编程代写|Scientific Programming – 2022/2023 Coursework 2 – Part 2
As part of this exercise we will be performing two operations on a large dataset using Numpy. First, considering an array of values, we will compute the percentage differential from one value to the next, as in the example below.
作为这个练习的一部分,我们将使用Numpy对一个大数据集进行两个操作。首先,考虑到一个数值的数组,我们将计算一个数值与下一个数值的百分比差,如下图所示。
input_array = [43, 63, 72, 64, 16]
percent_diff = [(43-63)/43, (63-72)/63, (72-64)/72, (64-16)/64]
Second, we will compute the cummulative sum on an input array:
第二,我们将在一个输入数组上计算累积和。
input_array = [43, 63, 72, 64, 16]
cumsum = [43, 43+63, 43+63+72, 43+63+72+64, 43+63+72+64+16]
Working code for both functions is provided in the following cells, the task is to re-write both functions to accelerate the computation. Several means of speeding up code have been covered during the lectures. Those include the use of caching, the use of native languages, the use of parallel computing, or the modification of the default algorithm. You can use one or several combined strategies.
下面的单元格中提供了这两个函数的工作代码,任务是重写这两个函数以加速计算。在讲座中已经涉及了几种加速代码的方法。其中包括使用缓存,使用本地语言,使用并行计算,或者修改默认算法。你可以使用一种或几种组合策略。
The following cell’s code is from https://github.com/rakshitha123/TSForecasting/blob/master/utils/data_loader.py and enables the convert an input dataset in .tsf format into a pandas data frame.
In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from numba import njit, prange
# https://github.com/rakshitha123/TSForecasting/blob/master/utils/data_loader.py
from datetime import datetime
from distutils.util import strtobool
# Converts the contents in a .tsf file into a dataframe and returns it along
# with other meta-data of the dataset: frequency, horizon, whether the dataset
# contains missing values and whether the series have equal lengths
#
# Parameters
# full_file_path_and_name – complete .tsf file path
# replace_missing_vals_with – a term to indicate the missing values in series
# in the returning dataframe
# value_column_name – Any name that is preferred to have as the name of the
# column containing series values in the returning dataframe
def convert_tsf_to_dataframe(
full_file_path_and_name,
replace_missing_vals_with=“NaN”,
value_column_name=“series_value”,
col_names = []
col_types = []
all_data = {}
line_count = 0
frequency = None
forecast_horizon = None
contain_missing_values = None
contain_equal_length = None
found_data_tag = False
found_data_section = False
started_reading_data_section = False
with open(full_file_path_and_name, “r”) as file: # , encoding=”cp1252″
for line in file:
# Strip white space from start/end of line
line = line.strip()
if line:
if line.startswith(“@”): # Read meta-data
if not line.startswith(“@data”):
line_content = line.split(” “)
if line.startswith(“@attribute”):
if len(line_content) != 3: # Attributes have both name and type
raise Exception(“Invalid meta-data specification.”)
col_names.append(line_content[1])
col_types.append(line_content[2])
else:
if len(line_content) != 2: # Other meta-data have only values
raise Exception(“Invalid meta-data specification.”)
if line.startswith(“@frequency”):
frequency = line_content[1]
elif line.startswith(“@horizon”):
forecast_horizon = int(line_content[1])
elif line.startswith(“@missing”):
contain_missing_values = bool(strtobool(line_content[1]))
elif line.startswith(“@equallength”):
contain_equal_length = bool(strtobool(line_content[1]))
else:
if len(col_names) == 0:
raise Exception(“Missing attribute section. ”
“Attribute section must come before data.”)
found_data_tag = True
elif not line.startswith(“#”):
if len(col_names) == 0:
raise Exception(“Missing attribute section. ”
“Attribute section must come before data.”)
elif not found_data_tag:
raise Exception(“Missing @data tag.”)
else:
if not started_reading_data_section:
started_reading_data_section = True
found_data_section = True
all_series = []
for col in col_names:
all_data[col] = []
full_info = line.split(“:”)
if len(full_info) != (len(col_names) + 1):
raise Exception(“Missing attributes/values in series.”)
series = full_info[len(full_info) – 1]
series = series.split(“,”)
if len(series) == 0:
raise Exception(
“A given series should contains a set of comma separated ”
“numeric values. At least one numeric value should be there ”
“in a series. Missing values should be indicated with ? symbol”
numeric_series = []
for val in series:
if val == “?”:
numeric_series.append(replace_missing_vals_with)
else:
numeric_series.append(float(val))
if numeric_series.count(replace_missing_vals_with) == len(numeric_series):
raise Exception(
“All series values are missing. A given series should contains ”
“a set of comma separated numeric values. At least one numeric ”
“value should be there in a series.”
all_series.append(pd.Series(numeric_series).array)
for i in range(len(col_names)):
att_val = None
if col_types[i] == “numeric”:
att_val = int(full_info[i])
elif col_types[i] == “string”:
att_val = str(full_info[i])
elif col_types[i] == “date”:
att_val = datetime.strptime(full_info[i], “%Y-%m-%d %H-%M-%S”)
else:
raise Exception(
“Invalid attribute type.”
) # Currently, the code supports only numeric, string and date types.
# Extend this as required.
if att_val is None:
raise Exception(“Invalid attribute value.”)
else:
all_data[col_names[i]].append(att_val)
line_count = line_count + 1
if line_count == 0:
raise Exception(“Empty file.”)
if len(col_names) == 0:
raise Exception(“Missing attribute section.”)
if not found_data_section:
raise Exception(“Missing series information under data section.”)
all_data[value_column_name] = all_series
loaded_data = pd.DataFrame(all_data)
return (
loaded_data,
frequency,
forecast_horizon,
contain_missing_values,
contain_equal_length,
# Example of usage
# loaded_data, frequency, forecast_horizon, contain_missing_values, contain_equal_length = convert_tsf_to_dataframe(“TSForecasting/tsf_data/sample.tsf”)
# print(loaded_data)
# print(frequency)
# print(forecast_horizon)
# print(contain_missing_values)
# print(contain_equal_length)
The following cell loads a dataset downloaded from https://zenodo.org/record/7371038. The following description is from the dataset webpage:
“This dataset contains 145063 time series representing the number of hits or web traffic for a set of Wikipedia pages from 2015-07-01 to 2022-06-30. This is an extended version of the dataset that was used in the Kaggle Wikipedia Web Traffic forecasting competition. For consistency, the same Wikipedia pages that were used in the competition have been used in this dataset as well. The colons (:) in article names have been replaced by dashes (-) to make the .tsf file readable using our data loaders.
The original dataset contains missing values. They have been simply replaced by zeros.
The data were downloaded from the Wikimedia REST API. According to the conditions of the API, this dataset is licensed under CC-BY-SA 3.0 and GFDL licenses.”
You can download the dataset from https://zenodo.org/record/7371038/files/web_traffic_extended_dataset_without_missing_values.zip. Note that the file is large: 433 MB. The cells convert the dataset into a Numpy compressed file for easier use in the remainder of this jupyter notebook.
下面的单元格加载了一个从https://zenodo.org/record/7371038 下载的数据集。 下面的描述来自于数据集的网页。
“这个数据集包含145063个时间序列,代表2015-07-01至2022-06-30期间一组维基百科页面的点击量或网络流量。这是在Kaggle维基百科网络流量预测比赛中使用的数据集的扩展版本。为了保持一致性,比赛中使用的维基百科页面也被用在这个数据集中。文章名称中的冒号(:)已被破折号(-)取代,以使.tsf文件可以用我们的数据加载器读取。
原始数据集包含缺失值。它们已被简单地替换为零。
这些数据是从维基媒体REST API中下载的。根据API的条件,该数据集在CC-BY-SA 3.0和GFDL许可下授权。”
你可以从https://zenodo.org/record/7371038/files/web_traffic_extended_dataset_without_missing_values.zip 下载该数据集。请注意,该文件很大。433MB。单元将数据集转换为Numpy压缩文件,以便在本jupyter笔记本的其余部分更容易使用。
In [2]:
# https://zenodo.org/record/7371038
df, frequency, forecast_horizon, contain_missing_values, contain_equal_length = convert_tsf_to_dataframe(
“web_traffic_extended_dataset_without_missing_values.tsf”
hits_arr = np.stack(df[“series_value”].apply(np.asarray).tolist())
np.savez_compressed(“web_traffic_extended_dataset_without_missing_values”, hits_arr)
The following cell implement the pct_change1 function, used to compute the percentage difference change beetween succesive values, as aforementioned. The cell is also used to time the function and display the result as a figure.
下面的单元格实现了 “pct_change1 “函数,用于计算前面提到的连续数值之间的百分比差异变化。该单元格也被用来为该函数计时,并将结果显示为一个数字。
In [ ]:
def pct_change1(n):
result = np.diff(n) / n[:, :–1]
result[np.isnan(result)] = 0 # difference is 0 and both values are 0
result[np.isinf(result)] = 1 # first value was 0 and next is not 0
return result
pct = pct_change1(hits_arr)
%timeit pct_change1(hits_arr)
plt.figure(figsize=(14, 5))
plt.plot(pct[10559, 300:800])
Modify the following cell to implement the function pct_change2 to accelerate the computation. The np.allclose() function checks that the results obtained using pct_change1 and pct_change2 are near identical.
修改以下单元格,实现函数pct_change2以加速计算。np.allclose()函数检查使用pct_change1和pct_change2得到的结果是否接近一致。
In [ ]:
def pct_change2(n):
result = None
#TODO
return result
pct_n = pct_change2(hits_arr)
%timeit pct_change2(hits_arr)
print(np.allclose(pct, pct_n))
The cumsum function from numpy can be used to compute the cummulative sum of the hits_arr array, as shown below.
numpy的cumsum函数可以用来计算hits_arr数组的累积和,如下所示。
In [ ]:
cs1 = np.cumsum(hits_arr, 1)
%timeit np.cumsum(hits_arr,1)
Modify the following cell to implement the function cumsum2 to accelerate the computation. The np.allclose() function enables to check that the results obtained using np.cumsum and cumsum2 are similar.
修改下面的单元格,实现函数cumsum2以加速计算。np.allclose()函数可以检查使用np.cumsum和cumsum2得到的结果是否相似。
In [ ]:
def cumsum2(n):
result = None
#TODO
return result
cs2 = cumsum2(hits_arr)
%timeit cumsum2(hits_arr)
np.allclose(cs1, cs2)
The aim of the exercise is to develop a effective solution to solve the travelling salesman problem.
The idea of the Travelling Salesman Problem is to find the shortest path necessary to visit all locations and return to the start. For an example of visiting four locations, cycle length is the distance travelled from starting location A to location B, then to location C, then to location D, then back to A. The brute force approach to solving this is consider every possible cycle between the locations, however for N locations there are N! possible cycles so this isn’t really tractable for anything above small values like 10.
The following cell generate a random graph with 100 nodes, links the nodes in indices increasing order (to create edges between pair of nodes) and display the resulting solution.
该练习的目的是开发一个有效的解决方案来解决旅行推销员问题。
旅行推销员问题的理念是找到访问所有地点并返回起点的最短路径。以访问四个地点为例,周期长度是指从起始地点A到地点B,然后到地点C,然后到地点D,然后回到A的距离。
下面的单元格生成了一个有100个节点的随机图,按指数递增的顺序链接节点(在一对节点之间建立边),并显示所得到的解决方案。