Python代写 | FE520 Assignment 5

这个作业是用Python完成时间序列数据练习和动量和均值回归
FE520 Assignment 5
Spring 2020
Submission Requirement:
For all the problems in this assignment you need to design and use Python 3, output
and present the results in nicely format.
Please submit a written report (pdf), where you detail your results and copy your
code into an Appendix. You are required to submit a single python file and a brief
report. Your grade will be evaluated by combination of report and code.
You are strongly encouraged to write comment for your code, because it is a convention to have your code documented all the time.
Python script must be a ‘.py’ script, Jupyter notebook ‘.ipynb is not allowed.
Do NOT copy and paste from others, all homework will be firstly checked by plagiarism detection tool.
Note: This assignment contain 4 questions, 130 points + 20 points Bonus = 150
Points in total. We will set the assignment total points = 120 points, thus, you will get
extra 10 bonus points automatically.
1 Time Series Data Practice(30 pts)
Recall what we mentioned in the class, we have two types of data splitting for training
and testing data: out of sample and out of time. It is proper to use Out of Time splitting
method for time series dataset. Writing a function to spilt ”Energy” dataset into training
and testing data.
Parameter input:
StartYear: int (default value = 2012), EndYear: int (default value = None).
Ontput:
Train, Test (Data type: Array(Numpy) )
If EndYear is None, we will only choose all data with ”Data Date” == StartYear
as Test data, all other data as Train data. By default, all company Data Date within
2012 will be selected as Testing data. If EndYear is NOT None, we will choose all data
with ”Data Date” == StartYear to EndYear as Test data, all other data as Train data,
For example, StartYear = 2010, EndYear = 2013, all data in 2010, 2011, 2012, 2013
will be selected as Testing data .
All return should be array from column ”Accumulated Other Comprehensive
Income (Loss)” to column ”Selling, General and Administrative Expenses”.
1
2 Momentum and Mean Reversion(30 pts)
Momentum and mean reversion are common trading strategy. For the purpose of this
homework, a stock exhibits momentum is defined as an asset whose price returns are
more likely to go up(down) on day t if the return went up(down) on day t-1. In other
words, the stock exhibits a positive auto-correlation. Mean reversion is the opposite of
momentum. Stocks are more likely to go up(down) on day t if that stock went down(up)
on day t-1.
You are provided below with a simulated dataset of series of stock returns. These
returns have been generated with a predetermined average momentum during one period and a predetermined average mean reversion in another period.
Please use dataset provided to answer the questions below. In order to do so, you
will need to clean the dataset. It comes with a number of flaws commonly seen in
dataset we receive.
Question:
1. In what month did the returns shift from exhibiting mean reversion to exhibiting
momentum, or from momentum to mean reversion. Please output the last month
that momentum(mean reversion) shift to mean reversion(momentum).
2. During the time period when these stock returns had momentum property, what
was the average momentum? Please note this is a single number, average cross
over all stock returns.
3. During the time period when these stock returns had mean reversion property,
what was the average mean reversion?Please note this is a single number, average
cross over all stock returns.
This is an interview question in industry, I have provided all information and hints in
the question and in the zoom video. This question aims to practice your ability to clean
the data using the technique we covered in the pandas lecture (especially in time series).
Please try to think about how to solve this question before looking into the hints.
Hints: The dataset provided is multiple stock returns, but questions are asked for
a single number of month that returns from one state to another state. Thus, what you
need to do is find a way to aggregate multiple stock into on market index, and observe
the index to find the answer.
If this is not enough, I have provided more detail Q & A in Discussion – Question
about Assignment 5.
3 Clustering & Classification (40pt)
1. Use sklearn.cluster.KMeans to do clustering on the given data set points.csv.
There are 5 clusters in this data set. Draw a scatter plot for the data and use color
to indicate their clusters.
2. Regard the clusters given by your KMeans model as the ground truth labels, randomly split the data set into training data (80%) and testing data (20%). Create a
2
linear SVM classifier and train it on training data set. Use the confusion matrix
to evaluate its performance on testing data set.
3. Regard the data set labels.csv as the ground truth labels, repeat the second
question. Compare their performance, discuss what do you observe, and how
would explain it.
4. (Bonus 10pt) Use tensorflow.keras API to create a fully connected neural network model, repeat the second question. Draw a plot to show how loss
changes when the step of training increases.
4 Regression (30pt)
1. In this question, we are going to use the diabetes data set. Use sklearn.datasets.load diabetes()
to load the data and labels.
2. Randomly split the data into training set (80%) and testing set (20%).
3. Create a linear regression model using sklearn, and fit training data. Evaluate
your model using test data. Give all the coefficient and R-squared score.
4. Use 10-fold cross validation to fit and validate your linear regression models on
the whole data set. Print the scores for each validation.
5. (Bonus 3pt) Use sklearn to create RandomForestRegressor model, and fit the
training data into it.
6. (Bonus 7pt) Use Grid Search to find the optimal hyper-parameters (max depth:{None,
7, 4} and min samples split: {2, 10, 20}) for RandomForestRegressor.
3