# Jupyter代写｜Jupyter Code Python Coding

Assignment 3

1. ARIMA

a. Use the file “assignment_1.csv” from the first assignment for this task. The target
column is “gold_price”.

b. Plot the autocorrelation function (ACF) and partial autocorrelation function (PCF)
of the cases timeseries. (see
https://www.statsmodels.org/stable/generated/statsmodels.tsa.stattools.acf.html)

c. Describe what the plots indicate (in terms of autocorrelation and autoregressive
parameter (p) and moving average (q)).
Some rules of thumb to recall:

i. Rule 1: If the ACF shows exponential decay, the PACF has a spike at lag
1, and no correlation for other lags, then use one autoregressive (p)
parameter

ii. Rule 2: If the ACF shows a sine-wave shape pattern or a set of
exponential decays, the PACF has spikes at lags 1 and 2, and no
correlation for other lags, the use two autoregressive (p) parameters.

iii. Rule 3: If the ACF has a spike at lag 1, no correlation for other lags, and
the PACF damps out exponentially, then use one moving average (q)
parameter.

iv. Rule 4: If the ACF has spikes at lags 1 and 2, no correlation for other
lags, and the PACF has a sine-wave shape pattern or a set of exponential
decays, then use two moving average (q) parameters.

v. Rule 5: If the ACF shows exponential decay starting at lag 1, and the
PACF shows exponential decay starting at lag 1, then use one
autoregressive (p) and one moving average (q) parameter.

d. Determine how many times you need to differentiate the data and perform the
analysis on the n times differentiated data.

e. Another approach to assessing the presence of autocorrelation is by using the
Durbin-Waton (DW) statistic. The value of the DW statistic is close to 2 if the
errors are uncorrelated. What is DW for our data, and does this match what you
observed from the ACF and PCF plots? (see

https://www.statsmodels.org/stable/generated/statsmodels.stats.stattools.durbin_
watson.html – in fact the statsmodels package provides a lot of the functionality
needed for time analysis)

f. Removing serial dependency by modeling a simple ARMA process with p and q
as derived above. Take a look at what the resulting process looks like (plot)

g. Calculate the residuals, and test the null hypothesis that the residuals come from
a normal distribution, and construct a qq-plot. Do the results of the hypothesis
test and qq-plot align?

h. Now investigate the autocorrelation of your ARMA(p,q) model. Did it improve?
These can be examined graphically, but a statistic will help. Next, we calculate
the lag, autocorrelation (AC), Q statistic and Prob>Q. The Ljung–Box Q test is a
type of statistical test of whether any of a group of autocorrelations of a time
series are different from zero. The null hypothesis is, H0: The data are
independently distributed (i.e. the correlations in the population from which the
sample is taken are 0, so that any observed correlations in the data result from
randomness of the sampling process).

i. Compute predictions for years 2000 and after, as well as, 2010 and after and
analyze their fit against actual values.

j. Calculate the forecast error via MAE and MFE.

Reminders: Mean absolute error: The mean absolute error (MAE) value is
computed as the average absolute error value. If MAE is zero the forecast is
perfect. As compared to the mean squared error (MSE), this measure of fit
“de-emphasizes” outliers (unique or rare large error values will affect the MAE
less than the MSE.

Mean Forecast Error (MFE, also known as Bias). The MFE is the average error in
the observations. A large positive MFE means that the forecast is undershooting
the actual observations. A large negative MFE means the forecast is
overshooting the actual observations. A value near zero is ideal, and generally a
small value means a pretty good fit.

The MAE is a better indicator of fit than the MFE.

2. Classification

a. Load the file “assignment_3.csv”. This file contains doctor analyses of breast
cancer images. The target is the “pathology” whether the tissue is “benign” (= no
cancer) or “malignant” (= has cancer).

b. Explore and clean-up the data. Additionally, convert categorical columns into
numerical columns that can be understood by a machine learning model. The
final data must have numerical columns only (except target) and two target
classes.

Operations that might be necessary:

i. One-hot-encoding (i.e., add column for each category)
ii. Binarization (i.e., representing one category as 0 and the other as 1)
iii. Merging categories (e.g., if categories are too similar or if categories only
appear a few times)
iv. Removing columns without information
v. Converting categories to numbers (i.e., introducing an order)
vi. Converting numbers to categories (e.g., if values do not represent
numbers)