# Jupyter代写｜Jupyter Code Python Coding

本次美国代写是一个Python和Jupyter相关的assignment

Assignment 3

**1. ARIMA**

a. Use the file “assignment_1.csv” from the first assignment for this task. The target

column is “gold_price”.

b. Plot the autocorrelation function (ACF) and partial autocorrelation function (PCF)

of the cases timeseries. (see

https://www.statsmodels.org/stable/generated/statsmodels.tsa.stattools.acf.html)

c. Describe what the plots indicate (in terms of autocorrelation and autoregressive

parameter (p) and moving average (q)).

Some rules of thumb to recall:

i. Rule 1: If the ACF shows exponential decay, the PACF has a spike at lag

1, and no correlation for other lags, then use one autoregressive (p)

parameter

ii. Rule 2: If the ACF shows a sine-wave shape pattern or a set of

exponential decays, the PACF has spikes at lags 1 and 2, and no

correlation for other lags, the use two autoregressive (p) parameters.

iii. Rule 3: If the ACF has a spike at lag 1, no correlation for other lags, and

the PACF damps out exponentially, then use one moving average (q)

parameter.

iv. Rule 4: If the ACF has spikes at lags 1 and 2, no correlation for other

lags, and the PACF has a sine-wave shape pattern or a set of exponential

decays, then use two moving average (q) parameters.

v. Rule 5: If the ACF shows exponential decay starting at lag 1, and the

PACF shows exponential decay starting at lag 1, then use one

autoregressive (p) and one moving average (q) parameter.

d. Determine how many times you need to differentiate the data and perform the

analysis on the n times differentiated data.

e. Another approach to assessing the presence of autocorrelation is by using the

Durbin-Waton (DW) statistic. The value of the DW statistic is close to 2 if the

errors are uncorrelated. What is DW for our data, and does this match what you

observed from the ACF and PCF plots? (see

https://www.statsmodels.org/stable/generated/statsmodels.stats.stattools.durbin_

watson.html – in fact the statsmodels package provides a lot of the functionality

needed for time analysis)

f. Removing serial dependency by modeling a simple ARMA process with p and q

as derived above. Take a look at what the resulting process looks like (plot)

g. Calculate the residuals, and test the null hypothesis that the residuals come from

a normal distribution, and construct a qq-plot. Do the results of the hypothesis

test and qq-plot align?

h. Now investigate the autocorrelation of your ARMA(p,q) model. Did it improve?

These can be examined graphically, but a statistic will help. Next, we calculate

the lag, autocorrelation (AC), Q statistic and Prob>Q. The Ljung–Box Q test is a

type of statistical test of whether any of a group of autocorrelations of a time

series are different from zero. The null hypothesis is, H0: The data are

independently distributed (i.e. the correlations in the population from which the

sample is taken are 0, so that any observed correlations in the data result from

randomness of the sampling process).

i. Compute predictions for years 2000 and after, as well as, 2010 and after and

analyze their fit against actual values.

j. Calculate the forecast error via MAE and MFE.

Reminders: Mean absolute error: The mean absolute error (MAE) value is

computed as the average absolute error value. If MAE is zero the forecast is

perfect. As compared to the mean squared error (MSE), this measure of fit

“de-emphasizes” outliers (unique or rare large error values will affect the MAE

less than the MSE.

Mean Forecast Error (MFE, also known as Bias). The MFE is the average error in

the observations. A large positive MFE means that the forecast is undershooting

the actual observations. A large negative MFE means the forecast is

overshooting the actual observations. A value near zero is ideal, and generally a

small value means a pretty good fit.

The MAE is a better indicator of fit than the MFE.

**2. Classification**

a. Load the file “assignment_3.csv”. This file contains doctor analyses of breast

cancer images. The target is the “pathology” whether the tissue is “benign” (= no

cancer) or “malignant” (= has cancer).

b. Explore and clean-up the data. Additionally, convert categorical columns into

numerical columns that can be understood by a machine learning model. The

final data must have numerical columns only (except target) and two target

classes.

Operations that might be necessary:

i. One-hot-encoding (i.e., add column for each category)

ii. Binarization (i.e., representing one category as 0 and the other as 1)

iii. Merging categories (e.g., if categories are too similar or if categories only

appear a few times)

iv. Removing columns without information

v. Converting categories to numbers (i.e., introducing an order)

vi. Converting numbers to categories (e.g., if values do not represent

numbers)