Python辅导 | FM5091 Chris Prouty

PCA using Python
Chris Prouty
University of Minnesota
• For next week’s finance topic, please consider:
A taxonomy of banking: retail, commercial,
investment, central.
Taxonomy of Banking PDF available in Useful
• Questions about the implied volatility
– What is Principal Component Analysis
• A method by which we can reduce the dimensionality
of a matrix of data
– What is the math behind PCA
• Spectral Theorem
– Using PCA with financial data sets
• Energy sector stock data
• Yield curve data
– PCA activity in class/assignment
• Maybe we can begin in class…
PCA using Python
• PCA is an acronym that stands for “Principal
Component Analysis”
• PCA has a couple aims
– Decompose a data set into component parts
– Reduce dimensionality
• For example, PCA can be used to determine
whether a set of time series data has a single
“driver” among all the prices, or whether each
evolves independently
• Determining what the drivers are is different than
determining that a driver exists
• In other words, PCA may describe how a market
behaves, but not why it behaves that way
PCA using Python
• Said another way, PCA is used to determine
how many variables in a data set describe the
variance for the data set
• Perhaps the variance among all the variables
can be explained 100% by a single variable;
perhaps it can be explained 0%, or perhaps it
can explain some portion of the observed
• PCA is part of a larger family of statistical
techniques known as factor analysis
• Factor analysis plays a large role in different
parts of finance (and other interesting things),
so it’s worth exploring the techniques beyond
what is presented today
PCA using Python
• PCA is useful because knowing how many
factors are behind moves in a data set allow
for easier modeling
– Example: Suppose you want to model a basket of
10 stocks. Using PCA, you might find that 95% of
the variance in the data set can be explained by a
single factor, making it easier to model the basket.
– In other words: we may be able to reduce the
dimensionality of a high-dimensional problem
• Today, as an example, we’ll be using market
• One data set is a time series of several stocks
in the oil industry and one is a Treasury yield
PCA using Python
• You should be aware that the usefulness of PCA
goes well beyond the realm of finance
• The fundamental principle of PCA is that we can
represent a larger dataset to some degree of
accuracy (possibly 100%) with a smaller set of
• PCA is commonly used in a variety of industries
that attempt to perform predictive modeling
– Psychology
– Retail sales
• Netflix, Amazon
– Meteorology
– Data compression (we will do an example of this…)
PCA using Python
• Before we address the mechanics, a little math…
• PCA is predicated on the Spectral Theorem
– Any symmetric matrix is rotationally diagonalizable
• The Spectral Theorem should be covered in
FM5001; review it if you’re not in FM5001
• Today we will be dealing with covariance matrices
• A covariance matrix should be symmetric and
positive semi-definite
– A covariance matrix satisfies the requirements of the
Spectral Theorem
• Recall that we have only price data, we need to
perform some operation to find returns
PCA using Python
• A brief discussion of the Spectral Theorem…
• Suppose we have a real symmetric matrix,
such as this one, “A”
The eigenvalues
and eigenvectors
of A are given as
b and a, respectively.
PCA using Python
• The Spectral Theorem tells us that if our
matrix satisfies certain conditions, our original
matrix can be “rebuilt” from the eigenvalues
and eigenvectors
• Specifically, the following expression holds
• Remember the conditions we’ve defined such
that this expression is true
– Matrix is symmetric
– Matrix is positive semi-definite
– Matrix is real
• To illustrate these points, consider another
simple example
Complex numbers are unlikely to
appear in a series of returns…
PCA using Python
Note that the equality that was true for the matrix A is no longer
true in the case of the matrix Y. The condition of symmetry is not satisfied.
PCA using Python
• When we perform eigendecomposition on a
real symmetric matrix, an important benefit is
• The eigenvectors of a real, symmetric matrix
are guaranteed to be orthogonal
• This is extremely helpful!
• Imagine data spread over some R
• Imagine that in one of those dimensions, the
orthogonal eigenvector has a corresponding
eigenvalue of zero
• We can simply drop that dimension with no
loss in data fidelity, reducing dimensionality
PCA using Python
• Before we try applying PCA to a data set, we
need to clearly delineate the steps to
performing PCA
• First, a data set needs to be constructed
• Practically, a “data set” means a matrix
• Commonly, each column represents a subject,
and each row represents an observation
– In the case of finance, columns could represent
stocks and rows may represent daily close prices
for the past n days
– Don’t forget about Quandl and Yahoo! Finance if
you’re looking for good time series data for stocks
Step 1: Find a data set
PCA using Python
• We need to construct a covariance matrix
representing the relationship between the
subjects in our data set
• Building a covariance matrix is a multi-step
process in itself:
– For each subject, calculate the mean of the
– Subtract the calculated means from their respective
– Multiply the transpose of the resulting matrix by itself
(assuming subjects are in columns and observations
are in rows) and by 1/(# of observations – 1)
Step 2: Build a covariance matrix
PCA using Python
• Alternatively, you can use the cov function
present in Python NumPy
– This was introduced in an earlier lecture
• Syntax:
cov_matrix=np.cov(data, rowvar=False)
• Eigendecomposition is nothing more than
finding the eigenvalues and eigenvectors of a
particular matrix
• In this case, we want to perform
eigendecomposition on the covariance matrix
we’ve constructed from the data set
Step 3: Perform eigendecomposition
PCA using Python
• Recall that NumPy has built-in functionality to
perform this task: the eig function
• Syntax:
• It’s worth pausing for a moment to perform an
experiment and discuss the results…
• 5 Minute Challenge
Write Python code that generates two
different matrices. Each should be of length n
and contain two subjects. Both subjects
should be a list of normal random numbers,
but in one matrix the two subjects should be
PCA using Python
• Solution
Where ε1
and ε2
are uncorrelated
random numbers
and ρ is the
correlation [-1,1]
import numpy as np
def makeCorrelMatrix(n, rho):
m = np.zeros(n * 2).reshape(n,2)
m[:,1]=rho * m[:,0] + np.sqrt(1 – rho ** 2) *
return m
PCA using Python
• Let’s perform Steps 1 through 3 on contrived
matrices under different conditions…
• First, let’s examine the eigenvalues for
uncorrelated subjects
• Now, let’s examine the eigenvalues for
subjects under different correlation conditions
0.9012 0
0 1.071
0.7327 0
0 1.2773
0.504 0
0 1.5148
PCA using Python
• Notice that the smaller eigenvalue gets
smaller as a proportion of the sum of the
eigenvalues as the correlation increases
• Punch line: data loss incurred by dimensional
reduction can be predicted by eigenvalues
• Discussion: What implication does this have
for data sets we examine in finance? What
implication does it have for other data sets?
0.2343 0
0 1.5989
0 0
0 2.052
PCA using Python
• Stated a different way, the eigenvalues tell us
how many factors are present in a data set
– Suppose we analyze a data set with ten stocks
from a given sector and only find one relevant
eigenvalue. What might that mean?
• Eigenvalues should typically be ordered in
ascending or descending order
• The eigenvector associated with the largest
eigenvalue describes a “line of best fit”
• Note that the NumPy eig function returns a
matrix of eigenvectors and eigenvalues;
rightmost eigenvalues are always the largest
and are associated with rightmost
PCA using Python
• Ending the topic at Step 3 would still provide a
useful introduction to PCA
• However, you might wonder: how do we actually
reduce the dimensionality of a data set?
• First, understand that reducing dimensionality
can only normally be done at the expense of
losing data
– Under what circumstance is this not true?
• Imagine an n-dimensional data set
• A p-dimensional data set is desired, where p < n
• p eigenvectors are retained, (n – p) are discarded
Step 4: Reduce dimensionality
PCA using Python
• Choose which eigenvectors to retain based on
eigenvalue ranking
• Create a new vector of vectors – a matrix –
where each column represents a retained
– Create this matrix such that, when transposed, the
most relevant eigenvector is the top row, with
descending importance downward
• Using the original data matrix, create a new
version by subtracting the mean of each
subject from the respective subject column
PCA using Python
• Practically, it’s easiest to see the usefulness of
dimensional reduction if we generate a highlycorrelated matrix as an example
• No dimensional reduction example:
size = 100
g = makeCorrelMatrix(size, 0.75)
c = np.cov(g, rowvar=False)
eig = np.linalg.eig(c)
# no dimensional reduction
r = np.rot90(eig[1])
norm_g = np.zeros(size * 2).reshape(size,2)
norm_g[:,0] = g[:,0] – np.mean(g[:,0])
norm_g[:,1] = g[:,1] – np.mean(g[:,1])
approx =, r)
PCA using Python
• Dimensional reduction example:
• Notice that approx_reduce is one dimension
smaller than the prior approx example
• We have reduced our problem by one
dimension (albeit with data loss)
size = 100
g = makeCorrelMatrix(size, 0.75)
c = np.cov(g, rowvar=False)
eig = np.linalg.eig(c)
# dimensional reduction, drop an eigenvector
r_reduce = np.rot90(eig[1])
r_reduce = r_reduce[:,0]
norm_g_reduce = np.zeros(size * 2).reshape(size,2)
norm_g_reduce[:,0] = g[:,0] – np.mean(g[:,0])
norm_g_reduce[:,1] = g[:,1] – np.mean(g[:,1])
approx_reduce =, r_reduce)
PCA using Python
m2 m2approx no dim reduce m2approx dim reduce
PCA using Python
• Having constructed a p-dimensional matrix
from the original data, reconstructing the
original data is the next logical step
• Unless a discarded eigenvalue was exactly
zero, reconstructing n-dimensional data from
p-dimensional data will result in data loss
• No dimensional reduction reconstruction:
Step 5: Reconstruct original data
OG =, r)
OG[:,0] = OG[:,0] + np.mean(g[:,0])
OG[:,1] = OG[:,1] + np.mean(g[:,1])
PCA using Python
• OG should be a matrix equal to g (with some
rounding errors)
• Follow the same procedure for the
dimensionally-reduced matrix
• Reconstructing the dimensionally-reduced
data will result in a chart that doesn’t
immediately look much like the original data
• Although the points all lie on a single line, the
trend of the original data is captured
• Since ρ≠1, some data is present in the
eigenvector that we discarded
– What happens if ρ=1?
PCA using Python
Charted here are all steps
involving devolving and
reconstructing data.
Note that these charts use
different data than shown
earlier. ρ=0.75
PCA using Python
Here ρ=1. Notice that the reconstructions
of both the full dimensional and
-reduced charts are
identical. The lesson here is that if an
eigenvalue is zero (ρ=1), we can discard a
dimension without losing any data.
PCA using Python
• We now have the fundamentals of performing
principal component analysis on a data set
• Let’s try applying it to some higherdimensional real-world financial data
– From the class website, download
– Place the file in the your Python directory
– Create the following code:
df = pd.csv_read(‘energysector.csv’)
– Notice that df is now a matrix with four columns
and 251 rows
– Each column represents an energy company’s
daily stock close prices
– Plot the data
PCA using Python
• Below is a chart of the energy sector time
series; is there a relationship among these
PCA using Python
• In order to make an apples-to-apples
comparison among these stocks, convert the
prices to returns
• 5 Minute Challenge
Create a Python function that accepts a matrix
of price time series and returns a matrix of log
returns. Assume the price time series are
arranged as columns. Assume there are n
columns (underlyings) and p rows
PCA using Python
• Solution
returns = pd.DataFrame()
for i in df:
returns[i] = np.log(df[i][1:].values/df[i][:-1].values)
PCA using Python
• Converting to prices to returns and charting
(the mean of each is subtracted)…
PCA using Python
• Calculate returns, build a covariance matrix,
and perform eigendecomposition…
• Examining the eigenvalues we see
approximately the following:
• We can see that there is a single principal
component that is responsible for the majority
of the variance in the data
• Knowing that we are looking at energy
companies, what is the principal component?
0 0 0 0
0 0.00001 0 0
0 0 0.0001 0
0 0 0 0.0008
PCA using Python
• Recall that if an eigenvalue is exactly zero, we
can reduce dimensionality with no data loss
• Examining the associated eigenvector, we
discard the vector shaded in red
0 0 0 0
0 0.00001 0 0
0 0 0.0001 0
0 0 0 0.0008
0.0088 0.0498 -0.8624 0.5037
0.6454 0.4911 0.3207 0.4893
0.0896 -0.8074 0.2593 0.5223
-0.7585 0.3231 0.2935 0.4839
PCA using Python
• Reconstructing the 3-dimensional return
PCA using Python
• Notice that there is very minor data loss
– This is likely attributable to rounding during
various elements of the calculation
• We can perform the same analysis and reduce
the dimension to two
0 0 0 0
0 0.00001 0 0
0 0 0.0001 0
0 0 0 0.0008
0.0088 0.0498 -0.8624 0.5037
0.6454 0.4911 0.3207 0.4893
0.0896 -0.8074 0.2593 0.5223
-0.7585 0.3231 0.2935 0.4839
PCA using Python
• Reconstructing the 2-dimensional return
PCA using Python
• In this case, two dimensions still captures much
• Just as we can apply PCA to stock prices, we can
also apply it to things from other asset classes
– Equities
– Foreign exchange
– Commodities
– Rates
• We’ll try applying PCA to the yield curve
– The yield curve describes the yields of various U.S.
Treasuries at different durations. These yields move
on a daily basis, and therefore we can get a time
series of yields at different points on the curve.
PCA using Python
• We need to do a little setup…
– From the class website, download ycdata.csv
– Place the file in the your Python directory
– At the command prompt, enter:
df = pd.csv_read(‘ycdata.csv’)
– Notice that df is now a matrix with eight columns
and 1112 rows
– Each column represents a daily close yield for a
given duration
– At the command prompt, enter:
• It’s important to understand what is being
charted in this graph…
PCA using Python
• How did the shape/slope of this yield curve
change over time? (x-axis is time)
PCA using Python
• Performing PCA analysis requires one twist
when dealing with rates: instead of using log
returns, we will use differences
– Why we use differences instead of returns is not
within the scope of this lecture…
• Performing PCA, we see eigenvalues:
0 0 0 0 0 0 0 0
0 0.0001 0 0 0 0 0 0
0 0 0.0001 0 0 0 0 0
0 0 0 0.0005 0 0 0 0
0 0 0 0 0.0007 0 0 0
0 0 0 0 0 0.0015 0 0
0 0 0 0 0 0 0.0094 0
0 0 0 0 0 0 0 0.0142
PCA using Python
• First, the punch line: it is generally agreed that
there are three principal components found
when decomposing a yield curve
– Level
– Slope
– Curvature
• Three principal components is consistent with
our PCA analysis
0 0 0 0 0 0 0 0
0 0.0001 0 0 0 0 0 0
0 0 0.0001 0 0 0 0 0
0 0 0 0.0005 0 0 0 0
0 0 0 0 0.0007 0 0 0
0 0 0 0 0 0.0015 0 0
0 0 0 0 0 0 0.0094 0
0 0 0 0 0 0 0 0.0142
PCA using Python
PCA using Python
PCA using Python
PCA using Python
PCA using Python
PCA using Python
PCA using Python
PCA using Python
PCA using Python
• There is a relationship between the
components of the dataset and the
eigenvalues worth noting
• The sum of the eigenvalues is the sum of the
variances of each component in the dataset
• In class exercise:
– Using a data set of your choosing, generate
eigenvalues and eigenvectors from a covariance
– Calculate the variance “by hand” of each subject
– Sum all the subject variance
– Compare to the sum of the eigenvalues
PCA using Python
• In other words, we can say how much
variation in a dataset can be accounted for
with n eigenvalues…or principal components
• Now that we’re able to perform PCA, how do
we use it to help run simulations and do
finance stuff?
• Next semester we may try to integrate PCA
into a Monte Carlo simulation to price an
option on a basket of correlated stocks
• We don’t yet know enough about simulation
to give it a try
• To be continued…
PCA using Python
• NumPy features a function called SVD, which
stands for Singular Value Decomposition
• SVD is a decomposition method which produces a
result similar to finding eigenvalues and
eigenvectors of a covariance matrix
– SVD may be performed on a non-square matrix,
where eigen-decomposition needs a square matrix
– Using SVD, the original matrix can be constructed
from the three resulting matrices
– The condition of matrix symmetry is no longer present
when using SVD
• In the same spirit of PCA, SVD can be used to
represent a larger dataset with less data
PCA using Python
• As I was toying with PCA, I began to wonder
whether the eigenvalues from empirical data
tended toward some distribution
• I randomly chose 1000 names and grabbed
one year of data from Quandl
Plotted here is a distribution
of eigenvalues and a sampled
Pareto distribution.
Does variance in empirical
stock data follow a power law?
PCA using Python
A brief vignette…
PCA using Python
• As mentioned earlier, PCA can be applied to
perform data compression
– Video
• Compressing video using PCA involves layering images
and comparing the hue intensity of a single pixel as the
image evolves
– Still images
• Aside from compressing still images, PCA can also be
used for things like facial recognition and other image
comparison pursuits
– Other types of media
• Today we will use PCA to build an admittedly
simple and shabby image compression
program in Python
PCA using Python
• First, we’ll need an image…
We choose to use
a grayscale image
because the matrix
data is simpler
than a full color
PCA using Python
• Homer will be helping us with image
compression (there’s a lot of him to compress)
• First, we need to load the image into Python
• The library cv2 contains a function called
imread that reads an image into a matrix
• Example:
• The data is recorded into an array reflecting
RGB values
• For ease, I will average those values to create
a greyscale Homer
PCA using Python
• We need to consider how best to apply PCA to
an image…
– Is it possible to simply treat an image like a time
series of stock data?
– Can we recreate the original data (or some
approximation thereof) with only a handful of
eigenvectors and eigenvalues?
– How do we plan to achieve a smaller data set that
represents the original image while allowing for a
smaller file size?
• Recall what PCA tells us about a data set
– Fundamentally PCA allows us to determine the
overall hegemony of a matrix (remember that we
are representing our image as a matrix)
PCA using Python
• We can’t apply PCA to the entire image…
Suppose we chose to
apply PCA to only
regions of the image.
By inspection, we can
make judgments about
the hegemony of
specific regions.
It’s important to
understand the difference
between the eigenvalues
likely to emerge by
performing PCA on the
red region versus the blue
PCA using Python
• The image contained inside the red region was
highly self-similar
• By comparison, the blue region was much less so
• Perhaps it doesn’t make sense to apply PCA to
the entire image, but maybe it does make sense
to apply it to sub-regions of a larger image
• If PCA applied to a region yields only one relevant
eigenvalue, what judgment can be made about
that region?
• If PCA applied to a region yields many relevant
eigenvalues, what judgment can be made about
that region?
PCA using Python
• We can’t apply PCA to the entire image…
Examine the small
boxes used to break
up the larger image.
How many of them
contain only a single
color? (and what
implication does
this have for the
eigenvalues if PCA
was applied to
only that box?)
PCA using Python
• If we can determine how self-similar a region
of the image tends to be, we can decide how
much of the image data should be kept and
how much should be discarded
– The more data discarded, the smaller the image
• Enter: PCA
• We use PCA to determine self-similarity of a
region of the image
• We don’t use PCA to compress, we use PCA to
determine whether we should compress
• In this example, “compression” will simply
mean averaging the data in a region
PCA using Python
• Consider the code below…
import matplotlib.pyplot as plt
import numpy as np
import cv2
im = cv2.imread(“homer.png”)
homer = np.array(im)
homer_grey = np.zeros(448**2).reshape(448,448)
for i in range(448):
for j in range(448):
value = np.mean(homer[i,j]) / 255
homer_grey[i,j] = value
quality = 0.6
new_homer = np.zeros(homer_grey.shape)
for i in range(55):
for j in range(55):
cell = homer_grey[(j * 8):(j * 8) + 8, (i * 8): (i * 8) + 8]
c = np.cov(cell)
eig = np.linalg.eig(c)
test = np.diag(eig[0])[0:1].sum() / np.diag(eig[0]).sum()
if(test > quality):
new_homer[(j * 8): (j * 8) + 8, (i * 8): (i * 8) +
8]=np.ones(64).reshape(8,8) * np.mean(np.mean(cell))
new_homer[(j * 8): (j * 8) + 8, (i * 8): (i * 8) + 8]=cell
Notice that we are not
using PCA to perform
the compression. We
are using PCA to
determine whether we
should perform
compression (a simple
hue average).
PCA using Python
Quality = 0.9
PCA using Python
Quality = 0.8
PCA using Python
Quality = 0.7
PCA using Python
Quality = 0.6
PCA using Python
Quality = 0.5
PCA using Python