# Python辅导 | FM5091 Chris Prouty

5091.7

PCA using Python

FM5091

Chris Prouty

University of Minnesota

Housekeeping

• For next week’s finance topic, please consider:

A taxonomy of banking: retail, commercial,

investment, central.

Reading:

Taxonomy of Banking PDF available in Useful

Documents

Agenda

• Questions about the implied volatility

assignment?

• PCA

– What is Principal Component Analysis

• A method by which we can reduce the dimensionality

of a matrix of data

– What is the math behind PCA

• Spectral Theorem

– Using PCA with financial data sets

• Energy sector stock data

• Yield curve data

– PCA activity in class/assignment

• Maybe we can begin in class…

PCA using Python

• PCA is an acronym that stands for “Principal

Component Analysis”

• PCA has a couple aims

– Decompose a data set into component parts

– Reduce dimensionality

• For example, PCA can be used to determine

whether a set of time series data has a single

“driver” among all the prices, or whether each

evolves independently

• Determining what the drivers are is different than

determining that a driver exists

• In other words, PCA may describe how a market

behaves, but not why it behaves that way

PCA using Python

• Said another way, PCA is used to determine

how many variables in a data set describe the

variance for the data set

• Perhaps the variance among all the variables

can be explained 100% by a single variable;

perhaps it can be explained 0%, or perhaps it

can explain some portion of the observed

variance

• PCA is part of a larger family of statistical

techniques known as factor analysis

• Factor analysis plays a large role in different

parts of finance (and other interesting things),

so it’s worth exploring the techniques beyond

what is presented today

PCA using Python

• PCA is useful because knowing how many

factors are behind moves in a data set allow

for easier modeling

– Example: Suppose you want to model a basket of

10 stocks. Using PCA, you might find that 95% of

the variance in the data set can be explained by a

single factor, making it easier to model the basket.

– In other words: we may be able to reduce the

dimensionality of a high-dimensional problem

• Today, as an example, we’ll be using market

data

• One data set is a time series of several stocks

in the oil industry and one is a Treasury yield

curve

PCA using Python

• You should be aware that the usefulness of PCA

goes well beyond the realm of finance

• The fundamental principle of PCA is that we can

represent a larger dataset to some degree of

accuracy (possibly 100%) with a smaller set of

data

• PCA is commonly used in a variety of industries

that attempt to perform predictive modeling

– Psychology

– Retail sales

• Netflix, Amazon

– Meteorology

– Data compression (we will do an example of this…)

PCA using Python

• Before we address the mechanics, a little math…

• PCA is predicated on the Spectral Theorem

– Any symmetric matrix is rotationally diagonalizable

• The Spectral Theorem should be covered in

FM5001; review it if you’re not in FM5001

• Today we will be dealing with covariance matrices

• A covariance matrix should be symmetric and

positive semi-definite

– A covariance matrix satisfies the requirements of the

Spectral Theorem

• Recall that we have only price data, we need to

perform some operation to find returns

PCA using Python

• A brief discussion of the Spectral Theorem…

• Suppose we have a real symmetric matrix,

such as this one, “A”

The eigenvalues

and eigenvectors

of A are given as

b and a, respectively.

PCA using Python

• The Spectral Theorem tells us that if our

matrix satisfies certain conditions, our original

matrix can be “rebuilt” from the eigenvalues

and eigenvectors

• Specifically, the following expression holds

• Remember the conditions we’ve defined such

that this expression is true

– Matrix is symmetric

– Matrix is positive semi-definite

– Matrix is real

• To illustrate these points, consider another

simple example

Complex numbers are unlikely to

appear in a series of returns…

PCA using Python

Note that the equality that was true for the matrix A is no longer

true in the case of the matrix Y. The condition of symmetry is not satisfied.

PCA using Python

• When we perform eigendecomposition on a

real symmetric matrix, an important benefit is

present

• The eigenvectors of a real, symmetric matrix

are guaranteed to be orthogonal

• This is extremely helpful!

• Imagine data spread over some R

n

space

• Imagine that in one of those dimensions, the

orthogonal eigenvector has a corresponding

eigenvalue of zero

• We can simply drop that dimension with no

loss in data fidelity, reducing dimensionality

PCA using Python

• Before we try applying PCA to a data set, we

need to clearly delineate the steps to

performing PCA

• First, a data set needs to be constructed

• Practically, a “data set” means a matrix

• Commonly, each column represents a subject,

and each row represents an observation

– In the case of finance, columns could represent

stocks and rows may represent daily close prices

for the past n days

– Don’t forget about Quandl and Yahoo! Finance if

you’re looking for good time series data for stocks

Step 1: Find a data set

PCA using Python

• We need to construct a covariance matrix

representing the relationship between the

subjects in our data set

• Building a covariance matrix is a multi-step

process in itself:

– For each subject, calculate the mean of the

observations

– Subtract the calculated means from their respective

subjects

– Multiply the transpose of the resulting matrix by itself

(assuming subjects are in columns and observations

are in rows) and by 1/(# of observations – 1)

Step 2: Build a covariance matrix

PCA using Python

• Alternatively, you can use the cov function

present in Python NumPy

– This was introduced in an earlier lecture

• Syntax:

cov_matrix=np.cov(data, rowvar=False)

• Eigendecomposition is nothing more than

finding the eigenvalues and eigenvectors of a

particular matrix

• In this case, we want to perform

eigendecomposition on the covariance matrix

we’ve constructed from the data set

Step 3: Perform eigendecomposition

PCA using Python

• Recall that NumPy has built-in functionality to

perform this task: the eig function

• Syntax:

eigenresults=np.linalg.eig(matrix)

• It’s worth pausing for a moment to perform an

experiment and discuss the results…

• 5 Minute Challenge

Write Python code that generates two

different matrices. Each should be of length n

and contain two subjects. Both subjects

should be a list of normal random numbers,

but in one matrix the two subjects should be

ρ-correlated.

PCA using Python

• Solution

Where ε1

and ε2

are uncorrelated

random numbers

and ρ is the

correlation [-1,1]

import numpy as np

def makeCorrelMatrix(n, rho):

m = np.zeros(n * 2).reshape(n,2)

m[:,0]=np.random.randn(n)

m[:,1]=rho * m[:,0] + np.sqrt(1 – rho ** 2) *

np.random.randn(n)

return m

PCA using Python

• Let’s perform Steps 1 through 3 on contrived

matrices under different conditions…

• First, let’s examine the eigenvalues for

uncorrelated subjects

• Now, let’s examine the eigenvalues for

subjects under different correlation conditions

0.9012 0

0 1.071

0.7327 0

0 1.2773

0.504 0

0 1.5148

ρ=0.25

n=1000

ρ=0.5

n=1000

n=1000

PCA using Python

• Notice that the smaller eigenvalue gets

smaller as a proportion of the sum of the

eigenvalues as the correlation increases

• Punch line: data loss incurred by dimensional

reduction can be predicted by eigenvalues

• Discussion: What implication does this have

for data sets we examine in finance? What

implication does it have for other data sets?

0.2343 0

0 1.5989

0 0

0 2.052

ρ=0.75

n=1000

ρ=1.0

n=1000

PCA using Python

• Stated a different way, the eigenvalues tell us

how many factors are present in a data set

– Suppose we analyze a data set with ten stocks

from a given sector and only find one relevant

eigenvalue. What might that mean?

• Eigenvalues should typically be ordered in

ascending or descending order

• The eigenvector associated with the largest

eigenvalue describes a “line of best fit”

• Note that the NumPy eig function returns a

matrix of eigenvectors and eigenvalues;

rightmost eigenvalues are always the largest

and are associated with rightmost

eigenvectors

PCA using Python

• Ending the topic at Step 3 would still provide a

useful introduction to PCA

• However, you might wonder: how do we actually

reduce the dimensionality of a data set?

• First, understand that reducing dimensionality

can only normally be done at the expense of

losing data

– Under what circumstance is this not true?

• Imagine an n-dimensional data set

• A p-dimensional data set is desired, where p < n

• p eigenvectors are retained, (n – p) are discarded

Step 4: Reduce dimensionality

PCA using Python

• Choose which eigenvectors to retain based on

eigenvalue ranking

• Create a new vector of vectors – a matrix –

where each column represents a retained

eigenvector

– Create this matrix such that, when transposed, the

most relevant eigenvector is the top row, with

descending importance downward

• Using the original data matrix, create a new

version by subtracting the mean of each

subject from the respective subject column

PCA using Python

• Practically, it’s easiest to see the usefulness of

dimensional reduction if we generate a highlycorrelated matrix as an example

• No dimensional reduction example:

size = 100

g = makeCorrelMatrix(size, 0.75)

c = np.cov(g, rowvar=False)

eig = np.linalg.eig(c)

# no dimensional reduction

r = np.rot90(eig[1])

norm_g = np.zeros(size * 2).reshape(size,2)

norm_g[:,0] = g[:,0] – np.mean(g[:,0])

norm_g[:,1] = g[:,1] – np.mean(g[:,1])

approx = np.dot(norm_g, r)

PCA using Python

• Dimensional reduction example:

• Notice that approx_reduce is one dimension

smaller than the prior approx example

• We have reduced our problem by one

dimension (albeit with data loss)

size = 100

g = makeCorrelMatrix(size, 0.75)

c = np.cov(g, rowvar=False)

eig = np.linalg.eig(c)

# dimensional reduction, drop an eigenvector

r_reduce = np.rot90(eig[1])

r_reduce = r_reduce[:,0]

norm_g_reduce = np.zeros(size * 2).reshape(size,2)

norm_g_reduce[:,0] = g[:,0] – np.mean(g[:,0])

norm_g_reduce[:,1] = g[:,1] – np.mean(g[:,1])

approx_reduce = np.dot(norm_g_reduce, r_reduce)

PCA using Python

m2 m2approx no dim reduce m2approx dim reduce

PCA using Python

• Having constructed a p-dimensional matrix

from the original data, reconstructing the

original data is the next logical step

• Unless a discarded eigenvalue was exactly

zero, reconstructing n-dimensional data from

p-dimensional data will result in data loss

• No dimensional reduction reconstruction:

Step 5: Reconstruct original data

OG = np.dot(approx, r)

OG[:,0] = OG[:,0] + np.mean(g[:,0])

OG[:,1] = OG[:,1] + np.mean(g[:,1])

PCA using Python

• OG should be a matrix equal to g (with some

rounding errors)

• Follow the same procedure for the

dimensionally-reduced matrix

• Reconstructing the dimensionally-reduced

data will result in a chart that doesn’t

immediately look much like the original data

• Although the points all lie on a single line, the

trend of the original data is captured

• Since ρ≠1, some data is present in the

eigenvector that we discarded

– What happens if ρ=1?

PCA using Python

Charted here are all steps

involving devolving and

reconstructing data.

Note that these charts use

different data than shown

earlier. ρ=0.75

PCA using Python

Here ρ=1. Notice that the reconstructions

of both the full dimensional and

dimensionally

-reduced charts are

identical. The lesson here is that if an

eigenvalue is zero (ρ=1), we can discard a

dimension without losing any data.

PCA using Python

• We now have the fundamentals of performing

principal component analysis on a data set

• Let’s try applying it to some higherdimensional real-world financial data

– From the class website, download

energysector.csv

– Place the file in the your Python directory

– Create the following code:

df = pd.csv_read(‘energysector.csv’)

– Notice that df is now a matrix with four columns

and 251 rows

– Each column represents an energy company’s

daily stock close prices

– Plot the data

PCA using Python

• Below is a chart of the energy sector time

series; is there a relationship among these

prices?

PCA using Python

• In order to make an apples-to-apples

comparison among these stocks, convert the

prices to returns

• 5 Minute Challenge

Create a Python function that accepts a matrix

of price time series and returns a matrix of log

returns. Assume the price time series are

arranged as columns. Assume there are n

columns (underlyings) and p rows

(observations).

PCA using Python

• Solution

returns = pd.DataFrame()

for i in df:

returns[i] = np.log(df[i][1:].values/df[i][:-1].values)

PCA using Python

• Converting to prices to returns and charting

(the mean of each is subtracted)…

PCA using Python

• Calculate returns, build a covariance matrix,

and perform eigendecomposition…

• Examining the eigenvalues we see

approximately the following:

• We can see that there is a single principal

component that is responsible for the majority

of the variance in the data

• Knowing that we are looking at energy

companies, what is the principal component?

0 0 0 0

0 0.00001 0 0

0 0 0.0001 0

0 0 0 0.0008

PCA using Python

• Recall that if an eigenvalue is exactly zero, we

can reduce dimensionality with no data loss

• Examining the associated eigenvector, we

discard the vector shaded in red

0 0 0 0

0 0.00001 0 0

0 0 0.0001 0

0 0 0 0.0008

0.0088 0.0498 -0.8624 0.5037

0.6454 0.4911 0.3207 0.4893

0.0896 -0.8074 0.2593 0.5223

-0.7585 0.3231 0.2935 0.4839

PCA using Python

• Reconstructing the 3-dimensional return

data…

PCA using Python

• Notice that there is very minor data loss

– This is likely attributable to rounding during

various elements of the calculation

• We can perform the same analysis and reduce

the dimension to two

0 0 0 0

0 0.00001 0 0

0 0 0.0001 0

0 0 0 0.0008

0.0088 0.0498 -0.8624 0.5037

0.6454 0.4911 0.3207 0.4893

0.0896 -0.8074 0.2593 0.5223

-0.7585 0.3231 0.2935 0.4839

PCA using Python

• Reconstructing the 2-dimensional return

data…

PCA using Python

• In this case, two dimensions still captures much

data

• Just as we can apply PCA to stock prices, we can

also apply it to things from other asset classes

– Equities

– Foreign exchange

– Commodities

– Rates

• We’ll try applying PCA to the yield curve

– The yield curve describes the yields of various U.S.

Treasuries at different durations. These yields move

on a daily basis, and therefore we can get a time

series of yields at different points on the curve.

PCA using Python

• We need to do a little setup…

– From the class website, download ycdata.csv

– Place the file in the your Python directory

– At the command prompt, enter:

df = pd.csv_read(‘ycdata.csv’)

– Notice that df is now a matrix with eight columns

and 1112 rows

– Each column represents a daily close yield for a

given duration

– At the command prompt, enter:

plot(df)

• It’s important to understand what is being

charted in this graph…

PCA using Python

• How did the shape/slope of this yield curve

change over time? (x-axis is time)

PCA using Python

• Performing PCA analysis requires one twist

when dealing with rates: instead of using log

returns, we will use differences

– Why we use differences instead of returns is not

within the scope of this lecture…

• Performing PCA, we see eigenvalues:

0 0 0 0 0 0 0 0

0 0.0001 0 0 0 0 0 0

0 0 0.0001 0 0 0 0 0

0 0 0 0.0005 0 0 0 0

0 0 0 0 0.0007 0 0 0

0 0 0 0 0 0.0015 0 0

0 0 0 0 0 0 0.0094 0

0 0 0 0 0 0 0 0.0142

PCA using Python

• First, the punch line: it is generally agreed that

there are three principal components found

when decomposing a yield curve

– Level

– Slope

– Curvature

• Three principal components is consistent with

our PCA analysis

0 0 0 0 0 0 0 0

0 0.0001 0 0 0 0 0 0

0 0 0.0001 0 0 0 0 0

0 0 0 0.0005 0 0 0 0

0 0 0 0 0.0007 0 0 0

0 0 0 0 0 0.0015 0 0

0 0 0 0 0 0 0.0094 0

0 0 0 0 0 0 0 0.0142

PCA using Python

1-dimension

PCA using Python

2-dimensions

PCA using Python

3-dimensions

PCA using Python

4-dimensions

PCA using Python

5-dimensions

PCA using Python

6-dimensions

PCA using Python

7-dimensions

PCA using Python

8-dimensions

PCA using Python

• There is a relationship between the

components of the dataset and the

eigenvalues worth noting

• The sum of the eigenvalues is the sum of the

variances of each component in the dataset

• In class exercise:

– Using a data set of your choosing, generate

eigenvalues and eigenvectors from a covariance

matrix

– Calculate the variance “by hand” of each subject

– Sum all the subject variance

– Compare to the sum of the eigenvalues

PCA using Python

• In other words, we can say how much

variation in a dataset can be accounted for

with n eigenvalues…or principal components

• Now that we’re able to perform PCA, how do

we use it to help run simulations and do

finance stuff?

• Next semester we may try to integrate PCA

into a Monte Carlo simulation to price an

option on a basket of correlated stocks

• We don’t yet know enough about simulation

to give it a try

• To be continued…

PCA using Python

• NumPy features a function called SVD, which

stands for Singular Value Decomposition

• SVD is a decomposition method which produces a

result similar to finding eigenvalues and

eigenvectors of a covariance matrix

– SVD may be performed on a non-square matrix,

where eigen-decomposition needs a square matrix

– Using SVD, the original matrix can be constructed

from the three resulting matrices

– The condition of matrix symmetry is no longer present

when using SVD

• In the same spirit of PCA, SVD can be used to

represent a larger dataset with less data

PCA using Python

• As I was toying with PCA, I began to wonder

whether the eigenvalues from empirical data

tended toward some distribution

• I randomly chose 1000 names and grabbed

one year of data from Quandl

Plotted here is a distribution

of eigenvalues and a sampled

Pareto distribution.

Does variance in empirical

stock data follow a power law?

PCA using Python

A brief vignette…

PCA using Python

• As mentioned earlier, PCA can be applied to

perform data compression

– Video

• Compressing video using PCA involves layering images

and comparing the hue intensity of a single pixel as the

image evolves

– Still images

• Aside from compressing still images, PCA can also be

used for things like facial recognition and other image

comparison pursuits

– Other types of media

• Today we will use PCA to build an admittedly

simple and shabby image compression

program in Python

PCA using Python

• First, we’ll need an image…

We choose to use

a grayscale image

because the matrix

data is simpler

than a full color

image.

PCA using Python

• Homer will be helping us with image

compression (there’s a lot of him to compress)

• First, we need to load the image into Python

• The library cv2 contains a function called

imread that reads an image into a matrix

• Example:

homer=cv2.imread(‘homer.png’)

• The data is recorded into an array reflecting

RGB values

• For ease, I will average those values to create

a greyscale Homer

PCA using Python

• We need to consider how best to apply PCA to

an image…

– Is it possible to simply treat an image like a time

series of stock data?

– Can we recreate the original data (or some

approximation thereof) with only a handful of

eigenvectors and eigenvalues?

– How do we plan to achieve a smaller data set that

represents the original image while allowing for a

smaller file size?

• Recall what PCA tells us about a data set

– Fundamentally PCA allows us to determine the

overall hegemony of a matrix (remember that we

are representing our image as a matrix)

PCA using Python

• We can’t apply PCA to the entire image…

Suppose we chose to

apply PCA to only

regions of the image.

By inspection, we can

make judgments about

the hegemony of

specific regions.

It’s important to

understand the difference

between the eigenvalues

likely to emerge by

performing PCA on the

red region versus the blue

region.

PCA using Python

• The image contained inside the red region was

highly self-similar

• By comparison, the blue region was much less so

• Perhaps it doesn’t make sense to apply PCA to

the entire image, but maybe it does make sense

to apply it to sub-regions of a larger image

• If PCA applied to a region yields only one relevant

eigenvalue, what judgment can be made about

that region?

• If PCA applied to a region yields many relevant

eigenvalues, what judgment can be made about

that region?

PCA using Python

• We can’t apply PCA to the entire image…

Examine the small

boxes used to break

up the larger image.

How many of them

contain only a single

color? (and what

implication does

this have for the

eigenvalues if PCA

was applied to

only that box?)

PCA using Python

• If we can determine how self-similar a region

of the image tends to be, we can decide how

much of the image data should be kept and

how much should be discarded

– The more data discarded, the smaller the image

file

• Enter: PCA

• We use PCA to determine self-similarity of a

region of the image

• We don’t use PCA to compress, we use PCA to

determine whether we should compress

• In this example, “compression” will simply

mean averaging the data in a region

PCA using Python

• Consider the code below…

import matplotlib.pyplot as plt

import numpy as np

import cv2

im = cv2.imread(“homer.png”)

homer = np.array(im)

homer_grey = np.zeros(448**2).reshape(448,448)

for i in range(448):

for j in range(448):

value = np.mean(homer[i,j]) / 255

homer_grey[i,j] = value

quality = 0.6

new_homer = np.zeros(homer_grey.shape)

for i in range(55):

for j in range(55):

cell = homer_grey[(j * 8):(j * 8) + 8, (i * 8): (i * 8) + 8]

c = np.cov(cell)

eig = np.linalg.eig(c)

test = np.diag(eig[0])[0:1].sum() / np.diag(eig[0]).sum()

if(test > quality):

new_homer[(j * 8): (j * 8) + 8, (i * 8): (i * 8) +

8]=np.ones(64).reshape(8,8) * np.mean(np.mean(cell))

else:

new_homer[(j * 8): (j * 8) + 8, (i * 8): (i * 8) + 8]=cell

plt.imshow(new_homer)

plt.show()

Notice that we are not

using PCA to perform

the compression. We

are using PCA to

determine whether we

should perform

compression (a simple

hue average).

PCA using Python

Quality = 0.9

PCA using Python

Quality = 0.8

PCA using Python

Quality = 0.7

PCA using Python

Quality = 0.6

PCA using Python

Quality = 0.5

PCA using Python

fin