机器学习代写|Fundamentals of Machine Learning Exercise 8

这是一个美国的Python机器学习代写Exercise

First load the dataset and import scikit-learn’s decomposition module:

import math
import matplotlib . pyplot as plt
import numpy as np
from sklearn . datasets import load_digits
from sklearn import decomposition
digits = load_digits ()
X = digits [” data ” ]/255.
Y = digits [” target “]

Use the decomposition module to compare non-negative matrix factorization (NMF) with singular
value decomposition (SVD, np.linalg.svd) on the digits dataset where the methods factorize X
(the matrix of flattened digit images) in the following way:

X = Z · H (NMF) (1)
X = U · S · VT (SVD) (2)

X; Z; H 2 R≥0. If X 2 RN ≥0×D and your number of latent components is M then Z 2 RN ≥0×M and H 2 RM×D≥0 . Run SVD with full rank and then select the 6 rows from VT corresponding to the largest singular values. Use at least 10 components for NMF. Note that you must use centered data for SVD (but not for NMF, of course) and add the mean back to the basis vectors. Reshape the selected basis vectors from H and VT into 2D images and plot them. One can interpret these images as a basis for the vector space spaned by the digit dataset. Compare the bases resulting from SVD and NMF and comment on interesting obervations.

We learned in the lecture that the NMF can be found by alternating updates of the form

Numerators and denominators of the fractions are matrix multiplications, whereas the divisions and multiplicative updates must be executed element-wise. Implement a function non_negative(data,num_components) that calculates a non-negative matrix factorization with these updates, when re num_components is the desired number of features M after decomposition. Initialize Z0 and H0 positively, e.g by taking the absolute value of standard normal random variables (RV) with np.random.randn. Iterate until reasonable convergence, e.g. for t = 1000 steps. Note that you might have to ensure numerical stability by avoiding division by zero. You can achieve this by clip ping denominators at a small positive value with np.clip. Run your code on the digits data, plot the resulting basis vectors and compare with the NMF results from scikit-learn (results should be similar). Can you confirm that the squared loss kX − Zt · Htk2 2 is non-increasing as a function of t?

2 Recommender system (12 Points)
Use your code to implement a recommendation system. We will use the movielens-100k dataset with pandas, which you can download as an “external link” on MaMPF.

import pandas as pd # install pandas via conda

# column headers for the dataset
ratings_cols = [’user id ’,’movie id ’,’rating ’,’timestamp ’]
movies_cols = [’movie id ’,’movie title ’,’release date ’,
’video release date ’,’IMDb URL ’,’unknown ’,’Action ’,
’Adventure ’,’Animation ’,’Childrens ’,’Comedy ’,’Crime ’,
’ Documentary ’,’Drama ’,’Fantasy ’,’Film – Noir ’,’Horror ’,
’Musical ’,’Mystery ’,’Romance ’,’Sci -Fi ’,’Thriller ’,
’War ’ ,’Western ’]

users_cols = [’user id ’,’age ’,’gender ’,’ occupation ’,
’zip code ’]
users = pd. read_csv (’ml -100 k/u. user ’, sep =’|’,
names = users_cols , encoding =’latin -1 ’)

movies = pd. read_csv (’ml -100 k/u. item ’, sep =’|’,
names = movies_cols , encoding =’latin -1 ’)

ratings = pd. read_csv (’ml -100 k/u. data ’, sep =’\t’,
names = ratings_cols , encoding =’latin -1 ’)

# peek at the dataframes , if you like 🙂
users . head ()
movies . head ()
ratings . head ()

# create a joint ratings dataframe for the matrix
fill_value = 0
rat_df = ratings . pivot ( index = ’user id ’,
columns =’movie id ’, values = ’rating ’). fillna ( fill_value )
rat_df . head ()