Python代写 | Final Project of MSDM5054

本次代写主要为Python机器学习分类的project

Part I. Classification on 20newsgroup Data
Data info: the goal is to classify the types of postings based on their context. The dataset is a tiny
version of the 20newsgroups data, with binary occurrence data for 100 key words across 16242 postings.
The file “wordlist.txt” lists the 100 key words. The file “documents.txt” is essentially a 16242×100
occurrence matrix where each row is corresponding to 1 posting and each column is corresponding to 1
keyword. The occurrence matrix has binary entries where the (i,j)-th entry is 1 if and only if the i-th posting
contains the j-th keyword. Since the occurrence matrix is extremely sparse, the “documents.txt” is a
sparse representation of the occurrence matrix. Basically, each line in “documents.txt” represents 1 non-
zero entry of the occurrence matrix. For instance, the first line of “documents.txt” is “1 23 1” which means
that the entry (1,23) of the occurrence matrix is 1, i.e., the 1st posting contains the 23th keyword.

The file “newsgroup.txt” has 16242 lines where i-th line stands for the group labels of i-th posting. There
are 4 different groups which means “comp.”, “rec.”, “sci.” and “talk.” respectively. The goal is predict the
type, i.e. 4 different group, of the posting based on the words in this posting.

1. Build a random forest for this dataset and report the 5-fold cross validation value of the
misclassification error. Note that you need to train the model by yourself, i.e., how many predictors
are chosen in each tree and how many trees are used. There is no benchmark. Stop tuning when
you feel appropriate. Report the best CV error, the corresponding confusion matrix and tuning
parameters. What are the ten most important keywords based on variable importance?

2. Build a boosting tree for this dataset and report the 5-fold cross validation value of the
misclassification error. Similarly, report the best CV error, the corresponding confusion matrix and
tuning parameters. Note that the R example in the textbook only considers binary classification.
But the library ‘gbm’ can deal with multi-class case by setting ‘distribution=multinomial’.

3. Compare the results from random forest and boosting trees.

4. Build a multi-class LDA classifier. Report the 5-fold CV error of misclassification and the confusion
matrix.

5. Build a multi-class QDA classifier. Report the 5-fold CV error of misclassification and the confusion
matrix.

6. Compare the performances of all above methods and give your comments.