# Python代写 | Final Project of MSDM5054

本次代写主要为Python机器学习分类的project

Part I. Classiﬁcation on 20newsgroup Data

Data info: the goal is to classify the types of postings based on their context. The dataset is a tiny

version of the 20newsgroups data, with binary occurrence data for 100 key words across 16242 postings.

The ﬁle “wordlist.txt” lists the 100 key words. The ﬁle “documents.txt” is essentially a 16242×100

occurrence matrix where each row is corresponding to 1 posting and each column is corresponding to 1

keyword. The occurrence matrix has binary entries where the (i,j)-th entry is 1 if and only if the i-th posting

contains the j-th keyword. Since the occurrence matrix is extremely sparse, the “documents.txt” is a

sparse representation of the occurrence matrix. Basically, each line in “documents.txt” represents 1 non-

zero entry of the occurrence matrix. For instance, the ﬁrst line of “documents.txt” is “1 23 1” which means

that the entry (1,23) of the occurrence matrix is 1, i.e., the 1st posting contains the 23th keyword.

The ﬁle “newsgroup.txt” has 16242 lines where i-th line stands for the group labels of i-th posting. There

are 4 diﬀerent groups which means “comp.”, “rec.”, “sci.” and “talk.” respectively. The goal is predict the

type, i.e. 4 diﬀerent group, of the posting based on the words in this posting.

1. Build a random forest for this dataset and report the 5-fold cross validation value of the

misclassiﬁcation error. Note that you need to train the model by yourself, i.e., how many predictors

are chosen in each tree and how many trees are used. There is no benchmark. Stop tuning when

you feel appropriate. Report the best CV error, the corresponding confusion matrix and tuning

parameters. What are the ten most important keywords based on variable importance?

2. Build a boosting tree for this dataset and report the 5-fold cross validation value of the

misclassiﬁcation error. Similarly, report the best CV error, the corresponding confusion matrix and

tuning parameters. Note that the R example in the textbook only considers binary classiﬁcation.

But the library ‘gbm’ can deal with multi-class case by setting ‘distribution=multinomial’.

3. Compare the results from random forest and boosting trees.

4. Build a multi-class LDA classiﬁer. Report the 5-fold CV error of misclassiﬁcation and the confusion

matrix.

5. Build a multi-class QDA classiﬁer. Report the 5-fold CV error of misclassiﬁcation and the confusion

matrix.

6. Compare the performances of all above methods and give your comments.