2023年12月3日

Python代写 | Final Project of MSDM5054

本次代写主要为Python机器学习分类的project

Part I. Classiﬁcation on 20newsgroup Data
Data info: the goal is to classify the types of postings based on their context. The dataset is a tiny
version of the 20newsgroups data, with binary occurrence data for 100 key words across 16242 postings.
The ﬁle “wordlist.txt” lists the 100 key words. The ﬁle “documents.txt” is essentially a 16242×100
occurrence matrix where each row is corresponding to 1 posting and each column is corresponding to 1
keyword. The occurrence matrix has binary entries where the (i,j)-th entry is 1 if and only if the i-th posting
contains the j-th keyword. Since the occurrence matrix is extremely sparse, the “documents.txt” is a
sparse representation of the occurrence matrix. Basically, each line in “documents.txt” represents 1 non-
zero entry of the occurrence matrix. For instance, the ﬁrst line of “documents.txt” is “1 23 1” which means
that the entry (1,23) of the occurrence matrix is 1, i.e., the 1st posting contains the 23th keyword.

The ﬁle “newsgroup.txt” has 16242 lines where i-th line stands for the group labels of i-th posting. There
are 4 diﬀerent groups which means “comp.”, “rec.”, “sci.” and “talk.” respectively. The goal is predict the
type, i.e. 4 diﬀerent group, of the posting based on the words in this posting.

1. Build a random forest for this dataset and report the 5-fold cross validation value of the
misclassiﬁcation error. Note that you need to train the model by yourself, i.e., how many predictors
are chosen in each tree and how many trees are used. There is no benchmark. Stop tuning when
you feel appropriate. Report the best CV error, the corresponding confusion matrix and tuning
parameters. What are the ten most important keywords based on variable importance?

2. Build a boosting tree for this dataset and report the 5-fold cross validation value of the
misclassiﬁcation error. Similarly, report the best CV error, the corresponding confusion matrix and
tuning parameters. Note that the R example in the textbook only considers binary classiﬁcation.
But the library ‘gbm’ can deal with multi-class case by setting ‘distribution=multinomial’.

3. Compare the results from random forest and boosting trees.

4. Build a multi-class LDA classiﬁer. Report the 5-fold CV error of misclassiﬁcation and the confusion
matrix.

5. Build a multi-class QDA classiﬁer. Report the 5-fold CV error of misclassiﬁcation and the confusion
matrix.

6. Compare the performances of all above methods and give your comments.

程序辅导定制C/C++/JAVA/安卓/PYTHON/留学生/PHP/APP开发/MATLAB

本网站支持 Alipay WeChatPay PayPal等支付方式

E-mail:vipdue@outlook.com 微信:vipnxx

如果您使用手机请先保存二维码，微信识别。如果用电脑，直接掏出手机果断扫描。

CS代写,留学生编程代写,CS作业代写,Java代写,程序代写，代码代写 | ITCS代写

Excel代写 | BU.610.625 – Simulation and Strategic Options 程序代写｜COMP309/AIML421 ML Tools and Techniques: Assignment 1