Python辅导 | 机器学习 Machine Learning and AI Homework
本次Python辅导主要是包含理论和代码实现的问题,第一个是AdaBoost算法理论的,第二个是涉及决策树(Decision Trees)、随机森林(Random Forests)、支持向量机(SVM)的。
This HW includes both theory and implementation problems. Please note,
The submission for this homework should be a single ‘zip‘ file containing all of the relevant code, figures, and any text explaining your results.
Problem 1 (15 points) Consider the AdaBoost algorithm we discussed in the class 1. Ad- aBoost is an example of ensemble classifiers where the weights in next round are decided based on the training error of the weak classifier learned on the current weighted training set. We wish to run the AdaBoost on the dataset provided in Table 1.
a) Assume we choose the following decision stump f1 (a shallow tree with a single decision node), as the first predictor (i.e., when training instances are weighted uniformly):
What would be the weight of f1 in final ensemble classifier (i.e., α1 in f(x) = ?Ki=1 αifi(x))?
b) After computing f1, we proceed to next round of AdaBoost. We begin by recomputing data weights depending on the error of f1 and whether a point was (mis)classified by f1. What is the weight of each instance in second boosting iteration, i.e., after the points have been re-weighted? Please note that the weights across the training set are to be uniformly
initialized.
c) In AdaBoost, would you stop the iteration if the error rate of the current weak classifier on the weighted training data is 0?
Problem 2 (40 points) For this problem, you will need to learn to use software libraries for at least two of the following non-linear classifier types:
All of these are available in scikit-learn, although you may also use other external libraries (e.g., XGBoost 2 for boosted decision trees and LibSVM for SVMs). You are welcome to implement learning algorithms for these classifiers yourself, but this is neither required nor recommended.
Pick two different types of non-linear classifiers from above for classification of Adult dataset. You can the download the data from a9a in libSVM data repository. The a9a data set comes with two files: the training data file a9a with 32,561 samples each with 123 features, and a9a.t with 16,281 test samples. Note that a9a data is in LibSVM for- mat. In this format, each line takes the form <label> <feature-id>:<feature-value> <feature-id>:<feature-value> ….. This format is especially suitable for sparse datasets.
Note that scikit-learn includes utility functions (e.g., load svmlight file in example code below) for loading datasets in the LibSVM format.
For each of learning algorithms, you will need to set various hyperparameters (e.g., the type of kernel and regularization parameter for SVM, tree method, max depth, number of weak classifiers, etc for XGBoost, number of estimators and min impurity decrease for Random Forests). Often there are defaults that make a good starting point, but you may need to adjust at least some of them to get good performance. Use hold-out validation or K-fold cross-validation to do this (scikit-learn has nice features to accomplish this, e.g., you may use train test split to split data into train and test data and sklearn.model selection for K-fold cross validation). Do not make any hyperparameter choices (or any other similar choices) based on the test set! You should only compute the test error rates after you have settled on hyperparameter settings and trained your two final classifiers.
What to submit (in PDF file and Jupyter/python codes):
Please do your best to obtain the best achievable accuracy for each classifier on given dataset. Note: The amount of effort you put on tuning the parameters will be deter- mined based on the discrepancy between the accuracy you get and the best achievable accuracy on a9a data for each algorithm.
Problem 3 (45 points) For this problem, you will implement the k-means++ algorithm in Python. You will then use it to cluster Iris dataset from the UCI Machine Learning Repository. The data is contained the iris.data file under ”Data Folder”, while the file iris.names contains a description of the data. The features x are given as the first four comma-separated values in each row in the data file. The labels y are the last entry in each row, but you do NOT need the class label as it is unknown for clustering.
1. sepal length (cm)
2. sepal width (cm)
3. petal length (cm)
4. petal width (cm)
5. class: {Iris Setosa, Iris Versicolour, Iris Virginica}
You need to,