程序代写|COMP309/AIML421 ML Tools and Techniques: Assignment 1
这是一篇来自新西兰的程序代写作业案例分享
Objectives
This assignment involves trying out a variety of classification and clustering algorithms. It requires use of python, numpy, matplotlib, and scikit-learn, and serves as an introduction to all those tools. You can run python in a jupyter notebook but there are others option to do the development.
1 Classification [70 marks]
The part of assignment is to explore several classifiers in scikit-learn and investigate the hyperparameter for complexity control for each of these classifiers on three datasets by setting the hyperparameter to a range of plausible values and seeing how well it does on ”held out” data. To do this you will need train test split from scikit-learn. To get better estimates, simply repeat 50 times with different random splits (set the seed to get reproducible results). For simplicity, use a 50:50 train:test split in all cases. For each setting of the hyperparameter, you then have a distribution over 50 different classification accuracies on the test set. A nice way to visualise these scores is to produce a box plot where the x-axis gives options for a parameter of the model, while the y-axis indicates the spread for classification accuracies of the classifier.
Titles and Axis Labels are needed for clarity.
1.1 Machine Learning Models:
You will be trying out the following classifiers in scikit-learn:
(a) KNeighborsClassifier (K nearest neighbours)
(b) GaussianNB (the Gaussian form of Naive Bayes)
(c) DecisionTreeClassifier (A decision tree (DT))
(d) LogisticRegression (essentially, a perceptron)
(e) GradientBoostingClassifier (Gradient Boosted DTs)
(f) RandomForestClassifier (Random Forest)
(g) MLPClassifier (Neural Network)
1.2 Datasets:
You will need to test the models on the following three datasets (you can easily download them from OpenML).
1.3 Tasks:
You should:
(i) Build a 7-by-3 (7 classifiers and 3 datasets) table to present the boxplots on the classifier accuracy versus parameter values ( as the example table below). It probably won’t fit into one summary figure. You should structure it as separate main plots, one per classifier,each consisting of several subplots for different datasets.
You need to consider the parameter and the corresponding values for each classifier as follows. Other hyperparameters should be left as default.
(ii) Present two summary tables, with rows being classifiers, and columns being datasets.
Table (1) is to contain the best mean value of the test errors.
Table (2) is to contain the best value for the hyperparameter.
(iii) Write a paragraph to compare and analyse the overall results as captured in these two tables including which model has the best performance and why, how sensitive these models are to the complexity control hyperparameter.
2 Clustering [30 marks]
The scikit-learn library provides a suite of different clustering algorithms to choose from, each of which offers a different approach to discover the natural groups in data.
In this part of assignment, you will explore the characteristics of different clustering algorithms by testing them on three toy datasets. You will use the make blobs() function,make classification (with ”n clusters per class=1”) and make circles (with ”noise=0.3”) to create three toy clustering datasets. Each of them will have 1, 000 examples/instances, with two input features. Set the seed in the dataset generators to get reproducible results.
2.1 Machine Learning Models:
You will be trying out the following clustering algorithms in scikit-learn:
(a) K-Means
(b) Affinity Propagation
(c) DBSCAN
(d) Gaussian Mixture Model
(e) BIRCH
(f) Agglomerative Clustering
(g) Mean Shift
2.2 Tasks:
You should:
(i) Fit the clustering models on the datasets and predicts a cluster for each example in the datasets. Build a 7-by-3 (7 clustering algorithms and 3 datasets) table to present the scatter plots showing the clusters generated by each algorithm.
(ii) Write a paragraph to compare and analyse the results of the clustering algorithms on the three datasets, highlight the characteristics of these algorithms.
2 Clustering [30 marks]
The scikit-learn library provides a suite of different clustering algorithms to choose from, each of which offers a different approach to discover the natural groups in data.
In this part of assignment, you will explore the characteristics of different clustering algorithms by testing them on three toy datasets. You will use the make blobs() function,make classification (with ”n clusters per class=1”) and make circles (with ”noise=0.3”) to create three toy clustering datasets. Each of them will have 1, 000 examples/instances, with two input features. Set the seed in the dataset generators to get reproducible results.
2.1 Machine Learning Models:
You will be trying out the following clustering algorithms in scikit-learn:
(a) K-Means
(b) Affinity Propagation
(c) DBSCAN
(d) Gaussian Mixture Model
(e) BIRCH
(f) Agglomerative Clustering
(g) Mean Shift
2.2 Tasks:
You should:
(i) Fit the clustering models on the datasets and predicts a cluster for each example in the datasets. Build a 7-by-3 (7 clustering algorithms and 3 datasets) table to present the scatter plots showing the clusters generated by each algorithm.
(ii) Write a paragraph to compare and analyse the results of the clustering algorithms on the three datasets, highlight the characteristics of these algorithms.
Plagiarism: Plagiarism in programming (copying someone else’s code) is just as serious as written plagiarism, and is treated accordingly. Make sure you explicitly write down where you got code from (and how much of it) if you use any other resources asides from the course material. Using excessive amounts of others’ code may result in the loss of marks, but plagiarism could result in zero marks!
Submission
You are required to submit a single .pdf report PLUS the python code file (.ipynb or .py) through the web submission system from the COMP309/AIML421 course website by the due time. Provide a README.txt file if you use any non-standard python libraries.