Python代写 | CSE 5243 Homework 3: Clustering

CSE 5243
Homework 3: Clustering
Objective:
In this lab, you will perform clustering on three datasets. Your will develop a K-means
clustering algorithm, evaluate it on the three datasets, and compare its performance to
other off-the-shelf clustering algorithms.
The objectives of this assignment are:
1. Understand how to implement a clustering algorithm in Python.
2. Understand how to tune and evaluate a clustering algorithm to achieve good
performance.
3. Understand how to select and evaluate suitable off-the-shelf clustering algorithm
based on the characteristics of a dataset and the outcomes you need.
The Datasets:
 The file small_Xydf.csv is a two-dimensional dataset with 100 records. It contains
columns X0, X1, and y. The y column is the actual cluster number that was produced by
the dataset generation algorithm. Do not use it for the clustering algorithm. It will be
used to evaluate your clustering algorithm below.
 The file large_Xydf.csv is a two-dimensional dataset with 1000 records. It contains
columns X0, X1, and y. The y column is the actual cluster number that was produced by
the dataset generation algorithm. Do not use it for the clustering algorithm. It will be
used to evaluate your clustering algorithm below.
 The Wine dataset (https://archive.ics.uci.edu/ml/machine-learning-databases/winequality/)
is a more complex dataset. Use the winequality-red.csv file. Do not use the
“quality” attribute for the clustering algorithm. It will be used to evaluate your
clustering algorithm below.
2
Approach:
1. Implement a K-Means Algorithm
Write a K-Means algorithm that takes an input dataset and a parameter K and returns an
output with two columns: the row ID and the computed cluster number for each input
record. Describe in detail (via comments, etc.) how you algorithm is supposed to work.
2. Evaluate the Algorithm on the Small Dataset
A. Given that you know the true clusters (from column y in the original data), compute
the true cluster SSE, the overall SSE, and the between-cluster sum of squares SSB.
B. Run your algorithm for K=2, 3, 4. For each run, compute the SSE for each cluster, the
overall SSE, and the between-cluster sum of squares SSB.
C. For the K=3 case above:
1. Create a scatterplot, overlaying the true cluster with the cluster produced by
2. Create a cross tabulation matrix comparing the true and assigned clusters.
D. What do you observe or conclude from these experiments? Which is your “preferred”
clustering, and why? Support this with statistics and/or graphs.
3. Evaluate the Algorithm on the Large Dataset
A. Given that you know the true clusters (from column y in the original data), compute
the true cluster SSE, the overall SSE, and the between-cluster sum of squares SSB.
B. Run your algorithm for K=4, 6, 8. For each run, compute the SSE for each cluster, the
overall SSE, and the between-cluster sum of squares SSB.
C. For the K=6 case above:
1. Create a scatterplot, overlaying the true cluster with the cluster produced by
2. Create a cross tabulation matrix comparing the true and assigned clusters.
D. What do you observe or conclude from these experiments? Which is your “preferred”
clustering, and why? Support this with statistics and/or graphs.
4. Evaluate the Algorithm on the Wine Dataset
A. Given that you know the true clusters (from the ‘quality’ column in the original data),
compute the true cluster SSE, the overall SSE, and the between-cluster sum of squares
SSB.
B. Run your algorithm for K=4, 6, 8. For each run, compute the SSE for each cluster, the
overall SSE, and the between-cluster sum of squares SSB.
C. For the K=6 case above:
1. Create a scatterplot, overlaying the true cluster with the cluster produced by
2. Create a cross tabulation matrix comparing the true and assigned clusters (if
3
possible – due to size constraints).
D. What do you observe or conclude from these experiments? Which is your “preferred”
clustering, and why? Support this with statistics and/or graphs.
5. Compare Your Algorithm with Off-The-Shelf K-Means Algorithm
A. Run the off-the-shelf K-Means algorithm from the SciKit Learn library (see References
below) and evaluate its performance.
B. Compare the results to the results of your coded K-Means algorithm. Does it perform
the better/worse, faster/slower? Why might that be the case?
6. Comparison of the K-Means Algorithms with Two Other Clustering Algorithms
Choose two other clustering algorithms from the SciKit Learn library and run them on the
dataset.
A. Why did you choose these specific algorithms? What characteristics of the data might
impact the relative performance?
B. Compare the performance of these algorithms to the K-Means code you wrote and to
the off-the-shelf K-Means algorithm.
C. Choose one of the clustering algorithms as best, and explain why.
7. Takeaways
Write a paragraph on what you discovered or learned from this homework.
Collaboration:
For this assignment, you should work as an individual. You may informally discuss ideas with
What you need to turn in:
1) Code
A. Do this work in Python.
B. You may use common Python libraries for I/O, data manipulation, data visualization,
etc. (e.g., NumPy, Pandas, MatPlotLib, Seaborn, …)
C. Unless explicitly permitted in the assignment, you may not use library operations that
perform, in effect, the “core” computations for this homework (e.g., If the
assignment is to write a K-Means algorithm, you may not use a library operation that,
in effect, does the core work needed to implement a K-Means algorithm.).
D. The code must be written by you, and any significant code snips you found on the
4
Internet and used to understand how to do your coding for the core functionality
must be attributed. (You do not need to attribute basic functionality – matrix
operations, IO, etc.)
E. The code must be commented sufficiently to allow a reader to understand the
algorithm without reading the actual Python, step by step.
F. When in doubt, ask the teaching assistant or instructor.
2) Written Report
A. The report should be well-written. Please proof-read and remove spelling and
grammar errors and typos.
B. The report should discuss your analysis and observations. Present charts and graphs
to support your observations. If you performed any data processing, cleaning, etc.,
please discuss it within the report.
C. The written report must be in the form of a Python Notebook or as a PDF Document.
1. Overall readability and organization of your report (10%) – Is it well organized and
does the presentation flow in a logical manner; are there many grammar and spelling
mistakes; do the charts/graphs relate to the text, etc.
2. Implementation of your K-Means algorithm (15%) – Is your algorithm design and
coding correct? Is it well documented? Have you made an effort to tune it for good
performance?
3. Evaluation of your K-Means algorithm on the Small, Large, and Wine datasets (5%,
10%, 15% respectively) – Is the evaluation sound?
4. Comparison of your K-Means algorithm with Off-The-Shelf K-Means algorithm (20%)
– Is the execution and comparison sound? Did you document any insights based on
the comparison?
5. Comparison of the K-Means algorithms with two other clustering algorithms (20%) –
Is the execution and comparison sound? Did you choose a specific clustering
algorithm as best and explain why?
6. Takeaways (5%) – Did you document your overall insights?
How to turn in your work on Carmen:
Submit to Carmen any code that you used to process and analyze this data. You do not need
to include the input data. All the related files (code and/or report) except for the data
should be archived in a *.zip file or *.tgz file, and submitted via Carmen. Use this naming
convention:
5
 Assignment3_Surname_DotNumber.zip or
 Assignment3_Surname_DotNumber.tgz
The submitted file should be less than 5MB.
References and Acknowledgements:
1. Dr. Jason Van Hulse
2. Dr. Ping Zhang
3. UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/machine-learningdatabases/wine-quality/)
4. SciKit-Learn (https://scikit-learn.org/stable/modules/classes.html#module-sklearn.cluster)