# Python代写 | CSE 5243 Homework 3: Clustering

这次Homework是用Python对三个数据集进行聚类算法分析

CSE 5243

Homework 3: Clustering

Objective:

In this lab, you will perform clustering on three datasets. Your will develop a K-means

clustering algorithm, evaluate it on the three datasets, and compare its performance to

other off-the-shelf clustering algorithms.

The objectives of this assignment are:

1. Understand how to implement a clustering algorithm in Python.

2. Understand how to tune and evaluate a clustering algorithm to achieve good

performance.

3. Understand how to select and evaluate suitable off-the-shelf clustering algorithm

based on the characteristics of a dataset and the outcomes you need.

The Datasets:

The file small_Xydf.csv is a two-dimensional dataset with 100 records. It contains

columns X0, X1, and y. The y column is the actual cluster number that was produced by

the dataset generation algorithm. Do not use it for the clustering algorithm. It will be

used to evaluate your clustering algorithm below.

The file large_Xydf.csv is a two-dimensional dataset with 1000 records. It contains

columns X0, X1, and y. The y column is the actual cluster number that was produced by

the dataset generation algorithm. Do not use it for the clustering algorithm. It will be

used to evaluate your clustering algorithm below.

The Wine dataset (https://archive.ics.uci.edu/ml/machine-learning-databases/winequality/)

is a more complex dataset. Use the winequality-red.csv file. Do not use the

“quality” attribute for the clustering algorithm. It will be used to evaluate your

clustering algorithm below.

2

Approach:

1. Implement a K-Means Algorithm

Write a K-Means algorithm that takes an input dataset and a parameter K and returns an

output with two columns: the row ID and the computed cluster number for each input

record. Describe in detail (via comments, etc.) how you algorithm is supposed to work.

2. Evaluate the Algorithm on the Small Dataset

A. Given that you know the true clusters (from column y in the original data), compute

the true cluster SSE, the overall SSE, and the between-cluster sum of squares SSB.

B. Run your algorithm for K=2, 3, 4. For each run, compute the SSE for each cluster, the

overall SSE, and the between-cluster sum of squares SSB.

C. For the K=3 case above:

1. Create a scatterplot, overlaying the true cluster with the cluster produced by

your algorithm.

2. Create a cross tabulation matrix comparing the true and assigned clusters.

D. What do you observe or conclude from these experiments? Which is your “preferred”

clustering, and why? Support this with statistics and/or graphs.

3. Evaluate the Algorithm on the Large Dataset

A. Given that you know the true clusters (from column y in the original data), compute

the true cluster SSE, the overall SSE, and the between-cluster sum of squares SSB.

B. Run your algorithm for K=4, 6, 8. For each run, compute the SSE for each cluster, the

overall SSE, and the between-cluster sum of squares SSB.

C. For the K=6 case above:

1. Create a scatterplot, overlaying the true cluster with the cluster produced by

your algorithm.

2. Create a cross tabulation matrix comparing the true and assigned clusters.

D. What do you observe or conclude from these experiments? Which is your “preferred”

clustering, and why? Support this with statistics and/or graphs.

4. Evaluate the Algorithm on the Wine Dataset

A. Given that you know the true clusters (from the ‘quality’ column in the original data),

compute the true cluster SSE, the overall SSE, and the between-cluster sum of squares

SSB.

B. Run your algorithm for K=4, 6, 8. For each run, compute the SSE for each cluster, the

overall SSE, and the between-cluster sum of squares SSB.

C. For the K=6 case above:

1. Create a scatterplot, overlaying the true cluster with the cluster produced by

your algorithm.

2. Create a cross tabulation matrix comparing the true and assigned clusters (if

3

possible – due to size constraints).

D. What do you observe or conclude from these experiments? Which is your “preferred”

clustering, and why? Support this with statistics and/or graphs.

5. Compare Your Algorithm with Off-The-Shelf K-Means Algorithm

A. Run the off-the-shelf K-Means algorithm from the SciKit Learn library (see References

below) and evaluate its performance.

B. Compare the results to the results of your coded K-Means algorithm. Does it perform

the better/worse, faster/slower? Why might that be the case?

6. Comparison of the K-Means Algorithms with Two Other Clustering Algorithms

Choose two other clustering algorithms from the SciKit Learn library and run them on the

dataset.

A. Why did you choose these specific algorithms? What characteristics of the data might

impact the relative performance?

B. Compare the performance of these algorithms to the K-Means code you wrote and to

the off-the-shelf K-Means algorithm.

C. Choose one of the clustering algorithms as best, and explain why.

7. Takeaways

Write a paragraph on what you discovered or learned from this homework.

Collaboration:

For this assignment, you should work as an individual. You may informally discuss ideas with

classmates, to get advice on general Python usage, etc., but your work should be your own.

Please make use of Piazza!

What you need to turn in:

1) Code

A. Do this work in Python.

B. You may use common Python libraries for I/O, data manipulation, data visualization,

etc. (e.g., NumPy, Pandas, MatPlotLib, Seaborn, …)

C. Unless explicitly permitted in the assignment, you may not use library operations that

perform, in effect, the “core” computations for this homework (e.g., If the

assignment is to write a K-Means algorithm, you may not use a library operation that,

in effect, does the core work needed to implement a K-Means algorithm.).

D. The code must be written by you, and any significant code snips you found on the

4

Internet and used to understand how to do your coding for the core functionality

must be attributed. (You do not need to attribute basic functionality – matrix

operations, IO, etc.)

E. The code must be commented sufficiently to allow a reader to understand the

algorithm without reading the actual Python, step by step.

F. When in doubt, ask the teaching assistant or instructor.

2) Written Report

A. The report should be well-written. Please proof-read and remove spelling and

grammar errors and typos.

B. The report should discuss your analysis and observations. Present charts and graphs

to support your observations. If you performed any data processing, cleaning, etc.,

please discuss it within the report.

C. The written report must be in the form of a Python Notebook or as a PDF Document.

Grading Criteria:

1. Overall readability and organization of your report (10%) – Is it well organized and

does the presentation flow in a logical manner; are there many grammar and spelling

mistakes; do the charts/graphs relate to the text, etc.

2. Implementation of your K-Means algorithm (15%) – Is your algorithm design and

coding correct? Is it well documented? Have you made an effort to tune it for good

performance?

3. Evaluation of your K-Means algorithm on the Small, Large, and Wine datasets (5%,

10%, 15% respectively) – Is the evaluation sound?

4. Comparison of your K-Means algorithm with Off-The-Shelf K-Means algorithm (20%)

– Is the execution and comparison sound? Did you document any insights based on

the comparison?

5. Comparison of the K-Means algorithms with two other clustering algorithms (20%) –

Is the execution and comparison sound? Did you choose a specific clustering

algorithm as best and explain why?

6. Takeaways (5%) – Did you document your overall insights?

How to turn in your work on Carmen:

Submit to Carmen any code that you used to process and analyze this data. You do not need

to include the input data. All the related files (code and/or report) except for the data

should be archived in a *.zip file or *.tgz file, and submitted via Carmen. Use this naming

convention:

5

Assignment3_Surname_DotNumber.zip or

Assignment3_Surname_DotNumber.tgz

The submitted file should be less than 5MB.

References and Acknowledgements:

1. Dr. Jason Van Hulse

2. Dr. Ping Zhang

3. UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/machine-learningdatabases/wine-quality/)

4. SciKit-Learn (https://scikit-learn.org/stable/modules/classes.html#module-sklearn.cluster)