Python代写 | CS434 Machine Learning and Data Mining – Homework 4


1 Exercises: Decision Trees and Ensembles [5pts]
To get warmed up and reinforce what we’ve learned, we’ll do some light exercises with decision trees – how to interpret
a decision tree and how to learn one from data.
I Q1 Drawing Decision Tree Predictions [2pts]. Consider the following decision tree:
a) Draw the decision boundaries defined by this tree over the interval x1 2 [0; 30], x2 2 [0; 30]. Each leaf of
the tree is labeled with a letter. Write this letter in the corresponding region of input space.
b) Give another decision tree that is syntactically different (i.e., has a different structure) but defines the
same decision boundaries.
c) This demonstrates that the space of decision trees is syntactically redundant. How does this redundancy
influence learning – i.e., does it make it easier or harder to find an accurate tree?

I Q2 Manually Learning A Decision Tree [2pts]. Consider the following training set and learn a deci-
sion tree to predict Y. Use information gain to select attributes for splits.
0 1 1 0
1 1 1 0
0 0 0 0
1 1 0 1
0 1 0 1
1 0 1 1
For each candidate split include the information gain in your report. Also include the final tree and your
training accuracy.
Now let’s consider building an ensemble of decision trees (also known as a random forest). We’ll specifically loo
how decreasing correlation can lead to further improvements in ensembling.
I Q3 Measuring Correlation in Random Forests [1pts]. We’ve provided a Python script that trains an ensemble of 15 decision trees on the Breast Cancer classification dataset we
used in HW1. We are using the sklearn package for the decision tree implementation as the point of this
exercise is to consider ensembling, not to implement decision trees. When run, the file displays the plot:

The non-empty cells in the upper-triangle of the figure show the correlation between predictions on the test
set for each of 15 decision tree models trained on the same training set. Variations in the correlation are due
to randomly breaking ties when selecting split attributes. The plot also reports the average correlation (a
very high 0.984 for this ensemble) and accuracy for the ensemble (majority vote) and a separately-trained
single model. Even with the high correlation, the ensemble managed to improve performance marginally.
As discussed in class, uncorrelated errors result in better ensembles. Modify the code to train the following
ensembles (each separately). Provide the resulting plots for each and describe what you observe.
a) Apply bagging by uniformly sampling train datapoints with replacement to train each ensemble member.
b) The sklearn API for the DecisionTreeClassifier provides many options to modify how decision trees are
learned, including some of the techniques we discussed to increase randomness. When set less than the
number of features in the dataset, the max_features argument will cause each split to only consider a
random subset of the features. Modify line 44 to include this option at a value you decide.