机器学习代写 | Naive Bayes Classifier
In this assignment you will implement the Naive Bayes Classifier. Before starting this assignment, make sure you understand the concepts discussed in the videos in Week 2 about Naive Bayes. You can also find it useful to read Chapter 1 of the textbook.
Also, make sure that you are familiar with the numpy.ndarray
class of python’s numpy
library and that you are able to answer the following questions:
Let’s assume a
is a numpy array.
You can answer all of these questions by
The UC Irvine machine learning data repository hosts a famous dataset, the Pima Indians dataset, on whether a patient has diabetes originally owned by the National Institute of Diabetes and Digestive and Kidney Diseases and donated by Vincent Sigillito. You can find it at https://www.kaggle.com/uciml/pima-indians-diabetes-database/data. This data has a set of attributes of patients, and a categorical variable telling whether the patient is diabetic or not. For several attributes in this data set, a value of 0 may indicate a missing value of the variable. It has a total of 768 data-points.
Report the accuracy of the classifier on the 20% evaluation data, where accuracy is the number of correct predictions as a fraction of total predictions.
Report the accuracy of the classifier on the held out 20%
The UC Irvine’s Machine Learning Data Repository Department hosts a Kaggle Competition with famous collection of data on whether a patient has diabetes (the Pima Indians dataset), originally owned by the National Institute of Diabetes and Digestive and Kidney Diseases and donated by Vincent Sigillito.
You can find this data at https://www.kaggle.com/uciml/pima-indians-diabetes-database/data. The Kaggle website offers valuable visualizations of the original data dimensions in its dashboard. It is quite insightful to take the time and make sense of the data using their dashboard before applying any method to the data.
First, we will shuffle the data completely, and forget about the order in the original csv file.
Some of the columns exhibit missing values. We will use a Naive Bayes Classifier later that will treat such missing values in a special way. To be specific, for attribute 3 (Diastolic blood pressure), attribute 4 (Triceps skin fold thickness), attribute 6 (Body mass index), and attribute 8 (Age), we should regard a value of 0 as a missing value.
Therefore, we will be creating the train_featues_with_nans
and eval_features_with_nans
numpy arrays to be just like their train_features
and eval_features
counter-parts, but with the zero-values in such columns replaced with nans.
Consider a single sample (x,y)(x,y), where the feature vector is denoted with xx, and the label is denoted with yy. We will also denote the jthjth feature of xx with x(j)x(j).
According to the textbook, the Naive Bayes Classifier uses the following decision rule:
“Choose yy such that
is the largest”
However, we first need to define the probabilistic models of the prior p(y)p(y) and the class-conditional feature distributions p(x(j)|y)p(x(j)|y) using the training data.
Write a function log_prior
that takes a numpy array train_labels
as input, and outputs the following vector as a column numpy array (i.e., with shape (2,1)(2,1)).
Try and avoid the utilization of loops as much as possible. No loops are necessary.
Hint: Make sure all the array shapes are what you need and expect. You can reshape any numpy array without any tangible computational over-head.