CS代写 | Assignment 4


Please upload the resulting jupyter notebook as pdf with the answers to the questions inline. You can do this by “printing” the jupyter notebook and then selecting “save as pdf”. Make sure to include your NetID and real name in your submission.

1. Natural Language Processing (Logistic Regression & Naïve Bayes)

a. The file “assignment_4.txt” contains text data with labels indicating sentiment.

(https://www.researchgate.net/publication/251231364_FinancialPhraseBank-v10) The end of each line contains an “@” followed by its label (positive, neutral, and negative). Load the file line by line to create a dataframe with two columns “text” and “label”. This code snippet should help you getting started:

with open(“assignment_4.txt”, “r”) as fin:

for line in fin:

text, label = line.strip().rsplit(“@”, 1) …

What is the distribution of labels in the full data?

b. Next, split the data randomly into a training (70%) and a test set (30%). Use stratified sampling to roughly retain the original distribution of labels for both training and test data.

c. Using sklearn create a binary CountVectorizer() and

TfidfVectorizer(). Use the original single words as well as bigrams. Also, use an “english” stop word list. Fit these to the training data to extract a vocabulary and then transform both the train and test data (keep a copy of the original train and test data for later).

d. Create LogisticRegression() and BernoulliNB() models. For all

settings, keep the default values. In a single plot, show the ROC curve for both classifiers and both the binary and tf-idf feature sets. In the legend, include the area under the ROC curve (AUC). Do not forget to label your axes. Your final plot will be a single window with 4 curves.

Which model do you think does a better job?

2. Natural Language Processing (BERT)

a. Use the original training and test data of the previous section (before any transformations)

b. Install the transformers library from huggingface: pip install transformers[torch]


c. Download a pre-trained BERT for Sequence Classification model


ication) You can use bert-base-uncased for both the tokenizer and the model.

d. Train the model using the original training data from above. (Here is a tutorial from huggingface:

https://huggingface.co/transformers/training.html#fine-tuning-in-native-pytorch but we will also provide a (training in) pytorch tutorial session.

Note, that you will need an evaluation subset from the training data. Don’t use the test data for this. Train for at least 5 epochs and monitor whether the training and evaluation losses decrease. Increase epochs as needed (if the loss flattens out you can stop training; if training takes long because of training on CPU you can stop early).

e. Plot the training and evaluation loss, accuracy, and AUC over time (each pair of training and evaluation in their own plots). Observe how your training converges.

f. Compute the ROC curve for the final model on the test data including its AUC. Compare it to the ROC curves of the previous section by plotting the ROC curves of the best model of the previous section and the current model.

g. Create a confusion matrix on the test data for the BERT model.

3. Model Explanations

a. Install the shapley python package (https://github.com/slundberg/shap).

b. Compute shapley explanations for your model from the previous section. For this, pick three example inputs from each cell in the confusion matrix (skip if a cell has 0 entries; you will have to compute the shapley explanations for at most 27 inputs).

c. Do the explanations match your intuition of which part of a sentence contributes to the final output? Looking at examples where the model failed to predict correctly, try to use the explanations to formulate why the model failed.

d. Create shapley explanations, like in subsection (b), on the Naïve Bayes model from above. Note, that you have to use a different approach for computing the explanations when using tabular data as input. Do you notice a change in how the two models (NB vs. BERT) utilize their inputs? Can you use the explanations to infer a likely reason why one of the models performs better than the other?

4. Fun with Language Models

a. There are a lot of tasks that BERT can do (e.g., text summarization, question answering, text generation, etc.; check out the huggingface website if you want to see more use cases). In this section we will use BERT for topic modeling. Use the original text data without labels (before the train / test split) and a pre-trained BERT model

(https://huggingface.co/transformers/model_doc/bert.html#bertmodel). No further training will be necessary in our case.

Compute the text embeddings for each text in the data (use the last_hidden_state and obtain the embedding of the first token of the inputs;

side note: BERT automatically inserts a special token at the beginning of the input which will produce the embedding for the full input)

b. Use sklearn’s Agglomerative Clustering

(https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeCl ustering.html) on the BERT embeddings.

c. Create a dendrogram of the clustering to decide a cut that contains a reasonable amount of clusters (the majority of clusters should not have only one entry and there should be more than two clusters). Aim for a cut that contains clusters with roughly equal sizes.

d. Equipped with the embeddings and the cluster labels, create a t-SNE projection (https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) of the embeddings of all instances and use the cluster labels for coloring the points. Are the clusters clearly separated or are there significant overlaps?

e. Next, pick the largest cluster and compute its centroid (use the mean of all members for each feature). Look at the original text of the 5 instances closest to the centroid. Is there a clear common theme of the texts? Can you formulate what groups them together?