Python代写机器学习 | COM6513 Assignment Text Classification with a Feedforward Network


assignment2_nn_text_clf March 18, 2021

1 [COM6513] Assignment 2: Text Classification with a Feedforward Network

1.0.1 Instructor: Nikos Aletras

The goal of this assignment is to develop a Feedforward neural network for text classification. For that purpose, you will implement:

lowed by a ReLU activation function (2 marks)

• The
weights of your Neural network. Your algorithm should:

Stochastic Gradient Descent (SGD) algorithm with back-propagation to learn the

the pre-trained weights. During training, you should not update them (i.e. weight freezing) and backprop should stop before computing gradients for updating embedding weights. Report results by performing hyperparameter tuning and plotting the learning process. Do you get better performance? (7 marks).

1.0.2 Data

The data you will use for the task is a subset of the AG News Corpus and you can find it in the ./data_topic folder in CSV format:

1.0.3 Pre-trained Embeddings

You can download pre-trained GloVe embeddings trained on Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download) from here. No need to unzip, the file is large.

1.0.4 Save Memory

To save RAM, when you finish each experiment you can delete the weights of your network using del W followed by Python’s garbage collector gc.collect()

1.0.5 Submission Instructions

You should submit a Jupyter Notebook file (assignment2.ipynb) and an exported PDF version (you can do it from Jupyter: File->Download as->PDF via Latex).

You are advised to follow the code structure given in this notebook by completing all given funtions. You can also write any auxilliary/helper functions (and arguments for the functions) that you might need but note that you can provide a full solution without any such functions. Similarly, you can just use only the packages imported below but you are free to use any function- ality from the Python Standard Library, NumPy, SciPy (excluding built-in softmax funtcions) and

Pandas. You are not allowed to use any third-party library such as Scikit-learn (apart from metric functions already provided), NLTK, Spacy, Keras, Pytorch etc.. You should mention if you’ve used Windows to write and test your code because we mostly use Unix based machines for marking (e.g. Ubuntu, MacOS).

There is no single correct answer on what your accuracy should be, but correct implementa- tions usually achieve F1-scores around 80% or higher. The quality of the analysis of the results is as important as the accuracy itself.

This assignment will be marked out of 60. It is worth 60% of your final grade in the module.

The deadline for this assignment is 23:59 on Fri, 23 Apr 2021 and it needs to be submitted via Blackboard. Standard departmental penalties for lateness will be applied. We use a range of strategies to detect unfair means, including Turnitin which helps detect plagiarism. Use of unfair means would result in getting a failing grade.

In [1]: import pandas as pd import numpy as np

from collections import Counter
import re
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score import random

from time import localtime, strftime
from scipy.stats import spearmanr,pearsonr import zipfile
import gc

1.1 Transform Raw texts into training and development data

First, you need to load the training, development and test sets from their corresponding CSV files (tip: you can use Pandas dataframes).

2 Create input representations

To train your Feedforward network, you first need to obtain input representations given a vocab- ulary. One-hot encoding requires large memory capacity. Therefore, we will instead represent documents as lists of vocabulary indices (each word corresponds to a vocabulary index).

2.1 Text Pre-Processing Pipeline

To obtain a vocabulary of words. You should: – tokenise all texts into a list of unigrams (tip: you can re-use the functions from Assignment 1) – remove stop words (using the one provided or one of your preference) – remove unigrams appearing in less than K documents – use the remaining to create a vocabulary of the top-N most frequent unigrams in the entire corpus.

2.1.1 Unigram extraction from a document

You first need to implement the extract_ngrams function. It takes as input: – x_raw: a string corresponding to the raw text of a document – ngram_range: a tuple of two integers denoting the type of ngrams you want to extract, e.g. (1,2) denotes extracting unigrams and bigrams. – token_pattern: a string to be used within a regular expression to extract all tokens. Note that data is already tokenised so you could opt for a simple white space tokenisation. – stop_words: a list of stop words – vocab: a given vocabulary. It should be used to extract specific features.

and returns:
• a list of all extracted features.

In [1]: def extract_ngrams(x_raw, ngram_range=(1,3), token_pattern=r’\b[A-Za-z][A-Za-z]+\b’, stop_words=[], vocab=set()):

tokenRE = re.compile(token_pattern)
# first extract all unigrams by tokenising

x_uni = [w for w in tokenRE.findall(str(x_raw).lower(),) if w not in stop_words]

if ngram_range[0]==1: x = x_uni

ngrams = []
for n in range(ngram_range[0], ngram_range[1]+1):

if n==1: continue
# pass a list of lists as an argument for zip

arg_list = [x_uni]+[x_uni[i:] for i in range(1, n)] 4

# extract tuples of n-grams using zip
# for bigram this should look: list(zip(x_uni, x_uni[1:]))
# align each item x[i] in x_uni with the next one x[i+1].
# Note that x_uni and x_uni[1:] have different lenghts
# but zip ignores redundant elements at the end of the second list # Alternatively, this could be done with for loops
x_ngram = list(zip(*arg_list))

for n in ngrams: for t in n:

x.append(t) if len(vocab)>0:

x = [w for w in x if w in vocab] return x

2.1.2 Create a vocabulary of n-grams

Then the get_vocab function will be used to (1) create a vocabulary of ngrams; (2) count the document frequencies of ngrams; (3) their raw frequency. It takes as input: – X_raw: a list of strings each corresponding to the raw text of a document – ngram_range: a tuple of two integers denoting the type of ngrams you want to extract, e.g. (1,2) denotes extracting unigrams and bigrams. – token_pattern: a string to be used within a regular expression to extract all tokens. Note that data is already tokenised so you could opt for a simple white space tokenisation. – stop_words: a list of stop words – min_df: keep ngrams with a minimum document frequency. – keep_topN: keep top-N more frequent ngrams.

and returns:

frequency as values.

In [6]: def get_vocab(X_raw, ngram_range=(1,3), token_pattern=r’\b[A-Za-z][A-Za-z]+\b’, min_df=0, keep_topN=0,

for x in X_raw:
x_ngram = extract_ngrams(x, ngram_range=ngram_range, token_pattern=token_patter

# obtain a vocabulary as a set.
# Keep elements with doc frequency > minimum doc freq (min_df) # Note that df contains all te
vocab = set([w for w in df if df[w]>=min_df])

if keep_topN>0:
vocab = set([w[0] for w in ngram_counts.most_common(keep_topN)

if w[0] in vocab])

return vocab, df, ngram_counts

Now you should use get_vocab to create your vocabulary and get document and raw frequen- cies of unigrams:

Then, you need to create vocabulary id -> word and word -> vocabulary id dictionaries for reference:

2.1.3 Convert the list of unigrams into a list of vocabulary indices

Storing actual one-hot vectors into memory for all words in the entire data set is prohibitive. Instead, we will store word indices in the vocabulary and look-up the weight matrix. This is equivalent of doing a dot product between an one-hot vector and the weight matrix.

First, represent documents in train, dev and test sets as lists of words in the vocabulary: Then convert them into lists of indices in the vocabulary:
Put the labels Y for train, dev and test sets into arrays:

3 Network Architecture

Your network should pass each word index into its corresponding embedding by looking-up on the embedding matrix and then compute the first hidden layer h1:

h1 = |x| ∑Wi ,i ∈ x

where |x| is the number of words in the document and We is an embedding matrix |V| × d, |V| is the size of the vocabulary and d the embedding size.

Then h1 should be passed through a ReLU activation function: 6

a1 = relu(h1) Finally the hidden layer is passed to the output layer:

y = softmax(a1W)

where W is a matrix d × |Y|, |Y| is the number of classes.
During training, a1 should be multiplied with a dropout mask vector (elementwise) for regu-

larisation before it is passed to the output layer.
You can extend to a deeper architecture by passing a hidden layer to another one:

hi = ai−1Wi ai = relu(hi)

4 Network Training

First we need to define the parameters of our network by initiliasing the weight matrices. For that purpose, you should implement the network_weights function that takes as input:

layers between the average embedding and the output layer

and returns:

• W: a dictionary mapping from layer index (e.g. 0 for the embedding matrix) to the corresponding weight matrix initialised with small random numbers (hint: use numpy.random.uniform with from -0.1 to 0.1)

Make sure that the dimensionality of each weight matrix is compatible with the previous and next weight matrix, otherwise you won’t be able to perform forward and backward passes. Con- sider also using np.float32 precision to save memory.

In [15]: def network_weights(vocab_size=1000, embedding_dim=300, hidden_dim=[], num_classes=3, init_val = 0.5):

Then you need to develop a softmax function (same as in Assignment 1) to be used in the output layer.

It takes as input z (array of real numbers) and returns sig (the softmax of z) 7

In [8]: def softmax(z): return sig

Now you need to implement the categorical cross entropy loss by slightly modifying the func- tion from Assignment 1 to depend only on the true label y and the class probabilities vector y_preds:

In [11]: def categorical_loss(y, y_preds): return l

Then, implement the relu function to introduce non-linearity after each hidden layer of your network (during the forward pass):

relu(zi) = max(zi, 0)

and the relu_derivative function to compute its derivative (used in the backward pass): relu_derivative(zi)=0, if zi<=0, 1 otherwise.
Note that both functions take as input a vector z
Hint use .copy() to avoid in place changes in array z

In [12]: def relu(z):

return a
def relu_derivative(z):

return dz

During training you should also apply a dropout mask element-wise after the activation func- tion (i.e. vector of ones with a random percentage set to zero). The dropout_mask function takes as input:

• size: the size of the vector that we want to apply dropout
• dropout_rate: the percentage of elements that will be randomly set to zeros

and returns:
• dropout_vec: a vector with binary values (0 or 1)

In [23]: def dropout_mask(size, dropout_rate):

return dropout_vec

Now you need to implement the forward_pass function that passes the input x through the network up to the output layer for computing the probability for each class using the weight matrices in W. The ReLU activation function should be applied on each hidden layer.

hidden layer, W[1] is the weight matrix that connects the hidden layer to the output layer.

applied after each hidden layer for regularisation. and returns:

• out_vals: a dictionary of output values from each layer: h (the vector before the activation function), a (the resulting vector after passing h from the activation function), its dropout mask vector; and the prediction vector (probability for each class) from the output layer.

In [25]: def forward_pass(x, W, dropout_rate=0.2):

out_vals = {}

return out_vals

The backward_pass function computes the gradients and updates the weights for each matrix in the network from the output to the input. It takes as input

hidden and an output layer: W[0] is the weight matrix that connects the input to the first

hidden layer, W[1] is the weight matrix that connects the hidden layer to the output layer.

• freeze_emb: boolean value indicating whether the embedding weights will be updated. and returns:

• W: the updated weights of the network.
Hint: the gradients on the output layer are similar to the multiclass logistic regression.

In [3]: def backward_pass(x, y, W, out_vals, lr=0.001, freeze_emb=False):

Finally you need to modify SGD to support back-propagation by using the forward_pass and backward_pass functions.

The SGD function takes as input:

is smaller than a threshold

be used by the backward pass function).

and returns:

each epoch

after each epoch

In [7]: def SGD(X_tr, Y_tr, W, X_dev=[], Y_dev=[], lr=0.001,
dropout=0.2, epochs=5, tolerance=0.001, freeze_emb=False,


return W, training_loss_history, validation_loss_history
Now you are ready to train and evaluate your neural net. First, you need to define your

network using the network_weights function followed by SGD with backprop: 10

for i in range(len(W)):
print(‘Shape W’+str(i), W[i].shape)

X_dev=X_dev, Y_dev=Y_dev, lr=0.001, dropout=0.2, freeze_emb=False, tolerance=0.01, epochs=100)

Plot the learning process:
Compute accuracy, precision, recall and F1-Score:

In [10]: preds_te = [np.argmax(forward_pass(x, W, dropout_rate=0.0)[‘y’]) for x,y in zip(X_te,Y_te)]

4.0.1 Discuss how did you choose model hyperparameters ?

5 Use Pre-trained Embeddings

Now re-train the network using GloVe pre-trained embeddings. You need to modify the backward_pass function above to stop computing gradients and updating weights of the em- bedding matrix.

Use the function below to obtain the embedding martix for your vocabulary. Generally, that should work without any problem. If you get errors, you can modify it.

In [32]: def get_glove_embeddings(f_zip, f_txt, word2id, emb_size=300): w_emb = np.zeros((len(word2id), emb_size))

with zipfile.ZipFile(f_zip) as z: with as f:

for line in f:
line = line.decode(‘utf-8’) word = line.split()[0]

if word in vocab: 11

emb = np.array(line.strip(‘\n’).split()[1:]).astype(np.float32) w_emb[word2id[word]] +=emb

return w_emb
In [33]: w_glove = get_glove_embeddings(“”,”glove.840B.300d.txt”,word2id)

First, initialise the weights of your network using the network_weights function. Second, replace the weigths of the embedding matrix with w_glove. Finally, train the network by freezing the embedding weights:

In [14]: preds_te = [np.argmax(forward_pass(x, W, dropout_rate=0.0)[‘y’]) for x,y in zip(X_te,Y_te)]

5.0.1 Discuss how did you choose model hyperparameters ?

6 Extend to support deeper architectures

Extend the network to support back-propagation for more hidden layers. You need to modify the backward_pass function above to compute gradients and update the weights between interme- diate hidden layers. Finally, train and evaluate a network with a deeper architecture. Do deeper architectures increase performance?

In [13]: preds_te = [np.argmax(forward_pass(x, W, dropout_rate=0.0)[‘y’]) for x,y in zip(X_te,Y_te)]

Add your final results here:

Average Embedding
Average Embedding (Pre-trained)
Average Embedding (Pre-trained) + X hidden layers

Discuss how did you choose model hyperparameters ? Full Results

Precision Recall F1-Score Accuracy

Please discuss why your best performing model is better than the rest.