深度学习辅导 | CIS 8392 Big Data Analytics Assignment 3

本次大数据辅导的主要内容是要求对imdb的数据进行预处理,使用深度学习模型进行数据分析和模型训练.

#Assignment 3

1/8 Assignment 3

library(keras)

set.seed(123) n_sample <- 5000 max_features <- 5000 maxlen <- 300

imdb <- dataset_imdb(num_words = max_features) c(c(x_train, y_train), c(x_test, y_test)) %<-% imdb # Loads the data

x_train <- pad_sequences(x_train, maxlen = maxlen) x_test <- pad_sequences(x_test, maxlen = maxlen)

sample_indicators = sample(1:nrow(x_train), n_sample)

x_train <- x_train[sample_indicators,] # use a subset of reviews for training

x_train <- y_train[sample_indicators]

# use a subset of reviews for training

x_test <- x_test[sample_indicators,] # use a subset of reviews for testing

y_test <- y_test[sample_indicators]

# use a subset of reviews for testing

2/8 Assignment 3

to fit the following models:

1. Simple RNN

4. bidirectional LSTM

5. bidirectional GRU

6. 1D convnet

You can decide the parameters for the network structure (e.g., layers, etc) and model training (e.g., epochs , batch_size and

, number of

validation_split

3/8 Assignment 3

Save each of these fitted models to an h5 file Save the history of each model to an rds file (see Save x_test and y_test to rds files

write_rds

Step 4. Save the R code you used for steps 1 to 3 to an R file Step 5. Compress all the output files from step 3 to a zip file

4/8 Assignment 3

Step 6. Use R Markdown to achieve the following:

1. Specify author, date, and title in the YAML metadata of your document

2. Read all the output files from Step 3

3. Use x_test and y_test to show the following statistics:

Number of reviews in the test set Number of positive reviews in the test set Number of negative reviews in the test set

4. For each model:

Show model summary Plot the training history Evaluate the performance of the model using the test set

5. Summarize the performance of different models using a table. Columns include

model_name

: Overall accuracy of the predictions in the test set n_tp : Number of true-positive predictions in the test set n_tn : Number of true-negative predictions in the test set n_fp : Number of false-positive predictions in the test set n_fn : Number of false-negative predictions in the test set 6. Discuss what you found from the table

5/8 Assignment 3

Here are some additional notes about writing a RMarkdown report. Violating these rules will lead to a lower grade.

Put the data in the same folder as your Rmd file. Whenever we run/knit an RMarkdown file, it uses the folder with the Rmd file as the working directory.

Read the data in your Rmd code chunk using  relative path. If you use an absolute path, I will not be able to knit the Rmd file to an html file from my end.

You will lose 5 points if for any reason (input path, error in code, etc.) the Rmd file cannot be knitted to an html file.

Distinguish headings (## heading) and normal text. We should not put all the text in headings.

Do not print excessive data in your RMarkdown report. Use kable to format tables.

Do not put your discussions/explanations in code chunk. Write them as normal text.

Do not use include=FALSE or echo=FALSE in your code chunk. I need to read your code. You may use message=T , warning=T to suppress messages/warnings.

Do not write an excessively long line of code. Break it into multiple lines to improve readability.

6/8 Assignment 3

Step 7. Knit the R Markdown file (.Rmd) to an HTML file Step 8. The R, Rmd, HTML, zip files must follow the naming rule below:

Assignment3-YourLastName.FileExtension

For example: Assignment3-Lin.R Assignment3-Lin.Rmd Assignment3-Lin.html Assignment3-Lin.zip

Step 9. Submit the R, Rmd, html, and zip files (individually) to iCollege

7/8 Assignment 3

Due by the beginning of next class Extra credit: the student who has the best report (determined by the instructor) will be given 5 extra points towards the final grade

Submissions that are too similar would not be considered for the extra credit Accuracy of the models plays a significant role for this extra credit

Grading is based on the following:

Grading is based on the submitted files on iCollege, and the submission folder will become unavaialbe after deadline. Do not wait till the last minutes. You will see a “not authorized” error message if you click submit after the deadline. You will receive 0 point if you submit your assignment via email.

Whether all required files were submitted to iCollege on time, following the naming rule Whether the Rmd file is syntactically correct and can render the html file Whether the report has a professional format and style (succinct and yet provides adequate and clear discussions) Whether the report meets the requirements specified in Step 6