数据科学代写｜COMP 598 Final Project – Data Science Project
Project 1: COVID in Canada
Your team has been hired by a not-for-profit that wants to understand the discussions
currently happening around COVID in Canadian social media. They have indicated that they
are especially concerned with vaccine hesitancy. Specifically, they want to know:
1. The salient topics discussed around COVID and what each topic primarily concerns
2. Relative engagement with those topics
3. How positive/negative the response to the pandemic/vaccination has been.
You will conduct this analysis and submit a report discussing your findings.
Your analysis will draw on Twitter posts (tweets). To inform your analysis, you should collect
1,000 tweets within a 3 day window. You should set filters such that all 1,000 posts mention
either COVID, vaccination, or a name-brand COVID vaccine AND all are in English (ensure
that the language field is set to “en”… this isn’t exact, but it gets close). You can filter by
hashtags or words when collecting Twitter data. You can choose the exact words, as long as
they are related to the context that we mentioned before.
Each post in your collection should be unique – meaning that you shouldn’t include an
identical tweet or retweet twice.
To develop your topics, conduct an open coding on 200 tweets (approach the exercise
requiring each tweet to belong to exactly one topic). You should aim for between 3-8 topics
Once your topics have been designed, manually annotate the rest of the 1,000 tweets in
your dataset. During this annotation, also code each post for positive/neutral/negative
sentiment. While double annotation would usually be used, for this project (given time
constraints), use single annotation. While double annotation would usually be used, for this
project (given time constraints), use single annotation.
Characterize your topics by computing the 10 words in each category with the highest
tf-idf scores (to compute inverse document frequency, use all 1,000 posts that you originally