During the COVID-19 pandemic, people take their worries, concerns, frustration, and loves to social media to share with the rest of the world about their feelings and thoughts. Twitter has become one of official channels where world leaders communicate with their supporters and followers. To understand what keep them busy, we extract tweets of two world leaders, Donald Trump (the President of United States) and Justin Trudeau (the Prime Minister of Canada). By applying natural language processing techniques and Latent Dirichlet Allocation (LDA) algorithm, topics of their tweets can be learned. So we can see what is on their mind during the crisis.

We use Python 3.6 and the following packages:

TwitterScraper, a Python script to scrape for tweets
NLTK (Natural Language Toolkit), a NLP package for text processing, e.g. stop words, punctuation, tokenization, lemmatization, etc.
Gensim, “generate similar”, a popular NLP package for topic modeling
Latent Dirichlet Allocation (LDA), a generative, probabilistic model for topic clustering/modeling
pyLDAvis, an interactive LDA visualization package, designed to help interpret topics in a topic model that is trained on a corpus of text data

Data Gathering

We use TwitterScraper to scrape tweets from Twitter handle @realDonaldTrump and @JustineTrudeau. Only original tweets that are posted from March 1 to April 27, 2020 are collected, no retweet of others. It is English only.

Number of tweets by Week Day and Hour

It seems Trump likes to tweet from 1 to 4 pm, while Trudeau likes to tweet around 3 pm.

Both Trump and Trudeau tweet regularly during the week. It seems Trump likes to tweet even more on Sundays!

Tweet Length

From March 1 to April 27, 2020, Trump made 673 tweets, with an average of 27 words in a tweet, and Trudeau made 386 tweets, with an average of 41 words in a tweet. Trump had many short tweets (less than 10 words) and some lengthy tweets (over 40 words). Trudeau had most tweets with 40 to 50 words.

Data Pre-processing

Text pre-processing is needed for transferring text from human language to machine-readable format for further processing. The following pre-processing steps are applied to our Twitter texts.

Convert all words to lowercase
Remove non-alphabet characters
Remove short word (length less than 3)
Tokenization: breaking sentences into words
Part-of-speech (POS) tagging: process of classifying words into their grammatical category, in order to understand their roles in a sentence, e.g. verbs, nouns, adjectives, etc. POS tagging provides grammar context for lemmatization.
Lemmatization: converting a word to its base form e.g. car, cars, car’s to car
Remove common English words e.g. a, the, of, etc., and remove common words that add very little value to our analysis, e.g. com, twitter, pic, etc.

We extract both unigrams and bigrams (pairs of consecutive words ) from the texts. After pre-processing, our tweets look like this:

text	token	bigram_token
WOW! Thank you, just landed, see everyone soon! #KAG2020pic.twitter.com/QGdfIsOp4u	[wow, thank, land, see, everyone, soon, kag, qgdfisop]	[wow thank, thank land, land see, see everyone, everyone soon, soon kag, kag qgdfisop]
Departing for the Great State of North Carolina!pic.twitter.com/BjnyTnnHUt	[depart, great, state, north, carolina, bjnytnnhut]	[depart great, great state, state north, north carolina, carolina bjnytnnhut]
They are staging a coup against Bernie!	[stag, coup, bernie]	[stag coup, coup bernie]

Word Count and Word Cloud

We use bigrams for our word count and word cloud as bigrams provide more meaningful insights than single word.

Top 5 mostly common words in Trump’s tweets are: fake news, white house, united state, news conference, mini mike.

Top 5 mostly common words in Trudeau’s tweets are: make sure, across country, keep safe, canada emergency, health care.

Here is the word cloud of Trump’s tweets:

Here is the word cloud of Trudeau’s tweets:

In the next post, we will show how to generate meaningful topics of the tweets by applying LDA algorithm.

Happy Machine Learning!