Auto Generated Insights of 2019 HR Tech Conference Twitter – Part 2 (Topic Modeling)

In our last post, we extract #HRTechConf tweets, clean up the texts, and generate a word cloud that highlights some of the buzzwords from the conference. But, what are the tweets talking about? Without reviewing each of the 7,000 tweets, how could we find out the popular topics? Let's explore and see if tweet topics could be auto detected by developing a Latent Dirichlet Allocation (LDA) model.

Feature Extraction

Tweets or any text must be converted to a vector of numbers - the dictionary that describes the occurrence of words in the text (or corpus). The technique we use is called Bag of Words, a simple method of extracting text features. Here are the steps.

Auto Generated Insights of 2019 HR Tech Conference Twitter – Part 1 (Word Cloud)

HR Technology Conference and Expo, world's leading and largest conference for HR and IT professionals, just took place in Las Vegas, from Oct 1 - 4, 2019. An incredibly amount of HR technology topics were covered at the conference. Unfortunately not everyone could be there, including myself. Is it possible to tell what the buzzwords and topics are without being there? The answer is YES! I dig into Twitter for some quick insights.

I scrape tweets with #HRTechConf, and build Latent Dirichlet Allocation (LDA) model for auto detecting and interpreting topics in the tweets. Here is my pipeline:

  1. Data gathering - twitter scrape
  2. Data pre-processing
  3. Generating word cloud
  4. Train LDA model
  5. Visualizing topics

I use Python 3.6 and the following packages:

  • TwitterScraper, a Python script to scrape for tweets
  • NLTK (Natural Language Toolkit), a NLP package for text processing, e.g. stop words, punctuation, tokenization, lemmatization, etc.
  • Gensim, "generate similar", a popular NLP package for topic modeling
  • Latent Dirichlet Allocation (LDA), a generative, probabilistic model for topic clustering/modeling
  • pyLDAvis, an interactive LDA visualization package, designed to help interpret topics in a topic model that is trained on a corpus of text data