Can You Tell Someone’s Gender Based on Tweets?

Predicting someone’s demographic attributes based on limited amount of information available is always a hot topic. It is common to use people’s name, ethnicity, location, and pictures for training models that can tell gender. Can you actually guess someone’s gender only based on what they share on Twitter? We will explore this using NLP techniques in this article.

At the end we conclude that it is quite a challenge to predict someone’s gender only using a single tweet. However, by combining prediction results from many tweets of the same person (similar to ensemble techniques like bagging), we may reach much better performance.

Continue reading “Can You Tell Someone’s Gender Based on Tweets?”

Trump And Trudeau Twitter Analysis During COVID-19 Crisis Part 2

In our last post, we extract @realDonalTrump and @JustinTrudeau tweets, clean up the texts, and generate word clouds. In this article, we will build a Latent Dirichlet Allocation (LDA) model to study the topics of the hundreds of tweets posted by the two world leaders.

Topic Modeling

Topic modeling is an unsupervised machine learning technique which is widely used for discovering abstract topics of a collection of documents. It considers each document to be represented by several topics and each topic to be represented by a set of words that frequently appear together. For example, with a cluster of cloud, rain, wind, we can tell that the associated topic likely related to weather.

Continue reading “Trump And Trudeau Twitter Analysis During COVID-19 Crisis Part 2”

Trump And Trudeau Twitter Analysis During COVID-19 Crisis Part 1

During the COVID-19 pandemic, people take their worries, concerns, frustration, and loves to social media to share with the rest of the world about their feelings and thoughts. Twitter has become one of official channels where world leaders communicate with their supporters and followers. To understand what keep them busy, we extract tweets of two world leaders, Donald Trump (the President of United States) and Justin Trudeau (the Prime Minister of Canada). By applying natural language processing techniques and Latent Dirichlet Allocation (LDA) algorithm, topics of their tweets can be learned. So we can see what is on their mind during the crisis.

We use Python 3.6 and the following packages:

  • TwitterScraper, a Python script to scrape for tweets
  • NLTK (Natural Language Toolkit), a NLP package for text processing, e.g. stop words, punctuation, tokenization, lemmatization, etc.
  • Gensim, “generate similar”, a popular NLP package for topic modeling
  • Latent Dirichlet Allocation (LDA), a generative, probabilistic model for topic clustering/modeling
  • pyLDAvis, an interactive LDA visualization package, designed to help interpret topics in a topic model that is trained on a corpus of text data
Continue reading “Trump And Trudeau Twitter Analysis During COVID-19 Crisis Part 1”

To Know What People Twitter About #Coronavirus In One Minute

Year 2020 is not off to a good start. The ongoing Coronavirus outbreak that originated in Wuhan, China has infected thousands of people worldwide and killed hundreds. Numbers are still rising everyday. With all the quarantine controls and vaccine development, hope this global epidemic will be soon under control.

When we are facing such a global challenge, we take our emotions and concerns to social media and share Coronavirus news with others. Since the outbreak, each day there are hundreds of thousands of tweets about Coronavirus. I decided to run analyses on Twitter feeds and see if I could generate some highlights.

Continue reading “To Know What People Twitter About #Coronavirus In One Minute”

Auto Generated Insights of 2019 HR Tech Conference Twitter – Part 2 (Topic Modeling)

In our last post, we extract #HRTechConf tweets, clean up the texts, and generate a word cloud that highlights some of the buzzwords from the conference. But, what are the tweets talking about? Without reviewing each of the 7,000 tweets, how could we find out the popular topics? Let’s explore and see if tweet topics could be auto detected by developing a Latent Dirichlet Allocation (LDA) model.

Feature Extraction

Tweets or any text must be converted to a vector of numbers – the dictionary that describes the occurrence of words in the text (or corpus). The technique we use is called Bag of Words, a simple method of extracting text features. Here are the steps.

Continue reading “Auto Generated Insights of 2019 HR Tech Conference Twitter – Part 2 (Topic Modeling)”

Auto Generated Insights of 2019 HR Tech Conference Twitter – Part 1 (Word Cloud)

HR Technology Conference and Expo, world’s leading and largest conference for HR and IT professionals, just took place in Las Vegas, from Oct 1 – 4, 2019. An incredibly amount of HR technology topics were covered at the conference. Unfortunately not everyone could be there, including myself. Is it possible to tell what the buzzwords and topics are without being there? The answer is YES! I dig into Twitter for some quick insights.

I scrape tweets with #HRTechConf, and build Latent Dirichlet Allocation (LDA) model for auto detecting and interpreting topics in the tweets. Here is my pipeline:

  1. Data gathering – twitter scrape
  2. Data pre-processing
  3. Generating word cloud
  4. Train LDA model
  5. Visualizing topics

I use Python 3.6 and the following packages:

  • TwitterScraper, a Python script to scrape for tweets
  • NLTK (Natural Language Toolkit), a NLP package for text processing, e.g. stop words, punctuation, tokenization, lemmatization, etc.
  • Gensim, “generate similar”, a popular NLP package for topic modeling
  • Latent Dirichlet Allocation (LDA), a generative, probabilistic model for topic clustering/modeling
  • pyLDAvis, an interactive LDA visualization package, designed to help interpret topics in a topic model that is trained on a corpus of text data
Continue reading “Auto Generated Insights of 2019 HR Tech Conference Twitter – Part 1 (Word Cloud)”

Things Employees Like and Dislike About Their Companies

I work in people analytics and have been wondering all the time what make employees feel great or bad about their companies. Is it money? Workload? Opportunities to grow? Or team around them? I know the answer depends on the company, but is there anything in common for companies that employees like or dislike the most?

I went to Glassdoor for help. Glassdoor is one of the world’s largest growing job sites where employees anonymously review current or former employers. I did my studies based on the 6,000 companies that have an office in Vancouver, BC.

Continue reading “Things Employees Like and Dislike About Their Companies”

Simple Skill-based Job Recommendation Engine

What are the most demanded skills for data scientists? Python, R, SQL, and the list goes on and on. There are many surveys and reports that show some good statistics on popular data skills. In this post, I am going to gather first-hand information by scraping data science jobs from indeed.ca, analyze top skills required by employers, and make job recommendations by matching skills from resume to posted jobs. It will be fun!

Quick summary of the project workflow:

Workflow
Workflow

Continue reading “Simple Skill-based Job Recommendation Engine”