Auto Generated Insights of 2019 HR Tech Conference Twitter – Part 1 (Word Cloud)

Peng WangOctober 14, 2019October 21, 2019

HR Technology Conference and Expo, world's leading and largest conference for HR and IT professionals, just took place in Las Vegas, from Oct 1 - 4, 2019. An incredibly amount of HR technology topics were covered at the conference. Unfortunately not everyone could be there, including myself. Is it possible to tell what the buzzwords and topics are without being there? The answer is YES! I dig into Twitter for some quick insights.

I scrape tweets with #HRTechConf, and build Latent Dirichlet Allocation (LDA) model for auto detecting and interpreting topics in the tweets. Here is my pipeline:

Data gathering - twitter scrape
Data pre-processing
Generating word cloud
Train LDA model
Visualizing topics

I use Python 3.6 and the following packages:

TwitterScraper, a Python script to scrape for tweets
NLTK (Natural Language Toolkit), a NLP package for text processing, e.g. stop words, punctuation, tokenization, lemmatization, etc.
Gensim, "generate similar", a popular NLP package for topic modeling
Latent Dirichlet Allocation (LDA), a generative, probabilistic model for topic clustering/modeling
pyLDAvis, an interactive LDA visualization package, designed to help interpret topics in a topic model that is trained on a corpus of text data

Things Employees Like and Dislike About Their Companies

Peng WangSeptember 3, 2019September 3, 2019

I work in people analytics and have been wondering all the time what make employees feel great or bad about their companies. Is it money? Workload? Opportunities to grow? Or team around them? I know the answer depends on the company, but is there anything in common for companies that employees like or dislike the most?

I went to Glassdoor for help. Glassdoor is one of the world's largest growing job sites where employees anonymously review current or former employers. I did my studies based on the 6,000 companies that have an office in Vancouver, BC.

Web App For Border Crossing Wait Time Forecast – Part 2

Peng WangAugust 6, 2019August 6, 20192

Keywords: Web App, Flask, AJAX, API, AWS, Virtual Environment

Previously, I built the Flask web app that runs on my local machine for predicting border crossing wait time. This time I'll show how it gets deployed on AWS and becomes a public available web app.

Here is the link to web app http://35.164.32.109:5000/

There is a small change to my workflow. Instead of using Facebook Prophet, I changed to build an XGBoost model due to Prophet requirement of minimum 4GB memory. AWS free tier EC2 service only has 1GB memory.

Model is rebuilt daily using the new wait time records available from prior day, and makes forecasts for the next 7 days. The last 7 days records are held out for model validation and RMSE is used for model evaluation.

Web App For Border Crossing Wait Time Forecast – Part 1

Peng WangJuly 10, 2019July 10, 2019

Keywords: Facebook Prophet, Web App, Flask, AJAX, API, AWS

About a year ago I built a predictive model for predicting border crossing wait time. There were a lot of feature manipulation and parameter tweaking. Although results were encouraging, I always wanted to simplify the process and also make the model available for public use.

After spending two weekends researching and coding (as I have no prior knowledge of Prophet or Flask), here is the improved workflow:

Retrieve border crossing wait time from Cascade Gateway API
Build predictive model for future crossing using Python + Facebook Prophet
Develop web app REST API using Flask, HTML, CSS, ajax
Deploy web app on AWS
Refresh data and re-build predictive model daily

Credit Card Fraud Detection Using SMOTE Technique

Peng WangJune 7, 2019June 8, 2019

Outlier detection is is an interesting application of machine learning. The goal is to identify those useful data records that can accurately profile abnormal behavior of the system. However, in real life examples, such special data like fraud and spam takes very small percentage of overall data population, which imposes challenges for developing machine learning models.

In this experiment, we will examine Kaggle's Credit Card Fraud Detection dataset and develop predictive models to detect fraud transactions which accounts for only 0.172% of all transactions. To deal with the unbalanced dateset issue, we will first balance the classes of our training data by a resampling technique (SMOTE), and then build a Logistic Regression model by optimizing the average precision score.

We will build and train our model on Google Colab, a free Jupyter notebook environment that runs on Google cloud and gives free GPU! For more information on Colab, check Colab official page.

People Analytics – Attrition Predictions

Peng WangFebruary 15, 2019February 17, 2019

According to the U.S. Bureau of Labor Statistics, 4.5 years is the average amount of time employees stay with their company today. It hurts an organization's financials and morale , considering the amount of time they spend training. Can management learn from the past attrition and manage to reduce turnovers? Answer is yes. We will build some predicative models using the fictional IBM data set which contains 1470 employee attrition records.

This post is part of a series of people analytics experiments I am putting together:

Job skill match (Recruitment )
Employee attrition prediction (Employee Management)
Pay gap by gender, ethnicity, profession (Employee Compensation) FUTURE WORK
Organizational network analysis (ONA) FUTURE WORK

Simple Skill-based Job Recommendation Engine

Peng WangOctober 21, 2018February 15, 2019

What are the most demanded skills for data scientists? Python, R, SQL, and the list goes on and on. There are many surveys and reports that show some good statistics on popular data skills. In this post, I am going to gather first-hand information by scraping data science jobs from indeed.ca, analyze top skills required by employers, and make job recommendations by matching skills from resume to posted jobs. It will be fun!

Quick summary of the project workflow:

[caption id="attachment_565" align="alignnone" width="952"]

Workflow[/caption]

How many bikes to be shared in Vancouver NEXT WEEK – Part 2

Peng WangSeptember 27, 2018September 28, 2018

This is Part 2 of building predictive models on Vancouver bike share. Part 1 is here. Python code can be found on my GitHub.

Model Training

Training dataset contains hourly bike rentals for each day from 01/01/2017 to 07/24/2018. Two decision tree models were trained: Random Forest (RF) and Gradient Boosted Trees (GBM). They are well known for delivering better performance and efficiency on noisy datasets. However, tuning hyperparameters can be some challenges so that they will not overfit.

How many bikes to be shared in Vancouver NEXT WEEK – Part 1

Peng WangSeptember 27, 2018September 28, 2018

Despite of worldwide debates on bike sharing benefits and challenges, Vancouver launched its own bike sharing program in summer 2016, Mobi sponsored by Shaw Go. First bike share appeared in Amsterdam in 1960's, and then was introduced to other big European cities. It has got popularized by the Chinese in the last decade - 13 out of 15 world biggest bike share programs are in China. I like bike share program because it is simply convenient and helps to save environment. So I decided to look into Vancouver bike share historical data, and hoped to find some trends/patterns. Thanks to Mobi who made their bike usage data available, predictive models can be built to forecast future rides. Quick summary of the project workflow: [caption id="attachment_539" align="alignnone" width="508"]

Project workflow[/caption]