What Skills Do You Need to Become a Data Engineer

People often ask me what skills needed to become a data engineer. Before answering that question, let's take a look at what data engineer do. According to Coursera

Data engineering is the practice designing and building systems for collecting, storing, and analyzing data at scale.

Data engineering has become the backbone of many applications across industries, and data engineer is an indispensable asset for many organizations.

I like using data to answer questions. I extracted 550 United States data engineer jobs from indeed.com and did some quick analyses using job description, location, and salary range. Although sample size is not big, it should be sufficient to reveal some insights and trends.

Exploratory Spatial Data Analysis (ESDA) – Spatial Autocorrelation

In exploratory data analysis (EDA), we often calculate correlation coefficients and present the result in a heatmap. Correlation coefficient measures the statistical relationship between two variables. The correlation value represents how the change in one parameter would impact the other, e.g. quantity of purchase vs price. Correlation analysis is a very important concept in the field of predictive analytics before building the model.

But how do we measure statistical relationship in a spatial dataset with geo locations? The conventional EDA and correlation analysis ignores the location features and treats geo coordinates similar to other regular features. Exploratory Spatial Data Analysis (ESDA) becomes very useful in the analysis of location-based data.

Spatial Autocorrelation

ESDA is intended to complement geovizualization through formal statistical tests for spatial clustering, and Spatial Autocorrelation is one of the important goals of those tests. Spatial autocorrelation measures the correlation of a variable across space i.e. relationships to neighbors on a graph. Values can be

  • positive: nearby cases are similar or clustered e.g. High-High or Low-Low (left image on the figure below)
  • neutral: neighbor cases have no particular relationship or random, absence of pattern (center image on the figure below)
  • negative: nearby cases are dissimilar or dispersed e.g. High-Low or Low-High (right image on the figure below)
Illustrations of spatial autocorrelation. From (Radil, 2011).

Can You Tell Someone’s Gender Based on Tweets?

Predicting someone’s demographic attributes based on limited amount of information available is always a hot topic. It is common to use people’s name, ethnicity, location, and pictures for training models that can tell gender. Can you actually guess someone’s gender only based on what they share on Twitter? We will explore this using NLP techniques in this article.

At the end we conclude that it is quite a challenge to predict someone’s gender only using a single tweet. However, by combining prediction results from many tweets of the same person (similar to ensemble techniques like bagging), we may reach much better performance.

What Skills Do You Need to Become an HR Analyst

For work reasons, I have opportunities to interact with HR analysts everyday. I am always curious what skills one would need to become an HR analyst.

Below is an HR Analyst job summary from SHRM.org. Other names for HR Analyst include People Analytics Analyst, Workforce Analytics Specialist, Data Analyst - People Analytics, etc.

The Human Resource (HR) Analyst will collect, compile, and analyze HR data, metrics, and statistics, and apply this data to make recommendations related to recruitment, retention, and legal compliance.

SHRM.org

To become an HR analyst, one needs to have the HR-related domain knowledge. This topic will be explored in another article. Here we will just focus on technical skills, general skills, and education.

Trump And Trudeau Twitter Analysis During COVID-19 Crisis Part 2

In our last post, we extract @realDonalTrump and @JustinTrudeau tweets, clean up the texts, and generate word clouds. In this article, we will build a Latent Dirichlet Allocation (LDA) model to study the topics of the hundreds of tweets posted by the two world leaders.

Topic Modeling

Topic modeling is an unsupervised machine learning technique which is widely used for discovering abstract topics of a collection of documents. It considers each document to be represented by several topics and each topic to be represented by a set of words that frequently appear together. For example, with a cluster of cloud, rain, wind, we can tell that the associated topic likely related to weather.

Trump And Trudeau Twitter Analysis During COVID-19 Crisis Part 1

During the COVID-19 pandemic, people take their worries, concerns, frustration, and loves to social media to share with the rest of the world about their feelings and thoughts. Twitter has become one of official channels where world leaders communicate with their supporters and followers. To understand what keep them busy, we extract tweets of two world leaders, Donald Trump (the President of United States) and Justin Trudeau (the Prime Minister of Canada). By applying natural language processing techniques and Latent Dirichlet Allocation (LDA) algorithm, topics of their tweets can be learned. So we can see what is on their mind during the crisis.

We use Python 3.6 and the following packages:

  • TwitterScraper, a Python script to scrape for tweets
  • NLTK (Natural Language Toolkit), a NLP package for text processing, e.g. stop words, punctuation, tokenization, lemmatization, etc.
  • Gensim, “generate similar”, a popular NLP package for topic modeling
  • Latent Dirichlet Allocation (LDA), a generative, probabilistic model for topic clustering/modeling
  • pyLDAvis, an interactive LDA visualization package, designed to help interpret topics in a topic model that is trained on a corpus of text data

COVID-19 (Coronavirus) Power BI Dashboard

On March 11, the WHO declared the Novel Coronavirus outbreak a pandemic, a new disease that has spread around the world. Many countries have seen reported cases of the virus.

To help track and understand the daily spread of the virus, I built this Power BI dashboard. It provides an overview of the confirmed and recovered cases of COVID-19 worldwide outbreaks. It contains daily updates from the Johns Hopkins University Center for Systems Science and Engineering (JHU CCSE) Coronavirus repository.

To Know What People Twitter About #Coronavirus In One Minute

Year 2020 is not off to a good start. The ongoing Coronavirus outbreak that originated in Wuhan, China has infected thousands of people worldwide and killed hundreds. Numbers are still rising everyday. With all the quarantine controls and vaccine development, hope this global epidemic will be soon under control.

When we are facing such a global challenge, we take our emotions and concerns to social media and share Coronavirus news with others. Since the outbreak, each day there are hundreds of thousands of tweets about Coronavirus. I decided to run analyses on Twitter feeds and see if I could generate some highlights.

Who Are the Top HR Analytics Influencers on Twitter

Visualizing Twitter social network of HRanalytics

Everyday people use social media such as Twitter to share thoughts and ideas. People with similar interests come together and interact on the online platform by re-sharing or replying posts they like. By studying how people interact on social networks, it will help us understand how information is distributed and identify who are the most prominent figures.

In our last post, we did a topic modeling study using Twitter feeds #HRTechConf and trained a model to learn the topics of all the tweets. In this article, we will analyze Twitter user interactions and visualize it in an interactive graph. 

Social Network is a network of social interactions and personal relationships. 

Oxford Dictionary

Auto Generated Insights of 2019 HR Tech Conference Twitter – Part 2 (Topic Modeling)

In our last post, we extract #HRTechConf tweets, clean up the texts, and generate a word cloud that highlights some of the buzzwords from the conference. But, what are the tweets talking about? Without reviewing each of the 7,000 tweets, how could we find out the popular topics? Let's explore and see if tweet topics could be auto detected by developing a Latent Dirichlet Allocation (LDA) model.

Feature Extraction

Tweets or any text must be converted to a vector of numbers - the dictionary that describes the occurrence of words in the text (or corpus). The technique we use is called Bag of Words, a simple method of extracting text features. Here are the steps.