Credit Card Fraud Detection Using SMOTE Technique

Outlier detection is is an interesting application of machine learning. The goal is to identify those useful data records that can accurately profile abnormal behavior of the system. However, in real life examples, such special data like fraud and spam takes very small percentage of overall data population, which imposes challenges for developing machine learning models.

In this experiment, we will examine Kaggle’s Credit Card Fraud Detection dataset and develop predictive models to detect fraud transactions which accounts for only 0.172% of all transactions. To deal with the unbalanced dateset issue, we will first balance the classes of our training data by a resampling technique (SMOTE), and then build a Logistic Regression model by optimizing the average precision score.

We will build and train our model on Google Colab, a free Jupyter notebook environment that runs on Google cloud and gives free GPU! For more information on Colab, check Colab official page.

Python code can be found on my GitHub.

Data Preparation

The dataset contains transactions made by credit cards in September 2013 by European cardholders over a two day period. There are 492 frauds out of a total 284,807 examples. It’s highly unbalanced, with the positive class (frauds) accounting for only 0.172% of all transactions.

This is what the dataset columns consist of:

Class: 0 – normal transaction 1 – fraud
Amount: Transaction amount
V1, V2, … ,V28: Anonymous features, due to privacy. They are numerical values which are results of PCA transformation.
Time: The amount of seconds elapsed between each transaction and the first transaction in the dataset.

As part of data preprocessing, we normalized the Amount column and dropped Time from feature set.

Let’s visualize the skewness:

Highly unbalanced dataset: 492 frauds out of a total 284,807

Evaluation Metrics

Before train our model, we must be clear what to measure our model performance i.e. model optimization. Typically, an accuracy score (the fraction of correct predictions) is used for measuring predictive model performance. However, it doesn’t work well for highly unbalanced dataset. Using our dataset, if one always predicts a given transaction as normal i.e. non-fraud, accuracy score will be 0.998 – almost too perfect and no fraud will be reported! It’s a useless predictive mode with “perfect” accuracy.

Let’s review two common metrics for model evaluation.

Precision = TP/(TP+FP): measures how accurate are the predictions

Recall = TP/(TP+FN): measures how good all the positives are found

where

TP: True Positives (actually fraud and predicted as fraud)

FP: False Positives (actually normal and predicted as fraud)

FN: False Negatives (actually fraud and predicted as normal)

Because letting fraudulent transactions pass through is quite costly to business and credit card holder, False Negatives (actually fraud and predicted as normal) should be minimized. Therefore, a higher Recall is desired.

Often, as tradeoff, increasing Recall tends to lower Precision. If our model predicts too many false fraudulent transactions (to increase recall), it will become very annoying to regular credit card users and drive people away from using the service. That’s a nightmare no business wants to see. Precision and Recall must be balanced.

A precision-recall curve shows the tradeoff of precision and recall for different prediction thresholds. It can be characterized by an Average Precision score which summarizes the weighted increase in precision with each change in recall. We use the Average Precision as a balanced measure of precision and recall to evaluate our model.

Model Training Without Re-sampling

We hold out 20% data for test and use the rest for training. Training data is split to 5 folds. A Logistic Regression model is built for this binary classification task. We run a random search on finding the optimal hyper parameters i.e. regularizer L1 or L2, and regularization penalty.

Took 2.83 minutes to find optimal parameters 
Best parameters for model: {'penalty': 'l2', 'C': 0.1} 
Best precision-recall score from training: 0.7642589240077233
Confusion Matrix 
[[56855     9]  
[   44    54]] 
Classification Report
               precision    recall  f1-score   support
            0       1.00      1.00      1.00     56864
            1       0.86      0.55      0.67        98
     accuracy                           1.00     56962
    macro avg       0.93      0.78      0.84     56962
 weighted avg       1.00      1.00      1.00     56962

Due to very limited fraud records presented in the training dataset, our model is only able to catch 55% frauds, while precision is very high because of normal transactions dominate the dataset.

Model Training With SMOTE (Over-sampling)

We balance the classes of our training data by using SMOTE (Synthetic Minority Over-sampling Technique). SMOTE is one of oversampling algorithms to increase number of positive class by producing synthetic examples. After applying SMOTE, the number of fraud instances in our training dataset is same as the number of normal transactions.

Took 6.78 minutes to find optimal parameters 
Best parameters for model: {'penalty': 'l2', 'C': 10} 
Best precision-recall score from training: 0.988851588589584 
Confusion Matrix 
[[55557  1307]  
[    8    90]] 
Classification Report
               precision    recall  f1-score   support
            0       1.00      0.98      0.99     56864
            1       0.06      0.92      0.12        98
     accuracy                           0.98     56962
    macro avg       0.53      0.95      0.55     56962
 weighted avg       1.00      0.98      0.99     56962

With over-sampling, 92% frauds are captured, with a cost of 2% increase on reporting normal transaction as fraud (FP). Essentially it’s a decision business has to make: unable to catch many frauds or falsely stopping normal transactions – which is more costly to business?

Below is precision-recall curve for our predictions. It has an average precision score 0.75 which is not bad. One could adjust prediction threshold value to achieve a balanced precision-recall score.

Future Works

Try same dataset using under-sampling technique

Again, Python code can be found on my GitHub.

Happy Machine Learning!