Credit Card Fraud Detection with Logistic Regression
Intro
We will build a logistic regression model using PCA transformed data. Dataset: data/creditcard.csv source: Kaggle
The assumption of this model is that both of normal and fraud transactions can be separated by a single linear boundary. Also, I assumed the fraud techniques stay same, i.e. this model is valid over long term (otherwise, this model becomes useless other than self practice purpose).
Import packages
from freq_utils import fsize, plot_learning_curve # freq_utils.py is my custom file
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import roc_curve, auc, log_loss
from sklearn.metrics import make_scorer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import GridSearchCV
# figure cosmetic function
def fsize(w,h,c=False):
# set figure size
plt.rcParams["figure.figsize"] = [w, h]
# adjust plot automatically
plt.rcParams['figure.constrained_layout.use'] = c
Read dataset
df_read = pd.read_csv("data/creditcard.csv")
df_read.info()
print(df_read.duplicated().value_counts())
df_read.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Time 284807 non-null float64
1 V1 284807 non-null float64
2 V2 284807 non-null float64
3 V3 284807 non-null float64
4 V4 284807 non-null float64
5 V5 284807 non-null float64
6 V6 284807 non-null float64
7 V7 284807 non-null float64
8 V8 284807 non-null float64
9 V9 284807 non-null float64
10 V10 284807 non-null float64
11 V11 284807 non-null float64
12 V12 284807 non-null float64
13 V13 284807 non-null float64
14 V14 284807 non-null float64
15 V15 284807 non-null float64
16 V16 284807 non-null float64
17 V17 284807 non-null float64
18 V18 284807 non-null float64
19 V19 284807 non-null float64
20 V20 284807 non-null float64
21 V21 284807 non-null float64
22 V22 284807 non-null float64
23 V23 284807 non-null float64
24 V24 284807 non-null float64
25 V25 284807 non-null float64
26 V26 284807 non-null float64
27 V27 284807 non-null float64
28 V28 284807 non-null float64
29 Amount 284807 non-null float64
30 Class 284807 non-null int64
dtypes: float64(30), int64(1)
memory usage: 67.4 MB
False 283726
True 1081
dtype: int64
Time | V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | ... | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | Amount | Class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | -1.359807 | -0.072781 | 2.536347 | 1.378155 | -0.338321 | 0.462388 | 0.239599 | 0.098698 | 0.363787 | ... | -0.018307 | 0.277838 | -0.110474 | 0.066928 | 0.128539 | -0.189115 | 0.133558 | -0.021053 | 149.62 | 0 |
1 | 0.0 | 1.191857 | 0.266151 | 0.166480 | 0.448154 | 0.060018 | -0.082361 | -0.078803 | 0.085102 | -0.255425 | ... | -0.225775 | -0.638672 | 0.101288 | -0.339846 | 0.167170 | 0.125895 | -0.008983 | 0.014724 | 2.69 | 0 |
2 | 1.0 | -1.358354 | -1.340163 | 1.773209 | 0.379780 | -0.503198 | 1.800499 | 0.791461 | 0.247676 | -1.514654 | ... | 0.247998 | 0.771679 | 0.909412 | -0.689281 | -0.327642 | -0.139097 | -0.055353 | -0.059752 | 378.66 | 0 |
3 | 1.0 | -0.966272 | -0.185226 | 1.792993 | -0.863291 | -0.010309 | 1.247203 | 0.237609 | 0.377436 | -1.387024 | ... | -0.108300 | 0.005274 | -0.190321 | -1.175575 | 0.647376 | -0.221929 | 0.062723 | 0.061458 | 123.50 | 0 |
4 | 2.0 | -1.158233 | 0.877737 | 1.548718 | 0.403034 | -0.407193 | 0.095921 | 0.592941 | -0.270533 | 0.817739 | ... | -0.009431 | 0.798278 | -0.137458 | 0.141267 | -0.206010 | 0.502292 | 0.219422 | 0.215153 | 69.99 | 0 |
5 rows × 31 columns
Data formats are valid. No empty record but duplication.
# drop duplication
df_read.drop_duplicates(inplace=True)
# Check number of each class
df_read.Class.value_counts()
0 283253
1 473
Name: Class, dtype: int64
Undersampling and Train Test Split
To fit logistic regression, class should be sampled in valance. I’ll balance the number of samples by undersampling since we have enough data.
Then split train and test sets before explore data.
# Split datasets separatively for each class
normal = df_read[df_read.Class==0]
fraud = df_read[df_read.Class==1]
normal0, normal2 = train_test_split(normal, test_size = 0.2, random_state=1)
normal0, normal1 = train_test_split(normal0, test_size = 0.2, random_state=2)
fraud0, fraud2 = train_test_split(fraud, test_size = 0.2, random_state=3)
fraud0, fraud1 = train_test_split(fraud0, test_size = 0.2, random_state=4)
# Undersampling for training and dev sets
df = pd.concat([fraud0,normal0.sample(len(fraud0))])
df_dev = pd.concat([fraud1,normal1.sample(len(fraud1))])
# Make a test sample realistic, i.e. 0.172% of transactions are fraud
df_test = pd.concat([fraud2, normal2])
EDA and Feature Engineering
Time and amount
Before begin, let’s check features we know.
fsize(16,8)
# Time and Amount joint plot with Class hue
sns.jointplot(x='Time', y='Amount', data=df, hue='Class',)
sns.jointplot(x='Time', y='Amount', data=df[df.Class==1], kind="hex", color="#4CB391")
<seaborn.axisgrid.JointGrid at 0x7fea873cf760>
No noticable item to separate two classes only with Amount or Time. Unlike my expectation, the typical amount of fraud is very little.
Select features
We will select features which have high correlation with Class. While selecting features, other features with too high correlation with the selected feature will be dropped.
# sort features according to correlation with class
sorted_features = df.corr().Class.abs().sort_values(ascending=False).drop('Class').index.to_numpy()
# features to drop
to_drop = set()
# add features to drop due to high correlation with another
for x in sorted_features:
if x in to_drop:
continue
for y in sorted_features:
if y in to_drop:
continue
elif x==y:
continue
else:
val = df[x].corr(df[y])
if val>0.85:
to_drop.add(y)
print('Features to drop:',to_drop)
Features to drop: {'V12', 'V7', 'V16', 'V18', 'V3'}
# Select highly correlated features again
sorted_features = df.corr().Class.abs().sort_values(ascending=False).drop('Class').drop(list(to_drop))
print('Feature correlation with Class:\n', sorted_features)
Feature correlation with Class:
V14 0.720942
V4 0.694075
V11 0.646819
V10 0.605722
V9 0.522540
V17 0.516138
V2 0.461733
V6 0.428790
V1 0.404820
V5 0.322045
V19 0.248267
V20 0.219867
V27 0.106493
V22 0.105856
V15 0.090137
V28 0.087062
V24 0.086127
Amount 0.085669
V21 0.079010
V8 0.073927
V23 0.065870
V13 0.038041
V25 0.031948
V26 0.028809
Time 0.026600
Name: Class, dtype: float64
# Dataframe for train and test
# Select 5 features for now, to check distribution
selected_features = sorted_features.iloc[:5]
train = df[selected_features.index.tolist()+['Class']]
# Visualize correlations and distributions
sns.pairplot(train, hue="Class")
plt.show()
- From this pair plot, we can see that those selected features have high classification power.
- Also, each feature separates two classes with a single straight boundary. Logistic regression will perform well here.
- I noticed that the features are not scaled. Let’s do it.
Feature scaling
scaler = StandardScaler()
# fit with training set
scaler.fit(df.drop('Class',axis=1))
# transform all sets
df[df.columns[:-1]] = scaler.transform(df.drop('Class',axis=1))
df_dev[df_dev.columns[:-1]] = scaler.transform(df_dev.drop('Class',axis=1))
df_test[df_test.columns[:-1]] = scaler.transform(df_test.drop('Class',axis=1))
# Confirm transformation result
train = df[selected_features.index.tolist()+['Class']]
sns.pairplot(train, hue="Class")
plt.show()
Train
Let’s train and plot learning curve. For the cost function, I’ll use cross entropy.
def learning_curve_wrapper(X,y):
n_samples = X.shape[0]
cv = ShuffleSplit(n_splits=10, test_size=0.3, random_state=0)
model = LogisticRegression()
#scorer = make_scorer(recall_score, greater_is_better=True)
scorer = make_scorer(log_loss, greater_is_better=False)
title = "Learning Curves"
plot_learning_curve(model, title, X, y, scoring=scorer, train_sizes=np.linspace(.1, 1.0, 20))
plt.show()
return model
fsize(12,6)
model = learning_curve_wrapper(train[train.columns[:-1]], train.Class)
Both of training and validation scores saturated at the similar score. No sign of overfit.
Hypermarameter tuning
Hyper parameters can be
- Number of selected features
- Regularization parameters.
Here, I’ll try with l1 and l2 regularization with “liblinear” obtimization algorithm because it is a good choice for small dataset.
# loop over scoring metric
model_var = []
imodel = 0
#for scoring in ['recall','accuracy','precision', 'f1']:
# loop over number of features
for num_features in range(1,20):
selected_features = sorted_features.iloc[:num_features]
train = df[selected_features.index.tolist()+['Class']]
dev = df_dev[selected_features.index.tolist()+['Class']]
X_train = train[train.columns[:-1]]
y_train = train.Class
X_dev = dev[dev.columns[:-1]]
y_dev = dev.Class
# logistic regression with L2 regularization
# obtimization is liblinear
model = LogisticRegression(solver='liblinear')
penalty= ['l1','l2']
# inverse of regularization strength
# smaller values specify stronger regularization
C = [100, 30, 10, 3.0, 1.0, 0.3, 0.1, 0.03, 0.01]
# dictionary of hyperparameters
distributions = dict(penalty=penalty, C=C)
# grid search
clf = GridSearchCV(model, distributions, scoring='neg_log_loss', cv=5)
search = clf.fit(X_train, y_train)
# best hyperparameter dict
bp = search.best_params_
# test score with vaildation set
model = LogisticRegression(solver='liblinear', penalty=bp['penalty'], C=bp['C'])
model.fit(X_train, y_train)
y_pred = model.predict(X_dev)
y_proba = model.predict_proba(X_dev)
y_proba = y_proba[:,1]
rc = recall_score(y_dev,y_pred)
ac = accuracy_score(y_dev,y_pred)
pr = precision_score(y_dev,y_pred)
f1 = f1_score(y_dev,y_pred)
coefs = pd.Series(model.coef_.flatten(), index = selected_features.index.tolist())
model_var.append([imodel,model,coefs,num_features])#, y_pred,y_proba])
print(imodel, num_features, bp, '\t', round(rc, 3),round(ac, 3),round(pr, 3),round(f1, 3) )
imodel+=1
0 1 {'C': 30, 'penalty': 'l2'} 0.947 0.961 0.973 0.96
1 2 {'C': 10, 'penalty': 'l2'} 0.934 0.934 0.934 0.934
2 3 {'C': 3.0, 'penalty': 'l2'} 0.934 0.941 0.947 0.94
3 4 {'C': 3.0, 'penalty': 'l2'} 0.934 0.947 0.959 0.947
4 5 {'C': 3.0, 'penalty': 'l2'} 0.947 0.954 0.96 0.954
5 6 {'C': 3.0, 'penalty': 'l2'} 0.934 0.947 0.959 0.947
6 7 {'C': 3.0, 'penalty': 'l2'} 0.934 0.954 0.973 0.953
7 8 {'C': 3.0, 'penalty': 'l2'} 0.947 0.954 0.96 0.954
8 9 {'C': 3.0, 'penalty': 'l2'} 0.947 0.961 0.973 0.96
9 10 {'C': 3.0, 'penalty': 'l2'} 0.947 0.961 0.973 0.96
10 11 {'C': 3.0, 'penalty': 'l2'} 0.947 0.961 0.973 0.96
11 12 {'C': 3.0, 'penalty': 'l2'} 0.947 0.961 0.973 0.96
12 13 {'C': 1.0, 'penalty': 'l1'} 0.947 0.961 0.973 0.96
13 14 {'C': 1.0, 'penalty': 'l1'} 0.947 0.954 0.96 0.954
14 15 {'C': 1.0, 'penalty': 'l1'} 0.947 0.954 0.96 0.954
15 16 {'C': 1.0, 'penalty': 'l1'} 0.947 0.947 0.947 0.947
16 17 {'C': 1.0, 'penalty': 'l1'} 0.947 0.947 0.947 0.947
17 18 {'C': 1.0, 'penalty': 'l2'} 0.947 0.961 0.973 0.96
18 19 {'C': 1.0, 'penalty': 'l2'} 0.934 0.954 0.973 0.953
Overall, performances are similar. Best Model 5 has slightly better performance, but it changes whenever I run the code (I don’t know if I can fix random state of GridSearchCV). For closer look, let’s plot feature coefficient together for first few cases.
# Plot features with high coefficient (leading features)
fsize(12,30,True)
for i in range(0,14):
plt.subplot(7,2,i+1)
_, model, coefs, num_features = model_var[i]
#coefs = pd.concat([coefs.sort_values().head(7),
# coefs.sort_values().tail(8)])
coefs.plot.barh()
plt.xlabel('Coefficient')
ax = plt.title("Coefficients")
While coefficients of the first four leading components look quite consistent, 5th component was not stable. Also, V1 and V5 components seem to contribute quite a lot consistently. Let’s try include them along with the first four component, and see if it improves this model.
train = df[['V14','V4','V11','V10','V1','V5','Class']]
dev = df_dev[['V14','V4','V11','V10','V1','V5','Class']]
X_train = train[train.columns[:-1]]
y_train = train.Class
X_dev = dev[dev.columns[:-1]]
y_dev = dev.Class
# logistic regression with L2 regularization
# obtimization is liblinear
model = LogisticRegression(solver='liblinear')
penalty= ['l1','l2']
# inverse of regularization strength
# smaller values specify stronger regularization
C = [100, 30, 10, 3.0, 1.0, 0.3, 0.1, 0.03, 0.01]
# dictionary of hyperparameters
distributions = dict(penalty=penalty, C=C)
# grid search
clf = GridSearchCV(model, distributions, scoring='neg_log_loss', cv=5)
search = clf.fit(X_train, y_train)
# best hyperparameter dict
bp = search.best_params_
# test score with vaildation set
model = LogisticRegression(solver='liblinear', penalty=bp['penalty'], C=bp['C'])
model.fit(X_train, y_train)
y_pred = model.predict(X_dev)
y_proba = model.predict_proba(X_dev)
y_proba = y_proba[:,1]
rc = recall_score(y_dev,y_pred)
ac = accuracy_score(y_dev,y_pred)
pr = precision_score(y_dev,y_pred)
f1 = f1_score(y_dev,y_pred)
coefs = pd.Series(model.coef_.flatten(), index = ['V14','V4','V11','V10','V1','V5'])
print(bp, '\t', round(rc, 3),round(ac, 3),round(pr, 3),round(f1, 3) )
{'C': 3.0, 'penalty': 'l2'} 0.934 0.947 0.959 0.947
No, it didn’t improve.
So, for our final model, let’s include only the first 4 components, which were leading components consistently. This model number is 3.
Test and Results
# get a final model
_, model, coefs, num_features = model_var[3]
selected_features = sorted_features.iloc[:num_features]
X_test = df[selected_features.index]
y_test = df.Class
# get prediction
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)
y_proba = y_proba[:,1]
# print scores
rc = recall_score(y_test,y_pred)
ac = accuracy_score(y_test,y_pred)
pr = precision_score(y_test,y_pred)
f1 = f1_score(y_test,y_pred)
print('Scores:',round(rc, 3),round(ac, 3),round(pr, 3),round(f1, 3) )
print(selected_features, model.coef_, model.intercept_)
print(confusion_matrix(y_test, y_pred))
Scores: 0.871 0.925 0.978 0.921
V14 0.720942
V4 0.694075
V11 0.646819
V10 0.605722
Name: Class, dtype: float64 [[-2.23133532 2.77847589 0.519276 -2.19161554]] [2.68467681]
[[296 6]
[ 39 263]]
V14, V4, and V10 played the most significant contribution. This model gave high recall score, 0.90, while giving very high precision, 0.98.
fsize(16,10)
test = X_test.copy()
test['Class'] = y_test
plt.subplot(2,2,1)
plt.scatter(X_test.V14, y_test, alpha=0.2)
plt.xlabel('V14')
plt.ylabel('Class')
plt.subplot(2,2,2)
plt.scatter(X_test.V4, y_test, alpha=0.2)
plt.xlabel('V4')
plt.ylabel('Class')
plt.subplot(2,2,3)
test['pred']= y_pred
X = test[(test.Class==0)&(test.pred==0)]
plt.scatter(X.V14, X.V4, color='tab:blue', alpha=0.1, label='Normal')
X = test[(test.Class==0)&(test.pred==1)]
plt.scatter(X.V14, X.V4, color='tab:blue', marker='x', label='Normal, predicted fraud')
X = test[(test.Class==1)&(test.pred==1)]
plt.scatter(X.V14, X.V4, color='tab:orange', alpha=0.1, label='Fraud')
X = test[(test.Class==1)&(test.pred==0)]
plt.scatter(X.V14, X.V4, color='tab:orange', marker='x', label='Fraud, predicted normal')
plt.xlabel('V14')
plt.ylabel('V4')
plt.legend()
plt.subplot(2,2,4)
X = test[(test.Class==0)&(test.pred==0)]
plt.scatter(X.V14, X.V10, color='tab:blue', alpha=0.1, label='Normal')
X = test[(test.Class==0)&(test.pred==1)]
plt.scatter(X.V14, X.V10, color='tab:blue', marker='x', label='Normal, predicted fraud')
X = test[(test.Class==1)&(test.pred==1)]
plt.scatter(X.V14, X.V10, color='tab:orange', alpha=0.1, label='Fraud')
X = test[(test.Class==1)&(test.pred==0)]
plt.scatter(X.V14, X.V10, color='tab:orange', marker='x', label='Fraud, predicted normal')
plt.xlabel('V14')
plt.ylabel('V10')
plt.legend()
<matplotlib.legend.Legend at 0x7fea801a6fa0>
Those plots visualize class overlapping regions.
# plot probabilities
fsize(8,6)
X = test[test.Class==0].drop(['Class','pred'],axis=1)
ax =plt.hist(model.predict_proba(X)[:,1], bins=30, range=(0,1), label='Normal', alpha=0.4)
X = test[test.Class==1].drop(['Class','pred'],axis=1)
ax = plt.hist(model.predict_proba(X)[:,1], bins=30, range=(0,1), label='Fraud', alpha=0.4)
plt.xlabel('Predicted probability of Fraud')
plt.legend()
<matplotlib.legend.Legend at 0x7fea80239bb0>
While fraud events have a peak at predicted probability around 1, the predicted probability of normal transactions is more even out. This model is more strict for fraud.
# Plot ROC curve
fsize(8,6)
fpr, tpr, thr = roc_curve(y_test, y_proba)
plt.plot(fpr, tpr, color='red', label='ROC')
plt.plot([0, 1], [0, 1], color='green', linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic Curve')
plt.legend()
plt.show()
Around TPR=0.8, false alarm rate start to increase rapidly. We might invent a two steps detection system, such as record anything over threshold of TPR=0.8, then immediately stop transaction following a model optimized for the high precision.
Leave a comment