Credit Card Fraud Detection with Anomaly Detection

August 15, 2021

Link to my GitHub directory

Intro

Goal

Detect most of fraudulent transaction without losing too much precision
Build a simple intuitive model in order to tune easily whenever fraudulent trend changes

Dataset

Provided by the Machine Learning Group of Université Libre de Bruxelles.
https://www.kaggle.com/mlg-ulb/creditcardfraud (144 MB, too large to upload to GitHub)
284,807 transactions, 492 of them (0.172%) are frauds.
30 numerical features: time, amount of meony, 28 PCA transformed components of encrypted data

Model

Supervised anomaly (outlier) detection based on Z-scores
Selected features having “high classification power”
Suggested best hyper parameters for two kind of “best choices”

Result

Both models showed expected and excellent performance on the test set

Import modules and read dataset

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

#df = pd.read_csv('creditcard.csv') # original
df = pd.read_csv('creditcard_reduced.csv') # 50% random sample of original

Organize data

# df.duplicated() # -> no duplication
# df.info() # -> no NaN

# Features: ['Time', 'Amount', 'V1', 'V2', ... 'V28']
# Classification: 0 normal, 1 fraud

# split dataset to normal and fraud transaction
df0 = df.loc[df.Class==0].copy()
df1 = df.loc[df.Class==1].copy()
#df_write = pd.concat([df1.sample(n=246),df0.sample(n=142403)])
#df_write.to_csv('creditcard_reduced.csv', index=False)

Split dataset - Train/Cross validation/Test

df0_train, df0_test = train_test_split(df0, test_size=0.4)
df1_train, df1_test = train_test_split(df1, test_size=0.4)
df0_dev, df0_test   = train_test_split(df0_test, test_size=0.5)
df1_dev, df1_test   = train_test_split(df1_test, test_size=0.5)

Train - Z score parameter

features = df0_train.columns.drop(labels='Class')

df0_stat = df0_train[features].describe().T

# get Z-score parameter
mu0=df0_stat['mean']
sig0=df0_stat['std']

Visualize classsification power for selected features

# Recall of each features with two sized Z-score classification with two-sigma cut
dfz = ((abs(df1_train[features]-mu0[features])/sig0[features]) > 2).mean().sort_values(ascending=False)

sorted_feature = dfz.index.tolist()

feature_lead = sorted_feature[:6] ## leading classifying features
feature_insig = sorted_feature[-6:] ## not significantly classifying features

feature_lead.append('Class')
feature_insig.append('Class')

# Significantly classificatory features - high Z score
df_sig = pd.concat([df0_train[feature_lead].sample(n=100),df1_train[feature_lead].sample(n=100)])

sns.pairplot(df_sig, hue="Class")
plt.show()

Note: Keep in mind that the above plots are drawn for equal number of samples for both normal and fraud samples. With actual highly skewed samples, the distribution of normal sample will have much thicker tails.

# Inignificantly classificatory features - low Z score
df_insig = pd.concat([df0_train[feature_insig].sample(n=100),df1_train[feature_insig].sample(n=100)])

sns.pairplot(df_insig, hue="Class")
plt.show()

Define anomaly metric and scoring

\[Z = \sqrt{\frac{1}{n}\sum^{n~\textrm{leading features}}_{i}~\frac{(Z_{i}-\mu_{i})^2}{\sigma_{i}^2} } > \epsilon\]

# Z-score transform
X_train1 = (df1_train[features]-mu0[features])**2/sig0[features]**2
X_dev0 = (df0_dev[features]-mu0[features])**2/sig0[features]**2
X_dev1 = (df1_dev[features]-mu0[features])**2/sig0[features]**2
X_test0 = (df0_test[features]-mu0[features])**2/sig0[features]**2
X_test1 = (df1_test[features]-mu0[features])**2/sig0[features]**2

y_dev0 = df0_dev.Class
y_dev1 = df1_dev.Class
y_test0 = df0_test.Class
y_test1 = df1_test.Class

X_dev = pd.concat([X_dev0, X_dev1])
y_dev = pd.concat([y_dev0, y_dev1])
X_test = pd.concat([X_test0, X_test1])
y_test = pd.concat([y_test0, y_test1])

# anomaly metric calculation
def metric(X, sig_cut, nfeatures):
    
    # X: squared Z-score

    y_metric = (X.sum(axis=1)/nfeatures)**0.5
    
    y_pred = y_metric.apply(lambda x: 1 if x>sig_cut else 0)
    
    return y_pred

# score calculation
def score(y_pred, y_actual):
    
    def verdict(p,a):
        # predict, actual
        # True/False Positive/Negative
        x=''
    
        if p==0 and a==0:
            x='TN'
        elif p==0 and a==1:
            x='FN'
        elif p==1 and a==0:
            x='FP'
        elif p==1 and a==1:
            x='TP'
        else:
            x='Invalid'
        
        return x
    
    y_score = pd.DataFrame({'predict':y_pred, 'actual':y_actual})
    
    y_score['verdict'] = y_score.apply(lambda x: verdict(x['predict'],x['actual']), axis=1)
    
    tp = y_score[y_score.verdict=='TP'].verdict.count()
    tn = y_score[y_score.verdict=='TN'].verdict.count()
    fp = y_score[y_score.verdict=='FP'].verdict.count()
    fn = y_score[y_score.verdict=='FN'].verdict.count()
    inv = y_score[y_score.verdict=='Invalid'].verdict.count()

    if inv>0 :
        print('Invalid value. Check classification values are all 0 or 1')
        return 0
    
    # Evaluation scores
    precision = tp/(tp+fp)
    recall = tp/(tp+fn) 
    f1score = 2*precision*recall/(precision+recall)
    
    
    #print('TP = {}, TN = {}, FP = {}, FN = {}'.format(tp,tn,fp,fn))

    return y_score.predict, precision, recall, f1score

Tune hyper parameters

df_hp = pd.DataFrame(columns=['sig_cut','nfeatures','lead_feature','precision','recall','f1score'])
df_y = y_dev.rename('actual').to_frame()

# This part is very slow
for icut in range(1,11):
    
    sig_cut=0.5*icut
    
    print(sig_cut)
    
    # Train to sort leading features in order of "classification power"
    dfz = (X_train1 > sig_cut).mean().sort_values(ascending=False)
    
    sorted_feature = dfz.index.tolist()

    for nfeatures in range(1,31):
        
        lead_feature = sorted_feature[:nfeatures]
        
        y_pred = metric(X_dev[lead_feature], sig_cut, nfeatures)
        result = score(y_pred, y_dev)

        y, p, r, f = result
        
        new_name='predict'+str(icut)+'_'+str(nfeatures)
        
        #print(new_name)

        y = y.rename(new_name).to_frame()#

        df_y = pd.merge(df_y, y,left_index=True,right_index=True)
        
        df_hp = df_hp.append({'sig_cut':sig_cut,'nfeatures':nfeatures,'lead_feature':lead_feature,
                              'precision':p,'recall':r,'f1score':f}, ignore_index=True)

Visualize tuning result

fig = plt.figure(figsize=(17,10))

ax0 = plt.subplot(2, 2, 1)
plt.hist2d(df_hp.nfeatures, df_hp.sig_cut, weights=df_hp.precision, bins=[30,10], range=[[1,30],[0.5,5.5]])
plt.colorbar()
ax0.set_title('Precision: purity of fraud sample')
ax0.set_xlabel('# features')
ax0.set_ylabel('Outlier cut in Z-score')

ax1 = plt.subplot(2, 2, 2)
plt.hist2d(df_hp.nfeatures, df_hp.sig_cut, weights=df_hp.recall, bins=[30,10], range=[[1,30],[0.5,5.5]])
plt.colorbar()
ax1.set_title('Recall: detection rate')
ax1.set_xlabel('# features')
ax1.set_ylabel('Outlier cut in Z-score')

ax2 = plt.subplot(2, 2, 3)
plt.hist2d(df_hp.nfeatures, df_hp.sig_cut, weights=df_hp.f1score, bins=[30,10], range=[[1,30],[0.5,5.5]])
plt.colorbar()
ax2.set_title('F1 score: compensated precision and recall')
ax2.set_xlabel('# features')
ax2.set_ylabel('Outlier cut in Z-score')

n_fraud = len(X_dev1)
print(n_fraud,"fraud samples")
#print('leading features: ',df_hp[(df_hp.sig_cut==2)&(df_hp.nfeatures==30)].iloc[0].lead_feature[:2])

x0=X_dev0.sample(n=n_fraud).V14
y0=X_dev0.sample(n=n_fraud).V17
x1=X_dev1.V14
y1=X_dev1.V17

ax = plt.subplot(2, 2, 4)

plt.plot(x0,y0,'*')
plt.plot(x1,y1,'*')

c1 = plt.Rectangle((0, 0), 1, 1, color='b', fill=False)
c2 = plt.Rectangle((0, 0), 2, 2, color='b', fill=False)
c3 = plt.Rectangle((0, 0), 3, 3, color='b', fill=False)

ax.add_patch(c1)
ax.add_patch(c2)
ax.add_patch(c3)
plt.legend(['normal','fraud'])

ax.set_title('Classification cuts for 1, 2, and 3 Z-score cuts')
ax.set_yscale('log')
ax.set_xscale('log')
ax.set_xlabel('V14 Z-score square')
ax.set_ylabel('V17 Z-score square')

plt.show()

49 fraud samples

Interpretation of cross validation scores

Precision goes higher as the number of leading features increases until it reaches about 10, then decreases

The more relevant features we use, the detection becomes more strict.
When we start to use more irrelevant features, then the detection doesn’t enhance fraud selection.

Precision goes higher as Z-score cut increases

Assuming normal samples have sharper distribution, higher z-score cut always increases purity of fraud sample.

Even at the highest, the precision is much lower than 1, due to heavily skewed statistics of classification samples

Recall goes higher as Z-score cut decreases

Not all fraud samples are separable from normal samples, so lower Z-score cut will only keep more fraud sample.
However the precision is very small in the lowest Z-score area.

Recall goes higher as the number of leading features decreases, except using only one feature

For a given Z-score cut, using a few multiple leading features detect more fraud samples than using too many features.

The highest F1 score is achieved when 2-3 features with highest Z-score cut are applied

Observation from precision and recall explain this result
2-3 leading features do best on ROC curve in the below, too

fig = plt.figure(figsize=(17,10))

ax3 = plt.subplot(2, 2, 1)
lgd=[]
for i in range(1,11):
    plt.plot(df_hp[df_hp.nfeatures==i].recall, df_hp[df_hp.nfeatures==i].precision,'*-')
    lgd.append(str(i)+' features')
plt.legend(lgd)
ax3.set_title('Precision vs recall')
ax3.set_xlabel('Recall')
ax3.set_ylabel('Precision')

ax3 = plt.subplot(2, 2, 2)
lgd=[]
for i in range(1,11):
    plt.plot(df_hp[(df_hp.recall>0.9)&(df_hp.nfeatures==i)].recall, 
             df_hp[(df_hp.recall>0.9)&(df_hp.nfeatures==i)].precision,'*-')
    lgd.append(str(i)+' features')
plt.legend(lgd)
ax3.set_title('Precision vs recall, recall>90%')
ax3.set_xlabel('Recall')
ax3.set_ylabel('Precision')


plt.show()

print(df_hp[(df_hp.recall>0.9)&(df_hp.nfeatures<11)]
      [['sig_cut','nfeatures','recall','precision']].sort_values(by=['recall'],ascending=False).head(10))

print(df_hp[(df_hp.recall>0.9)&(df_hp.nfeatures<11)]
      [['sig_cut','nfeatures','recall','precision']].sort_values(by=['precision'],ascending=False).head(10))

print(df_hp[(df_hp.nfeatures<11)]
      [['sig_cut','nfeatures','recall','precision']].sort_values(by=['precision'],ascending=False).head(10))

print(df_hp[(df_hp.nfeatures<11)]
      [['sig_cut','nfeatures','recall','precision']].sort_values(by=['recall'],ascending=False).head(10))

print(df_hp[['sig_cut','nfeatures','recall','precision','f1score']].sort_values(by=['f1score'],ascending=False).head(10))

   sig_cut nfeatures    recall  precision
    0.5         5  1.000000   0.001996
    0.5         6  1.000000   0.002009
    0.5         7  1.000000   0.001932
    0.5         8  1.000000   0.001857
    0.5         9  1.000000   0.001850
    0.5        10  1.000000   0.001858
    0.5         3  0.979592   0.002166
    0.5         4  0.979592   0.002061
    0.5         1  0.959184   0.003211
    0.5         2  0.959184   0.002412
    sig_cut nfeatures    recall  precision
    1.0         4  0.918367   0.005327
    1.0         6  0.918367   0.005096
     0.5         1  0.959184   0.003211
     0.5         2  0.959184   0.002412
     0.5         3  0.979592   0.002166
     0.5         4  0.979592   0.002061
     0.5         6  1.000000   0.002009
     0.5         5  1.000000   0.001996
     0.5         7  1.000000   0.001932
     0.5        10  1.000000   0.001858
     sig_cut nfeatures    recall  precision
    5.0         2  0.673469   0.673469
    5.0         3  0.775510   0.666667
    5.0         8  0.632653   0.659574
    5.0        10  0.571429   0.595745
    5.0         9  0.571429   0.583333
    5.0         7  0.653061   0.581818
    5.0         6  0.673469   0.568966
    4.5         2  0.693878   0.566667
    4.5        10  0.632653   0.543860
    4.5         3  0.795918   0.541667
   sig_cut nfeatures    recall  precision
    0.5         5  1.000000   0.001996
    0.5         6  1.000000   0.002009
    0.5         7  1.000000   0.001932
    0.5         8  1.000000   0.001857
    0.5         9  1.000000   0.001850
    0.5        10  1.000000   0.001858
    0.5         3  0.979592   0.002166
    0.5         4  0.979592   0.002061
    0.5         1  0.959184   0.003211
    0.5         2  0.959184   0.002412
     sig_cut nfeatures    recall  precision   f1score
    5.0         3  0.775510   0.666667  0.716981
    5.0         2  0.673469   0.673469  0.673469
    5.0         8  0.632653   0.659574  0.645833
    4.5         3  0.795918   0.541667  0.644628
    4.5         2  0.693878   0.566667  0.623853
    5.0         6  0.673469   0.568966  0.616822
    5.0         7  0.653061   0.581818  0.615385
    4.5         6  0.734694   0.507042  0.600000
    4.5         5  0.775510   0.487179  0.598425
    4.0         2  0.775510   0.475000  0.589147

How to select final parameters

Highest F1 (or F_{\beta}) score is often the best selection. From data, the highest F1 scores are archived by 2-3 leading features with highest Z-score cut. With such selection, the cross validation set shower good performances both in precision (40-60%) and recall (70-90%).

However, fraudulent transactions are so fatal, so you also want to keep a certain high recall value even though you lose some precision.

I selected final parameters based on the overall trends, not relying on precise scoring too much. Here are reasons.

The number of fraud sample has 10% of high statistical error assuming that it follows common probability distributions. Besides, this kind of anomaly sample distributions are chaotic.
I can imagine that the precision won’t be stable for each sampling or over time simply because the frequency of fraudulent transactions won’t be regular (if so, it will be interesting…). Here, note that recall might be relatively stable than precision, assuming fraudulent transactions have much broader distribution than normal one, until fraudulent transaction techniques are evolved and become close to normal transactions

As a conclusion, I suggest two parameter choices with two different kind of “best-overall” performance

Choice 1 - Highest precision at recall > ~90%

From cross validation set, top highest precision having recall > 90% were mostly achieved by parameters with a few multiple leading features (2-5 features) and moderately low Z-score cut (1.5-2.5) achieved.

Test

# Test
lead_feature = df_hp[(df_hp.sig_cut==2)&(df_hp.nfeatures==30)].iloc[0].lead_feature[:2]

y_pred = metric(X_test[lead_feature], 2, 2)
y, p, r, f = score(y_pred, y_test)


print(len(X_test), y.sum(), p,r)

print('precision:',p,'recall:',r,'sampling fraction:',y.sum()/len(X_test))

28531 1590 0.028930817610062894 0.92
precision: 0.028930817610062894 recall: 0.92 sampling fraction: 0.05572885633170937

From final test, 88% recall and 3.6% precision (number varies for different samplings) is obtained with 2 leading features and Z-score = 2 cut.

Evaluation

System with ~90% recall means that this system catches 90% of fraudulent transaction, which sounds very safe.
From our model, the precision was a few percent, which is too low to halt (customers will be annoyed), however, quite high to be the first level of sampling.
The advantage of this sampling is that it reduces the size of “samples to be investigated” significantly without losing most of fraudulent samples, therefore, we can build a several levels of efficient detector.

Application

One application example can be building multiple levels of fraud detector, combining faster first level and slower second (or higher) level
Level 1: Implement this model in as a fast hardware process, for example, add an operation which processes signals of two features and makes logic operation for comparison and decision in a card reader chip. If level1 detect fraud-like transaction, then it halt transaction and send feature signals to level 2
Level 2: Perform further and slower classification using software or information of higher-up security levels. If this decision says normal, then the transaction will be made (with delays introduced by level 2 processing time), and a customer will hardly notice anything. If this decision says fraud, then we can contact customer to ask security information.

Choice 2 - Highest F1 score

Highest F1 scores are achieved by parameters of 2-3 leading features and highest Z-score cut. From cross validation set, such parameters show good performance for both precision (40-60%) and recall (70-90%).

Test

lead_feature = df_hp[(df_hp.sig_cut==5)&(df_hp.nfeatures==30)].iloc[0].lead_feature[:3]

y_pred = metric(X_test[lead_feature], 5, 3)
y, p, r, f = score(y_pred, y_test)

print('accuracy:',y,'precision:',p,'recall:',r,'f1score:',f)

accuracy: 109266    0
29673     0
51913     0
61682     0
117402    0
         ..
153       1
188       1
61        1
40        1
169       1
Name: predict, Length: 28531, dtype: int64 precision: 0.6607142857142857 recall: 0.74 f1score: 0.6981132075471698

From final test, 77% recall and 74% precision (number varies for different samplings) is obtained with 3 leading features and Z-score = 5 cut.

Evaluation

This system samples highly fraud-like transactions without losing too much of total fraud transactions. Considering only 0.17% are fraud, 74% of test precision is such a great focusing on crime.

Application

One application can be a fast hardware level implementation in a card reader. If a transaction is marked as fraudulent, then the transaction is automatically halt, the transaction information is sent to security monitoring places and/or a staff in a shop can ask further questions to the credit card user.

Discussion and Conclusion

Supervised anomaly detection model based on Z-score is built for fraud detection.
Two best models are suggested, both perform excellent on test set
As the trend of normal or fraud transaction changes, the parameters should be adjusted. This model provides intuitive way to tune the parameters.
Some features might have skewed distribution. In such case, transformation to log might work better.
Of course, knowing the meaning of each feature will significantly improve the model.

Share on

Twitter Facebook LinkedIn

Intro

Goal

Dataset

Model

Result

Import modules and read dataset

Organize data

Split dataset - Train/Cross validation/Test

Train - Z score parameter

Visualize classsification power for selected features

Define anomaly metric and scoring

Tune hyper parameters

Visualize tuning result

Interpretation of cross validation scores

Precision goes higher as the number of leading features increases until it reaches about 10, then decreases

Precision goes higher as Z-score cut increases

Even at the highest, the precision is much lower than 1, due to heavily skewed statistics of classification samples

Recall goes higher as Z-score cut decreases

Recall goes higher as the number of leading features decreases, except using only one feature

The highest F1 score is achieved when 2-3 features with highest Z-score cut are applied

How to select final parameters

Highest F1 (or F_{\beta}) score is often the best selection. From data, the highest F1 scores are archived by 2-3 leading features with highest Z-score cut. With such selection, the cross validation set shower good performances both in precision (40-60%) and recall (70-90%).

However, fraudulent transactions are so fatal, so you also want to keep a certain high recall value even though you lose some precision.

I selected final parameters based on the overall trends, not relying on precise scoring too much. Here are reasons.

As a conclusion, I suggest two parameter choices with two different kind of “best-overall” performance

Choice 1 - Highest precision at recall > ~90%

From cross validation set, top highest precision having recall > 90% were mostly achieved by parameters with a few multiple leading features (2-5 features) and moderately low Z-score cut (1.5-2.5) achieved.

Test

From final test, 88% recall and 3.6% precision (number varies for different samplings) is obtained with 2 leading features and Z-score = 2 cut.

Evaluation

Application

Choice 2 - Highest F1 score

Highest F1 scores are achieved by parameters of 2-3 leading features and highest Z-score cut. From cross validation set, such parameters show good performance for both precision (40-60%) and recall (70-90%).

Test

From final test, 77% recall and 74% precision (number varies for different samplings) is obtained with 3 leading features and Z-score = 5 cut.

Evaluation

Application

Discussion and Conclusion

Share on

Leave a comment

You may also enjoy

Neural Style Transfer from BTS to BTS

Web crawling with Python and BeautifulSoup (and a little HTML)

Text preprocessing with Python NLTK package

Dancer’s Business