Open Table of contents

Introduction
Look into the data
Data transformation pipeline
- - Split dataset
Experiments
- Logistic Regression as baseline
- Gradient Boosting with LightGBM

Introduction

This is a brief walk through of the Kaggle challenge IEEE-CIS Fraud Detection. The process in this post is not meant to compete the top solution by performing an extre feature engineering and a greedy search for the best model with hyper-parameters. This is just to walk through the problem and demonstrate a relatively good solution, by doing feature analysis and a few experiments with reference to other’s methods.

The problem of this challenge is to detect payment frauds by using the data of the transactions and identities. The performance of the prediction is evaluated on ROC AUC. The reason why this measure is suitable for this problem (rather than Precision-Recall) can refer to the discussion here.

Look into the data

The provided dataset is broken into two files named identity and transaction, which are joined by TransactionID (note that NOT all the transactions have corresponding identity information).

Transaction Table

TransactionDT: timedelta from a given reference datetime (not an actual timestamp), the number of seconds in a day (60 _ 60 _ 24 = 86400)
TransactionAMT: transaction payment amount in USD
ProductCD: product code, the product for each transaction
card1 - card6: payment card information, such as card type, card category, issue bank, country, etc.
addr: address
dist: distance
P_ and (R__) emaildomain: purchaser and recipient email domain
C1-C14: counting, such as how many addresses are found to be associated with the payment card, etc. The actual meaning is masked.
D1-D15: timedelta, such as days between previous transaction, etc.
M1-M9: match, such as names on card and address, etc.
Vxxx: Vesta engineered rich features, including ranking, counting, and other entity relations.

Among these variables, categorical variables are:

ProductCD
card1 - card6
addr1, addr2
Pemaildomain Remaildomain
M1 - M9

Identity Table

All the variable in this table are categorical:

DeviceType
DeviceInfo
id12 - id38

A more detailed explanation of the data can be found in the reply of this discussion.

Now let’s have a close look at the data.

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import numpy as np
import pandas as pd
import plotly.express as px


DATA_DIR = '/content/drive/My Drive/colab-data/fraud detect/data'

tran_train = reduce_mem_usage(pd.read_csv(f'{DATA_DIR}/train_transaction.csv'))
id_train = reduce_mem_usage(pd.read_csv(f'{DATA_DIR}/train_identity.csv'))

tran_train.info()
tran_train.head()
id_train.info()
id_train.head()

Mem. usage decreased to 542.35 Mb (69.4% reduction)
Mem. usage decreased to 25.86 Mb (42.7% reduction)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 590540 entries, 0 to 590539
Columns: 394 entries, TransactionID to V339
dtypes: float16(332), float32(44), int16(1), int32(2), int8(1), object(14)
memory usage: 542.3+ MB

	TransactionID	TransactionDT	TransactionAmt	ProductCD	card1	card2	card3	card4	card5	card6	addr1	addr2	dist1	dist2	P_emaildomain	R_emaildomain	C1	C2	C6	C8	C9	C10	C11	C13	C14	D1	D2	D3	D4	D5	D6	D7	D8	D9	...	V302	V303	V304	V305	V306	V307	V308	V310	V312	V316	V317	V318	V322	V323	V324	V325	V326	V327	V328	V329	V330	V331	V332	V333	V334	V335	V336	V337	V338	V339
0	2987000	86400	68.5	W	13926	NaN	150.0	discover	142.0	credit	315.0	87.0	19.0	NaN	NaN	NaN	1.0	1.0	1.0	0.0	1.0	0.0	2.0	1.0	1.0	14.0	NaN	13.0	NaN	NaN	NaN	NaN	NaN	NaN	...	0.0	0.0	0.0	1.0	0.0	117.0	0.0	0.0	0.0	0.0	117.0	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	2987001	86401	29.0	W	2755	404.0	150.0	mastercard	102.0	credit	325.0	87.0	NaN	NaN	gmail.com	NaN	1.0	1.0	1.0	0.0	0.0	0.0	1.0	1.0	1.0	0.0	NaN	NaN	0.0	NaN	NaN	NaN	NaN	NaN	...	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	2987002	86469	59.0	W	4663	490.0	150.0	visa	166.0	debit	330.0	87.0	287.0	NaN	outlook.com	NaN	1.0	1.0	1.0	0.0	1.0	0.0	1.0	1.0	1.0	0.0	NaN	NaN	0.0	NaN	NaN	NaN	NaN	NaN	...	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	2987003	86499	50.0	W	18132	567.0	150.0	mastercard	117.0	debit	476.0	87.0	NaN	NaN	yahoo.com	NaN	2.0	5.0	4.0	0.0	1.0	0.0	1.0	25.0	1.0	112.0	112.0	0.0	94.0	0.0	NaN	NaN	NaN	NaN	...	0.0	0.0	0.0	1.0	50.0	1758.0	925.0	354.0	135.0	50.0	1404.0	790.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	2987004	86506	50.0	H	4497	514.0	150.0	mastercard	102.0	credit	420.0	87.0	NaN	NaN	gmail.com	NaN	1.0	1.0	1.0	1.0	0.0	1.0	1.0	1.0	1.0	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	1.0	1.0	1.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

5 rows × 394 columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 144233 entries, 0 to 144232
Data columns (total 41 columns):
TransactionID    144233 non-null int32
id_01            144233 non-null float16
id_02            140872 non-null float32
id_03            66324 non-null float16
id_04            66324 non-null float16
id_05            136865 non-null float16
id_06            136865 non-null float16
id_07            5155 non-null float16
id_08            5155 non-null float16
id_09            74926 non-null float16
id_10            74926 non-null float16
id_11            140978 non-null float16
id_12            144233 non-null object
id_13            127320 non-null float16
id_14            80044 non-null float16
id_15            140985 non-null object
id_16            129340 non-null object
id_17            139369 non-null float16
id_18            45113 non-null float16
id_19            139318 non-null float16
id_20            139261 non-null float16
id_21            5159 non-null float16
id_22            5169 non-null float16
id_23            5169 non-null object
id_24            4747 non-null float16
id_25            5132 non-null float16
id_26            5163 non-null float16
id_27            5169 non-null object
id_28            140978 non-null object
id_29            140978 non-null object
id_30            77565 non-null object
id_31            140282 non-null object
id_32            77586 non-null float16
id_33            73289 non-null object
id_34            77805 non-null object
id_35            140985 non-null object
id_36            140985 non-null object
id_37            140985 non-null object
id_38            140985 non-null object
DeviceType       140810 non-null object
DeviceInfo       118666 non-null object
dtypes: float16(22), float32(1), int32(1), object(17)
memory usage: 25.9+ MB

	TransactionID	id_01	id_02	id_03	id_04	id_05	id_06	id_07	id_08	id_09	id_10	id_11	id_12	id_13	id_14	id_15	id_16	id_17	id_18	id_19	id_20	id_21	id_22	id_23	id_24	id_25	id_26	id_27	id_28	id_29	id_30	id_31	id_32	id_33	id_34	id_35	id_36	id_37	id_38	DeviceType	DeviceInfo
0	2987004	0.0	70787.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	100.0	NotFound	NaN	-480.0	New	NotFound	166.0	NaN	542.0	144.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	New	NotFound	Android 7.0	samsung browser 6.2	32.0	2220x1080	match_status:2	T	F	T	T	mobile	SAMSUNG SM-G892A Build/NRD90M
1	2987008	-5.0	98945.0	NaN	NaN	0.0	-5.0	NaN	NaN	NaN	NaN	100.0	NotFound	49.0	-300.0	New	NotFound	166.0	NaN	621.0	500.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	New	NotFound	iOS 11.1.2	mobile safari 11.0	32.0	1334x750	match_status:1	T	F	F	T	mobile	iOS Device
2	2987010	-5.0	191631.0	0.0	0.0	0.0	0.0	NaN	NaN	0.0	0.0	100.0	NotFound	52.0	NaN	Found	Found	121.0	NaN	410.0	142.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Found	Found	NaN	chrome 62.0	NaN	NaN	NaN	F	F	T	T	desktop	Windows
3	2987011	-5.0	221832.0	NaN	NaN	0.0	-6.0	NaN	NaN	NaN	NaN	100.0	NotFound	52.0	NaN	New	NotFound	225.0	NaN	176.0	507.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	New	NotFound	NaN	chrome 62.0	NaN	NaN	NaN	F	F	T	T	desktop	NaN
4	2987016	0.0	7460.0	0.0	0.0	1.0	0.0	NaN	NaN	0.0	0.0	100.0	NotFound	NaN	-300.0	Found	Found	166.0	15.0	529.0	575.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Found	Found	Mac OS X 10_11_6	chrome 62.0	24.0	1280x800	match_status:2	T	F	T	T	desktop	MacOS

is_fraud = tran_train[['isFraud', 'TransactionID']].groupby('isFraud').count()

is_fraud['ratio'] = is_fraud['TransactionID'] / is_fraud['TransactionID'].sum()
fig_Y = px.bar(is_fraud, x=is_fraud.index, y='TransactionID',
               text='ratio',
               labels={'TransactionID': 'Number of transactions',
                       'x': 'is fraud'})
fig_Y.update_traces(texttemplate='%{text:.6p}')

Very imbalanced target varible

Positives of isFraud is very low of 3.5% in the entire dataset. For this classification problem, it’s very important to have high true positive rate. That is, how good can the model identify the fraud cases from all the fraud cases. So recall is in a sense more important than precision in this problem. Macro average of recall would be a good side metric for this problem. Of cource, in reality we need to consider the belance between the cost of a few frauds and the cost of handling cases.

In addition, we need to put some effort on the sampling and train-val split method, to ensure that the minority class samples have enough impact to the model while training. Class weights of the model could be set to see if there’s difference in performance.

Check missing values

Now let’s have a look at if there’s any missing value in the dataset. We can see from the table below that there’re quite a lot of missing values.

It’s hard to tell how we should handle with them before we look into each variable. Sometimes a missing value stands for something. It also depends on what kind of model we are going to use. We can leave them as missing value when using a tree model.

def missing_ratio_col(df):
    df_na = (df.isna().sum() / len(df)) * 100
    if isinstance(df, pd.DataFrame):
        df_na = df_na.drop(
            df_na[df_na == 0].index).sort_values(ascending=False)
        missing_data = pd.DataFrame(
            {'Missing Ratio %': df_na})
    else:
        missing_data = pd.DataFrame(
            {'Missing Ratio %': df_na}, index=[0])

    return missing_data

missing_ratio_col(tran_train)
missing_ratio_col(id_train)

	Missing Ratio %
dist2	93.628374
D7	93.409930
D13	89.509263
D14	89.469469
D12	89.041047
...	...
V307	0.002032
V299	0.002032
V309	0.002032
V310	0.002032
V308	0.002032

374 rows × 1 columns

	Missing Ratio %
id_24	96.708798
id_25	96.441868
id_07	96.425922
id_08	96.425922
id_21	96.423149
id_26	96.420375
id_27	96.416215
id_23	96.416215
id_22	96.416215
id_18	68.722137
id_04	54.016071
id_03	54.016071
id_33	49.187079
id_10	48.052110
id_09	48.052110
id_30	46.222432
id_32	46.207872
id_34	46.056034
id_14	44.503685
DeviceInfo	17.726179
id_13	11.726165
id_16	10.325654
id_05	5.108401
id_06	5.108401
id_20	3.447200
id_19	3.407681
id_17	3.372321
id_31	2.739318
DeviceType	2.373243
id_02	2.330257
id_11	2.256765
id_28	2.256765
id_29	2.256765
id_35	2.251912
id_36	2.251912
id_15	2.251912
id_37	2.251912
id_38	2.251912

Detailed look at each variable

There’re very good references of EDA and feature engineering on the dataset, so it’s meaningless to repeat here. Please check the list here if you’re interested:

Data transformation pipeline

Based on the references and my own analysis, here we have a pipeline of the transformations to perform on the dataset. It can be adjusted for experimenting. Explanation of the transformations see in code comments.

import numpy as np
import pandas as pd
import plotly.express as px
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from typing import List, Callable


DATA_DIR = '/content/drive/My Drive/colab-data/fraud detect/data'


def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df


def load_df(test_set: bool = False, nrows: int = None, sample_ratio: float = None, reduce_mem: bool = True) -> pd.DataFrame:
    if test_set:
        tran = pd.read_csv(f'{DATA_DIR}/test_transaction.csv', nrows=nrows)
        ids = pd.read_csv(f'{DATA_DIR}/test_identity.csv', nrows=nrows)
    else:
        tran = pd.read_csv(f'{DATA_DIR}/train_transaction.csv', nrows=nrows)
        ids = pd.read_csv(f'{DATA_DIR}/train_identity.csv', nrows=nrows)

    if sample_ratio:
        size = int(len(tran) * sample_ratio)
        tran = tran.sample(n=size, random_state=RAND_STATE)
        ids = ids.sample(n=size, random_state=RAND_STATE)
    df = tran.merge(ids, how='left', on='TransactionID')
    if reduce_mem:
        reduce_mem_usage(df)
    return df


def cat_cols(df: pd.DataFrame) -> List[str]:
    cols: List[str] = []

    cols.append('ProductCD')

    cols_card = [c for c in df.columns if 'card' in c]
    cols.extend(cols_card)

    cols_addr = ['addr1', 'addr2']
    cols.extend(cols_addr)

    cols_emaildomain = [c for c in df if 'email' in c]
    cols.extend(cols_emaildomain)

    cols_M = [c for c in df if c.startswith('M')]
    cols.extend(cols_M)

    cols.extend(['DeviceType', 'DeviceInfo'])

    cols_id = [c for c in df if c.startswith('id')]
    cols.extend(cols_id)

    return cols


def num_cols(df: pd.DataFrame, target_col='isFraud') -> List[str]:
    cols_cat = cat_cols(df)
    cats = df[cols_cat]
    cols_num = list(set(df.columns) - set(cols_cat))

    if target_col in cols_num:
        cols_num.remove(target_col)

    return cols_num


def missing_ratio_col(df):
    df_na = (df.isna().sum() / len(df)) * 100
    if isinstance(df, pd.DataFrame):
        df_na = df_na.drop(
            df_na[df_na == 0].index).sort_values(ascending=False)
        missing_data = pd.DataFrame({'Missing Ratio %': df_na})
    else:
        missing_data = pd.DataFrame({'Missing Ratio %': df_na}, index=[0])

    return missing_data


class NumColsNaMedianFiller(TransformerMixin, BaseEstimator):

    def fit(self, X, y=None):
        return self

    def transform(self, df):
        cols_cat = cat_cols(df)
        cols_num = list(set(df.columns) - set(cols_cat))

        for col in cols_num:
            median = df[col].median()
            df[col].fillna(median, inplace=True)

        return df


class NumColsNegFiller(TransformerMixin, BaseEstimator):

    def fit(self, X, y=None):
        return self

    def transform(self, df):
        cols_num = num_cols(df)

        for col in cols_num:
            df[col].fillna(-999, inplace=True)

        return df


class NumColsRatioDropper(TransformerMixin, BaseEstimator):
    def __init__(self, ratio: float = 0.5):
        self.ratio = ratio

    def fit(self, X, y=None):
        return self

    def transform(self, df):
        # print(X[self.attribute_names].columns)

        cols_cat = cat_cols(df)
        cats = df[cols_cat]
        # nums = df.drop(columns=cols_cat)
        # cols_num = df[~df[cols_cat]].columns
        cols_num = list(set(df.columns) - set(cols_cat))
        nums = df[cols_num]

        ratio = self.ratio * 100
        missings = missing_ratio_col(nums)
        # print(missings)
        inds = missings[missings['Missing Ratio %'] > ratio].index
        df = df.drop(columns=inds)
        return df


class ColsDropper(TransformerMixin, BaseEstimator):
    def __init__(self, cols: List[str]):
        self.cols = cols

    def fit(self, X, y=None):
        return self

    def transform(self, df):
        return df.drop(columns=self.cols)


class DataFrameSelector(TransformerMixin, BaseEstimator):
    def __init__(self, col_names):
        self.attribute_names = col_names

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        print(X[self.attribute_names].columns)

        return X[self.attribute_names].values


class DummyEncoder(TransformerMixin, BaseEstimator):

    def fit(self, X, y=None):
        return self

    def transform(self, df):
        cols_cat = cat_cols(df)

        cats = df[cols_cat]
        noncats = df.drop(columns=cols_cat)

        cats = cats.astype('category')
        cats_enc = pd.get_dummies(cats, prefix=cols_cat, dummy_na=True)

        return noncats.join(cats_enc)


# Label encoding is OK when we're using tree models
class MyLabelEncoder(TransformerMixin, BaseEstimator):

    def fit(self, X, y=None):
        return self

    def transform(self, df):
        cols_cat = cat_cols(df)

        for col in cols_cat:
            df[col] = df[col].astype('category').cat.add_categories(
                'missing').fillna('missing')
            le = preprocessing.LabelEncoder()
            # TODO add test set together to encoding
            # le.fit(df[col].astype(str).values)
            df[col] = le.fit_transform(df[col].astype(str).values)
        return df


class FrequencyEncoder(TransformerMixin, BaseEstimator):
    def __init__(self, cols):
        self.cols = cols

    def fit(self, X, y=None):
        return self

    def transform(self, df):
        for col in self.cols:
            vc = df[col].value_counts(dropna=True, normalize=True).to_dict()
            vc[-1] = -1
            nm = col + '_FE'
            df[nm] = df[col].map(vc)
            df[nm] = df[nm].astype('float32')
        return df


class CombineEncoder(TransformerMixin, BaseEstimator):
    def __init__(self, cols_pairs: List[List[str]]):
        self.cols_pairs = cols_pairs

    def fit(self, X, y=None):
        return self

    def transform(self, df):
        for pair in self.cols_pairs:
            col1 = pair[0]
            col2 = pair[1]
            nm = col1 + '_' + col2
            df[nm] = df[col1].astype(str) + '_' + df[col2].astype(str)
            df[nm] = df[nm].astype('category')
            # print(nm, ', ', end='')
        return df


class AggregateEncoder(TransformerMixin, BaseEstimator):
    def __init__(self, main_cols: List[str], uids: List[str], aggr_types: List[str],
                 fill_na: bool = True, use_na: bool = False):
        self.main_cols = main_cols
        self.uids = uids
        self.aggr_types = aggr_types
        self.use_na = use_na
        self.fill_na = fill_na

    def fit(self, X, y=None):
        return self

    def transform(self, df):
        for col in self.main_cols:
            for uid in self.uids:
                for aggr_type in self.aggr_types:
                    col_new = f'{col}_{uid}_{aggr_type}'
                    tmp = df.groupby([uid])[col].agg([aggr_type]).reset_index().rename(
                        columns={aggr_type: col_new})
                    tmp.index = list(tmp[uid])
                    tmp = tmp[col_new].to_dict()
                    df[col_new] = df[uid].map(tmp).astype('float32')
                    if self.fill_na:
                        df[col_new].fillna(-1, inplace=True)
        return df

from sklearn.pipeline import Pipeline

pipe = Pipeline(steps=[
    # Based on feature engineering from
    # https://www.kaggle.com/cdeotte/xgb-fraud-with-magic-0-9600#Encoding-Functions
    ('combine_enc', CombineEncoder(
        [['card1', 'addr1'], ['card1_addr1', 'P_emaildomain']])),
    ('freq_enc', FrequencyEncoder(
        ['addr1', 'card1', 'card2', 'card3', 'P_emaildomain'])),
    ('aggr_enc', AggregateEncoder(['TransactionAmt', 'D9', 'D11'], [
        'card1', 'card1_addr1', 'card1_addr1_P_emaildomain'], ['mean', 'std'])),

    # Drop columns that have certain high ratio of missing values, and then fill
    # in values e.g. median value. May not be used if using a tree model.
    ('reduce_missing', NumColsRatioDropper(0.5)),
    ('fillna_median', NumColsNaMedianFiller()),

    # Drop some columns that will not be used
    ('drop_cols_basic', ColsDropper(['TransactionID', 'TransactionDT', 'D6',
                                     'D7', 'D8', 'D9', 'D12', 'D13', 'D14', 'C3',
                                     'M5', 'id_08', 'id_33', 'card4', 'id_07',
                                     'id_14', 'id_21', 'id_30', 'id_32', 'id_34'])),

    # Drop some columns based on feature importance got from a model.
    ('drop_cols_feat_importance', ColsDropper(
        ['v107', 'v117', 'v119', 'v120', 'v27', 'v28', 'v305'])),

    ('fillna_negative', NumColsNegFiller()),

    # Encode categorical variables. Depending on the kind of model we use,
    # we can choose between label encoding and onehot encoding.
    # ('onehot_enc', DummyEncoder()),
    ('label_enc', MyLabelEncoder()),
])

Split dataset

And as we want to predict future payment fraud based on the past data, so we should not shuffle the dataset when split training and testing sets, but just time-based split.

As this is a imbalanced dataset with 1 class of the target variable have only about 3.5%, so we may want to try sampling methods like over-sampling or SMOTE sampling on the training dataset.

RAND_STATE = 20200119

def data_split_v1(X: pd.DataFrame, y: pd.Series):
    X_train, X_val, y_train, y_val = train_test_split(
        X, y, test_size=0.25, shuffle=False, random_state=RAND_STATE)

    return X_train, X_val, y_train, y_val


def data_split_oversample_v1(X: pd.DataFrame, y: pd.Series):
    from imblearn.over_sampling import RandomOverSampler

    X_train, X_val, y_train, y_val = train_test_split(
        X, y, test_size=0.25, shuffle=False, random_state=RAND_STATE)

    ros = RandomOverSampler(random_state=RAND_STATE)
    X_train, y_train = ros.fit_resample(X_train, y_train)

    return X_train, X_val, y_train, y_val


def data_split_smoteenn_v1(X: pd.DataFrame, y: pd.Series):
    from imblearn.combine import SMOTEENN

    X_train, X_val, y_train, y_val = train_test_split(
        X, y, test_size=0.2, shuffle=False, random_state=RAND_STATE)

    ros = SMOTEENN(random_state=RAND_STATE)
    X_train, y_train = ros.fit_resample(X_train, y_train)

    return X_train, X_val, y_train, y_val

Experiments

Now let’s start play with experimenting with simple models like Logistic Regression, or complex models like Gradient Boosting.

Here below is a scaffold for performing experiments.

import os
from datetime import datetime
import json
import pprint

from sklearn import metrics
from sklearn.pipeline import Pipeline
from typing import List, Callable

EXP_DIR = 'exp'

class Experiment:
    def __init__(self, df_nrows: int = None, transform_pipe: Pipeline = None,
                 data_split: Callable = None, model=None, model_class=None,
                 model_param: dict = None):
        self.df_nrows = df_nrows
        self.pipe = transform_pipe

        if data_split is None:
            self.data_split = data_split_v1
        else:
            self.data_split = data_split

        if model_class:
            self.model = model_class(**model_param)
        else:
            self.model = model

        self.model_param = model_param

    def transform(self, X):
        return self.pipe.fit_transform(X)

    def run(self, df_train: pd.DataFrame, save_exp: bool = True) -> float:
        # self.df = load_df(nrows=self.df_nrows)

        y = df_train['isFraud']
        X = df_train.drop(columns=['isFraud'])

        X = self.transform(X)

        X_train, X_val, Y_train, Y_val = self.data_split(X, y)

        # del X
        # gc.collect()

        self.model.fit(X_train, Y_train)

        Y_pred = self.model.predict(X_val)
        self.last_roc_auc = metrics.roc_auc_score(Y_val, Y_pred)

        if save_exp:
            self.save_result()

        return self.last_roc_auc

    def save_result(self, feature_importance:bool=False):
        save_time = datetime.now().strftime('%Y-%m-%d_%H-%M-%S')
        result = {}
        result['roc_auc'] = self.last_roc_auc
        result['transform'] = list(self.pipe.named_steps.keys())
        result['model'] = self.model.__class__.__name__
        result['model_param'] = self.model_param
        result['data_split'] = self.data_split.__name__
        result['num_sample_rows'] = self.df_nrows
        result['save_time'] = save_time
        if feature_importance:
            if hasattr(self.model, 'feature_importances_'):
                result['feature_importances_'] = dict(
                    zip(self.X.columns, self.model.feature_importances_.tolist()))
            if hasattr(self.model, 'feature_importance'):
                result['feature_importances_'] = dict(
                    zip(self.df.columns, self.model.feature_importance.tolist()))

        pprint.pprint(result, indent=4)

        if not os.path.exists(EXP_DIR):
            os.makedirs(EXP_DIR)
        with open(f'{EXP_DIR}/exp_{save_time}_{self.last_roc_auc:.4f}.json', 'w') as f:
            json.dump(result, f, indent=4)

import gc


del tran_train, id_train
gc.collect()

df_train = load_df()

df_train = load_df()

Mem. usage decreased to 650.48 Mb (66.8% reduction)

Logistic Regression as baseline

def exp1():
    from sklearn.linear_model import LogisticRegression

    pipe = Pipeline(steps=[
        ('combine_enc', CombineEncoder(
            [['card1', 'addr1'], ['card1_addr1', 'P_emaildomain']])),
        ('freq_enc', FrequencyEncoder(
            ['addr1', 'card1', 'card2', 'card3', 'P_emaildomain'])),
        ('aggr_enc', AggregateEncoder(['TransactionAmt', 'D9', 'D11'], [
         'card1', 'card1_addr1', 'card1_addr1_P_emaildomain'], ['mean', 'std'])),

        ('reduce_missing', NumColsRatioDropper(0.3)),
        ('fillna_median', NumColsNaMedianFiller()),

        ('drop_cols_basic', ColsDropper(['TransactionID', 'TransactionDT', 'C3', 'M5', 'id_08', 'id_33', 'card4', 'id_07', 'id_14', 'id_21', 'id_30', 'id_32', 'id_34'])),

        # Though onehot encoding is more appropriate for logistic regression, we
        # don't have enough memory to encode that many variables. So we take a
        # step back using label encoding.
        # ('onehot_enc', DummyEncoder()),
        ('label_enc', MyLabelEncoder()),
    ])

    exp = Experiment(transform_pipe=pipe,
                      data_split=data_split_v1,
                      model_class=LogisticRegression,
                      # just use the default hyper paramenters
                      model_param={},
                     )
    exp.run(df_train=df_train)

exp1()

/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/_logistic.py:940: ConvergenceWarning:

lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression



{   'data_split': 'data_split_v1',
    'model': 'LogisticRegression',
    'model_param': {},
    'num_sample_rows': None,
    'roc_auc': 0.4956463187232307,
    'save_time': '2020-03-26_20-27-08',
    'transform': [   'combine_enc',
                     'freq_enc',
                     'aggr_enc',
                     'reduce_missing',
                     'fillna_median',
                     'drop_cols_basic',
                     'label_enc']}

Gradient Boosting with LightGBM

Now let’s try a Gradient Boosting tree model using the LightGBM implementation, and tune a little on the hyper-parameters to make it a more complex model.

import lightgbm as lgb


class LgbmWrapper:
    def __init__(self, **param):
        self.param = param
        self.trained = None

    def fit(self, X_train, y_train):
        train = lgb.Dataset(X_train, label=y_train)
        self.trained = lgb.train(self.param, train)
        self.feature_importances_ = self.trained.feature_importance()
        return self.trained

    def predict(self, X_val):
        return self.trained.predict(X_val, num_iteration=self.trained.best_iteration)


def exp2():
    pipe = Pipeline(steps=[
        # Based on feature engineering from
        # https://www.kaggle.com/cdeotte/xgb-fraud-with-magic-0-9600#Encoding-Functions
        ('combine_enc', CombineEncoder(
            [['card1', 'addr1'], ['card1_addr1', 'P_emaildomain']])),
        ('freq_enc', FrequencyEncoder(
            ['addr1', 'card1', 'card2', 'card3', 'P_emaildomain'])),
        ('aggr_enc', AggregateEncoder(['TransactionAmt', 'D9', 'D11'], [
            'card1', 'card1_addr1', 'card1_addr1_P_emaildomain'], ['mean', 'std'])),

        # Drop some columns that will not be used
        ('drop_cols_basic', ColsDropper(['TransactionID', 'TransactionDT', 'D6',
                                        'D7', 'D8', 'D9', 'D12', 'D13', 'D14', 'C3',
                                        'M5', 'id_08', 'id_33', 'card4', 'id_07',
                                        'id_14', 'id_21', 'id_30', 'id_32', 'id_34'])),

        # Drop some columns based on feature importance got from a model.
        # ('drop_cols_feat_importance', ColsDropper(
        #     ['v107', 'v117', 'v119', 'v120', 'v27', 'v28', 'v305'])),

        ('fillna_negative', NumColsNegFiller()),

        # Label encoding used for tree models.
        # ('onehot_enc', DummyEncoder()),
        ('label_enc', MyLabelEncoder()),
    ])

    param_lgbm = {'objective': 'binary',
                  'boosting_type': 'gbdt',
                  'metric': 'auc',
                  'learning_rate': 0.01,
                  'num_leaves': 2**8,
                  'max_depth': -1,
                  'tree_learner': 'serial',
                  'colsample_bytree': 0.7,
                  'subsample_freq': 1,
                  'subsample': 0.7,
                  'n_estimators': 10000,
                  #  'n_estimators': 80000,
                  'max_bin': 255,
                  'n_jobs': -1,
                  'verbose': -1,
                  'seed': RAND_STATE,
                  # 'early_stopping_rounds': 100,
                  }


    exp = Experiment(transform_pipe=pipe,
                    data_split=data_split_v1,
                     model_class=LgbmWrapper,
                     model_param=param_lgbm,
                     )
    exp.run(df_train=df_train)


exp2()

/usr/local/lib/python3.6/dist-packages/lightgbm/engine.py:118: UserWarning:

Found `n_estimators` in params. Will use it instead of argument



{   'data_split': 'data_split_v1',
    'model': 'LgbmWrapper',
    'model_param': {   'boosting_type': 'gbdt',
                       'colsample_bytree': 0.7,
                       'learning_rate': 0.01,
                       'max_bin': 255,
                       'max_depth': -1,
                       'metric': 'auc',
                       'n_estimators': 10000,
                       'n_jobs': -1,
                       'num_leaves': 256,
                       'objective': 'binary',
                       'seed': 20200119,
                       'subsample': 0.7,
                       'subsample_freq': 1,
                       'tree_learner': 'serial',
                       'verbose': -1},
    'num_sample_rows': None,
    'roc_auc': 0.919589853747652,
    'save_time': '2020-03-27_09-55-43',
    'transform': [   'combine_enc',
                     'freq_enc',
                     'aggr_enc',
                     'drop_cols_basic',
                     'fillna_negative',
                     'label_enc']}

So we got local validation ROC AUC of about 0.9196, this is a looks OK score.

This model’s prediction on the test dataset got 0.9398 on publica leader board, and 0.9058 on private leader board. These scores have a somehow big gap to the top scores, but still good enough as there’re potentially many ways for improvement. For example, more different ways of transformations and engineering could be performed on the features, try model implementation like CatBoost and XGB, and search for better hyper-parameters. But it assumes you have plenty of computation resource and time.

A Walk Through of the IEEE-CIS Fraud Detection Challenge

Table of contents