There are some general steps in predictive modeling that should work for all kinds of problem. Aside from science like predicting molecular orbital energies, bandgaps or chemical reactivity, other fields, like risk management, economic growth, are also important and should share some general principles.

Here I trying to learn how risk management was done in its field.

Risk management is the process by which risks are identified, assessed, measured and managed. The goal of risk management is not to minimize it or avoid, since you cannot fully avoid risk; it is to take smart moves. As applied to financial field, investors assume risk only because they expect to be compensated for it in the form of higher economical return. Generally, financial risk can be divided into several categories: market risk, credit risk, operation risk, etc. Here we will look at a loan payback rate as an example in credit risk. It involves the possibility of nonpayment, either on a future obligation or during a transaction. They also named presettlement and settlement risk[1].

Presettlement risk is the risk of loss due to the counterparty’s failure to perform on an obligation during the life of the transaction. This includes default on a loan or bond or failure to make the required payment on a derivative transaction. Presettlement risk exists over long periods, starting from the time it its contracted until settlement[1].

Settlement risk is due to the exchange of cash flows and is of a much shorter-term nature. This is risk arises as soon as an institution makes the required payment and exists until the offsetting payment is received[1].

Data provided

Data are from seven sources:

  • application_train/ application_test.csv: Each row is a loan. If the client repay the loan, then the TARGET takes value of 0; if the client failed to repay the loan, then TARGET equals 1;
  • bureau.csv: previous credits from other financial institutions. each row represents a previous credit score, and one loan could have scores from several bureau (several rows);
  • bureau_balance.csv: Monthly balance of previous credits in Credit bureau. Each row is one month of a previous credit, and a single previous credit can have several rows, one for each month as long as the balance exist.
  • previous_application.csv: Client’s previous application at Home Credit. Each client can have several previous loans. Each previous loans has one row and is identified by the feature SK_ID_PREV
  • POS_CASH_BALANCE.csv: monthly data about previous point of sale or cash loans clients have had with Home credit. Each row is one month of a previous point of sale or cash loan, and a single previous loan can have many rows.
  • credit_card_balance: monthly data about previous credit cards clients have had with Home Credit. Each row is one month of a credit card balance and a single credit card can have many rows.
  • installments_payment: payment history for previous loans at Home credit. There is one row for every made payment and one row for every missed payment.

Metrics

The data we will deal with is highly imbalanced. It means the number of positive and negative are not comparable. Say you have 99 positive and 1 negative event in the sample. Then in order to achieve predicting positive event with 99% accuracy, you can simple predict all events are positive. But the model doesn’t make sense. That’s why in this problem, the metrics we used is ROC AUC (Receiver Operating Characteristic, Area Under the Curve). It is simply the area under the curve that represents the probability to be positive or negative.

Basic data properties

Upload the data to google colab

We will use google colab to work out this problem. To upload the data:

  • Manually download the data from Kaggle contest website:

https://www.kaggle.com/c/home-credit-default-risk/data

  • Then upload them to google colab:

https://colab.research.google.com/drive/1kFgNS4j1trgNuEfmlZ1XsGp_WWUrO2Xg

  • Run the code until the following cell that will ask you to choose the file to upload:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(name=fn, length=len(uploaded[fn])))
  • After choosing the file, there will be progress indicator and notification like:

    application_train.csv(application/vnd.ms-excel) - 166133370 bytes, last modified: 6/26/2018 - 100% done
    Saving application_train.csv to application_train.csv
    User uploaded file "application_train.csv" with length 166133370 bytes
    
  • If the file had been successful uploaded, it will be in the same directory as the code, which we can verify by:

# list the file in the current directory
import os
print(os.listdir("."))

#result
#['.config', 'application_train.csv', 'sample_data']

Data cleaning

# To maintain reproducibility, we install version 0.23.4 of pandas
!pip install pandas==0.23.4
import pandas as pd

#open the file
df_raw = pd.read_csv("application_train.csv")

#give the dimension of the data
df_raw.shape

#give the number of value in each data type
df_raw.dtypes.value_counts()

#give the mean/std/min/max of the numerical data
#give the counts of non-numerical data
df_raw.describe()

#number of columns, index and storage used
df_raw.info()

#list the first 10 rows
df_raw.head(10)

# take a sample of the data (10 rows) for a trial anaysis
df_raw.sample(10)

With file uploaded, we can already show the “imbalance” of the data:

# number of each value in TARGET column:
df['TARGET'].value_counts()
>>>0    282686
>>>1     24825
>>>Name: TARGET, dtype: int64

# plot them as a histogram
df['TARGET'].astype(int).plot.hist()

From the figure, we can see the ratio of the two possible values are highly unbalanced, which we need special metrics like ROC AUC or F1-score to describe.

1. missing values

In the real world application, there will be some values that are missing in the table or not in the right type. We need to process those values before modeling step.

Usually, variables can be of category or numerical type. For the category variables, we can create a new category “unknown” to replace the empty cell; for the latter, if it is a single cell, then we can use the average values to complete it or if it is an entire row, then we can simply delete that row.

Some model only accepts numerical value, in that case, we also need to map the category values into numerical values, or encoding categorical variables.

Now, we check some codes to do these two things.

# vistualize cells with missing values in the table
import missingno as msno
msno.matrix(df_raw)

We can see many values are missing, which are indicated by white lines.

  • We will deal with categorical variables first:
# select columns with data type Object
objectColumns = df_raw.select_dtypes(include=["object"]).columns
df_raw[objectColumns].isnull().sum().sort_values(ascending=False)

# vistualize the object column with missing values
msno.matrix(loans[objectColumns])

we can see, the occupation column missed many values, as well as house type etc. So we are going to label those as a new category value “unkown” and fill “unkown” into the empty cells. This would use a python function: fillna:

# make a new copy of the original data to fill with new values
df_filled = df_raw
df_filled[objectColumns] = df_filled[objectColumns].fillna("Unknown") 

# vistualize the new filled data
msno.bar(df_filled[objectColumns])

We can see now, all of cells have their values.

  • Now we will deal with numerical values
# selecting numeric column
# you can also use include=[numpy.number]
numColumns = df_raw.select_dtypes(include=['int','float64']).columns

# vistualize the missing values in numeric columns
msno.matrix(df_raw[numColumns])

we then use the average values to fill in those empty cells

from sklearn.preprocessing import Imputer

# strategy ="mean" means using average value, axis=0 means the column direction
imr = Imputer(missing_values='NaN', strategy='mean', axis=0) 
imr = imr.fit(df_filled[numColumns])
df_filled[numColumns] = imr.transform(df_filled[numColumns])

# vistualize the table to check values
msno.matrix(df_filled[numColumns])

You can also try other strategy as in preprocessing lib.

  • 3 process other types of data

    • The key is to first select the type of data you want, then visualize it to see if any values missing, then use different strategy to fill in the empty cell. Or something, if it is not important at the current step, you can also delete those values (not in the raw dataframe, but in the copied dataframe).

    To choose specific type of value in the dataframe, check “pandas.DataFrame.select_dtypes” document: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.select_dtypes.html

    • “pd.to_datetime()” can convert days into year-month-day. Sometime it will be useful.

    • For certain type of data, we want observe it before taking the next step. For example, this data has a column called “DAYS-BIRTH”, we want to check whether the value makes sense:

    # convert the value into years for convinience:
    (df_filled['DAYS_EMPLOYED']/365).describe()
    count    307511.000000
    mean        174.835742
    std         387.056895
    min         -49.073973
    25%          -7.561644
    50%          -3.323288
    75%          -0.791781
    max        1000.665753
    Name: DAYS_EMPLOYED, dtype: float64
    

    And some of the values look strange, say 1000 yr of employment? Those are called anomalies. Before we figuring out what is going on, we can first set those anomalies to NaN, and used an extra column to label those values:

    # Create an anomalous flag column
    df_filled['DAYS_EMPLOYED_ANOM'] = df_filledn["DAYS_EMPLOYED"] == 365243
        
    # Replace the anomalous values with nan
    df_filled['DAYS_EMPLOYED'].replace({365243: np.nan}, inplace = True)
        
    df_filled['DAYS_EMPLOYED'].plot.hist(title = 'Days Employment Histogram');
    plt.xlabel('Days Employment');
    

### 2. Encoding categorical variables into numeric values

Some modeling algorithm can take both numeric and categorical values (e.g. random forest), but other algorithm take only numeric values (e.g. logistic regression). So we need to convert categorical variables into numeric values, or encoding. There are two ways to do it: label encoder and hot encoder:

  • label encoding:

It is like the telegram, which you give everything a number. For example:

  convertible -> 0
  hardtop -> 1
  hatchback -> 2
  sedan -> 3
  wagon -> 4

But as you can see, the values assignment is arbitrary and lack of consistency. Usually, if the categorical variable has only two values (e.g. Yes/No or Male/Female), we will use this label encoding ( 0/1 binary coding). For other cases that has more than two values, hot encoding is the choice.

  • There are several way to encode categorical variable to numeric variable, we first show example using pandas:
  # change the column "FLAG_OWN_REALTY" to category type
  df_filled["FLAG_OWN_REALTY"] = df_filled["FLAG_OWN_REALTY"].astype("category")
  
  # using cat.codes accessor:
  df_filled["FLAG_OWN_REALTY_cat"] = df_filled["FLAG_OWN_REALTY"].cat.codes
  df_filled["FLAG_OWN_REALTY_cat"]
  
  >>>0         1
  >>>1         0
  >>>2         1
  >>>3         1
  >>>4         1
  ...

we can see the category values take 1 or 0 to denote “Y” or “N”.

  • The second way, if we want to manually encode the categorical values, we can do:
  # when it is "Y", numeric variable takes 1, or if it is not "Y", takes 0
  df_filled["FLAG_OWN_CAR_cat"] = np.where(df_filled["FLAG_OWN_CAR"].str.contains("Y"), 1, 0)

This is a general idea to deal with characters. All of the possible values of a categorical variable spans a virtual space, and any variable would be a vector in that space. So in stead of single scalar value, we used a vector to represent a character or words. This scheme is also used in Natural Language Processing, called word2vec method, which you used a 26 dimension vector to represent a letter, since there are 26 possible values in a alphabet.

  # an example of weather
  weather =['cold', 'cool', 'warm', 'hot']
  # vector value of eahc word:
  cold = [1,0,0,0]
  cool = [0,1,0,0]
  warm = [0,0,1,0]
  hot = [0,0,0,1]
  

This can be done either manually or using scikit library:

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

Basically, in data preprocessing, you need to create the “key vocabulary” made of all possible key values so as to convert categorical variable into vectors.

  # example of one hot encoder and label encoder
  from numpy import array
  from numpy import argmax
  from sklearn.preprocessing import LabelEncoder
  from sklearn.preprocessing import OneHotEncoder
  # define example
  data = ['cold', 'cold', 'warm', 'cold', 'hot', 'hot', 'warm', 'cold', 'warm', 'hot']
  values = array(data)
  print(values)
  # integer encode
  label_encoder = LabelEncoder()
  integer_encoded = label_encoder.fit_transform(values)
  print(integer_encoded)
  # binary encode
  onehot_encoder = OneHotEncoder(sparse=False)
  integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
  onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
  print(onehot_encoded)
  # invert first example
  inverted = label_encoder.inverse_transform([argmax(onehot_encoded[0, :])])
  print(inverted)

Reference

[1] Philippe Jorion, GARP (Global Association of Risk Professionals) - Financial Risk Manager Handbook + Test Bank_ FRM Part I & Part II (Wiley Finance)-Wiley (2010)

[2] Python Exploratory Data Analysis Tutorial https://www.datacamp.com/community/tutorials/exploratory-data-analysis-python

[3] Hitchhiker’s guide to Exploratory Data Analysis https://towardsdatascience.com/hitchhikers-guide-to-exploratory-data-analysis-6e8d896d3f7e

[4] Detailed exploratory data analysis with python https://www.kaggle.com/ekami66/detailed-exploratory-data-analysis-with-python

[5] Start Here: A Gentle Introduction https://www.kaggle.com/willkoehrsen/start-here-a-gentle-introduction

[3] Home Credit : Complete EDA + Feature Importance https://www.kaggle.com/codename007/home-credit-complete-eda-feature-importance