Credit Scoring Development Using R - Part 1
Contents
This is the first of a 3-parts series.
This post aims at 2 groups of audience. The first group has learned the underlying theories of credit scorecards but have yet seen a detailed application of the theories. The second group is practitioners who are already well-versed with the theories and the applications, but would like to explore developing credit scorecards using a open-source software such as R.
1.0 Introduction
The methodologies and processes of developing credit scorecards for lending purposes have been a long-held intellectual property of financial institutions and consulting firms.
Recently, even though several books have been published to shed lights on credit scorecards, practical guidance and demonstrations on the hands-on aspects of credit scorecards development have been in short supply. This could be due to the following reasons:
-
Availability of data is a common hindrance. Due to regulations and customers’ privacy, financial institutions’ data cannot be shared with external parties. Even in the case of credit bureau, data is subject to strict data protection regulations and, thus, is not available for research purposes. As a result, it is not surprising that step-by-step demonstrations of credit scorecards development are absence from not only published books but also research papers (to the best of my knowledge); and
-
The software used to develop credit scorecards could be another obstacle. Different analysts, financial institutions and consulting firms use different softwares. The codes of one software in general are not compatible with another softwares. Therefore, the selection of a particular software for demonstrating credit scorecards development will immediately reduce its appeals due to the lack of software compatibility. On the other hand, using multiple softwares to demonstrate the development process will cause the scope to become unmanageable.
This post seeks to fill the gap by providing a practical hands-on demonstration on how to develop a credit scorecard, as close to reality as possible. To that end, 2 decisions are made to address the afore-mentioned challenges: A) a real, publicly available (albeit small) credit data set is used, and B) the open-source software R is adopted.
The choice of R is twofold:
-
R is widely used among practitioners of data science and it is free. In addition, documentations on R is readily available and of good quality. Since R is popular, using it will not reduce the appeal of the demonstrations shown in this post. For non R users, R can be downloaded at no costs to run the R codes provided in this post.
-
A credit scorecards development “package”, called
scorecard
, had been built in R by Shichen Xie. This wonderful package makes credit scorecards development a breeze and will be used throughout this post.
With the above, eager students can download R and the data set to learn credit scorecards development. To analysts who are currently using different softwares for development efforts, this post can be taken as a show case on how an open-source software can be used reliably to built credit scorecards. Having a free alternative to consider and to explore is always an attractive business case.
Finally, it is noteworthy that this post does not seek to produce a powerful credit scorecard, given the rather small data set used and also to keep the length of this post manageable. It is also assumed that the reader has minimal or no knowledge of R (comments are added in the codes to assist non R users). Nonetheless, basic knowledge of credit scorecard theories is expected.
Note
Xie had prepared slides to show how a credit scorecard can be built using his package. The slides, however, was written in Chinese and the data set used therein is not what an analyst will likely encounter in a real world situation.
2.0 R and Packages
Other than the data set, which will be described in Section 3.0, the followings are needed for this post:
-
Base R: Base R can be downloaded from CRAN;
-
RStudio: This is the IDE for R and can be downloaded for free from RStudio.com;
-
Required R packages: Install these packages via RStudio (select the option of install all dependencies during packages installation).
- scorecard - for credit scorecards development;
- tidyverse - for data import and manipulations;
- Hmisc - for summary statistics generation;
- ClusOfVar - for variables clustering.
A quick introduction on how to install the softwares and how to navigate RStudio can be found in Youtube video How to install R and install R Studio
3.0 Data
This section describes the data set as well as the techniques used to process the data set.
A potential question that could be raised by the reader is on the data size R can handle. This is a legitimate question as it is well known that R loads data onto memory (i.e. RAM), which could be an issue if the data size is very large. There are at least 2 ways to overcome this:
-
Pre-process the data as much as possible by using relational database software before loading into R. R has “connector” that allows R to link to relational database server such as SQL. Via connector, SQL commands can be submitted from R to the server for data manipulation purposes. Once in R, use
data.table
(instead ofdata.frame
) to handle the data set; or -
Have access to cloud computing services such as AWS and Azure.
3.1 The Data Set
The data set analyzed in this post is sourced from Yeh, I. C., & Lien, C. H. (2009) and can be downloaded from UCI Machine Learning Repository. DON’T download now. The download will be done using R.
The data set contains 30,000 credit card holders in Taiwan. What makes this data set interesting is that, in addition to the usual demographic information, payments, outstanding balances and delinquency status of the credit card holders were tracked over a period of time (6 months). This feature makes the data set similar to an analyst would encounter in a real credit scorecards development project.
In total, there are 24 variables in the data set. The columns (i.e. variables) of the data set were neatly labeled and, therefore, are largely self-explanatory. Further remarks will be made in this post should need arise. Nonetheless, interested readers can find the information on the columns’ attributes on UCI Machine Learning Repository page.
Note
As the name of the default indicator default in the next month suggests, a credit scorecard built on this data set will be used to predict whether a credit card holder will default in next month. In banking industry, predicting default in the next 12 months is the norm.
3.2 Import Data into R
The data set from UCI Machine Learning Repository page has an extra row in row 1 of the spreadsheet. It is crucial to delete this row from the spreadsheet. The cleaned up version of the data set is stored at Github
library(tidyverse)
initiates the tidyverse
package, which will be used for data processing and manipulation. read.csv
loads the csv data set into R as an object called data
. Examine carefully the first few rows of data
, which is generated by running head(data)
, to ensure that the data set is imported correctly. str(data)
gives a summary of the structure of the data set, which can be used in conjunction with head(data)
for checking.
|
|
|
|
3.2 Data Exploration
Once the data set is imported correctly, it is always good to generate summary statistics to understand the data. This can be done using the package Hmisc
as below. describe(data)
gives very detailed descriptive statistics on all of the variables in the data set. This report should be reviewed thoroughly to detect unusual values or anything that is out of expectations.
|
|
|
|
From the report above (due to the large number of outputs, only a few variables are shown), 2 observations are worth noting:
-
All variables have no missing values, which is rare in a real world situation. In practice, missing values always warrant attentions and investigations. This is because the existence of missing values could be due to some underlying business processes rather than due to no information captured. Variables with excessive real missing values should be discarded (the benchmark is more than 50% values missing.)
-
The “PAY_x” variables represent delinquency status. The first delinquency variable is “PAY_0” and follows by “PAY_2”. In other words, “PAY_1” is omitted. In addition, for the delinquency status variables in this data set, “-1” signifies a particular customer was current on the payment. However, from the report generated using R, it can be observed that the delinquency status variables contain values “0” and “-2”, which are out of expectations.
Point 2 will be addressed in the section below.
3.3 Data Manipulation and Variables Generation
3.3.1 Manipulations
There are a few things need to be achieved in this section. First, rename all variables to lower case for easier codes typing. Second, rename “pay_0” to “pay_1”. Third, rename the default indicator “default payment next month” with a shorter name “gb_flag”, where “gb” indicates good bad. Fourth, in all the “pay_x” variables, recode the values “-2” and “-1” into “0”. “0” signifies current on payment. Finally, a cross-tab is run to check the recoding is done correctly.
|
|
|
|
3.3.2 Variables Generation
Except for some specific variables such as demographic information, raw variables in a data set are normally not used directly in credit scorecards. In this post, new variables are created from delinquency status, limit, and outstanding balance. To keep the scope simple, no new variables will be generated from payment amount. But by using the ideas and codes provided below, the reader should have no problem of creating his/her own variables from payment amount. Variables generation is only limited by imagination or creativity. Given the same data set, different analysts could generate different variables.
|
|
|
|
3.3.3 Sampling
Having generated the new variables, the full data set can now be divided into a development/training sample and a validation sample.
In a real credit scorecard development project, a validation sample is an out-of-time (OOT) sample, which is required to be taken from a different time period as compared to that of the development sample. In this data set, however, splitting the full data set in the way prescribed above is not possible. Instead, 80% of the full population is randomly selected as the development sample, and the remaining 20% is the validation sample.
|
|