Predicting Loan Defaults with Machine Learning: A Comprehensive Analysis for Banks

13 min readMar 13


— by Gaurav Singh

Defaulting on a loan is a worry for many individuals and businesses alike. Understanding loan default prediction and how to avoid it can be a complicated process. Loan default prediction is a practice used by lenders to measure the risk of a borrower not paying back a loan or credit. One of the most important factors is the borrower’s credit score, which is a numerical value that represents their creditworthiness. Banks also take into account the borrower’s income and employment history, debts, the type of loan and the purpose of the loan, and assets such as property or collateral. A higher credit score generally indicates that a borrower is more likely to repay a loan, while a lower credit score may indicate a higher risk of default. By understanding loan default prediction, borrowers can take steps to reduce the risk of defaulting and protect their financial wellbeing. Knowing how lenders assess risk and what steps to take can help borrowers secure the financing they need while avoiding costly default fees. With the right knowledge, borrowers can better manage their finances and avoid defaulting on their loans.


Loan default prediction is a prediction as to whether a borrower will be able to repay a loan or not. Banking Institutions or Lenders predict the risk of defaulting through a variety of factors, including past payment history, credit score, and current financial situation. The credit score is expressed as a percentage to help borrowers understand the likelihood of paying back their loan. The higher this percentage, the higher the risk of default and the less likely someone is to be approved for a loan. For example, a credit score of 60% means there is a 40% chance of defaulting on a loan. Loan default prediction helps lenders assess the risk of giving borrowers credit. The higher the risk of default, the higher the interest rate on the loan is likely to be. Understanding loan default prediction is important for borrowers because it helps them understand how their financial situation is assessed. This can help borrowers better manage their finances and reduce the risk of defaulting on their loans.


Loan default prediction is a crucial aspect of the lending process for banks and financial institutions. It involves analyzing past loan data and identifying patterns that can indicate whether a borrower is likely to default on a loan. By using this information, banks and lenders can make more informed decisions about which borrowers to approve for loans and what terms to offer. This can help them reduce the risk of loan defaults and increase the likelihood of successful loan repayment.
One of the key benefits of loan default prediction is that it can help banks and lenders identify high-risk borrowers early on. By using historical data and statistical models, financial institutions can identify patterns and characteristics that are associated with a higher risk of default. This can include factors such as credit history, income, and employment status.
By identifying high-risk borrowers early on, banks and lenders can take steps to mitigate the risk of default. For example, they might offer these borrowers a loan with more favorable terms, such as a lower interest rate or a longer repayment period. They might also require a larger down payment or collateral to secure the loan.


If a borrower does not repay a loan, they are considered to be in default. Lenders may charge additional fees and require the borrower to pay off the entire loan amount in one payment. This can be difficult for individuals to manage, especially if they have not been employed for long or do not have a steady income. Borrowers who default on a loan face other consequences as well. They may find it more difficult to obtain financing in the future as their credit score will likely drop. Additionally, the borrower’s reputation may be negatively impacted. Defaulting on a loan may also affect the borrower’s relationships with friends, family members, or coworkers. Defaulting on a loan can be stressful and difficult to overcome. It can also be expensive, as some lenders charge high default fees. Borrowers who default on a loan may be reported to a credit bureau and have their credit score negatively impacted. This can make it difficult to obtain financing in the future. Additionally, borrowers who default on a loan may have issues repaying other debts, such as utility bills or rent.


In this blog, we will try to understand how an actual loan default prediction is done by any banking or financial institution in the real world.

Introduction to the Dataset

For this we have taken the dataset from Kaggle: Loan_Default_Data.csv
The description of the dataset can be found at: Loan_Default_Data_Description.xlsx

We then imported the dataset in our local systems using Pandas to access the file and do analysis and modelling to build the model. Some of the characteristics of the dataset were:
1. Total number of features (columns): 111
2. Total number of observations (rows): 42542

You can check out the dataset using the above link if you are interested enough to know more about the dataset or if you want to make your own ML model.

Cleaning the Dataset (Data-Preprocessing)

Though there were a lot of empty or NA values present in the dataset which needed to be treated at first. We looked into NA values in 3 ways. These are the preprocessing steps followed:

  1. Columns which had more than 50% NA values: the entire column was removed from the dataset. This helped us reduce 57 columns from the dataset.
  2. We still have too many features to work with i.e., 54 columns. Thus we need to understand if we need all the columns to build our model. So we looked further into the dataset manually to understand each of the columns. We found out that the [‘url’, ‘description’] columns were not needed to build our model as they can not directly help us with improving the model. So we removed them.
  3. We dropped columns with single value charachter namely: [‘pymnt_plan’, ‘initial_list_status’, ‘collections_12_mths_ex_med’, ‘policy_code’, ‘Application_type’, ‘chargeoff_within_12_mths’] since they are of no use for further analysis, as can be seen from MS-Excel sheet. Now we have 46 columns to work with.
  4. Now to understand if there is any correlation between columns since if there is one we need to remove it as it can cause multicollinearity in our model further reducing the overall accuracy of the model. So we were able to remove 6 more columns which had more than 95% collinearity namely: [‘funded_amnt’, ‘installment’, ‘out_prncp_inv’, ‘total_pymnt_inv’, ‘total_rec_prncp’]. Now we have 41 columns to work with.
  5. Also, any future values need to be removed since we can’t know that in advance before lending out the loan such as: ‘recoveries’ which tells us about the post charge off gross recovery. So we will be removing those columns from our dataset:
    i. [‘id’, ‘member_id’] does not give us any information about the person taking out loan
    ii. ‘funded_amnt_inv’
    iii. ‘grade’, ‘sub_grade’
    iv. ‘issue_d’
    v. ‘zip_code’: won’t be needed since we already have that info in ‘addr_state’.
    vi. ‘out_prncp’, ‘out_prncp_inv’, ‘total_pymnt’, ‘total_pymnt_inv’, ‘total_rec_prncp’]: these 5 variables are all about the future, they inform us about how the repayment is going. So, we need to remove them from our model.
    vii. ‘total_rec_int’ is about the interest received to date (meaning the loan has been approved) and ‘total_rec_late_fee’ is about the interest that are late. These 2 variables need to be removed from the dataset as these data can be confirmed via ‘loan_status’ column
    viii. [‘recoveries’, ‘collection_recovery_fee’]
    ix. [‘last_pymnt_d’, ‘last_pymnt_amnt’]

Now we remain with 27 columns and we find out that we can’t further reduce this number and need all of those columns for further analysis and modelling.

Data Visualization

Reasons to take up loan
Understanding what was the reason to take up loans can provide banks with a much greater insight into knowing if people can give back the loans or not. This information can help banks better understand the borrowing patterns and needs of their customers, as well as identify potential market segments and risk factors. This can inform decisions related to loan products, interest rates, and marketing strategies. By analyzing data on loan amounts, loan purposes, demographics, and repayment behavior, banks can make more informed decisions about how to best serve their customers and manage risk.

What was the Employee title who applied for loans

A bank can analyze the loan repayment behavior of individuals with different job titles to identify patterns or correlations. For example, if a bank finds that individuals with certain job titles have a higher likelihood of loan default, they can use this information to adjust their risk assessment and loan underwriting procedures. Additionally, they can also consider the financial stability of different job titles in order to make more informed loan decisions.

Default Rate by Credit Score: A scatter plot to show the relationship between credit score and default rate. This can help identify the credit score range where defaults are most common.

Default Rate by Employment Length: A line graph to show the default rate by the length of time the borrower has been employed. This can help identify the impact of stable employment on loan repayment.

Loan Amount vs. Default Rate: A bubble chart to show the loan amount and default rate for each borrower. This can help identify any patterns between loan amount and default rate.

Heatmap between Loan Default & Loan Features: Created a heatmap to understand the correlation between different loan features and loan default. This could help us identify which loan features have the strongest correlation with loan default.

Histogram of Loan Amount: Shows the distribution of loan amounts in the dataset. It gives us an idea about the range of loan amounts that are common in the dataset and how frequently they occur. It also helps us identify any outliers or unusual loan amounts in the dataset.

Grade vs. Loan Amount Violin Plot: Created a violin plot of loan grades split by loan amount. This could help us understand the relationship between loan grade and loan amount, and if there are any significant differences in this relationship between different loan grades.

Target Column

The target column for our Machine Learning model i.e., “loan_status”, which tells us about whether the loan was defaulted or not. Here we try to understand. From a business point the below bar plot can help us identify the distribution of loan status categories in the loan_df dataset, which can provide insights into the health of the company’s loan portfolio.

Although for the ML modeling we will have only two target values i.e., Fully Paid and Default. Combining all the target values except “Fully Paid” into one target value.

“Charged off” is a term used by lenders to describe an account that they do not expect to be paid back and consider it as a loss. When a borrower fails to make payments on a debt obligation for a certain period of time, typically 120 to 180 days, the lender may declare the account as “charged off” and write it off as a loss in their financial statements. This does not, however, relieve the borrower of the obligation to pay the debt, and the lender may still pursue collection activities or sell the account to a collection agency. “Charged off” accounts can have a negative impact on a borrower’s credit score and credit history.

Still we see that there are very few rows that have charged off output (it impacts the accuracy of the ML model if we have an imbalance data). So we will increase the number of “Fully Paid” rows from the data and make it equal to the number of “Default” rows using SMOTE technique — It is called oversampling the dataset by generating synthetic datasets for a particular output value.

Data Transformation

In order to process data in machine learning models, the data needs to be in a numerical format, either in integer or float data types. However, the original data may contain non-numerical values, such as text or categorical variables. Therefore, before feeding the data into machine learning algorithms, it is necessary to transform the data into numerical format.
Before transforming the data, it is important to first understand the data type of each variable, such as whether it is numerical or categorical, as well as the distribution of each variable. This information can help inform the appropriate transformation techniques to use and ensure that the data is properly processed for machine learning models.

Data transformation involves converting the original data into a format that is suitable for analysis. This includes cleaning and preprocessing the data, handling missing values, and converting categorical variables to numerical values through techniques such as one-hot encoding or label encoding. Additionally, some numerical variables may need to be transformed, such as scaling or normalizing data to ensure that all variables are on the same scale.

So now we know which columns have to be converted to numerical

  1. term: It is either 36 or 60 months
  2. int_rate: We have to remove trailing % sign and convert it back to numerical value
  3. emp_length: Needs to be converted into numeric by changing the labels
  4. purpose and title convey the same thing, so we will keep only purpose since it has lesser number of labels and wil help us build a better model
  5. home_ownership: They can be of 4 types: RENT, MORTAGAGE, OWN, OTHER
  6. verification_status: Again these have 3 lables associated with it
  7. revol_util: Need to rmeove the trailing % sign, otherwise no change

One hot encoding (OHE) or dummy variable encoding is a technique used to convert categorical variables into numerical variables so that they can be used as input for machine learning models. In this technique, each category in a categorical variable is converted into a binary feature, where each feature represents a single category.

For example, suppose we have a categorical variable called “color,” which has three categories: red, blue, and green. Using OHE, we would create three binary features: “color_red,” “color_blue,” and “color_green.” Each observation in the dataset would be assigned a value of 1 for the corresponding feature and 0 for all other features. For instance, if an observation has the value “red” for the “color” variable, then it will have a value of 1 for the “color_red” feature and 0 for the “color_blue” and “color_green” features.

After transformation the following columns had datatypes as follows:

Model Building

Now we have a pre-processes dataset which can be used to build the Machine Learning model to predict whether a person will default on a loan or not.

We define two variables which are to be used as X (features) and Y (target):

  • X = [all the columns/features of loan_df dataset except target column loan_status]
  • Y = loan_status column

Now we divide the X and Y into a training and testing set where 70% is the training set and the rest 30% is the testing set. We then model the data with different ML models namely:

  1. Logistic Regression:

2. Random Forest Classifier:

3. Decision Tree Classifier:

We found out that the best Ml model in terms of Performance Metrics was Random Forest Classifier with an overall accuracy of 96.41% to detect the loan default.


Thus it has become very important for banks to accurately predict loan defaults in order to minimize financial risk and maximize profitability. By analyzing a borrower’s credit history, income, employment status, and other factors, banks can assess the likelihood of the borrower being able to repay the loan. This information can then be used to determine the interest rate and other terms of the loan.

If banks do not accurately predict loan defaults, they may approve loans to high-risk borrowers who are unlikely to repay the loan. This can lead to a high number of defaults and financial losses for the bank. On the other hand, if banks are too cautious and deny loans to low-risk borrowers, they may miss out on potential profits.

Therefore, banks need to strike a balance between risk and profitability by accurately predicting loan defaults. This requires the use of advanced data analysis techniques and machine learning algorithms to identify patterns and trends in borrower data that can indicate a high risk of default. By using these tools, banks can make informed decisions about loan approval and minimize their financial risk while maximizing profits.

Thus here we have tried to replicate what the Machine Learning models that banks use to understand which people they can trust with the loans they provide.




The Business Club Of NITT