The purpose of the effort is to perform elementary data analysis. I want to identify loans that have not been paid off as expected, and how this status is related to other variables present. The dataset used is the ‘Loan data from Prosper’ in the Udacity data sets.
This is a list of variables that are present in the dataset. Since there are 81 variables here, I plan to select 10-15 of these variables.
## [1] ListingKey
## [2] ListingNumber
## [3] ListingCreationDate
## [4] CreditGrade
## [5] Term
## [6] LoanStatus
## [7] ClosedDate
## [8] BorrowerAPR
## [9] BorrowerRate
## [10] LenderYield
## [11] EstimatedEffectiveYield
## [12] EstimatedLoss
## [13] EstimatedReturn
## [14] ProsperRating (numeric)
## [15] ProsperRating (Alpha)
## [16] ProsperScore
## [17] ListingCategory
## [18] BorrowerState
## [19] Occupation
## [20] EmploymentStatus
## [21] EmploymentStatusDuration
## [22] IsBorrowerHomeowner
## [23] CurrentlyInGroup
## [24] GroupKey
## [25] DateCreditPulled
## [26] CreditScoreRangeLower
## [27] CreditScoreRangeUpper
## [28] FirstRecordedCreditLine
## [29] CurrentCreditLines
## [30] OpenCreditLines
## [31] TotalCreditLinespast7years
## [32] OpenRevolvingAccounts
## [33] OpenRevolvingMonthlyPayment
## [34] InquiriesLast6Months
## [35] TotalInquiries
## [36] CurrentDelinquencies
## [37] AmountDelinquent
## [38] DelinquenciesLast7Years
## [39] PublicRecordsLast10Years
## [40] PublicRecordsLast12Months
## [41] RevolvingCreditBalance
## [42] BankcardUtilization
## [43] AvailableBankcardCredit
## [44] TotalTrades
## [45] TradesNeverDelinquent
## [46] TradesOpenedLast6Months
## [47] DebtToIncomeRatio
## [48] IncomeRange
## [49] IncomeVerifiable
## [50] StatedMonthlyIncome
## [51] LoanKey
## [52] TotalProsperLoans
## [53] TotalProsperPaymentsBilled
## [54] OnTimeProsperPayments
## [55] ProsperPaymentsLessThanOneMonthLate
## [56] ProsperPaymentsOneMonthPlusLate
## [57] ProsperPrincipalBorrowed
## [58] ProsperPrincipalOutstanding
## [59] ScorexChangeAtTimeOfListing
## [60] LoanCurrentDaysDelinquent
## [61] LoanFirstDefaultedCycleNumber
## [62] LoanMonthsSinceOrigination
## [63] LoanNumber
## [64] LoanOriginalAmount
## [65] LoanOriginationDate
## [66] LoanOriginationQuarter
## [67] MemberKey
## [68] MonthlyLoanPayment
## [69] LP_CustomerPayments
## [70] LP_CustomerPrincipalPayments
## [71] LP_InterestandFees
## [72] LP_ServiceFees
## [73] LP_CollectionFees
## [74] LP_GrossPrincipalLoss
## [75] LP_NetPrincipalLoss
## [76] LP_NonPrincipalRecoverypayments
## [77] PercentFunded
## [78] Recommendations
## [79] InvestmentFromFriendsCount
## [80] InvestmentFromFriendsAmount
## [81] Investors
## 81 Levels: AmountDelinquent ... TradesOpenedLast6Months
The initial set of fields I’m selecting are “Term”,“LoanStatus”,“BorrowerAPR”,“ListingCategory”,“Occupation”,“IsBorrowerHomeowner”,“CreditScoreRangeLower”,“CurrentCreditLines”,“DelinquenciesLast7Years”,“DebtToIncomeRatio”,“StatedMonthlyIncome”,“LoanOriginalAmount”,“LoanOriginationQuarter” and “BankcardUtilization”. My intent is to identify and select 10-15 variables that are fairly independent, but important to the analysis. After initial investigation, I may modify or remove some of these fields. The criteria I used to select these fields are 1. Documentation of the field is clear 2. Fields seem relevant to the ‘LoanStatus’ field. 3. The fields are relatively independent of each other.
I read in the data and create a subset with the selected columns. I then get summary information for each of the variables in the set
## [1] 14
## [1] 113937
## 'data.frame': 113937 obs. of 14 variables:
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ ListingCategory : int 0 2 0 16 2 1 1 2 7 7 ...
## $ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
## $ IsBorrowerHomeowner : logi TRUE FALSE FALSE TRUE TRUE TRUE ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CurrentCreditLines : int 5 14 NA 5 19 21 10 6 17 17 ...
## $ DelinquenciesLast7Years: int 4 0 0 14 0 0 0 0 0 0 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationQuarter : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
## $ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
## Term LoanStatus BorrowerAPR
## Min. :12.00 Current :56576 Min. :0.00653
## 1st Qu.:36.00 Completed :38074 1st Qu.:0.15629
## Median :36.00 Chargedoff :11992 Median :0.20976
## Mean :40.83 Defaulted : 5018 Mean :0.21883
## 3rd Qu.:36.00 Past Due (1-15 days) : 806 3rd Qu.:0.28381
## Max. :60.00 Past Due (31-60 days): 363 Max. :0.51229
## (Other) : 1108 NA's :25
## ListingCategory Occupation IsBorrowerHomeowner
## Min. : 0.000 Other :28617 Mode :logical
## 1st Qu.: 1.000 Professional :13628 FALSE:56459
## Median : 1.000 Computer Programmer : 4478 TRUE :57478
## Mean : 2.774 Executive : 4311 NA's :0
## 3rd Qu.: 3.000 Teacher : 3759
## Max. :20.000 Administrative Assistant: 3688
## (Other) :55456
## CreditScoreRangeLower CurrentCreditLines DelinquenciesLast7Years
## Min. : 0.0 Min. : 0.00 Min. : 0.000
## 1st Qu.:660.0 1st Qu.: 7.00 1st Qu.: 0.000
## Median :680.0 Median :10.00 Median : 0.000
## Mean :685.6 Mean :10.32 Mean : 4.155
## 3rd Qu.:720.0 3rd Qu.:13.00 3rd Qu.: 3.000
## Max. :880.0 Max. :59.00 Max. :99.000
## NA's :591 NA's :7604 NA's :990
## DebtToIncomeRatio StatedMonthlyIncome LoanOriginalAmount
## Min. : 0.000 Min. : 0 Min. : 1000
## 1st Qu.: 0.140 1st Qu.: 3200 1st Qu.: 4000
## Median : 0.220 Median : 4667 Median : 6500
## Mean : 0.276 Mean : 5608 Mean : 8337
## 3rd Qu.: 0.320 3rd Qu.: 6825 3rd Qu.:12000
## Max. :10.010 Max. :1750003 Max. :35000
## NA's :8554
## LoanOriginationQuarter BankcardUtilization
## Q4 2013:14450 Min. :0.000
## Q1 2014:12172 1st Qu.:0.310
## Q3 2013: 9180 Median :0.600
## Q2 2013: 7099 Mean :0.561
## Q3 2012: 5632 3rd Qu.:0.840
## Q2 2012: 5061 Max. :5.950
## (Other):60343 NA's :7604
The following columns can be converted to factors - Term, Listing category. Also the loan origination quarter may need to be modified so that they can be ordered chronologically.
As a first step I’ll create histograms for the variables listed. For ‘Listing categories’, which is a numerical column, the categories are : 0 - Not Available, 1 - Debt Consolidation, 2 - Home Improvement, 3 - Business, 4 - Personal Loan, 5 - Student Use, 6 - Auto, 7- Other, 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans, 13 - Household Expenses, 14 - Large Purchases, 15 - Medical/Dental, 16 - Motorcycle, 17 - RV, 18 - Taxes, 19 - Vacation, 20 - Wedding Loans
##
## Cancelled Chargedoff Completed
## 5 11992 38074
## Current Defaulted FinalPaymentInProgress
## 56576 5018 205
## Past Due (>120 days) Past Due (1-15 days) Past Due (16-30 days)
## 16 806 265
## Past Due (31-60 days) Past Due (61-90 days) Past Due (91-120 days)
## 363 313 304
From the histogram and the table, about 32% of the loans are completed or in final payments. 16% have not been successful outcome, i.e. either chargedoff, defaulted or payments are past due. So, 33% of all loans that are not current have failed. This is a large percentage and needs to be investigated. It will be easier if a new column is created which simplifies the outcomes. All chargedoff, defaulted or past due loans will be marked as failed, Completed loans will be marked as successful and Current loans will be marked as ongoing.
##
## Accountant/CPA
## 3588 3233
## Administrative Assistant Analyst
## 3688 3602
## Architect Attorney
## 213 1046
## Biologist Bus Driver
## 125 316
## Car Dealer Chemist
## 180 145
## Civil Service Clergy
## 1457 196
## Clerical Computer Programmer
## 3164 4478
## Construction Dentist
## 1790 68
## Doctor Engineer - Chemical
## 494 225
## Engineer - Electrical Engineer - Mechanical
## 1125 1406
## Executive Fireman
## 4311 422
## Flight Attendant Food Service
## 123 1123
## Food Service Management Homemaker
## 1239 120
## Investor Judge
## 214 22
## Laborer Landscaping
## 1595 236
## Medical Technician Military Enlisted
## 1117 1272
## Military Officer Nurse's Aide
## 346 491
## Nurse (LPN) Nurse (RN)
## 492 2489
## Other Pharmacist
## 28617 257
## Pilot - Private/Commercial Police Officer/Correction Officer
## 199 1578
## Postal Service Principal
## 627 312
## Professional Professor
## 13628 557
## Psychologist Realtor
## 145 543
## Religious Retail Management
## 124 2602
## Sales - Commission Sales - Retail
## 3446 2797
## Scientist Skilled Labor
## 372 2746
## Social Worker Student - College Freshman
## 741 41
## Student - College Graduate Student Student - College Junior
## 245 112
## Student - College Senior Student - College Sophomore
## 188 69
## Student - Community College Student - Technical School
## 28 16
## Teacher Teacher's Aide
## 3759 276
## Tradesman - Carpenter Tradesman - Electrician
## 120 477
## Tradesman - Mechanic Tradesman - Plumber
## 951 102
## Truck Driver Waiter/Waitress
## 1675 436
The field Occupation has many entries. The professions with the largest count of loans are ‘Other’ and ‘Professional’. Also, these values are sometimes ambiguous. For example the fields ‘Professional’ and ‘Principal’ are not clear on the exact field. With so many values, an analysis of this field may not provide much value. Going forward, this field will not be checked.
##
## 0 1 2 3 4 5 6 7 8 9 10 11
## 16965 58308 7433 7189 2395 756 2572 10494 199 85 91 217
## 12 13 14 15 16 17 18 19 20
## 59 1996 876 1522 304 52 885 768 771
This is a numeric field with these categories : 0 - Not Available, 1 - Debt Consolidation, 2 - Home Improvement, 3 - Business, 4 - Personal Loan, 5 - Student Use, 6 - Auto, 7- Other, 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans, 13 - Household Expenses, 14 - Large Purchases, 15 - Medical/Dental, 16 - Motorcycle, 17 - RV, 18 - Taxes, 19 - Vacation, 20 - Wedding Loans
It would help to convert this field to a factor.From the histogram, the most common categories are ‘Debt Consolidation’(Close to 50%),‘Not Available’(15%) or ‘Other’(10%). Not sure if the applicants were not sure of the purpose of the loan, or didnt care to fill the field accurately. The field may not be very helpful, but it will be interesting to check the failure rate for the various categories.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00653 0.15630 0.20980 0.21880 0.28380 0.51230 25
The Borrower APR varies normally from a minimum of 0.01 to a max of 0.5. The peak count is at 0.2. There seems to be a high count of about 6000 for the APR of 0.36, which breaks out of the normal curve.
##
## FALSE TRUE
## 56459 57478
Count of homeowners and renters are about equal, close to 56000 each.It will be interesting to check the loan failure rate for these two categories, in order to check on any differences.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 660.0 680.0 685.6 720.0 880.0 591
Credit scores range from 400 at the low end to 880. It follows a bell curve, with some gaps.It will be interesting to check if there is any corelation between the credit scores and the bad loans. Also, since credit scores are based on some of the other fields such as ‘Salary’, it will be helpful to check the corelation between them.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 7.00 10.00 10.32 13.00 59.00 7604
All borrowers seem to have credit lines. The range is from 1 to 40 credit lines. The curve is positively skewed. i.e. There are more borrowers with higher number of credit lines. It will be interesting to check if those with higher number of credit lines can pay off he loans in a timely manner.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 4.155 3.000 99.000 990
As expected the curve shows a decrease in count as the number of delinquencies increase. There are 4000 borrowers with zero deliquency. Maximum number of delinquencies is 49.I’d like to check if borrowers with more delinquencies have higher loan repayment failures.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.140 0.220 0.276 0.320 10.010 8554
The debt to income ratio has a peak count at about 0.18. The maximum count is at a value of 0.12. I’d expect those with a lower debtoincome ratio to be better at paying off the loan, and would expect them to have a better credit score.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6500 8337 12000 35000
I plotted this in a log scale since there were many borrowers with loan amounts less than 10,000. The graph shows the peak count of loans around 5000 and maximum values at 35000. It will be interesting to check the status of loans by the amount of the loan.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3200 4667 5608 6825 1750000
This is a positively skewed curve, with a max count at about $4000. I expect credit scores to be directly correlated to the monthly income. I’m not sure if some of the higher figures are accurate. i.e. max value of monthly income is stated as 1750003.
The loans origination date ranges from 2006-2013. The graph is not chronologically sorted. It will help to modify this field for that purpose. Also, if the results for 2006-2008 are starkly different from 2009-2013 due to pre and post financial crisis scenario, it may help to separate out the data.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.310 0.600 0.561 0.840 5.950 7604
##
## 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14
## 6782 730 632 606 611 629 577 603 538 531 616 594 549 609 646
## 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29
## 607 641 626 625 660 680 702 697 700 722 711 741 783 813 746
## 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.4 0.41 0.42 0.43 0.44
## 773 803 764 757 787 736 788 819 820 831 845 857 903 853 878
## 0.45 0.46 0.47 0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59
## 926 893 922 937 924 967 1012 930 994 1015 967 1031 1049 1002 1016
## 0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.7 0.71 0.72 0.73 0.74
## 1074 1010 1104 1108 1156 1066 1155 1095 1102 1118 1160 1115 1067 1066 1123
## 0.75 0.76 0.77 0.78 0.79 0.8 0.81 0.82 0.83 0.84 0.85 0.86 0.87 0.88 0.89
## 1043 1074 1130 1081 1113 1144 1166 1169 1211 1233 1199 1243 1303 1313 1302
## 0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1 1.01 1.02 1.03 1.04
## 1334 1427 1507 1546 1664 1791 1798 2053 2171 1719 830 400 243 164 105
## 1.05 1.06 1.07 1.08 1.09 1.1 1.11 1.12 1.13 1.14 1.15 1.16 1.17 1.18 1.19
## 95 64 58 51 46 40 29 34 30 28 26 17 19 18 15
## 1.2 1.21 1.22 1.23 1.24 1.25 1.26 1.27 1.28 1.29 1.3 1.31 1.32 1.33 1.34
## 13 15 9 13 10 12 8 8 9 7 6 10 8 2 13
## 1.35 1.36 1.37 1.38 1.39 1.4 1.41 1.43 1.44 1.45 1.46 1.47 1.5 1.51 1.52
## 6 4 4 3 1 4 3 3 1 2 4 1 5 5 3
## 1.53 1.54 1.55 1.57 1.58 1.59 1.6 1.61 1.62 1.63 1.64 1.65 1.66 1.69 1.7
## 2 5 1 3 1 1 2 2 2 1 1 3 1 1 1
## 1.72 1.73 1.75 1.76 1.78 1.79 1.8 1.81 1.82 1.83 1.84 1.86 1.88 1.89 1.9
## 1 2 2 1 1 2 1 1 1 1 1 3 1 1 1
## 1.96 1.97 1.99 2 2.04 2.05 2.07 2.21 2.23 2.25 2.27 2.35 2.36 2.46 2.5
## 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1
## 2.66 2.68 3.47 4.73 5.83 5.95
## 1 1 1 1 1 1
Bank card utilzation count increases steadily from 2000 with 0.05 utilization to 7500 with 1.0 utilization. Intermittently, some values such as 0.01, 0.16, 0.28 have a higher count. There also seem to be some erroneous values greater than 1.
I added Labels to the listing categories to enable convenient analysis of graphs.
I changed the term field to a factor. This will improve the display.
I removed entries from the set that have invalid values for Bank Card Utilization
I changed the Loan origination guarter to the format ‘2012 Q3’ instead of ‘Q3 2012’. This will help us order the graphs chronologically. The graph seems to show a dip in the count of loans beginning 2008 Q2 to 2011 Q4. This could be due to the impact of the 2008 financial crisis.
## [1] "Good" "Ongoing" "Ongoing" "Ongoing" "Ongoing" "Ongoing"
I created a new column ‘Loan Result’ that simplifies the Loan Status. It has one of the three values - ‘Bad’, Good’ and ‘Ongoing’. All completed and nearly completed loans are good. Loans other that this and current are ‘bad’.
Next I recreate many of the previous histogram plots, with the results colored as ‘LoanResult’
##
## Bad Good Ongoing
## 15227 33070 56292
From the total number of loans, about 53000 are ongoing, 33000 have completed successfully and 15000 have failed. i.e. the overall ratio of bad to good loans is 0.46 and the ratio of bad to loans that are not current is 0.32. I plan to focus on the factors which tip the ratio towards bad loans.
Another bivariate plot is the comparison of ‘debt to income ratio’ with the LoanResult
## mySet$LoanResult: Bad
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.1500 0.2300 0.3542 0.3500 10.0100 1603
## --------------------------------------------------------
## mySet$LoanResult: Good
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.1300 0.2000 0.2644 0.2900 10.0100 2661
## --------------------------------------------------------
## mySet$LoanResult: Ongoing
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.160 0.230 0.262 0.320 10.010 4090
The summary data shows median DebttoIncome ratio of 0.23 for bad loans Vs 0.20 for the good loans. However there are a large number of outliers in both cases. I’m cannot consider this result reliable enough to investigate further.
From the plot, it seems the count of bad loans increases steadily with increase in APR, and then spikes up at an 0.36 APR. It may help to plot bad loans as a percentage of total loans.
## mySet$ListingCategory: Not Available
##
## Bad Good Ongoing
## 3330 5549 12
## --------------------------------------------------------
## mySet$ListingCategory: Debt Consolidation
##
## Bad Good Ongoing
## 5562 13008 39024
## --------------------------------------------------------
## mySet$ListingCategory: Home Improvement
##
## Bad Good Ongoing
## 1001 2423 3940
## --------------------------------------------------------
## mySet$ListingCategory: Business
##
## Bad Good Ongoing
## 1555 2834 2726
## --------------------------------------------------------
## mySet$ListingCategory: Personal Loan
##
## Bad Good Ongoing
## 690 1552 0
## --------------------------------------------------------
## mySet$ListingCategory: Student Use
##
## Bad Good Ongoing
## 182 542 0
## --------------------------------------------------------
## mySet$ListingCategory: Auto
##
## Bad Good Ongoing
## 366 1207 964
## --------------------------------------------------------
## mySet$ListingCategory: Other
##
## Bad Good Ongoing
## 1661 4512 4190
## --------------------------------------------------------
## mySet$ListingCategory: Baby&Adoption
##
## Bad Good Ongoing
## 19 26 153
## --------------------------------------------------------
## mySet$ListingCategory: Boat
##
## Bad Good Ongoing
## 5 25 55
## --------------------------------------------------------
## mySet$ListingCategory: Cosmetic Procedure
##
## Bad Good Ongoing
## 13 34 43
## --------------------------------------------------------
## mySet$ListingCategory: Engagement Ring
##
## Bad Good Ongoing
## 13 59 144
## --------------------------------------------------------
## mySet$ListingCategory: Green Loans
##
## Bad Good Ongoing
## 12 11 36
## --------------------------------------------------------
## mySet$ListingCategory: Household Expenses
##
## Bad Good Ongoing
## 292 373 1307
## --------------------------------------------------------
## mySet$ListingCategory: Large Purchases
##
## Bad Good Ongoing
## 83 140 649
## --------------------------------------------------------
## mySet$ListingCategory: Medical/Dental
##
## Bad Good Ongoing
## 201 234 1067
## --------------------------------------------------------
## mySet$ListingCategory: Motorcycle
##
## Bad Good Ongoing
## 16 87 200
## --------------------------------------------------------
## mySet$ListingCategory: RV
##
## Bad Good Ongoing
## 2 16 33
## --------------------------------------------------------
## mySet$ListingCategory: Taxes
##
## Bad Good Ongoing
## 83 161 638
## --------------------------------------------------------
## mySet$ListingCategory: Vacation
##
## Bad Good Ongoing
## 73 150 538
## --------------------------------------------------------
## mySet$ListingCategory: Wedding Loans
##
## Bad Good Ongoing
## 68 127 573
The highest count of bad loans is for ‘Debt consolidation’ - About 5500. However, it is not clear which Listing Category has the highest percentage of bad loans.
## mySet$IsBorrowerHomeowner: FALSE
##
## Bad Good Ongoing
## 7859 16545 25987
## --------------------------------------------------------
## mySet$IsBorrowerHomeowner: TRUE
##
## Bad Good Ongoing
## 7368 16525 30305
Both Homeowners and renters seem to have approximately the same counts for successful and failed loans. It may not help to investigate this field. Will not continue with this factor further.
## mySet$CreditScoreRangeLower: 520
##
## Bad Good Ongoing
## 521 404 0
## --------------------------------------------------------
## mySet$CreditScoreRangeLower: 540
##
## Bad Good Ongoing
## 288 276 0
## --------------------------------------------------------
## mySet$CreditScoreRangeLower: 560
##
## Bad Good Ongoing
## 381 415 0
## --------------------------------------------------------
## mySet$CreditScoreRangeLower: 580
##
## Bad Good Ongoing
## 346 401 0
## --------------------------------------------------------
## mySet$CreditScoreRangeLower: 600
##
## Bad Good Ongoing
## 1001 1408 369
## --------------------------------------------------------
## mySet$CreditScoreRangeLower: 620
##
## Bad Good Ongoing
## 1090 1757 638
## --------------------------------------------------------
## mySet$CreditScoreRangeLower: 640
##
## Bad Good Ongoing
## 2124 3646 5574
## --------------------------------------------------------
## mySet$CreditScoreRangeLower: 660
##
## Bad Good Ongoing
## 2072 3909 9816
## --------------------------------------------------------
## mySet$CreditScoreRangeLower: 680
##
## Bad Good Ongoing
## 2029 4004 9924
## --------------------------------------------------------
## mySet$CreditScoreRangeLower: 700
##
## Bad Good Ongoing
## 1721 3667 9729
## --------------------------------------------------------
## mySet$CreditScoreRangeLower: 720
##
## Bad Good Ongoing
## 1358 3545 7660
## --------------------------------------------------------
## mySet$CreditScoreRangeLower: 740
##
## Bad Good Ongoing
## 958 2987 5083
## --------------------------------------------------------
## mySet$CreditScoreRangeLower: 760
##
## Bad Good Ongoing
## 688 2459 3267
## --------------------------------------------------------
## mySet$CreditScoreRangeLower: 780
##
## Bad Good Ongoing
## 364 1834 2272
## --------------------------------------------------------
## mySet$CreditScoreRangeLower: 800
##
## Bad Good Ongoing
## 182 1149 1194
## --------------------------------------------------------
## mySet$CreditScoreRangeLower: 820
##
## Bad Good Ongoing
## 79 721 537
## --------------------------------------------------------
## mySet$CreditScoreRangeLower: 840
##
## Bad Good Ongoing
## 15 333 185
## --------------------------------------------------------
## mySet$CreditScoreRangeLower: 860
##
## Bad Good Ongoing
## 10 136 39
## --------------------------------------------------------
## mySet$CreditScoreRangeLower: 880
##
## Bad Good Ongoing
## 0 19 5
from the histogram, the distribution of bad loans, good loans and ongoing loans seems to be close to normal. The ‘Bad loans curve’ is skewed to the left, and that for good loans and Ongoing loans is skewed to the right. i.e. borrowers with lower credit scores seem have more failed loans, and those with higher credit scores seem to have more successful loan payback.
## mySet$CurrentCreditLines: 0
##
## Bad Good Ongoing
## 175 135 74
## --------------------------------------------------------
## mySet$CurrentCreditLines: 1
##
## Bad Good Ongoing
## 500 579 244
## --------------------------------------------------------
## mySet$CurrentCreditLines: 2
##
## Bad Good Ongoing
## 694 1055 628
## --------------------------------------------------------
## mySet$CurrentCreditLines: 3
##
## Bad Good Ongoing
## 823 1478 1105
## --------------------------------------------------------
## mySet$CurrentCreditLines: 4
##
## Bad Good Ongoing
## 972 1927 1762
## --------------------------------------------------------
## mySet$CurrentCreditLines: 5
##
## Bad Good Ongoing
## 998 2235 2974
## --------------------------------------------------------
## mySet$CurrentCreditLines: 6
##
## Bad Good Ongoing
## 1099 2524 3618
## --------------------------------------------------------
## mySet$CurrentCreditLines: 7
##
## Bad Good Ongoing
## 1189 2735 4833
## --------------------------------------------------------
## mySet$CurrentCreditLines: 8
##
## Bad Good Ongoing
## 1078 2805 4953
## --------------------------------------------------------
## mySet$CurrentCreditLines: 9
##
## Bad Good Ongoing
## 1064 2606 4935
## --------------------------------------------------------
## mySet$CurrentCreditLines: 10
##
## Bad Good Ongoing
## 969 2327 4737
## --------------------------------------------------------
## mySet$CurrentCreditLines: 11
##
## Bad Good Ongoing
## 851 2173 4387
## --------------------------------------------------------
## mySet$CurrentCreditLines: 12
##
## Bad Good Ongoing
## 805 1839 3802
## --------------------------------------------------------
## mySet$CurrentCreditLines: 13
##
## Bad Good Ongoing
## 717 1617 3294
## --------------------------------------------------------
## mySet$CurrentCreditLines: 14
##
## Bad Good Ongoing
## 640 1384 2856
## --------------------------------------------------------
## mySet$CurrentCreditLines: 15
##
## Bad Good Ongoing
## 487 1113 2347
## --------------------------------------------------------
## mySet$CurrentCreditLines: 16
##
## Bad Good Ongoing
## 439 970 2036
## --------------------------------------------------------
## mySet$CurrentCreditLines: 17
##
## Bad Good Ongoing
## 351 744 1512
## --------------------------------------------------------
## mySet$CurrentCreditLines: 18
##
## Bad Good Ongoing
## 293 590 1347
## --------------------------------------------------------
## mySet$CurrentCreditLines: 19
##
## Bad Good Ongoing
## 213 442 1071
## --------------------------------------------------------
## mySet$CurrentCreditLines: 20
##
## Bad Good Ongoing
## 178 382 808
## --------------------------------------------------------
## mySet$CurrentCreditLines: 21
##
## Bad Good Ongoing
## 141 281 643
## --------------------------------------------------------
## mySet$CurrentCreditLines: 22
##
## Bad Good Ongoing
## 102 238 485
## --------------------------------------------------------
## mySet$CurrentCreditLines: 23
##
## Bad Good Ongoing
## 96 193 377
## --------------------------------------------------------
## mySet$CurrentCreditLines: 24
##
## Bad Good Ongoing
## 75 163 323
## --------------------------------------------------------
## mySet$CurrentCreditLines: 25
##
## Bad Good Ongoing
## 64 114 264
## --------------------------------------------------------
## mySet$CurrentCreditLines: 26
##
## Bad Good Ongoing
## 52 95 201
## --------------------------------------------------------
## mySet$CurrentCreditLines: 27
##
## Bad Good Ongoing
## 34 66 150
## --------------------------------------------------------
## mySet$CurrentCreditLines: 28
##
## Bad Good Ongoing
## 20 56 128
## --------------------------------------------------------
## mySet$CurrentCreditLines: 29
##
## Bad Good Ongoing
## 22 37 86
## --------------------------------------------------------
## mySet$CurrentCreditLines: 30
##
## Bad Good Ongoing
## 22 31 66
## --------------------------------------------------------
## mySet$CurrentCreditLines: 31
##
## Bad Good Ongoing
## 11 36 44
## --------------------------------------------------------
## mySet$CurrentCreditLines: 32
##
## Bad Good Ongoing
## 7 27 41
## --------------------------------------------------------
## mySet$CurrentCreditLines: 33
##
## Bad Good Ongoing
## 10 16 36
## --------------------------------------------------------
## mySet$CurrentCreditLines: 34
##
## Bad Good Ongoing
## 3 12 24
## --------------------------------------------------------
## mySet$CurrentCreditLines: 35
##
## Bad Good Ongoing
## 7 10 23
## --------------------------------------------------------
## mySet$CurrentCreditLines: 36
##
## Bad Good Ongoing
## 5 6 23
## --------------------------------------------------------
## mySet$CurrentCreditLines: 37
##
## Bad Good Ongoing
## 3 5 15
## --------------------------------------------------------
## mySet$CurrentCreditLines: 38
##
## Bad Good Ongoing
## 2 7 14
## --------------------------------------------------------
## mySet$CurrentCreditLines: 39
##
## Bad Good Ongoing
## 2 3 8
## --------------------------------------------------------
## mySet$CurrentCreditLines: 40
##
## Bad Good Ongoing
## 3 2 5
## --------------------------------------------------------
## mySet$CurrentCreditLines: 41
##
## Bad Good Ongoing
## 4 2 2
## --------------------------------------------------------
## mySet$CurrentCreditLines: 42
##
## Bad Good Ongoing
## 0 1 2
## --------------------------------------------------------
## mySet$CurrentCreditLines: 43
##
## Bad Good Ongoing
## 0 0 1
## --------------------------------------------------------
## mySet$CurrentCreditLines: 44
##
## Bad Good Ongoing
## 0 3 1
## --------------------------------------------------------
## mySet$CurrentCreditLines: 45
##
## Bad Good Ongoing
## 1 0 2
## --------------------------------------------------------
## mySet$CurrentCreditLines: 46
##
## Bad Good Ongoing
## 1 0 0
## --------------------------------------------------------
## mySet$CurrentCreditLines: 47
##
## Bad Good Ongoing
## 1 0 2
## --------------------------------------------------------
## mySet$CurrentCreditLines: 48
##
## Bad Good Ongoing
## 2 1 0
## --------------------------------------------------------
## mySet$CurrentCreditLines: 51
##
## Bad Good Ongoing
## 0 1 0
## --------------------------------------------------------
## mySet$CurrentCreditLines: 52
##
## Bad Good Ongoing
## 2 1 0
## --------------------------------------------------------
## mySet$CurrentCreditLines: 54
##
## Bad Good Ongoing
## 0 0 3
## --------------------------------------------------------
## mySet$CurrentCreditLines: 56
##
## Bad Good Ongoing
## 0 2 0
## --------------------------------------------------------
## mySet$CurrentCreditLines: 59
##
## Bad Good Ongoing
## 0 1 0
The ratio of bad to good loans is higher at the extremes, i.e. below credit lines of 3 and higher than credit lines of 30. Between 3-30 credit lines, the count of bad loans is flat seems to be much lower than that of the good loans.
The number of Bad loans to Good loans is higher when the number of delinquencies is low. However, from the graph the ratio of bad to good loan count seems to increase as the number of delinquencies increase.
The ratio of bad to good loans is low when debt to Income ratio is low. The ratio seems to increase after DebttoIncomeRatio exceeds 0.3.
Most of the loans are below the value of 10000. Also, the ratio of bad to good loans is low when the loan amount is below 10,000. As the loan amount increases beyond this value the ratio increases.
Most of the loans are located in the mid range of monthly income - between 3000 & 8000. It is not clear how the ratio changes across the range of monthly income. Also, creating the table of loan results for each value of monthly income does not helpful, since there are too many entries.
## mySet$LoanOriginationQuarter: 2007 Q1
##
## Bad Good Ongoing
## 484 842 0
## --------------------------------------------------------
## mySet$LoanOriginationQuarter: 2007 Q2
##
## Bad Good Ongoing
## 1167 1798 0
## --------------------------------------------------------
## mySet$LoanOriginationQuarter: 2007 Q3
##
## Bad Good Ongoing
## 969 1575 0
## --------------------------------------------------------
## mySet$LoanOriginationQuarter: 2007 Q4
##
## Bad Good Ongoing
## 849 1605 0
## --------------------------------------------------------
## mySet$LoanOriginationQuarter: 2008 Q1
##
## Bad Good Ongoing
## 952 1953 0
## --------------------------------------------------------
## mySet$LoanOriginationQuarter: 2008 Q2
##
## Bad Good Ongoing
## 1367 2769 0
## --------------------------------------------------------
## mySet$LoanOriginationQuarter: 2008 Q3
##
## Bad Good Ongoing
## 1036 2348 0
## --------------------------------------------------------
## mySet$LoanOriginationQuarter: 2008 Q4
##
## Bad Good Ongoing
## 138 362 0
## --------------------------------------------------------
## mySet$LoanOriginationQuarter: 2009 Q2
##
## Bad Good Ongoing
## 3 10 0
## --------------------------------------------------------
## mySet$LoanOriginationQuarter: 2009 Q3
##
## Bad Good Ongoing
## 83 496 0
## --------------------------------------------------------
## mySet$LoanOriginationQuarter: 2009 Q4
##
## Bad Good Ongoing
## 219 1210 0
## --------------------------------------------------------
## mySet$LoanOriginationQuarter: 2010 Q1
##
## Bad Good Ongoing
## 184 1036 0
## --------------------------------------------------------
## mySet$LoanOriginationQuarter: 2010 Q2
##
## Bad Good Ongoing
## 258 1263 1
## --------------------------------------------------------
## mySet$LoanOriginationQuarter: 2010 Q3
##
## Bad Good Ongoing
## 191 1060 1
## --------------------------------------------------------
## mySet$LoanOriginationQuarter: 2010 Q4
##
## Bad Good Ongoing
## 284 1280 17
## --------------------------------------------------------
## mySet$LoanOriginationQuarter: 2011 Q1
##
## Bad Good Ongoing
## 342 1242 152
## --------------------------------------------------------
## mySet$LoanOriginationQuarter: 2011 Q2
##
## Bad Good Ongoing
## 593 1165 691
## --------------------------------------------------------
## mySet$LoanOriginationQuarter: 2011 Q3
##
## Bad Good Ongoing
## 662 1404 979
## --------------------------------------------------------
## mySet$LoanOriginationQuarter: 2011 Q4
##
## Bad Good Ongoing
## 798 1661 1397
## --------------------------------------------------------
## mySet$LoanOriginationQuarter: 2012 Q1
##
## Bad Good Ongoing
## 885 1505 1997
## --------------------------------------------------------
## mySet$LoanOriginationQuarter: 2012 Q2
##
## Bad Good Ongoing
## 966 1515 2523
## --------------------------------------------------------
## mySet$LoanOriginationQuarter: 2012 Q3
##
## Bad Good Ongoing
## 965 1488 3121
## --------------------------------------------------------
## mySet$LoanOriginationQuarter: 2012 Q4
##
## Bad Good Ongoing
## 614 1014 2759
## --------------------------------------------------------
## mySet$LoanOriginationQuarter: 2013 Q1
##
## Bad Good Ongoing
## 299 699 2598
## --------------------------------------------------------
## mySet$LoanOriginationQuarter: 2013 Q2
##
## Bad Good Ongoing
## 400 682 5989
## --------------------------------------------------------
## mySet$LoanOriginationQuarter: 2013 Q3
##
## Bad Good Ongoing
## 282 585 8273
## --------------------------------------------------------
## mySet$LoanOriginationQuarter: 2013 Q4
##
## Bad Good Ongoing
## 218 416 13768
## --------------------------------------------------------
## mySet$LoanOriginationQuarter: 2014 Q1
##
## Bad Good Ongoing
## 19 87 12026
The ratio of bad to good loans seem to be high prior to 2008 Q3. The number of loans and the ratio dropped since. I’m not adding any comments on the ratio in 2012, since the ongoing loans may be close to completion, so the good to bad ratio may not be valid for these years.
The ratio seems to be fairly steady, except for the BankCardUtilization value of 1.0, where it is high. Perhaps this value indicates borrowers with no bank cards. For this field, I did not output the table,since it would not be useful.
## mySet$Term: 12
##
## Bad Good Ongoing
## 92 1450 62
## --------------------------------------------------------
## mySet$Term: 36
##
## Bad Good Ongoing
## 13198 29174 36181
## --------------------------------------------------------
## mySet$Term: 60
##
## Bad Good Ongoing
## 1937 2446 20049
Loan Term- Majority of the loans have a 36 month term. The bad to good loan ratio is lower for the 12 month term and 36 month term as compared to the 60 month term.
Based on the initial histograms, I plan to investigate the following variables further 1. Stated Monthly Income 2. Loan Original Amount 3. DelinquenciesLast7Years 4. Current Credit Lines 5. CreditScoreRangeLower 6. ListingCategory 7. Borrower APR 8. DebtToIncomeRatio
I plan to plot each of these variables against the ratio of bad to good loans. If there seems to be a corelation, I plan to evaluate it.
I first convert all columns to factors, in order to group the data using these factors. Values have been rounded, so that the resulting graphs will be reasonably granular.
Also, the data is filtered for all loans with a final result, since the final status of an ongoing loan is not known.
The function summarizes the total count of loans by the column name in the arguments. It also summarizes this information by Loan Result and column name. That will enable us to get the ratio of bad loans to total loans for each field.
We now use this function to display ratio of bad loan count to total loans for each of the fields in the frame.
The categories with the highest failure rate are ‘Green Loans’, ‘Medical/Dental, ’Household expenses’ and ‘Baby Adoption’. These categories have a failure rate greater than 37.5%.
There is a direct corelation between the credit score and percentage of successful loans. And an inverse relation with the percentage of failed loans. The corelation can be investigated further.
There are a bunch of outliers, high stated incomes where all loans have been bad.This could be a case of incorrect data. The Failure rate is scattered for high incomes and very low incomes. However for the interval monthy income between 1200-9000, there seems to be a negative corelation between count of loans failed and the income.
The Failure rate seems to increase as the ‘Loan Amount’ increases. The values are more scattered with increase in Loan Amounts. There is also an outlier for failure rate when the loan value is 21500. I will check on the corelation coefficient for this metric.
This metric seems to have a direct corelation to the failure rate, but only for a specific interval - 0.06 to 0.37. At APR values below 0.06 the failure rate increases. Also, at values higher than 0.37 the failure rate decreases. I’m not sure of the cause here.
In the case of delinquencies, the failure rate seems to scatter as the number of delinquencies increase. It is not clear if there is a corelation. I plan to check.
Ignoring an outlier for the value of 0, there seems to be a direct corelation between ‘DebtToIncomeRatio’ and Failure rate, until the DebtoIncomeRatio value of 0.5. After that, the values scatter. It will help to check the corelation coeff for this.
When the number of credit lines goes beyond 40, the rate of loan failures goes high, beyond 60%. However, there is no direct corelation at values below 40. In as the number of credit lines increases from 0 to 5, the loan failure rate decreases from 40% to 30%. This could possibly be due to young inexperienced borrowers, who are new to loans.
The next step is to calculate corelation values between failure rates and some of the metrics identified in the graphs.
## [,1]
## CreditScore -0.9932277
CreditScoreCoeff -0.9932
## [,1]
## CreditLines 0.5403611
CreditLines 0.54
## [,1]
## BorrowerAPR 0.7005896
BorrowerAPR 0.700
## [,1]
## Income 0.53022
Income 0.5302
## [,1]
## LoanAmount 0.4963623
LoanAmount 0.4963
## [,1]
## Delinquencies 0.5464917
Delinquencies 0.5464
## [,1]
## DebtToIncomeRatio 0.7991244
DebtToIncomeRatio 0.799
It seems the most common factors that are well corelated,directly or inversely, to the Failure rate are Credit Score, DebtToIncomeRatio and BorrowerAPR.
To check on Failures further, I’ll plot two relatively independent variables along with the Loan Result.
The graph shows that loans of amounts higher than 10000 are not approved for borrowers below the credit score of 600. That is a good idea, since the failed loans increase when the credit scores are lower.When the loan amount is low, around $1000 and the credit scores are above 700, there are hardly any failed loans. The count of failed loans reduces as the credit scores get higher, for all loan amounts. The graph has been scaled by log10 on the x axis because there are fewer loans as the loan amount gets higher.
The graph illustrates the advantage of a higher credit score. Lower APRs are only predominantly given to those with higher credit scores. The mostly empty triangle in the bottom left makes this clear. Also, the empty triangle on the upper right shows that those with higher credit score dont accept high APRs for their loans. It is only those with lower credit scores that have to take on high APR loans. Consequently, most failures are for high APR loans(greater than 0.2) and low credit scores. Also, there are few borrowers between credit scores of 500-600. Few applicants in this range get approval for a loan.
From the graph, it seems that there are minimal loan failures when the APR is low, i.e. below 0.15. Failures increase with the APR. The count of loans increases with the loan amounts. Largest count of loans is between 5000-10,000. The empty triangle on the upper left shows that loans are not approved or accepted for amounts higher than 100000, when the APR is high (0.3-0.4).
The graph indicates the trend of percentage of failed(red) and successful(green) loans as the credit score increases. The trend is linear as indicated by linear interpolation. The corelation coefficient for Credit Score and Ratio of failed loans is -0.9932. The credit score is created by financial agencies using information from the borrowers credit history. It seems they are doing a good job of it for the lenders.
The graph shows higher density of failed loans along the along the lower border and to the right. This indicates the impact of lower credit scores on all the repayment of the loans. At loan amounts higher than 10000, loans are approved only for credit ratings above 600. Even for these loans, the incidence of successful loans increases as the credit rating rises.Fewer loans are approved for borrowers below a rating of 600. The smoothing curve indicates the fact that a higher credit score is required for a larger loan amount.
The graph shows the failed loans increasing as the credit score decreases and as the APR increases. The density of failed loans is highest at the credit score of 500 and increases with the Borrower APR. Very few borrowers below a credit score of 600 get approved for loans below APR of 0.2.This further decreases their chances of successful repayment. The APRs less than 0.5 can be considered to be outliers. Then the smoothing curve indicates that a higher credit score provides access to loans with lower APR.
The initial step of identifying relevant variables for investigation, is probably the most critical and is a subjective decision. Luckily, some of the variables selected were relevant to the success/failure of the loan. Some of the categorical variables evaluated were related to the loan outcome. However, I was not sure how to proceed with the evaluation using these variables. So, all the bivariate and multivariate plots consist of numerical variables. From the multivariate plots, the use of Credit Score is apparent. Banks are using this variable to decide on loan approval and on the Loan APR. The corelation between Credit score and success/failure of loans is high and graphically linear. Other factors such as loan amount and APR are also related to the loan outcome. The next steps would probably be to create a linear regression model using these variables.