I’ve been investing with Lending Club lately. Lending Club is a form of P2P lending. In short, you’re lending more directly to the borrowers requesting loans. Since there’s less overhead, the investor gets a higher interest rate (with some increased risk) and the borrower gets a lower interest rate on his/her loan.
I’ve spent a lot of time going over the data provided by Lending Club. I’m fascinated to see what kinds of interesting information I can get out of the raw data. For example, there is a more than 95% probability that a loan that has repaid more than 65% of it’s principal will repay fully. In other words, you really don’t have to worry nearly as much once the loan is past 65% repayment.
Tonight I wanted to find out what loan information was statistically significant in regards to whether or not the loan would default. See below for the results, and keep reading if you’re interested in the technical details:
Data | Statistically significant in regards to repayment? | Confidence |
---|---|---|
Inquiries in the past 6 months | YES | >99.99% |
Sate borrower lives in | YES | >99.99% |
Credit Grade (A1, A2, etc) | YES | >99.99% |
Loan Length | NO | N/A |
Loan Purpose | YES | >99.99% |
Home Ownership (Own, Rent, Mortgage) | YES | >99% |
FICO Score | YES | >99.99% |
Open Credit Lines | NO | N/A |
Employment Length | NO | N/A |
All of these are pretty much what I’d expect with the exception of the last two. I was avoiding borrowers with a lot of open credit lines or who hadn’t been employed very long. It’s good to see that this prejudice was unjustified.
Confidence factor can be a little confusing. For example, the confidence factor for “Loan Purpose” means that there is less than 0.01% chance that the differences between the observed and expected values of loan repayment for the loan purpose were caused by random chance. That’s why we are more than 99.99% confident that there must be some underlying reason other than chance that the data differed. This does not include any notion of how or why the loan purpose matters to loan repayment, only that it does.
These values were calculated using a Chi-square test. I took all the loans that were either fully paid, defaulted, or charged off. I further broke the loans down into two results: loss, which included all loans that had repaid less than 94% of the loan’s principal, and gain, which included all loans with more than 94% repayment of principal.
I only took categories that had more than 300 loans in the set. With smaller numbers you risk having your results greatly impacted by random chance. For example, only seven of the thirty-five credit grades met this criteria (A4, A5, B2, B3, B4, B5, and C1) and only four states (CA, FL, NY, and TX). Since we’re only interested in knowing whether different credit grades or states impact the likelihood of repayment, this restriction is fine.
If you’d like to see the expected vs observed tables for these results, you can grab here:Â Observed vs Expected Tables.
Hopefully I’ll find time to talk about future findings!
Interesting stuff … I’m wondering, based on your research, how many inquiries in the last 6 months should a good filter have, what states should be excluded, what loan purposes should be excluded, what the FICO score should be, etc. You only showed that they’re statistically significant, but not what the best filters would be for each.
Leo, the statistical test I used *only* tell us if something is statistically significant. Other algorithms (genetic, SVM, etc) should be used to determine what constitutes a good loan. I still plan on doing more tests, but I’ve been distracted with writing a thesis. Check back once in a while! 🙂