Lending Club statistical significance for loan categories

I’ve been investing with Lending Club lately. Lending Club is a form of P2P lending. In short, you’re lending more directly to the borrowers requesting loans. Since there’s less overhead, the investor gets a higher interest rate (with some increased risk) and the borrower gets a lower interest rate on his/her loan.

I’ve spent a lot of time going over the data provided by Lending Club. I’m fascinated to see what kinds of interesting information I can get out of the raw data. For example, there is a more than 95% probability that a loan that has repaid more than 65% of it’s principal will repay fully. In other words, you really don’t have to worry nearly as much once the loan is past 65% repayment.

Tonight I wanted to find out what loan information was statistically significant in regards to whether or not the loan would default. See below for the results, and keep reading if you’re interested in the technical details:

Data Statistically significant in regards to repayment? Confidence
Inquiries in the past 6 months YES >99.99%
Sate borrower lives in YES >99.99%
Credit Grade (A1, A2, etc) YES >99.99%
Loan Length NO N/A
Loan Purpose YES >99.99%
Home Ownership (Own, Rent, Mortgage) YES >99%
FICO Score YES >99.99%
Open Credit Lines NO N/A
Employment Length NO N/A

All of these are pretty much what I’d expect with the exception of the last two. I was avoiding borrowers with a lot of open credit lines or who hadn’t been employed very long. It’s good to see that this prejudice was unjustified.

Confidence factor can be a little confusing. For example, the confidence factor for “Loan Purpose” means that there is less than 0.01% chance that the differences between the observed and expected values of loan repayment for the loan purpose were caused by random chance. That’s why we are more than 99.99% confident that there must be some underlying reason other than chance that the data differed. This does not include any notion of how or why the loan purpose matters to loan repayment, only that it does.

These values were calculated using a Chi-square test. I took all the loans that were either fully paid, defaulted, or charged off. I further broke the loans down into two results: loss, which included all loans that had repaid less than 94% of the loan’s principal, and gain, which included all loans with more than 94% repayment of principal.

I only took categories that had more than 300 loans in the set. With smaller numbers you risk having your results greatly impacted by random chance. For example, only seven of the thirty-five credit grades met this criteria (A4, A5, B2, B3, B4, B5, and C1) and only four states (CA, FL, NY, and TX). Since we’re only interested in knowing whether different credit grades or states impact the likelihood of repayment, this restriction is fine.

If you’d like to see the expected vs observed tables for these results, you can grab here: Observed vs Expected Tables.

Hopefully I’ll find time to talk about future findings!

Biphasic sleeping

For the past month, I’ve been following a biphasic sleep schedule. This basically means that I’m sleeping in two chunks throughout the day, every day. I started off sleeping 8:00 PM to 9:30 PM and 2:00 AM to 7:00 AM, for a total of six and a half hours. I’ve since changed to sleeping from 3:00 PM to 4:30 PM and 1:00 AM to 6:00 AM because I tend to get pretty drowsy after lunch.

I’m finding that biphasic sleeping is working pretty well; I certainly appreciate the extra time it gives me every day. Unfortunately, I have been nodding off in classes more often than I normally do, but that’s probably because my classes are more boring this semester.

Overall, I’d recommend biphasic sleeping to anyone with a schedule that can accommodate it. Just keep in mind that the first week of adjusting will be tough!

A lovely remake of Spacewar!

For those who aren’t familiar, Spacewar! is the second video game ever made. It took the original programmer 200 hours to create on a DEC PDP-1. Obviously technology has advanced a lot since when it was first created, so I thought it would be the perfect game to develop for ACM.

Over the course of two sessions at one hour each session, I was able to create a viable remake of Spacewar! using Lua and Love. The controls are the arrow keys for ship 1 (red) and wasd for ship 2 (blue). Running into the sun kills you, and each time you die (from anything) you have two seconds of immunity from dying. This immunity prevents spawn camping. I’d like to thank zerohdog on flickr for the ship image and fastcall for the sun image (also listed in credits in the game).


Download! (Needs love2d to run)

A simple SQL/Flatfile abstraction in Lua

I’ve been wondering if it was possible to make a MySQL/SQLite/Flatfile abstracted interface for Lua for some time now. At first I thought I’d try something like LINQ, but realized that there was really no reasonable way to do that in Lua since we don’t have the power of expression trees. I then considered writing the queries in a simplified pseudo-Lua language, but that would take too much time to parse and no one really wants to learn another language anyways.

What I settled on instead was a very simplified abstraction of the data access. It can’t do everything you can do yourself from SQL or raw file I/O, but I found that it serves up exactly the kind of interface I’d need for all my past projects in Lua that use some form of I/O.

Here’s an example of declaring a table of data:
[cc_lua]users = CreateDataTable( “users”, “steamid”, “string(32)”, “The steamid of the user” )
users:AddKey( “group”, “string(16)”, “The group the user belongs to” )
users:AddKey( “name”, “string(32)”, “The name the player was last seen with” )
users:AddListOfKeyValues( “allow”, “string(16)”, “string(128)”, “The allows for the user” )[/cc_lua]

We’re defining a primary key ‘steamid’ for each of the rows in the table ‘users’. We’re saying that steamid is going to be a string of max length 32 characters, and we define a comment that’s used in MySQL and Flatfiles. We then go on to add regular keys ‘group’ and ‘name’ to the table in a similar fashion, and you should note that regular keys are optional in the table. Finally, we’re making an additional key-value table named ‘allow’ (A key-value table basically means a regular, unrestricted Lua table). So, to make sure you’ve got the idea, the Lua table structure of would look like this:

[cc_lua]users = {
my_steamid1 = {
group = “admin”,
allow = {
slap = “*”,
kick = “*”
}
},
my_steamid2 = {
name = “Bob”,
allow = {}
}
}[/cc_lua]

The API around such a clean representation of data couldn’t be much simpler. You have four operations: insert a new row by primary key, fetch an existing row by primary key, delete a row by primary key, and get the entire table. When inserting a row you can optionally pass in the data for the row. With both the insert and fetch functions you get back a table for the row that’s being “tracked”. When you change any of the contents, the change is immediately reflected into the DB or file.

There’s a caching system built around the system so if you fetch the same row multiple times, it won’t be going out to the DB or reading the file each time. You can request the cache be flushed or disabled altogether if you need to. Unlike file I/O I’ve done in the past, this system doesn’t need to parse the whole file to get a single row, which means it should probably work okay using flatfiles on very large files.

Though I haven’t coded this portion yet, I’m also planning on adding the ability to convert between MySQL/SQLite/flatfiles on the fly. This may be problematic for very large databases, so I’ve also taken care to make sure that this system can be run as a standalone script apart from any specific application.

So… if you’ve read this far, your next question is probably, “Why bother with such a system? Why not just stick with a single format?”. A single format (usually flatfiles) is great for about 95% of my users. The other 5%, however, are the power users who want to do crazy things like hook up a PHP billboard with blinking lights showing how many times a second someone says the word ‘the’. It’s the power users I prefer to cater to, so I’ve always felt guilty in the past that this was one area I couldn’t do much for them in. But now I can!

Damerau–Levenshtein Distance, Lua Implementation

I stumbled across Levenshtein distance today and had to try my hand at writing an implementation in Lua. I choose the slightly more complex Damerau–Levenshtein distance, and I think it turned out pretty well.

Some notes of interest:

  • Complexity is O( (#t+1) * (#s+1) ) when lim isn’t specified.
  • This function can be used to compare array-like tables as easily as strings.
  • This function is case sensitive when comparing strings.
  • Using this function to compare against a dictionary of 250,000 words took about 0.6 seconds on my machine for the word “Teusday”, around 10 seconds for very poorly spelled words. Both tests used lim.

[ccn_lua]–[[
Function: EditDistance

Finds the edit distance between two strings or tables. Edit distance is the minimum number of
edits needed to transform one string or table into the other.

Parameters:

s – A *string* or *table*.
t – Another *string* or *table* to compare against s.
lim – An *optional number* to limit the function to a maximum edit distance. If specified
and the function detects that the edit distance is going to be larger than limit, limit
is returned immediately.

Returns:

A *number* specifying the minimum edits it takes to transform s into t or vice versa. Will
not return a higher number than lim, if specified.

Example:

:EditDistance( “Tuesday”, “Teusday” ) — One transposition.
:EditDistance( “kitten”, “sitting” ) — Two substitutions and a deletion.

returns…

:1
:3

Notes:

* Complexity is O( (#t+1) * (#s+1) ) when lim isn’t specified.
* This function can be used to compare array-like tables as easily as strings.
* The algorithm used is Damerau–Levenshtein distance, which calculates edit distance based
off number of subsitutions, additions, deletions, and transpositions.
* Source code for this function is based off the Wikipedia article for the algorithm
.
* This function is case sensitive when comparing strings.
* If this function is being used several times a second, you should be taking advantage of
the lim parameter.
* Using this function to compare against a dictionary of 250,000 words took about 0.6
seconds on my machine for the word “Teusday”, around 10 seconds for very poorly
spelled words. Both tests used lim.

Revisions:

v1.00 – Initial.
]]
function EditDistance( s, t, lim )
local s_len, t_len = #s, #t — Calculate the sizes of the strings or arrays
if lim and math.abs( s_len – t_len ) >= lim then — If sizes differ by lim, we can stop here
return lim
end

— Convert string arguments to arrays of ints (ASCII values)
if type( s ) == “string” then
s = { string.byte( s, 1, s_len ) }
end

if type( t ) == “string” then
t = { string.byte( t, 1, t_len ) }
end

local min = math.min — Localize for performance
local num_columns = t_len + 1 — We use this a lot

local d = {} — (s_len+1) * (t_len+1) is going to be the size of this array
— This is technically a 2D array, but we’re treating it as 1D. Remember that 2D access in the
— form my_2d_array[ i, j ] can be converted to my_1d_array[ i * num_columns + j ], where
— num_columns is the number of columns you had in the 2D array assuming row-major order and
— that row and column indices start at 0 (we’re starting at 0).

for i=0, s_len do
d[ i * num_columns ] = i — Initialize cost of deletion
end
for j=0, t_len do
d[ j ] = j — Initialize cost of insertion
end

for i=1, s_len do
local i_pos = i * num_columns
local best = lim — Check to make sure something in this row will be below the limit
for j=1, t_len do
local add_cost = (s[ i ] ~= t[ j ] and 1 or 0)
local val = min(
d[ i_pos – num_columns + j ] + 1, — Cost of deletion
d[ i_pos + j – 1 ] + 1, — Cost of insertion
d[ i_pos – num_columns + j – 1 ] + add_cost — Cost of substitution, it might not cost anything if it’s the same
)
d[ i_pos + j ] = val

— Is this eligible for tranposition?
if i > 1 and j > 1 and s[ i ] == t[ j – 1 ] and s[ i – 1 ] == t[ j ] then
d[ i_pos + j ] = min(
val, — Current cost
d[ i_pos – num_columns – num_columns + j – 2 ] + add_cost — Cost of transposition
)
end

if lim and val < best then best = val end end if lim and best >= lim then
return lim
end
end

return d[ #d ]
end[/ccn_lua]
Gist of the same source code.

Philosophy of higher education

For one of my classes today we came up with our “purpose statement” for our being at college. It’s quite enlightening to sit down and actually think about the reasons why you’re actually spending all this time and money when the only direct, physical result is a piece of paper. Here’s my philosophy:

Higher education primarily ensures that graduates have the tools and knowledge they need for the career field they want to go into. The time spent learning while earning the degree is invested into research in the field and honing professional and interpersonal skills. Studying at an academic institution is often necessary to make certain that topics are fully understood, that any questions are answers, and that the new knowledge can be easily referenced if needed in the future.

Attending a higher learning facility and living in school’s dormitory has extra added benefits. Living with other people that you don’t initially know helps immensely in helping students learn to empathize with those around them as well as help the students realize what kind of person they are versus what kind of person they’d like to be. This social experience is every bit just as important as the traditional academic learning that the school provides and needs to be treated as such when students are considering college options.

GUI for NetTunnel

Designing the GUI for NetTunnel put my creativity to the test. I’ve never actually designed a GUI before, but I’ve seen and read a lot about GUI design theory, but theory seems to be fairly pointless for this design process. It was interesting for me to try to translate the idea in my head to the controls given in Visual Studios.

My first attempt ended up like this:

Main Window
Main Window
Services Window
Services Window

This is okay, but not great. Most of those elements are static elements that don’t move even if you resize it. It’s certainly not something I’d feel comfortable working with every day. After getting lots and lots of advice from friends, my second and final GUI design ended up like so:

Main Window
Main Window
Services Window
Services Window

A much cleaner and easier to understand layout. Services can be toggled just by clicking on the ‘service’ menu and then clicking on the appropriate service from the drop-down, or they can be toggled within the service window proper. All the most commonly used items in the gui are put in obvious places, while making sure that everything’s just a few clicks away. Everything resizes and can have the size proportions for it changed.

Now that I know how easy it is to create GUIs, I think I might start using them in future projects while retaining a command line version for power users.

Introducing NetTunnel

As part of my requirements for obtaining my degree, I’m doing a network capstone project this semester.  I’ve always been somewhat fascinated with NATs and specifically, how to break them, so I decided to work on a NAT traversal application. Enter NetTunnel: The purpose of NetTunnel is to provide a means for users with a lack of knowledge of networking or on a restrictive network an easy and simple means to share network services with other users.

Basically, say I’m on a restrictive network (like the dorm network) that does not allow me to host servers with the world due to NAT restrictions. I want to run ventrilo on my machine and have my friends join so I can chat with them. Unaided, this would be impossible, but if my friends and I are running NetTunnel a “hole” will be opened up in the network so they can connect.

It’s a pretty simple concept that’s already done in most modern desktop applications like Skype and peer to peer games. As far as I could tell this idea’s never been extended to a general level like NetTunnel, so there is a definite need for it. The closest application to it I could find is GameRanger, though GameRanger is aimed specifically at games.

I’ve set up a quick static page for NetTunnel at nettunnel.nayruden.com. Keep an eye out for further development, and I’d love to hear feedback about it!

CODN

While talking to LightSys, an organization that offers free IT services to mission organizations, I was told about the Christian Open Development Network or CODN (pronounced codin’). The site is in dire need of a refresh and a core community (last update is 2005?!). I felt like God was nudging me while they described what they wanted CODN to be, especially since what they need is exactly what I have experience with from managing the Ulysses community.

So, if you or any friends are interested in Christian open source development, be sure to watch the site over the following months as we work on it. If you’ve got any ideas on what you’d like to see there, be sure to drop a note in the comments.

Callback system in D

We’ve been evaluating D for use in Daydream, and I decided to see how easy it would be to create a callback system in the D  language (aka events or signals). This is a daunting task in C++ because C++ templates can only accept a static number of arguments… very bad when you have a function that can accept any number of arguments. To solve this problem in C++ you need to create a separate template for each number of possible arguments.

In D you can create templates that accept any number of arguments! You can treat these as a tuple, an array, or use them with tail recursion (à la PROLOG).

Combine this with the natural awesomeness of D and you’re setup for a power punch. Following this text is a very simple callback system in D.
A short but sweet 50 lines of code; it stores both functions and delegates and gives you a good launching point to create a more complicated call back system.

[ccn_d]import tango.io.Stdout;

// Converts a function to a delegate. Stolen from http://dsource.org/projects/tango/ticket/1174
// Note that it doesn’t handle ref or out though
R delegate(T) toDg(R, T…)(R function(T) fp) {
struct dg {
R opCall(T t) {
return (cast(R function(T)) this) (t);
}
}
R delegate(T) t;
t.ptr = fp;
t.funcptr = &dg.opCall;
return t;
}

class SimpleCallback(R, P…)
{
alias R delegate(P) callbacktype;
alias R function(P) function_callbacktype;

private callbacktype[] callback_list;

typeof( this ) opCatAssign( in callbacktype callback )
{
callback_list ~= callback;
return this;
}

typeof( this ) opCatAssign( in function_callbacktype callback )
{
auto dg = toDg!(R, P)( callback );
return this ~= dg;
}

R emit( P p )
{
static if ( !is( R == void ) )
R last;

foreach( callback; callback_list )
{
static if ( !is( R == void ) )
last = callback( p );
else
callback( p );
}

static if ( !is( R == void ) )
return last;
}

alias emit opCall;
}[/ccn_d]

Here’s some example code:

[cc_d]SimpleCallback!( void ) sc = new SimpleCallback!( void );
SimpleCallback!( bool, char[] ) sc2 = new SimpleCallback!( bool, char[] );
sc ~= function void() { Stdout.formatln( “#1” ); };
sc ~= function void() { Stdout.formatln( “#2” ); };
sc2 ~= function bool( char[] str ) { Stdout.formatln( “#1 called with {}, returning false”, str ); return false; };
sc2 ~= function bool( char[] str ) { Stdout.formatln( “#2 called with {}, returning true”, str ); return true; };
sc();
Stdout.formatln( “Last sc2 callback returned {}”, sc2( “coffee” ) );[/cc_d]
And here’s the output:
[cc_text]#1
#2
#1 called with coffee, returning false
#2 called with coffee, returning true
Last sc2 callback returned true[/cc_text]

A commentary on culture, theology, and programming