In the SQL Data Mine: Catastrophic Test of the Naive Bayes Classifier

In my prior post I created a simple Naive Bayes classification algorithm on a very simple data set. In this post I'll use the model to predict some "real world" data: surviving the Titanic. Kaggle.com is a website that hosts competitions in data mining and prediction usually with cash prizes for the best results. Kaggle provides the data, rules, evaluation and maintains a leader-board using participants submissions. They are currently conducting what they call a knowledge competition (i.e. the value is in the knowledge gained, although recently the competition appears to have added $10,000 in prizes) focused on predicting whether a passenger survived the Titanic tragedy. They have made a training file and testing file available to train the model and test the predictions. Just how well does the simple Naive Bayes model fares against some pretty sophisticated alternatives?

The modeling process will be the same as in the Naive Bayes post with minimal alterations to the queries to accommodate the Titanic data.

The first step is getting the data by joining Kaggle, agreeing to the terms and conditions of the competition and down-loading the csv file. Probably the most tedious aspect of SQL Server as a data mining tool is loading data from a file. There are a couple of ways to do it. You can use the provided import/export wizard which I personally hate, or bulk insert which is only marginally better (there are other ways).

I begin by opening the data files in Calc or Excel for a quick review which in addition to proving a list of the fields and their characteristics also helps identify potential data issues. For example, the fields are comma separated, but the third column, "name" has grammatical commas as well. This seems to confuse bulk insert; to remedy re-save the data to your desktop as a CSV file with semi-colon field separators.

Next create the data table in SQL server called Train:

create table train(survived smallint, pclass smallint ,name varchar(100) ,sex varchar(10) , age real , sibsp smallint , parch smallint , ticket varchar(24) ,fare real ,cabin varchar(24) , embarked varchar(1) )

Execute the bulk insert starting with the second row as the first has field names. The location of the source file in the "from" can be easily identified by right clicking on the data file and copying the path and file name from the properties.

bulk insert Train from 'C:\Users\colin\Desktop\train.csv' with (fieldterminator = ';', rowterminator = '\n', firstrow= 2)

This will load 891 records; select the top 5-10 to get a look at the data:

select top 10 * from Train

survived	pclass	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked
0	3	Braund, Mr. Owen Harris	male	22	1	0	A/5 21171	7.25	NULL	S
1	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38	1	0	PC 17599	71.2833	C85	C
1	3	Heikkinen, Miss. Laina	female	26	0	0	STON/O2. 3101282	7.925	NULL	S
1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	0	113803	53.1	C123	S
0	3	Allen, Mr. William Henry	male	35	0	0	373450	8.05	NULL	S
0	3	Moran, Mr. James	male	NULL	0	0	330877	8.4583	NULL	Q
0	1	McCarthy, Mr. Timothy J	male	54	0	0	17463	51.8625	E46	S
0	3	Palsson, Master. Gosta Leonard	male	2	3	1	349909	21.075	NULL	S
1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27	0	2	347742	11.1333	NULL	S
1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14	1	0	237736	30.0708	NULL	C

Repeat the create and insert processes for the test data:

create table Test (passengerid int, pclass smallint ,name varchar(100) ,sex varchar(10) , age real , sibsp smallint ,parch smallint ,ticket varchar(24) ,fare real , cabin varchar(24) , embarked varchar(1) )

bulk insert Test from 'C:\Users\colin\Desktop\test.csv' with (fieldterminator= ';', rowterminator= '\n', firstrow= 2)

Recall the Naive Bayes model worked only with categorical data. Reviewing the top 10 records suggests that pclass, sex, sibsp, parch & embark are categorical. Through binning, (a future topic) "fare" and "age" could be categorized. "Name" is too varied with little commonality to categorize other than possibly extracting the title (Miss, Mrs, Mr etc) or maybe family name. "Cabin" is similarly low commonality, but as the first letter is the deck it may be worth creating a column of the first cabin letter and "na" for the nulls. It may also make sense to replace all nulls with "na". This is one of the benefits of SQL: it is fairly easy to modify the data. Be sure that changes made to the Train table are also incorporated in the Test data table.

To start the model, first create the priors table changing the class from "lens" to "survived" and the table name to Train:

select survived class, count(1) pfreq, sum(1.0)/(select count(1) from train) pprob into #Priors from Train group by survived

Next create the attribute-class summary table again changing the source tables to Train and the attribute, values & class to match the Train table, as well as converting any formats to match the column format.

select attribute,value, p.class,acfreq, acfreq/pfreq acprob into #AttrClass from
(select 'pclass ' attribute, cast(pclass as varchar) value,survived class, sum(1.0) acfreq from Train group by pclass,survived
union all
select 'sex', sex,survived, sum(1.0) from Train group by sex,survived
union all
select 'sibsp' ,cast(sibsp as varchar),survived, sum(1.0) from Train group by sibsp,survived
union all
select 'cabinltr',left(cabin,1),survived, sum(1.0) from Train group by left(cabin,1),survived
union all
select 'embarked',embarked,survived, sum(1.0) from Train group by embarked,survived
) rd
join #Priors p on rd.class=p.class

Reviewing the summary data is analytically instructive. For example, consider sex: 68% of the survivors were female but only 14% of those lost, or "pclass": less than 15% of those lost were class 1 versus nearly 40% of those who survived.

attribute	value	class	acfreq	acprob
cabinltr	NULL	0	481	0.876138
cabinltr	NULL	1	206	0.602339
cabinltr	A	0	8	0.014571
cabinltr	A	1	7	0.020467
cabinltr	B	0	12	0.021857
cabinltr	B	1	35	0.102339
cabinltr	C	0	24	0.043715
cabinltr	C	1	35	0.102339
cabinltr	D	0	8	0.014571
cabinltr	D	1	25	0.073099
cabinltr	E	0	8	0.014571
cabinltr	E	1	24	0.070175
cabinltr	F	0	5	0.009107
cabinltr	F	1	8	0.023391
cabinltr	G	0	2	0.003642
cabinltr	G	1	2	0.005847
cabinltr	T	0	1	0.001821
embarked	NULL	1	2	0.005847
embarked	C	0	75	0.136612
embarked	C	1	93	0.271929
embarked	Q	0	47	0.08561
embarked	Q	1	30	0.087719
embarked	S	0	427	0.777777
embarked	S	1	217	0.634502
pclass	1	0	80	0.145719
pclass	1	1	136	0.39766
pclass	2	0	97	0.176684
pclass	2	1	87	0.254385
pclass	3	0	372	0.677595
pclass	3	1	119	0.347953
sex	female	0	81	0.14754
sex	female	1	233	0.681286
sex	male	0	468	0.852459
sex	male	1	109	0.318713
sibsp	0	0	398	0.724954
sibsp	0	1	210	0.614035
sibsp	1	0	97	0.176684
sibsp	1	1	112	0.327485
sibsp	2	0	15	0.027322
sibsp	2	1	13	0.038011
sibsp	3	0	12	0.021857
sibsp	3	1	4	0.011695
sibsp	4	0	15	0.027322
sibsp	4	1	3	0.008771
sibsp	5	0	5	0.009107
sibsp	8	0	7	0.01275

Next it is necessary to create the Test table. The only way to evaluate the model using the Kaggle test file is to run a prediction and submit it to their site for evaluation. While fun, it does take time and a participant is limited to 5 submissions per day. Instead, use the Train table and see how well the model predicts itself, kind of like an R squared in regression. At this stage it would be useful to have a passenger id on the train records to facilitate matching up predictions with the actual results (like the test file which has a passenger id).

alter table Train add passengerid int identity(1,1)

Now build the #Test table from the Train table:

select passengerid id, 'pclass ' attribute, cast(pclass as varchar) value into #Test from Train
union all
select passengerid,'sex', sex from Train
union all
select passengerid,'sibsp' ,cast(sibsp as varchar) from Train
union all
select passengerid,'cabinltr',left(cabin,1) from Train
union all
select passengerid, 'embarked',embarked from Train

Execute the fit SQL with no changes from the original post:

select id,acs.class, log(pprob) + sum(log(fprob)) score into #Fit from #Test t join
(select sp.attribute, sp.value, sp.class, sp.pprob, isnull(s.acprob,.001) fprob from
(select s.attribute,s.value, p.class, p.pprob from #AttrClass s, #Priors p group by s.attribute,s.value, p.class, p.pprob) sp
left join #AttrClass s on sp.attribute=s.attribute and sp.value = s.value and sp.class=s.class) acs
on t.attribute = acs.attribute and t.value = acs.value
group by id,acs.class,pprob

The final prediction step is to identify the winning class for each passenger; the SQL is unchanged from the original post:

select f.id, max(class) pclass from #Fit f join
(select id, max(score) mscore from #Fit group by id) mq
on f.id = mq.id and score = mscore group by f.id

How accurately does it classify? Modify the evaluation query to use the Train table to get a result of 77.2%:

select sum(case when p.pclass = t.survived then 1.0 else 0 end) / count(1) from Train t join
(select f.id, max(class) pclass from #Fit f join
(select id, max(score) mscore from #Fit group by id) q
on f.id = q.id and score = mscore group by f.id) p
on p.id = t.passengerid

To finish, load the actual test data to the #Test table.

select passengerid id, 'pclass ' attribute, cast(pclass as varchar) value into #Test from Test
union all
select passengerid,'sex', sex from Test
union all
select passengerid,'sibsp' ,cast(sibsp as varchar) from Test
union all
select passengerid,'cabinltr',left(cabin,1) from Test
union all
select passengerid, 'embarked',embarked from Test

Run the Fit SQL as above and query the classification results with the required field name headers as described in the submission instructions and submit...

select f.id passengerid, max(class) Survived from #Fit f join
(select id, max(score) mscore from #Fit group by id) q
on f.id = q.id and score = mscore group by f.id

Kaggle returned a categorization accuracy score 72.727%, ranking about 5500th out of 6000 participants. Not great, but consider how little work was put into the model (about 45 minutes changing SQL table names and the occasional field). There is plenty of opportunity for improvement to Naive Bayes including elimination of attributes that degrade performance, incorporating unused attributes, optimal binning and combining attributes. In fact, limiting the model to just sex, embarked and cabin letter resulted in a score of 75.122% and improved the rank to 5084th.

Finally, there are better classification algorithms...more to come on those.

In the SQL Data Mine

Monday, July 15, 2013

Catastrophic Test of the Naive Bayes Classifier

No comments:

Post a Comment