In the SQL Data Mine: July 2013

Monday, July 29, 2013

Single & Multiple Regression in SQL

Linear regression, a.k.a ordinary least squares is a algorithm for estimating the mathematical relationships between a set of one or more independent variables and a dependent variable. It takes the mathematical form of Y = B+m1*X1+m2*X2+... where Y is the independent variable that is being predicted, B is the intercept or constant and m1, m2 etc are the slopes or weights of the independent, or input, variables X1, X2 etc. I suggest a visit to Google or Wikipedia for the theories involved.

If there are multiple independent variables the model is referred to as multiple regression while with a single independent variable it is known as simple regression. This distinction is more than semantic: simple regression in SQL is extremely easy and can be implemented as a user defined function to estimate the slope and intercept much like an average. Multiple regression on the other hand requires significant matrix algebra. While I've come across and implemented simple regression in SQL, I've yet to see multiple despite some serious searching. I've been trying to implement the Gram-Schmidt procedure , but it is still a work in progress (edit: I figured it out! see the post dated 12/15/2013). In the interim there is an iterative approach using gradient descent which I'll cover here. First I'll re-hash the simple regression and then follow with the solution to multiple regression.

Start simple regression by creating a table and populate with data, in this case the area and price of homes (this was from a Stackoverflow post) from an on-line machine learning course:

create table #rawdata (id int,X float, Y float)

insert into #rawdata select 1, 2201,400
insert into #rawdata select 2, 1600,330
insert into #rawdata select 3, 2400,369
insert into #rawdata select 4, 1416,232
insert into #rawdata select 5, 3000,540

The estimation processes is to first calculate the slope (parse the equation from the query) and then the intercept by subtracting the average of the independent variable multiplied by the slope from the average of the dependent variable. To unify the two steps I use a sub-query to estimate the slope and averages of the independent (x) and dependent (y) variables; the parent query presents the slope and does the math to calculate the intercept:

select slope, ybar - slope*xbar intercept into #model from
(select (sum(x*y) -(sum(x)*sum(y)/count(1)))/
(sum(x*x) - (sum(x)*sum(x)/count(1))) slope,
avg(x) xbar, avg(y) ybar
from #rawdata) a

The results:

slope	intercept
0.1658939364	21.9408154411

To see the fitted data versus actual, put the results in a temporary file called #model and run the following query:

select id, y, intercept+slope*x fit, y-(intercept+slope*x) error from #model, #rawdata

id	y	fit	error
1	400	387.073	12.927
2	330	287.371	42.629
3	369	420.086	-51.086
4	232	256.847	-24.847
5	540	519.623	20.377

On to the multiple regression: In addition to the area of a house to predict the price the number of rooms is provided and is to be incorporated into the model. Unfortunately the algebraic solution in simple regression doesn't scale to 2 or more independent variables easily. An alternate solution is to use what is known as the gradient descent algorithm which improves the model's fit by repeatedly updating the weights to reduce the error. The processes is straight forward: using the current weights estimate the fitted model, calculate the error and propagate a small amount of the error back to the weights and repeat. As usual, more details are available all over the web.

To start recreate the #rawdata table and populate with data. This time instead of X & Y I'll use the variable names.

create table #rawdata (id int,area float, rooms float, price float)

insert into #rawdata select 1, 2201,3,400

insert into #rawdata select 2, 1600,3,330

insert into #rawdata select 3, 2400,3,369

insert into #rawdata select 4, 1416,2,232

insert into #rawdata select 5, 3000,4,540

To facilitate the SQL, the table should be converted into data and class vectors. A problem with gradient methods occurs when there are variables with widely divergent ranges. The convergence slows down dramatically or becomes unstable. To overcome this difficulty normalize the variables by subtracting the mean and dividing by the standard deviation. This is a simple linear transformation and will be easy to convert back when the final predictions are made:

select id,'intercept' attribute,1.0 value into #data from #rawdata
union all
select id, 'area',(area-(select avg(area) from #rawdata))/(select stdev(area) from #rawdata) from #rawdata
union all
select id, 'rooms',(rooms-(select avg(rooms) from #rawdata))/(select stdev(rooms) from #rawdata) from #rawdata

select id,'price' attribute, (price-(select avg(price) from #rawdata))/(select stdev(price) from #rawdata) value into #class from #rawdata

Now there are two approaches to calculating the fit, error and update the weights: either observation by observation with repeated passes across the whole data sets or all records at once in batch. This implementation follows the batch approach.

Start by creating a table of weights for each attribute which can be populated with random numbers or, as in this case, 0:

select attribute,cast(0.0 as float) wght into #weight from #data group by attribute

The program requires 4 variables:

alpha, a number greater than 0 but less than 1 that controls the portion of the error which will be incorporated in the weights on each iteration and is generally set to 0.1
best and newbest which track the fit and control when the routine stops
count: the maximum number of iterations

After setting the parameter values, including @best which is the square-root of the average of the squared errors using the available weights, the program begins looping until the stop conditions are met:

the error increases after adjusting the weights, or
the maximum iteration count is exceeded

The key weight adjusting query is highlighted in yellow. The sub-queries first calculate the fit (identifier f) , then the error (identifier e) which is multiplied by the independent variables and a small portion (@alpha) is added to the weights; the updated weights are saved to #weightsnew. Using the #weightsnew table, the overall error is calculated as @newbest and if it is greater than the prior error, @best, the program stops otherwise the weights are saved to the #weights table and the #weightsnew table is dropped. If the maximum number of iterations has not been met it repeats the loop.

declare @alpha float, @best float, @newbest float,@count int

select @alpha = .1, @count = 100, @best = sum(power(value-fit,2)) from #class c join
(select id, sum(value*wght) fit from #data d join #weight w on d.attribute=w.attribute group by id) f
on c.id = f.id

while 1=1
begin

select w.attribute, wght+@alpha* avg(value*err) wght into #weightnew
from #weight w join #data d on d.attribute = w.attribute
join (select c.id, (value-fit) err from #class c join
(select id, sum(value*wght) fit from #data d join #weight w on d.attribute=w.attribute group by id) f
on c.id = f.id ) e on d.id = e.id
group by w.attribute, wght

select @newbest = sum(power(value-fit,2)) from #class c join
(select id, sum(value*wght) fit from #data d join #weightnew w on d.attribute=w.attribute group by id) f
on c.id = f.id

if @newbest>@best break
else set @best = @newbest

drop table #weight
select * into #weight from #weightnew
drop table #weightnew

if @count = 0 break

set @count = @count -1
end

The final step is to query the actual versus fit and error which includes converting the standardized variables back to the original scale:

select id, mn+(value*se) value, mn+(fit*se) fit from
(select c.id,value,fit from #class c join
(select id, sum(value*wght) fit from #data d join #weight w on d.attribute=w.attribute group by id) f
on c.id = f.id) x,
(select avg(price) mn from #rawdata) y,
(select stdev(price) se from #rawdata) z

As a test I replicated the simple regression results above using just the area independent variable with the following fit & error:

id	value	fit	error
1	400	387.071	-12.929
2	330	287.390	-42.610
3	369	420.076	51.076
4	232	256.872	24.872
5	540	519.591	-20.409

A comparison of the error column with the error column in the simple regression table suggests the gradient descent approach closely matched the algebraic simple regression with the difference being off no more than 0.033 for any observation. In fact the square-root of the mean squared error for both versions is 33.546.

Adding the number of rooms to the model improves the fit:

id	y	fit	error
1	400	380.166	19.834
2	330	333.961	-3.961
3	369	395.465	-26.465
4	232	227.534	4.466
5	540	533.874	6.126

The square-root of the mean squared error is improved (reduced) to 15.277.

A common measure of goodness of fit is R-Squared which is the variance explained by the model as a percent of the total variance. It is calculated as 1 - (unexplained variance /total variance) where unexplained variance is the sum of the square of the actual value minus the fitted value and total variance is the sum of the square of the actual values minus the average. The SQL takes advantage of the normalized data which sets the average in the total variance calculation to zero:

select 1- sum(power(value-fit,2))/ sum(power(value,2)) from

(select c.id,value,fit from #class c join

(select id, sum(value*wght) fit from #data d join #weight w on d.attribute=w.attribute group by id) f

on c.id = f.id) x

For the simple regression, the R-Squared is 88.83% using the gradient algorithm; for the multiple regression it improves to 97.68%.

Thursday, July 25, 2013

Classic Classification: The Perceptron

I don't intend to make this a running dissertation on classification algorithms (although I will come back to them from time to time because they are key to data mining), but the disappointing results on the Naive Bayes Classifier on the Kaggle Titanic challenge in my last post, lead me to ask if another simple classifier might produce better results. To that end I dusted off my books on the Perceptron. which originated in the late 1950's (showing that machine learning has been around since the dawn of the computer age). The Perceptron is a linear classifier that learns weights on input variables that indicate membership when the sum exceeds a threshold. For those wanting more background, try Wikipedia; I will use their example to demonstrate the code before running the Titanic data though the model.

First thing, let's create a table and populate it with some basic data to simulate the NAND function. In the write-ups the data is invariably identified as binary (0,1 or -1,1); I will make use of SQL and use data labels (true, false) which are easier to interpret and permit more input attribute values. Note however, that converting the class value to (0,1) will simplify the program.

create table #rawdata (id int identity(1,1), x1 varchar(5), x2 varchar(5), z varchar(5))

insert into #rawdata select 'false', 'false', 'true'
insert into #rawdata select 'false', 'true', 'true'
insert into #rawdata select 'true', 'false', 'true'
insert into #rawdata select 'true', 'true', 'false'

To leverage SQL functionality convert the data table to a data vector adding a bias field which works like a constant in regression.

select id, 'bias' attribute,'bias' value into #data from #rawdata
union all
select id , 'x1' , x1 from #rawdata
union all
select id , 'x2' , x2 from #rawdata

Similarly create a class data vector; as noted earlier, convert the value of the class attribute value to (0,1).

select id, 'z' attribute, case when z='true' then 1.0 else 0 end value into #class from #rawdata

Recall the algorithm compares the sum of the weights for each observation to a threshold value to determine membership. The next step is to initiate the table of weights which can be set at 0.

select attribute, value,cast(0 as float) wght into #weight from #data group by attribute, value

One of the reasons I like the Naive Bayes algorithm is it doesn't require iteration to get to the weights; unfortunately the Perceptron (and most data mining algorithms) learns through iteration. The processes: pick an observation, calculate membership and if there is a miss-classification (error), adjust the weights in the direction of correct classification. Repeat until convergence (or patience is exhausted). If the class has a perfectly separable solution, it will converge and the weights will stop changing. The real world is seldom so co-operative (this example is; the Titanic data is not) and the results will vacillate. A solution to this problem is the pocket algorithm: calculate the overall fit after each iteration and if the fit is the best, put the weights in your pocket. When done, use these weights which are in the #bestweight table:

select * into #bestweight from #weight

Begin the learning program by declaring the necessary variables:

threshold is the level necessary to classify, generally 0.5.
alpha is the amount the weights will be adjusted should there be an error and experience shows 0.1 is a good value.
count keeps track of how many times though the whole data set (not many are required) and id is the individual record being evaluated
fit is the sum of the weights and error is difference versus the actual
best and newbest track the overall fit to determine if the new weights should replace those in your pocket

The weights are modified in the update statement (highlighted in yellow) following the fit and error calculations. Next is the overall fit (percent correct classification) which is followed by the pocket algorithm.

declare @threshold float, @alpha float, @count int, @id int, @fit float, @err float ,@best float,@newbest float

select @threshold = .5, @alpha = .1, @count = 5, @best = 0

while @count>=0
begin

set @id = 1

while @id<= (select max(id) from #class)
begin

select @fit = sum(wght) from #data d join #weight w on d.attribute = w.attribute and d.value = w.value where d.id = @id

select @err = c.value - case when @fit>.5 then 1.0 else 0 end from #class c where c.id = @id

update w set w.wght = w.wght+@err*@alpha from #weight w join #data d on w.attribute = d.attribute and w.value = d.value and d.id = @id

select @newbest = sum(case when fit = value then 1.0 else 0 end)/count(1) from
(select c.id,case when sumw>.5 then 1 else 0 end fit, c.value from #class c join
(select d.id, sum(wght) sumw
from #data d
join #weight w on d.attribute = w.attribute and d.value = w.value
group by d.id) n
on c.id = n.id) x

if @newbest>@best
begin
set @best = @newbest
drop table #bestweight
select * into #bestweight from #weight
end
set @id = @id+1
end
set @count=@count-1
end

Finally, evaluate the results with the following query:

select * from (select c.id,case when sumw>@threshold then 1 else 0 end fit, c.value from #class c join (select d.id, sum(wght) sumw from #data d join #weight w on d.attribute = w.attribute and d.value = w.value group by d.id) n on c.id = n.id) x

The fit with the initial (zero value) weights:

id	fit	value
1	0	1
2	0	1
3	0	1
4	0	0

After running the program it is 100%:

id	fit	value
1	1	1
2	1	1
3	1	1
4	0	0

With the following weights:

select * from #weight

attribute	value	weight
bias	bias	0.3
x1	false	0.2
x1	true	0.1
x2	false	0.2
x2	true	0.1

This last step does not use the #bestweights, but per the logic of the program they would be the same.

How does the Perceptron fare on the Titanic data? First, bring in the Titanic train data into the data vector and then the survival field into the class vector:

select passengerid id, 'bias' attribute,'bias' value into #data from train

union all

select passengerid , 'sex' , sex from train

union all

select passengerid, 'pclass' , cast(pclass as varchar) from train

union all

select passengerid, 'embarked' , cast(embarked as varchar) from train

union all

select passengerid, 'cabin' , isnull(left(cabin,1),'na') from train

select passengerid id, 'survived' attribute, cast(survived as float) value into #class from train

Initiate the #weights and #bestweights tables and run the learning program. No further modifications to the program should be required.

The basic fit can be evaluated with the following which shows 71.7171% correct classification; using the bestweight table improves the correct classification rate to 81.8181%.

select sum(case when fit = value then 1.0 else 0 end)/count(1) from

(select c.id,case when sumw>.5 then 1 else 0 end fit, c.value from #class c join

(select d.id, sum(wght) sumw

from #data d

join #weight w on d.attribute = w.attribute and d.value = w.value

group by d.id) n

on c.id = n.id) x

The weights in the #bestweight table:

attribute	value	weight
bias	bias	0.2
cabin	A	0
cabin	B	0.2
cabin	C	0
cabin	D	0.2
cabin	E	0.3
cabin	F	0.1
cabin	G	-0.4
cabin	n	0
cabin	T	-0.2
embarked	NULL	0
embarked	C	0.1
embarked	Q	0.1
embarked	S	0
pclass	1	0.1
pclass	2	0.2
pclass	3	-0.1
sex	female	0.3
sex	male	-0.1

I submitted the test data classification using the #bestweight table to Kaggle and improved my score to 76.077% and ranking by nearly 300 places to about 4800th. Not where I'd like to be, but not bad for a half century old, simple algorithm.

Monday, July 15, 2013

Catastrophic Test of the Naive Bayes Classifier

In my prior post I created a simple Naive Bayes classification algorithm on a very simple data set. In this post I'll use the model to predict some "real world" data: surviving the Titanic. Kaggle.com is a website that hosts competitions in data mining and prediction usually with cash prizes for the best results. Kaggle provides the data, rules, evaluation and maintains a leader-board using participants submissions. They are currently conducting what they call a knowledge competition (i.e. the value is in the knowledge gained, although recently the competition appears to have added $10,000 in prizes) focused on predicting whether a passenger survived the Titanic tragedy. They have made a training file and testing file available to train the model and test the predictions. Just how well does the simple Naive Bayes model fares against some pretty sophisticated alternatives?

The modeling process will be the same as in the Naive Bayes post with minimal alterations to the queries to accommodate the Titanic data.

The first step is getting the data by joining Kaggle, agreeing to the terms and conditions of the competition and down-loading the csv file. Probably the most tedious aspect of SQL Server as a data mining tool is loading data from a file. There are a couple of ways to do it. You can use the provided import/export wizard which I personally hate, or bulk insert which is only marginally better (there are other ways).

I begin by opening the data files in Calc or Excel for a quick review which in addition to proving a list of the fields and their characteristics also helps identify potential data issues. For example, the fields are comma separated, but the third column, "name" has grammatical commas as well. This seems to confuse bulk insert; to remedy re-save the data to your desktop as a CSV file with semi-colon field separators.

Next create the data table in SQL server called Train:

create table train(survived smallint, pclass smallint ,name varchar(100) ,sex varchar(10) , age real , sibsp smallint , parch smallint , ticket varchar(24) ,fare real ,cabin varchar(24) , embarked varchar(1) )

Execute the bulk insert starting with the second row as the first has field names. The location of the source file in the "from" can be easily identified by right clicking on the data file and copying the path and file name from the properties.

bulk insert Train from 'C:\Users\colin\Desktop\train.csv' with (fieldterminator = ';', rowterminator = '\n', firstrow= 2)

This will load 891 records; select the top 5-10 to get a look at the data:

select top 10 * from Train

survived	pclass	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked
0	3	Braund, Mr. Owen Harris	male	22	1	0	A/5 21171	7.25	NULL	S
1	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38	1	0	PC 17599	71.2833	C85	C
1	3	Heikkinen, Miss. Laina	female	26	0	0	STON/O2. 3101282	7.925	NULL	S
1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	0	113803	53.1	C123	S
0	3	Allen, Mr. William Henry	male	35	0	0	373450	8.05	NULL	S
0	3	Moran, Mr. James	male	NULL	0	0	330877	8.4583	NULL	Q
0	1	McCarthy, Mr. Timothy J	male	54	0	0	17463	51.8625	E46	S
0	3	Palsson, Master. Gosta Leonard	male	2	3	1	349909	21.075	NULL	S
1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27	0	2	347742	11.1333	NULL	S
1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14	1	0	237736	30.0708	NULL	C

Repeat the create and insert processes for the test data:

create table Test (passengerid int, pclass smallint ,name varchar(100) ,sex varchar(10) , age real , sibsp smallint ,parch smallint ,ticket varchar(24) ,fare real , cabin varchar(24) , embarked varchar(1) )

bulk insert Test from 'C:\Users\colin\Desktop\test.csv' with (fieldterminator= ';', rowterminator= '\n', firstrow= 2)

Recall the Naive Bayes model worked only with categorical data. Reviewing the top 10 records suggests that pclass, sex, sibsp, parch & embark are categorical. Through binning, (a future topic) "fare" and "age" could be categorized. "Name" is too varied with little commonality to categorize other than possibly extracting the title (Miss, Mrs, Mr etc) or maybe family name. "Cabin" is similarly low commonality, but as the first letter is the deck it may be worth creating a column of the first cabin letter and "na" for the nulls. It may also make sense to replace all nulls with "na". This is one of the benefits of SQL: it is fairly easy to modify the data. Be sure that changes made to the Train table are also incorporated in the Test data table.

To start the model, first create the priors table changing the class from "lens" to "survived" and the table name to Train:

select survived class, count(1) pfreq, sum(1.0)/(select count(1) from train) pprob into #Priors from Train group by survived

Next create the attribute-class summary table again changing the source tables to Train and the attribute, values & class to match the Train table, as well as converting any formats to match the column format.

select attribute,value, p.class,acfreq, acfreq/pfreq acprob into #AttrClass from
(select 'pclass ' attribute, cast(pclass as varchar) value,survived class, sum(1.0) acfreq from Train group by pclass,survived
union all
select 'sex', sex,survived, sum(1.0) from Train group by sex,survived
union all
select 'sibsp' ,cast(sibsp as varchar),survived, sum(1.0) from Train group by sibsp,survived
union all
select 'cabinltr',left(cabin,1),survived, sum(1.0) from Train group by left(cabin,1),survived
union all
select 'embarked',embarked,survived, sum(1.0) from Train group by embarked,survived
) rd
join #Priors p on rd.class=p.class

Reviewing the summary data is analytically instructive. For example, consider sex: 68% of the survivors were female but only 14% of those lost, or "pclass": less than 15% of those lost were class 1 versus nearly 40% of those who survived.

attribute	value	class	acfreq	acprob
cabinltr	NULL	0	481	0.876138
cabinltr	NULL	1	206	0.602339
cabinltr	A	0	8	0.014571
cabinltr	A	1	7	0.020467
cabinltr	B	0	12	0.021857
cabinltr	B	1	35	0.102339
cabinltr	C	0	24	0.043715
cabinltr	C	1	35	0.102339
cabinltr	D	0	8	0.014571
cabinltr	D	1	25	0.073099
cabinltr	E	0	8	0.014571
cabinltr	E	1	24	0.070175
cabinltr	F	0	5	0.009107
cabinltr	F	1	8	0.023391
cabinltr	G	0	2	0.003642
cabinltr	G	1	2	0.005847
cabinltr	T	0	1	0.001821
embarked	NULL	1	2	0.005847
embarked	C	0	75	0.136612
embarked	C	1	93	0.271929
embarked	Q	0	47	0.08561
embarked	Q	1	30	0.087719
embarked	S	0	427	0.777777
embarked	S	1	217	0.634502
pclass	1	0	80	0.145719
pclass	1	1	136	0.39766
pclass	2	0	97	0.176684
pclass	2	1	87	0.254385
pclass	3	0	372	0.677595
pclass	3	1	119	0.347953
sex	female	0	81	0.14754
sex	female	1	233	0.681286
sex	male	0	468	0.852459
sex	male	1	109	0.318713
sibsp	0	0	398	0.724954
sibsp	0	1	210	0.614035
sibsp	1	0	97	0.176684
sibsp	1	1	112	0.327485
sibsp	2	0	15	0.027322
sibsp	2	1	13	0.038011
sibsp	3	0	12	0.021857
sibsp	3	1	4	0.011695
sibsp	4	0	15	0.027322
sibsp	4	1	3	0.008771
sibsp	5	0	5	0.009107
sibsp	8	0	7	0.01275

Next it is necessary to create the Test table. The only way to evaluate the model using the Kaggle test file is to run a prediction and submit it to their site for evaluation. While fun, it does take time and a participant is limited to 5 submissions per day. Instead, use the Train table and see how well the model predicts itself, kind of like an R squared in regression. At this stage it would be useful to have a passenger id on the train records to facilitate matching up predictions with the actual results (like the test file which has a passenger id).

alter table Train add passengerid int identity(1,1)

Now build the #Test table from the Train table:

select passengerid id, 'pclass ' attribute, cast(pclass as varchar) value into #Test from Train
union all
select passengerid,'sex', sex from Train
union all
select passengerid,'sibsp' ,cast(sibsp as varchar) from Train
union all
select passengerid,'cabinltr',left(cabin,1) from Train
union all
select passengerid, 'embarked',embarked from Train

Execute the fit SQL with no changes from the original post:

select id,acs.class, log(pprob) + sum(log(fprob)) score into #Fit from #Test t join
(select sp.attribute, sp.value, sp.class, sp.pprob, isnull(s.acprob,.001) fprob from
(select s.attribute,s.value, p.class, p.pprob from #AttrClass s, #Priors p group by s.attribute,s.value, p.class, p.pprob) sp
left join #AttrClass s on sp.attribute=s.attribute and sp.value = s.value and sp.class=s.class) acs
on t.attribute = acs.attribute and t.value = acs.value
group by id,acs.class,pprob

The final prediction step is to identify the winning class for each passenger; the SQL is unchanged from the original post:

select f.id, max(class) pclass from #Fit f join
(select id, max(score) mscore from #Fit group by id) mq
on f.id = mq.id and score = mscore group by f.id

How accurately does it classify? Modify the evaluation query to use the Train table to get a result of 77.2%:

select sum(case when p.pclass = t.survived then 1.0 else 0 end) / count(1) from Train t join
(select f.id, max(class) pclass from #Fit f join
(select id, max(score) mscore from #Fit group by id) q
on f.id = q.id and score = mscore group by f.id) p
on p.id = t.passengerid

To finish, load the actual test data to the #Test table.

select passengerid id, 'pclass ' attribute, cast(pclass as varchar) value into #Test from Test
union all
select passengerid,'sex', sex from Test
union all
select passengerid,'sibsp' ,cast(sibsp as varchar) from Test
union all
select passengerid,'cabinltr',left(cabin,1) from Test
union all
select passengerid, 'embarked',embarked from Test

Run the Fit SQL as above and query the classification results with the required field name headers as described in the submission instructions and submit...

select f.id passengerid, max(class) Survived from #Fit f join
(select id, max(score) mscore from #Fit group by id) q
on f.id = q.id and score = mscore group by f.id

Kaggle returned a categorization accuracy score 72.727%, ranking about 5500th out of 6000 participants. Not great, but consider how little work was put into the model (about 45 minutes changing SQL table names and the occasional field). There is plenty of opportunity for improvement to Naive Bayes including elimination of attributes that degrade performance, incorporating unused attributes, optimal binning and combining attributes. In fact, limiting the model to just sex, embarked and cabin letter resulted in a score of 75.122% and improved the rank to 5084th.

Finally, there are better classification algorithms...more to come on those.