The modeling process will be the same as in the Naive Bayes post with minimal alterations to the queries to accommodate the Titanic data.
The first step is getting the data by joining Kaggle, agreeing to the terms and conditions of the competition and down-loading the csv file. Probably the most tedious aspect of SQL Server as a data mining tool is loading data from a file. There are a couple of ways to do it. You can use the provided import/export wizard which I personally hate, or bulk insert which is only marginally better (there are other ways).
I begin by opening the data files in Calc or Excel for a quick review which in addition to proving a list of the fields and their characteristics also helps identify potential data issues. For example, the fields are comma separated, but the third column, "name" has grammatical commas as well. This seems to confuse bulk insert; to remedy re-save the data to your desktop as a CSV file with semi-colon field separators.
Next create the data table in SQL server called Train:
create table train(survived smallint, pclass smallint ,name varchar(100) ,sex varchar(10) , age real , sibsp smallint , parch smallint , ticket varchar(24) ,fare real ,cabin varchar(24) , embarked varchar(1) )
Execute the bulk insert starting with the second row as the first has field names. The location of the source file in the "from" can be easily identified by right clicking on the data file and copying the path and file name from the properties.
bulk insert Train from 'C:\Users\colin\Desktop\train.csv' with (fieldterminator = ';', rowterminator = '\n', firstrow= 2)
This will load 891 records; select the top 5-10 to get a look at the data:
select top 10 * from Train
survived | pclass | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked |
0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.25 | NULL | S |
1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
1 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.925 | NULL | S |
1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.1 | C123 | S |
0 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.05 | NULL | S |
0 | 3 | Moran, Mr. James | male | NULL | 0 | 0 | 330877 | 8.4583 | NULL | Q |
0 | 1 | McCarthy, Mr. Timothy J | male | 54 | 0 | 0 | 17463 | 51.8625 | E46 | S |
0 | 3 | Palsson, Master. Gosta Leonard | male | 2 | 3 | 1 | 349909 | 21.075 | NULL | S |
1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27 | 0 | 2 | 347742 | 11.1333 | NULL | S |
1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14 | 1 | 0 | 237736 | 30.0708 | NULL | C |
Repeat the create and insert processes for the test data:
create table Test (passengerid int, pclass smallint ,name varchar(100) ,sex varchar(10) , age real , sibsp smallint ,parch smallint ,ticket varchar(24) ,fare real , cabin varchar(24) , embarked varchar(1) )
bulk insert Test from 'C:\Users\colin\Desktop\test.csv' with (fieldterminator= ';', rowterminator= '\n', firstrow= 2)
Recall the Naive Bayes model worked only with categorical data. Reviewing the top 10 records suggests that pclass, sex, sibsp, parch & embark are categorical. Through binning, (a future topic) "fare" and "age" could be categorized. "Name" is too varied with little commonality to categorize other than possibly extracting the title (Miss, Mrs, Mr etc) or maybe family name. "Cabin" is similarly low commonality, but as the first letter is the deck it may be worth creating a column of the first cabin letter and "na" for the nulls. It may also make sense to replace all nulls with "na". This is one of the benefits of SQL: it is fairly easy to modify the data. Be sure that changes made to the Train table are also incorporated in the Test data table.
To start the model, first create the priors table changing the class from "lens" to "survived" and the table name to Train:
select survived class, count(1) pfreq, sum(1.0)/(select count(1) from train) pprob into #Priors from Train group by survived
select attribute,value, p.class,acfreq, acfreq/pfreq acprob into #AttrClass from
(select 'pclass ' attribute, cast(pclass as varchar) value,survived class, sum(1.0) acfreq from Train group by pclass,survived
union all
select 'sex', sex,survived, sum(1.0) from Train group by sex,survived
union all
select 'sibsp' ,cast(sibsp as varchar),survived, sum(1.0) from Train group by sibsp,survived
union all
select 'cabinltr',left(cabin,1),survived, sum(1.0) from Train group by left(cabin,1),survived
union all
select 'embarked',embarked,survived, sum(1.0) from Train group by embarked,survived
) rd
join #Priors p on rd.class=p.class
Reviewing the summary data is analytically instructive. For example, consider sex: 68% of the survivors were female but only 14% of those lost, or "pclass": less than 15% of those lost were class 1 versus nearly 40% of those who survived.
attribute | value | class | acfreq | acprob |
cabinltr | NULL | 0 | 481 | 0.876138 |
cabinltr | NULL | 1 | 206 | 0.602339 |
cabinltr | A | 0 | 8 | 0.014571 |
cabinltr | A | 1 | 7 | 0.020467 |
cabinltr | B | 0 | 12 | 0.021857 |
cabinltr | B | 1 | 35 | 0.102339 |
cabinltr | C | 0 | 24 | 0.043715 |
cabinltr | C | 1 | 35 | 0.102339 |
cabinltr | D | 0 | 8 | 0.014571 |
cabinltr | D | 1 | 25 | 0.073099 |
cabinltr | E | 0 | 8 | 0.014571 |
cabinltr | E | 1 | 24 | 0.070175 |
cabinltr | F | 0 | 5 | 0.009107 |
cabinltr | F | 1 | 8 | 0.023391 |
cabinltr | G | 0 | 2 | 0.003642 |
cabinltr | G | 1 | 2 | 0.005847 |
cabinltr | T | 0 | 1 | 0.001821 |
embarked | NULL | 1 | 2 | 0.005847 |
embarked | C | 0 | 75 | 0.136612 |
embarked | C | 1 | 93 | 0.271929 |
embarked | Q | 0 | 47 | 0.08561 |
embarked | Q | 1 | 30 | 0.087719 |
embarked | S | 0 | 427 | 0.777777 |
embarked | S | 1 | 217 | 0.634502 |
pclass | 1 | 0 | 80 | 0.145719 |
pclass | 1 | 1 | 136 | 0.39766 |
pclass | 2 | 0 | 97 | 0.176684 |
pclass | 2 | 1 | 87 | 0.254385 |
pclass | 3 | 0 | 372 | 0.677595 |
pclass | 3 | 1 | 119 | 0.347953 |
sex | female | 0 | 81 | 0.14754 |
sex | female | 1 | 233 | 0.681286 |
sex | male | 0 | 468 | 0.852459 |
sex | male | 1 | 109 | 0.318713 |
sibsp | 0 | 0 | 398 | 0.724954 |
sibsp | 0 | 1 | 210 | 0.614035 |
sibsp | 1 | 0 | 97 | 0.176684 |
sibsp | 1 | 1 | 112 | 0.327485 |
sibsp | 2 | 0 | 15 | 0.027322 |
sibsp | 2 | 1 | 13 | 0.038011 |
sibsp | 3 | 0 | 12 | 0.021857 |
sibsp | 3 | 1 | 4 | 0.011695 |
sibsp | 4 | 0 | 15 | 0.027322 |
sibsp | 4 | 1 | 3 | 0.008771 |
sibsp | 5 | 0 | 5 | 0.009107 |
sibsp | 8 | 0 | 7 | 0.01275 |
Next it is necessary to create the Test table. The only way to evaluate the model using the Kaggle test file is to run a prediction and submit it to their site for evaluation. While fun, it does take time and a participant is limited to 5 submissions per day. Instead, use the Train table and see how well the model predicts itself, kind of like an R squared in regression. At this stage it would be useful to have a passenger id on the train records to facilitate matching up predictions with the actual results (like the test file which has a passenger id).
alter table Train add passengerid int identity(1,1)
Now build the #Test table from the Train table:
select passengerid id, 'pclass ' attribute, cast(pclass as varchar) value into #Test from Train
union all
select passengerid,'sex', sex from Train
union all
select passengerid,'sibsp' ,cast(sibsp as varchar) from Train
union all
select passengerid,'cabinltr',left(cabin,1) from Train
union all
select passengerid, 'embarked',embarked from Train
Execute the fit SQL with no changes from the original post:
select id,acs.class, log(pprob) + sum(log(fprob)) score into #Fit from #Test t join
(select sp.attribute, sp.value, sp.class, sp.pprob, isnull(s.acprob,.001) fprob from
(select s.attribute,s.value, p.class, p.pprob from #AttrClass s, #Priors p group by s.attribute,s.value, p.class, p.pprob) sp
left join #AttrClass s on sp.attribute=s.attribute and sp.value = s.value and sp.class=s.class) acs
on t.attribute = acs.attribute and t.value = acs.value
group by id,acs.class,pprob
The final prediction step is to identify the winning class for each passenger; the SQL is unchanged from the original post:
select f.id, max(class) pclass from #Fit f join
(select id, max(score) mscore from #Fit group by id) mq
on f.id = mq.id and score = mscore group by f.id
How accurately does it classify? Modify the evaluation query to use the Train table to get a result of 77.2%:
select sum(case when p.pclass = t.survived then 1.0 else 0 end) / count(1) from Train t join
(select f.id, max(class) pclass from #Fit f join
(select id, max(score) mscore from #Fit group by id) q
on f.id = q.id and score = mscore group by f.id) p
on p.id = t.passengerid
To finish, load the actual test data to the #Test table.
select passengerid id, 'pclass ' attribute, cast(pclass as varchar) value into #Test from Test
union all
select passengerid,'sex', sex from Test
union all
select passengerid,'sibsp' ,cast(sibsp as varchar) from Test
union all
select passengerid,'cabinltr',left(cabin,1) from Test
union all
select passengerid, 'embarked',embarked from Test
Run the Fit SQL as above and query the classification results with the required field name headers as described in the submission instructions and submit...
select f.id passengerid, max(class) Survived from #Fit f join
(select id, max(score) mscore from #Fit group by id) q
on f.id = q.id and score = mscore group by f.id
Kaggle returned a categorization accuracy score 72.727%, ranking about 5500th out of 6000 participants. Not great, but consider how little work was put into the model (about 45 minutes changing SQL table names and the occasional field). There is plenty of opportunity for improvement to Naive Bayes including elimination of attributes that degrade performance, incorporating unused attributes, optimal binning and combining attributes. In fact, limiting the model to just sex, embarked and cabin letter resulted in a score of 75.122% and improved the rank to 5084th.
Finally, there are better classification algorithms...more to come on those.
No comments:
Post a Comment