The algorithm models a numeric dependent variable by estimating weights for the independent variables which are all categorical. An example would be height as a function of gender. Height is a numeric variable while gender is categorical. In typical regression (including my earlier post) this could be modeled by using a binary (1,0) variable for each category which is fairly simple. Imagine doing the same model of weight as a function of state of residence; very quickly the model will have fifty binary variables. Straightforward (if not necessarily statistically valid) in a regression package like R, but a pain in SQL.
The model: estimate a score for each attribute value (i.e.gender female) such that the fitted class value is the sum of the scores for that observation. The scores, or weights, are estimated using a batch mode gradient descent.
The course of this post:
- Load some data, I will first use a small data set then the Titanic data
- Put the data into a vector format to ease the SQL
- Initiate the weights and control variables
- Create the gradient descent loop
- Show the results.
The demonstration data is height in feet as a function of gender:
create table #rawdata (id int, sex varchar(6), height float)
insert into #rawdata select 1,'male', 6
insert into #rawdata select 2,'male', 5.92
insert into #rawdata select 3, 'male', 5.58
insert into #rawdata select 4,'male', 5.92
insert into #rawdata select 5,'female', 5
insert into #rawdata select 6,'female', 5.5
insert into #rawdata select 7,'female', 5.42
insert into #rawdata select 8,'female', 5.75
Create the data vectors with #data as the independent variables and #class as the dependent variable. As a reminder, this format eases the calculations in an SQL environment by using joins instead of having to look things up.
select id, 'intercept' attribute, 'intercept' value into #data from #rawdata
union all
select id, 'sex',sex from #rawdata
select id, 'height' attribute,height class into #class from #rawdata
select attribute, value, cast(0 as float) wght into #weights from #data
select * into #newweights from #weights
declare @error float, @newerror float, @count int, @alpha
set @error = (select sum(power(error,2)) from
(select c.id, c.class - sum(wght) error
from #class c
join #data d on c.id = d.id
join #weights w on d.attribute = w.attribute and d.value = w.value
group by c.id, c.class) e)
set @count = 1
set @alpha = 0.1
while 1=1
begin
truncate table #newweights
insert into #newweights
select d.attribute, d.value, w.wght+.@alpha*avg(error) aerror
from #weights w
join #data d
on w.attribute = d.attribute and w.value = d.value
join (select c.id, c.class - sum(wght) error
from #class c
join #data d on c.id = d.id
join #weights w on d.attribute = w.attribute and d.value = w.value
group by c.id, c.class) e
on d.id = e.id
group by d.attribute, d.value, w.wght
set @newerror =(select sum(power(error,2)) from
(select c.id, c.class - sum(wght) error
from #class c
join #data d on c.id = d.id
join #newweights w on d.attribute = w.attribute and d.value = w.value
group by c.id, c.class) e)
if (@newerror> @error and @count >10) or @count>100 break
truncate table #weights
insert into #weights
select * from #newweights
set @error = @newerror
set @count = @count +1
end
On the toy data set it ran in about 1 second with the following weights and fit:
select * from #weights
attribute | value | weight |
intercept | intercept | 2.818 |
sex | female | 2.599 |
sex | male | 3.037 |
select c.id, c.class,sum(wght) fit , c.class-sum(wght) error
from #class c
join #data d on c.id = d.id
join #newweights w on d.attribute = w.attribute and d.value = w.value
group by c.id, c.class
id | class | fit | error |
1 | 6 | 5.855 | 0.145 |
2 | 5.92 | 5.855 | 0.065 |
3 | 5.58 | 5.855 | -0.275 |
4 | 5.92 | 5.855 | 0.065 |
5 | 5 | 5.418 | -0.418 |
6 | 5.5 | 5.418 | 0.082 |
7 | 5.42 | 5.418 | 0.002 |
8 | 5.75 | 5.418 | 0.332 |
An the r-squared is 99.8% using:
select 1- sum(power(class-fit,2))/ sum(power(class,2)) from
(select c.id, c.class,sum(wght) fit
from #class c
join #data d on c.id = d.id
join #newweights w on d.attribute = w.attribute and d.value = w.value
group by c.id, c.class) f
Frequently the dependent variable is a binary variable in which case a logistic model may be more appropriate to keep the fitted value between 0 and 1 (http://en.wikipedia.org/wiki/Logistic_regression). It is an easy change: substitute (1.0/(1.0+exp(-sum(wght))) for sum(wght) throughout. I'll demonstrate this using the Titanic data (currently available from http://www.kaggle.com/ or http://www.encyclopedia-titanica.org/titanic-passenger-list/) which is on my server in a table called train.
The dependent variable is survived which is binary. In addition to the survival field, training data has both categorical and numeric data. I will use three categorical variables: sex, pclass and embarked. I've shown the modification in the initial error sub_query only.
select passengerid id, 'survived' attribute,survived class into #class from train
select passengerid id, 'intercept' attribute, 'intercept' value into #data from train
union all
select passengerid, 'sex',sex from train
union all
select passengerid, 'pclass',cast(pclass as varchar) from train
union all
select passengerid, 'embarked',embarked from train
declare @error float, @newerror float, @count int
drop table #weights, #newweights
select attribute, value, cast(0 as float) wght into #weights from #data
select attribute, value, cast(0 as float) wght into #newweights from #data
set @error = (select sum(power(error,2)) from
(select c.id, c.class - (1.0/(1.0+exp(-sum(wght)))) error
from #class c
join #data d on c.id = d.id
join #weights w on d.attribute = w.attribute and d.value = w.value
group by c.id, c.class) e)
attribute | value | weight |
embarked | C | 0.283 |
embarked | Q | 0.013 |
embarked | S | -0.283 |
intercept | intercept | -0.144 |
pclass | 1 | 0.730 |
pclass | 2 | 0.218 |
pclass | 3 | -0.664 |
sex | female | 1.234 |
sex | male | -0.894 |
As this is essentially a classification model, how successfully did it classify if one assumes a fitted value >.5 = 1? 78.67%.
select sum(case when class = round(fit,0) then 1.0 else 0 end)/count(1) from
(select c.id, c.class,(1.0/(1.0+exp(-sum(wght)))) fit
from #class c
join #data d on c.id = d.id
join #newweights w on d.attribute = w.attribute and d.value = w.value
group by c.id, c.class) f
Finally, the r-squared is 61.5%
select 1- sum(power(class-fit,2))/ sum(power(class,2)) from
(select c.id, c.class,(1.0/(1.0+exp(-sum(wght)))) fit
from #class c
join #data d on c.id = d.id
join #newweights w on d.attribute = w.attribute and d.value = w.value
group by c.id, c.class) f
No comments:
Post a Comment