Tuesday, September 10, 2013

Incredibly Simple Regression Using Group By Categories in SQL

I've cover regression in one of my earlier posts and in the process came up with a simpler algorithm.  I can't claim it is my discovery as I'm sure it is already out there somewhere  and there is a strong similarity between many error minimization problems, but I've never seen a write-up and my favorite search engine didn't reveal any prior publications. I'm calling it SQL Group By Regression (SGBR), if it has already been published, let me know.  It bares a strong resemblance to ANOVA.

The algorithm models a numeric dependent variable by estimating weights for the independent variables which are all categorical.  An example would be height as a function of gender.  Height is a numeric variable while gender is categorical.  In typical regression (including my earlier post)  this could be modeled by using a binary (1,0) variable for each category which is fairly simple.  Imagine doing the same model of weight as a function of state of residence;  very quickly the model will have fifty binary variables. Straightforward (if not necessarily statistically valid) in a regression package like R, but a pain in SQL.

The model: estimate a score for each attribute value (i.e.gender female) such that the fitted class value is the sum of the scores for that observation. The scores, or weights, are estimated using a batch mode gradient descent.

The course of this post:

  1. Load some data, I will first use a small data set  then the Titanic data
  2. Put the data into a vector format to ease the SQL
  3. Initiate the weights and control variables
  4. Create the gradient descent loop
  5. Show the results.
The demonstration data is height in feet as a function of gender:


create table #rawdata (id int, sex varchar(6), height float)

insert into #rawdata select 1,'male', 6
insert into #rawdata select 2,'male', 5.92 
insert into #rawdata select 3, 'male', 5.58 
insert into #rawdata select 4,'male', 5.92
insert into #rawdata select 5,'female', 5
insert into #rawdata select 6,'female', 5.5 
insert into #rawdata select 7,'female', 5.42 
insert into #rawdata select 8,'female', 5.75 

Create the data vectors with #data as the independent variables and #class as the dependent variable.  As a reminder, this format eases the calculations in an SQL environment by using joins instead of having to look things up.

select id, 'intercept'  attribute, 'intercept' value into #data from #rawdata  
union all
select id, 'sex',sex from #rawdata 

select id, 'height' attribute,height class  into #class from #rawdata

The weights will be stored in the #weight table which is a summary of the #data file to ensure all variables being modeled are included. There is also a identical table called  #newweights.  As the weights are updated they are stored in the #newweights table until it is verified that the error reduced, if it did the weights are copied to the #weights table and the looping continues, otherwise the looping stops and the new weights are discarded.  The progress is measured with @count while the improvements in error are tracked through @error and @newerror; @alpha is the percent of error added to the weight estimates.  The initial weights are set to 0  and @error is the sum of the squared error between the fit and the class.

select attribute, value, cast(0 as float) wght into #weights from #data
select *  into #newweights from #weights 

declare @error float, @newerror float, @count int, @alpha

set @error = (select sum(power(error,2)) from 
(select c.id, c.class - sum(wght) error 
from  #class c 
join #data d on c.id = d.id
join #weights w on d.attribute = w.attribute and d.value = w.value  
group by c.id, c.class) e)


set @count = 1
set @alpha = 0.1



 The key SQL is highlighted with the orange highlighted  sub-query calculating the error by observation and the yellow using the average of that error to update the category weights.  The remaining queries calculate the error and checks the stop conditions and update the tables with the new information. The gradient loops at least 10 times and stops at 100 or when the error increases (highlighted in red).

while 1=1
begin
truncate table #newweights

insert into #newweights
select d.attribute, d.value, w.wght+.@alpha*avg(error) aerror
from #weights w  
join #data d
on w.attribute = d.attribute and w.value = d.value  
join (select c.id, c.class - sum(wght) error 
from  #class c 
join #data d on c.id = d.id
join #weights w on d.attribute = w.attribute and d.value = w.value  
group by c.id, c.class) e
on d.id = e.id
group by d.attribute, d.value, w.wght

set @newerror =(select sum(power(error,2)) from 
(select c.id, c.class - sum(wght) error 
from  #class c 
join #data d on c.id = d.id
join #newweights w on d.attribute = w.attribute and d.value = w.value  
group by c.id, c.class) e)

if (@newerror> @error and @count >10) or @count>100  break

truncate table #weights
insert into #weights  
select * from #newweights

set @error = @newerror
set @count = @count +1

end

On the toy data set it ran in about 1 second with the following weights and fit:

select * from #weights


attribute value weight
intercept intercept 2.818
sex female 2.599
sex male 3.037

select c.id, c.class,sum(wght)  fit , c.class-sum(wght) error
from  #class c 
join #data d on c.id = d.id
join #newweights w on d.attribute = w.attribute and d.value = w.value  
group by c.id, c.class

id class fit error
1 6 5.855 0.145
2 5.92 5.855 0.065
3 5.58 5.855 -0.275
4 5.92 5.855 0.065
5 5 5.418 -0.418
6 5.5 5.418 0.082
7 5.42 5.418 0.002
8 5.75 5.418 0.332

An the r-squared is 99.8% using:

 select 1- sum(power(class-fit,2))/ sum(power(class,2))  from 
  (select c.id, c.class,sum(wght)  fit 
from  #class c 
join #data d on c.id = d.id
join #newweights w on d.attribute = w.attribute and d.value = w.value  
group by c.id, c.class) f

Frequently the dependent variable is a binary variable  in which case a logistic model may be more appropriate to keep the fitted value between 0 and 1 (http://en.wikipedia.org/wiki/Logistic_regression). It is an easy change:  substitute (1.0/(1.0+exp(-sum(wght)))  for sum(wght)  throughout.  I'll demonstrate this using the Titanic data (currently available from http://www.kaggle.com/  or http://www.encyclopedia-titanica.org/titanic-passenger-list/)  which is on my server in a table called train.

The dependent variable is survived which is binary.  In addition to the survival field, training data has both categorical and numeric data.  I will use three categorical variables: sex, pclass and embarked.   I've shown the modification in the initial error sub_query only. 

select passengerid id, 'survived' attribute,survived class into #class from train 

select passengerid id, 'intercept'  attribute, 'intercept' value into #data from train  
union all
select passengerid, 'sex',sex from train 
union all
select passengerid, 'pclass',cast(pclass as varchar) from train
union all
select passengerid, 'embarked',embarked from train

declare @error float, @newerror float, @count int

drop table #weights, #newweights
select attribute, value, cast(0 as float) wght into #weights from #data
select attribute, value, cast(0 as float) wght into #newweights from #data


set @error = (select sum(power(error,2)) from 
(select c.id, c.class -  (1.0/(1.0+exp(-sum(wght)))) erro
from  #class c 
join #data d on c.id = d.id
join #weights w on d.attribute = w.attribute and d.value = w.value  
group by c.id, c.class) e)

The algorithm ran in about 4 seconds on my low-end gateway using un-indexed tables.  The weights  show one clearly wanted to be a lady in first class from Cherbourg):

attribute value weight
embarked C 0.283
embarked Q 0.013
embarked S -0.283
intercept intercept -0.144
pclass 1 0.730
pclass 2 0.218
pclass 3 -0.664
sex female 1.234
sex male -0.894

As this is essentially a classification model, how successfully did it classify if one assumes a fitted value >.5 = 1? 78.67%.

select  sum(case when class = round(fit,0) then 1.0 else 0 end)/count(1) from 
(select c.id, c.class,(1.0/(1.0+exp(-sum(wght))))  fit 
from  #class c 
join #data d on c.id = d.id
join #newweights w on d.attribute = w.attribute and d.value = w.value  
group by c.id, c.class) f

Finally, the r-squared is 61.5%

select 1- sum(power(class-fit,2))/ sum(power(class,2))  from 
(select c.id, c.class,(1.0/(1.0+exp(-sum(wght))))  fit 
from  #class c 
join #data d on c.id = d.id
join #newweights w on d.attribute = w.attribute and d.value = w.value  
group by c.id, c.class) f






No comments:

Post a Comment