In the SQL Data Mine: Incredibly Simple Regression Using Group By Categories in SQL

I've cover regression in one of my earlier posts and in the process came up with a simpler algorithm. I can't claim it is my discovery as I'm sure it is already out there somewhere and there is a strong similarity between many error minimization problems, but I've never seen a write-up and my favorite search engine didn't reveal any prior publications. I'm calling it SQL Group By Regression (SGBR), if it has already been published, let me know. It bares a strong resemblance to ANOVA.

The algorithm models a numeric dependent variable by estimating weights for the independent variables which are all categorical. An example would be height as a function of gender. Height is a numeric variable while gender is categorical. In typical regression (including my earlier post) this could be modeled by using a binary (1,0) variable for each category which is fairly simple. Imagine doing the same model of weight as a function of state of residence; very quickly the model will have fifty binary variables. Straightforward (if not necessarily statistically valid) in a regression package like R, but a pain in SQL.

The model: estimate a score for each attribute value (i.e.gender female) such that the fitted class value is the sum of the scores for that observation. The scores, or weights, are estimated using a batch mode gradient descent.

The course of this post:

Load some data, I will first use a small data set then the Titanic data
Put the data into a vector format to ease the SQL
Initiate the weights and control variables
Create the gradient descent loop
Show the results.

The demonstration data is height in feet as a function of gender:

create table #rawdata (id int, sex varchar(6), height float)

insert into #rawdata select 1,'male', 6

insert into #rawdata select 2,'male', 5.92

insert into #rawdata select 3, 'male', 5.58

insert into #rawdata select 4,'male', 5.92

insert into #rawdata select 5,'female', 5

insert into #rawdata select 6,'female', 5.5

insert into #rawdata select 7,'female', 5.42

insert into #rawdata select 8,'female', 5.75

Create the data vectors with #data as the independent variables and #class as the dependent variable. As a reminder, this format eases the calculations in an SQL environment by using joins instead of having to look things up.

select id, 'intercept' attribute, 'intercept' value into #data from #rawdata

union all

select id, 'sex',sex from #rawdata

select id, 'height' attribute,height class into #class from #rawdata

The weights will be stored in the #weight table which is a summary of the #data file to ensure all variables being modeled are included. There is also a identical table called #newweights. As the weights are updated they are stored in the #newweights table until it is verified that the error reduced, if it did the weights are copied to the #weights table and the looping continues, otherwise the looping stops and the new weights are discarded. The progress is measured with @count while the improvements in error are tracked through @error and @newerror; @alpha is the percent of error added to the weight estimates. The initial weights are set to 0 and @error is the sum of the squared error between the fit and the class.

select attribute, value, cast(0 as float) wght into #weights from #data
select * into #newweights from #weights

declare @error float, @newerror float, @count int, @alpha

set @error = (select sum(power(error,2)) from
(select c.id, c.class - sum(wght) error
from #class c
join #data d on c.id = d.id
join #weights w on d.attribute = w.attribute and d.value = w.value
group by c.id, c.class) e)

set @count = 1
set @alpha = 0.1

The key SQL is highlighted with the orange highlighted sub-query calculating the error by observation and the yellow using the average of that error to update the category weights. The remaining queries calculate the error and checks the stop conditions and update the tables with the new information. The gradient loops at least 10 times and stops at 100 or when the error increases (highlighted in red).

while 1=1
begin
truncate table #newweights

insert into #newweights
select d.attribute, d.value, w.wght+.@alpha*avg(error) aerror
from #weights w
join #data d
on w.attribute = d.attribute and w.value = d.value
join (select c.id, c.class - sum(wght) error
from #class c
join #data d on c.id = d.id
join #weights w on d.attribute = w.attribute and d.value = w.value
group by c.id, c.class) e
on d.id = e.id
group by d.attribute, d.value, w.wght

set @newerror =(select sum(power(error,2)) from
(select c.id, c.class - sum(wght) error
from #class c
join #data d on c.id = d.id
join #newweights w on d.attribute = w.attribute and d.value = w.value
group by c.id, c.class) e)

if (@newerror> @error and @count >10) or @count>100 break

truncate table #weights
insert into #weights
select * from #newweights

set @error = @newerror
set @count = @count +1

end

On the toy data set it ran in about 1 second with the following weights and fit:

select * from #weights

attribute	value	weight
intercept	intercept	2.818
sex	female	2.599
sex	male	3.037

select c.id, c.class,sum(wght) fit , c.class-sum(wght) error
from #class c
join #data d on c.id = d.id
join #newweights w on d.attribute = w.attribute and d.value = w.value
group by c.id, c.class

id	class	fit	error
1	6	5.855	0.145
2	5.92	5.855	0.065
3	5.58	5.855	-0.275
4	5.92	5.855	0.065
5	5	5.418	-0.418
6	5.5	5.418	0.082
7	5.42	5.418	0.002
8	5.75	5.418	0.332

An the r-squared is 99.8% using:

select 1- sum(power(class-fit,2))/ sum(power(class,2)) from

(select c.id, c.class,sum(wght) fit

from #class c

join #data d on c.id = d.id

join #newweights w on d.attribute = w.attribute and d.value = w.value

group by c.id, c.class) f

Frequently the dependent variable is a binary variable in which case a logistic model may be more appropriate to keep the fitted value between 0 and 1 (http://en.wikipedia.org/wiki/Logistic_regression). It is an easy change: substitute (1.0/(1.0+exp(-sum(wght))) for sum(wght) throughout. I'll demonstrate this using the Titanic data (currently available from http://www.kaggle.com/ or http://www.encyclopedia-titanica.org/titanic-passenger-list/) which is on my server in a table called train.

The dependent variable is survived which is binary. In addition to the survival field, training data has both categorical and numeric data. I will use three categorical variables: sex, pclass and embarked. I've shown the modification in the initial error sub_query only.

select passengerid id, 'survived' attribute,survived class into #class from train

select passengerid id, 'intercept' attribute, 'intercept' value into #data from train

union all

select passengerid, 'sex',sex from train

union all

select passengerid, 'pclass',cast(pclass as varchar) from train

union all

select passengerid, 'embarked',embarked from train

declare @error float, @newerror float, @count int

drop table #weights, #newweights

select attribute, value, cast(0 as float) wght into #weights from #data

select attribute, value, cast(0 as float) wght into #newweights from #data

set @error = (select sum(power(error,2)) from

(select c.id, c.class - (1.0/(1.0+exp(-sum(wght)))) error

from #class c

join #data d on c.id = d.id

join #weights w on d.attribute = w.attribute and d.value = w.value

group by c.id, c.class) e)

The algorithm ran in about 4 seconds on my low-end gateway using un-indexed tables. The weights show one clearly wanted to be a lady in first class from Cherbourg):

attribute	value	weight
embarked	C	0.283
embarked	Q	0.013
embarked	S	-0.283
intercept	intercept	-0.144
pclass	1	0.730
pclass	2	0.218
pclass	3	-0.664
sex	female	1.234
sex	male	-0.894

As this is essentially a classification model, how successfully did it classify if one assumes a fitted value >.5 = 1? 78.67%.

select sum(case when class = round(fit,0) then 1.0 else 0 end)/count(1) from

(select c.id, c.class,(1.0/(1.0+exp(-sum(wght)))) fit

from #class c

join #data d on c.id = d.id

join #newweights w on d.attribute = w.attribute and d.value = w.value

group by c.id, c.class) f

Finally, the r-squared is 61.5%

select 1- sum(power(class-fit,2))/ sum(power(class,2)) from

(select c.id, c.class,(1.0/(1.0+exp(-sum(wght)))) fit

from #class c

join #data d on c.id = d.id

join #newweights w on d.attribute = w.attribute and d.value = w.value

group by c.id, c.class) f

In the SQL Data Mine

Tuesday, September 10, 2013

Incredibly Simple Regression Using Group By Categories in SQL

No comments:

Post a Comment