Sunday, December 15, 2013

True Multiple Regression Using SQL.

Some weeks ago I posted on multiple regression.  The code in that post solved for the coefficients using gradient descent, an iterative process of reducing error through coefficient adjustment which seems a second rate compromise when compared to the ease of  the Linest function in Excel.  In the interim I researched multiple regression that builds on simple univariate regression  (effectively regressing Y on X is sum (XY)/sum(XX) in SQL) which forms the basis of this post.  After some trial & error I got the SQL to work.  My primary source for this approach is the somewhat difficult  Elements of Statistical Learning (2008) by Hastie, Tibshirni and Friedman.  A reasonable discussion on the topic is at stats exchange.

To begin create a table of data:

create table #rawdata (id int,area float, rooms float, odd float,  price float)

insert into #rawdata select 1, 2201,3,1,400

insert into #rawdata select 2, 1600,3,0,330
insert into #rawdata select 3, 2400,3,1,369
insert into #rawdata select 4, 1416,2,1,232
insert into #rawdata select 5, 3000,4,0,540

It facilitates the code to convert the table of data into two data vectors:  the independent #x variables and the dependent #y variable.  There are a couple of modifications. First, rather than use the label names, I use a label index which facilitates the processing.  Second, In addition to the independent variables the #x table also includes the intercept variable with a value of 1 for each observation.  Finally, there is no error handling and the program cannot manage missing (null)  values.

The naming conventions used throughout are pretty simple.  The tables follow the names use in the articles with the exception of #c  for the orthogonal coefficients (instead of a Greek letter).  Field names are *id for the row id in table *, *n is the index of the field, *v is the value.

select id xid, 0 xn,1 xv into #x from #rawdata
union all
select id, 1,rooms  from #rawdata
union all
select id, 2,area  from #rawdata
union all
select id, 3,odd  from #rawdata

select id yid, 0 yn, price yv  into #y from #rawdata


Three additional tables are required: the #z table holds the orthogonalized values,   #c holds the orthagonalization coefficients and #b the regression coefficients.

create table #z (zid int, zn int, zv float)
insert into #z select id , 0 zn,1 zv from #rawdata

create table #c(cxn int, czn int, cv float) 

create table #b(bn int, bv float) 

The first part of the code loops forward through each of the independent variables starting with the intercept calculating first the orthogonalization coefficients and then the orthogonalized X values  which are store in #z.

declare @p int
set @p = 1

while @p <= (select max(xn) from #x)

begin
insert into #c
select  xn cxn,  zn czn, sum(xv*zv)/sum(zv*zv) cv 
from #x join  #z on  xid = zid where zn = @p-1 and xn>zn group by xn, zn
insert into #z
select zid, xn,xv- sum(cv*zv) 
from #x join #z on xid = zid   join  #c  on  czn = zn and cxn = xn  where xn = @p and zn<xn  group by zid, xn,xv
set @p = @p +1

end

The second part of the code loops backward calculating the B coefficients by regressing Y on Z where Y is continually updated to exclude previously fit Y values (the sub-query)

 while @p>=0 
begin

insert into #b
select zn, sum(yv*zv)/ sum(zv*zv) 
from #z  join 
(select yid, yv-isnull(sum(bv*xv),0) yv from #x join #y on xid = yid left join #b on  xn=bn group by yid, yv) y
on zid = yid where zn = @p  group by zn

set @p = @p-1

end

The regression coefficients are:

select * from #b


bn bv
3 11.026
2 0.056
1 114.965
0 -96.747

Finally the fitted values are:

select yid, yv, fit, yv-fit err from #y join 
(select xid, sum(xv*bv) fit from #x join #b on xn = bn  group by xid) f
on yid = xid


yid yv fit err
1 400 382.975 17.025
2 330 338.144 -8.144
3 369 394.169 -25.169
4 232 223.856 8.144
5 540 531.856 8.144

The r squared is 99.85%

 select 1-sum(power(err,2))/sum(power(yv,2)) from 
  (select yid, yv, fit, yv-fit err from #y join 
(select xid, sum(xv*bv) fit from #x join #b on xn = bn  group by xid) f
on yid = xid) d


Given the challenge in writing this I was ultimately surprised how little code was involved.  There an important detail missing from the model:  the errors associated with the coefficients for determining statistical significance.  Maybe at some point I'll return to regression and figure it out. but in the mean time perhaps a reader wants to take on the challenge...


8 comments:

  1. Hi Colin, I really like your post on regression in SQL. We've taken on the challenge of errors associated, but are stuck unfortunately. Have you made any progression on this topic? Kind regards, Wouter

    ReplyDelete
  2. Glad to hear you are finding the post useful. Sorry but I haven't had a chance to get back into the details of regression to figure out the error calculations.

    ReplyDelete
  3. Colin, wonderful work! I've also tried this without luck. This week keep me occupied for a while. I'll need to look into the text you mentioned early in the post. Andrew

    ReplyDelete
  4. Hi Colin, Thanks for this helpful post. I found it very useful and saved my time. Also I need to know if you have any idea how to force intercept to zero in your solution. Thanks again.

    ReplyDelete
  5. Code works perfectly!

    ReplyDelete
  6. The r-squared has a bug in it. This fixes it...
    select 1-sum(power(err,2))/sum(power(yv-yavg,2)) AS r2,
    ((Avg(fit * yv) - Avg(fit) * Avg(yv)) / (StDevP(fit) * StDevP(yv))) AS correlation,
    SQRT(SUM(SQUARE(yv-fit))/COUNT(*)) AS RMSE
    FROM (select yid, yv, fit, yv-fit err,yavg from #y
    JOIN (select xid, sum(xv*bv) fit from #x join #b on xn = bn group by xid) f on yid = xid
    JOIN (SELECT AVG(yv) AS yavg FROM #y) g ON 1=1) d

    ReplyDelete
  7. Thanks Colin, linest returns slope, how do I get that from your code ?

    ReplyDelete