In the SQL Data Mine: True Multiple Regression Using SQL.

Some weeks ago I posted on multiple regression. The code in that post solved for the coefficients using gradient descent, an iterative process of reducing error through coefficient adjustment which seems a second rate compromise when compared to the ease of the Linest function in Excel. In the interim I researched multiple regression that builds on simple univariate regression (effectively regressing Y on X is sum (XY)/sum(XX) in SQL) which forms the basis of this post. After some trial & error I got the SQL to work. My primary source for this approach is the somewhat difficult Elements of Statistical Learning (2008) by Hastie, Tibshirni and Friedman. A reasonable discussion on the topic is at stats exchange.

To begin create a table of data:

create table #rawdata (id int,area float, rooms float, odd float, price float)

insert into #rawdata select 1, 2201,3,1,400
insert into #rawdata select 2, 1600,3,0,330
insert into #rawdata select 3, 2400,3,1,369
insert into #rawdata select 4, 1416,2,1,232
insert into #rawdata select 5, 3000,4,0,540

It facilitates the code to convert the table of data into two data vectors: the independent #x variables and the dependent #y variable. There are a couple of modifications. First, rather than use the label names, I use a label index which facilitates the processing. Second, In addition to the independent variables the #x table also includes the intercept variable with a value of 1 for each observation. Finally, there is no error handling and the program cannot manage missing (null) values.

The naming conventions used throughout are pretty simple. The tables follow the names use in the articles with the exception of #c for the orthogonal coefficients (instead of a Greek letter). Field names are *id for the row id in table *, *n is the index of the field, *v is the value.

select id xid, 0 xn,1 xv into #x from #rawdata
union all
select id, 1,rooms from #rawdata
union all
select id, 2,area from #rawdata
union all
select id, 3,odd from #rawdata

select id yid, 0 yn, price yv into #y from #rawdata

Three additional tables are required: the #z table holds the orthogonalized values, #c holds the orthagonalization coefficients and #b the regression coefficients.

create table #z (zid int, zn int, zv float)
insert into #z select id , 0 zn,1 zv from #rawdata

create table #c(cxn int, czn int, cv float)

create table #b(bn int, bv float)

The first part of the code loops forward through each of the independent variables starting with the intercept calculating first the orthogonalization coefficients and then the orthogonalized X values which are store in #z.

declare @p int
set @p = 1

while @p <= (select max(xn) from #x)
begin
insert into #c
select xn cxn, zn czn, sum(xv*zv)/sum(zv*zv) cv
from #x join #z on xid = zid where zn = @p-1 and xn>zn group by xn, zn
insert into #z
select zid, xn,xv- sum(cv*zv)
from #x join #z on xid = zid join #c on czn = zn and cxn = xn where xn = @p and zn<xn group by zid, xn,xv
set @p = @p +1

end

The second part of the code loops backward calculating the B coefficients by regressing Y on Z where Y is continually updated to exclude previously fit Y values (the sub-query)

while @p>=0
begin

insert into #b
select zn, sum(yv*zv)/ sum(zv*zv)
from #z join
(select yid, yv-isnull(sum(bv*xv),0) yv from #x join #y on xid = yid left join #b on xn=bn group by yid, yv) y
on zid = yid where zn = @p group by zn

set @p = @p-1
end

The regression coefficients are:

select * from #b

bn	bv
3	11.026
2	0.056
1	114.965
0	-96.747

Finally the fitted values are:

select yid, yv, fit, yv-fit err from #y join
(select xid, sum(xv*bv) fit from #x join #b on xn = bn group by xid) f
on yid = xid

*yid*	yv	*fit*	*err*
1	400	382.975	17.025
2	330	338.144	-8.144
3	369	394.169	-25.169
4	232	223.856	8.144
5	540	531.856	8.144

The r squared is 99.85%

select 1-sum(power(err,2))/sum(power(yv,2)) from
(select yid, yv, fit, yv-fit err from #y join
(select xid, sum(xv*bv) fit from #x join #b on xn = bn group by xid) f
on yid = xid) d

Given the challenge in writing this I was ultimately surprised how little code was involved. There an important detail missing from the model: the errors associated with the coefficients for determining statistical significance. Maybe at some point I'll return to regression and figure it out. but in the mean time perhaps a reader wants to take on the challenge...

8 comments:

Wouter BuzingApril 24, 2014 at 3:57 AM
Hi Colin, I really like your post on regression in SQL. We've taken on the challenge of errors associated, but are stuck unfortunately. Have you made any progression on this topic? Kind regards, Wouter
Colin M CampbellApril 24, 2014 at 8:45 AM
Glad to hear you are finding the post useful. Sorry but I haven't had a chance to get back into the details of regression to figure out the error calculations.
AnonymousJune 19, 2014 at 2:48 PM
Colin, wonderful work! I've also tried this without luck. This week keep me occupied for a while. I'll need to look into the text you mentioned early in the post. Andrew
SiavashMay 11, 2015 at 1:44 AM
Hi Colin, Thanks for this helpful post. I found it very useful and saved my time. Also I need to know if you have any idea how to force intercept to zero in your solution. Thanks again.
CbTrekSeptember 14, 2016 at 2:49 PM
Brilliant!
AnonymousSeptember 14, 2016 at 2:50 PM
Code works perfectly!
AnonymousSeptember 14, 2016 at 3:30 PM
The r-squared has a bug in it. This fixes it...
select 1-sum(power(err,2))/sum(power(yv-yavg,2)) AS r2,
((Avg(fit * yv) - Avg(fit) * Avg(yv)) / (StDevP(fit) * StDevP(yv))) AS correlation,
SQRT(SUM(SQUARE(yv-fit))/COUNT(*)) AS RMSE
FROM (select yid, yv, fit, yv-fit err,yavg from #y
JOIN (select xid, sum(xv*bv) fit from #x join #b on xn = bn group by xid) f on yid = xid
JOIN (SELECT AVG(yv) AS yavg FROM #y) g ON 1=1) d

MilPhilnerJuly 30, 2018 at 12:45 PM
Thanks Colin, linest returns slope, how do I get that from your code ?

Sunday, December 15, 2013

True Multiple Regression Using SQL.

8 comments: