To begin create a table of data:
create table #rawdata (id int,area float, rooms float, odd float, price float)
insert into #rawdata select 1, 2201,3,1,400
insert into #rawdata select 2, 1600,3,0,330
insert into #rawdata select 3, 2400,3,1,369
insert into #rawdata select 4, 1416,2,1,232
insert into #rawdata select 5, 3000,4,0,540
It facilitates the code to convert the table of data into two data vectors: the independent #x variables and the dependent #y variable. There are a couple of modifications. First, rather than use the label names, I use a label index which facilitates the processing. Second, In addition to the independent variables the #x table also includes the intercept variable with a value of 1 for each observation. Finally, there is no error handling and the program cannot manage missing (null) values.
The naming conventions used throughout are pretty simple. The tables follow the names use in the articles with the exception of #c for the orthogonal coefficients (instead of a Greek letter). Field names are *id for the row id in table *, *n is the index of the field, *v is the value.
select id xid, 0 xn,1 xv into #x from #rawdata
union all
select id, 1,rooms from #rawdata
union all
select id, 2,area from #rawdata
union all
select id, 3,odd from #rawdata
select id yid, 0 yn, price yv into #y from #rawdata
Three additional tables are required: the #z table holds the orthogonalized values, #c holds the orthagonalization coefficients and #b the regression coefficients.
create table #z (zid int, zn int, zv float)
insert into #z select id , 0 zn,1 zv from #rawdata
create table #c(cxn int, czn int, cv float)
create table #b(bn int, bv float)
The first part of the code loops forward through each of the independent variables starting with the intercept calculating first the orthogonalization coefficients and then the orthogonalized X values which are store in #z.
declare @p int
set @p = 1
while @p <= (select max(xn) from #x)
begin
insert into #c
select xn cxn, zn czn, sum(xv*zv)/sum(zv*zv) cv
from #x join #z on xid = zid where zn = @p-1 and xn>zn group by xn, zn
insert into #z
select zid, xn,xv- sum(cv*zv)
from #x join #z on xid = zid join #c on czn = zn and cxn = xn where xn = @p and zn<xn group by zid, xn,xv
set @p = @p +1
end
The second part of the code loops backward calculating the B coefficients by regressing Y on Z where Y is continually updated to exclude previously fit Y values (the sub-query)
while @p>=0
begin
insert into #b
select zn, sum(yv*zv)/ sum(zv*zv)
from #z join
(select yid, yv-isnull(sum(bv*xv),0) yv from #x join #y on xid = yid left join #b on xn=bn group by yid, yv) y
on zid = yid where zn = @p group by zn
set @p = @p-1
end
The regression coefficients are:
select * from #b
bn | bv |
3 | 11.026 |
2 | 0.056 |
1 | 114.965 |
0 | -96.747 |
Finally the fitted values are:
select yid, yv, fit, yv-fit err from #y join
(select xid, sum(xv*bv) fit from #x join #b on xn = bn group by xid) f
on yid = xid
yid | yv | fit | err |
1 | 400 | 382.975 | 17.025 |
2 | 330 | 338.144 | -8.144 |
3 | 369 | 394.169 | -25.169 |
4 | 232 | 223.856 | 8.144 |
5 | 540 | 531.856 | 8.144 |
The r squared is 99.85%
select 1-sum(power(err,2))/sum(power(yv,2)) from
(select yid, yv, fit, yv-fit err from #y join
(select xid, sum(xv*bv) fit from #x join #b on xn = bn group by xid) f
on yid = xid) d
Given the challenge in writing this I was ultimately surprised how little code was involved. There an important detail missing from the model: the errors associated with the coefficients for determining statistical significance. Maybe at some point I'll return to regression and figure it out. but in the mean time perhaps a reader wants to take on the challenge...
Hi Colin, I really like your post on regression in SQL. We've taken on the challenge of errors associated, but are stuck unfortunately. Have you made any progression on this topic? Kind regards, Wouter
ReplyDeleteGlad to hear you are finding the post useful. Sorry but I haven't had a chance to get back into the details of regression to figure out the error calculations.
ReplyDeleteColin, wonderful work! I've also tried this without luck. This week keep me occupied for a while. I'll need to look into the text you mentioned early in the post. Andrew
ReplyDeleteHi Colin, Thanks for this helpful post. I found it very useful and saved my time. Also I need to know if you have any idea how to force intercept to zero in your solution. Thanks again.
ReplyDeleteBrilliant!
ReplyDeleteCode works perfectly!
ReplyDeleteThe r-squared has a bug in it. This fixes it...
ReplyDeleteselect 1-sum(power(err,2))/sum(power(yv-yavg,2)) AS r2,
((Avg(fit * yv) - Avg(fit) * Avg(yv)) / (StDevP(fit) * StDevP(yv))) AS correlation,
SQRT(SUM(SQUARE(yv-fit))/COUNT(*)) AS RMSE
FROM (select yid, yv, fit, yv-fit err,yavg from #y
JOIN (select xid, sum(xv*bv) fit from #x join #b on xn = bn group by xid) f on yid = xid
JOIN (SELECT AVG(yv) AS yavg FROM #y) g ON 1=1) d
Thanks Colin, linest returns slope, how do I get that from your code ?
ReplyDelete