Monday, July 29, 2013

Single & Multiple Regression in SQL

Linear regression, a.k.a ordinary least squares is a algorithm for estimating the mathematical relationships between a set of one or more independent variables and a dependent variable. It takes the mathematical form of Y = B+m1*X1+m2*X2+... where Y is the independent variable that is being predicted, B is the intercept or constant and m1, m2 etc are the slopes or weights of the independent, or input, variables X1, X2 etc. I suggest a visit to Google or Wikipedia for the theories involved.

If there are multiple independent variables the model is referred to as multiple regression while with a single independent variable it is known as simple regression. This distinction is more than semantic: simple regression in SQL is extremely easy and can be implemented as a user defined function to estimate the slope and intercept  much like an average.  Multiple regression on the other hand requires significant matrix algebra. While I've come across and implemented simple regression in SQL, I've yet to see multiple despite some serious searching.  I've been trying to implement the Gram-Schmidt procedure , but it is still a work in progress (edit:  I figured it out! see the post dated 12/15/2013). In the interim there is an iterative approach using gradient descent which I'll cover here.  First I'll re-hash the simple regression and then follow with the solution to multiple regression.

Start simple regression by creating a table  and populate with data, in this case the area and price of homes (this was from a Stackoverflow post) from an on-line machine learning course:

create table #rawdata (id int,X float, Y float)

insert into #rawdata select 1, 2201,400

insert into #rawdata select 2, 1600,330
insert into #rawdata select 3, 2400,369
insert into #rawdata select 4, 1416,232
insert into #rawdata select 5, 3000,540

The estimation processes is to first calculate the slope (parse the equation from the query) and then the intercept by subtracting the average of the independent variable multiplied by the slope from the average of the dependent variable.  To unify the two steps I use a sub-query to estimate the slope and averages of the independent (x)  and dependent (y) variables; the parent query presents the slope and does the math to calculate the intercept:

select slope, ybar - slope*xbar intercept into #model from 
(select (sum(x*y) -(sum(x)*sum(y)/count(1)))/
(sum(x*x) - (sum(x)*sum(x)/count(1))) slope, 
avg(x)  xbar, avg(y) ybar
from #rawdata) a

The results:

slope intercept
0.1658939364 21.9408154411

To see the fitted data versus actual, put the results in a temporary file called #model and run the following query:

select id, y, intercept+slope*x fit, y-(intercept+slope*x) error from #model, #rawdata

id y fit error
1 400 387.073 12.927
2 330 287.371 42.629
3 369 420.086 -51.086
4 232 256.847 -24.847
5 540 519.623 20.377

On to the multiple regression:  In addition to the area of a house to predict the price the number of rooms is provided and is to be incorporated into the model.  Unfortunately the algebraic solution in simple regression doesn't scale to 2 or more independent variables easily.  An alternate solution is to use what is known as the gradient descent algorithm which improves the model's fit by repeatedly updating the weights to reduce the error.  The processes is straight forward:  using the current weights estimate the fitted model, calculate the error and propagate a small amount of the error back to the weights and repeat.  As usual, more details are available all over the web.

To start recreate the #rawdata table and populate with data.  This time instead of X & Y I'll use the variable names.

create table #rawdata (id int,area float, rooms float, price float)

insert into #rawdata select 1, 2201,3,400
insert into #rawdata select 2, 1600,3,330
insert into #rawdata select 3, 2400,3,369
insert into #rawdata select 4, 1416,2,232
insert into #rawdata select 5, 3000,4,540

To facilitate the SQL, the table should be converted into data and class vectors.  A problem with gradient methods occurs when there are variables with widely divergent ranges.  The convergence slows down dramatically or becomes unstable.  To overcome this difficulty normalize the variables by subtracting the mean and dividing by the standard deviation.  This is a simple linear transformation and will be easy to convert back when the final predictions are made:

select id,'intercept' attribute,1.0 value into  #data  from #rawdata
union all
select id, 'area',(area-(select avg(area) from #rawdata))/(select stdev(area) from #rawdata) from #rawdata
union all
select id, 'rooms',(rooms-(select avg(rooms) from #rawdata))/(select stdev(rooms) from #rawdata) from #rawdata

select id,'price' attribute, (price-(select avg(price) from #rawdata))/(select stdev(price) from #rawdata) value into #class from #rawdata

Now there are two approaches to calculating the fit, error and update the weights:  either observation by observation with repeated passes across the whole data sets or all records at once in batch.   This implementation follows the batch approach.

Start by creating a table of weights for each attribute which can be populated with random numbers or, as in this case, 0:

select attribute,cast(0.0 as float) wght into #weight from #data   group by attribute 

The program requires 4 variables: 
  • alpha, a number greater than 0 but less than 1 that controls the portion of the error which will be incorporated in the weights on each iteration and is generally set to 0.1
  • best and newbest which track the fit and control when the routine stops
  • count: the maximum number of iterations
After setting the parameter values, including @best which is the square-root of the average of the squared errors  using the available weights, the program begins looping until the stop conditions are met:
  • the error increases after adjusting the weights, or
  •  the maximum iteration count is exceeded
The key weight adjusting query is highlighted in yellow.  The sub-queries first calculate the fit (identifier f) , then the error (identifier e)  which is multiplied by the independent variables and a small portion (@alpha) is added to the weights;  the updated weights are saved to #weightsnew.  Using the #weightsnew table, the overall error is calculated as @newbest and if it is greater than the prior error, @best, the program stops otherwise the weights are saved to the #weights table and the #weightsnew table is dropped.  If the maximum number of iterations has not been met it repeats the loop.

declare  @alpha float, @best  float, @newbest float,@count int

select  @alpha = .1, @count = 100, @best = sum(power(value-fit,2)) from #class c join
(select id, sum(value*wght) fit from #data d join #weight w on d.attribute=w.attribute  group by id) f
on c.id = f.id

while 1=1

begin 

select w.attribute,  wght+@alpha* avg(value*err) wght into #weightnew
from  #weight w join #data d on d.attribute = w.attribute
join (select  c.id, (value-fit) err from #class c join
(select id, sum(value*wght) fit from #data d join #weight w on d.attribute=w.attribute group by id) f
on c.id = f.id  ) e on d.id = e.id
group by w.attribute, wght

select @newbest = sum(power(value-fit,2)) from #class c join
(select id, sum(value*wght) fit from #data d join #weightnew w on d.attribute=w.attribute  group by id) f
on c.id = f.id

if @newbest>@best break

else set @best = @newbest

drop table #weight

select * into #weight from #weightnew
drop table #weightnew

if @count = 0 break


set @count = @count -1 

end

The final step is to query the actual versus fit and error which includes converting the standardized variables back to the original scale:

select id, mn+(value*se) value, mn+(fit*se) fit from 
(select c.id,value,fit from #class c join
(select id, sum(value*wght) fit from #data d join #weight w on d.attribute=w.attribute  group by id) f
on c.id = f.id) x,
(select avg(price) mn from #rawdata) y,
(select stdev(price) se from #rawdata) z

As a test I replicated the simple regression results above using just the area independent variable  with the following fit & error:

id value fit error
1 400 387.071 -12.929
2 330 287.390 -42.610
3 369 420.076 51.076
4 232 256.872 24.872
5 540 519.591 -20.409

A comparison of the error column with the error column in the simple regression table suggests the gradient descent approach closely matched the algebraic simple regression with the difference being off no more than 0.033 for any observation.  In fact the square-root of the mean squared error for both versions is 33.546.

Adding the number of rooms to the model improves the fit:

id y fit error
1 400 380.166 19.834
2 330 333.961 -3.961
3 369 395.465 -26.465
4 232 227.534 4.466
5 540 533.874 6.126

The square-root of the mean squared error is improved (reduced)  to 15.277.

A common measure of goodness of fit is R-Squared which is the variance explained by the model as a percent of the total variance.  It is calculated as 1 - (unexplained variance /total variance) where unexplained variance is the sum of  the square of the actual value minus the fitted value and total variance is the sum of the square of the actual values minus the average.  The SQL takes advantage of the normalized data which sets the average in the total variance calculation to zero:

select 1- sum(power(value-fit,2))/ sum(power(value,2))  from 
(select c.id,value,fit from #class c join
(select id, sum(value*wght) fit from #data d join #weight w on d.attribute=w.attribute  group by id) f
on c.id = f.id) x

For the simple regression, the R-Squared is 88.83% using the gradient algorithm; for the multiple regression it improves to 97.68%.







1 comment:

  1. Hmm, what happened to my first comment?
    Also, the query to select the slope and intercept can be shortened to
    select (avg(x*y) - avg(x)*avg(y))/VARIANCE(X) as slope
    , avg(y) - ((avg(x*y) - avg(x)*avg(y))/VARIANCE(X)) * avg(x) as intercept
    from #rawdata ;

    ReplyDelete