If there are multiple independent variables the model is referred to as multiple regression while with a single independent variable it is known as simple regression. This distinction is more than semantic: simple regression in SQL is extremely easy and can be implemented as a user defined function to estimate the slope and intercept much like an average. Multiple regression on the other hand requires significant matrix algebra. While I've come across and implemented simple regression in SQL, I've yet to see multiple despite some serious searching. I've been trying to implement the Gram-Schmidt procedure , but it is still a work in progress (edit: I figured it out! see the post dated 12/15/2013). In the interim there is an iterative approach using gradient descent which I'll cover here. First I'll re-hash the simple regression and then follow with the solution to multiple regression.
Start simple regression by creating a table and populate with data, in this case the area and price of homes (this was from a Stackoverflow post) from an on-line machine learning course:
create table #rawdata (id int,X float, Y float)
insert into #rawdata select 1, 2201,400
insert into #rawdata select 2, 1600,330
insert into #rawdata select 3, 2400,369
insert into #rawdata select 4, 1416,232
insert into #rawdata select 5, 3000,540
The estimation processes is to first calculate the slope (parse the equation from the query) and then the intercept by subtracting the average of the independent variable multiplied by the slope from the average of the dependent variable. To unify the two steps I use a sub-query to estimate the slope and averages of the independent (x) and dependent (y) variables; the parent query presents the slope and does the math to calculate the intercept:
select slope, ybar - slope*xbar intercept into #model from
(select (sum(x*y) -(sum(x)*sum(y)/count(1)))/
(sum(x*x) - (sum(x)*sum(x)/count(1))) slope,
avg(x) xbar, avg(y) ybar
from #rawdata) a
The results:
slope | intercept |
0.1658939364 | 21.9408154411 |
To see the fitted data versus actual, put the results in a temporary file called #model and run the following query:
select id, y, intercept+slope*x fit, y-(intercept+slope*x) error from #model, #rawdata
id | y | fit | error |
1 | 400 | 387.073 | 12.927 |
2 | 330 | 287.371 | 42.629 |
3 | 369 | 420.086 | -51.086 |
4 | 232 | 256.847 | -24.847 |
5 | 540 | 519.623 | 20.377 |
On to the multiple regression: In addition to the area of a house to predict the price the number of rooms is provided and is to be incorporated into the model. Unfortunately the algebraic solution in simple regression doesn't scale to 2 or more independent variables easily. An alternate solution is to use what is known as the gradient descent algorithm which improves the model's fit by repeatedly updating the weights to reduce the error. The processes is straight forward: using the current weights estimate the fitted model, calculate the error and propagate a small amount of the error back to the weights and repeat. As usual, more details are available all over the web.
To start recreate the #rawdata table and populate with data. This time instead of X & Y I'll use the variable names.
create table #rawdata (id int,area float, rooms float, price float)
insert into #rawdata select 1, 2201,3,400
insert into #rawdata select 2, 1600,3,330
insert into #rawdata select 3, 2400,3,369
insert into #rawdata select 4, 1416,2,232
insert into #rawdata select 5, 3000,4,540
To facilitate the SQL, the table should be converted into data and class vectors. A problem with gradient methods occurs when there are variables with widely divergent ranges. The convergence slows down dramatically or becomes unstable. To overcome this difficulty normalize the variables by subtracting the mean and dividing by the standard deviation. This is a simple linear transformation and will be easy to convert back when the final predictions are made:
select id,'intercept' attribute,1.0 value into #data from #rawdata
union all
select id, 'area',(area-(select avg(area) from #rawdata))/(select stdev(area) from #rawdata) from #rawdata
union all
select id, 'rooms',(rooms-(select avg(rooms) from #rawdata))/(select stdev(rooms) from #rawdata) from #rawdata
select id,'price' attribute, (price-(select avg(price) from #rawdata))/(select stdev(price) from #rawdata) value into #class from #rawdata
union all
select id, 'area',(area-(select avg(area) from #rawdata))/(select stdev(area) from #rawdata) from #rawdata
union all
select id, 'rooms',(rooms-(select avg(rooms) from #rawdata))/(select stdev(rooms) from #rawdata) from #rawdata
select id,'price' attribute, (price-(select avg(price) from #rawdata))/(select stdev(price) from #rawdata) value into #class from #rawdata
Now there are two approaches to calculating the fit, error and update the weights: either observation by observation with repeated passes across the whole data sets or all records at once in batch. This implementation follows the batch approach.
Start by creating a table of weights for each attribute which can be populated with random numbers or, as in this case, 0:
select attribute,cast(0.0 as float) wght into #weight from #data group by attribute
The program requires 4 variables:
- alpha, a number greater than 0 but less than 1 that controls the portion of the error which will be incorporated in the weights on each iteration and is generally set to 0.1
- best and newbest which track the fit and control when the routine stops
- count: the maximum number of iterations
After setting the parameter values, including @best which is the square-root of the average of the squared errors using the available weights, the program begins looping until the stop conditions are met:
- the error increases after adjusting the weights, or
- the maximum iteration count is exceeded
The key weight adjusting query is highlighted in yellow. The sub-queries first calculate the fit (identifier f) , then the error (identifier e) which is multiplied by the independent variables and a small portion (@alpha) is added to the weights; the updated weights are saved to #weightsnew. Using the #weightsnew table, the overall error is calculated as @newbest and if it is greater than the prior error, @best, the program stops otherwise the weights are saved to the #weights table and the #weightsnew table is dropped. If the maximum number of iterations has not been met it repeats the loop.
declare @alpha float, @best float, @newbest float,@count int
select @alpha = .1, @count = 100, @best = sum(power(value-fit,2)) from #class c join
(select id, sum(value*wght) fit from #data d join #weight w on d.attribute=w.attribute group by id) f
on c.id = f.id
while 1=1
begin
select w.attribute, wght+@alpha* avg(value*err) wght into #weightnew
from #weight w join #data d on d.attribute = w.attribute
join (select c.id, (value-fit) err from #class c join
(select id, sum(value*wght) fit from #data d join #weight w on d.attribute=w.attribute group by id) f
on c.id = f.id ) e on d.id = e.id
group by w.attribute, wght
select @newbest = sum(power(value-fit,2)) from #class c join
(select id, sum(value*wght) fit from #data d join #weightnew w on d.attribute=w.attribute group by id) f
on c.id = f.id
if @newbest>@best break
else set @best = @newbest
drop table #weight
select * into #weight from #weightnew
drop table #weightnew
if @count = 0 break
set @count = @count -1
end
The final step is to query the actual versus fit and error which includes converting the standardized variables back to the original scale:
select id, mn+(value*se) value, mn+(fit*se) fit from
(select c.id,value,fit from #class c join
(select id, sum(value*wght) fit from #data d join #weight w on d.attribute=w.attribute group by id) f
on c.id = f.id) x,
(select avg(price) mn from #rawdata) y,
(select stdev(price) se from #rawdata) z
As a test I replicated the simple regression results above using just the area independent variable with the following fit & error:
A comparison of the error column with the error column in the simple regression table suggests the gradient descent approach closely matched the algebraic simple regression with the difference being off no more than 0.033 for any observation. In fact the square-root of the mean squared error for both versions is 33.546.
Adding the number of rooms to the model improves the fit:
The square-root of the mean squared error is improved (reduced) to 15.277.
declare @alpha float, @best float, @newbest float,@count int
select @alpha = .1, @count = 100, @best = sum(power(value-fit,2)) from #class c join
(select id, sum(value*wght) fit from #data d join #weight w on d.attribute=w.attribute group by id) f
on c.id = f.id
while 1=1
begin
select w.attribute, wght+@alpha* avg(value*err) wght into #weightnew
from #weight w join #data d on d.attribute = w.attribute
join (select c.id, (value-fit) err from #class c join
(select id, sum(value*wght) fit from #data d join #weight w on d.attribute=w.attribute group by id) f
on c.id = f.id ) e on d.id = e.id
group by w.attribute, wght
select @newbest = sum(power(value-fit,2)) from #class c join
(select id, sum(value*wght) fit from #data d join #weightnew w on d.attribute=w.attribute group by id) f
on c.id = f.id
if @newbest>@best break
else set @best = @newbest
drop table #weight
select * into #weight from #weightnew
drop table #weightnew
if @count = 0 break
set @count = @count -1
end
The final step is to query the actual versus fit and error which includes converting the standardized variables back to the original scale:
select id, mn+(value*se) value, mn+(fit*se) fit from
(select c.id,value,fit from #class c join
(select id, sum(value*wght) fit from #data d join #weight w on d.attribute=w.attribute group by id) f
on c.id = f.id) x,
(select avg(price) mn from #rawdata) y,
(select stdev(price) se from #rawdata) z
As a test I replicated the simple regression results above using just the area independent variable with the following fit & error:
id | value | fit | error |
1 | 400 | 387.071 | -12.929 |
2 | 330 | 287.390 | -42.610 |
3 | 369 | 420.076 | 51.076 |
4 | 232 | 256.872 | 24.872 |
5 | 540 | 519.591 | -20.409 |
A comparison of the error column with the error column in the simple regression table suggests the gradient descent approach closely matched the algebraic simple regression with the difference being off no more than 0.033 for any observation. In fact the square-root of the mean squared error for both versions is 33.546.
Adding the number of rooms to the model improves the fit:
id | y | fit | error |
1 | 400 | 380.166 | 19.834 |
2 | 330 | 333.961 | -3.961 |
3 | 369 | 395.465 | -26.465 |
4 | 232 | 227.534 | 4.466 |
5 | 540 | 533.874 | 6.126 |
The square-root of the mean squared error is improved (reduced) to 15.277.
A common measure of goodness of fit is R-Squared which is the variance explained by the model as a percent of the total variance. It is calculated as 1 - (unexplained variance /total variance) where unexplained variance is the sum of the square of the actual value minus the fitted value and total variance is the sum of the square of the actual values minus the average. The SQL takes advantage of the normalized data which sets the average in the total variance calculation to zero:
select 1- sum(power(value-fit,2))/ sum(power(value,2)) from
(select c.id,value,fit from #class c join
(select id, sum(value*wght) fit from #data d join #weight w on d.attribute=w.attribute group by id) f
on c.id = f.id) x
For the simple regression, the R-Squared is 88.83% using the gradient algorithm; for the multiple regression it improves to 97.68%.
Hmm, what happened to my first comment?
ReplyDeleteAlso, the query to select the slope and intercept can be shortened to
select (avg(x*y) - avg(x)*avg(y))/VARIANCE(X) as slope
, avg(y) - ((avg(x*y) - avg(x)*avg(y))/VARIANCE(X)) * avg(x) as intercept
from #rawdata ;
Hii you are providing good information.Thanks for sharing AND Data Scientist Course in Hyderabad, Data Analytics Courses,
ReplyDeleteData Science Courses, Business Analytics Training ISB HYD Trained Faculty with 10 yrs of Exp See below link
Data Scientist Course In Hyderabad