In the SQL Data Mine: SQL

Showing posts with label SQL. Show all posts

Tuesday, February 18, 2014

Associated Items Using the Apriori Algorithm

Apriori is an association rule learning algorithm for mining frequent item sets; extremely useful in identifying which items sell/occur together and is doubtlessly employed by the likes of Amazon. Perhaps the most famous example of this is the (possibly mythical) discovery of the association of beer and diaper sales at a convenience store. The algorithm proceeds by identifying all of the items (e.g. products) in a transaction (sales order) data-set that meet a minimum frequency criteria (aka support) . The next step is to match the list back to the single item list by transaction to identify associated item groups that meet the support criteria. This processes is repeated extending the associated item list until either the maximum list size is met or the results list is empty. Wikipedia (and many other sources) provide a more comprehensive explanation for the interested.

The processes starts with a table of transactions (this could be sales orders, documents, patients) which are composed of items (goods, words, diagnosis) and returns a table of items sets. Note the list is not how many of an item is in the transaction but merely whether it was in the transaction, so if in transaction one there were 4 loaves of bread and two gallons of milk it would show up as only bread and milk.

create table #trans (id int, item varchar(24))

insert into #trans select 1,'bread'
insert into #trans select 1,'milk'
insert into #trans select 1,'formula'
insert into #trans select 1,'diapers'
insert into #trans select 1,'beer'
insert into #trans select 2,'bread'
insert into #trans select 2,'diapers'
insert into #trans select 2,'beer'
insert into #trans select 3,'diapers'
insert into #trans select 3,'beer'
insert into #trans select 4,'diapers'
insert into #trans select 5,'beer'
insert into #trans select 5,'bread'

Next, define and set the support level which is the minimum number of occurrences a rule should have to be included in associated items list. For the purposes of this toy example it is set at two; in a real world example it should be set at a level consistent with the database size.

declare @support int

set @support = 2

The process will first generate all single item sets that meet the support criteria, then using that list will create the two item sets and continue the process until the maximum number of items set is met. This example will generate sets of up to 5 associated items. For each set of associated items, step one is to generate all acceptable combinations of items by joining the last set of associated items to the single item list and saving the acceptable combinations. An acceptable combination is a set of items where all are different and regardless of sequence the set is unique. In other words in a two item starting with A and B, set AB and BA are just AB while AA and BB are excluded. The second step is to delete those combinations where the count doesn't meet the support criteria. From an SQL standpoint it would be possible to create a looping program that dynamically generates the query, but as the number of associated items generally grouped is modest, creating a query for each associated item list is easily managed.

--1 item

--list of items by transaction
select id,item item1 into #assoc1 from #trans group by id, item

-- remove items not meeting support level

delete from #assoc1 from #assoc1 x join

(select item1 items from #assoc1 group by item1 having count(1)<=@support) y on item1 = items

--2 items

select a.*, b.item1 item2 into #assoc2 from #assoc1 a, #assoc1 b where a.id = b.id and b.item1>a.item1

delete from #assoc2 from #assoc2 x join

(select item1+item2 items from #assoc2 group by item1+item2 having count(1)<=@support) y on item1+item2=items

--3 items

select a.*, b.item1 item3 into #assoc3 from #assoc2 a, #assoc1 b where a.id = b.id and b.item1>a.item2

delete from #assoc3 from #assoc3 x join

(select item1+item2+item3 items from #assoc3 group by item1+item2+item3 having count(1)<=@support) y on item1+item2+item3=items

--4 items

select a.*, b.item1 item4 into #assoc4 from #assoc3 a, #assoc1 b where a.id = b.id and b.item1>a.item3

delete from #assoc4 from #assoc4 x join

(select item1+item2+item3+item4 items from #assoc4 group by item1+item2+item3+item4 having count(1)<=@support) y on item1+item2+item3+item4=items

--5 items

select a.*, b.item1 item5 into #assoc5 from #assoc4 a, #assoc1 b where a.id = b.id and b.item1>a.item4

delete from #assoc5 from #assoc5 x join

(select item1+item2+item3+item4+item5 items from #assoc5 group by item1+item2+item3+item4+item5 having count(1)<=@support) y on item1+item2+item3+item4+item5=items

Incorporating additional item lists is a fairly straight forward exercise of copying the prior queries, updating the table names and adding the additional fields.

The following generates a summary of the associated item lists:

select 5 size, item1,item2,item3,item4,item5, count(1) freq into #assoc from #assoc5 group by item1,item2,item3, item4,item5
union all
select 4, item1,item2,item3,item4,'', count(1) from #assoc4 group by item1,item2,item3, item4
union all
select 3, item1,item2,item3,'','', count(1) from #assoc3 group by item1,item2,item3
union all
select 2, item1,item2,'','','', count(1) from #assoc2 group by item1, item2
union all
select 1, item1,'','','','', count(1) from #assoc1 group by item1

select * from #assoc

*size*	*item1*	*item2*	*item3*	*item4*	*item5*	*freq*
3	beer	bread	diapers			2
2	beer	bread				3
2	beer	diapers				3
2	bread	diapers				2
1	beer					4
1	bread					3
1	diapers					4

One of the drawbacks of using apriori to generate the associated item sets is the number of permutations in large databases. The above was tested on 76,013 transactions (patients from Kaggle's Heritage competition) composed of 45 items (diagnosis) for a total of 282,718 records (medical claims year 1) with a support of 100 using a fairly basic home PC; it generated a table of 7,500 sets of 2 or more associated items in a little over 1 minute.

To have so many rules begs the question of what is interesting and worth further investigation. One approach is to identify how often an associated item set occurs versus what one would expect if it was random. If items are unrelated the expected probability of them co-occurring would be the product of the individual probabilities; if the actual probability is substantially greater than expected, then the associated items are more interesting. The following query approximates the two item solution and can easily be extended to larger associated item lists:

select top 10 item1, item2, freqab, freqab/freqt Assoc, (freqb/freqt) *(freqc/freqt) Expect ,(freqab/freqt) / ((freqb/freqt) *(freqc/freqt)) Ratio from
(select item1, item2, cast(count(1) as float) freqab from #assoc2 group by item1, item2) a
join
(select item1 itemb, cast(count(1) as float) freqb from #assoc1 group by item1) b on item1 = itemb
join
(select item1 itemc, cast(count(1) as float) freqc from #assoc1 group by item1) c on item2 = itemc,
(select count(distinct id) freqt from #assoc1) d
order by (freqab/freqt) / ((freqb/freqt) *(freqc/freqt)) desc

Below are the top 10 two item set results from the Heritage data diagnosis (item) by patient (transaction) with rather obscure diagnostic titles. I won't go through them, but for example myocardial infarction (AMI) is approximately 7 times more likely to occur in patient diagnosed with congestive heart failure (CHF) than expected. Some of the others are also quite obvious.

*item1*	*item2*	*Freq*	*Assoc*	*Expect*	*Ratio*
AMI	CHF	473	0.62%	0.09%	7.03
HEART2	PERVALV	124	0.16%	0.03%	6.22
SEIZURE	STROKE	195	0.26%	0.04%	6.03
CHF	HEART2	429	0.56%	0.09%	5.99
HEMTOL	RENAL2	145	0.19%	0.04%	5.05
CHF	PNEUM	141	0.19%	0.04%	5.03
AMI	ROAMI	1324	1.74%	0.37%	4.77
GYNEC1	PRGNCY	558	0.73%	0.15%	4.75
HEART4	STROKE	223	0.29%	0.06%	4.72
GYNECA	ODaBNCA	231	0.30%	0.06%	4.70

Sunday, December 15, 2013

True Multiple Regression Using SQL.

Some weeks ago I posted on multiple regression. The code in that post solved for the coefficients using gradient descent, an iterative process of reducing error through coefficient adjustment which seems a second rate compromise when compared to the ease of the Linest function in Excel. In the interim I researched multiple regression that builds on simple univariate regression (effectively regressing Y on X is sum (XY)/sum(XX) in SQL) which forms the basis of this post. After some trial & error I got the SQL to work. My primary source for this approach is the somewhat difficult Elements of Statistical Learning (2008) by Hastie, Tibshirni and Friedman. A reasonable discussion on the topic is at stats exchange.

To begin create a table of data:

create table #rawdata (id int,area float, rooms float, odd float, price float)

insert into #rawdata select 1, 2201,3,1,400
insert into #rawdata select 2, 1600,3,0,330
insert into #rawdata select 3, 2400,3,1,369
insert into #rawdata select 4, 1416,2,1,232
insert into #rawdata select 5, 3000,4,0,540

It facilitates the code to convert the table of data into two data vectors: the independent #x variables and the dependent #y variable. There are a couple of modifications. First, rather than use the label names, I use a label index which facilitates the processing. Second, In addition to the independent variables the #x table also includes the intercept variable with a value of 1 for each observation. Finally, there is no error handling and the program cannot manage missing (null) values.

The naming conventions used throughout are pretty simple. The tables follow the names use in the articles with the exception of #c for the orthogonal coefficients (instead of a Greek letter). Field names are *id for the row id in table *, *n is the index of the field, *v is the value.

select id xid, 0 xn,1 xv into #x from #rawdata
union all
select id, 1,rooms from #rawdata
union all
select id, 2,area from #rawdata
union all
select id, 3,odd from #rawdata

select id yid, 0 yn, price yv into #y from #rawdata

Three additional tables are required: the #z table holds the orthogonalized values, #c holds the orthagonalization coefficients and #b the regression coefficients.

create table #z (zid int, zn int, zv float)
insert into #z select id , 0 zn,1 zv from #rawdata

create table #c(cxn int, czn int, cv float)

create table #b(bn int, bv float)

The first part of the code loops forward through each of the independent variables starting with the intercept calculating first the orthogonalization coefficients and then the orthogonalized X values which are store in #z.

declare @p int
set @p = 1

while @p <= (select max(xn) from #x)
begin
insert into #c
select xn cxn, zn czn, sum(xv*zv)/sum(zv*zv) cv
from #x join #z on xid = zid where zn = @p-1 and xn>zn group by xn, zn
insert into #z
select zid, xn,xv- sum(cv*zv)
from #x join #z on xid = zid join #c on czn = zn and cxn = xn where xn = @p and zn<xn group by zid, xn,xv
set @p = @p +1

end

The second part of the code loops backward calculating the B coefficients by regressing Y on Z where Y is continually updated to exclude previously fit Y values (the sub-query)

while @p>=0
begin

insert into #b
select zn, sum(yv*zv)/ sum(zv*zv)
from #z join
(select yid, yv-isnull(sum(bv*xv),0) yv from #x join #y on xid = yid left join #b on xn=bn group by yid, yv) y
on zid = yid where zn = @p group by zn

set @p = @p-1
end

The regression coefficients are:

select * from #b

bn	bv
3	11.026
2	0.056
1	114.965
0	-96.747

Finally the fitted values are:

select yid, yv, fit, yv-fit err from #y join
(select xid, sum(xv*bv) fit from #x join #b on xn = bn group by xid) f
on yid = xid

*yid*	yv	*fit*	*err*
1	400	382.975	17.025
2	330	338.144	-8.144
3	369	394.169	-25.169
4	232	223.856	8.144
5	540	531.856	8.144

The r squared is 99.85%

select 1-sum(power(err,2))/sum(power(yv,2)) from
(select yid, yv, fit, yv-fit err from #y join
(select xid, sum(xv*bv) fit from #x join #b on xn = bn group by xid) f
on yid = xid) d

Given the challenge in writing this I was ultimately surprised how little code was involved. There an important detail missing from the model: the errors associated with the coefficients for determining statistical significance. Maybe at some point I'll return to regression and figure it out. but in the mean time perhaps a reader wants to take on the challenge...

Tuesday, September 10, 2013

Incredibly Simple Regression Using Group By Categories in SQL

I've cover regression in one of my earlier posts and in the process came up with a simpler algorithm. I can't claim it is my discovery as I'm sure it is already out there somewhere and there is a strong similarity between many error minimization problems, but I've never seen a write-up and my favorite search engine didn't reveal any prior publications. I'm calling it SQL Group By Regression (SGBR), if it has already been published, let me know. It bares a strong resemblance to ANOVA.

The algorithm models a numeric dependent variable by estimating weights for the independent variables which are all categorical. An example would be height as a function of gender. Height is a numeric variable while gender is categorical. In typical regression (including my earlier post) this could be modeled by using a binary (1,0) variable for each category which is fairly simple. Imagine doing the same model of weight as a function of state of residence; very quickly the model will have fifty binary variables. Straightforward (if not necessarily statistically valid) in a regression package like R, but a pain in SQL.

The model: estimate a score for each attribute value (i.e.gender female) such that the fitted class value is the sum of the scores for that observation. The scores, or weights, are estimated using a batch mode gradient descent.

The course of this post:

Load some data, I will first use a small data set then the Titanic data
Put the data into a vector format to ease the SQL
Initiate the weights and control variables
Create the gradient descent loop
Show the results.

The demonstration data is height in feet as a function of gender:

create table #rawdata (id int, sex varchar(6), height float)

insert into #rawdata select 1,'male', 6

insert into #rawdata select 2,'male', 5.92

insert into #rawdata select 3, 'male', 5.58

insert into #rawdata select 4,'male', 5.92

insert into #rawdata select 5,'female', 5

insert into #rawdata select 6,'female', 5.5

insert into #rawdata select 7,'female', 5.42

insert into #rawdata select 8,'female', 5.75

Create the data vectors with #data as the independent variables and #class as the dependent variable. As a reminder, this format eases the calculations in an SQL environment by using joins instead of having to look things up.

select id, 'intercept' attribute, 'intercept' value into #data from #rawdata

union all

select id, 'sex',sex from #rawdata

select id, 'height' attribute,height class into #class from #rawdata

The weights will be stored in the #weight table which is a summary of the #data file to ensure all variables being modeled are included. There is also a identical table called #newweights. As the weights are updated they are stored in the #newweights table until it is verified that the error reduced, if it did the weights are copied to the #weights table and the looping continues, otherwise the looping stops and the new weights are discarded. The progress is measured with @count while the improvements in error are tracked through @error and @newerror; @alpha is the percent of error added to the weight estimates. The initial weights are set to 0 and @error is the sum of the squared error between the fit and the class.

select attribute, value, cast(0 as float) wght into #weights from #data
select * into #newweights from #weights

declare @error float, @newerror float, @count int, @alpha

set @error = (select sum(power(error,2)) from
(select c.id, c.class - sum(wght) error
from #class c
join #data d on c.id = d.id
join #weights w on d.attribute = w.attribute and d.value = w.value
group by c.id, c.class) e)

set @count = 1
set @alpha = 0.1

The key SQL is highlighted with the orange highlighted sub-query calculating the error by observation and the yellow using the average of that error to update the category weights. The remaining queries calculate the error and checks the stop conditions and update the tables with the new information. The gradient loops at least 10 times and stops at 100 or when the error increases (highlighted in red).

while 1=1
begin
truncate table #newweights

insert into #newweights
select d.attribute, d.value, w.wght+.@alpha*avg(error) aerror
from #weights w
join #data d
on w.attribute = d.attribute and w.value = d.value
join (select c.id, c.class - sum(wght) error
from #class c
join #data d on c.id = d.id
join #weights w on d.attribute = w.attribute and d.value = w.value
group by c.id, c.class) e
on d.id = e.id
group by d.attribute, d.value, w.wght

set @newerror =(select sum(power(error,2)) from
(select c.id, c.class - sum(wght) error
from #class c
join #data d on c.id = d.id
join #newweights w on d.attribute = w.attribute and d.value = w.value
group by c.id, c.class) e)

if (@newerror> @error and @count >10) or @count>100 break

truncate table #weights
insert into #weights
select * from #newweights

set @error = @newerror
set @count = @count +1

end

On the toy data set it ran in about 1 second with the following weights and fit:

select * from #weights

attribute	value	weight
intercept	intercept	2.818
sex	female	2.599
sex	male	3.037

select c.id, c.class,sum(wght) fit , c.class-sum(wght) error
from #class c
join #data d on c.id = d.id
join #newweights w on d.attribute = w.attribute and d.value = w.value
group by c.id, c.class

id	class	fit	error
1	6	5.855	0.145
2	5.92	5.855	0.065
3	5.58	5.855	-0.275
4	5.92	5.855	0.065
5	5	5.418	-0.418
6	5.5	5.418	0.082
7	5.42	5.418	0.002
8	5.75	5.418	0.332

An the r-squared is 99.8% using:

select 1- sum(power(class-fit,2))/ sum(power(class,2)) from

(select c.id, c.class,sum(wght) fit

from #class c

join #data d on c.id = d.id

join #newweights w on d.attribute = w.attribute and d.value = w.value

group by c.id, c.class) f

Frequently the dependent variable is a binary variable in which case a logistic model may be more appropriate to keep the fitted value between 0 and 1 (http://en.wikipedia.org/wiki/Logistic_regression). It is an easy change: substitute (1.0/(1.0+exp(-sum(wght))) for sum(wght) throughout. I'll demonstrate this using the Titanic data (currently available from http://www.kaggle.com/ or http://www.encyclopedia-titanica.org/titanic-passenger-list/) which is on my server in a table called train.

The dependent variable is survived which is binary. In addition to the survival field, training data has both categorical and numeric data. I will use three categorical variables: sex, pclass and embarked. I've shown the modification in the initial error sub_query only.

select passengerid id, 'survived' attribute,survived class into #class from train

select passengerid id, 'intercept' attribute, 'intercept' value into #data from train

union all

select passengerid, 'sex',sex from train

union all

select passengerid, 'pclass',cast(pclass as varchar) from train

union all

select passengerid, 'embarked',embarked from train

declare @error float, @newerror float, @count int

drop table #weights, #newweights

select attribute, value, cast(0 as float) wght into #weights from #data

select attribute, value, cast(0 as float) wght into #newweights from #data

set @error = (select sum(power(error,2)) from

(select c.id, c.class - (1.0/(1.0+exp(-sum(wght)))) error

from #class c

join #data d on c.id = d.id

join #weights w on d.attribute = w.attribute and d.value = w.value

group by c.id, c.class) e)

The algorithm ran in about 4 seconds on my low-end gateway using un-indexed tables. The weights show one clearly wanted to be a lady in first class from Cherbourg):

attribute	value	weight
embarked	C	0.283
embarked	Q	0.013
embarked	S	-0.283
intercept	intercept	-0.144
pclass	1	0.730
pclass	2	0.218
pclass	3	-0.664
sex	female	1.234
sex	male	-0.894

As this is essentially a classification model, how successfully did it classify if one assumes a fitted value >.5 = 1? 78.67%.

select sum(case when class = round(fit,0) then 1.0 else 0 end)/count(1) from

(select c.id, c.class,(1.0/(1.0+exp(-sum(wght)))) fit

from #class c

join #data d on c.id = d.id

join #newweights w on d.attribute = w.attribute and d.value = w.value

group by c.id, c.class) f

Finally, the r-squared is 61.5%

select 1- sum(power(class-fit,2))/ sum(power(class,2)) from

(select c.id, c.class,(1.0/(1.0+exp(-sum(wght)))) fit

from #class c

join #data d on c.id = d.id

join #newweights w on d.attribute = w.attribute and d.value = w.value

group by c.id, c.class) f

Monday, July 29, 2013

Single & Multiple Regression in SQL

Linear regression, a.k.a ordinary least squares is a algorithm for estimating the mathematical relationships between a set of one or more independent variables and a dependent variable. It takes the mathematical form of Y = B+m1*X1+m2*X2+... where Y is the independent variable that is being predicted, B is the intercept or constant and m1, m2 etc are the slopes or weights of the independent, or input, variables X1, X2 etc. I suggest a visit to Google or Wikipedia for the theories involved.

If there are multiple independent variables the model is referred to as multiple regression while with a single independent variable it is known as simple regression. This distinction is more than semantic: simple regression in SQL is extremely easy and can be implemented as a user defined function to estimate the slope and intercept much like an average. Multiple regression on the other hand requires significant matrix algebra. While I've come across and implemented simple regression in SQL, I've yet to see multiple despite some serious searching. I've been trying to implement the Gram-Schmidt procedure , but it is still a work in progress (edit: I figured it out! see the post dated 12/15/2013). In the interim there is an iterative approach using gradient descent which I'll cover here. First I'll re-hash the simple regression and then follow with the solution to multiple regression.

Start simple regression by creating a table and populate with data, in this case the area and price of homes (this was from a Stackoverflow post) from an on-line machine learning course:

create table #rawdata (id int,X float, Y float)

insert into #rawdata select 1, 2201,400
insert into #rawdata select 2, 1600,330
insert into #rawdata select 3, 2400,369
insert into #rawdata select 4, 1416,232
insert into #rawdata select 5, 3000,540

The estimation processes is to first calculate the slope (parse the equation from the query) and then the intercept by subtracting the average of the independent variable multiplied by the slope from the average of the dependent variable. To unify the two steps I use a sub-query to estimate the slope and averages of the independent (x) and dependent (y) variables; the parent query presents the slope and does the math to calculate the intercept:

select slope, ybar - slope*xbar intercept into #model from
(select (sum(x*y) -(sum(x)*sum(y)/count(1)))/
(sum(x*x) - (sum(x)*sum(x)/count(1))) slope,
avg(x) xbar, avg(y) ybar
from #rawdata) a

The results:

slope	intercept
0.1658939364	21.9408154411

To see the fitted data versus actual, put the results in a temporary file called #model and run the following query:

select id, y, intercept+slope*x fit, y-(intercept+slope*x) error from #model, #rawdata

id	y	fit	error
1	400	387.073	12.927
2	330	287.371	42.629
3	369	420.086	-51.086
4	232	256.847	-24.847
5	540	519.623	20.377

On to the multiple regression: In addition to the area of a house to predict the price the number of rooms is provided and is to be incorporated into the model. Unfortunately the algebraic solution in simple regression doesn't scale to 2 or more independent variables easily. An alternate solution is to use what is known as the gradient descent algorithm which improves the model's fit by repeatedly updating the weights to reduce the error. The processes is straight forward: using the current weights estimate the fitted model, calculate the error and propagate a small amount of the error back to the weights and repeat. As usual, more details are available all over the web.

To start recreate the #rawdata table and populate with data. This time instead of X & Y I'll use the variable names.

create table #rawdata (id int,area float, rooms float, price float)

insert into #rawdata select 1, 2201,3,400

insert into #rawdata select 2, 1600,3,330

insert into #rawdata select 3, 2400,3,369

insert into #rawdata select 4, 1416,2,232

insert into #rawdata select 5, 3000,4,540

To facilitate the SQL, the table should be converted into data and class vectors. A problem with gradient methods occurs when there are variables with widely divergent ranges. The convergence slows down dramatically or becomes unstable. To overcome this difficulty normalize the variables by subtracting the mean and dividing by the standard deviation. This is a simple linear transformation and will be easy to convert back when the final predictions are made:

select id,'intercept' attribute,1.0 value into #data from #rawdata
union all
select id, 'area',(area-(select avg(area) from #rawdata))/(select stdev(area) from #rawdata) from #rawdata
union all
select id, 'rooms',(rooms-(select avg(rooms) from #rawdata))/(select stdev(rooms) from #rawdata) from #rawdata

select id,'price' attribute, (price-(select avg(price) from #rawdata))/(select stdev(price) from #rawdata) value into #class from #rawdata

Now there are two approaches to calculating the fit, error and update the weights: either observation by observation with repeated passes across the whole data sets or all records at once in batch. This implementation follows the batch approach.

Start by creating a table of weights for each attribute which can be populated with random numbers or, as in this case, 0:

select attribute,cast(0.0 as float) wght into #weight from #data group by attribute

The program requires 4 variables:

alpha, a number greater than 0 but less than 1 that controls the portion of the error which will be incorporated in the weights on each iteration and is generally set to 0.1
best and newbest which track the fit and control when the routine stops
count: the maximum number of iterations

After setting the parameter values, including @best which is the square-root of the average of the squared errors using the available weights, the program begins looping until the stop conditions are met:

the error increases after adjusting the weights, or
the maximum iteration count is exceeded

The key weight adjusting query is highlighted in yellow. The sub-queries first calculate the fit (identifier f) , then the error (identifier e) which is multiplied by the independent variables and a small portion (@alpha) is added to the weights; the updated weights are saved to #weightsnew. Using the #weightsnew table, the overall error is calculated as @newbest and if it is greater than the prior error, @best, the program stops otherwise the weights are saved to the #weights table and the #weightsnew table is dropped. If the maximum number of iterations has not been met it repeats the loop.

declare @alpha float, @best float, @newbest float,@count int

select @alpha = .1, @count = 100, @best = sum(power(value-fit,2)) from #class c join
(select id, sum(value*wght) fit from #data d join #weight w on d.attribute=w.attribute group by id) f
on c.id = f.id

while 1=1
begin

select w.attribute, wght+@alpha* avg(value*err) wght into #weightnew
from #weight w join #data d on d.attribute = w.attribute
join (select c.id, (value-fit) err from #class c join
(select id, sum(value*wght) fit from #data d join #weight w on d.attribute=w.attribute group by id) f
on c.id = f.id ) e on d.id = e.id
group by w.attribute, wght

select @newbest = sum(power(value-fit,2)) from #class c join
(select id, sum(value*wght) fit from #data d join #weightnew w on d.attribute=w.attribute group by id) f
on c.id = f.id

if @newbest>@best break
else set @best = @newbest

drop table #weight
select * into #weight from #weightnew
drop table #weightnew

if @count = 0 break

set @count = @count -1
end

The final step is to query the actual versus fit and error which includes converting the standardized variables back to the original scale:

select id, mn+(value*se) value, mn+(fit*se) fit from
(select c.id,value,fit from #class c join
(select id, sum(value*wght) fit from #data d join #weight w on d.attribute=w.attribute group by id) f
on c.id = f.id) x,
(select avg(price) mn from #rawdata) y,
(select stdev(price) se from #rawdata) z

As a test I replicated the simple regression results above using just the area independent variable with the following fit & error:

id	value	fit	error
1	400	387.071	-12.929
2	330	287.390	-42.610
3	369	420.076	51.076
4	232	256.872	24.872
5	540	519.591	-20.409

A comparison of the error column with the error column in the simple regression table suggests the gradient descent approach closely matched the algebraic simple regression with the difference being off no more than 0.033 for any observation. In fact the square-root of the mean squared error for both versions is 33.546.

Adding the number of rooms to the model improves the fit:

id	y	fit	error
1	400	380.166	19.834
2	330	333.961	-3.961
3	369	395.465	-26.465
4	232	227.534	4.466
5	540	533.874	6.126

The square-root of the mean squared error is improved (reduced) to 15.277.

A common measure of goodness of fit is R-Squared which is the variance explained by the model as a percent of the total variance. It is calculated as 1 - (unexplained variance /total variance) where unexplained variance is the sum of the square of the actual value minus the fitted value and total variance is the sum of the square of the actual values minus the average. The SQL takes advantage of the normalized data which sets the average in the total variance calculation to zero:

select 1- sum(power(value-fit,2))/ sum(power(value,2)) from

(select c.id,value,fit from #class c join

(select id, sum(value*wght) fit from #data d join #weight w on d.attribute=w.attribute group by id) f

on c.id = f.id) x

For the simple regression, the R-Squared is 88.83% using the gradient algorithm; for the multiple regression it improves to 97.68%.