In the SQL Data Mine: 2013

Sunday, December 15, 2013

True Multiple Regression Using SQL.

Some weeks ago I posted on multiple regression. The code in that post solved for the coefficients using gradient descent, an iterative process of reducing error through coefficient adjustment which seems a second rate compromise when compared to the ease of the Linest function in Excel. In the interim I researched multiple regression that builds on simple univariate regression (effectively regressing Y on X is sum (XY)/sum(XX) in SQL) which forms the basis of this post. After some trial & error I got the SQL to work. My primary source for this approach is the somewhat difficult Elements of Statistical Learning (2008) by Hastie, Tibshirni and Friedman. A reasonable discussion on the topic is at stats exchange.

To begin create a table of data:

create table #rawdata (id int,area float, rooms float, odd float, price float)

insert into #rawdata select 1, 2201,3,1,400
insert into #rawdata select 2, 1600,3,0,330
insert into #rawdata select 3, 2400,3,1,369
insert into #rawdata select 4, 1416,2,1,232
insert into #rawdata select 5, 3000,4,0,540

It facilitates the code to convert the table of data into two data vectors: the independent #x variables and the dependent #y variable. There are a couple of modifications. First, rather than use the label names, I use a label index which facilitates the processing. Second, In addition to the independent variables the #x table also includes the intercept variable with a value of 1 for each observation. Finally, there is no error handling and the program cannot manage missing (null) values.

The naming conventions used throughout are pretty simple. The tables follow the names use in the articles with the exception of #c for the orthogonal coefficients (instead of a Greek letter). Field names are *id for the row id in table *, *n is the index of the field, *v is the value.

select id xid, 0 xn,1 xv into #x from #rawdata
union all
select id, 1,rooms from #rawdata
union all
select id, 2,area from #rawdata
union all
select id, 3,odd from #rawdata

select id yid, 0 yn, price yv into #y from #rawdata

Three additional tables are required: the #z table holds the orthogonalized values, #c holds the orthagonalization coefficients and #b the regression coefficients.

create table #z (zid int, zn int, zv float)
insert into #z select id , 0 zn,1 zv from #rawdata

create table #c(cxn int, czn int, cv float)

create table #b(bn int, bv float)

The first part of the code loops forward through each of the independent variables starting with the intercept calculating first the orthogonalization coefficients and then the orthogonalized X values which are store in #z.

declare @p int
set @p = 1

while @p <= (select max(xn) from #x)
begin
insert into #c
select xn cxn, zn czn, sum(xv*zv)/sum(zv*zv) cv
from #x join #z on xid = zid where zn = @p-1 and xn>zn group by xn, zn
insert into #z
select zid, xn,xv- sum(cv*zv)
from #x join #z on xid = zid join #c on czn = zn and cxn = xn where xn = @p and zn<xn group by zid, xn,xv
set @p = @p +1

end

The second part of the code loops backward calculating the B coefficients by regressing Y on Z where Y is continually updated to exclude previously fit Y values (the sub-query)

while @p>=0
begin

insert into #b
select zn, sum(yv*zv)/ sum(zv*zv)
from #z join
(select yid, yv-isnull(sum(bv*xv),0) yv from #x join #y on xid = yid left join #b on xn=bn group by yid, yv) y
on zid = yid where zn = @p group by zn

set @p = @p-1
end

The regression coefficients are:

select * from #b

bn	bv
3	11.026
2	0.056
1	114.965
0	-96.747

Finally the fitted values are:

select yid, yv, fit, yv-fit err from #y join
(select xid, sum(xv*bv) fit from #x join #b on xn = bn group by xid) f
on yid = xid

*yid*	yv	*fit*	*err*
1	400	382.975	17.025
2	330	338.144	-8.144
3	369	394.169	-25.169
4	232	223.856	8.144
5	540	531.856	8.144

The r squared is 99.85%

select 1-sum(power(err,2))/sum(power(yv,2)) from
(select yid, yv, fit, yv-fit err from #y join
(select xid, sum(xv*bv) fit from #x join #b on xn = bn group by xid) f
on yid = xid) d

Given the challenge in writing this I was ultimately surprised how little code was involved. There an important detail missing from the model: the errors associated with the coefficients for determining statistical significance. Maybe at some point I'll return to regression and figure it out. but in the mean time perhaps a reader wants to take on the challenge...

Thursday, December 5, 2013

Viva Monte Carlo: Planning Model for Intermittent Demand

I recently posted a method for demand planning based on joint probability distribution functions derived using averages and variances. While robust for items with higher sales frequency, my experience is that the method isn't very stable on low frequency items due to churn and susceptibility to an extreme observations (not to mention the normal distribution may not be appropriate). In this post I'll highlight a distribution-free method based on Monte Carlo methods.

Monte Carlo methods are well documented elsewhere , so I won't go into much detail. Suffice it to say that by repeatedly sampling from the receipt and sales distributions one can estimate the underlying joint lead-time demand distribution. The algorithm is pretty simple:

For each item
Randomly pick a lead-time in days using the receipts for that item
For each day in #2 pick a day from the sampling period and add the number of orders for that day (often zero) to a running total
For each order in #3 randomly pick an order from the order file and add the quantity to a running total
Repeat steps 2-4 several hundred times saving the quantity total for each iteration
Sort the iterations in ascending order and set the order point equal to the iteration that corresponds to the target service level

In terms of the distributions to pick from there are a number of approaches. The first is to use the raw data by taking the top record from a randomly sorted table, a method that can become time consuming with lots of SKUs. A second approach is to summarize the raw transactions into a cumulative distribution table and add a cumulative percent of total column. the random value is choosen by selecting the record where the cumulative % of total corresponds to a random number between 0 and 1. The third approach is to assume underlying distributions (i.e. orders per day is Poisson, order quantity is log-normal etc) and choose a value form that distribution using a random number. This post will use combination of the first two methodologies.

The data is the same from the prior post (no cost data will be used, but I've left it in for consistency).

create table #item (item varchar(3), daysinstock int, servicetarget float,stdcost float, carrycost float, ordercost float)

insert into #item select 'abc', 120, .95,1.5, .12, 15

create table #receipts (purchaseorder int, item varchar(3), orderdate int, receiptdate int)

insert into #receipts select 456,'abc', 20130109, 20130226

insert into #receipts select 789,'abc', 20130325, 20130413

insert into #receipts select 567,'abc', 20131007, 20131130

create table #salesorders (salesorder int, salesdate int, item varchar(3), quantity int)

insert into #salesorders select 123, 20130208, 'abc', 1

insert into #salesorders select 234, 20131014, 'abc', 10

insert into #salesorders select 345, 20130413, 'abc', 35

insert into #salesorders select 456, 20130409, 'abc', 10

insert into #salesorders select 323, 20131105, 'abc', 5

insert into #salesorders select 432, 20130515, 'abc', 5

Step #2 of the algorithm picks a day at random which implies calendar that has lots of days with no sales. Rather than build a calendar tailored to each item and select a day at random as in the first approach, construct a summary distribution table.

create table #calendar(item varchar(3), ordercount int, daycount int, distribution float)

The following inserts the number of days for each order count by item. The sub-query (sq) counts the number of orders by day which is then summarized as number of order days for each order count.

insert into #calendar (item, ordercount, daycount)

select item, ordercount, count(1) from

(select i.item, so.salesdate, count(1) ordercount from #salesorders so

join #item i on i.item = so.item group by i.item, so.salesdate) sq

group by ordercount, item

The day with no orders is calculated as the days in stock from the #item table less a sub-query totaling the number of days with orders from #calendar.

insert into #calendar (item, ordercount, daycount) select item, 0, daysinstock - (select sum(daycount) from #calendar c where c.item = i.item) from #item i

The last calendar step is to update the cumulative distribution which uses two sub-queries. The first totals the day count buy order count which is then standardized by dividing by the total number of days in the second sub-query. To save time, the second query could be replaced by incorporating the days in stock value into the #calendar file.

update c set distribution = (select cast(sum(daycount) as float) from #calendar c1 where c1.item = c.item and c1.daycount >= c.daycount ) /(select cast(sum(daycount) as float) from #calendar c2 where c2.item = c.item) from #calendar c

The next step is the looping program; the final results are stored in a table called #orderpoint and consist of the item, lead-time demand quantity and order point quantity. A scratchpad table called #quantity is also required to hold the output for each iteration.

create table #orderpoint(item varchar(3), leadtimequantity int, orderpoint int)

create table #quantity( quantity int)

A number of variables are also required for tracking; the first four are self explanatory, @it1 tracks the overall iterations (set at 100 here) and @it2 is a general purpose counter.

declare @item varchar(3),@servicetarget int, @leadtime int, @ordercount int, @quantity int, @it1 int,@it2 int

The program runs until every item in the #item table is also in the #orderpoint table. I've highlighted the looping statements in yellow and annotated their function. There are two functions used for the random selection. The first is newid() which is build into SQL server and is really a pseudo random sequence of characters. Using first check sum and then the right 3 divided by 999 than can be converted into a random number between 0 and 1

while 1=1 --loop for each item

begin

set @item = (select top 1 item from #item where item not in (select item from #orderpoint))

If @item is null break

set @servicetarget = (select cast(servicetarget*100 as int) from #item where item = @item)

set @it1 = 1000

truncate table #quantity

while @it1>0 --loop for each iteration picking a lead time

begin

set @leadtime = (select top 1 datediff(d, convert(char, orderdate,112), convert(char, receiptdate,112)) from #receipts where item = @item order by newid())

set @it2 = 1

set @ordercount = 0

while @it2 < = @leadtime --loop for each day in lead time picking an order count

begin

set @ordercount=@ordercount+ (select min(ordercount) from #calendar where item = @item and distribution >=dbo.fx_rand())

set @it2 = @it2+1

end

set @it2 = 1

set @quantity = 0

while @it2 < = @ordercount --loop for each order in order count picking a quantity

begin

set @quantity=@quantity+(select top 1 quantity from #salesorders where item = @item order by newid())

set @it2 = @it2+1

end

insert into #quantity select @quantity

set @it1 = @it1-1

end

insert into #orderpoint select @item,leadtimequantity, max(sq2.quantity) from

(select avg(quantity) leadtimequantity from #quantity) sq1,

(select top (@servicetarget) percent quantity from #quantity order by quantity) sq2

group by leadtimequantity

end

The orange highlighted section loads the lead-time demand and the order point to the working file that would then be loaded to an MRP system.

An interesting property of this approach is that it can be used graphically illustrates the demand profile of an item. Each iteration has an equal chance of occurring, a graph of the #quantity table for a given item shows all of the equally likely outcomes. Consider the graphs below. The first shows the values for 1000 iterations (took about 2 seconds on my Gateway in SQL server express) as calculated with an average of about 22 units. The second is sorted based on quantity with the black arrows indicating the median (15 units) and 95th percentile (71 units). The red arrow to the left is the number of iterations with no volume. Shows the lead-time volume by frequency: the red arrow is nearly 20% of the distribution: There is nearly a one in five chance of selling nothing after placing an order. The joys of intermittent demand.

Appendix: Dynamic Random Numbers

The rand() command in TSQL is not dynamic; using it in a query will return the same value for each record returned. I know of two work-arounds of which I'm not sure which is the more random.

First, newid() is a built-in SQLServer function that returns a pseudo-random sequence of characters that can be used for sorting as is:

select * from #table order by newid()

It can also be used to generate a random number between 0 and 1 :

select cast(right(checksum(newid()),3) as float)/999

The second alternative is more involved but is built on the rand() function. First create a view that references the rand() and then a function that return the results of the view:

create view vw_rand

select rand() random

create function dbo.fx_rand ()

returns float

begin

return (select random from vw_rand)

end

Tuesday, December 3, 2013

Forecasting for Inventory Planning

It has been some time since I last posted; to restart I'd like to get into an area that's related to data mining: forecasting or better yet demand planning. In his book Predictive Analytics, Eric Siegel asserts that "[Predictive analytics] is a completely different animal from forecasting." I don't agree, they are both concerned with using real data to predict the future. Demand forecasting may be the most widely accepted application of empirical data to decision making in business. Well designed forecasting systems use actual orders and sales to predict how much manufacturers should make and purchasing departments should buy to support future demand.

I will use a very simple model that lends itself very nicely to the SQL environment that was written up by CERN researches (hard to beat nuclear scientists: http://ais.web.cern.ch/AIS/apps/lims/theory.html). Rather than group data into monthly buckets and extrapolate using models like exponential smoothing, this approach offers a "mathematical solution based on the real statistical distributions."

Inventory management algorithms use some variation of the order point model: when inventory for a given stock keeping unit (SKU) falls below an order point the system recommends purchase/manufacture of more (with the quantity usually based on the economic order quantity, a topic for another day). The order point is calculated as quantity demanded over the lead time plus safety stock. Lead time is the time between noticing more is required and having more available in inventory for immediate sale and includes time for: inventory review + purchase order placing + vendor fulfillment + in transit + receiving + put-away and is usually measured in days or weeks. Lead time quantity is the quantity expected to be sold over the lead time. Safety stock is the inventory held for unpredictable future demand fluctuation. Typically it is set to achieve some level of fill associated with the business' value proposition, say 95% of customer orders should be filled from stock. Assuming the demand is normally distributed, then safety stock is the number of standard deviations associated with service level times the lead time standard error.

Modeling requires six elements:

orders per day (opd) = count of orders for some period/days in period
average order quantity (aoq) = sum of quantity per order/count of orders
order quantity variance (oqv) = sum of squared difference between the order quantity and average order quantity / count of orders minus 1
average lead time in days (altd)
lead time in days variance (ltdv)
number of standard deviations associated with service level (z)

The 3 basic formulas are (* means times and ^ means raise to the power of):

lead time quantity (ltq) = opd*altd*aoq
lead time variance (ltv) = opd*altd*(oqv + aoq^2) +opd^2 *aoq^2*ltdv
order point (op) = ltq+z*ltv^.5

As an aside, many vendors specify their lead time which can be interpreted as altd with zero variance simplifying #2 above to ltv = opd*altd*(oqv + aoq^2).

The program employs three tables which are stripped to the essential information for the matters at hand. The first is #item which holds the item list and relevant SKU specific planning parameters such as the days in stock useful for new items or adjusting the forecast for stock outs, targeted service level which my vary based on an item importance, and item cost data which comes in handy when computing carry costs, average inventory costs and economic order quantities. The second table is #receipts to measure the lead time. A purchase order has an order date followed by one or more receipt dates which are useful in estimating lead time. The final table is #salesorders which is self explanatory. the one observation I would make is that the sales order date should be the date the customer requested, not the actual ship date.

create table #item (item varchar(3), daysinstock int, servicetarget float,stdcost float, carrycost float, ordercost float)

insert into #item select 'abc', 120, .95, 1.5, .12, 15

create table #receipts (purchaseorder int, item varchar(3), orderdate int, receiptdate int)

insert into #receipts select 321,'abc', 20130109, 20130226

insert into #receipts select 432,'abc', 20130325, 20130413

insert into #receipts select 543,'abc', 20131007, 20131130

create table #salesorders (salesorder int, salesreqdate int, item varchar(3), quantity int)

insert into #salesorders select 123, 20130208, 'abc', 1

insert into #salesorders select 234, 20131014, 'abc', 10

insert into #salesorders select 345, 20130413, 'abc', 35

insert into #salesorders select 456, 20130409, 'abc', 10

insert into #salesorders select 567, 20131105, 'abc', 5

insert into #salesorders select 678, 20130515, 'abc', 5

Using the built-in SQL functions for average, variance and count summarize the data using sub-queries for the receipts data and the sales order data and place in a summary table with holding places for lead-time quantity (ltq), lead time variance(ltv), order point (op) and economic order quantity (eoq). While all of these could be calculated in the first pass, I broke it into updates to make following easier.

select i.item, i.daysinstock,servicetarget, stdcost,carrycost, ordercost, altd,ltdv ,ofreq/daysinstock opd, aoq,oqv,
cast(0 as float) ltq,
cast(0 as float) ltqv,
cast(0 as int) op,
cast(0 as int) eoq
into #summary from #item i
join
(select item, cast(count(1) as float) rfreq, avg(cast(datediff(d,convert(char,orderdate,112), convert(char,receiptdate,112)) as float)) altd ,
var(cast(datediff(d,convert(char,orderdate,112), convert(char,receiptdate,112)) as float)) ltdv
from #receipts group by item) lt
on i.item = lt.item
join
(select item, cast(count(1) as float) ofreq, avg (cast(qty as float)) aoq, var(cast(qty as float)) oqv from #salesorders group by item) so
on i.item = so.item

Use update statements to add the lead time quantity and lead time variance:

update #summary
set ltq = opd*altd*aoq,
ltqv = opd*altd*(oqv + power(aoq,2)) +power(opd,2)*power(aoq,2)*ltdv

The final step uses an update step to create the order point and the economic order quantity(eoq). See http://en.wikipedia.org/wiki/Economic_order_quantity for a very thorough definition. Also, note the use of dbo.fx_normsinv which is a function to convert a targeted service level (95%) to a z score. I've included the code in an appendix.

update #summary
set op = cast(ltq+ power(ltqv, 0.5)*dbo.fx_normsinv(servicetarget) as int),
eoq = cast(power((opd*365*aoq *ordercost)/(carrycost*stdcost),.5) as int)

In addition to planning, the data can also provide insight into performance and KPIs. For example, average inventory can be approximated as one half the order quantity plus safety stock, cost of goods sold (cogs) and projected turns as cogs/avginv:

select
sum((eoq/2+ (op-ltq))*stdcost) avginv,
sum(opd*365*aoq*stdcost) cogs,
sum(opd*365*aoq*stdcost)/sum((eoq/2+ (op-ltq))*stdcost) turns
from #summary

That's it. I employed a slightly extended version of this to run inventory planning on nearly 200,000 SKUs with an inventory of over $100 million achieving a 95+% service level.

Appendix:

Inverse of normal distribution per Abramowitz and Stegun formula 26.2.23

create function fx_normsinv(@p float)
returns float
as
begin
declare @c0 float, @c1 float, @c2 float, @d1 float, @d2 float , @d3 float, @t float
if @p>=.5 set @p = 1.0-@p
select @t = power(-2.0*log(@p),.5), @c0 = 2.515517, @c1 =0.802853,@c2 =0.010328,@d1 = 1.432788, @d2 =0.189269, @d3 =0.001308
return (select @t-((@c0+@c1*@t+@c2*@t*@t)/(1.0+@d1*@t+@d2*@t*@t+@d3*@t*@t*@t)))
end

Thursday, September 26, 2013

Non-Negative Matrix Factorization

Matrix factorization is a technique for extracting attributes from a data matrix into matrices corresponding to the rows and columns of the original matrix such that when the extracted matrices are multiplied the original matrix is reconstructed. Sounds like a neat geek trick, but actually has practical applications with perhaps the most renown example being the Netflix prize (http://en.wikipedia.org/wiki/Netflix_Prize). Netflix captures film ratings by users which Netflix mines to recommend movies to users. The competition made part of this database publicly available with the goal of improving the accuracy of the recommendations. One of the key algorithms used by the winners was matrix factorization which was first suggested by Simon Funk (http://sifter.org/~simon/journal/20061211.html).

To understand the approach at a high level, imagine the ratings database as a giant table with each row being a different user and each column a different film. At the intersection of each row and column is the rating (1-5 for Netflix) given by that user to that film. As most people have not seen most films there is a lot of empty cells. Using matrix factorization the film effects can be separated from the user effects and the empty cells can be estimated by multiplying a user's effects by the unseen films effects. The list of recommendations would be the films with the highest estimated ratings for that user.

While replicating the Funk methodology isn't difficult, this posting uses a different approach: non-negative matrix factorization which ensures the extracted values are all positive. The SQL is based on Toby Segaran's python code in his excellent Programming Collective Intelligence.

The first step will be to create three stored procedures for matrix algebra that will make the ultimate code more compact, next will be to create a toy matrix example, third the optimization code and, finally, the results on the toy matrix.

Matrix multiplication (http://en.wikipedia.org/wiki/Matrix_multiplication) multiplies 2 matrices where the number of rows in matrix 1 (im1) equals the number of columns in matrix 2 (im2) and the number of columns in matrix 1 equals the number of rows in matrix 2. The output matrix has the rows from im1 and the columns from im2. One of the side benefits of SQL is that matrix multiplications is relatively easy. To keep the code simple, there is no error handling so if the number of columns in im1 doesn't match the number of rows in im2 , SQL will work its magic but the answer will be wrong. All matrices are a table in rcv format where column 1 (r) is the row number, column 2 (c) is the column number and column 3 (v) is the value.

create procedure matmult ( @im1 varchar(24), @im2 varchar(24), @om varchar(24), @dot binary)
as
begin
declare @sql varchar(max)
if @dot = 1 exec ('drop table '+@om)
set @sql = 'select A.r, B.c, sum(A.v*B.v ) v into zzzzz from xxxxx A join yyyyy B on A.c=B.r group by A.r,B.c'
exec(replace(replace(replace(@sql, 'xxxxx',@im1), 'yyyyy',@im2),'zzzzz',@om))
end

Matrix transposition is handled by creating an output table that substitutes the rows for the columns and columns for rows on an input matrix:

create procedure mattran ( @im1 varchar(24), @om varchar(24), @dot binary)
as
begin
declare @sql varchar(max)
if @dot = 1 exec ('drop table '+@om)
set @sql = 'select c r,r c, v into zzzzz from xxxxx'
exec( replace(replace(@sql, 'xxxxx',@im1),'zzzzz',@om))
end

Matrix cell to cell functions such as adding two matrices are managed with matarr and require input matrices, an output matrix, and a specified function (+,-,*,/):

create procedure matarr ( @im1 varchar(24), @im2 varchar(24),@func varchar(1), @om varchar(24), @dot binary)
as
begin
if @dot = 1 exec('drop table '+@om)
declare @sql varchar(max)
set @sql = 'select im1.r, im1.c, (im1.v qqqqq im2.v) v into zzzzz from xxxxx im1 join yyyyy im2 on im1.c=im2.c and im1.r=im2.r '
exec( replace(replace(replace(replace(@sql, 'xxxxx',@im1), 'yyyyy',@im2),'zzzzz',@om),'qqqqq',@func))
end

Next build a toy matrix using the rcv format. To make it interesting, start with 2x2 and a 2x3 matrix then multiply to create a 2x3 matrix:

create table x (r int,c int, v float)
create table y (r int,c int, v float)

insert into x select 1,1,1
insert into x select 2,1,2
insert into x select 1,2,3
insert into x select 2,2,4

insert into y select 1,1,2
insert into y select 1,2,4
insert into y select 1,3,6
insert into y select 2,1,8
insert into y select 2,2,10
insert into y select 2,3,12

exec matmult 'x','y','v',1

select * from v order by r,c

r	c	v
1	1	26
1	2	34
1	3	42
2	1	36
2	2	48
2	3	60

The extracted matrices have either the number of rows or number of columns of the original matrix and the same number of hidden features. Exactly how many hidden features are required to rebuild the original matrix isn't preordained and there is an element of trial and error in this choice, but experience suggests the more complicated the matrix the more hidden features are required. This one is pretty simple, so pick 2. Create a vector table and load the numbers. This is useful in initiating the feature values aka weights:

create table n(i int)

insert into n select 1

insert into n select 2

Using the v and n tables create the weights tables w (for row categories) & h (for column categories), calculate the initial fit table (third query) and sum the initial error squared. The weight are initiated using the fx_rand() function presented in the earlier k-means posting.

select r,i c, dbo.fx_rand() v into w from v, n group by r, i

select i r, c, dbo.fx_rand() v into h from v, n group by i, c

exec matmult 'w','h','wh',1

select sum(power(v.v-wh.v,2)) from v join wh on v.r = wh.r and v.c = wh.c

Using the toy matrix above, the initial error is about 10500 (results will differ somewhat due the random number weights). The fitting process loops through the updating steps until achieving the stopping conditions of either a specified sum of errors threshold or number of iterations (highlighted in yellow):

declare @count int, @error float

set @count = 1

while 1=1

begin

exec mattran 'w','wt',1

exec matmult 'wt','v','hn',1

exec matmult 'wt','w', 'wtw',1

exec matmult 'wtw','h','hd',1

exec matarr 'h','hn','*','hhn',1

exec matarr 'hhn','hd','/','h',1

exec mattran 'h','ht',1

exec matmult 'v','ht','wn',1

exec matmult 'w','h','wxh',1

exec matmult 'wxh','ht','wd',1

exec matarr 'w','wn','*','wwn',1

exec matarr 'wwn','wd','/','w',1

exec matmult 'w','h','wh',1

set @error= (select sum(power(v.v-wh.v,2)) from v join wh on v.r = wh.r and v.c = wh.c)

if @error<.01 or @count>=25 break

set @count = @count+1

end

For the above example the algorithm stopped after 23 iterations with a sum of squared errors of 0.00795711. The actual vs fit could be improved by adding hidden features to the n table or by decreasing the error target and increasing the iterations. For the above parameters the results are:

select v.r, v.c,v.v, wh.v from v join wh on v.r = wh.r and v.c = wh.c

r	c	actual	fit	error
1	1	26	25.988	0.012
2	1	36	36.009	-0.009
1	2	34	34.059	-0.059
2	2	48	47.958	0.042
1	3	42	41.960	0.040
2	3	60	60.028	-0.028

The weights (nicely formatted using a spreadsheet):

h	1	2
1	29.286	23.738
2	37.388	31.600
3	44.840	39.533

w	1	2
1	0.312	0.708
2	0.001	1.516

I've been experimenting on using this technique to predict NFL football scores. Using the giant table analogy, the rows are the offense team, the columns the defense team and the intersecting cell is the offensive teams score. Assuming a suitable solution exists each team's score could be forecast from the model and the spread as well. Perhaps at some point in the future I'll share my results (or run off to Vegas to cash in).

Tuesday, September 10, 2013

Incredibly Simple Regression Using Group By Categories in SQL

I've cover regression in one of my earlier posts and in the process came up with a simpler algorithm. I can't claim it is my discovery as I'm sure it is already out there somewhere and there is a strong similarity between many error minimization problems, but I've never seen a write-up and my favorite search engine didn't reveal any prior publications. I'm calling it SQL Group By Regression (SGBR), if it has already been published, let me know. It bares a strong resemblance to ANOVA.

The algorithm models a numeric dependent variable by estimating weights for the independent variables which are all categorical. An example would be height as a function of gender. Height is a numeric variable while gender is categorical. In typical regression (including my earlier post) this could be modeled by using a binary (1,0) variable for each category which is fairly simple. Imagine doing the same model of weight as a function of state of residence; very quickly the model will have fifty binary variables. Straightforward (if not necessarily statistically valid) in a regression package like R, but a pain in SQL.

The model: estimate a score for each attribute value (i.e.gender female) such that the fitted class value is the sum of the scores for that observation. The scores, or weights, are estimated using a batch mode gradient descent.

The course of this post:

Load some data, I will first use a small data set then the Titanic data
Put the data into a vector format to ease the SQL
Initiate the weights and control variables
Create the gradient descent loop
Show the results.

The demonstration data is height in feet as a function of gender:

create table #rawdata (id int, sex varchar(6), height float)

insert into #rawdata select 1,'male', 6

insert into #rawdata select 2,'male', 5.92

insert into #rawdata select 3, 'male', 5.58

insert into #rawdata select 4,'male', 5.92

insert into #rawdata select 5,'female', 5

insert into #rawdata select 6,'female', 5.5

insert into #rawdata select 7,'female', 5.42

insert into #rawdata select 8,'female', 5.75

Create the data vectors with #data as the independent variables and #class as the dependent variable. As a reminder, this format eases the calculations in an SQL environment by using joins instead of having to look things up.

select id, 'intercept' attribute, 'intercept' value into #data from #rawdata

union all

select id, 'sex',sex from #rawdata

select id, 'height' attribute,height class into #class from #rawdata

The weights will be stored in the #weight table which is a summary of the #data file to ensure all variables being modeled are included. There is also a identical table called #newweights. As the weights are updated they are stored in the #newweights table until it is verified that the error reduced, if it did the weights are copied to the #weights table and the looping continues, otherwise the looping stops and the new weights are discarded. The progress is measured with @count while the improvements in error are tracked through @error and @newerror; @alpha is the percent of error added to the weight estimates. The initial weights are set to 0 and @error is the sum of the squared error between the fit and the class.

select attribute, value, cast(0 as float) wght into #weights from #data
select * into #newweights from #weights

declare @error float, @newerror float, @count int, @alpha

set @error = (select sum(power(error,2)) from
(select c.id, c.class - sum(wght) error
from #class c
join #data d on c.id = d.id
join #weights w on d.attribute = w.attribute and d.value = w.value
group by c.id, c.class) e)

set @count = 1
set @alpha = 0.1

The key SQL is highlighted with the orange highlighted sub-query calculating the error by observation and the yellow using the average of that error to update the category weights. The remaining queries calculate the error and checks the stop conditions and update the tables with the new information. The gradient loops at least 10 times and stops at 100 or when the error increases (highlighted in red).

while 1=1
begin
truncate table #newweights

insert into #newweights
select d.attribute, d.value, w.wght+.@alpha*avg(error) aerror
from #weights w
join #data d
on w.attribute = d.attribute and w.value = d.value
join (select c.id, c.class - sum(wght) error
from #class c
join #data d on c.id = d.id
join #weights w on d.attribute = w.attribute and d.value = w.value
group by c.id, c.class) e
on d.id = e.id
group by d.attribute, d.value, w.wght

set @newerror =(select sum(power(error,2)) from
(select c.id, c.class - sum(wght) error
from #class c
join #data d on c.id = d.id
join #newweights w on d.attribute = w.attribute and d.value = w.value
group by c.id, c.class) e)

if (@newerror> @error and @count >10) or @count>100 break

truncate table #weights
insert into #weights
select * from #newweights

set @error = @newerror
set @count = @count +1

end

On the toy data set it ran in about 1 second with the following weights and fit:

select * from #weights

attribute	value	weight
intercept	intercept	2.818
sex	female	2.599
sex	male	3.037

select c.id, c.class,sum(wght) fit , c.class-sum(wght) error
from #class c
join #data d on c.id = d.id
join #newweights w on d.attribute = w.attribute and d.value = w.value
group by c.id, c.class

id	class	fit	error
1	6	5.855	0.145
2	5.92	5.855	0.065
3	5.58	5.855	-0.275
4	5.92	5.855	0.065
5	5	5.418	-0.418
6	5.5	5.418	0.082
7	5.42	5.418	0.002
8	5.75	5.418	0.332

An the r-squared is 99.8% using:

select 1- sum(power(class-fit,2))/ sum(power(class,2)) from

(select c.id, c.class,sum(wght) fit

from #class c

join #data d on c.id = d.id

join #newweights w on d.attribute = w.attribute and d.value = w.value

group by c.id, c.class) f

Frequently the dependent variable is a binary variable in which case a logistic model may be more appropriate to keep the fitted value between 0 and 1 (http://en.wikipedia.org/wiki/Logistic_regression). It is an easy change: substitute (1.0/(1.0+exp(-sum(wght))) for sum(wght) throughout. I'll demonstrate this using the Titanic data (currently available from http://www.kaggle.com/ or http://www.encyclopedia-titanica.org/titanic-passenger-list/) which is on my server in a table called train.

The dependent variable is survived which is binary. In addition to the survival field, training data has both categorical and numeric data. I will use three categorical variables: sex, pclass and embarked. I've shown the modification in the initial error sub_query only.

select passengerid id, 'survived' attribute,survived class into #class from train

select passengerid id, 'intercept' attribute, 'intercept' value into #data from train

union all

select passengerid, 'sex',sex from train

union all

select passengerid, 'pclass',cast(pclass as varchar) from train

union all

select passengerid, 'embarked',embarked from train

declare @error float, @newerror float, @count int

drop table #weights, #newweights

select attribute, value, cast(0 as float) wght into #weights from #data

select attribute, value, cast(0 as float) wght into #newweights from #data

set @error = (select sum(power(error,2)) from

(select c.id, c.class - (1.0/(1.0+exp(-sum(wght)))) error

from #class c

join #data d on c.id = d.id

join #weights w on d.attribute = w.attribute and d.value = w.value

group by c.id, c.class) e)

The algorithm ran in about 4 seconds on my low-end gateway using un-indexed tables. The weights show one clearly wanted to be a lady in first class from Cherbourg):

attribute	value	weight
embarked	C	0.283
embarked	Q	0.013
embarked	S	-0.283
intercept	intercept	-0.144
pclass	1	0.730
pclass	2	0.218
pclass	3	-0.664
sex	female	1.234
sex	male	-0.894

As this is essentially a classification model, how successfully did it classify if one assumes a fitted value >.5 = 1? 78.67%.

select sum(case when class = round(fit,0) then 1.0 else 0 end)/count(1) from

(select c.id, c.class,(1.0/(1.0+exp(-sum(wght)))) fit

from #class c

join #data d on c.id = d.id

join #newweights w on d.attribute = w.attribute and d.value = w.value

group by c.id, c.class) f

Finally, the r-squared is 61.5%

select 1- sum(power(class-fit,2))/ sum(power(class,2)) from

(select c.id, c.class,(1.0/(1.0+exp(-sum(wght)))) fit

from #class c

join #data d on c.id = d.id

join #newweights w on d.attribute = w.attribute and d.value = w.value

group by c.id, c.class) f