In the SQL Data Mine: September 2013

Thursday, September 26, 2013

Non-Negative Matrix Factorization

Matrix factorization is a technique for extracting attributes from a data matrix into matrices corresponding to the rows and columns of the original matrix such that when the extracted matrices are multiplied the original matrix is reconstructed. Sounds like a neat geek trick, but actually has practical applications with perhaps the most renown example being the Netflix prize (http://en.wikipedia.org/wiki/Netflix_Prize). Netflix captures film ratings by users which Netflix mines to recommend movies to users. The competition made part of this database publicly available with the goal of improving the accuracy of the recommendations. One of the key algorithms used by the winners was matrix factorization which was first suggested by Simon Funk (http://sifter.org/~simon/journal/20061211.html).

To understand the approach at a high level, imagine the ratings database as a giant table with each row being a different user and each column a different film. At the intersection of each row and column is the rating (1-5 for Netflix) given by that user to that film. As most people have not seen most films there is a lot of empty cells. Using matrix factorization the film effects can be separated from the user effects and the empty cells can be estimated by multiplying a user's effects by the unseen films effects. The list of recommendations would be the films with the highest estimated ratings for that user.

While replicating the Funk methodology isn't difficult, this posting uses a different approach: non-negative matrix factorization which ensures the extracted values are all positive. The SQL is based on Toby Segaran's python code in his excellent Programming Collective Intelligence.

The first step will be to create three stored procedures for matrix algebra that will make the ultimate code more compact, next will be to create a toy matrix example, third the optimization code and, finally, the results on the toy matrix.

Matrix multiplication (http://en.wikipedia.org/wiki/Matrix_multiplication) multiplies 2 matrices where the number of rows in matrix 1 (im1) equals the number of columns in matrix 2 (im2) and the number of columns in matrix 1 equals the number of rows in matrix 2. The output matrix has the rows from im1 and the columns from im2. One of the side benefits of SQL is that matrix multiplications is relatively easy. To keep the code simple, there is no error handling so if the number of columns in im1 doesn't match the number of rows in im2 , SQL will work its magic but the answer will be wrong. All matrices are a table in rcv format where column 1 (r) is the row number, column 2 (c) is the column number and column 3 (v) is the value.

create procedure matmult ( @im1 varchar(24), @im2 varchar(24), @om varchar(24), @dot binary)
as
begin
declare @sql varchar(max)
if @dot = 1 exec ('drop table '+@om)
set @sql = 'select A.r, B.c, sum(A.v*B.v ) v into zzzzz from xxxxx A join yyyyy B on A.c=B.r group by A.r,B.c'
exec(replace(replace(replace(@sql, 'xxxxx',@im1), 'yyyyy',@im2),'zzzzz',@om))
end

Matrix transposition is handled by creating an output table that substitutes the rows for the columns and columns for rows on an input matrix:

create procedure mattran ( @im1 varchar(24), @om varchar(24), @dot binary)
as
begin
declare @sql varchar(max)
if @dot = 1 exec ('drop table '+@om)
set @sql = 'select c r,r c, v into zzzzz from xxxxx'
exec( replace(replace(@sql, 'xxxxx',@im1),'zzzzz',@om))
end

Matrix cell to cell functions such as adding two matrices are managed with matarr and require input matrices, an output matrix, and a specified function (+,-,*,/):

create procedure matarr ( @im1 varchar(24), @im2 varchar(24),@func varchar(1), @om varchar(24), @dot binary)
as
begin
if @dot = 1 exec('drop table '+@om)
declare @sql varchar(max)
set @sql = 'select im1.r, im1.c, (im1.v qqqqq im2.v) v into zzzzz from xxxxx im1 join yyyyy im2 on im1.c=im2.c and im1.r=im2.r '
exec( replace(replace(replace(replace(@sql, 'xxxxx',@im1), 'yyyyy',@im2),'zzzzz',@om),'qqqqq',@func))
end

Next build a toy matrix using the rcv format. To make it interesting, start with 2x2 and a 2x3 matrix then multiply to create a 2x3 matrix:

create table x (r int,c int, v float)
create table y (r int,c int, v float)

insert into x select 1,1,1
insert into x select 2,1,2
insert into x select 1,2,3
insert into x select 2,2,4

insert into y select 1,1,2
insert into y select 1,2,4
insert into y select 1,3,6
insert into y select 2,1,8
insert into y select 2,2,10
insert into y select 2,3,12

exec matmult 'x','y','v',1

select * from v order by r,c

r	c	v
1	1	26
1	2	34
1	3	42
2	1	36
2	2	48
2	3	60

The extracted matrices have either the number of rows or number of columns of the original matrix and the same number of hidden features. Exactly how many hidden features are required to rebuild the original matrix isn't preordained and there is an element of trial and error in this choice, but experience suggests the more complicated the matrix the more hidden features are required. This one is pretty simple, so pick 2. Create a vector table and load the numbers. This is useful in initiating the feature values aka weights:

create table n(i int)

insert into n select 1

insert into n select 2

Using the v and n tables create the weights tables w (for row categories) & h (for column categories), calculate the initial fit table (third query) and sum the initial error squared. The weight are initiated using the fx_rand() function presented in the earlier k-means posting.

select r,i c, dbo.fx_rand() v into w from v, n group by r, i

select i r, c, dbo.fx_rand() v into h from v, n group by i, c

exec matmult 'w','h','wh',1

select sum(power(v.v-wh.v,2)) from v join wh on v.r = wh.r and v.c = wh.c

Using the toy matrix above, the initial error is about 10500 (results will differ somewhat due the random number weights). The fitting process loops through the updating steps until achieving the stopping conditions of either a specified sum of errors threshold or number of iterations (highlighted in yellow):

declare @count int, @error float

set @count = 1

while 1=1

begin

exec mattran 'w','wt',1

exec matmult 'wt','v','hn',1

exec matmult 'wt','w', 'wtw',1

exec matmult 'wtw','h','hd',1

exec matarr 'h','hn','*','hhn',1

exec matarr 'hhn','hd','/','h',1

exec mattran 'h','ht',1

exec matmult 'v','ht','wn',1

exec matmult 'w','h','wxh',1

exec matmult 'wxh','ht','wd',1

exec matarr 'w','wn','*','wwn',1

exec matarr 'wwn','wd','/','w',1

exec matmult 'w','h','wh',1

set @error= (select sum(power(v.v-wh.v,2)) from v join wh on v.r = wh.r and v.c = wh.c)

if @error<.01 or @count>=25 break

set @count = @count+1

end

For the above example the algorithm stopped after 23 iterations with a sum of squared errors of 0.00795711. The actual vs fit could be improved by adding hidden features to the n table or by decreasing the error target and increasing the iterations. For the above parameters the results are:

select v.r, v.c,v.v, wh.v from v join wh on v.r = wh.r and v.c = wh.c

r	c	actual	fit	error
1	1	26	25.988	0.012
2	1	36	36.009	-0.009
1	2	34	34.059	-0.059
2	2	48	47.958	0.042
1	3	42	41.960	0.040
2	3	60	60.028	-0.028

The weights (nicely formatted using a spreadsheet):

h	1	2
1	29.286	23.738
2	37.388	31.600
3	44.840	39.533

w	1	2
1	0.312	0.708
2	0.001	1.516

I've been experimenting on using this technique to predict NFL football scores. Using the giant table analogy, the rows are the offense team, the columns the defense team and the intersecting cell is the offensive teams score. Assuming a suitable solution exists each team's score could be forecast from the model and the spread as well. Perhaps at some point in the future I'll share my results (or run off to Vegas to cash in).

Tuesday, September 10, 2013

Incredibly Simple Regression Using Group By Categories in SQL

I've cover regression in one of my earlier posts and in the process came up with a simpler algorithm. I can't claim it is my discovery as I'm sure it is already out there somewhere and there is a strong similarity between many error minimization problems, but I've never seen a write-up and my favorite search engine didn't reveal any prior publications. I'm calling it SQL Group By Regression (SGBR), if it has already been published, let me know. It bares a strong resemblance to ANOVA.

The algorithm models a numeric dependent variable by estimating weights for the independent variables which are all categorical. An example would be height as a function of gender. Height is a numeric variable while gender is categorical. In typical regression (including my earlier post) this could be modeled by using a binary (1,0) variable for each category which is fairly simple. Imagine doing the same model of weight as a function of state of residence; very quickly the model will have fifty binary variables. Straightforward (if not necessarily statistically valid) in a regression package like R, but a pain in SQL.

The model: estimate a score for each attribute value (i.e.gender female) such that the fitted class value is the sum of the scores for that observation. The scores, or weights, are estimated using a batch mode gradient descent.

The course of this post:

Load some data, I will first use a small data set then the Titanic data
Put the data into a vector format to ease the SQL
Initiate the weights and control variables
Create the gradient descent loop
Show the results.

The demonstration data is height in feet as a function of gender:

create table #rawdata (id int, sex varchar(6), height float)

insert into #rawdata select 1,'male', 6

insert into #rawdata select 2,'male', 5.92

insert into #rawdata select 3, 'male', 5.58

insert into #rawdata select 4,'male', 5.92

insert into #rawdata select 5,'female', 5

insert into #rawdata select 6,'female', 5.5

insert into #rawdata select 7,'female', 5.42

insert into #rawdata select 8,'female', 5.75

Create the data vectors with #data as the independent variables and #class as the dependent variable. As a reminder, this format eases the calculations in an SQL environment by using joins instead of having to look things up.

select id, 'intercept' attribute, 'intercept' value into #data from #rawdata

union all

select id, 'sex',sex from #rawdata

select id, 'height' attribute,height class into #class from #rawdata

The weights will be stored in the #weight table which is a summary of the #data file to ensure all variables being modeled are included. There is also a identical table called #newweights. As the weights are updated they are stored in the #newweights table until it is verified that the error reduced, if it did the weights are copied to the #weights table and the looping continues, otherwise the looping stops and the new weights are discarded. The progress is measured with @count while the improvements in error are tracked through @error and @newerror; @alpha is the percent of error added to the weight estimates. The initial weights are set to 0 and @error is the sum of the squared error between the fit and the class.

select attribute, value, cast(0 as float) wght into #weights from #data
select * into #newweights from #weights

declare @error float, @newerror float, @count int, @alpha

set @error = (select sum(power(error,2)) from
(select c.id, c.class - sum(wght) error
from #class c
join #data d on c.id = d.id
join #weights w on d.attribute = w.attribute and d.value = w.value
group by c.id, c.class) e)

set @count = 1
set @alpha = 0.1

The key SQL is highlighted with the orange highlighted sub-query calculating the error by observation and the yellow using the average of that error to update the category weights. The remaining queries calculate the error and checks the stop conditions and update the tables with the new information. The gradient loops at least 10 times and stops at 100 or when the error increases (highlighted in red).

while 1=1
begin
truncate table #newweights

insert into #newweights
select d.attribute, d.value, w.wght+.@alpha*avg(error) aerror
from #weights w
join #data d
on w.attribute = d.attribute and w.value = d.value
join (select c.id, c.class - sum(wght) error
from #class c
join #data d on c.id = d.id
join #weights w on d.attribute = w.attribute and d.value = w.value
group by c.id, c.class) e
on d.id = e.id
group by d.attribute, d.value, w.wght

set @newerror =(select sum(power(error,2)) from
(select c.id, c.class - sum(wght) error
from #class c
join #data d on c.id = d.id
join #newweights w on d.attribute = w.attribute and d.value = w.value
group by c.id, c.class) e)

if (@newerror> @error and @count >10) or @count>100 break

truncate table #weights
insert into #weights
select * from #newweights

set @error = @newerror
set @count = @count +1

end

On the toy data set it ran in about 1 second with the following weights and fit:

select * from #weights

attribute	value	weight
intercept	intercept	2.818
sex	female	2.599
sex	male	3.037

select c.id, c.class,sum(wght) fit , c.class-sum(wght) error
from #class c
join #data d on c.id = d.id
join #newweights w on d.attribute = w.attribute and d.value = w.value
group by c.id, c.class

id	class	fit	error
1	6	5.855	0.145
2	5.92	5.855	0.065
3	5.58	5.855	-0.275
4	5.92	5.855	0.065
5	5	5.418	-0.418
6	5.5	5.418	0.082
7	5.42	5.418	0.002
8	5.75	5.418	0.332

An the r-squared is 99.8% using:

select 1- sum(power(class-fit,2))/ sum(power(class,2)) from

(select c.id, c.class,sum(wght) fit

from #class c

join #data d on c.id = d.id

join #newweights w on d.attribute = w.attribute and d.value = w.value

group by c.id, c.class) f

Frequently the dependent variable is a binary variable in which case a logistic model may be more appropriate to keep the fitted value between 0 and 1 (http://en.wikipedia.org/wiki/Logistic_regression). It is an easy change: substitute (1.0/(1.0+exp(-sum(wght))) for sum(wght) throughout. I'll demonstrate this using the Titanic data (currently available from http://www.kaggle.com/ or http://www.encyclopedia-titanica.org/titanic-passenger-list/) which is on my server in a table called train.

The dependent variable is survived which is binary. In addition to the survival field, training data has both categorical and numeric data. I will use three categorical variables: sex, pclass and embarked. I've shown the modification in the initial error sub_query only.

select passengerid id, 'survived' attribute,survived class into #class from train

select passengerid id, 'intercept' attribute, 'intercept' value into #data from train

union all

select passengerid, 'sex',sex from train

union all

select passengerid, 'pclass',cast(pclass as varchar) from train

union all

select passengerid, 'embarked',embarked from train

declare @error float, @newerror float, @count int

drop table #weights, #newweights

select attribute, value, cast(0 as float) wght into #weights from #data

select attribute, value, cast(0 as float) wght into #newweights from #data

set @error = (select sum(power(error,2)) from

(select c.id, c.class - (1.0/(1.0+exp(-sum(wght)))) error

from #class c

join #data d on c.id = d.id

join #weights w on d.attribute = w.attribute and d.value = w.value

group by c.id, c.class) e)

The algorithm ran in about 4 seconds on my low-end gateway using un-indexed tables. The weights show one clearly wanted to be a lady in first class from Cherbourg):

attribute	value	weight
embarked	C	0.283
embarked	Q	0.013
embarked	S	-0.283
intercept	intercept	-0.144
pclass	1	0.730
pclass	2	0.218
pclass	3	-0.664
sex	female	1.234
sex	male	-0.894

As this is essentially a classification model, how successfully did it classify if one assumes a fitted value >.5 = 1? 78.67%.

select sum(case when class = round(fit,0) then 1.0 else 0 end)/count(1) from

(select c.id, c.class,(1.0/(1.0+exp(-sum(wght)))) fit

from #class c

join #data d on c.id = d.id

join #newweights w on d.attribute = w.attribute and d.value = w.value

group by c.id, c.class) f

Finally, the r-squared is 61.5%

select 1- sum(power(class-fit,2))/ sum(power(class,2)) from

(select c.id, c.class,(1.0/(1.0+exp(-sum(wght)))) fit

from #class c

join #data d on c.id = d.id

join #newweights w on d.attribute = w.attribute and d.value = w.value

group by c.id, c.class) f