In the SQL Data Mine: Non-Negative Matrix Factorization

Matrix factorization is a technique for extracting attributes from a data matrix into matrices corresponding to the rows and columns of the original matrix such that when the extracted matrices are multiplied the original matrix is reconstructed. Sounds like a neat geek trick, but actually has practical applications with perhaps the most renown example being the Netflix prize (http://en.wikipedia.org/wiki/Netflix_Prize). Netflix captures film ratings by users which Netflix mines to recommend movies to users. The competition made part of this database publicly available with the goal of improving the accuracy of the recommendations. One of the key algorithms used by the winners was matrix factorization which was first suggested by Simon Funk (http://sifter.org/~simon/journal/20061211.html).

To understand the approach at a high level, imagine the ratings database as a giant table with each row being a different user and each column a different film. At the intersection of each row and column is the rating (1-5 for Netflix) given by that user to that film. As most people have not seen most films there is a lot of empty cells. Using matrix factorization the film effects can be separated from the user effects and the empty cells can be estimated by multiplying a user's effects by the unseen films effects. The list of recommendations would be the films with the highest estimated ratings for that user.

While replicating the Funk methodology isn't difficult, this posting uses a different approach: non-negative matrix factorization which ensures the extracted values are all positive. The SQL is based on Toby Segaran's python code in his excellent Programming Collective Intelligence.

The first step will be to create three stored procedures for matrix algebra that will make the ultimate code more compact, next will be to create a toy matrix example, third the optimization code and, finally, the results on the toy matrix.

Matrix multiplication (http://en.wikipedia.org/wiki/Matrix_multiplication) multiplies 2 matrices where the number of rows in matrix 1 (im1) equals the number of columns in matrix 2 (im2) and the number of columns in matrix 1 equals the number of rows in matrix 2. The output matrix has the rows from im1 and the columns from im2. One of the side benefits of SQL is that matrix multiplications is relatively easy. To keep the code simple, there is no error handling so if the number of columns in im1 doesn't match the number of rows in im2 , SQL will work its magic but the answer will be wrong. All matrices are a table in rcv format where column 1 (r) is the row number, column 2 (c) is the column number and column 3 (v) is the value.

create procedure matmult ( @im1 varchar(24), @im2 varchar(24), @om varchar(24), @dot binary)
as
begin
declare @sql varchar(max)
if @dot = 1 exec ('drop table '+@om)
set @sql = 'select A.r, B.c, sum(A.v*B.v ) v into zzzzz from xxxxx A join yyyyy B on A.c=B.r group by A.r,B.c'
exec(replace(replace(replace(@sql, 'xxxxx',@im1), 'yyyyy',@im2),'zzzzz',@om))
end

Matrix transposition is handled by creating an output table that substitutes the rows for the columns and columns for rows on an input matrix:

create procedure mattran ( @im1 varchar(24), @om varchar(24), @dot binary)
as
begin
declare @sql varchar(max)
if @dot = 1 exec ('drop table '+@om)
set @sql = 'select c r,r c, v into zzzzz from xxxxx'
exec( replace(replace(@sql, 'xxxxx',@im1),'zzzzz',@om))
end

Matrix cell to cell functions such as adding two matrices are managed with matarr and require input matrices, an output matrix, and a specified function (+,-,*,/):

create procedure matarr ( @im1 varchar(24), @im2 varchar(24),@func varchar(1), @om varchar(24), @dot binary)
as
begin
if @dot = 1 exec('drop table '+@om)
declare @sql varchar(max)
set @sql = 'select im1.r, im1.c, (im1.v qqqqq im2.v) v into zzzzz from xxxxx im1 join yyyyy im2 on im1.c=im2.c and im1.r=im2.r '
exec( replace(replace(replace(replace(@sql, 'xxxxx',@im1), 'yyyyy',@im2),'zzzzz',@om),'qqqqq',@func))
end

Next build a toy matrix using the rcv format. To make it interesting, start with 2x2 and a 2x3 matrix then multiply to create a 2x3 matrix:

create table x (r int,c int, v float)
create table y (r int,c int, v float)

insert into x select 1,1,1
insert into x select 2,1,2
insert into x select 1,2,3
insert into x select 2,2,4

insert into y select 1,1,2
insert into y select 1,2,4
insert into y select 1,3,6
insert into y select 2,1,8
insert into y select 2,2,10
insert into y select 2,3,12

exec matmult 'x','y','v',1

select * from v order by r,c

r	c	v
1	1	26
1	2	34
1	3	42
2	1	36
2	2	48
2	3	60

The extracted matrices have either the number of rows or number of columns of the original matrix and the same number of hidden features. Exactly how many hidden features are required to rebuild the original matrix isn't preordained and there is an element of trial and error in this choice, but experience suggests the more complicated the matrix the more hidden features are required. This one is pretty simple, so pick 2. Create a vector table and load the numbers. This is useful in initiating the feature values aka weights:

create table n(i int)

insert into n select 1

insert into n select 2

Using the v and n tables create the weights tables w (for row categories) & h (for column categories), calculate the initial fit table (third query) and sum the initial error squared. The weight are initiated using the fx_rand() function presented in the earlier k-means posting.

select r,i c, dbo.fx_rand() v into w from v, n group by r, i

select i r, c, dbo.fx_rand() v into h from v, n group by i, c

exec matmult 'w','h','wh',1

select sum(power(v.v-wh.v,2)) from v join wh on v.r = wh.r and v.c = wh.c

Using the toy matrix above, the initial error is about 10500 (results will differ somewhat due the random number weights). The fitting process loops through the updating steps until achieving the stopping conditions of either a specified sum of errors threshold or number of iterations (highlighted in yellow):

declare @count int, @error float

set @count = 1

while 1=1

begin

exec mattran 'w','wt',1

exec matmult 'wt','v','hn',1

exec matmult 'wt','w', 'wtw',1

exec matmult 'wtw','h','hd',1

exec matarr 'h','hn','*','hhn',1

exec matarr 'hhn','hd','/','h',1

exec mattran 'h','ht',1

exec matmult 'v','ht','wn',1

exec matmult 'w','h','wxh',1

exec matmult 'wxh','ht','wd',1

exec matarr 'w','wn','*','wwn',1

exec matarr 'wwn','wd','/','w',1

exec matmult 'w','h','wh',1

set @error= (select sum(power(v.v-wh.v,2)) from v join wh on v.r = wh.r and v.c = wh.c)

if @error<.01 or @count>=25 break

set @count = @count+1

end

For the above example the algorithm stopped after 23 iterations with a sum of squared errors of 0.00795711. The actual vs fit could be improved by adding hidden features to the n table or by decreasing the error target and increasing the iterations. For the above parameters the results are:

select v.r, v.c,v.v, wh.v from v join wh on v.r = wh.r and v.c = wh.c

r	c	actual	fit	error
1	1	26	25.988	0.012
2	1	36	36.009	-0.009
1	2	34	34.059	-0.059
2	2	48	47.958	0.042
1	3	42	41.960	0.040
2	3	60	60.028	-0.028

The weights (nicely formatted using a spreadsheet):

h	1	2
1	29.286	23.738
2	37.388	31.600
3	44.840	39.533

w	1	2
1	0.312	0.708
2	0.001	1.516

I've been experimenting on using this technique to predict NFL football scores. Using the giant table analogy, the rows are the offense team, the columns the defense team and the intersecting cell is the offensive teams score. Assuming a suitable solution exists each team's score could be forecast from the model and the spread as well. Perhaps at some point in the future I'll share my results (or run off to Vegas to cash in).

In the SQL Data Mine

Thursday, September 26, 2013

Non-Negative Matrix Factorization

No comments:

Post a Comment