In the SQL Data Mine

Tuesday, December 3, 2013

Forecasting for Inventory Planning

It has been some time since I last posted; to restart I'd like to get into an area that's related to data mining: forecasting or better yet demand planning. In his book Predictive Analytics, Eric Siegel asserts that "[Predictive analytics] is a completely different animal from forecasting." I don't agree, they are both concerned with using real data to predict the future. Demand forecasting may be the most widely accepted application of empirical data to decision making in business. Well designed forecasting systems use actual orders and sales to predict how much manufacturers should make and purchasing departments should buy to support future demand.

I will use a very simple model that lends itself very nicely to the SQL environment that was written up by CERN researches (hard to beat nuclear scientists: http://ais.web.cern.ch/AIS/apps/lims/theory.html). Rather than group data into monthly buckets and extrapolate using models like exponential smoothing, this approach offers a "mathematical solution based on the real statistical distributions."

Inventory management algorithms use some variation of the order point model: when inventory for a given stock keeping unit (SKU) falls below an order point the system recommends purchase/manufacture of more (with the quantity usually based on the economic order quantity, a topic for another day). The order point is calculated as quantity demanded over the lead time plus safety stock. Lead time is the time between noticing more is required and having more available in inventory for immediate sale and includes time for: inventory review + purchase order placing + vendor fulfillment + in transit + receiving + put-away and is usually measured in days or weeks. Lead time quantity is the quantity expected to be sold over the lead time. Safety stock is the inventory held for unpredictable future demand fluctuation. Typically it is set to achieve some level of fill associated with the business' value proposition, say 95% of customer orders should be filled from stock. Assuming the demand is normally distributed, then safety stock is the number of standard deviations associated with service level times the lead time standard error.

Modeling requires six elements:

orders per day (opd) = count of orders for some period/days in period
average order quantity (aoq) = sum of quantity per order/count of orders
order quantity variance (oqv) = sum of squared difference between the order quantity and average order quantity / count of orders minus 1
average lead time in days (altd)
lead time in days variance (ltdv)
number of standard deviations associated with service level (z)

The 3 basic formulas are (* means times and ^ means raise to the power of):

lead time quantity (ltq) = opd*altd*aoq
lead time variance (ltv) = opd*altd*(oqv + aoq^2) +opd^2 *aoq^2*ltdv
order point (op) = ltq+z*ltv^.5

As an aside, many vendors specify their lead time which can be interpreted as altd with zero variance simplifying #2 above to ltv = opd*altd*(oqv + aoq^2).

The program employs three tables which are stripped to the essential information for the matters at hand. The first is #item which holds the item list and relevant SKU specific planning parameters such as the days in stock useful for new items or adjusting the forecast for stock outs, targeted service level which my vary based on an item importance, and item cost data which comes in handy when computing carry costs, average inventory costs and economic order quantities. The second table is #receipts to measure the lead time. A purchase order has an order date followed by one or more receipt dates which are useful in estimating lead time. The final table is #salesorders which is self explanatory. the one observation I would make is that the sales order date should be the date the customer requested, not the actual ship date.

create table #item (item varchar(3), daysinstock int, servicetarget float,stdcost float, carrycost float, ordercost float)

insert into #item select 'abc', 120, .95, 1.5, .12, 15

create table #receipts (purchaseorder int, item varchar(3), orderdate int, receiptdate int)

insert into #receipts select 321,'abc', 20130109, 20130226

insert into #receipts select 432,'abc', 20130325, 20130413

insert into #receipts select 543,'abc', 20131007, 20131130

create table #salesorders (salesorder int, salesreqdate int, item varchar(3), quantity int)

insert into #salesorders select 123, 20130208, 'abc', 1

insert into #salesorders select 234, 20131014, 'abc', 10

insert into #salesorders select 345, 20130413, 'abc', 35

insert into #salesorders select 456, 20130409, 'abc', 10

insert into #salesorders select 567, 20131105, 'abc', 5

insert into #salesorders select 678, 20130515, 'abc', 5

Using the built-in SQL functions for average, variance and count summarize the data using sub-queries for the receipts data and the sales order data and place in a summary table with holding places for lead-time quantity (ltq), lead time variance(ltv), order point (op) and economic order quantity (eoq). While all of these could be calculated in the first pass, I broke it into updates to make following easier.

select i.item, i.daysinstock,servicetarget, stdcost,carrycost, ordercost, altd,ltdv ,ofreq/daysinstock opd, aoq,oqv,
cast(0 as float) ltq,
cast(0 as float) ltqv,
cast(0 as int) op,
cast(0 as int) eoq
into #summary from #item i
join
(select item, cast(count(1) as float) rfreq, avg(cast(datediff(d,convert(char,orderdate,112), convert(char,receiptdate,112)) as float)) altd ,
var(cast(datediff(d,convert(char,orderdate,112), convert(char,receiptdate,112)) as float)) ltdv
from #receipts group by item) lt
on i.item = lt.item
join
(select item, cast(count(1) as float) ofreq, avg (cast(qty as float)) aoq, var(cast(qty as float)) oqv from #salesorders group by item) so
on i.item = so.item

Use update statements to add the lead time quantity and lead time variance:

update #summary
set ltq = opd*altd*aoq,
ltqv = opd*altd*(oqv + power(aoq,2)) +power(opd,2)*power(aoq,2)*ltdv

The final step uses an update step to create the order point and the economic order quantity(eoq). See http://en.wikipedia.org/wiki/Economic_order_quantity for a very thorough definition. Also, note the use of dbo.fx_normsinv which is a function to convert a targeted service level (95%) to a z score. I've included the code in an appendix.

update #summary
set op = cast(ltq+ power(ltqv, 0.5)*dbo.fx_normsinv(servicetarget) as int),
eoq = cast(power((opd*365*aoq *ordercost)/(carrycost*stdcost),.5) as int)

In addition to planning, the data can also provide insight into performance and KPIs. For example, average inventory can be approximated as one half the order quantity plus safety stock, cost of goods sold (cogs) and projected turns as cogs/avginv:

select
sum((eoq/2+ (op-ltq))*stdcost) avginv,
sum(opd*365*aoq*stdcost) cogs,
sum(opd*365*aoq*stdcost)/sum((eoq/2+ (op-ltq))*stdcost) turns
from #summary

That's it. I employed a slightly extended version of this to run inventory planning on nearly 200,000 SKUs with an inventory of over $100 million achieving a 95+% service level.

Appendix:

Inverse of normal distribution per Abramowitz and Stegun formula 26.2.23

create function fx_normsinv(@p float)
returns float
as
begin
declare @c0 float, @c1 float, @c2 float, @d1 float, @d2 float , @d3 float, @t float
if @p>=.5 set @p = 1.0-@p
select @t = power(-2.0*log(@p),.5), @c0 = 2.515517, @c1 =0.802853,@c2 =0.010328,@d1 = 1.432788, @d2 =0.189269, @d3 =0.001308
return (select @t-((@c0+@c1*@t+@c2*@t*@t)/(1.0+@d1*@t+@d2*@t*@t+@d3*@t*@t*@t)))
end

Thursday, September 26, 2013

Non-Negative Matrix Factorization

Matrix factorization is a technique for extracting attributes from a data matrix into matrices corresponding to the rows and columns of the original matrix such that when the extracted matrices are multiplied the original matrix is reconstructed. Sounds like a neat geek trick, but actually has practical applications with perhaps the most renown example being the Netflix prize (http://en.wikipedia.org/wiki/Netflix_Prize). Netflix captures film ratings by users which Netflix mines to recommend movies to users. The competition made part of this database publicly available with the goal of improving the accuracy of the recommendations. One of the key algorithms used by the winners was matrix factorization which was first suggested by Simon Funk (http://sifter.org/~simon/journal/20061211.html).

To understand the approach at a high level, imagine the ratings database as a giant table with each row being a different user and each column a different film. At the intersection of each row and column is the rating (1-5 for Netflix) given by that user to that film. As most people have not seen most films there is a lot of empty cells. Using matrix factorization the film effects can be separated from the user effects and the empty cells can be estimated by multiplying a user's effects by the unseen films effects. The list of recommendations would be the films with the highest estimated ratings for that user.

While replicating the Funk methodology isn't difficult, this posting uses a different approach: non-negative matrix factorization which ensures the extracted values are all positive. The SQL is based on Toby Segaran's python code in his excellent Programming Collective Intelligence.

The first step will be to create three stored procedures for matrix algebra that will make the ultimate code more compact, next will be to create a toy matrix example, third the optimization code and, finally, the results on the toy matrix.

Matrix multiplication (http://en.wikipedia.org/wiki/Matrix_multiplication) multiplies 2 matrices where the number of rows in matrix 1 (im1) equals the number of columns in matrix 2 (im2) and the number of columns in matrix 1 equals the number of rows in matrix 2. The output matrix has the rows from im1 and the columns from im2. One of the side benefits of SQL is that matrix multiplications is relatively easy. To keep the code simple, there is no error handling so if the number of columns in im1 doesn't match the number of rows in im2 , SQL will work its magic but the answer will be wrong. All matrices are a table in rcv format where column 1 (r) is the row number, column 2 (c) is the column number and column 3 (v) is the value.

create procedure matmult ( @im1 varchar(24), @im2 varchar(24), @om varchar(24), @dot binary)
as
begin
declare @sql varchar(max)
if @dot = 1 exec ('drop table '+@om)
set @sql = 'select A.r, B.c, sum(A.v*B.v ) v into zzzzz from xxxxx A join yyyyy B on A.c=B.r group by A.r,B.c'
exec(replace(replace(replace(@sql, 'xxxxx',@im1), 'yyyyy',@im2),'zzzzz',@om))
end

Matrix transposition is handled by creating an output table that substitutes the rows for the columns and columns for rows on an input matrix:

create procedure mattran ( @im1 varchar(24), @om varchar(24), @dot binary)
as
begin
declare @sql varchar(max)
if @dot = 1 exec ('drop table '+@om)
set @sql = 'select c r,r c, v into zzzzz from xxxxx'
exec( replace(replace(@sql, 'xxxxx',@im1),'zzzzz',@om))
end

Matrix cell to cell functions such as adding two matrices are managed with matarr and require input matrices, an output matrix, and a specified function (+,-,*,/):

create procedure matarr ( @im1 varchar(24), @im2 varchar(24),@func varchar(1), @om varchar(24), @dot binary)
as
begin
if @dot = 1 exec('drop table '+@om)
declare @sql varchar(max)
set @sql = 'select im1.r, im1.c, (im1.v qqqqq im2.v) v into zzzzz from xxxxx im1 join yyyyy im2 on im1.c=im2.c and im1.r=im2.r '
exec( replace(replace(replace(replace(@sql, 'xxxxx',@im1), 'yyyyy',@im2),'zzzzz',@om),'qqqqq',@func))
end

Next build a toy matrix using the rcv format. To make it interesting, start with 2x2 and a 2x3 matrix then multiply to create a 2x3 matrix:

create table x (r int,c int, v float)
create table y (r int,c int, v float)

insert into x select 1,1,1
insert into x select 2,1,2
insert into x select 1,2,3
insert into x select 2,2,4

insert into y select 1,1,2
insert into y select 1,2,4
insert into y select 1,3,6
insert into y select 2,1,8
insert into y select 2,2,10
insert into y select 2,3,12

exec matmult 'x','y','v',1

select * from v order by r,c

r	c	v
1	1	26
1	2	34
1	3	42
2	1	36
2	2	48
2	3	60

The extracted matrices have either the number of rows or number of columns of the original matrix and the same number of hidden features. Exactly how many hidden features are required to rebuild the original matrix isn't preordained and there is an element of trial and error in this choice, but experience suggests the more complicated the matrix the more hidden features are required. This one is pretty simple, so pick 2. Create a vector table and load the numbers. This is useful in initiating the feature values aka weights:

create table n(i int)

insert into n select 1

insert into n select 2

Using the v and n tables create the weights tables w (for row categories) & h (for column categories), calculate the initial fit table (third query) and sum the initial error squared. The weight are initiated using the fx_rand() function presented in the earlier k-means posting.

select r,i c, dbo.fx_rand() v into w from v, n group by r, i

select i r, c, dbo.fx_rand() v into h from v, n group by i, c

exec matmult 'w','h','wh',1

select sum(power(v.v-wh.v,2)) from v join wh on v.r = wh.r and v.c = wh.c

Using the toy matrix above, the initial error is about 10500 (results will differ somewhat due the random number weights). The fitting process loops through the updating steps until achieving the stopping conditions of either a specified sum of errors threshold or number of iterations (highlighted in yellow):

declare @count int, @error float

set @count = 1

while 1=1

begin

exec mattran 'w','wt',1

exec matmult 'wt','v','hn',1

exec matmult 'wt','w', 'wtw',1

exec matmult 'wtw','h','hd',1

exec matarr 'h','hn','*','hhn',1

exec matarr 'hhn','hd','/','h',1

exec mattran 'h','ht',1

exec matmult 'v','ht','wn',1

exec matmult 'w','h','wxh',1

exec matmult 'wxh','ht','wd',1

exec matarr 'w','wn','*','wwn',1

exec matarr 'wwn','wd','/','w',1

exec matmult 'w','h','wh',1

set @error= (select sum(power(v.v-wh.v,2)) from v join wh on v.r = wh.r and v.c = wh.c)

if @error<.01 or @count>=25 break

set @count = @count+1

end

For the above example the algorithm stopped after 23 iterations with a sum of squared errors of 0.00795711. The actual vs fit could be improved by adding hidden features to the n table or by decreasing the error target and increasing the iterations. For the above parameters the results are:

select v.r, v.c,v.v, wh.v from v join wh on v.r = wh.r and v.c = wh.c

r	c	actual	fit	error
1	1	26	25.988	0.012
2	1	36	36.009	-0.009
1	2	34	34.059	-0.059
2	2	48	47.958	0.042
1	3	42	41.960	0.040
2	3	60	60.028	-0.028

The weights (nicely formatted using a spreadsheet):

h	1	2
1	29.286	23.738
2	37.388	31.600
3	44.840	39.533

w	1	2
1	0.312	0.708
2	0.001	1.516

I've been experimenting on using this technique to predict NFL football scores. Using the giant table analogy, the rows are the offense team, the columns the defense team and the intersecting cell is the offensive teams score. Assuming a suitable solution exists each team's score could be forecast from the model and the spread as well. Perhaps at some point in the future I'll share my results (or run off to Vegas to cash in).

Tuesday, September 10, 2013

Incredibly Simple Regression Using Group By Categories in SQL

I've cover regression in one of my earlier posts and in the process came up with a simpler algorithm. I can't claim it is my discovery as I'm sure it is already out there somewhere and there is a strong similarity between many error minimization problems, but I've never seen a write-up and my favorite search engine didn't reveal any prior publications. I'm calling it SQL Group By Regression (SGBR), if it has already been published, let me know. It bares a strong resemblance to ANOVA.

The algorithm models a numeric dependent variable by estimating weights for the independent variables which are all categorical. An example would be height as a function of gender. Height is a numeric variable while gender is categorical. In typical regression (including my earlier post) this could be modeled by using a binary (1,0) variable for each category which is fairly simple. Imagine doing the same model of weight as a function of state of residence; very quickly the model will have fifty binary variables. Straightforward (if not necessarily statistically valid) in a regression package like R, but a pain in SQL.

The model: estimate a score for each attribute value (i.e.gender female) such that the fitted class value is the sum of the scores for that observation. The scores, or weights, are estimated using a batch mode gradient descent.

The course of this post:

Load some data, I will first use a small data set then the Titanic data
Put the data into a vector format to ease the SQL
Initiate the weights and control variables
Create the gradient descent loop
Show the results.

The demonstration data is height in feet as a function of gender:

create table #rawdata (id int, sex varchar(6), height float)

insert into #rawdata select 1,'male', 6

insert into #rawdata select 2,'male', 5.92

insert into #rawdata select 3, 'male', 5.58

insert into #rawdata select 4,'male', 5.92

insert into #rawdata select 5,'female', 5

insert into #rawdata select 6,'female', 5.5

insert into #rawdata select 7,'female', 5.42

insert into #rawdata select 8,'female', 5.75

Create the data vectors with #data as the independent variables and #class as the dependent variable. As a reminder, this format eases the calculations in an SQL environment by using joins instead of having to look things up.

select id, 'intercept' attribute, 'intercept' value into #data from #rawdata

union all

select id, 'sex',sex from #rawdata

select id, 'height' attribute,height class into #class from #rawdata

The weights will be stored in the #weight table which is a summary of the #data file to ensure all variables being modeled are included. There is also a identical table called #newweights. As the weights are updated they are stored in the #newweights table until it is verified that the error reduced, if it did the weights are copied to the #weights table and the looping continues, otherwise the looping stops and the new weights are discarded. The progress is measured with @count while the improvements in error are tracked through @error and @newerror; @alpha is the percent of error added to the weight estimates. The initial weights are set to 0 and @error is the sum of the squared error between the fit and the class.

select attribute, value, cast(0 as float) wght into #weights from #data
select * into #newweights from #weights

declare @error float, @newerror float, @count int, @alpha

set @error = (select sum(power(error,2)) from
(select c.id, c.class - sum(wght) error
from #class c
join #data d on c.id = d.id
join #weights w on d.attribute = w.attribute and d.value = w.value
group by c.id, c.class) e)

set @count = 1
set @alpha = 0.1

The key SQL is highlighted with the orange highlighted sub-query calculating the error by observation and the yellow using the average of that error to update the category weights. The remaining queries calculate the error and checks the stop conditions and update the tables with the new information. The gradient loops at least 10 times and stops at 100 or when the error increases (highlighted in red).

while 1=1
begin
truncate table #newweights

insert into #newweights
select d.attribute, d.value, w.wght+.@alpha*avg(error) aerror
from #weights w
join #data d
on w.attribute = d.attribute and w.value = d.value
join (select c.id, c.class - sum(wght) error
from #class c
join #data d on c.id = d.id
join #weights w on d.attribute = w.attribute and d.value = w.value
group by c.id, c.class) e
on d.id = e.id
group by d.attribute, d.value, w.wght

set @newerror =(select sum(power(error,2)) from
(select c.id, c.class - sum(wght) error
from #class c
join #data d on c.id = d.id
join #newweights w on d.attribute = w.attribute and d.value = w.value
group by c.id, c.class) e)

if (@newerror> @error and @count >10) or @count>100 break

truncate table #weights
insert into #weights
select * from #newweights

set @error = @newerror
set @count = @count +1

end

On the toy data set it ran in about 1 second with the following weights and fit:

select * from #weights

attribute	value	weight
intercept	intercept	2.818
sex	female	2.599
sex	male	3.037

select c.id, c.class,sum(wght) fit , c.class-sum(wght) error
from #class c
join #data d on c.id = d.id
join #newweights w on d.attribute = w.attribute and d.value = w.value
group by c.id, c.class

id	class	fit	error
1	6	5.855	0.145
2	5.92	5.855	0.065
3	5.58	5.855	-0.275
4	5.92	5.855	0.065
5	5	5.418	-0.418
6	5.5	5.418	0.082
7	5.42	5.418	0.002
8	5.75	5.418	0.332

An the r-squared is 99.8% using:

select 1- sum(power(class-fit,2))/ sum(power(class,2)) from

(select c.id, c.class,sum(wght) fit

from #class c

join #data d on c.id = d.id

join #newweights w on d.attribute = w.attribute and d.value = w.value

group by c.id, c.class) f

Frequently the dependent variable is a binary variable in which case a logistic model may be more appropriate to keep the fitted value between 0 and 1 (http://en.wikipedia.org/wiki/Logistic_regression). It is an easy change: substitute (1.0/(1.0+exp(-sum(wght))) for sum(wght) throughout. I'll demonstrate this using the Titanic data (currently available from http://www.kaggle.com/ or http://www.encyclopedia-titanica.org/titanic-passenger-list/) which is on my server in a table called train.

The dependent variable is survived which is binary. In addition to the survival field, training data has both categorical and numeric data. I will use three categorical variables: sex, pclass and embarked. I've shown the modification in the initial error sub_query only.

select passengerid id, 'survived' attribute,survived class into #class from train

select passengerid id, 'intercept' attribute, 'intercept' value into #data from train

union all

select passengerid, 'sex',sex from train

union all

select passengerid, 'pclass',cast(pclass as varchar) from train

union all

select passengerid, 'embarked',embarked from train

declare @error float, @newerror float, @count int

drop table #weights, #newweights

select attribute, value, cast(0 as float) wght into #weights from #data

select attribute, value, cast(0 as float) wght into #newweights from #data

set @error = (select sum(power(error,2)) from

(select c.id, c.class - (1.0/(1.0+exp(-sum(wght)))) error

from #class c

join #data d on c.id = d.id

join #weights w on d.attribute = w.attribute and d.value = w.value

group by c.id, c.class) e)

The algorithm ran in about 4 seconds on my low-end gateway using un-indexed tables. The weights show one clearly wanted to be a lady in first class from Cherbourg):

attribute	value	weight
embarked	C	0.283
embarked	Q	0.013
embarked	S	-0.283
intercept	intercept	-0.144
pclass	1	0.730
pclass	2	0.218
pclass	3	-0.664
sex	female	1.234
sex	male	-0.894

As this is essentially a classification model, how successfully did it classify if one assumes a fitted value >.5 = 1? 78.67%.

select sum(case when class = round(fit,0) then 1.0 else 0 end)/count(1) from

(select c.id, c.class,(1.0/(1.0+exp(-sum(wght)))) fit

from #class c

join #data d on c.id = d.id

join #newweights w on d.attribute = w.attribute and d.value = w.value

group by c.id, c.class) f

Finally, the r-squared is 61.5%

select 1- sum(power(class-fit,2))/ sum(power(class,2)) from

(select c.id, c.class,(1.0/(1.0+exp(-sum(wght)))) fit

from #class c

join #data d on c.id = d.id

join #newweights w on d.attribute = w.attribute and d.value = w.value

group by c.id, c.class) f

Tuesday, August 13, 2013

K Means Clustering

Perhaps the most successful data mining algorithm after simple statistics and regression is the clustering algorithm known as k-means. It is a deceptively simple iterative processes that applies easily understood similarity measures to group observations (records) into homogeneous clusters. It has a wide variety of applications including identifying extreme observations , data reduction, document clustering and more; a quick Google search will reveal all.

The basic algorithm is simple:

Prepare the data set (initially consider only ratio data, a future post will go into catagorical data). There really is no limit on the number of attributes that can be considered, however the more there are the less influence each will have on the cluster assignment processes. To ensure no single attribute overwhelms the analysis, normalize the data by subtracting the attribute average and dividing by the standard deviation.
Select the number of clusters and randomly assign each observation to one of the clusters A draw-back of k-means is that the user must pre-specify the number of clusters rather than have the ideal number identified by the algorithm.
Calculate the mean for each attribute for each cluster.
For each observation calculate the similarity to each cluster mean and if the observation is more similar to a different cluster than which it is currently assigned, re-assign it to the more similar cluster.
Repeat steps 3 & 4 till observations stop moving between clusters.

There are a variety of similarity measures, but the most frequently used for ratio data is the inverse of the Euclidean distance. The nearest cluster is the most similar and the distance is calculated as the sum of the square of the difference between the observation's attribute value and the cluster mean for that attribute. This model will use the absolute value of the difference rather than squaring which tends to reduce the impact of an extreme observation. If the data is categorical, like in document clustering where the attributes are words and the values are the frequency for a given document, cosine similarity, (A*B)/ ((A^2)^.5)((B^2)^.5), produces good results

In addition to the raw data table where each record is an observation made up of attributes, the SQL program requires 2 tables: 1. the data in vector format (#data), 2. the observations aka records by cluster (#class). The first table converts the matrix data into a vector (essentially each record is row, column, value where row is the record id and column is the attribute name with value is the value normalized through sub-queries from the raw data table). The second keeps track of the cluster membership of each id.

create table #rawdata (id int,area float, rooms float, price float)

insert into #rawdata select 1, 2201,3,400

insert into #rawdata select 2, 1600,3,330

insert into #rawdata select 3, 2400,3,369

insert into #rawdata select 4, 1416,2,232

insert into #rawdata select 5, 3000,4,540

select id, 'area' attribute, (area-(select avg(area) from #rawdata))/(select stdev(area) from #rawdata) value into #data from #rawdata

union all

select id, 'rooms' , (rooms-(select avg(rooms) from #rawdata))/(select stdev(rooms) from #rawdata) from #rawdata

union all

select id, 'price' , (price-(select avg(price) from #rawdata))/(select stdev(price) from #rawdata) from #rawdata

select id, cast(2*dbo.fx_rand() as int) class into #class from #data group by id

In the last query the id's are randomly assigned to one of 2 classes (the highlighted statement beginning with the number of clusters). Ideally SQL would permit this using the built-in rand() statement, but it generates a single random number per query and doesn't change with each record in the results. Two work around for this are first, create a view vw_rand and a user defined function fx_rand():

create view vw_rand
as

select rand() as random

create function fx_rand ()

returns float

begin

return (select random from vw_rand)

end

The second solution which is simpler than creating views and functions is to replace fx_rand () with cast(right(checksum(newid()),3) as float)/1000 to generate a pseudo random number between 0.000 and 0.999. While both return "random" numbers their validity has not to the best of my knowledge been evaluated; for the purposes of this algorithm they both appear to work.

Next is the heart of the program that summarizing the attribute data for the observations by cluster and then compares each record to each cluster updating the assignment if a different cluster is closer. This continues until the stopping condition --no further re-assignments are made--is met.

while 1=1

begin

;with cte1 as

(select attribute, class, avg(value) mean

from #data d join #class c on d.id = c.id group by attribute, class),

cte2 as

(select d.id,c.class, sum(abs(d.value-c.mean)) dist from cte1 c join #data d on c.attribute = d.attribute

group by d.id, c.class),

cte3 as

(select id, min(dist) mindist from cte2 group by id),

cte4 as

(select a.id, min(class) newclass from cte3 a join cte2 b on mindist = dist group by a.id)

update c set class = newclass from #class c join cte4 b on c.id = b.id where class<>newclass

if @@rowcount = 0 break

end

The query uses a common table expressions (CTE) which makes programming easier than creating and dropping temporary tables or stacking sub-queries like so many planes over O'Hare but has the disadvantage of not working outside T-SQL and the results of the interim steps are not available for debugging. The first CTE calculates the centers for each of the attributes for each cluster. The second calculates the distance to each cluster for each observation; if you want to use the Pythagorean based calculation change abs(d.value-c.mean) to power(d.value-c.mean,2). CTE3 identifies the shortest distance to a cluster and CTE4 identifies the cluster associated with that shortest distance with any ties going arbitrarily to the cluster with the lower class number. Finally, any observations with new cluster assignments are updated and the processes stops is repeated until no further updates are made.

To evaluate the effects of the algorithm, use the following 2 queries to link first the initial clusters then the results back to the raw data table:

select c.id,c.class cluster, area, rooms,price from #class c join #rawdata d on c.id = d.id order by c.id

select class cluster, count(1) freq, avg(area) avg_area, avg(rooms) avg_rooms, avg(price) avg_price from #class c join #rawdata d on c.id = d.id
group by class

Initial clusters:

id	cluster	area	rooms	Price
1	1	2201	3	400
2	1	1600	3	330
3	0	2400	3	369
4	0	1416	2	232
5	0	3000	4	540

cluster	freq	avg area	avg rooms	avg price
0	3	2272.0	3	380.3
1	2	1900.5	3	365.0

After running the algorithm record 1 was moved to cluster 0 and record 4 was moved to cluster 1 and from the summary table it is evident that greater distinction is made on all variables with cluster 0 being the larger more expensive houses.

id	cluster	area	rooms	Price
1	0	2201	3	400
2	1	1600	3	330
3	0	2400	3	369
4	1	1416	2	232
5	0	3000	4	540

cluster	freq	avg area	avg rooms	avg price
0	3	2533.7	3.3	436.3
1	2	1508.0	2.5	281.0

A couple of final issues to consider. First, k-means can be sensitive to the initial cluster assignments especially when there are many clusters. Should this happen run the program several times to identify a preferred solution, much like the pocked solution in the post on the Perceptron. Finally, it is not uncommon for k-means to produce a solution with fewer clusters than originally specified, particularly when the sample size is small. Running the above with 3 clusters produces a 2 cluster solution. Either go with the fewer clusters or try a different similarity measure.