Tuesday, August 13, 2013

K Means Clustering

Perhaps the most successful data mining algorithm after simple statistics and regression is the clustering algorithm known as k-means.  It is a deceptively simple iterative processes that applies easily understood similarity measures to group observations (records) into homogeneous clusters.  It has a wide variety of applications including identifying extreme observations , data reduction, document clustering and more; a quick Google search will reveal all.

The basic algorithm is simple:
  1. Prepare the data set (initially consider only ratio data, a future post will go into catagorical data).  There really is no limit on the number of attributes that can be considered, however the more there are the less influence each will have on the cluster assignment processes. To ensure no single attribute overwhelms the analysis, normalize the data by subtracting the attribute average and dividing by the standard deviation.
  2. Select the number of clusters and randomly assign each observation to one of the clusters   A draw-back of k-means is that the user must pre-specify the number of clusters rather than have the ideal number identified by the algorithm. 
  3. Calculate the mean  for each attribute for each cluster.
  4. For each observation calculate the similarity to each cluster mean and if the observation is more similar to a different cluster than which it is currently assigned, re-assign it to the more similar cluster.
  5. Repeat steps 3 & 4 till observations stop moving between clusters.
There are a variety of similarity measures, but the most frequently used for ratio data is the inverse of the Euclidean distance. The nearest cluster is the most similar and the distance is calculated as the sum of the square of the difference between the observation's attribute value and the cluster mean  for that attribute. This model will use the absolute value of the difference rather than squaring which tends to reduce the impact of an extreme observation.  If the data is categorical, like in document clustering where the attributes are words and the values are the frequency for a given document, cosine similarity, (A*B)/ ((A^2)^.5)((B^2)^.5), produces good results

In addition to the raw data table  where each record is an observation made up of attributes, the SQL program requires 2 tables: 1. the data in vector  format (#data), 2. the observations aka records by cluster (#class).  The first table converts the matrix data into a vector (essentially each record is row, column, value where row is the record id and column is the attribute name with value is the value normalized through sub-queries from the raw data table).  The second keeps track of the cluster membership of each id.

create table #rawdata (id int,area float, rooms float, price float)
insert into #rawdata select 1, 2201,3,400
insert into #rawdata select 2, 1600,3,330
insert into #rawdata select 3, 2400,3,369
insert into #rawdata select 4, 1416,2,232
insert into #rawdata select 5, 3000,4,540

select id, 'area' attribute, (area-(select avg(area) from #rawdata))/(select stdev(area) from #rawdata) value  into #data from #rawdata

union all

select id, 'rooms' , (rooms-(select avg(rooms) from #rawdata))/(select stdev(rooms) from #rawdata)  from #rawdata

union all

select id, 'price' , (price-(select avg(price) from #rawdata))/(select stdev(price) from #rawdata)  from #rawdata

select id, cast(2*dbo.fx_rand() as int) class into #class from #data  group by id

In the last query the id's are randomly assigned to one of 2 classes (the highlighted statement beginning with the number of clusters).  Ideally SQL would permit this using the built-in rand() statement, but it generates a single random number per query and doesn't change with each record in the results.  Two work around for this are first, create a view vw_rand and a user defined function fx_rand():

create view vw_rand 
as
select rand() as random


create function fx_rand ()
returns float
as
begin
return (select random from vw_rand)
end

The second solution which is simpler than creating views and functions is to replace fx_rand () with cast(right(checksum(newid()),3) as float)/1000  to generate a pseudo random number between 0.000 and 0.999.  While both return "random" numbers their validity has not to the best of my knowledge been evaluated; for the purposes of this algorithm they both appear to work.

Next is the heart of the program that summarizing the attribute data for the observations by cluster and then compares each record to each cluster updating the assignment if a different cluster is closer.  This continues until the stopping condition --no further re-assignments are made--is met.  

while 1=1
begin 
;with cte1 as
(select attribute, class, avg(value) mean 
from #data d join #class c on d.id = c.id  group by attribute, class),
cte2 as 
(select d.id,c.class, sum(abs(d.value-c.mean)) dist  from cte1 c join #data d on c.attribute = d.attribute 
group by d.id, c.class),
cte3 as 
(select id, min(dist) mindist from cte2  group by id),
cte4 as
(select  a.id, min(class) newclass  from cte3 a join cte2 b on  mindist = dist group by a.id)
update c set class = newclass from #class c join cte4 b on c.id = b.id where class<>newclass

if @@rowcount = 0 break
end 

The query uses a common table expressions (CTE) which makes programming easier than creating and dropping temporary tables or stacking sub-queries like so many planes over O'Hare but has the disadvantage of not working outside T-SQL and the results of the interim steps are not available for debugging.  The first CTE  calculates the centers for each of the attributes for each cluster.  The second  calculates the distance to each cluster for each observation; if you want to use the Pythagorean based calculation change abs(d.value-c.mean) to power(d.value-c.mean,2). CTE3 identifies the shortest distance to a cluster and CTE4 identifies the cluster associated with that shortest distance with any ties going arbitrarily to the cluster with the lower class number.  Finally, any observations with new cluster assignments are updated and the processes stops is repeated until no further updates are made.

To evaluate the effects of the algorithm, use the following 2 queries to link  first the initial clusters then  the results back to the raw data table:

select c.id,c.class cluster, area, rooms,price from #class c  join #rawdata d on c.id = d.id  order by c.id

select class cluster, count(1) freq, avg(area) avg_area, avg(rooms) avg_rooms, avg(price) avg_price from #class c  join #rawdata d on c.id = d.id 

group by class

Initial clusters:

id cluster area rooms Price
1 1 2201 3 400
2 1 1600 3 330
3 0 2400 3 369
4 0 1416 2 232
5 0 3000 4 540

cluster freq avg area avg rooms avg price
0 3 2272.0 3 380.3
1 2 1900.5 3 365.0

After running the algorithm record 1 was moved to cluster 0 and record 4 was moved to cluster 1 and from the summary table it is evident that greater distinction is made on all variables with cluster 0 being the larger more expensive houses.

id cluster area rooms Price
1 0 2201 3 400
2 1 1600 3 330
3 0 2400 3 369
4 1 1416 2 232
5 0 3000 4 540

cluster freq avg area avg rooms avg price
0 3 2533.7 3.3 436.3
1 2 1508.0 2.5 281.0

A couple of final issues to consider.  First, k-means can be sensitive to the initial cluster assignments especially when there are many clusters.  Should this happen run the program several times to identify a preferred solution, much like the pocked solution in the post on the Perceptron.  Finally,  it is not uncommon for k-means to produce a solution with fewer clusters than originally specified, particularly when the sample size is small.  Running the above with 3 clusters produces a 2 cluster solution.  Either go with the fewer clusters or try a different similarity measure.













No comments:

Post a Comment