Wednesday, March 19, 2014

Clustering Two by Two

An earlier post covered the K-Means clustering algorithm, in this post I will cover hierarchical agglomerative clustering (HAC).  The HAC algorithm starts with each observation as a cluster and successively groups them based on shortest distance or highest similarity until all observations belong to a single cluster. There are  several ways to judge closest or most similar when comparing two clusters. They include the cluster average, the furthest members or nearest members; this post uses the nearest members algorithm.

The first step is to create a table of data.  Then compute the distance between clusters which will be used to successively cluster the observations.   The last step is displaying the results. Programs like R offer neat dendrogram graphics not available in SQL; to overcome this the post will offer code to dynamically create a table of clusters for each pass. Not ideal, but passable.

To keep it easy and visual, each observation has two dimensions that are randomly determined on a 0-100 scale.  The first query creates the data table while the loop assigns a value to the dimensions X1 and X2 which in this case is displayed below.  Note your results will differ because X1 and X2 are randomly assigned each time the program is run.

create table #data (id int, x1 float, x2 float)

declare @id int
set @id =1
while @id<=12
begin
insert into #data select @id, round(dbo.fx_rand()*100,0), round(dbo.fx_rand()*100,0)
set @id = @id+1
end

select x1,x2,id from #data



Distance is computed between all observation pairs using euclidean distance by joining the data table to itself on the observation id.  Redundant distances (A to B = B to A) and irrelevant pairs (A to A) are eliminating by specifying that the id in the first table is greater than the id in the second.  Other measures of distance could be used as well as the inverse of similarity; similar items are usually defined as near to each other (low distance) and dissimilar far apart (high distance).  The cluster ids are overwritten as clusters are merged and therefore archived in the aid2 and bid2 fields.

create table #distance (aid int, bid int,aid2 int, bid2 int, distance float )

insert into #distance select a.id, b.id,a.id, b.id, power(power(a.x1-b.x1,2)+power(a.x2-b.x2,2),.5) distance from #data a join #data b on a.id >b.id

The clusters are stored in the #hierarchy table as a parent-child relations ship. Starting with the individual observations as a parent id (and 0 as the child id), the algorithm finds the nearest pair of clusters, insets them into the hierarchy table as children and assigns the next sequential id as parent.  In the distance table the id's for the merged clusters are updated to the new parent id. This insures that the nonmembers of a cluster are compared to their nearest neighbor from within a cluster and saves time in recalculating the distance matrix. The algorithm stops when all clusters have the same cluster id in the distance table.

create table #hierarchy( cid int, pid int, pass int)

declare @aid int, @bid int, @pid int, @pass int
set @pass = 0
insert into #hierarchy (cid,pid, pass) select 0 ,id,@pass from #data

while 1=1
begin 
set @pass = @pass+1
select top 1 @aid = aid, @bid = bid from #distance where aid<>bid  order by distance, newid() 
if @@rowcount = 0 break

select @pid = max(pid)+1 from  #hierarchy
insert into #hierarchy(cid,pid, pass)  select @aid, @pid, @pass
insert into #hierarchy(cid,pid, pass)  select @bid, @pid, @pass

update #distance set aid = @pid where aid in (@aid, @bid)
update #distance set bid = @pid where bid in (@aid, @bid)
end

The next problem is how to display the results.  As noted earlier the usual way to display HAC is through snappy graphic known as a dendrogram, but since TSQL offers no graphics an alternate solution is to create a table that shows the clustering using the parent ids.  The following query uses dynamic SQL to build a table adding a column for each pass populated with id's of the parent cluster

create table #list (p0 int)

insert into #list select pid from #hierarchy where pass = 0 

declare @pass int
declare @sql varchar(200)
set @pass = 1
while @pass<= (select max(pass) from #hierarchy)
begin
set @sql = 'alter table #list add p'+cast(@pass as varchar)+' int'
exec (@sql)

set @sql = 'Update l  set p'+cast(@pass as varchar)+ '= 
case when pid is null then p'+cast(@pass-1 as varchar)+' else pid end from #list l
left join  #hierarchy b
on p'+cast(@pass-1 as varchar)+' = cid and pass ='+ cast(@pass as varchar)

exec (@sql)

set @pass = @pass+1
end

select * from #list  
order by p11 ,p10 ,p9 ,p8 ,p7 ,p6 ,p5 ,p4 ,p3 ,p2 ,p1 ,p0 

p0 p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11
8 8 8 8 8 8 8 19 19 19 19 23
12 12 12 12 12 12 18 19 19 19 19 23
5 5 14 14 14 14 18 19 19 19 19 23
6 13 14 14 14 14 18 19 19 19 19 23
9 13 14 14 14 14 18 19 19 19 19 23
11 11 11 11 11 11 11 11 11 11 22 23
7 7 7 7 16 16 16 16 16 21 22 23
10 10 10 10 16 16 16 16 16 21 22 23
2 2 2 2 2 2 2 2 20 21 22 23
3 3 3 3 3 17 17 17 20 21 22 23
1 1 1 15 15 17 17 17 20 21 22 23
4 4 4 15 15 17 17 17 20 21 22 23

The results conform with the visual evident clustering above.  Immediately 9 & 6 cluster followed by 5 while next 1 & 4 start  new cluster; all clearly close together in the above chart.  Ultimately  the data set forms two clusters one with  8, 6, 12 & 9  and the other with the balance  which is again consistent with the visual expectations.

1 comment:

  1. Glad to read your blog post. I was looking for such posts from long time. I appreciate your efforts in posting it. website scraping company USA
    web scraping services
    web scraping service

    ReplyDelete