The first step is to create a table of data. Then compute the distance between clusters which will be used to successively cluster the observations. The last step is displaying the results. Programs like R offer neat dendrogram graphics not available in SQL; to overcome this the post will offer code to dynamically create a table of clusters for each pass. Not ideal, but passable.
To keep it easy and visual, each observation has two dimensions that are randomly determined on a 0-100 scale. The first query creates the data table while the loop assigns a value to the dimensions X1 and X2 which in this case is displayed below. Note your results will differ because X1 and X2 are randomly assigned each time the program is run.
create table #data (id int, x1 float, x2 float)
declare @id int
set @id =1
while @id<=12
begin
insert into #data select @id, round(dbo.fx_rand()*100,0), round(dbo.fx_rand()*100,0)
set @id = @id+1
end
select x1,x2,id from #data
Distance is computed between all observation pairs using euclidean distance by joining the data table to itself on the observation id. Redundant distances (A to B = B to A) and irrelevant pairs (A to A) are eliminating by specifying that the id in the first table is greater than the id in the second. Other measures of distance could be used as well as the inverse of similarity; similar items are usually defined as near to each other (low distance) and dissimilar far apart (high distance). The cluster ids are overwritten as clusters are merged and therefore archived in the aid2 and bid2 fields.
create table #distance (aid int, bid int,aid2 int, bid2 int, distance float )
insert into #distance select a.id, b.id,a.id, b.id, power(power(a.x1-b.x1,2)+power(a.x2-b.x2,2),.5) distance from #data a join #data b on a.id >b.id
The clusters are stored in the #hierarchy table as a parent-child relations ship. Starting with the individual observations as a parent id (and 0 as the child id), the algorithm finds the nearest pair of clusters, insets them into the hierarchy table as children and assigns the next sequential id as parent. In the distance table the id's for the merged clusters are updated to the new parent id. This insures that the nonmembers of a cluster are compared to their nearest neighbor from within a cluster and saves time in recalculating the distance matrix. The algorithm stops when all clusters have the same cluster id in the distance table.
create table #hierarchy( cid int, pid int, pass int)
declare @aid int, @bid int, @pid int, @pass int
set @pass = 0
insert into #hierarchy (cid,pid, pass) select 0 ,id,@pass from #data
while 1=1
begin
set @pass = @pass+1
select top 1 @aid = aid, @bid = bid from #distance where aid<>bid order by distance, newid()
if @@rowcount = 0 break
select @pid = max(pid)+1 from #hierarchy
insert into #hierarchy(cid,pid, pass) select @aid, @pid, @pass
insert into #hierarchy(cid,pid, pass) select @bid, @pid, @pass
update #distance set aid = @pid where aid in (@aid, @bid)
update #distance set bid = @pid where bid in (@aid, @bid)
end
The next problem is how to display the results. As noted earlier the usual way to display HAC is through snappy graphic known as a dendrogram, but since TSQL offers no graphics an alternate solution is to create a table that shows the clustering using the parent ids. The following query uses dynamic SQL to build a table adding a column for each pass populated with id's of the parent cluster
create table #list (p0 int)
insert into #list select pid from #hierarchy where pass = 0
declare @pass int
declare @sql varchar(200)
set @pass = 1
while @pass<= (select max(pass) from #hierarchy)
begin
set @sql = 'alter table #list add p'+cast(@pass as varchar)+' int'
exec (@sql)
set @sql = 'Update l set p'+cast(@pass as varchar)+ '=
case when pid is null then p'+cast(@pass-1 as varchar)+' else pid end from #list l
left join #hierarchy b
on p'+cast(@pass-1 as varchar)+' = cid and pass ='+ cast(@pass as varchar)
exec (@sql)
set @pass = @pass+1
end
select * from #list
order by p11 ,p10 ,p9 ,p8 ,p7 ,p6 ,p5 ,p4 ,p3 ,p2 ,p1 ,p0
p0 | p1 | p2 | p3 | p4 | p5 | p6 | p7 | p8 | p9 | p10 | p11 |
8 | 8 | 8 | 8 | 8 | 8 | 8 | 19 | 19 | 19 | 19 | 23 |
12 | 12 | 12 | 12 | 12 | 12 | 18 | 19 | 19 | 19 | 19 | 23 |
5 | 5 | 14 | 14 | 14 | 14 | 18 | 19 | 19 | 19 | 19 | 23 |
6 | 13 | 14 | 14 | 14 | 14 | 18 | 19 | 19 | 19 | 19 | 23 |
9 | 13 | 14 | 14 | 14 | 14 | 18 | 19 | 19 | 19 | 19 | 23 |
11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 22 | 23 |
7 | 7 | 7 | 7 | 16 | 16 | 16 | 16 | 16 | 21 | 22 | 23 |
10 | 10 | 10 | 10 | 16 | 16 | 16 | 16 | 16 | 21 | 22 | 23 |
2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 20 | 21 | 22 | 23 |
3 | 3 | 3 | 3 | 3 | 17 | 17 | 17 | 20 | 21 | 22 | 23 |
1 | 1 | 1 | 15 | 15 | 17 | 17 | 17 | 20 | 21 | 22 | 23 |
4 | 4 | 4 | 15 | 15 | 17 | 17 | 17 | 20 | 21 | 22 | 23 |
The results conform with the visual evident clustering above. Immediately 9 & 6 cluster followed by 5 while next 1 & 4 start new cluster; all clearly close together in the above chart. Ultimately the data set forms two clusters one with 8, 6, 12 & 9 and the other with the balance which is again consistent with the visual expectations.
Glad to read your blog post. I was looking for such posts from long time. I appreciate your efforts in posting it. website scraping company USA
ReplyDeleteweb scraping services
web scraping service