Tuesday, February 18, 2014

Associated Items Using the Apriori Algorithm

Apriori is an association rule learning algorithm for mining frequent item sets; extremely useful in identifying which items sell/occur together and is doubtlessly employed by the likes of Amazon.  Perhaps the most famous example of this is the (possibly mythical) discovery of the association of beer and diaper sales at a convenience store.  The algorithm proceeds by identifying all of the items (e.g. products)  in a transaction (sales order) data-set that meet a minimum frequency criteria (aka support) .  The next step is to match the list back to the single item list by transaction to identify associated item groups that meet the support criteria.  This processes is repeated extending the associated item list until either the maximum list size is met or the results list is empty.  Wikipedia (and many other sources)  provide a more comprehensive explanation for the interested.

The processes starts with a table of transactions (this could be sales orders, documents, patients) which are composed of items (goods, words, diagnosis) and returns a table of items sets.  Note the list is not how many of an item is in the transaction but merely whether it was in the transaction, so if in transaction one there were 4 loaves of bread and two gallons of milk it would show up as only bread and milk.

create table #trans (id int, item varchar(24))

insert into #trans select 1,'bread'
insert into #trans select 1,'milk'
insert into #trans select 1,'formula'
insert into #trans select 1,'diapers'
insert into #trans select 1,'beer'
insert into #trans select 2,'bread'
insert into #trans select 2,'diapers'
insert into #trans select 2,'beer'
insert into #trans select 3,'diapers'
insert into #trans select 3,'beer'
insert into #trans select 4,'diapers'
insert into #trans select 5,'beer'
insert into #trans select 5,'bread'

Next, define and set the support level which is the minimum number of occurrences a rule should have to be included in associated items list.  For the purposes of this toy example it is set at two; in a real world example it should be set at a level consistent with the database size.

declare @support int
set @support = 2

The process will first generate all single item sets that meet the support criteria, then using that list will create the two item sets and continue the process until the maximum number of items set is met. This example will generate sets of up to 5 associated items.  For each set of associated items, step one is to generate all acceptable combinations of items by joining the last set of associated items to the single item list and saving the acceptable combinations.  An acceptable combination is a set of items where all are different and regardless of sequence the set is unique.  In other words in a two item starting with A and B,  set AB and BA are just AB while AA and BB are excluded.  The second step is to delete those combinations where the count doesn't meet the support criteria. From an SQL standpoint it would be possible to create a looping program that dynamically generates the query, but as the number of associated items generally grouped is modest, creating a query for each associated item list is easily managed.  

--1 item
--list of items by transaction
select  id,item item1  into #assoc1 from #trans group by id, item

-- remove items not meeting support level
delete from #assoc1  from #assoc1  x join
(select item1 items from #assoc1 group by item1 having count(1)<=@support) y  on item1 = items

--2 items
select a.*, b.item1 item2 into #assoc2 from #assoc1 a, #assoc1 b where a.id = b.id and b.item1>a.item1

delete  from #assoc2  from #assoc2 x join   
(select item1+item2 items from #assoc2 group by item1+item2 having count(1)<=@support) y on item1+item2=items

--3 items
select a.*, b.item1 item3 into #assoc3  from #assoc2 a, #assoc1 b where a.id = b.id and b.item1>a.item2

delete  from #assoc3  from #assoc3  x join 
(select item1+item2+item3  items from #assoc3 group by item1+item2+item3 having count(1)<=@support) y  on item1+item2+item3=items

--4 items
select a.*, b.item1 item4 into #assoc4 from #assoc3 a, #assoc1 b where a.id = b.id and b.item1>a.item3

delete  from #assoc4 from  #assoc4 x join 
(select item1+item2+item3+item4 items from #assoc4 group by item1+item2+item3+item4 having count(1)<=@support) y on item1+item2+item3+item4=items

--5 items
select a.*, b.item1 item5 into #assoc5 from #assoc4 a, #assoc1 b where a.id = b.id and b.item1>a.item4
delete  from #assoc5 from  #assoc5 x join 
(select item1+item2+item3+item4+item5 items from #assoc5 group by item1+item2+item3+item4+item5 having count(1)<=@support) y on item1+item2+item3+item4+item5=items

Incorporating additional item lists is a fairly straight forward exercise of copying the prior queries, updating the table names and adding the additional fields.

The following generates a summary of the associated item lists:

select 5 size, item1,item2,item3,item4,item5, count(1) freq into #assoc from #assoc5 group by item1,item2,item3, item4,item5
union all
select 4, item1,item2,item3,item4,'', count(1) from #assoc4 group by item1,item2,item3, item4
union all
select 3, item1,item2,item3,'','', count(1) from #assoc3 group by item1,item2,item3
union all
select 2, item1,item2,'','','', count(1) from #assoc2 group by item1, item2
union all
select 1, item1,'','','','', count(1) from #assoc1  group by item1


select * from #assoc

size item1 item2 item3 item4 item5 freq
3 beer bread diapers

2
2 beer bread


3
2 beer diapers


3
2 bread diapers


2
1 beer



4
1 bread



3
1 diapers



4

One of the drawbacks of using apriori to generate the associated item sets is the number of permutations in large databases.  The above was tested on 76,013 transactions (patients  from Kaggle's Heritage competition) composed of 45 items (diagnosis) for a total of 282,718 records (medical claims year 1) with a support of 100 using a fairly basic home PC; it generated a table of 7,500 sets of 2  or more associated items in a little over 1 minute.

To have so many rules begs the question of what is interesting and worth further investigation.  One approach is to identify how often an associated item set occurs versus what one would expect if it was random. If items are unrelated the expected probability of them co-occurring would be the product of the individual probabilities; if the actual probability is substantially greater than expected, then the associated items are more interesting.  The following query approximates the two item solution and can easily be extended to larger associated item lists:

select top 10  item1, item2, freqab, freqab/freqt Assoc, (freqb/freqt) *(freqc/freqt) Expect ,(freqab/freqt) / ((freqb/freqt) *(freqc/freqt)) Ratio from 
(select item1, item2, cast(count(1) as float) freqab  from #assoc2  group by item1, item2) a
join
(select item1 itemb, cast(count(1) as float) freqb from #assoc1 group by item1) b  on item1 = itemb
join 
(select item1 itemc, cast(count(1) as float) freqc from #assoc1 group by item1) c on item2 = itemc,
(select count(distinct id) freqt from #assoc1) d
order by (freqab/freqt) / ((freqb/freqt) *(freqc/freqt)) desc

Below are the top 10 two item set results from the Heritage data diagnosis (item) by patient (transaction) with rather obscure diagnostic titles.   I won't go through them, but for example myocardial infarction (AMI) is approximately 7 times more likely to occur in patient diagnosed with congestive heart failure  (CHF) than expected.  Some of the others are also quite obvious.

item1 item2 Freq Assoc Expect Ratio
AMI CHF 473 0.62% 0.09% 7.03
HEART2 PERVALV 124 0.16% 0.03% 6.22
SEIZURE STROKE 195 0.26% 0.04% 6.03
CHF HEART2 429 0.56% 0.09% 5.99
HEMTOL RENAL2 145 0.19% 0.04% 5.05
CHF PNEUM 141 0.19% 0.04% 5.03
AMI ROAMI 1324 1.74% 0.37% 4.77
GYNEC1 PRGNCY 558 0.73% 0.15% 4.75
HEART4 STROKE 223 0.29% 0.06% 4.72
GYNECA ODaBNCA 231 0.30% 0.06% 4.70