The processes starts with a table of transactions (this could be sales orders, documents, patients) which are composed of items (goods, words, diagnosis) and returns a table of items sets. Note the list is not how many of an item is in the transaction but merely whether it was in the transaction, so if in transaction one there were 4 loaves of bread and two gallons of milk it would show up as only bread and milk.
create table #trans (id int, item varchar(24))
insert into #trans select 1,'bread'
insert into #trans select 1,'milk'
insert into #trans select 1,'formula'
insert into #trans select 1,'diapers'
insert into #trans select 1,'beer'
insert into #trans select 2,'bread'
insert into #trans select 2,'diapers'
insert into #trans select 2,'beer'
insert into #trans select 3,'diapers'
insert into #trans select 3,'beer'
insert into #trans select 4,'diapers'
insert into #trans select 5,'beer'
insert into #trans select 5,'bread'
Next, define and set the support level which is the minimum number of occurrences a rule should have to be included in associated items list. For the purposes of this toy example it is set at two; in a real world example it should be set at a level consistent with the database size.
declare @support int
set @support = 2
The process will first generate all single item sets that meet the support criteria, then using that list will create the two item sets and continue the process until the maximum number of items set is met. This example will generate sets of up to 5 associated items. For each set of associated items, step one is to generate all acceptable combinations of items by joining the last set of associated items to the single item list and saving the acceptable combinations. An acceptable combination is a set of items where all are different and regardless of sequence the set is unique. In other words in a two item starting with A and B, set AB and BA are just AB while AA and BB are excluded. The second step is to delete those combinations where the count doesn't meet the support criteria. From an SQL standpoint it would be possible to create a looping program that dynamically generates the query, but as the number of associated items generally grouped is modest, creating a query for each associated item list is easily managed.
The process will first generate all single item sets that meet the support criteria, then using that list will create the two item sets and continue the process until the maximum number of items set is met. This example will generate sets of up to 5 associated items. For each set of associated items, step one is to generate all acceptable combinations of items by joining the last set of associated items to the single item list and saving the acceptable combinations. An acceptable combination is a set of items where all are different and regardless of sequence the set is unique. In other words in a two item starting with A and B, set AB and BA are just AB while AA and BB are excluded. The second step is to delete those combinations where the count doesn't meet the support criteria. From an SQL standpoint it would be possible to create a looping program that dynamically generates the query, but as the number of associated items generally grouped is modest, creating a query for each associated item list is easily managed.
--1 item
--list of items by transaction
select id,item item1 into #assoc1 from #trans group by id, item
-- remove items not meeting support level
select id,item item1 into #assoc1 from #trans group by id, item
-- remove items not meeting support level
delete from #assoc1 from #assoc1 x join
(select item1 items from #assoc1 group by item1 having count(1)<=@support) y on item1 = items
--2 items
select a.*, b.item1 item2 into #assoc2 from #assoc1 a, #assoc1 b where a.id = b.id and b.item1>a.item1
delete from #assoc2 from #assoc2 x join
(select item1+item2 items from #assoc2 group by item1+item2 having count(1)<=@support) y on item1+item2=items
--3 items
select a.*, b.item1 item3 into #assoc3 from #assoc2 a, #assoc1 b where a.id = b.id and b.item1>a.item2
delete from #assoc3 from #assoc3 x join
(select item1+item2+item3 items from #assoc3 group by item1+item2+item3 having count(1)<=@support) y on item1+item2+item3=items
--4 items
select a.*, b.item1 item4 into #assoc4 from #assoc3 a, #assoc1 b where a.id = b.id and b.item1>a.item3
delete from #assoc4 from #assoc4 x join
(select item1+item2+item3+item4 items from #assoc4 group by item1+item2+item3+item4 having count(1)<=@support) y on item1+item2+item3+item4=items
--5 items
select a.*, b.item1 item5 into #assoc5 from #assoc4 a, #assoc1 b where a.id = b.id and b.item1>a.item4
delete from #assoc5 from #assoc5 x join
(select item1+item2+item3+item4+item5 items from #assoc5 group by item1+item2+item3+item4+item5 having count(1)<=@support) y on item1+item2+item3+item4+item5=items
Incorporating additional item lists is a fairly straight forward exercise of copying the prior queries, updating the table names and adding the additional fields.
The following generates a summary of the associated item lists:
select 5 size, item1,item2,item3,item4,item5, count(1) freq into #assoc from #assoc5 group by item1,item2,item3, item4,item5
union all
select 4, item1,item2,item3,item4,'', count(1) from #assoc4 group by item1,item2,item3, item4
union all
select 3, item1,item2,item3,'','', count(1) from #assoc3 group by item1,item2,item3
union all
select 2, item1,item2,'','','', count(1) from #assoc2 group by item1, item2
union all
select 1, item1,'','','','', count(1) from #assoc1 group by item1
Incorporating additional item lists is a fairly straight forward exercise of copying the prior queries, updating the table names and adding the additional fields.
The following generates a summary of the associated item lists:
select 5 size, item1,item2,item3,item4,item5, count(1) freq into #assoc from #assoc5 group by item1,item2,item3, item4,item5
union all
select 4, item1,item2,item3,item4,'', count(1) from #assoc4 group by item1,item2,item3, item4
union all
select 3, item1,item2,item3,'','', count(1) from #assoc3 group by item1,item2,item3
union all
select 2, item1,item2,'','','', count(1) from #assoc2 group by item1, item2
union all
select 1, item1,'','','','', count(1) from #assoc1 group by item1
select * from #assoc
size | item1 | item2 | item3 | item4 | item5 | freq |
3 | beer | bread | diapers | 2 | ||
2 | beer | bread | 3 | |||
2 | beer | diapers | 3 | |||
2 | bread | diapers | 2 | |||
1 | beer | 4 | ||||
1 | bread | 3 | ||||
1 | diapers | 4 |
One of the drawbacks of using apriori to generate the associated item sets is the number of permutations in large databases. The above was tested on 76,013 transactions (patients from Kaggle's Heritage competition) composed of 45 items (diagnosis) for a total of 282,718 records (medical claims year 1) with a support of 100 using a fairly basic home PC; it generated a table of 7,500 sets of 2 or more associated items in a little over 1 minute.
To have so many rules begs the question of what is interesting and worth further investigation. One approach is to identify how often an associated item set occurs versus what one would expect if it was random. If items are unrelated the expected probability of them co-occurring would be the product of the individual probabilities; if the actual probability is substantially greater than expected, then the associated items are more interesting. The following query approximates the two item solution and can easily be extended to larger associated item lists:
select top 10 item1, item2, freqab, freqab/freqt Assoc, (freqb/freqt) *(freqc/freqt) Expect ,(freqab/freqt) / ((freqb/freqt) *(freqc/freqt)) Ratio from
(select item1, item2, cast(count(1) as float) freqab from #assoc2 group by item1, item2) a
join
(select item1 itemb, cast(count(1) as float) freqb from #assoc1 group by item1) b on item1 = itemb
join
(select item1 itemc, cast(count(1) as float) freqc from #assoc1 group by item1) c on item2 = itemc,
(select count(distinct id) freqt from #assoc1) d
order by (freqab/freqt) / ((freqb/freqt) *(freqc/freqt)) desc
Below are the top 10 two item set results from the Heritage data diagnosis (item) by patient (transaction) with rather obscure diagnostic titles. I won't go through them, but for example myocardial infarction (AMI) is approximately 7 times more likely to occur in patient diagnosed with congestive heart failure (CHF) than expected. Some of the others are also quite obvious.
item1 | item2 | Freq | Assoc | Expect | Ratio |
AMI | CHF | 473 | 0.62% | 0.09% | 7.03 |
HEART2 | PERVALV | 124 | 0.16% | 0.03% | 6.22 |
SEIZURE | STROKE | 195 | 0.26% | 0.04% | 6.03 |
CHF | HEART2 | 429 | 0.56% | 0.09% | 5.99 |
HEMTOL | RENAL2 | 145 | 0.19% | 0.04% | 5.05 |
CHF | PNEUM | 141 | 0.19% | 0.04% | 5.03 |
AMI | ROAMI | 1324 | 1.74% | 0.37% | 4.77 |
GYNEC1 | PRGNCY | 558 | 0.73% | 0.15% | 4.75 |
HEART4 | STROKE | 223 | 0.29% | 0.06% | 4.72 |
GYNECA | ODaBNCA | 231 | 0.30% | 0.06% | 4.70 |