In the SQL Data Mine: Associated Items Using the Apriori Algorithm

Apriori is an association rule learning algorithm for mining frequent item sets; extremely useful in identifying which items sell/occur together and is doubtlessly employed by the likes of Amazon. Perhaps the most famous example of this is the (possibly mythical) discovery of the association of beer and diaper sales at a convenience store. The algorithm proceeds by identifying all of the items (e.g. products) in a transaction (sales order) data-set that meet a minimum frequency criteria (aka support) . The next step is to match the list back to the single item list by transaction to identify associated item groups that meet the support criteria. This processes is repeated extending the associated item list until either the maximum list size is met or the results list is empty. Wikipedia (and many other sources) provide a more comprehensive explanation for the interested.

The processes starts with a table of transactions (this could be sales orders, documents, patients) which are composed of items (goods, words, diagnosis) and returns a table of items sets. Note the list is not how many of an item is in the transaction but merely whether it was in the transaction, so if in transaction one there were 4 loaves of bread and two gallons of milk it would show up as only bread and milk.

create table #trans (id int, item varchar(24))

insert into #trans select 1,'bread'
insert into #trans select 1,'milk'
insert into #trans select 1,'formula'
insert into #trans select 1,'diapers'
insert into #trans select 1,'beer'
insert into #trans select 2,'bread'
insert into #trans select 2,'diapers'
insert into #trans select 2,'beer'
insert into #trans select 3,'diapers'
insert into #trans select 3,'beer'
insert into #trans select 4,'diapers'
insert into #trans select 5,'beer'
insert into #trans select 5,'bread'

Next, define and set the support level which is the minimum number of occurrences a rule should have to be included in associated items list. For the purposes of this toy example it is set at two; in a real world example it should be set at a level consistent with the database size.

declare @support int

set @support = 2

The process will first generate all single item sets that meet the support criteria, then using that list will create the two item sets and continue the process until the maximum number of items set is met. This example will generate sets of up to 5 associated items. For each set of associated items, step one is to generate all acceptable combinations of items by joining the last set of associated items to the single item list and saving the acceptable combinations. An acceptable combination is a set of items where all are different and regardless of sequence the set is unique. In other words in a two item starting with A and B, set AB and BA are just AB while AA and BB are excluded. The second step is to delete those combinations where the count doesn't meet the support criteria. From an SQL standpoint it would be possible to create a looping program that dynamically generates the query, but as the number of associated items generally grouped is modest, creating a query for each associated item list is easily managed.

--1 item

--list of items by transaction
select id,item item1 into #assoc1 from #trans group by id, item

-- remove items not meeting support level

delete from #assoc1 from #assoc1 x join

(select item1 items from #assoc1 group by item1 having count(1)<=@support) y on item1 = items

--2 items

select a.*, b.item1 item2 into #assoc2 from #assoc1 a, #assoc1 b where a.id = b.id and b.item1>a.item1

delete from #assoc2 from #assoc2 x join

(select item1+item2 items from #assoc2 group by item1+item2 having count(1)<=@support) y on item1+item2=items

--3 items

select a.*, b.item1 item3 into #assoc3 from #assoc2 a, #assoc1 b where a.id = b.id and b.item1>a.item2

delete from #assoc3 from #assoc3 x join

(select item1+item2+item3 items from #assoc3 group by item1+item2+item3 having count(1)<=@support) y on item1+item2+item3=items

--4 items

select a.*, b.item1 item4 into #assoc4 from #assoc3 a, #assoc1 b where a.id = b.id and b.item1>a.item3

delete from #assoc4 from #assoc4 x join

(select item1+item2+item3+item4 items from #assoc4 group by item1+item2+item3+item4 having count(1)<=@support) y on item1+item2+item3+item4=items

--5 items

select a.*, b.item1 item5 into #assoc5 from #assoc4 a, #assoc1 b where a.id = b.id and b.item1>a.item4

delete from #assoc5 from #assoc5 x join

(select item1+item2+item3+item4+item5 items from #assoc5 group by item1+item2+item3+item4+item5 having count(1)<=@support) y on item1+item2+item3+item4+item5=items

Incorporating additional item lists is a fairly straight forward exercise of copying the prior queries, updating the table names and adding the additional fields.

The following generates a summary of the associated item lists:

select 5 size, item1,item2,item3,item4,item5, count(1) freq into #assoc from #assoc5 group by item1,item2,item3, item4,item5
union all
select 4, item1,item2,item3,item4,'', count(1) from #assoc4 group by item1,item2,item3, item4
union all
select 3, item1,item2,item3,'','', count(1) from #assoc3 group by item1,item2,item3
union all
select 2, item1,item2,'','','', count(1) from #assoc2 group by item1, item2
union all
select 1, item1,'','','','', count(1) from #assoc1 group by item1

select * from #assoc

*size*	*item1*	*item2*	*item3*	*item4*	*item5*	*freq*
3	beer	bread	diapers			2
2	beer	bread				3
2	beer	diapers				3
2	bread	diapers				2
1	beer					4
1	bread					3
1	diapers					4

One of the drawbacks of using apriori to generate the associated item sets is the number of permutations in large databases. The above was tested on 76,013 transactions (patients from Kaggle's Heritage competition) composed of 45 items (diagnosis) for a total of 282,718 records (medical claims year 1) with a support of 100 using a fairly basic home PC; it generated a table of 7,500 sets of 2 or more associated items in a little over 1 minute.

To have so many rules begs the question of what is interesting and worth further investigation. One approach is to identify how often an associated item set occurs versus what one would expect if it was random. If items are unrelated the expected probability of them co-occurring would be the product of the individual probabilities; if the actual probability is substantially greater than expected, then the associated items are more interesting. The following query approximates the two item solution and can easily be extended to larger associated item lists:

select top 10 item1, item2, freqab, freqab/freqt Assoc, (freqb/freqt) *(freqc/freqt) Expect ,(freqab/freqt) / ((freqb/freqt) *(freqc/freqt)) Ratio from
(select item1, item2, cast(count(1) as float) freqab from #assoc2 group by item1, item2) a
join
(select item1 itemb, cast(count(1) as float) freqb from #assoc1 group by item1) b on item1 = itemb
join
(select item1 itemc, cast(count(1) as float) freqc from #assoc1 group by item1) c on item2 = itemc,
(select count(distinct id) freqt from #assoc1) d
order by (freqab/freqt) / ((freqb/freqt) *(freqc/freqt)) desc

Below are the top 10 two item set results from the Heritage data diagnosis (item) by patient (transaction) with rather obscure diagnostic titles. I won't go through them, but for example myocardial infarction (AMI) is approximately 7 times more likely to occur in patient diagnosed with congestive heart failure (CHF) than expected. Some of the others are also quite obvious.

*item1*	*item2*	*Freq*	*Assoc*	*Expect*	*Ratio*
AMI	CHF	473	0.62%	0.09%	7.03
HEART2	PERVALV	124	0.16%	0.03%	6.22
SEIZURE	STROKE	195	0.26%	0.04%	6.03
CHF	HEART2	429	0.56%	0.09%	5.99
HEMTOL	RENAL2	145	0.19%	0.04%	5.05
CHF	PNEUM	141	0.19%	0.04%	5.03
AMI	ROAMI	1324	1.74%	0.37%	4.77
GYNEC1	PRGNCY	558	0.73%	0.15%	4.75
HEART4	STROKE	223	0.29%	0.06%	4.72
GYNECA	ODaBNCA	231	0.30%	0.06%	4.70

In the SQL Data Mine

Tuesday, February 18, 2014

Associated Items Using the Apriori Algorithm

No comments:

Post a Comment