In the SQL Data Mine: Classification Using Naive Bayes

I originally intended to start with a post covering the basics of data and data analysis, but decided that the exciting nature of the topic might lose me a reader or two. Instead, I'll start with one of the simpler, nevertheless robust, classification tools: Naive Bayes. It has wide variety of applications and despite its simplicity does reasonably well against far more complicated algorithms. The version I will create here is for categorical, binary data (attribute value is present or not, rather than how frequently it occurs as in say a text document). The SQL is easily adaptable to different data sets and will handle thousands of records and attributes .

If you Google "Naive Bayes" you will get 742,000+ references many of which can provide a solid clarification of the theoretical underpinnings. It exploits Bayes theorem: the probability of a class given an attribute equals the probability of class times the probability of an attribute given a class divided by the probability of an attribute. NB ignores the probability of an attribute by replacing "equals" with "is proportionate to" and selects the highest scoring class. I'll create an example in an Openoffice spreadsheet and then convert it to SQL.

Assume a table of data. Each row is an observation and the columns are attributes with multiple values; one of the attributes is the variable to be predict known as the classes. To illustrate the algorithm I used the Lens data set from the UIC Machine Learning Repository ( http://archive.ics.uci.edu/ml/datasets/Lenses ) because it is a small data set with categorical data and three classes. In this example, age, prescription, astigmatic and tears are attributes and lens is the class. There is no reason you couldn't pick age as the class and use lens as an attribute.

id	age	prescription	astigmatic	tears	lens
1	young	myope	no	reduced	none
2	young	myope	no	normal	soft
3	young	myope	yes	reduced	none
4	young	myope	yes	normal	hard
5	young	hypermetrope	no	reduced	none
6	young	hypermetrope	no	normal	soft
7	young	hypermetrope	yes	reduced	none
8	young	hypermetrope	yes	normal	hard
9	pre-presbyopic	myope	no	reduced	none
10	pre-presbyopic	myope	no	normal	soft
11	pre-presbyopic	myope	yes	reduced	none
12	pre-presbyopic	myope	yes	normal	hard
13	pre-presbyopic	hypermetrope	no	reduced	none
14	pre-presbyopic	hypermetrope	no	normal	soft
15	pre-presbyopic	hypermetrope	yes	reduced	none
16	pre-presbyopic	hypermetrope	yes	normal	none
17	presbyopic	myope	no	reduced	none
18	presbyopic	myope	no	normal	none
19	presbyopic	myope	yes	reduced	none
20	presbyopic	myope	yes	normal	hard
21	presbyopic	hypermetrope	no	reduced	none
22	presbyopic	hypermetrope	no	normal	soft
23	presbyopic	hypermetrope	yes	reduced	none
24	presbyopic	hypermetrope	yes	normal	none

The first step is to estimate the probability of a class (aka prior). As there are three classes, a no-evidence guess could be 33.3% for each class. However with data the distribution can be estimated using a simple pivot table by selecting the data table above as the range and set "lens" (the classes) as the column fields and a count of "id" as the data field. Change displayed value type to "percent of row".

lens
hard	none	soft	Total Result
16.67%	62.50%	20.83%	100.00%

Using the scoring rule described above the highest scoring class is "none" and is right 62% of the time.

The second step is to calculate the probability of an attribute given a class. Again, using a pivot table with the date range as above with "lens" as the column data and "age" as the row data; keep count of "id" as the data field but change the display data type to percent of column.

Count - id	lens
age	hard	none	soft	Total Result
pre-presbyopic	25.00%	33.33%	40.00%	33.33%
presbyopic	25.00%	40.00%	20.00%	33.33%
young	50.00%	26.67%	40.00%	33.33%
Total Result	100.00%	100.00%	100.00%	100.00%

Repeat this last step for the remaining attributes and arrange the tables in a spread sheet as follows labeling the first row as "prior".

	lens
	hard	none	soft	Total Result
prior	16.67%	62.50%	20.83%	100.00%

Count - id	lens
age	hard	none	soft	Total Result
pre-presbyopic	25.00%	33.33%	40.00%	33.33%
presbyopic	25.00%	40.00%	20.00%	33.33%
young	50.00%	26.67%	40.00%	33.33%
Total Result	100.00%	100.00%	100.00%	100.00%

Count - id	lens
prescription	hard	none	soft	Total Result
hypermetrope	25.00%	53.33%	60.00%	50.00%
myope	75.00%	46.67%	40.00%	50.00%
Total Result	100.00%	100.00%	100.00%	100.00%

Count - id	lens
astigmatic	hard	none	soft	Total Result
no		46.67%	100.00%	50.00%
yes	100.00%	53.33%		50.00%
Total Result	100.00%	100.00%	100.00%	100.00%

Count - id	lens
Tears	hard	none	soft	Total Result
normal	100.00%	20.00%	100.00%	50.00%
reduced		80.00%		50.00%
Total Result	100.00%	100.00%	100.00%	100.00%

This is the "knowledge base" for the NB classifier and an observation is classified by multiplying the prior probability for a class times the probabilities for each attribute value given that class. For example, consider a subject who's age is presbyopic, prescription is hypermetrope, astigmatic is no and tears are normal. The score for soft would equal 20.83% (prior) * 20% (presbyopic) * 60% (hypermetrope)* 100% (astigmatic: no)* 100% (normal) =0.025000 (shown in the highlighted area below).

	lens
	hard	none	soft	Total Result	hard	none	soft
prior	16.67%	62.50%	20.83%	100.00%	16.67%	62.50%	20.83%

Count - id	lens
age	hard	none	soft	Total Result
pre-presbyopic	25.00%	33.33%	40.00%	33.33%
presbyopic	25.00%	40.00%	20.00%	33.33%	25.00%	40.00%	20.00%
young	50.00%	26.67%	40.00%	33.33%
Total Result	100.00%	100.00%	100.00%	100.00%

Count - id	lens
prescription	hard	none	soft	Total Result
hypermetrope	25.00%	53.33%	60.00%	50.00%	25.00%	53.33%	60.00%
myope	75.00%	46.67%	40.00%	50.00%
Total Result	100.00%	100.00%	100.00%	100.00%

Count - id	lens
astigmatic	hard	none	soft	Total Result
no		46.67%	100.00%	50.00%	0	46.67%	100.00%
yes	100.00%	53.33%		50.00%
Total Result	100.00%	100.00%	100.00%	100.00%

Count - id	lens
Tears	hard	none	soft	Total Result
normal	100.00%	20.00%	100.00%	50.00%	100.00%	20.00%	100.00%
reduced		80.00%		50.00%
Total Result	100.00%	100.00%	100.00%	100.00%

				Score	0.000000	0.012444	0.025000

The class "soft" has the highest score and is the winner. However, this example reveals a couple of issues. First, for "astigmatic" the probability for hard lenses is 0%; since the score is the product of the probabilities, a zero probability eliminates that class from consideration. There are a number of solutions: there may be cases where a zero probability eliminating the class is appropriate (pregnant men), the literature proposes Laplace smoothing which adds a value like 1 to each attribute value count for each class before calculating probabilities, or using an arbitrarily small probability, say 0.1% in place of zero. I generally favor the last option because in my experience Laplace smoothing often increases the error (and it is easier to program). In this example using .1% didn't change the results.

The second issue is that multiplying probabilities can lead to very small numbers and introduce rounding errors. The solution is to sum the logs of the probabilities (which comes in handy in SQL).

The SQL version of the NB classifier is developed as follows: create a raw data table, create a table with the prior distribution, create a table with the attribute class distributions, create a table of test data, run a classification and check the accuracy. The code I write is functional; I don't put in a lot of formalities and error handling. I get the program to work and if a more robust version is required, I'll take the time to build in features. As the data sets are small, the temp tables could easily be replaced with common table expressions or views.

Create table RawData ( id smallint, age varchar(15), prescription varchar(15), astigmatic varchar(15), tears varchar(15), lens varchar(15))

insert into RawData select 1,'young','myope ','no','reduced','none'
insert into RawData select 2,'young','myope ','no','normal','soft'
insert into RawData select 3,'young','myope ','yes','reduced','none'
insert into RawData select 4,'young','myope ','yes','normal','hard'
insert into RawData select 5,'young','hypermetrope','no','reduced','none'
insert into RawData select 6,'young','hypermetrope','no','normal','soft'
insert into RawData select 7,'young','hypermetrope','yes','reduced','none'
insert into RawData select 8,'young','hypermetrope','yes','normal','hard'
insert into RawData select 9,'pre-presbyopic','myope ','no','reduced','none'
insert into RawData select 10,'pre-presbyopic','myope ','no','normal','soft'
insert into RawData select 11,'pre-presbyopic','myope ','yes','reduced','none'
insert into RawData select 12,'pre-presbyopic','myope ','yes','normal','hard'
insert into RawData select 13,'pre-presbyopic','hypermetrope','no','reduced','none'
insert into RawData select 14,'pre-presbyopic','hypermetrope','no','normal','soft'
insert into RawData select 15,'pre-presbyopic','hypermetrope','yes','reduced','none'
insert into RawData select 16,'pre-presbyopic','hypermetrope','yes','normal','none'
insert into RawData select 17,'presbyopic','myope ','no','reduced','none'
insert into RawData select 18,'presbyopic','myope ','no','normal','none'
insert into RawData select 19,'presbyopic','myope ','yes','reduced','none'
insert into RawData select 20,'presbyopic','myope ','yes','normal','hard'
insert into RawData select 21,'presbyopic','hypermetrope','no','reduced','none'
insert into RawData select 22,'presbyopic','hypermetrope','no','normal','soft'
insert into RawData select 23,'presbyopic','hypermetrope','yes','reduced','none'
insert into RawData select 24,'presbyopic','hypermetrope','yes','normal','none'

Calculate the prior distribution using a sub query.

select lens class, count(1) pfreq, sum(1.0)/(select count(1) from RawData) pprob into #Priors from RawData group by lens

Summarize the attribute-class information using a query for each attribute grouped by values. This creates a vertical table that is more amenable to SQL than matrix formatted tables.

select attribute,value, p.class,acfreq, acfreq/pfreq acprob into #AttrClass from
(select 'age ' attribute, age value,lens class, sum(1.0) acfreq from RawData group by age,lens
union all
select 'prescription', prescription,lens, sum(1.0) from RawData group by prescription,lens
union all
select 'astigmatic' ,astigmatic,lens, sum(1.0) from RawData group by astigmatic,lens
union all
select 'tears',tears,lens, sum(1.0) from RawData group by tears,lens) rd
join #Priors p on rd.class=p.class

The knowledge base is now complete. Next create and populate a table of test data. Like the attribute class table it is vertical table with each subject identified by a common id and an attribute value on each row:

create table #test( id int, attribute varchar(15), value varchar(15))

insert into #test select 99, 'age','presbyopic'
insert into #test select 99, 'prescription','hypermetrope'
insert into #test select 99, 'astigmatic','no'
insert into #test select 99, 'tears','normal'

The next step is to calculate a score for each test subject. As the #AttrClass table includes only combinations for which there is data the sub query sp ensures all possible attribute class combinations are included when left joining to the attribute class summary table. Attribute class combinations that don't exist and return as nulls are replaced by .001 when calculating the score.

select id,acs.class, log(pprob) + sum(log(fprob)) score into #Fit from #Test t join
(select sp.attribute, sp.value, sp.class, sp.pprob, isnull(s.acprob,.001) fprob from
(select s.attribute,s.value, p.class, p.pprob from #AttrClass s, #Priors p group by s.attribute,s.value, p.class, p.pprob) sp
left join #AttrClass s on sp.attribute=s.attribute and sp.value = s.value and sp.class=s.class) acs
on t.attribute = acs.attribute and t.value = acs.value
group by id,acs.class,pprob

The remaining step is to identify the class with the highest score for each id in the test table. Any unlikely tie in the scores is broken by picking the "maximum" class.

select f.id, max(class) pclass from #fit f join
(select id, max(score) mscore from #fit group by id) mq
on f.id = mq.id and score = mscore group by f.id

Results:

id	pclass
99	soft

How well does the classifier work? Load the test table with the original data and see how well the classifications match:

drop table #Test
select id,'age ' attribute, age value into #Test from RawData
union all
select id,'prescription', prescription from RawData
union all
select id,'astigmatic' ,astigmatic from RawData
union all
select id,'tears',tears from RawData

Link the predicted class to the RawData and count where lens matches the predicted class:

select sum(case when pclass = lens then 1 else 0 end) from RawData r join
(select f.id, max(class) pclass from #fit f join
(select id, max(score) mscore from #fit group by id) q
on f.id = q.id and score = mscore group by f.id) p
on p.id = r.id

95.8% accuracy is disingenuous as it measures how well the algorithm fit the data it fit. A more meaningful approach would be to withhold some observations, fit the model and see how well the held observations are predicted versus using the class with the highest probability. Identify records to predict by adding a column to RawData initially set equal to 1 with 6 randomly selected records set to zero. Modify the model to only use records with flag = 1 and put the records with flag=0 in test and run the prediction. Repeat 100 times to generate a sample of predictions.

alter table RawData add flag int
update RawData set flag = 1
update RawData set flag = 0 where id in
(select top 6 id from RawData order by newid())

(Modifications to the other queries not shown.)

The distribution of results are below. There was one case where using prior lead to one correct match, 9 where prior lead to 2 correct matches etc. Using the priors leads to about 63% correct classification while the NB improved the accuracy to 69%. Not bad in light of the small sample size (the knowledge base consisted of 18 subjects per iteration to predict 6).

Matches of 6	Using Prior	Using Fit
1	1	1
2	9	8
3	26	16
4	39	37
5	24	30
6	1	8
Total	100	100
Percent Correct	63%	69%
Avg Correct	3.79	4.11

Enjoy & let me know what you think.

In the SQL Data Mine

Friday, July 12, 2013

Classification Using Naive Bayes

No comments:

Post a Comment