Friday, July 12, 2013

Classification Using Naive Bayes

I originally intended to start with  a post covering the basics of data and data analysis, but decided that the exciting nature of the topic might lose me a reader or two.  Instead, I'll start with one of the simpler, nevertheless robust, classification tools:  Naive Bayes.  It has  wide variety of applications and despite its simplicity does reasonably well against far more complicated algorithms. The version I will create here is for categorical, binary data (attribute value is present or not, rather than how frequently it occurs as in say a text document).  The SQL is easily adaptable to different data sets and will handle thousands of records and attributes .

If you Google "Naive Bayes"  you will get 742,000+ references many of which can provide a solid clarification of the theoretical underpinnings.  It exploits Bayes theorem:  the probability of a class given an attribute equals  the probability of  class times the probability of an attribute given a class divided by the probability of an attribute.  NB ignores the probability of an attribute by replacing "equals" with "is proportionate to"  and selects the highest scoring class.    I'll create an example in an Openoffice spreadsheet and then convert it to SQL.

Assume a table of data.  Each row is an observation and the columns are attributes with multiple values; one of the attributes is the variable to be predict known as the classes.   To illustrate the algorithm I used the Lens data set from the UIC  Machine Learning Repository ( http://archive.ics.uci.edu/ml/datasets/Lenses ) because it is a small data set with categorical data and three classes.    In this example, age, prescription, astigmatic and tears are attributes and lens is the class.  There is no reason you couldn't pick age as the class and use lens as an attribute.

idageprescriptionastigmatictearslens
1youngmyopenoreducednone
2youngmyopenonormalsoft
3youngmyopeyesreducednone
4youngmyopeyesnormalhard
5younghypermetropenoreducednone
6younghypermetropenonormalsoft
7younghypermetropeyesreducednone
8younghypermetropeyesnormalhard
9pre-presbyopicmyopenoreducednone
10pre-presbyopicmyopenonormalsoft
11pre-presbyopicmyopeyesreducednone
12pre-presbyopicmyopeyesnormalhard
13pre-presbyopichypermetropenoreducednone
14pre-presbyopichypermetropenonormalsoft
15pre-presbyopichypermetropeyesreducednone
16pre-presbyopichypermetropeyesnormalnone
17presbyopicmyopenoreducednone
18presbyopicmyopenonormalnone
19presbyopicmyopeyesreducednone
20presbyopicmyopeyesnormalhard
21presbyopichypermetropenoreducednone
22presbyopichypermetropenonormalsoft
23presbyopichypermetropeyesreducednone
24presbyopichypermetropeyesnormalnone


The first step is to estimate the probability of a class (aka prior).  As there are three classes, a no-evidence guess could be 33.3%  for each class.  However with data the distribution can be estimated using a simple pivot table by selecting the data table above as the range and set "lens" (the classes) as the column fields and a count of "id" as the data field.   Change displayed value type  to "percent of row".

lens
hardnonesoftTotal Result
16.67%62.50%20.83%100.00%


Using the scoring rule described above the highest scoring class is "none" and is right 62% of the time.
The second step is to calculate the probability of an attribute given a class.  Again, using a pivot table with the date range as above with "lens" as the column data and "age" as the row data; keep count of "id" as the data field but change the display data type to percent of column.

Count - idlens
agehardnonesoftTotal Result
pre-presbyopic25.00%33.33%40.00%33.33%
presbyopic25.00%40.00%20.00%33.33%
young50.00%26.67%40.00%33.33%
Total Result100.00%100.00%100.00%100.00%


Repeat this last step for the remaining attributes and arrange the tables in a spread sheet as follows labeling the first row as "prior".

lens
hardnonesoftTotal Result
 prior16.67%62.50%20.83%100.00%
Count - idlens
agehardnonesoftTotal Result
pre-presbyopic25.00%33.33%40.00%33.33%
presbyopic25.00%40.00%20.00%33.33%
young50.00%26.67%40.00%33.33%
Total Result100.00%100.00%100.00%100.00%
Count - idlens
prescriptionhardnonesoftTotal Result
hypermetrope25.00%53.33%60.00%50.00%
myope75.00%46.67%40.00%50.00%
Total Result100.00%100.00%100.00%100.00%
Count - idlens
astigmatichardnonesoftTotal Result
no46.67%100.00%50.00%
yes100.00%53.33%50.00%
Total Result100.00%100.00%100.00%100.00%
Count - idlens
TearshardnonesoftTotal Result
normal100.00%20.00%100.00%50.00%
reduced80.00%50.00%
Total Result100.00%100.00%100.00%100.00%


This is the "knowledge base" for the NB classifier and an observation is classified by  multiplying the prior probability for a class times the probabilities for each attribute value  given that class.  For example, consider a subject who's age is presbyopic, prescription is hypermetrope, astigmatic is no and tears are normal.  The score for soft would equal 20.83% (prior) * 20% (presbyopic) * 60%  (hypermetrope)* 100% (astigmatic: no)* 100% (normal) =0.025000  (shown in the highlighted area below).
lens
hardnonesoftTotal Resulthardnonesoft
 prior16.67%62.50%20.83%100.00%16.67%62.50%20.83%
Count - idlens
agehardnonesoftTotal Result
pre-presbyopic25.00%33.33%40.00%33.33%
presbyopic25.00%40.00%20.00%33.33%25.00%40.00%20.00%
young50.00%26.67%40.00%33.33%
Total Result100.00%100.00%100.00%100.00%
Count - idlens
prescriptionhardnonesoftTotal Result
hypermetrope25.00%53.33%60.00%50.00%25.00%53.33%60.00%
myope75.00%46.67%40.00%50.00%
Total Result100.00%100.00%100.00%100.00%
Count - idlens
astigmatichardnonesoftTotal Result
no46.67%100.00%50.00%046.67%100.00%
yes100.00%53.33%50.00%
Total Result100.00%100.00%100.00%100.00%
Count - idlens
TearshardnonesoftTotal Result
normal100.00%20.00%100.00%50.00%100.00%20.00%100.00%
reduced80.00%50.00%
Total Result100.00%100.00%100.00%100.00%
Score0.0000000.0124440.025000


The class "soft"  has the highest score and is the winner. However, this example reveals a couple of issues.  First, for "astigmatic" the probability for hard lenses is 0%; since the score is the product of the probabilities, a zero probability eliminates that class from consideration.  There are a number of solutions: there may be cases where a zero probability eliminating the class is appropriate (pregnant men),   the literature proposes Laplace smoothing which adds a value like 1 to each attribute value count for each class before calculating probabilities, or using an arbitrarily small probability, say 0.1% in place of zero.  I generally favor the last option because in my experience Laplace smoothing often increases the error (and  it is easier to program). In this example using .1% didn't change the results.

The second issue is that multiplying probabilities can lead to very small numbers and introduce rounding errors.  The solution is to sum the logs of the probabilities  (which comes in handy in SQL).
The SQL version of the NB classifier is developed as follows: create a raw data table, create a table with the prior distribution, create a table with the attribute class distributions, create a table of test data,  run a classification and check the accuracy.  The code I write is functional;  I don't put in a lot of formalities and error handling.  I get the program to work and if a more robust version is required, I'll take the time to build in features.  As the data sets are small, the temp tables could easily be replaced with common table expressions  or views.

Create table RawData ( id smallint, age varchar(15), prescription varchar(15), astigmatic varchar(15), tears varchar(15), lens varchar(15))

insert into RawData select 1,'young','myope ','no','reduced','none'
insert into RawData select 2,'young','myope ','no','normal','soft'
insert into RawData select 3,'young','myope ','yes','reduced','none'
insert into RawData select 4,'young','myope ','yes','normal','hard'
insert into RawData select 5,'young','hypermetrope','no','reduced','none'
insert into RawData select 6,'young','hypermetrope','no','normal','soft'
insert into RawData select 7,'young','hypermetrope','yes','reduced','none'
insert into RawData select 8,'young','hypermetrope','yes','normal','hard'
insert into RawData select 9,'pre-presbyopic','myope ','no','reduced','none'
insert into RawData select 10,'pre-presbyopic','myope ','no','normal','soft'
insert into RawData select 11,'pre-presbyopic','myope ','yes','reduced','none'
insert into RawData select 12,'pre-presbyopic','myope ','yes','normal','hard'
insert into RawData select 13,'pre-presbyopic','hypermetrope','no','reduced','none'
insert into RawData select 14,'pre-presbyopic','hypermetrope','no','normal','soft'
insert into RawData select 15,'pre-presbyopic','hypermetrope','yes','reduced','none'
insert into RawData select 16,'pre-presbyopic','hypermetrope','yes','normal','none'
insert into RawData select 17,'presbyopic','myope ','no','reduced','none'
insert into RawData select 18,'presbyopic','myope ','no','normal','none'
insert into RawData select 19,'presbyopic','myope ','yes','reduced','none'
insert into RawData select 20,'presbyopic','myope ','yes','normal','hard'
insert into RawData select 21,'presbyopic','hypermetrope','no','reduced','none'
insert into RawData select 22,'presbyopic','hypermetrope','no','normal','soft'
insert into RawData select 23,'presbyopic','hypermetrope','yes','reduced','none'
insert into RawData select 24,'presbyopic','hypermetrope','yes','normal','none'

Calculate the prior distribution using a sub query.

select lens class, count(1) pfreq, sum(1.0)/(select count(1) from RawData) pprob into #Priors from RawData group by lens

Summarize the attribute-class information using a query for each attribute grouped by values.  This creates a vertical table  that is more amenable to SQL than matrix formatted tables.

select attribute,value, p.class,acfreq, acfreq/pfreq acprob into #AttrClass from 
(select 'age ' attribute, age value,lens class, sum(1.0) acfreq from RawData group by age,lens
union all
select 'prescription', prescription,lens, sum(1.0) from RawData group by prescription,lens
union all
select 'astigmatic' ,astigmatic,lens, sum(1.0) from RawData group by astigmatic,lens
union all
select 'tears',tears,lens, sum(1.0) from RawData group by tears,lens) rd 
join #Priors p on rd.class=p.class

The knowledge base is now complete.  Next create and populate a table of test data.  Like the attribute class table it is  vertical table with each subject identified by a common id and an attribute value on each row:

create table #test( id int, attribute varchar(15), value varchar(15))
insert into #test select 99, 'age','presbyopic'
insert into #test select 99, 'prescription','hypermetrope'
insert into #test select 99, 'astigmatic','no'
insert into #test select 99, 'tears','normal'

The next step is to calculate a score for each test subject.   As the #AttrClass table includes only combinations for which there is data the sub query sp ensures all possible attribute class combinations are included when left joining to the attribute class summary table.  Attribute class combinations that don't exist and return as nulls are replaced by .001 when calculating the score.

select id,acs.class, log(pprob) + sum(log(fprob)) score into #Fit from #Test t join
(select sp.attribute, sp.value, sp.class, sp.pprob, isnull(s.acprob,.001) fprob from 
(select s.attribute,s.value, p.class, p.pprob from #AttrClass s, #Priors p group by s.attribute,s.value, p.class, p.pprob) sp
left join #AttrClass s on sp.attribute=s.attribute and sp.value = s.value and sp.class=s.class) acs
on t.attribute = acs.attribute and t.value = acs.value 
group by id,acs.class,pprob

The remaining step is to identify the class with the highest score for each id in the test table.  Any unlikely tie in the scores is broken by picking the "maximum" class.

select f.id, max(class) pclass from #fit f join
(select id, max(score) mscore from #fit group by id) mq
on f.id = mq.id and score = mscore group by f.id

Results:

idpclass
99soft


How well does the classifier work?  Load the test table with the original data and see how well the classifications match:

drop table #Test
select id,'age ' attribute, age value into #Test from RawData 
union all
select id,'prescription', prescription from RawData 
union all
select id,'astigmatic' ,astigmatic from RawData 
union all
select id,'tears',tears from RawData

Link the predicted class to the RawData  and count where lens matches  the predicted class:

select sum(case when pclass = lens then 1 else 0 end) from RawData r join 
(select f.id, max(class) pclass from #fit f join
(select id, max(score) mscore from #fit group by id) q
on f.id = q.id and score = mscore group by f.id) p
on p.id = r.id

95.8% accuracy is disingenuous  as it measures how well the algorithm fit the data it fit.   A more meaningful approach would be to withhold some observations, fit the model and see how well the held observations are predicted versus using the class with the highest probability.   Identify records to predict by adding a column to RawData initially set equal to 1 with 6 randomly selected records set to zero.  Modify the model to only use records with flag = 1 and put the records with flag=0 in test and run the prediction.  Repeat 100 times to generate a sample of predictions.

alter table RawData add flag int
update RawData set flag = 1
update RawData set flag = 0 where id in 
(select top 6 id from RawData order by newid())

(Modifications to the other queries not shown.)

The distribution of results are below.  There was one case where using prior lead to one correct match, 9 where prior lead to 2 correct matches etc.  Using the priors leads to about 63% correct classification while the NB improved the accuracy to 69%.  Not bad in light of the small sample size (the knowledge base consisted of 18 subjects per iteration to predict 6).

Matches of 6Using PriorUsing Fit
111
298
32616
43937
52430
618
Total100100
Percent Correct63%69%
Avg Correct3.794.11


Enjoy & let me know what you think.

No comments:

Post a Comment