Thursday, July 25, 2013

Classic Classification: The Perceptron

I don't intend to make this a running dissertation on classification algorithms (although I will come back to them from time to time  because they are key to data mining), but the disappointing results on the Naive Bayes Classifier on the Kaggle Titanic challenge in my last post, lead me to ask if another simple classifier might produce better results.  To that end I dusted off my books on the Perceptron. which originated in the late 1950's (showing that machine learning has been around since the dawn of the computer age).  The Perceptron is a linear classifier that learns weights on input variables that indicate membership when the sum exceeds a threshold.  For those wanting more background, try Wikipedia; I will use their example to demonstrate the code before running the Titanic data though the model.

First thing, let's create a table and populate it with some basic data to simulate the NAND function.  In the write-ups the data is invariably identified as binary (0,1 or -1,1); I will make use of SQL and use data labels (true, false) which are easier to interpret and permit more input attribute values.  Note however, that  converting the class value to (0,1)  will simplify the program.

create table #rawdata (id int identity(1,1), x1 varchar(5), x2 varchar(5), z varchar(5))

insert into #rawdata select 'false', 'false', 'true' 

insert into #rawdata select 'false', 'true', 'true' 
insert into #rawdata select 'true', 'false', 'true' 
insert into #rawdata select 'true', 'true', 'false'

To leverage SQL functionality convert the data table to a data vector adding a bias field which works like a constant in regression.

select id, 'bias' attribute,'bias' value into #data from #rawdata
union all
select id , 'x1' , x1 from #rawdata
union all
select id , 'x2' , x2 from #rawdata

Similarly create a class data vector; as noted earlier, convert the value of the class attribute value to (0,1).

 select id, 'z' attribute, case when z='true' then 1.0 else 0 end value into #class from #rawdata

Recall the algorithm compares the sum of the weights for each observation to a threshold value to determine membership.  The next step is to initiate the table of weights which can be set at 0.

select attribute, value,cast(0 as float) wght into #weight from #data group by attribute, value

One of the reasons I like the Naive Bayes algorithm is it doesn't require iteration to get to the weights; unfortunately the Perceptron (and most data mining algorithms) learns through iteration.  The processes: pick an observation, calculate membership and if there is a miss-classification (error), adjust the weights in the direction of correct classification. Repeat until convergence (or patience is exhausted). If the class has a perfectly separable solution, it will converge and the weights will stop changing. The real world is seldom so co-operative (this example is; the Titanic data is not) and the results will vacillate.  A solution to this problem is the pocket algorithm: calculate the overall fit after each iteration and if the fit is the best, put the weights in your pocket.  When done, use these weights which are in the #bestweight table:

select * into #bestweight from #weight

 Begin the learning program by declaring the necessary variables:

  • threshold is the level necessary to classify, generally 0.5.  
  • alpha is the amount the weights will be adjusted should there be an error and experience shows 0.1 is a good value.  
  • count keeps track of how many times though the whole data set  (not many are required) and id is the individual record being evaluated  
  • fit is the sum of the weights and error is difference  versus the actual
  • best and newbest track the overall fit to determine if the new weights should replace those in your pocket  
The weights are  modified in the update statement (highlighted in yellow)  following the fit and error calculations. Next is the overall fit (percent correct classification) which is followed by the pocket algorithm.

declare @threshold float, @alpha float, @count int, @id int,  @fit float, @err float ,@best float,@newbest float

select @threshold = .5, @alpha = .1, @count = 5, @best = 0


while @count>=0

begin

set @id = 1


while @id<= (select max(id) from #class)

begin  

select @fit = sum(wght) from  #data d join #weight w on d.attribute = w.attribute and d.value = w.value where d.id = @id


select @err = c.value - case when @fit>.5 then 1.0 else 0 end  from #class c where c.id =  @id


update w set w.wght = w.wght+@err*@alpha from #weight w join #data d on w.attribute = d.attribute and w.value = d.value and d.id = @id

select @newbest = sum(case when fit = value then 1.0 else 0 end)/count(1) from 
(select c.id,case when sumw>.5 then 1 else 0 end fit, c.value  from #class c join
(select d.id,  sum(wght) sumw 
from #data d 
join #weight w on d.attribute = w.attribute and d.value = w.value  
group by d.id) n 
on c.id = n.id) x

if @newbest>@best

begin
set @best = @newbest
drop table #bestweight
select * into #bestweight from #weight
end
set @id = @id+1
end
set @count=@count-1
end

Finally, evaluate the results with the following query:

select * from (select c.id,case when sumw>@threshold then 1 else 0 end fit, c.value from #class c join (select d.id, sum(wght) sumw from #data d join #weight w on d.attribute = w.attribute and d.value = w.value group by d.id) n on c.id = n.id) x

The fit with the initial (zero value) weights:

id fit value
1 0 1
2 0 1
3 0 1
4 0 0

After running the program it is 100%:

id fit value
1 1 1
2 1 1
3 1 1
4 0 0

With the following weights:

select * from #weight

attribute value weight
bias bias 0.3
x1 false 0.2
x1 true 0.1
x2 false 0.2
x2 true 0.1

This last step does not use the #bestweights, but per the logic of the program they would be the same.  

How does the Perceptron fare on the Titanic data?  First, bring in the Titanic train data into the data vector and then the survival field into the class vector:

select passengerid id, 'bias' attribute,'bias' value into #data from train
union all
select passengerid , 'sex' , sex  from train
union all
select passengerid, 'pclass' , cast(pclass as varchar)  from train
union all
select passengerid, 'embarked' , cast(embarked as varchar)  from train
union all
select passengerid, 'cabin' , isnull(left(cabin,1),'na')  from train

 select passengerid id, 'survived' attribute, cast(survived as float) value into #class from train

Initiate the #weights and #bestweights tables and run the learning program.  No further modifications to the program should be required.

The basic fit can be evaluated with the following which shows 71.7171% correct classification; using the bestweight table improves the correct classification rate to 81.8181%.

select sum(case when fit = value then 1.0 else 0 end)/count(1) from 
(select c.id,case when sumw>.5 then 1 else 0 end fit, c.value  from #class c join
(select d.id,  sum(wght) sumw 
from #data d 
join #weight w on d.attribute = w.attribute and d.value = w.value  
group by d.id) n 
on c.id = n.id) x

The weights in the #bestweight table:

attribute value weight
bias bias 0.2
cabin A 0
cabin B 0.2
cabin C 0
cabin D 0.2
cabin E 0.3
cabin F 0.1
cabin G -0.4
cabin n 0
cabin T -0.2
embarked NULL 0
embarked C 0.1
embarked Q 0.1
embarked S 0
pclass 1 0.1
pclass 2 0.2
pclass 3 -0.1
sex female 0.3
sex male -0.1

I submitted the test data classification using the #bestweight table to Kaggle and improved my score to 76.077% and ranking  by nearly 300 places to about 4800th.  Not where I'd like to be, but not bad for a half century old, simple algorithm.




No comments:

Post a Comment