Wednesday, March 19, 2014

Clustering Two by Two

An earlier post covered the K-Means clustering algorithm, in this post I will cover hierarchical agglomerative clustering (HAC).  The HAC algorithm starts with each observation as a cluster and successively groups them based on shortest distance or highest similarity until all observations belong to a single cluster. There are  several ways to judge closest or most similar when comparing two clusters. They include the cluster average, the furthest members or nearest members; this post uses the nearest members algorithm.

The first step is to create a table of data.  Then compute the distance between clusters which will be used to successively cluster the observations.   The last step is displaying the results. Programs like R offer neat dendrogram graphics not available in SQL; to overcome this the post will offer code to dynamically create a table of clusters for each pass. Not ideal, but passable.

To keep it easy and visual, each observation has two dimensions that are randomly determined on a 0-100 scale.  The first query creates the data table while the loop assigns a value to the dimensions X1 and X2 which in this case is displayed below.  Note your results will differ because X1 and X2 are randomly assigned each time the program is run.

create table #data (id int, x1 float, x2 float)

declare @id int
set @id =1
while @id<=12
begin
insert into #data select @id, round(dbo.fx_rand()*100,0), round(dbo.fx_rand()*100,0)
set @id = @id+1
end

select x1,x2,id from #data



Distance is computed between all observation pairs using euclidean distance by joining the data table to itself on the observation id.  Redundant distances (A to B = B to A) and irrelevant pairs (A to A) are eliminating by specifying that the id in the first table is greater than the id in the second.  Other measures of distance could be used as well as the inverse of similarity; similar items are usually defined as near to each other (low distance) and dissimilar far apart (high distance).  The cluster ids are overwritten as clusters are merged and therefore archived in the aid2 and bid2 fields.

create table #distance (aid int, bid int,aid2 int, bid2 int, distance float )

insert into #distance select a.id, b.id,a.id, b.id, power(power(a.x1-b.x1,2)+power(a.x2-b.x2,2),.5) distance from #data a join #data b on a.id >b.id

The clusters are stored in the #hierarchy table as a parent-child relations ship. Starting with the individual observations as a parent id (and 0 as the child id), the algorithm finds the nearest pair of clusters, insets them into the hierarchy table as children and assigns the next sequential id as parent.  In the distance table the id's for the merged clusters are updated to the new parent id. This insures that the nonmembers of a cluster are compared to their nearest neighbor from within a cluster and saves time in recalculating the distance matrix. The algorithm stops when all clusters have the same cluster id in the distance table.

create table #hierarchy( cid int, pid int, pass int)

declare @aid int, @bid int, @pid int, @pass int
set @pass = 0
insert into #hierarchy (cid,pid, pass) select 0 ,id,@pass from #data

while 1=1
begin 
set @pass = @pass+1
select top 1 @aid = aid, @bid = bid from #distance where aid<>bid  order by distance, newid() 
if @@rowcount = 0 break

select @pid = max(pid)+1 from  #hierarchy
insert into #hierarchy(cid,pid, pass)  select @aid, @pid, @pass
insert into #hierarchy(cid,pid, pass)  select @bid, @pid, @pass

update #distance set aid = @pid where aid in (@aid, @bid)
update #distance set bid = @pid where bid in (@aid, @bid)
end

The next problem is how to display the results.  As noted earlier the usual way to display HAC is through snappy graphic known as a dendrogram, but since TSQL offers no graphics an alternate solution is to create a table that shows the clustering using the parent ids.  The following query uses dynamic SQL to build a table adding a column for each pass populated with id's of the parent cluster

create table #list (p0 int)

insert into #list select pid from #hierarchy where pass = 0 

declare @pass int
declare @sql varchar(200)
set @pass = 1
while @pass<= (select max(pass) from #hierarchy)
begin
set @sql = 'alter table #list add p'+cast(@pass as varchar)+' int'
exec (@sql)

set @sql = 'Update l  set p'+cast(@pass as varchar)+ '= 
case when pid is null then p'+cast(@pass-1 as varchar)+' else pid end from #list l
left join  #hierarchy b
on p'+cast(@pass-1 as varchar)+' = cid and pass ='+ cast(@pass as varchar)

exec (@sql)

set @pass = @pass+1
end

select * from #list  
order by p11 ,p10 ,p9 ,p8 ,p7 ,p6 ,p5 ,p4 ,p3 ,p2 ,p1 ,p0 

p0 p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11
8 8 8 8 8 8 8 19 19 19 19 23
12 12 12 12 12 12 18 19 19 19 19 23
5 5 14 14 14 14 18 19 19 19 19 23
6 13 14 14 14 14 18 19 19 19 19 23
9 13 14 14 14 14 18 19 19 19 19 23
11 11 11 11 11 11 11 11 11 11 22 23
7 7 7 7 16 16 16 16 16 21 22 23
10 10 10 10 16 16 16 16 16 21 22 23
2 2 2 2 2 2 2 2 20 21 22 23
3 3 3 3 3 17 17 17 20 21 22 23
1 1 1 15 15 17 17 17 20 21 22 23
4 4 4 15 15 17 17 17 20 21 22 23

The results conform with the visual evident clustering above.  Immediately 9 & 6 cluster followed by 5 while next 1 & 4 start  new cluster; all clearly close together in the above chart.  Ultimately  the data set forms two clusters one with  8, 6, 12 & 9  and the other with the balance  which is again consistent with the visual expectations.

Monday, March 10, 2014

Text Mining With a Bag of Words

Read virtually any article on big data, data mining, machine learning, artificial intelligence and you'll come across the assertion that the vast majority of the data in existence, and being created daily at an increasing rate, isn't in neat tables but rather text based documents.  Emails, RSS feeds, blogs (like this one) , web pages, Tweets, etc are all basically strings of words that in their raw form are not suitable for algorithmic analysis.  Nevertheless, academics and data scientists have developed a number of techniques for machine learning from text data with names like the bag of words, sentiment analysis, topic modeling and natural language processing.

This post uses a versatile approach to text modeling called bag of words which models documents as lists (or bags) of words as features ignoring order and grammar. To improve matching, words are reduced to their linguistic roots through stemming and those with little discriminatory power are removed using a "stop word" list and a simple frequency threshold.  The words are weighted using frequency then the frequencies are normalized.  The similarity of document pairs is computed using the dot-product of their normalized word frequency vectors.

The first step is to create the list of documents consisting of a unique id, title and body.  For the purposes of demonstration this post uses the description of the thirty companies in the Dow Jones index from Yahoo finance. Since most readers will have some familiarity with these companies, the ultimate similarity matrix can be intuitively evaluated.   For the sake of  brevity (and to avoid any copyright issues) only the first part of the first few records are shown below; the data can be scraped from the website by changing the GE in the URL to the company symbol, copying the description and pasting it into the insert statement below.   As Yahoo's apostrophe matches the SQL apostrophe, it is manually deleted before inserting; there are ways to handle them in SQL, just not germane to the business at hand.

create table #documents (id int identity(1,1), title varchar (48), body varchar(max))

insert into #documents select 'General Electric','General Electric Company operates as...
insert into #documents select 'International Busines Machines','Internat...
insert into #documents select '3M Co','3M Company operates as a..

The next step is to parse the body text into terms which may make more sense than calling them words because it is the vernacular, and most words would not pass a spell check.  Parsing is done by first rendering the string in lower case then looping through each character.  Terms are built using only characters that fall within the ASCII range of 97 and 122;  all others are non-letters and treated as spaces indicating breaks between the terms Once the term is parsed it is checked against a list of common words, or stop words, that are generally ignored and then reduced to it's lexical stem using the Porter Stemmer presented in an earlier post. There is no universally agreed upon list of common words referred to as stop words  but there are multiple available on the web, one of which is loaded into the stopwords function (see Appendix).  When a document is completely parsed the terms are grouped, counted and inserted into the #documents_terms table.


create table #list (term varchar(24))
create table #documents_terms(id  int, term varchar(24), freq int, freq_incl binary, freq_norm float)
create table #vocabulary(term varchar(24), doc_freq int, term_freq int)

declare  @count_char int, @term varchar(24), @count_term int,@document varchar(max), @id int

while exists (select * from #documents  where id not in (select id from #documents_terms))
begin
delete from #list
select top 1 @id = id from #documents where id not in (select id from #documents_terms)
select @document = lower(document), @count_term = 0,  @count_char = 1, @term = '' from #documents where id =@id

while @count_char<=len(@document)+1
begin
if ascii(substring (@document,@count_char,1)) between 97 and 122 set @term = @term+substring (@document,@count_char,1)
else if len(@term)>0
begin
insert into #list select dbo.fx_psWord(@term)  where dbo.fx_stopword(@term)=0
select @term = '',@count_term = @count_term+1
end
set @count_char = @count_char +1
end
insert into #documents_terms (id, term, freq) select @id, term, count(1) from #list  group by term
end

select term,count(1) doc_freq, sum(freq) term_freq  into #vocabulary from #documents_terms  group by term 

Use the #vocabulary table to identify terms that are not going to drive or dominate the similarity measure because they either occur in too few or too many documents.  


select doc_freq, count(1) terms from #vocabulary  group by doc_freq  order by doc_freq


Terms Documents
893 1
200 2
108 3
48 4
41 5
15 6
13 7
13 8
10 9
7 10
5 11
3 12
1 13
2 14
2 15
4 16
1 18
3 19
2 20
1 21
2 22
1 23
3 25
1 28
1 29
1 30

Based on the above distribution I decided to limit the included terms to show in at least 2 document, but no more than 20.  One of course could try different ranges and review the results to pick the best range.  

update a  set freq_incl = case when b.doc_freq between 2 and 20 then 1 else 0 end  from #documents_terms a join #vocabulary b on a.term= b.term

The final computation before computing the similarity matrix is to normalize the data based on the included terms.  Normalization is computed as the frequency of term X in a document Y divided by square root of the sum of the squared frequencies for all (included) terms in document Y.  This could be done within the similarity calculation, but it is easier to separate it out.

update a set freq_norm =  cast(freq as float)/ (select power(sum(power(cast(freq as float),2)),.5) from #documents_terms b where b.id = a.id and b.freq_incl = 1)  from #documents_terms a  where a.freq_incl = 1

The final step is to calculate the similarities which in this case the measure is cosine similarity: A*B/||A||*||B|| Since the data has been normalized to A/||A|| similarity is merely a matter of calculating the dot product by unique company pairs.  This requires joining the #documents_terms table to itself  using the document id  where the id's of the first table are always greater than the other, and summing the product of the normalized frequencies using only the included terms.

select c.title atitle, d.title btitle, a.id aid, b.id bid ,sum(a.freq_norm*b.freq_norm)   Similarity
into #similarity
from #documents_terms a 
join #documents_terms b on a.term = b.term
join #documents c on a.id = c.id
join #documents d on b.id = d.id
where  a.freq_incl=1 and b.freq_incl = 1 and a.id>b.id
group by c.title , d.title , a.id, b.id

order by sum(a.freq_norm*b.freq_norm) desc

A review of the top 10 company pairs based on similarity is qualitatively a success.   JPMorgan and Goldman sachs are both finance companies, united technologies and GE are conglomerates, Verizon and ATT are communications, Pfizer and Merck are pharmaceuticals, and so forth.


select top 10 atitle,btitle,similarity from #similarity order by similarity desc



atitle btitle Similarity
JPMorgan Chase and Co Goldman Sachs Group Inc 0.67
United Technologies Corp General Electric 0.66
Verizon Communications Inc AT&T Inc 0.62
Pfizer Inc Merck & Co Inc 0.58
Exxon Mobil Corp Chevron Corp 0.57
United Technologies Corp Boeing Co 0.57
Procter & Gamble Co Johnson & Johnson 0.56
Caterpillar Inc General Electric 0.55
E I du Pont de Nemours and Co 3M Co 0.54
Pfizer Inc Johnson & Johnson 0.51

The bag of words model presented is a basic approach and there are many modifications and extensions available (for possible future posts as well) to extract and display additional information. For example, the parsing could include numbers, capitals and bi-grams (two words as a feature, rather than one).  Weighting can be binary based rather than frequency or use a term frequency-inverse document frequency algorithm.  The document term matrix can be factored using the non-negative matrix factorization or clustered using K-means.  The similarity matrix can be evaluated using hierarchical agglomerative clustering or Markov clustering.  Stay tuned.

Appendix: Stop Word Function

create function fx_stopword( @word varchar(24))
returns binary
as
begin
return case when @word in
('a','able','about','across','after','all','almost','also','am','among','an','and','any','are','as','at',
'be','because','been','but','by','can','cannot','could','dear','did','do','does','either','else','ever',
'every','for','from','get','got','had','has','have','he','her','hers','him','his','how','however','i',
'if','in','into','is','it','its','just','least','let','like','likely','may','me','might','most','must','my',
'neither','no','nor','not','of','off','often','on','only','or','other','our','own','rather','said','say',
'says','she','should','since','so','some','than','that','the','their','them','then','there','these','they','this','tis','to','too','twas','us','wants','was','we','were','what','when','where','which','while',
'who','whom','why','will','with','would','yet','you','your') 
then 1 else 0 end
end






Friday, March 7, 2014

Porter Stemming: Pruning Words to Their Root

Stemming decreases the variety of words by finding the root or stem of a word.   For example, "house", "houses" and "housing" parse as separate words, but share the same root: "hous".  Using this stem reduce three words to one and may improve document matching A well known stemmer is by Martin Porter and is available on-line with a comprehensive list of sources and code including T-SQL by Keith Lubell.

Originally intended to use Mr Lubell's code in my text processing program, but I decided to code in (what I believe to be) a simpler form.  My objective was to incorporate all of the code required for the stemmer  into a single function rather than have stand alone tables and helper functions that support only the stemmer as does the available SQL.  I was only modestly successful having substituted case-when statements for the table, but as the helper functions are called repeatedly, and I know of no way to code a subroutine in T-SQL functions, the helper functions remain.

The helper functions and what they do:

  • fx_psPat converts a word into a string of v& c for verbs and consonants
  • fx_psMeas  calculates the measure of a word:  how often a vc pattern occurs
  • fx_psVl identifies if there is a vowel in the string
  • fx_psCVC identifies if the the last three letters of the stem are consonant- vowel-consonant
  • fx_psCC identifies if the stem ends in a double consonant
The primary function is fx_psWORD('word').   Below is a test query and the results:

select dbo.fx_psWORD('house')  house, dbo.fx_psWORD('houses') houses, dbo.fx_psWORD('housing') housing


house houses housing
hous hous hous



The code in its entirety:


Create function fx_psPat(@str varchar(48))
returns varchar(24)
as 
begin
declare @ctr smallint, @pat varchar(48)
select @ctr = 1, @pat = ''
while @ctr<=len(@str)
begin
set @pat = @pat+case 
when charindex(substring(@str,@ctr,1),'aeiou')>0 then 'v'  
when substring(@str,@ctr,1) = 'y' and @ctr>1 and charindex(substring(@str,@ctr-1,1), 'aeiou') = 0 then 'v'
else 'c' end
set @ctr = @ctr+1
end
return @pat
end

Create function fx_psMeas(@str varchar(48))
returns smallint
as 
begin
declare @ctr smallint, @flg binary, @m smallint
select @ctr = 1, @flg = 0, @m=0, @str = dbo.fx_psPat(@str)
while @ctr<=len(@str)
begin
if substring(@str,@ctr,1)='v' or @flg = 1
begin
set @flg = 1
if substring(@str,@ctr,1) = 'c' select @m = @m+1, @flg = 0
end
set @ctr = @ctr+1
end
return @m
end

Create function fx_psVl(@str varchar(48))
returns binary
as
begin
return (select case when  charindex( 'v', dbo.fx_psPat(@str))>0 then 1 else 0 end)
end


create function fx_psCVC(@str varchar(48))
returns binary
as
begin
return (select case when  len(@str)>2 and right(dbo.fx_psPat(@str),3)='cvc'  and charindex(right(@str,1),'wxy') = 0  then 1 else 0 end)
end


create function fx_psCC(@str varchar(48))
returns binary
as
begin
return (select case when  len(@str)>1 and ((right(dbo.fx_psPat(@str),2) ='cc' and substring(@str, len(@str),1)= substring(@str, len(@str)-1,1))  
or (right(@str,2) ='yy' and charindex(substring(@str, len(@str)-3,1),'aeiou')=0 ))  then 1 else 0 end)
end

alter function fx_psWORD(@str varchar(48))
returns varchar(48)
as
begin

set @str = lower(rtrim(ltrim(@str)))

if len(@str)>2
begin 

--step 1a
set @str = case
when len(@str)>4 and right(@str, 4) = 'sses'  then left(@str, len(@str)-4)+'ss'
when len(@str)>3 and right(@str, 3) = 'ies'  then left(@str, len(@str)-3)+ 'i'
when len(@str)>2 and right(@str, 2) = 'ss'  then @str
when len(@str)>1 and right(@str, 1) = 's'  then left(@str,len(@str)-1)
else @str end

--step 1b
set @str = case  
when len(@str)>3 and dbo.fx_psMeas(left(@str, len(@str)-3))>0  and right(@str,3) ='eed'  then  left(@str, len(@str)-3)+'ee'
when len(@str)>3 and right(@str,3) ='eed' then  @str
when len(@str)>2 and dbo.fx_psvl(left(@str,len(@str)-2))=1 and right(@str,2)='ed' then left(@str, len(@str)-2)+'\'
when len(@str)>3 and dbo.fx_psvl(left(@str,len(@str)-3))=1 and right(@str,3)='ing' then left(@str, len(@str)-3)+'\'
else @str end

if right(@str,1) = '\'
begin
set @str =replace(@str,'\','')
set @str = case
when  len(@str)>2 and right(@str,2) ='at' then left(@str, len(@str)-2)+'ate'
when  len(@str)>2 and right(@str,2) ='bl' then left(@str, len(@str)-2)+'ble'
when  len(@str)>2 and right(@str,2) ='iz' then left(@str, len(@str)-2)+'ize'
when  len(@str)>1 and dbo.fx_psCC(@str)=1 and right(@str,1) not in ('l','s','z')  then left(@str, len(@str)-1)
when  len(@str)>2 and dbo.fx_psmeas(@str)=1 and dbo.fx_psCVC(@str)=1  then @str+'e'
else @str end
end

--step 1c
set @str = case
when dbo.fx_psvl(left(@str,len(@str)-1))=1 and right(@str,1)='y' then left(@str, len(@str)-1)+'i'
else @str end

--step2
set @str = case 
when len(@str)>7 and dbo.fx_psMeas(left(@str,len(@str)-7))>0 and right(@str, 7)='ational' then left(@str,len(@str)-7)+'ate'
when len(@str)>6 and dbo.fx_psMeas(left(@str,len(@str)-6))>0 and right(@str, 6)='tional' then left(@str,len(@str)-6)+'tion'
when len(@str)>4 and dbo.fx_psMeas(left(@str,len(@str)-4))>0 and right(@str, 4)= 'enci' then left(@str,len(@str)-4)+'ence'
when len(@str)>4 and dbo.fx_psMeas(left(@str,len(@str)-4))>0 and right(@str, 4)='anci' then left(@str,len(@str)-4)+'ance'
when len(@str)>4 and dbo.fx_psMeas(left(@str,len(@str)-4))>0 and right(@str, 4)='izer' then left(@str,len(@str)-4)+'ize'
when len(@str)>3 and dbo.fx_psMeas(left(@str,len(@str)-3))>0 and  right(@str, 3)='bli' then left(@str,len(@str)-3)+'ble'
when len(@str)>4 and dbo.fx_psMeas(left(@str,len(@str)-4))>0 and right(@str, 4)='alli' then left(@str,len(@str)-4)+'al'
when len(@str)>5 and dbo.fx_psMeas(left(@str,len(@str)-5))>0 and right(@str, 5)='entli' then left(@str,len(@str)-5)+'ent'
when len(@str)>3 and dbo.fx_psMeas(left(@str,len(@str)-3))>0 and right(@str, 3)='eli' then left(@str,len(@str)-3)+'e'
when len(@str)>5 and dbo.fx_psMeas(left(@str,len(@str)-5))>0 and right(@str, 5)='ousli' then left(@str,len(@str)-5)+'ous'
when len(@str)>7 and dbo.fx_psMeas(left(@str,len(@str)-7))>0 and right(@str, 7)='ization' then left(@str,len(@str)-7)+'ize'
when len(@str)>5 and dbo.fx_psMeas(left(@str,len(@str)-5))>0 and right(@str, 5)='ation' then left(@str,len(@str)-5)+'ate'
when len(@str)>4 and dbo.fx_psMeas(left(@str,len(@str)-4))>0 and right(@str, 4)='ator' then left(@str,len(@str)-4)+'ate'
when len(@str)>5 and dbo.fx_psMeas(left(@str,len(@str)-5))>0 and right(@str, 5)='alism' then left(@str,len(@str)-5)+'al'
when len(@str)>7 and dbo.fx_psMeas(left(@str,len(@str)-7))>0 and right(@str, 7)='iveness' then left(@str,len(@str)-7)+'ive'
when len(@str)>7 and dbo.fx_psMeas(left(@str,len(@str)-7))>0 and right(@str, 7)='fulness' then left(@str,len(@str)-7)+'ful'
when len(@str)>7 and dbo.fx_psMeas(left(@str,len(@str)-7))>0 and right(@str, 7)='ousness' then left(@str,len(@str)-7)+'ous'
when len(@str)>5 and dbo.fx_psMeas(left(@str,len(@str)-5))>0 and right(@str, 5)='aliti' then left(@str,len(@str)-5)+'al'
when len(@str)>5 and dbo.fx_psMeas(left(@str,len(@str)-5))>0 and right(@str, 5)='iviti' then left(@str,len(@str)-5)+'ive'
when len(@str)>6 and dbo.fx_psMeas(left(@str,len(@str)-6))>0 and  right(@str, 6)='biliti' then left(@str,len(@str)-6)+'ble'
else @str end


--step 3
set @str = case 
when len(@str)>4 and dbo.fx_psMeas(left(@str,len(@str)-4)) >0 and right(@str, 4)='logi' then left(@str,len(@str)-4)+'log'
when len(@str)>5 and dbo.fx_psMeas(left(@str,len(@str)-5)) >0 and right(@str, 5)='icate' then left(@str,len(@str)-5)+'ic'
when len(@str)>5 and dbo.fx_psMeas(left(@str,len(@str)-5)) >0 and right(@str, 5)='ative' then left(@str,len(@str)-5)
when len(@str)>5 and dbo.fx_psMeas(left(@str,len(@str)-5)) >0 and right(@str, 5)='alize' then left(@str,len(@str)-5)+'al'
when len(@str)>5 and dbo.fx_psMeas(left(@str,len(@str)-5)) >0 and right(@str, 5)='iciti' then left(@str,len(@str)-5)+'ic'
when len(@str)>4 and dbo.fx_psMeas(left(@str,len(@str)-4)) >0 and right(@str, 4)='ical' then left(@str,len(@str)-4)+'ic'
when len(@str)>3 and dbo.fx_psMeas(left(@str,len(@str)-3)) >0 and right(@str, 3)='ful' then left(@str,len(@str)-3)
when len(@str)>4 and dbo.fx_psMeas(left(@str,len(@str)-4)) >0 and right(@str, 4)='ness' then left(@str,len(@str)-4)
else @str end

-- step 4
set @str = case 
when len(@str)>5 and dbo.fx_psMeas(left(@str,len(@str)-5)) >1 and  right(@str, 5)='ement'then left(@str,len(@str)-5)
when len(@str)>5 and right(@str, 5)='ement'then @str
when len(@str)>4 and dbo.fx_psMeas(left(@str,len(@str)-4)) >1 and  right(@str, 4) in ('ance','ence','able','ible','ment') then left(@str,len(@str)-4)
when len(@str)>4 and  right(@str, 4) in ('ance','ence','able','ible','ment') then @str
when len(@str)>3 and dbo.fx_psMeas(left(@str,len(@str)-3)) >1 and  right(@str, 3)='ion'and right(left(@str,len(@str)-3),1) in ('s','t') then left(@str,len(@str)-3)
when len(@str)>3 and dbo.fx_psMeas(left(@str,len(@str)-3)) >1 and  right(@str, 3) in ('ent','ism','ate','iti','ous','ive','ize','ant') then left(@str,len(@str)-3)
when len(@str)>2 and dbo.fx_psMeas(left(@str,len(@str)-2)) >1 and right(@str, 2)in ('al','er','ic','ou') then left(@str,len(@str)-2)
else @str end

-- step 5a
set @str = case 
when len(@str)>1 and dbo.fx_psMeas(left(@str,len(@str)-1))>1 and right(@str,1) = 'e' then left(@str,len(@str)-1)
when len(@str)>1 and dbo.fx_psMeas(left(@str,len(@str)-1))=1 and dbo.fx_psCVC(left(@str,len(@str)-1)) = 0 and right(@str,1) = 'e' then left(@str,len(@str)-1)
else @str end

-- step 5b
set @str = case 
when len(@str)>1 and dbo.fx_psMeas(left(@str,len(@str)-1))>1 and dbo.fx_psCC(@str)=1 and  right(@str,1) = 'l' then left(@str,len(@str)-1)
else @str end

end
return @str
end