Monday, March 10, 2014

Text Mining With a Bag of Words

Read virtually any article on big data, data mining, machine learning, artificial intelligence and you'll come across the assertion that the vast majority of the data in existence, and being created daily at an increasing rate, isn't in neat tables but rather text based documents.  Emails, RSS feeds, blogs (like this one) , web pages, Tweets, etc are all basically strings of words that in their raw form are not suitable for algorithmic analysis.  Nevertheless, academics and data scientists have developed a number of techniques for machine learning from text data with names like the bag of words, sentiment analysis, topic modeling and natural language processing.

This post uses a versatile approach to text modeling called bag of words which models documents as lists (or bags) of words as features ignoring order and grammar. To improve matching, words are reduced to their linguistic roots through stemming and those with little discriminatory power are removed using a "stop word" list and a simple frequency threshold.  The words are weighted using frequency then the frequencies are normalized.  The similarity of document pairs is computed using the dot-product of their normalized word frequency vectors.

The first step is to create the list of documents consisting of a unique id, title and body.  For the purposes of demonstration this post uses the description of the thirty companies in the Dow Jones index from Yahoo finance. Since most readers will have some familiarity with these companies, the ultimate similarity matrix can be intuitively evaluated.   For the sake of  brevity (and to avoid any copyright issues) only the first part of the first few records are shown below; the data can be scraped from the website by changing the GE in the URL to the company symbol, copying the description and pasting it into the insert statement below.   As Yahoo's apostrophe matches the SQL apostrophe, it is manually deleted before inserting; there are ways to handle them in SQL, just not germane to the business at hand.

create table #documents (id int identity(1,1), title varchar (48), body varchar(max))

insert into #documents select 'General Electric','General Electric Company operates as...
insert into #documents select 'International Busines Machines','Internat...
insert into #documents select '3M Co','3M Company operates as a..

The next step is to parse the body text into terms which may make more sense than calling them words because it is the vernacular, and most words would not pass a spell check.  Parsing is done by first rendering the string in lower case then looping through each character.  Terms are built using only characters that fall within the ASCII range of 97 and 122;  all others are non-letters and treated as spaces indicating breaks between the terms Once the term is parsed it is checked against a list of common words, or stop words, that are generally ignored and then reduced to it's lexical stem using the Porter Stemmer presented in an earlier post. There is no universally agreed upon list of common words referred to as stop words  but there are multiple available on the web, one of which is loaded into the stopwords function (see Appendix).  When a document is completely parsed the terms are grouped, counted and inserted into the #documents_terms table.


create table #list (term varchar(24))
create table #documents_terms(id  int, term varchar(24), freq int, freq_incl binary, freq_norm float)
create table #vocabulary(term varchar(24), doc_freq int, term_freq int)

declare  @count_char int, @term varchar(24), @count_term int,@document varchar(max), @id int

while exists (select * from #documents  where id not in (select id from #documents_terms))
begin
delete from #list
select top 1 @id = id from #documents where id not in (select id from #documents_terms)
select @document = lower(document), @count_term = 0,  @count_char = 1, @term = '' from #documents where id =@id

while @count_char<=len(@document)+1
begin
if ascii(substring (@document,@count_char,1)) between 97 and 122 set @term = @term+substring (@document,@count_char,1)
else if len(@term)>0
begin
insert into #list select dbo.fx_psWord(@term)  where dbo.fx_stopword(@term)=0
select @term = '',@count_term = @count_term+1
end
set @count_char = @count_char +1
end
insert into #documents_terms (id, term, freq) select @id, term, count(1) from #list  group by term
end

select term,count(1) doc_freq, sum(freq) term_freq  into #vocabulary from #documents_terms  group by term 

Use the #vocabulary table to identify terms that are not going to drive or dominate the similarity measure because they either occur in too few or too many documents.  


select doc_freq, count(1) terms from #vocabulary  group by doc_freq  order by doc_freq


Terms Documents
893 1
200 2
108 3
48 4
41 5
15 6
13 7
13 8
10 9
7 10
5 11
3 12
1 13
2 14
2 15
4 16
1 18
3 19
2 20
1 21
2 22
1 23
3 25
1 28
1 29
1 30

Based on the above distribution I decided to limit the included terms to show in at least 2 document, but no more than 20.  One of course could try different ranges and review the results to pick the best range.  

update a  set freq_incl = case when b.doc_freq between 2 and 20 then 1 else 0 end  from #documents_terms a join #vocabulary b on a.term= b.term

The final computation before computing the similarity matrix is to normalize the data based on the included terms.  Normalization is computed as the frequency of term X in a document Y divided by square root of the sum of the squared frequencies for all (included) terms in document Y.  This could be done within the similarity calculation, but it is easier to separate it out.

update a set freq_norm =  cast(freq as float)/ (select power(sum(power(cast(freq as float),2)),.5) from #documents_terms b where b.id = a.id and b.freq_incl = 1)  from #documents_terms a  where a.freq_incl = 1

The final step is to calculate the similarities which in this case the measure is cosine similarity: A*B/||A||*||B|| Since the data has been normalized to A/||A|| similarity is merely a matter of calculating the dot product by unique company pairs.  This requires joining the #documents_terms table to itself  using the document id  where the id's of the first table are always greater than the other, and summing the product of the normalized frequencies using only the included terms.

select c.title atitle, d.title btitle, a.id aid, b.id bid ,sum(a.freq_norm*b.freq_norm)   Similarity
into #similarity
from #documents_terms a 
join #documents_terms b on a.term = b.term
join #documents c on a.id = c.id
join #documents d on b.id = d.id
where  a.freq_incl=1 and b.freq_incl = 1 and a.id>b.id
group by c.title , d.title , a.id, b.id

order by sum(a.freq_norm*b.freq_norm) desc

A review of the top 10 company pairs based on similarity is qualitatively a success.   JPMorgan and Goldman sachs are both finance companies, united technologies and GE are conglomerates, Verizon and ATT are communications, Pfizer and Merck are pharmaceuticals, and so forth.


select top 10 atitle,btitle,similarity from #similarity order by similarity desc



atitle btitle Similarity
JPMorgan Chase and Co Goldman Sachs Group Inc 0.67
United Technologies Corp General Electric 0.66
Verizon Communications Inc AT&T Inc 0.62
Pfizer Inc Merck & Co Inc 0.58
Exxon Mobil Corp Chevron Corp 0.57
United Technologies Corp Boeing Co 0.57
Procter & Gamble Co Johnson & Johnson 0.56
Caterpillar Inc General Electric 0.55
E I du Pont de Nemours and Co 3M Co 0.54
Pfizer Inc Johnson & Johnson 0.51

The bag of words model presented is a basic approach and there are many modifications and extensions available (for possible future posts as well) to extract and display additional information. For example, the parsing could include numbers, capitals and bi-grams (two words as a feature, rather than one).  Weighting can be binary based rather than frequency or use a term frequency-inverse document frequency algorithm.  The document term matrix can be factored using the non-negative matrix factorization or clustered using K-means.  The similarity matrix can be evaluated using hierarchical agglomerative clustering or Markov clustering.  Stay tuned.

Appendix: Stop Word Function

create function fx_stopword( @word varchar(24))
returns binary
as
begin
return case when @word in
('a','able','about','across','after','all','almost','also','am','among','an','and','any','are','as','at',
'be','because','been','but','by','can','cannot','could','dear','did','do','does','either','else','ever',
'every','for','from','get','got','had','has','have','he','her','hers','him','his','how','however','i',
'if','in','into','is','it','its','just','least','let','like','likely','may','me','might','most','must','my',
'neither','no','nor','not','of','off','often','on','only','or','other','our','own','rather','said','say',
'says','she','should','since','so','some','than','that','the','their','them','then','there','these','they','this','tis','to','too','twas','us','wants','was','we','were','what','when','where','which','while',
'who','whom','why','will','with','would','yet','you','your') 
then 1 else 0 end
end






No comments:

Post a Comment