This post uses a versatile approach to text modeling called bag of words which models documents as lists (or bags) of words as features ignoring order and grammar. To improve matching, words are reduced to their linguistic roots through stemming and those with little discriminatory power are removed using a "stop word" list and a simple frequency threshold. The words are weighted using frequency then the frequencies are normalized. The similarity of document pairs is computed using the dot-product of their normalized word frequency vectors.
The first step is to create the list of documents consisting of a unique id, title and body. For the purposes of demonstration this post uses the description of the thirty companies in the Dow Jones index from Yahoo finance. Since most readers will have some familiarity with these companies, the ultimate similarity matrix can be intuitively evaluated. For the sake of brevity (and to avoid any copyright issues) only the first part of the first few records are shown below; the data can be scraped from the website by changing the GE in the URL to the company symbol, copying the description and pasting it into the insert statement below. As Yahoo's apostrophe matches the SQL apostrophe, it is manually deleted before inserting; there are ways to handle them in SQL, just not germane to the business at hand.
create table #documents (id int identity(1,1), title varchar (48), body varchar(max))
insert into #documents select 'General Electric','General Electric Company operates as...
insert into #documents select 'International Busines Machines','Internat...
insert into #documents select '3M Co','3M Company operates as a..
The next step is to parse the body text into terms which may make more sense than calling them words because it is the vernacular, and most words would not pass a spell check. Parsing is done by first rendering the string in lower case then looping through each character. Terms are built using only characters that fall within the ASCII range of 97 and 122; all others are non-letters and treated as spaces indicating breaks between the terms. Once the term is parsed it is checked against a list of common words, or stop words, that are generally ignored and then reduced to it's lexical stem using the Porter Stemmer presented in an earlier post. There is no universally agreed upon list of common words referred to as stop words but there are multiple available on the web, one of which is loaded into the stopwords function (see Appendix). When a document is completely parsed the terms are grouped, counted and inserted into the #documents_terms table.
create table #list (term varchar(24))
create table #documents_terms(id int, term varchar(24), freq int, freq_incl binary, freq_norm float)
create table #vocabulary(term varchar(24), doc_freq int, term_freq int)
declare @count_char int, @term varchar(24), @count_term int,@document varchar(max), @id int
while exists (select * from #documents where id not in (select id from #documents_terms))
begin
delete from #list
select top 1 @id = id from #documents where id not in (select id from #documents_terms)
select @document = lower(document), @count_term = 0, @count_char = 1, @term = '' from #documents where id =@id
while @count_char<=len(@document)+1
begin
if ascii(substring (@document,@count_char,1)) between 97 and 122 set @term = @term+substring (@document,@count_char,1)
else if len(@term)>0
begin
insert into #list select dbo.fx_psWord(@term) where dbo.fx_stopword(@term)=0
select @term = '',@count_term = @count_term+1
end
set @count_char = @count_char +1
end
insert into #documents_terms (id, term, freq) select @id, term, count(1) from #list group by term
end
select term,count(1) doc_freq, sum(freq) term_freq into #vocabulary from #documents_terms group by term
Use the #vocabulary table to identify terms that are not going to drive or dominate the similarity measure because they either occur in too few or too many documents.
select doc_freq, count(1) terms from #vocabulary group by doc_freq order by doc_freq
Terms | Documents |
893 | 1 |
200 | 2 |
108 | 3 |
48 | 4 |
41 | 5 |
15 | 6 |
13 | 7 |
13 | 8 |
10 | 9 |
7 | 10 |
5 | 11 |
3 | 12 |
1 | 13 |
2 | 14 |
2 | 15 |
4 | 16 |
1 | 18 |
3 | 19 |
2 | 20 |
1 | 21 |
2 | 22 |
1 | 23 |
3 | 25 |
1 | 28 |
1 | 29 |
1 | 30 |
Based on the above distribution I decided to limit the included terms to show in at least 2 document, but no more than 20. One of course could try different ranges and review the results to pick the best range.
update a set freq_incl = case when b.doc_freq between 2 and 20 then 1 else 0 end from #documents_terms a join #vocabulary b on a.term= b.term
The final computation before computing the similarity matrix is to normalize the data based on the included terms. Normalization is computed as the frequency of term X in a document Y divided by square root of the sum of the squared frequencies for all (included) terms in document Y. This could be done within the similarity calculation, but it is easier to separate it out.
update a set freq_norm = cast(freq as float)/ (select power(sum(power(cast(freq as float),2)),.5) from #documents_terms b where b.id = a.id and b.freq_incl = 1) from #documents_terms a where a.freq_incl = 1
select c.title atitle, d.title btitle, a.id aid, b.id bid ,sum(a.freq_norm*b.freq_norm) Similarity
into #similarity
from #documents_terms a
join #documents_terms b on a.term = b.term
join #documents c on a.id = c.id
join #documents d on b.id = d.id
where a.freq_incl=1 and b.freq_incl = 1 and a.id>b.id
group by c.title , d.title , a.id, b.id
order by sum(a.freq_norm*b.freq_norm) desc
A review of the top 10 company pairs based on similarity is qualitatively a success. JPMorgan and Goldman sachs are both finance companies, united technologies and GE are conglomerates, Verizon and ATT are communications, Pfizer and Merck are pharmaceuticals, and so forth.
atitle | btitle | Similarity |
JPMorgan Chase and Co | Goldman Sachs Group Inc | 0.67 |
United Technologies Corp | General Electric | 0.66 |
Verizon Communications Inc | AT&T Inc | 0.62 |
Pfizer Inc | Merck & Co Inc | 0.58 |
Exxon Mobil Corp | Chevron Corp | 0.57 |
United Technologies Corp | Boeing Co | 0.57 |
Procter & Gamble Co | Johnson & Johnson | 0.56 |
Caterpillar Inc | General Electric | 0.55 |
E I du Pont de Nemours and Co | 3M Co | 0.54 |
Pfizer Inc | Johnson & Johnson | 0.51 |
The bag of words model presented is a basic approach and there are many modifications and extensions available (for possible future posts as well) to extract and display additional information. For example, the parsing could include numbers, capitals and bi-grams (two words as a feature, rather than one). Weighting can be binary based rather than frequency or use a term frequency-inverse document frequency algorithm. The document term matrix can be factored using the non-negative matrix factorization or clustered using K-means. The similarity matrix can be evaluated using hierarchical agglomerative clustering or Markov clustering. Stay tuned.
Appendix: Stop Word Function
create function fx_stopword( @word varchar(24))
returns binary
as
begin
return case when @word in
('a','able','about','across','after','all','almost','also','am','among','an','and','any','are','as','at',
'be','because','been','but','by','can','cannot','could','dear','did','do','does','either','else','ever',
'every','for','from','get','got','had','has','have','he','her','hers','him','his','how','however','i',
'if','in','into','is','it','its','just','least','let','like','likely','may','me','might','most','must','my',
'neither','no','nor','not','of','off','often','on','only','or','other','our','own','rather','said','say',
'says','she','should','since','so','some','than','that','the','their','them','then','there','these','they','this','tis','to','too','twas','us','wants','was','we','were','what','when','where','which','while',
'who','whom','why','will','with','would','yet','you','your')
then 1 else 0 end
end
your article on data science is very good keep it up thank you for sharing.
ReplyDeleteData Science coaching in Hyderabad
good article about data science has given it is very nice thank you for sharing.
ReplyDeleteData Science coaching in Hyderabad
Thank you so much for this nice information. Hope so many people will get aware of this and useful as well. And please keep update like this.
ReplyDeleteText Analytics Companies
Text Analytics Python
"Text Mining With a Bag of Words" illuminates the pivotal role of text-based data in fields like big data, data mining, and artificial intelligence. It underscores the prevalence of unstructured textual data from various sources like emails, blogs, and social media, which pose challenges for algorithmic analysis. The post delves into a versatile text modeling technique called the bag of words, emphasizing its simplicity and effectiveness in processing large text datasets. By parsing documents into terms, applying stemming, and removing stop words, this approach lays the groundwork for tasks like sentiment analysis and topic modeling, paving the way for deeper insights into textual data.
ReplyDelete