Friday, July 12, 2013

Introduction

My goal is to demonstrate data mining and machine learning tools using SQL. Why?  For fame and fortune (and data is popular right now).  Why data mining? Because I find it fascinating, enjoy extracting information from data, and believe in evidence based analysis, decision making and prediction.  Why SQL?  I know Python and R but I like SQL and the code seems to flow. I have used Python for extracting data from the web and R in general, but find both frustrating.  Since I started on this project about 6 months ago I have found that just about everything can be done in fairly simple SQL.  I use Microsoft's SQL Server Express because it is free.  The most serious drawbacks are speed on large data sets and a lack of graphics.  To remedy the first I try to use efficient data handling and the second by copying to Openoffice Calc (all software is free).

Some of the topics I will explore  (in no particular order):  time series forecasting, regression, Monte Carlo modeling, Naive Bayes,  Matrix factorization, Markov Clustering, Topic modeling, Dijkstra's algorithm, Nearest neighbor, k-means clustering, rule extraction, correlation & similarity, document clustering and more.   My approach will be to provide a brief non-statistical description of the algorithm, perhaps a spread-sheet example, my SQL code and a sample of the results with interpretation if appropriate.  I will try to use real-world data.  For those interested in detailed understandings of the theoretical underpinnings, start with Google.  As for the code, I cannot claim to be an originalist, but frequently peek over the shoulders of others and will provide attribution unless I forget.

Finally, there will be errors.  Let me know and I will correct.

Update:  I've recently started adding forecasting and planning code  to the blog; I hope it proves useful.. I've worked nearly 20 years in Supply Chain,   if you look at my background you'll note I led purchasing and inventory management at a company with over 200K stock keeping units and $100m in inventory  where we employed SQL for rapid development (for example, a new item level forecasting system designed, codded and operational in just over a week).  I'm bringing some of that to bare in my current posts.

No comments:

Post a Comment