My goal is to demonstrate data mining and machine learning tools using SQL. Why? For fame and fortune (and data is popular right now). Why data mining? Because I find it fascinating, enjoy extracting information from data, and believe in evidence based analysis, decision making and prediction. Why SQL? I know Python and R but I like SQL and the code seems to flow. I have used Python for extracting data from the web and R in general, but find both frustrating. Since I started on this project about 6 months ago I have found that just about everything can be done in fairly simple SQL. I use Microsoft's SQL Server Express because it is free. The most serious drawbacks are speed on large data sets and a lack of graphics. To remedy the first I try to use efficient data handling and the second by copying to Openoffice Calc (all software is free).
Some of the topics I will explore (in no particular order): time series forecasting, regression, Monte Carlo modeling, Naive Bayes, Matrix factorization, Markov Clustering, Topic modeling, Dijkstra's algorithm, Nearest neighbor, k-means clustering, rule extraction, correlation & similarity, document clustering and more. My approach will be to provide a brief non-statistical description of the algorithm, perhaps a spread-sheet example, my SQL code and a sample of the results with interpretation if appropriate. I will try to use real-world data. For those interested in detailed understandings of the theoretical underpinnings, start with Google. As for the code, I cannot claim to be an originalist, but frequently peek over the shoulders of others and will provide attribution unless I forget.
Finally, there will be errors. Let me know and I will correct.
Update: I've recently started adding forecasting and planning code to the blog; I hope it proves useful.. I've worked nearly 20 years in Supply Chain, if you look at my background you'll note I led purchasing and inventory management at a company with over 200K stock keeping units and $100m in inventory where we employed SQL for rapid development (for example, a new item level forecasting system designed, codded and operational in just over a week). I'm bringing some of that to bare in my current posts.
Update: I've recently started adding forecasting and planning code to the blog; I hope it proves useful.. I've worked nearly 20 years in Supply Chain, if you look at my background you'll note I led purchasing and inventory management at a company with over 200K stock keeping units and $100m in inventory where we employed SQL for rapid development (for example, a new item level forecasting system designed, codded and operational in just over a week). I'm bringing some of that to bare in my current posts.
No comments:
Post a Comment