Sunday, December 15, 2013

True Multiple Regression Using SQL.

Some weeks ago I posted on multiple regression.  The code in that post solved for the coefficients using gradient descent, an iterative process of reducing error through coefficient adjustment which seems a second rate compromise when compared to the ease of  the Linest function in Excel.  In the interim I researched multiple regression that builds on simple univariate regression  (effectively regressing Y on X is sum (XY)/sum(XX) in SQL) which forms the basis of this post.  After some trial & error I got the SQL to work.  My primary source for this approach is the somewhat difficult  Elements of Statistical Learning (2008) by Hastie, Tibshirni and Friedman.  A reasonable discussion on the topic is at stats exchange.

To begin create a table of data:

create table #rawdata (id int,area float, rooms float, odd float,  price float)

insert into #rawdata select 1, 2201,3,1,400

insert into #rawdata select 2, 1600,3,0,330
insert into #rawdata select 3, 2400,3,1,369
insert into #rawdata select 4, 1416,2,1,232
insert into #rawdata select 5, 3000,4,0,540

It facilitates the code to convert the table of data into two data vectors:  the independent #x variables and the dependent #y variable.  There are a couple of modifications. First, rather than use the label names, I use a label index which facilitates the processing.  Second, In addition to the independent variables the #x table also includes the intercept variable with a value of 1 for each observation.  Finally, there is no error handling and the program cannot manage missing (null)  values.

The naming conventions used throughout are pretty simple.  The tables follow the names use in the articles with the exception of #c  for the orthogonal coefficients (instead of a Greek letter).  Field names are *id for the row id in table *, *n is the index of the field, *v is the value.

select id xid, 0 xn,1 xv into #x from #rawdata
union all
select id, 1,rooms  from #rawdata
union all
select id, 2,area  from #rawdata
union all
select id, 3,odd  from #rawdata

select id yid, 0 yn, price yv  into #y from #rawdata


Three additional tables are required: the #z table holds the orthogonalized values,   #c holds the orthagonalization coefficients and #b the regression coefficients.

create table #z (zid int, zn int, zv float)
insert into #z select id , 0 zn,1 zv from #rawdata

create table #c(cxn int, czn int, cv float) 

create table #b(bn int, bv float) 

The first part of the code loops forward through each of the independent variables starting with the intercept calculating first the orthogonalization coefficients and then the orthogonalized X values  which are store in #z.

declare @p int
set @p = 1

while @p <= (select max(xn) from #x)

begin
insert into #c
select  xn cxn,  zn czn, sum(xv*zv)/sum(zv*zv) cv 
from #x join  #z on  xid = zid where zn = @p-1 and xn>zn group by xn, zn
insert into #z
select zid, xn,xv- sum(cv*zv) 
from #x join #z on xid = zid   join  #c  on  czn = zn and cxn = xn  where xn = @p and zn<xn  group by zid, xn,xv
set @p = @p +1

end

The second part of the code loops backward calculating the B coefficients by regressing Y on Z where Y is continually updated to exclude previously fit Y values (the sub-query)

 while @p>=0 
begin

insert into #b
select zn, sum(yv*zv)/ sum(zv*zv) 
from #z  join 
(select yid, yv-isnull(sum(bv*xv),0) yv from #x join #y on xid = yid left join #b on  xn=bn group by yid, yv) y
on zid = yid where zn = @p  group by zn

set @p = @p-1

end

The regression coefficients are:

select * from #b


bn bv
3 11.026
2 0.056
1 114.965
0 -96.747

Finally the fitted values are:

select yid, yv, fit, yv-fit err from #y join 
(select xid, sum(xv*bv) fit from #x join #b on xn = bn  group by xid) f
on yid = xid


yid yv fit err
1 400 382.975 17.025
2 330 338.144 -8.144
3 369 394.169 -25.169
4 232 223.856 8.144
5 540 531.856 8.144

The r squared is 99.85%

 select 1-sum(power(err,2))/sum(power(yv,2)) from 
  (select yid, yv, fit, yv-fit err from #y join 
(select xid, sum(xv*bv) fit from #x join #b on xn = bn  group by xid) f
on yid = xid) d


Given the challenge in writing this I was ultimately surprised how little code was involved.  There an important detail missing from the model:  the errors associated with the coefficients for determining statistical significance.  Maybe at some point I'll return to regression and figure it out. but in the mean time perhaps a reader wants to take on the challenge...


Thursday, December 5, 2013

Viva Monte Carlo: Planning Model for Intermittent Demand

I recently posted a method for demand planning based on joint probability distribution functions derived using averages and variances.  While robust for items with higher sales frequency, my experience is that the method isn't very stable on low frequency items due to churn and susceptibility to an extreme observations (not to mention the normal distribution may not be appropriate).  In this post I'll highlight a distribution-free method based on Monte Carlo methods.

Monte Carlo methods are well documented elsewhere , so I won't go into much detail.  Suffice it to say that by repeatedly sampling from the receipt and sales distributions one can estimate the underlying joint lead-time demand distribution. The algorithm is pretty simple:
  1. For each item
  2. Randomly pick a lead-time in days using the receipts for that item
  3. For each day in #2 pick a day from the sampling period and add the number of orders for that day (often zero) to a running total
  4. For each order in #3 randomly pick an order from the order file and add the quantity to a running total
  5. Repeat steps 2-4 several hundred times saving the quantity total for each iteration
  6. Sort the iterations in ascending order and set the order point equal to the iteration that corresponds to the target service level
In terms of the distributions to pick from there are a number of approaches.  The first is to use the raw data by taking the top record from a randomly sorted table, a method that can become time consuming with lots of SKUs.  A second approach is to summarize the raw transactions into a cumulative distribution table and add a cumulative percent of total column.  the random value is choosen by selecting the record where the cumulative % of total corresponds to a random number between 0 and 1.  The third approach is to assume underlying distributions (i.e. orders per day is Poisson, order quantity is log-normal etc) and choose a value form that distribution using a random number.  This post will use combination of the first two methodologies.

The data is the same from the prior post (no cost data will be used, but I've left it in for consistency).

create table #item (item varchar(3), daysinstock int, servicetarget  float,stdcost float,  carrycost float, ordercost float)
insert into #item select 'abc', 120, .95,1.5,  .12, 15

create table #receipts  (purchaseorder int, item varchar(3), orderdate int, receiptdate int)
insert into #receipts select  456,'abc', 20130109, 20130226
insert into #receipts select  789,'abc', 20130325, 20130413
insert into #receipts select  567,'abc', 20131007, 20131130

create table #salesorders (salesorder int, salesdate  int, item varchar(3),   quantity int)
insert into #salesorders select 123, 20130208, 'abc',  1
insert into #salesorders select 234, 20131014, 'abc',  10
insert into #salesorders select 345, 20130413, 'abc',  35
insert into #salesorders select 456, 20130409, 'abc',  10
insert into #salesorders select 323, 20131105, 'abc',  5
insert into #salesorders select 432, 20130515, 'abc',  5

Step #2 of the algorithm picks a day at random which implies calendar that has lots of days with no sales.  Rather than build a calendar tailored to each item and select a day at random as in the first approach, construct a summary distribution table.

create table #calendar(item varchar(3), ordercount int, daycount int, distribution float)

The following inserts the number of days for each order count by item.  The sub-query (sq) counts the number of orders by day which is then summarized as number of order days for each order count.

insert into #calendar (item, ordercount, daycount)
select item, ordercount, count(1) from 
(select i.item, so.salesdate, count(1) ordercount  from #salesorders so 
join #item i on i.item = so.item group by i.item, so.salesdate) sq
group by ordercount, item

The day with no orders is calculated as the days in stock from the #item table less a sub-query totaling the number of days with orders from #calendar.

insert into #calendar (item, ordercount, daycount) select item, 0, daysinstock - (select sum(daycount) from #calendar c where c.item = i.item) from #item i

The last calendar step is to update the cumulative distribution which uses two sub-queries. The first totals the day count buy order count which is then standardized by dividing by the total number of days in the second sub-query.  To save time, the second query could be replaced by incorporating the days in stock value into the #calendar file.

update c set distribution  =  (select cast(sum(daycount) as float) from #calendar c1 where c1.item = c.item and c1.daycount >= c.daycount ) /(select cast(sum(daycount) as float) from #calendar c2 where c2.item = c.item) from #calendar c 

The next step is the looping program; the final results are stored in a table called #orderpoint  and consist of the item, lead-time demand quantity and order point quantity.  A scratchpad table called #quantity is also required to hold the output for each iteration.

create table #orderpoint(item varchar(3), leadtimequantity int, orderpoint int)
create table #quantity( quantity int)

A number of variables are also required for tracking; the first four are self explanatory, @it1 tracks the overall iterations (set at 100 here) and @it2 is a general purpose counter.

declare @item varchar(3),@servicetarget int, @leadtime int, @ordercount int, @quantity int, @it1 int,@it2 int

The program runs until every item in the #item table is also in the #orderpoint table.  I've highlighted the looping statements in yellow and annotated their function.  There are two functions used for the random selection.  The first is newid()  which is build into SQL server and is really  a pseudo random sequence of characters.  Using first check sum and then the right 3 divided by 999 than can be converted into a random number between 0 and 1

while 1=1 --loop for each item
begin 
set @item = (select top 1 item from #item where item not in (select item from #orderpoint))
If @item is null break
set @servicetarget  = (select cast(servicetarget*100 as int) from #item where item = @item)
set @it1 = 1000
truncate table #quantity
while @it1>0 --loop for each iteration picking a lead time
begin
set @leadtime = (select top 1 datediff(d, convert(char, orderdate,112), convert(char, receiptdate,112)) from #receipts where item = @item order by newid())
set @it2 = 1
set @ordercount = 0 
while  @it2 < = @leadtime --loop for each day in lead time picking an order count
begin
set @ordercount=@ordercount+ (select min(ordercount) from #calendar  where item = @item and distribution >=dbo.fx_rand())
set @it2 = @it2+1
end
set @it2 = 1
set @quantity = 0
while  @it2 < = @ordercount --loop for each order in order count picking a quantity
begin
set @quantity=@quantity+(select top 1 quantity from #salesorders where item = @item order by newid())
set @it2 = @it2+1
end
insert into #quantity select @quantity
set @it1 = @it1-1
end
insert into #orderpoint select @item,leadtimequantity, max(sq2.quantity) from 
(select avg(quantity) leadtimequantity from #quantity) sq1,
(select top (@servicetarget) percent   quantity from #quantity  order by quantity) sq2
group by leadtimequantity
end

The orange highlighted section loads the lead-time demand and the order point to the working file that would then be loaded to an MRP system.

An interesting property of this approach is  that it can be used graphically illustrates the demand profile of an item.  Each iteration has an equal chance of occurring, a graph of the #quantity table for a given item shows all of the equally likely outcomes. Consider the graphs below.  The first shows the values for 1000 iterations (took about 2 seconds on my Gateway in SQL server express) as calculated  with an average of about 22 units.  The second is sorted based on quantity with the black arrows indicating the median (15 units) and 95th percentile (71 units).  The red arrow to the left is the number of iterations with  no volume. Shows the lead-time volume by frequency:  the red arrow is nearly 20% of the distribution:  There is nearly a one in five chance of selling nothing after placing an order.  The joys of intermittent demand.



Appendix:  Dynamic Random Numbers

The rand() command in TSQL is not dynamic; using it in a query will return the same value for each record returned.  I know of two work-arounds  of which I'm not sure which is the more random.

First, newid() is a built-in SQLServer function that returns a pseudo-random sequence of characters  that can be used for sorting as is:

select * from #table order by newid()

It can also be used to generate a random number between 0 and 1 :

select cast(right(checksum(newid()),3) as float)/999

The second alternative is more involved but is built on the rand() function.  First create a view that references the rand()  and then a function that return the results of the view:

create view vw_rand
as
select rand() random

create function dbo.fx_rand ()
returns float
as
begin
return (select random from vw_rand)
end










Tuesday, December 3, 2013

Forecasting for Inventory Planning

It has been some time since I last posted; to restart I'd like to get into an area that's related to data mining:  forecasting or better yet demand planning. In his book Predictive Analytics,  Eric Siegel asserts that "[Predictive analytics] is a completely different animal from forecasting."  I don't agree, they are both concerned with using real data to predict the future. Demand forecasting may be the most widely accepted application of empirical data to decision making in business. Well designed forecasting systems use actual orders and sales to predict how much manufacturers should make and purchasing departments should buy to support future demand.

I will use a very simple model that lends itself very nicely to the SQL environment that was written up by CERN researches (hard to beat nuclear scientists: http://ais.web.cern.ch/AIS/apps/lims/theory.html).  Rather than group data into monthly buckets and extrapolate using models like exponential smoothing, this approach offers a "mathematical solution based on the real statistical distributions."

Inventory management algorithms use some variation of the order point model: when inventory  for a given stock keeping unit (SKU) falls below an order point the system recommends purchase/manufacture of more (with the quantity usually based on the economic order quantity, a topic for another day).   The order point is calculated as quantity demanded over the lead time plus safety stock. Lead time is the time between noticing more is required and having more available in inventory for immediate sale and includes time for: inventory review + purchase order placing + vendor fulfillment + in transit + receiving + put-away and is usually measured in days or weeks.  Lead time quantity is the quantity expected to be sold over the lead time.  Safety stock is the inventory held for unpredictable future demand fluctuation.  Typically it is set to achieve some level of fill associated with the business' value proposition, say 95% of customer orders should be filled from stock.  Assuming the demand is normally distributed, then safety stock is the number of standard deviations associated with service level times the lead time standard error.

Modeling requires six elements:
  1. orders per day (opd) = count of orders for some period/days in period
  2. average order quantity (aoq) = sum of quantity per order/count of orders
  3. order quantity variance (oqv) = sum of  squared difference between  the order quantity and average order quantity / count of orders minus 1
  4. average lead time in days (altd)
  5. lead time in days variance (ltdv)
  6. number of standard deviations associated with service level (z)
The 3 basic formulas are (* means times and ^ means raise to the power of):
  1. lead time quantity (ltq) =  opd*altd*aoq
  2. lead time variance (ltv) =  opd*altd*(oqv + aoq^2) +opd^2 *aoq^2*ltdv
  3. order point (op) = ltq+z*ltv^.5
As an aside, many vendors specify their lead time which can be interpreted as  altd with zero variance simplifying #2 above to ltv = opd*altd*(oqv + aoq^2).

The program employs three tables  which are stripped to the essential information for the matters at hand.  The first is #item which holds the item list and relevant SKU specific planning parameters  such as the days in stock useful for new items or adjusting the forecast for stock outs, targeted service level  which my vary based on an item importance, and item cost data which comes in handy when computing carry costs, average inventory costs and economic order quantities. The second table is #receipts to measure the lead time. A purchase order has an order date followed by one or more receipt dates which are useful in estimating lead time.  The final table is #salesorders  which is self explanatory.  the one observation I would make is that the sales order date should be the date the customer requested, not the actual ship date.


create table #item (item varchar(3), daysinstock int, servicetarget  float,stdcost float,  carrycost float, ordercost float)
insert into #item select 'abc', 120, .95, 1.5,  .12, 15

create table #receipts  (purchaseorder int, item varchar(3), orderdate int, receiptdate int)
insert into #receipts select  321,'abc', 20130109, 20130226
insert into #receipts select  432,'abc', 20130325, 20130413

insert into #receipts select  543,'abc', 20131007, 20131130

create table #salesorders (salesorder int, salesreqdate  int, item varchar(3),   quantity int)

insert into #salesorders select 123, 20130208, 'abc',  1
insert into #salesorders select 234, 20131014, 'abc',  10
insert into #salesorders select 345, 20130413, 'abc',  35
insert into #salesorders select 456, 20130409, 'abc',  10
insert into #salesorders select 567, 20131105, 'abc',  5
insert into #salesorders select 678, 20130515, 'abc',  5

Using the built-in SQL functions for average, variance and count summarize the data using sub-queries for the receipts data and the sales order data and place in a summary table with holding places for lead-time quantity (ltq), lead time variance(ltv), order point (op) and economic order quantity (eoq).  While all of these could be calculated in the first pass, I broke it into updates to make following easier.

select i.item, i.daysinstock,servicetarget, stdcost,carrycost, ordercost, altd,ltdv ,ofreq/daysinstock opd, aoq,oqv, 
cast(0 as float) ltq,
cast(0 as float) ltqv, 
cast(0 as int) op,
cast(0 as int) eoq
into #summary from #item i
join
(select item, cast(count(1) as float)  rfreq, avg(cast(datediff(d,convert(char,orderdate,112), convert(char,receiptdate,112)) as float))  altd ,
var(cast(datediff(d,convert(char,orderdate,112), convert(char,receiptdate,112)) as float)) ltdv
from #receipts  group by item) lt
on i.item = lt.item
join 
(select  item, cast(count(1) as float) ofreq, avg (cast(qty as float)) aoq, var(cast(qty as float)) oqv from #salesorders  group by item)  so
on i.item = so.item

Use update statements to  add the lead time quantity and lead time variance:

update #summary
set ltq = opd*altd*aoq,
ltqv =  opd*altd*(oqv + power(aoq,2)) +power(opd,2)*power(aoq,2)*ltdv

The final step uses an update step to create the order point and the economic order quantity(eoq).   See  http://en.wikipedia.org/wiki/Economic_order_quantity for a very thorough definition.  Also, note the use of dbo.fx_normsinv  which is a function to convert a targeted service level (95%) to a z score.  I've included the code in an appendix.

update #summary  
set op = cast(ltq+ power(ltqv, 0.5)*dbo.fx_normsinv(servicetarget) as int),
eoq = cast(power((opd*365*aoq *ordercost)/(carrycost*stdcost),.5) as int)

In addition to planning, the data can also provide insight into performance and KPIs.  For example, average inventory can be approximated as one half the order quantity plus safety stock, cost of goods sold (cogs) and projected turns as cogs/avginv:

select 
sum((eoq/2+ (op-ltq))*stdcost) avginv,
sum(opd*365*aoq*stdcost) cogs,
sum(opd*365*aoq*stdcost)/sum((eoq/2+ (op-ltq))*stdcost) turns
from #summary

That's it.  I employed a slightly extended version of this to run inventory planning on nearly 200,000 SKUs with an inventory of over $100 million achieving a 95+% service level.

Appendix:  

Inverse of normal distribution per Abramowitz and Stegun formula 26.2.23

create function fx_normsinv(@p float) 
returns float
as
begin
declare @c0 float, @c1 float, @c2 float, @d1 float, @d2 float , @d3 float, @t float
if @p>=.5 set @p = 1.0-@p
select  @t = power(-2.0*log(@p),.5), @c0 = 2.515517, @c1 =0.802853,@c2 =0.010328,@d1 = 1.432788, @d2 =0.189269, @d3 =0.001308  
return (select  @t-((@c0+@c1*@t+@c2*@t*@t)/(1.0+@d1*@t+@d2*@t*@t+@d3*@t*@t*@t)))
end