The Big Data Newsvendor: Practical Insights from Machine Learning

发表在 Operations Research, 2019. DOI: https://doi.org/10.1287/opre.2018.1757.

Subject Classiﬁcations: inventory/production; stochastic: programming; stochastic: statistics; estimation

Area of Review: Operations and Supply Chains

Keywords: big data; newsvendor; machine learning; sample average approximation; statistical learning theory; quantile regression

文章是2015年初写的，2018年初接收。这篇文章把对需求的刻画，从统计学跃进到了数据科学。明天的需求，可能跟方方面面的因素有关系（天气、汇率、节假日等等），这汇聚成了 big data。文章提出了一种 distribution-free, one-step machine-learning 的算法来解决大数据时代下的 newsvendor 问题。

We investigate the data-driven newsvendor problem when one has n observations of p features related to the demand as well as historical demand data.

文章围绕四个问题展开：

How should the DM use a feature-demand data set to solve the newsvendor problem?
What is the value of incorporating features in newsvendor decision making in the ﬁrst place?
What theoretical guarantees does the DM using such data have, and how do these scale with the various problem parameters?
How do newsvendor decisions based on the feature-demand data set compare with other benchmarks in practice?

文章的 section 2,3,4,5 分别解决一个问题。

Q1

问题1是：如何在大数据时代解决 newsvendor 问题？

传统的 newsvendor 问题是：

$$ \begin{array}{c} C(q ; D):=b(D-q)^{+}+h(q-D)^{+} \\ \displaystyle\min _{q \geq 0} E C(q):=\mathbb{E}[C(q ; D)] \\ \end{array} $$

这是一个随机规划问题，如果我们不知道需求分布，但是有需求的样本，那么传统的SAA方法给出的解是：

$$ \hat{q}_{n}=\inf \left\{y: \hat{F}_{n}(y) \geq \frac{b}{b+h}\right\} $$

然而，在大数据时代，商品的需求可能与各个变量有关系，而不仅仅是服从某个分布这么简单。这就引出了 feature-based Newsvendor Problem

而 big-data newsvendor 问题是：

$$ \begin{equation} \min _{q(\cdot) \in \mathscr{L},\{q: \mathscr{x} \rightarrow \mathbb{R}\}} \mathbb{E}[C(q(\mathbf{x}) ; D(\mathbf{x})) \mid \mathbf{x}] \label{big-data-newsvendor} \end{equation} $$

其中 $\mathbf{x}$ 是 high-dimensional features，需求 $D(\mathbf{x})$是特征向量 $\mathbf{x}$ 的函数，优化的目的是找到一个规则，输入特征，输出订货量，并使得期望的损失最小。$\mathscr{L}$ 是一族可行的规则。

说白了，就是直接训练一个模型，输入特征，输出订货量。

关于问题 $\eqref{big-data-newsvendor}$ 的求解，文章给出了参数和非参数式两种方法。

假定有数据集 $S_n = [(\mathbf{x}_1, d_1), \dots, (\mathbf{x}_n, d_n)]$，每个特征 $\mathbf{x}$ 是 $p$ 维的。

Parametric —— Empirical Risk Minimization

参数式的方法假设这个决策规则 $q$ 可以由某些参数来表示，比如，假定这个可行集 $\mathscr{L}$ 是线性参数化的：

$$ \mathscr{L}=\left\{q: \mathscr{X} \rightarrow \mathbb{R}: q(\mathbf{x})=\mathbf{q}^{\prime} \mathbf{x}=\sum_{j=1}^{p} q^{j} x^{j}\right\} $$

如果线性不够，还可以加入二次项。

以经验风险作为优化目标：

$$ \min _{q(\cdot) \in \mathscr{L} \;\{q: \mathscr{x} \rightarrow \mathbb{R}\}} \hat{R}\left(q(\cdot) ; S_{n}\right) =\frac{1}{n} \sum_{i=1}^{n}\left[b\left(d_{i}-q\left(\mathbf{x}_{i}\right)\right)^{+}+h\left(q\left(\mathbf{x}_{i}\right)-d_{i}\right)^{+}\right] \tag{NV-ERM} $$

最终这个问题归结为参数的优化问题。我们还甚至可以把参数的正则化考虑进去。

Nonparametric Kernel Optimization Method

非参数回归希望用非参数的方法来近似条件期望，$m$ 不一定是线性函数

$$ \mathrm{E}(Y \mid X)=m(X) $$

首先：

$$ \mathrm{E}(Y \mid X=x)=\int y f(y \mid x) d y=\int y \frac{f(x, y)}{f(x)} d y $$

对 $f(x, y)$ 和 $f(x)$ 做核密度估计：

$$ \begin{aligned} &\hat{f}(x, y)=\frac{1}{n} \sum_{i=1}^n K_h\left(x-x_i\right) K_h\left(y-y_i\right) \\ &\hat{f}(x)=\frac{1}{n} \sum_{i=1}^n K_h\left(x-x_i\right) \end{aligned} $$

最后可以得到：

$$ \begin{aligned} \hat{\mathrm{E}}(Y \mid X=x) &=\int \frac{y \sum_{i=1}^n K_h\left(x-x_i\right) K_h\left(y-y_i\right)}{\sum_{j=1}^n K_h\left(x-x_j\right)} d y, \\ &=\frac{\sum_{i=1}^n K_h\left(x-x_i\right) \int y K_h\left(y-y_i\right) d y}{\sum_{j=1}^n K_h\left(x-x_j\right)}, \\ &=\frac{\sum_{i=1}^n K_h\left(x-x_i\right) y_i}{\sum_{j=1}^n K_h\left(x-x_j\right)} = \widehat{m}_h (x) \end{aligned} $$

$\widehat{m}_h(x)$ 即为 $Y$ 对 $X$ 的条件期望的 Nadaraya–Watson estimator.

文章试图用非参数的方法来估计 $\mathbb{E}\left[C(q ; D) \mid \mathbf{x}_{n+1}\right]$，对应的 NW estimator 是：

$$ \frac{\sum_{i=1}^n K_w\left(\mathbf{x}_{n+1}-\mathbf{x}_i\right) C\left(q, d_i\right)}{\sum_{i=1}^n K_w\left(\mathbf{x}_{n+1}-\mathbf{x}_i\right)} $$

实际上上面这个式子是关于 $q$ 是线性的，文章的 Proposition 1 给出了这种方法的 $\hat{q}^{\kappa}_n$ .

随后，文章列举了一些相近的文献并 argue 了本文的优点。在此略过。

Q2

问题2是：需求特征的价值是什么？

在这一部分，文章举了两个示例，并证明在加入 demand-feature 时，SAA 方法不一定 consistent ，而 ERM 方法 consistent.

Two-Population Model

假设需求分布由分布 $D_1, D_2$ 和特征 $x \in \{0, 1\}$ 组成

$$ D=D_0(1-x)+D_1 x $$

在 mild condition on $D_1, D_2$ 下，SAA inconsistent, EMR consistent.

Linear Demand Model

假设需求与特征 $\mathbf{x}$ 的关系是线性的

$$ \begin{equation} D \mid(\mathbf{X}=\mathbf{x})=\beta^{\top} \mathbf{x}+\varepsilon \label{big-data-newsvendor-linear-demand} \end{equation} $$

首先，线性的关系可以匹配多数需求模型，比如 $\text{AR}(p)$

$$ D_t=\alpha_0+\alpha_1 D_{t-1}+\ldots+\alpha_p D_{t-p}+\varepsilon_t $$

就可以用历史需求数据作为特征。

这个例子想说明的结论与上面相同。

Q3

问题3是：文章方法的理论保证是什么，big-data newsvendor decision 的效果如何？

实际上这就是要导出 out-of-sample performance 的 probability guarantee.

文章的定理5，证明在一定假设下，如果需求服从线性假设 $\eqref{big-data-newsvendor-linear-demand}$ 下，那么有这样一个关系：

$$ \begin{aligned} &\left|R_{\text {true }}\left(q^\ast\right)-\hat{R}_{i n}\left(\hat{q} ; S_n\right)\right| \leq(b \vee h) \bar{D}\left[\frac{2(b \vee h)}{b \wedge h} \frac{p}{n}+\left(\frac{4(b \vee h)}{b \wedge h} p+1\right) \sqrt{\frac{\log (2 / \delta)}{2 n}}\right]+(b \vee h) K \frac{\sqrt{\log n}}{n^{1 /(2+p / 2)}} \text {, }\\ &\text { where }\; K=\sqrt{\frac{9(8+5 p)}{(4+p)}} \frac{1}{\left(1-2^{-4 /(4+p)) \lambda_2^{\ast}}\right.} \text { and } \lambda_2^\ast=\min _{t \in[\underline{D}, \bar{D}]} f_{\varepsilon}(t) \end{aligned} $$

文章把误差分解成了两个部分，generalization error 和 finite-sample bias，分别对应于 overfitting 和 underfitting 带来的误差。

第一部分 generalization error 是 $p/\sqrt{n}$ 阶的，当特征数量比较多的时候，容易过拟合。

第二部分 finite-sample bias 是 $n^{-1 /(2+p / 2)} \sqrt{\log n}$ 阶的，当样本数 $n$ 很大的时候，欠拟合带来的误差就越小。

如果加入参数的正则化项，文章的定理6给出了此时的误差

$$ \begin{aligned} \left|R_{\text {true }}\left(q^\ast\right)-\hat{R}_{i n}\left(\hat{q}_\lambda ; S_n\right)\right| & \leq(b \vee h) \bar{D}\left[\frac{(b \vee h) X_{\max }^2 p}{n \lambda \bar{D}}+\left(\frac{2(b \vee h) X_{\max }^2 p}{\lambda \bar{D}}+1\right) \sqrt{\frac{\log (2 / \delta)}{2 n}}\right] \\ & + (b \vee h) \mathbb{E}_{D \mid \mathbf{x}_{n+1}} \left[ {\mid \hat{q}_\lambda}-\hat{q} \mid \right] \\ & +(b \vee h) K \frac{\sqrt{\log n}}{n^{1 /(2+p / 4)}} \end{aligned} $$

这时候误差分为3项，第一项是 $O(p/(\sqrt{n} \lambda))$ 的泛化误差，这说明增大正则化参数可以减小过拟合；新增加的第二项则是由于引入正则化项带来的偏差。

对于 kernel method 文章也做了类似的分析。

Q4

问题4是：实际应用中，引入 demand feature 效果能有多好。

文章做了一个 case study

Although some analytical comparisons are possible under assumptions about the true demand model, the ultimate test of data-driven methods must be on real data sets.

研究的问题是医院在某一天应该安排多少个护士，数据显示病人的数量与时间有一定的相关性：

文章设定了一些特征，如星期几、时间、前几天的病人数量，然后比较了多种方法的效果。

论文提出的方法在效果上可圈可点。

这篇文章内容丰富，理论扎实，的确可以说是非常优秀了。

The Big Data Newsvendor: Practical Insights from Machine Learning

Q1

Parametric —— Empirical Risk Minimization

Nonparametric Kernel Optimization Method

Q2

Two-Population Model

Linear Demand Model

Q3

Q4

相关文章：