Simchi-Levi, David, Xu, Yunzong, and Zhao, Jinglong

Subjects

Computer Science - Machine Learning, Computer Science - Computer Science and Game Theory, Mathematics - Optimization and Control, and Statistics - Machine Learning

Abstract

This work is motivated by a practical concern from our retail partner. While they respect the advantages of dynamic pricing, they must limit the number of price changes to be within some constant. We study the classical price-based network revenue management problem, where a retailer has finite initial inventory of multiple resources to sell over a finite time horizon. We consider both known and unknown distribution settings, and derive policies that have the best-possible asymptotic performance in both settings. Our results suggest an intrinsic difference between the expected revenue associated with how many switches are allowed, which further depends on the number of resources. Our results are also the first to show a separation between the regret bounds associated with different number of resources.

Computer Science - Machine Learning and Statistics - Machine Learning

Abstract

This paper investigates the impact of pre-existing offline data on online learning, in the context of dynamic pricing. We study a single-product dynamic pricing problem over a selling horizon of $T$ periods. The demand in each period is determined by the price of the product according to a linear demand model with unknown parameters. We assume that the seller already has some pre-existing offline data before the start of the selling horizon. The offline data set contains $n$ samples, each of which is an input-output pair consists of a historical price and an associated demand observation. The seller wants to utilize both the pre-existing offline data and the sequential online data to minimize the regret of the online learning process. We characterize the joint effect of the size, location and dispersion of offline data on the optimal regret of the online learning process. Specifically, the size, location and dispersion of offline data are measured by the number of historical samples $n$, the absolute difference between the average historical price and the optimal price $\delta$, and the standard deviation of the historical prices $\sigma$, respectively. We show that the best achievable regret is $\tilde\Theta\left(\sqrt{T}\wedge\frac{T}{n\sigma^2+(n\wedge T)\delta^2}\right)$. We design a learning algorithm based on the "optimism in the face of uncertainty" principle, whose regret is optimal up to a logarithmic factor. Our results reveal surprising transformations of the optimal regret rate with respect to the size of offline data, which we refer to as phase transitions. In addition, our results demonstrate that the location and dispersion of the offline data set also have an intrinsic effect on the optimal regret, and we quantify this effect via the inverse-square law.

The universe is things which change and called events. The events are matter and field. A boundary divides a system to things and environment. The things belong to the environment have no significant effect on the things belong to the system. The physical observables are the variations of things and it is always assumed that the conscious thing is placed in environment because the science cannot explain consciousness. There is not only an obligated minimum boundary between things (space) but also between past and future (present). The gravitational field has significant effect on these obligated minimums, especially at Planck scale. By using the above concept we introduce a grand unified reaction platform for categorizing the current physical paradigms and possible future explanation of the universe as a whole. Comment: 28 pages, 4 figures

Computer Science - Data Structures and Algorithms and Mathematics - Optimization and Control

Abstract

We consider an assortment optimization problem where a customer chooses a single item from a sequence of sets shown to her, while limited inventories constrain the items offered to customers over time. In the special case where all of the assortments have size one, our problem captures the online stochastic matching with timeouts problem. For this problem, we derive a polynomial-time approximation algorithm which earns at least 1-ln(2-1/e), or 0.51, of the optimum. This improves upon the previous-best approximation ratio of 0.46, and furthermore, we show that it is tight. For the general assortment problem, we establish the first constant-factor approximation ratio of 0.09 for the case that different types of customers value items differently, and an approximation ratio of 0.15 for the case that different customers value each item the same. Our algorithms are based on rounding an LP relaxation for multi-stage assortment optimization, and improve upon previous randomized rounding schemes to derive the tight ratio of 1-ln(2-1/e).

We study an online knapsack problem where the items arrive sequentially and must be either immediately packed into the knapsack or irrevocably discarded. Each item has a different size and the objective is to maximize the total size of items packed. We focus on the class of randomized algorithms which initially draw a threshold from some distribution, and then pack every fitting item whose size is at least that threshold. Threshold policies satisfy many desiderata including simplicity, fairness, and incentive-alignment. We derive two optimal threshold distributions, the first of which implies a competitive ratio of 0.432 relative to the optimal offline packing, and the second of which implies a competitive ratio of 0.428 relative to the optimal fractional packing. We also consider the generalization to multiple knapsacks, where an arriving item has a different size in each knapsack and must be placed in at most one. This is equivalent to the AdWords problem where item truncation is not allowed. We derive a randomized threshold algorithm for this problem which is 0.214-competitive. We also show that any randomized algorithm for this problem cannot be more than 0.461-competitive, providing the first upper bound which is strictly less than 0.5. This online knapsack problem finds applications in many areas, like supply chain ordering, online advertising, and healthcare scheduling, refugee integration, and crowdsourcing. We show how our optimal threshold distributions can be naturally implemented in the warehouses for a Latin American chain department store. We run simulations on their large-scale order data, which demonstrate the robustness of our proposed algorithms.

Cheung, Wang Chi, Simchi-Levi, David, and Zhu, Ruihao

Subjects

Computer Science - Machine Learning and Statistics - Machine Learning

Abstract

We propose algorithms with state-of-the-art \emph{dynamic regret} bounds for un-discounted reinforcement learning under drifting non-stationarity, where both the reward functions and state transition distributions are allowed to evolve over time. Our main contributions are: 1) A tuned Sliding Window Upper-Confidence bound for Reinforcement Learning with Confidence-Widening (\texttt{SWUCRL2-CW}) algorithm, which attains low dynamic regret bounds against the optimal non-stationary policy in various cases. 2) The Bandit-over-Reinforcement Learning (\texttt{BORL}) framework that further permits us to enjoy these dynamic regret bounds in a parameter-free manner.

Computer Science - Machine Learning, Computer Science - Data Structures and Algorithms, Mathematics - Optimization and Control, and Statistics - Machine Learning

Abstract

We consider the classical stochastic multi-armed bandit problem with a constraint on the total cost incurred by switching between actions. We prove matching upper and lower bounds on regret and provide near-optimal algorithms for this problem. Surprisingly, we discover phase transitions and cyclic phenomena of the optimal regret. That is, we show that associated with the multi-armed bandit problem, there are phases defined by the number of arms and switching costs, where the regret upper and lower bounds in each phase remain the same and drop significantly between phases. The results enable us to fully characterize the trade-off between regret and incurred switching cost in the stochastic multi-armed bandit problem, contributing new insights to this fundamental problem. Under the general switching cost structure, the results reveal a deep connection between bandit problems and graph traversal problems, such as the shortest Hamiltonian path problem.

Motivated by the dynamic assortment offerings and item pricings occurring in e-commerce, we study a general problem of allocating finite inventories to heterogeneous customers arriving sequentially. We analyze this problem under the framework of competitive analysis, where the sequence of customers is unknown and does not necessarily follow any pattern. Previous work in this area, studying online matching, advertising, and assortment problems, has focused on the case where each item can only be sold at a single price, resulting in algorithms which achieve the best-possible competitive ratio of 1-1/e. In this paper, we extend all of these results to allow for items having multiple feasible prices. Our algorithms achieve the best-possible weight-dependent competitive ratios, which depend on the sets of feasible prices given in advance. Our algorithms are also simple and intuitive; they are based on constructing a class of universal ``value functions'' which integrate the selection of items and prices offered. Finally, we test our algorithms on the publicly-available hotel data set of Bodea et al. (2009), where there are multiple items (hotel rooms) each with multiple prices (fares at which the room could be sold). We find that applying our algorithms, as a ``hybrid'' with algorithms which attempt to forecast and learn the future transactions, results in the best performance.

Jin, Rong, Simchi-Levi, David, Wang, Li, Wang, Xinshang, and Yang, Sen

Subjects

Mathematics - Optimization and Control and Computer Science - Machine Learning

Abstract

The recent rising popularity of ultra-fast delivery services on retail platforms fuels the increasing use of urban warehouses, whose proximity to customers makes fast deliveries viable. The space limit in urban warehouses poses a problem for the online retailers: the number of products (SKUs) they carry is no longer "the more, the better", yet it can still be significantly large, reaching hundreds or thousands in a product category. In this paper, we study algorithms for dynamically identifying a large number of products (i.e., SKUs) with top customer purchase probabilities on the fly, from an ocean of potential products to offer on retailers' ultra-fast delivery platforms. We distill the product selection problem into a semi-bandit model with linear generalization. There are in total $N$ different arms, each with a feature vector of dimension $d$. The player pulls $K$ arms in each period and observes the bandit feedback from each of the pulled arms. We focus on the setting where $K$ is much greater than the number of total time periods $T$ or the dimension of product features $d$. We first analyze a standard UCB algorithm and show its regret bound can be expressed as the sum of a $T$-independent part $\tilde O(K d^{3/2})$ and a $T$-dependent part $\tilde O(d\sqrt{KT})$, which we refer to as "fixed cost" and "variable cost" respectively. To reduce the fixed cost for large $K$ values, we propose a novel online learning algorithm, which iteratively shrinks the upper confidence bounds within each period, and show its fixed cost is reduced by a factor of $d$ to $\tilde O(K \sqrt{d})$. Moreover, we test the algorithms on an industrial dataset from Alibaba Group. Experimental results show that our new algorithm reduces the total regret of the standard UCB algorithm by at least 10%.

Cheung, Wang Chi, Simchi-Levi, David, and Zhu, Ruihao

Subjects

Computer Science - Machine Learning and Statistics - Machine Learning

Abstract

We introduce general data-driven decision-making algorithms that achieve state-of-the-art \emph{dynamic regret} bounds for non-stationary bandit settings. It captures applications such as advertisement allocation and dynamic pricing in changing environments. We show how the difficulty posed by the (unknown \emph{a priori} and possibly adversarial) non-stationarity can be overcome by an unconventional marriage between stochastic and adversarial bandit learning algorithms. Our main contribution is a general algorithmic recipe that first converts the rate-optimal Upper-Confidence-Bound (UCB) algorithm for stationary bandit settings into a tuned Sliding Window-UCB algorithm with optimal dynamic regret for the corresponding non-stationary counterpart. Boosted by the novel bandit-over-bandit framework with automatic adaptation to the unknown changing environment, it even permits us to enjoy, in a (surprisingly) parameter-free manner, this optimal dynamic regret if the amount of non-stationarity is moderate to large or an improved (compared to existing literature) dynamic regret otherwise. In addition to the classical exploration-exploitation trade-off, our algorithms leverage the power of the "forgetting principle" in their online learning processes, which is vital in changing environments. We further conduct extensive numerical experiments on both synthetic data and the CPRM-12-001: On-Line Auto Lending dataset provided by the Center for Pricing and Revenue Management at Columbia University to show that our proposed algorithms achieve superior dynamic regret performances. Comment: arXiv admin note: text overlap with arXiv:1810.03024

Bastani, Hamsa, Simchi-Levi, David, and Zhu, Ruihao

Subjects

Computer Science - Machine Learning and Statistics - Machine Learning

Abstract

We study the problem of learning shared structure \emph{across} a sequence of dynamic pricing experiments for related products. We consider a practical formulation where the unknown demand parameters for each product come from an unknown distribution (prior) that is shared across products. We then propose a meta dynamic pricing algorithm that learns this prior online while solving a sequence of Thompson sampling pricing experiments (each with horizon $T$) for $N$ different products. Our algorithm addresses two challenges: (i) balancing the need to learn the prior (\emph{meta-exploration}) with the need to leverage the estimated prior to achieve good performance (\emph{meta-exploitation}), and (ii) accounting for uncertainty in the estimated prior by appropriately "widening" the prior as a function of its estimation error, thereby ensuring convergence of each price experiment. Unlike prior-independent approaches, our algorithm's meta regret grows sublinearly in $N$; an immediate consequence of our analysis is that the price of an unknown prior in Thompson sampling is negligible in experiment-rich environments with shared structure (large $N$). Numerical experiments on synthetic and real auto loan data demonstrate that our algorithm significantly speeds up learning compared to prior-independent algorithms or a naive approach of greedily using the updated prior across products.