flowchart TD n1["Data Preparation"] --> n2["Selecting a Store and Product"] n2 --> n3["Extract Demand Curve"] & n4["Extract Sales Data"] n3 --> n7["Train Agent"] n4 --> n7 n5["Define the Environment"] --> n7 n6["Define the Q Learning Agent"] --> n7 n7 --> n8["Compare to Baseline Policies"]
Businesses struggle to balance inventory: too much stock ties up cash, too little loses sales. Inventory optimisation is a whole field of study, which has a well understood impact of the potential profitability of any business which sells physical goods - the median company spends 1% of revenue carrying inventory.
This experiment explores using reinforcement learning (RL) to optimise inventory, using a simple simulation of a retail store. The goal is to learn an optimal policy for ordering stock based on demand forecasts and current inventory levels.
We will be using the M5 Forecasting - Accuracy dataset from Kaggle, which contains historical sales data for various products across different stores. The dataset includes features such as sales history and event calendar information. We will use this data to train our RL agent in a simulated environment, to make inventory decisions such as how much stock to order and when to place orders, with the aim of maximising profits while minimising :link stockouts and holding costs.
The overall approach
The overall approach to the experiment is as follows.
For the sake of simplicity, we will be using a single store and product. The aim of the experiment is to highlight the reinforcement learning approach to inventory optimisation, rather than to achieve the best possible results.
We will use a simple :link Q-learning algorithm to train our agent. This is a model-free reinforcement learning algorithm that learns the value of taking a particular action in a given state. The agent will learn to take actions that maximise the expected cumulative reward over time. If we were to train against many stores and products, we would likely use a more complex algorithm, such as Deep Q-Learning or :link Proximal Policy Optimization (PPO). However, for this experiment, we will keep it simple and use Q-learning.
We will also compare the performance of our RL agent against some baseline policies, such as a naive policy and a :link reorder point policy. The naive policy simply orders a fixed quantity of stock every day, while the reorder point policy orders stock when inventory falls below a certain threshold. The goal is to demonstrate the effectiveness of the RL approach in optimising inventory management.
Getting the data
We start with downloading the data. You need to create a Kaggle account and download the dataset from there if you want to replicate this experiment.
Show the code
!kaggle competitions download -c m5-forecasting-accuracy -p .data && unzip -o -q .data/m5-forecasting-accuracy.zip -d .data && rm .data/m5-forecasting-accuracy.zip
Downloading m5-forecasting-accuracy.zip to .data
0%| | 0.00/45.8M [00:00<?, ?B/s] 2%|▊ | 1.00M/45.8M [00:00<00:21, 2.21MB/s] 4%|█▋ | 2.00M/45.8M [00:00<00:11, 3.87MB/s] 7%|██▍ | 3.00M/45.8M [00:00<00:08, 5.22MB/s] 9%|███▎ | 4.00M/45.8M [00:00<00:07, 6.17MB/s] 11%|████▏ | 5.00M/45.8M [00:00<00:06, 6.95MB/s] 13%|████▉ | 6.00M/45.8M [00:01<00:05, 7.66MB/s] 15%|█████▊ | 7.00M/45.8M [00:01<00:05, 8.10MB/s] 17%|██████▋ | 8.00M/45.8M [00:01<00:04, 8.44MB/s] 20%|███████▍ | 9.00M/45.8M [00:01<00:04, 8.61MB/s] 22%|████████▎ | 10.0M/45.8M [00:01<00:04, 8.74MB/s] 24%|█████████▏ | 11.0M/45.8M [00:01<00:04, 8.86MB/s] 26%|█████████▉ | 12.0M/45.8M [00:01<00:03, 8.91MB/s] 28%|██████████▊ | 13.0M/45.8M [00:01<00:03, 8.85MB/s] 31%|███████████▌ | 14.0M/45.8M [00:01<00:03, 9.00MB/s] 33%|████████████▍ | 15.0M/45.8M [00:02<00:03, 9.09MB/s] 35%|█████████████▎ | 16.0M/45.8M [00:02<00:03, 9.03MB/s] 37%|██████████████ | 17.0M/45.8M [00:02<00:03, 9.15MB/s] 39%|██████████████▉ | 18.0M/45.8M [00:02<00:03, 9.12MB/s] 41%|███████████████▊ | 19.0M/45.8M [00:02<00:03, 9.09MB/s] 44%|████████████████▌ | 20.0M/45.8M [00:02<00:02, 9.10MB/s] 46%|█████████████████▍ | 21.0M/45.8M [00:02<00:02, 9.07MB/s] 48%|██████████████████▎ | 22.0M/45.8M [00:02<00:02, 9.09MB/s] 50%|███████████████████ | 23.0M/45.8M [00:03<00:02, 9.09MB/s] 52%|███████████████████▉ | 24.0M/45.8M [00:03<00:02, 9.07MB/s] 55%|████████████████████▋ | 25.0M/45.8M [00:03<00:02, 9.12MB/s] 57%|█████████████████████▌ | 26.0M/45.8M [00:03<00:02, 9.12MB/s] 59%|██████████████████████▍ | 27.0M/45.8M [00:03<00:02, 9.12MB/s] 61%|███████████████████████▏ | 28.0M/45.8M [00:03<00:02, 9.09MB/s] 63%|████████████████████████ | 29.0M/45.8M [00:03<00:01, 9.08MB/s] 66%|████████████████████████▉ | 30.0M/45.8M [00:03<00:01, 9.03MB/s] 68%|█████████████████████████▋ | 31.0M/45.8M [00:03<00:01, 9.09MB/s] 70%|██████████████████████████▌ | 32.0M/45.8M [00:04<00:01, 9.11MB/s] 72%|███████████████████████████▍ | 33.0M/45.8M [00:04<00:01, 9.08MB/s] 74%|████████████████████████████▏ | 34.0M/45.8M [00:04<00:01, 9.13MB/s] 76%|█████████████████████████████ | 35.0M/45.8M [00:04<00:01, 9.07MB/s] 79%|█████████████████████████████▉ | 36.0M/45.8M [00:04<00:01, 9.09MB/s] 81%|██████████████████████████████▋ | 37.0M/45.8M [00:04<00:01, 9.01MB/s] 83%|███████████████████████████████▌ | 38.0M/45.8M [00:04<00:00, 9.07MB/s] 85%|████████████████████████████████▎ | 39.0M/45.8M [00:04<00:00, 9.00MB/s] 87%|█████████████████████████████████▏ | 40.0M/45.8M [00:04<00:00, 8.97MB/s] 90%|██████████████████████████████████ | 41.0M/45.8M [00:05<00:00, 8.99MB/s] 92%|██████████████████████████████████▊ | 42.0M/45.8M [00:05<00:00, 9.00MB/s] 94%|███████████████████████████████████▋ | 43.0M/45.8M [00:05<00:00, 7.23MB/s] 96%|████████████████████████████████████▌ | 44.0M/45.8M [00:05<00:00, 7.18MB/s] 98%|█████████████████████████████████████▎| 45.0M/45.8M [00:05<00:00, 7.53MB/s]
100%|██████████████████████████████████████| 45.8M/45.8M [00:05<00:00, 8.29MB/s]
We want replicable results, so we will set a seed for the random number generator. This will ensure that the results are consistent across different runs of the experiment.
Show the code
# Set a seed for reproducibility
import random
import numpy as np
= 3141
seed
random.seed(seed) np.random.seed(seed)
What does it represent ?
The M5 dataset is composed of several data tables, each representing different aspects of the sales data. The main tables we will be using are the calendar
, sell prices
and sales
data tables. The calendar
table contains information about the dates, including holidays and events, sell prices
contains information about the prices of the products in different stores over time, while sales
contains the sales data for each product in each store.
Calendar
Show the code
# Load .data/calendar.csv
import pandas as pd
from IPython.display import display
= pd.read_csv(".data/calendar.csv")
calendar
display(calendar.head())
date | wm_yr_wk | weekday | wday | month | year | d | event_name_1 | event_type_1 | event_name_2 | event_type_2 | snap_CA | snap_TX | snap_WI | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2011-01-29 | 11101 | Saturday | 1 | 1 | 2011 | d_1 | NaN | NaN | NaN | NaN | 0 | 0 | 0 |
1 | 2011-01-30 | 11101 | Sunday | 2 | 1 | 2011 | d_2 | NaN | NaN | NaN | NaN | 0 | 0 | 0 |
2 | 2011-01-31 | 11101 | Monday | 3 | 1 | 2011 | d_3 | NaN | NaN | NaN | NaN | 0 | 0 | 0 |
3 | 2011-02-01 | 11101 | Tuesday | 4 | 2 | 2011 | d_4 | NaN | NaN | NaN | NaN | 1 | 1 | 0 |
4 | 2011-02-02 | 11101 | Wednesday | 5 | 2 | 2011 | d_5 | NaN | NaN | NaN | NaN | 1 | 0 | 1 |
Sell prices
Show the code
# Load .data/sell_prices.csv
= pd.read_csv(".data/sell_prices.csv")
sell_prices
display(sell_prices.head())
store_id | item_id | wm_yr_wk | sell_price | |
---|---|---|---|---|
0 | CA_1 | HOBBIES_1_001 | 11325 | 9.58 |
1 | CA_1 | HOBBIES_1_001 | 11326 | 9.58 |
2 | CA_1 | HOBBIES_1_001 | 11327 | 8.26 |
3 | CA_1 | HOBBIES_1_001 | 11328 | 8.26 |
4 | CA_1 | HOBBIES_1_001 | 11329 | 8.26 |
Sales data
Show the code
# Load .data/sales_train_validation.csv
= pd.read_csv(".data/sales_train_validation.csv")
sales_train_validation
display(sales_train_validation.head())
id | item_id | dept_id | cat_id | store_id | state_id | d_1 | d_2 | d_3 | d_4 | ... | d_1904 | d_1905 | d_1906 | d_1907 | d_1908 | d_1909 | d_1910 | d_1911 | d_1912 | d_1913 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | HOBBIES_1_001_CA_1_validation | HOBBIES_1_001 | HOBBIES_1 | HOBBIES | CA_1 | CA | 0 | 0 | 0 | 0 | ... | 1 | 3 | 0 | 1 | 1 | 1 | 3 | 0 | 1 | 1 |
1 | HOBBIES_1_002_CA_1_validation | HOBBIES_1_002 | HOBBIES_1 | HOBBIES | CA_1 | CA | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
2 | HOBBIES_1_003_CA_1_validation | HOBBIES_1_003 | HOBBIES_1 | HOBBIES | CA_1 | CA | 0 | 0 | 0 | 0 | ... | 2 | 1 | 2 | 1 | 1 | 1 | 0 | 1 | 1 | 1 |
3 | HOBBIES_1_004_CA_1_validation | HOBBIES_1_004 | HOBBIES_1 | HOBBIES | CA_1 | CA | 0 | 0 | 0 | 0 | ... | 1 | 0 | 5 | 4 | 1 | 0 | 1 | 3 | 7 | 2 |
4 | HOBBIES_1_005_CA_1_validation | HOBBIES_1_005 | HOBBIES_1 | HOBBIES | CA_1 | CA | 0 | 0 | 0 | 0 | ... | 2 | 1 | 1 | 0 | 1 | 1 | 2 | 2 | 2 | 4 |
5 rows × 1919 columns
Selecting a store and product
We will select a random store and product from the sales data, but we don’t want to select a product that has no sales. We will filter the products to only include those that have sales on at least 75% of the days in the dataset. This will help ensure that we have enough data to train our agent.
First, we pick a random store.
Show the code
# Pick a random store
= random.choice(sales_train_validation["store_id"].unique())
store_id
# Identify which columns are the “day” columns
= [c for c in sales_train_validation.columns if c.startswith("d_")]
day_cols = len(day_cols)
total_days print(f"Total days: {total_days}")
Total days: 1913
We can quickly visualise the distribution of sales counts for that store.
Show the code
# Subset to that store
= sales_train_validation[sales_train_validation["store_id"] == store_id]
df_store
# Compute how many days each product had non-zero sales
= df_store[day_cols].gt(0).sum(axis=1) sales_counts
Show the code
# Plot the distribution of sales counts
import plotly.express as px
= px.histogram(
fig
sales_counts,=f"Distribution of Sales Counts for Store {store_id}",
title={"value": "Sales Count"},
labels="plotly_white",
template
)=dict(color="grey", line=dict(width=1, color="black")))
fig.update_traces(marker
fig.update_layout(="Sales Count",
xaxis_title="Frequency",
yaxis_title=dict(title_text="Sales Count"),
xaxis=dict(title_text="Frequency"),
yaxis=0.2,
bargap=0.5,
title_x=0.95,
title_y=20,
title_font_size
)=dict())
fig.update_layout(grid fig.show()
We will pick a product from the tail end of the distribution, so we increase the chance for the RL agent to learn something useful, rather than attempting to train from a dataset which is too sparse. Let us filter the products to only include those that have sales on at least 75% of the days in the dataset.
Show the code
# Filter to products with ≥75% days having sales
= 0.75 * total_days
threshold = df_store.loc[sales_counts >= threshold, "item_id"].unique()
valid_products print(f"Number of valid products to chose from: {len(valid_products)}")
Number of valid products to chose from: 181
We need to pick a product at random from this subset of products.
Show the code
# Filter to the top 10% of products by sales
= (
valid_products >= threshold, "item_id"]
df_store.loc[sales_counts
.value_counts()int(len(valid_products) * 0.1))
.nlargest(
.index
)
if len(valid_products) == 0:
raise ValueError(f"No products in store {store_id} have sales on ≥75% of days.")
# Pick one at random
= random.choice(valid_products)
product_id print(f"Selected product: {product_id}")
Selected product: FOODS_3_723
Now that we have a store and product, let us finalise the data we will be using for the experiment.
Show the code
# Filter the sales data for the selected store and product
= sales_train_validation[
sales_data "store_id"] == store_id)
(sales_train_validation[& (sales_train_validation["item_id"] == product_id)
]
Show the code
= sell_prices[
product_prices "store_id"] == store_id) & (sell_prices["item_id"] == product_id)
(sell_prices[
]# We only need wm_yr_wk and sell_price columns
= product_prices[["wm_yr_wk", "sell_price"]].reset_index(drop=True) product_prices
To develop an intuition of what we selected, let us visualise some of the important data points, starting with product prices.
Show the code
# Plot select product prices
= pd.DataFrame(
price_df "Week": product_prices["wm_yr_wk"], "Price": product_prices["sell_price"]}
{
)= px.scatter(
fig
price_df,="Week",
x="Price",
y=f"Prices for {product_id} in {store_id}",
title={"Week": "Weeks", "Price": "Price"},
labels="plotly_white",
template
)=dict(size=5, color="grey"))
fig.update_traces(marker=dict())
fig.update_layout(grid=False)
fig.update_xaxes(showticklabels fig.show()
Certain events can have a significant impact on sales. For example, holidays or promotional events can lead to spikes in demand (the Superbowl leads to more beer sold, Christmas means turkey sales, etc.). We will use the calendar
data to identify these events and incorporate them into our model. We will do so by creating demand data that includes these events. For the purpose of this experiment, we will use a simple multiplier to boost demand on event days - in practice, you would want to use a more sophisticated model to predict the impact of events on demand which is specific to each product.
Show the code
# Create a DataFrame for all days with their corresponding week
= sales_data.columns[6:] # columns are like d_1, d_2, ..., d_1913
sales_days
= (
calendar_small "d", "date", "wm_yr_wk", "event_name_1"]]
calendar[["d")
.set_index(
.loc[sales_days]
.reset_index()
)"date"] = pd.to_datetime(calendar_small["date"])
calendar_small[
"is_event"] = calendar_small["event_name_1"].notna()
calendar_small[print(f"Number of event days: {calendar_small['is_event'].sum()}")
Number of event days: 154
Show the code
# Merge to get sell_price per day
= calendar_small.merge(product_prices, on="wm_yr_wk", how="left")
daily_prices
# Fill missing prices if needed (e.g., if price missing for some weeks)
"sell_price"] = daily_prices["sell_price"].ffill()
daily_prices[
= daily_prices["sell_price"].values
sell_price_data print(f"Min price: {np.min(sell_price_data)}, Max price: {np.max(sell_price_data)}")
Min price: 1.0, Max price: 1.5
Show the code
= sales_data.iloc[0, 6:].values
daily_sales = pd.Series(daily_sales).rolling(window=7, min_periods=1).mean().values demand_data
Show the code
print(
f"Min daily demand: {np.min(demand_data)}, Max daily demand: {np.max(demand_data)}"
)
= 2 * np.max(demand_data)
expected_max_inventory print(f"Expected Max Inventory: {expected_max_inventory}")
print(f"Min daily sales: {np.min(daily_sales)}, Max daily sales: {np.max(daily_sales)}")
Min daily demand: 0.0, Max daily demand: 71.85714285714286
Expected Max Inventory: 143.71428571428572
Min daily sales: 0, Max daily sales: 154
We can look at the calculated demand data, which is a rolling average of the sales data. We apply an average to smooth out the demand curve, making it more stable for the RL agent to learn from. Let us have a quick look at the demand data, with event days highlighted in red.
Show the code
# Plot demand data for the selected product
= pd.DataFrame(
demand_df
{"Day": range(len(demand_data)),
"Demand": demand_data,
# map the boolean is_event → friendly labels
"Event Day": calendar_small["is_event"].map({False: "No", True: "Yes"}),
}
)
= px.scatter(
fig
demand_df,="Day",
x="Demand",
y="Event Day",
color={"No": "grey", "Yes": "red"},
color_discrete_map=f"Demand Data for {product_id} in {store_id} (events in red)",
title={"Day": "Days", "Demand": "Demand", "Event Day": "Event Day"},
labels="plotly_white",
template
)
fig.update_layout(="Event Day", xaxis=dict(showgrid=True), yaxis=dict(showgrid=True)
legend_title_text
)
fig.show()
We want to affect the demand with seasonality and holiday spikes. We will do this by boosting demand on weekends and holidays.
Show the code
def add_weekly_seasonality(demand_data, weekend_multiplier=1.2):
"""Boost demand on weekends."""
= []
new_demand for i, demand in enumerate(demand_data):
= i % 7
day_of_week if day_of_week in [5, 6]: # Saturday, Sunday
*= weekend_multiplier
demand
new_demand.append(demand)return np.array(new_demand)
= add_weekly_seasonality(demand_data) demand_data
Show the code
# Flag where there is any event
"is_holiday"] = calendar["event_name_1"].notnull().astype(int)
calendar[
# Build a simple multiplier array
= np.where(calendar["is_holiday"] == 1, 1.5, 1.0)
holiday_multipliers = demand_data.copy()
demand_data_without_boost = demand_data * holiday_multipliers[: len(demand_data)] demand_data
Here is the final demand curve, which includes seasonality and holiday spikes. You will notice that the demand data is now boosted on weekends and holidays, as expected. You can click on the plot to zoom in and out, hover over points to see the values, and click on the legend items to show and hide individual series.
Show the code
# Plot demand data with seasonality and holiday spikes, vs original demand data
= px.scatter(
fig
demand_df,="Day",
x="Demand",
y="Event Day",
color={"No": "grey", "Yes": "red"},
color_discrete_map=f"Demand Data with Seasonality Spikes for {product_id} in {store_id}",
title={"Day": "Days", "Demand": "Demand", "Event Day": "Event Day"},
labels="plotly_white",
template=0.5, # Set alpha for this plot
opacity
)
fig.add_scatter(=demand_df["Day"],
x=demand_data_without_boost,
y="markers",
mode="Original Demand",
name=dict(color="blue", width=2),
line=0.6, # Set alpha for this plot
opacity
)
fig.add_scatter(=demand_df["Day"],
x=demand_data,
y="markers",
mode="Demand with Seasonality and Holiday Spikes",
name=dict(color="orange", width=2),
line=0.8, # Set alpha for this plot
opacity
)
fig.update_layout(="Event Day",
legend_title_text=dict(showgrid=True),
xaxis=dict(showgrid=True),
yaxis# Place legend outside the plot on the right
=dict(
legend="v",
orientation="top",
yanchor=1.0,
y="left",
xanchor=1.05,
x=dict(size=12),
title_font=dict(size=10),
font
),
) fig.show()
The environment
We will use the Gymnasium library to create a custom environment for our RL agent. It will simulate the actions in the store and the inventory management process, including the demand data, sell prices, and inventory levels. The agent will interact with the environment by taking actions (ordering stock) and receiving rewards (profit or loss) depending on the actions it takes.
This is the fundamental principle of reinforcement learning: the agent learns to take actions that maximise the expected cumulative reward over time. The environment will provide feedback to the agent in the form of rewards, which will be used to update a policy.
In diagram form, the main pieces of the environment are as follows.
flowchart TD subgraph s1["Environment"] n1["Action Space"] n2["Order 0"] n3["Order 5"] n4["Order 10"] n5["Order ..."] n6["Observation Space"] n7["Inventory"] n8["Demand"] n9["Functions"] n10["Reset"] n11["Step"] end n1 --> n2 & n3 & n4 & n5 n6 --> n7 & n8 n9 --> n10 n9 --> n11
To start with, let us define a few parameters for the environment - the possible order quantities which the agent can choose from, and the action space size it will be working in.
Show the code
# Define the action space as a discrete set of order quantities from 0 to 200, with increments of 5
= list(range(0, 201, 5))
order_quantities = len(order_quantities)
action_space_size # Set max inventory to 300% of the max demand
= int(3 * np.max(demand_data))
max_inventory # Set max order to 50% of the max inventory
= int(0.5 * max_inventory)
max_order
print(f"Action space size: {action_space_size}")
print(f"Order quantities: {order_quantities}")
print(f"Max inventory: {max_inventory}")
print(f"Max order: {max_order}")
Action space size: 41
Order quantities: [0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200]
Max inventory: 341
Max order: 170
Finally we define the Gymnasium class for the environment. It is composed of state data (like inventory levels and recent demand), action space (the possible order quantities), and reward function (which is based on the profit or loss from the actions taken). The environment will also include a reset function to reset the state of the environment at the beginning of each episode.
Note that the reset function can be called with a random_start
parameter, which will randomly select a starting point in the demand data. This is useful for training the agent on different starting points, and can help to improve the robustness of the learned policy as that way we will not bias the agent to learn a policy that only works for a specific starting point.
An important part is defining the costs associated with holding inventory, stockouts, and ordering stock. For this experiment, we will use the following costs:
- Holding cost per unit: \[0.2\]
- Stockout penalty per unit: \(20\%\) of the average selling price
- Fixed order cost: \[1\]
- Order cost per unit: \(10\%\) of the average selling price
In practice, these costs are a key driver for model performance. For this experiment we are using simple, speculative values. You would want to use actual costs from your business to train the agent, and they would be specific to each product and store (for example, a stockout penalty will vary wildly between products, between stores for the same product or even when the stockout event occurs).
Show the code
import gymnasium as gym
from gymnasium import spaces
class InventoryManagementEnv(gym.Env):
"""Custom Environment for Inventory Management"""
= {"render.modes": ["human"]}
metadata
def __init__(
self,
demand_data,
sell_price_data,
order_quantities,=1000,
max_inventory=500,
max_order=False,
random_start=1000,
episode_length
):super(InventoryManagementEnv, self).__init__()
# Data
self.demand_data = demand_data
self.sell_price_data = sell_price_data
self.avg_sell_price = np.mean(sell_price_data)
self.order_quantities = order_quantities
self.current_step = 0
self.random_start = random_start
self.episode_length = episode_length
self.random_inventory_divider = np.random.randint(2, 5)
# Inventory settings
self.max_inventory = max_inventory
self.inventory = max_inventory // self.random_inventory_divider
self.max_daily_demand = np.max(demand_data) * 1.5 # Add a safety margin
self.max_order = max_order
# Costs
self.holding_cost_per_unit = 0.2
self.stockout_penalty_per_unit = self.avg_sell_price * 0.2
self.fixed_order_cost = 1
self.order_cost_per_unit = 0.1 * self.avg_sell_price
# Action space: discrete choices of order quantities
self.action_space = spaces.Discrete(len(order_quantities))
# Observation space: inventory level + recent demand
self.observation_space = spaces.Box(
=np.array([0, 0]),
low=np.array([self.max_inventory, self.max_daily_demand]),
high=np.float32,
dtype
)
def _get_obs(self):
= (
recent_demand self.demand_data[max(0, self.current_step - 7) : self.current_step].mean()
if self.current_step > 0
else 0
)return np.array([self.inventory, recent_demand], dtype=np.float32)
def step(self, action):
# Decode the action
= self.order_quantities[action]
order_quantity
# Place an order
self.inventory += order_quantity
self.inventory = min(self.inventory, self.max_inventory) # Cap at max_inventory
# Receive demand
= (
today_demand self.demand_data[self.current_step]
if self.current_step < len(self.demand_data)
else 0
)
# Set today's price
= (
today_price self.sell_price_data[self.current_step]
if self.current_step < len(self.sell_price_data)
else 0
)
# Fulfill sales
= min(self.inventory, today_demand)
sales = max(today_demand - self.inventory, 0)
lost_sales self.inventory -= sales
# Calculate rewards
= sales * today_price
revenue = self.inventory * self.holding_cost_per_unit
holding_cost = lost_sales * self.stockout_penalty_per_unit
stockout_cost if order_quantity > 0:
= (
order_cost self.fixed_order_cost + order_quantity * self.order_cost_per_unit
)else:
= 0
order_cost
= revenue - (holding_cost + stockout_cost + order_cost)
reward
self.current_step += 1
= self.current_step >= len(self.demand_data)
done
return self._get_obs(), reward, done, {}
def reset(self):
if self.random_start:
# we choose a random window so we always have enough days left
= len(self.demand_data) - 1 - self.episode_length
max_start self.current_step = np.random.randint(0, max_start)
else:
self.current_step = 0
self.inventory = self.max_inventory // 4
return self._get_obs()
def render(self, mode="human"):
print(f"Day {self.current_step}: Inventory={self.inventory}")
We now have a custom environment that simulates the inventory management process. The agent will interact with this environment by taking actions (ordering stock) and receiving rewards (profit or loss) depending on the actions it takes. We also need to define the Q-learning agent, which will learn to take actions that maximise the expected cumulative reward over time.
The Q-learning agent
The agent will use a Q-table to store the expected rewards for each state-action pair. The Q-table is a 2D array where the rows represent the states (inventory levels and recent demand) and the columns represent the actions (order quantities). It will update it using the Q-learning algorithm, which is a model-free reinforcement learning algorithm which learns the value of taking a particular action in a given state.
The Q-learning algorithm is based on the Bellman equation, which states that the value of a state–action pair is equal to the immediate reward plus the expected (discounted) value of the next state. The Q-table is updated via:
\[ Q(s, a) \;\leftarrow\; Q(s, a) \;+\; \alpha \,\bigl(r \;+\; \gamma \,\max_{a'} Q(s', a') \;-\; Q(s, a)\bigr) \]
where:
- \(Q(s, a)\) is the current estimate of the value of taking action \(a\) in state \(s\)
- \(\alpha\) is the learning rate (how fast we update)
- \(r\) is the immediate reward received
- \(\gamma\) is the discount factor (how much we value future rewards)
- \(s'\) is the next state
- \(a'\) is the action that maximizes \(Q(s', a')\) in that next state
Note: \(\epsilon\) is the exploration rate used in the ε-greedy policy for choosing actions, but it does not appear explicitly in the Q-update formula above.
The agent will use a simple \(\epsilon\)-greedy policy to balance exploration and exploitation. This means it will choose a random action with probability \(\epsilon\) and the best action with probability \(1 - \epsilon\). The value of \(\epsilon\) will decay over time to encourage exploration in the beginning and exploitation later on.
Don’t worry for now about the details if it sounds a bit overwhelming, what is important is that you develop an intuition of how reinforcement learning solves this domain problem. You can explore how this works in your own time.
Show the code
class QLearningAgent:
def __init__(
self,
state_space_size,
action_space_size,=0.1,
alpha=0.95,
gamma=1.0,
epsilon=0.995,
epsilon_decay=0.01,
epsilon_min
):self.q_table = np.zeros((state_space_size, action_space_size))
self.alpha = alpha
self.gamma = gamma
self.epsilon = epsilon
self.epsilon_decay = epsilon_decay
self.epsilon_min = epsilon_min
self.action_space_size = action_space_size
def choose_action(self, state):
if np.random.rand() < self.epsilon:
return np.random.choice(self.action_space_size)
return np.argmax(self.q_table[state])
def learn(self, state, action, reward, next_state):
= self.q_table[state, action]
predict = reward + self.gamma * np.max(self.q_table[next_state])
target self.q_table[state, action] += self.alpha * (target - predict)
def decay_epsilon(self):
if self.epsilon > self.epsilon_min:
self.epsilon *= self.epsilon_decay
Discretising the state space
The Q-learning algorithm requires a discrete state space, so we need to discretise the continuous state space (inventory levels and recent demand) into bins. We will use a simple binning approach to achieve this. The number of bins for inventory and demand can be adjusted based on the expected range of values, but we will use a simple heuristic to determine the number of bins based on the maximum values of inventory and demand.
flowchart TD subgraph s1["Bins"] n2["1 to 5"] n3["6 to 10"] n4["..."] n5["51 to 55"] end subgraph s2["Data"] n1["1,2,3,4,5,6,7,...,51,52,53,54,55"] --> n2 & n3 & n4 n1 --> n5 end
We will use the following formula to calculate the number of bins: \[ \text{bins} = \frac{\text{max value} - \text{min value}}{\text{target bin width}} \]
where max value
is the maximum value of the variable (inventory or demand), min value
is the minimum value of the variable, and target bin width
is the desired width of each bin. We will also cap the number of bins to avoid outliers.
Show the code
def discretise_state(
=10, demand_bins=5, max_inventory=1000, max_demand=500
state, inventory_bins
):= state
inventory, recent_demand
# Bin inventory
= int(
inventory_bin / max_inventory * inventory_bins, 0, inventory_bins - 1)
np.clip(inventory
)
# Bin recent demand
= int(
demand_bin / max_demand * demand_bins, 0, demand_bins - 1)
np.clip(recent_demand
)
# Flatten to a single integer
= inventory_bin * demand_bins + demand_bin
discrete_state_index
return discrete_state_index
Show the code
= 30
desired_bins = (demand_data.max() - demand_data.min()) / desired_bins
width = max(1, int(width))
target_demand_bin_width
= expected_max_inventory / desired_bins
width = max(1, int(width))
target_inventory_bin_width
# Calculate bins
= int(
demand_bins max(demand_data) - np.min(demand_data)) / target_demand_bin_width)
np.ceil((np.
)= int(np.ceil(expected_max_inventory / target_inventory_bin_width))
inventory_bins # Cap bins to avoid outliers
= np.clip(demand_bins, 5, desired_bins)
demand_bins = np.clip(inventory_bins, 5, desired_bins)
inventory_bins print(f"Demand bins: {demand_bins}, Inventory bins: {inventory_bins}")
Demand bins: 30, Inventory bins: 30
Training the agent in the environment
Finally, with all the necessary pieces in place, we can train the agent using the environment we have setup. We will use the random_start
parameter to randomly select a starting point in the demand data, and we will train the agent for a large number of episodes. Each episode will consist of a series of steps where the agent interacts with the environment, takes actions, and receives rewards.
Show the code
= InventoryManagementEnv(
env
demand_data,
sell_price_data,=order_quantities,
order_quantities=max_inventory,
max_inventory=max_order,
max_order=True,
random_start=365,
episode_length
)
= QLearningAgent(
agent =inventory_bins * demand_bins, action_space_size=action_space_size
state_space_size
)
= 3000
n_episodes
= []
rewards_per_episode
for episode in range(n_episodes):
= env.reset()
state = discretise_state(state)
state = 0
total_reward
= False
done while not done:
= agent.choose_action(state)
action = env.step(action)
next_state, reward, done, _ = discretise_state(
next_state =inventory_bins, demand_bins=demand_bins
next_state, inventory_bins
)
agent.learn(state, action, reward, next_state)= next_state
state += reward
total_reward
agent.decay_epsilon()
rewards_per_episode.append(total_reward)
if episode % (n_episodes // 10) == 0:
print(f"Episode {episode}: Total reward = {total_reward}")
Episode 0: Total reward = -77318.09671824376
Episode 300: Total reward = -27876.565643357502
Episode 600: Total reward = 1167.2731506832938
Episode 900: Total reward = -17823.154116540958
Episode 1200: Total reward = 5873.567524755438
Episode 1500: Total reward = 24436.06273839149
Episode 1800: Total reward = 5729.572647823171
Episode 2100: Total reward = 27366.796881397957
Episode 2400: Total reward = 3917.763412217164
Episode 2700: Total reward = 27351.212991740715
Comparing to baseline policies
With the model trained, let us compare the results to typical baseline policies which represent common inventory management strategies. We will implement a naive policy, a reorder point policy, and a dynamic reorder policy.
Naive policy simply orders a fixed quantity of stock every day, while the reorder point policy orders stock when inventory falls below a certain threshold.
Dynamic reorder policy adjusts the reorder point based on recent demand, which is a more sophisticated strategy that takes into account the current demand levels and adjusts the reorder point accordingly.
These policies are not necessarily optimal, but they represent common strategies used in inventory management. The goal is to demonstrate the effectiveness of the RL approach compared to these.
As a first step, we need a helper function to find the closest action in the action space to a given order quantity. This is important because the agent will need to choose an action based on the order quantity it wants to place, and we need to ensure that the action is valid (i.e., it is one of the possible order quantities).
Show the code
def find_closest_action(order_quantity, order_quantities):
return min(
range(len(order_quantities)),
=lambda i: abs(order_quantities[i] - order_quantity),
key )
We then define the policies.
Naive policy
Show the code
def simulate_naive_policy(
=50, order_every_n_days=7, n_episodes=500
env, order_quantities, order_quantity
):= []
rewards
for episode in range(n_episodes):
# reset() returns just obs
= env.reset()
obs = 0
total_reward = False
done = 0
day_counter
while not done:
# Every n days, place an order
if day_counter % order_every_n_days == 0:
= find_closest_action(order_quantity, order_quantities)
action else:
= find_closest_action(0, order_quantities)
action
# step() returns (obs, reward, done, info)
= env.step(action)
next_obs, reward, done, info
+= reward
total_reward += 1
day_counter
= next_obs
obs
rewards.append(total_reward)
return rewards
Reorder point policy
Show the code
def simulate_reorder_point_policy(
=200, reorder_quantity=300, n_episodes=500
env, order_quantities, reorder_point
):= []
rewards = {
metrics "order_quantity": [],
"inventory": [],
"recent_demand": [],
"action": [],
"reward": [],
"episode": [],
}
for _ in range(n_episodes):
= env.reset()
obs = obs
inventory, recent_demand = 0
total_reward = False
done
while not done:
# if below threshold → order, else do nothing
if inventory < reorder_point:
= find_closest_action(reorder_quantity, order_quantities)
action else:
= find_closest_action(0, order_quantities)
action
= env.step(action)
next_obs, reward, done, info = next_obs
inventory, recent_demand "order_quantity"].append(order_quantities[action])
metrics["inventory"].append(inventory)
metrics["recent_demand"].append(recent_demand)
metrics["action"].append(action)
metrics["reward"].append(reward)
metrics["episode"].append(_)
metrics[+= reward
total_reward
rewards.append(total_reward)
return rewards, metrics
Dynamic reorder policy
Show the code
def simulate_dynamic_reorder_policy(
env,
order_quantities,=200,
base_reorder_point=300,
base_order_quantity=500,
n_episodes
):= []
rewards = {
metrics "order_quantity": [],
"inventory": [],
"recent_demand": [],
"action": [],
"reward": [],
"episode": [],
}
for _ in range(n_episodes):
= env.reset()
obs = obs
inventory, recent_demand = 0
total_reward = False
done
while not done:
# dynamically adjust s based on recent demand
= recent_demand / 50
demand_factor = base_reorder_point * demand_factor
dynamic_s = np.clip(dynamic_s, 100, 400)
dynamic_s
# policy decision
if inventory < dynamic_s:
= find_closest_action(base_order_quantity, order_quantities)
action else:
= find_closest_action(0, order_quantities)
action
= env.step(action)
next_obs, reward, done, info = next_obs
inventory, recent_demand "order_quantity"].append(order_quantities[action])
metrics["inventory"].append(inventory)
metrics["recent_demand"].append(recent_demand)
metrics["action"].append(action)
metrics["reward"].append(reward)
metrics["episode"].append(_)
metrics[+= reward
total_reward
rewards.append(total_reward)
return rewards, metrics
With all the baseline policies defined, we can now simulate them in the environment. We use the same environment and demand data as the RL agent, and compare against the rewards received by each policy over the same number of episodes.
Show the code
# Simulate the naive policy
= simulate_naive_policy(
naive_rewards
env,
order_quantities,=50,
order_quantity=7,
order_every_n_days=n_episodes,
n_episodes
)# Simulate the reorder point policy
= simulate_reorder_point_policy(
reorder_rewards, reorder_metrics
env,
order_quantities,=expected_max_inventory // 2,
reorder_point=expected_max_inventory,
reorder_quantity=n_episodes,
n_episodes
)# Simulate the dynamic reorder policy
= simulate_dynamic_reorder_policy(
dynamic_rewards, dynamic_metrics
env,
order_quantities,=expected_max_inventory // 2,
base_reorder_point=expected_max_inventory,
base_order_quantity=n_episodes,
n_episodes )
Performance benchmarking
We can now contrast the performance of the RL agent against the baseline policies. Let us plot the rewards received by each policy over the same number of episodes, and include the rewards received by the RL agent for comparison.
Show the code
# Plot the rewards per episode for each policy, including the RL agent
import plotly.graph_objects as go
# Smooth the rewards for better visualisation
= np.convolve(rewards_per_episode, np.ones(10) / 10, mode="valid")
rewards_per_episode
= go.Figure()
fig
fig.add_trace(
go.Scatter(=list(range(n_episodes)),
x=reorder_rewards,
y="markers",
mode="Reorder Point Policy",
name=dict(color="blue", symbol="circle"),
marker=0.5,
opacity
)
)
fig.add_trace(
go.Scatter(=list(range(n_episodes)),
x=dynamic_rewards,
y="markers",
mode="Dynamic Reorder Policy",
name=dict(color="orange", symbol="square"),
marker=0.5,
opacity
)
)
fig.add_trace(
go.Scatter(=list(range(n_episodes)),
x=naive_rewards,
y="markers",
mode="Naive Policy",
name=dict(color="green", symbol="triangle-up"),
marker=0.5,
opacity
)
)
fig.add_trace(
go.Scatter(=list(range(n_episodes)),
x=rewards_per_episode,
y="lines",
mode="RL Agent",
name=dict(color="red", width=1),
line
)
)
fig.update_layout(="Rewards per Episode for Different Policies",
title="Episode",
xaxis_title="Total Reward",
yaxis_title="Policy",
legend_title_text="plotly_white",
template
) fig.show()
It is clear the RL agent (the red line in the plot) is outperforming all the baseline policies after enough training episodes, meaning the agent has learned a set of actions that maximises the expected cumulative reward over time.
Visualising the Q-table
In the Q-learning algorithm, the Q-table is a 2D array where the rows represent the states (inventory levels and recent demand) and the columns represent the actions (order quantities). It forms the “brain” of the agent, where each cell represents the expected reward for taking a particular action in a given state.
It offers good explanatory power for the learned policy, we can visualise it to understand how the agent has learned to take actions based on the state of the environment.
Show the code
= agent.q_table.reshape(
q_table_reshaped
(inventory_bins, demand_bins, action_space_size)
)
# Best action for each (inventory, demand) combination
= np.argmax(q_table_reshaped, axis=2)
best_actions
= go.Figure(
fig =go.Heatmap(
data=best_actions,
z="Viridis",
colorscale=dict(title="Best Action Index"),
colorbar
)
)
fig.update_layout(="Learned Policy: Best Action per (Inventory Bin, Demand Bin)",
title="Demand Bin",
xaxis_title="Inventory Bin",
yaxis_title="reversed", # So low inventory is at bottom
yaxis_autorange=800,
width=600,
height
)
fig.show()
Final remarks
We have explored how to use reinforcement learning to solve an inventory management problem. We have defined a custom environment using the Gymnasium library, implemented a Q-learning agent, and compared its performance against common baseline policies. The RL agent has successfully learned to take actions that maximise the expected cumulative reward over time, outperforming the baseline policies.
In future work, we can explore more advanced RL algorithms, such as deep Q-learning or policy gradient methods, to further improve the performance of the agent and also integrate more stores and products into a single model.