InferOpt tutorial

In this tutorial, we will use hybrid methods between Machine Learning (ML) and Combinatorial Optimization (CO) from the package InferOpt.jl. The goal is to learn to approximate a hard problem (Stochastic Vehicle Scheduling Problem) by an easier one (Vehicle Scheduling Problem). For more details, you can look at the InferOpt.jl documentation, or its corresponding paper.

# Imports useful packages and fix seed
using Flux
using Plots
using InferOpt
using Random
using Statistics: mean
using StochasticVehicleScheduling
Random.seed!(1);

Introduction

Let instance be an instance of the Stochastic Vehicle Scheduling problem, and $D = (V, A)$ its associated graph. For each arc $a\in A$, we compute a features vector $x_a\in \mathbb{R}^n$. We obtain a feature matrix $X\in\mathbb{R}^{n\times |A|}$. For more details about the features computation, see Features.

nb_tasks = 10
nb_scenarios = 10
instance = create_random_instance(; nb_tasks, nb_scenarios)
@info "Example instance" instance.graph instance.features
┌ Info: Example instance
│   instance.graph = {12, 61} directed simple Int64 graph
│   instance.features =
│    20×61 Matrix{Float64}:
│       30.8915    25.3271    29.3043  …  22.1639  27.1421  23.5073
│     1000.0     1000.0     1000.0         0.0      0.0      0.0
│        ⋮                             ⋱                     ⋮
└        1.0        1.0        0.0         1.0      1.0      1.0

Let $w\in\mathbb{R}^n$ a weight vector. We define the following Generalized Linear Model (GLM) $\varphi_w$, that let us compute arcs weights $\theta_a\in \mathcal R$ for each arc $a\in A$.

\[\theta_a = \varphi_w(x_a) = w^\top x_a\]

nb_features = size(instance.features, 1)
φ_w = Chain(Dense(nb_features => 1; bias=false), vec)  # Note: we use methods from Flux to build the GLM, w is initialized randomly by default
Chain(
  Dense(20 => 1; bias=false),           # 20 parameters
  vec,
) 

We can use the GLM to compute arcs weights θ

θ = φ_w(instance.features)
θ'
1×61 adjoint(::Vector{Float32}) with eltype Float32:
 175.487  223.087  383.468  510.705  …  70.9228  459.433  245.594  244.993

Then, we define what we call the easy problem as the following linear program:

\[\begin{aligned} y = \arg\max_y & \sum_{a\in A} \theta_a y_a &\\ s.t. & \sum_{a\in \delta^-(v)} y_a = \sum_{a\in \delta^+(v)} y_a, & \forall v \in V\backslash \{o, d\}\\ & \sum_{a\in \delta^-(v)} y_a = 1, & \forall v \in V\backslash \{o, d\}\\ & y_a \in \{0, 1\}, &\forall a\in A \end{aligned}\]

Integrity contraints can be dropped because we have a flow polytope. Therefore, the easy problem can be solved easily, which is done by the easy_problem method (note: we need to give instance as input because we need to know the graph in order to build the constraints).

By applying it to arcs weights $\theta$, we obtain the optimal solution $y$ of the easy problem, which is also a feasible solution of the initial stochastic vehicle scheduling instance:

y = easy_problem(θ; instance=instance)
y'
1×61 adjoint(::BitVector) with eltype Bool:
 1  1  1  1  1  1  1  1  1  1  0  0  0  …  0  0  1  0  0  0  1  0  0  1  1  1

Let's evaluate it:

evaluate_solution(y, instance)
10636.590842402416

The objective value does not seem very good. Let's compare it to the value of the optimal solution of instance, using the exact algorrithm described in MIP formulation:

_, y_opt = solve_scenarios(instance)
evaluate_solution(y_opt, instance)
2698.5071616763576

This is not good, our prediction is far off the optimal solution 😟

Can we do better? Is it possible to find $w$ that predicts $\theta$ that gives good solutions ?

The answer is yes, we can use tools from InferOpt in order to learn parameters $w$ that yield good solutions for any instance.

First of all, let's assemble together our full pipeline:

\[\xrightarrow[\text{Instance}]{X} \fbox{Encoder $\varphi_w$} \xrightarrow[\text{Cost vector}]{\theta \in \mathbb{R}^{d(x)}} \fbox{Easy problem} \xrightarrow[\text{Solution}]{y \in \mathcal{Y}(x)}\]

pipeline(x) = easy_problem(φ_w(x.features); instance=x);

Dataset creation

In order to learn something, we first need to create a dataset containing instances and corresponding solutions. We create a training dataset, and a validation dataset to evaluate the results.

nb_train_samples = 25
nb_val_samples = 25

X_train = [create_random_instance(; nb_tasks, nb_scenarios) for _ in 1:nb_train_samples]
Y_train = [solve_scenarios(x)[2] for x in X_train]

X_val = [create_random_instance(; nb_tasks, nb_scenarios) for _ in 1:nb_val_samples]
Y_val = [solve_scenarios(x)[2] for x in X_val]

data_train, data_val = zip(X_train, Y_train), zip(X_val, Y_val);

We can evaluate our current pipeline by computing the average gap with the optimal solutions:

initial_pred = [pipeline(x) for x in X_val]
initial_obj = [evaluate_solution(y, x) for (x, y) in zip(X_val, initial_pred)]
ground_truth_obj = [evaluate_solution(y, x) for (x, y) in data_val]

initial_average_gap = mean((initial_obj .- ground_truth_obj) ./ ground_truth_obj .* 100)
@info "Initial gap ≃ $(round(initial_average_gap; digits=2))%"
┌ Warning: Layer with Float32 parameters got Float64 input.
│   The input will be converted, but any earlier layers may be very slow.
│   layer = Dense(20 => 1; bias=false)  # 20 parameters
│   summary(x) = "20×60 Matrix{Float64}"
└ @ Flux ~/.julia/packages/Flux/FWgS0/src/layers/stateless.jl:50
[ Info: Initial gap ≃ 220.61%

The gap very high, let's see if we can learn a good $w$ that reduces it.

Learning by imitation

Regularization

In order to train our model by using gradient descent, we need to be able to differentiate through easy_problem with automatic differenciation tools. The problem is that, as a combinatorial algorithm, it's pieciwise constant, therefore gradient descent won't work at all:

try
    gradient(θ -> easy_problem(θ; instance=instance), θ)
catch e
    @error e
end
┌ Error: Zygote.CompileError(Tuple{typeof(MathOptInterface.set), Cbc.Optimizer, MathOptInterface.RawOptimizerAttribute, String}, ErrorException("try/catch is not supported.\nRefer to the Zygote documentation for fixes.\nhttps://fluxml.ai/Zygote.jl/latest/limitations\n"))
└ @ Main inferopt.md:155

That's why we need to regularize the easy problem. Here we choose the PerturbedAdditive regularization from InferOpt.jl.

regularized_predictor = PerturbedAdditive(easy_problem; ε=0.1, nb_samples=10)
PerturbedAdditive(easy_problem, 0.1, 10, Random.MersenneTwister, nothing)

Instead of returning a binary solution, this wrapper around easy_problem takes the same inputs but returns a smooth a continuous solution.

y_pred = regularized_predictor(rand(length(θ)); instance=instance)
y_pred[y_pred .>= 0.1 .&& y_pred .<= 0.9]'
1×21 adjoint(::Vector{Float64}) with eltype Float64:
 0.2  0.9  0.5  0.4  0.5  0.7  0.1  0.1  …  0.9  0.6  0.3  0.1  0.3  0.3  0.4

Loss function

We can now choose a differentiable loss function from InferOpt's toolbox. We choose a FenchelYoungLoss, that wraps our perturbed maximizer.

loss = FenchelYoungLoss(regularized_predictor)
FenchelYoungLoss(PerturbedAdditive(easy_problem, 0.1, 10, Random.MersenneTwister, nothing))

We can evaluate our prediction $\theta$ respect to the optimal solution y_opt

loss(θ, y_opt.value; instance=instance)
13483.924601506505

And compute its gradient respect to $\theta$

gradient(θ -> loss(θ, y_opt.value; instance=instance), θ)
(Float32[0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0  …  1.0, 0.0, 0.0, -1.0, 1.0, -1.0, 0.0, 1.0, 0.0, 0.0],)

We now have all the tools in order to learn our pipeline using gradient descent and automatic differentiation !

Training loop

We train our model using the Flux.jl library, and tha Adam optimizer.

flux_loss(x, y) = loss(φ_w(x.features), y.value; instance=x)
# Optimizer
opt = Adam();

We train our model for 25 epochs

nb_epochs = 25
training_losses, val_losses = Float64[], Float64[]
objective_gap_history = Float64[]
for _ in 1:nb_epochs
    l = mean(flux_loss(x, y) for (x, y) in data_train)
    l_test = mean(flux_loss(x, y) for (x, y) in data_val)
    Y_pred = [pipeline(x) for x in X_val]
    values = [evaluate_solution(y, x) for (x, y) in zip(X_val, Y_pred)]
    V = mean((v_pred - v) / v * 100 for (v_pred, v) in zip(values, ground_truth_obj))
    push!(training_losses, l)
    push!(val_losses, l_test)
    push!(objective_gap_history, V)

    Flux.train!(flux_loss, Flux.params(φ_w), data_train, opt)
end
┌ Warning: Layer with Float32 parameters got Float64 input.
│   The input will be converted, but any earlier layers may be very slow.
│   layer = Dense(20 => 1; bias=false)  # 20 parameters
│   summary(x) = "20×58 Matrix{Float64}"
└ @ Flux ~/.julia/packages/Flux/FWgS0/src/layers/stateless.jl:50

Results

Train and test losses

Let's check our pipeline performance, by plotting training and validation losses.

plot(training_losses; label="Training loss")
plot!(val_losses; label="Validation loss")

The loss decreased, which is a good sign.

Other test metrics

We can also check the average optimality gap as another metric.

plot(objective_gap_history; title="Objective gap")

The optimality gap also decreased:

@info "Initial objective gap ≃ $(round(objective_gap_history[1], digits=2))%"
@info "Final objective gap ≃ $(round(objective_gap_history[end], digits=2))%"
[ Info: Initial objective gap ≃ 220.61%
[ Info: Final objective gap ≃ 0.76%

Its value is now less than 1%. The training was a success 🎉


This page was generated using Literate.jl.