Bayesian Optimization

Back
gaussian regression banner art

Bayesian Optimization is a method for finding the maximum value of an objective function that is unknown and costly to evaluate.

Example: We’d like to optimize a chemical reaction that produces a compound C. In order to maximize output, we need to find the temperature which the reaction produces the most amount of C at (i.e. has the largest yield). Thus, there is some function Y(T)Y(T) that correlates the yield with the reaction temperature, but we don’t know this function. We could just try to measure the reaction at different temperatures, but this would take a very long time, as running the reaction once (= sampling) takes, say, two days.

The goal of bayesian optimization is to find the maximum of a function (e.g. the temperature with maximum yield) by measuring the least sample points. More precisely, BO calculates the most probable maximum given a certain number of evaluations.

How does BO work?

Roughly, BO works by applying the following steps:

Initial samples

Before running the BO algorithm, we evaluate our function NN number of times, where NN can be any positive integer. At the start, we have no idea where the optimum might be, and what our objective function might look like, so it makes sense to start with an initial sampling.

Objective Function with three samples points already measured

Main Loop

Now, we repeat the following steps until either

Surrogate Model

Through the points that we already measured, we fit a surrogate model. This model tries to estimate what the objective function might be, and assigns a standard deviation to each point. The surrogate model is very precise (low standard deviation) close to the points we measured. This makes sense: The points that are already known are “fixed”, so there is not much wiggle room in the proximity of these points.

There are many methods that can be used for this task, but most BO implementations use a Gaussian regression, also called “Kriging”.

Surrogate model with mean and standard deviation for three points

Acquisition Function

Next, we try to figure out where it makes the most sense to measure a new sample point. There are multiple AF that we can use, but one of the most commonly used ones is “Expected Improvement”. This AF caulcates, for each xx, the expected value for the improvement function for f(x)f(x).

The EI of the above example looks like this:

Expected Improvement of the above surrogate model

Measure a new Sample

The acquisition function now tells us where the best point for a new measurement lies. For this, we take the maximum of the acquisition function and use its xx value for a new evaluation of the objective function.

Result

After the main loop finishes, we caulcate the surrogate model once more and return the maximum of this model. This is the supposed global maximum of the objective function. In this case, the maximum is at x=0.704x^* = 0.704 with a maximum of f(x)=1.693f(x^*) = 1.693.

Surrogate model after 10 iterations

Let’s implement it!

First, create a new python venv, activate it and install all the necessary packages.

python -m venv venv source venv/bin/activate # depends on the OS pip install numpy scipy matplotlib

Let’s create a new python file and import all necessary libraries.

import os import numpy as np import matplotlib.pyplot as plt from functools import reduce from scipy.stats import norm

Next, let’s define a few constants that we can use later. DOMAIN is the extent of the xx axis, INITIAL_SAMPLES defines how many points we want to evaluate at the start, BUGDET is the number of iterations and machine_epsilon will be relevant later.

DOMAIN=[0,10] INITIAL_SAMPLES=3 BUDGET=10 machine_epsilon = np.finfo(np.float32).eps

Objective Function

Alright, let’s define an objective function that our BO will try to optimize. For this example, I used

f(x)=sin1.7x+cosx,f(x) = \sin 1.7x + \cos x,

because it has a clear maximum and looks somewhat interesting.

objective_function = lambda x: np.sin(1.7 * x) + np.cos(x)

Now, we evaluate this function thrice at equidistant positions.

samples = np.array([[x, objective_function(x)] for x in np.linspace(*DOMAIN, INITIAL_SAMPLES+2)[1:-1]])
[[ 2.5 -1.69613297] [ 5. 1.0821493 ] [ 7.5 0.52923445]]

Plot the objective function together with the evaluated samples. As samples is an array of [x, y], we transpose (.T) and take the columns for plotting.

fig, ax = plt.subplots(figsize=(12.8, 4.8)) xvalues = np.linspace(*DOMAIN, 200) yvalues = objective_function(xvalues) ax.plot(xvalues, yvalues, linewidth=1, color='black') ax.scatter(*samples.T, s=10, c='black') ax.set_xlabel(r'$x$', fontsize=16) ax.set_ylabel(r'$f(x)$', fontsize=16) ax.set_xlim(*DOMAIN) plt.tight_layout() os.makedirs('output', exist_ok=True) plt.savefig(os.path.join('output', 'objective_function.pdf')) plt.close('all')
Objective Function with three samples points already measured

Surrogate Model (Gaussian Regression)

Covariance

For the Gaussian Regression, we first need a covariance function. This function will, given two xx values x1x_1 and x2x_2, return a value that describes how well these values correlate. For this demonstration, we use the Squared Exponential Kernel, which has a high correlation for values that are close and a low correlation for far apart values.

The Squared Exponential Kernel is defined as (source)

K(x1,x2)=exp12x1x22K(x_1, x_2) = \exp -\frac{1}{2}|x_1 - x_2|^2

and looks something like this:

Squared Exponential Kernel as a function of the distance between x1 and x2

In Python:

squaredExponentialKernel = lambda x1, x2: np.exp(-np.linalg.norm(x1 - x2)**2 / 2)

Now, we write a function that generates a covariance matrix given two input vectors x1x_1 and x2x_2, as well as a kernel function. The covariance of an element with itself is the variance.

covariance = lambda x1, x2, kernel: np.array([[kernel(a, b) for b in x2] for a in x1])
In [2]: covariance([1, 2, 3], [1, 2, 3], squaredExponentialKernel) Out[2]: array([[1. , 0.60653066, 0.13533528], [0.60653066, 1. , 0.60653066], [0.13533528, 0.60653066, 1. ]])

Here, 1 and 3 are further apart, so their covariance is lower.

Gaussian Regression

The Gaussian Regression (as mean μ\mu and covariance cov\text{cov}) is defined as (source)

μ=K(X,X)(K(X,X)+I)1y\mu = K(X_*, X) \left( K(X, X) + I \right)^{-1} \mathbf{y}

and

cov=K(X,X)K(X,X)(K(X,X)+I)1K(X,X)\text{cov} = K(X_*, X_*) - K(X_*, X) \left( K(X, X) + I \right)^{-1} K(X, X_*)

with XX being the xx values of the samples, y\mathbf{y} the yy values of the samples and and XX_* the values at which we wish to calculate the Gaussian Regression.

We now write a function that implements these expressions. The machine_epsilon is the smallest value a float32 can store and prevents values from being 0 without actually disturbing the calculations. Some operations will throw an error if a value is 0.

After calculating μ\mu and cov\text{cov}, we calculate the standard deviation at each point by taking the diagonal elements (= variance) and returning the square root (= standard deviation).

def gaussian_regression(x, y, predict, kernel=squaredExponentialKernel): # K(X, X) + I K = covariance(x, x, kernel) + machine_epsilon * np.identity(len(x)) # K(X_*, X) Ks = covariance(predict, x, kernel) # K(X*, X*) Kss = covariance(predict, predict, kernel) K_inv = np.linalg.inv(K) mean = np.dot( Ks, np.dot( K_inv, y ) ) cov = Kss - np.dot( Ks, np.dot( K_inv, Ks.T ) ) cov += machine_epsilon * np.ones(np.shape(cov)[0]) std = np.sqrt(np.diag(cov)) return mean, std

We can now use this function to calculate the surrogate model

mean, std = gaussian_regression(*samples.T, xvalues, kernel=squaredExponentialKernel)

and plot it. I plot three standard deviations here in order to visualize the diminishing probability of a measurement actually being there.

fig, ax = plt.subplots(figsize=(12.8, 4.8)) ax.scatter(*np.array(samples).transpose(), s=10, c='black', label=r'samples') ax.plot(xvalues, objective_function(xvalues), color='black', linewidth=1, label='objective function') ax.plot(xvalues, mean, linestyle='--', color='black', linewidth=1, label=r'surrogate function') ax.fill_between(xvalues, mean + std, mean - std, alpha=0.2, color='black', label=r'surrogate function $\sigma$', linewidth=0) ax.fill_between(xvalues, mean + 2*std, mean - 2*std, alpha=0.15, color='black', label=r'surrogate function $2\sigma$', linewidth=0) ax.fill_between(xvalues, mean + 3*std, mean - 3*std, alpha=0.1, color='black', label=r'surrogate function $3\sigma$', linewidth=0) ax.set_xlim(*DOMAIN) ax.set_xlabel(r'$x$', fontsize=16) ax.set_ylabel(r'$f(x)$', fontsize=16) ax.legend() plt.tight_layout() plt.savefig(os.path.join('output', f'gaussian_regression.pdf')) plt.close('all')
Surrogate model with mean and standard deviation for three points

Acquisition Function

In order to understand Expected Improvement, let’s look at “Probability of Improvement”, PI, first.

For a given current best value in our samples (xx^* and f(x)f(x^*)), PI defines a function I(x)I(x) that is the probability that a new point measured at this xx will result in a better best value. For this, the surrogate model from step 1 defines a normal distribution in the yy direction for each xx, which is shown by how wide the gray area is at a position xx.

Visualization of PI, redrawn from here:

Visualization of PI

EI will now not only consider if a certain measurement will improve the current best, but also by how much.

For implementing EI, we first need to know the best point we currently have. This can be done using the reduce function:

best_point = reduce(lambda point, m: point if point[1] > m[1] else m, samples)

reduce is a function commonly used in functional programming and reduces list down to a single value. It does this by taking a function that takes a value and an accumulator and returns the new accumulator.

EI is defined as (source)

EI(x;ξ)=δΦ(δσ)+σφ(δσ)\text{EI}(x; \xi) = \delta \Phi\left(\frac{\delta}{\sigma}\right) + \sigma\varphi\left(\frac{\delta}{\sigma}\right)

with δ=μf(x)ξ\delta = \mu - f(x^*) - \xi, φ\varphi being the probability density function of the normal distribution (scipy.stats.norm.pdf) and Φ\Phi being the cumulative density function for the normal distribution (scipy.stats.norm.cdf).

The ξ\xi value controls “exploration” vs. “exploitation”: If ξ\xi is close to zero, BO will often try to find local optima and thus converge quicker, with the risk of a local optimum not being the global optimum. For large ξ\xi, BO will “jump around” more and thus potentially evaluate points at places that are further away from the previous point.

To write this, we first define an empty list EI and then loop over all points to calculate EI(x)\text{EI}(x). Note that scipy.stats.norm is a class that we have to instanciate with the correct standard deviation first.

xi = 0.1 EI = [] for i in range(len(xvalues)): delta = mean[i] - best_point[1] - xi std[std == 0] = np.inf z = delta / std[i] unit_norm = norm(scale=std[i]) EI.append(delta * unit_norm.cdf(z) + std[i] * unit_norm.pdf(z))

Now, we can plot EI:

fig, ax = plt.subplots(figsize=(12.8, 4.8)) ax.plot(xvalues, EI, linewidth=1, color='black') ax.set_xlim(*DOMAIN) ax.set_xlabel(r'$x$', fontsize=16) ax.set_ylabel(r'$\text{EI}$', fontsize=16) plt.tight_layout() plt.savefig(os.path.join('output', f'expected_improvement.pdf')) plt.close('all')
Expected Improvement of the above surrogate model

Measure a new Sample

Last, we find the maximum in EI and use this xx value to measure a new sample from the objective function. We append this new [x, y] pair to the samples array.

max_EI_index = np.argmax(EI) max_EI = EI[max_EI_index] samples = np.array([*samples, [xvalues[max_EI_index], objective_function(xvalues[max_EI_index])]])

Return Maximum

The steps Surrogate Model, Acquisition Function and Measure a new Sample are now repeated BUDGET times.

After that, generate the surrogate model once more and find the maximum thereof.

mean, std = gaussian_regression(*samples.T, xvalues, kernel=squaredExponentialKernel) max_index = np.argmax(mean) print(f"Maximum found at {[xvalues[max_index], mean[max_index]]}")
Surrogate model after 10 iterations

Full Code

The steps in the loop are here put into a loop and each iteration’s plots are saved using the loop iterator. The zero_pad function just makes sure that each iterator has the same number of digits (e.g. 0024), which allows for correct sorting by your OS.

import os import numpy as np import matplotlib.pyplot as plt from functools import reduce from scipy.stats import norm DOMAIN=[0,10] INITIAL_SAMPLES=3 BUDGET=10 machine_epsilon = np.finfo(np.float32).eps objective_function = lambda x: np.sin(1.7 * x) + np.cos(x) samples = np.array([[x, objective_function(x)] for x in np.linspace(*DOMAIN, INITIAL_SAMPLES+2)[1:-1]]) # --- Objective Function --- fig, ax = plt.subplots(figsize=(12.8, 4.8)) xvalues = np.linspace(*DOMAIN, 200) yvalues = objective_function(xvalues) ax.plot(xvalues, yvalues, linewidth=1, color='black') ax.scatter(*samples.T, s=10, c='black') ax.set_xlabel(r'$x$', fontsize=16) ax.set_ylabel(r'$f(x)$', fontsize=16) ax.set_xlim(*DOMAIN) plt.tight_layout() os.makedirs('output', exist_ok=True) plt.savefig(os.path.join('output', 'objective_function.pdf')) plt.close('all') # --- Covariance --- squaredExponentialKernel = lambda x1, x2: np.exp(-np.linalg.norm(x1 - x2)**2 / 2) covariance = lambda x1, x2, kernel: np.array([[kernel(a, b) for b in x2] for a in x1]) yvalues = [squaredExponentialKernel(0, x) for x in xvalues] plt.plot(xvalues, yvalues, color='black', linewidth=1) plt.xlim(*DOMAIN) plt.xlabel(r'$|x_1 - x_2|$', fontsize=16) plt.ylabel(r'$\text{cov}(0, |x_1 - x_2|)$', fontsize=16) plt.tight_layout() plt.savefig(os.path.join('output', 'squaredExponentialKernel.pdf')) plt.close('all') # --- Gaussian Regression --- def gaussian_regression(x, y, predict, kernel=squaredExponentialKernel): K = covariance(x, x, kernel) + machine_epsilon * np.identity(len(x)) Ks = covariance(predict, x, kernel) Kss = covariance(predict, predict, kernel) K_inv = np.linalg.inv(K) mean = np.dot( Ks, np.dot( K_inv, y ) ) cov = Kss - np.dot( Ks, np.dot( K_inv, Ks.T ) ) cov += machine_epsilon * np.ones(np.shape(cov)[0]) std = np.sqrt(np.diag(cov)) return mean, std def zero_pad(value, max_number): num_digits = int(np.log10(max_number-1)) + 1 return str(value).zfill(num_digits) for i in range(BUDGET): print(f"==> Iteration {i+1}/{BUDGET}") filename_suffix = zero_pad(i, BUDGET) # --- Surrogate Model --- mean, std = gaussian_regression(*samples.T, xvalues, kernel=squaredExponentialKernel) fig, ax = plt.subplots(figsize=(12.8, 4.8)) ax.scatter(*np.array(samples).transpose(), s=10, c='black', label=r'samples') ax.plot(xvalues, objective_function(xvalues), color='black', linewidth=1, label='objective function') ax.plot(xvalues, mean, linestyle='--', color='black', linewidth=1, label=r'surrogate function') ax.fill_between(xvalues, mean + std, mean - std, alpha=0.2, color='black', label=r'surrogate function $\sigma$', linewidth=0) ax.fill_between(xvalues, mean + 2*std, mean - 2*std, alpha=0.15, color='black', label=r'surrogate function $2\sigma$', linewidth=0) ax.fill_between(xvalues, mean + 3*std, mean - 3*std, alpha=0.1, color='black', label=r'surrogate function $3\sigma$', linewidth=0) ax.set_xlim(*DOMAIN) ax.set_xlabel(r'$x$', fontsize=16) ax.set_ylabel(r'$f(x)$', fontsize=16) ax.legend() plt.tight_layout() plt.savefig(os.path.join('output', f'gaussian_regression_{filename_suffix}.pdf')) plt.close('all') # --- Acquisition Function --- best_point = reduce(lambda point, m: point if point[1] > m[1] else m, samples) xi = 0.1 EI = [] for i in range(len(xvalues)): delta = mean[i] - best_point[1] - xi std[std == 0] = np.inf z = delta / std[i] unit_norm = norm(scale=std[i]) EI.append(delta * unit_norm.cdf(z) + std[i] * unit_norm.pdf(z)) fig, ax = plt.subplots(figsize=(12.8, 4.8)) ax.plot(xvalues, EI, linewidth=1, color='black') ax.set_xlim(*DOMAIN) ax.set_xlabel(r'$x$', fontsize=16) ax.set_ylabel(r'$\text{EI}$', fontsize=16) plt.tight_layout() plt.savefig(os.path.join('output', f'expected_improvement_{filename_suffix}.pdf')) plt.close('all') # --- measure another sample --- max_EI_index = np.argmax(EI) max_EI = EI[max_EI_index] samples = np.array([*samples, [xvalues[max_EI_index], objective_function(xvalues[max_EI_index])]]) # --- get maximum of surrogate model --- mean, std = gaussian_regression(*samples.T, xvalues, kernel=squaredExponentialKernel) max_index = np.argmax(mean) print(f"Maximum found at {[xvalues[max_index], mean[max_index]]}")

Comments

Source Code

Subscribe via RSS.

MIT 2025 © Alexander Schoch.