TL for BO
TL Scenarios
- Black Box outputs noisy evaluations: $$y^{t_i} \sim N(f^{t_i},\lambda^{t_i}I)$$ $\lambda^{t_i}$ is the variance of the evaluation’s noise(固有).
- $k(x,x')$ is a hyperparameter of GP,describing the similarity of two configurations,but can alter during training.
- Surrogate Model outputs $\mu(x),\lambda(x),\sigma(x)$.
- The posterior on a given point is calculated from the SM: $$f_i | \mathbf{x}^{t_i},\mathbf{X}^{t_i},\mathbf{y}^{t_i} \sim N(\mu^i_{n_i}(x^{t_i}),\lambda^i_{n_i}(x^{t_i}))$$
TL for SM
Kernel
- Multi-task Kernel Design:
Put source tasks and target task together into one GP model.
- $$ K_{multi}((x^{t_i},t_i),(x^{t_j},t_j)) = K_t(t_i,t_j) \otimes K_x(x^{t_i},x^{t_j})$$
$K_x$ : the kernel between different input points.
$K_t$ :- the difference between different tasks
- inferred using slicing sampling.
- $$ k(x^{t_i}, x^{t_j}) = \begin{cases} \exp\left( -\frac{\|x^{t_i} - x^{t_j}\|^2}{2\ell^2} \right), & \text{if } t_i = t_j \\\\ 1 - \frac{\|x^{t_i} - x^{t_j}\|^2}{B}, & \text{if } t_j \in \text{NN}(t_i),\ t_i \ne t_j \\\\ 0, & \text{otherwise} \end{cases} $$
-
Boosted Hierarchical Gaussian Process (BHGP): Transfer knowledge from $t_s$ to $t$ by injecting additional information into the kernel of GP.
-
Assume all tasks share the same ML, and what differs is the training dataset.Thus, they aim to represent each dataset as a feature vector and use this to construct cross-task kernels.
- $\phi_x,\phi_y$ are feature maps learned by neural networks.
- $P_X$ Kernel mean embedding: $$\varphi(D^{t_k}) = \frac{\sum^{n_k}_{l=1} \phi_x (x_l)}{n_k}$$
- $P_{Y|X}$ Kernel conditional mean operator: $$ \hat{C}_{Y|X} = \Phi _y^T(\Phi_X \Phi_X^T+\lambda I)^{-1} \Phi_X $$
- $P_{XY}$ Cross covariance: $$\hat{C}_{XY} = \frac{\Phi_X^T \Phi_y}{n_k}$$
- Noisy Biased Kernel Design
View source tasks as noisy observations of target task.
-
Envelop-BO(Env-GP):
$$\mathbf{K}_\ast = \begin{bmatrix} K(X^{ts}, X^{ts}) + \sigma_s^2 \mathbf{I}_{n_s \times n_s} & K(X^{ts}, X^T) \\ K(X^T, X^{ts}) & K(X^T, X^T) + \lambda^T \mathbf{I}_{n_T \times n_T} \end{bmatrix}$$The noise variance $\sigma_s^2$ is dynamically adjusted during optimization. It increases when the similarity between source and target tasks decreases.
-
Diff-GP: Define $g(x)=f^T(x)-f^s(x)$ to measure the difference between source task $t_s$ and target task $t$.
$$g|x,X^{t_T},\Delta y(X^{t_T}) \sim N(\mu_t^g(x),\lambda_t^g(x))$$
Assume $g(x)$ follows GP:$g(x)$ can only be noisily observed as
$$\Delta y(x)=y^{t_T}-\mu^s_{n_s}(x)$$$$\iff \Delta y(x)=g(x)+\epsilon_g(x)$$$$\epsilon_g(x) \sim N(0,\lambda^T+\lambda^s(x))$$
$\lambda^T$ :observation noise variance of $t_T$
$\lambda^s(x)$ :the predictive variance of $t_s$ GP at point $x$.
Finally,Use the posterior mean function $\mu_t^g$ to correct the the observation from $t_s$ to $t_T$:where $\Lambda^g_{i,i}=\lambda^s_{n_s}(x_i^{t_s})+\lambda_t^g(x_i^{t_s})$
-
Multi-armed Bandit(MAB): Dynamically select the most helpful source task at each iteration. Each source task is treated as an arm of the bandit. The reward for selecting the $k$-th source task at iteration $t$ is defined as:
\[ r_t^k = -\left( y_t^{t^T} - \mu^k(x_t^{t^T}) \right)^2 \]where:
- $y_t^{t^T}$ is the noisy observation from the target task at point $x_t^{t^T}$,
- $\mu^k(x_t^{t^T})$ is the predictive mean from the $k$-th source task’s GP at the same point.
- Prior Design
Use the source tasks to learn the prior mean and kernel function.
-
MetaBO Assume all $f$ are sampled from the same GP prior distribution and are conditionally independent. All source tasks must have the same input points.
- Search Space is a finite set: $$X=[x_j]_{j=1}^M$$ $$Y={y_j^{t_k}}_{j\in[M],k\in{K}}$$ where $y_j^{t_k}$ is the mean function from GP $N(f^k(x),\sigma^2)$ trained on $t_k$. $$\hat{\mu}(X)=\frac{Y^T \mathbf{1}_K}{K}$$ $$\hat{k}(X)=\frac{(Y-\mathbf{1}_K \hat{\mu}(X)^T)^T(Y-\mathbf{1}_K\hat{X}^T)}{K-1}$$
- Search Space is a compact subset of $R^d$,i.e. $X \sub R^d$ can also be solved.
-
HyperBO: Unlike MetaBO , source and target tasks do not have to share the same input design (each task may be evaluated at different \(\mathbf{x}\) locations).
Hierarchical Generative Model:-
Task parameter
\[ \theta \sim p(\theta \mid \alpha) \]
\(\theta\) captures shared structure across tasks; \(\alpha\) is a global hyper-prior. -
Noise level
\[ \sigma \sim p(\sigma \mid \theta) \]
-
GP prior (mean \(\mu\), kernel \(k\))
\[ (\mu, k) \sim p(\mu, k \mid \theta) \]
-
True objective for the task
\[ f \sim \mathcal{GP}\!\bigl(\mu, k\bigr) \]
Each source/target task produces observations
\(y = f(x) + \varepsilon,\; \varepsilon \sim \mathcal N(0,\sigma^{2})\).
Learning the Hyper-Parameters \(\alpha\):Strategy Description Hierarchical MLE Maximise the marginal log-likelihood of all source-task data w.r.t. \(\alpha\). Empirical Divergence Minimisation Fit \(\alpha\) by minimising the discrepancy between model-predicted multi-variate Gaussians and empirical estimators from data. Result: a learned distribution \(p(\theta \mid \alpha^{\star})\) which induces priors over \(\mu, k, \sigma\).
-
- Deep Kernel Prior
$$k_{deep}(x,x'|\theta,\omega)=k(\phi(x,\omega),\phi(x',\omega)|\theta)$$where,$\omega$ is the weights of the neural network,$\theta$ is the hyperparameter of the kernel.
-
Use SGA on batch of observations from one sampled source task to update hyperparameters $\theta,\omega$.
-
Deep Kernel Acquisiton Function(DKAF):
Component Role Details Deep Kernel Learnable representation + RBF kernel \[ k(\mathbf x,\mathbf x’\mid\theta,\omega)=\alpha \exp\Bigl(-\frac{| \phi(\mathbf x,\omega)-\phi(\mathbf x’,\omega)|_2^2}{2\eta}\Bigr)+\beta·\delta(\mathbf x,\mathbf x’) \] GP Surrogate Probabilistic model of the objective Produces mean \(\mu(\mathbf x)\) and variance \(\sigma^2(\mathbf x)\) using the deep kernel. Acquisition Function (learned) Picks next query point Rather than a fixed formula, DKAF learns the policy that maps GP outputs → action. -
Formulate BO as RL
- State \(s_t\): current evaluated dataset.
- Action \(a_t\): next configuration \(\mathbf x_t\).
- Reward \(r_t = -(f^* - f(\mathbf x_t))\).\
-
Policy-Gradient Update
- Sample a source task, run BO episode with current \((\theta,\omega)\).
- Collect rewards, apply REINFORCE/PPO to update both kernel params and policy network.
-
Transfer
- Freeze learned \((\theta,\omega)\) and policy; apply on target task for fast optimization.
- Date Scale Prior
- Standardization by Mean and Variance:
Each task’s response is normalized to zero mean and unit variance:
- Randomized Min-Max Normalization: $$l^i, u^i \sim \mathcal{U}(y_{\min}, y_{\max}), \quad \hat{y}_j^{ti} = \frac{y_j^{ti} - l^i}{u^i - l^i}$$
- Ranking-Based Reconstruction:
Use ranks instead of values. $$x_k^{ti} \prec x_j^{ti} \Leftrightarrow y_k^{ti} < y_j^{ti}$$ Learn surrogate from pairwise ranking using GP-Rank or SVM-Rank. - Gaussian Copula Transformation: Map outputs to normal space via empirical CDF:(Transform all tasks’ y to N(0,I)) $$F(t) \approx \begin{cases} \delta_N & \text{if } \tilde{F}(t) < \delta_N \\ \tilde{F}(t) & \text{if } \delta_N \leq \tilde{F}(t) \leq 1 - \delta_N \\ 1 - \delta_N & \text{if } \tilde{F}(t) > 1 - \delta_N \end{cases}$$ $$z = \phi(y) = \Phi^{-1}(F(y))$$ Then learn a normal model: $$P(z | x) \sim \mathcal{N}(\mu_\theta(x), \sigma_\theta^2(x)), \quad \hat{z} = \frac{\phi(y) - \mu_\theta(x)}{\sigma_\theta(x)}$$
- Ensemble Design
Seperately train GP and combine them.
-
POGPE : For each source task \( t_i \), a separate GP is trained,each task has its own mean function \( \mu_i(x) \) and variance function \( \lambda_i(x) \).
Posterior Mean:
\[ \mu^*(x) = \lambda^*(x) \cdot \sum_{i=1}^{K} \beta_i \cdot \lambda_i^{-1}(x) \cdot \mu_i(x) \]
Posterior Variance:
\[ \left(\lambda^*(x)\right)^{-1} = \sum_{i=1}^{K} \beta_i \cdot \lambda_i^{-1}(x) \]Posterior Likelyhood:
\[ p(\mathbf{y} \mid \mathbf{X}, \theta) = \prod_{i=1}^{K} p_i(\mathbf{y}^{t_i} \mid \mathbf{X}^{t_i}, \theta^{t_i})^{\beta_i} \]Here, \( \beta_i \) represents the weight for each task:
- Tasks with smaller uncertainty (\(\lambda_i(x)\)) contribute more.
- Two common choices for weights:
- Equal weight: \( \beta_i = \frac{1}{K} \) for all tasks.
- Target task receives higher weight: For example, \( \beta_T = \frac{1}{2} \), and the rest split \( \frac{1}{2} \) among source tasks.
Bayesian Neural Network
GP : $O(n^3)$ . BNN is more scalable.
Bayesian Linear Regression
- Prior : $$\omega \sim N(0,S)$$
- Likelyhood : $$ y|x,\omega \sim N(\omega^T \phi(x),\sigma^2)$$ $\sigma$ : the variance of the observation noise
- $$\log p(\mathbf{w} | \mathcal{D}) = \log p(\mathbf{w}) + \log p(\mathcal{D} | \mathbf{w}) + \text{const} \\ = -\frac{1}{2}\mathbf{w}^\mathsf{T} \mathbf{S}^{-1}\mathbf{w} - \frac{1}{2\sigma^2}||\boldsymbol{\Psi}\mathbf{w} - \mathbf{t}||^2 + \text{const} \\ = -\frac{1}{2}\mathbf{w}^\mathsf{T} \mathbf{S}^{-1}\mathbf{w} - \frac{1}{2\sigma^2}(\mathbf{w}^\mathsf{T} \boldsymbol{\Psi}^\mathsf{T} \boldsymbol{\Psi}\mathbf{w} - 2\mathbf{t}^\mathsf{T} \boldsymbol{\Psi}\mathbf{w} + \mathbf{t}^\mathsf{T} \mathbf{t}) + \text{const} \\ = -\frac{1}{2}(\mathbf{w} - \boldsymbol{\mu})^\mathsf{T} \boldsymbol{\Sigma}^{-1}(\mathbf{w} - \boldsymbol{\mu}) + \text{const}$$ where $$\boldsymbol{\mu} = \sigma^{-2}\boldsymbol{\Sigma}\boldsymbol{\Psi}^\mathsf{T}\mathbf{t} \\ \boldsymbol{\Sigma}^{-1} = \sigma^{-2}\boldsymbol{\Psi}^\mathsf{T}\boldsymbol{\Psi} + \mathbf{S}^{-1}$$ $$\omega | D \sim N(\mu,\Sigma)$$
- DNGO: $$y = \omega_0 + \omega_1 x_1 + ... + \omega_k x_k \\ \rArr \\ y = \omega_0 + \omega_1 \Phi(X_1) + ... + \omega_d \Phi(X_d) $$ $$ \mu_n(\mathbf{x}) = [\beta \mathbf{K}^{-1} \mathbf{\Phi}^{\mathrm{T}}(\mathbf{y} - \mu_0(\mathbf{x}))]^{\mathrm{T}} \phi(\mathbf{x}) + \mu_0(\mathbf{x}) $$
- Multi-task DNGO:
- All tasks share one neural network.
- For each task \( k \), the modeling assumption is:
Where:
- \( \Phi^k \in \mathbb{R}^{n_k \times D} \): Feature matrix for task \( k \), where each row is \( \phi(\mathbf{x}_i^k) \), extracted by a shared neural network \( \phi(\cdot) \).
- \( \mathbf{w}^k \in \mathbb{R}^D \): Task-specific Bayesian linear regression weight vector (latent).
- \( \beta_k \): Precision (inverse of noise variance) of task \( k \)’s observations.
- \( \alpha_k \): Prior precision of weights.
- \( \theta \): Parameters of the shared neural network that defines \( \phi(\cdot) \).
- Given the model and integrating out \( \mathbf{w}^k \), the predictive mean and variance for a new point \( \mathbf{x} \) in task \( k \) are:
Where:
\[ K_k = (\Phi^k)^T \Phi^k + \frac{\alpha_k}{\beta_k} I_D \]This gives task-specific posterior distributions, similar to classic Bayesian linear regression but enhanced with neural network features.
- All task-specific and shared parameters are learned by minimizing the negative log marginal likelihood across tasks:
The marginal likelihood for each task is:
\[ \mathbf{y}^k \mid \theta, \alpha_k, \beta_k \sim \mathcal{N}\left( \mathbf{0}, \beta_k^{-1} I_{n_k} + \alpha_k^{-1} \Phi^k (\Phi^k)^T \right) \]- ABRAC(Adaptive Complexity ABLR with Nested Dropout):
- Nested Dropout : Keep the top $k$ feature dimensions and set the rest $0$ at each iteration,where $k$ is sampled as a truncation index.
- It allows each task has different latent feature dimensionality.
- The model assumes a weight prior of the form:
Allows $\alpha_k$ to be different for different $\omega_k$.
- Step 1 : Offline The model applies nested dropout during training to learn a feature importance ordering.
- Step 2 : Online Complexity Selection and Fine-tuning. The learned feature ordering is reused. For each task, one selects the best truncation level \( t^* \) based on validation or prior knowledge. The model then uses only the first \( t^* \) latent dimensions of \( \phi(x) \) during inference and optimization.
- BOHAMIANN: (Bayesian Neural Networks with SGHMC)
- BOHAMIANN applies Bayesian neural networks with Stochastic Gradient Hamiltonian Monte Carlo (SGHMC) sampling for BO.
- Each task-specific model is: \[ f^k(x; \theta_{\mu}) = h\left( \left[ x; \phi_k \right]^T, \theta_h \right) \]
- Posterior predictive is approximated by sampling \( \theta^i \sim p(\theta \mid D) \) and averaging:
Neural Process
Transfer Neural Processes (TNP) extend NPs for multi-task and transfer learning scenarios, enabling BO systems to generalize across tasks.
- TNP is composed of the following modules:
- Encoder \( \mathcal{E}_\theta \): Maps each observation \( (x_i, y_i) \rightarrow r_i \in \mathbb{R}^d \)
- Attention module \( \mathcal{A}_\theta \): Aggregates context representations \( \{r_i\} \) using cross-task attention to produce a task-aware representation for query point \( x \)
- Decoder \( \mathcal{D}_\theta \): Predicts \( y \sim \mathcal{N}(\mu(x), \sigma^2(x)) \) using aggregated context and input \( x \)
All components are parameterized by neural networks and trained jointly.
TL for Acq
- Multi-task BO based on entropy search:
- $c_k(x)$ is the real valued cost function of evaluating $x$ at task $t_k$.
- At every iteration, acq picks one paticular task $t_k$ and $x$ to evaluate, then update the shared GP.
-
MUMBO(Multi-task Max-value Baysien Optimization):
$$ \alpha_{\text{MUMBO}}(x^k, z) = \frac{H(y \mid D_n) - \mathbb{E}_{f(x)}\left[ H\left(y \mid D_n \cup \{(x, \mu_n(x))\} \right) \right]}{c_k(x, z)} $$- $c_k(x,z)$ is the real valued cost function of evaluating $x$ with fidelity $z$ at task $t_k$.
-
Ensemble GPs-based acq transfer(TAF):
\[ \alpha_{TAF}(\mathbf{x}) = \frac{ \beta_T \alpha_{EI_n}^{T}(\mathbf{x}) + \sum_{i=1}^{K} \beta_i I_i(\mathbf{x}) }{ \sum_{i=1}^{K} \beta_i } \]-
\( \alpha_{EI_n}^{T}(\mathbf{x}) \): Expected Improvement on the target task at iteration \( n \).
-
\( I_i(\mathbf{x}) = \max\{ y_{\min}^{t_k} - \mu^k(\mathbf{x}), 0 \} \): Improvement indicator from source task \( t_k \).
-
\( \beta_T, \beta_i \): Importance weights.
-
Each source task contributes via its own GP and acquisition measure.
-
Promotes points with strong performance across source tasks and high EI in the target task.
-
-
RL-based NAF(Neural Acq Func) is learned via Reinforcement Learning (RL), allowing meta-learning of acquisition functions from source to target tasks.
- The Acq is a neural network $\alpha_\theta$,parameterize by $\theta$.
- Use Proximal Policy Optimization (PPO) to learn \( \theta \) over many source tasks.
TL for Init
Meta-feature similarity based
MI-SMBO: Meta-learning-based Initialization Sequential Model-based BO
-
Source task dataset: \( D^1, D^2, ..., D^K \), each with its best configuration \( \mathbf{x}^{*1}, ..., \mathbf{x}^{*K} \).
-
Meta-feature vector for each task: \( \mathbf{m}^1,\mathbf{m}^2,...,\mathbf{m}^K \in \mathbb{R}^d \)
-
Target task meta-feature: \( \mathbf{m}^t \)
-
To find similar tasks, compute the distance between target and source meta-features:
- p-norm distance: \[ d_p(D^k, D^t) = \|\mathbf{m}^k - \mathbf{m}^t\|_p \]
- Negative Spearman correlation:
-
Then, select top \( t \) most similar datasets, and use their best points \( \{\mathbf{x}^{*j}\} \) as the warm-start points for the target BO process.
Warm-start + Deep Meta-feature Learning(Image datasets)
-
Randomly sample \( \tau \) data from each dataset \( D^k \).
-
Use a deep neural network \( \mathcal{M}^{df} \) to extract deep features \( \mathbf{d}^{t_k}_{1:\tau} = \{d_i^{t_k}\}_{i=1}^\tau \).(Because of the image datesets)
-
Aggregate these deep features to produce the meta-feature vector \( \mathbf{h}^{t_k} \in \mathbb{R}^d \) using either:
-
ADF (Aggregation of Deep Features):
\[ \mathbf{h}^{t_k}_{ADF} = \frac{1}{\tau} \sum_{i=1}^\tau \mathbf{d}^{t_k}_i \tag{62} \] -
Bi-LSTM over the sequence:
\[ \mathbf{h}^{t_k}_{Bi-LSTM} = \text{Bi-LSTM}(\mathbf{d}^{t_k}_{1:\tau}) \tag{63} \]
-
-
The final meta-feature vector \( \mathbf{m}^k \) is output by a fully connected layer, trained via minimizing:
$$||d_{target}(D_i,D_j) - d_{mf}(\mathbf{m}^i , \mathbf{m}^j)||$$where $d_{target}(D_i,D_j)=\sum_{s=1}^n ||f^i(x_s) - f^j(x_s)||$
Where \( f^i \), \( f^j \) are performance functions from datasets \( D_i, D_j \), and the goal is to make similar datasets have similar meta-feature embeddings.
Gradient-based
-
Via loss gradient, the init points are expected to perform well across all source tasks.
-
❌ Original (non-differentiable) Loss:
\[ \mathcal{L}(\mathcal{X}^I, \mathcal{D}) = \frac{1}{K} \sum_{D^{t_k} \in \mathcal{D}} \min_{x \in \mathcal{X}^I} f^k(x) \] -
✅ Softmin Approximation:
\[ \mathcal{L}(\mathcal{X}^I, \mathcal{D}) = \frac{1}{K} \sum_{D^{t_k} \in \mathcal{D}} \sum_{i=1}^I \sigma_{D^{t_k},i} \hat{f}^k(x_i) \]- \( \hat{f}^k(x) \): mean function from the GP surrogate for task \(k\)
- \( \sigma_{D^{t_k},i} \): softmin weight using a large negative \(\beta\)
-
Gradient-based Optimization: Update each point using gradient descent:
\[ x_{i,j} = x_{i,j} - \eta \frac{\partial \mathcal{L}}{\partial x_{i,j}} \] -
Advanced: Task Similarity Weighting
\[ \mathcal{L}(\mathcal{X}^I, \mathcal{D}) = \frac{1}{K} \sum_{D^k \in \mathcal{D}} c(D^k, D^T) \sum_{i=1}^I \sigma_{D^k,i} \hat{f}^k(x_i) \]- \( c(D^k, D^T) \): similarity between source and target task
Evolutionary algorithm based
This method proposes using an evolutionary algorithm (EA) to find a good set of initial points \( \mathcal{X}^I \) for BO by minimizing the loss across all source tasks.
Find a set of points that achieves low loss across all source tasks:
\[ \mathcal{X}^I = \arg\min_{\mathcal{X} \sube \mathcal{X}} \sum_{k=1}^{K} \min_{x \in \mathcal{X}} \frac{f^k(x) - f^k_{\min}}{f^k_{\max} - f^k_{\min}} \tag{67} \]- \( f^k(x) \): posterior mean of GP model for task \( k \)
- \( f^k_{\min}, f^k_{\max} \): min and max over all function values seen so far
This encourages the selected point set to perform reasonably well for all source tasks, rather than excelling on only a few.
- Start with random population of candidate sets.
- For each point \( x \), compute its sampling probability as:
- Evolve new sets via mutation or crossover over 100,000 iterations.
TL for Config Space Design
Search Sapce pruning
Pruning unpromising space using the knowledge from source tasks to avoid unnecessary function evaluations.
-
Define the similarity between source task and target task as :
$$ \text{KTRC}(D^{t^k}, D^{t^T}) := \frac{ \sum_{x_i, x_j \in X_t} \mathbb{1}\left(\hat{f}^k(x_i) > \hat{f}^k(x_j) \oplus \hat{f}^T(x_i) > \hat{f}^T(x_j)\right) }{ |X_t| (|X_t| - 1) }$$ -
Select the most similar $m$ tasks as $T'=\{t_{i1},....,t_{im}\}$
-
Define potential of a Region For a candidate region \( \mathcal{R} = (x, \delta) \), define its potential as:
$$ \text{potential}(\mathcal{R}, X_t) := \sum_{k \in \{t_1, \dots, t_m\}} \hat{f}^k(x) - \max_{x' \in X_t} \hat{f}^k(x') $$ -
Define a pruned space that removes regions far from promising areas:
$$ X^{(\text{pruned})} := \{ x \in X \mid \text{dist}(x, x') > \delta, x' \in X' \} $$Where \( X' \) is the set of high-potential points.
-
Define the distance function between \( x \) and \( x' \) as:
$$ \text{dist}(x, x') := \begin{cases} \infty & \text{if } x \text{ and } x' \text{ differ in a categorical variable} \\\\ \|x - x'\| & \text{otherwise} \end{cases} $$ -
The final space is $X^{(pruned)} \cup \{x \in X | dist(x,x') \le \delta,x' \in X'\}$,where $X$ is the original space,$X'$ is the high-potential space.
Promising search space design
The search space design is transformed into an optimization problem, aiming to construct the smallest space that contains all known good points by minimizing the volume of a geometric shape (such as a box or ellipsoid).