Variational Bayesian Inference
Log likelihood is a statistical measure that is used to quantify how well a given set of parameters explains the observed data under specific probabilistic model. It is especially popular in statistics and machine learning community, particularly in the context of Bayesian inference, which this post is about. In simpler words, Likelihood : How probable is the observed data given a set of parameters?, and Log-likelihoo: The natural logarithm of the likelihood, transforming a product of proabilities into a sum of log-probabilities for ease of computation.
Now let’s break down the log likelihood to more thorough understanding with an example: Let us consider, we have some observed data points \(x_1, x_2, \cdots, x_n\). Also, let’s assume that these data points are generated from a probability distribution with parameters \(\theta\). For example, the data might come from a normal distribution with mean \(\mu\), and standard deviation \(\sigma\) such that \(\theta = (\mu, \sigma)\).
Now the likelihood function \(L(\theta)\) is the probability of the observed data given the parameters \(\theta\), or mathematically:
\[L(\theta) = P(X \mid \theta)\]For \(n\) independent and identically distributed (i.i.d) observations \(x_1, x_2, \cdots, x_n\), the likelihood function is defined as the product of the probabilities of each individual observation. Mathematically,
\[L(\theta) = \prod_{i = 1}^{n} P(x_i \mid \theta)\]Subsequently, the Log-Likelihood Function is now simply the natural logarithm of the above likelihood function.
\[\mathcal{l}(\theta) = \sum_{i = 1}^n \log L(\theta) = log(\prod_{i = 1}^n)P(x_i \mid \theta)\]Using the properties of logarithm, the above equation can be expressed as sum. This transformation from product to sum is often done because mathematically sums are easier to work with compared to products, especially when taking gradients (eg: for optimization).
\[\mathcal{l}(\theta) = \sum_{i = 1}^n \log P(x_i \mid \theta)\]Log-Likelihood for a Normal Distribution
Usually in scenarios when the data is modelled as a Normal (Gaussian) distribution with some mean \(\mu\), and standard deviation \(\sigma\), the conditional probability of some input \(x_i\) is given by:
\[P(x_i \mid \mu, \sigma) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp(\frac{-(x_i - \mu)^2}{2\sigma^2})\]The log-likelihood function for this normal distribution can be written as:
\[\mathcal{l}(\mu, \sigma) = \sum_{i = 1}^n \log(\frac{1}{\sqrt{2\pi\sigma^2}}\exp(\frac{-(x_i - \mu)^2}{2\sigma^2}))\] \[\mathcal{l}(\mu, \sigma) = \sum_{i = 1}^n(log(\frac{1}{\sqrt{2\pi\sigma^2}}) + \log(\exp(\frac{-(x_i - \mu)^2}{2\sigma^2})))\] \[\mathcal{l}(\mu, \sigma) = \sum_{i = 1}^n(-\frac{1}{2}\log(2\pi\sigma^2) - \frac{-(x_i - \mu)^2}{2\sigma^2})\]Bayesian Linear Layer
Now, let us create a Bayesian linear layer in Pytorch. The weights of the linear layer are sampled from a fully factorised Normal with learnable parameters. The likelihood of the weight samples under the prior and the approximate posterior are returned with each forward pass in order to estimate the KL term in the ELBO.
In the following code, a Python Class created that inherits from nn.Module. For easier understanding of the process, I will try to break down and explain the code snippets in simple manner.
- First, the layer is initialized with the number of input features
num_inp_feats
, the number of output featuresnum_out_feats
, a prior distribution classprior_class
, and an optional bias termwith_bias
(default isTrue
). -
self.W_mu
andself.W_p
are the mean and a parameter related to the standard deviation of the weights, respectively. -
W_mu
is initialized uniformly between -0.1 and 0.1. -
W_p
is initialized uniformly between -3 and -2. -
self.b_mu
andself.b_p
are the mean and a parameter related to the standard deviation of the biases, respectively, initialized similarly to the weights.
class BayesLinearNormalDist(nn.Module):
def __init__(self, num_inp_feats, num_out_feats, prior_class, with_bias=True):
super(BayesLinearNormalDist, self).__init__()
self.num_inp_feats = num_inp_feats
self.num_out_feats = num_out_feats
self.prior = prior_class
self.with_bias = with_bias
# Learnable parameters -> Initialisation is set empirically.
self.W_mu = nn.Parameter(
torch.Tensor(self.num_inp_feats, self.num_out_feats).uniform_(-0.1, 0.1)
)
self.W_p = nn.Parameter(
torch.Tensor(self.num_inp_feats, self.num_out_feats).uniform_(-3, -2)
)
self.b_mu = nn.Parameter(torch.Tensor(self.num_out_feats).uniform_(-0.1, 0.1))
self.b_p = nn.Parameter(torch.Tensor(self.num_out_feats).uniform_(-3, -2))
The forward method below defines the forward pass of the layer.
- If
ifsample
isFalse
, the method returns the deterministic output using the mean of the weights and biases. - If
ifsample
isTrue
, it proceeds to sample weights and biases from the approximate posterior distribution rather than using the mean values. This stochastic sampling is crucial for Bayesian neural networks, allowing the model to account for uncertainty in the parameters.def forward(self, X, sample=0, local_rep=False, ifsample=True): if not ifsample: # When training return MLE of w for quick validation # pdb.set_trace() if self.with_bias: output = torch.mm(X, self.W_mu) + self.b_mu.expand( X.size()[0], self.num_out_feats ) else: output = torch.mm(X, self.W_mu) return output, torch.Tensor([0]).cuda()
The weights and biases are expanded along a new dimension for sampling.
-
eps_W
andeps_b
are standard normal noise samples. -
std_w
andstd_b
are computed using the softplus function to ensure positivity. -
W
andb
are sampled from the approximate posterior. - Unsqueezing: The unsqueeze method adds an extra dimension to the tensors. This is necessary to prepare the tensors for sampling.
-
W_mu
andW_p
are unsqueezed at dimension 1, transforming them from shape (num_inp_feats
,num_out_feats
) to (num_inp_feats
, 1,num_out_feats
). -
b_mu
andb_p
are unsqueezed at dimension 0, transforming them from shape (num_out_feats
) to (1,num_out_feats
).
-
-
Repeating: The repeat method replicates the tensors along specified dimensions.This ensures that each weight and bias parameter has multiple samples, one for each desired sample.
-
W_mu
andW_p
are repeated along the new dimension to match the number of samples, resulting in shape (num_inp_feats
,sample
,num_out_feats
). -
b_mu
andb_p
are repeated along the new dimension to match the number of samples, resulting in shape (sample
,num_out_feats
).
-
- Sampling Noise:
eps_W
andeps_b
are tensors of the same shape asW_mu
andb_mu
, filled with samples from a standard normal distribution. - Computing Standard Deviation: The standard deviations for weights and biases are computed using the softplus function applied to
W_p
andb_p
. The softplus function ensures that the standard deviations are positive. - The weights
W
and biasesb
are sampled from their approximate posterior distributions by adding the scaled noise (std_w
*eps_W
andstd_b
*eps_b
) to their means (W_mu
andb_mu
).else: if not local_rep: # Tensor.new() Constructs a new tensor of the same data type as self tensor. # the same random sample is used for every element in the minibatch # pdb.set_trace() W_mu = self.W_mu.unsqueeze(1).repeat(1, sample, 1) W_p = self.W_p.unsqueeze(1).repeat(1, sample, 1) b_mu = self.b_mu.unsqueeze(0).repeat(sample, 1) b_p = self.b_p.unsqueeze(0).repeat(sample, 1) eps_W = W_mu.data.new(W_mu.size()).normal_() eps_b = b_mu.data.new(b_mu.size()).normal_() if not ifsample: eps_W = eps_W * 0 eps_b = eps_b * 0 # sample parameters std_w = 1e-6 + f.softplus(W_p, beta=1, threshold=20) std_b = 1e-6 + f.softplus(b_p, beta=1, threshold=20) W = W_mu + 1 * std_w * eps_W b = b_mu + 1 * std_b * eps_b
-
lqw
andlpw
are the log-likelihoods of the weights and biases under the approximate posterior and prior distributions, respectively. -
Log-Likelihood under the Posterior (lqw): This is computed for both weights and biases if
self.with_bias
isTrue
. The functionisotropic_gauss_loglike
calculates the log-likelihood of the sampled parameters under the approximate posterior distribution. - Log-Likelihood under the Prior (lpw): This is computed using the provided prior class for both weights and biases if
self.with_bias
isTrue
.
For log-likelihood under the posterior, the weight and bias parameter including the mean, and standard deviation of weights are passed as arguments to isotropic_gauss_loglike
function. The isotropic_gauss_loglike
function computes the log-likelihood of a set of samples x
under an isotropic Gaussian (Normal) distribution with mean mu
and standard deviation sigma
. Within the BayesLinearNormalDist
class, the isotropic_gauss_loglike
function is used to calculate the log-likelihood of the sampled weights and biases under the approximate posterior distribution. This helps in computing the Evidence Lower Bound (ELBO) for variational inference. Specifically, the log-likelihood difference between the posterior and prior distributions (lqw
- lpw
) is used as part of the loss function to optimize the parameters of the model.Here is a detailed breakdown of the method:
def isotropic_gauss_loglike(x, mu, sigma, do_sum=True):
"""Returns the computed log-likelihood
Args:
x (_type_): the sampled weights or biases
mu (_type_): mean of gaussian distribution
sigma (_type_): standard deviation of gaussian dist
do_sum (bool, optional): _description_. a boolean indicating whether to sum the log-likelihoods
over all elements or to take the mean.
Returns:
_type_: Gaussian Log likelihood
"""
cte_term = -(0.5) * np.log(2 * np.pi) # constant term
det_sig_term = -torch.log(sigma) # Determinant term
inner = (x - mu) / sigma
dist_term = -(0.5) * (inner**2)
if do_sum:
out = (cte_term + det_sig_term + dist_term).sum() # sum over all weights
else:
out = (cte_term + det_sig_term + dist_term).mean()
return out
-
cte_term
: Constant term, This term is a constant that is part of the Gaussian log-likelihood formula. It arises from the normalization constant of the Gaussian distribution. -
det_sig_term
: Determinant term, This term accounts for the log of the determinant of the covariance matrix. In the isotropic case (where the covariance matrix is diagonal with equal entries), this term simplifies to the log of the standard deviation. - dist_term: Distance term, This term measures the squared distance between the samples x and the mean mu, scaled by the standard deviation sigma. The result is then multiplied by -0.5, which is part of the Gaussian log-likelihood formula.
- The output is computed using the sampled weights and biases.
if self.with_bias:
lqw = isotropic_gauss_loglike(
W, W_mu, std_w
) + isotropic_gauss_loglike(b, b_mu, std_b)
lpw = self.prior.loglike(W) + self.prior.loglike(b)
else:
lqw = isotropic_gauss_loglike(W, W_mu, std_w)
lpw = self.prior.loglike(W)
# Reshaping weight to (num_inp_feats, num_out_feats) and biases to (num_out_feats)
W = W.view(W.size()[0], -1)
b = b.view(-1)
if self.with_bias:
# wx + b
output = torch.mm(X, W) + b.unsqueeze(0).expand(
X.shape[0], -1
) # (batch_size, num_out_featsput)
else:
output = torch.mm(X, W)
If local representation local_rep
is True
, weights and biases are sampled for each data point in the batch. The process becomes slightly different, with weights and biases sampled for each data point individually rather than a shared sample across the batch. The output is computed using batch matrix multiplication (torch.bmm
). The method returns the output and the difference between the log-likelihoods of the weights and biases under the posterior and prior distributions.
else:
W_mu = self.W_mu.unsqueeze(0).repeat(X.size()[0], 1, 1)
W_p = self.W_p.unsqueeze(0).repeat(X.size()[0], 1, 1)
b_mu = self.b_mu.unsqueeze(0).repeat(X.size()[0], 1)
b_p = self.b_p.unsqueeze(0).repeat(X.size()[0], 1)
# pdb.set_trace()
eps_W = W_mu.data.new(W_mu.size()).normal_()
eps_b = b_mu.data.new(b_mu.size()).normal_()
# sample parameters
std_w = 1e-6 + f.softplus(W_p, beta=1, threshold=20)
std_b = 1e-6 + f.softplus(b_p, beta=1, threshold=20)
W = W_mu + 1 * std_w * eps_W
b = b_mu + 1 * std_b * eps_b
# W = W.view(W.size()[0], -1)
# b = b.view(-1)
# pdb.set_trace()
if self.with_bias:
output = (
torch.bmm(X.view(X.size()[0], 1, X.size()[1]), W).squeeze() + b
) # (batch_size, num_out_featsput)
lqw = isotropic_gauss_loglike(
W, W_mu, std_w
) + isotropic_gauss_loglike(b, b_mu, std_b)
lpw = self.prior.loglike(W) + self.prior.loglike(b)
else:
output = torch.bmm(X.view(X.size()[0], 1, X.size()[1]), W).squeeze()
lqw = isotropic_gauss_loglike(W, W_mu, std_w)
lpw = self.prior.loglike(W)
return output, lqw - lpw
Variational Inference and the ELBO
In Bayesian inference, the goal is to estimate the posterior distribution of the model parameters given the data. However, directly computing the posterior is often intractable for complex models, so we use variational inference as an approximation technique.
Evidence Lower Bound (ELBO) The Evidence Lower Bound (ELBO) is a key concept in variational inference. It provides a way to approximate the true posterior distribution by optimizing a simpler, parameterized distribution. The ELBO is defined as:
\[ELBO = \EX_{q(\theta)}[\log P(X \mid \theta)] - KL(q(\theta) \| P(\theta))\]- where, \(\EX_{q(\theta)}[\log P(X \mid \theta)]\) is the expected log-likelihood of the data under the approximate posterior \(q(\theta)\).
- \(KL(q(\theta) \| P(\theta))\) is the Kullback-Leibler divergence between the approximate posterior \(q(\theta)\) and the prior \(P(\theta)\).
The ELBO serves two main purposes:
- Maximizing the Data Likelihood: By maximizing the first term, we ensure that the model parameters explain the data well.
- Regularizing with the Prior: By minimizing the KL divergence, we ensure that the approximate posterior stays close to the prior, preventing overfitting.
In the BayesLinearNormalDist
class discussed above, the forward method returns (lqw
- lpw
). This term represents the contribution of the KL divergence for the sampled weights and biases. By returning (lqw
− lpw
), the method provides the necessary components to compute the KL divergence term in the ELBO. This term is crucial for ensuring that the approximate posterior does not deviate too much from the prior.
In practice, The loss function to be minimized during training is the negative ELBO.
\[Loss = -ELBO = -(\EX_{q(\theta)}[\log P(X \mid \theta)] - KL(q(\theta) \| P(\theta)))\]Mixture of Gaussian Prior
The isotropic_mixture_gauss_prior
class below defines a prior distribution that is a mixture of two isotropic Gaussian distributions. This means that the prior distribution for the weights and biases is not a single Gaussian, but a weighted combination of two Gaussians with different means and standard deviations. This type of prior can capture more complex prior beliefs about the parameters.
In the prior_class
argument in BayesLinearNormalDist
class, the mixture of gaussian priors is considered. Let’s break down this code:
- Parameters:
-
mu1
,sigma1
: Mean and standard deviation of the first Gaussian component. -
mu2
,sigma2
: Mean and standard deviation of the second Gaussian component. -
pi
: Mixing coefficient for the first Gaussian component (probability that a sample comes from the first Gaussian). The second component’s weight is 1 -pi
.
-
– Precomputed Terms: - cte_term
: Constant term from the Gaussian log-likelihood. - det_sig_term1
, det_sig_term2
: Logarithms of the standard deviations for the two Gaussian components.
- Distance Terms:
-
dist_term1
: Distance term for the first Gaussian, measuring how farx
is frommu1
scaled bysigma1
. -
dist_term2
: Distance term for the second Gaussian, measuring how farx
is frommu2
scaled bysigma2
.
-
- Mixture Log-Likelihood:
- The log-likelihood is computed as the log of a weighted sum of the exponentiated terms from both Gaussian components.
- If
do_sum
isTrue
, the log-likelihoods are summed over all elements. Otherwise, the mean is taken.
Mathematical Explanation For each sampled parameter \(x\) (weights, or biases), the log likelihood under the mixture prior is computed as:
\[\log P(x) = \log(\pi_1\dot\exp(\frac{-(x-\mu_1)^2}{2\sigma_1^2}) + (1 - \pi_1)\dot\exp(\frac{-(x-\mu_2)^2}{2\sigma_2^2}))\]where,
- \(\pi_1\), and \(1 - \pi_1\) are the mixing coefficients.
- \(\mu_1, \sigma_1, \mu_2, \sigma_2\) are the means and standard deviations of two Gaussian components.
Significance in variational Inference: The log-likelihood under the prior lpw
is crucial for computing the KL divergence term in the ELBO.
The isotropic_mixture_gauss_prior
class defines a prior as a mixture of two Gaussians, and its loglike method computes the log-likelihood of parameters under this mixture prior. In the BayesLinearNormalDist
class, this log-likelihood is used to compute the KL divergence term in the ELBO, balancing the fit of the model to the data with the regularization imposed by the prior.
class isotropic_mixture_gauss_prior(object):
def __init__(self, mu1=0, mu2=0, sigma1=0.1, sigma2=1.5, pi=0.5):
self.mu1 = mu1
self.sigma1 = sigma1
self.mu2 = mu2
self.sigma2 = sigma2
self.pi1 = pi
self.pi2 = 1 - pi
self.cte_term = -(0.5) * np.log(2 * np.pi)
self.det_sig_term1 = -np.log(self.sigma1)
self.det_sig_term2 = -np.log(self.sigma2)
def loglike(self, x, do_sum=True):
dist_term1 = -(0.5) * ((x - self.mu1) / self.sigma1) ** 2
dist_term2 = -(0.5) * ((x - self.mu2) / self.sigma2) ** 2
if do_sum:
return (
torch.log(
self.pi1
* torch.exp(self.cte_term + self.det_sig_term1 + dist_term1)
+ self.pi2
* torch.exp(self.cte_term + self.det_sig_term2 + dist_term2)
)
).sum()
else:
return (
torch.log(
self.pi1
* torch.exp(self.cte_term + self.det_sig_term1 + dist_term1)
+ self.pi2
* torch.exp(self.cte_term + self.det_sig_term2 + dist_term2)
)
).mean()