Machine learning is highly data driven. By data driven, it means that the goal of machine learning is to design a general purpose methodology to extract valuable patterns from data, ideally without much domain-specific expertise. A machine learning model is said to learn from the data if its performance on a given task improves after the data is taken into account. The ultimate goal is to find good models through learning from data, that generalize well to yet unseen data, which we may care about in the future. Learning can be viewed as a way to automatically find patterns and structure in the data by optimizing the model parameters. Training the model means to use the available data to optimize some parameters of the model with resect to a utility function that evaluates how well the model predicts the training data. For instance, a training method can be though as an approach anlagous to climbing a hill to reach its peak. The peak here is the maximum of some desired performance.

What is probability?

In a nutshell, “Probability is the study of uncertainity”. It is a degree of belief about an event occuring. We often quantify (measure) uncertainity in data, uncertainity in machine learning model, and uncertainity in predictions of the model. To quantify the uncertainity, the idea of random variable is required, which is a function that maps the outcomes of random experiments to a set of properties that we are interested in. For example, a coin is flipped two times and the record if it shows a head or a tail. The outcome or the sample space $S = \left\{ HH, HT, TH, TT \right\}$. There are total of four possible outcomes and each of them is equaly likely. Instead of considering all the possible outcomes, we can consider assigning the variable $X$, to be the number of heads obtained in two flips of a coin. This translates to $X$, a random variable taking on possible values of $0, 1$ or $2$. It is the probabilities of these values that we are interested in. i.e., the probabilities of head coming none, once or twice on two flips of a coin.

Notations

For any two random variables $X$ and $Y$, the probability that $X=x$, and $Y=y$ is written as $p(x, y)$ and is called the joint probability.
The marginal probability that $X$ takes the value $x$ irrespective of the value of random variable $Y$ is written as $p(x)$.
$X \sim p(x)$ denotes that the random variable $X$ is distributed according to $p(x)$.
If we consider only the instances where $X=x$, then fraction of instance (conditional probability) for which $Y=y$ is written as $p(y \mid x)$.

Probability Density function

A function $f: \mathbb{R} ^D \to \mathbb{R}$ is called a probability density function (pdf) if :

$\forall x \in \mathbb{R}^D : f(x) \ge 0$,
Its integral exists, and
$\int_{\mathbb{R}^D}f(x)dx =1$.

Simple Explanation with example

Imagine we have a smoothie shop that sells smoothies in cups of any size between 0 and 20 ounces. A PDF for the sizes of smoothies sold would show on a graph how likely it is to sell a smoothie at each size. For example, if most people buy 12-ounce smoothies, the graph would be higher at 12 ounces, indicating this size is more popular or likely to be sold.

Probability Density Function graph

Key Points

Shape of the Graph: The shape tells you which outcomes are more likely. A peak at a particular value means that outcome is very likely.
Area Under the Graph: For any segment of the graph, the area under it (from the graph to the horizontal axis) represents the probability of outcomes within that range. For example, the area under the curve between 10 and 15 ounces tells you the likelihood of selling a smoothie that’s between 10 and 15 ounces.
Total Area Equals 1: The total area under the entire curve equals 1, which in probability terms means “100% certainty.” This is because some outcome within the possible range must happen.

So, a PDF is essentially a way to visualize probabilities for continuous data, helping us see what outcomes are likely, how likely they are, and the range of possible outcomes.

Sum rule, Product rule, and Bayes’ Theorem

Given the definitions of marginal, and conditional probability distributions, we can present the two fundamental rule in probability theory. The sum rule states that:

\[p(x)= \begin{cases} \sum_{y\in \mathcal{Y}} p(x, y),& \text{if } y \text{ is discrete}\\ \int_\mathcal{Y}p(x, y)dy, & \text{if } y \text{ is continuous} \end{cases} \label{eq:1}\]

where $\mathcal{Y}$ are the states of the target space of random variable $Y$. We sum out (or integrate out) the set of states $y$ of the random variable $Y$. The sum rule is also known as the marginalization property. The sum rule relates the joint distribution to a marginal distribution.

The product rule relates the joint distribution to the conditional distribution via\

\[p(x, y)=p(y \mid x)p(x)\label{eq:2}\]

The product rule can be interpreted as the fact that every joint distribution of two random variables can be factorized (written as a product) of two other distributions. These two factors are the marginal distributions of the first random variable $ p(x) $ and the conditional distribution of the second random variable given the first $p(y \mid x)$. The ordering of random variables $X$, and $Y$ is arbitrary so the product rule can also be written as

\[p(x, y)=p(x \mid y)p(y )\label{eq:3}\]

In machine learning and Bayesian statistics, we are interested in making inference of unobserved random variables given that we have observed other random variables.
The Bayes’ rule or Bayes’ theorem: If we assume that we have some prior knowledge $p(x)$ about an unobserved random variable $x$, and some relationship $p(y \mid x)$ between $x$ and second random variable $y$, which we can observe. If we observe $y$, then we can use Bayes’ theorem to draw some conclusions about $x$ given the observed values of $y$.From the direct consequence of the product rule stated above, the bayesian rule states:

\[p(x \mid y)p(y)=p(y \mid x)p(x) \Leftrightarrow p(x \mid y)= \frac{p(y \mid x)p(x)}{p(y)} \label{eq:4}\]

From above equation, we can deduce following:

$p(x)$ is the prior which is the probability of an event occurring before new data is collected or in other words, a subjective knowledge of the unobserved variable $x$ before observing any data. Any prior value can be chosen, however, it is critical to ensure that it has non zero pdf on all plausible $x$.
$p(y \mid x)$ is the likelihood that describes how $x$ and $y$ are related. This likelihood is not a distribution in $x$ but in $y$, so we call $p(x \mid y)$ as “the likelihood (or probability) of $x$ given $y$”.
$p(x \mid y)$ is the posterior which is the quantity of interest in Bayesian statistics. Here it is expressed as what we know about $x$ having observed $y$. In Bayesian statistics, the “posterior” is a probability that represents our updated beliefs after we have taken into account new evidence. It comes from Bayes’ Rule, which is a way to revise predictions or hypotheses in light of new data.

Simple Explanation

Imagine there’s a 70% chance it will rain today based on the weather forecast (this is our initial belief or “prior”). Then, we look outside and see dark clouds gathering. This new evidence should change our belief about the likelihood of rain. Bayes’ Rule helps us update our beliefs based on new evidence, shifting from our prior belief to our posterior belief.
$p(y)$ is the marginal likelihood/evidence which ensures that the probabilities add up correctly (normalize) to make the posterior probability a true probability measure. It is calculated by considering not just one possible scenario, but all possible scenarios (or parameter values) under a model and seeing how well these scenarios, on average, explain the observed data. Mathematically, it involves summing or integrating the product of the likelihood (how probable the data is if a particular parameter value was true) and the prior (how probable the parameter value is before seeing the data) over all possible parameter values. Mathematically, the evidence is represented as:

\[p(y) =: \int p(y \mid x)p(x)dx = \mathbb{E}_{X}[p(y \mid x)]\]

In Bayesian statics, the posterior distribution is the quantity of interest because it represents an updated belief about a parameter after considering both the prior belief and the new evidence provided by the data. If we think of bigger context, then the posterior can be used in decision making under uncertainity, and having full posterior is extremely useful and lead to decisions that are robust to disturbances. Having a full posterior can be very useful for downstream task.

Expected Value

The concept of expected value is central to machine learning. In probabilistic term, expected value is a mean or average of a random variable incorporating all possible outcomes weighted by their probabilities. Mathematically, the expected value is calculated by summing up all possible values that a random variable can take, each multiplied by its probability of occurrence. The expected value of a function $g:\mathbb{R}\rightarrow \mathbb{R}$ of a univariate continuous random variable $X \sim p(x)$ is given by:

\[\mathbb{E}_{X}[g(x)] = \int_{x}g(x)p(x)dx\]

Correspondingly, the expected value of a function $g$ of a discrete random variable $X \sim p(x)$ is given by:

\[\mathbb{E}_{X}[g(x)] = \sum_{x\in\mathcal{X}}g(x)p(x)\]

where $\mathcal{X}$ is a set of possible outcomes of the random variable $X$. Similarly, for a multivariate random variables $X$ defined as a finite vector of univariate random variables $[X_1, X_2, \cdots X_D]^T$, the expected value is defined elemente wise as:

\[\begin{align} \mathbb{E}_{X}[g(x)] &= \begin{bmatrix} \mathbb{E}_{X}[g(x_1)] \\ \mathbb{E}_{X}[g(x_2)] \\ \vdots \\ \mathbb{E}_{X}[g(x_D)] \end{bmatrix} \in \mathbb{R} \end{align}\]

where the subscript $\mathbb{E}_{X_d}$ indicates that we are taking the expected value with respect to the $d^{\text th}$ element of the vector $x$.

Mean

The mean of a random variable $X$ with states $x \in \mathbb{R}^D$ is an average and is defined as:

$\begin{align} \mathbb{E}_{X}[x] &= \begin{bmatrix} \mathbb{E}_{X_1}[x_1] \\ \mathbb{E}_{X_2}[x_2] \\ \vdots \\ \mathbb{E}_{X_D}[x_D] \end{bmatrix} \in \mathbb{R}^D \end{align}$ $$ where,

\[E_{X_d}[x_d]= \begin{cases} \int_{\mathcal{Y}} x_d p(x_d)dx,& \text{if } X \text{ is a continuous random variable}\\ \sum_{x_i\in\mathcal{X}}x_ip(x_d=x_i), & \text{if } X \text{ is a discrete random variable} \end{cases}\]

for $d=1, \cdots, D$, where the subscript $d$ indicates the corresponding dimension of $x$. The integral and sum are over the states $\mathcal{X}$ of the target space of the random variable $X$.

Covariance

Covariance is calculated as the expected value (average) of the product of the deviations of two random variables from their respective means. Mathematically, the covariance between two univariate random variables $X\text , Y \in \mathbb{R}$, with mean value $\mathbb{E}_X[x]$, and $\mathbb{E}_Y[y]$ respectively can be written as:

\[\text{Cov}_{X,Y}[x,y]:= \mathbb{E}_{X,Y}[(x-\mathbb{E}_X[x])(y-\mathbb{E}_Y[y])]\]

By using linearity of expectations, the above expressin can be re-written as the expected value of the product minus the product of the expected values as follows:

\[\text{Cov}[x,y]=\mathbb{E}[xy]-\mathbb{E}[x]\mathbb{E}[y]\]

The covariance of a variable with itself i.e., $\text{Cov}[x,x]$ is called variance and is denoted by $\mathbb{V}_X[x]$. And the square root of the variance is called the standard deviation, which is denoted by $\sigma(x)$.

Positive Covariance: Indicates that as $X$ increases, $Y$ also tends to increase. This suggests a positive relationship between the variables.

Negative Covariance: Indicates that as $X$ increases, $Y$ tends to decrease. This suggests an inverse relationship between the variables.

Zero Covariance: Implies that there is no linear relationship between $X$ and $Y$. The variables are linearly independent, though they could still be related in other non-linear ways.