lectures.alex.balgavy.eu

Lecture notes from university.
git clone git://git.alex.balgavy.eu/lectures.alex.balgavy.eu.git
Log | Files | Refs | Submodules

index.md (11004B)


      1 +++
      2 title = 'Probability'
      3 template = 'page-math.html'
      4 +++
      5 # Probability
      6 
      7 ## Probability basics
      8 
      9 What even is probability?
     10 
     11 -   Frequentism: probability is only property of repeated experiments
     12 -   Bayesianism: probability is expression of our uncertainty and of our
     13     beliefs
     14 
     15 ### Probability theory
     16 
     17 Definitions:
     18 
     19 -   sample space: the possible outcomes, can be discrete or continuous
     20     (like real numbers)
     21 -   event space: set of the things that have probability (subsets of
     22     sample space)
     23 -   random variable: a way to describe events, takes values with some
     24     probability
     25     -   notation P(X = x) = 0.2 means that X takes the value x with
     26         probability 0.2
     27 -   for random variables X and Y:
     28     -   joint probability P(X, Y): gives probability of each atomic
     29         event (specified single value for each random variable)
     30     -   marginal probability: if you sum a row/column of the joint
     31         distribution (also called "marginalizing out" a variable)
     32     -   conditional probability P(X | Y): probability of X given Y,
     33         i.e. the probability over one variable if another variable is
     34         known
     35 -   independence: X and Y independent if P(X, Y) = P(X) P(Y)
     36 -   conditional independence:
     37     -   X and Y conditionally independent if P(X | Y) = P(X)
     38     -   X and Y conditionally independent given Z if P(X, Y | Z) =
     39         P(X|Z) P(Y|Z)
     40 
     41 Identities:
     42 
     43 -   $P(x | y) = \frac{P(x \cap y)}{P(y)} = P(X,Y) P(Y) = \frac{P(y | x) P(x)}{P(y)}$
     44 -   $P(x \cup y) = P(x) + P(y) - P(x \cap y)$
     45 
     46 Maximum likelihood estimation:
     47 $\hat{\theta} = \argmax_{\theta} P(X | \theta)$
     48 
     49 Fitting a normal distribution:
     50 
     51 $$\begin{aligned}
     52     \hat{\mu}, \hat{\sigma}  &= \argmax_{\mu, \sigma} P(X<sup>1, X</sup>2, ... | \mu, \sigma) \\
     53                             &= \argmax_{\mu, \sigma} \prod_i N(X^i | \mu, \sigma) \\
     54 \end{aligned}$$
     55 
     56 Probabilistic classifiers return a probability over all classes, given
     57 features.
     58 
     59 ## (Naive) Bayesian classifiers
     60 
     61 This is a generative classifier -- learn P(X|Y) and P(Y), apply Bayes
     62 rule.
     63 
     64 Choose class y that maximises P(y|x) -- the probability of class given
     65 data. Then expand using Bayes' rule. Denominator doesn't affect which
     66 class gets highest probability, so just fit models to P(x|y) and P(y)
     67 to maximise quantity P(x|y)P(y).
     68 
     69 $$\begin{aligned}
     70 c(x)    &= \argmax_{y \in {pos,neg}}P(y|x) \\
     71         &= \argmax_{y \in {pos,neg}}\frac{P(x|y) P(y)}{P(x)} \\
     72         &= \argmax_{y \in {pos,neg}}P(x|y)P(y)
     73 \end{aligned}$$
     74 
     75 
     76 Bayes classifier:
     77 
     78 -   choose probability distribution M (e.g. multivariate normal)
     79 -   fit M<sub>pos</sub> to all positive points: P(X=x | pos) = M<sub>pos</sub>(x)
     80 -   fit M<sub>neg</sub> to all negative points: P(X=x | neg) = M<sub>neg</sub>(x)
     81 -   estmate P(Y) from class frequencies in the training data, or
     82     domain-specific information
     83 
     84 Naive Bayes:
     85 
     86 -   assume independence between all features, conditional on the class:
     87     $P(X_1, X_2 | Y) = P(X_1 | Y) P(X_2 | Y)$
     88 -   but, if particular value doesn't occur, we estimate the probability
     89     to be 0. and since the whole estimate of probability is a long
     90     product, if a factor becomes zero, everything becomes zero.
     91 
     92 Laplace smoothing:
     93 
     94 -   for each possible value, add an instance where all features have
     95     that value (e.g. one row with all trues and one row with all falses)
     96 -   avoids collapses due to zero values
     97 
     98 ## Logistic "regression" (classifier)
     99 
    100 A discriminative classifier: learn function for P(Y|X) directly.
    101 
    102 The logistic sigmoid:
    103 $\sigma(t) = \frac{1}{1+e^{-t}} = \frac{e<sup>t}{1+e</sup>t}$
    104 
    105 -   also, $1-\sigma(t) = \sigma(-t)$
    106 -   fits results into interval \[0,1\]
    107 
    108 Classifier: compute linear function, apply logistic sigmoid to result
    109 $c(x) = \sigma(w \cdot x + b)$
    110 
    111 Loss function: log loss ($-\log{P(class |features)}$)
    112 
    113 -   maximum likelihood objective: find classifier q that maximises
    114     probability of true classes
    115 -   points near decision boundary get more influence than points far
    116     away (least squares does the opposite)
    117 -   also sometimes called "cross-entropy loss"
    118 
    119 $$\begin{aligned}
    120   \argmax_q \prod_{C,x}q_x(C) &= \argmax_{q}\log{\prod_{C,x}q_x(C)} \\
    121                     &= \argmin_{q}-\log{\prod_{C,x} q_x (C)} \\
    122                     &= \argmin_q \sum_{C,x} - \log{q_x (C)} \\
    123                     &= \argmin_q - \sum_{x \in X_p} \log{q_x (P)} - \sum_{x \in X_N} \log{q_x (N)}
    124 \end{aligned}$$
    125 
    126 where:
    127 
    128 -   x: some data point
    129 -   q<sub>x</sub>: our classifier q<sub>x</sub>(C) = q(C|x)
    130 
    131 Problem: if the classes are well separable linearly, there are many
    132 suitable classifiers, and logistic regression has no reason to prefer
    133 one over the other.
    134 
    135 ## Information theory
    136 
    137 The relation between encoding information and probability theory.
    138 
    139 Prefix-free trees assign prefix free code to set of outcomes. Benefit is
    140 that no delimiters necessary in bit/codeword string.
    141 
    142 Arithmetic coding - if allow L(x) (length of code for x) to take
    143 non-integer values, we can equate codes with probability distributions.
    144 
    145 Entropy of distribution: expected codelength of an element sampled from
    146 that distribution.
    147 
    148 $$\begin{aligned}
    149     H(p) &= E_p L(x) \\
    150          &= \sum_{x \in X} P(x)L(x) \\
    151          &= - \sum_{x \in X} P(x) \log{P(x)}
    152   \end{aligned}$$
    153 
    154 Cross entropy: expected codelength if we use q, but data comes from p.
    155 
    156 $H(p, q) = E_p L^q(x) = - \sum_{x \in X} p(x) \log{q(x)}$
    157 
    158 Kulback-Leibler divergence: expected difference in codelength between p
    159 and q. in other words, differencein expected codelength.
    160 
    161 $KL(p,q) = H(p,q) - H(p) = - \sum_{x \in X} p(x) \log{\frac{q(x)}{p(x)}}$
    162 
    163 ### Maximum likelihood
    164 The maximum likelihood is the model with the highest probability. Selects the model that is most suitable given the observed data.
    165 
    166 (Log) likelihood: what we maximise to fit a probability model
    167 
    168 Loss: what we minimise to find a machine learning model
    169 
    170 
    171 ### Normal distributions (Gaussians)
    172 #### 1D normal distribution (Gaussian)
    173 Has a mean μ and standard deviation σ.
    174 
    175 Not a probability function, but a probability _density_ function. The only things on the graph that have probability are intervals, so to find probability, you integrate over the probability density function.
    176 
    177 Definition: $N(x | \mu, \sigma) = \frac{1}{\sqrt{2\pi\sigma^{2}}} \exp{[ -\frac{1}{2\sigma^2} (x-\mu)^2 ]}$
    178 
    179 Maximum likelihood for the mean:
    180 
    181 $\begin{aligned}
    182          \argmax_{\theta} \log{p(x | \theta)} &= \argmax_{\theta} \ln{\prod_{x \in x} p(x|\theta)} \\
    183                                               &= \argmax_{\theta} \sum_{x}{\ln{p(x|\theta})}                                                                                  &&\text{(because product in log is sum outside of log)}\\
    184                                               &= \argmax_{\mu, \sigma} \sum_{x}\ln{\frac{1}{\sqrt{2\pi\sigma^2}}} \exp \big\lfloor -\frac{1}{2\sigma^2} (x-\mu)^2 \big\rfloor &&\text{(fill in the formula)}\\
    185                                               &= \argmax_{\mu, \sigma} \sum_{x}\ln{\frac{1}{\sqrt{2\pi\sigma^2}}} - \frac{1}{2\sigma^2} (x-\mu)^2 \\
    186 \frac{\partial \ln P(x|\theta)}{\partial \mu} &= \sum_{x} \frac{\partial \big[ \ln{\frac{1}{\sqrt{2\pi\sigma^2}}} - \frac{1}{2\sigma^2} (x-\mu)^2 \big]}{\partial \mu}        &&\text{(because we want to maximise it)}\\
    187                                               &= -\frac{1}{2\sigma^2} \sum_{x} \frac{\partial (x-\mu)^2}{\partial \mu} \\
    188                                               &= -\frac{1}{\sigma^2} \sum_{x} (x-\mu) \\
    189          -\frac{1}{\sigma^2} \sum_{x} (x-\mu) &= 0                                                                                                                            &&\text{(because the max/min is where the derivative is 0)} \\
    190                              \sum_{x} (x-\mu) &= 0 \\
    191                           -\mu n + \sum_{x} x &= 0 \\
    192                                           \mu &= \frac{1}{n} \sum_{x} x                                                                                                       &&\text{(i.e. the arithmetic mean)}
    193 \end{aligned}$
    194 
    195 The implication is that the maximum likelihood estimator for the mean of normal distribution is the mean of the data.
    196 
    197 #### Regression with Gaussian errors
    198 For a regression $y = x^{T} w + b + E$, where $E \sim N(0, \sigma)$
    199 
    200 If we want to maximise the likelihood of the parameters of the line, given some data:
    201 
    202 $\begin{aligned}
    203 \argmax_{w,b} P(Y|X,w,b) &= \argmax_{w,b} \ln{\prod_{i} N(y_i | x_{i}^{T} w + b, \sigma)} \\
    204                          &= \argmax_{w,b} \sum_{i} \ln{\frac{1}{\sqrt{2\pi\sigma^2}}} \exp \Big[ -\frac{1}{2\sigma<sup>2} (x_{i}</sup>{T} w + b - y_i)^2 \Big]  &&\text{(just fill in the formula)}\\
    205                          &= \argmax_{w,b} -\sum_{i} \frac{1}{2\sigma<sup>2} (x_{i}</sup>{T} w + b - y_i)^2                                                      &&\text{(because the ln doesn't matter for argmax)}\\
    206                          &= \argmax_{w,b} -\frac{1}{2} \sum_{i} (x_{i}<sup>{T} w + b - y_i)</sup>2                                                              &&\text{(because the stdev doesn't impact the result)}\\
    207                          &= \argmin_{w,b} \frac{1}{2} \sum_{i} (x_{i}<sup>{T} w + b - y_i)</sup>2                                                               &&\text{(which is the least squares function)}\\
    208 \end{aligned}$
    209 
    210 So that's why least squares assumes a normal distribution.
    211 #### n-D normal distribution (multivariate Gaussian)
    212 The formula: $N(x | \mu, \Sigma) = \frac{1}{\sqrt{(2\pi)^d |\Sigma |}} \exp \Big[ -\frac{1}{2} (x-\mu)^{T} \Sigma^{-1} (x-\mu) \Big]$
    213 
    214 #### Gaussian mixture model
    215 Basically, combine Gaussians to represent more complex shapes.
    216 
    217 Example with three components:
    218 - three components: N(μ₁, Σ₁), N(μ₂, Σ₂), N(μ₃, Σ₃)
    219 - three weights: w₁, w₂, w₃ with $\sum w_{i} = 1$
    220 
    221 Maximum likelihood:
    222 $\argmax_{w_{i}, \mu_{i}, \Sigma_{i}} \sum_{x} {\ln \sum_{i}{ N(x | \mu_{i}, \Sigma_{i})}}$
    223 ### Expectation-maximisation
    224 Finding maximum-likelihood is hard if there are hidden variables (not observed) that affect those that are in the dataset. For example, if the hidden variables come from mixture models (you don't know their specific distribution). This can be used to fit _any_ hidden variable model.
    225 
    226 Key insight: can't optimise both θ and z, but given some θ, can compute P(z|x), and given z, can optimise θ.
    227 
    228 Intuition:
    229 1. Initialize components randomly
    230 2. loop:
    231     - expectation: assign soft responsibilities to each point. i.e., points "belong" to each Gaussian "to some degree"; each Gaussian takes a certain _responsibility_ for each point.
    232     - maximisation: fit components to the data, weighted by responsibility.
    233 
    234 Definition of "responsibility": $r_{x}^{i} = \frac{P(z=i | x)}{\sum_{j} P(z=j | x)}$
    235 Model parameters, given responsibilities:
    236 - $n_i = \sum_{x} r_{x}^{i}$
    237 - $\mu_i = \frac{1}{n_i} \sum_{x} r_{x}^{i} x$
    238 - $\Sigma_i = \frac{1}{n_i} \sum_{x} r_{x}^{i} (x-\mu_i) (x-\mu_i)^{T}$
    239 - $w_i = \frac{n_i}{n}$
    240 
    241 ![Expectation and maximization](7ba36b211f204ba187d79d53fd4e6a97.png)
    242