index.md (11004B)
1 +++ 2 title = 'Probability' 3 template = 'page-math.html' 4 +++ 5 # Probability 6 7 ## Probability basics 8 9 What even is probability? 10 11 - Frequentism: probability is only property of repeated experiments 12 - Bayesianism: probability is expression of our uncertainty and of our 13 beliefs 14 15 ### Probability theory 16 17 Definitions: 18 19 - sample space: the possible outcomes, can be discrete or continuous 20 (like real numbers) 21 - event space: set of the things that have probability (subsets of 22 sample space) 23 - random variable: a way to describe events, takes values with some 24 probability 25 - notation P(X = x) = 0.2 means that X takes the value x with 26 probability 0.2 27 - for random variables X and Y: 28 - joint probability P(X, Y): gives probability of each atomic 29 event (specified single value for each random variable) 30 - marginal probability: if you sum a row/column of the joint 31 distribution (also called "marginalizing out" a variable) 32 - conditional probability P(X | Y): probability of X given Y, 33 i.e. the probability over one variable if another variable is 34 known 35 - independence: X and Y independent if P(X, Y) = P(X) P(Y) 36 - conditional independence: 37 - X and Y conditionally independent if P(X | Y) = P(X) 38 - X and Y conditionally independent given Z if P(X, Y | Z) = 39 P(X|Z) P(Y|Z) 40 41 Identities: 42 43 - $P(x | y) = \frac{P(x \cap y)}{P(y)} = P(X,Y) P(Y) = \frac{P(y | x) P(x)}{P(y)}$ 44 - $P(x \cup y) = P(x) + P(y) - P(x \cap y)$ 45 46 Maximum likelihood estimation: 47 $\hat{\theta} = \argmax_{\theta} P(X | \theta)$ 48 49 Fitting a normal distribution: 50 51 $$\begin{aligned} 52 \hat{\mu}, \hat{\sigma} &= \argmax_{\mu, \sigma} P(X<sup>1, X</sup>2, ... | \mu, \sigma) \\ 53 &= \argmax_{\mu, \sigma} \prod_i N(X^i | \mu, \sigma) \\ 54 \end{aligned}$$ 55 56 Probabilistic classifiers return a probability over all classes, given 57 features. 58 59 ## (Naive) Bayesian classifiers 60 61 This is a generative classifier -- learn P(X|Y) and P(Y), apply Bayes 62 rule. 63 64 Choose class y that maximises P(y|x) -- the probability of class given 65 data. Then expand using Bayes' rule. Denominator doesn't affect which 66 class gets highest probability, so just fit models to P(x|y) and P(y) 67 to maximise quantity P(x|y)P(y). 68 69 $$\begin{aligned} 70 c(x) &= \argmax_{y \in {pos,neg}}P(y|x) \\ 71 &= \argmax_{y \in {pos,neg}}\frac{P(x|y) P(y)}{P(x)} \\ 72 &= \argmax_{y \in {pos,neg}}P(x|y)P(y) 73 \end{aligned}$$ 74 75 76 Bayes classifier: 77 78 - choose probability distribution M (e.g. multivariate normal) 79 - fit M<sub>pos</sub> to all positive points: P(X=x | pos) = M<sub>pos</sub>(x) 80 - fit M<sub>neg</sub> to all negative points: P(X=x | neg) = M<sub>neg</sub>(x) 81 - estmate P(Y) from class frequencies in the training data, or 82 domain-specific information 83 84 Naive Bayes: 85 86 - assume independence between all features, conditional on the class: 87 $P(X_1, X_2 | Y) = P(X_1 | Y) P(X_2 | Y)$ 88 - but, if particular value doesn't occur, we estimate the probability 89 to be 0. and since the whole estimate of probability is a long 90 product, if a factor becomes zero, everything becomes zero. 91 92 Laplace smoothing: 93 94 - for each possible value, add an instance where all features have 95 that value (e.g. one row with all trues and one row with all falses) 96 - avoids collapses due to zero values 97 98 ## Logistic "regression" (classifier) 99 100 A discriminative classifier: learn function for P(Y|X) directly. 101 102 The logistic sigmoid: 103 $\sigma(t) = \frac{1}{1+e^{-t}} = \frac{e<sup>t}{1+e</sup>t}$ 104 105 - also, $1-\sigma(t) = \sigma(-t)$ 106 - fits results into interval \[0,1\] 107 108 Classifier: compute linear function, apply logistic sigmoid to result 109 $c(x) = \sigma(w \cdot x + b)$ 110 111 Loss function: log loss ($-\log{P(class |features)}$) 112 113 - maximum likelihood objective: find classifier q that maximises 114 probability of true classes 115 - points near decision boundary get more influence than points far 116 away (least squares does the opposite) 117 - also sometimes called "cross-entropy loss" 118 119 $$\begin{aligned} 120 \argmax_q \prod_{C,x}q_x(C) &= \argmax_{q}\log{\prod_{C,x}q_x(C)} \\ 121 &= \argmin_{q}-\log{\prod_{C,x} q_x (C)} \\ 122 &= \argmin_q \sum_{C,x} - \log{q_x (C)} \\ 123 &= \argmin_q - \sum_{x \in X_p} \log{q_x (P)} - \sum_{x \in X_N} \log{q_x (N)} 124 \end{aligned}$$ 125 126 where: 127 128 - x: some data point 129 - q<sub>x</sub>: our classifier q<sub>x</sub>(C) = q(C|x) 130 131 Problem: if the classes are well separable linearly, there are many 132 suitable classifiers, and logistic regression has no reason to prefer 133 one over the other. 134 135 ## Information theory 136 137 The relation between encoding information and probability theory. 138 139 Prefix-free trees assign prefix free code to set of outcomes. Benefit is 140 that no delimiters necessary in bit/codeword string. 141 142 Arithmetic coding - if allow L(x) (length of code for x) to take 143 non-integer values, we can equate codes with probability distributions. 144 145 Entropy of distribution: expected codelength of an element sampled from 146 that distribution. 147 148 $$\begin{aligned} 149 H(p) &= E_p L(x) \\ 150 &= \sum_{x \in X} P(x)L(x) \\ 151 &= - \sum_{x \in X} P(x) \log{P(x)} 152 \end{aligned}$$ 153 154 Cross entropy: expected codelength if we use q, but data comes from p. 155 156 $H(p, q) = E_p L^q(x) = - \sum_{x \in X} p(x) \log{q(x)}$ 157 158 Kulback-Leibler divergence: expected difference in codelength between p 159 and q. in other words, differencein expected codelength. 160 161 $KL(p,q) = H(p,q) - H(p) = - \sum_{x \in X} p(x) \log{\frac{q(x)}{p(x)}}$ 162 163 ### Maximum likelihood 164 The maximum likelihood is the model with the highest probability. Selects the model that is most suitable given the observed data. 165 166 (Log) likelihood: what we maximise to fit a probability model 167 168 Loss: what we minimise to find a machine learning model 169 170 171 ### Normal distributions (Gaussians) 172 #### 1D normal distribution (Gaussian) 173 Has a mean μ and standard deviation σ. 174 175 Not a probability function, but a probability _density_ function. The only things on the graph that have probability are intervals, so to find probability, you integrate over the probability density function. 176 177 Definition: $N(x | \mu, \sigma) = \frac{1}{\sqrt{2\pi\sigma^{2}}} \exp{[ -\frac{1}{2\sigma^2} (x-\mu)^2 ]}$ 178 179 Maximum likelihood for the mean: 180 181 $\begin{aligned} 182 \argmax_{\theta} \log{p(x | \theta)} &= \argmax_{\theta} \ln{\prod_{x \in x} p(x|\theta)} \\ 183 &= \argmax_{\theta} \sum_{x}{\ln{p(x|\theta})} &&\text{(because product in log is sum outside of log)}\\ 184 &= \argmax_{\mu, \sigma} \sum_{x}\ln{\frac{1}{\sqrt{2\pi\sigma^2}}} \exp \big\lfloor -\frac{1}{2\sigma^2} (x-\mu)^2 \big\rfloor &&\text{(fill in the formula)}\\ 185 &= \argmax_{\mu, \sigma} \sum_{x}\ln{\frac{1}{\sqrt{2\pi\sigma^2}}} - \frac{1}{2\sigma^2} (x-\mu)^2 \\ 186 \frac{\partial \ln P(x|\theta)}{\partial \mu} &= \sum_{x} \frac{\partial \big[ \ln{\frac{1}{\sqrt{2\pi\sigma^2}}} - \frac{1}{2\sigma^2} (x-\mu)^2 \big]}{\partial \mu} &&\text{(because we want to maximise it)}\\ 187 &= -\frac{1}{2\sigma^2} \sum_{x} \frac{\partial (x-\mu)^2}{\partial \mu} \\ 188 &= -\frac{1}{\sigma^2} \sum_{x} (x-\mu) \\ 189 -\frac{1}{\sigma^2} \sum_{x} (x-\mu) &= 0 &&\text{(because the max/min is where the derivative is 0)} \\ 190 \sum_{x} (x-\mu) &= 0 \\ 191 -\mu n + \sum_{x} x &= 0 \\ 192 \mu &= \frac{1}{n} \sum_{x} x &&\text{(i.e. the arithmetic mean)} 193 \end{aligned}$ 194 195 The implication is that the maximum likelihood estimator for the mean of normal distribution is the mean of the data. 196 197 #### Regression with Gaussian errors 198 For a regression $y = x^{T} w + b + E$, where $E \sim N(0, \sigma)$ 199 200 If we want to maximise the likelihood of the parameters of the line, given some data: 201 202 $\begin{aligned} 203 \argmax_{w,b} P(Y|X,w,b) &= \argmax_{w,b} \ln{\prod_{i} N(y_i | x_{i}^{T} w + b, \sigma)} \\ 204 &= \argmax_{w,b} \sum_{i} \ln{\frac{1}{\sqrt{2\pi\sigma^2}}} \exp \Big[ -\frac{1}{2\sigma<sup>2} (x_{i}</sup>{T} w + b - y_i)^2 \Big] &&\text{(just fill in the formula)}\\ 205 &= \argmax_{w,b} -\sum_{i} \frac{1}{2\sigma<sup>2} (x_{i}</sup>{T} w + b - y_i)^2 &&\text{(because the ln doesn't matter for argmax)}\\ 206 &= \argmax_{w,b} -\frac{1}{2} \sum_{i} (x_{i}<sup>{T} w + b - y_i)</sup>2 &&\text{(because the stdev doesn't impact the result)}\\ 207 &= \argmin_{w,b} \frac{1}{2} \sum_{i} (x_{i}<sup>{T} w + b - y_i)</sup>2 &&\text{(which is the least squares function)}\\ 208 \end{aligned}$ 209 210 So that's why least squares assumes a normal distribution. 211 #### n-D normal distribution (multivariate Gaussian) 212 The formula: $N(x | \mu, \Sigma) = \frac{1}{\sqrt{(2\pi)^d |\Sigma |}} \exp \Big[ -\frac{1}{2} (x-\mu)^{T} \Sigma^{-1} (x-\mu) \Big]$ 213 214 #### Gaussian mixture model 215 Basically, combine Gaussians to represent more complex shapes. 216 217 Example with three components: 218 - three components: N(μ₁, Σ₁), N(μ₂, Σ₂), N(μ₃, Σ₃) 219 - three weights: w₁, w₂, w₃ with $\sum w_{i} = 1$ 220 221 Maximum likelihood: 222 $\argmax_{w_{i}, \mu_{i}, \Sigma_{i}} \sum_{x} {\ln \sum_{i}{ N(x | \mu_{i}, \Sigma_{i})}}$ 223 ### Expectation-maximisation 224 Finding maximum-likelihood is hard if there are hidden variables (not observed) that affect those that are in the dataset. For example, if the hidden variables come from mixture models (you don't know their specific distribution). This can be used to fit _any_ hidden variable model. 225 226 Key insight: can't optimise both θ and z, but given some θ, can compute P(z|x), and given z, can optimise θ. 227 228 Intuition: 229 1. Initialize components randomly 230 2. loop: 231 - expectation: assign soft responsibilities to each point. i.e., points "belong" to each Gaussian "to some degree"; each Gaussian takes a certain _responsibility_ for each point. 232 - maximisation: fit components to the data, weighted by responsibility. 233 234 Definition of "responsibility": $r_{x}^{i} = \frac{P(z=i | x)}{\sum_{j} P(z=j | x)}$ 235 Model parameters, given responsibilities: 236 - $n_i = \sum_{x} r_{x}^{i}$ 237 - $\mu_i = \frac{1}{n_i} \sum_{x} r_{x}^{i} x$ 238 - $\Sigma_i = \frac{1}{n_i} \sum_{x} r_{x}^{i} (x-\mu_i) (x-\mu_i)^{T}$ 239 - $w_i = \frac{n_i}{n}$ 240 241 ![Expectation and maximization](7ba36b211f204ba187d79d53fd4e6a97.png) 242