index.md (6594B)
1 +++ 2 title = 'Models for sequential data' 3 template = 'page-math.html' 4 +++ 5 # Models for sequential data 6 7 ## Sequences 8 They consists of numbers or symbols: 9 - numeric 1 dimensional, e.g. stock price over time. can be n-dimensional 10 - symbolic (categorical) 1-dimensional, like english sequence of words/characters. can be n-dimensional, with multiple categorical features per timestamp (like sheet music) 11 12 we could have one sequence per instance, and try to classify the sequences (like email spam/not spam) 13 or the whole dataset is a sequence, and instances are ordered. 14 15 single sequence feature extraction: 16 - make it a regression problem, each point is represented by the m values before it 17 - gives us a table with a target label (value at time t) and m features (the m preceding values) 18 - you could also use mean/variance statistics 19 - but shit: if the data is shuffled, the classifier is trained on data that comes from the future (relative to the test data) 20 21 major key: think about the real-world use case. e.g. if we want to predict future values, the training data shouldn't contain things that happen later than test data. 22 23 you can do walk-forward validation, if target labels have meaningful ordering in time: 24 25  26 27 When modelling probability, break the sequence into its tokens, like words in a sentence. Each token is modeled as a random variable (_not_ independent). 28 So you end up with joint distribution P(W₄, W₃, W₂, W₁) (with some arbitrary number of parameters. 29 30 Can apply chain rule of probability: 31 32 $\begin{aligned} 33 P(W_4, W_3, W_2, W_1) &= P(W_4, W_3, W_2 | W_1) P(W_1) \\ 34 &= P(W_4, W_3 | W_2, W_1) P(W_2 | W_1) P(W_1) \\ 35 &= P(W_4 | W_3, W_2, W_1) P(W_3 | W_2, W_1) P(W_2 | W_1) P(W_1) 36 \end{aligned}$ 37 38 i.e.: can rewrite probability of sentences as product of probability of each word, with condition on its history. 39 with log probability, you get a sum: $\log{P(\text{sentence})} = \sum_{word} \log{P(\text{word} | \text{words before it})}$ 40 41 ## Markov models 42 Markov assumption: limit the amount of memory for previous tokens. e.g. retain a max of 2 words. 43 The "order" is the number of words retained in the conditional. 44 45 For example, if the conditional is $P(x | \text{i, will, graduate, in, a, decade})$ and it's a third-order model, the Markov assumption is $P(x | \text{i, will, graduate})$. 46 47 With Markov assumption and chain rule, can model sequence as limited-memory conditional probabilities. These can be estimated from a corpus (huge piece of text). 48 49 For example, to estimate prob of the word 'prize' given "won a", count how often "won a prize" occurs in text as proportion of total occurrences of "won a": 50 51 $P(\text{prize} | \text{a, won}) \approx \frac{\text{\# won a prize}}{\text{\# won a}}$ 52 53 The word snippets are "n-grams". Three words is a trigram, two words is a bigram. I guess one word is just a gram. And maybe 1000 words would be a kilogram. 54 55 Sequential sampling: start with small seed of words, then sample next word according to its probability given the previous words. 56 57 ## Embedding models 58 Model object x by embedding vector e<sub>x</sub>. The similarities of these vectors represent similarities between words. 59 Creates embedding vectors for words, where distances and directions reflect semantic meaning. 60 61 Distributional hypothesis: words that occur in same context often have similar meanings. 62 63 1-hot vector: represent words as atomic objects in a monolithic vector 64 65 Word2Vec: 66 - slide context window over sequence, trying to predict distribution P(y|x) - which words likely to occur in context window given middle word 67 - create dataset of word pairs from text 68 - feed this dataset to two-layer network, which predicts context 69 - softmax activation over 10k outputs is expensive, so need some tricks to make it feasible 70 - after training, discard second layer (softmax) and only use embeddings produced by first layer 71 72 ## Recurrent neural networks 73 Neural network with cycles in it (used for sequences). 74 75 Can be used for: 76 - sequence to sequence, e.g. translating English to French 77 - sequence to label, e.g. sequence classification 78 - label to sequence, e.g. sentence generation 79 80 Example, fully connected network with input x extended by three nodes, to which the hidden layer is copied: 81 82  83 84 Visual shorthand: 85 - rectangle is vector of nodes 86 - arrow feeding into the rectangle annotated with a weight matrix means fully connected transformation 87 - if line doesn't have weight, it's a copy of input vector 88 - if two lines flow into each other, concatenate their vectors 89 90   91 92 Training RNNs: 93 - provide input seq x, target seq t 94 - backpropagation through time: 95 - unroll: 96 - every step in seq is applied in parallel to copy of the network 97 - recurrent connection flows from previous copy to next 98 - the whole thing is a feedforward net (network without cyles) 99 - hidden layer inits to zero vector 100 101  102 103 Basic RNNs work well, but don't learn to remember information for a long time. 104 Can't have a long term mem for everything, need to be selective. In order to remember things long term, you need to forget a lot of other stuff (such is life). 105 106 ## LSTMs 107 "Long short-term memory". 108 Selective forgetting and remembering, controlled by learnable "gates". Side note, from now on I'm not "studying", but I'm "selectively forgetting and remembering". 109 110 *[tanh]: sigmoid rescaled so its outputs are between -1 and 1 111 112 The gating mechanism takes two input vectors, which are combined with sigmoid and tanh activations. 113 It produces an additive value -- want to figure out how much of input to add to some other vectors. 114 The tanh is like a mapping of input to range(-1, 1) -- limits the effect of the addition vector. 115 The sigmoid is like a selection vector. 116 117  118 119 Basic operation of LSTM is a "cell". There are two recurrent connections between cells: the current output y, and the cell state C. 120 121 I don't yet know how much detail we need to know about this, so I'll fill it in later based on exam questions. 122 123 The prof's summary: "incredibly powerful language models. Tricky to train, very opaque." Yep, opaque and complicated, indeed.