index.md (1952B)
1 +++ 2 title = 'Searching & Search Engines' 3 +++ 4 # Searching & Search Engines 5 Classic info retrieval model: 6 7 ![5f288d4b7073dc54b3dd0084791569b8.png](58ee7e1539ec44cd88941b2d254368b2.png) 8 9 Search types: 10 11 - Boolean search: terms (words) combined with AND/OR/NOT 12 - Ad hoc query: one-time 13 - others: browsing, recommender, social bookmarking 14 15 Search engines 16 17 - index pages: locally stored summaries 18 - index building 19 20 1. Crawling: follow links of pages they already know 21 2. Preprocessing 22 23 - Remove HTML tags 24 - Tokenisation (split words on spaces) 25 - Remove words like I, the it, a…. 26 - Take stem of words (e.g. "walking" -> "walk") 27 28 3. Build an index 29 30 - simple — term document matrix 31 - 1 if term appears in document, otherwise 0 32 - to answer query, bitwise AND the term vectors 33 - better — inverted index (just a standard index) 34 - for each term, store a list of all the documents containing it 35 - terms are looked up in index, docs followed using ID 36 - judging search engine fucntionality 37 - ask human raters to judge (because it's subjective) 38 - mathematical — precision, recall (need to know how to calculate this!) 39 - there's a tradeoff between precision & recall 40 - as recall increases, precision decreases 41 - F-measure is the harmonic mean of precision & recall 42 - precision at N only includes the first N results 43 - ranking 44 - weighting scheme tf.idf: 45 - every word is given a weight for a document, some are more important than others 46 - tf: term frequency 47 - idf: inverse document frequency 48 - Zipf's law: frequency of a word is inversely proportional to its rank in the frequency table 49 - Heap's law: most common words are encountered quickly while scanning, but you continue to encounter new words (so you need to keep indexing) 50 - PageRank: absolute score for a page