lectures.alex.balgavy.eu

Lecture notes from university.
git clone git://git.alex.balgavy.eu/lectures.alex.balgavy.eu.git
Log | Files | Refs | Submodules

index.md (1952B)


      1 +++
      2 title = 'Searching & Search Engines'
      3 +++
      4 # Searching & Search Engines
      5 Classic info retrieval model:
      6 
      7 ![5f288d4b7073dc54b3dd0084791569b8.png](58ee7e1539ec44cd88941b2d254368b2.png)
      8 
      9 Search types:
     10 
     11 - Boolean search: terms (words) combined with AND/OR/NOT
     12 - Ad hoc query: one-time
     13 - others: browsing, recommender, social bookmarking
     14 
     15 Search engines
     16 
     17 - index pages: locally stored summaries
     18 - index building
     19 
     20     1. Crawling: follow links of pages they already know
     21     2. Preprocessing
     22 
     23         - Remove HTML tags
     24         - Tokenisation (split words on spaces)
     25         - Remove words like I, the it, a….
     26         - Take stem of words (e.g. "walking" -> "walk")
     27 
     28     3. Build an index
     29 
     30         - simple — term document matrix
     31             - 1 if term appears in document, otherwise 0
     32             - to answer query, bitwise AND the term vectors
     33         - better — inverted index (just a standard index)
     34             - for each term, store a list of all the documents containing it
     35             - terms are looked up in index, docs followed using ID
     36 - judging search engine fucntionality
     37     - ask human raters to judge (because it's subjective)
     38     - mathematical — precision, recall (need to know how to calculate this!)
     39         - there's a tradeoff between precision & recall
     40         - as recall increases, precision decreases
     41         - F-measure is the harmonic mean of precision & recall
     42         - precision at N only includes the first N results
     43 - ranking
     44     - weighting scheme tf.idf:
     45         - every word is given a weight for a document, some are more important than others
     46             - tf: term frequency
     47             - idf: inverse document frequency
     48     - Zipf's law: frequency of a word is inversely proportional to its rank in the frequency table
     49     - Heap's law: most common words are encountered quickly while scanning, but you continue to encounter new words (so you need to keep indexing)
     50     - PageRank: absolute score for a page