Friday, January 28, 2011

Unit 3: Index Construction and Compression

Index Compression is crucial to IR systems. Even a modestly sized collection of documents can take up a lot of memory.

Various compression techniques offer different benefits and disadvantages. One of the biggest factors is whether the compression techniques is lossy or not. Non-lossy compression preserves every term in a document. Lossy compression methods permanently remove terms that may or may not be crucial to the search process. Lossy methods will use less memory than non-lossy.

Muddiest Point: 1/24/11

If you are using k-grams for wildcard entries and the k-grams are stored without reference to their parent terms how does the IR system build the terms that are returned to the user?

Friday, January 21, 2011

Unit 2: Document and Query Processing

IIR Section 1.2 - Inverted Index:

  • The text describes the process to convert documents into the segments required to create an index.
  • Preprocessing takes tokens pulled from the text and normalizes them. It appears that this preprocessing is removing the formatting of the text to create searchable keywords. Are there any other differences between tokens and normalized tokens?
  • Each token is assigned an integer that ties it to the document from where it is derived.
  • Tokens are then arranged alphabetically and assigned a document frequency statistic. The document frequency data can make searches more efficient and can be used to create ranked lists.
  • Two data structures can be used to store the postings list: Singly linked lists and variable length arrays. The text lists the benefits of each data structure, but what are the limitations or drawbacks of each.
IIR Chapters 2 & 3:

Section 1.2 describes the basic structures used in information retrieval. Chapters 2 and 3 describe some of the challenges in creating effective information retrieval systems. To create useful search tools multiple factors must be considered.

One of the biggest challenges in creating and normalizing tokens is the nuance and intricacy of human language. Different languages present different issues. Designing algorithms that can breakdown tokens perfectly are impossible and thus we must rely on probabilities and collected user information to best approximate the desires of the search engine user. Designers must decide how to handle punctuation, capitalization, accents, and diacritics. Even if we have a handle on the nuance of language we still must account for user error such as misspellings and how to correct them.