Information Storage and Retrieval: January 2011

IIR Section 1.2 - Inverted Index:

The text describes the process to convert documents into the segments required to create an index.
Preprocessing takes tokens pulled from the text and normalizes them. It appears that this preprocessing is removing the formatting of the text to create searchable keywords. Are there any other differences between tokens and normalized tokens?
Each token is assigned an integer that ties it to the document from where it is derived.
Tokens are then arranged alphabetically and assigned a document frequency statistic. The document frequency data can make searches more efficient and can be used to create ranked lists.
Two data structures can be used to store the postings list: Singly linked lists and variable length arrays. The text lists the benefits of each data structure, but what are the limitations or drawbacks of each.

IIR Chapters 2 & 3:

Section 1.2 describes the basic structures used in information retrieval. Chapters 2 and 3 describe some of the challenges in creating effective information retrieval systems. To create useful search tools multiple factors must be considered.

One of the biggest challenges in creating and normalizing tokens is the nuance and intricacy of human language. Different languages present different issues. Designing algorithms that can breakdown tokens perfectly are impossible and thus we must rely on probabilities and collected user information to best approximate the desires of the search engine user. Designers must decide how to handle punctuation, capitalization, accents, and diacritics. Even if we have a handle on the nuance of language we still must account for user error such as misspellings and how to correct them.

Information Storage and Retrieval

Friday, January 28, 2011

Unit 3: Index Construction and Compression

Muddiest Point: 1/24/11

Friday, January 21, 2011

Unit 2: Document and Query Processing