Information Storage and Retrieval: 2011

Friday, April 22, 2011

Muddiest Point: 4/18/11

How have classification models used in libraries influenced text classification on the web?

Thursday, April 14, 2011

Unit 13: Text Classification and Clustering

The goal of clustering algorithms is to cluster documents into clusters that are internally coherent. Documents in a cluster should be as similar as possible to each other and as dissimilar as possible from documents in other clusters. The distribution and makeup of the data will determine what documents belong to their appropriate clusters.

Muddiest Point: 4/11/11

What are the storage and performance issues of modeling users for the purpose of personalized search?

Thursday, April 7, 2011

Unit 12: Intelligent Information Retrieval

Cannot access articles for unit.

Muddiest Point: 4/4/11

Since the articles are not exactly the same when using comparable corpora is it likely that words could get translated into incorrect words?

Friday, April 1, 2011

Muddiest Point: 3/28/11

Once PageRank scores for a collection are calculated how are they used to influence the ranklist for a specific query?

Friday, March 25, 2011

Muddiest Point: 3/21/11

It seems that the global method of query expansion has a very high risk of returning more irrelevant documents than relevant ones or are too costly to maintain effectively. Are there certain situations where the global method would be preferable over the local method?

Unit 10: Web Information Retrieval

Information Retrieval on the web is extremely difficult, due to the Internet's enourmous size and the sheer variety of information it contains. There are no formatting or information organization standards for the Web. There have been many different methods used to create a search engine that could navigate the Internet effectively. Initial models required uses to submit a web site for indexing. Sites such as Yahoo! attempted to categorize websites so that users could browse without using queries. Google's approach, crawling the web and indexing information based on the structure of hypertext, has proven to be the most effective.

Wednesday, March 16, 2011

Unit 9: User Interaction and Visualization

User interface is an essential part of an information retrieval system. Human computer interaction is a large part of creating a successful user experience.

Key design principles of information retrieval are offering informative feedback, reduce working memory load, provide alternative interfaces for novice and expert users.

Friday, February 25, 2011

Unit 7: Relevance Feedback and Query Expansion

Relevance feedback is a method that invloves IR system users refining their queries by marking returned results as relevant or irrelevant. Once this is done, the system uses the information to reformulate the query to give more accurate results. Relevance feedback helps when concepts can be referred to using different words. RF helps to differentiate word meaning to the system.

RF is ineffective when misspellings occur, documents in the collection are cross-language or when the collection vocabulary and user vocabulary are mismatched. RF also puts more demand on the user.

Muddiest Point: 2/21/11

When evaluating precision, what are the factors in determining whether a document is relevant or irrelevant to a specific query?

Thursday, February 17, 2011

Unit 6: Evaluation

A well-performing IR system strikes a good balance between precision (the fraction of retrieved documents that are relevant) and recall (the fraction of relevant documents that are retrieved).

These two qualities are the most important when evaluation the performance of an IR system.

Muddiest Point: 2/14/11

Since languages have a defined syntax it would seem logical that a language model that can interpret grammar would have higher accuracy than the unigram model. Why does the unigram model perform better or just as accurately as n-gram or grammar-based models?

Thursday, February 10, 2011

Muddiest Point: 2/7/11

Could an IR system employ a combination of both the boolean search model and a best match model?

Friday, February 4, 2011

Unit 4: Matching Models, Ranked Boolean and Vector Space

Boolean searches are powerful tools in information retrieval that can provide improved results for the end user.

The task for the creator of the IR system is to have the system perform the Boolean search as efficiently as possible. Without using ranked indexes it is simple enough to implement Boolean search with a basic algorithm, but accuracy may be sacrificed. Users using the AND operator will get focused results but users using the OR may get many irrelevant results.

This is where ranking terms will help accuracy. Often times a document which contains frequent use of a term may not be more important than a document that uses the term a single time. This problem can be solved by using vector space and giving a weight to terms.

Thursday, February 3, 2011

Muddiest Point: 1/31/11

How do gamma codes help with compression in a situation where the largest encoded value is not known ahead of time?

Friday, January 28, 2011

Unit 3: Index Construction and Compression

Index Compression is crucial to IR systems. Even a modestly sized collection of documents can take up a lot of memory.

Various compression techniques offer different benefits and disadvantages. One of the biggest factors is whether the compression techniques is lossy or not. Non-lossy compression preserves every term in a document. Lossy compression methods permanently remove terms that may or may not be crucial to the search process. Lossy methods will use less memory than non-lossy.

Muddiest Point: 1/24/11

If you are using k-grams for wildcard entries and the k-grams are stored without reference to their parent terms how does the IR system build the terms that are returned to the user?

Friday, January 21, 2011

Unit 2: Document and Query Processing

IIR Section 1.2 - Inverted Index:

The text describes the process to convert documents into the segments required to create an index.
Preprocessing takes tokens pulled from the text and normalizes them. It appears that this preprocessing is removing the formatting of the text to create searchable keywords. Are there any other differences between tokens and normalized tokens?
Each token is assigned an integer that ties it to the document from where it is derived.
Tokens are then arranged alphabetically and assigned a document frequency statistic. The document frequency data can make searches more efficient and can be used to create ranked lists.
Two data structures can be used to store the postings list: Singly linked lists and variable length arrays. The text lists the benefits of each data structure, but what are the limitations or drawbacks of each.

IIR Chapters 2 & 3:

Section 1.2 describes the basic structures used in information retrieval. Chapters 2 and 3 describe some of the challenges in creating effective information retrieval systems. To create useful search tools multiple factors must be considered.

One of the biggest challenges in creating and normalizing tokens is the nuance and intricacy of human language. Different languages present different issues. Designing algorithms that can breakdown tokens perfectly are impossible and thus we must rely on probabilities and collected user information to best approximate the desires of the search engine user. Designers must decide how to handle punctuation, capitalization, accents, and diacritics. Even if we have a handle on the nuance of language we still must account for user error such as misspellings and how to correct them.