LIS329 Midterm Evaluation. Fall 2000. Due 5 PM, Friday, October 27. Answer all five questions. They count equally. You may use books, articles, notes, and computers to complete the problems, but you may not solicit or receive assistance from other human beings. Please show all work. Make sure your name is on anything you want graded. You may submit this to the instructor by electronic mail or in person. QUESTION 1) In class discussions (and on an earlier exam) I proposed a system of assigning fuzzy membership grades for index terms based on the term's occurrence in different parts of the document (title, abstract, etc.). Propose a set of guiding principles for fuzzy indexing of documents by human indexers (applying either controlled or natural language terms). Discuss potential advantages of fuzzy indexing vs. binary, and possible problems that might arise in the execution of your guidelines (by the indexers). QUESTION 2) Suppose a traditional Boolean search system is modified to produce ranked output by fuzzy membership grade (Korfhage page 70). Documents stored in the database have been indexed by humans according to the guidelines you proposed in question 1. If searchers execute Boolean queries without an understanding of the ranking principles, what kinds of unexpected or perplexing results might one expect? Illustrate your arguments with worked examples of queries, documents, and membership grades. QUESTION 3) One of the arguments in favor of vector representations for retrieval is that vector components can stand for any document attribute, not just the importance of a word or phrase. We've read about systems that combine bibliographic records with other kinds of data (genomic or astronomical data), but heterogeneous records in those systems are only loosely coupled (the systems support different methods of searching separately). Suppose we model documents with vectors where some of the components represent index term weights, while others represent other kinds of attributes (e.g., estimates of reading difficulty, authority, cost, recency, access time, and frequencies of non-text features such as illustrations, tables, equations, or citations). We conjecture an interface that allows users to include weighted preferences for these features as part of the query vector. Discuss ways in which relevance ranking might be complicated or made problematic when similarity or distance measures are applied to these heterogeneous vectors. Illustrate your points with simple worked examples using either angular (e.g. cosine) or metric (L1, L2, etc.) relevance measures (or both). QUESTION 4) Term weighting by signal to noise ratio favors words with skewed occurrence distributions. How plausible is it that such terms are likely to be better for indexing? Are there factors relating to the types of documents in the collection that speak to that plausibility? Take a position on this issue, and argue for it. Illustrate your points with one or more examples of computed signal values. QUESTION 5) It's been argued that adjusting search results with a user profile may not always be helpful for someone researching a novel or unusual topic. Review the description of the GUIDO visualization (Korfhage section 7.4) and explain how well GUIDO answers this particular objection to user profiles.