LIS329 Midterm Evaluation. Spring 1998. Due 4 PM, Thursday, April 2. Answer all questions. They count equally. You may use books, articles, notes, and computers to complete the problems, but you may not solicit or receive assistance from other human beings. Please show all work. Make sure your name is on anything you want graded. You may submit this to the instructor by electronic mail or in person. Suppose that I am constructing a search engine that indexes every available web page on the UIUC campus. The documents will be collected automatically from servers. I'm considering two different approaches for matching and ranking: fuzzy Boolean searches and ranking by cosine similarity. I may eventually choose either or both approaches, but I wish to test each against some plausible queries and see how well they do. I will index the documents as follows: words in META (metadata) elements (e.g. "description" and "keywords" elements) will receive a weight of 0.90. Words in TITLE elements will receive a weight of 0.80. Words in H1, H2, and H3 elements will receive weights of 0.70, 0.60, and 0.50, respectively. Words occurring anywhere else in the document will receive weights of 0.20. Weights are not cumulative: if a word occurs in more than one kind of element then the maximum weight will be selected (e.g. 0.80 if the word occurs in both the TITLE and a H2 element). Frequency of occurrence has no influence on the weight. Common stop words will be removed, but words will not be stemmed. In computing cosine similarity, the weights will be considered coordinates in a multidimensional space where each word is a dimension, and the documents and queries are plotted in that space (i.e. the vector model). Query terms will have a weight of 1.0, unless explicitly assigned a weight by the user. For computing fuzzy Boolean results, the weights will be considered degrees of membership (Korfhage, section 3.5). Membership grade for retrieved results will be computed according to the procedures outlined on page 70 of the text. If users do not specify Boolean operators, then it will be assumed that search terms are connected by OR operators. Suppose I search on the query "department courses psychology." Among the documents returned are the following documents: Document 1 mentions "psychology" in the TITLE element, "department" in a META element, and "courses" in a H2 element. Document 2 mentions "Department" and "courses" in H2 elements and "psychology" in an H3 element. QUESTION 1) Which of the two documents would you expect would be ranked higher by cosine similarity to the query and why? QUESTION 2) Under what circumstances, if any, would your prediction with respect to Question 1 be incorrect? QUESTION 3) Suppose the results of a fuzzy Boolean query are ranked by membership grade. Construct a Boolean query containing at least two of the three search terms that would assign documents 1 and 2 the same (non-zero) membership grade in the retrieved set. QUESTION 4) Suppose I wish to compare (only) differences between cosine ranking and fuzzy Boolean ranking. To accomplish this I will test each ranking method against several queries, each consisting of unweighted search terms without Boolean operators (like "department courses psychology," for example). Suggest one or more realistic queries that you predict will produce better ranked output under ranking by cosine similarity. Explain your rationale in detail. "Better" means more relevant documents among the top ranked. "Realistic" means you can imagine documents that would actually be relevant to such a query. QUESTION 5) Same situation as Question 4: suggest one or more realistic queries that you predict will produce better ranked output under ranking by fuzzy Boolean membership grade. Explain your rationale in detail. QUESTION 6) How well do you think the inner product would work in this example, compared to the cosine and the fuzzy Boolean ranking? Explain your reasoning. QUESTION 7) Should end users of vector, fuzzy Boolean, and related systems attempt to understand how those complex matching techniques work? Should system administrators attempt to explain them? Take a position on this issue and argue for it cogently in no more than four paragraphs. QUESTION 8) Consider index terms extracted from documents based on lexical significance measures vs. controlled terms assigned by human indexers. Are these indexing methods aiming to achieve the same goal or different goals? Take a position on this issue and argue for it cogently in no more than four paragraphs.