LIS329 Midterm Evaluation. Spring 1998. Due 4 PM, Thursday, April 2.

Answer all questions. They count equally.  You may use books, articles,
notes, and computers to complete the problems, but you may not solicit or
receive assistance from other human beings. Please show all work. Make sure
your name is on anything you want graded. You may submit this to the instructor
by electronic mail or in person.

  Suppose that I am constructing a search engine that indexes every available
web page on the UIUC campus. The documents will be collected automatically 
from servers. I'm considering two different approaches for matching and 
ranking: fuzzy Boolean searches and ranking by cosine similarity. I may 
eventually choose either or both approaches, but I wish to test each against 
some plausible queries and see how well they do.
  I will index the documents as follows: words in META (metadata) elements
(e.g. "description" and "keywords" elements) will receive a weight of 0.90. 
Words in TITLE elements will receive a weight of 0.80. Words in H1, H2, and H3
elements will receive weights of 0.70, 0.60, and 0.50, respectively. Words
occurring anywhere else in the document will receive weights of 0.20. Weights
are not cumulative: if a word occurs in more than one kind of element then 
the maximum weight will be selected (e.g. 0.80 if the word occurs in both the
TITLE and a H2 element). Frequency of occurrence has no influence on the
weight. Common stop words will be removed, but words will not be stemmed.
  In computing cosine similarity, the weights will be considered coordinates
in a multidimensional space where each word is a dimension, and the documents
and queries are plotted in that space (i.e. the vector model). Query terms
will have a weight of 1.0, unless explicitly assigned a weight by the user.
  For computing fuzzy Boolean results, the weights will be considered degrees
of membership (Korfhage, section 3.5). Membership grade for retrieved results 
will be computed according to the procedures outlined on page 70 of the text.
If users do not specify Boolean operators, then it will be assumed that search
terms are connected by OR operators.
  Suppose I search on the query "department courses psychology." Among the
documents returned are the following documents: Document 1 mentions 
"psychology" in the TITLE element, "department" in a META element, and 
"courses" in a H2 element. Document 2 mentions "Department" and "courses" in 
H2 elements and "psychology" in an H3 element. 

QUESTION 1)
  Which of the two documents would you expect would be ranked higher by cosine 
similarity to the query and why?

QUESTION 2)
  Under what circumstances, if any, would your prediction with respect to
Question 1 be incorrect?

QUESTION 3)
  Suppose the results of a fuzzy Boolean query are ranked by membership grade.
Construct a Boolean query containing at least two of the three search terms 
that would assign documents 1 and 2 the same (non-zero) membership grade in the
retrieved set. 

QUESTION 4)
  Suppose I wish to compare (only) differences between cosine ranking and
fuzzy Boolean ranking. To accomplish this I will test each ranking method
against several queries, each consisting of unweighted search terms without
Boolean operators (like "department courses psychology," for example).
Suggest one or more realistic queries that you predict will produce better
ranked output under ranking by cosine similarity. Explain your rationale
in detail. "Better" means more relevant documents among the top ranked.
"Realistic" means you can imagine documents that would actually be relevant
to such a query.

QUESTION 5) 
  Same situation as Question 4: suggest one or more realistic queries that you 
predict will produce better ranked output under ranking by fuzzy Boolean 
membership grade. Explain your rationale in detail.

QUESTION 6)
  How well do you think the inner product would work in this example, compared 
to the cosine and the fuzzy Boolean ranking? Explain your reasoning.

QUESTION 7)
  Should end users of vector, fuzzy Boolean, and related systems attempt to
understand how those complex matching techniques work? Should system 
administrators attempt to explain them? Take a position on this issue and
argue for it cogently in no more than four paragraphs.

QUESTION 8) 
  Consider index terms extracted from documents based on lexical significance
measures vs. controlled terms assigned by human indexers. Are these indexing
methods aiming to achieve the same goal or different goals? Take a position on
this issue and argue for it cogently in no more than four paragraphs.