LIS429 Midterm Evaluation. Spring 1997.
Answer all four questions. They count equally. You may use books, articles,
notes, and computers to complete the problems, but you may not solicit or
receive assistance from other human beings. Please show all work. Make sure
your name is on anything you want graded. You may submit this to the instructor
by electronic mail or in person.
Question 1)
Fred tells you about the web-based search engine he's set up for his
database of full text documents and abstracts. He's got a problem weighting
his search terms:
"I started out with a simple counting measure for relevance ranking:
each time a word in the document matched one of the search terms, the count
was increased by one. Then I displayed the results in descending order by
count.
I didn't like the fact that long documents always came out ahead of
short documents, so I started weighting the terms in the document. I divided
each word frequency by the sum of the term frequencies for that document. So
for example, if the only (non-stopword) terms in the document were "dog"
(3 times), "cat" (2 times) and "mouse" (5 times), I'd divide each of the
term weights by the sum (10) and end up with (dog: 0.3, cat: 0.2, mouse: 0.5).
I made my relevance estimate the sum of the weights (for terms in the query)
instead of the sum of the raw counts. Once I did this the long and short
documents were more balanced in the ranked output.
The next thing I wanted to do was let users weight their search terms
to reflect their relative importance. So I let the users assign numeric
weights to the terms in the query. I normalized the query weights the same
way I did the document weights, and added up the result of multiplying the
query weight by the corresponding document weight.
Ever since I've implemented the term weighting I've gotten funny results.
Sometimes the relative distribution of term frequencies in the document
reflects the weights in the query, but many times poor matches are ranked much
higher than good matches. What have I done wrong?"
Describe in your own words what's happening in Fred's system that's causing
the ranking problem.
Answer
Here's what's really going on in Fred's system:
Question 2)
The Cranfield test collection has 1437 short documents, and comes with a
bunch of test queries. Here's one of them:
Can a criterion be developed to show empirically the validity of flow
solutions for chemically reacting gas mixtures based on the simplifying
assumption of instantaneous local chemical equilibrium?
According to the documentation, there are only three documents in Cranfield
that are relevant to this query. I considered the occurrence of four stems
in the three relevant documents and in the entire collection:
* "flow" occurs in all three relevant documents and 738 total documents.
* "gas" occurrs in all three relevant documents and 156 total documents.
* "chemic" occurs in 2 of the relevant documents and 40 total documents.
* "react" occurs in 2 of the relevant documents and 42 total documents.
* "chemic" and "react" occur together in 2 of the relevant documents
and 29 of total documents.
* "gas" and "flow" occur together in all three relevant documents and
117 total documents.
Answer the following questions:
a) What is the actual probability of relevance, given "gas & flow"?
Answer
There are 117 documents that contain the words "gas" and "flow." Three of those
documents are relevant. So the actual probability is 3/117 or
0.0256.
b) If you estimated this probability using all the data except the
co-occurrence frequencies (i.e. assume independence) would your
estimate be close to the truth? If not, would it be off by a little
or by a lot? Why?
Answer
If you don't use the co-occurrence information, you'd assume that the
probability of "gas & flow" was equal to P(gas) * P(flow). You'd guess that
gas and flow co-ocurred in about 80 documents, so the probability of
relevance given gas and flow would be about 0.04. This is a fairly small
difference, since "gas" and "flow" really are fairly independent in this
database.
c) What is the actual probability of relevance, given "chemic & react"?
Answer
There are 29 documents that mention "chemic" and "react." Two of them are
relevant, so the probability of relevance given "chemic & react" is really
about 0.07.
d) If you estimated this probability using all the data except the
co-occurrence frequencies (i.e. assume independence) would your
estimate be close to the truth? If not, would it be off by a little
or by a lot? Why?
Answer
If you assume that the frequencies of "chemic" and "react" are independent
then you'd estimate that between 1 and 2 documents (1.17) contain
both terms. By Bayes theorem, this puts your estimate of the probability of
relevance given "chemic" and "react" at around 171%, which is a long way from
the actual probability of relevance. The issue, of course, is that "chemic"
and "react" co-occur much more frequently than you'd expect based on an
assumption of independence (Korfhage, section 4.6)
Question 3) Using the data from the last question, which of the four terms should be the "most discriminating" using the criterion of inverse document frequency? (This is collection wide, not in the three relevant documents, or in any particular document). Which should be the worst of the four? Imagine a situation in which the worst of the four by IDF would be the best by signal-to-noise ratio, and vice-versa. Describe what the distribution of term frequencies would have to be like for that to happen.
Answer
The term "chemic" will be the best of the four by IDF, since it occurs in the
fewest documents. The worst will be "flow" since it occurs in the most. The
term "flow" could be the best by SNR if almost all the occurences of the
term were concentrated in a few documents. The term "chemic" could be the worst
by SNR if the occurrences were distribted evenly (e.g. one occurrence in each
of the 40 documents) (Korfhage, section 5.5).
QUESTION 4) Suppose I index two documents with the following terms and weights: DOCUMENT 1 DOCUMENT 2 ---------- ---------- COMMITTEE 7 COMMITTEE 5 FORM 7 MEETING 7 STUDENTS 6 DISCUSSION 3 SPRING 6 AGENDA 1 DISCUSSION 5 OFFICE 5 a) How many document space dimensions will I need to compare the documents with the cosine measure or Euclidean distance?
Answer
Eight dimensions: committee, form, students, spring, discussion, meeting,
agenda, office
b) What is the similarity between the two documents, according to the cosine measure?
Answer
50/145.79 = 0.34
c) Calculate the Euclidean Distance between the two documents.
Answer
sqrt(204) = 14.28
d) How is the Euclidean distance affected by the terms that describe only one of the two documents?
Answer
The distance is substantially increased with each such term (except "Agenda").