Reprinted from Journal of Documentation Volume 60 Number 5 2004 pp. 493-502 Copyright © MCB University Press ISSN 0022-0418 and previously from Journal of Documentation Volume 28 Number 1 1972 pp. 11-21
A statistical interpretation of term specificity and its application in retrieval
Karen Spärck Jones
Computer Laboratory, University of Cambridge, Cambridge, UK
Abstract: The exhaustivity of document descriptions and the specificity of index terms are usually regarded as independent. It is suggested that specificity should be interpreted statistically, as a function of term use rather than of term meaning. The effects on retrieval of variations in term specificity are examined, ...view middle of the document...
The idea of an optimum level of indexing exhaustivity for a given document collection then follows: the average number of descriptors per document should be adjusted so that, hopefully, the chances of requests matching relevant documents are maximized, while too many false drops are avoided. Exhaustivity obviously applies to requests too, and one function of a search strategy is to vary request exhaustivity. I will be mainly concerned here, however, with document descriptions. Specificity as characterized above is a semantic property of index terms: a term is more or less specific as its meaning is more or less detailed and precise. This is a natural view for anyone concerned with the construction of an entire indexing vocabulary. Some decision has to be made about the discriminating power of individual terms in addition to their descriptive propriety. For example, the index term "beverage" may be as properly used for documents about tea, coffee, and cocoa as the terms "tea", "coffee", and "cocoa". Whether the more
general term "beverage" only is incorporated in the vocabulary, or whether "tea", "coffee", and "cocoa" are adopted, depends on judgements about the retrieval utility of distinctions between documents made by the latter but not the former. It is also predicted that the more general term would be applied to more documents than the separate terms "tea", "coffee", and "cocoa", so the less specific term would have a larger collection distribution than the more specific ones. It is of course assumed here that such choices when a vocabulary is constructed are exclusive: we may either have "beverage" or "tea", "coffee", and "cocoa". What happens if we have all four terms is a different matter. We may then either interpret "beverage" to mean "other beverages" or explicitly treat it as a related broader term. I will, however, disregard these alternatives here. In setting up an index vocabulary the specificity of index terms is looked at from one point of view: we are concerned with the probable effects on document description, and hence retrieval, of choosing particular terms, or rather of adopting a certain set of terms. For our decisions will, in part, be influenced by relations between terms, and how the set of chosen terms will collectively characterize the set of documents. But throughout we assume some level of indexing exhaustivity. We are concerned with obtaining an effective vocabulary for a collection of documents of some broadly known subject matter and size, where a given level of indexing exhaustivity is believed to be sufficient to represent the content of individual documents adequately, and distinguish one document from another. Index term specificity must, however, be looked at from another point of view. What happens when a given index vocabulary is actually used? We predict when we opt for "beverage", for example, that it will be used more than "cocoa". But we do not have much idea of how many documents there will be to...