Scoring and Lucene

eccenca Documentation

Score Determination for Textfield Searches

Summary

The score returned by Lucene is not a strict measure on how well the search term(s) match absolutely but is to be seen in relation and thus relative to the documents in the index. As a consequence, the score is a mechanism of simply ordering the results and has its validity only for that specific state of the index and query. Therefore, If the contents of the index changes the score returned by a subsequent search may not be the same for the same query and document.
The rest of this page discusses the algorithm how Lucene calculates the score.

The determination of scores is based upon the common statistical method TDIDF (Term Frequency * Inverse Document Frequency).

The larger the overall quantity of a term within the complete document, the lower the content reflected value of the term. This value is also called "Inverse Document Frequency". This value in conjunction with the "Term Frequency" represents the base for the statistical method TDIDF, to determine the relevance and ranking of search results.

I.e. a document with many words containing only one hit, will be rated lower, than a document with less words containing one hit. Correspondingly documents with several hits are rated higher.

The formular to determine the score is:

score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t * boost_t) * coord_q_d
score_d score for document d
sum_t sum for all terms t
tf_q the square root of the frequency of t in the query
tf_d the square root of the frequency of t in d
idf_t log(numDocs/docFreq_t+1) + 1.0
numDocs number of documents in index
docFreq_t number of documents containing t
norm_q sqrt(sum_t((tf_q*idf_t)^2))
norm_d_t square root of number of tokens in d in the same field as t
boost_t the user-specified boost for term t
coord_q_d number of terms in both query and document / number of terms in query

Labels

lucene lucene Delete
score score Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.