changes.
| | h1.Score Determination for Textfield Searches |
| | | |
| | {note:title=Summary} |
| | The score returned by Lucene is not a strict measure on how well the search term(s) match *absolutely* but is to be seen in relation and thus *relative* to the documents in the index. As a consequence, the score is a mechanism of simply ordering the results and has its validity only for that specific state of the index and query. Therefore, If the contents of the index changes the score returned by a subsequent search may not be the same for the same query and document. \\ The rest of this page discusses the algorithm how Lucene calculates the score. |
| | {note} |
| | |
| | The determination of scores is based upon the common statistical method TDIDF (Term Frequency * Inverse Document Frequency). |
| | |
| | The larger the overall quantity of a term within the complete document, the lower the content reflected value of the term. This value is also called "Inverse Document Frequency". This value in conjunction with the "Term Frequency" represents the base for the statistical method TDIDF, to determine the relevance and ranking of search results. |
| | |
| | I.e. a document with many words containing only one hit, will be rated lower, than a document with less words containing one hit. Correspondingly documents with several hits are rated higher. |
| | |
| | The formular to determine the score is: |
| | {noformat} |
| | score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t * boost_t) * coord_q_d |
| | {noformat} |
| | |score_d | score for document d| |
| | |sum_t | sum for all terms t| |
| | |tf_q | the square root of the frequency of t in the query| |
| | |tf_d | the square root of the frequency of t in d| |
| | |idf_t | log(numDocs/docFreq_t+1) + 1.0| |
| | |numDocs | number of documents in index| |
| | |docFreq_t| number of documents containing t| |
| | |norm_q | sqrt(sum_t((tf_q*idf_t)^2))| |
| | |norm_d_t | square root of number of tokens in d in the same field as t| |
| | |boost_t | the user-specified boost for term t| |
| | |coord_q_d| number of terms in both query and document / number of terms in query| |