Elasticsearch: why exact match has lower score than partial match

  Kiến thức lập trình

my question

I search the word form, but the exact match word form is not the fisrt in result. Is there any way to solve this problem?

my search query

{
  "query": {
    "match": {
      "word": "form"
    }
  }
}

result

word             score
--------------------------
formulation      10.864353
formaldehyde     10.864353
formless         10.864353
formal   10.84412
formerly         10.84412
forma    10.84412
formation        10.574185
formula          10.574185
formulate        10.574185
format   10.574185
formally         10.574185
form     10.254687
former   10.254687
formidable       10.254687
formality        10.254687
formative        10.254687
ill-formed       10.054999
in form          10.035862
pro forma        9.492243

POST my_index/_analyze

The word form in search has only one token form.

In index, form tokens are [“f”, “fo”, “for”, “form”]; formulation tokens are [“f”, “fo”, …, “formulatio”, “formulation”].

my config

filter

        "edgengram_filter": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 20
        }

analyzer

      "analyzer": {
        "abc_vocab_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "keyword_repeat",
            "lowercase",
            "asciifolding",
            "edgengram_filter",
            "unique"
          ]
        },
        "abc_vocab_search_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "keyword_repeat",
            "lowercase",
            "asciifolding",
            "unique"
          ]
        }
      }

mapping

        "word": {
          "type": "text",
          "analyzer": "abc_vocab_analyzer",
          "search_analyzer": "abc_vocab_search_analyzer"
        }

2

You get the result in the way you see because you’ve implemented edge-ngram filter and that form is a sub-string of the words similar to it. Basically in inverted index it would also store the document ids that contains formulation, formal etc.

Therefore, your relevancy also gets computed in that way. You can refer to this link and I’d specifically suggest you to go through sections Default Similarity and BM25. Although the present default similarity is BM25, that link would help you understand how scoring works.

You would need to create another sibling field which you can apply in a should clause. You can go ahead and create keyword sub-field with Term Query but you need to be careful about case-sensitivity.

Instead, as mentioned by @Val, you can create a sibling of text field with standard analyzer.

Mapping:

   {
    "word":{
      "type": "text",
      "analyzer": "abc_vocab_analyzer",
      "search_analyzer": "abc_vocab_search_analyzer"
      "fields":{
        "standard":{
          "type": "text"
        }
      }
    }
  }

Query:

POST <your_index_name>/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "word": "form"
          }
        }
      ],
      "should": [                          <---- Note this
        {
          "match": {
            "word.standard": "form"
          }
        }
      ]
    }
  }
}

Let me know if this helps!

13

Looks like some issue in your custom analyzer, I created my custom autocomplete analyzer, which uses edge_ngram and lowercase filter and it works fine for me for your query and returns me exact match on top and this is how Elasticsearch works, exact matches always have more score., So no need to explicitly create another field and boost that, As Elasticsearch by default boosts the exact match on tokens match.

Index def

{
  "settings": {
    "analysis": {
      "filter": {
        "autocomplete_filter": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 10
        }
      },
      "analyzer": {
        "autocomplete": { 
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "autocomplete_filter"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "autocomplete", 
        "search_analyzer": "standard" 
      }
    }
  }
}

Index few doc

{
   "title" : "formless"
}

{
   "title" : "form"
}

{
   "title" : "formulation"
}

Search query on title field as provided in the question

{
  "query": {
    "match": {
      "title": "form"
    }
  }
}

Search result with exact match having highest score

"hits": [
         {
            "_index": "so-60523240-score",
            "_type": "_doc",
            "_id": "1",
            "_score": 0.16410133,
            "_source": {
               "title": "form"
            }
         },
         {
            "_index": "so-60523240-score",
            "_type": "_doc",
            "_id": "2",
            "_score": 0.16410133,
            "_source": {
               "title": "formulation"
            }
         },
         {
            "_index": "so-60523240-score",
            "_type": "_doc",
            "_id": "3",
            "_score": 0.16410133,
            "_source": {
               "title": "formaldehyde"
            }
         },
         {
            "_index": "so-60523240-score",
            "_type": "_doc",
            "_id": "4",
            "_score": 0.16410133,
            "_source": {
               "title": "formless"
            }
         }
      ]

1

Because your type for this field is text which means ES will do full-text search analysis on this field. And ES search process is kind of finding results most similar to the word you have given.
To accurately search word “form”, change your search method to match_phrase
Furthermore, you could also read articles below to learn more about different ES search methods:
https://www.cnblogs.com/yjf512/p/4897294.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase.html

2

LEAVE A COMMENT