How to make Elasticsearch asciifolding filter handle multiple diacritics per character?

  Kiến thức lập trình

The ASCII folding token filter, per documentation,

Converts alphabetic, numeric, and symbolic characters that are not in the Basic Latin Unicode block (first 127 ASCII characters) to their ASCII equivalent, if one exists.

I am using this filter to strip diacritics from characters. This works well for the most basic cases (like when áccènt is folded to accent), but seems to fail for complex combinations of diacritics. Let’s use this example:

Say we have a string “pīr púr pā́r” with each token’s vowel carrying different combinations of diacritics. The last token carries a combination of two diacritics. The result I expect is pir pur par, but…

GET http://127.0.0.1:9200/_analyze

{
  "text":"pīr púr pā́r",
  "filter": ["asciifolding"]
}

Response:

{
  "tokens": [
    {
      "token": "pir pur pár",
      "start_offset": 0,
      "end_offset": 12,
      "type": "word",
      "position": 0
    }
  ]
}

…so it leaves the ´ accent on the a. Of course it gets quite nasty if you’re searching with this, as a query pár would yield no results (the accent would be stripped from the query, obviously).

Is there any way of stripping diacritics more thoroughly? I would really love find a way to do this without implementing a plugin. Also, not having to play around with a combination of character decomposition and regex-based filters (as this’d be quite expensive) would be nice. But if it’s really necessary, I’d be open to ideas.

LEAVE A COMMENT