How to filter by the size of an array in nested type?

  Kiến thức lập trình

Let’s say I have the following type:

{
    "2019-11-04": {
        "mappings": {
            "_doc": {
                "properties": {
                    "labels": {
                        "type": "nested",
                        "properties": {
                            "confidence": {
                                "type": "float"
                            },
                            "created_at": {
                                "type": "date",
                                "format": "strict_date_optional_time||date_time||epoch_millis"
                            },
                            "label": {
                                "type": "keyword"
                            },
                            "updated_at": {
                                "type": "date",
                                "format": "strict_date_optional_time||date_time||epoch_millis"
                            },
                            "value": {
                                "type": "keyword",
                                "fields": {
                                    "numeric": {
                                        "type": "float",
                                        "ignore_malformed": true
                                    }
                                }
                            }
                        }
                    },
                    "params": {
                        "type": "object"
                    },
                    "type": {
                        "type": "keyword"
                    }
                }
            }
        }
    }
}

And I want to filter by the size/length of the labels array. I’ve tried the following (as the official docs suggest):

{
    "query": {
        "bool": {
            "filter": {
                "script": {
                    "script": {
                        "source": "doc['labels'].size > 10"
                    }
                }
            }
        }
    }
}

but I keep getting:

{
  "error": {
    "root_cause": [
      {
        "type": "script_exception",
        "reason": "runtime error",
        "script_stack": [
          "org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:81)",
          "org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:39)",
          "doc['labels'].size > 10",
          "    ^---- HERE"
        ],
        "script": "doc['labels'].size > 10",
        "lang": "painless"
      }
    ],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    "grouped": true,
    "failed_shards": [
      {
        "shard": 0,
        "index": "2019-11-04",
        "node": "kk5MNRPoR4SYeQpLk2By3A",
        "reason": {
          "type": "script_exception",
          "reason": "runtime error",
          "script_stack": [
            "org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:81)",
            "org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:39)",
            "doc['labels'].size > 10",
            "    ^---- HERE"
          ],
          "script": "doc['labels'].size > 10",
          "lang": "painless",
          "caused_by": {
            "type": "illegal_argument_exception",
            "reason": "No field found for [labels] in mapping with types []"
          }
        }
      }
    ]
  },
  "status": 500
}

2

I’m afraid that is not something possible, because the field labels is not a field that ES saves or albiet creates an inverted index on.

Doc doc['fieldname'] is only applicable on the fields on which inverted index is created and Elasticsearch’s Query DSL too only works on fields on which inverted index gets created and unfortunately nested type is not a valid field on which inverted index is created.

Having said so, I have the below two ways of doing this.

For the sake of simplicity, I’ve created sample mapping, documents and two possible solutions which may help you.

Mapping:

PUT my_sample_index
{
  "mappings": {
    "properties": {
      "myfield": {
        "type": "nested",
        "properties": {
          "label": {
            "type": "keyword"
          }
        }
      }
    }
  }
}

Sample Documents:

// single field inside 'myfield'
POST my_sample_index/_doc/1
{
  "myfield": {                              
    "label": ["New York", "LA", "Austin"]   
  }
}


// two fields inside 'myfield' 
POST my_sample_index/_doc/2
{                                          
  "myfield": {                             
    "label": ["London", "Leicester", "Newcastle", "Liverpool"],
    "country": "England"
  }
}

Solution 1: Using Script Fields (Managing at Application Level)

I have a workaround to get what you want, well not exactly but would help you filter out on your service layer or application.

POST my_sample_index/_search
{
  "_source": "*", 
  "query": {
    "bool": {
      "must": [
        {
          "match_all": {}
        }
      ]
    }
  },
  "script_fields": {
    "label_size": {
        "script": {
            "lang": "painless",
            "source": "params['_source']['labels'].size() > 1"
        }
    }
  }
}

You would notice that in response a separate field label_size gets created with true or false value.

A sample response is something like below:

{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "my_sample_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "myfield" : {
            "label" : [
              "New York",
              "LA",
              "Austin"
            ]
          }
        },
        "fields" : {
          "label_size" : [              <---- Scripted Field
            false
          ]
        }
      },
      {
        "_index" : "my_sample_index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.0,
        "_source" : {
          "myfield" : {
            "country" : "England",
            "label" : [
              "London",
              "Leicester",
              "Newcastle",
              "Liverpool"
            ]
          }
        },
        "fields" : {                  <---- Scripted Field
          "label_size" : [
            true                      <---- True because it has two fields 'labels' and 'country'
          ]
        }
      }
    ]
  }
}

Note that only second document makes sense as it has two fields i.e. country and labels. However if you only want the docs with label_size with true, that’d would have to be managed at your application layer.

Solution 2: Reindexing with labels.size using Script Processor

Create a new index as below:

PUT my_sample_index_temp
{
  "mappings": {
    "properties": {
      "myfield": {
        "type": "nested",
        "properties": {
          "label": {
            "type": "keyword"
          }
        }
      },
      "labels_size":{             <---- New Field where we'd store the size
        "type": "integer"
      }
    }
  }
}

Create the below pipeline:

PUT _ingest/pipeline/set_labels_size
{
  "description": "sets the value of labels size",
  "processors": [
      {
        "script": {
          "source": """
            ctx.labels_size = ctx.myfield.size();
          """
        }
      }
    ]
}

Use Reindex API to reindex from my_sample_index index

POST _reindex
{
  "source": {
    "index": "my_sample_index"
  },
  "dest": {
    "index": "my_sample_index_temp",
    "pipeline": "set_labels_size"
  }
}

Verify the documents in my_sample_index_temp using GET my_sample_index_temp/_search

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "my_sample_index_temp",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "labels_size" : 1,           <---- New Field Created 
          "myfield" : {
            "label" : [
              "New York",
              "LA",
              "Austin"
            ]
          }
        }
      },
      {
        "_index" : "my_sample_index_temp",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.0,
        "_source" : {
          "labels_size" : 2,           <----- New Field Created
          "myfield" : {
            "country" : "England",
            "label" : [
              "London",
              "Leicester",
              "Newcastle",
              "Liverpool"
            ]
          }
        }
      }
    ]
  }
}

Now you can simply use this field labels_size in your query and its way easier and not to mention efficient.

Hope this helps!

You can solve it with a custom score approach:

GET 2019-11-04/_search
{
  "min_score": 0.1,
  "query": {
    "function_score": {
      "query": {
        "match_all": {}
      },
      "functions": [
        {
          "script_score": {
            "script": {
              "source": "params['_source']['labels'].length > 10 ? 1 : 0"
            }
          }
        }
      ]
    }
  }
}

LEAVE A COMMENT