How to select buckets of aggregation results based on top hit document attribute? - elasticsearch

I am trying to get result for following Elasticsearch query. I got the response as shown below. Now I want to select the buckets based on the top hit document field "source".
POST /data/_search?size=0{
"aggs":{
"by_partyIds":{
"terms":{
"field":"id.keyword"
},
"aggs":{
"oldest_record":{
"top_hits":{
"sort":[
{
"createdate.keyword":{
"order":"asc"
}
}
],
"_source":[
"source"
],
"size":1
}
}
}
}
}
}
Response :
{
"aggregations": {
"by_partyIds": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "1",
"doc_count": 3,
"oldest_record": {
"hits": {
"total": 3,
"max_score": null,
"hits": [
{
"_index": "data",
"_type": "osr",
"_id": "DcagSm4B9WnM0Ke-MgGk",
"_score": null,
"_source": {
"source": "US"
},
"sort": [
"20-09-18 05:45:26.000000000AM"
]
}
]
}
}
},
{
"key": "2",
"doc_count": 3,
"oldest_record": {
"hits": {
"total": 3,
"max_score": null,
"hits": [
{
"_index": "data",
"_type": "osr",
"_id": "7caiSm4B9WnM0Ke-HwGx",
"_score": null,
"_source": {
"source": "UK"
},
"sort": [
"22-09-18 05:45:26.000000000AM"
]
}
]
}
}
}
]
}
}
}
Now I want to get the buckets with count US as source. Can we write the query for that? I tried A bucket aggregation which is parent pipeline aggregation which executes a script which determines whether the current bucket will be retained in the parent multi-bucket aggregation. The specified metric must be numeric and the script must return a boolean value. If the script language is expression then a numeric return value is permitted. In this case 0.0 will be evaluated as false and all other values will evaluate to true.

Related

Remove results with same id from Elasticsearch search result

Let's assume we have a search result with 3 documents. Two of them share a key attribute (product-ID or similar).
Is it possible to remove duplicates from the search result by using Elasticsearch, so that only 2 documents would be returned in that case? I don't want to implement this in application logic as I would still like to use pagination, aggregation, etc. It does not matter which of the two documents with the same id is removed.
Thanks,
Philipp
Edit:
This would be the example in Elasticsearch:
PUT /tmp_pd_articles
{
"mappings": {
"properties": {
"name": { "type": "text" },
"articleNumber": { "type": "keyword" }
}
}
}
PUT /tmp_pd_articles/_doc/1
{
"name": "My Book 1",
"articleNumber": "A9781"
}
PUT /tmp_pd_articles/_doc/2
{
"name": "My Book 1 (with some other title)",
"articleNumber": "A9781"
}
PUT /tmp_pd_articles/_doc/3
{
"name": "My Book 2",
"articleNumber": "A9782"
}
GET /tmp_pd_articles/_search
{
"query": { "match_all": {} }
}
The goal is to write a query that returns only two articles instead of all three:
#1 ("A9781", "My Book 1") OR #2 ("A9781", "My Book 1 (with some other title)") AND
#3 ("A9782", "My Book 2")
This reduction should be applied because #1 and #2 share the same productNumber "A9781". I wonder whether there is a Elasticsearch query to accomplish this goal.
Yes, its possible using top-hits aggregation, please use below query to filter the data., note tested it on your mapping and sample data, and it provides your expected data.
{
"size": 0, --> returns only aggregate data, if you want to include all 3 documents remove this size param.
"aggs": {
"dedup": {
"terms": {
"field": "articleNumber"
},
"aggs": {
"dedup_docs": {
"top_hits": {
"size": 1
}
}
}
}
}
}
And Search result
"aggregations": {
"dedup": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "A9781",
"doc_count": 2,
"dedup_docs": {
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "tmp_pd_articles",
"_type": "_doc",
"_id": "1",
"_score": 1.0,
"_source": {
"name": "My Book 1",
"articleNumber": "A9781"
}
}
]
}
}
},
{
"key": "A9782",
"doc_count": 1,
"dedup_docs": {
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "tmp_pd_articles",
"_type": "_doc",
"_id": "3",
"_score": 1.0,
"_source": {
"name": "My Book 2",
"articleNumber": "A9782"
}
}
]
}
}
}
]
}

Elasticsearch - Find documents missing two fields

I'm trying to create a query that returns information about how many documents that don't have data for two fields (date.new and date.old). I have tried the query below, but it works as OR-logic, where all documents missing either date.new or date.old are returned. Does anyone know how I can make this only return documents missing both fields?
{
"aggs":{
"Missing_field_count1":{
"missing":{
"field":"date.new"
}
},
"Missing_field_count2":{
"missing":{
"field":"date.old"
}
}
}
}
Aggregations is not the feature to use for this. You need to use the exists query wrapped within a bool/must_not query, like this:
GET index/_count
{
"size": 0,
"bool": {
"must_not": [
{
"exists": {
"field": "date.new"
}
},
{
"exists": {
"field": "date.old"
}
}
]
}
}
hits.total.value indicates the count of the documents that match the search request. The value indicates the number of hits that match and relation indicates whether the value is accurate (eq) or a lower bound (gte)
Index Data:
{
"data": {
"new": 1501,
"old": 10
}
}
{
"title": "elasticsearch"
}
{
"title": "elasticsearch-query"
}
{
"date": {
"new": 1400
}
}
The search query given by #Val answers on how to achieve your use case.
Search Result:
"hits": {
"total": {
"value": 2, <-- note this
"relation": "eq"
},
"max_score": 0.0,
"hits": [
{
"_index": "65112793",
"_type": "_doc",
"_id": "2",
"_score": 0.0,
"_source": {
"title": "elasticsearch"
}
},
{
"_index": "65112793",
"_type": "_doc",
"_id": "5",
"_score": 0.0,
"_source": {
"title": "elasticsearch-query"
}
}
]
}

Elasticsearch - Any way to find out all the documents with field value as text

In the elasticsearch cluster, I accidentally pushed some text in a field which should ideally be a Number. Later, I fixed that and pushed the Number type value. Now, I wanted to fix it such that all the old values can be replaced by some Number for which I need to find out all the documents which are having this field as text.
Is there any elasticsearch query that I can use to get this information?
I think that can be possible by using a nested aggregations.
At the top-level; use terms aggregation to know text values, at the sub-level; use top_hits aggregation to get documents that includes these values.
for instance:
GET example_index/_search
{
"size": 0,
"aggs": {
"NAME": {
"terms": {
"field": "example_field.keyword",
"size": 10
},
"aggs": {
"documents": {
"top_hits": {
"size": 10
}
}
}
}
}
}
This query; will return distinct values of the field, and the related documents in the sub-level, something like:
{
"aggregations": {
"NAME": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "mistake",
"doc_count": 2,
"documents": {
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "example_index",
"_type": "example_index",
"_id": "2QoDoXEBOCkJkkpwq5P0",
"_score": 1,
"_source": {
"example_field": "mistake"
}
},
{
"_index": "example_index",
"_type": "example_index",
"_id": "qAoDoXEBOCkJkkpwq5T0",
"_score": 1,
"_source": {
"example_field": "mistake"
}
}
]
}
}
},
{
"key": "520",
"doc_count": 2,
"documents": {
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "example_index",
"_type": "example_index",
"_id": "5goDoXEBOCkJkkpwq5P0",
"_score": 1,
"_source": {
"example_field": "1"
}
}
]
}
}
}
]
}
}
}
I the example above; we need to delete the documents with mistake value, you can simply delete them by id.
NOTE: if you have a big index, it's rather to write a function inside your code that builds aggregations, gets the response, filters values if it can be parsed to a number, then removes documents by id.

Terms Aggregation return multiple fields (min_doc_count: 0)

I'm making a Terms Aggregation but I want to return multiple fields. I want a user to select buckets via "slug" (my-name), but show the actual "name" (My Name).
At this moment I'm making a TopHits SubAggregation like this:
"organisation": {
"aggregations": {
"label": {
"top_hits": {
"_source": {
"includes": [
"organisations.name"
]
},
"size": 1
}
}
},
"terms": {
"field": "organisations.slug",
"min_doc_count": 0,
"size": 20
}
}
This gives the desired result when my whole query actually find some buckets/results.
You see I've set the min_doc_count to 0 which will return buckets with a doc count of 0. The problem I'm facing here is that my TopHits response is empty, which results of not being able to render the proper name to the client.
Example response:
"organisation": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "my-name",
"doc_count": 27,
"label": {
"hits": {
"total": 27,
"max_score": 1,
"hits": [
{
"_index": "users",
"_type": "doc",
"_id": "4475",
"_score": 1,
"_source": {
"organisations": [
{
"name": "My name"
}]
}
}]
}
}
},
{
"key": "my-name-2",
"doc_count": 0,
"label": {
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}
},
.....
Anyone has accomplished this desired result? I feel like TopHits won't help me here. It should always fetch the name.
What I've also tried:
Working with a terms sub aggregation. (same result)
Working with a significant terms sub aggregation. (same result)
What I think could be a solution, but feels dirty:
Index a new field with "organisations.slug___organisations.name" and work the magic via this.
Manual query the name field where the count is 0 (read TopHits is empty)
Kind regards,
Thanks in advance

Elasticsearch OR query with nested objects returns inner_hits not matching the criteria

I'm getting weird results when querying nested objects. Imagine the following structure:
{ owner.name = "fred",
...,
pets [
{ name = "daisy", ... },
{ name = "flopsy", ... }
]
}
If I only have the document shown above, and I search pets matching this criteria:
pets.name = "daisy" OR
(owner.name = "julie" and pet.name = "flopsy")
I would expect to only get one result ("daisy"), but I'm getting both pet names.
This is one way to reproduce this:
# Create nested mapping
PUT pet-owners
{
"mappings": {
"animals": {
"properties": {
"owner": {"type": "text"},
"pets": {
"type": "nested",
"properties": {
"name": {"type": "text", "fielddata": true}
}
}
}
}
}
}
# Insert nested object
PUT pet-owners/animals/1?op_type=create
{
"owner" : "fred",
"pets" : [
{ "name" : "daisy"},
{ "name" : "flopsy"}
]
}
# Query
GET pet-owners/_search
{ "from": 0, "size": 50,
"query": {
"constant_score": {
"filter": { "bool": {"must": [
{"bool": {"should": [
{"nested": {"query":
{"term": {"pets.name": "daisy"}},
"path":"pets",
"inner_hits": {
"name": "pets_hits_1",
"size": 99,
"_source": false,
"docvalue_fields": ["pets.name"]
}
}},
{"bool": {"must": [
{"term": {"owner": "julie"}},
{"nested": {"query":
{"term": {"pets.name": "flopsy"}},
"path":"pets",
"inner_hits": {
"name": "pets_hits_2",
"size": 99,
"_source": false,
"docvalue_fields": ["pets.name"]
}
}}
]}}
]}}
]}}}},
"_source": false
}
The query returns both pets names (as opposed to the expected one).
Is this behavior normal? Am I doing something wrong, or my reasoning about the nested structure or the query behavior is flawed?
Any help or guidance will be much appreciated.
I'm running this query under ElasticSearch 6.3.x
EDIT: I'm adding the response received, to better illustrate the case
{
"took": 16,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "pet-owners",
"_type": "animals",
"_id": "1",
"_score": 1,
"inner_hits": {
"pets_hits_1": {
"hits": {
"total": 1,
"max_score": 0.6931472,
"hits": [
{
"_index": "pet-owners",
"_type": "animals",
"_id": "1",
"_nested": {
"field": "pets",
"offset": 0
},
"_score": 0.6931472,
"fields": {
"pets.name": [
"daisy"
]
}
}
]
}
},
"pets_hits_2": {
"hits": {
"total": 1,
"max_score": 0.6931472,
"hits": [
{
"_index": "pet-owners",
"_type": "animals",
"_id": "1",
"_nested": {
"field": "pets",
"offset": 1
},
"_score": 0.6931472,
"fields": {
"pets.name": [
"flopsy"
]
}
}
]
}
}
}
}
]
}
}
So we can see that it's not that the query matches and returns the whole existing document, but that it returns each of the pets independently, one inside each of the inner_hits. It's this result that's surprising to me.
(edited) - in summary this issue is around the context of the 'inner_hits':
It looks like the inner_hits 'pets_hits_2' is returning a match because it is belonging to the nested query that simply searches the pets field for 'flopsy'.
As an independent query on our single document, that is a valid hit.
However, because that query is within a list of bool/must queries, where other queries will not match on our document, you may well expect that the inner_hits should pick up on this and therefore not return a hit.
I haven't been able to find any docs to clarify whether this is intentional behaviour or not - might be worth raising with elastic ...

Resources