elasticsearch distinct parent sub aggregation without nested field - elasticsearch

In elasticsearch 6.2 I have a parent-child relationship :
Document -> NamedEntity
I want to aggregate NamedEntity by counting mention field and giving the number of documents that contains each named entity.
My use case is :
doc1 contains 'NER'(_id=ner11), 'NER'(_id=ner12)
doc2 contains 'NER'(_id=ner2)
The parent/child relation is implemented with a join field. In the Document I have a field :
join: {
name: "Document"
}
And in the NamedEntity children :
join: {
name: "NamedEntity",
parent: "parent_id"
}
with _routing set to parent_id.
So I tried with terms sub-aggregation :
curl -XPOST elasticsearch:9200/datashare-testjs/_search?pretty -H 'Content-Type: application/json' -d '
{"query":{"term":{"type":"NamedEntity"}},
"aggs":{
"mentions":{
"terms":{
"field":"mention"
},
"aggs":{
"docs":{
"terms":{"field":"join"}
}
}
}
}
}'
And I have the following response :
"aggregations" : {
"mentions" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "NER",
"doc_count" : 3,
"docs" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "NamedEntity",
"doc_count" : 3 <-- WRONG ! There are 2 distinct documents
}
]
}
}
]
}
I find the expected 3 occurrences in mentions.buckets.doc_count. But in the mentions.buckets.docs.buckets.doc_count field I would like to have only 2 documents (not 3). Like a select count distinct.
If I aggregate with "terms":{"field":"join.parent"} I have :
...
"docs" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ ]
}
...
I tied with cardinality aggregation on the join field and I obtain a value of 1, and cardinality aggregation on the join.parent that returns a value of 0.
So how do you make an aggregation distinct count on parents without the use of a reverse nested aggregation ?
As #AndreiStefan asked, here is the mapping. It is a simple 1-N relation between Document(content) and NamedEntity(mention) in an ES 6 mapping (fields are defined on the same level) :
curl -XPUT elasticsearch:9200/datashare-testjs -H 'Content-Type: application/json' -d '
{
"mappings": {
"doc": {
"properties": {
"content": {
"type": "text",
"index_options": "offsets"
},
"type": {
"type": "keyword"
},
"join": {
"type": "join",
"relations": {
"Document": "NamedEntity"
}
},
"mention": {
"type": "keyword"
}
}
}
}}
And the requests for a minimal dataset :
curl -XPUT elasticsearch:9200/datashare-testjs/doc/doc1 -H 'Content-Type: application/json' -d '{"type": "Document", "join": {"name": "Document"}, "content": "a NER document contains 2 NER"}'
curl -XPUT elasticsearch:9200/datashare-testjs/doc/doc2 -H 'Content-Type: application/json' -d '{"type": "Document", "join": {"name": "Document"}, "content": "another NER document"}'
curl -XPUT elasticsearch:9200/datashare-testjs/doc/ner11?routing=doc1 -H 'Content-Type: application/json' -d '{"type": "NamedEntity", "join": {"name": "NamedEntity", "parent": "doc1"}, "mention": "NER"}'
curl -XPUT elasticsearch:9200/datashare-testjs/doc/ner12?routing=doc1 -H 'Content-Type: application/json' -d '{"type": "NamedEntity", "join": {"name": "NamedEntity", "parent": "doc1"}, "mention": "NER"}'
curl -XPUT elasticsearch:9200/datashare-testjs/doc/ner2?routing=doc2 -H 'Content-Type: application/json' -d '{"type": "NamedEntity", "join": {"name": "NamedEntity", "parent": "doc2"}, "mention": "NER"}'

"aggs": {
"mentions": {
"terms": {
"field": "mention"
},
"aggs": {
"docs": {
"terms": {
"field": "join"
},
"aggs": {
"uniques": {
"cardinality": {
"field": "join#Document"
}
}
}
}
}
}
}
OR if you just want the count:
"aggs": {
"mentions": {
"terms": {
"field": "mention"
},
"aggs": {
"uniques": {
"cardinality": {
"field": "join#Document"
}
}
}
}
}
If you need a custom ordering (by unique counts):
"aggs": {
"mentions": {
"terms": {
"field": "mention",
"order": {
"uniques": "desc"
}
},
"aggs": {
"uniques": {
"cardinality": {
"field": "join#Document"
}
}
}
}
}

I post this workaround in case it can help someone. But if someone has a cleaner way of doing this, I'd be interested.
I added a denormalized field in the children that contains a copy of the parent id (the value already in join/parent):
curl -XPUT elasticsearch:9200/datashare-testjs -H 'Content-Type: application/json' -d '
{
"mappings": {
"doc": {
"properties": {
"content": {
"type": "text",
"index_options": "offsets"
},
"type": {
"type": "keyword"
},
"join": {
"type": "join",
"relations": {
"Document": "NamedEntity"
}
},
"document_id: {
"type": "keyword"
},
"mention": {
"type": "keyword"
}
}
}
}}
Then the cardinality aggregate with this new field works as expected :
curl -XPOST elasticsearch:9200/datashare-testjs/_search?pretty -H 'Content-Type: application/json' -d '
{"query":{"term":{"type":"NamedEntity"}},
"aggs":{
"mentions":{
"terms":{
"field":"mention"
},
"aggs":{
"docs":{
"cardinality": {
"field" : "document_id"
}
}
}
}}}'
It responds :
...
"aggregations" : {
"mentions" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "NER",
"doc_count" : 3,
"docs" : {
"value" : 2
}
}
]
}
}

I recently ran into the same issue on Elasticsearch 7.1, and this additional field "my_join_field#my_parent" created by elasicsearch solved it. I am glad I didn't have to add the parent_id to the child document.
https://www.elastic.co/guide/en/elasticsearch/reference/current/parent-join.html#_searching_with_parent_join

Related

Update elastic search nested field based on query

I have an Elasticsearch index named pollstat with mapping as follows:
{
"pollstat" : {
"mappings" : {
"dynamic" : "false",
"properties" : {
"dt" : {
"properties" : {
"dte" : {
"type" : "date"
},
"is_polled" : {
"type" : "boolean"
}
}
},
"is_profiled" : {
"type" : "boolean"
},
"maid" : {
"type" : "keyword"
}
}
}
}
}
The above index is created using:
curl -XPUT "http://localhost:9200/pollstat" -H 'Content-Type: application/json' -d'
{
"settings": {
"number_of_shards": 5,
"number_of_replicas": 1
},
"mappings": {
"properties": {
"maid" : {
"type" : "keyword"
},
"dt" : {
"type" : "object",
"properties": {
"dte" : {"type":"date"},
"is_polled" : {"type":"boolean"}
}
},
"is_profiled" : {
"type" : "boolean"
}
},
"dynamic":false
}
}'
To add data into this index, I am using the following code:
curl -X POST "localhost:9200/pollstat/_doc/?pretty" -H 'Content-Type: application/json' -d'{"maid" : "fans", "dt" : [{"dte": "2022-03-19", "is_polled":true } ], "is_profiled":true } '
This is working.
The requirement is to append the dt field when a particular maid polls data on a specific date. In this case, if the maid fans polls data for another day, I want to append the same to the dt field.
I used the following code, which takes the document id to update the document.
curl -X POST "localhost:9200/pollstat/_doc/hQh4oH8BPfXX63hBUbPN/_update?pretty" -H 'Content-Type: application/json' -d'{"script": {"source": "ctx._source.dt.addAll(params.dt)", "params": {"dt": [{ "dte": "2019-07-16", "is_polled": true }, { "dte": "2019-07-17", "is_polled": false } ] } } } '
This is also working
However, my application does not have visibility to the document id but gets the maid. The maid is also as unique as the document id. Hence to update a specific maid, I was trying to do the same with a query on maid.
I used the following code:
curl -X POST "localhost:9200/pollstat/_update_by_query?pretty" -H 'Content-Type: application/json' -d'"query": {"match": { "maid": "fans" }, "script": {"source": "ctx._source.dt.addAll(params.dt)", "params": {"dt": [{ "dte": "2019-07-18", "is_polled": true }, { "dte": "2019-07-19", "is_polled": false } ] } } }'
This code executes without an error and I am getting the following update status as well:
{
"took" : 8,
"timed_out" : false,
"total" : 1,
"updated" : 1,
"deleted" : 0,
"batches" : 1,
"version_conflicts" : 0,
"noops" : 0,
"retries" : {
"bulk" : 0,
"search" : 0
},
"throttled_millis" : 0,
"requests_per_second" : -1.0,
"throttled_until_millis" : 0,
"failures" : [ ]
}
However my index is not getting updated.
Since the maid field has type keyword, I had to use the query->term instead of query->match. The final query is as follows:
curl -X POST "localhost:9200/pollstat/_update_by_query?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"term": { "maid": "fans" }},
"script": {
"source": "ctx._source.dt.addAll(params.dt)",
"params": {
"dt": [
{ "dte": "2019-07-18", "is_polled": true },
{ "dte": "2019-07-19", "is_polled": false }
]
}
}
}
'
Posting this answer for others reference.

Group and count by array of objects' keys

Given the following index definition and query:
curl -XDELETE "localhost:9200/products"
curl -XPUT "localhost:9200/products"
curl -XPUT "localhost:9200/products/_mapping" -H 'Content-Type: application/json' -d'
{
"properties": {
"opinions": {
"type": "nested",
"properties": {
"topic": {"type": "keyword"},
"count": {"type": "long"}
},
"include_in_parent": true
}
}
}'
curl -X POST "localhost:9200/products/_bulk" -H 'Content-Type: application/json' -d'
{"index":{"_id":1}}
{"opinions":[{"topic": "room", "count": 2}, {"topic": "kitchen", "count": 1}]}
{"index":{"_id":2}}
{"opinions":[{"topic": "room", "count": 1}, {"topic": "restroom", "count": 1}]}
'
sleep 2
curl -X POST "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d'
{
"size": 0,
"aggs": {
"per_topic": {
"terms": {"field": "opinions.topic"},
"aggs": {
"counts": {
"sum": {"field": "opinions.count"}
}
}
}
}
}
'
Produces the result:
"aggregations" : {
"per_topic" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "room",
"doc_count" : 2,
"counts" : {
"value" : 5.0
}
},
{
"key" : "kitchen",
"doc_count" : 1,
"counts" : {
"value" : 3.0
}
},
{
"key" : "restroom",
"doc_count" : 1,
"counts" : {
"value" : 2.0
}
}
]
}
}
}
I'm expecting the sum of room to be 3, kitchen to be 1 and restroom to be 1, counting only the related nested documents, but instead it is summing all the nested count fields in all the matched the documents.
How can I sum only the matched aggregated nested documents?
UPDATE: solution based on comments
curl -X POST "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d'
{
"size": 0,
"aggs": {
"opinions": {
"nested": {"path": "opinions"},
"aggs": {
"per_topic": {
"terms": {"field": "opinions.topic"},
"aggs": {
"counts": {
"sum": {"field": "opinions.count"}
}
}
}
}
}
}
}
'
The main initial problem was the use of object fields instead of nested fields: only using nested fields is it possible to preserve the structure [{"room", 2}, {"kitchen", 1}], as in object fields the data is flattened to {["room", "kitchen"], [1,2]} without relationships between "room" and 2.
Unluckily, at the moment is not possible to use the SQL API to group by (some?) nested fields, but it is possible to write a native Elastic query using nested aggregations.

Difference of two query results in Elasticsearch

Let's say we've indexes of e-commerce store data, and we want to get the difference of list of products which are present in 2 stores.
Information on the index content: A sample data stored in each document looks like below:
{
"product_name": "sample 1",
"store_slug": "store 1",
"sales_count": 42,
"date": "2018-04-04"
}
Below are queries which gets me all products present in 2 stores individually,
Data for store 1
curl -XGET 'localhost:9200/store/_search?pretty' -H 'Content-Type: application/json' -d'
{
"_source": ["product_name"],
"query": {
"constant_score" : {
"filter" : {
"bool" : {
"must" : [
{ "term" : { "store_slug" : "store_1"}}]}}}}}'
Data for store 2
curl -XGET 'localhost:9200/store/_search?pretty' -H 'Content-Type: application/json' -d'
{
"_source": ["product_name"],
"query": {
"constant_score" : {
"filter" : {
"bool" : {
"must" : [
{ "term" : { "store_slug" : "store_2"}}]}}}}}'
Is it possible with elasticsearch query to get the difference of both result(without doing using some script/ other languages)?
E.g. of above operation: Let's say "store 1" is selling products ["product 1", "product 2"] and "store 2" is selling products ["product 1", "product 3"], So expected output of difference of products of "store 1" and "store 2" is "product 2".
Why not doing it in a single query?
Products that are in store 1 but not in store 2:
curl -XGET 'localhost:9200/store/_search?pretty' -H 'Content-Type: application/json' -d '{
"_source": [
"product_name"
],
"query": {
"constant_score": {
"filter": {
"bool": {
"filter": [
{
"term": {
"store_slug": "store_1"
}
}
],
"must_not": [
{
"term": {
"store_slug": "store_2"
}
}
]
}
}
}
}
}'
You can easily do the opposite, too.
UPDATE
After reading your updates, I think the best way to solve this is using terms aggregations, first by product and then by store and only select the products for which there is only a single store bucket (using a pipeline aggregation)
curl -XGET 'localhost:9200/store/_search?pretty' -H 'Content-Type: application/json' -d '{
{
"size": 0,
"aggs": {
"products": {
"terms": {
"field": "product_name"
},
"aggs": {
"stores": {
"terms": {
"field": "store_slug"
}
},
"min_bucket_selector": {
"bucket_selector": {
"buckets_path": {
"count": "stores._bucket_count"
},
"script": {
"source": "params.count == 1"
}
}
}
}
}
}
}'

Making aggregations in two different types and return it grouped in Elasticsearch

Having this mapping with two types, items_one and items_two:
curl -XPUT 'localhost:9200/tester?pretty=true' -d '{
"mappings": {
"items_one": {
"properties" : {
"type" : {"type": "string",
"index": "not_analyzed"}
}},
"items_two": {
"properties" : {
"other_type" : { "type": "string",
"index": "not_analyzed"}
}}}}'
I put two items on items_one:
curl -XPUT 'localhost:9200/tester/items_one/1?pretty=true' -d '{
"type": "Bank transfer"
}'
curl -XPUT 'localhost:9200/tester/items_one/2?pretty=true' -d '{
"type": "PayPal"
}'
... and another two in items_two:
curl -XPUT 'localhost:9200/tester/items_two/1?pretty=true' -d '{
"other_type": "Cash"
}'
curl -XPUT 'localhost:9200/tester/items_two/2?pretty=true' -d '{
"other_type": "No pay"
}'
How can I make the aggregations in two different fields and return it grouped?
I know I can get it from one field doing:
curl -XGET 'localhost:9200/tester/_search?pretty=true' -d '{
"size": 0,
"aggs": {
"paying_types": {
"terms": {
"field": "type"
}
}
}
}'
But I cant make it "multi-field" making something like this (which is not working):
curl -XGET 'localhost:9200/tester/_search?pretty=true' -d '{
"size": 0,
"aggs": {
"paying_types": {
"terms": {
"field": ["type", "other_type"]
}
}
}
}'
My desired output should be:
"aggregations" : {
"paying_types" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ {
"key" : "Bank transfer",
"doc_count" : 1
}, {
"key" : "PayPal",
"doc_count" : 1
}, {
"key" : "Cash",
"doc_count" : 1
}, {
"key" : "No pay",
"doc_count" : 1
} ]
}
}
}
Thanks in advance
Finally solved it. A script will do the trick:
curl -XGET 'localhost:9200/tester/_search?pretty=true' -d '{
"size": 0,
"aggs": {
"paying_types": {
"terms": {
"script": "doc['type'].values + doc['other_type'].values"
}
}
}
}'

Multiple properties in facet (elasticsearch)

I have following index:
curl -XPUT "http://localhost:9200/test/" -d '
{
"mappings": {
"files": {
"properties": {
"name": {
"type": "string",
"index": "not_analyzed"
},
"owners": {
"type": "nested",
"properties": {
"name": {
"type":"string",
"index":"not_analyzed"
},
"mail": {
"type":"string",
"index":"not_analyzed"
}
}
}
}
}
}
}
'
With sample documents:
curl -XPUT "http://localhost:9200/test/files/1" -d '
{
"name": "first.jpg",
"owners": [
{
"name": "John Smith",
"mail": "js#example.com"
},
{
"name": "Joe Smith",
"mail": "joes#example.com"
}
]
}
'
curl -XPUT "http://localhost:9200/test/files/2" -d '
{
"name": "second.jpg",
"owners": [
{
"name": "John Smith",
"mail": "js#example.com"
},
{
"name": "Ann Smith",
"mail": "as#example.com"
}
]
}
'
curl -XPUT "http://localhost:9200/test/files/3" -d '
{
"name": "third.jpg",
"owners": [
{
"name": "Kate Foo",
"mail": "kf#example.com"
}
]
}
'
And I need to find all owners that match some query, let's say "mit":
curl -XGET "http://localhost:9200/test/files/_search" -d '
{
"facets": {
"owners": {
"terms": {
"field": "owners.name"
},
"facet_filter": {
"query": {
"query_string": {
"query": "*mit*",
"default_field": "owners.name"
}
}
},
"nested": "owners"
}
}
}
'
This gives me following result:
{
"facets" : {
"owners" : {
"missing" : 0,
"_type" : "terms",
"other" : 0,
"total" : 4,
"terms" : [
{
"count" : 2,
"term" : "John Smith"
},
{
"count" : 1,
"term" : "Joe Smith"
},
{
"count" : 1,
"term" : "Ann Smith"
}
]
}
},
"timed_out" : false,
"hits" : {...}
}
And it's ok.
But what I exaclty need is to get owners with their email addresses (for each entry in facet I need additional field in results).
Is it achievable?
Not possible i think? Depending on your needs I would have
Create a composite field with both name & email and do the facet on that field, or
Run the query in addition to the facet and extract it from the query-result, but this is obviously not scalable
Two step-operation, get the facet, build the needed queries and merge results.

Resources