Group and count by array of objects' keys - elasticsearch

Given the following index definition and query:
curl -XDELETE "localhost:9200/products"
curl -XPUT "localhost:9200/products"
curl -XPUT "localhost:9200/products/_mapping" -H 'Content-Type: application/json' -d'
{
"properties": {
"opinions": {
"type": "nested",
"properties": {
"topic": {"type": "keyword"},
"count": {"type": "long"}
},
"include_in_parent": true
}
}
}'
curl -X POST "localhost:9200/products/_bulk" -H 'Content-Type: application/json' -d'
{"index":{"_id":1}}
{"opinions":[{"topic": "room", "count": 2}, {"topic": "kitchen", "count": 1}]}
{"index":{"_id":2}}
{"opinions":[{"topic": "room", "count": 1}, {"topic": "restroom", "count": 1}]}
'
sleep 2
curl -X POST "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d'
{
"size": 0,
"aggs": {
"per_topic": {
"terms": {"field": "opinions.topic"},
"aggs": {
"counts": {
"sum": {"field": "opinions.count"}
}
}
}
}
}
'
Produces the result:
"aggregations" : {
"per_topic" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "room",
"doc_count" : 2,
"counts" : {
"value" : 5.0
}
},
{
"key" : "kitchen",
"doc_count" : 1,
"counts" : {
"value" : 3.0
}
},
{
"key" : "restroom",
"doc_count" : 1,
"counts" : {
"value" : 2.0
}
}
]
}
}
}
I'm expecting the sum of room to be 3, kitchen to be 1 and restroom to be 1, counting only the related nested documents, but instead it is summing all the nested count fields in all the matched the documents.
How can I sum only the matched aggregated nested documents?
UPDATE: solution based on comments
curl -X POST "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d'
{
"size": 0,
"aggs": {
"opinions": {
"nested": {"path": "opinions"},
"aggs": {
"per_topic": {
"terms": {"field": "opinions.topic"},
"aggs": {
"counts": {
"sum": {"field": "opinions.count"}
}
}
}
}
}
}
}
'

The main initial problem was the use of object fields instead of nested fields: only using nested fields is it possible to preserve the structure [{"room", 2}, {"kitchen", 1}], as in object fields the data is flattened to {["room", "kitchen"], [1,2]} without relationships between "room" and 2.
Unluckily, at the moment is not possible to use the SQL API to group by (some?) nested fields, but it is possible to write a native Elastic query using nested aggregations.

Related

Difference of two query results in Elasticsearch

Let's say we've indexes of e-commerce store data, and we want to get the difference of list of products which are present in 2 stores.
Information on the index content: A sample data stored in each document looks like below:
{
"product_name": "sample 1",
"store_slug": "store 1",
"sales_count": 42,
"date": "2018-04-04"
}
Below are queries which gets me all products present in 2 stores individually,
Data for store 1
curl -XGET 'localhost:9200/store/_search?pretty' -H 'Content-Type: application/json' -d'
{
"_source": ["product_name"],
"query": {
"constant_score" : {
"filter" : {
"bool" : {
"must" : [
{ "term" : { "store_slug" : "store_1"}}]}}}}}'
Data for store 2
curl -XGET 'localhost:9200/store/_search?pretty' -H 'Content-Type: application/json' -d'
{
"_source": ["product_name"],
"query": {
"constant_score" : {
"filter" : {
"bool" : {
"must" : [
{ "term" : { "store_slug" : "store_2"}}]}}}}}'
Is it possible with elasticsearch query to get the difference of both result(without doing using some script/ other languages)?
E.g. of above operation: Let's say "store 1" is selling products ["product 1", "product 2"] and "store 2" is selling products ["product 1", "product 3"], So expected output of difference of products of "store 1" and "store 2" is "product 2".
Why not doing it in a single query?
Products that are in store 1 but not in store 2:
curl -XGET 'localhost:9200/store/_search?pretty' -H 'Content-Type: application/json' -d '{
"_source": [
"product_name"
],
"query": {
"constant_score": {
"filter": {
"bool": {
"filter": [
{
"term": {
"store_slug": "store_1"
}
}
],
"must_not": [
{
"term": {
"store_slug": "store_2"
}
}
]
}
}
}
}
}'
You can easily do the opposite, too.
UPDATE
After reading your updates, I think the best way to solve this is using terms aggregations, first by product and then by store and only select the products for which there is only a single store bucket (using a pipeline aggregation)
curl -XGET 'localhost:9200/store/_search?pretty' -H 'Content-Type: application/json' -d '{
{
"size": 0,
"aggs": {
"products": {
"terms": {
"field": "product_name"
},
"aggs": {
"stores": {
"terms": {
"field": "store_slug"
}
},
"min_bucket_selector": {
"bucket_selector": {
"buckets_path": {
"count": "stores._bucket_count"
},
"script": {
"source": "params.count == 1"
}
}
}
}
}
}
}'

elasticsearch distinct parent sub aggregation without nested field

In elasticsearch 6.2 I have a parent-child relationship :
Document -> NamedEntity
I want to aggregate NamedEntity by counting mention field and giving the number of documents that contains each named entity.
My use case is :
doc1 contains 'NER'(_id=ner11), 'NER'(_id=ner12)
doc2 contains 'NER'(_id=ner2)
The parent/child relation is implemented with a join field. In the Document I have a field :
join: {
name: "Document"
}
And in the NamedEntity children :
join: {
name: "NamedEntity",
parent: "parent_id"
}
with _routing set to parent_id.
So I tried with terms sub-aggregation :
curl -XPOST elasticsearch:9200/datashare-testjs/_search?pretty -H 'Content-Type: application/json' -d '
{"query":{"term":{"type":"NamedEntity"}},
"aggs":{
"mentions":{
"terms":{
"field":"mention"
},
"aggs":{
"docs":{
"terms":{"field":"join"}
}
}
}
}
}'
And I have the following response :
"aggregations" : {
"mentions" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "NER",
"doc_count" : 3,
"docs" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "NamedEntity",
"doc_count" : 3 <-- WRONG ! There are 2 distinct documents
}
]
}
}
]
}
I find the expected 3 occurrences in mentions.buckets.doc_count. But in the mentions.buckets.docs.buckets.doc_count field I would like to have only 2 documents (not 3). Like a select count distinct.
If I aggregate with "terms":{"field":"join.parent"} I have :
...
"docs" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ ]
}
...
I tied with cardinality aggregation on the join field and I obtain a value of 1, and cardinality aggregation on the join.parent that returns a value of 0.
So how do you make an aggregation distinct count on parents without the use of a reverse nested aggregation ?
As #AndreiStefan asked, here is the mapping. It is a simple 1-N relation between Document(content) and NamedEntity(mention) in an ES 6 mapping (fields are defined on the same level) :
curl -XPUT elasticsearch:9200/datashare-testjs -H 'Content-Type: application/json' -d '
{
"mappings": {
"doc": {
"properties": {
"content": {
"type": "text",
"index_options": "offsets"
},
"type": {
"type": "keyword"
},
"join": {
"type": "join",
"relations": {
"Document": "NamedEntity"
}
},
"mention": {
"type": "keyword"
}
}
}
}}
And the requests for a minimal dataset :
curl -XPUT elasticsearch:9200/datashare-testjs/doc/doc1 -H 'Content-Type: application/json' -d '{"type": "Document", "join": {"name": "Document"}, "content": "a NER document contains 2 NER"}'
curl -XPUT elasticsearch:9200/datashare-testjs/doc/doc2 -H 'Content-Type: application/json' -d '{"type": "Document", "join": {"name": "Document"}, "content": "another NER document"}'
curl -XPUT elasticsearch:9200/datashare-testjs/doc/ner11?routing=doc1 -H 'Content-Type: application/json' -d '{"type": "NamedEntity", "join": {"name": "NamedEntity", "parent": "doc1"}, "mention": "NER"}'
curl -XPUT elasticsearch:9200/datashare-testjs/doc/ner12?routing=doc1 -H 'Content-Type: application/json' -d '{"type": "NamedEntity", "join": {"name": "NamedEntity", "parent": "doc1"}, "mention": "NER"}'
curl -XPUT elasticsearch:9200/datashare-testjs/doc/ner2?routing=doc2 -H 'Content-Type: application/json' -d '{"type": "NamedEntity", "join": {"name": "NamedEntity", "parent": "doc2"}, "mention": "NER"}'
"aggs": {
"mentions": {
"terms": {
"field": "mention"
},
"aggs": {
"docs": {
"terms": {
"field": "join"
},
"aggs": {
"uniques": {
"cardinality": {
"field": "join#Document"
}
}
}
}
}
}
}
OR if you just want the count:
"aggs": {
"mentions": {
"terms": {
"field": "mention"
},
"aggs": {
"uniques": {
"cardinality": {
"field": "join#Document"
}
}
}
}
}
If you need a custom ordering (by unique counts):
"aggs": {
"mentions": {
"terms": {
"field": "mention",
"order": {
"uniques": "desc"
}
},
"aggs": {
"uniques": {
"cardinality": {
"field": "join#Document"
}
}
}
}
}
I post this workaround in case it can help someone. But if someone has a cleaner way of doing this, I'd be interested.
I added a denormalized field in the children that contains a copy of the parent id (the value already in join/parent):
curl -XPUT elasticsearch:9200/datashare-testjs -H 'Content-Type: application/json' -d '
{
"mappings": {
"doc": {
"properties": {
"content": {
"type": "text",
"index_options": "offsets"
},
"type": {
"type": "keyword"
},
"join": {
"type": "join",
"relations": {
"Document": "NamedEntity"
}
},
"document_id: {
"type": "keyword"
},
"mention": {
"type": "keyword"
}
}
}
}}
Then the cardinality aggregate with this new field works as expected :
curl -XPOST elasticsearch:9200/datashare-testjs/_search?pretty -H 'Content-Type: application/json' -d '
{"query":{"term":{"type":"NamedEntity"}},
"aggs":{
"mentions":{
"terms":{
"field":"mention"
},
"aggs":{
"docs":{
"cardinality": {
"field" : "document_id"
}
}
}
}}}'
It responds :
...
"aggregations" : {
"mentions" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "NER",
"doc_count" : 3,
"docs" : {
"value" : 2
}
}
]
}
}
I recently ran into the same issue on Elasticsearch 7.1, and this additional field "my_join_field#my_parent" created by elasicsearch solved it. I am glad I didn't have to add the parent_id to the child document.
https://www.elastic.co/guide/en/elasticsearch/reference/current/parent-join.html#_searching_with_parent_join

How to get multiple type search results from elasticsearch index?

So, I have myindex elastic search index with two types type1 and type2. Both the type has two common fields as name and descriptionas below:
{
"name": "",
"description": ""
}
I want 5 results from type1 and 5 results from result2 if I specify the size as 10 in a single search query?
The below query gives me 10 results from type1 if the matching results are more from type1:
curl -XPOST 'localhost:9200/myindex/_search?pretty&pretty' -H 'Content-Type: application/json' -d'
{
"size": 10,
"query": {
"match": {
"name": "xyz"
}
}
}'
I can do this in two different queries as below, but I want to do it in one go.
curl -XPOST 'localhost:9200/myindex/type1/_search?pretty&pretty' -H 'Content-Type: application/json' -d'
{
"size": 5,
"query": {
"match": {
"name": "xyz"
}
}
}'
curl -XPOST 'localhost:9200/myindex/type2/_search?pretty&pretty' -H 'Content-Type: application/json' -d'
{
"size": 5,
"query": {
"match": {
"name": "xyz"
}
}
}'
You can use a multisearch and the results will come back in two separate arrays.
GET /_msearch --data-binary
{ "index" : "myindex" , "type" : "type1" }
{ "size" : 5, "query" : { "match" : { "name" : "xyz" } } }
{ "index" : "myindex", "type" : "type2" }
{ "size" : 5, "query" : { "match" : { "name" : "xyz" } } }

Making aggregations in two different types and return it grouped in Elasticsearch

Having this mapping with two types, items_one and items_two:
curl -XPUT 'localhost:9200/tester?pretty=true' -d '{
"mappings": {
"items_one": {
"properties" : {
"type" : {"type": "string",
"index": "not_analyzed"}
}},
"items_two": {
"properties" : {
"other_type" : { "type": "string",
"index": "not_analyzed"}
}}}}'
I put two items on items_one:
curl -XPUT 'localhost:9200/tester/items_one/1?pretty=true' -d '{
"type": "Bank transfer"
}'
curl -XPUT 'localhost:9200/tester/items_one/2?pretty=true' -d '{
"type": "PayPal"
}'
... and another two in items_two:
curl -XPUT 'localhost:9200/tester/items_two/1?pretty=true' -d '{
"other_type": "Cash"
}'
curl -XPUT 'localhost:9200/tester/items_two/2?pretty=true' -d '{
"other_type": "No pay"
}'
How can I make the aggregations in two different fields and return it grouped?
I know I can get it from one field doing:
curl -XGET 'localhost:9200/tester/_search?pretty=true' -d '{
"size": 0,
"aggs": {
"paying_types": {
"terms": {
"field": "type"
}
}
}
}'
But I cant make it "multi-field" making something like this (which is not working):
curl -XGET 'localhost:9200/tester/_search?pretty=true' -d '{
"size": 0,
"aggs": {
"paying_types": {
"terms": {
"field": ["type", "other_type"]
}
}
}
}'
My desired output should be:
"aggregations" : {
"paying_types" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ {
"key" : "Bank transfer",
"doc_count" : 1
}, {
"key" : "PayPal",
"doc_count" : 1
}, {
"key" : "Cash",
"doc_count" : 1
}, {
"key" : "No pay",
"doc_count" : 1
} ]
}
}
}
Thanks in advance
Finally solved it. A script will do the trick:
curl -XGET 'localhost:9200/tester/_search?pretty=true' -d '{
"size": 0,
"aggs": {
"paying_types": {
"terms": {
"script": "doc['type'].values + doc['other_type'].values"
}
}
}
}'

Elasticsearch return unique values for a field

I am trying to build an Elasticsearch query that will return only unique values for a particular field.
I do not want to return all the values for that field nor count them.
For example, if there are 50 different values currently contained by the field, and I do a search to return only 20 hits (size=20). I want each of the 20 results to have a unique result for that field, but I don't care about the 30 other values not represented in the result.
For example with the following search (pseudo code - not checked):
{
from: 0,
size: 20,
query: {
bool: {
must: {
range: { field1: { gte: 50 }},
term: { field2: 'salt' },
/**
* I want to return only unique values for "field3", but I
* don't want to return all of them or count them.
*
* How do I specify this in my query?
**/
unique: 'field3',
},
mustnot: {
match: { field4: 'pepper'},
}
}
}
}
You should be able to do this pretty easily with a terms aggregation.
Here's an example. I defined a simple index, containing a field that has "index": "not_analyzed" so we can get the full text of each field as a unique value, rather than terms generated from tokenizing it, etc.
DELETE /test_index
PUT /test_index
{
"settings": {
"number_of_shards": 1
},
"mappings": {
"doc": {
"properties": {
"title": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
Then I add a few docs with the bulk API.
POST /test_index/_bulk
{"index":{"_index":"test_index","_type":"doc","_id":1}}
{"title":"first doc"}
{"index":{"_index":"test_index","_type":"doc","_id":2}}
{"title":"second doc"}
{"index":{"_index":"test_index","_type":"doc","_id":3}}
{"title":"third doc"}
{"index":{"_index":"test_index","_type":"doc","_id":4}}
{"title":"third doc"}
Now we can run our terms aggregation:
POST /test_index/_search?search_type=count
{
"aggs": {
"unique_vals": {
"terms": {
"field": "title"
}
}
}
}
...
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 0,
"hits": []
},
"aggregations": {
"unique_vals": {
"buckets": [
{
"key": "third doc",
"doc_count": 2
},
{
"key": "first doc",
"doc_count": 1
},
{
"key": "second doc",
"doc_count": 1
}
]
}
}
}
I'm very surprised a filter aggregation hasn't been suggested. It goes back all the way to ES version 1.3.
The filter aggregation is similar to a regular filter query but can instead be nested into an aggregation chain to filter out counts of documents that don't meet a particular criteria and give you sub-aggregation results based only on the documents that meet the criteria of the query.
First, we'll put our mapping.
curl --request PUT \
--url http://localhost:9200/items \
--header 'content-type: application/json' \
--data '{
"mappings": {
"item": {
"properties": {
"field1" : { "type": "integer" },
"field2" : { "type": "keyword" },
"field3" : { "type": "keyword" },
"field4" : { "type": "keyword" }
}
}
}
}
'
Then let's load some data.
curl --request PUT \
--url http://localhost:9200/items/_bulk \
--header 'content-type: application/json' \
--data '{"index":{"_index":"items","_type":"item","_id":1}}
{"field1":50, "field2":["salt", "vinegar"], "field3":["garlic", "onion"], "field4":"paprika"}
{"index":{"_index":"items","_type":"item","_id":2}}
{"field1":40, "field2":["salt", "pepper"], "field3":["onion"]}
{"index":{"_index":"items","_type":"item","_id":3}}
{"field1":100, "field2":["salt", "vinegar"], "field3":["garlic", "chives"], "field4":"pepper"}
{"index":{"_index":"items","_type":"item","_id":4}}
{"field1":90, "field2":["vinegar"], "field3":["chives", "garlic"]}
{"index":{"_index":"items","_type":"item","_id":5}}
{"field1":900, "field2":["salt", "vinegar"], "field3":["garlic", "chives"], "field4":"paprika"}
'
Notice, that only the documents with id's 1 and 5 will pass the criteria and so we will be left to aggregate on these two field3 arrays and four values total. ["garlic", "chives"], ["garlic", "onion"]. Also notice that field3 can be an array or single value in the data but I'm making them arrays to illustrate how the counts will work.
curl --request POST \
--url http://localhost:9200/items/item/_search \
--header 'content-type: application/json' \
--data '{
"size": 0,
"aggregations": {
"top_filter_agg" : {
"filter" : {
"bool": {
"must":[
{
"range" : { "field1" : { "gte":50} }
},
{
"term" : { "field2" : "salt" }
}
],
"must_not":[
{
"term" : { "field4" : "pepper" }
}
]
}
},
"aggs" : {
"field3_terms_agg" : { "terms" : { "field" : "field3" } }
}
}
}
}
'
After running the conjuncted filter/terms aggregation. We only have a count of 4 terms on field3 and three unique terms altogether.
{
"took": 46,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 5,
"max_score": 0.0,
"hits": []
},
"aggregations": {
"top_filter_agg": {
"doc_count": 2,
"field3_terms_agg": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "garlic",
"doc_count": 2
},
{
"key": "chives",
"doc_count": 1
},
{
"key": "onion",
"doc_count": 1
}
]
}
}
}
}

Resources