Group and count by array of objects' keys

Group and count by array of objects' keys - elasticsearch

Given the following index definition and query:
curl -XDELETE "localhost:9200/products"
curl -XPUT "localhost:9200/products"
curl -XPUT "localhost:9200/products/_mapping" -H 'Content-Type: application/json' -d'
{
"properties": {
"opinions": {
"type": "nested",
"properties": {
"topic": {"type": "keyword"},
"count": {"type": "long"}
},
"include_in_parent": true
}
}
}'
curl -X POST "localhost:9200/products/_bulk" -H 'Content-Type: application/json' -d'
{"index":{"_id":1}}
{"opinions":[{"topic": "room", "count": 2}, {"topic": "kitchen", "count": 1}]}
{"index":{"_id":2}}
{"opinions":[{"topic": "room", "count": 1}, {"topic": "restroom", "count": 1}]}
'
sleep 2
curl -X POST "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d'
{
"size": 0,
"aggs": {
"per_topic": {
"terms": {"field": "opinions.topic"},
"aggs": {
"counts": {
"sum": {"field": "opinions.count"}
}
}
}
}
}
'
Produces the result:
"aggregations" : {
"per_topic" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "room",
"doc_count" : 2,
"counts" : {
"value" : 5.0
}
},
{
"key" : "kitchen",
"doc_count" : 1,
"counts" : {
"value" : 3.0
}
},
{
"key" : "restroom",
"doc_count" : 1,
"counts" : {
"value" : 2.0
}
}
]
}
}
}
I'm expecting the sum of room to be 3, kitchen to be 1 and restroom to be 1, counting only the related nested documents, but instead it is summing all the nested count fields in all the matched the documents.
How can I sum only the matched aggregated nested documents?
UPDATE: solution based on comments
curl -X POST "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d'
{
"size": 0,
"aggs": {
"opinions": {
"nested": {"path": "opinions"},
"aggs": {
"per_topic": {
"terms": {"field": "opinions.topic"},
"aggs": {
"counts": {
"sum": {"field": "opinions.count"}
}
}
}
}
}
}
}
'

The main initial problem was the use of object fields instead of nested fields: only using nested fields is it possible to preserve the structure [{"room", 2}, {"kitchen", 1}], as in object fields the data is flattened to {["room", "kitchen"], [1,2]} without relationships between "room" and 2.
Unluckily, at the moment is not possible to use the SQL API to group by (some?) nested fields, but it is possible to write a native Elastic query using nested aggregations.

Related

Difference of two query results in Elasticsearch

Let's say we've indexes of e-commerce store data, and we want to get the difference of list of products which are present in 2 stores.
Information on the index content: A sample data stored in each document looks like below:
{
"product_name": "sample 1",
"store_slug": "store 1",
"sales_count": 42,
"date": "2018-04-04"
}
Below are queries which gets me all products present in 2 stores individually,
Data for store 1
curl -XGET 'localhost:9200/store/_search?pretty' -H 'Content-Type: application/json' -d'
{
"_source": ["product_name"],
"query": {
"constant_score" : {
"filter" : {
"bool" : {
"must" : [
{ "term" : { "store_slug" : "store_1"}}]}}}}}'
Data for store 2
curl -XGET 'localhost:9200/store/_search?pretty' -H 'Content-Type: application/json' -d'
{
"_source": ["product_name"],
"query": {
"constant_score" : {
"filter" : {
"bool" : {
"must" : [
{ "term" : { "store_slug" : "store_2"}}]}}}}}'
Is it possible with elasticsearch query to get the difference of both result(without doing using some script/ other languages)?
E.g. of above operation: Let's say "store 1" is selling products ["product 1", "product 2"] and "store 2" is selling products ["product 1", "product 3"], So expected output of difference of products of "store 1" and "store 2" is "product 2".

Why not doing it in a single query?
Products that are in store 1 but not in store 2:
curl -XGET 'localhost:9200/store/_search?pretty' -H 'Content-Type: application/json' -d '{
"_source": [
"product_name"
],
"query": {
"constant_score": {
"filter": {
"bool": {
"filter": [
{
"term": {
"store_slug": "store_1"
}
}
],
"must_not": [
{
"term": {
"store_slug": "store_2"
}
}
]
}
}
}
}
}'
You can easily do the opposite, too.
UPDATE
After reading your updates, I think the best way to solve this is using terms aggregations, first by product and then by store and only select the products for which there is only a single store bucket (using a pipeline aggregation)
curl -XGET 'localhost:9200/store/_search?pretty' -H 'Content-Type: application/json' -d '{
{
"size": 0,
"aggs": {
"products": {
"terms": {
"field": "product_name"
},
"aggs": {
"stores": {
"terms": {
"field": "store_slug"
}
},
"min_bucket_selector": {
"bucket_selector": {
"buckets_path": {
"count": "stores._bucket_count"
},
"script": {
"source": "params.count == 1"
}
}
}
}
}
}
}'

elasticsearch distinct parent sub aggregation without nested field

In elasticsearch 6.2 I have a parent-child relationship :
Document -> NamedEntity
I want to aggregate NamedEntity by counting mention field and giving the number of documents that contains each named entity.
My use case is :
doc1 contains 'NER'(_id=ner11), 'NER'(_id=ner12)
doc2 contains 'NER'(_id=ner2)
The parent/child relation is implemented with a join field. In the Document I have a field :
join: {
name: "Document"
}
And in the NamedEntity children :
join: {
name: "NamedEntity",
parent: "parent_id"
}
with _routing set to parent_id.
So I tried with terms sub-aggregation :
curl -XPOST elasticsearch:9200/datashare-testjs/_search?pretty -H 'Content-Type: application/json' -d '
{"query":{"term":{"type":"NamedEntity"}},
"aggs":{
"mentions":{
"terms":{
"field":"mention"
},
"aggs":{
"docs":{
"terms":{"field":"join"}
}
}
}
}
}'
And I have the following response :
"aggregations" : {
"mentions" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "NER",
"doc_count" : 3,
"docs" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "NamedEntity",
"doc_count" : 3 <-- WRONG ! There are 2 distinct documents
}
]
}
}
]
}
I find the expected 3 occurrences in mentions.buckets.doc_count. But in the mentions.buckets.docs.buckets.doc_count field I would like to have only 2 documents (not 3). Like a select count distinct.
If I aggregate with "terms":{"field":"join.parent"} I have :
...
"docs" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ ]
}
...
I tied with cardinality aggregation on the join field and I obtain a value of 1, and cardinality aggregation on the join.parent that returns a value of 0.
So how do you make an aggregation distinct count on parents without the use of a reverse nested aggregation ?
As #AndreiStefan asked, here is the mapping. It is a simple 1-N relation between Document(content) and NamedEntity(mention) in an ES 6 mapping (fields are defined on the same level) :
curl -XPUT elasticsearch:9200/datashare-testjs -H 'Content-Type: application/json' -d '
{
"mappings": {
"doc": {
"properties": {
"content": {
"type": "text",
"index_options": "offsets"
},
"type": {
"type": "keyword"
},
"join": {
"type": "join",
"relations": {
"Document": "NamedEntity"
}
},
"mention": {
"type": "keyword"
}
}
}
}}
And the requests for a minimal dataset :
curl -XPUT elasticsearch:9200/datashare-testjs/doc/doc1 -H 'Content-Type: application/json' -d '{"type": "Document", "join": {"name": "Document"}, "content": "a NER document contains 2 NER"}'
curl -XPUT elasticsearch:9200/datashare-testjs/doc/doc2 -H 'Content-Type: application/json' -d '{"type": "Document", "join": {"name": "Document"}, "content": "another NER document"}'
curl -XPUT elasticsearch:9200/datashare-testjs/doc/ner11?routing=doc1 -H 'Content-Type: application/json' -d '{"type": "NamedEntity", "join": {"name": "NamedEntity", "parent": "doc1"}, "mention": "NER"}'
curl -XPUT elasticsearch:9200/datashare-testjs/doc/ner12?routing=doc1 -H 'Content-Type: application/json' -d '{"type": "NamedEntity", "join": {"name": "NamedEntity", "parent": "doc1"}, "mention": "NER"}'
curl -XPUT elasticsearch:9200/datashare-testjs/doc/ner2?routing=doc2 -H 'Content-Type: application/json' -d '{"type": "NamedEntity", "join": {"name": "NamedEntity", "parent": "doc2"}, "mention": "NER"}'

"aggs": {
"mentions": {
"terms": {
"field": "mention"
},
"aggs": {
"docs": {
"terms": {
"field": "join"
},
"aggs": {
"uniques": {
"cardinality": {
"field": "join#Document"
}
}
}
}
}
}
}
OR if you just want the count:
"aggs": {
"mentions": {
"terms": {
"field": "mention"
},
"aggs": {
"uniques": {
"cardinality": {
"field": "join#Document"
}
}
}
}
}
If you need a custom ordering (by unique counts):
"aggs": {
"mentions": {
"terms": {
"field": "mention",
"order": {
"uniques": "desc"
}
},
"aggs": {
"uniques": {
"cardinality": {
"field": "join#Document"
}
}
}
}
}

I post this workaround in case it can help someone. But if someone has a cleaner way of doing this, I'd be interested.
I added a denormalized field in the children that contains a copy of the parent id (the value already in join/parent):
curl -XPUT elasticsearch:9200/datashare-testjs -H 'Content-Type: application/json' -d '
{
"mappings": {
"doc": {
"properties": {
"content": {
"type": "text",
"index_options": "offsets"
},
"type": {
"type": "keyword"
},
"join": {
"type": "join",
"relations": {
"Document": "NamedEntity"
}
},
"document_id: {
"type": "keyword"
},
"mention": {
"type": "keyword"
}
}
}
}}
Then the cardinality aggregate with this new field works as expected :
curl -XPOST elasticsearch:9200/datashare-testjs/_search?pretty -H 'Content-Type: application/json' -d '
{"query":{"term":{"type":"NamedEntity"}},
"aggs":{
"mentions":{
"terms":{
"field":"mention"
},
"aggs":{
"docs":{
"cardinality": {
"field" : "document_id"
}
}
}
}}}'
It responds :
...
"aggregations" : {
"mentions" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "NER",
"doc_count" : 3,
"docs" : {
"value" : 2
}
}
]
}
}

I recently ran into the same issue on Elasticsearch 7.1, and this additional field "my_join_field#my_parent" created by elasicsearch solved it. I am glad I didn't have to add the parent_id to the child document.
https://www.elastic.co/guide/en/elasticsearch/reference/current/parent-join.html#_searching_with_parent_join

How to get multiple type search results from elasticsearch index?

So, I have myindex elastic search index with two types type1 and type2. Both the type has two common fields as name and descriptionas below:
{
"name": "",
"description": ""
}
I want 5 results from type1 and 5 results from result2 if I specify the size as 10 in a single search query?
The below query gives me 10 results from type1 if the matching results are more from type1:
curl -XPOST 'localhost:9200/myindex/_search?pretty&pretty' -H 'Content-Type: application/json' -d'
{
"size": 10,
"query": {
"match": {
"name": "xyz"
}
}
}'
I can do this in two different queries as below, but I want to do it in one go.
curl -XPOST 'localhost:9200/myindex/type1/_search?pretty&pretty' -H 'Content-Type: application/json' -d'
{
"size": 5,
"query": {
"match": {
"name": "xyz"
}
}
}'
curl -XPOST 'localhost:9200/myindex/type2/_search?pretty&pretty' -H 'Content-Type: application/json' -d'
{
"size": 5,
"query": {
"match": {
"name": "xyz"
}
}
}'

You can use a multisearch and the results will come back in two separate arrays.
GET /_msearch --data-binary
{ "index" : "myindex" , "type" : "type1" }
{ "size" : 5, "query" : { "match" : { "name" : "xyz" } } }
{ "index" : "myindex", "type" : "type2" }
{ "size" : 5, "query" : { "match" : { "name" : "xyz" } } }

Making aggregations in two different types and return it grouped in Elasticsearch

Having this mapping with two types, items_one and items_two:
curl -XPUT 'localhost:9200/tester?pretty=true' -d '{
"mappings": {
"items_one": {
"properties" : {
"type" : {"type": "string",
"index": "not_analyzed"}
}},
"items_two": {
"properties" : {
"other_type" : { "type": "string",
"index": "not_analyzed"}
}}}}'
I put two items on items_one:
curl -XPUT 'localhost:9200/tester/items_one/1?pretty=true' -d '{
"type": "Bank transfer"
}'
curl -XPUT 'localhost:9200/tester/items_one/2?pretty=true' -d '{
"type": "PayPal"
}'
... and another two in items_two:
curl -XPUT 'localhost:9200/tester/items_two/1?pretty=true' -d '{
"other_type": "Cash"
}'
curl -XPUT 'localhost:9200/tester/items_two/2?pretty=true' -d '{
"other_type": "No pay"
}'
How can I make the aggregations in two different fields and return it grouped?
I know I can get it from one field doing:
curl -XGET 'localhost:9200/tester/_search?pretty=true' -d '{
"size": 0,
"aggs": {
"paying_types": {
"terms": {
"field": "type"
}
}
}
}'
But I cant make it "multi-field" making something like this (which is not working):
curl -XGET 'localhost:9200/tester/_search?pretty=true' -d '{
"size": 0,
"aggs": {
"paying_types": {
"terms": {
"field": ["type", "other_type"]
}
}
}
}'
My desired output should be:
"aggregations" : {
"paying_types" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ {
"key" : "Bank transfer",
"doc_count" : 1
}, {
"key" : "PayPal",
"doc_count" : 1
}, {
"key" : "Cash",
"doc_count" : 1
}, {
"key" : "No pay",
"doc_count" : 1
} ]
}
}
}
Thanks in advance

Finally solved it. A script will do the trick:
curl -XGET 'localhost:9200/tester/_search?pretty=true' -d '{
"size": 0,
"aggs": {
"paying_types": {
"terms": {
"script": "doc['type'].values + doc['other_type'].values"
}
}
}
}'

Elasticsearch return unique values for a field

I am trying to build an Elasticsearch query that will return only unique values for a particular field.
I do not want to return all the values for that field nor count them.
For example, if there are 50 different values currently contained by the field, and I do a search to return only 20 hits (size=20). I want each of the 20 results to have a unique result for that field, but I don't care about the 30 other values not represented in the result.
For example with the following search (pseudo code - not checked):
{
from: 0,
size: 20,
query: {
bool: {
must: {
range: { field1: { gte: 50 }},
term: { field2: 'salt' },
/**
* I want to return only unique values for "field3", but I
* don't want to return all of them or count them.
*
* How do I specify this in my query?
**/
unique: 'field3',
},
mustnot: {
match: { field4: 'pepper'},
}
}
}
}

You should be able to do this pretty easily with a terms aggregation.
Here's an example. I defined a simple index, containing a field that has "index": "not_analyzed" so we can get the full text of each field as a unique value, rather than terms generated from tokenizing it, etc.
DELETE /test_index
PUT /test_index
{
"settings": {
"number_of_shards": 1
},
"mappings": {
"doc": {
"properties": {
"title": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
Then I add a few docs with the bulk API.
POST /test_index/_bulk
{"index":{"_index":"test_index","_type":"doc","_id":1}}
{"title":"first doc"}
{"index":{"_index":"test_index","_type":"doc","_id":2}}
{"title":"second doc"}
{"index":{"_index":"test_index","_type":"doc","_id":3}}
{"title":"third doc"}
{"index":{"_index":"test_index","_type":"doc","_id":4}}
{"title":"third doc"}
Now we can run our terms aggregation:
POST /test_index/_search?search_type=count
{
"aggs": {
"unique_vals": {
"terms": {
"field": "title"
}
}
}
}
...
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 0,
"hits": []
},
"aggregations": {
"unique_vals": {
"buckets": [
{
"key": "third doc",
"doc_count": 2
},
{
"key": "first doc",
"doc_count": 1
},
{
"key": "second doc",
"doc_count": 1
}
]
}
}
}

I'm very surprised a filter aggregation hasn't been suggested. It goes back all the way to ES version 1.3.
The filter aggregation is similar to a regular filter query but can instead be nested into an aggregation chain to filter out counts of documents that don't meet a particular criteria and give you sub-aggregation results based only on the documents that meet the criteria of the query.
First, we'll put our mapping.
curl --request PUT \
--url http://localhost:9200/items \
--header 'content-type: application/json' \
--data '{
"mappings": {
"item": {
"properties": {
"field1" : { "type": "integer" },
"field2" : { "type": "keyword" },
"field3" : { "type": "keyword" },
"field4" : { "type": "keyword" }
}
}
}
}
'
Then let's load some data.
curl --request PUT \
--url http://localhost:9200/items/_bulk \
--header 'content-type: application/json' \
--data '{"index":{"_index":"items","_type":"item","_id":1}}
{"field1":50, "field2":["salt", "vinegar"], "field3":["garlic", "onion"], "field4":"paprika"}
{"index":{"_index":"items","_type":"item","_id":2}}
{"field1":40, "field2":["salt", "pepper"], "field3":["onion"]}
{"index":{"_index":"items","_type":"item","_id":3}}
{"field1":100, "field2":["salt", "vinegar"], "field3":["garlic", "chives"], "field4":"pepper"}
{"index":{"_index":"items","_type":"item","_id":4}}
{"field1":90, "field2":["vinegar"], "field3":["chives", "garlic"]}
{"index":{"_index":"items","_type":"item","_id":5}}
{"field1":900, "field2":["salt", "vinegar"], "field3":["garlic", "chives"], "field4":"paprika"}
'
Notice, that only the documents with id's 1 and 5 will pass the criteria and so we will be left to aggregate on these two field3 arrays and four values total. ["garlic", "chives"], ["garlic", "onion"]. Also notice that field3 can be an array or single value in the data but I'm making them arrays to illustrate how the counts will work.
curl --request POST \
--url http://localhost:9200/items/item/_search \
--header 'content-type: application/json' \
--data '{
"size": 0,
"aggregations": {
"top_filter_agg" : {
"filter" : {
"bool": {
"must":[
{
"range" : { "field1" : { "gte":50} }
},
{
"term" : { "field2" : "salt" }
}
],
"must_not":[
{
"term" : { "field4" : "pepper" }
}
]
}
},
"aggs" : {
"field3_terms_agg" : { "terms" : { "field" : "field3" } }
}
}
}
}
'
After running the conjuncted filter/terms aggregation. We only have a count of 4 terms on field3 and three unique terms altogether.
{
"took": 46,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 5,
"max_score": 0.0,
"hits": []
},
"aggregations": {
"top_filter_agg": {
"doc_count": 2,
"field3_terms_agg": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "garlic",
"doc_count": 2
},
{
"key": "chives",
"doc_count": 1
},
{
"key": "onion",
"doc_count": 1
}
]
}
}
}
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Group and count by array of objects' keys - elasticsearch

Related

Difference of two query results in Elasticsearch

elasticsearch distinct parent sub aggregation without nested field

How to get multiple type search results from elasticsearch index?

Making aggregations in two different types and return it grouped in Elasticsearch

Elasticsearch return unique values for a field

Categories

Resources