Understanding Elasticsearch aggregations - elasticsearch

My scenario is the following:
I have people, who can have regular or one-time income. I would like to sum the regular income of every people, who are not deleted and was born within a date range. The query part just works well, but when I start to put together the aggregation part of the Elastic query, I got the wrong figures and can't understand, what do I do wrong.
This is how I've created the mapping for my data type:
curl -X PUT -i http://localhost:9200/people --data '{
"settings" : {
"number_of_shards" : 1
},
"mappings" : {
"person" : {
"properties" : {
"birthDate" : {
"type" : "date",
"format" : "strict_date_optional_time||epoch_millis"
},
"company" : {
"type" : "string"
},
"deleted" : {
"type" : "boolean"
},
"income" : {
"type": "nested",
"properties" : {
"income_type" : {
"type" : "string"
},
"value" : {
"type" : "double"
}
}
},
"name" : {
"type" : "string"
}
}
}
}
}
}'
This is the data:
curl -X PUT -H 'Content-Type: application/json' -i http://localhost:9200
/people/person/1 --data '{
"deleted":false,
"birthDate":"1980-10-10",
"name":"John Smith",
"company": "IBM",
"income": [{"income_type":"regular","value":55.5}]
}'
curl -X PUT -H 'Content-Type: application/json' -i http://localhost:9200/people/person/2 --data '{
"deleted":true,
"birthDate":"1960-10-10",
"name":"Mary Legend",
"company": "ASUS",
"income": [{"income_type":"one-time","value":10},{"income_type":"regular","value":55}]
}'
curl -X PUT -H 'Content-Type: application/json' -i http://localhost:9200/people/person/3 --data '{
"deleted":false,
"birthDate":"2000-10-10",
"name":"F. King Elastic",
"income": [{"income_type":"one-time","value":1},{"income_type":"regular","value":5}]
}'
curl -X PUT -H 'Content-Type: application/json' -i http://localhost:9200/people/person/4 --data '{
"deleted":false,
"birthDate":"1989-10-10",
"name":"Prison Lesley",
"income": [{"income_type":"regular","value":120.7},{"income_type":"one-time","value":99.3}]
}'
curl -X PUT -H 'Content-Type: application/json' -i http://localhost:9200/people/person/5 --data '{
"deleted":false,
"birthDate":"1983-10-10",
"name":"Prison Lesley JR.",
"income": [{"income_type":"one-time","value":99.3}]
}'
curl -X PUT -H 'Content-Type: application/json' -i http://localhost:9200/people/person/6 --data '{
"deleted":true,
"birthDate":"1986-10-10",
"name":"Hono Lulu",
"income": [{"income_type":"regular","value":11.3}]
}'
This is a query, which filters for undeleted people, who have at least one regular income, and was born between the given dates. The below query still works as expected (two persons were fulfilling the criteria):
curl -X POST -H 'Content-Type: application/json' -i 'http://localhost:9200/people/person/_search?pretty=true' --data '{
"size": 100,
"filter": {
"bool": {
"must": [
{
"match": {
"deleted": false
}
},
{
"range": {
"birthDate": {
"gte": "1980-01-01",
"lte": "1990-12-31"
}
}
},
{
"nested": {
"path": "income",
"query": {
"bool": {
"must": [
{
"match": {
"income.income_type": "regular"
}
}
]
}
}
}
}
]
}
}
}'
But when I add the aggregation section, everything goes wrong, and I do not understand, why :(
curl -X POST -H 'Content-Type: application/json' -i 'http://localhost:9200/people/person/_search?pretty=true' --data '{
"size": 100,
"filter": {
"bool": {
"must": [
{
"match": {
"deleted": false
}
},
{
"range": {
"birthDate": {
"gte": "1980-01-01",
"lte": "1990-12-31"
}
}
},
{
"nested": {
"path": "income",
"query": {
"bool": {
"must": [
{
"match": {
"income.income_type": "regular"
}
}
]
}
}
}
}
]
}
},
"aggs": {
"incomes": {
"nested": {
"path": "income"
},
"aggs": {
"income_type": {
"filter": {
"bool": {
"must": [
{
"match": {
"income.income_type": "regular"
}
},
{
"match": {
"deleted": false
}
}
]
}
},
"aggs": {
"totalIncome": {
"sum": {
"field": "income.value"
}
}
}
}
}
}
}
}'
The result is this:
...
"aggregations": {
"incomes": {
"doc_count": 9,
"income_type": {
"doc_count": 0,
"totalIncome": {
"value": 0.0
}
}
}
}
}
I was expecting the doc_count to be 2, and the totalIncome should be 176.2 (120.7 + 55.5)
Does anyone have an idea, what do I do wrong?

Great start! You don't need the filter on the deleted field in your aggregation since your query is already filtering out all deleted documents. Try this:
"aggs": {
"incomes": {
"nested": {
"path": "income"
},
"aggs": {
"income_type": {
"filter": {
"match": {
"income.income_type": "regular"
}
},
"aggs": {
"totalIncome": {
"sum": {
"field": "income.value"
}
}
}
}
}
}
}

Related

ElasticSearch multiple AND/OR query

I have a schema like below -
{
"errorCode": "e015",
"errorDescription": "Description e015",
"storeId": "71102",
"businessFunction": "PriceFeedIntegration",
"createdDate": "2021-02-20T09:17:04.004",
"readBy": [
{
"userId": "scha3055"
},
{
"userId": "abcd1234"
}
]
}
I'm trying to search combination of "errorCode","storeId","businessFunction" with a date range like below -
{
"query": {
"bool": {
"must": [
{
"terms": {
"errorCode": [
"e015",
"e020",
"e022"
]
}
},
{
"terms": {
"storeId": [
"71102",
"71103"
]
}
},
{
"range": {
"createdDate": {
"gte": "2021-02-16T09:17:04.000",
"lte": "2021-02-22T00:00:00.005"
}
}
}
]
}
}
}
But when I add another condition with "businessFunction" the query does not work.
{
"query": {
"bool": {
"must": [
{
"terms": {
"errorCode": [
"e015",
"e020",
"e022"
]
}
},
{
"terms": {
"storeId": [
"71102",
"71103"
]
}
},
{
"terms": {
"errorDescription": [
"Description e020",
"71103"
]
}
},
{
"range": {
"createdDate": {
"gte": "2021-02-16T09:17:04.000",
"lte": "2021-02-22T00:00:00.005"
}
}
}
]
}
}
}
What am I missing in the query? When I add the third "terms" cndition , the query does not work. Please suggest or let me know any alternate way.
In your example you are searching for "Description e020" but in your example you stored "Description e015".
Short answer, I hope that's right for you:
"Description e015" will have been indexed as the two terms ["description","e015"].
use match_phrase instead of terms
...
{
"match_phrase": {
"errorDescription": "Description e015"
}
},
{
"range": {
"createdDate": {
"gte": "2021-02-16T09:17:04.000",
"lte": "2021-02-22T00:00:00.005"
}
}
}
....
Without knowing your mapping, I think that your errorDescription field its analyzed.
Other option not recommended:
If your field its analyzed and you require match exact, search in errorDescription.keyword
{
"terms": {
"errorDescription.keyword": [
"Description e015"
]
}
}
UPDATE
Long answer:
As I mentioned previously maybe, your field value was analyzed, then converted from "PriceFeedIntegration2" to pricefeedintegration2.
2 options
Search by your field.keyword aka businessFunction.keyword
Change your field mapping to not analyzed. Then you can get results just as you expect using terms.
Option: 1
It's the easy way, if you never run full text searches on that field, better not use as default. If it does not matter, use this option, it is the simplest.
Check your businessFunction.keyword field (created by default if you dont specify mapping)
Indexing data without mapping on my000001 index
curl -X "POST" "http://localhost:9200/my000001/_doc" \
-H "Content-type: application/json" \
-d $'
{
"errorCode": "e015",
"errorDescription": "Description e015",
"storeId": "71102",
"businessFunction": "PriceFeedIntegration",
"createdDate": "2021-02-20T09:17:04.004"
}'
Check
curl -X "GET" "localhost:9200/my000001/_analyze" \
-H "Content-type: application/json" \
-d $'{
"field": "businessFunction.keyword",
"text": "PriceFeedIntegration"
}'
Result:
{
"tokens": [
{
"token": "PriceFeedIntegration",
"start_offset": 0,
"end_offset": 20,
"type": "word",
"position": 0
}
]
}
Get the results using businessFunction.keyword
curl -X "GET" "localhost:9200/my000001/_search" \
-H "Content-type: application/json" \
-d $'{
"query": {
"bool": {
"must": [
{
"terms": {
"errorCode": [
"e015",
"e020",
"e022"
]
}
},
{
"terms": {
"storeId": [
"71102",
"71103"
]
}
},
{
"terms": {
"businessFunction.keyword": [
"PriceFeedIntegration2",
"PriceFeedIntegration"
]
}
},
{
"range": {
"createdDate": {
"gte": "2021-02-16T09:17:04.000",
"lte": "2021-02-22T00:00:00.005"
}
}
}
]
}
}
}' | jq
Why isn't recommended as default option?
"The default dynamic string mappings will index string fields both as
text and keyword. This is wasteful if you only need one of them."
https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-disk-usage.html
Option 2
Run on my000001 index
curl -X "GET" "localhost:9200/my000001/_analyze" \
-H "Content-type: application/json" \
-d $'{
"field": "businessFunction",
"text": "PriceFeedIntegration"
}'
You can see, that your field value was analyzed(tokenized, lowercase, and others modifications depending of the analyzer and the value provided)
Results:
{
"tokens": [
{
"token": "pricefeedintegration",
"start_offset": 0,
"end_offset": 20,
"type": "<ALPHANUM>",
"position": 0
}
]
}
That is the reason why your search doesn't return results.
"PriceFeedIntegration" doesn't match with "pricefeedintegration"
"The problem isn’t with the term query; it is with the way the data
has been indexed."
Your businessFunction field value was analyzed.
If you require find(search/filter) by exact values, maybe you need to change your "businessFunction" field mapping to not_analyzed.
Change your mapping require delete your index and create again providing the required mapping.
If you try to change the mapping of an existing index you will get an "resource_already_exists_exception" error.
Here is the background that you need to know in order to solve your problem:
https://www.elastic.co/guide/en/elasticsearch/guide/master/_finding_exact_values.html#_finding_exact_values
Create a Mapping on a new my000005 index
curl -X "PUT" "localhost:9200/my000005" \
-H "Content-type: application/json" \
-d $'{
"mappings" : {
"properties" : {
"businessFunction" : {
"type" : "keyword"
},
"errorDescription" : {
"type" : "text"
},
"errorCode" : {
"type" : "keyword"
},
"createdDate" : {
"type" : "date"
},
"storeId": {
"type" : "keyword"
}
}
}
}'
Indexing data
curl -X "POST" "http://localhost:9200/my000005/_doc" \
-H "Content-type: application/json" \
-d $'
{
"errorCode": "e015",
"errorDescription": "Description e015",
"storeId": "71102",
"businessFunction": "PriceFeedIntegration",
"createdDate": "2021-02-20T09:17:04.004"
}'
Get the results, that you expect using terms businessFunction
curl -X "GET" "localhost:9200/my000005/_search" \
-H "Content-type: application/json" \
-d $'{
"query": {
"bool": {
"must": [
{
"terms": {
"errorCode": [
"e015",
"e020",
"e022"
]
}
},
{
"terms": {
"storeId": [
"71102",
"71103"
]
}
},
{
"terms": {
"businessFunction": [
"PriceFeedIntegration2",
"PriceFeedIntegration"
]
}
},
{
"range": {
"createdDate": {
"gte": "2021-02-16T09:17:04.000",
"lte": "2021-02-22T00:00:00.005"
}
}
}
]
}
}
}' | jq
This answer is based on what I think is your mapping and your needs.
In the future share your mapping and your ES version, in order to get a better answer from the community.
curl -X "GET" "localhost:9200/yourindex/_mappings"
Please read this https://www.elastic.co/guide/en/elasticsearch/guide/master/_finding_exact_values.html#_finding_exact_values
and this https://www.elastic.co/blog/strings-are-dead-long-live-strings

elasticsearch match on every element in the nested collection

I am trying to perform an elastic-search query that will return documents where "every" element of the nested collection has a match, not just one.
For example, I have a Driver object, with the List of cars, and each car has the color attribute.
Driver index:
curl --location --request PUT 'localhost:9200/driver' \
--header 'Content-Type: application/json' \
--data-raw '{
"mappings": {
"properties": {
"driver": {
"type": "nested",
"properties": {
"name": {
"type": "text"
},
"car": {
"type": "nested",
"properties": {
"color": {
"type": "text"
}
}
}
}
}
}
}
}'
And the following data: Driver John with green and red car, and Driver Bob with two green cars.
curl --location --request PUT 'localhost:9200/driver/_doc/1' \
--header 'Content-Type: application/json' \
--data-raw '{
"driver": {
"name": "John",
"car": [
{
"color": "red"
},
{
"color": "green"
}
]
}
}'
curl --location --request PUT 'localhost:9200/driver/_doc/2' \
--header 'Content-Type: text/plain' \
--data-raw '{
"driver": {
"name": "Bob",
"car": [
{
"color": "green"
},
{
"color": "green"
}
]
}
}'
I want to find the driver that has ONLY green cars (i.e. Bob).
I tried the following query, but it returns a driver that has at least one car that matches color:
curl --location --request GET 'localhost:9200/driver/_search' \
--header 'Content-Type: application/json' \
--data-raw '{
"query": {
"nested": {
"path": "driver",
"query": {
"nested": {
"path": "driver.car",
"query": {
"bool": {
"must": [
{
"match": {
"driver.car.color": "green"
}
}
]
}
}
}
}
}
}
}'
This query returns every driver that has at least one green car. What is the fix? Thank you.
You can add a must_not query explicitly ruling out red:
{
"query": {
"bool": {
"must": [
{
"nested": {
"path": "driver.car",
"query": {
"match": {
"driver.car.color": "green"
}
}
}
}
],
"must_not": [
{
"nested": {
"path": "driver.car",
"query": {
"range": {
"driver.car.color": {
"gt": "green"
}
}
}
}
}
]
}
}
}
BTW you don't need the double nesting -- drive.car will work exactly the same as driver -> driver.car.

Difference of two query results in Elasticsearch

Let's say we've indexes of e-commerce store data, and we want to get the difference of list of products which are present in 2 stores.
Information on the index content: A sample data stored in each document looks like below:
{
"product_name": "sample 1",
"store_slug": "store 1",
"sales_count": 42,
"date": "2018-04-04"
}
Below are queries which gets me all products present in 2 stores individually,
Data for store 1
curl -XGET 'localhost:9200/store/_search?pretty' -H 'Content-Type: application/json' -d'
{
"_source": ["product_name"],
"query": {
"constant_score" : {
"filter" : {
"bool" : {
"must" : [
{ "term" : { "store_slug" : "store_1"}}]}}}}}'
Data for store 2
curl -XGET 'localhost:9200/store/_search?pretty' -H 'Content-Type: application/json' -d'
{
"_source": ["product_name"],
"query": {
"constant_score" : {
"filter" : {
"bool" : {
"must" : [
{ "term" : { "store_slug" : "store_2"}}]}}}}}'
Is it possible with elasticsearch query to get the difference of both result(without doing using some script/ other languages)?
E.g. of above operation: Let's say "store 1" is selling products ["product 1", "product 2"] and "store 2" is selling products ["product 1", "product 3"], So expected output of difference of products of "store 1" and "store 2" is "product 2".
Why not doing it in a single query?
Products that are in store 1 but not in store 2:
curl -XGET 'localhost:9200/store/_search?pretty' -H 'Content-Type: application/json' -d '{
"_source": [
"product_name"
],
"query": {
"constant_score": {
"filter": {
"bool": {
"filter": [
{
"term": {
"store_slug": "store_1"
}
}
],
"must_not": [
{
"term": {
"store_slug": "store_2"
}
}
]
}
}
}
}
}'
You can easily do the opposite, too.
UPDATE
After reading your updates, I think the best way to solve this is using terms aggregations, first by product and then by store and only select the products for which there is only a single store bucket (using a pipeline aggregation)
curl -XGET 'localhost:9200/store/_search?pretty' -H 'Content-Type: application/json' -d '{
{
"size": 0,
"aggs": {
"products": {
"terms": {
"field": "product_name"
},
"aggs": {
"stores": {
"terms": {
"field": "store_slug"
}
},
"min_bucket_selector": {
"bucket_selector": {
"buckets_path": {
"count": "stores._bucket_count"
},
"script": {
"source": "params.count == 1"
}
}
}
}
}
}
}'

elasticsearch distinct parent sub aggregation without nested field

In elasticsearch 6.2 I have a parent-child relationship :
Document -> NamedEntity
I want to aggregate NamedEntity by counting mention field and giving the number of documents that contains each named entity.
My use case is :
doc1 contains 'NER'(_id=ner11), 'NER'(_id=ner12)
doc2 contains 'NER'(_id=ner2)
The parent/child relation is implemented with a join field. In the Document I have a field :
join: {
name: "Document"
}
And in the NamedEntity children :
join: {
name: "NamedEntity",
parent: "parent_id"
}
with _routing set to parent_id.
So I tried with terms sub-aggregation :
curl -XPOST elasticsearch:9200/datashare-testjs/_search?pretty -H 'Content-Type: application/json' -d '
{"query":{"term":{"type":"NamedEntity"}},
"aggs":{
"mentions":{
"terms":{
"field":"mention"
},
"aggs":{
"docs":{
"terms":{"field":"join"}
}
}
}
}
}'
And I have the following response :
"aggregations" : {
"mentions" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "NER",
"doc_count" : 3,
"docs" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "NamedEntity",
"doc_count" : 3 <-- WRONG ! There are 2 distinct documents
}
]
}
}
]
}
I find the expected 3 occurrences in mentions.buckets.doc_count. But in the mentions.buckets.docs.buckets.doc_count field I would like to have only 2 documents (not 3). Like a select count distinct.
If I aggregate with "terms":{"field":"join.parent"} I have :
...
"docs" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ ]
}
...
I tied with cardinality aggregation on the join field and I obtain a value of 1, and cardinality aggregation on the join.parent that returns a value of 0.
So how do you make an aggregation distinct count on parents without the use of a reverse nested aggregation ?
As #AndreiStefan asked, here is the mapping. It is a simple 1-N relation between Document(content) and NamedEntity(mention) in an ES 6 mapping (fields are defined on the same level) :
curl -XPUT elasticsearch:9200/datashare-testjs -H 'Content-Type: application/json' -d '
{
"mappings": {
"doc": {
"properties": {
"content": {
"type": "text",
"index_options": "offsets"
},
"type": {
"type": "keyword"
},
"join": {
"type": "join",
"relations": {
"Document": "NamedEntity"
}
},
"mention": {
"type": "keyword"
}
}
}
}}
And the requests for a minimal dataset :
curl -XPUT elasticsearch:9200/datashare-testjs/doc/doc1 -H 'Content-Type: application/json' -d '{"type": "Document", "join": {"name": "Document"}, "content": "a NER document contains 2 NER"}'
curl -XPUT elasticsearch:9200/datashare-testjs/doc/doc2 -H 'Content-Type: application/json' -d '{"type": "Document", "join": {"name": "Document"}, "content": "another NER document"}'
curl -XPUT elasticsearch:9200/datashare-testjs/doc/ner11?routing=doc1 -H 'Content-Type: application/json' -d '{"type": "NamedEntity", "join": {"name": "NamedEntity", "parent": "doc1"}, "mention": "NER"}'
curl -XPUT elasticsearch:9200/datashare-testjs/doc/ner12?routing=doc1 -H 'Content-Type: application/json' -d '{"type": "NamedEntity", "join": {"name": "NamedEntity", "parent": "doc1"}, "mention": "NER"}'
curl -XPUT elasticsearch:9200/datashare-testjs/doc/ner2?routing=doc2 -H 'Content-Type: application/json' -d '{"type": "NamedEntity", "join": {"name": "NamedEntity", "parent": "doc2"}, "mention": "NER"}'
"aggs": {
"mentions": {
"terms": {
"field": "mention"
},
"aggs": {
"docs": {
"terms": {
"field": "join"
},
"aggs": {
"uniques": {
"cardinality": {
"field": "join#Document"
}
}
}
}
}
}
}
OR if you just want the count:
"aggs": {
"mentions": {
"terms": {
"field": "mention"
},
"aggs": {
"uniques": {
"cardinality": {
"field": "join#Document"
}
}
}
}
}
If you need a custom ordering (by unique counts):
"aggs": {
"mentions": {
"terms": {
"field": "mention",
"order": {
"uniques": "desc"
}
},
"aggs": {
"uniques": {
"cardinality": {
"field": "join#Document"
}
}
}
}
}
I post this workaround in case it can help someone. But if someone has a cleaner way of doing this, I'd be interested.
I added a denormalized field in the children that contains a copy of the parent id (the value already in join/parent):
curl -XPUT elasticsearch:9200/datashare-testjs -H 'Content-Type: application/json' -d '
{
"mappings": {
"doc": {
"properties": {
"content": {
"type": "text",
"index_options": "offsets"
},
"type": {
"type": "keyword"
},
"join": {
"type": "join",
"relations": {
"Document": "NamedEntity"
}
},
"document_id: {
"type": "keyword"
},
"mention": {
"type": "keyword"
}
}
}
}}
Then the cardinality aggregate with this new field works as expected :
curl -XPOST elasticsearch:9200/datashare-testjs/_search?pretty -H 'Content-Type: application/json' -d '
{"query":{"term":{"type":"NamedEntity"}},
"aggs":{
"mentions":{
"terms":{
"field":"mention"
},
"aggs":{
"docs":{
"cardinality": {
"field" : "document_id"
}
}
}
}}}'
It responds :
...
"aggregations" : {
"mentions" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "NER",
"doc_count" : 3,
"docs" : {
"value" : 2
}
}
]
}
}
I recently ran into the same issue on Elasticsearch 7.1, and this additional field "my_join_field#my_parent" created by elasicsearch solved it. I am glad I didn't have to add the parent_id to the child document.
https://www.elastic.co/guide/en/elasticsearch/reference/current/parent-join.html#_searching_with_parent_join

Search in two different types with different mappings in Elasticsearch

Having the following mapping of the index tester with two types items and items_two:
curl -XPUT 'localhost:9200/tester?pretty=true' -d '{
"mappings": {
"items": {
"properties" : {
"body" : { "type": "string" }
}},
"items_two": {
"properties" : {
"body" : { "type": "string" },
"publised" : { "type": "integer"}
}}}}'
I put three elements on it.
curl -XPUT 'localhost:9200/tester/items/1?pretty=true' -d '{
"body" : "Hey there im reading a book"
}'
curl -XPUT 'localhost:9200/tester/items_two/1?pretty=true' -d '{
"body" : "I love the new book of my brother",
"publised" : 0
}'
curl -XPUT 'localhost:9200/tester/items_two/2?pretty=true' -d '{
"body" : "Stephen kings book is very nice",
"publised" : 1
}'
I need to make a query that matches the word book and has published = 1 AND the ones that has not published on the mapping, but has book on it (as the only item of items).
With the following query I only get match with the "Stephen kings book is very nice" item (obviously).
curl -XGET 'localhost:9200/tester/_search?pretty=true' -d '{
"query": {
"bool": {
"must": [
{
"match": { "body": "book" }
},
{
"match": { "publised": "1" }
}]
}}}'
My desired output if I search for the string book should match item #1 from the type items ("Hey there im reading a book") and item #2 from the type items_two ("Stephen kings book is very nice").
I don't want to change the mapping or anything else, I need to archieve this via one query, so how can I build my query?
Thanks in advance.
You can use the _type field for these kind of searches. Try the following query
{
"query": {
"bool": {
"should": [
{
"bool": {
"must": [
{
"match": {
"body": "text"
}
},
{
"match": {
"publised": "1"
}
}
],
"filter": {
"term": {
"_type": "items_two"
}
}
}
},
{
"bool": {
"must": [
{
"match": {
"body": "text"
}
}
],
"filter": {
"term": {
"_type": "items"
}
}
}
}
]
}
}
}

Resources