elastic: query the sum of a filtered subset of nested documents - elasticsearch

Consider a document (post) like this in elasticsearch index:
{
title: "I love ice cream!"
comments: [
{
body: "me too!",
reaction: 'positive',
likes: 20
},
{
body: "huh!",
reaction: 'sarcastic',
likes: 5
}
]
}
The comments is a field of nested type.
How can elastic answer this:
Give me all posts, where the total sum of likes on "sarcastic" comments is greater than 100.
I'm open to any other way of modelling data which helps answer such queries.

This can be solved using bucket selector aggregation.
Mapping:
{
"index1" : {
"mappings" : {
"properties" : {
"comments" : {
"type" : "nested",
"properties" : {
"body" : {
"type" : "text"
},
"likes" : {
"type" : "integer"
},
"reaction" : {
"type" : "text"
}
}
},
"title" : {
"type" : "keyword"
}
}
}
}
}
Data:
"hits" : [
{
"_index" : "index1",
"_type" : "_doc",
"_id" : "p0y9DGsBfPdKzuAGdQrm",
"_score" : 1.0,
"_source" : {
"title" : "I love ice cream!",
"comments" : [
{
"body" : "me too!",
"reaction" : "positive",
"likes" : 20
},
{
"body" : "huh!",
"reaction" : "sarcastic",
"likes" : 5
}
]
}
},
{
"_index" : "index1",
"_type" : "_doc",
"_id" : "qEy9DGsBfPdKzuAGnwox",
"_score" : 1.0,
"_source" : {
"title" : "I hate ice cream!",
"comments" : [
{
"body" : "me too!",
"reaction" : "positive",
"likes" : 10
},
{
"body" : "huh!",
"reaction" : "sarcastic",
"likes" : 5
}
]
}
}
]
}
Query:
GET index1/_search
{
"size": 0,
"aggs": {
"title": {
"terms": {
"field": "title"
},
"aggs": {
"comments": {
"nested": {
"path": "comments"
},
"aggs": {
"reaction": {
"filter": {
"term": {
"comments.reaction": "positive"
}
},
"aggs": {
"total_likes": {
"sum": {
"field": "comments.likes"
}
}
}
}
}
},
"total_likes_filter": {
"bucket_selector": {
"buckets_path": {
"likes": "comments>reaction>total_likes"
},
"script": "params.likes > 15"
}
}
}
}
}
}
Result:
"aggregations" : {
"title" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "I love ice cream!",
"doc_count" : 1,
"comments" : {
"doc_count" : 2,
"reaction" : {
"doc_count" : 1,
"total_likes" : {
"value" : 20.0
}
}
}
}
]
}
}
}
Bucket contains only "I love ice cream!" where total likes for reaction positive is greater than 20.
I hate ice cream! has total sum 5 for positive reaction so it is not included.

Related

How to return hit term in ES ?

I try to return only the terms that were successfully hit instead of the document itself, but I don’t know how to achieve the desired effect。
"es_episode" : {
"aliases" : { },
"mappings" : {
"properties" : {
"endTime" : {
"type" : "long"
},
"episodeId" : {
"type" : "long"
},
"startTime" : {
"type" : "long"
},
"studentIds" : {
"type" : "long"
}
}
}
This is an example:
{
"episodeId":124,
"startTime":10,
"endTime":20,
"studentIds":[200,300]
}
My query:
GET /es_episode/_search
{
"_source": ["studentIds"],
"query": {
"terms": {
"studentIds": [300,400]
}
}
}
The result is
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "es_episode",
"_type" : "episode",
"_id" : "2",
"_score" : 1.0,
"_source" : {
"studentIds" : [
200,
300
]
}
}
]
}
But in fact I only want to know which term hits. For example, the result I want should be studentIds=[300] instead of all studentIds=[200,300] of the returned document. It seems that some additional operations are required, but I don’t know
how.
I try to achieve my goal with the following query
GET /es_episode/_search
{
"_source": ["studentIds"],
"query": {
"terms": {
"studentIds": [300,400]
}
},
"aggs": {
"student_id": {
"terms": {
"field": "studentIds",
"size": 10
},
"aggs": {
"id": {
"terms": {
"field": "episodeId"
}
},
"id_select":{
"bucket_selector": {
"buckets_path": {
"key" : "_key"
},
"script": "params.key==300 || params.key==400"
}
}
}
}
}
}
the result for this is
"aggregations" : {
"student_id" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 300,
"doc_count" : 1,
"id" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 124,
"doc_count" : 1
}
]
}
}
]
}
}
It seems that I successfully filtered out the terms I don’t want, but this doesn’t look pretty, and I need to set my parameters repeatedly in the script

How to search Parent documents along with count of associated child documents

I am looking for a best way to search parent documents along with counts for associated child document? Example :
We have Organization documents and User documents. There could be thousands of users belong to one particular organization.
Organization document :
{
"id" : "001"
"name" : "orgname1"
}
{
"id" : "002"
"name" : "orgname2"
}
Users documents :
{
"id" : "testusr1"
"name" : "xyz1"
"orgId" : "001"
},
{
"id" : "testusr2"
"name" : "xyz2"
"orgId" : "001"
}
{
"id" : "testusr3"
"name" : "xyz3"
"orgId" : "001"
}
{
"id" : "testusr4"
"name" : "xyz4"
"orgId" : "001"
}
{
"id" : "testusr5"
"name" : "xyz5"
"orgId" : "002"
}
{
"id" : "testusr6"
"name" : "xyz6"
"orgId" : "002"
}
In above example, we have 4 users associated with organization with 001 and 2 users associated with 002. So on front end, admin will search for organization and as a result, I want to give response along with users count for that organization.
You can solve you issue in three ways. Each have its own advantages and disadvantages
1. Index Parent and child separately
This will require two queries . First you need to query user index and get orgId and then query child index and get its count
Advantage.
Change in one index doesn't affect other index
Disadvantage .
You need to use two queries
2. Nested Documents
Mapping:
PUT index9
{
"mappings": {
"properties": {
"id":{
"type": "integer"
},
"name":{
"type": "text",
"fields": {
"keyword":{
"type":"keyword"
}
}
},
"user":{
"type": "nested",
"properties": {
"id":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword"
}
}
},
"name":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword"
}
}
}
}
}
}
}
}
POST index9/_doc
{
"id" : 1,
"name" : "orgname1",
"user":[
{
"id":"testuser1",
"name":"xyz1"
},
{
"id":"testuser2",
"name":"xyz2"
}
]
}
Query:
GET index9/_search
{
"query": {
"match_all": {}
},
"aggs": {
"organization": {
"terms": {
"field": "id",
"size": 10
},
"aggs": {
"user": {
"nested": {
"path": "user"
},
"aggs": {
"count": {
"value_count": {
"field": "user.id.keyword"
}
}
}
}
}
}
}
}
Result:
"aggregations" : {
"organization" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 1,
"doc_count" : 1,
"user" : {
"doc_count" : 2,
"count" : {
"value" : 2
}
}
}
]
}
}
Nested are faster compared to parent/child,
Nested docs require reindexing the parent with all its children, while parent child allows to reindex / add / delete specific children.
3. Parent Child Relationship
Mapping
{
"my_index" : {
"mappings" : {
"properties" : {
"id" : {
"type" : "keyword"
},
"my_join_field" : {
"type" : "join",
"eager_global_ordinals" : true,
"relations" : {
"organization" : "user"
}
},
"name" : {
"type" : "text"
},
"orgId" : {
"type" : "long"
}
}
}
}
Data:
POST my_index/_doc/1
{
"id": 1,
"name" : "orgname1",
"my_join_field": "organization"
}
POST my_index/_doc/2
{
"id" : 2,
"name" : "orgname2",
"my_join_field": "organization"
}
POST my_index/_doc/3?routing=1
{
"id": "testusr1",
"name": "xyz1",
"orgId": 1,
"my_join_field": {
"name": "user",
"parent": 1
}
}
POST my_index/_doc/4?routing=2
{
"id" : "testusr5",
"name" : "xyz5",
"orgId" : 1,
"my_join_field": {
"name": "user",
"parent": 2
}
}
POST my_index/_doc/5?routing=2
{
"id" : "testusr6",
"name" : "xyz6",
"orgId" : 2,
"my_join_field": {
"name": "user",
"parent": 2
}
}
Query:
{
"query": {
"has_child": {
"type": "user",
"query": { "match_all": {} }
}
},
"aggs": {
"organization": {
"terms": {
"field": "id",
"size": 10
},
"aggs": {
"user": {
"children": {
"type": "user"
},
"aggs": {
"count": {
"value_count": {
"field": "id"
}
}
}
}
}
}
}
}
Result:
"hits" : [
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"id" : 1,
"name" : "orgname1",
"my_join_field" : "organization"
}
},
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 1.0,
"_source" : {
"id" : 2,
"name" : "orgname2",
"my_join_field" : "organization"
}
}
]
},
"aggregations" : {
"organization" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "1",
"doc_count" : 1,
"user" : {
"doc_count" : 1,
"count" : {
"value" : 1
}
}
},
{
"key" : "2",
"doc_count" : 1,
"user" : {
"doc_count" : 2,
"count" : {
"value" : 2
}
}
}
]
}
Benefits:
1. Parent document and children are separate documents
Parent and child can be updated separately without re-indexing the other
It is useful when child documents are large in number and need to be added or
changed frequently.
Child documents can be returned as the results of a search request.

Filter nested objects in ElasticSearch 6.8.1

I didn't find any answers how to do simple thing in ElasticSearch 6.8 I need to filter nested objects.
Index
{
"settings": {
"index": {
"number_of_shards": "5",
"number_of_replicas": "1"
}
},
"mappings": {
"human": {
"properties": {
"cats": {
"type": "nested",
"properties": {
"name": {
"type": "text"
},
"breed": {
"type": "text"
},
"colors": {
"type": "integer"
}
}
},
"name": {
"type": "text"
}
}
}
}
}
Data
{
"name": "iridakos",
"cats": [
{
"colors": 1,
"name": "Irida",
"breed": "European Shorthair"
},
{
"colors": 2,
"name": "Phoebe",
"breed": "european"
},
{
"colors": 3,
"name": "Nino",
"breed": "Aegean"
}
]
}
select human with name="iridakos" and cats with breed contains 'European' (ignore case).
Only two cats should be returned.
Million thanks for helping.
For nested datatypes, you would need to make use of nested queries.
Elasticsearch would always return the entire document as a response. Note that nested datatype means that every item in the list would be treated as an entire document in itself.
Hence in addition to return entire document, if you also want to know the exact hits, you would need to make use of inner_hits feature.
Below query should help you.
POST <your_index_name>/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"name": "iridakos"
}
},
{
"nested": {
"path": "cats",
"query": {
"match": {
"cats.breed": "european"
}
},
"inner_hits": {}
}
}
]
}
}
}
Response:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.74455214,
"hits" : [
{
"_index" : "my_cat_index",
"_type" : "_doc",
"_id" : "1", <--- The document that hit
"_score" : 0.74455214,
"_source" : {
"name" : "iridakos",
"cats" : [
{
"colors" : 1,
"name" : "Irida",
"breed" : "European Shorthair"
},
{
"colors" : 2,
"name" : "Phoebe",
"breed" : "european"
},
{
"colors" : 3,
"name" : "Nino",
"breed" : "Aegean"
}
]
},
"inner_hits" : { <---- Note this
"cats" : {
"hits" : {
"total" : {
"value" : 2, <---- Count of nested doc hits
"relation" : "eq"
},
"max_score" : 0.52354836,
"hits" : [
{
"_index" : "my_cat_index",
"_type" : "_doc",
"_id" : "1",
"_nested" : {
"field" : "cats",
"offset" : 1
},
"_score" : 0.52354836,
"_source" : { <---- First Nested Document
"breed" : "european"
}
},
{
"_index" : "my_cat_index",
"_type" : "_doc",
"_id" : "1",
"_nested" : {
"field" : "cats",
"offset" : 0
},
"_score" : 0.39019167,
"_source" : { <---- Second Document
"breed" : "European Shorthair"
}
}
]
}
}
}
}
]
}
}
Note in your response how the inner_hits section would appear where you would find the exact hits.
Hope this helps!
You could use something like this:
{
"query": {
"bool": {
"must": [
{ "match": { "name": "iridakos" }},
{ "match": { "cats.breed": "European" }}
]
}
}
}
To search on a cat's breed, you can use the dot-notation.

elasticsearch groupby and filter by regex condition

It's a bit hard for me to define the question as I'm not very experienced with Elasticsearch. I'm focusing the question on my specific problem:
Assuming I have the following records:
{
id: 1
name: bla1_1.aaa
},
{
id: 1
name: bla1_2.bbb
},
{
id: 2
name: bla2_1.aaa
},
{
id: 2
name: bla2_2.aaa
}
What I want is to GET all the ids that have all of their names ending with aaa.
I was thinking about group by id and then do a regex query like so: *\.aaa so that all the name must satisfy the regex query.
On this particular example I would get id: 2 back.
How do I do it?
Let me know if there's anything I need to add to clarify the question.
RegexExp can be used.
Wildcard .* matches any character any number of times including zero
Terms aggregation will give you unique "ids" and number of docs under them.
Mapping :
PUT regex
{
"mappings": {
"properties": {
"id":{
"type":"integer"
},
"name":{
"type":"text",
"fields": {
"keyword":{
"type":"keyword"
}
}
}
}
}
}
Data:
"hits" : [
{
"_index" : "regex",
"_type" : "_doc",
"_id" : "olQXjW0BywGFQhV7k84P",
"_score" : 1.0,
"_source" : {
"id" : 1,
"name" : "bla1_1.aaa"
}
},
{
"_index" : "regex",
"_type" : "_doc",
"_id" : "o1QXjW0BywGFQhV7us6B",
"_score" : 1.0,
"_source" : {
"id" : 1,
"name" : "bla1_2.bbb"
}
},
{
"_index" : "regex",
"_type" : "_doc",
"_id" : "pFQXjW0BywGFQhV77c6J",
"_score" : 1.0,
"_source" : {
"id" : 2,
"name" : "bla2_1.aaa"
}
},
{
"_index" : "regex",
"_type" : "_doc",
"_id" : "pVQYjW0BywGFQhV7Dc6F",
"_score" : 1.0,
"_source" : {
"id" : 2,
"name" : "bla2_2.aaa"
}
}
]
Query:
GET regex/_search
{
"size":0,
"query": {
"regexp": {
"name.keyword": {
"value": ".*.aaa" ---> name ending with .aaa
}
}
},
"aggs": {
"unique_ids": {
"terms": {
"field": "id",
"size": 10
}
}
}
}
Result:
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"unique_ids" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 2, ---> 2 doc under id 2
"doc_count" : 2
},
{
"key" : 1, ----> 1 doc under id 1
"doc_count" : 1
}
]
}
}
Edit:
Using bucket selector to keep buckets where total count of docs in Id matches with docs selected in regex
GET regex/_search
{
"size": 0,
"aggs": {
"unique_ids": {
"terms": {
"field": "id",
"size": 10
},
"aggs": {
"totalCount": { ---> to get total count of id(all docs)
"value_count": {
"field": "id"
}
},
"filter_agg": {
"filter": {
"bool": {
"must": [
{
"regexp": {
"name.keyword": ".*.aaa"
}
}
]
}
},
"aggs": {
"finalCount": { -->total count of docs matching regex
"value_count": {
"field": "id"
}
}
}
},
"mybucket_selector": { ---> include buckets where totalcount==finalcount
"bucket_selector": {
"buckets_path": {
"FinalCount": "filter_agg>finalCount",
"TotalCount": "totalCount"
},
"script": "params.FinalCount==params.TotalCount"
}
}
}
}
}
}

Elasticsearch Array (Label/Tag Querying

I really think that I'm trying to do is fairly simple. I'm simply trying to query for N tags. A clear example of this was asked and answered over at "Elasticsearch: How to use two different multiple matching fields?". Yet, that solution doesn't seem to work for the latest version of ES (more likely, I'm simply doing it wrong).
To show the current data and to demonstrate a working query, see below:
{
"query": {
"filtered": {
"filter": {
"terms": {
"Price": [10,5]
}
}
}
}
}
Here are the results for this. As you can see, 5 and 10 are showing up (this demonstrates that basic queries do work):
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 6,
"successful" : 6,
"failed" : 0
},
"hits" : {
"total" : 4,
"max_score" : 1.0,
"hits" : [ {
"_index" : "labelsample",
"_type" : "entry",
"_id" : "AVLGnGMYXB5vRcKBZaDw",
"_score" : 1.0,
"_source" : {
"Category" : [ "Medium Signs" ],
"Code" : "a",
"Name" : "Sample 1",
"Timestamp" : 1.455031083799152E9,
"Price" : "10",
"IsEnabled" : true
}
}, {
"_index" : "labelsample",
"_type" : "entry",
"_id" : "AVLGnGHHXB5vRcKBZaDF",
"_score" : 1.0,
"_source" : {
"Category" : [ "Small Signs" ],
"Code" : "b",
"Name" : "Sample 2",
"Timestamp" : 1.45503108346191E9,
"Price" : "5",
"IsEnabled" : true
}
}, {
"_index" : "labelsample",
"_type" : "entry",
"_id" : "AVLGnGILXB5vRcKBZaDO",
"_score" : 1.0,
"_source" : {
"Category" : [ "Medium Signs" ],
"Code" : "c",
"Name" : "Sample 3",
"Timestamp" : 1.455031083530215E9,
"Price" : "10",
"IsEnabled" : true
}
}, {
"_index" : "labelsample",
"_type" : "entry",
"_id" : "AVLGnGGgXB5vRcKBZaDA",
"_score" : 1.0,
"_source" : {
"Category" : [ "Medium Signs" ],
"Code" : "d",
"Name" : "Sample 4",
"Timestamp" : 1.4550310834233E9,
"Price" : "10",
"IsEnabled" : true
}
}]
}
}
As a side note: the following bool query gives the exact same results:
{
"query": {
"bool": {
"must": [{
"terms": {
"Price": [10,5]
}
}]
}
}
}
Notice Category...
Let's simply copy/paste Category into a query:
{
"query": {
"filtered": {
"filter": {
"terms": {
"Category" : [ "Medium Signs" ]
}
}
}
}
}
This gives the following gem:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 6,
"successful" : 6,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
}
Again, here's the bool query version that gives the same 0-hit result:
{
"query": {
"bool": {
"must": [{
"terms": {
"Category" : [ "Medium Signs" ]
}
}]
}
}
}
In the end, I definitely need something similar to "Category" : [ "Medium Signs", "Small Signs" ] working (in concert with other label queries and minimum_should_match as well-- but I can't even get this bare-bones query to work).
I have zero clue why this is. I poured over the docs for houring, trying everything I can see. Do I need to look into debugging various encodings? Is my syntax archaic?
The problem here is that ElasticSearch is analyzing and betokening the Category field, and the terms filter expects an exact match. One solution here is to add a raw field to Category inside your entry mapping:
PUT labelsample
{
"mappings": {
"entry": {
"properties": {
"Category": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
},
"Code": {
"type": "string"
},
"Name": {
"type": "string"
},
"Timestamp": {
"type": "date",
"format": "epoch_millis"
},
"Price": {
"type": "string"
},
"IsEnabled": {
"type": "boolean"
}
}
}
}
}
...and filter on the raw field:
GET labelsample/entry/_search
{
"query": {
"filtered": {
"filter": {
"terms": {
"Category.raw" : [ "Medium Signs" ]
}
}
}
}
}

Resources