Elasticsearch - How to Generate Facets for Doubly Nested Objects - elasticsearch

Using elasticsearch 7, I am trying to build facets for doubly nested objects.
So in the example below I would like to pull out the artist id codes from the artistMakerPerson field. I can pull out the association which is nested at a single depth but I can't get the syntax for the nested nested objects.
You could use the following code in Kibana to recreate an example.
My mapping looks like this:
PUT test_artist
{
"settings": {
"number_of_shards": 1
},
"mappings": {
"properties": {
"object" : {
"type" : "text",
"fields" : {
"raw" : {
"type" : "keyword"
}
},
"copy_to" : [
"global_search"
]
},
"uniqueID" : {
"type" : "keyword",
"copy_to" : [
"global_search"
]
},
"artistMakerPerson" : {
"type" : "nested",
"properties" : {
"association" : {
"type" : "keyword"
},
"name" : {
"type" : "nested",
"properties" : {
"id" : {
"type" : "keyword"
},
"text" : {
"type" : "text",
"fields" : {
"raw" : {
"type" : "keyword"
}
},
"copy_to" : [
"gs_authority"
]
}
}
},
"note" : {
"type" : "text"
}
}
}
}
}
}
Index a document with:
PUT /test_artist/_doc/123
{
"object": "cup",
"uniquedID": "123",
"artistMakerPerson" : [
{
"name" : {
"text" : "Johann Kandler",
"id" : "A6734"
},
"association" : "modeller",
"note" : "probably"
},
{
"name" : {
"text" : "Peter Reinicke",
"id" : "A27702"
},
"association" : "designer",
"note" : "probably"
}
]
}
I am using this query to pull out facets or aggregations for artistMakerPerson.association
GET test_artist/_search
{
"size": 0,
"aggs": {
"artists": {
"nested": {
"path": "artistMakerPerson"
},
"aggs": {
"kinds": {
"terms": {
"field": "artistMakerPerson.association",
"size": 10
}
}
}
}
}
}
and I am rewarded with buckets for designer and modeller but I get nothing when I try to pull out the deeper artist id:
GET test_artist/_search
{
"size": 0,
"aggs": {
"artists": {
"nested": {
"path": "artistMakerPerson"
},
"aggs": {
"kinds": {
"terms": {
"field": "artistMakerPerson.name.id",
"size": 10
}
}
}
}
}
}
What am I doing wrong?

Change the path from artistMakerPerson to artistMakerPerson.name.
GET test_artist/_search
{
"size": 0,
"aggs": {
"artists": {
"nested": {
"path": "artistMakerPerson.name"
},
"aggs": {
"kinds": {
"terms": {
"field": "artistMakerPerson.name.id",
"size": 10
}
}
}
}
}
}

Related

How to count number of fields inside nested field? - Elasticsearch

I did the following mapping. I would like to count the number of products in each nested field "products" (for each document separately). I would also like to do a histogram aggregation, so that I would know the number of specific bucket sizes.
PUT /receipts
{
"mappings": {
"properties": {
"id" : {
"type": "integer"
},
"user_id" : {
"type": "integer"
},
"date" : {
"type": "date"
},
"sum" : {
"type": "double"
},
"products" : {
"type": "nested",
"properties": {
"name" : {
"type" : "text"
},
"number" : {
"type" : "double"
},
"price_single" : {
"type" : "double"
},
"price_total" : {
"type" : "double"
}
}
}
}
}
}
I've tried this query, but I get the number of all the products instead of number of products for each document separately.
GET /receipts/_search
{
"query": {
"match_all": {}
},
"size": 0,
"aggs": {
"terms": {
"nested": {
"path": "products"
},
"aggs": {
"bucket_size": {
"value_count": {
"field": "products"
}
}
}
}
}
}
Result of the query:
"aggregations" : {
"terms" : {
"doc_count" : 6552,
"bucket_size" : {
"value" : 0
}
}
}
UPDATE
Now I have this code where I make separate buckets for each id and count the number of products inside them.
GET /receipts/_search
{
"query": {
"match_all": {}
},
"size" : 0,
"aggs": {
"terms":{
"terms":{
"field": "_id"
},
"aggs": {
"nested": {
"nested": {
"path": "products"
},
"aggs": {
"bucket_size": {
"value_count": {
"field": "products.number"
}
}
}
}
}
}
}
}
Result of the query:
"aggregations" : {
"terms" : {
"doc_count_error_upper_bound" : 5,
"sum_other_doc_count" : 490,
"buckets" : [
{
"key" : "1",
"doc_count" : 1,
"nested" : {
"doc_count" : 21,
"bucket_size" : {
"value" : 21
}
}
},
{
"key" : "10",
"doc_count" : 1,
"nested" : {
"doc_count" : 5,
"bucket_size" : {
"value" : 5
}
}
},
{
"key" : "100",
"doc_count" : 1,
"nested" : {
"doc_count" : 12,
"bucket_size" : {
"value" : 12
}
}
},
...
Is is possible to group these values (21, 5, 12, ...) into buckets to make a histogram of them?
products is only the path to the array of individual products, not an aggregatable field. So you'll need to use it on one of your product's field -- such as the number:
GET receipts/_search
{
"size": 0,
"aggs": {
"terms": {
"nested": {
"path": "products"
},
"aggs": {
"bucket_size": {
"value_count": {
"field": "products.number"
}
}
}
}
}
}
Note that is a product has no number, it'll not contribute to the total count. It's therefore best practice to always include an ID in each of them and then aggregate on that field.
Alternatively you could use a script to account for missing values. Luckily value_count does not deduplicate -- meaning if two products are alike and/or have empty values, they'll still be counted as two:
GET receipts/_search
{
"size": 0,
"aggs": {
"terms": {
"nested": {
"path": "products"
},
"aggs": {
"bucket_size": {
"value_count": {
"script": {
"source": "doc['products.number'].toString()"
}
}
}
}
}
}
}
UPDATE
You could also use a nested composite aggregation which'll give you the histogrammed product count w/ the corresponding receipt id:
GET /receipts/_search
{
"size": 0,
"aggs": {
"my_aggs": {
"nested": {
"path": "products"
},
"aggs": {
"composite_parent": {
"composite": {
"sources": [
{
"receipt_id": {
"terms": {
"field": "_id"
}
}
},
{
"product_number": {
"histogram": {
"field": "products.number",
"interval": 1
}
}
}
]
}
}
}
}
}
}
The interval is modifiable.

Range query on Nested Documents for Keyword Type

I have a document structure like this. For this below two documents, we have nested documents called interaction info. I just need to get only the documents that have title duration and their value is greater than 60
Here whats the catch the value field is keyword, not an integer. I know that only for the Integer Range query will get executed. Is there any possible way to find the documents that have a duration greater than 60 ( Painless Query or Script Query ). Like converting the value Field into Integer and then searching the document.
{
"key": "f07ff9ba-36e4-482a-9c1c-d888e89f926e",
"interactionInfo": [
{
"title": "duration",
"value": "11"
},
{
"title": "timetaken",
"value": "9"
},
{
"title": "talk_time",
"value": "145"
}
]
},
{
"key": "f07ff9ba-36e4-482a-9c1c-d888e89f926e",
"interactionInfo": [
{
"title": "duration",
"value": "120"
},
{
"title": "timetaken",
"value": "9"
},
{
"title": "talk_time",
"value": "60"
}
]
}
I have added script to get interactionInfo.value>"somevalue". Scripts are slow and it is better to resolve this at index time and use a range query.
Index:
{
"index15" : {
"mappings" : {
"properties" : {
"interactionInfo" : {
"type" : "nested",
"properties" : {
"title" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"value" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
},
"key" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
Query:
{
"query": {
"nested": {
"path": "interactionInfo",
"query": {
"bool": {
"must": [
{
"term": {
"interactionInfo.title.keyword": {
"value": "duration"
}
}
},
{
"script": {
"script": {
"source":"def val=Integer.parseInt(doc['interactionInfo.value.keyword'].value); if(val>params.value) return true; else return false;",
"params": {
"value":10
}
}
}
}
]
}
},
"inner_hits": {}
}
}
}

Elasticsearch: filter documents by array passed in request contains all document array elements

My documents stored in elasticsearch have following structure:
{
"id": 1,
"test": "name",
"rules": [
{
"id": 2,
"name": "rule1",
"ruleDetails": [
{
"id": 3,
"requiredAnswerId": 1
},
{
"id": 4,
"requiredAnswerId": 2
},
{
"id": 5,
"requiredAnswerId": 3
}
]
}
]
}
where, rules property has nested type.
I need to query documents by checking that array of requiredAnswerId passed in the search request (provided terms) contains all rules.ruleDetails.requiredAnswerId stored in the document.
Does anyone know which elasticsearch option I can use to build such specific query? Or maybe, it is better to fetch the whole document and perform filtering on the application level.
UPDATED
Adding mapping
{
"my_index": {
"mappings": {
"properties": {
"id": {
"type": "long"
},
"test": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"rules": {
"type": "nested",
"properties": {
"id": {
"type": "long"
},
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"ruleDetails": {
"properties": {
"id": {
"type": "long"
},
"requiredAnswerId": {
"type": "long"
}
}
}
}
}
}
}
}
}
Mapping:
{
"index4" : {
"mappings" : {
"properties" : {
"id" : {
"type" : "integer"
},
"rules" : {
"type" : "nested",
"properties" : {
"id" : {
"type" : "integer"
},
"name" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword"
}
}
},
"ruleDetails" : {
"properties" : {
"id" : {
"type" : "long"
},
"requiredAnswerId" : {
"type" : "long"
}
}
}
}
},
"test" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword"
}
}
}
}
}
}
}
Query: This will need use of scripts which are not good from performance perspective. I am looping through all documents and checking if field is present is passed parameters
{
"query": {
"nested": {
"path": "rules",
"query": {
"script": {
"script": {
"source": "for(a in doc['rules.ruleDetails.requiredAnswerId']){if(!params.Ids.contains((int)a)) return false; } return true;",
"params": {
"Ids": [
1,
2,
3
]
}
}
}
},
"inner_hits": {}
}
}
}
Result:
"hits" : [
{
"_index" : "index4",
"_type" : "_doc",
"_id" : "TxOpvnEBf42mOjxvvLQB",
"_score" : 4.0,
"_source" : {
"id" : 1,
"test" : "name",
"rules" : [
{
"id" : 2,
"name" : "rule1",
"ruleDetails" : [
{
"id" : 3,
"requiredAnswerId" : 1
},
{
"id" : 4,
"requiredAnswerId" : 2
},
{
"id" : 5,
"requiredAnswerId" : 3
}
]
},
{
"id" : 3,
"name" : "rule3",
"ruleDetails" : [
{
"id" : 3,
"requiredAnswerId" : 1
},
{
"id" : 4,
"requiredAnswerId" : 2
}
]
}
]
},
"inner_hits" : {
"rules" : {
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 4.0,
"hits" : [
{
"_index" : "index4",
"_type" : "_doc",
"_id" : "TxOpvnEBf42mOjxvvLQB",
"_nested" : {
"field" : "rules",
"offset" : 0
},
"_score" : 4.0,
"_source" : {
"id" : 2,
"name" : "rule1",
"ruleDetails" : [
{
"id" : 3,
"requiredAnswerId" : 1
},
{
"id" : 4,
"requiredAnswerId" : 2
},
{
"id" : 5,
"requiredAnswerId" : 3
}
]
}
}
]
}
}
}
}
]
EDIT 1
Terms_set can be used as an alternative. It will be faster compared to script query
Returns documents that contain a minimum number of exact terms in a
provided field.
minimum_should_match_script- size of array can be used to match the minimum number of passed values.
Query:
{
"query": {
"nested": {
"path": "rules",
"query": {
"bool": {
"filter": {
"terms_set": {
"rules.ruleDetails.requiredAnswerId": {
"terms": [
1,
2,
3
],
"minimum_should_match_script": {
"source": "doc['rules.ruleDetails.requiredAnswerId'].size()"
}
}
}
}
}
},
"inner_hits": {}
}
}
}
After some time playing with ES and reading its documentation, I found that you should keep in mind that provided script should be compiled and applied for the document, hence it will be slower, if you just know the required number elements that should match in advance.
Therefore, I created a separate field requiredMatches that stores the number of rules.ruleDetails.requiredAnswerId elements for every document and calculate it before indexing document. Then, instead of using minimum_should_match_script in my search query, I am using minimum_should_match_field:
{
"query": {
"nested": {
"path": "rules",
"query": {
"bool": {
"filter": {
"terms_set": {
"rules.ruleDetails.requiredAnswerId": {
"terms": [
1,
2,
3
],
"minimum_should_match_field": "requiredMatches"
}
}
}
}
},
"inner_hits": {}
}
}
}
I used, following example, as a reference

How to Query elasticsearch index with nested and non nested fields

I have an elastic search index with the following mapping:
PUT /student_detail
{
"mappings" : {
"properties" : {
"id" : { "type" : "long" },
"name" : { "type" : "text" },
"email" : { "type" : "text" },
"age" : { "type" : "text" },
"status" : { "type" : "text" },
"tests":{ "type" : "nested" }
}
}
}
Data stored is in form below:
{
"id": 123,
"name": "Schwarb",
"email": "abc#gmail.com",
"status": "current",
"age": 14,
"tests": [
{
"test_id": 587,
"test_score": 10
},
{
"test_id": 588,
"test_score": 6
}
]
}
I want to be able to query the students where name like '%warb%' AND email like '%gmail.com%' AND test with id 587 have score > 5 etc. The high level of what is needed can be put something like below, dont know what would be the actual query, apologize for this messy query below
GET developer_search/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"name": "abc"
}
},
{
"nested": {
"path": "tests",
"query": {
"bool": {
"must": [
{
"term": {
"tests.test_id": IN [587]
}
},
{
"term": {
"tests.test_score": >= some value
}
}
]
}
}
}
}
]
}
}
}
The query must be flexible so that we can enter dynamic test Ids and their respective score filters along with the fields out of nested fields like age, name, status
Something like that?
GET student_detail/_search
{
"query": {
"bool": {
"must": [
{
"wildcard": {
"name": {
"value": "*warb*"
}
}
},
{
"wildcard": {
"email": {
"value": "*gmail.com*"
}
}
},
{
"nested": {
"path": "tests",
"query": {
"bool": {
"must": [
{
"term": {
"tests.test_id": 587
}
},
{
"range": {
"tests.test_score": {
"gte": 5
}
}
}
]
}
},
"inner_hits": {}
}
}
]
}
}
}
Inner hits is what you are looking for.
You must make use of Ngram Tokenizer as wildcard search must not be used for performance reasons and I wouldn't recommend using it.
Change your mapping to the below where you can create your own Analyzer which I've done in the below mapping.
How elasticsearch (albiet lucene) indexes a statement is, first it breaks the statement or paragraph into words or tokens, then indexes these words in the inverted index for that particular field. This process is called Analysis and that this would only be applicable on text datatype.
So now you only get the documents if these tokens are available in inverted index.
By default, standard analyzer would be applied. What I've done is I've created my own analyzer and used Ngram Tokenizer which would be creating many more tokens than just simply words.
Default Analyzer on Life is beautiful would be life, is, beautiful.
However using Ngrams, the tokens for Life would be lif, ife & life
Mapping:
PUT student_detail
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 4,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings" : {
"properties" : {
"id" : {
"type" : "long"
},
"name" : {
"type" : "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"email" : {
"type" : "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"age" : {
"type" : "text" <--- I am not sure why this is text. Change it to long or int. Would leave this to you
},
"status" : {
"type" : "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"tests":{
"type" : "nested"
}
}
}
}
Note that in the above mapping I've created a sibling field in the form of keyword for name, email and status as below:
"name":{
"type":"text",
"analyzer":"my_analyzer",
"fields":{
"keyword":{
"type":"keyword"
}
}
}
Now your query could be as simple as below.
Query:
POST student_detail/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"name": "war" <---- Note this. This would even return documents having "Schwarb"
}
},
{
"match": {
"email": "gmail" <---- Note this
}
},
{
"nested": {
"path": "tests",
"query": {
"bool": {
"must": [
{
"term": {
"tests.test_id": 587
}
},
{
"range": {
"tests.test_score": {
"gte": 5
}
}
}
]
}
}
}
}
]
}
}
}
Note that for exact matches I would make use of Term Queries on keyword fields while for normal searches or LIKE in SQL I would make use of simple Match Queries on text Fields provided they make use of Ngram Tokenizer.
Also note that for >= and <= you would need to make use of Range Query.
Response:
{
"took" : 233,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 3.7260926,
"hits" : [
{
"_index" : "student_detail",
"_type" : "_doc",
"_id" : "1",
"_score" : 3.7260926,
"_source" : {
"id" : 123,
"name" : "Schwarb",
"email" : "abc#gmail.com",
"status" : "current",
"age" : 14,
"tests" : [
{
"test_id" : 587,
"test_score" : 10
},
{
"test_id" : 588,
"test_score" : 6
}
]
}
}
]
}
}
Note that I observe the document you've mentioned in your question, in my response when I run the query.
Please do read the links I've shared. It is vital that you understand the concepts. Hope this helps!

Search query for elastic search

I have documents in elastic search in the following format
{
"stringindex" : {
"mappings" : {
"files" : {
"properties" : {
"BaseOfCode" : {
"type" : "long"
},
"BaseOfData" : {
"type" : "long"
},
"Characteristics" : {
"type" : "long"
},
"FileType" : {
"type" : "long"
},
"Id" : {
"type" : "string"
},
"Strings" : {
"properties" : {
"FileOffset" : {
"type" : "long"
},
"RO_BaseOfCode" : {
"type" : "long"
},
"SectionName" : {
"type" : "string"
},
"SectionOffset" : {
"type" : "long"
},
"String" : {
"type" : "string"
}
}
},
"SubSystem" : {
"type" : "long"
}
}
}
}
}
}
My requirement is when I search for a particular string (String.string) i want to get only the FileOffSet (String.FileOffSet) for that string.
How do i do this?
Thanks
I suppose that you want to perform a nested query and retrieve only one field as the result, but I see problems in your mapping, hence I will split my answer in 3 sections:
What is the problem I see:
How to query nested fields (this is more ES background):
How to find a solution:
1) What is the problem I see:
You want to query a nested field, but you don't have a nested field.
The nested field part:
The field "Strings" is not nested in the type "files" (nested data without a nested field may bring future problems), otherwise your mapping for the field "Strings" would be something like this:
{
"stringindex" : {
"mappings" : {
"files" : {
"properties" : {
"Strings" : {
"properties" : {
"type" : "nested",
"String" : {
"type" : "string"
}
}
}
}
}
}
}
}
Note: yes, I cut most of the fields, but I did this to easily show that you didn't create a nested field.
With a nested field "in hands", we need a nested query.
The specific field result part:
To retrieve only one field as result, you have to include the property "_source" in your query.
2) How to query nested fields:
This is more for ES background, if you have never worked with nested fields.
Small example:
You define a type with a nested field:
{
"nesttype" : {
"properties" : {
"name" : { "type" : "string" },
"parents" : {
"type" : "nested" ,
"properties" : {
"sex" : { "type" : "string" },
"name" : { "type" : "string" }
}
}
}
}
}
You create some inputs:
{ "name" : "Dan", "parents" : [{ "name" : "John" , "sex" : "m" },
{ "name" : "Anna" , "sex" : "f" }] }
{ "name" : "Lana", "parents" : [{ "name" : "Maria" , "sex" : "f" }] }
Then you query, but only fetch the nested field "parents.name":
{
"query": {
"nested": {
"path": "parents",
"query": {
"bool": {
"must": [
{
"term": {
"sex": "m"
}
}
]
}
}
}
},
"_source" : [ "parents.name" ]
}
The output of this query is "the name of the parents of all people who have a parent of the sex 'm' ". One entry (Dan) has a father, whereas the other (Lana) doesn't. So it only will retrieve Dan's parents names.
3) How to find a solution:
To fix your mapping:
You only need to include the type "nested" in the field "Strings":
{
"files" : {
"properties" : {
...
"Strings" : {
"type" : "nested" ,
"properties" : {
"FileOffset" : { "type" : "long" },
"RO_BaseOfCode" : { "type" : "long" },
...
}
}
...
}
}
}
To query your data:
{
"query": {
"nested": {
"path": "Strings",
"query": {
"bool": {
"must": [
{
"term": {
"String": "my string"
}
}
]
}
}
}
},
"_source" : [ "Strings.FileOffSet" ]
}
Great answer by dan, but I think he didn't mention it all.
His solution don't work for your question, but I guess you even don't know that.
Consider a scenario where data is like ,
doc_1
{
"Id": 1,
"Strings": [
{
"string": "x",
"fileoffset": "f1"
},
{
"string": "y",
"fileoffset": "f2"
}
]
}
doc_2
{
"Id": 2,
"Strings": {
"string": "z",
"fileoffset": "f3"
}
}
When you run the like dan said, like say let's apply filter with Strings.string=x then response is like,
{
"hits": [
{
"_index": "stringindex",
"_type": "files",
"_id": "11961",
"_score": 1,
"_source": {
"Strings": [
{
"fileoffset": "f1"
},
{
"fileoffset": "f2"
}
]
}
}
]
}
This is because, elasticsearch will get hits from documents where any of the object inside nested field (here Strings) pass the filter criteria. (In this case in doc_1, Strings.string=x passed filter, so doc_1 is returned. But we don't know which nested object pass the criteria.
So, you have to use nested_aggregation,
Here is a solution for you..
POST index/type/_search
{
"size": 0,
"aggs": {
"StringsNested": {
"nested": {
"path": "Strings"
},
"aggs": {
"StringFilter": {
"filter": {
"term": {
"Strings.string": "x"
}
},
"aggs": {
"FileOffsets": {
"terms": {
"field": "Strings.fileoffset"
}
}
}
}
}
}
}
}
So, response is like,
"aggregations": {
"StringsNested": {
"doc_count": 2,
"StringFilter": {
"doc_count": 1,
"FileOffsets": {
"buckets": [
{
"key": "f1",
"doc_count": 1
}
]
}
}
}
}
Remember to have mapping of Strings as nested, as dan said.

Resources