Aggregations elasticsearch 5 - elasticsearch

In my elastic search index has following type of entries.
{
"_index": "employees",
"_type": "employee",
"_id": "10000",
"_score": 1.3640093,
"_source": {
"itms": {
"depc": [
"IT",
"MGT",
"FIN"
],
"dep": [
{
"depn": "Information Technology",
"depc": "IT"
},
{
"depn": "Management",
"depc": "MGT"
},
{
"depn": "Finance",
"depc": "FIN"
},
{
"depn": "Finance",
"depc": "FIN"
}
]
}
}
}
Now I an trying to get unique department list including department code (depc) and department name (depn).
I was trying following but it doesn't give result what I expected.
{
"size": 0,
"query": {},
"aggs": {
"departments": {
"terms": {
"field": "itms.dep.depc",
"size": 10000,
"order": {
"_term": "asc"
}
},
"aggs": {
"department": {
"terms": {
"field": "itms.dep.depn",
"size": 10
}
}
}
}
}
}
Any suggestions are appreciated.
Thanks You

From your agg query, it seems like the mapping type for itms.dep is object and not nested
Lucene has no concept of inner objects, so Elasticsearch flattens
object hierarchies into a simple list of field names and values.
Hence, your doc has internally transformed to :
{
"depc" : ["IT","MGT","FIN"],
"dep.depc" : [ "IT","MGT","FIN"],
"dep.depn" : [ "Information Technology", "Management", "Finance" ]
}
i.e. you have lost the association between depc and depn
To fix this :
You need to change your object type to nested
Use nested aggregation
The structure of your existing agg query seems fine to me but you will have to convert it to a nested aggregation post the mapping update

Related

Query with filter array field

I want to return documents that include only some of array field members.
For example, I have of two order documents:\
{
"orderNumber":"ORD-111",
"items":[{"name":"part-1","status":"new"},
{"name":"part-2","status":"paid"}]
}
{
"orderNumber":"ORD-112",
"items":[{"name":"part-3","status":"paid"},
{"name":"part-4","status":"supplied"}]
}
I want to create a query so that my result will include all the order documents but only with items that match {"status":"supplied"}.
The result should look like:\
{
"orderNumber":"ORD-111",
"items":[]
}
{
"orderNumber":"ORD-112",
"items":[{"name":"part-4","status":"supplied"}]
}
You can use a nested query along with inner_hits to get only matching array values in the result
Adding a working example
Index Mapping:
{
"mappings": {
"properties": {
"items": {
"type": "nested"
}
}
}
}
Search Query:
{
"query": {
"nested": {
"path": "items",
"query": {
"bool": {
"must": [
{
"match": {
"items.status": "supplied"
}
}
]
}
},
"inner_hits": {}
}
}
}
Search Result:
"hits": [
{
"_index": "67890614",
"_type": "_doc",
"_id": "2",
"_score": 1.2039728,
"_source": {
"orderNumber": "ORD-112",
"items": [
{
"name": "part-3",
"status": "paid"
},
{
"name": "part-4",
"status": "supplied"
}
]
},
"inner_hits": {
"items": {
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.2039728,
"hits": [
{
"_index": "67890614",
"_type": "_doc",
"_id": "2",
"_nested": {
"field": "items",
"offset": 1
},
"_score": 1.2039728,
"_source": {
"name": "part-4",
"status": "supplied" // note this
}
}
]
}
}
}
}
]
Elasticsearch flats the matching field so is unable to tell which was the actual element in the array that matches.
As previously answered you could use nested queries.
How arrays of objects are flattened
Elasticsearch has no concept of inner objects. Therefore, it flattens object hierarchies into a simple list of field names and values. For instance, consider the following document:
PUT my-index-000001/_doc/1
{
"group" : "fans",
"user" : [
{
"first" : "John",
"last" : "Smith"
},
{
"first" : "Alice",
"last" : "White"
}
]
}
The user field is dynamically added as a field of type object.
The previous document would be transformed internally into a document that looks more like this:
{
"group" : "fans",
"user.first" : [ "alice", "john" ],
"user.last" : [ "smith", "white" ]
}
The user.first and user.last fields are flattened into multi-value fields, and the association between alice and white is lost. This document would incorrectly match a query for alice AND smith:
GET my-index-000001/_search
{
"query": {
"bool": {
"must": [
{ "match": { "user.first": "Alice" }},
{ "match": { "user.last": "Smith" }}
]
}
}
}
To answer your question:
If you need to index arrays of objects and to maintain the independence of each object in the array, use the nested data type instead of the object data type.
Internally, nested objects index each object in the array as a separate hidden document, meaning that each nested object can be queried independently of the others with the nested query:
PUT my-index-000001
{
"mappings": {
"properties": {
"user": {
"type": "nested"
}
}
}
}
PUT my-index-000001/_doc/1
{
"group" : "fans",
"user" : [
{
"first" : "John",
"last" : "Smith"
},
{
"first" : "Alice",
"last" : "White"
}
]
}
GET my-index-000001/_search
{
"query": {
"nested": {
"path": "user",
"query": {
"bool": {
"must": [
{ "match": { "user.first": "Alice" }},
{ "match": { "user.last": "Smith" }}
]
}
}
}
}
}
GET my-index-000001/_search
{
"query": {
"nested": {
"path": "user",
"query": {
"bool": {
"must": [
{ "match": { "user.first": "Alice" }},
{ "match": { "user.last": "White" }}
]
}
},
"inner_hits": {
"highlight": {
"fields": {
"user.first": {}
}
}
}
}
}
}
The user field is mapped as type nested instead of type object.
This query doesn’t match because Alice and Smith are not in the same nested object.
This query matches because Alice and White are in the same nested object.
inner_hits allow us to highlight the matching nested documents.
Interacting with nested documents
Nested documents can be:
queried with the nested query.
analyzed with the nested and reverse_nested aggregations.
sorted with nested sorting.
retrieved and highlighted with nested inner hits.
Because nested documents are indexed as separate documents, they can only be accessed within the scope of the nested query, the nested/reverse_nested aggregations, or nested inner hits.
consider performance when taking this approach as it is by magnitudes more expensive.
for more details
ou can check the source:
https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html

How to filter large no of nested documents in elasticsearch?

I have nested documents in an elasticsearch index
[
{
"Id": "1",
"Events": [
{
"EventTime": "2021-04-13T08:00:00.000000Z"
},
{
"EventTime": "2021-04-13T08:10:00.000000Z"
}
]
},
{
"Id": "2",
"Events": [
{
"EventTime": "2021-04-13T09:00:00.000000Z"
},
{
"EventTime": "2021-04-13T09:10:00.000000Z"
}
]
},
{
"Id": "3",
"Events": [
{
"EventTime": "2021-04-13T10:00:00.000000Z"
},
{
"EventTime": "2021-04-13T10:10:00.000000Z"
}
]
}
]
I want to get all the documents with EventTime < some given time. I want to filter nested documents as well. So, I know we can do it using inner_hits as follows
{
"_source": [ "Id" ],
"query": {
"nested": {
"path": "Events",
"query": {
"range": {
"Events.EventTime": {
"lte": "2021-04-13T09:20:00.000000Z"
}
}
},
"inner_hits": {
"from": 0,
"size": 100
}
}
}
}
But we can only get at max 100 documents in inner_hits. If I have more that 100 nested documents, I can't use inner_hits without changing some configuration in elasticsearch. Is there a way to achieve this without changing the config?
There's an explicit setting governing this and it's called max_inner_result_window. You can increase it by running
PUT your-index/_settings
{
"index.max_inner_result_window": 10000
}
This setting is non-breaking because it doesn't require index dropping or reindexing.
There's no alternative to get around this inner hits limitation.

How does "must" clause with an array of "match" clauses really mean?

I have an elasticsearch query which looks like this...
"query": {
"bool": {
"must": [{
"match": {"attrs.name": "username"}
}, {
"match": {"attrs.value": "johndoe"}
}]
}
}
... and documents in the index that look like this:
{
"key": "value",
"attrs": [{
"name": "username",
"value": "jimihendrix"
}, {
"name": "age",
"value": 23
}, {
"name": "alias",
"value": "johndoe"
}]
}
Which of the following does this query really mean?
Document should contain either attrs.name = username OR attrs.value = johndoe
Or, document should contain, both, attrs.name = username AND attrs.value = johndoe, even if they may match different elements in the attrs array (this would mean that the document given above would match the query)
Or, document should contain, both, attrs.name = username AND attrs.value = johndoe, but they must match the same element in the attrs array (which would mean that the document given above would not match the query)
Further, how do I write a query to express #3 from the list above, i.e. the document should match only if a single element inside the attrs array matches both the following conditions:
attrs.name = username
attrs.value = johndoe
Must stands for "And" so a document satisfying all the clauses in match query is returned.
Must will not satisfy point 1. Document should contain either attrs.name = username OR attrs.value = johndoe- you need a should clause which works like "OR"
Whether Must will satisfy Point 2 or point 3 depends on the type of "attrs" field.
If "attr" field type is object then fields are flattened that is no relationship maintained between different fields for array. So must query will return a document if any attrs.name="username" and attrs.value="John doe", even if they are not part of same object in that array.
If you want an object in an array to act like a separate document, you need to use nested field and use nested query to match documents
{
"query": {
"nested": {
"path": "attrs",
"inner_hits": {}, --> returns matched nested documents
"query": {
"bool": {
"must": [
{
"match": {
"attrs.name": "username"
}
},
{
"match": {
"attrs.value": "johndoe"
}
}
]
}
}
}
}
}
hits in the response will contain all nested documents , to get all matched nested documents , inner_hits has to be specified
Based on your requirements you need to define your attrs field as nested, please refer nested type in Elasticsearch for more information. Disclaimer : it maintains the relationship but costly to query.
Answer to your other two questions also depends on what data type you are using please refer nested vs object data type for more details
Edit: solution using sample mapping, example docs and expected result
Index mapping using nested type
{
"mappings": {
"properties": {
"attrs": {
"type": "nested"
}
}
}
}
Index 2 sample doc one which severs the criteria and other which doesn't
{
"attrs": [
{
"name": "username",
"value": "johndoe"
},
{
"name": "alias",
"value": "myname"
}
]
}
Another which serves criteria
{
"attrs": [
{
"name": "username",
"value": "jimihendrix"
},
{
"name": "age",
"value": 23
},
{
"name": "alias",
"value": "johndoe"
}
]
}
And search query
{
"query": {
"nested": {
"path": "attrs",
"inner_hits": {},
"query": {
"bool": {
"must": [
{
"match": {
"attrs.name": "username"
}
},
{
"match": {
"attrs.value": "johndoe"
}
}
]
}
}
}
}
}
And Search result
"hits": [
{
"_index": "nested",
"_type": "_doc",
"_id": "2",
"_score": 1.7509375,
"_source": {
"attrs": [
{
"name": "username",
"value": "johndoe"
},
{
"name": "alias",
"value": "myname"
}
]
},
"inner_hits": {
"attrs": {
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.7509375,
"hits": [
{
"_index": "nested",
"_type": "_doc",
"_id": "2",
"_nested": {
"field": "attrs",
"offset": 0
},
"_score": 1.7509375,
"_source": {
"name": "username",
"value": "johndoe"
}
}
]
}
}
}
}
]

ElasticSearch Aggregation + Sorting in on NonNumric Field 5.3

I wanted to aggregate the data on a different field and also wanted to get the aggregated data on sorted fashion based on the name.
My data is :
{
"_index": "testing-aggregation",
"_type": "employee",
"_id": "emp001_local000000000000001",
"_score": 10.0,
"_source": {
"name": [
"Person 01"
],
"groupbyid": [
"group0001"
],
"ranking": [
"2.0"
]
}
},
{
"_index": "testing-aggregation",
"_type": "employee",
"_id": "emp002_local000000000000001",
"_score": 85146.375,
"_source": {
"name": [
"Person 02"
],
"groupbyid": [
"group0001"
],
"ranking": [
"10.0"
]
}
},
{
"_index": "testing-aggregation",
"_type": "employee",
"_id": "emp003_local000000000000001",
"_score": 20.0,
"_source": {
"name": [
"Person 03"
],
"groupbyid": [
"group0002"
],
"ranking": [
"-1.0"
]
}
},
{
"_index": "testing-aggregation",
"_type": "employee",
"_id": "emp004_local000000000000001",
"_score": 5.0,
"_source": {
"name": [
"Person 04"
],
"groupbyid": [
"group0002"
],
"ranking": [
"2.0"
]
}
}
My query :
{
"size": 0,
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "name:emp*^1000.0"
}
}
]
}
},
"aggs": {
"groupbyid": {
"terms": {
"field": "groupbyid.raw",
"order": {
"top_hit_agg": "desc"
},
"size": 10
},
"aggs": {
"top_hit_agg": {
"terms": {
"field": "name"
}
}
}
}
}
}
My mapping is :
{
"name": {
"type": "text",
"fielddata": true,
"fields": {
"lower_case_sort": {
"type": "text",
"fielddata": true,
"analyzer": "case_insensitive_sort"
}
}
},
"groupbyid": {
"type": "text",
"fielddata": true,
"index": "analyzed",
"fields": {
"raw": {
"type": "keyword",
"index": "not_analyzed"
}
}
}
}
I am getting data based on the average of the relevance of grouped records. Now, what I wanted is the first club the records based on the groupid and then in each bucket sort the data based on the name field.
I wanted grouping on one field and after that grouped bucket, I want to sort on another field. This is sample data.
There are other fields like created_on, updated_on. I also wanted to get sorted data based on that field. also get the data by alphabetically grouped.
I wanted to sort on the non-numeric data type(string). I can do the numeric data type.
I can do it for the ranking field but not able to do it for the name field. It was giving the below error.
Expected numeric type on field [name], but got [text];
You're asking for a few things, so I'll try to answer them in turn.
Step 1: Sorting buckets by relevance
I am getting data based on the average of the relevance of grouped records.
If this is what you're attempting to do, it's not what the aggregation you wrote is doing. Terms aggregations default to sorting the buckets by the number of documents in each bucket, descending. To sort the groups by "average relevance" (which I'll interpret as "average _score of documents in the group"), you'd need to add a sub-aggregation on the score and sort the terms aggregation by that:
"aggregations": {
"most_relevant_groups": {
"terms": {
"field": "groupbyid.raw",
"order": {
"average_score": "desc"
}
},
"aggs": {
"average_score": {
"avg": {
"script": {
"inline": "_score",
"lang": "painless",
}
}
}
}
}
}
Step 2: Sorting employees by name
Now, what I wanted is the first club the records based on the groupid and then in each bucket sort the data based on the name field.
To sort the documents within each bucket, you can use a top_hits aggregation:
"aggregations": {
"most_relevant_groups": {
"terms": {
"field": "groupbyid.raw",
"order": {
"average_score": "desc"
}
},
"aggs": {
"employees": {
"top_hits": {
"size": 10, // Default will be 10 - change to whatever
"sort": [
{
"name.lower_case_sort": {
"order": "asc"
}
}
]
}
}
}
}
}
Step 3: Putting it all together
Putting the both the above together, the following aggregation should suit your needs (note that I used a function_score query to simulate "relevance" based on ranking - your query can be whatever and just needs to be any query that produces whatever relevance you need):
POST /testing-aggregation/employee/_search
{
"size": 0,
"query": {
"function_score": {
"functions": [
{
"field_value_factor": {
"field": "ranking"
}
}
]
}
},
"aggs": {
"groupbyid": {
"terms": {
"field": "groupbyid.raw",
"size": 10,
"order": {
"average_score": "desc"
}
},
"aggs": {
"average_score": {
"avg": {
"script": {
"inline": "_score",
"lang": "painless"
}
}
},
"employees": {
"top_hits": {
"size": 10,
"sort": [
{
"name.lower_case_sort": {
"order": "asc"
}
}
]
}
}
}
}
}
}

Use Elasticsearch percolate with specific type of field name

I'm making a subscription system for notifications using the percolate type of property of Elasticsearch 7.x. The problem is that I can't make a percolate query with certain types of fields.
This is an example of the indexed data. As you can see, I have a query indexed to be able to perform a percolate query. The difference I would like to mention is the name of the field in the query which can be state or created_by.full_name.raw
{
"_index": "widgets_2020",
"_type": "widget",
"_score": 1.0,
"_source": {
"created_at": "2020-01-09T21:58:14.123Z",
"query": {
"bool": {
"must": [],
"filter": [
{
"terms": {
"created_by.full_name.raw": [
"Ivan Ledner"
]
}
}
]
}
}
}
},
{
"_index": "widgets_2020",
"_type": "widget",
"_score": 1.0,
"_source": {
"created_at": "2020-01-09T22:02:24.133Z",
"query": {
"bool": {
"must": [],
"filter": [
{
"terms": {
"state": [
"done"
]
}
}
]
}
}
}
}
When I do something like this, Elasticsearch returns the documents I expect.
widgets_2020/_search
{
"query" : {
"percolate" : {
"field" : "query",
"document" : {
"state": ["created"]
}
}
}
}
But when I search this, It returns nothing.
widgets_2020/_search
{
"query" : {
"percolate" : {
"field" : "query",
"document" : {
"created_by.full_name.raw": ["Ivan Ledner"]
}
}
}
}
Is there a different way of dealing with these types of names? Thanks in advance!
The problem was that I enabled the option map_unmapped_fields_as_text and this mapped all my fields as text as the options say. The way I solved this is mapping all the attributes manually and the percolator started to work as expected.

Resources