ElasticSearch - Unique counts in nested array - elasticsearch

For the sake of easier understanding, I will show you how my data is mapped.
Here is the template I'm having.
{
"mappings":
{
"properties":
{
"applicationName":
{
"type": "keyword"
},
"tags":
{
"type": "nested",
"properties":
{
"tagKey":
{
"type": "keyword"
},
"tagKeyword":
{
"type": "keyword"
}
}
}
}
}
}
Here are some sample items,
Sample item 1
"applicationName": "application1"
"tags": [
{"tagKey": "user", "tagKeyword": "aaa"},
{"tagKey": "os", "tagKeyword": "android"}
]
Sample item 2
"applicationName": "application2"
"tags": [
{"tagKey": "user", "tagKeyword": "bbb"},
{"tagKey": "os", "tagKeyword": "ios"}
]
Sample item 3
"applicationName": "application1"
"tags": [
{"tagKey": "user", "tagKeyword": "aaa"},
{"tagKey": "os", "tagKeyword": "pc"}
]
I want to retrieve the count of distinct tagKeyword that has tagKey of "user" for each application.
For example,
[
{
"applicationName": "application1",
"distinctUser": 2
},
{
"applicationName": "application2",
"distinctUser": 1
}
]
Both solution or URL to the document related to this issue will be appreciated.

You can use a terms aggregation on the applicationName, then filter the user-only tags through a nested filter aggregation:
POST index-name/_search?filter_path=aggregations.*.buckets.key,aggregations.*.buckets.nestedTags.distinctUser
{
"size": 0,
"aggs": {
"distinctAppName": {
"terms": {
"field": "applicationName",
"size": 10
},
"aggs": {
"nestedTags": {
"nested": {
"path": "tags"
},
"aggs": {
"distinctUser": {
"filter": {
"term": {
"tags.tagKey": "user"
}
}
}
}
}
}
}
}
}
yielding
{
"aggregations" : {
"distinctAppName" : {
"buckets" : [
{
"key" : "application1",
"nestedTags" : {
"distinctUser" : {
"doc_count" : 2
}
}
},
{
"key" : "application2",
"nestedTags" : {
"distinctUser" : {
"doc_count" : 1
}
}
}
]
}
}
}

Refer nested aggregations. Try term aggregation for the field applicationName to group by applications and then do term sub-aggregation for nested field tags.tagKeyword to get distinct list of values within a given application.
Also you have to add a filter for "tag.tagKey" field as "user" to suit your requirement

Related

Elasticsearch - get all nested objects of all documents

Let's imagine Elasticsearch index where each document represents a country. Country has cities field, which is defined as nested.
Sample mapping (simplified for brevity of this example):
{
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"cities": {
"type": "nested",
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
// other properties are omitted for brevity
}
}
}
}
The documents which I'm inserting to the index look like this:
{
"name": "Slovakia",
"cities": [
{
"name": "Bratislava"
},
{
"name": "Zilina"
},
...
]
}
{
"name": "Czech Republic",
"cities": [
{
"name": "Praha"
},
{
"name": "Brno"
},
...
]
}
Is it possible to compose a query which returns all cities (over all countries) and supports sorting & pagination? In response, I'd like to have the complete nested objects + some fields of the parent object (so that I can display which country the city belongs to).
The first returned page (response) would contain 10 cities from Czech Republic, the second page would contain 10 cities where four of them are (the last ones) from Czech Republic and six are from Slovakia.
I was looking into composite aggregation, but I don't know how add country name to sources:
{
"query": {
"match_all": {}
},
"aggs": {
"nested_aggs": {
"nested": {
"path": "cities"
},
"aggs": {
"by_name": {
"composite": {
"sources": [
{
"cityName": {
"terms": {
"field": "cities.name.keyword",
"order": "asc"
}
}
}
]
}
}
}
}
}
}
Is it possible to compose such query without modifying the Elasticsearch mapping?
All members of composite aggregations need to be defined under the same context — you cannot intermix nested and non-nested contexts.
The easiest option would be to first aggregate on the countries and then on the cities:
{
"size": 0,
"aggs": {
"by_country": {
"terms": {
"field": "name.keyword",
"size": 10
},
"aggs": {
"nested_cities": {
"nested": {
"path": "cities"
},
"aggs": {
"by_cities": {
"terms": {
"field": "cities.name.keyword",
"size": 10
}
}
}
}
}
}
}
}
If you do have the option of changing the mapping, you can leverage the include_in_root feature which'll enable you to perform composite aggs such as:
{
"size": 0,
"aggs": {
"by_name": {
"composite": {
"sources": [
{
"countryName": {
"terms": {
"field": "name.keyword",
"order": "asc"
}
}
},
{
"cityName": {
"terms": {
"field": "cities.name.keyword",
"order": "asc"
}
}
}
]
}
}
}
}
which can be easily paginated.
Here's what the result would look like:
...
"aggregations" : {
"by_name" : {
"after_key" : {
"countryName" : "Slovakia",
"cityName" : "Zilina"
},
"buckets" : [
{
"key" : {
"countryName" : "Czech Republic",
"cityName" : "Brno"
},
"doc_count" : 1
},
{
"key" : {
"countryName" : "Czech Republic",
"cityName" : "Praha"
},
"doc_count" : 1
},
{
"key" : {
"countryName" : "Slovakia",
"cityName" : "Bratislava"
},
"doc_count" : 1
},
{
"key" : {
"countryName" : "Slovakia",
"cityName" : "Zilina"
},
"doc_count" : 1
}
]
}
}

Reverse_nested aggregation + top hits : get parent and nested data at the same time

Do you know how to use reverse_nested aggregation to get both the parent and ONLY the nested data inside my top hit aggregations ?
The 'ONLY' part is the problem right now.
This is my mapping :
{
"ticket": {
"mappings": {
"properties": {
"name": {
"type": "keyword"
}
},
"tasks": {
"type": "nested",
"properties": {
"string_task_name": {
"type": "keyword"
}
}
}
}
}
}
My query uses top hits and reverse nested aggs.
{
"aggs": {
"object_tasks": {
"nested": {
"path": "object_tasks"
},
"aggs": {
"filter_by_tasks_attribute": {
"filter": {
"bool": {
"must": [
{
"wildcard": {
"object_tasks.string_task_name.keyword": "*"
}
}
]
}
},
"aggs": {
"using_reverse_nested": {
"reverse_nested": {
"path": "object_tasks"
},
"aggs": {
"names": {
"top_hits": {
"_source": {
"includes": [
"object_tasks.string_task_name",
"string_name"
]
},
"sort": [
{
"object_tasks.string_task_name.keyword": {
"order": "desc"
}
}
],
"from": 0,
"size": 10
}
}
}
}
}
}
}
}
}
}
{
"hits": {
"total": {
"value": 25,
"relation": "eq"
},
"max_score": null,
"hits": [
{
"_index": "random_index",
"_type": "_doc",
"_id": "5",
"_score": null,
"_source": {
"object_tasks": [ ================> I don't want all these tasks names, I just want the task name of the current nested object I am in.
{
"string_task_name": "task1"
},
{
"string_task_name": "task2"
},
{
"string_task_name": "task3"
},
{
"string_task_name": "task4"
}
],
"string_name": "Dummy Ticket 854"
},
"sort": [
"seek_a_sme"
]
}
]
}
}
As you can see the result is giving me 4 tasks name. What I want is to return only 1 task name.
The only workaround I have found is to copy the data of tickets inside the tasks. But if I can avoid it that would be awesome.
I don't want all these tasks names, I just want the task name of the current nested object I am in.
The statement "of the current nested object I'm in" implies that you are inside of a nested context but you cannot be in one when you escape it through reverse_nested…
I'm not sure if I truly understood what you're gunning for here but you could aggregate on the terms of object_tasks.string_task_name.keyword and the keys of this aggregation would then function as the individual "current nested objects" that you're after:
{
"size": 0,
"aggs": {
"object_tasks": {
"nested": {
"path": "object_tasks"
},
"aggs": {
"filter_by_tasks_attribute": {
"filter": {
"bool": {
"must": [
{
"wildcard": {
"object_tasks.string_task_name.keyword": "*"
}
}
]
}
},
"aggs": {
"by_string_task_name": {
"terms": {
"field": "object_tasks.string_task_name.keyword",
"order": {
"_key": "desc"
},
"size": 10
},
"aggs": {
"using_reverse_nested": {
"reverse_nested": {},
"aggs": {
"names": {
"top_hits": {
"_source": {
"includes": [
"string_name"
]
},
"from": 0,
"size": 10
}
}
}
}
}
}
}
}
}
}
}
}
yielding
"aggregations" : {
"object_tasks" : {
...
"filter_by_tasks_attribute" : {
...
"by_string_task_name" : {
...
"buckets" : [
{
"key" : "task4", <--
...
"using_reverse_nested" : {
...
"names" : {
"hits" : {
...
"hits" : [
{
...
"_source" : {
"string_name" : "Dummy Ticket 854" <--
}
}
]
}
}
}
},
{
"key" : "task3", <--
...
},
{
"key" : "task2", <--
...
},
{
"key" : "task1", <--
...
}
}
]
}
}
}
}
Notice that the top_hits aggregation doesn't need to be sorted anymore -- object_tasks.string_task_name.keyword will always be the same for any currently aggregated terms bucket. What I did instead was order this terms aggregation by _key which works the same way as a top_hits sort would have. BTW -- yours was missing the nested path parameter.

ElasticSearch Aggregation Filter (not nested) Array

I have mapping like that:
PUT myindex1/_mapping
{
"properties": {
"program":{
"properties":{
"rounds" : {
"properties" : {
"id" : {
"type" : "keyword"
},
"name" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
}
}
And my example docs:
POST myindex1/_doc
{
"program": {
"rounds":[
{"id":"00000000-0000-0000-0000-000000000000", "name":"Test1"},
{"id":"00000000-0000-0000-0000-000000000001", "name":"Fact2"}
]
}
}
POST myindex1/_doc
{
"program": {
"rounds":[
{"id":"00000000-0000-0000-0000-000000000002", "name":"Test3"},
{"id":"00000000-0000-0000-0000-000000000003", "name":"Fact4"}
]
}
}
POST myindex1/_doc
{
"program": {
"rounds":[
{"id":"00000000-0000-0000-0000-000000000004", "name":"Test5"},
{"id":"00000000-0000-0000-0000-000000000005", "name":"Fact6"}
]
}
}
Purpose: get only names of rounds that filtered as wildcard by user.
Aggregation query:
GET myindex1/_search
{
"aggs": {
"result": {
"aggs": {
"names": {
"terms": {
"field": "program.rounds.name.keyword",
"size": 10000,
"order": {
"_key": "asc"
}
}
}
},
"filter": {
"bool": {
"must":[
{
"wildcard": {
"program.rounds.name": "*test*"
}
}
]
}
}
}
},
"size": 0
}
This aggregation returns all 6 names, but I need only Test1,Test3,Test5. Also tried include": "/tes.*/i" regex pattern for terms, but ignore case does not work.
Note: I'm note sure abount nested type, because I don't interested in association between Id and Name (at least for now).
ElasticSearch version: 7.7.0
If you want to only aggregate specific rounds based on a condition on the name field, then you need to make rounds nested, otherwise all name values end up in the same field.
Your mapping needs to be changed to this:
PUT myindex1/
{
"mappings": {
"properties": {
"program": {
"properties": {
"rounds": {
"type": "nested", <--- add this
"properties": {
"id": {
"type": "keyword"
},
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
}
}
}
And then your query needs to change to this:
GET myindex1/_search
{
"size": 0,
"query": {
"nested": {
"path": "program.rounds",
"query": {
"bool": {
"must": [
{
"wildcard": {
"program.rounds.name": "*Test*"
}
}
]
}
}
}
},
"aggs": {
"rounds": {
"nested": {
"path": "program.rounds"
},
"aggs": {
"name_filter": {
"filter": {
"wildcard": {
"program.rounds.name": "*Test*"
}
},
"aggs": {
"names": {
"terms": {
"field": "program.rounds.name.keyword",
"size": 10000,
"order": {
"_key": "asc"
}
}
}
}
}
}
}
}
}
And the result will be:
"aggregations" : {
"rounds" : {
"doc_count" : 6,
"name_filter" : {
"doc_count" : 3,
"names" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Test1",
"doc_count" : 1
},
{
"key" : "Test3",
"doc_count" : 1
},
{
"key" : "Test5",
"doc_count" : 1
}
]
}
}
}
}
UPDATE:
Actually, you can achieve what you want without introducing nested types with the following query. You were close, but the include pattern was wrong
GET myindex1/_search
{
"aggs": {
"result": {
"aggs": {
"names": {
"terms": {
"field": "program.rounds.name.keyword",
"size": 10000,
"include": "[Tt]est.*",
"order": {
"_key": "asc"
}
}
}
},
"filter": {
"bool": {
"must": [
{
"wildcard": {
"program.rounds.name": "*Test*"
}
}
]
}
}
}
},
"size": 0
}

ES query to match all elements in array

So I got this document with a
nested array that I want to filter with this query.
I want ES to return all documents where all items have changes = 0 and that only.
If document has even a single item in the list with a change = 1, that's discarded.
Is there any way I can achieve this starting from the query I have already wrote? Or should I use a script instead?
DOCUMENTS:
{
"id": "abc",
"_source" : {
"trips" : [
{
"type" : "home",
"changes" : 0
},
{
"type" : "home",
"changes" : 1
}
]
}
},
{
"id": "def",
"_source" : {
"trips" : [
{
"type" : "home",
"changes" : 0
},
{
"type" : "home",
"changes" : 0
}
]
}
}
QUERY:
GET trips_solutions/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"id": {
"value": "abc"
}
}
},
{
"nested": {
"path": "trips",
"query": {
"range": {
"trips.changes": {
"gt": -1,
"lt": 1
}
}
}
}
}
]
}
}
}
EXPECTED RESULT:
{
"id": "def",
"_source" : {
"trips" : [
{
"type" : "home",
"changes" : 0
},
{
"type" : "home",
"changes" : 0
}
]
}
}
Elasticsearch version: 7.6.2
Already read this answers but they didn't help me:
https://discuss.elastic.co/t/how-to-match-all-item-in-nested-array/163873
ElasticSearch: How to query exact nested array
First off, if you filter by id: abc, you obviously won't be able to get id: def back.
Second, due to the nature of nested fields which are treated as separate subdocuments, you cannot query for all trips that have the changes equal to 0 -- the connection between the individual trips is lost and they "don't know about each other".
What you can do is return only the trips that matched your nested query using inner_hits:
GET trips_solutions/_search
{
"_source": "false",
"query": {
"bool": {
"must": [
{
"nested": {
"inner_hits": {},
"path": "trips",
"query": {
"term": {
"trips.changes": {
"value": 0
}
}
}
}
}
]
}
}
}
The easiest solution then is to dynamically save this nested info on a parent object like discussed here and using range/term query on the resulting array.
EDIT:
Here's how you do it using copy_to onto the doc's top level:
PUT trips_solutions
{
"mappings": {
"properties": {
"trips_changes": {
"type": "integer"
},
"trips": {
"type": "nested",
"properties": {
"changes": {
"type": "integer",
"copy_to": "trips_changes"
}
}
}
}
}
}
trips_changes will be an array of numbers -- I presume they're integers but more types are available.
Then syncing a few docs:
POST trips_solutions/_doc
{"trips":[{"type":"home","changes":0},{"type":"home","changes":1}]}
POST trips_solutions/_doc
{"trips":[{"type":"home","changes":0},{"type":"home","changes":0}]}
And finally querying:
GET trips_solutions/_search
{
"query": {
"bool": {
"must": [
{
"nested": {
"path": "trips",
"query": {
"term": {
"trips.changes": {
"value": 0
}
}
}
}
},
{
"script": {
"script": {
"source": "doc.trips_changes.stream().filter(val -> val != 0).count() == 0"
}
}
}
]
}
}
}
Note that we first filter normally using the nested term query to narrow down our search context (scripts are slow so this is useful). We then check if there are any non-zero changes in the accumulated top-level changes and reject those that apply.

Elastic Search query return terms within array of a specific type

I've a mapping of an index as following:
{"tagged_index":{"mappings":{"tagged":{"properties":{"tags":{"properties":{"resources":{"properties":{"tagName":{"type":"string"},"type":{"type":"string"}}}}},"content":{"type":"string"}}}}}}
Where Resources is an array which can have multiple tags. For example
{"_id":"82906194","_source":{"tags":{"resources":[{"type":"Person","tagName":"Kim_Kardashian",},{"type":"Person","tagName":"Kanye_West",},{"type":"City","tagName":"New_York",},...},"content":" Popular NEWS ..."}}
,
{"_id":"82906195","_source":{"tags":{"resources":[{"type":"City","tagName":"London",},{"type":"Country","tagName":"USA",},{"type":"Music","tagName":"Hello",},...},"content":" Adele's Hello..."}},
...
I do know how to extract important terms[tagName] with the below query, but I do not want terms[tagName] of all types.
How can I extract only the terms which are for example Cities only [type:City]? (I would like to get a list of tagName where the type is City i.e. London, New_York, Berlin,...)
{"size":0,"query":{"filtered":{"query":{"query_string":{"query":"*","analyze_wildcard":true}}}},"aggs":{"Cities":{"terms":{"field":"tags.resources.tagName","size":10,"order":{"_count":"desc"}}}}}
Following is how the required output should look like:
{"took":1200,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":5179261,"max_score":0.0,"hits":[]},"aggregations":{"Cities":{"doc_count_error_upper_bound":46737,"sum_other_doc_count":36037440,"buckets":[{"key":"London","doc_count":332820},{"key":"New_York","doc_count":211274},{"key":"Berlin","doc_count":156954},{"key":"Amsterdam","doc_count":132173},...
Can you try this:
{
"_source" : ["tags.resources.tagName"]
"query": {
"term": {
"tags.resources.type": {
"value": "City"
}
}
}
}
Above query will fetch those resources which are of type city provided resources is of object type.
After Edit
Problem Group By Tag name which are Of city Type. That would not be achieved with the current mapping you have. You will have to change resources field to nested type.
Mapping would look like.
"mappings": {
"resource": {
"properties": {
"tags": {
"properties": {
"content": {
"type": "string"
},
"resources": {
"type": "nested",
"properties": {
"tagName": {
"type": "string"
},
"type": {
"type": "string"
}
}
}
}
}
}
}
}
Final query would be :
{
"size": 0,
"query": {
"nested": {
"path": "tags.resources",
"query": {
"match": {
"tags.resources.type": "city"
}
}
}
},
"aggs": {
"resources Nested path": {
"nested": {
"path": "tags.resources"
},
"aggs": {
"city type": {
"filter": {
"term": {
"tags.resources.type": "city"
}
},
"aggs": {
"group By tagName": {
"terms": {
"field": "tags.resources.tagName"
}
}
}
}
}
}
}
}
Output would be:
"aggregations": {
"resources Nested path": {
"doc_count": 6,
"city type": {
"doc_count": 2,
"group By tagName": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "london",
"doc_count": 1
},
{
"key": "new_york",
"doc_count": 1
}
]
}
}
}
}

Resources