Elasticsearch - get all nested objects of all documents - elasticsearch

Let's imagine Elasticsearch index where each document represents a country. Country has cities field, which is defined as nested.
Sample mapping (simplified for brevity of this example):
{
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"cities": {
"type": "nested",
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
// other properties are omitted for brevity
}
}
}
}
The documents which I'm inserting to the index look like this:
{
"name": "Slovakia",
"cities": [
{
"name": "Bratislava"
},
{
"name": "Zilina"
},
...
]
}
{
"name": "Czech Republic",
"cities": [
{
"name": "Praha"
},
{
"name": "Brno"
},
...
]
}
Is it possible to compose a query which returns all cities (over all countries) and supports sorting & pagination? In response, I'd like to have the complete nested objects + some fields of the parent object (so that I can display which country the city belongs to).
The first returned page (response) would contain 10 cities from Czech Republic, the second page would contain 10 cities where four of them are (the last ones) from Czech Republic and six are from Slovakia.
I was looking into composite aggregation, but I don't know how add country name to sources:
{
"query": {
"match_all": {}
},
"aggs": {
"nested_aggs": {
"nested": {
"path": "cities"
},
"aggs": {
"by_name": {
"composite": {
"sources": [
{
"cityName": {
"terms": {
"field": "cities.name.keyword",
"order": "asc"
}
}
}
]
}
}
}
}
}
}
Is it possible to compose such query without modifying the Elasticsearch mapping?

All members of composite aggregations need to be defined under the same context — you cannot intermix nested and non-nested contexts.
The easiest option would be to first aggregate on the countries and then on the cities:
{
"size": 0,
"aggs": {
"by_country": {
"terms": {
"field": "name.keyword",
"size": 10
},
"aggs": {
"nested_cities": {
"nested": {
"path": "cities"
},
"aggs": {
"by_cities": {
"terms": {
"field": "cities.name.keyword",
"size": 10
}
}
}
}
}
}
}
}
If you do have the option of changing the mapping, you can leverage the include_in_root feature which'll enable you to perform composite aggs such as:
{
"size": 0,
"aggs": {
"by_name": {
"composite": {
"sources": [
{
"countryName": {
"terms": {
"field": "name.keyword",
"order": "asc"
}
}
},
{
"cityName": {
"terms": {
"field": "cities.name.keyword",
"order": "asc"
}
}
}
]
}
}
}
}
which can be easily paginated.
Here's what the result would look like:
...
"aggregations" : {
"by_name" : {
"after_key" : {
"countryName" : "Slovakia",
"cityName" : "Zilina"
},
"buckets" : [
{
"key" : {
"countryName" : "Czech Republic",
"cityName" : "Brno"
},
"doc_count" : 1
},
{
"key" : {
"countryName" : "Czech Republic",
"cityName" : "Praha"
},
"doc_count" : 1
},
{
"key" : {
"countryName" : "Slovakia",
"cityName" : "Bratislava"
},
"doc_count" : 1
},
{
"key" : {
"countryName" : "Slovakia",
"cityName" : "Zilina"
},
"doc_count" : 1
}
]
}
}

Related

ElasticSearch - Unique counts in nested array

For the sake of easier understanding, I will show you how my data is mapped.
Here is the template I'm having.
{
"mappings":
{
"properties":
{
"applicationName":
{
"type": "keyword"
},
"tags":
{
"type": "nested",
"properties":
{
"tagKey":
{
"type": "keyword"
},
"tagKeyword":
{
"type": "keyword"
}
}
}
}
}
}
Here are some sample items,
Sample item 1
"applicationName": "application1"
"tags": [
{"tagKey": "user", "tagKeyword": "aaa"},
{"tagKey": "os", "tagKeyword": "android"}
]
Sample item 2
"applicationName": "application2"
"tags": [
{"tagKey": "user", "tagKeyword": "bbb"},
{"tagKey": "os", "tagKeyword": "ios"}
]
Sample item 3
"applicationName": "application1"
"tags": [
{"tagKey": "user", "tagKeyword": "aaa"},
{"tagKey": "os", "tagKeyword": "pc"}
]
I want to retrieve the count of distinct tagKeyword that has tagKey of "user" for each application.
For example,
[
{
"applicationName": "application1",
"distinctUser": 2
},
{
"applicationName": "application2",
"distinctUser": 1
}
]
Both solution or URL to the document related to this issue will be appreciated.
You can use a terms aggregation on the applicationName, then filter the user-only tags through a nested filter aggregation:
POST index-name/_search?filter_path=aggregations.*.buckets.key,aggregations.*.buckets.nestedTags.distinctUser
{
"size": 0,
"aggs": {
"distinctAppName": {
"terms": {
"field": "applicationName",
"size": 10
},
"aggs": {
"nestedTags": {
"nested": {
"path": "tags"
},
"aggs": {
"distinctUser": {
"filter": {
"term": {
"tags.tagKey": "user"
}
}
}
}
}
}
}
}
}
yielding
{
"aggregations" : {
"distinctAppName" : {
"buckets" : [
{
"key" : "application1",
"nestedTags" : {
"distinctUser" : {
"doc_count" : 2
}
}
},
{
"key" : "application2",
"nestedTags" : {
"distinctUser" : {
"doc_count" : 1
}
}
}
]
}
}
}
Refer nested aggregations. Try term aggregation for the field applicationName to group by applications and then do term sub-aggregation for nested field tags.tagKeyword to get distinct list of values within a given application.
Also you have to add a filter for "tag.tagKey" field as "user" to suit your requirement

Reverse_nested aggregation + top hits : get parent and nested data at the same time

Do you know how to use reverse_nested aggregation to get both the parent and ONLY the nested data inside my top hit aggregations ?
The 'ONLY' part is the problem right now.
This is my mapping :
{
"ticket": {
"mappings": {
"properties": {
"name": {
"type": "keyword"
}
},
"tasks": {
"type": "nested",
"properties": {
"string_task_name": {
"type": "keyword"
}
}
}
}
}
}
My query uses top hits and reverse nested aggs.
{
"aggs": {
"object_tasks": {
"nested": {
"path": "object_tasks"
},
"aggs": {
"filter_by_tasks_attribute": {
"filter": {
"bool": {
"must": [
{
"wildcard": {
"object_tasks.string_task_name.keyword": "*"
}
}
]
}
},
"aggs": {
"using_reverse_nested": {
"reverse_nested": {
"path": "object_tasks"
},
"aggs": {
"names": {
"top_hits": {
"_source": {
"includes": [
"object_tasks.string_task_name",
"string_name"
]
},
"sort": [
{
"object_tasks.string_task_name.keyword": {
"order": "desc"
}
}
],
"from": 0,
"size": 10
}
}
}
}
}
}
}
}
}
}
{
"hits": {
"total": {
"value": 25,
"relation": "eq"
},
"max_score": null,
"hits": [
{
"_index": "random_index",
"_type": "_doc",
"_id": "5",
"_score": null,
"_source": {
"object_tasks": [ ================> I don't want all these tasks names, I just want the task name of the current nested object I am in.
{
"string_task_name": "task1"
},
{
"string_task_name": "task2"
},
{
"string_task_name": "task3"
},
{
"string_task_name": "task4"
}
],
"string_name": "Dummy Ticket 854"
},
"sort": [
"seek_a_sme"
]
}
]
}
}
As you can see the result is giving me 4 tasks name. What I want is to return only 1 task name.
The only workaround I have found is to copy the data of tickets inside the tasks. But if I can avoid it that would be awesome.
I don't want all these tasks names, I just want the task name of the current nested object I am in.
The statement "of the current nested object I'm in" implies that you are inside of a nested context but you cannot be in one when you escape it through reverse_nested…
I'm not sure if I truly understood what you're gunning for here but you could aggregate on the terms of object_tasks.string_task_name.keyword and the keys of this aggregation would then function as the individual "current nested objects" that you're after:
{
"size": 0,
"aggs": {
"object_tasks": {
"nested": {
"path": "object_tasks"
},
"aggs": {
"filter_by_tasks_attribute": {
"filter": {
"bool": {
"must": [
{
"wildcard": {
"object_tasks.string_task_name.keyword": "*"
}
}
]
}
},
"aggs": {
"by_string_task_name": {
"terms": {
"field": "object_tasks.string_task_name.keyword",
"order": {
"_key": "desc"
},
"size": 10
},
"aggs": {
"using_reverse_nested": {
"reverse_nested": {},
"aggs": {
"names": {
"top_hits": {
"_source": {
"includes": [
"string_name"
]
},
"from": 0,
"size": 10
}
}
}
}
}
}
}
}
}
}
}
}
yielding
"aggregations" : {
"object_tasks" : {
...
"filter_by_tasks_attribute" : {
...
"by_string_task_name" : {
...
"buckets" : [
{
"key" : "task4", <--
...
"using_reverse_nested" : {
...
"names" : {
"hits" : {
...
"hits" : [
{
...
"_source" : {
"string_name" : "Dummy Ticket 854" <--
}
}
]
}
}
}
},
{
"key" : "task3", <--
...
},
{
"key" : "task2", <--
...
},
{
"key" : "task1", <--
...
}
}
]
}
}
}
}
Notice that the top_hits aggregation doesn't need to be sorted anymore -- object_tasks.string_task_name.keyword will always be the same for any currently aggregated terms bucket. What I did instead was order this terms aggregation by _key which works the same way as a top_hits sort would have. BTW -- yours was missing the nested path parameter.

ElasticSearch Aggregation Filter (not nested) Array

I have mapping like that:
PUT myindex1/_mapping
{
"properties": {
"program":{
"properties":{
"rounds" : {
"properties" : {
"id" : {
"type" : "keyword"
},
"name" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
}
}
And my example docs:
POST myindex1/_doc
{
"program": {
"rounds":[
{"id":"00000000-0000-0000-0000-000000000000", "name":"Test1"},
{"id":"00000000-0000-0000-0000-000000000001", "name":"Fact2"}
]
}
}
POST myindex1/_doc
{
"program": {
"rounds":[
{"id":"00000000-0000-0000-0000-000000000002", "name":"Test3"},
{"id":"00000000-0000-0000-0000-000000000003", "name":"Fact4"}
]
}
}
POST myindex1/_doc
{
"program": {
"rounds":[
{"id":"00000000-0000-0000-0000-000000000004", "name":"Test5"},
{"id":"00000000-0000-0000-0000-000000000005", "name":"Fact6"}
]
}
}
Purpose: get only names of rounds that filtered as wildcard by user.
Aggregation query:
GET myindex1/_search
{
"aggs": {
"result": {
"aggs": {
"names": {
"terms": {
"field": "program.rounds.name.keyword",
"size": 10000,
"order": {
"_key": "asc"
}
}
}
},
"filter": {
"bool": {
"must":[
{
"wildcard": {
"program.rounds.name": "*test*"
}
}
]
}
}
}
},
"size": 0
}
This aggregation returns all 6 names, but I need only Test1,Test3,Test5. Also tried include": "/tes.*/i" regex pattern for terms, but ignore case does not work.
Note: I'm note sure abount nested type, because I don't interested in association between Id and Name (at least for now).
ElasticSearch version: 7.7.0
If you want to only aggregate specific rounds based on a condition on the name field, then you need to make rounds nested, otherwise all name values end up in the same field.
Your mapping needs to be changed to this:
PUT myindex1/
{
"mappings": {
"properties": {
"program": {
"properties": {
"rounds": {
"type": "nested", <--- add this
"properties": {
"id": {
"type": "keyword"
},
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
}
}
}
And then your query needs to change to this:
GET myindex1/_search
{
"size": 0,
"query": {
"nested": {
"path": "program.rounds",
"query": {
"bool": {
"must": [
{
"wildcard": {
"program.rounds.name": "*Test*"
}
}
]
}
}
}
},
"aggs": {
"rounds": {
"nested": {
"path": "program.rounds"
},
"aggs": {
"name_filter": {
"filter": {
"wildcard": {
"program.rounds.name": "*Test*"
}
},
"aggs": {
"names": {
"terms": {
"field": "program.rounds.name.keyword",
"size": 10000,
"order": {
"_key": "asc"
}
}
}
}
}
}
}
}
}
And the result will be:
"aggregations" : {
"rounds" : {
"doc_count" : 6,
"name_filter" : {
"doc_count" : 3,
"names" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Test1",
"doc_count" : 1
},
{
"key" : "Test3",
"doc_count" : 1
},
{
"key" : "Test5",
"doc_count" : 1
}
]
}
}
}
}
UPDATE:
Actually, you can achieve what you want without introducing nested types with the following query. You were close, but the include pattern was wrong
GET myindex1/_search
{
"aggs": {
"result": {
"aggs": {
"names": {
"terms": {
"field": "program.rounds.name.keyword",
"size": 10000,
"include": "[Tt]est.*",
"order": {
"_key": "asc"
}
}
}
},
"filter": {
"bool": {
"must": [
{
"wildcard": {
"program.rounds.name": "*Test*"
}
}
]
}
}
}
},
"size": 0
}

Elasticsearch - Applying multi level filter on nested aggregation bucket?

I'm, trying to get distinct nested objects by applying multiple filters.
Basically in Elasticsearch I have cities as top level document and inside I have nested citizens documents, which have another nested pets documents.
I am trying to get all citizens that have certain conditions applied on all of these 3 levels (cities, citizens and pets):
Give me all distinct citizens
that have age:"40",
that have pets "name":"Casper",
from cities with office_type="secondary"
I know that to filter 1st level I can use query condition, and then if I need to filter the nested citizens I can add a filter in the aggregation level.
I am using this article as an example: https://iridakos.com/tutorials/2018/10/22/elasticsearch-bucket-aggregations.html
Query working so far:
GET city_offices/_search
{
"size" : 10,
"query": {
"term" : { "office_type" : "secondary" }
},
"aggs": {
"citizens": {
"nested": {
"path": "citizens"
},
"aggs": {
"inner_agg": {
"filter": {
"term": { "citizens.age": "40" }
} ,
"aggs": {
"occupations": {
"terms": {
"field": "citizens.occupation"
}
}
}
}
}
}
}
}
BUT: How can I add the "pets" nested filter condition?
Mapping:
PUT city_offices
{
"settings": {
"number_of_shards": 1
},
"mappings": {
"doc": {
"properties": {
"city": {
"type": "keyword"
},
"office_type": {
"type": "keyword"
},
"citizens": {
"type": "nested",
"properties": {
"occupation": {
"type": "keyword"
},
"age": {
"type": "integer"
},
"pets": {
"type": "nested",
"properties": {
"kind": {
"type": "keyword"
},
"name": {
"type": "keyword"
},
"age": {
"type": "integer"
}
}
}
}
}
}
}
}
}
Index data:
PUT /city_offices/doc/1
{
"city":"Athens",
"office_type":"secondary",
"citizens":[
{
"occupation":"Statistician",
"age":30,
"pets":[
{
"kind":"Cat",
"name":"Phoebe",
"age":14
}
]
},
{
"occupation":"Librarian",
"age":30,
"pets":[
{
"kind":"Rabbit",
"name":"Nino",
"age":13
}
]
},
{
"occupation":"Librarian",
"age":40,
"pets":[
{
"kind":"Rabbit",
"name":"Nino",
"age":13
}
]
},
{
"occupation":"Statistician",
"age":40,
"pets":[
{
"kind":"Rabbit",
"name":"Casper",
"age":2
},
{
"kind":"Rabbit",
"name":"Nino",
"age":13
},
{
"kind":"Dog",
"name":"Nino",
"age":15
}
]
}
]
}
So I found a solution for this.
Basically I apply top level filters in the query section and then apply rest of conditions in the aggregations.
First I apply citizens level filter aggregation, then I go inside nested pets and apply the filter and then I need to get back up to citizens level (using reverse_nested: citizens) and then set the term that will generate the final bucket.
Query looks like this:
GET city_offices/_search
{
"size" : 10,
"query": {
"term" : { "office_type" : "secondary" }
},
"aggs": {
"citizens": {
"nested": {
"path": "citizens"
},
"aggs": {
"inner": {
"filter": {
"term": { "citizens.age": "40" }
} ,
"aggs": {
"occupations": {
"nested": {
"path": "citizens.pets"
},
"aggs": {
"inner_pets": {
"filter": {
"term": { "citizens.pets.name": "Casper" }
} ,
"aggs": {
"lll": {
"reverse_nested": {
"path": "citizens"
},
"aggs": {
"xxx": {
"terms": {
"field": "citizens.occupation",
"size": 10
}
}
}
}
}
}
}
}
}
}
}
}
}
}
The response bucket looks like this:
"xxx": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Librarian",
"doc_count": 1
},
{
"key": "Statistician",
"doc_count": 1
}
]
}
Any other suggestions?

Elasticsearch aggregation by arrays of String

I have an ElasticSearch index, where I store telephony transactions (SMS, MMS, Calls, etc ) with their associated costs.
The key of these documents are the MSISDN (MSISDN = phone number). In my app, I know that there are group of users. Each users can have one or more MSISDN.
Here is the mapping of this kind of documents :
"mappings" : {
"cdr" : {
"properties" : {
"callDatetime" : {
"type" : "long"
},
"callSource" : {
"type" : "string"
},
"callType" : {
"type" : "string"
},
"callZone" : {
"type" : "string"
},
"calledNumber" : {
"type" : "string"
},
"companyKey" : {
"type" : "string"
},
"consumption" : {
"properties" : {
"data" : {
"type" : "long"
},
"voice" : {
"type" : "long"
}
}
},
"cost" : {
"type" : "double"
},
"country" : {
"type" : "string"
},
"included" : {
"type" : "boolean"
},
"msisdn" : {
"type" : "string"
},
"network" : {
"type" : "string"
}
}
}
}
My goal and issue :
My goal is to make a query that retrieve cost by callType by group. But groups are not represented in ElasticSearch, only in my PostgreSQL database.
So I will make a method that retrieves all the MSISDN for every existing group, and get something like a List of String arrays, containing every MSISDN within each group.
Let's say I have something like :
"msisdn_by_group" : [
{
"group1" : ["01111111111", "02222222222", "033333333333", "044444444444"]
},
{
"group2" : ["05555555555","06666666666"]
}
]
Now, I will use this to generate an Elasticsearch query. I want to make with an aggregation, the sum of the cost, for all those terms in different buckets, and then split it again by callType. (to make a stackedbar chart).
I've tried several things, but didn't manage to make it work (histogram, buckets, term and sum was mainly the keyword i'm playing with).
If somebody here can help me with the order, and the keywords I can use to achieve this, it would be great :) Thanks
EDIT :
Here is my last try :
QUERY:
{
"aggs" : {
"cost_histogram": {
"terms": {
"field": "callType"
},
"aggs": {
"cost_histogram_sum" : {
"sum": {
"field": "cost"
}
}
}
}
}
}
I go the expected result, but it missing the "group" split, as I don't know how to pass the MSISDN arrays as a criteria :
RESULT :
"aggregations": {
"cost_histogram": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "data",
"doc_count": 5925,
"cost_histogram_sum": {
"value": 0
}
},
{
"key": "sms_mms",
"doc_count": 5804,
"cost_histogram_sum": {
"value": 91.76999999999995
}
},
{
"key": "voice",
"doc_count": 5299,
"cost_histogram_sum": {
"value": 194.1196
}
},
{
"key": "sms_mms_plus",
"doc_count": 35,
"cost_histogram_sum": {
"value": 7.2976
}
}
]
}
}
Ok I found out how to make this with one query, but it's damn a long query because it repeats for every group, but I have no choise. I'm using the "filter" aggregator.
Here is a working example based on the array I wrote in my question above :
POST localhost:9200/cdr/_search?size=0
{
"query": {
"term" : {
"companyKey" : 1
}
},
"aggs" : {
"group_1_split_cost": {
"filter": {
"bool": {
"should": [{
"bool": {
"must": {
"match": {
"msisdn": "01111111111"
}
}
}
},{
"bool": {
"must": {
"match": {
"msisdn": "02222222222"
}
}
}
},{
"bool": {
"must": {
"match": {
"msisdn": "03333333333"
}
}
}
},{
"bool": {
"must": {
"match": {
"msisdn": "04444444444"
}
}
}
}]
}
},
"aggs": {
"cost_histogram": {
"terms": {
"field": "callType"
},
"aggs": {
"cost_histogram_sum" : {
"sum": {
"field": "cost"
}
}
}
}
}
},
"group_2_split_cost": {
"filter": {
"bool": {
"should": [{
"bool": {
"must": {
"match": {
"msisdn": "05555555555"
}
}
}
},{
"bool": {
"must": {
"match": {
"msisdn": "06666666666"
}
}
}
}]
}
},
"aggs": {
"cost_histogram": {
"terms": {
"field": "callType"
},
"aggs": {
"cost_histogram_sum" : {
"sum": {
"field": "cost"
}
}
}
}
}
}
}
}
Thanks to the newer versions of Elasticsearch we can now nest very deep aggregations, but it's still a bit too bad that we can't pass arrays of values to an "OR" operator or something like that. It could reduce the size of those queries, I guess. Even if they are a bit special and used in niche cases, as mine.

Resources