Elasticsearch aggregation on a field with dynamic properties

Elasticsearch aggregation on a field with dynamic properties - elasticsearch

Given the following mapping where variants are a nested type and options is a flattened type:
{
"doc_type" : "product",
"id" : 1,
"variants" : [
{
"options" : {
"Size" : "XS",
},
"price" : 1,
},
{
"options" : {
"Size" : "S",
"Material": "Wool"
},
"price" : 6.99,
},
]
}
I want to run an aggregation that produces data in the following format:
{
"variants.options.Size": {
"buckets" : [
{
"key" : "XS",
"doc_count" : 1
},
{
"key" : "S",
"doc_count" : 1
},
],
},
"variants.options.Material": {
"buckets" : [
{
"key" : "Wool",
"doc_count" : 1
}
],
},
}
I could very easily do something like:
"aggs": {
"variants.options.Size": {
"terms": {
"field": "variants.options.Size"
}
},
"variants.options.Material": {
"terms": {
"field": "variants.options.Material"
}
}
}
The caveat here is that we're using the flattened type for options because the fields in options are dynamic and so there is no way for me to know before hand that we want to aggregate on Size and Material.
Essentially, I want to tell Elasticsearch that it should aggregate on whatever keys it finds under options. Is there a way to do this?

I want to tell Elasticsearch that it should aggregate on whatever keys it finds under options. Is there a way to do this?
Not directly. I had the same question a while back. I haven't found a clean solution to this day and I'm convinced there isn't one.
Luckily, there's a scripted_metric workaround that I outlined here. Applying it to your use case:
POST your_index/_search
{
"size": 0,
"aggs": {
"dynamic_variant_options": {
"scripted_metric": {
"init_script": "state.buckets = [:];",
"map_script": """
def variants = params._source['variants'];
for (def variant : variants) {
for (def entry : variant['options'].entrySet()) {
def key = entry.getKey();
def value = entry.getValue();
def path = "variants.options." + key;
if (state.buckets.containsKey(path)) {
if (state.buckets[path].containsKey(value)) {
state.buckets[path][value] += 1;
} else {
state.buckets[path][value] = 1;
}
} else {
state.buckets[path] = [value:1];
}
}
}
""",
"combine_script": "return state",
"reduce_script": "return states"
}
}
}
}
would yield:
"aggregations" : {
"dynamic_variant_options" : {
"value" : [
{
"buckets" : {
"variants.options.Size" : {
"S" : 1,
"XS" : 1
},
"variants.options.Material" : {
"Wool" : 1
}
}
}
]
}
}
You'll need to adjust the painless code if you want the buckets to be arrays of key-doc_count pairs instead of hash maps like in my example.

Related

elasticearch aggregation by array size

I need a stats on elasticsearch. I can't make the request.
I would like to know the number of people per appointment.
appointment index mapping
{
"id" : "383577",
"persons" : [
{
"id" : "1",
},
{
"id" : "2",
}
]
}
what i would like
"buckets" : [
{
"key" : "1", <--- appointment of 1 person
"doc_count" : 1241891
},
{
"key" : "2", <--- appointment of 2 persons
"doc_count" : 10137
},
{
"key" : "3", <--- appointment of 3 persons
"doc_count" : 8064
}
]
Thank you

The easiest way to do this is to create another integer field containing the length of the persons array and aggregating on that field.
{
"id" : "383577",
"personsCount": 2, <---- add this field
"persons" : [
{
"id" : "1",
},
{
"id" : "2",
}
]
}
The non-optimal way of achieving what you expect is to use a script that will return the length of the persons array dynamically, but be aware that this is sub-optimal and can potentially harm your cluster depending on the volume of data you have:
GET /_search
{
"aggs": {
"persons": {
"terms": {
"script": "doc['persons.id'].size()"
}
}
}
}
If you want to update all your documents to create that field you can do it like this:
POST index/_update_by_query
{
"script": {
"source": "ctx._source.personsCount = ctx._source.persons.length"
}
}
However, you'll also need to modify the logic of your indexing application to create that new field.

ElasticSearch Accessing Nested Documents in Script - Null Pointer Exception

Gist: Trying to write a custom filter on nested documents using painless. Want to write error checks when there are no nested documents to surpass null_pointer_exception
I have a mapping as such (simplified and obfuscated)
{
"video_entry" : {
"aliases" : { },
"mappings" : {
"properties" : {
"captions_added" : {
"type" : "boolean"
},
"category" : {
"type" : "keyword"
},
"is_votable" : {
"type" : "boolean"
},
"members" : {
"type" : "nested",
"properties" : {
"country" : {
"type" : "keyword",
},
"date_of_birth" : {
"type" : "date",
}
}
}
}
Each video_entry document can have 0 or more members nested documents.
Sample Document
{
"captions_added": true,
"category" : "Mental Health",
"is_votable: : true,
"members": [
{"country": "Denmark", "date_of_birth": "1998-04-04T00:00:00"},
{"country": "Denmark", "date_of_birth": "1999-05-05T00:00:00"}
]
}
If one or more nested document exist, we want to write some painless scripts that'd check certain fields across all the nested documents. My script works on mappings with a few documents but when I try it on larger set of documents I get null pointer exceptions despite having every null check possible. I've tried various access patterns, error checking mechanisms but I get exceptions.
POST /video_entry/_search
{
"query": {
"script": {
"script": {
"source": """
// various NULL checks that I already tried
// also tried short circuiting on finding null values
if (!params['_source'].empty && params['_source'].containsKey('members')) {
def total = 0;
for (item in params._source.members) {
// custom logic here
// if above logic holds true
// total += 1;
}
return total > 3;
}
return true;
""",
"lang": "painless"
}
}
}
}
Other Statements That I've Tried
if (params._source == null) {
return true;
}
if (params._source.members == null) {
return true;
}
if (!ctx._source.contains('members')) {
return true;
}
if (!params['_source'].empty && params['_source'].containsKey('members') &&
params['_source'].members.value != null) {
// logic here
}
if (doc.containsKey('members')) {
for (mem in params._source.members) {
}
}
Error Message
&& params._source.members",
^---- HERE"
"caused_by" : {
"type" : "null_pointer_exception",
"reason" : null
}
I've looked into changing the structure (flattening the document) and the usage of must_not as indicated in this answer. They don't suit our use case as we need to incorporate some more custom logic.
Different tutorials use ctx, doc and some use params. To add to the confusion Debug.explain(doc.members), Debug.explain(params._source.members) return empty responses and I'm having a hard time figuring out the types.
Gist: Trying to write a custom filter on nested documents using painless. Want to write error checks when there are no nested documents to surpass null_pointer_exception
Any help is appreciated.

TLDr;
Elastic flatten objects. Such that
{
"group" : "fans",
"user" : [
{
"first" : "John",
"last" : "Smith"
},
{
"first" : "Alice",
"last" : "White"
}
]
}
Turn into:
{
"group" : "fans",
"user.first" : [ "alice", "john" ],
"user.last" : [ "smith", "white" ]
}
To access members inner value you need to reference it using doc['members.<field>'] as members will not exist on its own.
Details
As you may know, Elastic handles inner documents in its own way. [doc]
So you will need to reference them accordingly.
Here is what I did to make it work.
Btw, I have been using the Dev tools of kibana
PUT /so_test/
PUT /so_test/_mapping
{
"properties" : {
"captions_added" : {
"type" : "boolean"
},
"category" : {
"type" : "keyword"
},
"is_votable" : {
"type" : "boolean"
},
"members" : {
"properties" : {
"country" : {
"type" : "keyword"
},
"date_of_birth" : {
"type" : "date"
}
}
}
}
}
POST /so_test/_doc/
{
"captions_added": true,
"category" : "Mental Health",
"is_votable" : true,
"members": [
{"country": "Denmark", "date_of_birth": "1998-04-04T00:00:00"},
{"country": "Denmark", "date_of_birth": "1999-05-05T00:00:00"}
]
}
PUT /so_test/_doc/
{
"captions_added": true,
"category" : "Mental breakdown",
"is_votable" : true,
"members": []
}
POST /so_test/_doc/
{
"captions_added": true,
"category" : "Mental success",
"is_votable" : true,
"members": [
{"country": "France", "date_of_birth": "1998-04-04T00:00:00"},
{"country": "Japan", "date_of_birth": "1999-05-05T00:00:00"}
]
}
And then I did this query (it is only a bool filter, but I guess making it work for your own use case should not prove too difficult)
GET /so_test/_search
{
"query":{
"bool": {
"filter": {
"script": {
"script": {
"lang": "painless",
"source": """
def flag = false;
// /!\ notice how the field is referenced /!\
if(doc['members.country'].size() != 0)
{
for (item in doc['members.country']) {
if (item == params.country){
flag = true
}
}
}
return flag;
""",
"params": {
"country": "Japan"
}
}
}
}
}
}
}
BTW you were saying you were a bit confused about the context for painless. you can find in the documentation so details about it.
[doc]
In this case the filter context is the one we want to look at.

I would like to combine the duplicate values in Elasticsearch into one and see the results with a different filter

I'm collecting logs through Elastic Search. The logs are collected as below.
ex.
{
"name" : "John"
"team" : "IT"
"startTime" : "21:00"
"result" : "pass"
},
{
"name" : "James"
"team" : "HR"
"startTime" : "21:04"
"result" : "pass"
},
{
"name" : "Paul"
"team" : "IT"
"startTime" : "21:05"
"result" : "pass"
},
{
"name" : "Jackson"
"team" : "Marketing"
"startTime" : "21:30"
"result" : "fail"
},
{
"name" : "John"
"team" : "IT"
"startTime" : "21:41"
"result" : "pass"
},
.....and so on
If you run the query below on these collected logs,
GET logData/_search
{
"size": 0,
"aggs": {
"Documents_per_team": {
"terms": {
"field": "team"
}
}
}
}
The following results will be exposed.
"aggregations" : {
"Documents_per_team" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "IT",
"doc_count" : 70
},
{
"key" : "Marketing",
"doc_count" : 55
},
{
"key" : "HR",
"doc_count" : 11
}
]
}
}
}
What I want is to eliminate duplication if the name of the document is duplicated in this result.
[AS-IS]
As shown above, the IT team count is exposed to 70
[The result I want]
if John performed 50 times, Kate performed 10 times, Paul performed 10 times, the IT team count 3 is exposed. (Because there are three of IT team member)
Can I get a team-by-team result after removing duplicates?
Thanks

You've got two options:
a cardinality sub-aggregation (straightforward, but approximate and not very scalable, albeit only in very specific/advanced situations)
or a scripted metric aggregation (slower, more verbose but exact).
Both approaches assume that the names are unique per team-level. If they're not, you'll need to adjust accordingly. Also, it is assumed that the name is mapped to be of type keyword, just like the team. If not, you'll need to replace them with your_field.keyword
1. Cardinality
{
"size": 0,
"aggs": {
"Documents_per_team": {
"terms": {
"field": "team"
},
"aggs": {
"unique_names_per_team": {
"cardinality": {
"field": "name"
}
}
}
}
}
}
2. Scripted Metric
{
"size": 0,
"aggs": {
"Documents_per_team": {
"scripted_metric": {
"init_script": "state.by_department = [:]; state.dept_vs_name = [:];",
"map_script": """
def dept = doc['team'].value;
def name = doc['name'].value;
def name_already_considered = state.by_department.containsKey(dept) && state.dept_vs_name[dept].containsKey(name);
if (name_already_considered) {
return;
}
if (state.by_department.containsKey(dept)) {
state.by_department[dept] += 1;
} else {
state.by_department[dept] = 1
}
if (!state.dept_vs_name.containsKey(dept)) {
// init new map & set is first member
state.dept_vs_name[dept] = [name:true];
} else if (!state.dept_vs_name[dept].containsKey(name)) {
state.dept_vs_name[dept][name] = true;
}
""",
"combine_script": "return state.by_department",
"reduce_script": "return states"
}
}
}
}
Note: If you also wish to see the underlying dept vs. name breakdown, you can modify the combine_script to return the whole state, i.e. return state.

Sorting by distance on a numeric field in ElasticSearch

For a project I need to select documents which are within filter bounds and a closest to a numeric value. This is about a price and I cannot seem to find if this is possible.
Say I have 2 documents:
{
"name": "Document1",
"price": 46.12,
"tags": ["tag1", "tag2"]
}
{
"name": "Document2",
"price": 82.29,
"tags": ["tag1", "tag3"]
}
Is it possible to get the document with the price closest to 66.23?

The answer (thanks to keety) was to enable dynamic scripting and add a sorting method like this:
{
"query": {
"filtered" : {
"query":{
"match_all" : { }
},
"filter": {}
}
},
"sort" : {
"_script" : {
"script" : "cur = (factor - doc['age'].value); if (cur < 0) { cur = cur * -1 } else { cur = cur}",
"type" : "number",
"params" : {
"factor" : 45
},
"order" : "asc"
}
}
}
This sorts from closest to farthest and works like a charm. Thanks!

Bucket by fields present in returned documents using Elasticsearch

Our indexed documents do not have a completely fixed schema, that is, not every field is in every document. Is there a way to create buckets based on the fields present in a set of documents (i.e. in response to a query) with the count of how many documents contain those fields? For example, these documents that I just made up comprise the results of a query:
{"name":"Bob","field1":"value","field2":"value2","field3":"value3"}
{"name":"Sue","field2":"value4","field3":"value5"}
{"name":"Ali","field1":"value6","field2":"value7"}
{"name":"Joe","field3":"value8"}
This is the information (not format) I want to extract:
name: 4
field1: 2
field2: 3
field3: 3
Is there a way I can aggregate and count to get those results?

Yeah, I think you can do it like this:
GET /some_index/some_type/_search?search_type=count
{
"aggs": {
"name_bucket": {
"filter" : { "exists" : { "field" : "name" } }
},
"field1_bucket": {
"filter" : { "exists" : { "field" : "field1" } }
},
"field2_bucket": {
"filter" : { "exists" : { "field" : "field2" } }
},
"field3_bucket": {
"filter" : { "exists" : { "field" : "field3" } }
}
}
}
And you get something like this:
"aggregations": {
"field3_bucket": {
"doc_count": 3
},
"field1_bucket": {
"doc_count": 2
},
"field2_bucket": {
"doc_count": 3
},
"name_bucket": {
"doc_count": 4
}
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Elasticsearch aggregation on a field with dynamic properties - elasticsearch

Related

elasticearch aggregation by array size

ElasticSearch Accessing Nested Documents in Script - Null Pointer Exception

I would like to combine the duplicate values in Elasticsearch into one and see the results with a different filter

Sorting by distance on a numeric field in ElasticSearch

Bucket by fields present in returned documents using Elasticsearch

Categories

Resources