Counting comma saperated elements as value in elasticsearch - elasticsearch

Is it possible to count the values ​​separated by commas from the same field with aggregations or any other way?
i.e
JSON
{
"ID":"example.1",
"Sports":{
"1":
{,
"Football Teams":" Real Madrid, Manchester United, Juventus"
"Basket Teams":"Chicago Bulls"
},
"2":
{
"Football Teams":"F.C Barcelona, Milan"
"Basket Teams":"Lakers"
}
}
}
query
GET xxx/_search
{
"query": {
"match_all": {
}
},
"aggs": {
"NAME": {
"value_count": {
"field": "Sports.1."Football Teams.keyword"
}}
}
}
Desired Output
"aggregations" : {
"Count" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Real Madrid, Manchester United, Juventus",
"doc_count" : 3
}
The objective of the query I am looking for is to be able to determine how many values ​​separated by commas a field has.

You can do it with a value script in a terms sub-aggregation:
GET xxx/_search
{
"aggs": {
"count": {
"terms": {
"field": "team.keyword"
},
"aggs": {
"count": {
"terms": {
"field": "team.keyword",
"script": {
"source": "/,/.split(_value).length",
"lang": "painless"
}
}
}
}
}
}
}
The top-level buckets will be the values of the football team and each top-level bucket will have a sub-bucket with the number of tokens, like this:
{
"key" : "Real Madrid, Manchester United, Juventus", <-- team name
"doc_count" : 1,
"count" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "3", <-- number of teams
"doc_count" : 1
}
]
}
}
In order for this to work, you need to add the following line in elasticsearch.yml and restart your server:
script.painless.regex.enabled: true

Related

Elasticsearch: Tricky aggregation with sum and comparison

I am trying to pull statistics from my elastic cluster I cannot figure out.
In the end what I want to achieve is a count of streams (field: status) over time (field: timestamp) for a specific item (field: media).
The data are logs from nginx with anonymized IPs (field: ip_hash) and user agents (field: http_user_agent). To get a valid count I need to sum up the bytes transferred (field: bytes_sent) and compare that to a minimum threshold (integer) considering the same IP and user agent. It is only a valid stream / only counts if XY bytes of that stream have been transferred in sum.
"_source": {
"media": "my-stream.001",
"http_user_agent": "Spotify/8.4.44 Android/29 (SM-T535)",
"ip_hash": "fcd2653c44c1d8e33ef5d58ac5a33c2599b68f05d55270a8946166373d79a8212a49f75bcf3f71a62b9c71d3206c6343430a9ebec9c062a0b308a48838161ce8",
"timestamp": "2022-02-05 01:32:23.941",
"bytes_sent": 4893480,
"status": 206
}
Where I am having trouble is to sum up the transferred bytes based on the unique user agent / IP hash combination and comparing that to the threshold.
Any pointers are appreciated how I could solve this. Thank you!
So far I got this:
GET /logdata_*/_search
{
"size": 0,
"query": {
"bool": {
"must": [
{
"range": {
"timestamp": {
"gte": "now-1w/d",
"lt": "now/d"
}
}
}
]
}
},
"aggs": {
"status206":{
"filter": {
"term": {
"status": "206"
}
},
"aggs": {
"medias": {
"terms": {
"field": "media",
"size": 10
},
"aggs": {
"ips": {
"terms": {
"field": "ip_hash",
"size": 10
},
"aggs": {
"clients": {
"terms": {
"field": "http_user_agent",
"size": 10
},
"aggs": {
"transferred": {
"sum": {
"field": "bytes_sent"
}
}
}
}
}
}
}
}
}
}
}
}
Which gives something like this:
{
"took" : 1563,
"timed_out" : false,
"_shards" : {
"total" : 12,
"successful" : 12,
"skipped" : 8,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 10000,
"relation" : "gte"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"status206" : {
"doc_count" : 1307130,
"medias" : {
"doc_count_error_upper_bound" : 7612,
"sum_other_doc_count" : 1163149,
"buckets" : [
{
"key" : "20220402_ETD_Podcast_2234_Eliten_-_VD_Hanson.mp3",
"doc_count" : 21772,
"ips" : {
"doc_count_error_upper_bound" : 12,
"sum_other_doc_count" : 21574,
"buckets" : [
{
"key" : "ae55a10beda61afd3641fe2a6ca8470262d5a0c07040d3b9b8285ea1a4dba661a0502a7974dc5a4fecbfbbe5b7c81544cdcea126271533e724feb3d7750913a5",
"doc_count" : 38,
"clients" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Deezer/7.0.0.xxx (Android; 10; Mobile; de) samsung SM-G960F",
"doc_count" : 38,
"transferred" : {
"value" : 7582635.0
}
}
]
}
},
{
"key" : "60082e96eb57c4a8b7962dc623ef7446fbc08cea676e75c4ff94ab5324dec93a6db1848d45f6dcc6e7acbcb700bb891cf6bee66e1aa98fc228107104176734ff",
"doc_count" : 37,
"clients" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Deezer/7.0.0.xxx (Android; 12; Mobile; de) samsung SM-N770F",
"doc_count" : 36,
"transferred" : {
"value" : 7252448.0
}
},
{
"key" : "Mozilla/5.0 (Linux; Android 11; RMX2063) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.58 Mobile Safari/537.36",
"doc_count" : 1,
"transferred" : {
"value" : 843367.0
}
}
]
}
},
Now I would need to check that "transferred" is gte the treshhold and that would count as 1 stream. In the end I need the count of all applicable streams.
You can try the following:
> GET _search?filter_path=aggregations.valid_streams.count
{
"size": 0,
"query": {
"bool": {
"must": [
{
"range": {
"timestamp": {
"gte": "now-1w/d",
"lt": "now/d"
}
}
},
{
"match": {
"status": "206"
}
}
]
}
},
"aggs": {
"streams": {
"multi_terms": {
"size": "65536",
"terms": [
{
"field": "media"
},
{
"field": "ip_hash"
},
{
"field": "http_user_agent"
}
]
},
"aggs": {
"transferred": {
"sum": {
"field": "bytes_sent"
}
},
"threshold": {
"bucket_selector": {
"buckets_path": {
"total": "transferred"
},
"script": "params.total > 12345"
}
}
}
},
"valid_streams": {
"stats_bucket": {
"buckets_path": "streams>transferred"
}
}
}
}
Explanation:
streams - Combined terms aggregations since every changed field in it should be counted as a new stream. This is mainly for better readability, change it if it doesn't fit your logic.
transferred - sum aggregation to sum up the sent bytes.
threshold - bucket_selector aggregation which filters out the streams that didn't reach the XY threshold.
valid_streams - stats_bucket aggregation which returns a count field containing the amount of buckets = valid streams. BTW, it also gives you info about your valid streams (i.e average bytes)
The filter_path queryparam is used to reduce the returned response to only include the aggregation output.

ELASTICSEARCH - Total doc_count aggregations

I am looking for a way to sum up the total of an aggregation that I have defined in the query.
For example:
{
"name" : false,
"surname" : false
},
{
"name" : false,
"surname" : false
}
Query:
GET index/_search?size=0
{"query": {
"bool": {
"must": [
{"term": {"name": false}},
{"term": {"surname": false}}
]
}
},
"aggs": {
"name": {
"terms": {
"field": "name"
}
},
"surname": {
"terms": {
"field": "surname"
}
}
}
}
The query returns the value for each field "name" and "surname" with value "false".
"aggregations" : {
"name" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 0,
"key_as_string" : "false",
"doc_count" : 2 <---------
}
]
},
"surname" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 0,
"key_as_string" : "false",
"doc_count" : 2 <---------
}
]
}
}
}
Is it possible to return the total sum of doc_count, so that in this situation it would be "doc_count" : 2 + "doc_count" : 2 == 4?
I've been trying to do it with script but since they are boolean values it doesn't work.
The functionality that most closely resembles the solution I am looking for is sum_bucket.
GET index/_search?filter_path=aggregations
{
"aggs": {
"surname_field": {
"terms": {
"field": "surname",
"size": 1
}
},
"sum": {
"sum_bucket" : {
"buckets_path":"surname_field>_count"
}
}
}
}
For this specific case where it is a simple JSON, the result of the query is the same as the hits.total.value (number of documents) with filtering to boolean field surname:false or name:false.
But for situations with Json with more fields we can specify the number of times we have a result in our database.
With this result I wanted to find the total number of hits and not the number of documents in the result.

Elasticsearch: Browse sorted terms

Let's say i have a set of document with common personal information like:
{
"name": "Samuel",
"Job": "Engineer",
....
}
My customer want to browse through all the terms in the "name" field, starting from a page centred on a "selected" name.
For example they want to search for "Samuel" and show a page of 7 elements like:
Eddie: 7
Lucian: 3
Sammy: 1
Samuel: 3
Scott:3
Teddy: 2
Tod: 1
Where the names are sorted alphabetically and the number are the number of occurrence.
Also would be nice to be able to go up and down through pages.
This is just an example, in the reality I may have lot and lot of unique keys to browse, so returning all the terms and loop over them is not really a solution.
There is a way to achieve this with ElasticSearch ?
I'd suggest using filtered aggs based on bucketing the users by the name first char -- so from a to g, h to n etc:
Syncing a few docs:
POST jobs/_doc
{"name":"Samuel","Job":"Engineer"}
POST jobs/_doc
{"name":"Lucian","Job":"Engineer"}
POST jobs/_doc
{"name":"Teddy","Job":"Engineer"}
POST jobs/_doc
{"name":"Tod","Job":"Engineer"}
POST jobs/_doc
{"name":"Andrew","Job":"Engineer"}
Then applying scripted filters:
GET jobs/_search
{
"size": 0,
"aggs": {
"alpha_buckets": {
"filter": {
"match_all": {}
},
"aggs": {
"a_g": {
"filter": {
"script": {
"script": {
"source": """
def first_char = doc['name.keyword'].value.toLowerCase().toCharArray()[0];
def char_int = Character.getNumericValue(first_char);
return char_int >= 10 && char_int <= 16
""",
"lang": "painless"
}
}
},
"aggs": {
"a_g": {
"terms": {
"field": "name.keyword",
"order": {
"_term": "asc"
}
}
}
}
},
"h_n": {
"filter": {
"script": {
"script": {
"source": """
def first_char = doc['name.keyword'].value.toLowerCase().toCharArray()[0];
def char_int = Character.getNumericValue(first_char);
return char_int >= 17 && char_int <= 23
""",
"lang": "painless"
}
}
},
"aggs": {
"h_n": {
"terms": {
"field": "name.keyword",
"order": {
"_term": "asc"
}
}
}
}
}
}
}
}
}
yielding
"aggregations" : {
"alpha_buckets" : {
"doc_count" : 7,
"h_n" : {
"doc_count" : 2,
"h_n" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Lucian",
"doc_count" : 2
},
{
"key" : "Lupe",
"doc_count" : 1
}
]
}
},
"a_g" : {
"doc_count" : 1,
"a_g" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Andrew",
"doc_count" : 1
},
{
"key" : "Anton",
"doc_count" : 1
}
]
}
}
}
Now, you don't need to get all alphabetical buckets at once, the above is just to show what's possible.

Elastic script from buckets and higher level aggregation

I want to compare the daily average of a metric (the frequency of words appearing in texts) to the value of a specific day. This is during a week. My goal is to check whether there's a spike. If the last day is way higher than the daily average, I'd trigger an alarm.
So from my input in Elasticsearch I compute the daily average during the week and find out the value for the last day of that week.
For getting the daily average for the week, I simply cut a week's worth of data using a range query on date field, so all my available data is the given week. I compute the sum and divide by 7 for a daily average.
For getting the last day's value, I did a terms aggregation on the date field with descending order and size 1 as suggested in a different question (How to select the last bucket in a date_histogram selector in Elasticsearch)
The whole output is as follows. Here you can see words "rama0" and "rama1" with their corresponding frequencies.
{
"aggregations" : {
"the_keywords" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "rama0",
"doc_count" : 4200,
"the_last_day" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 3600,
"buckets" : [
{
"key" : 1580169600000,
"key_as_string" : "2020-01-28T00:00:00.000Z",
"doc_count" : 600,
"the_last_day_frequency" : {
"value" : 3000.0
}
}
]
},
"the_weekly_sum" : {
"value" : 21000.0
},
"the_daily_average" : {
"value" : 3000.0
}
},
{
"key" : "rama1",
"doc_count" : 4200,
"the_last_day" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 3600,
"buckets" : [
{
"key" : 1580169600000,
"key_as_string" : "2020-01-28T00:00:00.000Z",
"doc_count" : 600,
"the_last_day_frequency" : {
"value" : 3000.0
}
}
]
},
"the_weekly_sum" : {
"value" : 21000.0
},
"the_daily_average" : {
"value" : 3000.0
}
},
[...]
]
}
}
}
Now I have the_daily_average in a high level of the output, and the_last_day_frequency in the single-element buckets list in the_last_day aggregation. I cannot use a bucket_script to compare those, because I cannot refer to a single bucket (if I place the script outside the_last_day aggregation) and I cannot refer to higher-level aggregations if I place the script inside the_last_day.
IMO the reasonable thing to do would be to put the script outside the aggregation and use a buckets_path using the <AGG_NAME><MULTIBUCKET_KEY> syntax mentioned in the docs, but I have tried "var1": "the_last_day[1580169600000]>the_last_day_frequency" and variations (hardcoding first until it works), but I haven't been able to refer to a particular bucket.
My ultimate goal is to have a list of keywords for which the last day frequency greatly exceeds the daily average.
For anyone interested, my current query is as follows. Notice that the part I'm struggling with is commented out.
body='{
"query": {
"range": {
"date": {
"gte": "START",
"lte": "END"
}
}
},
"aggs": {
"the_keywords": {
"terms": {
"field": "keyword",
"size": 100
},
"aggs": {
"the_weekly_sum": {
"sum": {
"field": "frequency"
}
},
"the_daily_average" : {
"bucket_script": {
"buckets_path": {
"weekly_sum": "the_weekly_sum"
},
"script": {
"inline": "return params.weekly_sum / 7"
}
}
},
"the_last_day": {
"terms": {
"field": "date",
"size": 1,
"order": {"_key": "desc"}
},
"aggs": {
"the_last_day_frequency": {
"sum": {
"field": "frequency"
}
}
}
}/*,
"the_spike": {
"bucket_script": {
"buckets_path": {
"last_day_frequency": "the_last_day>the_last_day_frequency",
"daily_average": "the_daily_average"
},
"script": {
"inline": "return last_day_frequency / daily_average"
}
}
}*/
}
}
}
}'
In your query the_last_day>the_last_day_frequency points to a bucket not a single value so it is throwing error. You need to get single metric value from "the_last_day_frequency", you can achieve it using max_bucket. Then you can use bucket_Selector aggregation to compare last day value with average value
Query:
"aggs": {
"the_keywords": {
"terms": {
"field": "keyword",
"size": 100
},
"aggs": {
"the_weekly_sum": {
"sum": {
"field": "frequency"
}
},
"the_daily_average": {
"bucket_script": {
"buckets_path": {
"weekly_sum": "the_weekly_sum"
},
"script": {
"inline": "return params.weekly_sum / 7"
}
}
},
"the_last_day": {
"terms": {
"field": "date",
"size": 1,
"order": {
"_key": "desc"
}
},
"aggs": {
"the_last_day_frequency": {
"sum": {
"field": "frequency"
}
}
}
},
"max_frequency_last_day": {
"max_bucket": {
"buckets_path": "the_last_day>the_last_day_frequency"
}
},
"the_spike": {
"bucket_selector": {
"buckets_path": {
"last_day_frequency": "max_frequency_last_day",
"daily_average": "the_daily_average"
},
"script": {
"inline": "params.last_day_frequency > params.daily_average"
}
}
}
}
}
}
````

Elasticsearch aggregations: how to get bucket with 'other' results of terms aggregation?

I use aggregation to collect data from nested field and stuck a little
Example of document:
{
...
rectangle: {
attributes: [
{_id: 'some_id', ...}
]
}
ES allows group data by rectangle.attributes._id, but is there any way to get some 'other' bucket to put there documents that were not added to any of groups? Or maybe there is a way to create query to create bucket for documents by {"rectangle.attributes._id": {$ne: "{currentDoc}.rectangle.attributes._id"}}
I think bucket would be perfect because i need to do further aggregations with 'other' docs.
Or maybe there's some cool workaround
I use query like this for aggregation
"aggs": {
"attributes": {
"nested": {
"path": "rectangle.attributes"
},
"aggs": {
"attributesCount": {
"cardinality": {
"field": "rectangle.attributes._id.keyword"
}
},
"entries": {
"terms": {
"field": "rectangle.attributes._id.keyword"
}
}
}
}
}
And get this result
"buckets" : [
{
"key" : "some_parent_id",
"doc_count" : 27616,
"attributes" : {
"doc_count" : 45,
"entries" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "some_id",
"doc_count" : 45,
"attributeOptionsCount" : {
"value" : 2
}
}
]
}
}
}
]
result like this would be perfect:
"buckets" : [
{
"key" : "some_parent_id",
"doc_count" : 1000,
"attributes" : {
"doc_count" : 145,
"entries" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "some_id",
"doc_count" : 45
},
{
"key" : "other",
"doc_count" : 100
}
]
}
}
}
]
You can make use of missing value parameter. Update aggregation as below:
"aggs": {
"attributes": {
"nested": {
"path": "rectangle.attributes"
},
"aggs": {
"attributesCount": {
"cardinality": {
"field": "rectangle.attributes._id.keyword"
}
},
"entries": {
"terms": {
"field": "rectangle.attributes._id.keyword",
"missing": "other"
}
}
}
}
}

Resources