Elasticsearch: Browse sorted terms - elasticsearch

Let's say i have a set of document with common personal information like:
{
"name": "Samuel",
"Job": "Engineer",
....
}
My customer want to browse through all the terms in the "name" field, starting from a page centred on a "selected" name.
For example they want to search for "Samuel" and show a page of 7 elements like:
Eddie: 7
Lucian: 3
Sammy: 1
Samuel: 3
Scott:3
Teddy: 2
Tod: 1
Where the names are sorted alphabetically and the number are the number of occurrence.
Also would be nice to be able to go up and down through pages.
This is just an example, in the reality I may have lot and lot of unique keys to browse, so returning all the terms and loop over them is not really a solution.
There is a way to achieve this with ElasticSearch ?

I'd suggest using filtered aggs based on bucketing the users by the name first char -- so from a to g, h to n etc:
Syncing a few docs:
POST jobs/_doc
{"name":"Samuel","Job":"Engineer"}
POST jobs/_doc
{"name":"Lucian","Job":"Engineer"}
POST jobs/_doc
{"name":"Teddy","Job":"Engineer"}
POST jobs/_doc
{"name":"Tod","Job":"Engineer"}
POST jobs/_doc
{"name":"Andrew","Job":"Engineer"}
Then applying scripted filters:
GET jobs/_search
{
"size": 0,
"aggs": {
"alpha_buckets": {
"filter": {
"match_all": {}
},
"aggs": {
"a_g": {
"filter": {
"script": {
"script": {
"source": """
def first_char = doc['name.keyword'].value.toLowerCase().toCharArray()[0];
def char_int = Character.getNumericValue(first_char);
return char_int >= 10 && char_int <= 16
""",
"lang": "painless"
}
}
},
"aggs": {
"a_g": {
"terms": {
"field": "name.keyword",
"order": {
"_term": "asc"
}
}
}
}
},
"h_n": {
"filter": {
"script": {
"script": {
"source": """
def first_char = doc['name.keyword'].value.toLowerCase().toCharArray()[0];
def char_int = Character.getNumericValue(first_char);
return char_int >= 17 && char_int <= 23
""",
"lang": "painless"
}
}
},
"aggs": {
"h_n": {
"terms": {
"field": "name.keyword",
"order": {
"_term": "asc"
}
}
}
}
}
}
}
}
}
yielding
"aggregations" : {
"alpha_buckets" : {
"doc_count" : 7,
"h_n" : {
"doc_count" : 2,
"h_n" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Lucian",
"doc_count" : 2
},
{
"key" : "Lupe",
"doc_count" : 1
}
]
}
},
"a_g" : {
"doc_count" : 1,
"a_g" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Andrew",
"doc_count" : 1
},
{
"key" : "Anton",
"doc_count" : 1
}
]
}
}
}
Now, you don't need to get all alphabetical buckets at once, the above is just to show what's possible.

Related

bucket aggregation/bucket_script computation

How to apply computation using bucket fields via bucket_script? More so, I would like to understand how to aggregate on distinct, results.
For example, below is a sample query, and the response.
What I am looking for is to aggregate the following into two fields:
sum of all buckets dist.value from e.g. response (1+2=3)
sum of all buckets (dist.value x key) from e.g., response (1x10)+(2x20)=50
Query
{
"size": 0,
"query": {
"bool": {
"must": [
{
"match": {
"field": "value"
}
}
]
}
},
"aggs":{
"sales_summary":{
"terms":{
"field":"qty",
"size":"100"
},
"aggs":{
"dist":{
"cardinality":{
"field":"somekey.keyword"
}
}
}
}
}
}
Query Result:
{
"aggregations": {
"sales_summary": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 10,
"doc_count": 100,
"dist": {
"value": 1
}
},
{
"key": 20,
"doc_count": 200,
"dist": {
"value": 2
}
}
]
}
}
}
You need to use a sum bucket aggregation, which is a pipeline aggregation to find the sum of response of cardinality aggregation across all the buckets.
Search Query for sum of all buckets dist.value from e.g. response (1+2=3):
POST idxtest1/_search
{
"size": 0,
"aggs": {
"sales_summary": {
"terms": {
"field": "qty",
"size": "100"
},
"aggs": {
"dist": {
"cardinality": {
"field": "pageview"
}
}
}
},
"sum_buckets": {
"sum_bucket": {
"buckets_path": "sales_summary>dist"
}
}
}
}
Search Response :
"aggregations" : {
"sales_summary" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 10,
"doc_count" : 3,
"dist" : {
"value" : 2
}
},
{
"key" : 20,
"doc_count" : 3,
"dist" : {
"value" : 3
}
}
]
},
"sum_buckets" : {
"value" : 5.0
}
}
For the second requirement, you need to first modify the response of value in the bucket aggregation response, using bucket script aggregation, and then use the modified value to perform bucket sum aggregation on it.
Search Query for sum of all buckets (dist.value x key) from e.g., response (1x10)+(2x20)=50
POST idxtest1/_search
{
"size": 0,
"aggs": {
"sales_summary": {
"terms": {
"field": "qty",
"size": "100"
},
"aggs": {
"dist": {
"cardinality": {
"field": "pageview"
}
},
"format-value-agg": {
"bucket_script": {
"buckets_path": {
"newValue": "dist"
},
"script": "params.newValue * 10"
}
}
}
},
"sum_buckets": {
"sum_bucket": {
"buckets_path": "sales_summary>format-value-agg"
}
}
}
}
Search Response :
"aggregations" : {
"sales_summary" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 10,
"doc_count" : 3,
"dist" : {
"value" : 2
},
"format-value-agg" : {
"value" : 20.0
}
},
{
"key" : 20,
"doc_count" : 3,
"dist" : {
"value" : 3
},
"format-value-agg" : {
"value" : 30.0
}
}
]
},
"sum_buckets" : {
"value" : 50.0
}
}

Get top values from Elasticsearch bucket

I have some items with brand
I want to return N records, but no more than x from each bucket
So far I have my buckets grouped by brand
"aggs": {
"brand": {
"terms": {
"field": "brand"
}
}
}
"aggregations" : {
"brand" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "brandA",
"doc_count" : 130
},
{
"key" : "brandB",
"doc_count" : 127
}
]
}
But how do I access specific bucket and get top x values from there?
You can use top hits sub aggregation to get documents under each brand. You can sort those documents and define a size too.
{
"aggs": {
"brand": {
"terms": {
"field": "brand",
"size": 10 --> no of brands
},
"aggs": {
"top_docs": {
"top_hits": {
"sort": [
{
"date": {
"order": "desc"
}
}
],
"size": 1 --> no of documents under each brand
}
}
}
}
}
}

Counting comma saperated elements as value in elasticsearch

Is it possible to count the values ​​separated by commas from the same field with aggregations or any other way?
i.e
JSON
{
"ID":"example.1",
"Sports":{
"1":
{,
"Football Teams":" Real Madrid, Manchester United, Juventus"
"Basket Teams":"Chicago Bulls"
},
"2":
{
"Football Teams":"F.C Barcelona, Milan"
"Basket Teams":"Lakers"
}
}
}
query
GET xxx/_search
{
"query": {
"match_all": {
}
},
"aggs": {
"NAME": {
"value_count": {
"field": "Sports.1."Football Teams.keyword"
}}
}
}
Desired Output
"aggregations" : {
"Count" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Real Madrid, Manchester United, Juventus",
"doc_count" : 3
}
The objective of the query I am looking for is to be able to determine how many values ​​separated by commas a field has.
You can do it with a value script in a terms sub-aggregation:
GET xxx/_search
{
"aggs": {
"count": {
"terms": {
"field": "team.keyword"
},
"aggs": {
"count": {
"terms": {
"field": "team.keyword",
"script": {
"source": "/,/.split(_value).length",
"lang": "painless"
}
}
}
}
}
}
}
The top-level buckets will be the values of the football team and each top-level bucket will have a sub-bucket with the number of tokens, like this:
{
"key" : "Real Madrid, Manchester United, Juventus", <-- team name
"doc_count" : 1,
"count" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "3", <-- number of teams
"doc_count" : 1
}
]
}
}
In order for this to work, you need to add the following line in elasticsearch.yml and restart your server:
script.painless.regex.enabled: true

Elastic script from buckets and higher level aggregation

I want to compare the daily average of a metric (the frequency of words appearing in texts) to the value of a specific day. This is during a week. My goal is to check whether there's a spike. If the last day is way higher than the daily average, I'd trigger an alarm.
So from my input in Elasticsearch I compute the daily average during the week and find out the value for the last day of that week.
For getting the daily average for the week, I simply cut a week's worth of data using a range query on date field, so all my available data is the given week. I compute the sum and divide by 7 for a daily average.
For getting the last day's value, I did a terms aggregation on the date field with descending order and size 1 as suggested in a different question (How to select the last bucket in a date_histogram selector in Elasticsearch)
The whole output is as follows. Here you can see words "rama0" and "rama1" with their corresponding frequencies.
{
"aggregations" : {
"the_keywords" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "rama0",
"doc_count" : 4200,
"the_last_day" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 3600,
"buckets" : [
{
"key" : 1580169600000,
"key_as_string" : "2020-01-28T00:00:00.000Z",
"doc_count" : 600,
"the_last_day_frequency" : {
"value" : 3000.0
}
}
]
},
"the_weekly_sum" : {
"value" : 21000.0
},
"the_daily_average" : {
"value" : 3000.0
}
},
{
"key" : "rama1",
"doc_count" : 4200,
"the_last_day" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 3600,
"buckets" : [
{
"key" : 1580169600000,
"key_as_string" : "2020-01-28T00:00:00.000Z",
"doc_count" : 600,
"the_last_day_frequency" : {
"value" : 3000.0
}
}
]
},
"the_weekly_sum" : {
"value" : 21000.0
},
"the_daily_average" : {
"value" : 3000.0
}
},
[...]
]
}
}
}
Now I have the_daily_average in a high level of the output, and the_last_day_frequency in the single-element buckets list in the_last_day aggregation. I cannot use a bucket_script to compare those, because I cannot refer to a single bucket (if I place the script outside the_last_day aggregation) and I cannot refer to higher-level aggregations if I place the script inside the_last_day.
IMO the reasonable thing to do would be to put the script outside the aggregation and use a buckets_path using the <AGG_NAME><MULTIBUCKET_KEY> syntax mentioned in the docs, but I have tried "var1": "the_last_day[1580169600000]>the_last_day_frequency" and variations (hardcoding first until it works), but I haven't been able to refer to a particular bucket.
My ultimate goal is to have a list of keywords for which the last day frequency greatly exceeds the daily average.
For anyone interested, my current query is as follows. Notice that the part I'm struggling with is commented out.
body='{
"query": {
"range": {
"date": {
"gte": "START",
"lte": "END"
}
}
},
"aggs": {
"the_keywords": {
"terms": {
"field": "keyword",
"size": 100
},
"aggs": {
"the_weekly_sum": {
"sum": {
"field": "frequency"
}
},
"the_daily_average" : {
"bucket_script": {
"buckets_path": {
"weekly_sum": "the_weekly_sum"
},
"script": {
"inline": "return params.weekly_sum / 7"
}
}
},
"the_last_day": {
"terms": {
"field": "date",
"size": 1,
"order": {"_key": "desc"}
},
"aggs": {
"the_last_day_frequency": {
"sum": {
"field": "frequency"
}
}
}
}/*,
"the_spike": {
"bucket_script": {
"buckets_path": {
"last_day_frequency": "the_last_day>the_last_day_frequency",
"daily_average": "the_daily_average"
},
"script": {
"inline": "return last_day_frequency / daily_average"
}
}
}*/
}
}
}
}'
In your query the_last_day>the_last_day_frequency points to a bucket not a single value so it is throwing error. You need to get single metric value from "the_last_day_frequency", you can achieve it using max_bucket. Then you can use bucket_Selector aggregation to compare last day value with average value
Query:
"aggs": {
"the_keywords": {
"terms": {
"field": "keyword",
"size": 100
},
"aggs": {
"the_weekly_sum": {
"sum": {
"field": "frequency"
}
},
"the_daily_average": {
"bucket_script": {
"buckets_path": {
"weekly_sum": "the_weekly_sum"
},
"script": {
"inline": "return params.weekly_sum / 7"
}
}
},
"the_last_day": {
"terms": {
"field": "date",
"size": 1,
"order": {
"_key": "desc"
}
},
"aggs": {
"the_last_day_frequency": {
"sum": {
"field": "frequency"
}
}
}
},
"max_frequency_last_day": {
"max_bucket": {
"buckets_path": "the_last_day>the_last_day_frequency"
}
},
"the_spike": {
"bucket_selector": {
"buckets_path": {
"last_day_frequency": "max_frequency_last_day",
"daily_average": "the_daily_average"
},
"script": {
"inline": "params.last_day_frequency > params.daily_average"
}
}
}
}
}
}
````

Why elasticsearch cannot support min_doc_count and order by _count asc?

Requirements:
group by hldId having count(*) = 2
Elasticsearch query:
"aggs": {
"groupByHldId": {
"terms": {
"field": "hldId",
"min_doc_count": 2,
"order" : { "_count" : "asc" }
}
}
}
but no records are return
"aggregations" : {
"groupByHldId" : {
"doc_count_error_upper_bound" : -1,
"sum_other_doc_count" : 2660,
"buckets" : [ ]
}
}
but if changed to desc , it has return
"buckets" : [
{
"key" : 200035075,
"doc_count" : 355
},
or if without min_doc_count, it also has return
"buckets" : [
{
"key" : 200000061,
"doc_count" : 1
},
So why both have mid_doc_count and asc direction it returns empty?
You can try like this, bucket selector with a custom script.
{
"aggs": {
"countfield": {
"terms": {
"field": "hldId",
"size": 100,
"order": {
"_count": "desc"
}
},
"aggs": {
"criticals": {
"bucket_selector": {
"buckets_path": {
"doc_count": "_count"
},
"script": "params.doc_count==2"
}
}
}
}
}
}

Resources