Which elasticsearch aggregations should I use?

Which elasticsearch aggregations should I use? - elasticsearch

I need to create a bar chart of "number of active users by date". An active user means the user who has logged in last 7 days.
so I need to count total number of users, whose last_activity date is within 7 last days. and I need to do it for each bar(day) in my chart.
I understand it needs to be done using aggregations elastic search, but unsure
which aggregations should I use? bucket aggregations, pipeline aggregations?
Please let me know if you know a similar example of it.
Here you can find two examples of sample documents for user "john"
{
"userid": "john",
"last_activity": "2017-08-09T16:10:10.396+01:00",
"date_of_this_report": "2017-09-24T00:00:00+01:00"
}
{
"userid": "john",
"last_activity": "2017-08-09T16:10:10.396+01:00",
"date_of_this_report": "2017-09-25T00:00:00+01:00"
}

You can filter the users with last activity for last 7 days using date math operation of elasticsearch. You can push the filter before the date histogram aggregation.
POST active_users/document_type1/_search
{
"size": 0,
"aggs": {
"filtered_active_users_7_days": {
"filter": {
"range": {
"last_activity": {
"gte": "now-7d/d"
}
}
},
"aggs": {
"date_histogram_last_7_days": {
"date_histogram": {
"field": "last_activity",
"interval": "day"
}
}
}
}
}
}
Hope this works for you.

Related

Paginate an aggregation sorted by hits on Elastic index

I have an Elastic index (say file) where I append a document every time the file is downloaded by a client. Each document is quite basic, it contains a field filename and a date when to indicate the time of the download.
What I want to achieve is to get, for each file the number of times it has been downloaded in the last 3 months. Thanks to another question, I have a query that returns all the results:
{
"query": {
"range": {
"when": {
"gte": "now-3M"
}
}
},
"aggs": {
"downloads": {
"terms": {
"field": "filename.keyword",
"size": 1000
}
}
},
"size": 0
}
Now, I want to have a paginated result. The term aggreation cannot be paginated, so I use a composite aggregation. Of course, if there is a better aggregation, it can be used here...
So for the moment, I have something like that:
{
"query": {
"range": {
"when": {
"gte": "now-3M"
}
}
},
"aggs": {
"downloads_agg": {
"composite": {
"size": 100,
"sources": [
{
"downloads": {
"terms": {
"field": "filename.keyword"
}
}
}
]
}
}
},
"size": 0
}
This aggregation allows me to paginate (thanks to after_key value in response), but it is not sorted by the number of downloads - it is sorted by the filename.
How can I sort that composite aggregation on the number of documents for each filename in my index?
Thanks.

Composite aggregation don't allow sorting based on the value field.
Excerpt from the discussion on elastic forum:
it's designed as a memory-friendly way to paginate over aggregations.
Part of the tradeoff is that you lose things like ordering by doc
count, since that isn't known until after all the docs have been
collected.
I have no experience with Transforms (part of X-pack & Licensed) but you can try that out. Apart from this, I don't see a way to get the expected output.

How to get documents that are differents by value field

I'm using ElasticSearch 6.3.
Scenario: dozens of thousand documents has "123" field with "blabla" value in most of those. A few has "blabla blo" in that field. These occupy last places in query results if I set up size: 10000 (if default size, they doesn't appear). But I really want both unique records: one with these field "123": "blabla" and that one with field "123":"blabla blo".
I`m using wildcard and getting all 10000 documents. Only need those two.
I'm going to feed a select tag HTML with thats records, but only two of them ideally!
Query body:
{
"query": {
"wildcard":{
"324" : {
"value":"*b*"
}
}
},
"size": 10000,
"_source": ["324"]
}
How I should make it? The concept would be similar to find records which value aren't fully duplicated in that field, I supose.
Thank you

That's what aggs are for!
GET index_name/_search
{
"query": {
"wildcard": {
"324": {
"value": "*b*"
}
}
},
"size": 0,
"aggs": {
"324_uniques": {
"terms": {
"field": "324",
"size": 10
}
}
}
}
field could be 324 OR 324.keyword, depending on your mapping.

Groupby query in elastic search

I have an elastic search cluster having the analytics data of my website. There are page view events when a user visits a page. Each pageview event will have a session-id field, which will remain same during the user session.
I would like to calculate the session duration of each session by grouping the events by session id and calculating the duration different between the first event and the last event
Is there any way I can achieve this with Elastic Search Query?
Pageview events
[
{
"session-id":"234234-234234-324324-23432432",
"url": 'testpage1',
"timestamp":54323424222
},
{
"session-id":"234234-234234-324324-23432432",
"url": 'testpage2',
"timestamp":54323424223
},
{
"session-id":"234234-234234-324324-23432432",
"url": 'testpage3',
"timestamp":54323424224
}
]
Session duration will be (54323424224 - 54323424222)ms
EDIT:
I was able to create a datatable visualization with sessionid, max timestamp, min stamp, by query min(timestamp) & max(timestamp) for each of the session id. Now all I need is the different between these to aggs.

There's no way to compute the difference between max and min inside buckets.
Try with this calculating the difference from min-max in your client-side:
{
"aggs": {
"bySession": {
"terms": {
"field": "session-id.keyword"
},
"aggs": {
"statsBySession": {
"stats": {
"field": "timestamp"
}
}
}
}
}
}

Stats bucket aggregation will give you information about min and max timestamps per session. You can calculate difference between them(max - min) using bucket script aggregation.
Refer: bucket-script-aggregation
and stats-bucket-aggregation.
You can use following query to calculate difference between max and min timestamps per session-id:
{
"size": 0,
"aggs": {
"session": {
"terms": {
"field": "session-id.keyword",
"size": 10
},
"aggs": {
"stats_bucket":{
"stats":{
"field": "timestamp"
}
},
"time_spent": {
"bucket_script": {
"buckets_path": {
"min_stats": "stats_bucket.min",
"max_stats": "stats_bucket.max"
},
"script": "params.max_stats - params.min_stats"
}
}
}
}
}
}

Elasticsearch - retriving documents only, if multiple match by specific field

I have an index in Elasticsearch with users' posts. I want to retrieve user_id from this index, if for given date range, there are at least X posts. Otherwise to skip such posts.
Anyway I can achieve it in ES or I have to get all entities and handle them later?
Trawa ;)

To answer your question I'll assume you have the fields user and datetime in your mapping.
You can get the requested data like so:
Get the list of users who have more then X (i.e X=100) posts by given date range - aggregate by user name for specific date range:
{
"size": 0,
"query": {
"bool": {
"must": [
{
"range": {
"datetime": {
"gte": "2017-05-01",
"lt": "2017-06-01"
}
}
}
]
}
},
"aggregations": {
"users": {
"terms": {
"field": "user",
"min_doc_count": 100
}
}
}
}
Edit the query to match your date range (and its format) and min_doc_count to the minimum X posts per user.
EDIT:
There is no way to avoid terms_aggregation to get all distinct values.
50k values do seems to be to much data to retrieve - but it also depends on your cluster.
My suggestion is to add another filter, lets say, alphabetically filter so instead of getting 50k results at once you can do it in other several queries:
"must": [
{
"range": {
"datetime": {
"gte": "2017-05-01",
"lt": "2017-06-01"
}
}
},
{
"wildcard": {
"user": "a*"
}
},
{
"wildcard": {
"user": "b*"
}
}
]
See Wildcard
Unfortunately, scrolling on aggregation results is not available. Manually dividing the data to pieces is the best thing I can see right now.

Elasticsearch - calculate percentage in nested aggregations in relation to parent bucket

Updated question
In my query I aggregate on date and then on sensor name. It is possible to calculate a ratio from a nested aggregation and the total count of documents (or any other aggregation) of the parent bucket? Example query:
{
"size": 0,
"aggs": {
"over_time": {
"aggs": {
"by_date": {
"date_histogram": {
"field": "date",
"interval": "1d",
"min_doc_count": 0
},
"aggs": {
"measure_count": {
"cardinality": {
"field": "date"
}
},
"all_count": {
"value_count": {
"field": "name"
}
},
"by_name": {
"terms": {
"field": "name",
"size": 0
},
"aggs": {
"count_by_name": {
"value_count": {
"field": "name"
}
},
"my ratio": count_by_name / all_count * 100 <-- How to do that?
}
}
}
}
}
}
}
}
I want a custom metric that gives me the ratio count_by_name / all_count * 100. Is that possible in ES, or do I have to compute that on the client?
This seems very simple to me, but I haven't found a way yet.
Old post:
Is there a way to let Elasticsearch consider the overall count of documents (or any other metric) when calculating the average for a bucket?
Example:
I have like 100000 sensors that generate events on different times. Every event is indexed as a document that has a timestamp and a value.
When I want to calculate a ratio of the value and a date histogram, and some sensors only generated values at one time, I want Elasticsearch to treat the not existing values(documents) for my sensors as 0 instead of null.
So when aggregating by day and a sensor only has generated two values at 10pm (3) and 11pm (5), the aggregate for the day should be (3+5)/24, or formal: SUM(VALUE)/24.
Instead, Elasticsearch calculates the average like (3+5)/2, which is not correct in my case.
There was once a ticket on Github https://github.com/elastic/elasticsearch/issues/9745, but the answer was "handle it in your application". That's no answer for me, as I would have to generate zillions of zero-Value documents for every sensor/time combination to get the average ratio right.
Any ideas on this?

If this is the case , simply divide the results by 24 from application side.And when granularity change , change this value accordingly. Number of hours per day is fixed right ....

You can use the Bucket script aggregation to do what you want.
{
"bucket_script": {
"buckets_path": {
"count_by_name": "count_by_name",
"all_count": "all_count"
},
"script": "count_by_name / all_count*100"
}
}
It's just an example.
https://www.elastic.co/guide/en/elasticsearch/reference/2.4/search-aggregations-pipeline-bucket-script-aggregation.html

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Which elasticsearch aggregations should I use? - elasticsearch

Related

Paginate an aggregation sorted by hits on Elastic index

How to get documents that are differents by value field

Groupby query in elastic search

Elasticsearch - retriving documents only, if multiple match by specific field

Elasticsearch - calculate percentage in nested aggregations in relation to parent bucket

Categories

Resources