Histogram on the basis of facet counts - elasticsearch

I am currently working on a project in which I am storing user activity logs in elasticsearch. the user field in the log is like {"user":"abc#yahoo.com"}. I have a timestamp field for each activity, that describes when this activity was recorded. Can i generate date histogram on the basis of number of users in a particular time period. eg the histogram entry must show the number of users on that time. I can have this implemented by obtaining facet counts, but i need to get counts on various intervals and various ranges with minimum queries. Please guide me in this regard. Thanks.

Add a facet to your query something like the following:
{"facets": {
"daily_volume": {
"date_histogram": {
"size": 100,
"field": "created_at",
"interval": "day"
"order": "time"
}
}
}
This returns a nice set of ordered data for the number of items per day.
I then feed this to a Google Chart (the ColumnChart works nicely for histograms), doing a conversion on the returned timestamp integer to convert it to a Date type understood correctly by the Javascript charts API.

Related

Elasticsearch - Limit of total fields [1000] in index exceeded

I saw that there are some concerns to raising the total limit on fields above 1000.
I have a situation where I am not sure how to approach it from the design point of view.
I have lots of simple key value pairs:
key1:15, key2:45, key99999:1313123.
Where key is a string and value is a integer on which I would like to sort my results upon on where as if a certain document receives a key it gets sorted by the value.
I ended up creating an object and just put the key value pairs inside so I can match it easy.
For example I have sorting: "object.key".
I was wondering if I just use a simple object with bunch of strings inside that are just there for exact matching should I worry about raising this limit to 10k, or 20k.
Because I now have an issue where there can be more then 1k of these records. I've found I could use nested sorting but it still has a default limit of 10k.
Is there a good design pattern approach for this or should I not be worried by raising the field limits?
Simplified version of the query:
GET products/_search
{
"query": {
"match_all": {}
},
"sort": [
{
"sortingObject.someSortingKey1": {
"order": "desc",
"missing": 2,
"unmapped_type":"float"
}
}
]
}
Point is that I get the sortingKey from request and I use it to sort my results. There are 100k different ways to sort the result for example
There were some recent improvements (in 7.16) that should help there, but 10K or 20K fields is still a lot of overhead.
I'm not sure what kind of queries you need to run on those keyX fields, but maybe the flattened data-type would work for you? https://www.elastic.co/guide/en/elasticsearch/reference/current/flattened.html

ES Scripted fields on large number of data

I have a ES index with millions of records. Currently records are fetched using scroll api and are paginated. Also all fields are sortable.
I have request for new additional fields (dependent on already existing fields). So wanted to use scripted fields.
Can we apply scripted fields clubbed with scroll api and search_after api?
Can we have scripted fields data sorted?
Example:-
We have a point collection system. Users have to collect certain points by doing some tasks in a certain time limit.
{
user1: {
"endtime": "1640807573000", // time epoch
"points": 100,
"target": 1000,
...
},
user2: {
"endtime": "1640807573000" // time epoch
"points": 200
"target": 5000,
...
}, ... // millions of such records
}
Values of endtime, points, target can be updated any time by other system.
Admins can view the report of all users on a ES based paginated tabular UI or download the whole dump of tabular report in a csv format. Now, for the Admins, we want to have visibilty of following new parameters for every user in the report.
Time left: (user.endtime - current time) // Current Time is the time when report is fetched for realtime report.
Expected work rate: ( target - points ) / ( Time left )
Zone: On basis of Expected work rate ranges, classify in 3 zones; Red, Yellow, Green
Also, these 3 new fields should have sorting, filtering and pagination.
I wanted to calculate above 3 fields using scripted fields, when the report is fetched.
Any suggestion on feasibility of above approach or recommendation for better approach?

Navigating terms aggregation in Elastic with very large number of buckets

Hope everyone is staying safe!
I am trying to explore the proper way to tacke the following use case in elasticsearch
Lets say that I have about 700000 docs which I would like to bucket on the basis of a field (let's call it primary_id). This primary id can be same for more than one docs (usually upto 2-3 docs will have same primary_id). In all other cases the primary_id is not repeted in any other docs.
So on average out of every 10 docs I will have 8 unique primary ids, and 1 primary id same among 2 docs
To ensure uniqueness I tried using the terms aggregation and I ended up getting buckets in response to my search request but not for the subsequent scroll requests. Upon googling, I found that scroll queries do not support aggregations.
As a result, I tried finding alternates solutions, and tried the solution in this link as well, https://lukasmestan.com/learn-how-to-use-scroll-elasticsearch-aggregation/
It suggests use of multiple search requests each specifying the partition number to fetch (dependent upon how many partitions do you divide your result in). But I receive client timeouts even with high timeout settings client side.
Ideally, I want to know what is the best way to go about such data where the variance of the field which forms the bucket is almost equal to the number of docs. The SQL equivalent would be select DISTINCT ( primary_id) from .....
But in elasticsearch, distinct things can only be processed via bucketing (terms aggregation).
I also use top hits as a sub aggregation query under terms aggregation to fetch the _source fields.
Any help would be extremely appreciated!
Thanks!
There are 3 ways to paginate aggregtation.
Composite aggregation
Partition
Bucket sort
Partition you have already tried.
Composite Aggregation: can combine multiple datasources in a single buckets and allow pagination and sorting on it. It can only paginate linearly using after_key i.e you cannot jump from page 1 to page 3. You can fetch "n" records , then pass returned after key and fetch next "n" records.
GET index22/_search
{
"size": 0,
"aggs": {
"ValueCount": {
"value_count": {
"field": "id.keyword"
}
},
"pagination": {
"composite": {
"size": 2,
"sources": [
{
"TradeRef": {
"terms": {
"field": "id.keyword"
}
}
}
]
}
}
}
}
Bucket sort
The bucket_sort aggregation, like all pipeline aggregations, is
executed after all other non-pipeline aggregations. This means the
sorting only applies to whatever buckets are already returned from the
parent aggregation. For example, if the parent aggregation is terms
and its size is set to 10, the bucket_sort will only sort over those
10 returned term buckets
So this isn't suitable for your case
You can increase the result size to value greater than 10K by updating setting index.max_result_window. Setting too big a size can cause out of memory issue so you need to test it out see how much your hardware can support.
Better option is to use scroll api and perform distinct at client side

Elasticsearch: Aggregate documents based on date range

I have a set of documents in ElasticSearch 5.5 with two date fields: start_date and end_date.
I want to aggregate them into date histogram buckets (ex: weekly) such that if the start_date < week X < end_date, then document would be in "week X" bucket.
This means that a single document might be in multiple buckets.
Consider the following concrete example: I have a set of documents describing company employees, and for each employee you have hire date and (optionally) termination date. I want to build date histogram of number of active employees for trailing twelve months.
Sample doc content:
{
"start_date": "2013-01-12T00:00:00.000Z",
"end_date": "2016-12-08T00:00:00.000Z",
"id": "123123123"
}
Is there a way to do this in ES?
I have found one way to do this, using filter aggregations (
https://www.elastic.co/guide/en/elasticsearch/reference/master/search-aggregations-bucket-filter-aggregation.html). If I need, say, 12 trailing months report, then I would create 12 buckets, where each bucket defines filter conditions, such as:
"bool":{
"must":[{
"range":{
"start_date":{
"lte":"2016-01-01T00:00:00.000Z"
}
}
},{
{
"range":{
"end_date":{
"gt":"2016-02-01T00:00:00.000Z"
}
}
}]
}
However, I feel that it would be nice if there was an easier way to do this, since if I want say trailing 365 days, that means I have to create 365 bucket filters, which makes resultant query very large.
I know this question is quite old but as it's still open I am sharing my knowledge on this. Also this question does not clearly explains that what kind of output is expected but still I think this can be achieved using the "Date Histogram Aggregation" and "Bucket Script Aggregation".
Here are the documentation links for both of these aggregations.
https://www.elastic.co/guide/en/elasticsearch/reference/5.5/search-aggregations-bucket-datehistogram-aggregation.html
https://www.elastic.co/guide/en/elasticsearch/reference/5.5/search-aggregations-pipeline-bucket-script-aggregation.html

Calculate change rate of Time Series values

I have an application which writes Time Series data to Elasticsearch. The (simplified) data looks like the following:
{
"timestamp": 1425369600000,
"shares": 12271
},
{
"timestamp": 1425370200000,
"shares": 12575
},
{
"timestamp": 1425370800000,
"shares": 12725
},
...
I now would like to use an aggregation to calculate the change rate of the shares field by time "buckets", for example like
The change rate of the share values within the last 10 minute "bucket" could be IMHO calculated as
# of shares t1
--------------
# of shares t0
I tried the Date Histogram aggregation, but I guess that's not what I need to calculate the change rates, because this would only give me the doc_count, and it's not clear to me how I could calculate the change rate from these:
{
"aggs" : {
"shares_over_time" : {
"date_histogram" : {
"field" : "timestamp",
"interval" : "10m"
}
}
}
}
Is there a way to achieve my goal with aggregations within Elasticsearch? I search the docs, but didn't find a matching method.
Thanks a lot for any help!
I think it is hard to achieve with out-of-the-box aggregate functions. However, you can take a look at percentile_ranks_aggregation and add your own modifications to the script to to create point in time rates.
Also, sorry for the off-top, but I wonder: is the elastic search the best fit for that kind of stuff? As I understand, at any given point in time you need only the previous sample data to calculate the correct rate for the current sample. This sounds to me like a better fit for some sliding window algorithm real time implementation (even on some relational DB like Postgres), where you keep a fixed number of time buckets and counters you are interested in inside the bucket. Once the new sample 'arrives', you update (slide) the window and calculate the updated rate for the most recent time bucket.

Resources