ES Scripted fields on large number of data - elasticsearch

I have a ES index with millions of records. Currently records are fetched using scroll api and are paginated. Also all fields are sortable.
I have request for new additional fields (dependent on already existing fields). So wanted to use scripted fields.
Can we apply scripted fields clubbed with scroll api and search_after api?
Can we have scripted fields data sorted?
Example:-
We have a point collection system. Users have to collect certain points by doing some tasks in a certain time limit.
{
user1: {
"endtime": "1640807573000", // time epoch
"points": 100,
"target": 1000,
...
},
user2: {
"endtime": "1640807573000" // time epoch
"points": 200
"target": 5000,
...
}, ... // millions of such records
}
Values of endtime, points, target can be updated any time by other system.
Admins can view the report of all users on a ES based paginated tabular UI or download the whole dump of tabular report in a csv format. Now, for the Admins, we want to have visibilty of following new parameters for every user in the report.
Time left: (user.endtime - current time) // Current Time is the time when report is fetched for realtime report.
Expected work rate: ( target - points ) / ( Time left )
Zone: On basis of Expected work rate ranges, classify in 3 zones; Red, Yellow, Green
Also, these 3 new fields should have sorting, filtering and pagination.
I wanted to calculate above 3 fields using scripted fields, when the report is fetched.
Any suggestion on feasibility of above approach or recommendation for better approach?

Related

Q: Structuring elasticsearch indexes for query optimization

So I'm working on setting up ElasticSearch/Opensearch in order to build an analytics dashboards.
The data that I have is:
Product > Date > Customer > Variables:Data (e.g, revenue: 100)
{
"_id": X,
“_type”: [“Date”],
“_index”: [“Product A”],
"_CustomerXYZ":{
"revenue": 100,
"name": ["ABC Inc.”],
"usage":200,
}
}
I was thinking of setting up an index for each product and then a document for each date and then do a JSON map for each customer where we have each of the variables.
I essentially want to be able to easily query and graph customer variables over time for a particular product. E.g, product A for last 90 days for customer B plot their revenue.
As I will have millions of customers, and 2yrs+ of data + multiple products - I'm looking at 100s of millions if not billions of records. What is the best way to setup my ElasticSearch cluster to ensure scalability and sub-second latencies?

Navigating terms aggregation in Elastic with very large number of buckets

Hope everyone is staying safe!
I am trying to explore the proper way to tacke the following use case in elasticsearch
Lets say that I have about 700000 docs which I would like to bucket on the basis of a field (let's call it primary_id). This primary id can be same for more than one docs (usually upto 2-3 docs will have same primary_id). In all other cases the primary_id is not repeted in any other docs.
So on average out of every 10 docs I will have 8 unique primary ids, and 1 primary id same among 2 docs
To ensure uniqueness I tried using the terms aggregation and I ended up getting buckets in response to my search request but not for the subsequent scroll requests. Upon googling, I found that scroll queries do not support aggregations.
As a result, I tried finding alternates solutions, and tried the solution in this link as well, https://lukasmestan.com/learn-how-to-use-scroll-elasticsearch-aggregation/
It suggests use of multiple search requests each specifying the partition number to fetch (dependent upon how many partitions do you divide your result in). But I receive client timeouts even with high timeout settings client side.
Ideally, I want to know what is the best way to go about such data where the variance of the field which forms the bucket is almost equal to the number of docs. The SQL equivalent would be select DISTINCT ( primary_id) from .....
But in elasticsearch, distinct things can only be processed via bucketing (terms aggregation).
I also use top hits as a sub aggregation query under terms aggregation to fetch the _source fields.
Any help would be extremely appreciated!
Thanks!
There are 3 ways to paginate aggregtation.
Composite aggregation
Partition
Bucket sort
Partition you have already tried.
Composite Aggregation: can combine multiple datasources in a single buckets and allow pagination and sorting on it. It can only paginate linearly using after_key i.e you cannot jump from page 1 to page 3. You can fetch "n" records , then pass returned after key and fetch next "n" records.
GET index22/_search
{
"size": 0,
"aggs": {
"ValueCount": {
"value_count": {
"field": "id.keyword"
}
},
"pagination": {
"composite": {
"size": 2,
"sources": [
{
"TradeRef": {
"terms": {
"field": "id.keyword"
}
}
}
]
}
}
}
}
Bucket sort
The bucket_sort aggregation, like all pipeline aggregations, is
executed after all other non-pipeline aggregations. This means the
sorting only applies to whatever buckets are already returned from the
parent aggregation. For example, if the parent aggregation is terms
and its size is set to 10, the bucket_sort will only sort over those
10 returned term buckets
So this isn't suitable for your case
You can increase the result size to value greater than 10K by updating setting index.max_result_window. Setting too big a size can cause out of memory issue so you need to test it out see how much your hardware can support.
Better option is to use scroll api and perform distinct at client side

How to use Kibana and elastichsearch [7.5.0] to track number of documents containing particular value

I have an index which contains information about some objects. I want to display some of the information on my Kibana's dasboard. Lets assume an object looks as follows:
{
"_index": "obj",
"_type": "_doc",
"_id": "KwDPAHABfo5V345r4IYV",
"_version": 1,
"_score": 0,
"_source": {
"value_1": "some value",
"value_2": "some_other value",
"owner": "jason",
"modified_date": "2020-02-01T12:53:08.210317+00:00",
"created_date": "2020-02-01T12:53:08.243980+00:00"
}
}
I need to show (live) number of objects that has owner: 'UNKNOWN'. Thing is, that this value changes in time. Each change is a new document - they are not being updated. I need to track how many UNKNOWN owners currently I see. Updates (new documents) are being sent to elk in fixed intervals.
When I try to set up a metric, it sometimes shows 0, during the window between one update and another - when there is no documents flowing into elk. How can I make Kibana display only last documents with owner: 'UNKNOWN'?
How can I make Kibana display only last documents with owner: 'UNKNOWN'?
You could set up a data table visualization for that as an alternative to the one-dimensional metric visualization.
This is how I personally would configure the data table:
Set a filter with 'owner(.keyword) is UNKNOWN'.
Use the metric 'Top Hit' on the field created_date (or #timestamp, thats up to you) instead of the count metric.
Set the order to descending based on the timestamp field.
Split the rows (Term Aggregations) for every field you want to display in the rows. This will create 'columns' in your table.
Go to the options tab and enable count on the sum of all rows.
Set an appropriate time interval, e.g. last 1 hour.
This will display all the relevant data of your documents that have the field owner equal to UNKNOWN. Also, you see the ingestion/creation date timestamp of these documents in a descending order. Furthermore, you see the number of documents that match (configured via the options tab as described above).
I hope I could help you.

mongodb - Recommended tree structure for large amount of data points

I'm working on a project which records price history for items across multiple territories, and I'm planning on storing the data in a mongodb collection.
As I'm relatively new to mongodb, I'm curious about what might be a recommended document structure for quite a large amount of data. Here's the situation:
I'm recording the price history for about 90,000 items across 200 or so territories. I'm looking to record the price of each item every hour, and give a 2 week history for any given item. That comes out to around (90000*200*24*14) ~= 6 billion data points, or approximately 67200 per item. A cleanup query will be run once a day to remove records older than 14 days (more specifically, archive it to a gzipped json/text file).
In terms of data that I will be getting out of this, I'm mainly interested in two things: 1) The price history for a specific item in a specific territory, and 2) the price history for a specific item across ALL territories.
Before I actually start importing this data and running benchmarks, I'm hoping someone might be able to give some advice on how I should structure this to allow for quick access to the data through a query.
I'm considering the following structure:
{
_id: 1234,
data: [
{
territory: "A",
price: 5678,
time: 123456789
},
{
territory: "B",
price: 9876
time: 123456789
}
]
}
Each item is its own document, which each territory/price point for that item in a particular territory. The issue I run into with this is retrieving the price history for a particular item. I believe I can accomplish this with the following query:
db.collection.aggregate(
{$unwind: "$data"},
{$match: {_id: 1234, "data.territory": "B"}}
)
The other alternative I was considering was just put every single data point in its own document and putting an index on the item and territory.
// Document 1
{
item: 1234,
territory: "A",
price: 5679,
time: 123456789
}
// Document 2
{
item: 1234,
territory: "B",
price: 9676,
time: 123456789
}
I'm just unsure of whether having 6 billion documents with 3 indexes or having 90,000 documents with 67200 array objects each and using an aggregate would be better for performance.
Or perhaps there's some other tree structure or handling of this problem that you fine folks and MongoDB wizards can recommend?
I would structure the documents as "prices for a product in a given territory per fixed time interval". The time interval is fixed for the schema as a whole, but different schemas result from different choices and the best one for your application will probably need to be decided by testing. Choosing the time interval to be 1 hour gives your second schema idea, with ~6 billion documents total. You could choose the time interval to be 2 weeks (don't). In my mind, the best time interval to choose is 1 day, so the documents would look like this
{
"_id" : ObjectId(...), // could also use a combination of prod_id, terr_id, and time so you get a free unique index to look up by those 3 values
"prod_id" : "DEADBEEF",
"terr_id" : "FEEDBEAD",
"time" : ISODate("2014-10-22T00:00:00.000Z"), // start of the day this document contains the data for
"data" : [
{
"price" : 1234321,
"time" : ISODate("2014-10-22T15:00:00.000Z") // start of the hour this data point is for
},
...
]
}
I like the time interval of 1 day because it hits a nice balance between number of documents (mostly relevant because of index sizes), size of documents (16MB limit, have to pipe over network), and ease of retiring old docs (hold 15 days, wipe+archive all from 15th day at some point each day). If you put an index on { "prod_id" : 1, "terr_id" : }`, that should let you fulfill your two main queries efficiently. You can gain an additional bonus performance boost by preallocating the doc for each day so that updates are in-place.
There's a great blog post about managing time series data like this, based on experience building the MMS monitoring system. I've essentially lifted my ideas from there.

Histogram on the basis of facet counts

I am currently working on a project in which I am storing user activity logs in elasticsearch. the user field in the log is like {"user":"abc#yahoo.com"}. I have a timestamp field for each activity, that describes when this activity was recorded. Can i generate date histogram on the basis of number of users in a particular time period. eg the histogram entry must show the number of users on that time. I can have this implemented by obtaining facet counts, but i need to get counts on various intervals and various ranges with minimum queries. Please guide me in this regard. Thanks.
Add a facet to your query something like the following:
{"facets": {
"daily_volume": {
"date_histogram": {
"size": 100,
"field": "created_at",
"interval": "day"
"order": "time"
}
}
}
This returns a nice set of ordered data for the number of items per day.
I then feed this to a Google Chart (the ColumnChart works nicely for histograms), doing a conversion on the returned timestamp integer to convert it to a Date type understood correctly by the Javascript charts API.

Resources