So I'm working on setting up ElasticSearch/Opensearch in order to build an analytics dashboards.
The data that I have is:
Product > Date > Customer > Variables:Data (e.g, revenue: 100)
{
"_id": X,
“_type”: [“Date”],
“_index”: [“Product A”],
"_CustomerXYZ":{
"revenue": 100,
"name": ["ABC Inc.”],
"usage":200,
}
}
I was thinking of setting up an index for each product and then a document for each date and then do a JSON map for each customer where we have each of the variables.
I essentially want to be able to easily query and graph customer variables over time for a particular product. E.g, product A for last 90 days for customer B plot their revenue.
As I will have millions of customers, and 2yrs+ of data + multiple products - I'm looking at 100s of millions if not billions of records. What is the best way to setup my ElasticSearch cluster to ensure scalability and sub-second latencies?
Related
I have a ES index with millions of records. Currently records are fetched using scroll api and are paginated. Also all fields are sortable.
I have request for new additional fields (dependent on already existing fields). So wanted to use scripted fields.
Can we apply scripted fields clubbed with scroll api and search_after api?
Can we have scripted fields data sorted?
Example:-
We have a point collection system. Users have to collect certain points by doing some tasks in a certain time limit.
{
user1: {
"endtime": "1640807573000", // time epoch
"points": 100,
"target": 1000,
...
},
user2: {
"endtime": "1640807573000" // time epoch
"points": 200
"target": 5000,
...
}, ... // millions of such records
}
Values of endtime, points, target can be updated any time by other system.
Admins can view the report of all users on a ES based paginated tabular UI or download the whole dump of tabular report in a csv format. Now, for the Admins, we want to have visibilty of following new parameters for every user in the report.
Time left: (user.endtime - current time) // Current Time is the time when report is fetched for realtime report.
Expected work rate: ( target - points ) / ( Time left )
Zone: On basis of Expected work rate ranges, classify in 3 zones; Red, Yellow, Green
Also, these 3 new fields should have sorting, filtering and pagination.
I wanted to calculate above 3 fields using scripted fields, when the report is fetched.
Any suggestion on feasibility of above approach or recommendation for better approach?
Part of this question is related to : Elasticsearch filter on aggregation
Context
Let's say my Elasticsearch index contains some orders. Each order has one field price and one field amount. This result in an index that look like this :
[
{
"docKey": "order01",
"user": "1",
"price": 8,
"amount": 20
},
{
"docKey": "order02",
"user": "1",
"price": 14,
"amount": 3
},
{
"docKey": "order03",
"user": "2",
"price": 5,
"amount": 1
},
{
"docKey": "order04",
"user": "2",
"price": 10,
"amount": 3
}
]
What I would like to do
What I want to do is a filter on some values aggregated per user. I want to do this kind of filter for search and also in order to apply aggregation on it. For example in this example I would like to retrieve the documents of all user that have their average order with a price in the range of 9-14.
User 1 has an average price order of 11 so we keep both of his orders.
User 2 has an average price order of 7.5 so both his orders are not kept.
This was the easy part. After I filter to only get the user one. I want to do some more aggregates on the result.
So for example : I want the repartition of the average per user of the amout field among the bucket [0,10] and [10,20] for all user that have an average order with a price in the range of 9-14.
The answer I except for this question is 0 in the bucket [0,10] and one in the bucket [10,20] (Only user 1 is kept because of his average price. His average amount is 11.5 so in the bucket [10,20]).
What I have tried
I have manage do to my filter in order to retrieve the users that have their average order with a price in the range of 9-14. I did this by first doing a term aggregation on the user filed. Then I do a subaggregation that is an avg aggregation on the price. Then I do a bucket selector pipeline aggregation that check if the previous computed average price is between 9 and 14.
I have also manage to do the aggregation I wanted but without the previous filter. I did exactly the same thing that for the filter for each range. Then I count the number of results in each bucket.
I havn't find any way to apply an other aggregation on bucket selector result. So i could not first do the filter and then apply the range...
Also theses solution are not elegant.. I don't think they will scale up as a big part of the document need to be returned in the answer and processed further (even if it's off internet I prefer avoiding doing this and I might be limited in the result size of an aggregation ?).
I manage to find a solution but it's not elegant and might be poorly scalable.
Make a term aggregation on the user.
As a sub-aggregation of the term aggregation do an avg aggregation that compute the average of the price.
As a sub-aggregation of the term aggregation do an avg aggregation that compute the average of the amount.
Do a bucket selector pipeline aggregation that filter to only keep avg_price in range [9-14].
Do a bucket selector pipeline aggregation that filter to only keep avg_amount in a [0-10]
Do a "count" bucket script pipeline aggregation (with script returning one).
Do a bucket sum pipeline aggregation that sum the count.
Repeat all the steps for all ranges wanted ([0-10], [10-20])
I'm working in documents-visualization for binary classification of a big amount of documents (around 150 000). The challenge is how to present general visual information to end-users, so they can have an idea on the main "concepts" on each category (positive/negative). As each document has an associated set of topics, I thought about asking Elasticsearch through aggregations for the top-20 topics on positive classified documents, and then the same for the negatives.
I created a python script that downloads the data from Elastic and classify the docs, BUT the problem is that the predictions on the dataset are not registered on Elasticsearch, so I can not ask for the top-20 topics on a certain category. First I thought about creating a query in elastic to ask for the aggregations and passing a match
As I have the ids of the positive/negative documents, I can write a query to retrieve the aggregation of topics BUT in the query I should provide a really big amount of documents IDS to indicate, for instance, just the positive documents. That is impossible, since there is a limit on the endpoint and I can not pass 50 000 ids like:
"query": {
"bool": {
"should": [
{"match": {"id_str": "939490553510748161"}},
{"match": {"id_str": "939496983510742348"}}
...
],
"minimum_should_match" : 1
}
},
"aggs" : { ... }
So I tried to register the predicted categories of the classification in the Elastic index, but as the amount of documents is really huge, it takes like half an hour (compared to less than a minute for running the classification)... which is a LOT of time just for storing the predictions.... Then I also need to query the index to et the right data for the visualization. To update the documents, I am using:
for id in docs_ids:
es.update(
index=kwargs["index"],
doc_type=kwargs["doc_type"],
id=id,
body={"doc": {
"prediction": kwargs["category"]
}}
)
Do you know an alternative to update the predictions faster?
You could use bulk query that permits you to serialize your requests and query only one time against elasticsearch executing a lot of searches.
Try:
from elasticsearch import helpers
query_list = []
list_ids = ["1","2","3"]
es = ElasticSearch("myurl")
for id in list_ids:
query_dict ={
'_op_type': 'update',
'_index': kwargs["index"],
'_type': kwargs["doc_type"],
'_id': id,
'doc': {"prediction": kwargs["category"]}
}
query_list.append(query_dict)
helpers.bulk(client=es,actions=query_list)
Please have a read here
Regarding to query the list ids, to get faster response you should't bring with you the match_string value, as you have done in the question, but the _id field. That permits you to use multiget query, a bulk query for the get operation. Here in the python library. Try:
my_ids_list = [<some_ids_here>]
es.mget(index = kwargs["index"],
doc_type = kwargs["index"],
body = {'ids': my_ids_list})
I just started using ElasticSearch and Grafana this week (so it might be easy question J)
I have an ES base which looks like :
{
"title": "Adventure Book",
"month" : 01-2019,
"price" : 50
}
So for each book, I have the price each month
What I want to do is create a dashboard on Grafana with :
compute the maximum price of each book
create an histogram on the number of book per maximum price
I managed to do the first part an create a table with book_id / maximum price.
But then i don't know how i can use my first table "as a source" for my histogram
If you have ideas or workaround to do so it would really help J
I'm working on a project which records price history for items across multiple territories, and I'm planning on storing the data in a mongodb collection.
As I'm relatively new to mongodb, I'm curious about what might be a recommended document structure for quite a large amount of data. Here's the situation:
I'm recording the price history for about 90,000 items across 200 or so territories. I'm looking to record the price of each item every hour, and give a 2 week history for any given item. That comes out to around (90000*200*24*14) ~= 6 billion data points, or approximately 67200 per item. A cleanup query will be run once a day to remove records older than 14 days (more specifically, archive it to a gzipped json/text file).
In terms of data that I will be getting out of this, I'm mainly interested in two things: 1) The price history for a specific item in a specific territory, and 2) the price history for a specific item across ALL territories.
Before I actually start importing this data and running benchmarks, I'm hoping someone might be able to give some advice on how I should structure this to allow for quick access to the data through a query.
I'm considering the following structure:
{
_id: 1234,
data: [
{
territory: "A",
price: 5678,
time: 123456789
},
{
territory: "B",
price: 9876
time: 123456789
}
]
}
Each item is its own document, which each territory/price point for that item in a particular territory. The issue I run into with this is retrieving the price history for a particular item. I believe I can accomplish this with the following query:
db.collection.aggregate(
{$unwind: "$data"},
{$match: {_id: 1234, "data.territory": "B"}}
)
The other alternative I was considering was just put every single data point in its own document and putting an index on the item and territory.
// Document 1
{
item: 1234,
territory: "A",
price: 5679,
time: 123456789
}
// Document 2
{
item: 1234,
territory: "B",
price: 9676,
time: 123456789
}
I'm just unsure of whether having 6 billion documents with 3 indexes or having 90,000 documents with 67200 array objects each and using an aggregate would be better for performance.
Or perhaps there's some other tree structure or handling of this problem that you fine folks and MongoDB wizards can recommend?
I would structure the documents as "prices for a product in a given territory per fixed time interval". The time interval is fixed for the schema as a whole, but different schemas result from different choices and the best one for your application will probably need to be decided by testing. Choosing the time interval to be 1 hour gives your second schema idea, with ~6 billion documents total. You could choose the time interval to be 2 weeks (don't). In my mind, the best time interval to choose is 1 day, so the documents would look like this
{
"_id" : ObjectId(...), // could also use a combination of prod_id, terr_id, and time so you get a free unique index to look up by those 3 values
"prod_id" : "DEADBEEF",
"terr_id" : "FEEDBEAD",
"time" : ISODate("2014-10-22T00:00:00.000Z"), // start of the day this document contains the data for
"data" : [
{
"price" : 1234321,
"time" : ISODate("2014-10-22T15:00:00.000Z") // start of the hour this data point is for
},
...
]
}
I like the time interval of 1 day because it hits a nice balance between number of documents (mostly relevant because of index sizes), size of documents (16MB limit, have to pipe over network), and ease of retiring old docs (hold 15 days, wipe+archive all from 15th day at some point each day). If you put an index on { "prod_id" : 1, "terr_id" : }`, that should let you fulfill your two main queries efficiently. You can gain an additional bonus performance boost by preallocating the doc for each day so that updates are in-place.
There's a great blog post about managing time series data like this, based on experience building the MMS monitoring system. I've essentially lifted my ideas from there.