Is there a way to maintain aging in documents in elastic search - elasticsearch

Here is the problem
I have about 1 million record in indexes. There is a property aging in the documents which increase daily. Every night scheduler runs and it calculates the aging from current date and created date in the document and update the index.
The problem is as data is increasing the bulk update is leading to GC overhead limit exceeded. So what I did is added some pause in each update, but still no help.
Now I am thinking and researching of using groovy script with 'update_with_query'.
I want to ask it there any other way to maintain age. e.g in jira everyday overdue date is increased or I have to fetch visit and update documents
EveryTime bulk request is run I can see elastic search throttling ' now throttling indexing: numMergesInFlight=5, maxNumMerges=4'. I have read about this but not sure what to do. I think there should be another approach to calculate aging but not sure, because as data will increase this problem is going to persist
IN the end I want a query like give me all docs whose aging is 100 or give me all documents whose aging > 100

The answer was simple. I was thinking other way around.
if a query is get all docs where aging is > 2. It means I need to get all docs who were created before two days. Simple convert '2' to date from current date and use range operation and it should solve the problem

Related

How to find memory usage difference in Grafana

I am working in graph panel in grafana and using elastic search as a data source. In the data source, I have memory-used with timestamp. I am trying to give notification alert when the difference is more than 100 MB. How to find memory difference between the memory used in day one and memory used in current day and send alert notification?
You would setup a query which is basically grouped by timestamp and define it based on whether you are looking for the 100 MB difference to be on max value or average. Assuming it is max value- you query would be something like
And then you would set alerts by going to the alert tab based on the query and diff in the values for 24 hours

Reasons & Consequences of putting a Date in Elastic Index Name

I am looking at sending my App logs to Elastic (6.x) via FileBeat and Logstash. As mentioned in Configure the Logstash output and recommended elsewhere, it seems that I need add the Date to the Index name. The reason for doing so was that when the time came to delete old data, it was easier to delete an entire Index by date, rather than individual documents. Is this true?
If I should be following this recommendation of adding the Date to the Index Name, I’m curious what additional things I need to do to ensure seamless querying? By this I mean querying esp. in Kibana, for e.g. over the past day which would need to look at today’s index as well as yesterday’s index.
Speaking of querying in Kibana, is there a way of simply working with the base index name without the date stamp i.e. setting it up so that I do not see or have to deal with the date named indexes?
Edit: Kamal raised a good point that I have not provided any information about my cluster and my needs. The following is what I'm working with:
What is your daily data creation/expected count
I'm not sure. I don't expect anything more than a GB of data day, and no more than a couple of 100K documents a day. Since these are logs, I don't expect any updates to the documents once they are created.
Growth rate of the data in the future (1 year - 5 years)
At the moment, I don't see the growth rate to cross a GB a day.
How many teams are using the same cluster apart from yours if there is
any
The cluster would be used (actually queried) by just my team. We are about 5 right now, but I don't see more than 10 users (and that's not concurrent, just over a day or month)
Usage patterns, type of queries used etc.
I'm not sure, but there certainly would not be updates to the data other than deletions
Hardware details
I've not worked this out with management. For most part I expect 3 nodes. Also this is not critical i.e. if we lose all of our logs for some reason, I would not lose sleep over it.
First of all you need to take a step back and understand do you really need multiple index or single one(where you need to filter documents while querying using a date field for a particular date).
Some of questions you must have before you take on such decision
What is your daily data creation/expected count
Growth rate of the data in the future (1 year - 5 years)
How many teams are using the same cluster apart from yours if there is any
Usage patterns, type of queries used etc.
Hardware details
Advantages
In a way, having multiple indexes(with date field as its index name) would be more beneficial.
You can delete the old indexes without affecting new ones.
In case if you have to change the mapping, you can do so with the new index without affecting the old ones. Comparatively less overhead while for single index, you have to reindex all the documents which would take lot more time if size is pretty huge. And if this keeps happening every now and then, you would need to come up with solution where you have to execute such operations at the times of minimal usages. That means, it can harm productivity.
searching using multiple indexes still is convenient.
not really sure but its easier for scaling using multiple indexes.
Disadvantages are:
Additional shards are created for each and every index that can waste some storage space.
Overhead to maintain multiple indexes by monitoring/operations team.
At times can lead to over-creation of indexes.
No mapping changes and less documents insertion(in 100s or few 100s), it'd be better to use single index.
The only way and the only correct way to figure out what's best is to have a cluster that closely resembles the production one with data too resembling to production, try various configurations and see which solution fits best.
Speaking of querying in Kibana, is there a way of simply working with
the base index name without the date stamp i.e. setting it up so that
I do not see or have to deal with the date named indexes?
Yes there is. If you have indexes with names like logs-0001, logs-0002, you can use logs-* as indexname when you query.
Including date in an index name is a very common use case implemened by many Elasticsearch users. It helps with archiving/ purging old indices as you mentioned. You dont need to do anything additionally to be able to query. Setup your index basename as an index pattern for your indices for ex. logstash-* and you can query on that particular index pattern in Kibana.

How to merge old data to save space in Elasticsearch

I tried to find information about this, but I have not find what I was looking for.
I am storing metrics every minutes in an Elasticsearch database. My idea is that the frequency is important only in a short period.
For example, I want to have my metrics every minutes for the last past week, but then I would like to merge these metrics in order to have only one document of metrics for each past weeks.
Thus, I have an idea to achieve this with a stream processing framework such as Spark streaming or Flink, but my question is : is there a native way / tool / tricks to make it happen in Elasticsearch ?
Thank you, hope my question is clear enough, otherwise leave a comment for more details.
One idea would be to have a weekly index in which you store all your metrics every minutes, once the week has passed, you could run an aggregation query on the past week index and aggregate all info at the day or week level. You'd then store that weekly aggregated information as new document in another historical index that you can query later on. I don't think it's necessary to leverage Spark streaming for this, ES aggregations can do the job pretty easily.

How to add calculations to an Elastic Search database?

I'm using Elastic Search to index large amounts of sensor data for analytics purposes. The table has 4 million + rows and growing fast - expecting 40 million within the next year. This makes Elastic Search seem like a natural fit, especially with tools such as Kibana to easily display the data.
Elastic Search seems great, however there are are some more complex calculations that have to be performed as well. One such calculation is for our "average user time", where we take two data points (timestamp of item picked up and timestamp of item placed back), subtract them from each other and do an average of all these for one specific customer over a specific timeframe. The SQL query would look something like "select * from events where event_type = 'object picked up' or event_type = 'object placed back down'" then take all these events and get diffs on all their timestamps, add them all together then divide by count.
These types of calculations to my understanding are not the type of thing that Elastic Search is meant to do. I've had people recommend Hadoop but that could take a long time to get set up and we can use a fast language like GO or Node/JavaScript to batch process things and add them to the DB periodically... but what is the right way to do this? Allowing for future scalability and working nicely with Elastic Search.
Our setup is: Rails, AngularJS, Elastic Search, Heroku, Postgres.
Maybe you could try to use scripted metrics. In connection with filters can give you more or less proper solution for your problem
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-scripted-metric-aggregation.html

Is it possible to limit a size of an Elasticsearch index?

I have an Elasticsearch instance for indexing log records. Naturally the data grows over time and I would like to limit its size(about 10GB). Something like a mongoDb capped collection.
I'm not interested in old log records anyway.
I haven't found any config for this and I'm not sure that I can just remove data files.
any suggestions ?
The Elasticsearch "way" of dealing with "old" data is to create time-based indices. Meaning, for each day or each week you create an index. Index everything belonging to that day/week in that index.
You decide how many days you want to keep around and stick to that number. Let's say that the data for 7 days counts as 10 GB. In the 8th day you create the new index, as usual, then you delete the index from 8 days before.
All the time you'll have in your cluster 7 indices.
Using ttl as the other poster suggested is not recommended, because is far more difficult and it creates additional pressure on the cluster. The ttl mechanism checks every indices.ttl.interval (60 seconds by default) for expired documents, it creates bulk requests out of them and deletes them. This means unnecessary requests coming to the cluster.
Instead, deleting an index is very easy and quick.
Take a look at this and how to easily manage time based indices with Curator.
From what I remember a capped collection in MongoDB was just a circular buffer type of collection that removes oldest entries when there's no more room? Unfortunately there's nothing like this out of the box in ElasticSearch, you have to add this functionality yourself either by removing single documents (or batches of documents) using ES's API. A more performant way is described in their documentation under retiring data.
You can provide a per index/type default _ttl(time to live) value as follows:
{
"tweet" : {
"_ttl" : { "enabled" : true, "default" : "1d" }
}
}
You will have more detail here: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-ttl-field.html
Regards,
Alain

Resources