I'm using ElasticSearch 5 and I need my document, older than X days/weeks or a date, to be automatically deleted. I am not sure _ttl is available in 5 but from what I read Elastic do not recommend it any way.
I will update my documents, it is only the one non update for a define period that I need deleting.
Any ideas?
If you need to do that for all docs which are older than a date X, then it's definitely better to create one index per period (let say per day) then after X days, simply drop the index.
It's a way more efficient than doing delete doc operations.
If it's with a given query, docs that are older than X days and match XYZ, then add yourself a timestamp within your doc and run a delete by query call every day.
Related
I am trying to achieve updating the data in the indices of Elasticsearch with zero downtime but I am not sure how to achieve this. Can anyone assist me how I can do so?
For example: If I have an index of name my_es_index and I want to update the data in that particular index with zero downtime so that the old data is still there on one of the node while someone is performing certain query but parallelly in the backend , we are updating the data on that index.
Is it possible to achieve? If yes, please help me with how I can proceed.
You build/create another index (we called new index), then switch from old index to new index, then delete old index.
Read more at https://medium.com/craftsmenltd/rebuild-elasticsearch-index-without-downtime-168363829ea4
Unless you are to update the mapping of an existing field and preserving the name of the fields is required, I don't think taking the cluster down is needed.
While the above article is a good read and might be treated as best practices, ES is a lot flexible. Unlike MySQL/SQL, it allows you to update existing documents.
Adding a new field
Let's call the new field to be added as x.
add mapping to the index for x.
make the code changes such that going forward, all the new documents have this new field x.
while all the new documents have the field x, write-up a script which updates the older documents and adds this field x.
once you are sure that all the documents have the field x, you may enable the feature you added this field for.
Updating mapping of a field
Let's again call the field to be updated as x (assuming the name of the field is not the prime concern).
create a new field, say new_x (add correct mapping to the index).
follow the above steps to ensure new_x has the data (slight change that we need to ensure both x and new_x have this data).
once all the documents in the index have the field new_x, simply refactor the code to use new_x instead of x.
While one might argue that above two approaches are in a way hacks, it saves you time, effort and cost to boot up a new ES instance and manage the aliases.
I am facing issue with UpdateByQuery API while trying to update a document which doesn’t exist in Elastic search
Problem description
We are creating one index for each day like test_index-2020.03.11, test_index-2020.03.12… and we maintain eight days (today’s as well as last week seven days) indexes.
When data arrives (reading one by one or in a bulk from Kafka topic) either we need to update (which may exist in any one of the 8 days indexes) if data already exists with given ID or save it if not exist (to current day index).
The solution, I am trying currently when data arrive one by one:
Using UpdateByQuery with an inline script to update the doc
If BulkByScrollResponse returns Updated count 0, then save the doc
Issues:
Even if doc doesn’t exist still I can see BulkByScrollResponse returns updated field as non-zero (1,2,3,4…) as follows
BulkIndexByScrollResponse[sliceId=null,updated=1,created=0,deleted=0,batches=1,versionConflicts=0,noops=0,retries=0,throttledUntil=0s]
Due to this unable to trigger document save request.
How to approach if the bulk of documents (having set of different doc IDs) need to be updated with their respective content with single request? Will I be able to achieve with UpdateByQuery?
Note: Considering the amount of data to be processed per hour we need to avoid multiple hits to Elasticsearch.
Doc ID is in the format of
str1:str2:Used:Sat Mar 14 23:34:39 IST 2020
But even if doc doesn't exist still i can see updated count as non zero
Adding couple of more points about the approach i am trying:
-In my case there is always only one doc which has to get updated per request, as i am trying to update the doc matching the given ID
-We have configured shards and replica as
"number_of_shards": 10,
"number_of_replicas": 1
-We are going with this approach as we don't know in which index actual doc resides
If there is maximum one document matching then Updated field of the response should not have more than 1
Following are couple of output which i get as a part of response:
BulkIndexByScrollResponse[sliceId=null,updated=9,created=0,deleted=0,batches=1,versionConflicts=1,noops=0,retries=0,throttledUntil=0s]
BulkIndexByScrollResponse[sliceId=null,updated=10,created=0,deleted=0,batches=1,versionConflicts=0,noops=0,retries=0,throttledUntil=0s]
I am looking at sending my App logs to Elastic (6.x) via FileBeat and Logstash. As mentioned in Configure the Logstash output and recommended elsewhere, it seems that I need add the Date to the Index name. The reason for doing so was that when the time came to delete old data, it was easier to delete an entire Index by date, rather than individual documents. Is this true?
If I should be following this recommendation of adding the Date to the Index Name, I’m curious what additional things I need to do to ensure seamless querying? By this I mean querying esp. in Kibana, for e.g. over the past day which would need to look at today’s index as well as yesterday’s index.
Speaking of querying in Kibana, is there a way of simply working with the base index name without the date stamp i.e. setting it up so that I do not see or have to deal with the date named indexes?
Edit: Kamal raised a good point that I have not provided any information about my cluster and my needs. The following is what I'm working with:
What is your daily data creation/expected count
I'm not sure. I don't expect anything more than a GB of data day, and no more than a couple of 100K documents a day. Since these are logs, I don't expect any updates to the documents once they are created.
Growth rate of the data in the future (1 year - 5 years)
At the moment, I don't see the growth rate to cross a GB a day.
How many teams are using the same cluster apart from yours if there is
any
The cluster would be used (actually queried) by just my team. We are about 5 right now, but I don't see more than 10 users (and that's not concurrent, just over a day or month)
Usage patterns, type of queries used etc.
I'm not sure, but there certainly would not be updates to the data other than deletions
Hardware details
I've not worked this out with management. For most part I expect 3 nodes. Also this is not critical i.e. if we lose all of our logs for some reason, I would not lose sleep over it.
First of all you need to take a step back and understand do you really need multiple index or single one(where you need to filter documents while querying using a date field for a particular date).
Some of questions you must have before you take on such decision
What is your daily data creation/expected count
Growth rate of the data in the future (1 year - 5 years)
How many teams are using the same cluster apart from yours if there is any
Usage patterns, type of queries used etc.
Hardware details
Advantages
In a way, having multiple indexes(with date field as its index name) would be more beneficial.
You can delete the old indexes without affecting new ones.
In case if you have to change the mapping, you can do so with the new index without affecting the old ones. Comparatively less overhead while for single index, you have to reindex all the documents which would take lot more time if size is pretty huge. And if this keeps happening every now and then, you would need to come up with solution where you have to execute such operations at the times of minimal usages. That means, it can harm productivity.
searching using multiple indexes still is convenient.
not really sure but its easier for scaling using multiple indexes.
Disadvantages are:
Additional shards are created for each and every index that can waste some storage space.
Overhead to maintain multiple indexes by monitoring/operations team.
At times can lead to over-creation of indexes.
No mapping changes and less documents insertion(in 100s or few 100s), it'd be better to use single index.
The only way and the only correct way to figure out what's best is to have a cluster that closely resembles the production one with data too resembling to production, try various configurations and see which solution fits best.
Speaking of querying in Kibana, is there a way of simply working with
the base index name without the date stamp i.e. setting it up so that
I do not see or have to deal with the date named indexes?
Yes there is. If you have indexes with names like logs-0001, logs-0002, you can use logs-* as indexname when you query.
Including date in an index name is a very common use case implemened by many Elasticsearch users. It helps with archiving/ purging old indices as you mentioned. You dont need to do anything additionally to be able to query. Setup your index basename as an index pattern for your indices for ex. logstash-* and you can query on that particular index pattern in Kibana.
Here is the problem
I have about 1 million record in indexes. There is a property aging in the documents which increase daily. Every night scheduler runs and it calculates the aging from current date and created date in the document and update the index.
The problem is as data is increasing the bulk update is leading to GC overhead limit exceeded. So what I did is added some pause in each update, but still no help.
Now I am thinking and researching of using groovy script with 'update_with_query'.
I want to ask it there any other way to maintain age. e.g in jira everyday overdue date is increased or I have to fetch visit and update documents
EveryTime bulk request is run I can see elastic search throttling ' now throttling indexing: numMergesInFlight=5, maxNumMerges=4'. I have read about this but not sure what to do. I think there should be another approach to calculate aging but not sure, because as data will increase this problem is going to persist
IN the end I want a query like give me all docs whose aging is 100 or give me all documents whose aging > 100
The answer was simple. I was thinking other way around.
if a query is get all docs where aging is > 2. It means I need to get all docs who were created before two days. Simple convert '2' to date from current date and use range operation and it should solve the problem
I'm using jdbc river to sync Elasticsearch and database.The known problem is that rows deleted from database remain in ES, jdbc river plugin doesn't solve that. Author of jdbc river suggested the way of solving the problem:
A good method would be windowed indexing. Each timeframe (maybe once per day or >per week) a new index is created for the river, and added to an alias. Old >indices are to be dropped after a while. This maintenance is similar to >logstash indexing, but it is outside the scope of a river.
My question is, what does that mean in precise way?
Lets say I have table in database called table1 with million rows, my try is as follows:
Create river called river1, with index1. index1 contains indexed
rows of table1. Index1 is added to alias.
Some rows from table1 are deleted during the day so every night I create another river called river2, with index2 which
contains only what is now present in table1.
Remove old index1 from alias and add index2 to alias.
Delete old index1.
Is that the right way?
How about using the _ttl field? Define a static _ttl in the SQL-statement to be longer than the SQL-update frequency.
The SQL would be something like this when the river is scheduled to run more frequently than 1 hour:
"select '1h' as _ttl, some_id as _id, ..."
This way the _ttl gets updated when the river runs, but deleted rows will not get updated and will be removed from the ES when the _ttl expires.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-ttl-field.html
Yes, it can be done using _ttl field, but I solved it using scripts.
Every night script starts with indexing table and creating an index for that day. Indexing can last for few hours.
Another scripts periodically reads output from localhost:9200/_river/jdbc/*/_state?pretty and checks if all rivers are finished (by checking existance of lastEndDate field). When all rivers are finished, alias is refreshed with newly created index. Old index is dropped.