how elastic store indexes - elasticsearch

The elastic indexes are getting bigger and bigger and then some days the indexes are small. The days that indexes are small no machine is down; everything is the same as in the days the indexes are big.
I noticed that elasticsearch still store documents in the indexes from days before.
Is it possible that elastic pilling up the days before in the current day? How elastic stores the documents on indexes?
We had to decrease the days the indexes are stored since some days one index is 2x the size of another.
Thanks

#maryf there can be 2 possibilities here if there are date based indexes:
log shipper is not persisting the registry which contains info about which log files have ben harvested and upto what offset.
index is defined to use incorrect timestamp field for timelines.
In first case, whenever your log shipper restarts, it will start reading log files from beginning and you can see duplicate records in your index. While in the second case, logs are stored to the index based on the timestamp field being used. If the timestamp is from older date, it will be stored in older index matching the date.

Related

Elastic search - reIndex millions of records based on its date

We have millions of records in a date based Index, In the same index there would be thousands of records belonging to a different date, we want to reIndex all those records to an Index based on its day of the month and year(e.g 2020-08-01, 2020-08-02 etc...), we have to delete the record from source index after reindexing successfully.
I wrote a small script in the query to get all records which do not belong to current date, and started reIndexing using reIndex api, but after processing few hundred api's ES throws "too many requests" error.
I tried to reindex using "wait_for_completion:false" parameter too.
I am not able to group records on date as date_histogram gives only count, cant use bulk api because destination for every record can be different and cant extract destination value for each record.
What would be the best way to solve this issue?

Logstash Output Index Yearly Rotation

Quick question related to Logstash and Elasticsearch:
If I use the following output index option:
index => "myindex-%{+MM}"
Will this overwrite the oldest index once it reaches a year or just add to it?
Logstash will never delete existing indexes, so when the year ends, the oldest index will be used again.
Nothing will get overwritten AS LONG AS the document ids of new documents are always different than the ids of the documents already existing in that index.

Reasons & Consequences of putting a Date in Elastic Index Name

I am looking at sending my App logs to Elastic (6.x) via FileBeat and Logstash. As mentioned in Configure the Logstash output and recommended elsewhere, it seems that I need add the Date to the Index name. The reason for doing so was that when the time came to delete old data, it was easier to delete an entire Index by date, rather than individual documents. Is this true?
If I should be following this recommendation of adding the Date to the Index Name, I’m curious what additional things I need to do to ensure seamless querying? By this I mean querying esp. in Kibana, for e.g. over the past day which would need to look at today’s index as well as yesterday’s index.
Speaking of querying in Kibana, is there a way of simply working with the base index name without the date stamp i.e. setting it up so that I do not see or have to deal with the date named indexes?
Edit: Kamal raised a good point that I have not provided any information about my cluster and my needs. The following is what I'm working with:
What is your daily data creation/expected count
I'm not sure. I don't expect anything more than a GB of data day, and no more than a couple of 100K documents a day. Since these are logs, I don't expect any updates to the documents once they are created.
Growth rate of the data in the future (1 year - 5 years)
At the moment, I don't see the growth rate to cross a GB a day.
How many teams are using the same cluster apart from yours if there is
any
The cluster would be used (actually queried) by just my team. We are about 5 right now, but I don't see more than 10 users (and that's not concurrent, just over a day or month)
Usage patterns, type of queries used etc.
I'm not sure, but there certainly would not be updates to the data other than deletions
Hardware details
I've not worked this out with management. For most part I expect 3 nodes. Also this is not critical i.e. if we lose all of our logs for some reason, I would not lose sleep over it.
First of all you need to take a step back and understand do you really need multiple index or single one(where you need to filter documents while querying using a date field for a particular date).
Some of questions you must have before you take on such decision
What is your daily data creation/expected count
Growth rate of the data in the future (1 year - 5 years)
How many teams are using the same cluster apart from yours if there is any
Usage patterns, type of queries used etc.
Hardware details
Advantages
In a way, having multiple indexes(with date field as its index name) would be more beneficial.
You can delete the old indexes without affecting new ones.
In case if you have to change the mapping, you can do so with the new index without affecting the old ones. Comparatively less overhead while for single index, you have to reindex all the documents which would take lot more time if size is pretty huge. And if this keeps happening every now and then, you would need to come up with solution where you have to execute such operations at the times of minimal usages. That means, it can harm productivity.
searching using multiple indexes still is convenient.
not really sure but its easier for scaling using multiple indexes.
Disadvantages are:
Additional shards are created for each and every index that can waste some storage space.
Overhead to maintain multiple indexes by monitoring/operations team.
At times can lead to over-creation of indexes.
No mapping changes and less documents insertion(in 100s or few 100s), it'd be better to use single index.
The only way and the only correct way to figure out what's best is to have a cluster that closely resembles the production one with data too resembling to production, try various configurations and see which solution fits best.
Speaking of querying in Kibana, is there a way of simply working with
the base index name without the date stamp i.e. setting it up so that
I do not see or have to deal with the date named indexes?
Yes there is. If you have indexes with names like logs-0001, logs-0002, you can use logs-* as indexname when you query.
Including date in an index name is a very common use case implemened by many Elasticsearch users. It helps with archiving/ purging old indices as you mentioned. You dont need to do anything additionally to be able to query. Setup your index basename as an index pattern for your indices for ex. logstash-* and you can query on that particular index pattern in Kibana.

Is it possible to limit a size of an Elasticsearch index?

I have an Elasticsearch instance for indexing log records. Naturally the data grows over time and I would like to limit its size(about 10GB). Something like a mongoDb capped collection.
I'm not interested in old log records anyway.
I haven't found any config for this and I'm not sure that I can just remove data files.
any suggestions ?
The Elasticsearch "way" of dealing with "old" data is to create time-based indices. Meaning, for each day or each week you create an index. Index everything belonging to that day/week in that index.
You decide how many days you want to keep around and stick to that number. Let's say that the data for 7 days counts as 10 GB. In the 8th day you create the new index, as usual, then you delete the index from 8 days before.
All the time you'll have in your cluster 7 indices.
Using ttl as the other poster suggested is not recommended, because is far more difficult and it creates additional pressure on the cluster. The ttl mechanism checks every indices.ttl.interval (60 seconds by default) for expired documents, it creates bulk requests out of them and deletes them. This means unnecessary requests coming to the cluster.
Instead, deleting an index is very easy and quick.
Take a look at this and how to easily manage time based indices with Curator.
From what I remember a capped collection in MongoDB was just a circular buffer type of collection that removes oldest entries when there's no more room? Unfortunately there's nothing like this out of the box in ElasticSearch, you have to add this functionality yourself either by removing single documents (or batches of documents) using ES's API. A more performant way is described in their documentation under retiring data.
You can provide a per index/type default _ttl(time to live) value as follows:
{
"tweet" : {
"_ttl" : { "enabled" : true, "default" : "1d" }
}
}
You will have more detail here: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-ttl-field.html
Regards,
Alain

Delete old indices elasticsearch?

I am using the ELK stack for analyzing logs. So as per default configuration a new index by "logsatash-YYYY-MM-DD" is created by ES.
So if I have configured logstash to read like this:
/var/log/rsyslog/**/2014-12-0[1-7]/auditd.log
So it is reading old logs and the index name created will be "logstash-2015-03-20", so this index will have documents (logs) of previous dates.
My problem occurs when I have to delete indexes. If I have to keep only last one weeks data and purge the older indices. When I will delete index names except the last 7 days, I have no track which days logs are kept in which index name. Eg: 2014-12-07 date's logs may be kept in any of index named logstash-2015-03-19 or logstash-2015-03-20.
So how shall I delete indexes??
Log messages are stored into indexes based on the value of the #timestamp field (which uses UTC time). If your 2014-12-07 logs end up in 2015-03-19 this timestamp parsing isn't done correctly.
Correct the problem by adding a grok and/or date filter and your 2014-12-07 logs will end up in the logstash-2014.12.07 index and it'll be trivial to clean up old logs.

Resources