Logstash Output Index Yearly Rotation - elasticsearch

Quick question related to Logstash and Elasticsearch:
If I use the following output index option:
index => "myindex-%{+MM}"
Will this overwrite the oldest index once it reaches a year or just add to it?

Logstash will never delete existing indexes, so when the year ends, the oldest index will be used again.
Nothing will get overwritten AS LONG AS the document ids of new documents are always different than the ids of the documents already existing in that index.

Related

how elastic store indexes

The elastic indexes are getting bigger and bigger and then some days the indexes are small. The days that indexes are small no machine is down; everything is the same as in the days the indexes are big.
I noticed that elasticsearch still store documents in the indexes from days before.
Is it possible that elastic pilling up the days before in the current day? How elastic stores the documents on indexes?
We had to decrease the days the indexes are stored since some days one index is 2x the size of another.
Thanks
#maryf there can be 2 possibilities here if there are date based indexes:
log shipper is not persisting the registry which contains info about which log files have ben harvested and upto what offset.
index is defined to use incorrect timestamp field for timelines.
In first case, whenever your log shipper restarts, it will start reading log files from beginning and you can see duplicate records in your index. While in the second case, logs are stored to the index based on the timestamp field being used. If the timestamp is from older date, it will be stored in older index matching the date.

How to upsert documents in an existing elasticsearch index?

I have an elasticsearch index which has multiple documents, now I want to update the index with some new documents which might also contain duplicates of the existing documents. What's the best way to do this? I'm using elasticsearch py for all CRUD operations
Every update in elasticsearch deletes the old document and create a new document as the smallest unit of document collection is called segments in elastic-search which are immutable, hence when you index a new document or update any exiting documents, it gets into the new segments which are merged into bigger segments during the merge process.
Now even if you have duplicate data but with the same id, it will replace the existing document, and its fine and more performant than first fetching the document and than comparing both the documents to see if they are duplicate and than discard the update/upsert request from an application, rather than just index whatever if coming and ES will again insert the duplicate docs.

Ways to only process new(index after last run) data in Elasticsearch?

Is there a way to get the date and time that an elastic search document was written?
I am running es queries via spark and would prefer NOT to look through all documents that I have already processed. Instead I would like read the only documents that were ingested between the last time the program ran and now.
What is the best most efficient way to do this?
I have looked at;
updating to add a field with an array with booleans for if its been looked at by which analytic. The negative is waiting for the update to occur.
index per time frame method, which would be to break down the current indexes into smaller ones so by hour.The negative I see is the number of open file descriptors.
??
Elasticsearch version 5.6
I posted the question on the elasticsearch discussion board and it appears using the ingest pipeline is the best option.
I am running es queries via spark and would prefer NOT to look through
all documents that I have already processed. Instead I would like read
the only documents that were ingested between the last time the
program ran and now.
A workaround could be :
While inserting data using Logstash to Elasticsearch, Logstash appends a #timestamp key to the document which represents the time (in UTC) at which the document is created or we can use an ingest pipline
After that we can query based on the timestamp.
For more on this please have a look at :
Mapping changes
There is no way to ask ES to insert a timestamp at index time
Elasticsearch doesn't have such functionality.
You need manually save with each document date. In this case you will be able to search by date range.

Updating existing documents in ElasticSearch (ES) while using rollover API

I have a data source which will create a high number of entries that I'm planning to store in ElasticSearch.
The source creates two entries for the same document in ElasticSearch:
the 'init' part which records init-time and other details under a random key in ES
the 'finish' part which contains the main data, and updates the initially created document (merges) in ES under the init's random key.
I will need to use time-based indexes in ElasticSearch, with an alias pointing to the actual index,
using the rollover index.
For updates I'll use the update API to merge init and finish.
Question: If the init document with the random key is not in the current index (but in an older one already rolled over) would updating it using it's key
successfully execute? If not, what is the best practice to perform the update?
After some quietness I've set out to test it.
Short answer: After the index is rolled over under an alias, an update operation using the alias refers to the new index only, so it will create the document in the new index, resulting in two separate documents.
One way of solving it is to perform a search in the last 2 (or more if needed) indexes and figure out which non-alias index name to use for the update.
Other solution which I prefer is to avoid using the rollover, but calculate index name from the required date field of our document, and create new index from the application, using template to define mapping. This way event sourcing and replaying the documents in order will yield the same indexes.

Delete old indices elasticsearch?

I am using the ELK stack for analyzing logs. So as per default configuration a new index by "logsatash-YYYY-MM-DD" is created by ES.
So if I have configured logstash to read like this:
/var/log/rsyslog/**/2014-12-0[1-7]/auditd.log
So it is reading old logs and the index name created will be "logstash-2015-03-20", so this index will have documents (logs) of previous dates.
My problem occurs when I have to delete indexes. If I have to keep only last one weeks data and purge the older indices. When I will delete index names except the last 7 days, I have no track which days logs are kept in which index name. Eg: 2014-12-07 date's logs may be kept in any of index named logstash-2015-03-19 or logstash-2015-03-20.
So how shall I delete indexes??
Log messages are stored into indexes based on the value of the #timestamp field (which uses UTC time). If your 2014-12-07 logs end up in 2015-03-19 this timestamp parsing isn't done correctly.
Correct the problem by adding a grok and/or date filter and your 2014-12-07 logs will end up in the logstash-2014.12.07 index and it'll be trivial to clean up old logs.

Resources