I use Elasticsearch 2.3 with fluentd/logstash and kibana and have fluentd/logstash writing to the index logstash-YYYY.MM.dd
But because mistakes happen and I would like to be able to re-index my data I would instead like to write to the alias logstash-YYYY.MM.dd-alias.
This is what I can do:
Create an index with index template and auto-created the alias using
{ aliases: {"{index}-alias":{}}}
Write to the alias for the rest of the day
The problem is: Let's say it is 2016-04-12. On the new day, the index "logstash-2016.04.13" does not exist and hence neither its alias "logstash-2016.04.13-alias". So instead of creating both, Elasticsearch creates "logstash-2016.04.13-alias" and its alias "logstash-2016.04.13-alias-alias"
Possible solution: Have a cronjob running to create the index 5 minutes before midnight.
Downside: Cronjob server can go down. Timezone issues can occur. Other errors lead to more errors.
Is there any other way to manage writing to an alias using a date format without having to create the index+alias pair beforehand?
Related
I have an Oracle DB. Logstash retrieves data from Oracle and puts it to ElasticSearch.
But when Logstash makes planned export every 5 minutes, ElasticSearch filled with copies cause old data still exist. This is an obvious situation. Oracle's condition almost not changed during this 5 minutes. Let's say - added 2-3 rows, and 4-5 deleted.
How can we replace old data with new without copies?
For example:
Delete the whole old index;
Create new index with the same name and make the same configuration (nGram configuration and mapping);
Add all new data;
Wait for 5 minutes and repeat.
It's pretty easy: create a new index for each import and apply the mappings, switch your alias afterwards to the most recent index. Remove old indices if needed. Your currenr data will be always searchable while indexing the most recent data.
Here are the sources you'll probalbly need to read:
Use aliases (https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-aliases.html) to point to the most current data when searching in elasticsearch (BTW it`s always a good idea to have aliases in place).
Use rollover api (https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-rollover-index.html) to create a new index for each import run - note the alias handling here too.
Use index templates (https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-templates.html) to autmatically apply the mappings/settings for your newly created indices.
Shrink, close and/or delete old indices to keep your cluster handling data you really need. Have a look on the curator (https://github.com/elastic/curator) as standalone tool.
You just need to use the fingerprint/hash of each document , or hash of the uniq fields in each document , as the document id , so that eveytime you can overwirte the same documents with updated one , in place , while adding new documents as well.
But this approach will not work with deleting data from oracle.
Is there a Recipe out there to Reindex all ElasticSearch Indices with Curator?
I'm seeing that it can Reindex a set of indices into one (Daily to Month use case), however I don't see anything that would suggest it could easily apply a new mapping file to every Elastic Index.
I'm taking a guess I'll need to write a wrapper script around Curator to grab index names and feed them into Curator.
I don't know if I got you right as you mentioned reindexing and mapping changes...
If you want to set/update a mapping in a collection of indices and if you know the indices to update by name (or pattern), you are able to apply the same mapping or a mapping change at once with https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-put-mapping.html#_multi_index_2
For reindexing, there is no way to specify multiple source/target pairs at once but you can split one index into many. But as you sugessted, you can use subsequent calls to the reindex api.
BTW: The reindex api does not copy the settings nor mappings from the source into the destination index. You need to handle it by yourself, maybe using https://www.elastic.co/guide/en/elasticsearch/reference/6.4/indices-templates.html
I have a data source which will create a high number of entries that I'm planning to store in ElasticSearch.
The source creates two entries for the same document in ElasticSearch:
the 'init' part which records init-time and other details under a random key in ES
the 'finish' part which contains the main data, and updates the initially created document (merges) in ES under the init's random key.
I will need to use time-based indexes in ElasticSearch, with an alias pointing to the actual index,
using the rollover index.
For updates I'll use the update API to merge init and finish.
Question: If the init document with the random key is not in the current index (but in an older one already rolled over) would updating it using it's key
successfully execute? If not, what is the best practice to perform the update?
After some quietness I've set out to test it.
Short answer: After the index is rolled over under an alias, an update operation using the alias refers to the new index only, so it will create the document in the new index, resulting in two separate documents.
One way of solving it is to perform a search in the last 2 (or more if needed) indexes and figure out which non-alias index name to use for the update.
Other solution which I prefer is to avoid using the rollover, but calculate index name from the required date field of our document, and create new index from the application, using template to define mapping. This way event sourcing and replaying the documents in order will yield the same indexes.
What if I've changed mapping for my index and wants to reindex?
I'm currenly using the Java API which does not yet have the reindex functionality, so using bulk would solve my problems. So the solution would look something like this
ref How to reindex in ElasticSearch via Java API
Long time ago
create index MY_INDEX_1
create mapping for MY_INDEX_1
create alias MY_INDEX_1 -> MY_INDEX
create documents in MY_INDEX
Time to reindex!
List item
create index MY_INDEX_2
create mapping for MY_INDEX_2
scroll search + bulk all documents from MY_INDEX_1 to MY_INDEX_2
Renaming and deletion of old index
create alias MY_INDEX_2 -> MY_INDEX
delete alias MY_INDEX_1 -> MY_INDEX
delete index MY_INDEX_1
But what happens, while reindexing all documents, a document that was reindexed in the beginning is updated from a user.
Or that between reindexing and rename aliases the above happpens?
Possible Solutions ?
One way would be using external version, such as it does not overwrite an document with an higher version
Or could it be solved in another way?
Or between renaming aliases and deleting my_index_1, reindexing all documents that has been indexed since the reindexing? But then still it would be the case that a document has been updated between renaming aliases and second reindexing
Or should we lock while reindexing? Seems like a bad solution..
I think this is your real question:
But what happens, while reindexing all documents, a document that was reindexed in the beginning is updated from a user. Or that between reindexing and rename aliases the above happpens?
I just asked a question that is very close, but still has questions that need to be resolved separately. However, my research allows me to answer this question. See the question for details and references.
To answer your question, you create a second alias just before reindexing. I call this a duplicate_write_alias and you have your application, if it sees this second alias, write to first the old and then the new index via the two aliases. (the order is important to cancel a potential race). When the indexing is done, your indexing process deletes this duplicate_write_alias and moves your MY_INDEX alias to the new MY_INDEX_2 as noted above. Do the alias switch in one atomic command.
As I noted in my question, you still have to deal with potential 'index does not exist' errors because of a remaining race between your application's checking for existence of the alias and the alias being deleted. I'm hoping there's a better answer than 'always write twice and ignore errors' or 'check and hope for the best'...
I think there is also another (more ugly way):
You can disable write operations for the source index while reindexing, this leads to temporary not usable apis, you don't have to:
Maintain a second storage to hold the truth
Deal with inconsistency
Flag documents for delete which should be deleted after migration
You can use elastic search engine storage to create snapshots between indecies
You can signal users of your api to send their change again later (when the indexing is done)
Downsides:
You have a downtime at least for write operations
You need more logic to handle errors, if the index would not be set to allow-writes-again mode (automatic recovery etc.)
Holding more than one index causes more storage space to be used.
For more information look here:
https://www.elastic.co/guide/en/elasticsearch/reference/6.2/index-modules.html
So we are using time frame indexes that are created automatically using an index template.
The ideal situation now would be to have an alias, let's call it 'current', that points to the last index created.
The question then is if there's any way we can do this through the index template. I can see that you can specify aliases, but I want also to remove the bindings between previous indexes and this 'current' alias.
You can use the Curator tool to do it for you
That tool has an alias command that you can use in conjunction with the indices subcommand in order to remove an alias on your old index.
curator alias --name current --remove indices oldindex
You can combine the above command with a cron that would kick in around the same time your index template gets into action to create the new time-based index.