i'm new with elasticsearch
i would like to set index lifecycle policy(from hot to warm), time based
Using java and spring boot to store data.
so my questions are:
can i set the lifecycle policy to read from my custom key(of date), if so, how do i do it? does the key needs to be in some format?
if 1 is not possible, is there a way to set #timestamp field manually? if we set a key with this format, will it do the trick?
if 1 and 2 is not possible, that means that all rollovers should be done programmatically, does anyone have good example? or just use simple select and insert and delete?
thanks!
I am exactly isn't sure what is your question. Anyway, I will try to answer as I understood.
life cycle policy can only be based on the date index created that's all.
it's index created time only
you can create rollover to happen at the hot phase automatically based on time or size or doc count of the index .
Related
Env Details:
Elastic Search version 7.8.1
routing param is an optional in Index settings.
As per ElasticSearch docs - https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-routing-field.html
When indexing documents specifying a custom _routing, the uniqueness of the _id is not guaranteed across all of the shards in the index. In fact, documents with the same _id might end up on different shards if indexed with different _routing values.
We have landed up in same scenario where earlier we were using custom routing param(let's say customerId). And for some reason we need to remove custom routing now.
Which means now docId will be used as default routing param. This is creating duplicate record with same id across different shard during Index operation. Earlier it used to (before removing custom routing) it resulted in update of record (expected)
I am thinking of following approaches to come out of this, please advise if you have better approach to suggest, key here is to AVOID DOWNTIME.
Approach 1:
As we receive the update request, let duplicate record get created. Once record without custom routing gets created, issue a delete request for a record with custom routing.
CONS: If there is no update on records, then all those records will linger around with custom routing, we want to avoid this as this might results in unforeseen scenario in future.
Approach 2
We use Re-Index API to migrate data to new index (turning off custom routing during migration). Application will use new index after successful migration.
CONS: Some of our Indexes are huge, they take 12 hrs+ for re-index operation, and since elastic search re-index API will not migrate the newer records created between this 12hr window, as it uses snapshot mechanism. This needs a downtime approach.
Please suggest alternative if you have faced this before.
Thanks #Val, also found few other approaches like write to both indexes and read from old. And then shift to read new one after re-indexing is finished. Something on following lines -
Create an aliases pointing to the old indices (*_v1)
Point the application to these aliases instead of actual indices
Create a new indices (*_v2) with the same mapping
Move data from old indices to new using re-indexing and make sure we don't
retain custom routing during this.
Post re-indexing, change the aliases to point to new index instead of old
(need to verify this though, but there are easy alternatives if this
doesn't work)
Once verification is done, delete the old Indices
What do we do in transition period (window between reindexing start to reindexing finish) -
Write to both Indices (old and new) and read from old indices via aliases
I'm looking to setup my index such that it is partitioned into daily sub-indices that I can adjust the individual settings of depending on the age of that index, i.e. >= 30 days old should be moved to slower hardware etc. I am aware I can do this with a lifecycle policy.
What I'm unable to join-the-dots on is how to setup the original index to be partitioned by day. When adding data/querying, do I need to specify the individual daily indicies or is there something in Elasticsearch that will do this for me? If the later, how does it work with adding/querying (assuming they are different?)...how does it determine the partitions that are relevant for the query/partition to add a document to? (I'm assuming there is a timestamp field - but I can't see from the docs how its all linked together)
I'm using the base Elasticsearch OSS v7.7.1 without any plugins installed.
there's no such thing as sub indices or partitions in Elasticsearch. if you want to use ilm, which you should, then you are using aliases and multiple indices
you will need to upgrade from 7.7 - which is EOL - and use the default distribution to get access to ilm as well
getting back to your conceptual questions, https://www.elastic.co/guide/en/elasticsearch/reference/current/overview-index-lifecycle-management.html and the following few chapters dive into it. but to your questions;
the major assumption of using ilm is that data being ingested is current, so on a rough level, data from today will end up in an index from today
if you are indexing historic data then you may want to put that into "traditional" index names, eg logs-2021.08.09 and then attach them to the ilm policy as per https://www.elastic.co/guide/en/elasticsearch/reference/current/ilm-with-existing-indices.html
when querying, Elasticsearch will handle accessing all the indices it needs based on the request it receives. it does this via https://www.elastic.co/guide/en/elasticsearch/reference/current/search-field-caps.html
I'm using the new enrich API of Elasticsearch (ver 7.11),
to my understanding, I need to execute the policy "PUT /_enrich/policy/my-policy/_execute" each time when the source index changed, which lead to the creation of a new .enrich index.
is there an option to make it happen automatically and avoid of index creation on every change of the source index?
This is not (yet) supported and there have been other reports of similar needs.
It seems to be complex to provide the ability to regularly update an enrich index based on a changing source index and the issue above explains why.
That feature might be available some day, something seems to be in the works. I agree it would be super useful.
You can add a default pipeline to your index. that pipeline will process the documents.
See here.
I have followed the following tutorial for crawling content with stormcrawler and then store it in elasticsearch: https://www.youtube.com/watch?v=KTerugU12TY . However, I would like to add to every document the date it was crawled. Can anyone tell me how this can be done?
In general, how can I change the fields of the crawled content?
Thanks in advance
One option would be to create an ingest pipeline in Elasticsearch to populate a date field, as described here. Alternatively, you'd have to write a bespoke parse filter to put the date in the metadata and then index it using indexer.md.mapping in the configuration.
It would probably be useful to make this operation simpler, please feel free to open an issue on Github (or even better contribute some code) so that the ES indexer could check the configuration for a field name indicating where to store the current date, e.g. es.now.field.
How can I find out how long elasticsearch stores indexes?
For what period from which date to now.
It's in config elasticsearch.yml? or I need something else?
edit
No, I don't want delete indicies, I want to know, from which date I have indicies.
Use Cerebro (formerly Kopf) or the management view in Kibana (DevTools) for manual operations and peeking around.
An index is deleted or rebuild with REST commands. So only on demand. This is typically scripted to delete by time filters. For example curator can do that.