Elasticsearch .tasks index usage

Elasticsearch .tasks index usage - elasticsearch

The reindex operation in Elasticsearch creates an entry in ".tasks" index.
Following is the excerpt from docs:
If the request contains wait_for_completion=false then Elasticsearch will perform some preflight checks, launch the request, and then return a task which can be used with Tasks APIs to cancel or get the status of the task. Elasticsearch will also create a record of this task as a document at .tasks/task/${taskId}. This is yours to keep or remove as you see fit. When you are done with it, delete it so Elasticsearch can reclaim the space it uses
However, If i disable creating index dynamically by using this API/property, then .tasks index is not created during reindex and the operation gets successful.
My question is,
Will this affect the normal processing of Elastic (specially the reindex operation) ?
Was this ".tasks" index exists in versions before 6.6 ?

The .tasks index exists at least since ES 5.0 and the purpose is to let you manage your long-running tasks instead of letting them run and finish without seeing their outcome.
The normal processing is in no way affected by this, that index is just a container for tasks outcome that you can consult at your leisure. However, if you decide to store tasks outcome in there, it's your job to keep that index clean, i.e. ES will not delete task documents from that index.

Related

Elasticsearch count of searches against an index resets to zero after cluster restart

We use Elasticsearch - one cluster is 7.16 and another is 8.4. Behavior is the same in both.
We need to be able to get a count of search queries run against an index since the index's creation.
We retrieve the amount of searches that have been run against a given index by using the _stats endpoint as such:
GET /_stats?filter_path=indices.my_index.primaries.search.query_total
The problem is that this stat resets to zero after a cluster reboot. Does this data persist anywhere for a given index such that I can get the total since inception of the index? If not, is there an action I can take to somehow record that stat before a reboot so I can always access the full total number?
EDIT - this is the only item I was able to find on this subject, and the answer in this discussion does not look promising: https://discuss.elastic.co/t/why-close-reopen-index-will-reset-index-stats-to-zero/170830

As far as I know, there is no Out of the box solution to achieve your use-case, but its not that hard to build it yourself either, You can simply call the same _stats API periodically and store it in some other index of Elasticsearch or DB so that its not reset. IMHO Its not that big work.

Periodically remove documents from Elasticsearch index depending on field

Let's say I have an index called car. The documents in car have the following fields:
constructionYear
seats
decommissioned
…
Now I want to periodically delete all documents where decommissioned is true.
Is there a way to configure such a job on the Elasticsearch server? Or do I have to perform a REST call every time I want to clean up the index?

you'd need to build a delete by query to manage this, and then schedule it outside of Elasticsearch to be run every so often. there's no inbuilt scheduler for Elasticsearch to do this
however, to the point of Yuri's comment above, why not just leave them? you can still run analytics on the data

Actually, you can utilize a Watcher for this purpose.
It's not what they made for, yet you can set the Webhook Action there to go through whatever your Search input returns & do a REST call to delete unwanted docs by ID.
That way you're be able to keep it within your Elastic cluster.
P.S. Though it make sense to rethink your "data model" a bit, really.
Elasticsearch is not what your regular RDBMS is, and selective delete could get VERY expensive.
It's better to leave them sit there & simply modify your queries to acknowledge that attribute.

Timeout Response for Failed Elasticsearch Re-index

I'm using re-index api of Elasticsearch to move documents from an index (named index1) to another index (named index2).
My problem rises when the size of index1 is too big, so the time out response comes back from Elasticsearch. There is another query (GET _tasks?detailed=true&actions=*reindex) which shows the reindex procedure. But I can't figure out how I can see the errors if there are errors during the reindex time and why my reindex task fails.
One possible solution that I don't like is to increase the time out response of Elasticsearch. Is there any solution that I can see the errors without increasing the time out?

What I usually do is to launch the reindex with ?wait_for_completion=false, so that a background task gets created. The reindex call will return almost immediately and tell you the ID of the task that was created.
You can then use the Task API to check the status of the task using:
GET .tasks/task/<taskId>
Even when the reindex is done, the task will stay in the index and you can check the errors if any.
It is your responsibility, though, to delete that document using:
DELETE .tasks/task/<taskId>

Elasticsearch Reindexing race condition

Hello elasticsearch users/experts,
I have a bit of trouble understanding the race condition problem with the reindex api of Elasticsearch and would like to hear if anyone has found a solution about it.
I have searched a lot of places and could not find any clear solution (most of the solutions date back to before the reindex api).
As you might know, the (now) standard way of reindexing a document (after changing the mapping, for example) is to use an alias.
Suppose the alias points to "old_index". We then create a new index called "new_index" with the new mapping, we call the reindex api to reindex the documents from 'old_index' to 'new_index' and then switch the alias to point to the new_index (and remove the alias pointer to old_index). It seems this is the standard way of reindexing and that is what I have seen on almost all recent websites I visited.
My questions are the following, for using this method, while I would not want downtime (so the user should still be able to search documents), and I would still want to be able to inject documents to ElasticSearch while the reindexing process is happening :
If documents would still be incoming while the reindexing process is working (which would probably take a lot of time), how would the reindexing process ensure that the document would be ingested in the old index (to be able to search for it while the reindexing process is working) but still would be correctly reindexed to the new index?
If a document is modified in the old index, after it has been reindexed (mapped to the new index), while the reindexing process is working, how would ElasticSearch ensure that this modification is also taken account in the new index?
(Similar to 2.) If a record is deleted in the old index, after it has been reindexed (mapped to the new index), while the reindexing process is working, how would ElasticSearch ensure that this removal is also taken account in the new index?
Basically in a scenario where it is not affordable to make any indexing mistake for a document, how would one be able to proceed to make sure the reindexing goes without any of the above problems?
Has anyone any idea? And if there is no solution without any downtime, how would we proceed with the least amount of downtime in that case?
Thanks in advance!

Apologies if its too verbose, but my two cents:
If documents would still be incoming while the reindexing process is
working (which would probably take a lot of time), how would the
reindexing process ensure that the document would be ingested in the
old index (to be able to search for it while the reindexing process is
working) but still would be correctly reindexed to the new index?
When a reindexing is happening from source to destination, the alias would(and must be) still be pointed to the source_index. All the modifications/changes to this index happens in independent fashion and these updates/deletes should be affecting immediately.
Let's say the state of source_index changes from t to t+1
If you have ran a reindexing job at t to dest_index, it would still consume the data of snapshot of source_index at t. You need to run reindexing job again to have latest data of source_index i.e. data at t+1 in your dest_index.
Ingestions at source_index and ingestions from source_index to destination_index are both independent transactions/processes.
Reindexing jobs will never always guarantee consistency between source_index and dest_index.
If a document is modified in the old index, after it has been
reindexed (mapped to the new index), while the reindexing process is
working, how would ElasticSearch ensure that this modification is also
taken account in the new index?
It won't be taken account in the new index as reindexing would be making use of snapshot of source_index at time t.
You would need to perform reindexing again. For this general approach would be to have a scheduler which keeps running reindexing process every few hours.
You can have updates/deletes happening at source_index every few minutes(if you are using scheduler) or in real time(if you are using any event based approach).
However for full indexing (from source_index to dest_index), have it scheduled like once in a day or twice as it is an expensive process.
(Similar to 2.) If a record is deleted in the old index, after it has
been reindexed (mapped to the new index), while the reindexing process
is working, how would ElasticSearch ensure that this removal is also
taken account in the new index?
Again, you need to run a new job/reindexing process.
Version_type: External
Just as a side note, one interesting what you can do during reindexing, is to make use of the version_type:external which would ensure only the updated/missing documents from source_index would be reindexed in dest_index
You can refer to this LINK for more info on this
POST _reindex
{
"source": {
"index": "source_index"
},
"dest": {
"index": "dest_index",
"version_type": "external"
}
}

Reindex all of ElasticSearch with Curator?

Is there a Recipe out there to Reindex all ElasticSearch Indices with Curator?
I'm seeing that it can Reindex a set of indices into one (Daily to Month use case), however I don't see anything that would suggest it could easily apply a new mapping file to every Elastic Index.
I'm taking a guess I'll need to write a wrapper script around Curator to grab index names and feed them into Curator.

I don't know if I got you right as you mentioned reindexing and mapping changes...
If you want to set/update a mapping in a collection of indices and if you know the indices to update by name (or pattern), you are able to apply the same mapping or a mapping change at once with https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-put-mapping.html#_multi_index_2
For reindexing, there is no way to specify multiple source/target pairs at once but you can split one index into many. But as you sugessted, you can use subsequent calls to the reindex api.
BTW: The reindex api does not copy the settings nor mappings from the source into the destination index. You need to handle it by yourself, maybe using https://www.elastic.co/guide/en/elasticsearch/reference/6.4/indices-templates.html

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Elasticsearch .tasks index usage - elasticsearch

Related

Elasticsearch count of searches against an index resets to zero after cluster restart

Periodically remove documents from Elasticsearch index depending on field

Timeout Response for Failed Elasticsearch Re-index

Elasticsearch Reindexing race condition

Reindex all of ElasticSearch with Curator?

Categories

Resources