Are ElasticSearch scripts safe for concurrency issues? - elasticsearch

I'm running a process which updates user documents on ElasticSearch. This process can run on multiple instances on different machines. In case 2 instances will try to run a script to update the same document in the same time, can there be a case that some of the data will be lost because of a race-condition? or that the internal script mechanism is safe (using the version property for optimistic locking or any other way)?
The official ES scripts documentation

Using the version attribute is safe for that kind of jobs.
Do the search with version: true
GET /index/type/_search
{
"version": true
your_query...
}
Then for the update, add a version attribute corresponding to the number returned during the search.
POST /index/type/the_id_to_update/_update?version=3 // <- returned by the search
{
"doc":{
"ok": "name"
}
}
https://www.elastic.co/guide/en/elasticsearch/guide/current/version-control.html

Related

Old elastic search data from controller despite checking repo and calling refresh

A background process checks the state of elastic search and sends an update if it has changed, e.g.
while (true) {
if (repo.get(id).status.equals(newStatus) {
repo.refresh()
tellClientToUpdate();
} }
Yet, when the client gets the data from a spring data controller afterwards, it returns the old data, from the same repository. After a wait of approximately 200 milliseconds, it returns the new data. The elastic search executable has version 6.2, the client has version 5.6.11 .
How can this be fixed (or even debugged) ?

View Completed Elasticsearch Tasks

I am trying to run daily tasks using Elasticsearch's update by query API. I can find currently running tasks but need a way to view all tasks, including completed.
I've reviewed the ES docs for the Update By Query API:
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update-by-query.html#docs-update-by-query
And the Task API:
https://www.elastic.co/guide/en/elasticsearch/reference/current/tasks.html#tasks
Task API shows how to get the status of a currently running task with GET _tasks/[taskId], or all running tasks - GET _tasks. But I need to see a history of all tasks ran.
How do I see a list of all completed tasks?
A bit late to the party, but you can check the tasks in the .tasks system index.
You can query this index, as any other regular index.
For the updateByQuery task you can do:
curl -XPOST -sS "elastic:9200/.tasks/_search" -H 'Content-Type: application/json' -d '{
"query": {
"bool": {
"filter": [
{
"term": {
"completed": true
}
},
{
"term": {
"task.action": "indices:data/write/update/byquery"
}
}
]
}
}
}'
From documentation,
The Task Management API is new and should still be considered a beta
feature.It allows to retrieve information about the tasks currently executing on one or more nodes in the cluster.
It still in beta version, so using this currently you'll only able to do these-
Possible to retrieve information for a particular task using GET /_tasks/<task_id>, where you can also use the detailed request parameter to get more information about the running tasks(also you can use other params as per the support)
Tasks can be also listed using GET _cat/tasks version of the list tasks command, which accepts the same arguments as the standard list tasks command.
If a long-running task supports cancellation, it can be cancelled with the cancel tasks API, with POST _tasks/oTUltX4IQMOUUVeiohTt8A:12345/_cancel
Task can be grouped with GET _tasks?group_by=parents

Elasticsearch Delete by Query Version Conflict

I am using Elasticsearch version 5.6.10. I have a query that deletes records for a given agency, so they can later be updated by a nightly script.
The query is in elasticsearch-dsl and look like this:
def remove_employees_from_search(jurisdiction_slug, year):
s = EmployeeDocument.search()
s = s.filter('term', year=year)
s = s.query('nested', path='jurisdiction', query=Q("term", **{'jurisdiction.slug': jurisdiction_slug}))
response = s.delete()
return response
The problem is I am getting a ConflictError exception when trying to delete the records via that function. I have read this occurs because the documents were different between the time the delete process started and executed. But I don't know how this can be, because nothing else is modifying the records during the delete process.
I am going to add s = s.params(conflicts='proceed') in order to silence the exception. But this is a band-aid as I do not understand why the delete is not processing as expected. Any ideas on how to troubleshoot this? A snapshot of the error is below:
ConflictError:TransportError(409,
u'{
"took":10,
"timed_out":false,
"total":55,
"deleted":0,
"batches":1,
"version_conflicts":55,
"noops":0,
"retries":{
"bulk":0,
"search":0
},
"throttled_millis":0,
"requests_per_second":-1.0,
"throttled_until_millis":0,
"failures":[
{
"index":"employees",
"type":"employee_document",
"id":"24681043",
"cause":{
"type":"version_conflict_engine_exception",
"reason":"[employee_document][24681043]: version conflict, current version [5] is different than the one provided [4]",
"index_uuid":"G1QPF-wcRUOCLhubdSpqYQ",
"shard":"0",
"index":"employees"
},
"status":409
},
{
"index":"employees",
"type":"employee_document",
"id":"24681063",
"cause":{
"type":"version_conflict_engine_exception",
"reason":"[employee_document][24681063]: version conflict, current version [5] is different than the one provided [4]",
"index_uuid":"G1QPF-wcRUOCLhubdSpqYQ",
"shard":"0",
"index":"employees"
},
"status":409
}
You could try making it do a refresh first
client.indices.refresh(index='your-index')
source https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/api-reference.html#_indices_refresh
First, this is a question that was asked 2 years ago, so take my response with a grain of salt due to the time gap.
I am using the javascript API, but I would bet that the flags are similar. When you index or delete there is a refresh flag which allows you to force the index to have the result appear to search.
I am not an Elasticsearch guru, but the engine must perform some systematic maintenance on the indices and shards so that it moves the indices to a stable state. It's probably done over time, so you would not necessarily get an immediate state update. Furthermore, from personal experience, I have seen when delete does not seemingly remove the item from the index. It might mark it as "deleted", give the document a new version number, but it seems to "stick around" (probably until general maintenance sweeps run).
Here I am showing the js API for delete, but it is the same for index and some of the other calls.
client.delete({
id: string,
index: string,
type: string,
wait_for_active_shards: string,
refresh: 'true' | 'false' | 'wait_for',
routing: string,
timeout: string,
if_seq_no: number,
if_primary_term: number,
version: number,
version_type: 'internal' | 'external' | 'external_gte' | 'force'
})
https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/api-reference.html#_delete
refresh
'true' | 'false' | 'wait_for' - If true then refresh the affected shards to make this operation visible to search, if wait_for then wait for a refresh to make this operation visible to search, if false (the default) then do nothing with refreshes.
For additional reference, here is the page on Elasticsearch refresh info and what might be a fairly relevant blurb for you.
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-refresh.html
Use the refresh API to explicitly refresh one or more indices. If the request targets a data stream, it refreshes the stream’s backing indices. A refresh makes all operations performed on an index since the last refresh available for search.
By default, Elasticsearch periodically refreshes indices every second, but only on indices that have received one search request or more in the last 30 seconds. You can change this default interval using the index.refresh_interval setting.

Error 429 [type=reduce_search_phase_exception]

I have many languages for my docs and am following this pattern: One index per language. In that they suggest to search across all indices with the
/blogs-*/post/_count
pattern. For my case I am getting a count across the indices of how many docs I have. I am running my code concurrently so making many requests at same time. If I search
/blogs-en/post/_count
or any other language then all is fine. However if I search
/blogs-*/post/_count
I soon encounter:
"Error 429 (Too Many Requests): [reduce] [type=reduce_search_phase_exception]
"
Is there a workaround for this? The same number of requests is made regardless of if I use
/blogs-en/post/_count or /blogs-*/post/_count.
I have always used the same number of workers in my code but re-arranging the indices to have one index per language suddenly broke my code.
EDIT: It is a brand new index without any documents when I start the program and when I get the error I have about 5,000 documents so not under any heavy load.
Edit: I am using the mapping found in the above-referenced link and running on a local machine with all the defaults of ES...in my case shards=5 and replicas=1. I am really just following the example from the link.
EDIT: The errors are seen with as few as 13-20 requests are made and I know ES can handle more than that. Searching /blogs-en/post/_count instead of /blogs-*/post/_count, etc.. can easily handle thousands with no errors.
Another Edit: I have removed all concurrency but still can only access 40-50 requests before I get the error.
I don't get an error for that request and it returns total documents.
Is you'r cluster under load?
Anyway, using simple aggregation you can get total document count in hits.total and per index document count in count_per_index part of result:
GET /blogs-*/post/_search
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"count_per_index": {
"terms": {
"field": "_index"
}
}
}
}

How do I configure elasticsearch to retain documents up to 30 days?

Is there a default data retention period in elasticsearch? If yes can you help me find the configuration?
This is no longer supported in Elasticsearch 5.0.0 or greater. The best practice is to create indexes periodically (daily is most common) and then delete the index when the data gets old enough.
Here's a reference to how to delete an index (https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-delete-index.html)
This article (though it's old enough to reference _ttl) also gives some insight: https://www.elastic.co/blog/using-elasticsearch-and-logstash-to-serve-billions-of-searchable-events-for-customers
As a reminder, it's best to protect your Elasticsearch cluster from the outside world via a proxy and restrict the methods that can be sent to your cluster. This way you can prevent your cluster from being ransomed.
Yeah you can set a TTL on the data. Take a look here for the configuration options available.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-ttl-field.html
Elasticsearch curator is the tool to use if you want to manage your indexes: https://www.elastic.co/guide/en/elasticsearch/client/curator/current/index.html
Here's an example of how to delete indices based on age: https://www.elastic.co/guide/en/elasticsearch/client/curator/current/ex_delete_indices.html
Combine with cron to have this done at regular intervals.
There is no default retention period but new versions of Elasticsearch have index lifecycle management (ILM). It allows to:
Delete stale indices to enforce data retention standards
Documentation.
Simple policy example:
PUT _ilm/policy/my_policy
{
"policy": {
"phases": {
"delete": {
"min_age": "30d",
"actions": {
"delete": {}
}
}
}
}
}
If you use OpenSearch in AWS then take a look at this documentation for the same thing.
Pretty old question but I got the same question just now. Maybe it will be helpful for somebody else.
Just FYI if you are using AWS's Elasticsearch service, they have a great doc on Using Curator to Rotate Data which includes some sample python code that can be used in a Lambda function.

Resources