AWS Elasticsearch - Delete_by_query, how to find task id so can continue with code when delete is done, and when to do wait_for _completion - aws-lambda

I'm using AWS Lambda to do a delete_by_query on an Elasticsearch index so I get rid of everything older than 7 days. That works, but I noticed that the count of the documents is the same before and after, so if I were to run a query in Elasticsearch I may not get correct results until the delete_by_query is completed.
I found this post (python 3.x - Right way to delete and then reindex ES documents - Stack Overflow) that states that it is "best to set wait_for_completion to False. In this case you'll get task details and will be able to track task progress." For one, I haven't found anything that states why this is the case, unless your delete takes 4 hours like that example.
I found code to determine if the delete_by_query is still running at this wonderful site here and tried:
es_client.tasks(detailed=True,actions="*/delete/byquery")
However, I'm getting the message that
'TasksClient' object is not callable.
I am not entirely sure if that is true or not , or if my syntax is incorrect and thus that is why it is not working. It doesn't make sense that I can't programmatically query Tasks with python if I can do it in the console and with curl.
If it is not good to set wait_for_completion to False, and I can't query this with Python, how am I to programmatically get any information about the task or an understanding as to whether I can go ahead with the analytical queries or whatever else I want to do that depends on this task being done?

Okay, I'm not entirely sure why you are getting that error, so I can't help with that in particular. But, I noticed that the python elasticsearch documentation on how to get the task id from the delete_by_query when wait_for_completion is set to false isn't very clear, so I'm going to provide this in case it helps.
from elasticsearch import Elasticsearch
es = Elasticsearch()
response = es.delete_by_query(index=someIndex, body=someQuery, wait_for_completion=False)
# get task id
print(response['task'])
Hope that helps!

Related

409 error when using streaming_bulk() - certain that document is only included once.

I am attempting to upload a large number of documents - about 7 million.
I have created actions for each document to be added and split them up into about 260 files, about 30K documents each.
Here is the format of the actions:
a = someDocument with nesting
esActionFromFile = [{
'_index': 'mt-interval-test-9',
'_type': 'doc',
'_id': 5641254,
'_source': a,
'_op_type': 'create'}]
I have tried using helpers.bulk, helpers.parallel_bulk, and helpers.streaming_bulk and have had partial success using helpers.bulk and helpers.streaming_bulk.
Each time I run a test, I delete, and then recreate the index using:
# Refresh Index
es.indices.delete(index=index, ignore=[400, 404])
es.indices.create(index = index, body = mappings_request_body)
When I am partially successful - many documents are loaded, but eventually I get a 409 version conflict error.
I am aware that there can be version conflicts created when there has not been sufficient time for ES to process the deletion of individual documents after doing a delete by query.
At first, I thought that something similar was happening here. However, I realized that I am often getting the errors from files the first time they have ever been processed (i.e. even if the deletion was causing issues, this particular file had never been loaded, so there couldn't be a conflict).
The _id value I am using is the primary key from the original database where I am extracting the data from - so I am certain they are unique. Furthermore, I have checked whether there was unintentional duplication of records in my actions arrays, or the files I created them from, and there are no duplicates.
I am at a loss to explain why this is happening, and struggling to find a solution to upload my data.
Any assistance would be greatly appreciated!
There should be information attached to the 409 response that should tell you exactly what's going wrong and which document caused it.
Another thing that could cause this would be a retry - when elasticsearch-py cannot connect to the cluster it will resend the request again to a different node. In some complex scenarios it can happen that a request will be thus sent twice. This is especially true if you enabled retry_on_timeout option.

Get status of a task Elasticsearch for a long running update query

Assuming I have a long running update query where I am updating ~200k to 500k, perhaps even more.Why I need to update so many documents is beyond the scope of the question.
Since the client times out (I use the official ES python client), I would like to have a way to check what the status of the bulk update request is, without having to use enormous timeout values.
For a short request, the response of the request can be used, is there a way I can get the response of the request as well or if I can specify a name or id to a request so as to reference it later.
For a request which is running : I can use the tasks API to get the information.
But for other statuses - completed / failed, how do I get it.
If I try to access a task which is already completed, I get resource not found .
P.S. I am using update_by_query for the update
With the task id you can look up the task directly:
GET /_tasks/taskId:1
The advantage of this API is that it integrates with
wait_for_completion=false to transparently return the status of
completed tasks. If the task is completed and
wait_for_completion=false was set on it them it’ll come back with a
results or an error field. The cost of this feature is the document
that wait_for_completion=false creates at .tasks/task/${taskId}. It is
up to you to delete that document.
From here https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update-by-query.html#docs-update-by-query-task-api
My use case went like this, I needed to do an update_by_query and I used painless as the script language. At first I did a reindex (when testing). Then I tried using the update_by_query functionality (they resemble each other a lot). I did a request to the task api (the operation hasn't finished of course) and I saw the task being executed. When it finished I did a query and the data of the fields that I was manipulating had disappeared. The script worked since I used the same script for the reindex api and everything went as it should have. I didn't investigate further because of lack of time, but... yeah, test thoroughly...
I feel GET /_tasks/taskId:1 confusing to understand. It should be
GET http://localhost:9200/_tasks/taskId
A taskId looks something like this NCvmGYS-RsW2X8JxEYumgA:1204320.
Here is my trivial explanation related to this topic.
To check a task, you need to know its taskId.
A task id is a string that consists of node_id, a colon, and a task_sequence_number. An example is taskId = NCvmGYS-RsW2X8JxEYumgA:1204320 where node_id = NCvmGYS-RsW2X8JxEYumgA and task_sequence_number = 1204320. Some people including myself thought taskId = 1204320, but that's not the way how the elasticsearch codebase developers understand it at this moment.
A taskId can be found in two ways.
wait_for_deletion = false. When sending a request to ES, with this parameter, the response will be {"task" : "NCvmGYS-RsW2X8JxEYumgA:1204320"}. Then, you can check a status of that task like this GET http://localhost:9200/_tasks/NCvmGYS-RsW2X8JxEYumgA:1204320
GET http://localhost:9200/_tasks?detailed=false&actions=*/delete/byquery. This example will return you the status of all tasks with action = delete_by_query. If you know there is only one task running on ES, you can find your taskId from the response of all running tasks.
After you know the taskId, you can get the status of a task with this.
GET /_tasks/taskId
Notice you can only check the status of a task when the task is running, or a task is generated with wait_for_deletion == false.
More trivial explanation, wait_for_deletion by default is true. Based on my understanding, tasks with wait_for_deletion = true are "in-memory" only. You can still check the status of a task while it's running. But it's completely gone after it is completed/canceled. Meaning checking the status will return you a 'resouce_not_found_exception'. Tasks with wait_for_deletion = false will be stored in an ES system index .task. You can still check it's status after it finishes. However, you might want to delete this task document from .task index after you are done with it to save some space. The deletion request looks like this
http://localhost:9200/.tasks/task/NCvmGYS-RsW2X8JxEYumgA:1204320
You will receive resouce_not_found_exception if a taskId is not present. (for example, you deleted some task twice, or you are deleting an in-memory task, whose wait_for_deletetion == true).
About this confusing taskId thing, I made a pull request https://github.com/elastic/elasticsearch/pull/31122 to help clarify the Elasticsearch document. Unfortunately, they rejected it. Ugh.

Aerospike 3.6.4 list insert operation giving 'Server error'

I am using the aerospike list operation (golang client) to prepend to an existing key in aerospike using the following command:
client.Operate(c.WritePolicy, aeroKey, aero.ListInsertOp(c.bin, 0,
messages...))
But I am getting "Server error" as response error and no other error details. I already checked that the aeroKey exists and is not nil. Could it be that the aerospike version does not support this operation? Is there a way to confirm this problem or some setting to allow this operation ?
Well, that would be because the list API was added in release 3.7.0.1. Before that lists were a data type without any atomic operations (list-append, etc). Same thing goes for maps, before 3.8.4 they were just a container for map data.
You're running against a version that is two years old. Time to upgrade.

Spark Streaming and ElasticSearch - Could not write all entries

I'm currently writing a Scala application made of a Producer and a Consumer. The Producers get some data from and external source and writes em inside Kafka. The Consumer reads from Kafka and writes to Elasticsearch.
The consumer is based on Spark Streaming and every 5 seconds fetches new messages from Kafka and writes them to ElasticSearch. The problem is I'm not able to write to ES because I get a lot of errors like the one below :
ERROR] [2015-04-24 11:21:14,734] [org.apache.spark.TaskContextImpl]:
Error in TaskCompletionListener
org.elasticsearch.hadoop.EsHadoopException: Could not write all
entries [3/26560] (maybe ES was overloaded?). Bailing out... at
org.elasticsearch.hadoop.rest.RestRepository.flush(RestRepository.java:225)
~[elasticsearch-spark_2.10-2.1.0.Beta3.jar:2.1.0.Beta3] at
org.elasticsearch.hadoop.rest.RestRepository.close(RestRepository.java:236)
~[elasticsearch-spark_2.10-2.1.0.Beta3.jar:2.1.0.Beta3] at
org.elasticsearch.hadoop.rest.RestService$PartitionWriter.close(RestService.java:125)
~[elasticsearch-spark_2.10-2.1.0.Beta3.jar:2.1.0.Beta3] at
org.elasticsearch.spark.rdd.EsRDDWriter$$anonfun$write$1.apply$mcV$sp(EsRDDWriter.scala:33)
~[elasticsearch-spark_2.10-2.1.0.Beta3.jar:2.1.0.Beta3] at
org.apache.spark.TaskContextImpl$$anon$2.onTaskCompletion(TaskContextImpl.scala:57)
~[spark-core_2.10-1.2.1.jar:1.2.1] at
org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:68)
[spark-core_2.10-1.2.1.jar:1.2.1] at
org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:66)
[spark-core_2.10-1.2.1.jar:1.2.1] at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
[na:na] at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
[na:na] at
org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:66)
[spark-core_2.10-1.2.1.jar:1.2.1] at
org.apache.spark.scheduler.Task.run(Task.scala:58)
[spark-core_2.10-1.2.1.jar:1.2.1] at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
[spark-core_2.10-1.2.1.jar:1.2.1] at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
[na:1.7.0_65] at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[na:1.7.0_65] at java.lang.Thread.run(Thread.java:745) [na:1.7.0_65]
Consider that the producer is writing 6 messages every 15 seconds so I really don't understand how this "overload" can possibly happen (I even cleaned the topic and flushed all old messages, I thought it was related to an offset issue). The task executed by Spark Streaming every 5 seconds can be summarized by the following code :
val result = KafkaUtils.createStream[String, Array[Byte], StringDecoder, DefaultDecoder](ssc, kafkaParams, Map("wasp.raw" -> 1), StorageLevel.MEMORY_ONLY_SER_2)
val convertedResult = result.map(k => (k._1 ,AvroToJsonUtil.avroToJson(k._2)))
//TO-DO : Remove resource (yahoo/yahoo) hardcoded parameter
log.info(s"*** EXECUTING SPARK STREAMING TASK + ${java.lang.System.currentTimeMillis()}***")
convertedResult.foreachRDD(rdd => {
rdd.map(data => data._2).saveToEs("yahoo/yahoo", Map("es.input.json" -> "true"))
})
If I try to print the messages instead of sending to ES, everything is fine and I actually see only 6 messages. Why can't I write to ES?
For the sake of completeness, I'm using this library to write to ES : elasticsearch-spark_2.10 with the latest beta version.
I found, after many retries, a way to write to ElasticSearch without getting any error. Basically passing the parameter "es.batch.size.entries" -> "1" to the saveToES method solved the problem. I don't understand why using the default or any other batch size leads to the aforementioned error considering that I would expect an error message if I'm trying to write more stuff than the allowed max batch size, not less.
Moreover I've noticed that actually I was writing to ES but not all my messages, I was losing between 1 and 3 messages per batch.
When I pushed dataframe to ES on Spark, I had the same error message. Even with "es.batch.size.entries" -> "1" configuration,I had the same error.
Once I increased thread pool in ES, I could figure out this issue.
for example,
Bulk pool
threadpool.bulk.type: fixed
threadpool.bulk.size: 600
threadpool.bulk.queue_size: 30000
Like it was already mentioned here, this is a document write conflict.
Your convertedResult data stream contains multiple records with the same id. When written to elastic as part of the same batch produces the error above.
Possible solutions:
Generate unique id for each record. Depending on your use case it can be done in a few different ways. As example, one common solution is to create a new field by combining the id and lastModifiedDate fields and use that field as id when writing to elastic.
Perform de-duplication of records based on id - select only one record with particular id and discard other duplicates. Depending on your use case, this could be the most current record (based on time stamp field), most complete (most of the fields contain data), etc.
The #1 solution will store all records that you receive in the stream.
The #2 solution will store only the unique records for a specific id based on your de-duplication logic. This result would be the same as setting "es.batch.size.entries" -> "1", except you will not limit the performance by writing one record at a time.
One of the possibility is the cluster/shard status being RED. Please address this issue which may be due to unassigned replicas. Once status turned GREEN the API call succeeded just fine.
This is a document write conflict.
For example:
Multiple documents specify the same _id for Elasticsearch to use.
These documents are located in different partitions.
Spark writes multiple partitions to ES simultaneously.
Result is Elasticsearch receiving multiple updates for a single Document at once - from multiple sources / through multiple nodes / containing different data
"I was losing between 1 and 3 messages per batch."
Fluctuating number of failures when batch size > 1
Success if batch write size "1"
Just adding another potential reason for this error, hopefully it helps someone.
If your Elasticsearch index has child documents then:
if you are using a custom routing field (not _id), then according to
the documentation the uniqueness of the documents is not guaranteed.
This might cause issues while updating from spark.
If you are using the standard _id, the uniqueness will be preserved, however you need to make sure the following options are provided while writing from Spark to Elasticsearch:
es.mapping.join
es.mapping.routing

How to Fix Read timed out in Elasticsearch

I used Elasticsearch-1.1.0 to index tweets.
The indexing process is okay.
Then I upgraded the version. Now I use Elasticsearch-1.3.2, and I get this message randomly:
Exception happened: Error raised when there was an exception while talking to ES.
ConnectionError(HTTPConnectionPool(host='127.0.0.1', port=8001): Read timed out. (read timeout=10)) caused by: ReadTimeoutError(HTTPConnectionPool(host='127.0.0.1', port=8001): Read timed out. (read timeout=10)).
Snapshot of the randomness:
Happened --33s-- Happened --27s-- Happened --22s-- Happened --10s-- Happened --39s-- Happened --25s-- Happened --36s-- Happened --38s-- Happened --19s-- Happened --09s-- Happened --33s-- Happened --16s-- Happened
--XXs-- = after XX seconds
Can someone point out on how to fix the Read timed out problem?
Thank you very much.
Its hard to give a direct answer since the error your seeing might be associated with the client you are using. However a solution might be one of the following:
1.Increase the default timeout Globally when you create the ES client by passing the timeout parameter. Example in Python
es = Elasticsearch(timeout=30)
2.Set the timeout per request made by the client. Taken from Elasticsearch Python docs below.
# only wait for 1 second, regardless of the client's default
es.cluster.health(wait_for_status='yellow', request_timeout=1)
The above will give the cluster some extra time to respond
Try this:
es = Elasticsearch(timeout=30, max_retries=10, retry_on_timeout=True)
It might won't fully avoid ReadTimeoutError, but it minimalize them.
Read timeouts can also happen when query size is large. For example, in my case of a pretty large ES index size (> 3M documents), doing a search for a query with 30 words took around 2 seconds, while doing a search for a query with 400 words took over 18 seconds. So for a sufficiently large query even timeout=30 won't save you. An easy solution is to crop the query to the size that can be answered below the timeout.
For what it's worth, I found that this seems to be related to a broken index state.
It's very difficult to reliably recreate this issue, but I've seen it several times; operations run as normal except certain ones which periodically seem to hang ES (specifically refreshing an index it seems).
Deleting an index (curl -XDELETE http://localhost:9200/foo) and reindexing from scratch fixed this for me.
I recommend periodically clearing and reindexing if you see this behaviour.
Increasing various timeout options may immediately resolve issues, but does not address the root cause.
Provided the ElasticSearch service is available and the indexes are healthy, try increasing the the Java minimum and maximum heap sizes: see https://www.elastic.co/guide/en/elasticsearch/reference/current/jvm-options.html .
TL;DR Edit /etc/elasticsearch/jvm.options -Xms1g and -Xmx1g
You also should check if all fine with elastic. Some shard can be unavailable, here is nice doc about possible reasons of unavailable shard https://www.datadoghq.com/blog/elasticsearch-unassigned-shards/

Resources