How to handle errors with bulk requests - elasticsearch

I am using Elasticsearch bulk API to send a lot of documents to index and delete at once. If there is an error for one document, other documents will be indexed or deleted successfully. And this leads to wrong state of data in elasticstore because in my case documents are kind of related to each other. I mean if one document's field has some value then there are other documents which should also have same value for that field. I am not sure how I can handle such errors from Bulk requests. Is it possible to rollback a request in any way? I read similar questions but could not get solution on handling such cases. Or instead of rollback, is there any way to send data only if there is no error? or something like dry run of request possible?

I'm late to the question but will answer for whoever runs across a similar scenario in the future.
After executing the Elasticsearch (ES) bulk API aka BulkRequest, you get a BulkResponse in return which consists of one or more BulkItemResponse. BulkItemResponse has a method isFailed() which will tell you if that action failed or not. In your case, you can traverse all the items in the response if there are failures and handle failed responses as per your requirement.
The code will look something like this for Synchronous execution:
val bulkResponse: BulkResponse = restHighLevelClient.bulk(bulkRequest, RequestOptions.DEFAULT);
bulkResponse.iterator.asScala
.filter(_.isFailed)
.foreach(item => { // your logic to handle failures })
For Asynchronous execution, you can provide a listener which will be called after the execution is completed. You have to override onResponse() and onFailure() in this case. You can read more about it at https://www.elastic.co/guide/en/elasticsearch/client/java-rest/current/java-rest-high-document-bulk.html
HTH.

The solution shared above to use BulkResponse output is basically to handle next batch requests. What if I want to break the batch processing at the position where any request failed in the batch. We are sending bulk events which are related to each other. Example of my issue: Batch(E1- E10), if batch fails at E5. I don't want E6-E10 to process because they are related. I want immediate response in that case.

Related

Possible to implement a "job" pattern with GraphQL

Is there a reasonable way to implement a job-based query paradigm in GraphQL?
In particular, something like the following:
Caller submits a search request
Backend returns a job ID
Caller receives status updates on the job as it runs
Caller separately can retrieve pages of data from the job results
I guess the problem I see here is that we are splitting up the process into two steps: One is making the request and the second is retrieving data. As a result, the fields requested in the first request do not correspond with what is returned (just a job ID). And similarly, a call to retrieve results has the same issue.
Subscriptions don't really solve this problem either, I don't believe. They might help with requesting data that might take a long time to return I think, but that isn't quite the same as a job-based API.
Maybe this is a niche use case, and I have no doubt that it wasn't what GraphQL was initially built to solve. But, I'm just wondering if this is something doable, or if this is more of trying to fit a square peg into a round hole.

GraphQL Asynchronous query results

I'm trying to implement a batch query interface with GraphQL. I can get a request to work synchronously without issue, but I'm not sure how to approach making the result asynchronous. Basically, I want to be able to kick off the query and return a pointer of sorts to where the results will eventually be when the query is done. I'd like to do this because the queries can sometimes take quite a while.
In REST, this is trivial. You return a 202 and return a Location header pointing to where the client can go to fetch the result. GraphQL as a specification does not seem to have this notion; it appears to always want requests to be handled synchronously.
Is there any convention for doing things like this in GraphQL? I very much like the query specification but I'd prefer to not leave the client HTTP connection open for up to a few minutes while a large query is executed on the backend. If anything happens to kill that connection the entire query would need to be retried, even if the results themselves are durable.
What you're trying to do is not solved easily in a spec-compliant way. Apollo introduced the idea of a #defer directive that does pretty much what you're looking for but it's still an experimental feature. I believe Relay Modern is trying to do something similar.
The idea is effectively the same -- the client uses a directive to mark a field or fragment as deferrable. The server resolves the request but leaves the deferred field null. It then sends one or more patches to the client with the deferred data. The client is able to apply the initial request and the patches separately to its cache, triggering the appropriate UI changes each time as usual.
I was working on a similar issue recently. My use case was to submit a job to create a report and provide the result back to the user. Creating a report takes couple of minutes which makes it an asynchronous operation. I created a mutation which submitted the job to the backend processing system and returned a job ID. Then I periodically poll the jobs field using a query to find out about the state of the job and eventually the results. As the result is a file, I return a link to a different endpoint where it can be downloaded (similar approach Github uses).
Polling for actual results is working as expected but I guess this might be better solved by subscriptions.

Sending apollo request outside batch

I have an apollo frontend with batch requests set up. However there are certain requests that shouldn't be included in the batch:
A component depends on a "small" version of a request to load
The "full" request should happen at the same time, to be entered into the cache for later use
If the small and full request are sent in the same batch it doesn't return until the full one is finished, which takes too long.
I've thought of two non-ideal solutions:
Start the full request once the small one is finished, using onCompleted. Not ideal because for speed I'd like to start the two simultaneously
Set up two backend endpoints, one with batching and one without, and use split to direct requests where appropriate. Would work but I'd like to get away without an extra endpoint
Any ideas?
EDIT: I've realised that the first solution is no good because it can cause other unrelated queries to be delayed - so the only option so far is the last solution.
I am not an expert on the topic but it seems that batchKey option in apllo-link-batch-http is what you are looking for. The easiest would be to for example prefix your operations with a keyword:
const link = BatchHttpLink({
batchKey: operation =>
operation.name && operation.name.value.startsWith('eager_') ? 'eager' : 'normal'
});

How to retry indexing with elasticsearch-py when using bulk streaming?

I have occasional BulkIndexError when using streaming_bulk helper. Is there any way to configure client to retry on such errors? What is the best way to handle errors when using helpers?
Well, you could set up your streaming pipeline in a way, so as to retry on errors (I believe, this will be a BulkIndexError).
The response from streaming_bulk is a tuple that looks like ok, item [see this]. Now, if you wrap the request to streaming_bulk in a try, and in your except, not empty out your list of actions, you could have this try-except block in an infinite loop, and break out when your list of actions is empty.

Posting an update request to ElasticSearch without waiting for completion

I have an ElasticSearch index that stores files, sometimes very large ones. Because the underlying Lucene engine is actually doing a complete replacement each time a document is updated, even if I am just modifying the value of one field, the entire document needs to be updated behind the scenes.
For large, multi-MB files this can take a fairly long time (several hundred ms). Since this is done as part of a web application this is not really acceptable. What I am doing right now is forking the process, so the update is called on a separate thread while the request finishes.
This works, but I'm not really happy with this as a long term solution, partially because it means that every time I create a new interface to the search engine I'll have to recode the forking logic. Also it means I basically can't know whether the request is successful or not, or if some kind of error occurred, without writing additional code to log successful or unsuccessful requests somewhere.
So I'm wondering if there is an unknown feature where you can post an UPDATE request to ElasticSearch, and have them return an acknowledgement without waiting for the update task to actually complete.
If you look at the documentation for Snapshot and Restore you'll see when you make a request you can add wait_for_completion=true in order to have the entire process run before receiving the result.
What I want is the reverse — the ability to add ?wait_for_completion=false to a POST request.

Resources