How to retry indexing with elasticsearch-py when using bulk streaming? - elasticsearch-py

I have occasional BulkIndexError when using streaming_bulk helper. Is there any way to configure client to retry on such errors? What is the best way to handle errors when using helpers?

Well, you could set up your streaming pipeline in a way, so as to retry on errors (I believe, this will be a BulkIndexError).
The response from streaming_bulk is a tuple that looks like ok, item [see this]. Now, if you wrap the request to streaming_bulk in a try, and in your except, not empty out your list of actions, you could have this try-except block in an infinite loop, and break out when your list of actions is empty.

Related

Flink Streaming JdbcSink Exception Handling

I am using JdbcSink to insert processed events into a Postgres DB.
Occasionally, I receive bad records from the source stream, and it fails to insert into the database (java.sql.BatchUpdateException) since it fails to satisfy some table constraints.
I can obviously pass the events through a Flink filter operator to filter them, but the filter would then become a complex code to check every possible combination of failure. Instead of a filter, I would like to catch the BatchUpdateException thrown by the JdbcSink, log it and continue processing other events.
No luck trying to find a way to catch BatchUpdateException from JdbcSink.
Has someone tried doing similar with success?
I took a quick look at the code, and didn't see an obvious solution. You could extend the JdbcOutputFormat class and override the attemptFlush method, and then clone the JdbcSink class and modify your version to use your output format class.

GraphQL Asynchronous query results

I'm trying to implement a batch query interface with GraphQL. I can get a request to work synchronously without issue, but I'm not sure how to approach making the result asynchronous. Basically, I want to be able to kick off the query and return a pointer of sorts to where the results will eventually be when the query is done. I'd like to do this because the queries can sometimes take quite a while.
In REST, this is trivial. You return a 202 and return a Location header pointing to where the client can go to fetch the result. GraphQL as a specification does not seem to have this notion; it appears to always want requests to be handled synchronously.
Is there any convention for doing things like this in GraphQL? I very much like the query specification but I'd prefer to not leave the client HTTP connection open for up to a few minutes while a large query is executed on the backend. If anything happens to kill that connection the entire query would need to be retried, even if the results themselves are durable.
What you're trying to do is not solved easily in a spec-compliant way. Apollo introduced the idea of a #defer directive that does pretty much what you're looking for but it's still an experimental feature. I believe Relay Modern is trying to do something similar.
The idea is effectively the same -- the client uses a directive to mark a field or fragment as deferrable. The server resolves the request but leaves the deferred field null. It then sends one or more patches to the client with the deferred data. The client is able to apply the initial request and the patches separately to its cache, triggering the appropriate UI changes each time as usual.
I was working on a similar issue recently. My use case was to submit a job to create a report and provide the result back to the user. Creating a report takes couple of minutes which makes it an asynchronous operation. I created a mutation which submitted the job to the backend processing system and returned a job ID. Then I periodically poll the jobs field using a query to find out about the state of the job and eventually the results. As the result is a file, I return a link to a different endpoint where it can be downloaded (similar approach Github uses).
Polling for actual results is working as expected but I guess this might be better solved by subscriptions.

Retry on the next Flux element and omit the successful ones

A little background.
I am trying to use reactive programming to be able to download file from the other service. The trick is that in case of connection failure or failed Flux element (anything) I would like to retry on the Flux a number of times but once being able to grasp on it I would like to resume without processing the elements from the very start.
What I mean is that, something goes wrong and I got only 56 elements from my Flux out of 100 possible (let's say it's an image in .jpg) because of the connection failure. Once I successfully retry I would like to resume on 57th element so I do not have to process it and perform GET from the start once again.
Here is how the normal retry looks like:
but what I would like to achieve is that on retry I would only have to get the red colored element (as I already have yellow and purple).
Just a sidenote, I would like to achieve the functionality as with HTTP range request headers where I can get bytes in specific range only and in case of failure I would be able to resume from the byte I want.
Is that even possible what I am trying to achieve? If so, what could be the possible course of action?
You need to keep some state (the beginning of the range to request, at least) on a per-subscriber basis. That has to be done upstream of the retry, so that each retry re-evaluates the range. At the same time, the state should be atomically updatable AND visible downstream of the retry (for updating purposes). I'm assuming you're using WebClient:
a flatMap can be used to create a scope in which the range state is visible
in the lambda, an AtomicLong can be used as the state
again in the flatmap lambda, wrap the webclient call in a Flux.defer to ensure lazy creation of the request with re-evaluation of the state for generating the appropriate header (reading from the AtomicLong)
append retry after the defer
update the AtomicLong as needed once each piece is received and processed (eg. in a doOnNext)

How to handle errors with bulk requests

I am using Elasticsearch bulk API to send a lot of documents to index and delete at once. If there is an error for one document, other documents will be indexed or deleted successfully. And this leads to wrong state of data in elasticstore because in my case documents are kind of related to each other. I mean if one document's field has some value then there are other documents which should also have same value for that field. I am not sure how I can handle such errors from Bulk requests. Is it possible to rollback a request in any way? I read similar questions but could not get solution on handling such cases. Or instead of rollback, is there any way to send data only if there is no error? or something like dry run of request possible?
I'm late to the question but will answer for whoever runs across a similar scenario in the future.
After executing the Elasticsearch (ES) bulk API aka BulkRequest, you get a BulkResponse in return which consists of one or more BulkItemResponse. BulkItemResponse has a method isFailed() which will tell you if that action failed or not. In your case, you can traverse all the items in the response if there are failures and handle failed responses as per your requirement.
The code will look something like this for Synchronous execution:
val bulkResponse: BulkResponse = restHighLevelClient.bulk(bulkRequest, RequestOptions.DEFAULT);
bulkResponse.iterator.asScala
.filter(_.isFailed)
.foreach(item => { // your logic to handle failures })
For Asynchronous execution, you can provide a listener which will be called after the execution is completed. You have to override onResponse() and onFailure() in this case. You can read more about it at https://www.elastic.co/guide/en/elasticsearch/client/java-rest/current/java-rest-high-document-bulk.html
HTH.
The solution shared above to use BulkResponse output is basically to handle next batch requests. What if I want to break the batch processing at the position where any request failed in the batch. We are sending bulk events which are related to each other. Example of my issue: Batch(E1- E10), if batch fails at E5. I don't want E6-E10 to process because they are related. I want immediate response in that case.

Concurrent web requests with Ruby (Sinatra?)?

I have a Sinatra app that basically takes some input values and then finds data matching those values from external services like Flickr, Twitter, etc.
For example:
input:"Chattanooga Choo Choo"
Would go out and find images at Flickr on the Chattanooga Choo Choo and tweets from Twitter, etc.
Right now I have something like:
#images = Flickr::...find...images..
#tweets = Twitter::...find...tweets...
#results << #images
#results << #tweets
So my question is, is there an efficient way in Ruby to run those requests concurrently? Instead of waiting for the images to finish before the tweets finish.
Threads would work, but it's a crude tool. You could try something like this:
flickr_thread = Thread.start do
#flickr_result = ... # make the Flickr request
end
twitter_thread = Thread.start do
#twitter_result = ... # make the Twitter request
end
# this makes the main thread wait for the other two threads
# before continuing with its execution
flickr_thread.join
twitter_thread.join
# now both #flickr_result and #twitter_result have
# their values (unless an error occurred)
You'd have to tinker a bit with the code though, and add proper error detection. I can't remember right now if instance variables work when declared inside the thread block, local variables wouldn't unless they were explicitly declared outside.
I wouldn't call this an elegant solution, but I think it works, and it's not too complex. In this case there is luckily no need for locking or synchronizations apart from the joins, so the code reads quite well.
Perhaps a tool like EventMachine (in particular the em-http-request subproject) might help you, if you do a lot of things like this. It could probably make it easier to code at a higher level. Threads are hard to get right.
You might consider making a client side change to use asynchronous Ajax requests to get each type (image, twitter) independently. The problem with server threads (one of them anyway) is that if one service hangs, the entire request hangs waiting for that thread to finish. With Ajax, you can load an images section, a twitter section, etc, and if one hangs the others will still show their results; eventually you can timeout the requests and show a fail whale or something in that section only.
Yes why not threads?
As i understood. As soon as the user submit a form, you want to process all request in parallel right? You can have one multithread controller (Ruby threads support works really well.) where you receive one request, then you execute in parallel the external queries services and then you answer back in one response or in the client side you send one ajax post for each service and process it (maybe each external service has your own controller/actions?)
http://github.com/pauldix/typhoeus
parallel/concurrent http requests
Consider using YQL for this. It supports subqueries, so that you can pull everything you need with a single (client-side, even) call that just spits out JSON of what you need to render. There are tons of tutorials out there already.

Resources