How to write an xarray object to s3 in an asynchronous routine? - python-asyncio

I have an xarray object in an asynchronous task, and would like to write it to s3. I thought that to_zarr would do the trick, as zarr uses fsspec, which supports async in s3fs. However, how to use? It seems that to_zarr actually writes something synchronously immediately even when compute=True? And then I get a dask object... is there any way to avoid dask?
I am writing relatively small arrays -- but lots of them. I am receiving data from several websocket streams, filling into fixed chunks, and then triggering a task to write full tasks to s3. How can I avoid blocking operations?

Related

Alternatives to Electron IPC for Large Messages

I have a React/Electron app split into 2 (and optionally many more) processes - a frontend, a backend, and potentially many 'inspector' windows. They are all connected via Redux using redux-electron-store which keeps all the instances in sync using IPC, with the main process being the 'master' node, renderers being sent diff actions. The backend processes lots of images and XML, potentially hundreds, and sends them to Redux for storage, resulting in the entire thing hanging. The frontend requires the thumbnails, and both other windows require the parsed XML data.
Originally, I was sending each item as its own Redux action, resulting in like, 200 actions for example, which froze it. I also tried staggering these, sending one every 2 seconds or so, which was good, until performance started degrading part way through anyway. I then changed that to a batch process, of 1 action for each type of processing - thumbnails or parsing XML - for a group of files, which resulted in 2 payloads of 48MB and 37MB or similar, which was better, but still froze everything for a good few seconds.
I put a little interval counter in the main process to see if it was a main or renderer hang, and it seems the main process is freezing, presumably while it ingests and resends these big messages (naturally this is not a very foolproof method of establishing causation here). So I'm not really sure how to restructure things to stop freezing the main process. We had two ideas:
Abstract the thumbnail and XML data to a different part of Redux that won't be synced by IPC, and instead have a small local websocket server in the backend which can communicate straight to the process that requests the data, which will put it in its own Redux, and not sync it. This might be able to be done with WebWorkers? This should circumvent sending big payloads to the main process, and the web worker should avoid freezing the renderer.
A partner's idea was to have a local database that is presumably read/written to, and other windows would somehow need to be notified, and potentially store it in component state rather than Redux. I'm not as fond of this, due to introducing more I/O operations, needing to maintain this file, and some additional patch to notify components that need it, that the writing is done, to then go read the same data.
The IPC is all done async currently, though it still blocks.
This is all under the impression that the large messages freezing the renderer is the sole problem, and not Redux doing things with it, which may also be true, however removing it from being synced as in solution 1 would cover both of these.
If anyone has any ideas with how to better structure this, I'd be very appreciative.
If sharing these actions between renderers only is a requirement and all renderers have the same origin you can try BroadcastChannel as an alternative to IPC.
Also you can try to handle the data in renderer process and send the update to other rendere without involving manin process at all.

boost asio concurrent async_read and async_write

Looking at the documentation it looks like the TCP socket object is not thread-safe. So I cannot issue async_read from one thread and async_write concurrently from another thread? Also I would guess it applies to boost::asio::write() as well?
Can I issue write() - synchronous, while I do async_read from another thread?
If that is not safe, then only way is probably to get the socket native handle
and use synchronous linux mechanisms to achieve concurrent read and writes. I have an application where the reads and writes are actually independent.
It is thread-safe for the use-cases you listed. You can read in one thread, and write in another. And you can use the synchronous as well as asynchronous operations for that.
You will however run into problems, if you try to do one dedicated operation type (e.g. reads) from more than one thread. Especially if you are using the freestanding/composed operations (boost::asio::read(socket) instead of socket.read_some(). The reason for this is one the primitive operations are atomic / threadsafe. And the composed operations are working by calling multiple times into the primitives.

NSUrlConnection synchronous request without accepting redirects

I am currently implementing code that uses macOS API for HTTP/HTTPs requests in a Delphi/Lazarus program.
The code runs in its own thread (i.e. not main/ui thread) and is part of a larger threading based crawler across Windows/Mac and Delphi/Lazarus. I try to implement the actual HTTP/S request part using the OS API - but handle e.g. processing and taking action upon HTTP headers myself.
This means I would like to keep using synchronous mode if possible.
I want the request to simply return to me what the server returns.
I do not want it to follow redirects.
I currently use sendSynchroniousRequest_returningResponse_error
I have tried searching Google, but it seems there is no way when using synchronous requests? That just seems a bit odd.
No, NSURLConnection's synchronous functionality is very limited, and was never expanded because it is so strongly discouraged. That said, it is technically possible to implement what you're trying to do.
My recollection, from having replaced that method with an NSURLSession equivalent once (to swizzle in a less leaky replacement for that method in a binary-only library), is that you need to basically write a method that uses a shared dictionary to store a semaphore for each NSURLSessionDataTask (using the data task as a key). Then, you set the semaphore's count to zero so that it will block immediately when you wait on it, asynchronously start an asynchronous request on the main thread, and then wait on the semaphore (in the current thread). In the asynchronous data task's completion handler block, you increment the semaphore, thus unblocking the calling thread.
The trick is to ensure that the session runs its callbacks on a thread OTHER than the current one (which is blocked waiting for the semaphore). So you'll need to dispatch_async into the main thread when you actually start the data task.
Ostensibly, if you supported converting the task into a download task or stream task in the relevant delegate method, you would also need to take appropriate action to update the shared dictionary as well, but I'm assuming you won't use that feature. :-)

Connecting Redis events to Lua Script execution and concurrency issues

I have grouped key value pairs or data structures built using Redisson library. The design is that a change in value of any group of value(s) should be sent as event to subscribing Lua scripts. These scripts then do computations and update another group's key-value pair. This process is implemented as a chain such that once the Lua script updates a key-value per, that in turn generates a event and another Lua script does the work similar to first Lua script based on certain parameters.
Question 1: How to connect the Lua script and the event?
Question 2: Events are pipelined but it may be that my Lua Scripts may have to wait for network IO. In that case, I assume the next event is processed and the subscribing script executed. this for me is a problem because first script hasn't finished updating the key-value pair it needs to and the second script is going ahead with its work. This will cause errors for me. Is there a way to get over this?
Question 3: How to emit events from Redisson datastructures and I need the Lua script to understand that data structure's structure. How?
At the time of writing, Redis (3.2.9) does not allow blocking commands inside Lua scripts, including the subscribe command. So it is impossible to achieve what you have described via Lua script.
However you can do it using Redisson Topic and/or Redisson distributed services:
Modify a value, send a message to a channel. Another process receives the message, do the computation and updating.
Or ...
If there's only one particular process that does the computation and updating, you can use Redisson remote service to tell this process do the work, it works like RPC. Maybe it is able to modify the first value too.
Or ...
Create the whole lot as one runnable job and send it to be processed by a Redisson remote executor. You can also choose to schedule the job if it is not immediately required.

Synchronous calls in akka / actor model

I've been looking into Akka lately and it looks like a great framework for building scalable servers on the JVM. However most of the libraries on the JVM are blocking (e.g. JDBC) so don't your lose out on the performance benefits of using an event based model because your threads will always be blocked? Does Akka do something to get around this? Or is it just something you have to live with until we get more non-blocking libraries on the JVM?
Have a look at CQRS, it greatly improves scalability by separating reads from writes. This means that you can scale your reads separately from your writes.
With the types of IO blocking issues you mentioned Scala provides a language embedded solution that matches perfectly: Futures. For example:
def expensiveDBQuery(key : Key) = Future {
//...query the database
}
val dbResult : Future[Result] =
expensiveDBQuery(...) //non-blocking call
The dbResult returns immediately from the function call. The Result will be a available in the "Future". The cool part about a Future is that you can think about them like any old collection, except you can never call .size on the Future. Other than that all collection-ish functions (e.g. map, filter, foreach, ...) are fair game. Simply think of the dbResult as a list of Results. What would you do with such a list:
dbResult.map(_.getValues)
.filter(values => someTestOnValues(values))
...
That sequence of calls sets up a computation pipeline that is invoked whenever the Result is actually returned from the database. You can give a sequence of computing steps before the data has arrived. All asynchronously.

Resources