How to perform asynchronous tasks in fixed interval of time - ruby

The goal is to perform an async task(file read, network operation) without blocking the code. And we have multiple such async tasks that need to be executed at a fixed interval of times. Here is a pseudo code to demonstrate the same.
# the async tasks should be performed in parallel
# provide me with a return value after the task is complete, or they can have a callback or any other mechanism of communication
async_task_1 = perform_async(1)
# now I need to wait fix amount of time before the async task 2
sleep(5)
# this also similar to the tasks one in nature
async_task_2 = perform_async(2)
# finally do something with the result
I'm reading that in ruby I've 2 options forking, threading. The is also something called as Fiber. I also read that due to GIL in the basic Ruby, I won't be able to make much use of threading. I still want to stick to the base Ruby.
I've written some parallel code previously in OMP and Cuda. But I've never got a chance to do that in Ruby.
Can you suggest how to achieve this?

I would recommend to you the concurrent-ruby gem with its async feature. This will work great, as long as your tasks are IO bound. (As you said they are)
There you have a async feature to perform your tasks. To wait the amount of time between your 2 async calls you can use literally the sleep function
class AsyncCalls
include Concurrent::Asnyc
def perform_task(params)
# IO bound task
end
end
AsyncCalls.new.async.perform_task("param")
sleep 5
AsyncCalls.new.async.perform_task("other param")

Related

Why do Ruby fibers that run sequentially without a scheduler set run concurrently when a scheduler is set?

I have the following Gemfile:
source "https://rubygems.org"
ruby "3.1.2"
gem "libev_scheduler", "~> 0.2"
and the following Ruby code in a file called main.rb:
require 'libev_scheduler'
set_sched = ARGV[0] == "--set-sched"
if set_sched then
Fiber.set_scheduler Libev::Scheduler.new
end
N_FIBERS = 5
fibers = []
N_FIBERS.times do |i|
n = i + 1
fiber = Fiber.new do
puts "Beginning calculation ##{n}..."
sleep 1
end
fibers.push({fiber: fiber, n: n})
end
fibers.each do |fiber|
fiber[:fiber].resume
end
puts "Finished all calculations!"
I'm executing the code with Ruby 3.1.2 installed via RVM.
When I run the program with time bundle exec ruby main.rb, I get the following output:
Beginning calculation #1...
Beginning calculation #2...
Beginning calculation #3...
Beginning calculation #4...
Beginning calculation #5...
Finished all calculations!
real 0m5.179s
user 0m0.146s
sys 0m0.027s
When I run the program with time bundle exec ruby main.rb --set-sched, I get the following output:
Beginning calculation #1...
Beginning calculation #2...
Beginning calculation #3...
Beginning calculation #4...
Beginning calculation #5...
Finished all calculations!
real 0m1.173s
user 0m0.150s
sys 0m0.021s
Why do my fibers only run concurrently when I've set a scheduler? Some older Stack Overflow answers (like this one) state that fibers are a construct for flow control, not concurrency, and that it is impossible to use fibers to write concurrent code. My results seem to contradict this.
My understanding so far of fibers is that they are meant for cooperative concurrency, as opposed to preemptive concurrency. Therefore, in order to get concurrency out of them, you'd need to have them yield to some other code as early as they can (ex. when IO begins) so that the other code can be executed while the fiber waits for its next opportunity to execute.
Based on this understanding, I think I understand why my code without a scheduler isn't able to run concurrently. It sleeps and because it lacks yield statements before and after code in it, there are no points in time where it could yield control to any other code I've written. But when I add a scheduler, it appears to somehow yield to something. Is sleep detecting the scheduler and yielding to it so that my code resuming the fibers is immediately yielded to, making it able to immediately resume all five fibers?
Great question!
As #stefan noted above, Ruby 3.0 introduced the concept of a "non-blocking fiber." The way the actual non-blocking behavior is accomplished is left up to the scheduler implementation. There is no default scheduler as far as I know; per the Ruby docs:
If Fiber.scheduler is not set in the current thread, blocking and non-blocking fibers’ behavior is identical.
Now, to answer your last question:
But when I add a scheduler, it appears to somehow yield to something ... Is sleep detecting the scheduler and yielding to it so that my code resuming the fibers is immediately yielded to, making it able to immediately resume all five fibers?
You're onto something! When you set a fiber scheduler, it's expected to conform to Fiber::SchedulerInterface, which defines several "hooks." One of those hooks is #kernel_sleep, which is invoked by Kernel#sleep (and Mutex#sleep)!
I can't say I've read much libev code, but you can find libev_scheduler's implementation of that hook here.
The idea is (emphasis my own):
The scheduler runs into a wait loop, checking all the blocked fibers (which it has registered on hook calls) and resuming them when the awaited resource is ready (e.g. I/O ready or sleep time elapsed).
So, in summary:
Your fiber calls Kernel#sleep with some duration.
Kernel#sleep calls the scheduler's #kernel_sleep hook with that same duration.
The schedule "somehow registers what the current fiber is waiting on, and yields control to other fibers with Fiber.yield" (quote from the docs there)
"The scheduler runs into a wait loop, checking all the blocked fibers (which it has registered on hook calls) and resuming them when the awaited resource is ready (e.g. I/O ready or sleep time elapsed)."
Hope this helps!

Using asyncio.run, is it safe to run multiple times?

The documentation for asyncio.run states:
This function always creates a new event loop and closes it at the end.
It should be used as a main entry point for asyncio programs, and should
ideally only be called once.
But it does not say why. I have a non-async program that needs to invoke something async. Can I just use asyncio.run every time I get to the async portion, or is this unsafe/wrong?
In my case, I have several async coroutines I want to gather and run in parallel to completion. When they are all completed, I want move on with my synchronous code.
async my_task(url):
# request some urls or whatever
integration_tasks = [my_task(url1), my_task(url2)]
async def gather_tasks(*integration_tasks):
return await asyncio.gather(*integration_tasks)
def complete_integrations(*integration_tasks):
return asyncio.run(gather_tasks(*integration_tasks))
print(complete_integrations(*integration_tasks))
Can I use asyncio.run() to run coroutines multiple times?
This actually is an interesting and very important question.
As a documentation of asyncio (python3.9) says:
This function always creates a new event loop and closes it at the end. It should be used as a main entry point for asyncio programs, and should ideally only be called once.
It does not prohibit calling it multiple times. And moreover, an old way of calling coroutines from synchronous code, which was:
loop = asyncio.get_event_loop()
loop.run_until_complete(coroutine)
Is now deprecated because of get_event_loop() method, which documentation says:
Consider also using the asyncio.run() function instead of using lower level functions to manually create and close an event loop.
Deprecated since version 3.10: Deprecation warning is emitted if there is no running event loop. In future Python releases, this function will be an alias of get_running_loop().
So in future releases it will not spawn new event loop if already running one is not present! Docs are proposing usage of asyncio.run() if You want to automatically spawn new loop if there is no new one.
There is a good reason for such decision. Even if You have an event loop and You will successfully use it to execute coroutines, there is few more things You must remember to do:
closing an event loop
consuming unconsumed generators (most important in case of failed coroutines)
...probably more, which I do not even attempt to refer here
What is exactly needed to be done to properly finalize event loop You can read in this source code.
Managing an event loop manually (if there is no running one) is a subtle procedure, and it is better to not doing that, unless one know what he is doing.
So Yes, I think that proper way of runing async function from synchronous code is calling asyncio.run(). But it is only suitable from a fully synchronous application. If there is already running event loop, it will probably fail (not tested). In such case, just await it or use get_runing_loop().run_untilcomplete(coro).
And for such synchronous apps, using asyncio.run() it is safe way and actually the only safe way of doing this, and it can be invoked multiple times.
The reason docs says that You should call it only once is that usually there is one single entrypoint to whole asynchronous application. It simplifies things and actually improves performance, because setting thins up for an event loop also takes some time. But if there is no single loop available in Your application, You should use multiple calls to asyncio.run() to run coroutines multiple times.
Is there is any performance gain?
Beside discussing multiple calls to asyncio.run(), I want to address one more concern. In comments, #jwal says:
asyncio is not parallel processing. Says so in the docs. [...] If you want parallel, run in a separate processes on a computer with a separate CPU core, not a separate thread, not a separate event loop.
Suggesting that asyncio is not suitable for parallel processing, which can be misunderstood and misleading to a conclusion, that it will not result in a performance gain, which is not always true. Moreover it is usually false!
So, any time You can delegate a job to an external process (not only a python process, it can be a database worker process, http call, ideally any TCP socket call) You can utilize a performance gain using asyncio. In huge majority of cases, when You are using a library which exposes async interface, the author of that library made an effort to eventually await for a result from a network/socket/process call. While response from such socket is not ready, event loop is completely free to do any other tasks. If loop has more than one such tasks, it will gain a performance.
A canonical example of such case is making a calls to a HTTP endpoints. At some point, there will be a network call, so python thread is free to do other work while awaiting for a data to appear on a TCP socket buffer. I have an example!
The example uses httpx library to compare performance of doing multiple calls to a OpenWeatherMap API. There are two functions:
get_weather_async()
get_weather_sync()
The first one does 8 request to an http API, but schedules those request to
run cooperatively (not concurrently!) on an event loop using asyncio.gather().
The second one performs 8 synchronous request in sequence.
To call the asynchronous function, I am actually using asyncio.run() method. And moreover, I am using timeit module to perform such call to asyncio.run() 4 times. So in a single python application, asyncio.run() was called 4 times, just to challenge my previous considerations.
from time import time
import httpx
import asyncio
import timeit
from random import uniform
class AsyncWeatherApi:
def __init__(
self, base_url: str = "https://api.openweathermap.org/data/2.5"
) -> None:
self.client: httpx.AsyncClient = httpx.AsyncClient(base_url=base_url)
async def weather(self, lat: float, lon: float, app_id: str) -> dict:
response = await self.client.get(
"/weather",
params={
"lat": lat,
"lon": lon,
"appid": app_id,
"units": "metric",
},
)
response.raise_for_status()
return response.json()
class SyncWeatherApi:
def __init__(
self, base_url: str = "https://api.openweathermap.org/data/2.5"
) -> None:
self.client: httpx.Client = httpx.Client(base_url=base_url)
def weather(self, lat: float, lon: float, app_id: str) -> dict:
response = self.client.get(
"/weather",
params={
"lat": lat,
"lon": lon,
"appid": app_id,
"units": "metric",
},
)
response.raise_for_status()
return response.json()
def get_random_locations() -> list[tuple[float, float]]:
"""generate 8 random locations in +/-europe"""
return [(uniform(45.6, 52.3), uniform(-2.3, 29.4)) for _ in range(8)]
async def get_weather_async(locations: list[tuple[float, float]]):
api = AsyncWeatherApi()
return await asyncio.gather(
*[api.weather(lat, lon, api_key) for lat, lon in locations]
)
def get_weather_sync(locations: list[tuple[float, float]]):
api = SyncWeatherApi()
return [api.weather(lat, lon, api_key) for lat, lon in locations]
api_key = "secret"
def time_async_job(repeat: int = 1):
locations = get_random_locations()
def run():
return asyncio.run(get_weather_async(locations))
duration = timeit.Timer(run).timeit(repeat)
print(
f"[ASYNC] In {duration}s: done {len(locations)} API calls, all"
f" repeated {repeat} times"
)
def time_sync_job(repeat: int = 1):
locations = get_random_locations()
def run():
return get_weather_sync(locations)
duration = timeit.Timer(run).timeit(repeat)
print(
f"[SYNC] In {duration}s: done {len(locations)} API calls, all repeated"
f" {repeat} times"
)
if __name__ == "__main__":
time_sync_job(4)
time_async_job(4)
At the end, a comparison of performance was printed. It says:
[SYNC] In 5.5580058859995916s: done 8 API calls, all repeated 4 times
[ASYNC] In 2.865574334995472s: done 8 API calls, all repeated 4 times
Those 4 repetitions was just to show that You can safely run a asyncio.run() multiple times. It had actualy destructive impact on measuring performance of asynchronous http calls, because all 32 request was actually run in four synchronous batches of 8 asynchronous tasks. Just to compare performance of one batch of 32 request:
[SYNC] In 4.373898585996358s: done 32 API calls, all repeated 1 times
[ASYNC] In 1.5169846520002466s: done 32 API calls, all repeated 1 times
So yes, it can, and usually will result in performance gain, if only proper async library is used (if library exposes an async API, it usually does it intentianally, knowing that there will be a network call somewhere).

Using wait_for with timeouts with list of tasks

So, I have a list of tasks which I want to schedule concurrently in a non-blocking fashion.
Basically, gather should do the trick.
Like
tasks = [ asyncio.create_task(some_task()) in bleh]
results = await asyncio.gather(*tasks)
But then, I also need a timeout. What I want is that any task which takes > timeout time cancels and I proceed with what I have.
I fould asyncio.wait primitive.
https://docs.python.org/3/library/asyncio-task.html#waiting-primitives
But then the doc says:
Run awaitable objects in the aws set concurrently and block until the condition specified by return_when.
Which seems to suggest that it blocks...
It seems that asyncio.wait_for will do the trick
https://docs.python.org/3/library/asyncio-task.html#timeouts
But how do i send in the list of awaitables rather than just an awaitable?
What I want is that any task which takes > timeout time cancels and I proceed with what I have.
This is straightforward to achieve with asyncio.wait():
# Wait for tasks to finish, but no more than a second.
done, pending = await asyncio.wait(tasks, timeout=1)
# Cancel the ones not done by now.
for fut in pending:
fut.cancel()
# Results are available as x.result() on futures in `done`
Which seems to suggest that [asyncio.wait] blocks...
It only blocks the current coroutine, the same as gather or wait_for.

How does one achieve parallel tasks with Ruby's Fibers?

I'm new to fibers and EventMachine, and have only recently found out about fibers when I was seeing if Ruby had any concurrency features, like go-lang.
There don't seem to be a whole lot of examples out there for real use cases for when you'd use a Fiber.
I did manage to find this: https://www.igvita.com/2009/05/13/fibers-cooperative-scheduling-in-ruby/ (back from 2009!!!)
which has the following code:
require 'eventmachine'
require 'em-http'
require 'fiber'
def async_fetch(url)
f = Fiber.current
http = EventMachine::HttpRequest.new(url).get :timeout => 10
http.callback { f.resume(http) }
http.errback { f.resume(http) }
return Fiber.yield
end
EventMachine.run do
Fiber.new{
puts "Setting up HTTP request #1"
data = async_fetch('http://www.google.com/')
puts "Fetched page #1: #{data.response_header.status}"
EventMachine.stop
}.resume
end
And that's great, async GET request! yay!!! but... how do I actually use it asyncily? The example doesn't have anything beyond creating the containing Fiber.
From what I understand (and don't understand):
async_fetch is blocking until f.resume is called.
f is the current Fiber, which is the wrapping Fiber created in the EventMachine.run block.
the async_fetch yields control flow back to its caller? I'm not sure what this does
why does the wrapping fiber have resume at the end? are fibers paused by default?
Outside of the example, how do I use fibers to say, shoot off a bunch of requests triggered by keyboard commands?
like, for example: every time I type a letter, I make a request to google or something? - normally this requires a thread, which the main thread would tell the parallel thread to launch a thread for each request. :-\
I'm new to concurrency / Fibers. But they are so intriguing!
If anyone can answers these questions, that would be very appreciated!!!
There is a lot of confusion regarding fibers in Ruby. Fibers are not a tool with which to implement concurrency; they are merely a way of organizing code in a way that may more clearly represent what is going on.
That the name 'fibers' is similar to 'threads' in my opinion contributes to the confusion.
If you want true concurrency, that is, distributing the CPU load across all available CPU's, you have the following options:
In MRI Ruby
Running multiple Ruby VM's (i.e. OS processes), using fork, etc. Even with multiple threads in Ruby, the GIL (Global Interpreter Lock) prevents the use of more than 1 CPU by the Ruby runtime.
In JRuby
Unlike MRI Ruby, JRuby will use multiple CPU's when assigning threads, so you can get truly concurrent processing.
If your code is spending most of its time waiting for external resources, then you may not have any need for this true concurrency. MRI threads or some kind of event handling loop will probably work fine for you.

How to use ruby fibers to avoid blocking IO

I need to upload a bunch of files in a directory to S3. Since more than 90% of the time required to upload is spent waiting for the http request to finish, I want to execute several of them at once somehow.
Can Fibers help me with this at all? They are described as a way to solve this sort of problem, but I can't think of any way I can do any work while an http call blocks.
Any way I can solve this problem without threads?
I'm not up on fibers in 1.9, but regular Threads from 1.8.6 can solve this problem. Try using a Queue http://ruby-doc.org/stdlib/libdoc/thread/rdoc/classes/Queue.html
Looking at the example in the documentation, your consumer is the part that does the upload. It 'consumes' a URL and a file, and uploads the data. The producer is the part of your program that keeps working and finds new files to upload.
If you want to upload multiple files at once, simply launch a new Thread for each file:
t = Thread.new do
upload_file(param1, param2)
end
#all_threads << t
Then, later on in your 'producer' code (which, remember, doesn't have to be in its own Thread, it could be the main program):
#all_threads.each do |t|
t.join if t.alive?
end
The Queue can either be a #member_variable or a $global.
To answer your actual questions:
Can Fibers help me with this at all?
No they can't. Jörg W Mittag explains why best.
No, you cannot do concurrency with Fibers. Fibers simply aren't a concurrency construct, they are a control-flow construct, like Exceptions. That's the whole point of Fibers: they never run in parallel, they are cooperative and they are deterministic. Fibers are coroutines. (In fact, I never understood why they aren't simply called Coroutines.)
The only concurrency construct in Ruby is Thread.
When he says that the only concurrency contruct in Ruby is Thread, remember that there are many different implimentations of Ruby and that they vary in their threading implementations. Jörg once again provides a great answer to these differences; and correctly concludes that only something like JRuby (that uses JVM threads mapped to native threads) or forking your process is how you can achieve true parallelism.
Any way I can solve this problem without threads?
Other than forking your process, I would also suggest that you look at EventMachine and something like em-http-request. It's an event driven, non-blocking, reactor pattern based HTTP client that is asynchronous and does not incur the overhead of threads.
Aaron Patterson (#tenderlove) uses an example almost exactly like yours to describe exactly why you can and should use threads to achieve concurrency in your situation.
Most I/O libraries are now smart enough to release the GVL (Global VM Lock, or most people know it as the GIL or Global Interpreter Lock) when doing IO. There is a simple function call in C to do this. You don't need to worry about the C code, but for you this means that most IO libraries worth their salt are going to release the GVL and allow other threads to execute while the thread that is doing the IO waits for the data to return.
If what I just said was confusing, you don't need to worry about it too much. The main thing that you need to know is that if you are using a decent library to do your HTTP requests (or any other I/O operation for that matter... database, interprocess communication, whatever), the Ruby interpreter (MRI) is smart enough to be able to release the lock on the interpreter and allow other threads to execute while one thread awaits IO to return. If the next thread has its own IO to grab, the Ruby interpreter will do the same thing (assuming that the IO library is built to utilize this feature of Ruby, which I believe most are these days).
So, to sum up what I am saying, use threads! You should see the performance benefit. If not, check to see whether your http library is using the rb_thread_blocking_region() function in C and, if not, find out why not. Maybe there is a good reason, maybe you need to consider using a better library.
The link to the Aaron Patterson video is here: http://www.youtube.com/watch?v=kufXhNkm5WU
It is worth a watch, even if just for the laughs, as Aaron Patterson is one of the funniest people on the internet.
You could use separate processes for this instead of threads:
#!/usr/bin/env ruby
$stderr.sync = true
# Number of children to use for uploading
MAX_CHILDREN = 5
# Hash of PIDs for children that are working along with which file
# they're working on.
#child_pids = {}
# Keep track of uploads that failed
#failed_files = []
# Get the list of files to upload as arguments to the program
#files = ARGV
### Wait for a child to finish, adding the file to the list of those
### that failed if the child indicates there was a problem.
def wait_for_child
$stderr.puts " waiting for a child to finish..."
pid, status = Process.waitpid2( 0 )
file = #child_pids.delete( pid )
#failed_files << file unless status.success?
end
### Here's where you'd put the particulars of what gets uploaded and
### how. I'm just sleeping for the file size in bytes * milliseconds
### to simulate the upload, then returning either +true+ or +false+
### based on a random factor.
def upload( file )
bytes = File.size( file )
sleep( bytes * 0.00001 )
return rand( 100 ) > 5
end
### Start a child uploading the specified +file+.
def start_child( file )
if pid = Process.fork
$stderr.puts "%s: uploaded started by child %d" % [ file, pid ]
#child_pids[ pid ] = file
else
if upload( file )
$stderr.puts "%s: done." % [ file ]
exit 0 # success
else
$stderr.puts "%s: failed." % [ file ]
exit 255
end
end
end
until #files.empty?
# If there are already the maximum number of children running, wait
# for one to finish
wait_for_child() if #child_pids.length >= MAX_CHILDREN
# Start a new child working on the next file
start_child( #files.shift )
end
# Now we're just waiting on the final few uploads to finish
wait_for_child() until #child_pids.empty?
if #failed_files.empty?
exit 0
else
$stderr.puts "Some files failed to upload:",
#failed_files.collect {|file| " #{file}" }
exit 255
end

Resources