asyncio: can a task only start when previous task reach a pre-defined stage? - python-asyncio

I am starting with asyncio that I wish to apply to following problem:
Data is split in chunks.
A chunk is 1st compressed.
Then the compressed chunk is written in the file.
A single file is used for all chunks, so I need to process them one by one.
with open('my_file', 'w+b') as f:
for chunk in chunks:
compress_chunk(ch)
f.write(ch)
From this context, to run this process faster, as soon as the write step of current iteration starts, could the compress step of next iteration be triggered as well?
Can I do that with asyncio, keeping a similar for loop structure? If yes, could you share some pointers about this?
I am guessing another way to run this in parallel is by using ProcessPoolExecutor and splitting fully the compress phase from the write phase. This means compressing 1st all chunks in different executors.
Only when all chunks are compressed, then starting the writing step .
But I would like to investigate the 1st approach with asyncio 1st, if it makes sense.
Thanks in advance for any help.
Bests

You can do this with a producer-consumer model. As long as there is one producer and one consumer, you will have the correct order. For your use-case, that's all you'll benefit from. Also, you should use the aioFiles library. Standard file IO will mostly block your main compression/producer thread and you won't see much speedup. Try something like this:
async def produce(queue, chunks):
for chunk in chunks:
compress_chunk(ch)
await queue.put(i)
async def consume(queue):
with async with aiofiles.open('my_file', 'w') as f:
while True:
compressed_chunk = await Q.get()
await f.write(b'Hello, World!')
queue.task_done()
async def main():
queue = asyncio.Queue()
producer = asyncio.create_task(producer(queue, chunks))
consumer = asyncio.create_task(consumer(queue))
# wait for the producer to finish
await producer
# wait for the consumer to finish processing and cancel it
await queue.join()
consumer.cancel()
asyncio.run(main())
https://github.com/Tinche/aiofiles
Using asyncio.Queue for producer-consumer flow

Related

Can I use multiple event loops in a program where I also use multiprocessing module

Thanks for any reply in advance.
I have the entrance program main.py:
import asyncio
from loguru import logger
from multiprocessing import Process
from app.events import type_a_tasks, type_b_tasks, type_c_tasks
def run_task(task):
loop = asyncio.get_event_loop()
loop.run_until_complete(task())
loop.run_forever()
def main():
processes = list()
processes.append(Process(target=run_task, args=(type_a_tasks,)))
processes.append(Process(target=run_task, args=(type_b_tasks,)))
processes.append(Process(target=run_task, args=(type_c_tasks,)))
for process in processes:
process.start()
logger.info(f"Started process id={process.pid}, name={process.name}")
for process in processes:
process.join()
if __name__ == '__main__':
main()
where the different types of tasks are similarly defined, for example type_a_tasks are:
import asyncio
from . import business_1, business_2, business_3, business_4, business_5, business_6
async def type_a_tasks():
tasks = list()
tasks.append(asyncio.create_task(business_1.main()))
tasks.append(asyncio.create_task(business_2.main()))
tasks.append(asyncio.create_task(business_3.main()))
tasks.append(asyncio.create_task(business_4.main()))
tasks.append(asyncio.create_task(business_5.main()))
tasks.append(asyncio.create_task(business_6.main()))
await asyncio.wait(tasks)
return tasks
where the main() function of businesses(1-6) are Future objects provided by asyncio, in which I implemented my business code.
Is my usage of multiprocessing and asyncio event loops above the correct way of doing it?
I am doing so because I have a lot of asynchronous tasks to perform, but it doesn't seem appropriate to put them all in one event loop, so I divided them into three parts(a, b and c) accordingly, and I hope they can be run in three different processes to exert the capability of multiple CPU cores, in the meantime taking advantage of asyncio features.
I tried running my code, where the log records show there actually are different processes but all are using the same thread/event loop(knowing this by adding process_id and thread_id to loguru format)
this seens ok. Just use asyncio.run(task()) inside run_task - it is simpler and there is no need to call run_forever (also, with the run_forever` call, your processes will never join the base one.
IDs for other objects across process may repeat - if you want, add to your logging the result of calling os.getpid() in the body of run_task.
(if these are, by chance, the same, that means that somehow subprocessing is using a "dummy" backend due to some configuration in your project - should not happen anyway)

Using asyncio.run, is it safe to run multiple times?

The documentation for asyncio.run states:
This function always creates a new event loop and closes it at the end.
It should be used as a main entry point for asyncio programs, and should
ideally only be called once.
But it does not say why. I have a non-async program that needs to invoke something async. Can I just use asyncio.run every time I get to the async portion, or is this unsafe/wrong?
In my case, I have several async coroutines I want to gather and run in parallel to completion. When they are all completed, I want move on with my synchronous code.
async my_task(url):
# request some urls or whatever
integration_tasks = [my_task(url1), my_task(url2)]
async def gather_tasks(*integration_tasks):
return await asyncio.gather(*integration_tasks)
def complete_integrations(*integration_tasks):
return asyncio.run(gather_tasks(*integration_tasks))
print(complete_integrations(*integration_tasks))
Can I use asyncio.run() to run coroutines multiple times?
This actually is an interesting and very important question.
As a documentation of asyncio (python3.9) says:
This function always creates a new event loop and closes it at the end. It should be used as a main entry point for asyncio programs, and should ideally only be called once.
It does not prohibit calling it multiple times. And moreover, an old way of calling coroutines from synchronous code, which was:
loop = asyncio.get_event_loop()
loop.run_until_complete(coroutine)
Is now deprecated because of get_event_loop() method, which documentation says:
Consider also using the asyncio.run() function instead of using lower level functions to manually create and close an event loop.
Deprecated since version 3.10: Deprecation warning is emitted if there is no running event loop. In future Python releases, this function will be an alias of get_running_loop().
So in future releases it will not spawn new event loop if already running one is not present! Docs are proposing usage of asyncio.run() if You want to automatically spawn new loop if there is no new one.
There is a good reason for such decision. Even if You have an event loop and You will successfully use it to execute coroutines, there is few more things You must remember to do:
closing an event loop
consuming unconsumed generators (most important in case of failed coroutines)
...probably more, which I do not even attempt to refer here
What is exactly needed to be done to properly finalize event loop You can read in this source code.
Managing an event loop manually (if there is no running one) is a subtle procedure, and it is better to not doing that, unless one know what he is doing.
So Yes, I think that proper way of runing async function from synchronous code is calling asyncio.run(). But it is only suitable from a fully synchronous application. If there is already running event loop, it will probably fail (not tested). In such case, just await it or use get_runing_loop().run_untilcomplete(coro).
And for such synchronous apps, using asyncio.run() it is safe way and actually the only safe way of doing this, and it can be invoked multiple times.
The reason docs says that You should call it only once is that usually there is one single entrypoint to whole asynchronous application. It simplifies things and actually improves performance, because setting thins up for an event loop also takes some time. But if there is no single loop available in Your application, You should use multiple calls to asyncio.run() to run coroutines multiple times.
Is there is any performance gain?
Beside discussing multiple calls to asyncio.run(), I want to address one more concern. In comments, #jwal says:
asyncio is not parallel processing. Says so in the docs. [...] If you want parallel, run in a separate processes on a computer with a separate CPU core, not a separate thread, not a separate event loop.
Suggesting that asyncio is not suitable for parallel processing, which can be misunderstood and misleading to a conclusion, that it will not result in a performance gain, which is not always true. Moreover it is usually false!
So, any time You can delegate a job to an external process (not only a python process, it can be a database worker process, http call, ideally any TCP socket call) You can utilize a performance gain using asyncio. In huge majority of cases, when You are using a library which exposes async interface, the author of that library made an effort to eventually await for a result from a network/socket/process call. While response from such socket is not ready, event loop is completely free to do any other tasks. If loop has more than one such tasks, it will gain a performance.
A canonical example of such case is making a calls to a HTTP endpoints. At some point, there will be a network call, so python thread is free to do other work while awaiting for a data to appear on a TCP socket buffer. I have an example!
The example uses httpx library to compare performance of doing multiple calls to a OpenWeatherMap API. There are two functions:
get_weather_async()
get_weather_sync()
The first one does 8 request to an http API, but schedules those request to
run cooperatively (not concurrently!) on an event loop using asyncio.gather().
The second one performs 8 synchronous request in sequence.
To call the asynchronous function, I am actually using asyncio.run() method. And moreover, I am using timeit module to perform such call to asyncio.run() 4 times. So in a single python application, asyncio.run() was called 4 times, just to challenge my previous considerations.
from time import time
import httpx
import asyncio
import timeit
from random import uniform
class AsyncWeatherApi:
def __init__(
self, base_url: str = "https://api.openweathermap.org/data/2.5"
) -> None:
self.client: httpx.AsyncClient = httpx.AsyncClient(base_url=base_url)
async def weather(self, lat: float, lon: float, app_id: str) -> dict:
response = await self.client.get(
"/weather",
params={
"lat": lat,
"lon": lon,
"appid": app_id,
"units": "metric",
},
)
response.raise_for_status()
return response.json()
class SyncWeatherApi:
def __init__(
self, base_url: str = "https://api.openweathermap.org/data/2.5"
) -> None:
self.client: httpx.Client = httpx.Client(base_url=base_url)
def weather(self, lat: float, lon: float, app_id: str) -> dict:
response = self.client.get(
"/weather",
params={
"lat": lat,
"lon": lon,
"appid": app_id,
"units": "metric",
},
)
response.raise_for_status()
return response.json()
def get_random_locations() -> list[tuple[float, float]]:
"""generate 8 random locations in +/-europe"""
return [(uniform(45.6, 52.3), uniform(-2.3, 29.4)) for _ in range(8)]
async def get_weather_async(locations: list[tuple[float, float]]):
api = AsyncWeatherApi()
return await asyncio.gather(
*[api.weather(lat, lon, api_key) for lat, lon in locations]
)
def get_weather_sync(locations: list[tuple[float, float]]):
api = SyncWeatherApi()
return [api.weather(lat, lon, api_key) for lat, lon in locations]
api_key = "secret"
def time_async_job(repeat: int = 1):
locations = get_random_locations()
def run():
return asyncio.run(get_weather_async(locations))
duration = timeit.Timer(run).timeit(repeat)
print(
f"[ASYNC] In {duration}s: done {len(locations)} API calls, all"
f" repeated {repeat} times"
)
def time_sync_job(repeat: int = 1):
locations = get_random_locations()
def run():
return get_weather_sync(locations)
duration = timeit.Timer(run).timeit(repeat)
print(
f"[SYNC] In {duration}s: done {len(locations)} API calls, all repeated"
f" {repeat} times"
)
if __name__ == "__main__":
time_sync_job(4)
time_async_job(4)
At the end, a comparison of performance was printed. It says:
[SYNC] In 5.5580058859995916s: done 8 API calls, all repeated 4 times
[ASYNC] In 2.865574334995472s: done 8 API calls, all repeated 4 times
Those 4 repetitions was just to show that You can safely run a asyncio.run() multiple times. It had actualy destructive impact on measuring performance of asynchronous http calls, because all 32 request was actually run in four synchronous batches of 8 asynchronous tasks. Just to compare performance of one batch of 32 request:
[SYNC] In 4.373898585996358s: done 32 API calls, all repeated 1 times
[ASYNC] In 1.5169846520002466s: done 32 API calls, all repeated 1 times
So yes, it can, and usually will result in performance gain, if only proper async library is used (if library exposes an async API, it usually does it intentianally, knowing that there will be a network call somewhere).

Correct use of streamz with websocket

I am trying to figure out a correct way of processing streaming data using streamz. My streaming data is loaded using websocket-client, after which I do this:
# open a stream and push updates into the stream
stream = Stream()
# establish a connection
ws = create_connection("ws://localhost:8765")
# get continuous updates
from tornado import gen
from tornado.ioloop import IOLoop
async def f():
while True:
await gen.sleep(0.001)
data = ws.recv()
stream.emit(data)
IOLoop.current().add_callback(f)
While this works, I find that my stream is not able to keep pace with the streaming data (so the data I see in the stream is several seconds behind the streaming data, which is both high volume and high frequency). I tried setting the gen.sleep(0.001) to a smaller value (removing it completely halts the jupyter lab), but the problem remains.
Is this a correct way of connecting streamz with streaming data using websocket?
I don't think websocket-client provides an async API and, so, it's blocking the event loop.
You should use an async websocket client, such as the one Tornado provides:
from tornado.websocket import websocket_connect
ws = websocket_connect("ws://localhost:8765")
async def f():
while True:
data = await ws.read_message()
if data is None:
break
else:
await stream.emit(data)
# considering you're receiving data from a localhost
# socket, it will be really fast, and the `await`
# statement above won't pause the while-loop for
# enough time for the event loop to have chance to
# run other things.
# Therefore, sleep for a small time to suspend the
# while-loop.
await gen.sleep(0.0001)
You don't need to sleep if you're receiving/sending data from/to a remote connection which will be slow enough to suspend the while loop at await statements.

Gathering coin volumes - Is my code running asynchronously?

I'm fairly new to programming in python, I've been programming for about half a year. I've decided to try to build a functional trading bot. While trying to code this bot, I stumbled upon the asyncio module. I would really like to understand the module better but it's hard finding any simple tutorials or documentation about asyncio.
For my script I'm gathering per coin the volume. This works perfectly, but it takes a really long time to gather all the volumes. I would like to ask if my script is running synchronously, and if so how do I fix this? I'm using an API wrapper to communicate with the Binance Exchange.
import binance
import asyncio
import time
s = time.time()
names = [name for name in binance.ticker_prices()] #Gathering all the coin names
loop = asyncio.get_event_loop()
async def get_volume(name):
async def get_data():
return binance.ticker_24hr(name) #Returns per coin a dict of the data of the last 24hr
data = await get_data()
return (name, data['volume'])
tasks = [asyncio.ensure_future(get_volume(name)) for name in names]
results = loop.run_until_complete(asyncio.gather(*tasks))
print('Total time:', time.time() - s)
Since binance.ticker_24hr does not look like it's a coroutine, it is almost certainly blocking the event loop and therefore preventing asyncio.gather to do its job. As a quick fix, you can use run_in_executor to run the blocking function in a separate thread:
async def get_volume(name):
loop = asyncio.get_event_loop()
data = await loop.run_in_executor(None, binance.ticker_24hr, name)
return name, data['volume']
This will work just fine for a reasonable number of parallel tasks. The downside is that it uses threads, so it might not scale to a huge number of parallel requests (or it would require unnecessary waiting). The correct solution in the long run is to use a library that natively supports asyncio.
Maarten firstly you are calling get_ticker for every symbol which means you're making many unnecessary requests. If you call it without a symbol value, you get all tickers in one request. This removes any loops or async as well if you aren't performing other tasks. It looks like the binance library you're using doesn't support this. You can use python-binance to do it
return client.get_ticker()
That said I've been testing an asyncio version of python-binance. It's currently in a feature branch now if you want to try it.
pip install git+https://github.com/sammchardy/python-binance#feature/asyncio
Include the asyncio version of the client and initialise the client
from binance.client_async import AsyncClient as Client
client = Client("<api_key>", "<api_secret>")
Then you can await the calls to get the ticker for a particular symbol
return await client.get_ticker(symbol=name)
Or for all symbol tickers don't pass the symbol parameter
return await client.get_ticker()
Hope that helps

How to perform asynchronous tasks in fixed interval of time

The goal is to perform an async task(file read, network operation) without blocking the code. And we have multiple such async tasks that need to be executed at a fixed interval of times. Here is a pseudo code to demonstrate the same.
# the async tasks should be performed in parallel
# provide me with a return value after the task is complete, or they can have a callback or any other mechanism of communication
async_task_1 = perform_async(1)
# now I need to wait fix amount of time before the async task 2
sleep(5)
# this also similar to the tasks one in nature
async_task_2 = perform_async(2)
# finally do something with the result
I'm reading that in ruby I've 2 options forking, threading. The is also something called as Fiber. I also read that due to GIL in the basic Ruby, I won't be able to make much use of threading. I still want to stick to the base Ruby.
I've written some parallel code previously in OMP and Cuda. But I've never got a chance to do that in Ruby.
Can you suggest how to achieve this?
I would recommend to you the concurrent-ruby gem with its async feature. This will work great, as long as your tasks are IO bound. (As you said they are)
There you have a async feature to perform your tasks. To wait the amount of time between your 2 async calls you can use literally the sleep function
class AsyncCalls
include Concurrent::Asnyc
def perform_task(params)
# IO bound task
end
end
AsyncCalls.new.async.perform_task("param")
sleep 5
AsyncCalls.new.async.perform_task("other param")

Resources