Asyncio with multiprocessing : Producers-Consumers model - multiprocessing

I am trying retrieve stock prices and process the prices them as they come. I am a beginner with concurrency but I thought this set up seems suited to an asyncio producers-consumers model in which each producers retrieve a stock price, and pass it to the consumers vial a queue. Now the consumers have do the stock price processing in parallel (multiprocessing) since the work is CPU intensive. Therefore I would have multiple consumers already working while not all the producers are finished retrieving data. In addition, I would like to implement a step in which, if the consumer finds that the stock price it's working on is invalid , we spawn a new consumer job for that stock.
So far, i have the following toy code that sort of gets me there, but has issues with my process_data function (the consumer).
from concurrent.futures import ProcessPoolExecutor
import asyncio
import random
import time
random.seed(444)
#producers
async def retrieve_data(ticker, q):
'''
Pretend we're using aiohttp to retrieve stock prices from a URL
Place a tuple of stock ticker and price into asyn queue as it becomes available
'''
start = time.perf_counter() # start timer
await asyncio.sleep(random.randint(4, 8)) # pretend we're calling some URL
price = random.randint(1, 100) # pretend this is the price we retrieved
print(f'{ticker} : {price} retrieved in {time.perf_counter() - start:0.1f} seconds')
await q.put((ticker, price)) # place the price into the asyncio queue
#consumers
async def process_data(q):
while True:
data = await q.get()
print(f"processing: {data}")
with ProcessPoolExecutor() as executor:
loop = asyncio.get_running_loop()
result = await loop.run_in_executor(executor, data_processor, data)
#if output of data_processing failed, send ticker back to queue to retrieve data again
if not result[2]:
print(f'{result[0]} data invalid. Retrieving again...')
await retrieve_data(result[0], q) # add a new task
q.task_done() # end this task
else:
q.task_done() # so that q.join() knows when the task is done
async def main(tickers):
q = asyncio.Queue()
producers = [asyncio.create_task(retrieve_data(ticker, q)) for ticker in tickers]
consumers = [asyncio.create_task(process_data(q))]
await asyncio.gather(*producers)
await q.join() # Implicitly awaits consumers, too. blocks until all items in the queue have been received and processed
for c in consumers:
c.cancel() #cancel the consumer tasks, which would otherwise hang up and wait endlessly for additional queue items to appear
'''
RUN IN JUPYTER NOTEBOOK
'''
start = time.perf_counter()
tickers = ['AAPL', 'AMZN', 'TSLA', 'C', 'F']
await main(tickers)
print(f'total elapsed time: {time.perf_counter() - start:0.2f}')
'''
RUN IN TERMINAL
'''
# if __name__ == "__main__":
# start = time.perf_counter()
# tickers = ['AAPL', 'AMZN', 'TSLA', 'C', 'F']
# asyncio.run(main(tickers))
# print(f'total elapsed time: {time.perf_counter() - start:0.2f}')
The data_processor() function below, called by process_data() above needs to be in a different cell in Jupyter notebook, or a separate module (from what I understand, to avoid a PicklingError)
from multiprocessing import current_process
def data_processor(data):
ticker = data[0]
price = data[1]
print(f'Started {ticker} - {current_process().name}')
start = time.perf_counter() # start time counter
time.sleep(random.randint(4, 5)) # mimic some random processing time
# pretend we're processing the price. Let the processing outcome be invalid if the price is an odd number
if price % 2==0:
is_valid = True
else:
is_valid = False
print(f"{ticker}'s price {price} validity: --{is_valid}--"
f' Elapsed time: {time.perf_counter() - start:0.2f} seconds')
return (ticker, price, is_valid)
THE ISSUES
Instead of using python's multiprocessing module, i used concurrent.futures' ProcessPoolExecutor, which I read is compatible with asyncio (What kind of problems (if any) would there be combining asyncio with multiprocessing?). But it seems that I have to choose between retrieving the output (result) of the function called by the executor and being able to run several subprocesses in parallel. With the construct below, the subprocesses run sequentially, not in parallel.
with ProcessPoolExecutor() as executor:
loop = asyncio.get_running_loop()
result = await loop.run_in_executor(executor, data_processor, data)
Removing result = await in front of loop.run_in_executor(executor, data_processor, data) allows to run several consumers in parallel, but then I can't collect their results from the parent process. I need the await for that. And then of course the remaining of the code block will fail.
How can I have these subprocesses run in parallel and provide the output? Perhaps it needs a different construct or something else than the producers-consumers model
the part of the code that requests invalid stock prices to be retrieved again works (provided I can get the result from above), but it is ran in the subprocess that calls it and blocks new consumers from being created until the request is fulfilled. Is there a way to address this?
#if output of data_processing failed, send ticker back to queue to retrieve data again
if not result[2]:
print(f'{result[0]} data invalid. Retrieving again...')
await retrieve_data(result[0], q) # add a new task
q.task_done() # end this task
else:
q.task_done() # so that q.join() knows when the task is done

But it seems that I have to choose between retrieving the output (result) of the function called by the executor and being able to run several subprocesses in parallel.
Luckily this is not the case, you can also use asyncio.gather() to wait for multiple items at once. But you obtain data items one by one from the queue, so you don't have a batch of items to process. The simplest solution is to just start multiple consumers. Replace
# the single-element list looks suspicious anyway
consumers = [asyncio.create_task(process_data(q))]
with:
# now we have an actual list
consumers = [asyncio.create_task(process_data(q)) for _ in range(16)]
Each consumer will wait for an individual task to finish, but that's ok because you'll have a whole pool of them working in parallel, which is exactly what you wanted.
Also, you might want to make executor a global variable and not use with, so that the process pool is shared by all consumers and lasts as long as the program. That way consumers will reuse the worker processes already spawned instead of having to spawn a new process for each job received from the queue. (That's the whole point of having a process "pool".) In that case you probably want to add executor.shutdown() at the point in the program where you don't need the executor anymore.

Related

Python asyncio: awaiting a future you don't have yet

Imagine that I have a main program which starts many async activities which all wait on queues to do jobs, and then on ctrl-C properly closes them all down: it might look something like this:
async def run_act1_forever():
# this is the async queue loop
while True:
job = await inputQueue1.get()
# do something with this incoming job
def run_activity_1(loop):
# run the async queue loop as a task
coro = loop.create_task(run_act1_forever())
return coro
def mainprogram():
loop = asyncio.get_event_loop()
act1 = run_activity_1(loop)
# also start act2, act3, etc here
try:
loop.run_forever()
except KeyboardInterrupt:
pass
finally:
act1.cancel()
# also act2.cancel(), act3.cancel(), etc
loop.close()
This all works fine. However, starting up activity 1 is actually more complex than this; it happens in three parts. Part 1 is to wait on the queue until a particular job comes in, one time; part 2 is a synchronous part which has to run in a thread with run_in_executor, one time, and then part 3 is the endless waiting on the queue for jobs as above. How do I structure this? My initial thought was:
async def run_act1_forever():
# this is the async queue loop
while True:
job = await inputQueue1.get()
# do something with this incoming job
async def run_act1_step1():
while True:
job = await inputQueue1.get()
# good, we have handled that first task; we're done
break
def run_act1_step2():
# note: this is sync, not async, so it's in a thread
# do whatever, here, and then exit when done
time.sleep(5)
def run_activity_1(loop):
# run step 1 as a task
step1 = loop.create_task(run_act1_step1())
# ERROR! See below
# now run the sync step 2 in a thread
self.loop.run_in_executor(None, run_act1_step2())
# finally, run the async queue loop as a task
coro = loop.create_task(run_act1_forever())
return coro
def mainprogram():
loop = asyncio.get_event_loop()
act1 = run_activity_1(loop)
# also start act2, act3, etc here
try:
loop.run_forever()
except KeyboardInterrupt:
pass
finally:
act1.cancel()
# also act2.cancel(), act3.cancel(), etc
loop.close()
but this does not work, because at the point where we say "ERROR!", we need to await the step1 task and we never do. We can't await it, because run_activity_1 is not an async function. So... what should I do here?
I thought about getting the Future back from calling run_act1_step1() and then using future.add_done_callback to handle running steps 2 and 3. However, if I do that, then run_activity_1() can't return the future generated by run_act1_forever(), which means that mainprogram() can't cancel that run_act1_forever() task.
I thought of generating an "empty" Future in run_activity_1() and returning that, and then making that empty Future "chain" to the Future returned by run_act1_forever(). But Python asyncio doesn't support chaining Futures.
You say that things are difficult because run_activity_1 is not an async function, but don't really detail why it can't be async.
async def run_activity_1(loop):
await run_act1_step1()
await loop.run_in_executor(None, run_act1_step2)
await run_act1_forever()
The returned coroutine won't be the same as the one returned by run_act1_forever(), but cancellation should propagate if you've got as far as executing that step.
With this change, run_activity_1 is no longer returning a task, so the invocation inside mainprogram would need to change to:
act1 = loop.create_task(run_activity_1(loop))
I think you were on the right track when you said, "I thought about getting the Future back from calling run_act1_step1() and then using future.add_done_callback to handle running steps 2 and 3." That's the logical way to structure this application. You have to manage the various returned objects correctly, but a small class solves this problem.
Here is a program similar to your second code snippet. It runs (tested with Python3.10) and handles Ctrl-C gracefully.
Python3.10 issues a deprecation warning when the function asyncio.get_event_loop() is called without a running loop, so I avoided doing that.
Activities.run() creates task1, then attaches a done_callback that starts task2 and the rest of the activities. The Activities object keeps track of task1 and task2 so they can be cancelled. The main program keeps a reference to Activities, and calls cancel_gracefully() to do the right thing, depending on how far the script progressed through the sequence of start-up activities.
Some care needs to be taken to catch the CancelledExceptions; otherwise stuff gets printed on the console when the program terminates.
The important difference between this program and your second code snippet is that this program immediately stores task1 and task2 in variables so they can be accessed later. Therefore they can be cancelled any time after their creation. The done_callback trick is used to launch all the steps in the proper order.
#! python3.10
import asyncio
import time
async def run_act1_forever():
# this is the async queue loop
while True:
await asyncio.sleep(1.0)
# job = await inputQueue1.get()
# do something with this incoming job
print("Act1 forever")
async def run_act1_step1():
while True:
await asyncio.sleep(1.0)
# job = await inputQueue1.get()
# good, we have handled that first task; we're done
break
print("act1 step1 finished")
def run_act1_step2():
# note: this is sync, not async, so it's in a thread
# do whatever, here, and then exit when done
time.sleep(5)
print("Step2 finished")
class Activities:
def __init__(self, loop):
self.loop = loop
self.task1: asyncio.Task = None
self.task2: asyncio.Task = None
def run(self):
# run step 1 as a task
self.task1 = self.loop.create_task(run_act1_step1())
self.task1.add_done_callback(self.run2)
# also start act2, act3, etc here
def run2(self, fut):
try:
if fut.exception() is not None: # do nothing if task1 failed
return
except asyncio.CancelledError: # or if it was cancelled
return
# now run the sync step 2 in a thread
self.loop.run_in_executor(None, run_act1_step2)
# finally, run the async queue loop as a task
self.task2 = self.loop.create_task(run_act1_forever())
async def cancel_gracefully(self):
if self.task2 is not None:
# in this case, task1 has already finished without error
self.task2.cancel()
try:
await self.task2
except asyncio.CancelledError:
pass
elif self.task1 is not None:
self.task1.cancel()
try:
await self.task1
except asyncio.CancelledError:
pass
# also act2.cancel(), act3.cancel(), etc
def mainprogram():
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
acts = Activities(loop)
loop.call_soon(acts.run)
try:
loop.run_forever()
except KeyboardInterrupt:
pass
loop.run_until_complete(acts.cancel_gracefully())
if __name__ == "__main__":
mainprogram()
You can do this with a combination of threading events and asyncio events. You'll need two events, one to signal the first item has arrived. The thread will wait on this event, so it needs to be a threading Event. You'll also need one to signal the thread is finished. Your run_act1_forever coroutine will await this, so it will need to be an asyncio Event. You can then return the task for run_act1_forever normally and cancel it as you need.
Note that when setting the asyncio event from the separate thread you'll need to use loop.call_soon_threadsafe as asyncio Events are not thread safe.
import asyncio
import time
import threading
import functools
from asyncio import Queue, AbstractEventLoop
async def run_act1_forever(inputQueue1: Queue,
thread_done_event: asyncio.Event):
await thread_done_event.wait()
print('running forever')
while True:
job = await inputQueue1.get()
async def run_act1_step1(inputQueue1: Queue,
first_item_event: threading.Event):
print('Waiting for queue item')
job = await inputQueue1.get()
print('Setting event')
first_item_event.set()
def run_act1_step2(loop: AbstractEventLoop,
first_item_event: threading.Event,
thread_done_event: asyncio.Event):
print('Waiting for event...')
first_item_event.wait()
print('Got event, processing...')
time.sleep(5)
loop.call_soon_threadsafe(thread_done_event.set)
def run_activity_1(loop):
inputQueue1 = asyncio.Queue(loop=loop)
first_item_event = threading.Event()
thread_done_event = asyncio.Event(loop=loop)
loop.create_task(run_act1_step1(inputQueue1, first_item_event))
inputQueue1.put_nowait('First item to test the code')
loop.run_in_executor(None, functools.partial(run_act1_step2,
loop,
first_item_event,
thread_done_event))
return loop.create_task(run_act1_forever(inputQueue1, thread_done_event))
def mainprogram():
loop = asyncio.new_event_loop()
act1 = run_activity_1(loop)
# also start act2, act3, etc here
try:
loop.run_forever()
except KeyboardInterrupt:
pass
finally:
act1.cancel()
# also act2.cancel(), act3.cancel(), etc
loop.close()
mainprogram()

Iterate through asyncio loop

I am very new with aiohttp and asyncio so apologies for my ignorance up front. I am having difficulties with the event loop portion of the documentation and don't think my below code is executing asynchronously. I am trying to take the output of all combinations of two lists via itertools, and POST to XML. A more full blown version is listed here while using the requests module, however that is not ideal as I am needing to POST 1000+ requests potentially at a time. Here is a sample of how it looks now:
import aiohttp
import asyncio
import itertools
skillid = ['7715','7735','7736','7737','7738','7739','7740','7741','7742','7743','7744','7745','7746','7747','7748' ,'7749','7750','7751','7752','7753','7754','7755','7756','7757','7758','7759','7760','7761','7762','7763','7764','7765','7766','7767','7768','7769','7770','7771','7772','7773','7774','7775','7776','7777','7778','7779','7780','7781','7782','7783','7784']
agent= ['5124','5315','5331','5764','6049','6076','6192','6323','6669','7690','7716']
url = 'https://url'
user = 'user'
password = 'pass'
headers = {
'Content-Type': 'application/xml'
}
async def main():
async with aiohttp.ClientSession() as session:
for x in itertools.product(agent,skillid):
payload = "<operation><operationType>update</operationType><refURLs><refURL>/unifiedconfig/config/agent/" + x[0] + "</refURL></refURLs><changeSet><agent><skillGroupsRemoved><skillGroup><refURL>/unifiedconfig/config/skillgroup/" + x[1] + "</refURL></skillGroup></skillGroupsRemoved></agent></changeSet></operation>"
async with session.post(url,auth=aiohttp.BasicAuth(user, password), data=payload,headers=headers) as resp:
print(resp.status)
print(await resp.text())
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
I see that coroutines can be used but not sure that applies as there is only a single task to execute. Any clarification is appreciated.
Because you're making a request and then immediately await-ing on it, you are only making one request at a time. If you want to parallelize everything, you need to separate making the request from waiting for the response, and you need to use something like asyncio.gather to wait for the requests in bulk.
In the following example, I've modified your code to connect to a local httpbin instance for testing; I'm making requests to the /delay/<value> endpoint so that each requests takes a random amount of time to complete.
The theory of operation here is:
Move the request code into the asynchronous one_request function,
which we use to build an array of tasks.
Use asyncio.gather to run all the tasks at once.
The one_request functions returns a (agent, skillid, response)
tuple, so that when we iterate over the responses we can tell which
combination of parameters resulted in the given response.
import aiohttp
import asyncio
import itertools
import random
skillid = [
"7715", "7735", "7736", "7737", "7738", "7739", "7740", "7741", "7742",
"7743", "7744", "7745", "7746", "7747", "7748", "7749", "7750", "7751",
"7752", "7753", "7754", "7755", "7756", "7757", "7758", "7759", "7760",
"7761", "7762", "7763", "7764", "7765", "7766", "7767", "7768", "7769",
"7770", "7771", "7772", "7773", "7774", "7775", "7776", "7777", "7778",
"7779", "7780", "7781", "7782", "7783", "7784",
]
agent = [
"5124", "5315", "5331", "5764", "6049", "6076", "6192", "6323", "6669",
"7690", "7716",
]
user = 'user'
password = 'pass'
headers = {
'Content-Type': 'application/xml'
}
async def one_request(session, agent, skillid):
# I'm setting `url` here because I want a random parameter for
# reach request. You would probably just set this once globally.
delay = random.randint(0, 10)
url = f'http://localhost:8787/delay/{delay}'
payload = (
"<operation>"
"<operationType>update</operationType>"
"<refURLs>"
f"<refURL>/unifiedconfig/config/agent/{agent}</refURL>"
"</refURLs>"
"<changeSet>"
"<agent>"
"<skillGroupsRemoved><skillGroup>"
f"<refURL>/unifiedconfig/config/skillgroup/{skillid}</refURL>"
"</skillGroup></skillGroupsRemoved>"
"</agent>"
"</changeSet>"
"</operation>"
)
# This shows when the task actually executes.
print('req', agent, skillid)
async with session.post(
url, auth=aiohttp.BasicAuth(user, password),
data=payload, headers=headers) as resp:
return (agent, skillid, await resp.text())
async def main():
tasks = []
async with aiohttp.ClientSession() as session:
# Add tasks to the `tasks` array
for x in itertools.product(agent, skillid):
task = asyncio.ensure_future(one_request(session, x[0], x[1]))
tasks.append(task)
print(f'making {len(tasks)} requests')
# Run all the tasks and wait for them to complete. Return
# values will end up in the `responses` list.
responses = await asyncio.gather(*tasks)
# Just print everything out.
print(responses)
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
The above code results in about 561 requests, and runs in about 30
seconds with the random delay I've introduced.
This code runs all the requests at once. If you wanted to limit the
maximum number of concurrent requests, you could introduce a
Semaphore to make one_request block if there were too many active requests.
If you wanted to process responses as they arrived, rather than
waiting for everything to complete, you could investigate the
asyncio.wait method instead.

Python asyncio - Increase the value of Semaphore

I am making use of aiohttp in one of my projects and would like to limit the number of requests made per second. I am using asyncio.Semaphore to do that. My challenge is I may want to increase/decrease the number of requests allowed per second.
For example:
limit = asyncio.Semaphore(10)
async with limit:
async with aiohttp.request(...)
...
await asyncio.sleep(1)
This works great. That is, it limits that aiohttp.request to 10 concurrent requests in a second. However, I may want to increase and decrease the Semaphore._value. I can do limit._value = 20 but I am not sure if this is the right approach or there is another way to do that.
Accessing the private _value attribute is not the right approach for at least two reasons: one that the attribute is private and can be removed, renamed, or change meaning in a future version without notice, and the other that increasing the limit won't be noticed by a semaphore that already has waiters.
Since asyncio.Semaphore doesn't support modifying the limit dynamically, you have two options: implementing your own Semaphore class that does support it, or not using a Semaphore at all. The latter is probably easier as you can always replace a semaphore-enforced limit with a fixed number of worker tasks that receive jobs through a queue. Assuming you currently have code that looks like this:
async def fetch(limit, arg):
async with limit:
# your actual code here
return result
async def tweak_limit(limit):
# here you'd like to be able to increase the limit
async def main():
limit = asyncio.Semaphore(10)
asyncio.create_task(tweak_limit(limit))
results = await asyncio.gather(*[fetch(limit, x) for x in range(1000)])
You could express it without a semaphore by creating workers in advance and giving them work to do:
async def fetch_task(queue, results):
while True:
arg = await queue.get()
# your actual code here
results.append(result)
queue.task_done()
async def main():
# fill the queue with jobs for the workers
queue = asyncio.Queue()
for x in range(1000):
await queue.put(x)
# create the initial pool of workers
results = []
workers = [asyncio.create_task(fetch_task(queue, results))
for _ in range(10)]
asyncio.create_task(tweak_limit(workers, queue, results))
# wait for workers to process the entire queue
await queue.join()
# finally, cancel the now-idle worker tasks
for w in workers:
w.cancel()
# results are now available
The tweak_limit() function can now increase the limit simply by spawning new workers:
async def tweak_limit(workers, queue, results):
while True:
await asyncio.sleep(1)
if need_more_workers:
workers.append(asyncio.create_task(fetch_task(queue, results)))
Using workers and queues is a more complex solution, you have to think about issues like setup, teardown, exception handling and backpressure, etc.
Semaphore can be implemented with Lock, if you don't mind abit of inefficiency (you will see why), here's a simple implemention for a dynamic-value semaphore:
class DynamicSemaphore:
def __init__(self, value=1):
self._lock = asyncio.Lock()
if value < 0:
raise ValueError("Semaphore initial value must be >= 0")
self.value = value
async def __aenter__(self):
await self.acquire()
return None
async def __aexit__(self, exc_type, exc, tb):
self.release()
def locked(self):
return self.value == 0
async def acquire(self):
async with self._lock:
while self.value <= 0:
await asyncio.sleep(0.1)
self.value -= 1
return True
def release(self):
self.value += 1

Asyncio task vs coroutine

Reading the asyncio documentation, I realize that I don't understand a very basic and fundamental aspect: the difference between awaiting a coroutine directly, and awaiting the same coroutine when it's wrapped inside a task.
In the documentation examples the two calls to the say_after coroutine are running sequentially when awaited without create_task, and concurrently when wrapped in create_task. So I understand that this is basically the difference, and that it is quite an important one.
However what confuses me is that in the example code I read everywhere (for instance showing how to use aiohttp), there are many places where a (user-defined) coroutine is awaited (usually in the middle of some other user-defined coroutine) without being wrapped in a task, and I'm wondering why that is the case. What are the criteria to determine when a coroutine should be wrapped in a task or not?
What are the criteria to determine when a coroutine should be wrapped in a task or not?
You should use a task when you want your coroutine to effectively run in the background. The code you've seen just awaits the coroutines directly because it needs them running in sequence. For example, consider an HTTP client sending a request and waiting for a response:
# these two don't make too much sense in parallel
await session.send_request(req)
resp = await session.read_response()
There are situations when you want operations to run in parallel. In that case asyncio.create_task is the appropriate tool, because it turns over the responsibility to execute the coroutine to the event loop. This allows you to start several coroutines and sit idly while they execute, typically waiting for some or all of them to finish:
dl1 = asyncio.create_task(session.get(url1))
dl2 = asyncio.create_task(session.get(url2))
# run them in parallel and wait for both to finish
resp1 = await dl1
resp2 = await dl2
# or, shorter:
resp1, resp2 = asyncio.gather(session.get(url1), session.get(url2))
As shown above, a task can be awaited as well. Just like awaiting a coroutine, that will block the current coroutine until the coroutine driven by the task has completed. In analogy to threads, awaiting a task is roughly equivalent to join()-ing a thread (except you get back the return value). Another example:
queue = asyncio.Queue()
# read output from process in an infinite loop and
# put it in a queue
async def process_output(cmd, queue, identifier):
proc = await asyncio.create_subprocess_shell(cmd)
while True:
line = await proc.readline()
await queue.put((identifier, line))
# create multiple workers that run in parallel and pour
# data from multiple sources into the same queue
asyncio.create_task(process_output("top -b", queue, "top")
asyncio.create_task(process_output("vmstat 1", queue, "vmstat")
while True:
identifier, output = await queue.get()
if identifier == 'top':
# ...
In summary, if you need the result of a coroutine in order to proceed, you should just await it without creating a task, i.e.:
# this is ok
resp = await session.read_response()
# unnecessary - it has the same effect, but it's
# less efficient
resp = await asyncio.create_task(session.read_reponse())
To continue with the threading analogy, creating a task just to await it immediately is like running t = Thread(target=foo); t.start(); t.join() instead of just foo() - inefficient and redundant.

Why does using asyncio.ensure_future for long jobs instead of await run so much quicker?

I am downloading jsons from an api and am using the asyncio module. The crux of my question is, with the following event loop as implemented as this:
loop = asyncio.get_event_loop()
main_task = asyncio.ensure_future( klass.download_all() )
loop.run_until_complete( main_task )
and download_all() implemented like this instance method of a class, which already has downloader objects created and available to it, and thus calls each respective download method:
async def download_all(self):
""" Builds the coroutines, uses asyncio.wait, then sifts for those still pending, loops """
ret = []
async with aiohttp.ClientSession() as session:
pending = []
for downloader in self._downloaders:
pending.append( asyncio.ensure_future( downloader.download(session) ) )
while pending:
dne, pnding= await asyncio.wait(pending)
ret.extend( [d.result() for d in dne] )
# Get all the tasks, cannot use "pnding"
tasks = asyncio.Task.all_tasks()
pending = [tks for tks in tasks if not tks.done()]
# Exclude the one that we know hasn't ended yet (UGLY)
pending = [t for t in pending if not t._coro.__name__ == self.download_all.__name__]
return ret
Why is it, that in the downloaders' download methods, when instead of the await syntax, I choose to do asyncio.ensure_future instead, it runs way faster, that is more seemingly "asynchronously" as I can see from the logs.
This works because of the way I have set up detecting all the tasks that are still pending, and not letting the download_all method complete, and keep calling asyncio.wait.
I thought that the await keyword allowed the event loop mechanism to do its thing and share resources efficiently? How come doing it this way is faster? Is there something wrong with it? For example:
async def download(self, session):
async with session.request(self.method, self.url, params=self.params) as response:
response_json = await response.json()
# Not using await here, as I am "supposed" to
asyncio.ensure_future( self.write(response_json, self.path) )
return response_json
async def write(self, res_json, path):
# using aiofiles to write, but it doesn't (seem to?) support direct json
# so converting to raw text first
txt_contents = json.dumps(res_json, **self.json_dumps_kwargs);
async with aiofiles.open(path, 'w') as f:
await f.write(txt_contents)
With full code implemented and a real API, I was able to download 44 resources in 34 seconds, but when using await it took more than three minutes (I actually gave up as it was taking so long).
When you do await in each iteration of for loop it will await to download every iteration.
When you do ensure_future on the other hand it doesn't it creates task to download all the files and then awaits all of them in second loop.

Resources