Iterate through asyncio loop - python-asyncio

I am very new with aiohttp and asyncio so apologies for my ignorance up front. I am having difficulties with the event loop portion of the documentation and don't think my below code is executing asynchronously. I am trying to take the output of all combinations of two lists via itertools, and POST to XML. A more full blown version is listed here while using the requests module, however that is not ideal as I am needing to POST 1000+ requests potentially at a time. Here is a sample of how it looks now:
import aiohttp
import asyncio
import itertools
skillid = ['7715','7735','7736','7737','7738','7739','7740','7741','7742','7743','7744','7745','7746','7747','7748' ,'7749','7750','7751','7752','7753','7754','7755','7756','7757','7758','7759','7760','7761','7762','7763','7764','7765','7766','7767','7768','7769','7770','7771','7772','7773','7774','7775','7776','7777','7778','7779','7780','7781','7782','7783','7784']
agent= ['5124','5315','5331','5764','6049','6076','6192','6323','6669','7690','7716']
url = 'https://url'
user = 'user'
password = 'pass'
headers = {
'Content-Type': 'application/xml'
}
async def main():
async with aiohttp.ClientSession() as session:
for x in itertools.product(agent,skillid):
payload = "<operation><operationType>update</operationType><refURLs><refURL>/unifiedconfig/config/agent/" + x[0] + "</refURL></refURLs><changeSet><agent><skillGroupsRemoved><skillGroup><refURL>/unifiedconfig/config/skillgroup/" + x[1] + "</refURL></skillGroup></skillGroupsRemoved></agent></changeSet></operation>"
async with session.post(url,auth=aiohttp.BasicAuth(user, password), data=payload,headers=headers) as resp:
print(resp.status)
print(await resp.text())
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
I see that coroutines can be used but not sure that applies as there is only a single task to execute. Any clarification is appreciated.

Because you're making a request and then immediately await-ing on it, you are only making one request at a time. If you want to parallelize everything, you need to separate making the request from waiting for the response, and you need to use something like asyncio.gather to wait for the requests in bulk.
In the following example, I've modified your code to connect to a local httpbin instance for testing; I'm making requests to the /delay/<value> endpoint so that each requests takes a random amount of time to complete.
The theory of operation here is:
Move the request code into the asynchronous one_request function,
which we use to build an array of tasks.
Use asyncio.gather to run all the tasks at once.
The one_request functions returns a (agent, skillid, response)
tuple, so that when we iterate over the responses we can tell which
combination of parameters resulted in the given response.
import aiohttp
import asyncio
import itertools
import random
skillid = [
"7715", "7735", "7736", "7737", "7738", "7739", "7740", "7741", "7742",
"7743", "7744", "7745", "7746", "7747", "7748", "7749", "7750", "7751",
"7752", "7753", "7754", "7755", "7756", "7757", "7758", "7759", "7760",
"7761", "7762", "7763", "7764", "7765", "7766", "7767", "7768", "7769",
"7770", "7771", "7772", "7773", "7774", "7775", "7776", "7777", "7778",
"7779", "7780", "7781", "7782", "7783", "7784",
]
agent = [
"5124", "5315", "5331", "5764", "6049", "6076", "6192", "6323", "6669",
"7690", "7716",
]
user = 'user'
password = 'pass'
headers = {
'Content-Type': 'application/xml'
}
async def one_request(session, agent, skillid):
# I'm setting `url` here because I want a random parameter for
# reach request. You would probably just set this once globally.
delay = random.randint(0, 10)
url = f'http://localhost:8787/delay/{delay}'
payload = (
"<operation>"
"<operationType>update</operationType>"
"<refURLs>"
f"<refURL>/unifiedconfig/config/agent/{agent}</refURL>"
"</refURLs>"
"<changeSet>"
"<agent>"
"<skillGroupsRemoved><skillGroup>"
f"<refURL>/unifiedconfig/config/skillgroup/{skillid}</refURL>"
"</skillGroup></skillGroupsRemoved>"
"</agent>"
"</changeSet>"
"</operation>"
)
# This shows when the task actually executes.
print('req', agent, skillid)
async with session.post(
url, auth=aiohttp.BasicAuth(user, password),
data=payload, headers=headers) as resp:
return (agent, skillid, await resp.text())
async def main():
tasks = []
async with aiohttp.ClientSession() as session:
# Add tasks to the `tasks` array
for x in itertools.product(agent, skillid):
task = asyncio.ensure_future(one_request(session, x[0], x[1]))
tasks.append(task)
print(f'making {len(tasks)} requests')
# Run all the tasks and wait for them to complete. Return
# values will end up in the `responses` list.
responses = await asyncio.gather(*tasks)
# Just print everything out.
print(responses)
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
The above code results in about 561 requests, and runs in about 30
seconds with the random delay I've introduced.
This code runs all the requests at once. If you wanted to limit the
maximum number of concurrent requests, you could introduce a
Semaphore to make one_request block if there were too many active requests.
If you wanted to process responses as they arrived, rather than
waiting for everything to complete, you could investigate the
asyncio.wait method instead.

Related

Rate-limitting aiohttp.ClientSession() to make N requests per second

I want to create an asynchronous SDK using aiohttp client for our service. I haven't been able to figure out how to throttle the ClientSession() to make only N requests per second.
class AsyncHTTPClient:
def __init__(self, api_token, per_second_limit=10):
self._client = aiohttp.ClienSession(
headers = {"Authorization": "Bearer f{api_token}"}
)
self._throttler = asyncio.Semaphore(per_second_limit)
async def _make_request(self, method, url, **kwargs):
async with self._throttler:
return await self._client.request(method, url, **kwargs)
async def get(self, url, **params):
return await self._make_request("GET", url, **params)
async def close(self):
await self._client.close()
I have this class with get, post, patch, put, delete methods implemented as a call to _make_request.
#As a user of the SDK I run the following code.
async def main():
try:
urls = [some_url * 100]
client = AsyncHTTPClient(my_token, per_second_limit=20)
await asyncio.gather(*[client.get(url) for url in urls])
finally:
await client.close()
asyncio.run(main())
asyncio.Semaphore limits the concurrency. That is, when the main() function is called, the async with self._throttler used in client._make_request limits concurrency to 20 requests. However, if the 20 requests finished within 1 second, then requests will be continuously made. What I want to do is make sure that only N requests (i.e. 20) are made in a second. If all 20 requests finished in 0.8 seconds, then sleep for 0.2 seconds and then process the requests again.
I looked up some asyncio.Queue examples with workers example but I am not sure how I will I implement it in my SDK since creating workers will have to be done by the user using this SDK and I want to avoid that, I want AsyncHTTPClient to handle the requests per second limit.
Any suggestions/advise/samples will be greatly appreciated.

Asyncio with multiprocessing : Producers-Consumers model

I am trying retrieve stock prices and process the prices them as they come. I am a beginner with concurrency but I thought this set up seems suited to an asyncio producers-consumers model in which each producers retrieve a stock price, and pass it to the consumers vial a queue. Now the consumers have do the stock price processing in parallel (multiprocessing) since the work is CPU intensive. Therefore I would have multiple consumers already working while not all the producers are finished retrieving data. In addition, I would like to implement a step in which, if the consumer finds that the stock price it's working on is invalid , we spawn a new consumer job for that stock.
So far, i have the following toy code that sort of gets me there, but has issues with my process_data function (the consumer).
from concurrent.futures import ProcessPoolExecutor
import asyncio
import random
import time
random.seed(444)
#producers
async def retrieve_data(ticker, q):
'''
Pretend we're using aiohttp to retrieve stock prices from a URL
Place a tuple of stock ticker and price into asyn queue as it becomes available
'''
start = time.perf_counter() # start timer
await asyncio.sleep(random.randint(4, 8)) # pretend we're calling some URL
price = random.randint(1, 100) # pretend this is the price we retrieved
print(f'{ticker} : {price} retrieved in {time.perf_counter() - start:0.1f} seconds')
await q.put((ticker, price)) # place the price into the asyncio queue
#consumers
async def process_data(q):
while True:
data = await q.get()
print(f"processing: {data}")
with ProcessPoolExecutor() as executor:
loop = asyncio.get_running_loop()
result = await loop.run_in_executor(executor, data_processor, data)
#if output of data_processing failed, send ticker back to queue to retrieve data again
if not result[2]:
print(f'{result[0]} data invalid. Retrieving again...')
await retrieve_data(result[0], q) # add a new task
q.task_done() # end this task
else:
q.task_done() # so that q.join() knows when the task is done
async def main(tickers):
q = asyncio.Queue()
producers = [asyncio.create_task(retrieve_data(ticker, q)) for ticker in tickers]
consumers = [asyncio.create_task(process_data(q))]
await asyncio.gather(*producers)
await q.join() # Implicitly awaits consumers, too. blocks until all items in the queue have been received and processed
for c in consumers:
c.cancel() #cancel the consumer tasks, which would otherwise hang up and wait endlessly for additional queue items to appear
'''
RUN IN JUPYTER NOTEBOOK
'''
start = time.perf_counter()
tickers = ['AAPL', 'AMZN', 'TSLA', 'C', 'F']
await main(tickers)
print(f'total elapsed time: {time.perf_counter() - start:0.2f}')
'''
RUN IN TERMINAL
'''
# if __name__ == "__main__":
# start = time.perf_counter()
# tickers = ['AAPL', 'AMZN', 'TSLA', 'C', 'F']
# asyncio.run(main(tickers))
# print(f'total elapsed time: {time.perf_counter() - start:0.2f}')
The data_processor() function below, called by process_data() above needs to be in a different cell in Jupyter notebook, or a separate module (from what I understand, to avoid a PicklingError)
from multiprocessing import current_process
def data_processor(data):
ticker = data[0]
price = data[1]
print(f'Started {ticker} - {current_process().name}')
start = time.perf_counter() # start time counter
time.sleep(random.randint(4, 5)) # mimic some random processing time
# pretend we're processing the price. Let the processing outcome be invalid if the price is an odd number
if price % 2==0:
is_valid = True
else:
is_valid = False
print(f"{ticker}'s price {price} validity: --{is_valid}--"
f' Elapsed time: {time.perf_counter() - start:0.2f} seconds')
return (ticker, price, is_valid)
THE ISSUES
Instead of using python's multiprocessing module, i used concurrent.futures' ProcessPoolExecutor, which I read is compatible with asyncio (What kind of problems (if any) would there be combining asyncio with multiprocessing?). But it seems that I have to choose between retrieving the output (result) of the function called by the executor and being able to run several subprocesses in parallel. With the construct below, the subprocesses run sequentially, not in parallel.
with ProcessPoolExecutor() as executor:
loop = asyncio.get_running_loop()
result = await loop.run_in_executor(executor, data_processor, data)
Removing result = await in front of loop.run_in_executor(executor, data_processor, data) allows to run several consumers in parallel, but then I can't collect their results from the parent process. I need the await for that. And then of course the remaining of the code block will fail.
How can I have these subprocesses run in parallel and provide the output? Perhaps it needs a different construct or something else than the producers-consumers model
the part of the code that requests invalid stock prices to be retrieved again works (provided I can get the result from above), but it is ran in the subprocess that calls it and blocks new consumers from being created until the request is fulfilled. Is there a way to address this?
#if output of data_processing failed, send ticker back to queue to retrieve data again
if not result[2]:
print(f'{result[0]} data invalid. Retrieving again...')
await retrieve_data(result[0], q) # add a new task
q.task_done() # end this task
else:
q.task_done() # so that q.join() knows when the task is done
But it seems that I have to choose between retrieving the output (result) of the function called by the executor and being able to run several subprocesses in parallel.
Luckily this is not the case, you can also use asyncio.gather() to wait for multiple items at once. But you obtain data items one by one from the queue, so you don't have a batch of items to process. The simplest solution is to just start multiple consumers. Replace
# the single-element list looks suspicious anyway
consumers = [asyncio.create_task(process_data(q))]
with:
# now we have an actual list
consumers = [asyncio.create_task(process_data(q)) for _ in range(16)]
Each consumer will wait for an individual task to finish, but that's ok because you'll have a whole pool of them working in parallel, which is exactly what you wanted.
Also, you might want to make executor a global variable and not use with, so that the process pool is shared by all consumers and lasts as long as the program. That way consumers will reuse the worker processes already spawned instead of having to spawn a new process for each job received from the queue. (That's the whole point of having a process "pool".) In that case you probably want to add executor.shutdown() at the point in the program where you don't need the executor anymore.

Aiohttp: Server & Client in one time

I try to use aiohttp 3.6.2 both server and client:
For webhook perform work:
1) Get JSON-request from service
2) Fast send HTTP 200 OK back to service
3) Made additional work after: make http-request to slow web-service(answer 2-5 sec)
I dont understand how to perform work after view(or handler) returned web.Response(text="OK")?
Current view:
(it's slow cause slow http_request perform before response)
view.py:
async def make_http_request(url):
async with aiohttp.ClientSession() as session:
async with session.get(url) as resp:
print(await resp.text())
async def work_on_request(request):
url = (await request.json())['url']
await make_http_request(url)
return aiohttp.web.Response(text='all ok')
routes.py:
from views import work_on_request
def setup_routes(app):
app.router.add_get('/', work_on_request)
server.py:
from aiohttp import web
from routes import setup_routes
import asyncio
app = web.Application()
setup_routes(app)
web.run_app(app)
So, workaround for me is to start one more thread with different event_loop, or may be you know how to add some work to current event loop?
Already not actual, cause i found desicion to add one more task to main event_loop:
//additionaly i created one global queue to interoperate coroutine between each other.
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
queue = asyncio.Queue(maxsize=100000)
loop.create_task(worker('Worker1', queue))
app = web.Application()
app['global_queue'] = queue

How to reuse aiohttp ClientSession pool?

The docs say to reuse the ClientSession:
Don’t create a session per request. Most likely you need a session per
application which performs all requests altogether.
A session contains a connection pool inside, connection reusage and
keep-alives (both are on by default) may speed up total performance.1
But there doesn't seem to be any explanation in the docs about how to do this? There is one example that's maybe relevant, but it does not show how to reuse the pool elsewhere: http://aiohttp.readthedocs.io/en/stable/client.html#keep-alive-connection-pooling-and-cookie-sharing
Would something like this be the correct way to do it?
#app.listener('before_server_start')
async def before_server_start(app, loop):
app.pg_pool = await asyncpg.create_pool(**DB_CONFIG, loop=loop, max_size=100)
app.http_session_pool = aiohttp.ClientSession()
#app.listener('after_server_stop')
async def after_server_stop(app, loop):
app.http_session_pool.close()
app.pg_pool.close()
#app.post("/api/register")
async def register(request):
# json validation
async with app.pg_pool.acquire() as pg:
await pg.execute() # create unactivated user in db
async with app.http_session_pool as session:
# TODO send activation email using SES API
async with session.post('http://httpbin.org/post', data=b'data') as resp:
print(resp.status)
print(await resp.text())
return HTTPResponse(status=204)
There're few things I think can be improved:
1)
Instance of ClientSession is one session object. This on session contains pool of connections, but it's not "session_pool" itself. I would suggest rename http_session_pool to http_session or may be client_session.
2)
Session's close() method is a corountine. Your should await it:
await app.client_session.close()
Or even better (IMHO), instead of thinking about how to properly open/close session use standard async context manager with awaiting of __aenter__ / __aexit__:
#app.listener('before_server_start')
async def before_server_start(app, loop):
# ...
app.client_session = await aiohttp.ClientSession().__aenter__()
#app.listener('after_server_stop')
async def after_server_stop(app, loop):
await app.client_session.__aexit__(None, None, None)
# ...
3)
Pay attention to this info:
However, if the event loop is stopped before the underlying connection
is closed, an ResourceWarning: unclosed transport warning is emitted
(when warnings are enabled).
To avoid this situation, a small delay must be added before closing
the event loop to allow any open underlying connections to close.
I'm not sure it's mandatory in your case but there's nothing bad in adding await asyncio.sleep(0) inside after_server_stop as documentation advices:
#app.listener('after_server_stop')
async def after_server_stop(app, loop):
# ...
await asyncio.sleep(0) # http://aiohttp.readthedocs.io/en/stable/client.html#graceful-shutdown
Upd:
Class that implements __aenter__ / __aexit__ can be used as async context manager (can be used in async with statement). It allows to do some actions before executing internal block and after it. This is very similar to regular context managers, but asyncio related. Same as regular context manager async one can be used directly (without async with) manually awaiting __aenter__ / __aexit__.
Why do I think it's better to create/free session using __aenter__ / __aexit__ manually instead of using close(), for example? Because we shouldn't worry what actually happens inside __aenter__ / __aexit__. Imagine in future versions of aiohttp creating of session will be changed with the need to await open() for example. If you'll use __aenter__ / __aexit__ you wouldn't need to somehow change your code.
seems no session pool in aiohttp.
// just post some official docs.
persistent session
here is persistent-session usage demo in official site
https://docs.aiohttp.org/en/latest/client_advanced.html#persistent-session
app.cleanup_ctx.append(persistent_session)
async def persistent_session(app):
app['PERSISTENT_SESSION'] = session = aiohttp.ClientSession()
yield
await session.close()
async def my_request_handler(request):
session = request.app['PERSISTENT_SESSION']
async with session.get("http://python.org") as resp:
print(resp.status)
//TODO: a full runnable demo code
connection pool
and it has a connection pool:
https://docs.aiohttp.org/en/latest/client_advanced.html#connectors
conn = aiohttp.TCPConnector()
#conn = aiohttp.TCPConnector(limit=30)
#conn = aiohttp.TCPConnector(limit=0) # nolimit, default is 100.
#conn = aiohttp.TCPConnector(limit_per_host=30) # default is 0
session = aiohttp.ClientSession(connector=conn)
I found this question after searching on Google on how to reuse an aiohttp ClientSession instance after my code was triggering this warning message: UserWarning: Creating a client session outside of coroutine is a very dangerous idea
This code may not solve the above problem though it is related. I am new to asyncio and aiohttp, so this may not be best practice. It's the best I could come up with after reading a lot of seemingly conflicting information.
I created a class ResourceManager taken from the Python docs that opens a context.
The ResourceManager instance handles the opening and closing of the aiohttp ClientSession instance via the magic methods __aenter__ and __aexit__ with BaseScraper.set_session and BaseScraper.close_session wrapper methods.
I was able to reuse a ClientSession instance with the following code.
The BaseScraper class also has methods for authentication. It depends on the lxml third-party package.
import asyncio
from time import time
from contextlib import contextmanager, AbstractContextManager, ExitStack
import aiohttp
import lxml.html
class ResourceManager(AbstractContextManager):
# Code taken from Python docs: 29.6.2.4. of https://docs.python.org/3.6/library/contextlib.html
def __init__(self, scraper, check_resource_ok=None):
self.acquire_resource = scraper.acquire_resource
self.release_resource = scraper.release_resource
if check_resource_ok is None:
def check_resource_ok(resource):
return True
self.check_resource_ok = check_resource_ok
#contextmanager
def _cleanup_on_error(self):
with ExitStack() as stack:
stack.push(self)
yield
# The validation check passed and didn't raise an exception
# Accordingly, we want to keep the resource, and pass it
# back to our caller
stack.pop_all()
def __enter__(self):
resource = self.acquire_resource()
with self._cleanup_on_error():
if not self.check_resource_ok(resource):
msg = "Failed validation for {!r}"
raise RuntimeError(msg.format(resource))
return resource
def __exit__(self, *exc_details):
# We don't need to duplicate any of our resource release logic
self.release_resource()
class BaseScraper:
login_url = ""
login_data = dict() # dict of key, value pairs to fill the login form
loop = asyncio.get_event_loop()
def __init__(self, urls):
self.urls = urls
self.acquire_resource = self.set_session
self.release_resource = self.close_session
async def _set_session(self):
self.session = await aiohttp.ClientSession().__aenter__()
def set_session(self):
set_session_attr = self.loop.create_task(self._set_session())
self.loop.run_until_complete(set_session_attr)
return self # variable after "as" becomes instance of BaseScraper
async def _close_session(self):
await self.session.__aexit__(None, None, None)
def close_session(self):
close_session = self.loop.create_task(self._close_session())
self.loop.run_until_complete(close_session)
def __call__(self):
fetch_urls = self.loop.create_task(self._fetch())
return self.loop.run_until_complete(fetch_urls)
async def _get(self, url):
async with self.session.get(url) as response:
result = await response.read()
return url, result
async def _fetch(self):
tasks = (self.loop.create_task(self._get(url)) for url in self.urls)
start = time()
results = await asyncio.gather(*tasks)
print(
"time elapsed: {} seconds \nurls count: {}".format(
time() - start, len(urls)
)
)
return results
#property
def form(self):
"""Create and return form for authentication."""
form = aiohttp.FormData(self.login_data)
get_login_page = self.loop.create_task(self._get(self.login_url))
url, login_page = self.loop.run_until_complete(get_login_page)
login_html = lxml.html.fromstring(login_page)
hidden_inputs = login_html.xpath(r'//form//input[#type="hidden"]')
login_form = {x.attrib["name"]: x.attrib["value"] for x in hidden_inputs}
for key, value in login_form.items():
form.add_field(key, value)
return form
async def _login(self, form):
async with self.session.post(self.login_url, data=form) as response:
if response.status != 200:
response.raise_for_status()
print("logged into {}".format(url))
await response.release()
def login(self):
post_login_form = self.loop.create_task(self._login(self.form))
self.loop.run_until_complete(post_login_form)
if __name__ == "__main__":
urls = ("http://example.com",) * 10
base_scraper = BaseScraper(urls)
with ResourceManager(base_scraper) as scraper:
for url, html in scraper():
print(url, len(html))

Why does using asyncio.ensure_future for long jobs instead of await run so much quicker?

I am downloading jsons from an api and am using the asyncio module. The crux of my question is, with the following event loop as implemented as this:
loop = asyncio.get_event_loop()
main_task = asyncio.ensure_future( klass.download_all() )
loop.run_until_complete( main_task )
and download_all() implemented like this instance method of a class, which already has downloader objects created and available to it, and thus calls each respective download method:
async def download_all(self):
""" Builds the coroutines, uses asyncio.wait, then sifts for those still pending, loops """
ret = []
async with aiohttp.ClientSession() as session:
pending = []
for downloader in self._downloaders:
pending.append( asyncio.ensure_future( downloader.download(session) ) )
while pending:
dne, pnding= await asyncio.wait(pending)
ret.extend( [d.result() for d in dne] )
# Get all the tasks, cannot use "pnding"
tasks = asyncio.Task.all_tasks()
pending = [tks for tks in tasks if not tks.done()]
# Exclude the one that we know hasn't ended yet (UGLY)
pending = [t for t in pending if not t._coro.__name__ == self.download_all.__name__]
return ret
Why is it, that in the downloaders' download methods, when instead of the await syntax, I choose to do asyncio.ensure_future instead, it runs way faster, that is more seemingly "asynchronously" as I can see from the logs.
This works because of the way I have set up detecting all the tasks that are still pending, and not letting the download_all method complete, and keep calling asyncio.wait.
I thought that the await keyword allowed the event loop mechanism to do its thing and share resources efficiently? How come doing it this way is faster? Is there something wrong with it? For example:
async def download(self, session):
async with session.request(self.method, self.url, params=self.params) as response:
response_json = await response.json()
# Not using await here, as I am "supposed" to
asyncio.ensure_future( self.write(response_json, self.path) )
return response_json
async def write(self, res_json, path):
# using aiofiles to write, but it doesn't (seem to?) support direct json
# so converting to raw text first
txt_contents = json.dumps(res_json, **self.json_dumps_kwargs);
async with aiofiles.open(path, 'w') as f:
await f.write(txt_contents)
With full code implemented and a real API, I was able to download 44 resources in 34 seconds, but when using await it took more than three minutes (I actually gave up as it was taking so long).
When you do await in each iteration of for loop it will await to download every iteration.
When you do ensure_future on the other hand it doesn't it creates task to download all the files and then awaits all of them in second loop.

Resources