aiohttp Error Rate Increases with Number of Connections - python-asyncio

I am trying to get the status code from millions of different sites, I am using asyncio and aiohttp, I run the below code with a different number of connections (yet same timeout on the request) but get very different results specifically much higher number of the following exception.
'concurrent.futures._base.TimeoutError'
The code
import pandas as pd
import asyncio
import aiohttp
out = []
CONNECTIONS = 1000
TIMEOUT = 10
async def fetch(url, session, loop):
try:
async with session.get(url,timeout=TIMEOUT) as response:
res = response.status
out.append(res)
return res
except Exception as e:
_exception = 'Error: '+str(type(e))
out.append(_exception)
return _exception
async def bound_fetch(sem, url, session, loop):
async with sem:
await fetch(url, session, loop)
async def run(urls, loop):
tasks = []
sem = asyncio.Semaphore(value=CONNECTIONS,loop=loop)
_connector = aiohttp.TCPConnector(limit=CONNECTIONS, loop=loop)
async with aiohttp.ClientSession(connector=_connector,loop=loop) as session:
for url in urls:
task = asyncio.ensure_future(bound_fetch(sem, url, session, loop))
tasks.append(task)
responses = await asyncio.gather(*tasks,return_exceptions=True)
return responses
## BEGIN ##
tlds = open('data/sample_1k.txt').read().splitlines()
urls = ['http://{}'.format(x) for x in tlds[1:]]
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(run(urls,loop))
ans = loop.run_until_complete(future)
print(str(pd.Series(out).value_counts()))
Results
CONNECTIONS=1000
CONNECTIONS=100
Is this a bug? These sites do response with a status code and run sequentially or with lower connections there is no timeout error so why is this happening? The other exceptions seem stable as you change number of connections. The ClientOSErrors are from sites that actually timeout or respond, honestly don't really know where the concurrent.futures._base.TimeoutError errors are coming from.

Imagine you opened 1000 urls in browser simultaneously. I bet you'll notice many of them aren't loaded after 10 seconds. It's not a bug it's a limit of your machine resources.
More parallel requests you're doing -> less network capacity for each one, less CPU time for each one, less RAM for each one -> higher chances each request wouldn't be ready before it's timeout.
If you see there are many timeouts with 1000 connections, make less connections (and may be increase timeout). Based on aiohttp documentation using different ClientSession instancies may also help:
Unless you are connecting to a large, unknown number of different
servers over the lifetime of your application, it is suggested you use
a single session for the lifetime of your application

I've had the same issue, have a look at the details of the ClientOSErrors and you might see Too many open files, if so you need to increase the OS's number of file descriptors.
Either way, you'll get more information if you print the whole exceptions, not just their types.

Related

python request get invalid url lighting speed

I have a list of 10^6 url I want to check against the status code.
The things is the requests.get is too slow for me with timeout specified and sometimes I can not be sure if url is valid or not even with 1 second timeout (let's say server response is slow).
So, currently I do:
import request
url = "https://dupa.ucho.elo.8"
r = requests.get(url, headers={'Connection': 'close'}, timeout=1)
How to quickly check if url is valid or not without setting timeout and instantly return response for invalid URLs?
Note1: I want to avoid grequests module.
Note2: I do not want to use multithreading.
I have read this https://stackoverflow.com/questions/17782142/why-doesnt-requests-get-return-what-is-the-default-timeout-that-requests-geta but it involves timeout set.
While this might not give you lightning speed due to avoiding multithreading, you can check whether the response of the URL contains what you want to see (200 status code) and terminate it right after.
import requests
import sys
url_list = ['http://google12121.com/','https://google.com/']
for url in url_list:
try:
response = requests.get(url)
if "200" in str(response.status_code):
print("Yes")
else:
print("No")
except:
print("Error: "+str(sys.exc_info()[0]))
continue
You might want to write a more specific error catching logic because generally catching all errors is bad.

How do I properly use Threads to connect ping a url?

I am trying to ping a large amount of urls and retrieve information regarding the certificate of the url. As I read in this thoughtbot article here Thoughtbot Threads and others, I've read that the best way to do this is by using Threads. When I implement threads however, I keep running into Timeout errors and other problems for urls that I can retrieve successfully on their own. I've been told in another related question that I asked earlier that I should not use Timeout with Threads. However, the examples I see wrap API/NET::HTTP/TCPSocket calls in the Timeout block and based opn what I've read, that entire API/NET::HTTP/TCP Socket call will be nested within the Thread. Here is my code:
class SslClient
attr_reader :url, :port, :timeout
def initialize(url, port = '443', timeout = 30)
#url = url
#port = port
#timeout = timeout
end
def ping_for_certificate_info
context = OpenSSL::SSL::SSLContext.new
certificates = nil
verify_result = nil
Timeout.timeout(timeout) do
tcp_client = TCPSocket.new(url, port)
ssl_client = OpenSSL::SSL::SSLSocket.new tcp_client, context
ssl_client.hostname = url
ssl_client.sync_close = true
ssl_client.connect
certificates = ssl_client.peer_cert_chain
verify_result = ssl_client.verify_result
tcp_client.close
end
{certificate: certificates.first, verify_result: verify_result }
rescue => error
puts url
puts error.inspect
end
end
[VERY LARGE LIST OF URLS].map do |url|
Thread.new do
ssl_client = SslClient.new(url)
cert_info = ssl_client.ping_for_certificate_info
puts cert_info
end
end.map(&:value)
If you run this code in your terminal, you will see many Timeout errors and ERNNO:TIMEDOUT errors for sites like fandango.com, fandom.com, mcaffee.com, google.de etc that should return information. When I run these individually however I get the information I need. When I run them in the thread they tend to fail especially for domains that have a foreign domain name. What I'm asking is whether I am using Threads correctly. This snippet of code that I've pasted is part of a larger piece of code that interacts with ActiveRecord objects in rails depending on the results given. Am I using Timeout and Threads correctly? What do I need to do to make this work? Why would a ping work individually but not wrapped in a thread? Help would be greatly appreciated.
There are several issues:
You'd not spawn thousands of threads, use a connection pool (e.g https://github.com/mperham/connection_pool) so you have maximum 20-30 concurrent requests going (this maximum number should be determined by testing at which point network performance drops and you get these timeouts).
It's difficult to guarantee that your code is not broken when you use threads, that's why I suggest you use something where others figured it out for you, like https://github.com/httprb/http (with examples for thread safety and concurrent requests like https://github.com/httprb/http/wiki/Thread-Safety). There are other libs out there (Typhoeus, patron) but this one is pure Ruby so basic thread safety is easier to achieve.
You should not use Timeout (see https://jvns.ca/blog/2015/11/27/why-rubys-timeout-is-dangerous-and-thread-dot-raise-is-terrifying and https://medium.com/#adamhooper/in-ruby-dont-use-timeout-77d9d4e5a001). Use IO.select or something else.
Also, I suggest you learn about threading issues like deadlocks, starvations and all the gotchas. In your case you are doing a starvation of network resources because all the threads are fighting for bandwidth/network.

Aiohttp server max connections

I cannot understand the reason aiohttp (and asyncio in general) server implementation does not provide a way to limit max concurrent connections limit (number of accepted sockets, or number of running requests handlers).
(https://github.com/aio-libs/aiohttp/issues/675). Without this limit, it is easy to run out of memory and/or file descriptors.
In the same time, aiohttp client by default limits number of concurrent requests to 100 (https://docs.aiohttp.org/en/stable/client_advanced.html#limiting-connection-pool-size), aiojobs limits number of running tasks and size of pending tasks list, nginx has worker_connections limit, any sync framework is limited by number of worker threads by design.
While aiohttp can handle a lot of concurrent requests, this number is still limited. Docs on aiojobs says "The Scheduler has implied limit for amount of concurrent jobs (100 by default). ... It prevents a program over-flooding by running a billion of jobs at the same time". And still, we can happily spawn "billion" (well, until we run out of resources) aiohttp handlers.
So the question is, why is it implemented the way it is? Am I missing some important detail? I think we can somehow pause requests handlers using Semafor, but the socket is still accepted by aiohttp and coroutine is spawned, in contrast with nginx. Also when deploying behind nginx, the number of worker_connections and aiohttp desired limit will certainly be different.(because nginx may serve static files also)
Based on the developers' comments on the linked issue, the reasons for this choice are the following:
The application can return a 4xx or 5xx response if it detects that the number of connections is larger than what it can reasonably handle. (This differs from the Semaphore idiom, which would effectively queue the connection.)
Throttling the number of server connections is more complicated than just specifying a number, because the limit might well depend on what your coroutines are doing, i.e. it should at least be path-based. Andrew Svetlov links to NGINX documentation about connection limiting to support this.
It is anyway recommended to put aiohttp behind a specialized front server such as NGINX.
More detail than this can only be provided by the developer(s), who have been known to read this tag.
At this point, it appears that the recommended solution is to either use a reverse proxy for limiting, or an application-based limit like this decorator (untested):
REQUEST_LIMIT = 100
def throttle_handle(real_handle):
_nrequests = 0
async def handle(request):
nonlocal _nrequests
if _nrequests >= REQUEST_LIMIT:
return aiohttp.web.Response(
status=429, text="Too many connections")
_nrequests += 1
try:
return await real_handle(request)
finally:
_nrequests -= 1
return handle
#throttle_handle
async def handle(request):
... your handler here ...
To limit concurrent connections you can use aiohttp.TCPConnector or aiohttp.ProxyConnector if you using proxy. Just create it in a session instead of using the default.
aiohttp.ClientSession(
connector=aiohttp.TCPConnector(limit=1)
)
aiohttp.ClientSession(
connector=aiohttp.ProxyConnector.from_url(proxy_url, limit=1)
)

Gathering coin volumes - Is my code running asynchronously?

I'm fairly new to programming in python, I've been programming for about half a year. I've decided to try to build a functional trading bot. While trying to code this bot, I stumbled upon the asyncio module. I would really like to understand the module better but it's hard finding any simple tutorials or documentation about asyncio.
For my script I'm gathering per coin the volume. This works perfectly, but it takes a really long time to gather all the volumes. I would like to ask if my script is running synchronously, and if so how do I fix this? I'm using an API wrapper to communicate with the Binance Exchange.
import binance
import asyncio
import time
s = time.time()
names = [name for name in binance.ticker_prices()] #Gathering all the coin names
loop = asyncio.get_event_loop()
async def get_volume(name):
async def get_data():
return binance.ticker_24hr(name) #Returns per coin a dict of the data of the last 24hr
data = await get_data()
return (name, data['volume'])
tasks = [asyncio.ensure_future(get_volume(name)) for name in names]
results = loop.run_until_complete(asyncio.gather(*tasks))
print('Total time:', time.time() - s)
Since binance.ticker_24hr does not look like it's a coroutine, it is almost certainly blocking the event loop and therefore preventing asyncio.gather to do its job. As a quick fix, you can use run_in_executor to run the blocking function in a separate thread:
async def get_volume(name):
loop = asyncio.get_event_loop()
data = await loop.run_in_executor(None, binance.ticker_24hr, name)
return name, data['volume']
This will work just fine for a reasonable number of parallel tasks. The downside is that it uses threads, so it might not scale to a huge number of parallel requests (or it would require unnecessary waiting). The correct solution in the long run is to use a library that natively supports asyncio.
Maarten firstly you are calling get_ticker for every symbol which means you're making many unnecessary requests. If you call it without a symbol value, you get all tickers in one request. This removes any loops or async as well if you aren't performing other tasks. It looks like the binance library you're using doesn't support this. You can use python-binance to do it
return client.get_ticker()
That said I've been testing an asyncio version of python-binance. It's currently in a feature branch now if you want to try it.
pip install git+https://github.com/sammchardy/python-binance#feature/asyncio
Include the asyncio version of the client and initialise the client
from binance.client_async import AsyncClient as Client
client = Client("<api_key>", "<api_secret>")
Then you can await the calls to get the ticker for a particular symbol
return await client.get_ticker(symbol=name)
Or for all symbol tickers don't pass the symbol parameter
return await client.get_ticker()
Hope that helps

Concurrency with aiohttp server

Here is my code:
import asyncio
import logging
import time
from aiohttp import web
logging.getLogger('aiohttp').setLevel(logging.DEBUG)
logging.getLogger('aiohttp').addHandler(logging.StreamHandler(sys.stderr))
def handle_sync(request):
web.web_logger.debug('Sync begin')
time.sleep(10)
web.web_logger.debug('Sync end')
return web.Response(text='Synchronous hello')
async def handle_async(request):
web.web_logger.debug('Async begin')
await asyncio.sleep(10)
web.web_logger.debug('Async end')
return web.Response(text='Asynchronous hello')
async def init(loop):
app = web.Application(loop=loop)
app.router.add_get('/sync/', handle_sync)
app.router.add_get('/async/', handle_async)
srv = await loop.create_server(app.make_handler(), '0.0.0.0', 8080)
return srv
loop = asyncio.get_event_loop()
loop.run_until_complete(init(loop))
loop.run_forever()
I would expect 2 behaviors:
when hitting twice the /sync/ url, with let's say a 2-seconds interval, the overall time spent is 20 seconds since we have one server in one thread, and a blocking call
an overall time of 12 seconds when hitting twice the /async/ url the same way (these calls are asynchronous, right?)
But it appears that both cases last 20 seconds, can someone please explain me why?
To be able to observe the benefits using async here, you will need to send the two separate request at (almost) the same time.
The await call simply means the function yields control over to the event loop so that if any other events occur, they can be handled without blocking. If you do send two requests at the same time to the async endpoint, you will see that each of them finishes around 10 seconds. This is because both requests are handled at the same time: the first request does not block the server from handling another request.
However if you send two requests in a similar fashion to the sync endpoint, the second request would take 20 seconds. This is because the sync endpoint blocks on the first request and can not start serving the second request until the first one is finished.

Resources