I want to call function to crawl data on the web using HTTP requests but it is too slow to call each function sequentially so that I found out about gevent library in python to make the logic run parallelly. But the source code is like the following. if content is big, I got error like the following. How can I set for the gevent to call my functions within the limited resource?
monkey.patch_all()
threads = [gevent.spawn(yahoo.download_csv, tkr) for tkr in content]
gevent.joinall(threads)
The error is..
File "D:\dev\Python36-64\lib\ssl.py", line 465, in options
super(SSLContext, SSLContext).options.__set__(self, value)
[Previous line repeated 318 more times]
RecursionError: maximum recursion depth exceeded while calling a Python object
The yahoo.download_csv is the function to crawl data using http requests and return the crawled data based on the ticker.
Related
There is a specific api endpoint. To reduce the load on the server, I want clients to send post requests to it, for example, no more than once every 15 minutes. At the same time, the rest of the endpoints worked as usual.
I thought that I needed to somehow implement a timeout. So that the client waits and does not receive a response to all requests until 15 minutes have passed. Those. just exit the post function. But it is impossible, it says that you need to return response. But if the client receives a response, he will immediately be able to send the next request. And you need to reduce the number of requests as much as possible. So that this timeout on the client side prevents him from bombing with requests.
And I would like to include the logic for enabling such behavior in the post function. It is a little more complicated than described in the question.
In python and django, a complete noob. Perhaps this can be implemented in some other way? Aim in which direction to dig.
You can use the Throttling and add the following lines to REST_FRAMEWORK block of your settings.py like below:
REST_FRAMEWORK = {
...
'DEFAULT_THROTTLE_CLASSES': [
'rest_framework.throttling.AnonRateThrottle',
'rest_framework.throttling.UserRateThrottle'
],
'DEFAULT_THROTTLE_RATES': {
'anon': '20/min',
'user': '60/min'
},
...
}
This will raise the 429 status_code if user send more than 60 requests in one minute.
Throttling can be a solution?
It can set up the rate of request: https://www.django-rest-framework.org/api-guide/throttling/
I cannot understand the reason aiohttp (and asyncio in general) server implementation does not provide a way to limit max concurrent connections limit (number of accepted sockets, or number of running requests handlers).
(https://github.com/aio-libs/aiohttp/issues/675). Without this limit, it is easy to run out of memory and/or file descriptors.
In the same time, aiohttp client by default limits number of concurrent requests to 100 (https://docs.aiohttp.org/en/stable/client_advanced.html#limiting-connection-pool-size), aiojobs limits number of running tasks and size of pending tasks list, nginx has worker_connections limit, any sync framework is limited by number of worker threads by design.
While aiohttp can handle a lot of concurrent requests, this number is still limited. Docs on aiojobs says "The Scheduler has implied limit for amount of concurrent jobs (100 by default). ... It prevents a program over-flooding by running a billion of jobs at the same time". And still, we can happily spawn "billion" (well, until we run out of resources) aiohttp handlers.
So the question is, why is it implemented the way it is? Am I missing some important detail? I think we can somehow pause requests handlers using Semafor, but the socket is still accepted by aiohttp and coroutine is spawned, in contrast with nginx. Also when deploying behind nginx, the number of worker_connections and aiohttp desired limit will certainly be different.(because nginx may serve static files also)
Based on the developers' comments on the linked issue, the reasons for this choice are the following:
The application can return a 4xx or 5xx response if it detects that the number of connections is larger than what it can reasonably handle. (This differs from the Semaphore idiom, which would effectively queue the connection.)
Throttling the number of server connections is more complicated than just specifying a number, because the limit might well depend on what your coroutines are doing, i.e. it should at least be path-based. Andrew Svetlov links to NGINX documentation about connection limiting to support this.
It is anyway recommended to put aiohttp behind a specialized front server such as NGINX.
More detail than this can only be provided by the developer(s), who have been known to read this tag.
At this point, it appears that the recommended solution is to either use a reverse proxy for limiting, or an application-based limit like this decorator (untested):
REQUEST_LIMIT = 100
def throttle_handle(real_handle):
_nrequests = 0
async def handle(request):
nonlocal _nrequests
if _nrequests >= REQUEST_LIMIT:
return aiohttp.web.Response(
status=429, text="Too many connections")
_nrequests += 1
try:
return await real_handle(request)
finally:
_nrequests -= 1
return handle
#throttle_handle
async def handle(request):
... your handler here ...
To limit concurrent connections you can use aiohttp.TCPConnector or aiohttp.ProxyConnector if you using proxy. Just create it in a session instead of using the default.
aiohttp.ClientSession(
connector=aiohttp.TCPConnector(limit=1)
)
aiohttp.ClientSession(
connector=aiohttp.ProxyConnector.from_url(proxy_url, limit=1)
)
TL;DR
How to safely await on function execution (takes str and int as arguments and doesn't require any other context) in a separate process?
Long story
I have aiohtto.web web API that uses Boost.Python wrapper for C++ extension, run under gunicorn (and I plan to deploy it on Heroku), tested by locust.
About extension: it have just one function that does non-blocking operation - takes one string (and one integer for timeout management), does some calculations with it and returns a new string. And for every input string, it is only one possible output (except timeout, but in that case, C++ exception must be raised and translated by Boost.Python to a Python-compatible one).
In short, a handler for specific URL executes the code below:
res = await loop.run_in_executor(executor, func, *args)
where executor is the ProcessPoolExecutor instance, and func -function from C++ extension module. (in the real project, this code is in the coroutine method of the class, and func - it's classmethod that only executes C++ function and returns the result)
Error catching
When a new request arrives, I extract it's POST data by request.post() and then storing it's data to the instance of the custom class named Call (because I have no idea how to name it in another way). So that call object contains all input data (string), request receiving time and unique id that comes with the request.
Then it proceeds to class named Handler (not the aiohttp request handler), that passes it's input to another class' method with loop.run_in_executor inside. But Handler has a logging system that works like a middleware - reads id and receiving time of every incoming call object and logging it with a message that tells you either it just starting to execute, successfully executed or get in trouble. Also, Handler have try/except and stores all errors inside the call object, so that logging middleware knows what error occurred, or what output extension had returned
Testing
I have the unit test that just creates 256 coroutines with this code inside and executor that have 256 workers and it works well.
But when testing with Locust here comes a problem. I use 4 Gunicorn workers and 4 executor workers for this kind of testing. At some time application just starts to return wrong output.
My Locust's TaskSet is configured to log every fault response with all available information: output string, error string, input string (that was returned by the application too), id. All simulated requests are the same, but id is unique for every.
The situation is better when setting Gunicorn's max_requests option to 100 requests, but failures still come.
Interesting thing is, that sometimes I can trigger "wrong output" period by simply stopping and starting Locust's test.
I need a 100% guarantee that my web API works as I expect.
UPDATE & solution
Just asked my teammate to review the C++ code - the problem was in global variables. In some way, it wasn't a problem for 256 parallel coroutines, but for Gunicorn was.
If I have a basic http handler for POST requests, how can I stop processing if the payload is larger than 100 KB?
From what I understand, in my POST Handler, behind the scenes the server is streaming the POSTED data. But if I try and access it, it will block correct?
I want to stop processing if it is over 100 KB in size.
Use http.MaxBytesReader to limit the amount of data read from the client. Execute this line of code
r.Body = http.MaxBytesReader(w, r.Body, 100000)
before calling r.ParseForm, r.FormValue or any other request method that reads the body.
Wrapping the request body with io.LimitedReader limits the amount of data read by the application, but does not necessarily limit the amount of data read by the server on behalf of the application.
Checking the request content length is unreliable because the field is not set to the actual request body size when chunked encoding is used.
I believe you can simply check http.Request.ContentLength param to know about the size of the posted request prior to decide whether to go ahead or return error if larger than expected.
I am building a server on sockjs-tornado, and wonder how could one take advantage of tornado's asynchronous HTTP client -- or other asynchronous facilities for tornado such as asyncmongo, tornado-redis, etc. Apparently it is not possible to use the tornado.web.asynchronous & tornado.gen.engine decorators on random methods. So if I need to do asynchronous Mongo/HTTP/Redis calls from within SockJSConnection's on_message(), how would I do that?
All you have to do is to create a method (or a function) which is decorated by tornado.gen decorator
Created small gist to illustrate how you can do it: https://gist.github.com/3708549
If you will run sample and check server console, you'll see following output:
1 - Making request
2 - Returned from on_message
... slight delay ...
3 - Sent data to client
So, it is not blocking ioloop and makes HTTP call in background.