python request get invalid url lighting speed - performance

I have a list of 10^6 url I want to check against the status code.
The things is the requests.get is too slow for me with timeout specified and sometimes I can not be sure if url is valid or not even with 1 second timeout (let's say server response is slow).
So, currently I do:
import request
url = "https://dupa.ucho.elo.8"
r = requests.get(url, headers={'Connection': 'close'}, timeout=1)
How to quickly check if url is valid or not without setting timeout and instantly return response for invalid URLs?
Note1: I want to avoid grequests module.
Note2: I do not want to use multithreading.
I have read this https://stackoverflow.com/questions/17782142/why-doesnt-requests-get-return-what-is-the-default-timeout-that-requests-geta but it involves timeout set.

While this might not give you lightning speed due to avoiding multithreading, you can check whether the response of the URL contains what you want to see (200 status code) and terminate it right after.
import requests
import sys
url_list = ['http://google12121.com/','https://google.com/']
for url in url_list:
try:
response = requests.get(url)
if "200" in str(response.status_code):
print("Yes")
else:
print("No")
except:
print("Error: "+str(sys.exc_info()[0]))
continue
You might want to write a more specific error catching logic because generally catching all errors is bad.

Related

Ruby progressbar with down gem

I am implementing a file downloader by using the down gem.
I need to add a progress bar to my program for fancy outputs. I found a gem called ruby-progressbar. However, I couldn't integrate it to my code base even though I followed the instructions documented on the official site. Here's what I have done so far:
First, I thought of using progress_proc. It was a bad idea because progress_proc returns chunked partial of the data.
Second, I streamed the data and built an idea on calculating chunked data. It worked well actually, but it smells bad to me.
Plus, here is the small part of my code base. I hope it helps you understand the concept.
progressbar = ProgressBar.create(title: 'File 1')
Down.download(url, progress_proc: ->(progress) { progressbar.progress = progress }) # It doesn't work
progressbar = ProgressBar.create(title: 'File 1')
file = Down.open(url, progress_proc: ->(progress) { progressbar.progress = progress })
chunked = 0
loop do
break if file.eof?
file.read(1024)
chunked += 1024
progressbar.progress = (chunked / file.size) * 100
end
# This worked well as I remember. It can be faulty because I wrote it down without testing.
In the HTTP protocol, there are two different ways on how a client can determine the full length of a response:
In the most common case, the entire response is sent by the server in one go. Here, the length of the response body in bytes is set in the Content-Length header of the response. Thus, if the response is not chunked, you can get the value of this header and read the response in one go as it is sent by the server.
The second option is for the server to send a chunked response. Here, the server sends chunks of the entire response, one after another. Each chunk is prefixed with the length of the chunk. However, the client has no way to know how many chunks there are in total, nor how large the total response may be. Often, this is even unknown to the server as the first chunks are already sent before the entire response is available to the server.
The down gem follows these two approaches by offering two interfaces:
In the first case (i.e. if the content length of the entire response is known), the gem will call the given content_length_proc once.
In the second case, as the entire length of the response is unknown before it was received in total, the down gem calls the progress_proc once for each chunk received. In this case, it is up to you to show something useful. In general, you can NOT show a progress bar as a percentage of completion here.

aiohttp Error Rate Increases with Number of Connections

I am trying to get the status code from millions of different sites, I am using asyncio and aiohttp, I run the below code with a different number of connections (yet same timeout on the request) but get very different results specifically much higher number of the following exception.
'concurrent.futures._base.TimeoutError'
The code
import pandas as pd
import asyncio
import aiohttp
out = []
CONNECTIONS = 1000
TIMEOUT = 10
async def fetch(url, session, loop):
try:
async with session.get(url,timeout=TIMEOUT) as response:
res = response.status
out.append(res)
return res
except Exception as e:
_exception = 'Error: '+str(type(e))
out.append(_exception)
return _exception
async def bound_fetch(sem, url, session, loop):
async with sem:
await fetch(url, session, loop)
async def run(urls, loop):
tasks = []
sem = asyncio.Semaphore(value=CONNECTIONS,loop=loop)
_connector = aiohttp.TCPConnector(limit=CONNECTIONS, loop=loop)
async with aiohttp.ClientSession(connector=_connector,loop=loop) as session:
for url in urls:
task = asyncio.ensure_future(bound_fetch(sem, url, session, loop))
tasks.append(task)
responses = await asyncio.gather(*tasks,return_exceptions=True)
return responses
## BEGIN ##
tlds = open('data/sample_1k.txt').read().splitlines()
urls = ['http://{}'.format(x) for x in tlds[1:]]
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(run(urls,loop))
ans = loop.run_until_complete(future)
print(str(pd.Series(out).value_counts()))
Results
CONNECTIONS=1000
CONNECTIONS=100
Is this a bug? These sites do response with a status code and run sequentially or with lower connections there is no timeout error so why is this happening? The other exceptions seem stable as you change number of connections. The ClientOSErrors are from sites that actually timeout or respond, honestly don't really know where the concurrent.futures._base.TimeoutError errors are coming from.
Imagine you opened 1000 urls in browser simultaneously. I bet you'll notice many of them aren't loaded after 10 seconds. It's not a bug it's a limit of your machine resources.
More parallel requests you're doing -> less network capacity for each one, less CPU time for each one, less RAM for each one -> higher chances each request wouldn't be ready before it's timeout.
If you see there are many timeouts with 1000 connections, make less connections (and may be increase timeout). Based on aiohttp documentation using different ClientSession instancies may also help:
Unless you are connecting to a large, unknown number of different
servers over the lifetime of your application, it is suggested you use
a single session for the lifetime of your application
I've had the same issue, have a look at the details of the ClientOSErrors and you might see Too many open files, if so you need to increase the OS's number of file descriptors.
Either way, you'll get more information if you print the whole exceptions, not just their types.

aiohttp timeout doesn't work properly

I have code, that make http-requests to sites (using aiohttp) with async_timeout. If I run all requests together, then some requests are raising TimeoutError (even if timeout=20s.). But if I run one request -- it works.
def coro(url):
with async_timeout.timeout(TIMEOUT, loop=loop):
async with session.get(url) as response:
text, status = (await response.text()), response.status
...
Is this async_timeout problem/bug or my?
I tried to use TCPConnector (aiohttp.TCPConnector(limit=None, verify_ssl=False, loop=loop)), but it doesn't work
There is nothing strange if a request takes more than 20 sec in case of very large requests amount (and this request is much faster when executed alone).
To make sure just insert timestamp printouts before and after .get()/.text() execution.
Timeout's code is deadly simple and highly tested, don't suspect an error in it.

Is it appropriate to use HTTP status codes for non-HTTP errors?

I know someone who is writing an API, and wants to use HTTP status codes to report the outcome of queries. e.g. if the user calls example.com/api/product_info?product_id=X, and the product doesn't exist, it would return HTTP status 400: Bad Request. I think that, since this is a valid call (i.e. the actual HTTP request is not malformed), it should return a 200 code response, and just have the body of the response something like {status: 'error'; message: 'No such product'}.
So my question is,
1) Is it appropriate to use HTTP status codes to convey non-HTTP program state, as in the example above?
2) Is there some standard, or at least widely used, specification describing when HTTP status codes are appropriate for use?
I was actually just talking about this the other day - http://blogs.mulesoft.org/api-best-practices-response-handling/
Your status code should reflect the response of the API, as 200 is "OK" and should be used for data that is successfully returned. However, 201 should be used for created items.
As mentioned already, in the event where the user tries a call but it fails (ie: users/?id=5) the server could return back a 400 to inform the user that it was a Bad Request, or a 404 if the resource doesn't exist.
It also depends on the action - if they are searching for a user and there are no responses, I wouldn't return an error, just a 200 with no results found. However, if they are trying to do a PUT or PATCH on a user that doesn't exist I would tell them with an error- as chances are there's a problem within their application somewhere.
In the link posted above you'll find more status codes, but one of the biggest advantages to using status codes is that it informs the client just though the header what actually happened with the server. This allows them to do a relatively quick (and low memory) check instead of having to deserialize the body and loop through an array looking for an errors key.
Essentially, you're giving them the tools to quickly and easily understand what is happening- something that I think every (sane) developer appreciates.
Hope this helps!
- Mike

Does an HTTP Status code of 0 have any meaning?

It appears that when you make an XMLHttpRequest from a script in a browser, if the browser is set to work offline or if the network cable is pulled out, the request completes with an error and with status = 0. 0 is not listed among permissible HTTP status codes.
What does a status code of 0 mean? Does it mean the same thing across all browsers, and for all HTTP client utilities? Is it part of the HTTP spec or is it part of some other protocol spec? It seems to mean that the HTTP request could not be made at all, perhaps because the server address could not be resolved.
What error message is appropriate to show the user? "Either you are not connected to the internet, or the website is encountering problems, or there might be a typing error in the address"?
I should add to this that I see the behavior in FireFox when set to "Work Offline", but not in Microsoft Internet Explorer when set to "Work Offline". In IE, the user gets a dialog giving the option to go online. FireFox does not notify the user before returning the error.
I am asking this in response to a request to "show a better error message". What Internet Explorer does is good. It tells the user what is causing the problem and gives them the option to fix it. In order to give an equivalent UX with FireFox I need to infer the cause of the problem and inform the user. So what in total can I infer from Status 0? Does it have a universal meaning or does it tell me nothing?
Short Answer
It's not a HTTP response code, but it is documented by WhatWG as a valid value for the status attribute of an XMLHttpRequest or a Fetch response.
Broadly speaking, it is a default value used when there is no real HTTP status code to report and/or an error occurred sending the request or receiving the response. Possible scenarios where this is the case include, but are not limited to:
The request hasn't yet been sent, or was aborted.
The browser is still waiting to receive the response status and headers.
The connection dropped during the request.
The request timed out.
The request encountered an infinite redirect loop.
The browser knows the response status, but you're not allowed to access it due to security restrictions related to the Same-origin Policy.
Long Answer
First, to reiterate: 0 is not a HTTP status code. There's a complete list of them in RFC 7231 Section 6.1, that doesn't include 0, and the intro to section 6 states clearly that
The status-code element is a three-digit integer code
which 0 is not.
However, 0 as a value of the .status attribute of an XMLHttpRequest object is documented, although it's a little tricky to track down all the relevant details. We begin at https://xhr.spec.whatwg.org/#the-status-attribute, documenting the .status attribute, which simply states:
The status attribute must return the response’s status.
That may sound vacuous and tautological, but in reality there is information here! Remember that this documentation is talking here about the .response attribute of an XMLHttpRequest, not a response, so this tells us that the definition of the status on an XHR object is deferred to the definition of a response's status in the Fetch spec.
But what response object? What if we haven't actually received a response yet? The inline link on the word "response" takes us to https://xhr.spec.whatwg.org/#response, which explains:
An XMLHttpRequest has an associated response. Unless stated otherwise it is a network error.
So the response whose status we're getting is by default a network error. And by searching for everywhere the phrase "set response to" is used in the XHR spec, we can see that it's set in five places:
To a network error, when:
the open() method is called, or
the response's body's stream is errored (see the algorithm described in the docs for the send() method)
the timed out flag is set, causing the request error steps to run
the abort() method is called, causing the request error steps to run
To the response produced by sending the request using Fetch, by way of either the Fetch process response task (if the XHR request is asychronous) or the Fetch process response end-of-body task (if the XHR request is synchronous).
Looking in the Fetch standard, we can see that:
A network error is a response whose status is always 0
so we can immediately tell that we'll see a status of 0 on an XHR object in any of the cases where the XHR spec says the response should be set to a network error. (Interestingly, this includes the case where the body's stream gets "errored", which the Fetch spec tells us can happen during parsing the body after having received the status - so in theory I suppose it is possible for an XHR object to have its status set to 200, then encounter an out-of-memory error or something while receiving the body and so change its status back to 0.)
We also note in the Fetch standard that a couple of other response types exist whose status is defined to be 0, whose existence relates to cross-origin requests and the same-origin policy:
An opaque filtered response is a filtered response whose ... status is 0...
An opaque-redirect filtered response is a filtered response whose ... status is 0...
(various other details about these two response types omitted).
But beyond these, there are also many cases where the Fetch algorithm (rather than the XHR spec, which we've already looked at) calls for the browser to return a network error! Indeed, the phrase "return a network error" appears 40 times in the Fetch standard. I will not try to list all 40 here, but I note that they include:
The case where the request's scheme is unrecognised (e.g. trying to send a request to madeupscheme://foobar.com)
The wonderfully vague instruction "When in doubt, return a network error." in the algorithms for handling ftp:// and file:// URLs
Infinite redirects: "If request’s redirect count is twenty, return a network error."
A bunch of CORS-related issues, such as "If httpRequest’s response tainting is not "cors" and the cross-origin resource policy check with request and response returns blocked, then return a network error."
Connection failures: "If connection is failure, return a network error."
In other words: whenever something goes wrong other than getting a real HTTP error status code like a 500 or 400 from the server, you end up with a status attribute of 0 on your XHR object or Fetch response object in the browser. The number of possible specific causes enumerated in spec is vast.
Finally: if you're interested in the history of the spec for some reason, note that this answer was completely rewritten in 2020, and that you may be interested in the previous revision of this answer, which parsed essentially the same conclusions out of the older (and much simpler) W3 spec for XHR, before these were replaced by the more modern and more complicated WhatWG specs this answers refers to.
status 0 appear when an ajax call was cancelled before getting the response by refreshing the page or requesting a URL that is unreachable.
this status is not documented but exist over ajax and makeRequest call's from gadget.io.
Know it's an old post. But these issues still exist.
Here are some of my findings on the subject, grossly explained.
"Status" 0 means one of 3 things, as per the XMLHttpRequest spec:
dns name resolution failed (that's for instance when network plug is pulled out)
server did not answer (a.k.a. unreachable or unresponding)
request was aborted because of a CORS issue (abortion is performed by the user-agent and follows a failing OPTIONS pre-flight).
If you want to go further, dive deep into the inners of XMLHttpRequest. I suggest reading the ready-state update sequence ([0,1,2,3,4] is the normal sequence, [0,1,4] corresponds to status 0, [0,1,2,4] means no content sent which may be an error or not). You may also want to attach listeners to the xhr (onreadystatechange, onabort, onerror, ontimeout) to figure out details.
From the spec (XHR Living spec):
const unsigned short UNSENT = 0;
const unsigned short OPENED = 1;
const unsigned short HEADERS_RECEIVED = 2;
const unsigned short LOADING = 3;
const unsigned short DONE = 4;
from documentation http://www.w3.org/TR/XMLHttpRequest/#the-status-attribute
means a request was cancelled before going anywhere
Since iOS 9, you need to add "App Transport Security Settings" to your info.plist file and allow "Allow Arbitrary Loads" before making request to non-secure HTTP web service. I had this issue in one of my app.
Yes, some how the ajax call aborted. The cause may be following.
Before completion of ajax request, user navigated to other page.
Ajax request have timeout.
Server is not able to return any response.

Resources