Fixing 416 error: HTTParty - ruby

Getting a 416 error when trying to GET a website with HTTParty. Works just fine in the browser.
I have never gotten this error before, so I went online and found this:
It occurs when the server is unable to fulfill the request. This may
be, for example, because the client asked for the 800th-900th bytes of
a document, but the document is only 200 bytes long.
The request includes a Range request-header field, and not any of the
range-specifier values in this field overlaps the current extent of
the selected resource, and also the request does not include an
If-Range request-header field.
Wondering if anyone has gotten 416 with HTTParty before and if there is a way to prevent this form happening. Thanks
Example website where error occurs:
http://www.bizjournals.com/jacksonville/blog/morning-edition/2014/07/teens-make-up-less-of-summer-workforce-than-ever.html

It appears that bizjournals is able to detect you are a bot (not accessing in the browser) and therefore returns a 416.
irb(main):005:0> HTTParty.get('http://www.bizjournals.com/jacksonville/blog/morning-edition/2014/07/teens-make-up-less-of-summer-workforce-than-ever.html').body
=> "........As you were browsing <strong>http://www.bizjournals.com</strong> something about your browser made us think you were a bot. There are a few reasons this might happen........"
You could either ask bizjournals to allow you to make requests or try to change the headers to make bizjournals think you are not a bot.

Related

How can I scrape an image that doesn't have an extension?

Sometimes I come across an image that I can't scrape so that it can be saved. An example of this is:
https://s3.amazonaws.com/plumdistrict.com-production/perks/12321/image/original.?1325898487
When I hit the url from Internet Explorer I see the image but when I try to get it from the code below I get the following error message "System.Net.WebException The remote server returned an error: (403) Forbidden" error with GetResponse:
string url = "https://s3.amazonaws.com/plumdistrict.com-production/perks/12321/image/original.?1325898487";
WebRequest request = WebRequest.Create(url);
WebResponse response = request.GetResponse();
Any ideas on how to get this image?
Edit:
I am able to get to save images that do have extensions. For example I can scrape the following image just fine:
https://s3.amazonaws.com/plumdistrict.com-production/perks/12659/image/original.jpg?1326828951
Although HTTP is originally supposed to be stateless, there are a lot of implementations that rely on it being stateless. I could configure my webserver to only accept requests for "http://mydomain.com/sexy_avatar.jpg" if you provide a cookie proving you were logged in. If not, I send you a redirect 303 to "http://mydomain.com/avatar_for_public_use.jpg".
Amazon could be doing the same. Try to load the web page using Chrome, and look at the Network view in developer mode (CTRL+SHIFT+J) to see all headers supplied to the website. Maybe you even need to do a full navigation in the same session before you are allowed to see the image. This is certainly the case in many web applications I have developed :-)
Well, it looks like it's being generated from a script (possibly being retrieved from a database). The server should be sending a file/content type to go along with that... but it doesn't seem to be, which I believe is a violation of standards.
My Linux box knows full well that that's a JPEG image once it's on my hard drive, because it examines file headers rather than relying on extensions. Perhaps there is a tool to do the same in Windows?
Edit: Actually, on further contemplation, it seems odd that you'd get a 403 for that. Perhaps the server is actually blocking you from retrieving the file in that manner.

Cross domain ajax POST in chrome

There are several topics about the problem with cross-domain AJAX. I've been looking at these and the conclusion seems to be this:
Apart from using somthing like JSONP, or a proxy sollution, you should not be able to do a basic jquery $.post() to another domain
My test code looks something like this (running on "http://myTestdomain.tld/path/file.html")
var myData = {datum1 : "datum", datum2: "datum"}
$.post("http://External-Ip:port", myData,function(return){alert(return);});
When I tried this (the reason I started looking), chrome-console told me:
XMLHttpRequest cannot load
http://External-IP:port/page.php. Origin
http://myTestdomain.tld is not allowed
by Access-Control-Allow-Origin.
Now this is, as far as I can tell, expected. I should not be able to do this. The problem is that the POST actually DOES come trough. I've got a simple script running that saves the $_POST to a file, and it is clear the post gets trough. Any real data I return is not delivered to my calling script, which again seems expected because of the Access-control issue. But the fact that the post actually arrived at the server got me confused.
Is it correct that I assume that above code running on "myTestdomain" should not be able to do a simple $.post() to the other domain (External-IP)?
Is it expected that the request would actually arrive at the external-ip's script, even though output is not received? or is this a bug. (I'm using Chrome 11.0.696.60 )
I posted a ticket about this on the WebKit bugtracker earlier, since I thought it was weird behaviour and possibly a security risk.
Since security-related tickets aren't publicly viewable, I'll quote the reply from Justin Schuh here:
This is implemented exactly as required by the spec. For simple cross-origin requests http://www.w3.org/TR/cors/#simple-method> there is no pre-flight check; the request is made and the response cannot be read if the appropriate headers do not authorize the requesting origin. Functionally, this is no different than creating a form and using script to make an off-origin POST (which has always been possible).
So: you're allowed to do the POST since you could have done that anyway by embedding a form and triggering the submit button with javascript, but you can't see the result. Because you wouldn't be able to do that in the form scenario.
A solution would be to add a header to the script running on the target server, e.g.
<?php
header("Access-Control-Allow-Origin: http://your_source_domain");
....
?>
Haven't tested that, but according to the spec, that should work.
Firefox 3.6 seems to handle it differently, by first doing an OPTIONS to see whether or not it can do the actual POST. Firefox 4 does the same thing Chrome does, or at least it did in my quick experiment. More about that is on https://developer.mozilla.org/en/http_access_control
The important thing to note about the JavaScript same-origin policy restriction is that it is something built into modern browsers for security - it is not a limitation of the technology or something enforced by servers.
To answer your question, neither of these are bugs.
Requests are not stopped from reaching the server - this gives the server the option to allow these cross-domain requests by setting the appropriate headers1.
The response is also received back by the browser. Before the use of the access control headers 1, responses to cross-domain requests would be stopped dead in their tracks by a security conscious browser - the browser would receive the response but it would not hand it off to the script. With the access control headers, the server has the option of setting the appropriate headers indicating to a compliant browser that it would like to allow certain origin URLs to make cross domain requests.
The exact behaviour on response might differ between browsers - I can't recall for sure now but I think Chrome calls the success callback function when using jQuery's ajax() but the response is empty. IIRC, Firefox will not invoke the success function.
I get the same thing happening for me. You are able to post across domains but are not able to receive a response. This is what I expected to be able to do and happens for me in Firefox, Chrome, and IE.
One way to kind of get around this caveat is having a local php file with will call the data via curl and respond the response to your javascript. (Kind of restated what you said you knew already.)
Yes, it's correct and you won't be able to do that unless you use any proxy.
No, request won't go to the external IP as soon as there is such limitation.

BITS error codes

I'm writing an application updater that pulls installation package from our distribution web site to the user's PC using the background intelligent download service facility.
More or less everything is working fine now but I'm having a bit of problem getting the application react well to all recoverable errors. Specifically, I'd like the application to handle properly the case of proxy authentication.
In HTTP, it's simple: make a request, get a "407" HTTP response code, prompt for user name/password and repeat until you ether go through or the user press "cancel".
With BITS, it's not that simple. I don't get the HTTP status code. I get a couple of codes: the context (which should be BG_ERROR_CONTEXT_REMOTE_FILE in my case) and an "ErrorCode" that is supposed to depend on the context.
If I request the textual description of the error through GetErrorDescription, I get the correct "407 proxy authentication require" text. But the error code I have is 0x80190197 which is nowhere near 407.
So, does anyone know where I can get a full list of the BITS error code ? Failing that, partial list with the most common errors would be nice.
0x80190197 is not strictly speaking a BITS error, it's an HTTP stack error. The list is available here: Errors (019) FACILITY_HTTP

Django: How to track down a spurious HTTP request?

I have 3 AJAX functions to move data between a Django app on my website and some JavaScript using YUI in the browser. There is not a major difference between them in terms of their structure, concept, code, etc. 2 of them work fine, but in the 3rd one, I get one spurious HTTP request immediately after the intended request. Its POST data contains a subset of the POST data of the intended request. The request meta data is identical except for the CONTENT_LENGTH (obviously) and the CONTENT_TYPE which is 'text/plain; charset=UTF-8' for the intended and 'application/x-www-form-urlencoded' for the unwanted request. I do not set the content type explicitely at all which seems to suggest both requests do not have the same origin and the second one just pops out of thin air.
The intended request sets HTTP_CACHE_CONTROL': 'no-cache' and 'HTTP_PRAGMA': 'no-cache', the spurious one does not. The dev server log output for both requests is
[15/Feb/2010 15:00:12] "POST /settings/ HTTP/1.1" 200 0
What does the last 0 at the end mean ? Could not find any documentation on that. This value is usually non-zero... In Apache, it is the total size in bytes of the server response, can someone confirm it's the same for Django ?
My problem obviously is to find out where this additional request comes from.
I am fairly familiar with JS debugging using Firebug and I think I'm good at Python and Django, but I do not know a lot about the internals of HTTP requests and responses. I can breakpoint and step through the JS code that sends the intended XMLHTTP request, but that breakpoint does not get hit again.
The problem occurs with both FF3 and Safari, I'm on Snow Leopard, so I can't test with IE or Chrome.
I've looked at Django debugging techniques and tools like http://robhudson.github.com/django-debug-toolbar/ but I think I already have the information they can give me.
Can someone advise on a strategy or a tool to narrow the problem down ?
The problematic AJAX function submits form data, the working two don't. Forms have a default action which takes place when the form is submitted: post a request with the form data. I failed to prevent this default action.
So the spurious request did indeed come out of the dark underwood of the browser, there is no code in my js files that sends it.
Solution:
YAHOO.util.Event.preventDefault(event);
at the beginning of the form submit event handler.
See also http://developer.yahoo.com/yui/examples/event/eventsimple.html

Why is AJAX returning HTTP status code 0?

For some reason, while using AJAX (with my dashcode developed application) the browser just stops uploading and returns status codes of 0. Why does this happen?
Another case:
It could be possible to get a status code of 0 if you have sent an AJAX call and a refresh of the browser was triggered before getting the AJAX response. The AJAX call will be cancelled and you will get this status.
In my experience, you'll see a status of 0 when:
doing cross-site scripting (where access is denied)
requesting a URL that is unreachable (typo, DNS issues, etc)
the request is otherwise intercepted (check your ad blocker)
as above, if the request is interrupted (browser navigates away from the page)
Same problem here when using <button onclick="">submit</button>. Then solved by using <input type="button" onclick="">
Status code 0 means the requested url is not reachable. By changing http://something/something to https://something/something worked for me. IE throwns an error saying "permission denied" when the status code is 0, other browsers dont.
It is important to note, that ajax calls can fail even within a session which is defined by a cookie with a certain domain prefixed with www. When you then call your php script e.g. without the www. prefix in the url, the call will fail and viceversa, too.
Because this shows up when you google ajax status 0 I wanted to leave some tip that just took me hours of wasted time... I was using ajax to call a PHP service which happened to be Phil's REST_Controller for Codeigniter (not sure if this has anything to do with it or not) and kept getting status 0, readystate 0 and it was driving me nuts. I was debugging it and noticed when I would echo and return instead of exit the message I'd get a success. Finally I turned debugging off and tried and it worked. Seems the xDebug debugger with PHP was somehow modifying the response. If your using a PHP debugger try turning it off to see if that helps.
I found another case where jquery gives you status code 0 -- if for some reason XMLHttpRequest is not defined, you'll get this error.
Obviously this won't normally happen on the web, but a bug in a nightly firefox build caused this to crop up in an add-on I was writing. :)
This article helped me. I was submitting form via AJAX and forgotten to use return false (after my ajax request) which led to classic form submission but strangely it was not completed.
"Accidental" form submission was exactly the problem I was having. I just removed the FORM tags altogether and that seems to fix the problem. Thank you, everybody!
I had the same problem, and it was related to XSS (cross site scripting) block by the browser. I managed to make it work using a server.
Take a look at: http://www.daniweb.com/web-development/javascript-dhtml-ajax/threads/282972/why-am-i-getting-xmlhttprequest.status0
We had similar problem - status code 0 on jquery ajax call - and it took us whole day to diagnose it. Since no one had mentioned this reason yet, I thought I'll share.
In our case the problem was HTTP server crash. Some bug in PHP was blowing Apache, so on client end it looked like this:
mirek#toccata:~$ telnet our.server.com 80
Trying 180.153.xxx.xxx...
Connected to our.server.com.
Escape character is '^]'.
GET /test.php HTTP/1.0
Host: our.server.com
Connection closed by foreign host.
mirek#toccata:~$
where test.php contained the crashing code.
No data returned from the server (not even headers) => ajax call was aborted with status 0.
In my case, it was caused by running my django server under http://127.0.0.1:8000/ but sending the ajax call to http://localhost:8000/. Even though you would expect them to map to the same address, they don't so make sure you're not sending your requests to localhost.
In our case, the page link was changed from https to http. Even though the users were logged in, they were prevented from loading with AJAX.
In my case, setting url: '' in ajax settings would result in a status code 0 in ie8.. It seems ie just doesn't tolerate such a setting.
For me, the problem was caused by the hosting company (Godaddy) treating POST operations which had substantial response data (anything more than tens of kilobytes) as some sort of security threat. If more than 6 of these occurred in one minute, the host refused to execute the PHP code that responded to the POST request during the next minute. I'm not entirely sure what the host did instead, but I did see, with tcpdump, a TCP reset packet coming as the response to a POST request from the browser. This caused the http status code returned in a jqXHR object to be 0.
Changing the operations from POST to GET fixed the problem. It's not clear why Godaddy impose this limit, but changing the code was easier than changing the host.
I think I know what may cause this error.
In google chrome there is an in-built feature to prevent ddos attacks for google chrome extensions.
When ajax requests continuously return 500+ status errors, it starts to throttle the requests.
Hence it is possible to receive status 0 on following requests.
In an attempt to win the prize for most dumbest reason for the problem described.
Forgetting to call
xmlhttp.send(); //yes, you need this pivotal line!
Yes, I was still getting status returns of zero from the 'open' call.
In my case, I was getting this but only on Safari Mobile. The problem is that I was using the full URL (http://example.com/whatever.php) instead of the relative one (whatever.php). This doesn't make any sense though, it can't be a XSS issue because my site is hosted at http://example.com. I guess Safari looks at the http part and automatically flags it as an insecure request without inspecting the rest of the URL.
In my troubleshooting, I found this AJAX xmlhttpRequest.status == 0 could mean the client call had NOT reached the server yet, but failed due to issue on the client side. If the response was from server, then the status must be either those 1xx/2xx/3xx/4xx/5xx HTTP Response code. Henceforth, the troubleshooting shall focus on the CLIENT issue, and could be internet network connection down or one of those described by #Langdon above.
In my case, I was making a Firefox Add-on and forgot to add the permission for the url/domain I was trying to ajax, hope this saves someone a lot of time.
Observe the browser Console while making the request, if you are seeing "The Same Origin Policy disallows reading the remote resource at http ajax..... reason: cors header ‘access-control-allow-origin’ missing" then you need to add "Access-Control-Allow-Origin" in response header. exa: in java you can set this like response.setHeader("Access-Control-Allow-Origin", "*") where response is HttpServletResponse.

Resources