Universal "HTTP GET html page content and recode to UTF-8" procedure

Universal "HTTP GET html page content and recode to UTF-8" procedure - ruby

For some time I have been trying to solve fairly common problem consisting of basically three steps:
fetch html page with specified URL and store its content in a String
detect content encoding either from html meta information or HTTP header
recode the content into UTF-8 for further processing
In the real usage I have the first step a little extended with functionalities like having a "user-agent" instance with cookie-jar, configurable timeout and amount of GET attempts, configurable request count per time frame limitation, etc...
I implemented rest-client wrapper but I run into several problems:
class-global RestClient.proxy settings conflicting with e.g. couchrest (using rest-client itself)
freezing: sometimes the timeout causes freezing of the process. AFAIK more of my friends run into the same problem with rest-client
redirect Location URI parsing: rest-client fails to fetch "http://www.ofertacarioca.com.br/index.aspx?cidade=4,Belo%20Horizonte" correctly complaining about invalid URI '/indexnew.aspx?cidade=4,Belo Horizonte' in Location header of the 302 result but curb handles this perfectly through to the target page. I'm about to reimplement the wrapper with the use of curb
recoding problems in the third step: I attempted to detect encoding from html page meta information and HTTP header (in this order) for some pages still to no avail
I would love to know of some cool gem out there that would handle such needs or of some intriguing universal solution hints if any.

As nobody has answered, I needed to implement the curb-based solution:
curburger
Perhaps somebody finds it useful.

Related

Setting noindex on Amazon S3 objects

We have some publicly shared S3 files that we want to make sure won't be indexed by Google. I can't seem to find any documentation on how to do this. Is there a way to set a "noindex" x-robots-tag response header on individual S3 objects?
(We're using the Ruby AWS client)

There does not appear to be a way to do this.
Only certain headers from an S3 PUT object request are documented as being returned when the object is fetched.
http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectPUT.html
Anything else you send appears to be simply disregarded, as long as it doesn't actually invalidate the request.
Actually, that's what I thought before researching this, and it's almost true.
The documentation here seems incomplete, and elsewhere suggests the following request headers, if sent with the upload, will appear in the download:
Cache-Control
Content-Disposition
Content-Encoding
Content-Type
x-amz-meta-*
Other headers are listed at the latter link, but some of these like Expect wouldn't make sense on a GET request, so they logically wouldn't appear.
So far, this is all consistent with my experience with S3.
If you send a random but not-invalid header with your request, it's ignored. Example:
X-Foo: bar
S3 seems to accepts this on upload, but discards it (presumably doesn't store it)... downloading the object does not return the X-Foo header.
But X-Robots-Tag appears to be an undocumented exception to this.
Uploading a file with X-Robots-Tag: noindex (for example) does indeed result in the same header and value being returned with the object when you GET it.
Unless somebody can cite the documentation that explains why this works, we're operating in distinctly undocumented territory.
But, if you're interested in going there, the simple answer appears to be, you just add this header to the HTTP PUT request you send to the REST API to upload the object.
"Not so fast," you say, "I'm using the Ruby SDK." Indeed. The AWS Ruby client seems to be too "helpful" to let you get away with this, at least, not easily. The docs there show how to add "metadata" --
:metadata (Hash) — A hash of metadata to be included with the object. These will be sent to S3 as headers prefixed with x-amz-meta. Each name, value pair must conform to US-ASCII.
Well, that's not going to work, because you'd get x-amz-meta-x-robots-tag.
How do you set other headers in the upload? Every other header you'd normally set is an element of the options hash, like :cache_control, which turns into Cache-Control: in the upload request. Unless they're blindly applying the keys from that hash to the upload transaction (which would be terrible design combined with excellent luck) then you may not have a straightforward way to get here from there. I can't be much more specific, because the only I really know about Ruby is the same thing I know about Java -- from what I've seen of it, I don't like it. :)
But X-Robots-Tag does appear to be a custom header S3 supports, to some extent, without clear documentation of that fact. It's, at least, accepted by the REST API.
Failing the above, you can manually add this header to the metadata in the S3 console after uploading the object. (Note, X-Foo: Bar doesn't work from the S3 console, either -- it's silently discarded, with no error -- but X-Robots-Tag: works fine).
You can also, of course, put a publicly-readable robots.txt file (with the appropriate directives in it) in the root of the bucket. Depending on your cobtent mix, path hierarchy, and other factors, that isn't (perhaps) as simple as selectively setting headers, but if the entire bucket is comprised of information you don't want indexed, it should easily accomplish what you want, since content should not be indexed if disallowed in robots.txt, even when a search spider follows a link to it from another site -- every domain (and subdomain)'s robots.txt file stands alone.

#Michael - sqlbot is correct. The SDKs don't support it by default and it won't show in the AWS Console, but if you set it directly with the REST API it works. For those who don't want to figure out the REST API and its authentication method, I was able to modify the node.js aws-sdk to support this feature.
Amazon stores the method params configuration and validation in a large json file: apis/s3-2006-03-01.min.json . I guess that the other SDKs may implement their validation in the same way.
You can go to the "PutObject" command, and under "input.members" you can add a new parameter "XRobotsTag". Configure it as a "header" and set the location to "X-Robots-Tag".
"XRobotsTag": {
"location": "header",
"locationName": "X-Robots-Tag"
}
Your local aws-sdk is now configured to support X-Robots-Tag on your putObject requests. In node.js this would look like this:
s3.putObject({
ACL: "public-read",
Body: "hello world",
Bucket: "my-bucket",
CacheControl: "public, max-age=31536000",
ContentType: "text/plain",
Key: "hello.txt",
XRobotsTag: "noindex, nofollow"
}, function(err, resp){});

A method for standardizing URI/URLs in Ruby in a forgiving manner

I'm trying to find a method to take a URI/URL string from a user and determine a working, canonical form (or failing if the resource isn't valid). Simultaneously, it should also verify that the URL exists. So we're checking for both valid "syntax" and also existence.
For instance, a string like google.com should be turned into http://www.google.com, and a string like google.com/insights should be turned into http://www.google.com/insights. A string like http://thiswebsitedoesntexistatall.com should return some sort of error or exception.
I believe a portion of the solution may likely be calling an HTTP get_response() method and following redirects until I get a 200 OK status.
It seems like the URI.parse() method is not forgiving of leaving off the http. I realize I could write a simple thing to try adding http in front, etc., but I was hoping there was some existing gem or little-known library function that would be really forgiving about URLs and canonicalize them for me.
Both the built in net/http and HTTParty seem to be too strict for what I'm looking for. Is there a nice way to do this?

There are some problems with what you're asking for:
A URL parser shouldn't assume the value passed in is HTTP, when FTP and many other protocols are equally valid. If you know the protocol is likely to be HTTP, then you need to add the protocol.
If you try to connect to a site and follow redirects until you get a 200 response, you've only proven that the URL resolves to a valid page of some sort. That 200 could be an error-page returned because the one you want is a dead link or invalid, or that the site is temporarily down. To prove otherwise means you have to have some intimate pre-knowledge about the page you're looking for, such as specific content to search for.
Assuming the URL is good after you follow the redirects, is not quite safe. Many sites add on all sorts of session data to the URL, so what could start as a simple and clean URL can resolve to a long and convoluted one.
I'd recommend you look at the Addressable::URI gem. It's much more full-featured than Ruby's URI. It won't make the decisions for you, but at least it will give you a more complete API and can rewrite/normalize URLs. Cleaning them up and/or determining if they are good is still left as an exercise for you.

What does a Ajax call response like 'for (;;); { json data }' mean? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Why do people put code like “throw 1; <dont be evil>” and “for(;;);” in front of json responses?
I found this kind of syntax being used on Facebook for Ajax calls. I'm confused on the for (;;); part in the beginning of response. What is it used for?
This is the call and response:
GET http://0.131.channel.facebook.com/x/1476579705/51033089/false/p_1524926084=0
Response:
for (;;);{"t":"continue"}

I suspect the primary reason it's there is control. It forces you to retrieve the data via Ajax, not via JSON-P or similar (which uses script tags, and so would fail because that for loop is infinite), and thus ensures that the Same Origin Policy kicks in. This lets them control what documents can issue calls to the API — specifically, only documents that have the same origin as that API call, or ones that Facebook specifically grants access to via CORS (on browsers that support CORS). So you have to request the data via a mechanism where the browser will enforce the SOP, and you have to know about that preface and remove it before deserializing the data.
So yeah, it's about controlling (useful) access to that data.

Facebook has a ton of developers working internally on a lot of projects, and it is very common for someone to make a minor mistake; whether it be something as simple and serious as failing to escape data inserted into an HTML or SQL template or something as intricate and subtle as using eval (sometimes inefficient and arguably insecure) or JSON.parse (a compliant but not universally implemented extension) instead of a "known good" JSON decoder, it is important to figure out ways to easily enforce best practices on this developer population.
To face this challenge, Facebook has recently been going "all out" with internal projects designed to gracefully enforce these best practices, and to be honest the only explanation that truly makes sense for this specific case is just that: someone internally decided that all JSON parsing should go through a single implementation in their core library, and the best way to enforce that is for every single API response to get for(;;); automatically tacked on the front.
In so doing, a developer can't be "lazy": they will notice immediately if they use eval(), wonder what is up, and then realize their mistake and use the approved JSON API.
The other answers being provided seem to all fall into one of two categories:
misunderstanding JSONP, or
misunderstanding "JSON hijacking".
Those in the first category rely on the idea that an attacker can somehow make a request "using JSONP" to an API that doesn't support it. JSONP is a protocol that must be supported on both the server and the client: it requires the server to return something akin to myFunction({"t":"continue"}) such that the result is passed to a local function. You can't just "use JSONP" by accident.
Those in the second category are citing a very real vulnerability that has been described allowing a cross-site request forgery via tags to APIs that do not use JSONP (such as this one), allowing a form of "JSON hijacking". This is done by changing the Array/Object constructor, which allows one to access the information being returned from the server without a wrapping function.
However, that is simply not possible in this case: the reason it works at all is that a bare array (one possible result of many JSON APIs, such as the famous Gmail example) is a valid expression statement, which is not true of a bare object.
In fact, the syntax for objects defined by JSON (which includes quotation marks around the field names, as seen in this example) conflicts with the syntax for blocks, and therefore cannot be used at the top-level of a script.
js> {"t":"continue"}
typein:2: SyntaxError: invalid label:
typein:2: {"t":"continue"}
typein:2: ....^
For this example to be exploitable by way of Object() constructor remapping, it would require the API to have instead returned the object inside of a set of parentheses, making it valid JavaScript (but then not valid JSON).
js> ({"t":"continue"})
[object Object]
Now, it could be that this for(;;); prefix trick is only "accidentally" showing up in this example, and is in fact being returned by other internal Facebook APIs that are returning arrays; but in this case that should really be noted, as that would then be the "real" cause for why for(;;); is appearing in this specific snippet.

Well the for(;;); is an infinite loop (you can use Chrome's JavaScript console to run that code in a tab if you want, and then watch the CPU-usage in the task manager go through the roof until the browser kills the tab).
So I suspect that maybe it is being put there to frustrate anyone attempting to parse the response using eval or any other technique that executes the returned data.
To explain further, it used to be fairly commonplace to parse a bit of JSON-formatted data using JavaScript's eval() function, by doing something like:
var parsedJson = eval('(' + jsonString + ')');
...this is considered unsafe, however, as if for some reason your JSON-formatted data contains executable JavaScript code instead of (or in addition to) JSON-formatted data then that code will be executed by the eval(). This means that if you are talking with an untrusted server, or if someone compromises a trusted server, then they can run arbitrary code on your page.
Because of this, using things like eval() to parse JSON-formatted data is generally frowned upon, and the for(;;); statement in the Facebook JSON will prevent people from parsing the data that way. Anyone that tries will get an infinite loop. So essentially, it's like Facebook is trying to enforce that people work with its API in a way that doesn't leave them vulnerable to future exploits that try to hijack the Facebook API to use as a vector.

I'm a bit late and T.J. has basically solved the mystery, but I thought I'd share a great paper on this particular topic that has good examples and provides deeper insight into this mechanism.
These infinite loops are a countermeasure against "Javascript hijacking", a type of attack that gained public attention with an attack on Gmail that was published by Jeremiah Grossman.
The idea is as simple as beautiful: A lot of users tend to be logged in permanently in Gmail or Facebook. So what you do is you set up a site and in your malicious site's Javascript you override the object or array constructor:
function Object() {
//Make an Ajax request to your malicious site exposing the object data
}
then you include a <script> tag in that site such as
<script src="http://www.example.com/object.json"></script>
And finally you can read all about the JSON objects in your malicious server's logs.
As promised, the link to the paper.

This looks like a hack to prevent a CSRF attack. There are browser-specific ways to hook into object creation, so a malicious website could use do that first, and then have the following:
<script src="http://0.131.channel.facebook.com/x/1476579705/51033089/false/p_1524926084=0" />
If there weren't an infinite loop before the JSON, an object would be created, since JSON can be eval()ed as javascript, and the hooks would detect it and sniff the object members.
Now if you visit that site from a browser, while logged into Facebook, it can get at your data as if it were you, and then send it back to its own server via e.g., an AJAX or javascript post.

IE8 XSS filter: what does it really do?

Internet Explorer 8 has a new security feature, an XSS filter that tries to intercept cross-site scripting attempts. It's described this way:
The XSS Filter, a feature new to Internet Explorer 8, detects JavaScript in URL and HTTP POST requests. If JavaScript is detected, the XSS Filter searches evidence of reflection, information that would be returned to the attacking Web site if the attacking request were submitted unchanged. If reflection is detected, the XSS Filter sanitizes the original request so that the additional JavaScript cannot be executed.
I'm finding that the XSS filter kicks in even when there's no "evidence of reflection", and am starting to think that the filter simply notices when a request is made to another site and the response contains JavaScript.
But even that is hard to verify because the effect seems to come and go. IE has different zones, and just when I think I've reproduced the problem, the filter doesn't kick in anymore, and I don't know why.
Anyone have any tips on how to combat this? What is the filter really looking for? Is there any way for a good-guy to POST data to a 3rd-party site which can return HTML to be displayed in an iframe and not trigger the filter?
Background: I'm loading a JavaScript library from a 3rd-party site. That JavaScript harvests some data from the current HTML page, and posts it to the 3rd-party site, which responds with some HTML to be displayed in an iframe. To see it in action, visit an AOL Food page and click the "Print" icon just above the story.

What does it really do? It allows third parties to link to a messed-up version of your site.
It kicks in when [a few conditions are met and] it sees a string in the query submission that also exists verbatim in the page, and which it thinks might be dangerous.
It assumes that if <script>something()</script> exists in both the query string and the page code, then it must be because your server-side script is insecure and reflected that string straight back out as markup without escaping.
But of course apart from the fact that's it's a perfectly valid query someone might have typed that matches by coincidence, it's also just as possible that they match because someone looked at the page and deliberately copied part of it out. For example:
http://www.bing.com/search?q=%3Cscript+type%3D%22text%2Fjavascript%22%3E
Follow that in IE8 and I've successfully sabotaged your Bing page so it'll give script errors, and the pop-out result bits won't work. Essentially it gives an attacker whose link is being followed license to pick out and disable parts of the page he doesn't like — and that might even include other security-related measures like framebuster scripts.
What does IE8 consider ‘potentially dangerous’? A lot more and a lot stranger things than just this script tag. eg. What's more, it appears to match against a set of ‘dangerous’ templates using a text pattern system (presumably regex), instead of any kind of HTML parser like the one that will eventually parse the page itself. Yes, use IE8 and your browser is pařṣinͅg HT̈́͜ML w̧̼̜it̏̔h ͙r̿e̴̬g̉̆e͎x͍͔̑̃̽̚.
‘XSS protection’ by looking at the strings in the query is utterly bogus. It can't be ‘fixed’; the very concept is intrinsically flawed. Apart from the problem of stepping in when it's not wanted, it can't ever really protect you from anything but the most basic attacks — and the attackers will surely workaround such blocks as IE8 becomes more widely used. If you've been forgetting to escape your HTML output correctly you'll still be vulnerable; all XSS “protection” has to offer you is a false sense of security. Unfortunately Microsoft seem to like this false sense of security; there is similar XSS “protection” in ASP.NET too, on the server side.
So if you've got a clue about webapp authoring and you've been properly escaping output to HTML like a good boy, it's definitely a good idea to disable this unwanted, unworkable, wrong-headed intrusion by outputting the header:
X-XSS-Protection: 0
in your HTTP responses. (And using ValidateRequest="false" in your pages if you're using ASP.NET.)
For everyone else, who still slings strings together in PHP without taking care to encode properly... well you might as well leave it on. Don't expect it to actually protect your users, but your site is already broken, so who cares if it breaks a little more, right?
To see it in action, visit an AOL Food page and click the "Print" icon just above the story.
Ah yes, I can see this breaking in IE8. Not immediately obvious where IE has made the hack to the content that's stopped it executing though... the only cross-domain request I can see that's a candidate for the XSS filter is this one to http://h30405.www3.hp.com/print/start:
POST /print/start HTTP/1.1
Host: h30405.www3.hp.com
Referer: http://recipe.aol.com/recipe/oatmeal-butter-cookies/142275?
csrfmiddlewaretoken=undefined&characterset=utf-8&location=http%253A%2F%2Frecipe.aol.com%2Frecipe%2Foatmeal-butter-cookies%2F142275&template=recipe&blocks=Dd%3Do%7Efsp%7E%7B%3D%25%3F%3D%3C%28%2B.%2F%2C%28%3D3%3F%3D%7Dsp%7Ct#kfoz%3D%25%3F%3D%7E%7C%7Czqk%7Cpspm%3Db3%3Fd%3Do%7Efsp%7E%7B%3D%25%3F%3D%3C%7D%2F%27%2B%2C.%3D3%3F%3D%7Dsp%7Ct#kfoz%3D%25%3F%3D%7E%7C%7Czqk...
that blocks parameter continues with pages more gibberish. Presumably there is something there that (by coincidence?) is reflected in the returned HTML and triggers one of IE8's messed up ideas of what an XSS exploit looks like.
To fix this, HP need to make the server at h30405.www3.hp.com include the X-XSS-Protection: 0 header.

You should send me (ericlaw#microsoft) a network capture (www.fiddlercap.com) of the scenario you think is incorrect.
The XSS filter works as follows:
Is XSSFILTER enabled for this process?
If yes– proceed to next check
If no – bypass XSS Filter and continue loading
Is a "document" load (like a frame, not a subdownload)?
If yes– proceed to next check
If no – bypass XSS Filter and continue loading
Is it a HTTP/HTTPS request?
If yes– proceed to next check
If no – bypass XSS Filter and continue loading
Does RESPONSE contain x-xss-protection header?
Yes:
Value = 1: XSS Filter Enabled (no urlaction check)
Value = 0: XSS Filter Disabled (no urlaction check)
No: proceed to next check
Is the site loading in a Zone where URLAction enables XSS filtering? (By default: Internet, Trusted, Restricted)
If yes– proceed to next check
If no – bypass XSS Filter and continue loading
Is a cross site Request? (Referrer header: Does the final (post-redirect) fully-qualified domain name in the HTTP request referrer header match the fully-qualified domain name of the URL being retrieved?)
If yes – bypass XSS Filter and continue loading
If no – then the URL in the request should be neutered.
Does the heuristic indicate of the RESPONSE data came from unsafe REQUEST DATA?
If yes – modify the response.
Now, the exact details of #7 are quite complicated, but basically, you can imagine that IE does a match of request data (URL/Post Body) to response data (script bodies) and if they match, then the response data will be modified.
In your site's case, you'll want to look at the body of the POST to http://h30405.www3.hp.com/print/start and the corresponding response.

Actually, it's worse than might seem. The XSS filter can make safe sites unsafe. Read here:
http://www.h-online.com/security/news/item/Security-feature-of-Internet-Explorer-8-unsafe-868837.html
From that article:
However, Google disables IE's XSS filter by sending the X-XSS-Protection: 0 header, which makes it immune.
I don't know enough about your site to judge if this may be a solution, but you can probably try.
More in depth, technical discussion of the filter, and how to disable it is here: http://michael-coates.blogspot.com/2009/11/ie8-xss-filter-bug.html

Ajax and filenames - Best practices

I am using jQuery to call PHP files via the $.get method
function fetchDepartment(company_id)
{
$.get("ajax/fetchDepartment.php?sec=departments&company_id="+company_id, function(data){
$("#department_id").html(data);
});
}
What I am thinking is can I secure the filename even further?
Currently I have a global access check within the .php file that check if the user is logged in, if he can access this data etc.
But I am wondering if there are further steps I can take so a user couldn't see this filename, or what other steps you recommend to take.

Encoded requests
You could make the request details effectively invisible to the casual miscreant by encoding almost all of the URL and then decoding the request details server-side.
The request details would include the action you wish to perform plus the parameters relevant to that action.
All requests would be sent to a single URL, where a server-side process would decode the request details and perform the relevant action as required.
Example Original URL:
/ajax/delete.php?parameter1=foo&parameter2=bar
Request details:
action=delete&parameter1=foo&parameter2=bar
Encoded request details (encoded using base64):
YWN0aW9uPWRlbGV0ZSZwYXJhbWV0ZXIxPWZvbyZwYXJhbWV0ZXIyPWJhcg==
Encoded URL:
/ajax/?request=YWN0aW9uPWRlbGV0ZSZwYXJhbWV0ZXIxPWZvbyZwYXJhbWV0ZXIyPWJhcg==
I don't believe there is native functionality to encode to base64 in JavaScript, but it's far from impossible to find a suitable method or to write your own.
With obfuscated/minified client-side JavaScript it would be quite tricky for someone to determine how to make a request 'by hand'.
Hide implementation details
There are a number of practices you can follow to make your application less susceptible to attack through URL misuse.
Let's start with a URL of: ajax/fetchDepartment.php?sec=departments&company_id=99
There's no need to reveal what server-side technology you're using (PHP) nor, through the query string (sec, company_id), what the query string values actually mean.
Masking the server-side technology
Assuming you have index.php defined as a default, the following URLs are equivalent:
ajax/fetchDepartment.php?sec=departments&company_id=99
ajax/fetchDepartment/index.php?sec=departments&company_id=99
ajax/fetchDepartment/?sec=departments&company_id=99
The third URL does not reveal the server-side technology you're using. This limits the range of possible attacks. It also makes it easier for you to switch over to a different server-side technology without changing your URLs.
Hiding the meaning of request parameters
ajax/fetchDepartment/?sec=departments&company_id=99
ajax/99/departments/
The latter URL still conveys enough information to perform the request without revealing what the information means.
Whilst someone could still change the values, they won't know what they're changing. This will make it more difficult for an attacker to evaluate and understand the result of any URL changes they make.

Pretty much the only way you can obscure the URL for a certain piece of information from the user is by not loading it in through http. Maybe you can load a set of data on the calling page, or another page with a more generic url.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio