Setting noindex on Amazon S3 objects

Setting noindex on Amazon S3 objects - ruby

We have some publicly shared S3 files that we want to make sure won't be indexed by Google. I can't seem to find any documentation on how to do this. Is there a way to set a "noindex" x-robots-tag response header on individual S3 objects?
(We're using the Ruby AWS client)

There does not appear to be a way to do this.
Only certain headers from an S3 PUT object request are documented as being returned when the object is fetched.
http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectPUT.html
Anything else you send appears to be simply disregarded, as long as it doesn't actually invalidate the request.
Actually, that's what I thought before researching this, and it's almost true.
The documentation here seems incomplete, and elsewhere suggests the following request headers, if sent with the upload, will appear in the download:
Cache-Control
Content-Disposition
Content-Encoding
Content-Type
x-amz-meta-*
Other headers are listed at the latter link, but some of these like Expect wouldn't make sense on a GET request, so they logically wouldn't appear.
So far, this is all consistent with my experience with S3.
If you send a random but not-invalid header with your request, it's ignored. Example:
X-Foo: bar
S3 seems to accepts this on upload, but discards it (presumably doesn't store it)... downloading the object does not return the X-Foo header.
But X-Robots-Tag appears to be an undocumented exception to this.
Uploading a file with X-Robots-Tag: noindex (for example) does indeed result in the same header and value being returned with the object when you GET it.
Unless somebody can cite the documentation that explains why this works, we're operating in distinctly undocumented territory.
But, if you're interested in going there, the simple answer appears to be, you just add this header to the HTTP PUT request you send to the REST API to upload the object.
"Not so fast," you say, "I'm using the Ruby SDK." Indeed. The AWS Ruby client seems to be too "helpful" to let you get away with this, at least, not easily. The docs there show how to add "metadata" --
:metadata (Hash) — A hash of metadata to be included with the object. These will be sent to S3 as headers prefixed with x-amz-meta. Each name, value pair must conform to US-ASCII.
Well, that's not going to work, because you'd get x-amz-meta-x-robots-tag.
How do you set other headers in the upload? Every other header you'd normally set is an element of the options hash, like :cache_control, which turns into Cache-Control: in the upload request. Unless they're blindly applying the keys from that hash to the upload transaction (which would be terrible design combined with excellent luck) then you may not have a straightforward way to get here from there. I can't be much more specific, because the only I really know about Ruby is the same thing I know about Java -- from what I've seen of it, I don't like it. :)
But X-Robots-Tag does appear to be a custom header S3 supports, to some extent, without clear documentation of that fact. It's, at least, accepted by the REST API.
Failing the above, you can manually add this header to the metadata in the S3 console after uploading the object. (Note, X-Foo: Bar doesn't work from the S3 console, either -- it's silently discarded, with no error -- but X-Robots-Tag: works fine).
You can also, of course, put a publicly-readable robots.txt file (with the appropriate directives in it) in the root of the bucket. Depending on your cobtent mix, path hierarchy, and other factors, that isn't (perhaps) as simple as selectively setting headers, but if the entire bucket is comprised of information you don't want indexed, it should easily accomplish what you want, since content should not be indexed if disallowed in robots.txt, even when a search spider follows a link to it from another site -- every domain (and subdomain)'s robots.txt file stands alone.

#Michael - sqlbot is correct. The SDKs don't support it by default and it won't show in the AWS Console, but if you set it directly with the REST API it works. For those who don't want to figure out the REST API and its authentication method, I was able to modify the node.js aws-sdk to support this feature.
Amazon stores the method params configuration and validation in a large json file: apis/s3-2006-03-01.min.json . I guess that the other SDKs may implement their validation in the same way.
You can go to the "PutObject" command, and under "input.members" you can add a new parameter "XRobotsTag". Configure it as a "header" and set the location to "X-Robots-Tag".
"XRobotsTag": {
"location": "header",
"locationName": "X-Robots-Tag"
}
Your local aws-sdk is now configured to support X-Robots-Tag on your putObject requests. In node.js this would look like this:
s3.putObject({
ACL: "public-read",
Body: "hello world",
Bucket: "my-bucket",
CacheControl: "public, max-age=31536000",
ContentType: "text/plain",
Key: "hello.txt",
XRobotsTag: "noindex, nofollow"
}, function(err, resp){});

Related

Standart/common/right way to send data in API request

I'm trying to learn how to create an API (I use Laravel in the backend and Postman to send requests), but I have a basic doubt when sending data to be processed in the backend.
I see that there are several ways to send data to the backend, but I'm not sure which is the right way to do it.
For example, with Postman I have seen that the sending can be done as parameters through the URI:
www.example.com/api/v1/orders?limit=10&offset=20
I can also do it in the body of the request through the tags
form data
x-www-form-urlencoded
raw
other ...
I understand that I can make the request along with sending data in several ways. I would like to know what should be the correct, standard or optimal way to do it for usual requests such as getting a series of records with a filtering, an order or a pagination.
I would also like to know if the way of sending data should depend on the verb to be used in the request.
My main question/problem is that I would like the way users use the API to be as simple or suitable as possible for them. I'm clear that I want to always return the data (when necessary) in JSON format but I'm not clear on how it should be sent.
Please, could someone clarify these doubts (maybe a link to a page where this kind of doubts are dealt with).
Thank you very much in advance.

It depends:
GET, HEAD and DELETE don't have a request body so all parameters have to be send via URL
POST can be easily sent via form data in Laravel
For PUT/PATCH I prefer application/json because PHP sends it via php://input stream which can have some problems in Laravel sometimes
You can also combine URL parameters and the request body. Compound types (for example models) can only be send as one via request body while it might suffice to send an id via URL parameter.
I guess, nearly more important is the overall format and documentation. The format should be consistent, easy to understand and maybe standardized (for example: https://jsonapi.org/format/#crud).
Keep in mind that forms do two things by default:
Only having methods GET and POST
Only having ectypes application/x-www-form-urlencoded, multipart/form-data and text/plain
If you want to enforce something else, you have to use scripts/libraries to do this.

Nowadays, it appears that JSON content (for POST, PUT, and PATCH) is the most popular and readable. It is well recognizable and clean. Examples in the documentation are easy to read.
I would go for JSON for both, incoming parameters and the outgoing response. This regards parameters related to the business logic of your application.
At the same time, for GET, HEAD, and DELETE methods, you don't have a payload at all. For parameters related to controlling the API (i.e. not strictly related to the business logic of the application, but to the API itself) I'd go for query parameters. This applies to parameters like limit, offset, order_by, etc.
P.S. There is only one caveat related to the JSON format. If your API happens to have file parameters you may face the problem. You can still use JSON format, but in such a case, you should encode your files (e.g. using base64) and put it as string parameters of your JSON. This may be demanding for the consumers of your API ;) This will also enlarge your files and will probably force you to process these files in memory. The alternative is to use multipart/form-data as a request Content-Type - this way you can have both, the form and separate "space" for files. It's worth keeping this case in mind when you decide.

Add content disposition param for uploadTask using URLSession and URLRequest

I am using URLSession 'uploadTask from file'
func uploadTask(with request: URLRequest, fromFile fileURL: URL) -> URLSessionUploadTask
Almost everything works fine, but now our server needs an extra param as 'uploadKey' to be passed as content disposition along with fileName.
This can be done by generating multipart request with content disposition added as we normally do.
I want to add it while using 'uploadTask from file' to avoid memory pressure. Please suggest how to do it.

From reading the question, I suspect that you're subtly misunderstanding what upload tasks do (and unfortunately, Apple's documentation needs some serious improvement in this area, which doesn't help matters). These tasks don't upload a file in the same way that a web browser would if you chose a file in an upload form. Rather, they use the file as the body of an upload request. I think they default to providing a sane Content-Type based on the filename, though I'm not certain, but they do not send data in form encoding.
So assuming I'm fully understanding the question, your options are either:
Keep using multipart encoding. Optionally write the multipart body into a file rather than keeping it in memory, and use the upload task to provide the body from that file instead of from an NSData object.
Upload the file you're trying to send, unencoded, as the entirety of the upload body, and provide whatever additional parameters you need to provide in the form of GET parameters in the URL itself.
Use some other encoding like JSON or protocol buffers.
Either way, the server code will determine which of these approaches is supported. If you can modify the server code, then I would recommend the second approach. It is slightly more efficient than the first approach, much more efficient than JSON, and easier to implement than any of the other approaches.

Google Api Ruby Client to return the actual HTTP response, not the helper object

Is there an easy way to ask the google api ruby client to just give you back the stock HTTP response, rather than to perform the lovely, but slightly limiting translation into one of their ruby representable objects?
e.g.
response = Gmail.client.get_user_message("me", id)
=> #<Google::Apis::GmailV1::Message
response = Gmail.client.list_user_messages("me")
=> #<Google::Apis::GmailV1::ListMessagesResponse
but
response = Gmail.client.delete_user_message("me", id)
=>nil #successfully deleted
Now that's all fine and dandy, except that sometimes I just want to know what sort of response is going to come back. i.e. an HTTP response with maybe some JSON in the body. And then I'll worry about what I do with it...
I can take the response and use the
response.to_json
to get the body of the json that would have come back (though I still won't have the response code, and I need to KNOW that it's one of those objects first).
The client library is definitely getting that, it's just converting it into these objects before it lets me see it. And if I don't know that it's a google object (and not nil) I can't run that to_json consistently....
Any ideas other than second guess what google is going to send me back?
(I should note that this has come about when trying to move a library from dealing with their 0.8 api to their 0.9 api, so call me a cynic if you must but my faith that google won't make breaking changes to those objects returned is at a low ebb...

As far as I know, it is possible to ask the server to send only the fields you really need and get a partial response instead of the default full response as mentioned in Performance Tips.
However, I suggest that you please check the documentation for the specific API you are using to see if the field you're looking for is currently supported. For the Gmail API, you may go through Working with partial resources.
Here are the two types of partial requests that you can use:
Partial response: A request where you specify which fields to include in the response (use the fields request parameter).
Patch: An update request where you send only the fields you want to change (use the PATCH HTTP verb).
Hope that helps!

spring data rest: why is all content exposed in general object uri (e.g. /files/), not only in specific uri (e.g. /files/1)

I am setting up a project using spring data rest.
Exposing my domain model seems to work, but I have some strange behaviour:
According to the wiki/docs if I access a uri like /files/
I should end up with an array of links to the single files. But, I get not only the links to the files, but also the attributs of the files objects when accessing the uri /files/.
This is anoying, because I have the content of the files as byte[] and I end up transmitting all contents of all files when accessing /files/
Does anybody know how to turn this behaviour off?

Ok, I just found the answer myself. I am posting this for others with the same question...
I have to use the header: "Accept: application/x-spring-data-compact+json" in my request to get the compact representation of my repositories...
That's it...

What does a Ajax call response like 'for (;;); { json data }' mean? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Why do people put code like “throw 1; <dont be evil>” and “for(;;);” in front of json responses?
I found this kind of syntax being used on Facebook for Ajax calls. I'm confused on the for (;;); part in the beginning of response. What is it used for?
This is the call and response:
GET http://0.131.channel.facebook.com/x/1476579705/51033089/false/p_1524926084=0
Response:
for (;;);{"t":"continue"}

I suspect the primary reason it's there is control. It forces you to retrieve the data via Ajax, not via JSON-P or similar (which uses script tags, and so would fail because that for loop is infinite), and thus ensures that the Same Origin Policy kicks in. This lets them control what documents can issue calls to the API — specifically, only documents that have the same origin as that API call, or ones that Facebook specifically grants access to via CORS (on browsers that support CORS). So you have to request the data via a mechanism where the browser will enforce the SOP, and you have to know about that preface and remove it before deserializing the data.
So yeah, it's about controlling (useful) access to that data.

Facebook has a ton of developers working internally on a lot of projects, and it is very common for someone to make a minor mistake; whether it be something as simple and serious as failing to escape data inserted into an HTML or SQL template or something as intricate and subtle as using eval (sometimes inefficient and arguably insecure) or JSON.parse (a compliant but not universally implemented extension) instead of a "known good" JSON decoder, it is important to figure out ways to easily enforce best practices on this developer population.
To face this challenge, Facebook has recently been going "all out" with internal projects designed to gracefully enforce these best practices, and to be honest the only explanation that truly makes sense for this specific case is just that: someone internally decided that all JSON parsing should go through a single implementation in their core library, and the best way to enforce that is for every single API response to get for(;;); automatically tacked on the front.
In so doing, a developer can't be "lazy": they will notice immediately if they use eval(), wonder what is up, and then realize their mistake and use the approved JSON API.
The other answers being provided seem to all fall into one of two categories:
misunderstanding JSONP, or
misunderstanding "JSON hijacking".
Those in the first category rely on the idea that an attacker can somehow make a request "using JSONP" to an API that doesn't support it. JSONP is a protocol that must be supported on both the server and the client: it requires the server to return something akin to myFunction({"t":"continue"}) such that the result is passed to a local function. You can't just "use JSONP" by accident.
Those in the second category are citing a very real vulnerability that has been described allowing a cross-site request forgery via tags to APIs that do not use JSONP (such as this one), allowing a form of "JSON hijacking". This is done by changing the Array/Object constructor, which allows one to access the information being returned from the server without a wrapping function.
However, that is simply not possible in this case: the reason it works at all is that a bare array (one possible result of many JSON APIs, such as the famous Gmail example) is a valid expression statement, which is not true of a bare object.
In fact, the syntax for objects defined by JSON (which includes quotation marks around the field names, as seen in this example) conflicts with the syntax for blocks, and therefore cannot be used at the top-level of a script.
js> {"t":"continue"}
typein:2: SyntaxError: invalid label:
typein:2: {"t":"continue"}
typein:2: ....^
For this example to be exploitable by way of Object() constructor remapping, it would require the API to have instead returned the object inside of a set of parentheses, making it valid JavaScript (but then not valid JSON).
js> ({"t":"continue"})
[object Object]
Now, it could be that this for(;;); prefix trick is only "accidentally" showing up in this example, and is in fact being returned by other internal Facebook APIs that are returning arrays; but in this case that should really be noted, as that would then be the "real" cause for why for(;;); is appearing in this specific snippet.

Well the for(;;); is an infinite loop (you can use Chrome's JavaScript console to run that code in a tab if you want, and then watch the CPU-usage in the task manager go through the roof until the browser kills the tab).
So I suspect that maybe it is being put there to frustrate anyone attempting to parse the response using eval or any other technique that executes the returned data.
To explain further, it used to be fairly commonplace to parse a bit of JSON-formatted data using JavaScript's eval() function, by doing something like:
var parsedJson = eval('(' + jsonString + ')');
...this is considered unsafe, however, as if for some reason your JSON-formatted data contains executable JavaScript code instead of (or in addition to) JSON-formatted data then that code will be executed by the eval(). This means that if you are talking with an untrusted server, or if someone compromises a trusted server, then they can run arbitrary code on your page.
Because of this, using things like eval() to parse JSON-formatted data is generally frowned upon, and the for(;;); statement in the Facebook JSON will prevent people from parsing the data that way. Anyone that tries will get an infinite loop. So essentially, it's like Facebook is trying to enforce that people work with its API in a way that doesn't leave them vulnerable to future exploits that try to hijack the Facebook API to use as a vector.

I'm a bit late and T.J. has basically solved the mystery, but I thought I'd share a great paper on this particular topic that has good examples and provides deeper insight into this mechanism.
These infinite loops are a countermeasure against "Javascript hijacking", a type of attack that gained public attention with an attack on Gmail that was published by Jeremiah Grossman.
The idea is as simple as beautiful: A lot of users tend to be logged in permanently in Gmail or Facebook. So what you do is you set up a site and in your malicious site's Javascript you override the object or array constructor:
function Object() {
//Make an Ajax request to your malicious site exposing the object data
}
then you include a <script> tag in that site such as
<script src="http://www.example.com/object.json"></script>
And finally you can read all about the JSON objects in your malicious server's logs.
As promised, the link to the paper.

This looks like a hack to prevent a CSRF attack. There are browser-specific ways to hook into object creation, so a malicious website could use do that first, and then have the following:
<script src="http://0.131.channel.facebook.com/x/1476579705/51033089/false/p_1524926084=0" />
If there weren't an infinite loop before the JSON, an object would be created, since JSON can be eval()ed as javascript, and the hooks would detect it and sniff the object members.
Now if you visit that site from a browser, while logged into Facebook, it can get at your data as if it were you, and then send it back to its own server via e.g., an AJAX or javascript post.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio