How to get ETag value from a list of objects in an Amazon S3 bucket - ruby

To get ETag of an object in a ruby code, I can simply use list_objects_v2 method. But in my use case, I want to get ETags for a list of files.
The problem is there are more than 7000 objects in the bucket. I only want to get the ETag for ~ 1000 of them. One of the ways is to use filename as the prefix and query each one of them individually. Probably the last resort.
Is there a way I can pass multiple values for prefix? Or any other way in which I can specify to get the ETags for which all objects??
EDIT: While I was looking for a possible way, I ran into https://stackoverflow.com/a/11421066/6200672. I know this is not directly commenting on how the ruby client will work, but according to me, it's gonna use the same API under the hood. (CMIIW)

Related

Setting noindex on Amazon S3 objects

We have some publicly shared S3 files that we want to make sure won't be indexed by Google. I can't seem to find any documentation on how to do this. Is there a way to set a "noindex" x-robots-tag response header on individual S3 objects?
(We're using the Ruby AWS client)
There does not appear to be a way to do this.
Only certain headers from an S3 PUT object request are documented as being returned when the object is fetched.
http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectPUT.html
Anything else you send appears to be simply disregarded, as long as it doesn't actually invalidate the request.
Actually, that's what I thought before researching this, and it's almost true.
The documentation here seems incomplete, and elsewhere suggests the following request headers, if sent with the upload, will appear in the download:
Cache-Control
Content-Disposition
Content-Encoding
Content-Type
x-amz-meta-*
Other headers are listed at the latter link, but some of these like Expect wouldn't make sense on a GET request, so they logically wouldn't appear.
So far, this is all consistent with my experience with S3.
If you send a random but not-invalid header with your request, it's ignored. Example:
X-Foo: bar
S3 seems to accepts this on upload, but discards it (presumably doesn't store it)... downloading the object does not return the X-Foo header.
But X-Robots-Tag appears to be an undocumented exception to this.
Uploading a file with X-Robots-Tag: noindex (for example) does indeed result in the same header and value being returned with the object when you GET it.
Unless somebody can cite the documentation that explains why this works, we're operating in distinctly undocumented territory.
But, if you're interested in going there, the simple answer appears to be, you just add this header to the HTTP PUT request you send to the REST API to upload the object.
"Not so fast," you say, "I'm using the Ruby SDK." Indeed. The AWS Ruby client seems to be too "helpful" to let you get away with this, at least, not easily. The docs there show how to add "metadata" --
:metadata (Hash) — A hash of metadata to be included with the object. These will be sent to S3 as headers prefixed with x-amz-meta. Each name, value pair must conform to US-ASCII.
Well, that's not going to work, because you'd get x-amz-meta-x-robots-tag.
How do you set other headers in the upload? Every other header you'd normally set is an element of the options hash, like :cache_control, which turns into Cache-Control: in the upload request. Unless they're blindly applying the keys from that hash to the upload transaction (which would be terrible design combined with excellent luck) then you may not have a straightforward way to get here from there. I can't be much more specific, because the only I really know about Ruby is the same thing I know about Java -- from what I've seen of it, I don't like it. :)
But X-Robots-Tag does appear to be a custom header S3 supports, to some extent, without clear documentation of that fact. It's, at least, accepted by the REST API.
Failing the above, you can manually add this header to the metadata in the S3 console after uploading the object. (Note, X-Foo: Bar doesn't work from the S3 console, either -- it's silently discarded, with no error -- but X-Robots-Tag: works fine).
You can also, of course, put a publicly-readable robots.txt file (with the appropriate directives in it) in the root of the bucket. Depending on your cobtent mix, path hierarchy, and other factors, that isn't (perhaps) as simple as selectively setting headers, but if the entire bucket is comprised of information you don't want indexed, it should easily accomplish what you want, since content should not be indexed if disallowed in robots.txt, even when a search spider follows a link to it from another site -- every domain (and subdomain)'s robots.txt file stands alone.
#Michael - sqlbot is correct. The SDKs don't support it by default and it won't show in the AWS Console, but if you set it directly with the REST API it works. For those who don't want to figure out the REST API and its authentication method, I was able to modify the node.js aws-sdk to support this feature.
Amazon stores the method params configuration and validation in a large json file: apis/s3-2006-03-01.min.json . I guess that the other SDKs may implement their validation in the same way.
You can go to the "PutObject" command, and under "input.members" you can add a new parameter "XRobotsTag". Configure it as a "header" and set the location to "X-Robots-Tag".
"XRobotsTag": {
"location": "header",
"locationName": "X-Robots-Tag"
}
Your local aws-sdk is now configured to support X-Robots-Tag on your putObject requests. In node.js this would look like this:
s3.putObject({
ACL: "public-read",
Body: "hello world",
Bucket: "my-bucket",
CacheControl: "public, max-age=31536000",
ContentType: "text/plain",
Key: "hello.txt",
XRobotsTag: "noindex, nofollow"
}, function(err, resp){});

Fully update documents without creating if not existent

Is there any method on elasticsearch for fully (not partially) updating documents and not create new ones in case it doesn’t already exists?
Until now, I found that the _update method, while passing a doc attribute inside the json request body to partially updating documents, however, I would like to replace the entire document in this case, not only partially.
I have also found that, the index method, where sending a PUT request works fine, although creating a new document in case the id not yet indexed.
Setting the op_type parameter to create will enforce document creation instead update.
I was wondering if there is any way to always enforce update and never create a new one?
Or perhaps is there another method that would allow me to achieve such task?
If I understand correctly, you want to index a doc, but only if it already exists? Like an op_type option of update?
You can mostly do it with the update API, given that your mapping remains consistent. With an _update, if the document doesn't exist, you'll get back a 404. If it does exist, ES will merge the contents of doc with whatever document exists there. If you make sure you're sending over a new doc with all the fields in the mapping, then you're effectively replacing it outright.
Note, however, that you can do it without the document merge rather efficiently in two requests; the first one checking for doc existence with a HEAD request. If HEAD /idx/type/id is successful, then do a PUT. This is essentially what's happening internally anyway with the update API, with a little extra overhead. But HEAD is really cheap because it's not shuffling any payload around. It simply returns an HTTP 200/404.

extJS4 difference between Ext.Ajax and Ext.data.proxy.Ajax

I am new to the extJS framework and after looking at these two classes I am curious when it is better to use one or the other. Is one better for submitting and one for getting? In my current situation I am using a grid and the api says to use the proxy.Ajax which I trust. But lets say I want to send the JSON back to a database, would I be better off using Ext.Ajax?
Also I am rather new to JSON I understand that its just a string with regular expressions essentially. What is the best way to send the json straight to a database if I don’t want to store the json locally?
Ext.Ajax is used to make one-off requests to a server. You can use this for GET and POST requests.
Ext.data.proxy.Ajax is used by the data package, in particular, Ext.data.Store's. You cannot use this 'manually'. You must use it via the data package.

Combining GET and POST in Sinatra (Ruby)

I am trying to make a RESTful api and have some function which needs credentials. For example say I'm writing a function which finds all nearby places within a certain radius, but only authorised users can use it.
One way to do it is to send it all using GET like so:
http://myapi.heroku.com/getNearbyPlaces?lon=12.343523&lat=56.123533&radius=30&username=john&password=blabla123
but obviously that's the worst possible way to do it.
Is it possible to instead move the username and password fields and embed them as POST variables over SSL, so the URL will only look like so:
https://myapi.heroku.com/getNearbyPlaces?lon=12.343523&lat=56.123533&radius=30
and the credentials will be sent encrypted.
How would I then in Sinatra and Ruby properly get at the GET and POST variables? Is this The Right Way To Do It? If not why not?
If you are really trying to create a restful API instead if some URL endpoints which happen to speak some HTTP dialect, you should stick to GET. It's even again in your path, so you seem to be pretty sure it's a get.
Instead of trying to hide the username and password in GET or POST parameters, you should instead use Basic authentication, which was invented especially for that purpose and is universally available in clients (and is available using convenience methods in Sinatra).
Also, if you are trying to use REST, you should embrace the concept of resources and resoiurce collections (which is implied by the R and E of REST). So you have a single URL like http://myapi.heroku.com/NearbyPlaces. If you GET there, you gather information about that resource, if you POST, you create a new resource, if you PUT yopu update n existing resource and if you DELETE, well, you delete it. What you should do before is th structure your object space into these resources and design your API around it.
Possibly, you could have a resource collection at http://myapi.heroku.com/places. Each place as a resource has a unique URL like http://myapi.heroku.com/places/123. New polaces can be created by POSTing to http://myapi.heroku.com/places. And nearby places could be gathered by GETing http://myapi.heroku.com/places/nearby?lon=12.343523&lat=56.123533&radius=30. hat call could return an Array or URLs to nearby places, e.g.
[
"http://myapi.heroku.com/places/123",
"http://myapi.heroku.com/places/17",
"http://myapi.heroku.com/places/42"
]
If you want to be truly discoverable, you might also embrace HATEOAS which constraints REST smentics in a way to allows API clients to "browse" through the API as a user with a browser would do. To allow this, you use Hyperlink inside your API which point to other resources, kind of like in the example above.
The params that are part of the url (namely lon, lat and radius) are known as query parameters, the user and password information that you want to send in your form are known as form parameters. In Sinatra both of these type of parameters are made available in the params hash of a controller.
So in Sinatra you would be able to access your lon parameter as params[:lon] and the user parameter as params[:user].
I suggest using basic or digest authentication and a plain GET request. In other words, your request should be "GET /places?lat=x&lon=x&radius=x" and you should let HTTP handle the authentication. If I understand your situation correctly, this is the ideal approach and will certainly be the most RESTful solution.
As an aside, your URI could be improved. Having verbs ("get") and query-like adjectives ("nearby") in your resource names is not really appropriate. In general, resources should be nouns (ie. "places", "person", "books"). See the example request I wrote above; "get" is redundant because you are using a GET request and "nearby" is redundant because you are already querying by location.

GET vs POST in AJAX?

Why are there GET and POST requests in AJAX as it does not affect page URL anyway? What difference does it make by passing sensitive data over GET in AJAX as the data is not getting reflected to page URL?
You should use the proper HTTP verb according to what you require from your web service.
When dealing with a Collection URI like: http://example.com/resources/
GET: List the members of the collection, complete with their member URIs for further navigation. For example, list all the cars for sale.
PUT: Meaning defined as "replace the entire collection with another collection".
POST: Create a new entry in the collection where the ID is assigned automatically by the collection. The ID created is usually included as part of the data returned by this operation.
DELETE: Meaning defined as "delete the entire collection".
When dealing with a Member URI like: http://example.com/resources/7HOU57Y
GET: Retrieve a representation of the addressed member of the collection expressed in an appropriate MIME type.
PUT: Update the addressed member of the collection or create it with the specified ID.
POST: Treats the addressed member as a collection in its own right and creates a new subordinate of it.
DELETE: Delete the addressed member of the collection.
Source: Wikipedia
Well, as for GET, you still have the url length limitation. Other than that, it is quite conceivable that the server treats POST and GET requests differently; thus the need to be able to specify what request you're doing.
Another difference between GET and POST is the way caching is handled in browsers. POST response is never cached. GET may or may not be cached based on the caching rules specified in your response headers.
Two primary reasons for having them:
GET requests have some pretty restrictive limitations on size; POST are typically capable of containing much more information.
The backend may be expecting GET or POST, depending on how it's designed. We need the flexibility of doing a GET if the backend expects one, or a POST if that's what it's expecting.
It's simply down to respecting the rules of the http protocol.
Get - calls must be idempotent. This means that if you call it multiple times you will get the same result. It is not intended to change the underlying data. You might use this for a search box etc.
Post - calls are NOT idempotent. It is allowed to make a change to the underlying data, so might be used in a create method. If you call it multiple times you will create multiple entries.
You normally send parameters to the AJAX script, it returns data based on these parameters. It works just like a form that has method="get" or method="post". When using the GET method, the parameters are passed in the query string. When using POST method, the parameters are sent in the post body.
Generally, if your parameters have very few characters and do not contain sensitive information then you send them via GET method. Sensitive data (e.g. password) or long text (e.g. an 8000 character long bio of a person) are better sent via POST method.
Thanks..
I mainly use the GET method with Ajax and I haven't got any problems until now except the following:
Internet Explorer (unlike Firefox and Google Chrome) cache GET calling if using the same GET values.
So, using some interval with Ajax GET can show the same results unless you change URL with irrelevant random number usage for each Ajax GET.
Others have covered the main points (context/idempotency, and size), but i'll add another: encryption. If you are using SSL and want to encrypt your input args, you need to use POST.
When we use the GET method in Ajax, only the content of the value of the field is sent, not the format in which the content is. For example, content in the text area is just added in the URL in case of the GET method (without a new line character). That is not the case in the POST method.

Resources