How to serve images from Riak - image

I have a bunch of markdown documents in Riak, which I'm exposing via a small Sinatra API with basic search functionality etc.
Each document has an associated image, also stored in Riak (in a different bucket). I'd like to have a client app display the documents alongside their associated images - so I need some way to make the images available, but as I'm only ever going to be requesting them by key it seems wasteful to serve them via a Sinatra app as I'm doing with the documents.
However I'm uneasy with serving them directly from Riak, because a) even using nginx to limit the acceptable requests, I worry about exposing more functionality than we want to and b) Riak throws a 403 for any request where the referrer is set, so by default using a direct-to-Riak url as the src of an img tag doesn't work.
So my question is - what's a good approach to take for serving the images? Add another endpoint to the Sinatra app? Direct from Riak using some Nginx wizardry that is currently beyond me? Or some other approach I haven't considered yet? This would ideally use Ruby as that's what the team I'm working with are more comfortable with.
Not sure if this question might be better suited to Server Fault - if so I'll move it over.

You're right to be concerned about exposing Riak to any direct connectivity. Until 2.0 arrives early next year, there is no security in the system (although the 403 for requests with a referrer is a security mechanism to protect against XSS), and even with security exposing any database directly to the Internet invites disaster.
I've not done anything with nginx, but all you'd really need to use it properly, I'd think, would be two features:
Ability to restrict requests to GET
Ability to restrict (or rewrite) requests to the proper bucket
Ability to strip out all HTTP headers that Riak includes in its result (which, since nginx is a proxy server and not a straight load balancer, seems like it should be straightforward)
Assuming that your images are the only content in that bucket, nginx feels like a reasonable choice here.

Related

Can a person add CORS headers using the ELB Application Load Balancer (sitting in front of Solr)?

We have a number of EC2 instances running Solr in EC2, which we've used in the past through another application. We would like to move towards allowing users (via web browser) to directly access Solr.
Without something "in front" of Solr this results in a security risk, so we have opted to try to use ELB (specifically the Application Load Balancer) as a simple and maintenance free way of preventing certain requests from hitting SOLR (i.e. preventing the public from DELETING or otherwise modifying the documents in Solr).
This worked great, but we realize that we need to deal with the CORS issue. In other words, we need to add the appropriate headers to requests that come in from a browser. I have not yet seen a way of doing this with Application Load Balancer but am wondering if it is possible to do someway. If it is not possible, I would love as an additional recomendation the easier and least complicated way of adding these headers. We really really really hate to add something like nginx in front of Solr because then we've got additional redundancy to deal with, more servers, etc.
Thank you!
There is not much I can find on CORS for ALB either and I remember when I used Beanstalk with ELB I had to add CORS support in my java application directly.
Having said that, I can find a lot of articles on how to set up CORS for Solr.
Can it be an option for you?

Amazon AWS and usage model for S3 storage

There is this example on amazon, a high traffic web application. I noticed that they are using S3 as their content delivery method. I was wondering if I need to have a Web Server for the content delivery and a Web App for my application. I don't understand why they have 2 web servers and 2 web app in the diagram.
And what is the best way to set up a website that serves images and static contents through S3 and the rest of the content through the regular storage.
My last question is, can I consider S3 as a main storage, reliable enough that I can only keep my static content there and don't have a normal storage as a backup ?
That is a very general diagram, specific diagrams will vary depending on the specifics of the overall architecture.
Having said that, I believe the Web Server represents something like Apache or Nginx and the App Server represent something like Rails, Rack Server, Unicorn, Gunicorn, Django, Sinatra, Flask, Jetty, Tomcat, etc. In some cases you can merge the Web Server and the App Server together like for example deploying Apache with python mod_wsgi to run your Django app. (So depends on Architecture)
what is the best way to set up a website that serves images and static
contents through S3 and the rest of the content through the regular
storage.
There's no really best way other than just point your dynamic content to your Databases (SQL and NoSQL) and point your static files to an S3 bucket (images, css, Jquery code, etc) You can also use third party modules depending on your application stack. For example you can accomplish this in Django with the django-storages module. You can find similar modules for other app stacks like Rails.
My last question is, can I consider S3 as a main storage, reliable
enough that I can only keep my static content there and don't have a
normal storage as a backup ?
S3 is pretty reliable, they provide a 99.999999999% reliability of your data. That goes down if you use their RRS (Reduced Redundancy Storage), but if you want to use it you probably want to back up your data in a non RRS bucket anyways. Anyhow, if it's extremely critical data, you are more than free to backup your data somewhere else just in case.
Notice in the diagram that they also recommend using CloudFront for your static files and this is especially useful if your users will be accessing your application from different geographical areas.
Hope this helps.

Should I make my CouchDB database server public-facing?

I'm new to CouchDb and am trying to comprehend how to properly make use of it. I'm coming from MongoDB where I would always write a web layer and put it in front of mongo so that I could allow users to access the data inside of it, etc. In fact, this is how I've used all databases for every web site that I've ever written. So, looking at Couch, I see that it's native API is HTTP and that it has built in things like OAuth support, and other features that hint to me that perhaps I should no longer have my code layer sitting in front of Couch, but instead write Views and things and just give out accounts to Couch to my users? I'm thinking in terms of like an HTTP-based API for a site of mine, or something that users would consume my data through. Opening up Couch like this seems odd to me, though. Is OAuth, in Couch's sense, meant more for remote access for software that I'd write and run internal to my own network "officially", or is it literally meant for the end users?
I know there might be things that could only be done through a code layer on top of CouchDB, like if you wanted additional non-database related things to occur during API requests, also. So thinking along those lines I think I will still need a code layer, anyway.
Dealer's choice.
Nodejitsu has a great writeup on this sort of topic here.
Not knowing your application specifics I'll take a broad approach...
Back-end
If you want to prevent users from ever seeing your database then make it back-end. You can pipe everything through something like node.js and present only what the user needs to see and they'll never know anything about the database.
See Resource View Presenter
Front-end
If you are not concerned about data security, you can host an entire app on CouchDB; see CouchApp. This approach has the benefit of using the replication mechanism to control publishing your site/data. The drawback here is that you will almost certainly run into some technical limitations that will require moving CouchDB closer to the backend.
Bl-end
Have the app server present the interface and the client pull the data from the database separately. This gives the most flexibility but can be a bag of hurt because even with good design this could lead to supportability and scalability issues.
My recommendation
Use CouchDB on the backend. If you need mobile clients to synchronize then use a secondary DB publicly exposed for this purpose and selectively sync this data to wherever it needs to go.
Simply put, no.
There's no way to secure Couch properly on a public facing site. There's no way to discriminate access at a fine enough granular level. If someone has access to any of the data, they have access to all of the data.
Not all data on a site is meant for public consumption, save for the most trivial of sites.

Server side caching in openrasta

I have a site that is being polled rather hard for the JSON representation of the same resources from multiple clients (browsers, other applications, unix shell scripts, python scripts, etc).
I would like to add some caching such that some of the resources are cached within the server for a configurable amount of time, to avoid the CPU hit of handling the request and serialising the resource to JSON. I could of course cache them myself within the handlers, but would then take the serialisation hit on every request, and would have to modify loads of handlers too.
I have looked at the openrasta-caching module but think this is only for controlling the browser cache?
So any suggestions for how I can get openrasta to cache the rendered representation of the resource, after the codec has generated it?
Thanks
openrasta-caching does have preliminary support for server-side caching, with an API you can map to the asp.net server-side cache, using the ServerCaching attribute. That said it's not complete, neither is openrasta-caching for that matter. It's a 0.2 that would need a couple of days of work to make it to a good v1 that completely support all the scenarios I wanted to support that the asp.net caching infrastructure doesn't currently support (mainly, make the caching in OpenRasta work exactly like an http intermediary rather than an object and .net centric one as exists in asp.net land, including client control of the server cache for those times you want the client to be allowed to force the server to redo the query). As I currently have no client project working on caching it's difficult to justify any further work on that plugin, so I'm not coding anything for it right now. I do have 4 days free coming up though, so DM me if you want openrasta-caching to be bumped to 0.3 with whatever requirements you have in that that would fit 4 days of work.
You can implement something simpler using an IOperationInterceptor and plugging in the asp.net pipeline using that, or be more web-friendly and implement your caching out of process using squid and rely on openrasta-caching for generating the correct http caching instructions.
That said, for your problem, you may not even need server caching if the cost is in json. If you map a last modified or an Etag to what the handler returns, it'll generate correctly a 304 where appropriate, and bypass json rendering alltogether, provided your client does conditional requests (and it should).
There's also a proposed API to allow you to further optimize your API by doing a first query on last modified/etag to return the 304 without retrieving any data.

Which schemaless datastores provide good performance?

I've recently written a web app that uses couchdb. I like couchdb and it suited the app - which has a lot of dynamic behaviour and simply pulls JSON directly from couchdb. Being able to upload images via a browser is nice and it's a snap to do tweaks to document data. The replication also has made deployment a breeze as the app is a couchapp, and all that's required to deploy is a replicate to the production server.
However for a new app I'm thinking off (think blog type thingy), I want good performance and it's one area I think couchdb is not strong in. The app will be predominantly read oriented (I'm estimating 90% reads to 10% writes).
Which datastores provide the best performance in a single server scenario? I'd be very interested to hear people's experiences in this...
I think MongoDB is beginning to look like the front runner performance wise for schemaless data stores.
We're currently in the processes of evaluating this for storing binary objects that can range from 10Kb to 50Mb and I've been very impressed with it's performance even on modest hardware.
If it is primarily read performance you are worried about why not just put a varnish proxy in front of couchdb? I use a couple of custom configurations in varnish to tell it not to actually query couchdb for cached objects despite couchdb specifying must-validate, then have a script with an active HTTP GET on _changes that uses the data from _changes in order to explicitly purge changed entries from varnish.
As a plus varnish lets you do URL rewriting, which I need. Most of the other solutions for it involve running something like apache or ngnix just to rewrite URLs for couchdb.

Resources