How can I compress the JSON content from CouchDB's HTTP responses? - performance

I am making a lot of _all_docs requests to CouchDB's HTTP server. One thing I'm realizing is that the data is not compressed, so this results in large file sizes. Even by using limit and skip, the files can sometimes be 10MB each. That doesn't cause any problems for my app, but it does mean that if a connection to our CouchDB server is slower than our office connection, it will go rather slow.
Is there any way I can enable HTTP compression? I am not referring to attachments - just the JSON files.
Also, I am using Windows Server - not Linux/Unix.
Thanks!

There is no support in CouchDB directly, but it has been requested. (so voice your support there if you want this included)
That being said, there are a number of options you have. First, you can set up nginx as a reverse proxy and allow it to compress (and possibly cache) responses for you. After a quick search, I found this plugin that you install in CouchDB directly.
Another thing is that CouchDB does a pretty solid job of allowing clients to cache reliably. You can leverage this to prevent repeatedly downloading the same large resource.

Related

Implementing "extreme" bandwidth saving for web browsing with a compression proxy

I have a network connection where I pay per megabyte, so I'm interested in reducing my bandwidth usage as far as possible while still having a reasonably good browsing experience. I use this wonderful extension (https://bandwidth-hero.com/). This extension runs a image-compression proxy on my heroku account that accepts images URLs, and returns a low-quality version of those images.This reduces bandwidth usage by 30-40% when images are loaded.
To further reduce usage, I typically browse with both JavaScript and images disabled (there are various extensions for doing this in firefox/firefox-esr/google-chrome). This has an added bonus of blocking most ads (since they usually need JavaScript to run).
For daily browsing, the most efficient solution is using a text-mode browser in a virtual console such as elinks/lynx/links2 running over ssh (with zlib compression) on a VPS server. But sometimes using JavaScript becomes necessary, as sites will not render without it .Elinks is the only text-mode browser that even tries to support JavaScript, and even that support is quite rudimentary. When I have to come back to using firefox/chrome, I find my bandwidth usage shooting up. I would like to avoid this.
I find that bandwidth is used partially to get the 'raw' html files of the sites I'm browsing, but more often for the associated .js/.css files. These are typically highly compressible. On my local workstation, html+css+javascript files typically compress by a factor of more than 10x when using lzma(2) compression.
It seems to me that one way to do drastically reduce bandwidth consumption would be to use the same template as the bandwidth-hero extension, i.e. run a compression proxy either on a vps or on my heroku account but do so for text content (.html/.js/.css).
Ideally, I would like to run a compression proxy on my local machine. When I open a site (say www.stackoverflow.com), the browser should send a request to this local proxy. This local proxy then sends a request to a back-end running on heroku/vps. The heroku/vps back-end actually fetches all the content, and compresses it (lzma/bzip/gzip). The compressed content is sent back to my local proxy. The local proxy decompresses the content and finally gives it to the browser.
There is something like this mentioned in this answer (https://stackoverflow.com/a/42505732/10690958) for node.js . I am thinking of the same for python.
From what google searches show, HTTP can "automatically" ask for gzip versions of pages. But does this also apply for the associated files that are loaded by JavaScript, and for the css files? Perhaps, what I am thinking about is already implemented by default ?
Any pointers would be welcome. I was thinking of writing a local proxy in python,as I am reasonably fluent in it. But I know little about heroku or the intricacies of HTTP.
thanks.
Update: I found a possible solution here https://github.com/barnacs/compy
which does almost exactly what I need (minify+compress with brotli/gzip+transcode jpeg/gif/png). It uses go instead of python, but that does not really matter. It also has a docker image here https://hub.docker.com/r/andrewgaul/compy/ . Since I'm not very familiar with heroku, I cant figure out how to use this to run the compression proxy service on my account. The heroku docs also weren't of much help to me. Any pointers would be welcome.

Server side caching of dynamic content with Nginx and Etags

I have a CouchDB DB, with an Nginx reverse proxy in front of it. Some responses from CouchDB take a long time to generate (yes, it was a bad choice, but need to stick with it for now), and I would like to cache them with Nginx. (Currently Nginx only does SSL.)
CouchDB supports Etags, so ideally what I would like is Nginx caching the Etags as well for dumb clients. The clients do not use Etags, they would just query Nginx, which goes to CouchDB with its cached Etag, and then either sends back the cached response, or the new one to the client.
My understanding based on the docs is that Nginx cannot do this currently. Have I missed something? Is there an alternative that supports this setup? Or the only solution is to invalidate the Nginx cache by hand?
I am assuming that you already looked at varnish and did not find it suitable for your case. There are two ways you can achieve what you want.
With nginx
Nginx has a default caching mechanism that you can configure for your use.
If that does not help, you should give Nginx compiled with the 3rd party Ngx_Lua module a try. This is also conveniently packaged along with other useful modules and the required Lua environment as Openresty.
With Ngx_Lua, you can use the shared dictionary to cache your couchdb responses. As the name suggests shared dictionary uses a shared memory zone in Ngx_Lua's execution environment. This is similar to the way the proxy_cache works in Nginx(which also defines a shared memory zone in Nginx's execution environment) but comes with the added advantage that you can program it.
The steps required to build a couchdb cache are pretty simple (with this approach you don't need to send etags to the client)
You make a request to couchdb
You save the {Url-Etag:response} pair
Next time the request comes to the same url query for etags using a HEAD request.
If response etag matches the {Url-Etag:response} pair then send the cached response otherwise query couchdb again using (get/post) methods and update the {Url-Etag:response} pair before sending the response to the client.
Of course if you program a cache by hand you will have to define max cache size and a mechanism to remove old items from the cache. The lua_shared_dict directive can help you define a memory size for which the responses will be cached. When saving the values in the shared dictionary you can specify the time for which the value will remain the memory zone after which it will automatically expire. Combining the max cache size parameter and cache time parameter of the shared dictionary you should be able to program fairly complex caching mechanism for your users.
With erlang
Since couchdb is written in erlang you already have an erlang env on your machine. So if you can program in it you can create a very robust distributed cache with mnesia. The steps are the same. Erlang timers can be combined with gen_* behaviours to give you the automatic expiry of items and mnesia has functions to monitor it's memory usage and notify you about it. The two approaches are almost equivalent the only difference being that mnesia can be distributed.
Update
As #abyz suggested redis is also good choice when it comes to caching.

AS3 limit bandwidth or prioritize downloads

I'm having multiple downloads in a swf, downloading external data.
Can I limit the download speed of one download thread or prioritize one?
Flash will download them in the order you ask for them. If you need one first, then ask for that one first.
When running within a browser it is actually the browser that makes the download on behalf of the client container (the client merely requests that the browser does the download). A lot of browsers have restrictions on concurrent file downloads anyway, especially in the same domain, and that is not something you will be able to circumvent with the in-built classes.
My guess is that you could roll your own (or use an existing client as a base) using Socket instead of URLLoader. This would be a large amount of work though - the browser does a lot of free work for you (SSL, cookies from the current session, gzip, keep-alives etc).

How to serve images from Riak

I have a bunch of markdown documents in Riak, which I'm exposing via a small Sinatra API with basic search functionality etc.
Each document has an associated image, also stored in Riak (in a different bucket). I'd like to have a client app display the documents alongside their associated images - so I need some way to make the images available, but as I'm only ever going to be requesting them by key it seems wasteful to serve them via a Sinatra app as I'm doing with the documents.
However I'm uneasy with serving them directly from Riak, because a) even using nginx to limit the acceptable requests, I worry about exposing more functionality than we want to and b) Riak throws a 403 for any request where the referrer is set, so by default using a direct-to-Riak url as the src of an img tag doesn't work.
So my question is - what's a good approach to take for serving the images? Add another endpoint to the Sinatra app? Direct from Riak using some Nginx wizardry that is currently beyond me? Or some other approach I haven't considered yet? This would ideally use Ruby as that's what the team I'm working with are more comfortable with.
Not sure if this question might be better suited to Server Fault - if so I'll move it over.
You're right to be concerned about exposing Riak to any direct connectivity. Until 2.0 arrives early next year, there is no security in the system (although the 403 for requests with a referrer is a security mechanism to protect against XSS), and even with security exposing any database directly to the Internet invites disaster.
I've not done anything with nginx, but all you'd really need to use it properly, I'd think, would be two features:
Ability to restrict requests to GET
Ability to restrict (or rewrite) requests to the proper bucket
Ability to strip out all HTTP headers that Riak includes in its result (which, since nginx is a proxy server and not a straight load balancer, seems like it should be straightforward)
Assuming that your images are the only content in that bucket, nginx feels like a reasonable choice here.

Which schemaless datastores provide good performance?

I've recently written a web app that uses couchdb. I like couchdb and it suited the app - which has a lot of dynamic behaviour and simply pulls JSON directly from couchdb. Being able to upload images via a browser is nice and it's a snap to do tweaks to document data. The replication also has made deployment a breeze as the app is a couchapp, and all that's required to deploy is a replicate to the production server.
However for a new app I'm thinking off (think blog type thingy), I want good performance and it's one area I think couchdb is not strong in. The app will be predominantly read oriented (I'm estimating 90% reads to 10% writes).
Which datastores provide the best performance in a single server scenario? I'd be very interested to hear people's experiences in this...
I think MongoDB is beginning to look like the front runner performance wise for schemaless data stores.
We're currently in the processes of evaluating this for storing binary objects that can range from 10Kb to 50Mb and I've been very impressed with it's performance even on modest hardware.
If it is primarily read performance you are worried about why not just put a varnish proxy in front of couchdb? I use a couple of custom configurations in varnish to tell it not to actually query couchdb for cached objects despite couchdb specifying must-validate, then have a script with an active HTTP GET on _changes that uses the data from _changes in order to explicitly purge changed entries from varnish.
As a plus varnish lets you do URL rewriting, which I need. Most of the other solutions for it involve running something like apache or ngnix just to rewrite URLs for couchdb.

Resources