Which schemaless datastores provide good performance? - performance

I've recently written a web app that uses couchdb. I like couchdb and it suited the app - which has a lot of dynamic behaviour and simply pulls JSON directly from couchdb. Being able to upload images via a browser is nice and it's a snap to do tweaks to document data. The replication also has made deployment a breeze as the app is a couchapp, and all that's required to deploy is a replicate to the production server.
However for a new app I'm thinking off (think blog type thingy), I want good performance and it's one area I think couchdb is not strong in. The app will be predominantly read oriented (I'm estimating 90% reads to 10% writes).
Which datastores provide the best performance in a single server scenario? I'd be very interested to hear people's experiences in this...

I think MongoDB is beginning to look like the front runner performance wise for schemaless data stores.
We're currently in the processes of evaluating this for storing binary objects that can range from 10Kb to 50Mb and I've been very impressed with it's performance even on modest hardware.

If it is primarily read performance you are worried about why not just put a varnish proxy in front of couchdb? I use a couple of custom configurations in varnish to tell it not to actually query couchdb for cached objects despite couchdb specifying must-validate, then have a script with an active HTTP GET on _changes that uses the data from _changes in order to explicitly purge changed entries from varnish.
As a plus varnish lets you do URL rewriting, which I need. Most of the other solutions for it involve running something like apache or ngnix just to rewrite URLs for couchdb.

Related

How to cache data in next.js server on vercel?

I'm trying to build a small site that gets its data from a database (currently I use Firebase's Cloud Firestore).
I've build it using next.js and thought to host it on vercel. It looks very nice and was working well.
However, the site needs to handle ~1000 small documents - serve, search, and rarely update. In order to reduce calls to the database on every request, which is costly both in time, and in database pricing, I thought it would be better if the server could get the full list of item when it starts (or on the first request), and then hold them in memory and make data request get the data from its memory.
It worked well in the local dev server, but when I deployed it to vercel, it didn't work. It seems it forces me to work in serverless mode, where each request is separate, and I can't use a common in-memory cache to get the data.
Am I missing something and there is a way to achieve something like that with next.js on vercel?
If not, can you recommend other free cloud services that can provide what I'm looking for?
One option can be using FaunaDB and Netlify, as described in this post, but I ended up opening a free Wix site and using Wix data to store the data. I built http-functions module to provide access to the data via REST, which also caches highly used data in memory. Currently it seems to work like a charm!

NoSQL for multi-site archival logging with full-text search

I'm looking at building a somewhat complex log handling system to replace an old ad-hoc setup and could use a bit of advice. I'm pretty familiar with SQL databases and networking, but am very new to NoSQL stores, which seem to be the key to solving this mess. Note that we have a very good team, but a limited licensing budget, so free/open-source options are vastly preferred. (That said, availability of support if something goes pear-shaped would be nice.)
Requirements:
Archive (test) logs generated in the several GB/day range at multiple sites around the world.
Provide full text search of those logs at each site fairly instantaneous for debugging purposes.
Push that archived data back to a central location (though a replica at each site would be absolutely okay).
Provide for analytics of that data back at the central location.
Constraints:
The sites have fairly crap Internet connections for the moment (high latency and fairly low bandwidth). Much of the data is generated during the day and a good portion of the sync would have to lag behind and finish overnight each day.
Sites MUST be able to function if the WAN goes completely off-line.
Extras
The log data is (as usual) highly compressible. Any solution that compresses data transacting from node to node across the WAN is preferred.
Many log files are related to each other in multi-level hierarchies, and that relationship is very important and must be maintained!
Sites will generally not modify the same data or modify it again once stored. This is all archival for the most part.
We can either stream as the logs are generated or push blocks of logs. Streaming is preferred, as it would simplify things considerably.
Options I'm aware of:
Local MySQL and folder structure for logging and local configuration management.
This is what we have now and it's running, but not a long-term solution by any means.
Elasticsearch
I've read that ElasticSearch would probably be really good for this, though from what I understand that doesn't support multi-site.
Cassandra
This seems to have built-in multi-site support, but I'm not exactly familiar with the data-model. Is this a good choice for something like this, or will I hate myself if I give it a try?
CouchDB
This is a document store that seems(?) like a good match for log data, but again doesn't appear to have multi-site support.
Apache Kafka
I read up on this, but I haven't quite wrapped my head around it yet...
Questions:
Do any of these actually let you stream-append logs or are they best suited to dumping completed files in?
Is there a solution I'm missing that might be better?
Any recommendations on multi-site with some of the options that don't support multi-site by themselves?
Interesting links:
https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
http://blog.cloudera.com/blog/2015/07/deploying-apache-kafka-a-practical-faq/
https://www.elastic.co/blog/scaling_elasticsearch_across_data_centers_with_kafka
https://kafka.apache.org/08/ops.html
https://github.com/Stratio/cassandra-lucene-index
I may be a bit biased, since Couchbase is my employer, but this sounds like the kind of problem that XDCR (Cross Datacenter Replication) was made to solve.
You could stand up a cluster on multiple geographical sites (Couchbase calls these "datacenters") and then XDCR would automatically replicate (bidirectionally) the data between sites. If I understand your requirements correctly, this sounds like just what you need.

What are the size limits for Laravel's file-based caching?

I am a new developer and am trying to implement Laravel's (5.1) caching facility to improve the speed of my app. I started out caching a large DB table that my app constantly references - but it got too large so I have backed away from that and am now 'forever' caching smaller chunks of data - for example, for each page only the portions of that large DB table that are relevant.
I have watched 'Caching Essentials' on Laracasts, done some Googling and had a search in this forum (and Laracasts') but I still have a couple of questions:
I am not totally clear on how the cache size limits work when you are using Laravel's file-based system - is there an overall in-app size limit for the cache or is one limited size-wise only per key and by your server size?
What are the signs you should switch from file-based caching to something like Memcached or Redis - and what are the benefits of using one of those services? Is it the fact that your caching is handled on a different server (thereby lightening the load on your own)? Do you switch over to one of these services when your local, file-based cache gets too big for your server?
My app utilizes several tables that have 3,000-4,000 rows - the data in these tables is constantly referenced and will remain static unless I decide to add new options. I am basically looking for the best way to speed up queries to the data in these tables.
Thanks!
I don't think Laravel imposes any limitations on its file i/o at all - the limitations will be with how much what PHP can read / write to a file at once, or hold in its memory / process at any one time.
It does serialise the data that you cache, and unserialise it when you reload it, so your PHP environment would have to be able to process the entire cache file (which is equivalent to the top level cache key) at once. So, if you are getting cacheduser.firstname, it would have to load the whole cacheduser key from the file, unserialise it, then get the firstname key from that.
I would take the PHP memory limit (classic, i know!) as a first point to investigate if you want to keep down this road.
Caching services like Redis or memcached are bespoke, optimised caching solutions. They take some of the logic and responsibility out of your PHP environment.
They can, for example, retrieve sub-keys from items without having to process the whole thing, so can retrieve part of some cached data in a memory efficient way. So, when you request cacheduser.firstname from redis, it just returns you the firstname attribute.
They have other advantages regarding tagging / clearing out subsets of caches (see [the cache tags Laravel docs] (https://laravel.com/docs/5.4/cache#cache-tags))
Another thing to think about is scaling. If your site is large enough, and is load-balanced across multiple servers, the filesystem caching may be different across those servers, as each server can only check their local filesystem for the cache files. A caching service can be on a different server (many hosts will have a separate redis / memcached services available), so isn't victim to this issue.
Also - as I understand it (and this might be the most important thing), the file cache driver in Laravel is mainly for local development and testing. Although it can work fine for simple applications with basic caching needs, it's not intended for large scalable production environments.
Personally, I develop locally and test with file caching, as i'm only dealing with small amounts of data then, and use redis to cache on production environments.
It doesn't necessarily need to be on a separate server to get the benefits. If you are never going to scale to multiple application servers, then using a caching service on the same server will already be a large improvement to caching large documents.

Is memcache(d) necessary when using Cloudflare/Incapsula

If you need caching in your website to make database use lower, do you have to do it using memcache or memcached (in PHP, for example) or can you achieve this by using professional services like CloudFlare, Incapsula or others like that do some caching for you?
Services like Cloudflare cache your HTML and/or assets like images and CSS files in a CDN, so that your entire server is hit less often. This is great for semi-static sites but may not be the best fit for highly dynamic sites.
Local caches like memcached just store any data in a way that's fast to access. You can use that to cache database queries and lower your database activity, but you can also use it to store pre-computed data that would be expensive to re-create all the time or whatever else you may want to store non-permanently in a fast-to-access way.
Both solutions solve different problems. You may use both together, or either, or neither. It really depends on where exactly your bottleneck is and which solution fits your problem better.
I'm the CEO of CloudFlare and I'd say: more (intelligent) caching is almost always a good thing. While we can significantly decrease the load coming to your web server, to get the best performance it's still extremely important to optimize your web application and it's interaction with your database. To that end, memcache and other fast caching layers can play an important role and I'd never discourage them.
PS - we work great with dynamic sites. 95%+ of our sites are highly dynamic web applications.

Is SQLite suitable for use as a read only cache on a web server?

I am currently building a high traffic GIS system which uses python on the web front end. The system is 99% read only. In the interest of performance, I am considering using an externally generated cache of pre-generated read-optimised GIS information and storing in an SQLite database on each individual web server. In short it's going to be used as a distributed read-only cache which doesn't have to hop over the network. The back end OLTP store will be postgreSQL but that will handle less than 1% of the requests.
I have considered using Redis but the dataset is quite large and therefore it will push up the administrative cost and memory cost on the virtual machines this is being hosted on. Memcache is not suitable as it cannot do range queries.
Am I going to hit read-concurrency problems with SQLite doing this?
Is this a sensible approach?
Ok after much research and performance testing, SQLite is suitable for this. It has good request concurrency on static data. SQLite only becomes an issue if you are doing writes as well as heavy reads.
More information here:
http://www.sqlite.org/lockingv3.html
if usage case is just a cache why don't you use something like
http://memcached.org/.
You can find memcached bindings for python in pypi repository.
Another options is that you use materialized views in postgres, this way you will keep things simple and have everything in one place.
http://tech.jonathangardner.net/wiki/PostgreSQL/Materialized_Views

Resources