Does using Elasticsearch as key value cache like redis makes sense - caching

I have recently encountered a question that since redis is not distributed and don't support parallelism(multi-core), isn't elastic search be a better choice to use instead of redis for caching purpose.
This is all in reference to a simple web, where we used redis to cache db query.
I have kind of got the idea here,
but still not sure whether it has any real benefits. Opening up this thread to discuss the advantages/disadvantages in doing so.

It's not really what you asked for but you might want to have a look at Aerospike.
Redis is an in-memory data structure store known for speed and often used as a cache. Both Redis and Aerospike are open source, however, when applications need persistence or when applications must scale but servers have maxed out of RAM, developers should consider Aerospike, a distributed key-value store that is just as fast or faster than Redis, but scales more simply and with the economics of flash/SSDs.

Related

Redis vs dynamoDb geolocation tracking

I am currently a bit confused into which database to use for geolocation Tracking. What I want to do is update the location of a group of people every 30 secs. The data is sent to the server using web-sockets. Each user has an Id in the database and I would like to update the location of that user every 30 second. After doing so, I would like to query these locations and show it in real time to another group of users. My question is what is the advantage and the disadvantages of DynamoDb and Redis. Which one is faster and can scale easier. I am expecting almost 2 million QPS
Both can scale fairly well, but this depends heavily on your use case and architecture.
DynamoDB is a cloud based NoSQL storage system, and Redis is an in memory data structure store. This means that queries to DynamoDB would involve making a roundtrip to Amazon's servers, while queries to Redis would be over RAM (so, much, much lower latency).
As a consequence of the above, the amount of data you can store in Redis would be limited by the RAM available on your hardware. That said, in the event of Redis or your hardware crashing for some reason, you would have to be content with some level of data loss. You can mitigate this somewhat by configuring Redis persistence so that Redis writes to disk regularly (either every N seconds or by manually triggering a write in your code) and mitigate further by then copying those writes to S3 or elsewhere. This trades performance (depending on your scale) for data safety somewhat due to I/O latency. See the documentation for Redis persistence and this blog post by the GitHub engineering team mentioning their decision to remove Redis persistence for performance reasons.
Meanwhile all of the issues above are abstracted away for you by DynamoDB since AWS handles availability for you behind the scenes. You are really only limited by how much you can afford and usage (read/write per second) limits.
DynamoDB does not have native support for querying and inserting geospatial data (although there is a library for it, but it seems to be unmaintained), Redis does. You could write your own code for this.
DynamoDB does not have support for namespacing, or rather, DynamoDB is namespaced by your AWS account meaning you would not be able to maintain a separate DynamoDB instance with the same table names (say for production vs dev data) on the same AWS account. Redis doesn't either, but you can trivially spin up a separate Redis instance for this.
See also Redis MEMORY USAGE command and Redis memory optimization docs.

Redis or Ehcache?

Which is better suited for the following environment:
Persistence not a compulsion.
Multiple servers (with Ehcache some cache sync must be required).
Infrequent writes and frequent reads.
Relatively small database (very less memory requirement).
I will pour out what's in my head currently. I may be wrong about these.
I know Redis requires a separate server (?) and Ehcache provides local cache so it must be faster but will replicate cache across servers (?). Updating all caches after some update on one is possible with Ehcache.
My question is which will suit better for the environment I mentioned?
Whose performance will be better or what are scenarios when one may outperform another?
Thanks in advance.
You can think Redis as a shared data structure, while Ehcache is a memory block storing serialized data objects. This is the main difference.
Redis as a shared data structure means you can put some predefined data structure (such as String, List, Set etc) in one language and retrieve it in another language. This is useful if your project is multilingual, for example: Java the backend side , and PHP the front side. You can use Redis for a shared cache. But it can only store predefined data structure, you cannot insert any Java objects you want.
If your project is only Java, i.e. not multilingual, Ehcache is a convenient solution.
You will meet issues with EhCache scaling and need resources to manage it during failover and etc.
Redis benefits over EhCache:
It uses time proven gossip protocol for Node discovery and synchronization.
Availability of fully managed services like AWS ElastiCache, Azure Redis Cache. Such services offers full automation, support and management of Redis, so developers can focus on their applications and not maintaining their databases.
Correct large memory amount handling (we all know that Redis can manage with hundreds of gigabytes of RAM on single machine). It doesn't have problems with Garbage collection like Java.
And finally existence of Java Developer friendly Redis client - Redisson.
Redisson provides many Java friendly objects on top of Redis, like:
Set
ConcurrentMap
List
Queue
Deque
BlockingQueue
BlockingDeque
ReadWriteLock
Semaphore
Lock
AtomicLong
CountDownLatch
Publish / Subscribe
ExecutorService
and many more...
Redisson supports local cache for Map structure which cold give you 45x performance boost for read operations.
Here is the article describing detailed feature comparison of Ehcache and Redis.

Is it necessary for memcached to replicate its data?

I understand that memcached is a distributed caching system. However, is it entirely necessary for memcached to replicate? The objective is to persist sessions in a clustered environment.
For example if we have memcached running on say 2 servers, both with data on it, and server #1 goes down, could we potentially lose session data that was stored on it? In other words, what should we expect to see happen should any memcached server (storing data) goes down and how would it affect our sessions in a clustered environment?
At the end of the day, will it be up to use to add some fault tolerance to our application? For example, if the key doesn't exist possibly because one of the servers it was on went down, re-query and store back to memcached?
From what I'm reading, it appears to lean in this direction but would like confirmation: https://developers.google.com/appengine/articles/scaling/memcache#transient
Thanks in advance!
Memcached has it's own fault tolerance built in so you don't need to add it to your application. I think providing an example will show why this is the case. Let's say you have 2 memcached servers set up in front of your database (let's say it's mysql). Initially when you start your application there will be nothing in memcached. When your application needs to get data if will first check in memcached and if it doesn't exist then it will read the data from the database and insert it into memcached before returning it to the user. For writes you will make sure that you insert the data into both your database and memcached. As you application continues to run it will populate the memcached servers with a bunch of data and take load off of your database.
Now one of your memcached servers crashes and you lose half of your cached data. What will happen is that your application will now be going to the database more frequently right after the crash and your application logic will continue to insert data into memcached except everything will go directly to the server that didn't crash. The only consequence here is that your cache is smaller and your database might need to do a little bit more work if everything doesn't fit into the cache. Your memcached client should also be able to handle the crash since it will be able to figure out where your remaining healthy memcached servers are and it will automatically hash values into them accordingly. So in short you don't need any extra logic for failure situations in memcached since the memcached client should take care of this for you. You just need to understand that memcached servers going down might mean your database has to do a lot of extra work. I also wouldn't recommend re-populating the cache after a failure. Just let the cache warm itself back up since there's no point in loading items that you aren't going to use in the near future.
m03geek also made a post where he mentioned that you could also use Couchbase and this is true, but I want to add a few things to his response about what the pros and cons are. First off Couchbase has two bucket (database) types and these are the Memcached Bucket and the Couchbase Bucket. The Memcached bucket is plain memcached and everything I wrote above is valid for this bucket. The only reasons you might want to go with Couchbase if you are going to use the memcached bucket are that you get a nice web ui which will provide stats about your memcached cluster along with ease of use of adding and removing servers. You can also get paid support down the road for Couchbase.
The Couchbase bucket is totally different in that it is not a cache, but an actual database. You can completely drop your backend database and just use this bucket type. One nice thing about the Couchbase bucket is that it provides replication and therefore prevents the cold cache problem that memcached has. I would suggest reading the Couchbase documentation if this sounds interesting you you since there are a lot of feature you get with the Couchbase bucket.
This paper about how Facebook uses memcached might be interesting too.
https://www.usenix.org/system/files/conference/nsdi13/nsdi13-final170_update.pdf
Couchbase embedded memcached and "vanilla" memcached have some differences. One of them, as far as I know, is that couchbase's memcached servers act like one. This means that if you store your key-value on one server, you'll be able to retreive it from another server in cluster. And vanilla memcached "clusters" are usally built with sharding technique, which means on app side you should know what server contain desired key.
My opinion is that replicating memcached data is unnessesary. Modern datacenters provide almost 99% uptime. So if someday one of your memcached servers will go down just some of your online users will be needed to relogin.
Also on many websites you can see "Remember me" checkbox that sets a cookie, which can be used to restore session. If your users will have that cookie they will not even notice that one of your servers were down. (that's answer for your question about "add some fault tolerance to our application")
But you can always use something like haproxy and replicate all your session data on 2 or more independent servers. In this case to store 1 user session you'll need N times more RAM, where N is number of replicas.
Another way - to use couchbase to store sessions. Couchbase cluster support replicas "out of the box" and it also stores data on disk, so if your node (or all nodes) will suddenly shutdown or reboot, session data will not lost.
Short answer: memcached with "remember me" cookie and without replication should be enough.

Is SQLite suitable for use as a read only cache on a web server?

I am currently building a high traffic GIS system which uses python on the web front end. The system is 99% read only. In the interest of performance, I am considering using an externally generated cache of pre-generated read-optimised GIS information and storing in an SQLite database on each individual web server. In short it's going to be used as a distributed read-only cache which doesn't have to hop over the network. The back end OLTP store will be postgreSQL but that will handle less than 1% of the requests.
I have considered using Redis but the dataset is quite large and therefore it will push up the administrative cost and memory cost on the virtual machines this is being hosted on. Memcache is not suitable as it cannot do range queries.
Am I going to hit read-concurrency problems with SQLite doing this?
Is this a sensible approach?
Ok after much research and performance testing, SQLite is suitable for this. It has good request concurrency on static data. SQLite only becomes an issue if you are doing writes as well as heavy reads.
More information here:
http://www.sqlite.org/lockingv3.html
if usage case is just a cache why don't you use something like
http://memcached.org/.
You can find memcached bindings for python in pypi repository.
Another options is that you use materialized views in postgres, this way you will keep things simple and have everything in one place.
http://tech.jonathangardner.net/wiki/PostgreSQL/Materialized_Views

hazelcast vs ehcache

Question is clear as you see in the title, it would be appreciated to hear your ideas about adv./disadv. differences between them.
UPDATE:
I have decided to use Hazelcast because of the advantages like distributed caching/locking mechanism as well as the extremely easy configuration while adapting it to your application.
We tried both of them for one of the largest online classifieds and e-commerce platform. We started with ehcache/terracotta(server array) cause it's well-known, backed by Terracotta and has bigger community support than hazelcast. When we get it on production environment(distributed,beyond one node cluster) things changed, our backend architecture became really expensive so we decided to give hazelcast a chance.
Hazelcast is dead simple, it does what it says and performs really well without any configuration overhead.
Our caching layer is on top of hazelcast for more than a year, we are quite pleased with it.
Even though Ehcache has been popular among Java systems, I find it less flexible than other caching solutions. I played around with Hazelcast and yes it did the job, it was easy to get running etc and it is newer than Ehcache. I can say that Ehcache has much more features than Hazelcast, is more mature, and has big support behind it.
There are several other good cache solutions as well, with all different properties and solutions such as good old Memcache, Membase (now CouchBase), Redis, AppFabric, even several NoSQL solutions which provides key value stores with or without persistence. They all have different characteristics in the sense they implement CAP theorem, or BASE theorem along with transactions.
You should care more about, which one have the functionality you want in your application, again, you should consider CAP theorem or BASE theorem for your application.
This test was done very recently with Cassandra on the cloud by Netflix. They reached to million writes per second with about 300 instances. Cassandra is not a memory cache but you data model is like a cache, which is consist of key value pairs. You can as well use Cassandra as a distributed memory cache.
Hazelcast has been a nightmare to scale and stability is still a major issue.
The dedicated client to grid component choices are
The messy version that cant survive node loss anywhere, negating the point of backups (superclient), or
An incredibly slow native client option that does not allow for any type of load balancing to processing nodes in the grid.
If any host could request records from this data grid it would be a sweet design, but you are stuck with those two lackluster option to get anything out of it.
Also multiple issues with database thread pools locking up on individual members and not writing anything to the databases, causing permanent records loss is a frequent issue and we often have to take the whole thing down for hours to refresh any of the JVM's. Split brain is also still an issue, although in 1.9.6 it seems to have calmed down a little.
Rallying to move to Ehcache and improving the database layer instead of using this as a band-aid.
Hazelcast serializes everything whenever there is a node (standard-one), so the data you will save to Hazelcast must implement serialization.
http://open.bekk.no/efficient-java-serialization/
Hazelcast has been a nightmare for me. I was able to get it "working" in a clustered Websphere environment. I use the term "working" loosely. First, all of Hazelcast's documentation is out of date and only shows examples using deprecated method calls. Trying to use the new code without comments in the Javadocs and no examples in the documentation is very hard. Also, the J2EE container code simply does not work at this point because it does not support XA transactions in Websphere. An error is thrown calling code that follows their only J2EE example explicitly(it does look like Milestone 3.0 is addressing this). I had to forget about joining Hazelcast to a J2EE transaction. It does seem Hazelcast is definitely geared to a non EJB/Non-J2EE container environment. Making calls to Hazelcast.getAllInstances() fails to retain any information about Hazelcast's state when switching from one enterprise java bean to another. That forces me to create a new Hazelcast instance just to run calls that give me access to my data. That causes many Hazelcast Instances to start up on the same JVM. Also,retrieving data from Hazelcast is not fast. I tried retrieving data using both the Native Client and directly as a member of the cluster. I stored 51 lists, each containing only 625 objects in Hazelcast. I could not perform a query directly on a list and did not want to store a map just to get access to that feature (SQL operations can be performed on a map). It took about a half second to retrieve each list of 625 objects because Hazelcast Serializes the entire list and sends it over the wire rather than just giving me the delta (what has changed). Another thing, I had to switch to a TCPIP configuration and explicitly list the ip addresses of the servers I wanted to be in the cluster. The default Multicast configuration did not work and from the group discussions in google, other people are experiencing that difficulty as well. To sum up; I did eventually get 8 machines communicating in a cluster through many hours of torturous programmatic configuration and trial and error (the documentation will be little help) but when I did, I still had no control over the number of instances and partitions being created on each JVM due to the half finished nature of Hazelcast for EJB/J2EE and it was VERY SLOW. I implemented a real use case in the unemployment insurance application I work on and the code was much faster making direct calls to the database. It would have been cool if Hazelcast worked as advertised because I really did not want to use a separate service to implement what I am trying to do. I have used MongoDB extensively so I may skip the whole in memory cache and just serialize my objects as documents in a separate repository.
One advantage of Ehcache is that it is backed by a company (Terracotta) that does extensive performance, failover, and platform testing in a large performance lab. Terracotta provides support, indemnity, etc. For many companies, that sort of thing is important.
I have not used Hazelcast but I've heard that it is easy to use and that it works. I haven't heard anything with respect to scalability or performance of Hazelcast vs Terracotta/Ehcache but given the amount of scalability and failover testing that Terracotta does, it's hard for me to imagine that Hazelcast would be competitive in a production deployment. But I presume it would work fine for smaller uses.
[Bias: I'm a former employee of Terracotta.]
Developers describe Ehcache as "Java's Most Widely-Used Cache". Ehcache is an open-source, standards-based cache for boosting performance, offloading your database, and simplifying scalability. It's the most widely-used Java-based cache because it's robust, proven, and full-featured. Ehcache scales from in-process, with one or more nodes, all the way to mixed in-process/out-of-process configurations with terabyte-sized caches. On the other hand, Hazelcast is detailed as "Clustering and highly scalable data distribution platform for Java". With its various distributed data structures, distributed caching capabilities, elastic nature, memcache support, integration with Spring and Hibernate and more importantly with so many happy users, Hazelcast is feature-rich, enterprise-ready and developer-friendly in-memory data grid solution.
Ehcache and Hazelcast are primarily classified as "Cache" and "In-Memory Databases" tools respectively.

Resources