Postgres Hstore vs. Redis - performance wise - performance

I read about HStores in Postgres something that is offered by Redis as well.
Our application is written in NodeJS. Two questions:
Performance-wise, is Postgres HStore comparable to Redis?
for session storage, what would you recommend--Redis, or Postgres with some other kind of data type (like HStore, or maybe even the usual relational table)? And how bad is one option vs the other?
Another constraint, is that we will need to use the data that is already in PostgreSQL and combine it with the active sessions (which we aren't sure where to store at this point, if in Redis or PostgreSQL).
From what we have read, we have been pointed out to use Redis as a Session manager, but due to the PostgreSQL constraint, we are not sure how to combine both and the possible performance issues that may arise.
Thanks!

Redis will be faster than Postgres because Pg offers reliability guarantees on your data (when the transaction is committed, it is guaranteed to be on disk), whereas Redis has a concept of writing to disk when it feels like it, so shouldn't be used for critical data.
Redis seems like a good option for your session data, or heck even store in a cookie or in your client side Javascript. But if you need data from your database on every request then it might not be even worth involving Redis. It very much depends on your application.

Using PostgreSQL as session manager is usually bad idea.
For older than 9.1 was physical limit of transaction per second based on persistent media parameters. For session management you usually don't need MGA (because there are not collision) and it means so MGA is overhead and databases without MGA and ACID must be significantly faster (10 or 100).
I know a use case, where PostgreSQL was used for session management and Performance was really terrible and unstable - it was eshop with about 10000 living sessions. When session management was moved to memcached, then performance and stability was significantly increased. PostgreSQL can be used for 100 living session without problem probably. For higher numbers there are better tools.

Related

Cache only specific tables in Spring boot

I have a table with millions of rows (with 98% reads, maybe 1 - 2% writes) which has references to couple of other config tables (with maybe 20 entries each). What are the best practices for caching the tables in this case? I cannot cache the table with millions of rows. But at the same time, I also don't want to hit the DB for the config tables. Is there a work around for this? I'm using Spring boot, and the data is in postgres.
Thanks.
First of all, let me refer to this:
What are the best practices for caching the tables in this case
I don't think you should "cache tables" as you say. In the Application, you work with the data, and this is what should be cached. This means the object that you cache should be already in a structure that includes these relations. Of course, in order to fetch the whole object from the database, you can use JOINs, but when the object gets cached, it doesn't matter already, the translation from Relational model to the object model was done.
Now the question is too broad because the actual answer can vary on the technologies you use, nature of data, and so forth.
You should answer the following questions before you design the cache (the list is out my head, but hopefully you'll get the idea):
What is the cache invalidation strategy? You say, there are 2% writes, what happens if the data gets updated, the data in the cache may become stale. Is it ok?
A kind of generalization of the previous question: If you have multiple instances (JVMs) of the same application, and one of them triggered the update to the DB data, what should happen to other apps' caches?
How long the stale/invalid data can reside in the cache?
Do the use cases of your application access all the data from the tables with the same frequencies or some data is more "interesting" (for example, the oldest data is not read, but the latest data is always "hot")? Probably if its millions of data for configuration, the JVM doesn't have all these objects in the heap at the same time, so there should be some "slice" of this data...
What are the performance implications of having the cache? How does it affect the GC behavior?
What technologies can be used in your case (maybe due to some regulations/licensing, some technologies are just not available, this is more a case in large organizations)
Based on these observations you can go with:
In-memory cache:
Spring integrates with various in-memory cache technologies, you can also use them without spring at all, to name a few:
Google Guava cache (for older spring cache implementations)
Coffeine (for newer spring cache implementations)
In memory map of key / value
In memory but in another process:
Redis
Infinispan
Now, these caches are slower than those listed in the previous category but still can
be significantly faster than the DB.
Data Grids:
Hazelcast
Off heap memory-based caches (this means that you store the data off-heap, so its not eligible for garbage collection)
Postgres related solutions. For example, you can still go to db, but since you can opt for keeping the index in-memory the queries will be significantly faster.
Some ORM mapping specific caches (like hibernate has its cache as well).
Some kind of mix of all above.
Implement your own solution - well, this is something that probably you shouldn't do as the first attempt to address the issue, because caching can be tricky.
In the end, let me provide a link to some very interesting session given by Michael Plod about caching. I believe it will help you to find the solution that works for you best.

Difference between In-Memory cache and In-Memory Database

I was wondering if I could get an explanation between the differences between In-Memory cache(redis, memcached), In-Memory data grids (gemfire) and In-Memory database (VoltDB). I'm having a hard time distinguishing the key characteristics between the 3.
Cache - By definition means it is stored in memory. Any data stored in memory (RAM) for faster access is called cache. Examples: Ehcache, Memcache Typically you put an object in cache with String as Key and access the cache using the Key. It is very straight forward. It depends on the application when to access the cahce vs database and no complex processing happens in the Cache. If the cache spans multiple machines, then it is called distributed cache. For example, Netflix uses EVCAche which is built on top of Memcache to store the users movie recommendations that you see on the home screen.
In Memory Database - It has all the features of a Cache plus come processing/querying capabilities. Redis falls under this category. Redis supports multiple data structures and you can query the data in the Redis ( examples like get last 10 accessed items, get the most used item etc). It can span multiple machine and is usually very high performant and also support persistence to disk if needed. For example, Twitter uses Redis database to store the timeline information.
I don't know about gemfire and VoltDB, but even memcached and redis are very different. Memcached is really simple caching, a place to store variables in a very uncomplex fashion, and then retrieve them so you don't have to go to a file or database lookup every time you need that data. The types of variable are very simple. Redis on the other hand is actually an in memory database, with a very interesting selection of data types. It has a wonderful data type for doing sorted lists, which works great for applications such as leader boards. You add your new record to the data, and it gets sorted automagically.
So I wouldn't get too hung up on the categories. You really need to examine each tool differently to see what it can do for you, and the application you're building. It's kind of like trying to draw comparisons on nosql databases - they are all very different, and do different things well.
I would add that things in the "database" category tend to have more features to protect and replicate your data than a simple "cache". Cache is temporary (usually) where as database data should be persistent. Many cache solutions I've seen do not persist to disk, so if you lost power to your whole cluster, you'd lose everything in cache.
But there are some cache solutions that have persistence and replication features too, so the line is blurry.
An in-memory Cache is a common query store therefore relieves DB of read Workloads. Common examples of in-memory cache are Redis cache. An example could be Web site storing popular searches made by clients thereby relieving the DB of some load.
In-memory Cache provides query functionality on top of caching (storing session data in RAM (temporary storage)).
Memcache falls in the temp store caching category.

storing data in secondary database

Our application (java,spring, hibernate) uses postgress to store data.
We are looking to add an analysis engine to the application. I want to explore using a nosql db to run the analysis on. This is an attempt at learning the nosql a bit also to free the main application activity from performance penalty (as much as possible).
So, I want the data changes to also synch to the nosql db (in addition to postgres). Any synch mechanism will affect the performance of the main data/transaction activity.
Is it a good idea to push the data changes to a message bus and free the main transaction as early as possible ? Can anyone point me to frameworks/technologies/ideas that address this issue of same data going to two different data stores.
The simplest solution would be sending data to a Postgres read replica and running your analytics queries on that. The performance impact is minimal and this would save a lot of time compared to alternative approaches.
Unless you really know what you are doing, I would avoid NoSQL for this kind of application. If your dataset is too big for a Postgres read replica, you might want to use Redshift, which is a columnar datastore that is optimized for types of analytics queries typically performed.

Should I trust Redis for data integrity?

In my current project, I have PostgreSQL as my master DB, and Redis as kind of a slave, e.g., when some user adds another as a friend, first the relationship will be stored in PostgreSQL and then a friend list in Redis will be updated. When some user's friend list is requested, it will be pulled out of Redis instead of PostgreSQL.
The question is: when I update the friend list in Redis, should I get a fresh copy outof PostgreSQL, and replace the old list in Redis with the new one or should I keep the old list and simply SADD the userid into the list? The latter is of course best for performance, but intuitively the former does a better job in keep the data integrity? And if something like Celery is used, is the second method worth the risk?
This has nothing to do with Redis. When you are writing to two databases, a lot of things can go wrong even if both of them individually guarantee data integrity.
For the sake of discussion, replace Redis with MySQL in your question, and ask yourself - will data integrity be compromised?
You may have written to Postgres and then your process can die without writing to MySQL. Or perhaps there is a network outage. Or perhaps MySQL is down. In all these cases, Postgres and MySQL would start to differ.
It does not matter whether you replace the entire record or simply add one row. Both can lead to data corruption.
If you are concerned with data integrity, keep data in a single authoritative system. Otherwise, you would need a two phase commit protocol
You should evaluate how important consistency is to your application and take things from there. It doesn't sound like anyone's cry if you loose a commit. You could have a background process that reads data from PostgreSQL and pushes it back into Redis, eventually cleaning up any inconsistencies. Alternatively, you could look at read slave PostgreSQL instances replicating from the write master. This would get you better read scalability using well tested synchronization technology.

Database with low memory requirements and Ruby interface

I need a database with low memory requirements for a small virtual server with few memory. At the moment I'm stuck with SQLite and Kyoto Cabinet or Tokyo Cabinet. The database should have a Ruby interface.
Ideally I want to avoid key-value-stores, because I have “complex” queries (more complex than looking up a single key) and tuples as keys. On the other hand I don't want to have a fixed schema and avoid the planning and migration efforts of a SQL database. A database server is also not necessary because only a single application will use the database.
Do you have any recommendations and numbers for me?
There is schema-less Postgresql (Postgresql 9.2 + json). Not as hard/confusing to set up as I thought. You get lots of flexibility with queries while still getting the benefits of schema-less storage. PG 9.2 includes plv8js, a new language handler that allows you to create functions in JavaScript. Here is one example of how you can index and query JSON docs in PG 9.2: http://people.planetpostgresql.org/andrew/index.php?/archives/249-Using-PLV8-to-index-JSON.html
CouchDB (Use BigCouch. Based on CouchDB, but fewer bugs/problems.):
very low memory requirements.
schema-less.
HTTP-based interface. Ruby has plenty of HTTP clients. HTTP caching (like Varnish) can also speed reads.
creative/complex queries. You can create indexes and queries on any key in the document (record). You can get very creative with queries since the indexes are very programmable.
Downsides:
Learning curve of setting up your queries/indexes.
You have to schedule a type of cleanup operation called "compaction".
Data will take up more space compared to other databases.
More: http://www.paperplanes.de/2010/7/26/10_annoying_things_about_couchdb.html
If disk is cheap and memory expensive, it would make a good candidate for your needs.
"...another strength of CouchDB, which has proven to serve thousands of concurrent requests only needed about 10MB of RAM - how awesome is that?!?!" (From: http://www.larsgeorge.com/2009/03/hbase-vs-couchdb-in-berlin.html )
SQLite3 is a great fit for what you are trying to do. It's used by a lot of companies as their embedded app database because it's flexible, fast, well tested, and has a small footprint. It's easy to create and blow away tables so it plays well with testing or single-application-use data stores.
The SQL language it uses is rich enough to do normal things but I'd recommend using Sequel with it. It's a great ORM and easily lets you treat it as a full-blown ORM, or drop all the way down to talking raw SQL to the DBM.
You are looking for a solution that only has a database file and no running server, probably. In that case, Sqlite should be a good choice - If you don't need it, just close the connection and that's it. Sqlite has everything that you need from and RDMS (expect for enforcing FK's directly, but that can be done with triggers), with a very little memory footprint, so in that case you are probably worried more about the memory your ORM (if any) uses.
Personally, I use sqlite for that use case as well, as it is portable and easy to access and install (which shouldn't be the problem on a server anyway, but in a desktop application it is).
BerkeleyDB with SQLite API is what you need.
http://www.oracle.com/technetwork/database/berkeleydb/overview/sql-160887.html

Resources