Should I trust Redis for data integrity? - performance

In my current project, I have PostgreSQL as my master DB, and Redis as kind of a slave, e.g., when some user adds another as a friend, first the relationship will be stored in PostgreSQL and then a friend list in Redis will be updated. When some user's friend list is requested, it will be pulled out of Redis instead of PostgreSQL.
The question is: when I update the friend list in Redis, should I get a fresh copy outof PostgreSQL, and replace the old list in Redis with the new one or should I keep the old list and simply SADD the userid into the list? The latter is of course best for performance, but intuitively the former does a better job in keep the data integrity? And if something like Celery is used, is the second method worth the risk?

This has nothing to do with Redis. When you are writing to two databases, a lot of things can go wrong even if both of them individually guarantee data integrity.
For the sake of discussion, replace Redis with MySQL in your question, and ask yourself - will data integrity be compromised?
You may have written to Postgres and then your process can die without writing to MySQL. Or perhaps there is a network outage. Or perhaps MySQL is down. In all these cases, Postgres and MySQL would start to differ.
It does not matter whether you replace the entire record or simply add one row. Both can lead to data corruption.
If you are concerned with data integrity, keep data in a single authoritative system. Otherwise, you would need a two phase commit protocol

You should evaluate how important consistency is to your application and take things from there. It doesn't sound like anyone's cry if you loose a commit. You could have a background process that reads data from PostgreSQL and pushes it back into Redis, eventually cleaning up any inconsistencies. Alternatively, you could look at read slave PostgreSQL instances replicating from the write master. This would get you better read scalability using well tested synchronization technology.

Related

Cache and update regularly complex data

Lets star with background. I have an api endpoint that I have to query every 15 minutes and that returns complex data. Unfortunately this endpoint does not provide information of what exactly changed. So it requires me to compare the data that I have in db and compare everything and than execute update, add or delete. This is pretty boring...
I came to and idea that I can simply remove all data from certain tables and build everything from scratch... But it I have to also return this cached data to my clients. So there might be a situation that the db will be empty during some request from my client because it will be "refreshing/rebulding". And that cant happen because I have to return something
So I cam to and idea to
Lock the certain db tables so that the client will have to wait for the "refreshing the db"
or
CQRS https://martinfowler.com/bliki/CQRS.html
Do you have any suggestions how to solve the problem?
It sounds like you're using a relational database, so I'll try to outline a solution using database terms. The idea, however, is more general than that. In general, it's similar to Blue-Green deployment.
Have two data tables (or two databases, for that matter); one is active, and one is inactive.
When the software starts the update process, it can wipe the inactive table and write new data into it. During this process, the system keeps serving data from the active table.
Once the data update is entirely done, the system can begin to serve data from the previously inactive table. In other words, the inactive table becomes the active table, and vice versa.

Maintain consistency between multiple datastores

I'm writing real time application that needs fast access to some resources. I'm using a relational database and redis. I use the relational database for safe storing of the resources and redis for fast access of those same resources.
The problem I'm facing is that the code for maintaining consistency between these two stores becomes very complicated. I need to check if writes are being done to both and if one of those fails, undo the one that worked.
I thought of using something like kafka where the write would be sent to a specific topic and having different consumers (SQLConsumer and RedisConsumer) that would write to the respective databases. This way, the consumers would retry indefinitely and achieve eventual consistency. Ideally, the message wouldn't be committed until the write was successful.
Is this a common/correct approach? Is there other way in which I could improve my architecture?

RethinkDB changefeeds performance: architectural advice?

I am building an application with RethinkDB and I'm about to switch to using changefeeds. But I'm facing an architectural choice and I'd like to get some advice.
My application currently loads all user data from several tables on user login (sending all of it to the frontend), and then processes requests from the frontend, altering the database, and preparing and sending changed items to users. I'd like to switch that over to changefeeds. The way I see it, I have two choices:
Set up a single changefeed for each table. Filter by users logged in to a particular server, and distribute the changes to users manually. These changefeeds are never closed, e.g. they have the lifetime of my servers.
When a user logs in, set up an individual changefeed for that user, for that user's data only (using a getAll with a secondary index). Maintain as many changefeeds as there are currently logged in users. Close them when users log out.
Solution #1 has a big disadvantage: RethinkDB changefeeds do not have a concept of time (or version number), like for example Kafka does. This means that there is no way to a) load initial data, and b) get changes that happened since the initial load. There is a time window where changes can be lost: between initial data load (a) and the moment the changefeed is set up (b). I find this worrying.
Solution #2 seems better, because includeInitial can be used to get initial data, and then get subsequent changes without interruption. I'd have to deal with initial load performance (it's faster to load a single dump of all data than process thousands of updates), but it seems more "correct". But what about scaling? I'm planning to handle up to 1k users per server — is RethinkDB prepared to handle thousands of changefeeds, each being essentially a getAll query? The actual activity in these changefeeds will be very low, it's just the number that I'm worried about.
The RethinkDB manual is a bit terse about changefeed scaling, saying that:
Changefeeds perform well as they scale, although they create extra intracluster messages in proportion to the number of servers with open feed connections on each write.
Solution #2 creates many more feeds, but the number of servers with open feed connections is actually the same for both solutions. And "changefeeds perform well as they scale" isn't quite enough to go on :-)
I'd also be interested to know what are recommended practices for handling server restarts/upgrades and disconnections. The way I see it, if anything happens to RethinkDB, clients have to perform a full data load (using includeInitial) after reconnecting, because there is no way to know what changes have been lost during downtime. Is that what people do?
RethinkDB should be able to handle thousands of changefeeds just fine if it's on reasonable hardware. One thing some people to do lower network load in that case is they put a proxy node on the same machine as their app server, and connect to that, since the proxy node knows enough to deduplicate the changefeed messages coming in over the network, and because it takes a lot of CPU/memory load off of their main cluster.
Currently the only way to recover from a crash is to restart the changefeed using includeInitial. There are plans to add write timestamps in the future, but handling deletes is complicated in that case.

Postgres Hstore vs. Redis - performance wise

I read about HStores in Postgres something that is offered by Redis as well.
Our application is written in NodeJS. Two questions:
Performance-wise, is Postgres HStore comparable to Redis?
for session storage, what would you recommend--Redis, or Postgres with some other kind of data type (like HStore, or maybe even the usual relational table)? And how bad is one option vs the other?
Another constraint, is that we will need to use the data that is already in PostgreSQL and combine it with the active sessions (which we aren't sure where to store at this point, if in Redis or PostgreSQL).
From what we have read, we have been pointed out to use Redis as a Session manager, but due to the PostgreSQL constraint, we are not sure how to combine both and the possible performance issues that may arise.
Thanks!
Redis will be faster than Postgres because Pg offers reliability guarantees on your data (when the transaction is committed, it is guaranteed to be on disk), whereas Redis has a concept of writing to disk when it feels like it, so shouldn't be used for critical data.
Redis seems like a good option for your session data, or heck even store in a cookie or in your client side Javascript. But if you need data from your database on every request then it might not be even worth involving Redis. It very much depends on your application.
Using PostgreSQL as session manager is usually bad idea.
For older than 9.1 was physical limit of transaction per second based on persistent media parameters. For session management you usually don't need MGA (because there are not collision) and it means so MGA is overhead and databases without MGA and ACID must be significantly faster (10 or 100).
I know a use case, where PostgreSQL was used for session management and Performance was really terrible and unstable - it was eshop with about 10000 living sessions. When session management was moved to memcached, then performance and stability was significantly increased. PostgreSQL can be used for 100 living session without problem probably. For higher numbers there are better tools.

When do we really need a key/value database instead of a key/value cache server?

Most of the time,we just get the result from database,and then save it in cache server,with an expiration time.
When do we need to persistent that key/value pair,what's the significant benifit to do so?
If you need to persist the data, then you would want a key/value database. In particular, as part of the NoSQL movement, many people have suggested replacing traditional SQL databases with Key/Value pair databases - but ultimately, the choice remains with you which paradigm is a better fit for your application.
Use a key/value database when you are using a key/value cache and you don't need a sql database.
When you use memcached/mysql or similar, you need to write two sets of data access code - one for getting objects from the cache, and another from the database. If the cache is your database, you only need the one method, and it is usually simpler code.
You do lose some functionality by not using SQL, but in a lot of cases you don't need it. Only the worst applications actually leave constraint checking to the database. Ad-hoc queries become impractical at scale. The occasional lost or inconsistent record simply doesn't matter if you are working with tweets rather than financial data. How do you justify the added complexity of using a SQL database?

Resources