Maintain consistency between multiple datastores - events

I'm writing real time application that needs fast access to some resources. I'm using a relational database and redis. I use the relational database for safe storing of the resources and redis for fast access of those same resources.
The problem I'm facing is that the code for maintaining consistency between these two stores becomes very complicated. I need to check if writes are being done to both and if one of those fails, undo the one that worked.
I thought of using something like kafka where the write would be sent to a specific topic and having different consumers (SQLConsumer and RedisConsumer) that would write to the respective databases. This way, the consumers would retry indefinitely and achieve eventual consistency. Ideally, the message wouldn't be committed until the write was successful.
Is this a common/correct approach? Is there other way in which I could improve my architecture?

Related

Evaluation of ehcache in web application

Is it a good practice to store your data in ehcache to improve the performance of a web application when lots of update operation on data regularly?
It all depends on how much reads you have over writes. Your updates will be costlier. So the time gain by reading should offset that.
Ehcache handles concurrent access. However, it is atomic, not transactional. So if you are getting multiple values from different caches, you can get updates in-between. But that's the same for a database. Also, you can use XA to make sure your writes are in sync with the database.

Oracle Materialized View for sensory data transfer

In an application we have to send sensory data stream from multiple clients to a central server over internet. One obvious solution is to use MOMs (Message Oriented Middlewares) such as Kafka, but I recently learned that we can do this with data base synchronization tools such as oracle Materialized View.
The later approach works in some application (sending data from a central server to multiple clients, inverse directin of our application), but what is the pros and cons of it in our application? Which one is better for sending sensory data stream from multiple (~100) clients to server in terms of speed, security, etc.?
Thanks.
P.S.
For more detail consider an application in which many (about 100) clients have to send streaming data (1MB data per minute) to a central server over internet. The data are needed in server for the sake of online monitoring, analysis and some computation such as machine learning and data mining tasks.
My question is about the difference between db-to-db connection and streaming solutions such as kafka for trasfering data from clients to server.
Prologue
I'm going to try and break your question down into in order to get a clearer understanding of your current requirements and then build it back up again. This has taken a long time to write so I'd really appreciate it if you do two things off the back of it:
Be sceptical - there's absolutely no substitute for testing things yourself. The internet is very useful as a guide but there's no guarantee that the help you receive (if this answer is even helpful!) is the best thing for your specific situation. It's impossible to completely describe your current situation in the space allotted and so any answer is, of necessity, going to be lacking somewhere.
Look again at how you explained yourself - this is a valid question that's been partially stopped by a lack of clarity in your description of the system and what you're trying to achieve. Getting someone unfamiliar with your system to look over your question before posting a complex question may help.
Problem definition
sensory data stream from multiple clients to a central server
You're sending data from multiple locations to a single persistence store
online monitoring
You're going to be triggering further actions based off the raw data and potentially some aggregated data
analysis and some computation such as machine learning and data mining tasks
You're going to be performing some aggregations on the clients' data, i.e. you require aggregations of all of the clients' data to be persisted (however temporarily) somewhere
Further assumptions
Because you're talking about materialized views we can assume that all the clients persist data in a database, probably Oracle.
The data coming in from your clients is about the same topic.
You've got ~100 clients, at that amount we can assume that:
the number of clients might change
you want to be able to add clients without increasing the number of methods of accessing data
You don't work for one of Google, Amazon, Facebook, Quantcast, Apple etc.
Architecture diagram
Here, I'm not making any comment on how it's actually going to work - it's the start of a discussion based on my lack of knowledge of your systems. The "raw data persistence" can be files, Kafka, a database etc. This is description of the components that are going to be required and a rough guess as to how they will have to connect.
Applying assumed architecture to materialized views
Materialized views are a persisted query. Therefore you have two choices:
Create a query that unions all 100 clients data together. If you add or remove a client you must change the query. If a network issue occurs at any one of your clients then everything fails
Write and maintain 100 materialized views. The Oracle database at your central location has 100 incoming connections.
As you can probably guess from the tradeoffs you'll have to make I do not like materialized views as the sole solution. We should be trying to reduce the amount of repeated code and single points of failure.
You can still use materialized views though. If we take our diagram and remove all the duplicated arrows in your central location it implies two things.
There is a single service that accepts incoming data
There is a single service that puts all the incoming data into a single place
You could then use a single materialized view for your aggregation layer (if your raw data persistence isn't in Oracle you'll first have to put the data into Oracle).
Consequences of changes
Now we've decided that you have a single data pipeline your decisions actually become harder. We've decoupled your clients from the central location and the aggregation layer from our raw data persistence. This means that the choices are now yours but they're also considerably easier to change.
Reimagining architecture
Here we need to work out what technologies aren't going to change.
Oracle databases are expensive and you're pushing 140GB/day into yours (that's 50TB/year by the way, quite a bit). I don't know if you're actually storing all the raw data but at those volumes it's less likely that you are - you're only storing the aggregations
I'm assuming you've got some preferred technologies where your machine learning and data mining happen. If you don't then consider getting some to prevent madness supporting everything
Putting all of this together we end up with the following. There's actually only one question that matters:
How many times do you want to read your raw data off your database.
If the answer to that is once then we've just described middleware of some description. If the answer is more than once then I would reconsider unless you've got some very good disks. Whether you use Kafka for this middle layer is completely up to you. Use whatever you're most familiar with and whatever you're most willing to invest the time into learning and supporting. The amount of data you're dealing with is non-trivial and there's going to be some trial and error getting this right.
One final point about this; we've defined a data pipeline. A single method of data flowing through your system. In doing so, we've increased the flexibility of the system. Want to add more clients, no need to do anything. Want to change the technology behind part of the system, as long as the interface remains the same there's no issue. Want to send data elsewhere, no problem, it's all in the raw data persistence layer.

RethinkDB changefeeds performance: architectural advice?

I am building an application with RethinkDB and I'm about to switch to using changefeeds. But I'm facing an architectural choice and I'd like to get some advice.
My application currently loads all user data from several tables on user login (sending all of it to the frontend), and then processes requests from the frontend, altering the database, and preparing and sending changed items to users. I'd like to switch that over to changefeeds. The way I see it, I have two choices:
Set up a single changefeed for each table. Filter by users logged in to a particular server, and distribute the changes to users manually. These changefeeds are never closed, e.g. they have the lifetime of my servers.
When a user logs in, set up an individual changefeed for that user, for that user's data only (using a getAll with a secondary index). Maintain as many changefeeds as there are currently logged in users. Close them when users log out.
Solution #1 has a big disadvantage: RethinkDB changefeeds do not have a concept of time (or version number), like for example Kafka does. This means that there is no way to a) load initial data, and b) get changes that happened since the initial load. There is a time window where changes can be lost: between initial data load (a) and the moment the changefeed is set up (b). I find this worrying.
Solution #2 seems better, because includeInitial can be used to get initial data, and then get subsequent changes without interruption. I'd have to deal with initial load performance (it's faster to load a single dump of all data than process thousands of updates), but it seems more "correct". But what about scaling? I'm planning to handle up to 1k users per server — is RethinkDB prepared to handle thousands of changefeeds, each being essentially a getAll query? The actual activity in these changefeeds will be very low, it's just the number that I'm worried about.
The RethinkDB manual is a bit terse about changefeed scaling, saying that:
Changefeeds perform well as they scale, although they create extra intracluster messages in proportion to the number of servers with open feed connections on each write.
Solution #2 creates many more feeds, but the number of servers with open feed connections is actually the same for both solutions. And "changefeeds perform well as they scale" isn't quite enough to go on :-)
I'd also be interested to know what are recommended practices for handling server restarts/upgrades and disconnections. The way I see it, if anything happens to RethinkDB, clients have to perform a full data load (using includeInitial) after reconnecting, because there is no way to know what changes have been lost during downtime. Is that what people do?
RethinkDB should be able to handle thousands of changefeeds just fine if it's on reasonable hardware. One thing some people to do lower network load in that case is they put a proxy node on the same machine as their app server, and connect to that, since the proxy node knows enough to deduplicate the changefeed messages coming in over the network, and because it takes a lot of CPU/memory load off of their main cluster.
Currently the only way to recover from a crash is to restart the changefeed using includeInitial. There are plans to add write timestamps in the future, but handling deletes is complicated in that case.

storing data in secondary database

Our application (java,spring, hibernate) uses postgress to store data.
We are looking to add an analysis engine to the application. I want to explore using a nosql db to run the analysis on. This is an attempt at learning the nosql a bit also to free the main application activity from performance penalty (as much as possible).
So, I want the data changes to also synch to the nosql db (in addition to postgres). Any synch mechanism will affect the performance of the main data/transaction activity.
Is it a good idea to push the data changes to a message bus and free the main transaction as early as possible ? Can anyone point me to frameworks/technologies/ideas that address this issue of same data going to two different data stores.
The simplest solution would be sending data to a Postgres read replica and running your analytics queries on that. The performance impact is minimal and this would save a lot of time compared to alternative approaches.
Unless you really know what you are doing, I would avoid NoSQL for this kind of application. If your dataset is too big for a Postgres read replica, you might want to use Redshift, which is a columnar datastore that is optimized for types of analytics queries typically performed.

Should I trust Redis for data integrity?

In my current project, I have PostgreSQL as my master DB, and Redis as kind of a slave, e.g., when some user adds another as a friend, first the relationship will be stored in PostgreSQL and then a friend list in Redis will be updated. When some user's friend list is requested, it will be pulled out of Redis instead of PostgreSQL.
The question is: when I update the friend list in Redis, should I get a fresh copy outof PostgreSQL, and replace the old list in Redis with the new one or should I keep the old list and simply SADD the userid into the list? The latter is of course best for performance, but intuitively the former does a better job in keep the data integrity? And if something like Celery is used, is the second method worth the risk?
This has nothing to do with Redis. When you are writing to two databases, a lot of things can go wrong even if both of them individually guarantee data integrity.
For the sake of discussion, replace Redis with MySQL in your question, and ask yourself - will data integrity be compromised?
You may have written to Postgres and then your process can die without writing to MySQL. Or perhaps there is a network outage. Or perhaps MySQL is down. In all these cases, Postgres and MySQL would start to differ.
It does not matter whether you replace the entire record or simply add one row. Both can lead to data corruption.
If you are concerned with data integrity, keep data in a single authoritative system. Otherwise, you would need a two phase commit protocol
You should evaluate how important consistency is to your application and take things from there. It doesn't sound like anyone's cry if you loose a commit. You could have a background process that reads data from PostgreSQL and pushes it back into Redis, eventually cleaning up any inconsistencies. Alternatively, you could look at read slave PostgreSQL instances replicating from the write master. This would get you better read scalability using well tested synchronization technology.

Resources