How does immutable data make eventual consistency trivial? - immutability

I have been reading Nathan Marz' article about how to beat the CAP theorem with the Lambda Architecture and don't understand how immutable data will make eventual consistency less complex.
The following paragraph is taken from the article:
The key is that data is immutable. Immutable data means there's no such thing as an update, so it's impossible for different replicas of a piece of data to become inconsistent. This means there are no divergent values, vector clocks, or read-repair. From the perspective of queries, a piece of data either exists or doesn't exist. There is just data and functions on that data. There's nothing you need to do to enforce eventual consistency, and eventual consistency does not get in the way of reasoning about the system.
Imagine the following example: I have a distributed insert-only database with two nodes A and B and both hold the record [timestamp=1; id=1; value=10]. Then at the same time, there is an insert against node A which results in [timestamp=2; id=1; value=20] and a read against node B for record with id=1.
How is solving the problem of eventual consistency less complex with that example than for databases with update possibility?

I'm not 100% I got it right, but I'll try to explain anyway.
Consider an example - you have 2 databases accepting writes/reads, connected with a network link. The link goes down, resulting in a network partition. We want our system to be CAP available, so we accept writes/reads in both databases.
When working with mutable data structures: suppose a client, connected to the 1st database, wants to update value for record X to A and another client, connected to the 2nd database, wants to update that value to B. Since our system is available, we accept both writes in both databases, but we will have to resolve the conflict once the network parittion is gone. This will result in one of the update being lost.
With immutable data structures, you wouldn't update the data but insert, so both writes would be there after the network paritition is gone. You'd still need some kind of time synchronization though in order to preserve the operation order which can be very tricky (see the comment in the article from Sebastien Diot).

Related

How to approach event sourcing with millions of records

We're looking at implementing event sourcing / CQRS and for 95% of our system I can reason about the events and it doesn't scare me.
On the other hand, we also have a requirement where customers can insert data for millions of records in one go. A large portion of them can be updated in one go as they move location etc or have batch level details updated. It also needs to be reversed if they change their mind moments after.
Each record relates to a physical entity in the real world and it's important that the read model is updated quickly and the audit trail preserved at all costs for each record.
I can't seem to find any advice on how to handle these volumes. Are you supposed to write an event for every single record and action and just accept that it's going to be computationally / Database expensive? Are there any case studies that have similar requirements?
Any guidance is appreciated.
Are you supposed to write an event for every single record and action and just accept that it's going to be computationally / Database expensive?
A potentially helpful heuristic -- how would you do it with a version control system? Would you create an empty document, and then introduce a million commits, or would you have a single Data Imported commit, and go from there?
An important consideration to notice is that the authority for the data is somewhere else. "Physical entities in the real world" are not subject to the rules of your domain model; you you have there is a big pile of reference data.
It can help to think in processes -- what you have is an import reference data process, that has a relatively small number of immediate steps, and independently some "do interesting things with each record" which may turn out to be millions of little processes with some small number of events.

How to inform remote users a cached data is out of date?

Our company has built some new webservices. The services provide some large data so it is best to save the data in cache for performance issues. What if new or update data will provide in our webservices, how can we inform our users? What is the best way to do this?
First thing to do is to include the expiration / valid till date along with the data response.
Second thing to do is to make a separate web-service method to check if the data has been modified after the given date.
You basically have trade-off between caching, and making sure data is valid - and storing the entire data on your webservice. Finding the right solution is an engineering issue that really depends on your specific case, but here are some pointers and possible approaches:
Each entry in the cache must have expiry data, that will be wiped after the time has passed, it will make sure you don't store old data, and your cache is not full of unneccessary information.
You can send a message to all your users once some entry is invalidated, that they should take this data out of their cache. This requires your clients to listen to you, and becomes inefficient if data changes often.
You can store a hash value of each element, and before using the actual value - check the hash data is correct. This usually requires much less data transfer than checking the actual value, but you can have false negative - you think a value is not changed, while in fact it was.
In some cases (especially peer-to-peer, but not exclusively) it is wise to use Merkele Trees. The idea if Merkele trees is that each leaf holds data, and its hash value, and each internal node is a hash of its two sons.
The idea is you can find out very quickly if no change was made to the cache by checking the value of the root, and finding what was changed is done in O(logN).
The downside is this DS is probabilistic, and there is a small yet possible chance that a value was changed and you won't detect it.
This approach is an efficient generalization of (3)
Ultimately, there is no silver bullet, and the chosen method should fit your specific case, and depends on a lot of factors, some are:
Size of entry in cache
Rate of changes of cache
Web server availability
Availability to maintain connection with clients
Is a probabalistic approach enough?
etc.

what is more important speed(time) or space

I had this question when I was given a task which made me think for a while. But not able to come up with more accurate or satisfying answer.
The task was something like this,
Things I already have
a table "User" which contain details about the user. Like createdTime, type(agent, admin, other), id, etc.,(This table contains too many rows(entries) in it)
The task given to me is
create a new table which will keep track of user which are deleted.
Then join the "User" table with this newly created table and show the user which are not already deleted and are of type='agent'.
Now my question is
"why are they asking me to create a new table instead of creating a new column(in User table) which will store a flag(true If user is deleted, else not deleted)
"Is join not time consuming?"
(creating a new column in the current "User" table will help in keeping the detail intact with the User.But creating a new table - where this will help?)
When I asked this question to my team member, he replied "If you create a new column, the value can be empty for a lot of rows, and space wastage" (what he said is right too.)
This states that he is caring about the space more and less on query time.
Nowadays we can get any amount of space with money, But speed is more important right. If the speed is less, then what is the use of anything which saves our disk space.
Shouldn't we care about time more and space less?.
I would like to know what you think about this. What is more important to you and why?.
I know this question can be down-voted But I wanted to know what most developers think in such case. What they care about more is it space or time?
Thanks for your time.
This will be an answer even though it's personally speaking because it's too big for a comment:
The two are rarely separate from each other but a function of each other.
Speed is also not just 'query time' - but basically execution time including data handling. From request to reply.
So if you have something that hogs space, you have more I/O and more memory usage and need to spend more time accessing data and more time transmitting data.
So my "answer" would be:
Treat both as equal and minimize both.
This states that he is caring about the space more and less on query time. :
No. That's a wrong assumption.
If you start thinking normalization and de-normalization in the database when thinking performance, you are most always better off normalizing data.
This is not only to save space, but also to save maintenance of data/data integrity (faster update, less locking); Indexing (space, yes - but also speed) and then when using the data I/O transfer from disk to server, to memory usage, across networks. Also - space usage means memory usage and the more things take up space, the more you want to also put into memory.
All these things leads to performance, aka speed.
The times you start thinking of demoralization in the database, you usually do it in connection with pre-calculating results and queries and utilize caching so you don't have to do the joins on demand. So while there are some situations where de-normalization is a plausible solution - more often than not, you're still better off normalizing the data.

Efficient creation time descending lookup in Raik

I'm learning to use Raik, the NoSQL engine. Given that I have a user "timeline" with posts, and that post may range from millions to billions, how can I take the last N posts from the raik bucket? I mean, the last created.
I read that when using a Secondary Index Raik will return posts ordered by key. So I decided to use an UUID1 for post keys and to have a Secondary Index for the post author, so that I can take all posts from that author using it's key.
However the posts are sorted ASCENDING! I also want to use the max_results parameter as the SQL LIMIT.
This query however returns the FIRST N posts of that user, not the last. Given that I already saw some StackOverflow posts, and that the proposed solution, MapReduce is not efficient for big buckets, how would you model data or write the query?
Thanks
When coming from a SQL environment it is easy to treat a bucket as a table and store small individual records there, often relying on secondary indexes to get the data out. As Riak is a key-value store that uses consistent hashing, this is however often not the most efficient or scalable approach.
A lookup based on key in Riak allows the partitions holding the data to be directly identified, and the coordinating node can directly query these partitions. When querying a secondary index, Riak does not know on which partitions data that may match the index will reside. It will therefore need to send the query to a large number of partitions in order to ensure that all matching objects can be found. This is known as a 'coverage query' and means that, assuming n_val of 3 is used for the bucket, at least 1/3 of all partitions need to be queried. This generally leads to higher load on the cluster and does not scale as well as direct key lookups. Latencies also tend to be higher.
When using Riak it is therefore often recommended that you structure your data so that you can use direct key lookups as much as possible, e.g. through de-normalization.
If your messages/posts can be grouped some way, e.g. by user or conversation, it may make sense to store them in a single object representing this grouping instead of as separate objects.
If we assume that your posts can consist of either text or images and are linked to a conversation thread, you could create an object representing the conversation thread. This would contain information about the conversation as well as a list of posts. This list of posts can e.g. contain the id of the poster, a timestamp and the key of the record containing the post. If the post is a reasonably short text message it may even contain the entire post, reducing the number of records that will need to be fetched.
As posts come in to this conversation, the record is updated and the list of posts gets longer. It may be wise to set allow_mult to true in order to enable siblings, as this will allow you to handle concurrent writes. This approach allows you to always get the conversation as well as the latest posts through a single direct key lookup.
Riak works best when the size of objects are kept below a couple of MB. You will therefore need to move the oldest posts off to a separate object at some point to keep the size in check. If you keep a list of these related objects in the main conversation object, possibly together with some information about the time interval they cover, you can easily access these through direct key lookup as well if you should need to scroll back over older posts.
As the most common query usually is for the most recent entries, this can always be fulfilled through the main conversation object.
I would also like to point out that we do have a very active mailing list where these kind of issues are discussed quite frequently.
I know it's probably too late to help you, but I found this post through wondering about the same thing. The workaround I have come up with and been using to good effect is to create two secondary indexes, one with the real timestamp, and another that is (MAX_DATE - timestamp). Performing lookups on the first query gets ascending results, and performing lookups on the second query gets descending results (once you do the math to turn it back into a real date). You can find the max date value in the Javascript specification, such as reported in MDN, which is 8640000000000000. I can't speak to how performant it is under really heavy load, but I can tell you that for my purposes it has been blazingly fast and I'm very satisfied. I just came here hoping to find a less hacky way to do it.

Asking for opinions : One sequence for all tables

Here's another one I've been thinking about lately.
We have concluded in earlier discussions : 'natural primary keys are bad, artificial primary keys are good.'
Working with Hibernate earlier I have seen that Hibernate default creates one sequence for all tables. At first I was puzzled by this, why would you do this. But later I saw the advantage that it makes linking parents and children fool proof. Because no tables have the same primary key value, accidentally linking a parent with a table that is not a child gives no results.
Does anyone see any downsides to this approach. I only see one : you cannot have more than 999999999999999999999999999 records in your database.
There could be performance issues with all code getting values from a single sequence - see this Ask Tom thread.
Depending on how sequences are implemented in the database, always hitting the same sequence can be better or worse. When only a few or only one thread request new values, there will be no locking issues. But a bad implementation could cause congestion.
Another problem is rolling back transactions: Sequences don't get rolled back (because someone else might have requested a higher value already), so you can have large gaps which will eat your number space much more quickly than you might expect. OTOH, it will take some time to eat 2 or 4 billion IDs (if you "only" use 32 bit (signed) ints), so it's rarely an issue in practice.
Lastly, you can't easily reset the sequence if you have to. But if you need to have a restarting sequence (say, number of records since midnight), you can tell Hibernate to create/use a second sequence.
A major advantage is that you can uniquely identify objects anywhere in the DB just by the ID. That means you can severely cut down the log information you write in the production system and still find something if you only have the ID.
I prefer having one sequence per table. This comes from one general observation: Some tables ("master tables") have a relatively small row count and have to be kept "forever". For example, the customer table in an ERP.
In other tables ("transaction tables"), many rows are generated perpetually, but after some time, those rows can be archived (or simply deleted). The most extreme example is a tracing table used for debugging purposes; it might grow by hundreds of rows per second, but each row is obsolete after a few days.
Small IDs in the master tables make it easier when working directly on the database, e.g. for debugging purposes.
select * from orders where customerid=415
vs
select * from orders where customerid=89461836571
But this is only a minor issue. The bigger issue is cycling. If you use one sequence for all tables, you simply cannot let it restart. With one sequence per table, you can restart the sequences for the transaction tables when you have archived or deleted the old data. Master tables hardly ever have that problem, since they grow much slower.
I see little value in having only one sequence for all tables. The arguments told so far do not convince me.
There are a couple of disadvantages of using a single sequence:-
reduced concurrency. Handing out the next sequence value involves synchronisation. In practice, I do not think this is likely to be a big problem
Oracle has special code when maintaining btree indexes to detect monotonically increasing values and balance the tree approriately
The CBO might have a better time estimating range queries on the index (if you ever did this) if most values were filled in
An advantage might be that you can determine the order of inserts amongst different tables.
Certainly there are pros and cons to the one-sequence versus one-sequence-per-table approach. Personally I find the ability to assign a truly unique identifier to a row, making each id column a uuid, to be enough of a benefit to outweigh any disadvantages. As Aaron D. succinctly writes:
you can uniquely identify objects anywhere in the DB just by the ID
And, for most applications, due to the way Hibernate3 batches IMPORT statements, this will not be a performance bottleneck unless massive amounts of records are vying for the same db resource (SELECT hibernate_sequence.nextval FROM dual).
Also, this sequence mapping is not supported in the latest release (1.2) of Grails. Though it was supported in Grails 1.1 (!). It now requires subclassing one of the Hibernate dialect classes as a workaround.
For those using Grails/GORM, have a look at this JIRA entry:
Oracle Sequence mappings ignored

Resources