CockroachDB 2-phase commits with or without blocking? - cockroachdb

Two-phase commits are supposed to suffer from blocking problems. Is that the case with CockroachDB, and if not, how is it avoided?

Summary: 2-phase commits are blocking, so it is important to keep the thing that is being 2-phase committed as "small" as possible, so that the set of all actions that are blocked is minimal. CockroachDB does this using MVCC with intents, 2-phase committing only on a single intent. Because CockroachDB provides serializable transactions, it reorders transaction timestamps to minimize blocking only to where absolutely necessary.
Longer answer
2-phase commits are blocking after the first phase, while all participants wait for a reply from the coordinator as to whether the second phase is to be committed or aborted. During this time period, participants that have already sent a "Yes" vote cannot unilaterally revoke their vote, but also cannot treat it as committed (as the coordinator might get back with an abort). So they are forced to block all subsequent actions that need to concretely know what the state of this transaction is. The key in the above sentence is in the "need": it is on us to design our system to reduce that set to the bare minimum. CockroachDB uses write intents and [MVCC] to minimize these dependencies.
Consider a naïve implementation of a distributed (multi-key) transactional key-value store: I wish to transactionally commit some write transaction t1. t1 spans many keys across many machines, but of particular concern is that it writes k1 = v2. k1 is on machine m1 (let's say k1=v1 was the previous value).
Since t1 spans many keys on many machines, all of them are involved in a 2-phase commit transaction. Once that 2-phase transaction is begun, we have to note that we have an intent to write k1=v2, and the status of the transaction is unknown (the transaction may abort, because one of the other writes cannot proceed).
Now if some other transaction t2 comes along which wants to read the value of k1, we simply cannot give that transaction an authoritative answer, until we know the final result of the 2-phase commit. t2 is blocked.
But, we (and CockroachDB) can do better. We can keep multiple versions of values for each key, and have a concurrency control mechanism to keep all of these versions in order. Namely, we can assign our transactions timestamps, and have our writes look (loosely) as follows:
`k1 = v1 committed at time=1`
`k1 = v2 at time=110 INTENT (pending transaction t1)`
Now, when t2 comes along, it has an option: it can choose to do the read at time<=109, which would not be blocked on t1. Of course, some transactions cannot do this (if say, they also are distributed, and there's a different component that simply requires a higher timestamp). Those transactions will be blocked. But in practice, this frees up the database to assign timestamps such that many types of transactions can proceed.
As the other answer says, Cockroach Labs has a post about CockroachDB's use of MVCC here, which explains some further details as well.

CockroachDB has a long blog post on how it uses 2-phase commit without locking here: https://www.cockroachlabs.com/blog/how-cockroachdb-distributes-atomic-transactions/
The part that deals most with the prevention of locking is its use of "write intents" (Stage: Write Intents is the heading in the blog post).

Related

Solana - leader validator and incrementing field

As I understand it, Solana will elect a leader each round and there will be multiple validators handling the transactions independently. The leader will then consolidate all the transactions.
From this understanding, I'm curious how Solana actually handles programs which increment a field. So lets say we have this counter field, which increases by 1 each time the program is called. What happens if 10 different users calls this program at the same time, how will this work if the 10 transactions are handled by the ten validators independently. For example at the start of the round, counter=50 and during the round, ten different validators handles the transactions separately so each validator will increase the counter=51. When the leader gets back all the txns, it will say counter=51, what happens in this scenario?
I feel like there is something missing in my assumptions.
So my understanding here seems to be incorrect. It is actually the leader who executes the transactions and the validators who are verifying the transactions.
Source
Page 2 - Section 3 - https://solana.com/solana-whitepaper.pdf
As shown in Figure 1, at any given time a system node is designated as
Leader to generate a Proof of History sequence, providing the network global
read consistency and a verifiable passage of time. The Leader sequences user
messages and orders them such that they can be efficiently processed by other
nodes in the system, maximizing throughput. It executes the transactions
on the current state that is stored in RAM and publishes the transactions
and a signature of the final state to the replications nodes called Verifiers.
Verifiers execute the same transactions on their copies of the state, and publish their computed signatures of the state as confirmations. The published
confirmations serve as votes for the consensus algorithm.
The "recent blockhash" is another important part of this. A transaction references a recent blockhash, which is part of the Proof of History sequence. If two transactions reference the same blockhash, they are counted as duplicates by the network, even if they come from two different users.
More information can be found at https://docs.solana.com/developing/programming-model/transactions#recent-blockhash
There is only one PoH generator(Block producer) at a time. other nodes are just validating.
I cannot comment to Jon C but the answer is wrong. you can use the same recent blockhash otherwise there is no way solana can handle 50000 tps when block time is around 0.4 sec.

Oracle Transaction Commit

Lets say I have a function which carries out a lot of CRUD operations, and also assume that this function is going to get executed without any exception (100% success). Is it better to have a transaction for the entire function or transaction commits for each CRUD operation. Basically, I wanted to know whether using many transaction commits has an impact on the memory and time consumption while executing the function which has a lot of CRUD operations.
Transaction boundaries should be defined by your business logic.
If your application has 100 CRUD operations to do, and each is completely independent of the others, maybe a commit after each is appropriate. Think about this: is it OK for a user running a report against your database to see only half of the CRUD operations?
A transaction is a set of updates that must all happen together or not at all, because a partial transaction would represent an inconsistent or inaccurate state.
Commit at the end of every transaction - that's it. No more, no less. It's not about performance, releasing locks, or managing server resources. Those are all real technical issues, but you don't solve them by committing halfway through a logical unit of work. Commit frequency is not a valid "tuning trick".
EDIT
To answer your actual question:
Basically, I wanted to know whether using many transaction commits has an impact on the memory and time consumption while executing the function which has a lot of CRUD operations.
Committing frequently will actually slow you down. Every time you do a regular commit, Oracle has to make sure that anything in the redo log buffers is flushed to disk, and your COMMIT will wait for the that process to complete.
Also, there is little or no memory savings in frequent commits. Almost all your transaction's work and any held locks are written to redo log buffers and/or database block buffers in memory. Oracle will flush both of those to disk in background as often as it needs to in order to manage memory. Yes, that's right -- your dirty, uncommitted database blocks can be written to disk. No commit necessary.
The only resource that a really huge transaction can blow out is UNDO space. But, again, you don't fix that problem by committing half way through a logical unit of work. If your logical unit of work is really that huge, size your database with an appropriate amount of UNDO space.
My response is "it depends." Does the transaction involve data in only one table or several? Are you performing inserts, updates, or deletes. With an INSERT no other session can see your data till it is committed so technically no rush. However if you update a row on a table where the exact same row may need to be updated by another session in short order you do not want to hold the row for any longer than absolutely necessary. What constitutes a logic unit of work, how much UNDO the table and index changes involved consume, and concurrent DML demand for the same rows all come into play when choosing the commit frequency.

Whats the difference between Paxos and W+R>=N in Cassandra?

Dynamo-like databases (e.g. Cassandra) can enforce consistency by means of quorum, i.e. a number of synchronously written replicas (W) and a number of replicas to read (R) should be chosen in such a way that W+R>N where N is a replication factor. On the other hand, PAXOS-based systems like Zookeeper are also used as a consistent fault-tolerant storage.
What is the difference between these two approaches? Does PAXOS provide guarantees that are not provided by W+R>N schema?
Yes, Paxos provides guarantees that are not provided by the Dynamo-like systems and their read-write quorums. The difference is how failures are handled and what happens during a write. After a successful write, both kind of systems behave similarly. The data will be saved and available for reading afterwards (until overwritten or deleted) and so on.
The difference appears during a write and after failures. Until you get a successful answer from W nodes when writing something to the eventually consistent systems, then the data may have been written to some nodes and not to others and there is no guarantee that the whole system agrees on the current value. If you try to read the data back at this point, some clients may get the new data back and some the old data back. In other words, the system is not immediately consistent. This is because writes aren't atomic across nodes in these systems. There are usually mechanisms to "heal" an inconsistency like this and "eventually" the system will become consistent again (i.e. reads will once again always return the same value, until something new is written). This is the reason why they are often called "eventually consistent". Inconsistencies can (and will) appear, but they will always be dealt with and reconciled eventually.
With Paxos, writes can be made atomic across nodes and inconsistencies between nodes are therefore possible to avoid. The Paxos algorithm makes it possible to guarantee that non-faulty nodes never disagree on the outcome of a write, at any point in time. Either the write succeeded everywhere or nowhere. There will never be any inconsistent reads at any point (if it's correctly implemented and if all the assumptions hold, of course). This comes at a cost, however. Mainly, the system may need to delay some requests and be unavailable when for example too many nodes (or the communication between them) aren't working. This is necessary to assure that no inconsistent replies are given.
To summarize: the main difference is that the Dynamo-like systems can return inconsistent results during writes or after failures for some time (but will eventually recover from it), whereas Paxos based systems can guarantee that there are never any such inconsistencies by sometimes being unavailable and delaying requests instead.
Paxos is non-trivial to implement, and expensive enough that many systems using it use hints as well, or use it only for leader election, or something. However, it does provide guaranteed consistency in the presence of failures - subject of course to the limits of its particular failure model.
The first quorum based systems I saw assumed some sort of leader or transaction infrastructure that would ensure enough consistency that you could trust that the quorum mechanism worked. This infrastructure might well be Paxos-based.
Looking at descriptions such as https://cloudant.com/blog/dynamo-and-couchdb-clusters/, it would appear that Dynamo is not based on an infrastructure that guarantees consistency for its quorum system - so is it being very clever or cutting corners? According to http://muratbuffalo.blogspot.co.uk/2010/11/dynamo-amazons-highly-available-key.html, "The Dynamo system emphasizes availability to the extent of sacrificing consistency. The abstract reads "Dynamo sacrifices consistency under certain failure scenarios". Actually, later it becomes clear that Dynamo sacrifices consistency even in the absence of failures: Dynamo may become inconsistent in the presence of multiple concurrent write requests since the replicas may diverge due to multiple coordinators." (end quote)
So, it would appear that in the case of quorums as implemented in Dynamo, Paxos provides stronger reliability guarantees.
Paxos and the W+R>N quorum try to solve slightly different problems. Paxos is usually described as a way to replicate a state machine, but in fact it is more of a distributed log: each item written to the log gets an index, and the different servers eventually hold the same log items + their index. (Replicated state machine can be achieved by writing to the log the inputs to the state machine and each server replays the state machine on the agreed inputs according to their index). You can read more about Paxos in a blog post I wrote here.
The W+R>N quorum solves the problem of sharing a single value among multiple servers. In the academia it is called "shared register". A shared register has two operations: read and write, where we expect the read to return the value of the previous write.
So, Paxos and the W+R>N quorum live in different domains, and have different properties (e.g., Paxos saves an ordered list of items). However, Paxos can be used to implement a shared register, and a W+R>N quorum can be used to implement a distributed log (although, very inefficiently).
Saying all the above, sometimes the W+R>N quorums aren't implemented in their "fully robust" way, as it will require more than one communication round. Thus, in systems that want low latency, it is possible that their implementation of W+R>N quorums provide weaker properties (e.g., conflicting values can co exist).
To sum up, theoretically, Paxos and the W+R>N can achieve the same goals. Practically, it would be very inefficient, and each one is better for something slightly different. Even more practically, W+R>N isn't always implemented fully, thus scarifying some consistency properties for speed.
Update: Paxos supports a very general failure model: messages can be dropped, nodes can crash and restart. The W+R>N quorum scheme has dfferent implementations, many of which assume less general failures. So, the difference between the two also depends on the assumption on the possible failures that are supported.
There is no difference. The definition of a quorum says that any two quorums' intersection is not empty. Simple majority quorum is an example NOT a definition. Take a look at Dr. Lamport's later paper "Vertical Paxos", where he gave some other possible configuration of quorums.
Multi-decree paxos protocol (AKA Multi-Paxos), in steady state it's just two phase commit. Ballot number changes are only needed when the leader fails.
Zookeeper's replication protocol (ZAB) , and RAFT are all based on Paxos. The differences are in fault-detection and transition after a leader fails.
As mentioned in other answers, in an R+W > N system, the writes are not atomic on all nodes which means that when a write is in progress (or during a write failure) some nodes will have newer values and some older ones. Take an example of a system where n=3, r=2, and w=2. For clarity let's assume the 3 nodes are named A, B, and C. Consider this scenario: a write is in progress; node A has been updated while B and C are still in process of receiving the updated value. Clients reading from A and B will see the newer value (resolved using version vectors or last write wins) and clients reading from B and C will see old values. This type of read is not considered linearizable. Such issues will not occur with proper linearizable systems such as Paxos or Raft.

Real world example of Paxos

Can someone give me a real-world example of how Paxos algorithm is used in a distributed database? I have read many papers on Paxos that explain the algorithm but none of them really explain with an actual example.
A simple example could be a banking application where an account is being modified through multiple sessions (i.e. a deposit at a teller, a debit operation etc..). Is Paxos used to decide which operation happens first? Also, what does one mean by multiple instances of Paxos protocol? How is when is this used? Basically, I am trying to understand all this through a concrete example rather than abstract terms.
For example, we have MapReduce system where master consists of 3 hosts. One is master and others are slaves. The procedure of choosing master uses Paxos algorithm.
Also Chubby of Google Big Table uses Paxos: The Chubby Lock Service for Loosely-Coupled Distributed Systems, Bigtable: A Distributed Storage System for Structured Data
The Clustrix database is a distributed database that uses Paxos in the transaction manager. Paxos is used by the database internals to coordinate messages and maintain transaction atomicity in a distributed system.
The Coordinator is the node the transaction originated on
Participants are the nodes that modified the database on behalf of
the transaction Readers are nodes that executed code on behalf of the
transaction but did not modify any state
Acceptors are the nodes that log the state of the transaction.
The following steps are taken when performing a transaction commit:
Coordinator sends a PREPARE message to each Participant.
The Participants lock transaction state. They send PREPARED messages back to the Coordinator.
Coordinator sends ACCEPT messages to Acceptors.
The Acceptors log the membership id, transaction, commit id, and participants. They send ACCEPTED messages back to the Coordinator.
Coordinator tells the user the commit succeeded.
Coordinator sends COMMIT messages to each Participant and Reader.
The Participants and Readers commit the transaction and update transaction state accordingly. They send COMMITTED messages back to the Coordinator.
Coordinator removes internal state and is now done.
This is all transparent to the application and is implemented in the database internals. So for your banking application, all the application level would need to do is perform exception handling for deadlock conflicts. The other key to implementing a database at scale is concurrency, which is generally helped via MVCC (Multi-Version concurrency control).
Can someone give me a real-world example of how Paxos algorithm is
used in a distributed database?
MySQL uses Paxos. This is why a highly available MySQL setup needs three servers. In contrast, a typical Postgres setup is a master-slave two-node configuration which isn't running Paxos.
I have read many papers on Paxos that explain the algorithm but none of them really explain with an actual example.
Here is a fairly detailed explanation of Paxos for transaction log replication. And here is the source code that implements it in Scala. Paxos (aka multi-Paxos) is optimally efficient in terms of messages as in a three node cluster, in steady state, the leader accepts it's own next value, transmits to both of the other two nodes, and knows the value is fixed when it gets back one response. It can then put the commit message (the learning message) into the front of the next value that it sends.
A simple example could be a banking application where an account is
being modified through multiple sessions (i.e. a deposit at a teller,
a debit operation etc..). Is Paxos used to decide which operation
happens first?
Yes if you use a MySQL database cluster to hold the bank accounts then Paxos is being used to ensure that the replicas agree with the master as to the order that transactions were applied to the customer bank accounts. If all the nodes agree on the order that transactions were applied they will all hold the same balances.
Operations on a bank account cannot be reordered without coming up with different balances that may violate the business rules of not exceeding your credit. The trivial way to ensure the order is to just use one server process that decides the official order simply based on the order of the messages that it receives. It can then track the balances of each bank account and enforce the business rules. Yet you don't want just a single server as it may crash. You want replica servers that are also receiving the credit and debit commands and agree with the master.
The challenge with having replicas that should hold the same balances are that messages may be lost and resent and messages are buffered by switches that may deliver some messages late. The net effect is that if the network is unstable it is hard to prove that fast replication protocols will never cause different servers to see that the messages arrived in different orders. You will end up with different servers in the same cluster holding different balances.
You don't have to use Paxos to solve the bank accounts problem. You can just do simple master-slave replication. You have one master, one or more slaves, and the master waits until it has got acknowledgements from the slaves before telling any client the outcome of a command. The challenge there is lost and reordered messages. Before Paxos was invented database vendors just created expensive hardware designed to have very high redundancy and reliability to run master-slave. What was revolutionary about Paxos is that it does work with commodity networking and without specialist hardware.
Since banking applications were profitable with expensive custom hardware it is likely that many real-world banking systems are still running that way. In such scenarios, the database vendor supplies the specialist hardware with built-in reliable networking that the database software runs on. That is very expensive and not something that smaller companies want. Cost-conscious companies can set up a MySQL cluster on VMs in any public cloud with normal networking and Paxos will make it reliable rather than using specialist hardware.
Also, what does one mean by multiple instances of Paxos protocol? How
is when is this used?
I wrote a blog about multi-Paxos being the original Paxos protocol. Simply put, in the case of choosing the order of transactions in a cluster, you want to stream the transactions as a stream of values. Each value is fixed in a separate logical instance of the protocol. As described in my blog about Paxos for cluster replication the algorithm is very efficient in steady-state needing only one round trip between the master and enough nodes to have a majority which is one other node in a three node cluster. When there are crashes or network issues the algorithm is always safe but needs more messages. So to answer your question typical applications need multiple rounds of Paxos to establish the order of client commands in the cluster.
I should note that Raft was specifically invented as a detailed description of how to perform cluster replication. The original Paxos papers require you to figure out many of the details to do cluster replication. So we can expect that people who are specifically trying to implement cluster replication would use Raft as it leaves nothing for the implementor to have to figure out for themselves.
So when might you use Paxos? It can be used to change the cluster membership of a cluster that is writing values based on a different protocol that can only be correct when you know the exact cluster membership. Corfu is a great example of that where it removes the bottleneck of writing via a single master by having clients write to shards of servers concurrently. Yet it can only do that accurately when all clients have an accurate view of the current cluster membership and shard layout. When nodes crash or you need to expand the cluster you propose a new cluster membership and shard layout and run it through Paxos to get consensus across the cluster.

a question about oracle undo segment binding

I'm no DBA, I just want to learn about Oracle's Multi-Version Concurrency model.
When launching a DML operation, the first step in the MVCC protocol is to bind a undo segment. The question is why one undo segment can only serve for one active transaction?
thank you for your time~~
Multi-Version Concurrency is probably the most important concept to grasp when it comes to Oracle. It is good for programmers to understand it even if they don't want to become DBAs.
There are a few aspects but to this, but they all come down to efficiency: undo management is overhead, so minimizing the number of cycles devoted to it contributes to the overall performance of the database.
A transaction can consist of many statements and generate a lot of undo: it might insert a single row, it might delete thirty thousands. It is better to assign one empty UNDO block at the start rather than continually scouting around for partially filled blocks with enough space.
Following one from that, sharing undo blocks would require the kernel to track of usage at a much finer granularity, which is just added complexity.
When the transaction completes the undo is released (unless, see next point). The fewer blocks the transaction has used the fewer latches have to be reset. Plus, if the blocks are shared we would have to free shards of a block, which is just more effort.
The key thing about MVCC is read consistency. This means that all the records returned by a longer running query will appear in the state they had when the query started. So if I issue a SELECT on the EMP table which takes fifteen minutes to run and halfway through you commit an update of all the salaries I won't see your change, The database does this by retrieving the undo data from the blocks your transaction used. Again, this is a lot easier when all the undo data is collocated in a one or two blocks.
"why one undo segment can only serve for one active transaction?"
It is simply a design decision. That is how undo segments are designed to work. I guess that it was done to address some of the issues that could occur with the previous rollback mechanism.
Rollback (which is still available but deprecated in favor of undo) included explicit creation of rollback segments by the DBA, and multiple transactions could be assigned to a single rollback segment. This had some drawbacks, most obviously that if one transaction assigned to a given segment generated enough rollback data that the segment was full (and could no longer extend), then other transactions using the same segment would be unable to perform any operation that would generate rollback data.
I'm surmising that one design goal of the new undo feature was to prevent this sort of inter-transaction dependency. Therefore, they designed the mechanism so that the DBA sizes and creates the undo tablespace, but the management of segments within it is done internally by Oracle. This allows the use of dedicated segments by each transaction. They can still cause problems for each other if the tablespace fills up (and cannot autoextend), but at the segment level there is no possibility of one transaction causing problems for another.

Resources