I have been reading about data races and race conditions, but it seems that these are used in place of each other.
So what is the real difference?
Related
(note this isnt about parallel execution of a query inside the RDB, but peformance characteristics of submitting queries in parallel).
I have a process that executes 1000's (if not 10,000s+) of queries, in a single threaded manner (i.e. send, wait for response, process, send....), loosely of the form
select a,b from table where id = 123
i.e. query a single record on an already indexed field
on an oracle database.
This process takes longer than desired, and doing some metrics on it, I'm sure that 90% of the time is spent server side execution (and transport) rather than client side.
This process naturally can be split into N 'jobs', and its been suggested that this could/should speed up the process.
Naively you would expect it to run N times quicker (with a small overhead to merge the answer).
Given that (loosely) SQL is 'serialised' is this the case though? That would imply that actually it would probably not run quicker at all.
I assume that for an update on a single record (for example) N updates would have to be effectively serialised, but for N reads, this may not be the case.
Which theory is most accurate (or neither)
I'm not a dba, it looks like for reads, reads never block reads, so assuming infinite resources the theory would be that N reads could be run completely in parallel with no blocking. For writes and reads it gets more complex depending on how you set up your transactions/locks but thats out of scope for me.
I noticed that transactions started on the shard follower are significantly slower than on the shard leader. However, while write operations are 5-10 times slower, the read operations are slower 1000 times.
I read mostly the same data that I just wrote but in separate transactions. Can using of transaction chain improve the read performance?
Transactions on a follower will be slower b/c every operation (read, write etc) has to go to the leader for strong consistency, which incurs network and serialization latencies. A transaction chain won't improve that.
When there is only one writer to a Berkeley DB, is it worth to use transactions?
Do transaction cause a significant slowdown? (in percents please)
You use transactions if you require the atomicity that they provide. Perhaps you need to abort the transaction, undoing everything in it? Or perhaps you need the semantic that should the application fail, a partially completed transaction is aborted. Your choice of transactions is based on atomicity, not performance. If you need it, you need it.
If you don't need atomicity, you may not need durability. Then, that is significantly faster!
Transactions with DB_INIT_TXN in Berkeley DB are not significantly
slower than other models, although generally maintaining a transactional
log requires all data to be written to the log before being written
to the database.
For a single writer and multiple readers, try the DB_INIT_CDB
model because the code is much simpler. Locks in the INIT_CDB
model are per-table and so overall throughput might be worse
than a INIT_TXN model because of coarse grained per-table
lock contention.
Performance will depend on access patterns more than whether
one uses DB_INIT_TXN or DB_INIT_CDB models.
I'm no DBA, I just want to learn about Oracle's Multi-Version Concurrency model.
When launching a DML operation, the first step in the MVCC protocol is to bind a undo segment. The question is why one undo segment can only serve for one active transaction?
thank you for your time~~
Multi-Version Concurrency is probably the most important concept to grasp when it comes to Oracle. It is good for programmers to understand it even if they don't want to become DBAs.
There are a few aspects but to this, but they all come down to efficiency: undo management is overhead, so minimizing the number of cycles devoted to it contributes to the overall performance of the database.
A transaction can consist of many statements and generate a lot of undo: it might insert a single row, it might delete thirty thousands. It is better to assign one empty UNDO block at the start rather than continually scouting around for partially filled blocks with enough space.
Following one from that, sharing undo blocks would require the kernel to track of usage at a much finer granularity, which is just added complexity.
When the transaction completes the undo is released (unless, see next point). The fewer blocks the transaction has used the fewer latches have to be reset. Plus, if the blocks are shared we would have to free shards of a block, which is just more effort.
The key thing about MVCC is read consistency. This means that all the records returned by a longer running query will appear in the state they had when the query started. So if I issue a SELECT on the EMP table which takes fifteen minutes to run and halfway through you commit an update of all the salaries I won't see your change, The database does this by retrieving the undo data from the blocks your transaction used. Again, this is a lot easier when all the undo data is collocated in a one or two blocks.
"why one undo segment can only serve for one active transaction?"
It is simply a design decision. That is how undo segments are designed to work. I guess that it was done to address some of the issues that could occur with the previous rollback mechanism.
Rollback (which is still available but deprecated in favor of undo) included explicit creation of rollback segments by the DBA, and multiple transactions could be assigned to a single rollback segment. This had some drawbacks, most obviously that if one transaction assigned to a given segment generated enough rollback data that the segment was full (and could no longer extend), then other transactions using the same segment would be unable to perform any operation that would generate rollback data.
I'm surmising that one design goal of the new undo feature was to prevent this sort of inter-transaction dependency. Therefore, they designed the mechanism so that the DBA sizes and creates the undo tablespace, but the management of segments within it is done internally by Oracle. This allows the use of dedicated segments by each transaction. They can still cause problems for each other if the tablespace fills up (and cannot autoextend), but at the segment level there is no possibility of one transaction causing problems for another.
I have a requirement where I have large sets of incoming data into a system I own.
A single unit of data in this set has a set of immutable attributes + state attached to it. The state is dynamic and can change at any time.
The requirements are as follows -
Large sets of data can experience state changes. Updates need to be fast.
I should be able to aggregate data pivoted on various attributes.
Ideally - there should be a way to correlate individual data units to an aggregated results i.e. I want to drill down into the specific transactions that produced a certain aggregation.
(I am aware of the race conditions here, like the state of a data unit changing after an aggregation is performed ; but this is expected).
All aggregations are time based - i.e. sum of x on pivot y over a day, 2 days, week, month etc.
I am evaluating different technologies to meet these use cases, and would like to hear your suggestions. I have taken a look at Hive/Pig which fit the analytics/aggregation use case. However, I am concerned about the large bursts of updates that can come into the system at any time. I am not sure how this performs on HDFS files when compared to an indexed database (sql or nosql).
You'll probably arrive at the optimal solution only by stress testing actual scenarios in your environment, but here are some suggestions. First, if write speed is a bottleneck, it might make sense to write the changing state to an append-only store, separate from the immutable data, then join the data again for queries. Append-only writing (e.g., like log files) will be faster than updating existing records, primarily because it minimizes disk seeks. This strategy can also help with the problem of data changing underneath you during queries. You can query against a "snapshot" in time. For example, HBase keeps several timestamped updates to a record. (The number is configurable.)
This is a special case of the persistence strategy called Multiversion Concurrency Control - MVCC. Based on your description, MVCC is probably the most important underlying strategy for you to perform queries for a moment in time and get consistent state information returned, even while updates are happening simultaneously.
Of course, doing joins over split data like this will slow down query performance. So, if query performance is more important, then consider writing whole records where the immutable data is repeated along with the changing state. That will consume more space, as a tradeoff.
You might consider looking at Flexviews. It supports creating incrementally refreshable materialized views for MySQL. A materialized view is like a snapshot of a query that is updated periodically with the data which has changed. You can use materialized views to summarize on multiple attributes in different summary tables and keep these views transactionally consistent with each other. You can find some slides describing the functionality on slideshare.net
There is also Shard-Query which can be used in combination with InnoDB and MySQL partitioning, as well as supporting spreading data over many machines. This will satisfy both high update rates and will provide query parallelism for fast aggregation.
Of course, you can combine the two together.