Compare and Contrast Change Data Capture and Database Change Notification - oracle

Oracle has two seemingly competing technologies. CDC and DCN.
What are the strengths of each?
When would you use one and not the other?

In general, you would use DCN to notify a client application that the client application needs to clear/ update the application's cache. You would use CDC for ETL processing.
DCN would generally be preferable when you have an OLTP application that needs to be notified immediately about data changes in the database. Since the goal here is to minimize the number of network round-trips and the number of database hits, you'd generally want the application to use DCN for queries which either are mostly static. If a large fraction of the query is changing regularly, you may be better off just refreshing the application's cache on a set frequency rather than running queries constantly to get the changed data (DCN does not contain the changed data, just the ROWID of the row(s) that changed). If the application goes down, I believe DCN allows changes to be lost.
CDC would generally be preferable when you have a DSS application that needs to periodically pull over all the data that changed in a number of tables. CDC can guarantee that the subscriber has received every change to the underlying table(s) which can be important if you are trying to replicate changes to a different database . CDC allows the subscriber to pull the changes at its convenience rather than trying to notify the subscriber that there are changes, so you'd definitely want CDC if you wanted the subscriber to process new data every hour or every day rather than in near real time. (note: DCN also has a guaranteed delivery mode, see comments below. --Mark Harrison)

CDC seems to be much more complex to set up than DCN.
I mean to setup DCN I wrap a select in a start and end DCN block and then write a procedure to be called with a collect of changes. That's it.
CDC requires publishers and subscribers and anyways, seems like more work.

Related

Implementing CQRS / ES the proper way

Recently I'm looking forward to implement the CQRS / ES pattern with Event sourcing in my microservices.
I've been reading for these patterns, but I have some questions that I couldn't find an answer anywhere:
When doing CQRS / ES, should each microservice have its own local
database anymore (Within microservice)?
I know that there will be an event store for writes, and a read-only projection database and i totally understand their purpose, but do microservices need
their own local database for any reason? (Advantages / disadvantages)
Example: Order microservice could have local orders database, item service an items local database etc...apart from the Event source DB and projections database implemented.
How to validate if some data exists in a microservice before
actually issuing a command?
Let's say i want to make a new order, so i assume first I have to
check if that item is still in stock, then perform the other
operation/s.
However, if i want to check if an item is still in stock, where do i
query that data, will it be the projection (read-only) database, or
a local database that each microservice has?
I've read many articles about CQRS / ES at this point, but most of them just explain the concept rather than actually diving into real-life scenarios / explaining how to implement it. I would appreciate if you had any recommendations.
Much appreciated
In general, when dealing with microservices, it's recommended (regardless of whether or not you're doing CQRS/ES) that no two microservices use the same database, or at the very least that no two microservices be writing to the same database. This allows each microservice to control its schema, which only needs to change if the microservice needs it to. One other advantage of this is that the database becomes entirely encapsulated within the service: it's purely an implementation detail.
It's entirely possible that a microservice implementing a read-model might not have a database: it might be able to keep all state in memory (an example might be a read-model which exposes metrics for your monitoring infrastructure), or it might simply be translating events from the write-model into commands to another service (so all of its state is just its position in the event stream).
if i want to check if an item is still in stock, where do i query that data, will it be the projection (read-only) database, or a local database that each microservice has?
In an event-sourced system, every view that's not the stream of events is a projection. So, depending on your requirements, your service can query another service or maintain its own view based on the events.
Note that at any given instant there may exist an event which has been published to the event stream (i.e. it has indisputably happened) but for which there also exists a projection which has not processed the event: the projections are eventually consistent with the event stream. So any check of whether an item is in stock will only tell you that the item was in stock at some point in the past (never mind, to use Greg Young's example, that no in-stock data can guarantee that nothing's been stolen from the warehouse unless the thieves happened to have the decency to update the count as they walked out with their loot). The nanosecond after your query, it might receive word of an event which makes it out-of-stock before you placed your order.
Accordingly, it may just be worth sending a command and letting it get reject your order if the item is not in stock. The write-side (which is the more strongly consistent part of the system, though it should be remembered that in many cases, one component's events are another component's commands) is under no obligation to accept every command; "command" in this context really means "polite request to publish events to the event stream which are conformant with my desired state of the universe".

RethinkDB changefeeds performance: architectural advice?

I am building an application with RethinkDB and I'm about to switch to using changefeeds. But I'm facing an architectural choice and I'd like to get some advice.
My application currently loads all user data from several tables on user login (sending all of it to the frontend), and then processes requests from the frontend, altering the database, and preparing and sending changed items to users. I'd like to switch that over to changefeeds. The way I see it, I have two choices:
Set up a single changefeed for each table. Filter by users logged in to a particular server, and distribute the changes to users manually. These changefeeds are never closed, e.g. they have the lifetime of my servers.
When a user logs in, set up an individual changefeed for that user, for that user's data only (using a getAll with a secondary index). Maintain as many changefeeds as there are currently logged in users. Close them when users log out.
Solution #1 has a big disadvantage: RethinkDB changefeeds do not have a concept of time (or version number), like for example Kafka does. This means that there is no way to a) load initial data, and b) get changes that happened since the initial load. There is a time window where changes can be lost: between initial data load (a) and the moment the changefeed is set up (b). I find this worrying.
Solution #2 seems better, because includeInitial can be used to get initial data, and then get subsequent changes without interruption. I'd have to deal with initial load performance (it's faster to load a single dump of all data than process thousands of updates), but it seems more "correct". But what about scaling? I'm planning to handle up to 1k users per server — is RethinkDB prepared to handle thousands of changefeeds, each being essentially a getAll query? The actual activity in these changefeeds will be very low, it's just the number that I'm worried about.
The RethinkDB manual is a bit terse about changefeed scaling, saying that:
Changefeeds perform well as they scale, although they create extra intracluster messages in proportion to the number of servers with open feed connections on each write.
Solution #2 creates many more feeds, but the number of servers with open feed connections is actually the same for both solutions. And "changefeeds perform well as they scale" isn't quite enough to go on :-)
I'd also be interested to know what are recommended practices for handling server restarts/upgrades and disconnections. The way I see it, if anything happens to RethinkDB, clients have to perform a full data load (using includeInitial) after reconnecting, because there is no way to know what changes have been lost during downtime. Is that what people do?
RethinkDB should be able to handle thousands of changefeeds just fine if it's on reasonable hardware. One thing some people to do lower network load in that case is they put a proxy node on the same machine as their app server, and connect to that, since the proxy node knows enough to deduplicate the changefeed messages coming in over the network, and because it takes a lot of CPU/memory load off of their main cluster.
Currently the only way to recover from a crash is to restart the changefeed using includeInitial. There are plans to add write timestamps in the future, but handling deletes is complicated in that case.

How to implement an ETL Process

I would like to implement a synchronization between a source SQL base database and a target TripleStore.
However for matter of simplicity let say simply 2 databases. I wonder what approaches to use to have every change in the source database replicated in the target database. More specifically, I would like that each time some row changes in the source database that this can be seen by a process that will read the changes and populate the target database accordingly while applying some transformation in the middle.
I have seen suggestion around the mechanism of notification that can
be available in the database, or building tables such that changes can
be tracked (meaning doing it manually) and have the process polling it
at different intervals, or the usage of Logs (change data capture,
etc...)
I'm seriously puzzle about all of this. I wonder if anyone could give some guidance and explanation about the different approaches with respect to my objective. Meaning: name of methods and where to look.
My organization mostly uses: Postgres and Oracle database.
I have to take relational data and transform them in RDF so as to store them in a triplestore and keep that triplestore constantly synchronized with the data is the SQL Store.
Please,
Many thanks
PS:
A clarification between ETL and replication techniques as in Change Data capture, with respect to my overall objective would be appreciated.
Again i need to make sense of the subject, know what are the methods, so i can further start digging for myself. So far i have understood that CDC is the new way to go.
Assuming you can't use replication and you need to use some kind of ETL process to actually extract, transform and load all changes to the destination database, you could use insert, update and delete triggers to fill a (manually created) audit table. Columns GeneratedId, TableName, RowId, Action (insert, update, delete) and a boolean value to determine if your ETL process has already processed this change. Use that table to get all the changed rows in your database and transport them to the destination database. Then delete the processed rows from the audit table so that it doesn't grow too big. How often you have to run the ETL process depends on the amount of changes occurring in the source database.

Use of Oracle Advanced Queuing to receive changes of database table rows

I am confused about Oracle Advanced Queueing. It looks like it is a way to asynchronously send database notification to application layer.
But looking in some details, there is queue to be setup, alongside a table. and there is explicit calls to publish messages that will afterward be pushed to the application layer.
Does this work automatically with table rows modification ?
I want, if a particular table changes (no matter who/how changed), to receive a notification about it in form of a binary object that represents the row changed.
(Note: I know about Oracle Query change notification, CQN, but I am not satisfied with its performance, my goal is then to see if Oracle Advanced Queue can offer similar goal with better speed).
Thanks in advance.

the best way to track data changes in oracle

as the title i am talking about, what's the best way to track data changes in oracle? i just want to know which row being updated/deleted/inserted?
at first i think about the trigger, but i need to write more triggers on each table and then record down the rowid which effected into my change table, it's not good, then i search in Google, learn new concepts about materialized view log and change data capture,
materialized view log is good for me that i can compare it to original table then i can get the different records, even the different of the fields, i think the way is the same with i create/copy new table from original (but i don't know what's different?);
change data capture component is complicate for me :), so i don't want to waste my time to research it.
anybody has the experience the best way to track data changes in oracle?
You'll want to have a look at the AUDIT statement. It gathers all auditing records in the SYS.AUD$ table.
Example:
AUDIT insert, update, delete ON t BY ACCESS
Regards,
Rob.
You might want to take a look at Golden Gate. This makes capturing changes a snap, at a price but with good performance and quick setup.
If performance is no issue, triggers and audit could be a valid solution.
If performance is an issue and Golden Gate is considered too expensive, you could also use Logminer or Change Data Capture. Given this choice, my preference would go for CDC.
As you see, there are quite a few options, near realtime and offline.
Coding a solution by hand also has a price, Golden Gate is worth investigating.
Oracle does this for you via redo logs, it depends on what you're trying to do with this info. I'm assuming your need is replication (track changes on source instance and propagate to 1 or more target instances).
If thats the case, you may consider Oracle streams (other options such as Advanced Replication, but you'll need to consider your needs):
From Oracle:
When you use Streams, replication of a
DML or DDL change typically includes
three steps:
A capture process or an application
creates one or more logical change
records (LCRs) and enqueues them into
a queue. An LCR is a message with a
specific format that describes a
database change. A capture process
reformats changes captured from the
redo log into LCRs, and applications
can construct LCRs. If the change was
a data manipulation language (DML)
operation, then each LCR encapsulates
a row change resulting from the DML
operation to a shared table at the
source database. If the change was a
data definition language (DDL)
operation, then an LCR encapsulates
the DDL change that was made to a
shared database object at a source
database.
A propagation propagates the staged
LCR to another queue, which usually
resides in a database that is separate
from the database where the LCR was
captured. An LCR can be propagated to
a number of queues before it arrives
at a destination database.
At a destination database, an apply
process consumes the change by
applying the LCR to the shared
database object. An apply process can
dequeue the LCR and apply it directly,
or an apply process can dequeue the
LCR and send it to an apply handler.
In a Streams replication environment,
an apply handler performs customized
processing of the LCR and then applies
the LCR to the shared database object.

Resources