how to read/write in kafka using jdbc with conditions - jdbc

I have a Kafka server that works fine for sync a table between server. My DB is PostgreSQL and I 'm using JDBC sink/source connector.
Now my question is How can I read data from two table in Source and Insert data to Four different table in Sink side.
example:
Source table: Users, Roles
Sink tables: Workers, Managers, Employers, ...
In parent server all users are available in Users table and have relation with Role table. in other side I want to insert data to specific table according to it's role

For the JDBC Sink you need one topic per target table. Thus you need four topics, one per target table, populated with the joined data. This join needs to happen at some point in the pipeline. Options would be :
As part of the JDBC Source, using the query option of the connector. Build four connectors, each with the necessary query to populate each target topic with the join that is done on the postgres side in SQL.
As a streaming application e.g. in Kafka Streams or KSQL. The JDBC source would pull in the source users and roles tables and you'd perform the join as each record flowed through.

Related

Kafka connect - connector per event type

I'm using kafka to transfer application events to the sql hisotrical database. The events are structured differently depending on the type eg. OrderEvent, ProductEvent and both have the relation Order.productId = Product.id. I want to store this events in seperate sql tables. I came up with two approaches to transfer this data, but each has a technical problem.
Topic per event type - this approach is easy to configure, but the order of events is not guaranteed with multiple topics, so there may be problem when product doesn't exist yet when the order is consumed. This may be solved with foreign keys in the database, so the consumer of the order topic will fail until the product be available in database.
One topic and multiple event types - using the schema regisrty it is possible to store multiple event types in one topic. Events are now properly ordered but I've stucked with jdbc connector configuration. I haven't found any solution how to set the sql table name depending of the event type. Is it possible to configure connector per event type?
Is the first approach with foreign keys correct? Is it possible to configure connector per event type in the second approach? Maybe there is another solution?

Debezium is replicating tables not included in "table.include.list"

I have specified a list of tables to be replicated in Debezium, using the "table.include.list" configuration.
However when creating new tables in the source db, that have not been selected for replication, they are in fact being replicated.
How can I change this behaviour of Debezium, to only replicate the tables specified?

AWS DMS with CDC. The update records only include the updated field. How to include all?

We recently started the process of continuous migration (initial load + CDC) from an Oracle database on RDS to S3 using AWS DMS. The DB is using LogMiner.
the problem that we have detected is that the CDC records of type Update only contain the data that was updated, leaving the rest of the fields empty, so the possibility of simply taking as valid the record with the maximum timestamp value is lost.
Does anyone know if this can be changed or in what part of the DMS or RDS configuration to touch so that the update contains the information of all the fields of the record?
Thanks in advance.
Supplemental Logging at table level may increase what is logged, but that will also increase total volume of log data written for a given workload.
Many Log Based Data Replication products from various vendors require additional supplemental logging at the table level to ensure the full row data for updates with before and after change data is written to the database logs.
re: https://docs.oracle.com/database/121/SUTIL/GUID-D857AF96-AC24-4CA1-B620-8EA3DF30D72E.htm#SUTIL1582
Pulling data through LogMiner may be possible, but you will need to evaluate if it will scale with the data volumes you need.
DMS-FULL/CDC also supports Binary Reader better option to LogMiner. In order to capture updates WITH all the columns use "ADD SUPPLEMENTAL LOG DATA (ALL) COLUMNS" on Oracle side.
This will push all the columns in a update record to endpoint from Oracle RAC/non-RAC dbs. Also, a pointer for CDC is use TRANSACT_ID in DMS side to generate a unique sequence for each record. Redo will be little more but, it is what it is; you can keep an eye on it and DROP the supplemental logging if require at the table level.
Cheers!

Syncing data between services using Kafka JDBC Connector

I have a system with a microservice architecture. It has two services: Service A and Service B each with it's own database like in the following diagram.
As far as I understand having a separate database for each service is a better approach. In this design each service is the owner of its data, it's responsible for creating, updating, deleting and enforcing constraints.
In order to have Service A data in Database B I was thinking of using JDBC Kafka Connector, but I am not sure if Table1 and Table2 in Database B should enforce constraints from Database A.
If the constraint, like the foreign key from Table2 to Table1 should exist in Database B then, is there a way to have the connector know about this?
What are other common or better ways to sync data or solve this problem?
The easiest solution seems to sync per table without any constraints in Database B. That would make things easier but it could also lead to a situation where Service's A data in Service B is inconsistent. For example having entries in Table2 that point to a non-existing entry in Table1
If the constraint, like the foreign key from Table2 to Table1 should
exist in Database B then, is there a way to have the connector know
about this?
No unfortunately the "Kafka JDBC Connector" does not know about constraints.
Based on your question I assume that Table1 and Table2 are duplicated tables in Database B which exist in Database A. In Database A you have constraints which you are not sure you should add in Database B?
If that is the case then I am not sure if using "Kafka JDBC Connector" to sync data is the best choice.
You have a couple options:
Enforce the usage of Constraints like Foreign Keys in Database B but you would need to update it from your application level and not through "Kafka JDBC Connector". So for this option you can not use "Kafka JDBC Connector". You would need to write some small service/worker to read the data from that Kafka topic and populate your database tables. This way you control what is saved to the db and you can validate the constraints even before trying to save to your database. But the question here is do you really need to have the Constraints? They are important in micro-service-A but do you really need them in micro-service-B as it is just a copy of the data?
Not use constraints and allow temporary inconsistency. This is common in micro-services world. When working with Distributed systems you always have to think about the CAP Theorem. So you take into account that some data might at some point be inconsistent but you have to make sure that you will eventually bring it back to consistent state. This means you would need to develop on your application level some cleanup/healing mechanism which will recognize this data and correct it. So Db constraints do not necessary have to be enforced on data which the micro-service does not own and is considered as External data to that micro-service Domain.
Rethink your design. Usually we duplicate data in micro-service-B from micro-service-A in order to avoid coupling between the services so that the service micro-service-B can live and operate even when the micro-service-A is down or not running for some reason. We also do it to reduce load from micro-service-B to micro-service-A for every operation which needs data from Table1 and Table2. Table1 and Table2 are owned by micro-service-A and micro-service-A is the only source of truth for this data. Micro-service-B is using a duplicate of that data for its operations.
Looking at your databases design following questions might help you figuring out what would be the best option for you system:
Is it necessary to duplicate the data in micro-service-B?
If I duplicate the data do I need both tables and do I need all their columns/data in micro-service-B? Usually you just store/duplicate only a subset of the Entity/Table that you need.
Do I need the same table structure in micro-service-A as in micro-service-A? You have to decide this based on your Domain but very often you Denormalize your tables and change them in order to fit the needs of micro-service-B operations. As usually all these design decisions depend on your application Domain and use case.

Is KSQL making remote requests under the hood, or is a Table actually a global KTable?

I have a Kafka topic containing customer records, called "customer-created". Each customer is a new record in the topic. There are 4 partitions.
I have two ksql-server instances running, based on the docker image confluentinc/cp-ksql-server:5.3.0. Both use the same KSQL Service Id.
I've created a table:
CREATE TABLE t_customer (id VARCHAR,
firstname VARCHAR,
lastname VARCHAR)
WITH (KAFKA_TOPIC = 'customer-created',
VALUE_FORMAT='JSON',
KEY = 'id');
I'm new to KSQL, but my understanding was that KSQL builds on top of Kafka Streams and that each ksql-server instance is roughly equivalent to a Kafka streams application instance. The first thing I notice is that as soon as I start a new instance of the ksql-server, it already knows about the tables/streams created on the first instance, even though it is an interactive instance in developer mode. Second of all, I can select the same customer based on it's ID from both instances, but I expected to only be able to do that from one of the instances, because I assumed a KSQL Table is equivalent to a KTable, i.e. it should only contain local data, i.e. from the partitions being processed by the ksql-server instance.
SET 'auto.offset.reset'='earliest';
select * from t_customer where id = '7e1a141b-b8a6-4f4a-b368-45da2a9e92a1';
Regardless of which instance of the ksql-server I attach the ksql-cli to, I get a result. The only way that I can get this to work when using plain Kafka Streams, is to use a global KTable. The fact that I get the result from both instances surprised me a little because according to the docs, "Only the Kafka Streams DSL has the notion of a GlobalKTable", so I expected only one of the two instances to find the customer. I haven't found any docs anywhere that explain how to specify that a KSQL Table should be a local or global table.
So here is my question: is a KSQL Table the equivalent of a global KTable and the docs are misleading, or is the ksql-server instance that I am connected to, making a remote request under the hood, to the instance responsible for the ID (presumably based on the partition), as described here, for Kafka Streams?
KSQL does not support GlobalKTables atm.
Your analogy between a KSQL server and a Kafka Streams program is not 100% accurate though. Each query is a Kafka Streams program (note, that a "program" can have multiple instances). Also, there is a difference between persistent queries and transient queries. When you create a TABLE from a topic, the command itself is a metadata operation only (similar for CREATE STREAM from a topic). For both, no query is executed and no Kafka Streams program is started.
The information about all creates STREAMS and TABLES is stored in a shared "command topic" in the Kafka Cluster. All servers with the same ID receive the same information about created streams, tables.
Queries run in the CLI are transient queries and they will be executed by a single server. The information about such transient queries is not distributed to other servers. Basically, a unique query-id (ie, application.id) is generated and the servers runs a single instance KafakStreams program. Hence, the server/program will subscribe to all partitions.
A persistent query (ie, CREATE STREAM AS or CREATE TABLE AS) is a query that queries a STREAM or TABLE and produces a STREAM or TABLE as output. The information about persistent queries is distributed via the "command topic" to all servers (however, not all servers will execute all persistent queries -- it depends on the configured parallelism how many will execute it). For persistent queries, each server that participates to execute the query creates a KafkaStreams instance running the same program, and all will use the same query-Id (ie, application.id) and thus different servers will subscribe to different topics.

Resources