Debezium docker with connect table.include.list not working - apache-kafka-connect

I'm using this example to sync elasticsearch with mssql https://github.com/debezium/debezium-examples/tree/master/unwrap-smt#elasticsearch-sink
I'm running debezium 1.5 with this mssql connection setting:
{
"name": "Test-connector",
"config": {
"connector.class": "io.debezium.connector.sqlserver.SqlServerConnector",
"database.hostname": "192.168.1.234",
"database.port": "1433",
"database.user": "user",
"database.password": "pass",
"database.dbname": "Test",
"database.server.name": "MyServer",
"table.include.list": "dbo.TEST_A",
"database.history.kafka.bootstrap.servers": "kafka:9092",
"database.history.kafka.topic": "dbhistory.testA"
}
}
According to debezium https://debezium.io/documentation/reference/connectors/sqlserver.html
table.include.list
An optional comma-separated list of regular expressions that match fully-qualified table identifiers for tables that you want Debezium to capture; any table that is not included in table.include.list is excluded from capture. Each identifier is of the form schemaName.tableName. By default, the connector captures all non-system tables for the designated schemas. Must not be used with table.exclude.list.
When I run this connector kafka doesn't respect table.exclude.list. Listing the topics shows all tables has been captured. The TEST_A topic is also there. I also tried "snapshot.include.collection.list": "Test.dbo.TEST_A" without result. What am I missing?

I figured it out it should be table.whitelist instead of table.include.list. Not sure why though.

Related

How kafka fetch data only when new row insert or update old row from mysql database using kafka streaming

I am using confluent platform JDBC connector, it is streaming data from mysql to kafka consumer. Through an application inserting data into another database for reporting purpose.
Here problem it is only streaming all data again and again within some interval of time. Actually a want only those data which is newly inserted or any update of previous record.
According to timestamp can't do because table does not contain any time column. And as per increment id also not possible. Please share any solution.
I have sample configuration file.
demo.json
{
"name":"mysql-connector-demo",
"config":{
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"key.converter":"org.apache.kafka.connect.json.JsonConverter",
"value.converter":"org.apache.kafka.connect.json.JsonConverter",
"connection.url": "jdbc:mysql://localhost:3306/test",
"connection.user": "root",
"connection.password": "1234",
"topic.prefix": "test",
"catalog.pattern":"test",
"mode": "bulk",
"validate.non.null": false,
"query": "select * from test ",
"table.types": "TABLE",
"topic.prefix": "test-jdbc-",
"poll.interval.ms": 10000
"schema.ignore": true,
"key.converter.schemas.enable": "false",
"value.converter.schemas.enable": "false"
}
}
But here new inserted record and newly updated record not affected to kafka consumer.
want only those data which is newly inserted or any update of previous record
Then you should be using change-data-capture (CDC) rather than JDBC polling.
Debezium is one such solution

Kafka Connect - Caused by: org.apache.kafka.connect.errors.ConnectException: PK mode for table is RECORD_KEY, but record key schema is missing

I have jdbc-sink for transfer data from Kafka to Oracle Database.
My connect gives this error.
Caused by: org.apache.kafka.connect.errors.ConnectException: PK mode for table 'orders' is RECORD_KEY, but record key schema is missing
my sink properties :
{
"name": "jdbc-oracle",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"tasks.max": "1",
"topics": "orders",
"connection.url": "jdbc:oracle:thin:#10.1.2.3:1071/orac",
"connection.user": "ersin",
"connection.password": "ersin!",
"auto.create": "true",
"delete.enabled": "true",
"pk.mode": "record_key",
"pk.fields": "MESSAGE_KEY",
"insert.mode": "update ",
"plugin.path": "/home/ersin/confluent-5.4.1/share/java/",
"name": "jdbc-oracle"
},
"tasks": [
{
"connector": "jdbc-oracle",
"task": 0
}
],
"type": "sink"
}
my connect-avro-distributed.properties :
bootstrap.servers=10.0.0.0:9092
group.id=connect-cluster
key.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=http://10.0.0.0:8081
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://10.0.0.0:8081
config.storage.topic=connect-configs
offset.storage.topic=connect-offsets
status.storage.topic=connect-statuses
config.storage.replication.factor=1
offset.storage.replication.factor=1
status.storage.replication.factor=1
internal.key.converter=org.apache.kafka.connect.json.JsonConverter
internal.value.converter=org.apache.kafka.connect.json.JsonConverter
internal.key.converter.schemas.enable=false
internal.value.converter.schemas.enable=false
I send data like this:
./bin/kafka-avro-console-producer \
--broker-list 10.0.0.0:9092 --topic orders \
--property parse.key="true" \
--property key.schema='{"type":"record","name":"key_schema","fields":[{"name":"id","type":"int"}]}' \
--property key.separator="$" \
--property value.schema='{"type":"record","name":"myrecord","fields":[{"name":"id","type":"int"},{"name":"product","type":"string"}, {"name":"quantity", "type": "int"}, {"name":"price","type": "int"}]}' \
--property schema.registry.url=http://10.0.0.0:8081
How can I solve this?
thanks in advance
The problem seems to be with your payload and the configuration "pk.mode": "record_key".
pk.mode is used to define the primary key mode and you have the following config options:
none: No keys utilized
kafka: Kafka coordinates are used as the PK
record_key: Field(s) from the record key are used, which may be a primitive or a struct.
record_value: Field(s) from the record value are used, which must be a struct.
In your configuration, you are using record_key which means that Kafka Connect will take the field from the key of the message and use it as the primary key in the target Oracle table.
Although you haven't shared your Kafka Connect worker's configuration, my guess is that you are missing some configuration parameters in there.
According to the documentation,
The sink connector requires knowledge of schemas, so you should use a
suitable converter e.g. the Avro converter that comes with the schema
registry, or the JSON converter with schemas enabled. Kafka record
keys if present can be primitive types or a Connect struct, and the
record value must be a Connect struct. Fields being selected from
Connect structs must be of primitive types. If the data in the topic
is not of a compatible format, implementing a custom Converter may
be necessary.
Now in your case the problem seems to be "pk.fields" which is currently set to "pk.fields": "MESSAGE_KEY". In your schema, the message key is defined to be id. Therefore, the following should do the trick:
"pk.fields": "id"

query-based JDBC Source connector Kafka

I have a legacy data base that has a primary key column to be string ( yeah I know ). I want to do an increment dumping mode from the postgres DB into kafka topics using JDBC kafka Source Connector
Below is my attempt to recreate the problem
create table test(
id varchar(20) primary key,
name varchar(10)
);
INSERT INTO test(
id, name)
VALUES ('1ab', 't'),
('2ab', 't'),
('3ab', 't')
My config
{"name" : "test_connector",
"config" : {
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"connection.url": "jdbc:postgresql://12.34.5.6:5432/",
"connection.user": "user",
"connection.password": "password",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"topic.prefix": "incre_",
"mode": "incrementing",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"query" :"SELECT cast(replace(id, 'ab','') as integer) as id , name from test ORDER BY id ASC",
"incrementing.column.name":"id",
"value.converter.schema.registry.url": "http://schema-registry_url.com",
"key.converter.schema.registry.url": "http://schema-registry_url.com",
"offset.flush.timeout.ms": 2000,
}
}
After I posted the config , the status was RUNNING when I did a HTTP curl . There is also no error log in the worker's log when I checked it
There is also no data in the kafka topic when I tried to do a console-consumer
I also tried several other combination like adding in "table.whitelist": "test".
Another thing i tried was following these two links
https://rmoff.net/2018/05/21/kafka-connect-and-oracle-data-types/
https://www.confluent.io/blog/kafka-connect-deep-dive-jdbc-source-connector but none help , even the smart trick that was suggested like SELECT * from (SELECT id, name from test where ...)
So after a few hours playing with different configuration. I come back to the official document and realised this
Use a custom query instead of loading tables, allowing you to join data from multiple tables. As long as the query does not include its own filtering, you can still use the built-in modes for incremental queries (in this case, using a timestamp column). Note that this limits you to a single output per connector and because there is no table name, the topic “prefix” is actually the full topic name in this case.
So the key is that "topic.prefix": "incre_test"
Follow up on the previous setting, the proper config should be
{"name" : "test_connector",
"config" : {
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"connection.url": "jdbc:postgresql://12.34.5.6:5432/",
"connection.user": "user",
"connection.password": "password",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"topic.prefix": "incre_test",
"mode": "incrementing",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"query" :"SELECT cast(replace(id, 'ab','') as integer) as id , name from test ORDER BY id ASC",
"incrementing.column.name":"id",
"value.converter.schema.registry.url": "http://schema-registry_url.com",
"key.converter.schema.registry.url": "http://schema-registry_url.com",
"offset.flush.timeout.ms": 2000,
}
}
I am afraid you cannot use your varchar id in incrementing mode because it is not an incrementing column/type. According to Confluent Docs,
Incrementing Column: A single column containing a unique ID for each row, where newer rows are guaranteed to have larger IDs, i.e. an
AUTOINCREMENT column. Note that this mode can only detect new rows.
Updates to existing rows cannot be detected, so this mode should only
be used for immutable data. One example where you might use this mode
is when streaming fact tables in a data warehouse, since those are
typically insert-only.

Kafka connect ElasticSearch sink - using if-else blocks to extract and transform fields for different topics

I have a kafka es sink properties file like the following
name=elasticsearch.sink.direct
connector.class=io.confluent.connect.elasticsearch.ElasticsearchSinkConnector
tasks.max=16
topics=data.my_setting
connection.url=http://dev-elastic-search01:9200
type.name=logs
topic.index.map=data.my_setting:direct_my_setting_index
batch.size=2048
max.buffered.records=32768
flush.timeout.ms=60000
max.retries=10
retry.backoff.ms=1000
schema.ignore=true
transforms=InsertKey,ExtractId
transforms.InsertKey.type=org.apache.kafka.connect.transforms.ValueToKey
transforms.InsertKey.fields=MY_SETTING_ID
transforms.ExtractId.type=org.apache.kafka.connect.transforms.ExtractField$Key
transforms.ExtractId.field=MY_SETTING_ID
This works perfectly for a single topic (data.my_setting). I would like to use the same connector for data coming in from more than one topic. A message in a different topic will have a different key which I'll need to transform.I was wondering if there's a way to use if else statements with a condition on the topic name or on a single field in the message such that I can then transform the key differently. All the incoming messages are json with schema and payload.
UPDATE based on the answer:
In my jdbc connector I add the key as follows:
name=data.my_setting
connector.class=io.confluent.connect.jdbc.JdbcSourceConnector
poll.interval.ms=500
tasks.max=4
mode=timestamp
query=SELECT * FROM MY_TABLE with (nolock)
timestamp.column.name=LAST_MOD_DATE
topic.prefix=investment.ed.data.app_setting
transforms=ValueToKey
transforms.ValueToKey.type=org.apache.kafka.connect.transforms.ValueToKey
transforms.ValueToKey.fields=MY_SETTING_ID
I still however get the error when a message produced from this connector is read by elasticsearch sink
org.apache.kafka.connect.errors.ConnectException: Exiting WorkerSinkTask due to unrecoverable exception.
Caused by: org.apache.kafka.connect.errors.DataException: STRUCT is not supported as the document id
The payload looks like this:
{
"schema": {
"type": "struct",
"fields": [{
"type": "int32",
"optional": false,
"field": "MY_SETTING_ID"
}, {
"type": "string",
"optional": true,
"field": "MY_SETTING_NAME"
}
],
"optional": false
},
"payload": {
"MY_SETTING_ID": 9,
"MY_SETTING_NAME": "setting_name"
}
}
Connect standalone property file looks like this:
bootstrap.servers=localhost:9092
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
converter.schemas.enable=false
internal.key.converter=org.apache.kafka.connect.json.JsonConverter
internal.value.converter=org.apache.kafka.connect.json.JsonConverter
internal.key.converter.schemas.enable=false
internal.value.converter.schemas.enable=false
offset.storage.file.filename=/apps/{env}/logs/infrastructure/offsets/connect.offsets
rest.port=8084
plugin.path=/usr/share/java
Is there a way to achieve my goal which is to have messages from multiple topics (in my case db tables) which will have their own unique ids (which will also be the id of a document in ES) be sent to a single ES sink.
Can I use avro for this task. Is there a way to define the key in schema registry or will I run into the same problem?
This isn't possible. You'd need multiple Connectors if the key fields are different.
One option to think about is pre-processing your Kafka topics through a stream processor (e.g. Kafka Streams, KSQL, Spark Streaming etc etc) to standardise the key fields, so that you can then use a single connector. It depends what you're building as to whether this would be worth doing, or overkill.

Kafka Connect stops after 300k records

I am trying to sink my mysql table to elasticsearch. My table has 1 million plus records. Issue is that my elasticsearch does not get anymore records after 300 some thousand records are inserted. I know that first time I ran it, it did run all the records. Its when I tried to do it again after deleting ES index, this happened. I have tried resetting the update_ts field to new timestamp. I have tried offset value in sink. Nothing seems to be working.
Here is my source file
{
"name": "items3",
"config": {
"_comment": "The JDBC connector class. Don't change this if you want to use the JDBC Source.",
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"_comment": "How to serialise the value of keys - here use the Confluent Avro serialiser. Note that the JDBC Source Connector always returns null for the key ",
"key.converter": "io.confluent.connect.avro.AvroConverter",
"_comment": "Since we're using Avro serialisation, we need to specify the Confluent schema registry at which the created schema is to be stored. NB Schema Registry and Avro serialiser are both part of Confluent Open Source.",
"key.converter.schema.registry.url": "http://localhost:8081",
"_comment": "As above, but for the value of the message. Note that these key/value serialisation settings can be set globally for Connect and thus omitted for individual connector configs to make them shorter and clearer",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url": "http://localhost:8081",
"_comment": " --- JDBC-specific configuration below here --- ",
"_comment": "JDBC connection URL. This will vary by RDBMS. Consult your manufacturer's handbook for more information",
"connection.url": "jdbc:mysql://localhost:3306/db?user=user&password=password",
"_comment": "Which table(s) to include",
"table.whitelist": "items",
"_comment": "Pull all rows based on an timestamp column. You can also do bulk or incrementing column-based extracts. For more information, see http://docs.confluent.io/current/connect/connect-jdbc/docs/source_config_options.html#mode",
"mode": "timestamp+incrementing",
"incrementing.column.name": "id",
"timestamp.column.name": "update_ts",
"_comment": "If the column is not defined as NOT NULL, tell the connector to ignore this ",
"validate.non.null": "true",
"_comment": "The Kafka topic will be made up of this prefix, plus the table name ",
"topic.prefix": "kafka-",
"auto.offset.reset" : "earliest"
}
}
And here is my sink
{
"name": "items-sink",
"config": {
"_comment": "-- standard converter stuff -- this can actually go in the worker config globally --",
"connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"key.converter": "io.confluent.connect.avro.AvroConverter",
"key.converter.schema.registry.url": "http://localhost:8081",
"value.converter.schema.registry.url": "http://localhost:8081",
"_comment": "--- Elasticsearch-specific config ---",
"_comment": "Elasticsearch server address",
"connection.url": "http://localhost:9200",
"_comment": "Elasticsearch mapping name. Gets created automatically if doesn't exist ",
"type.name": "items",
"_comment": "Which topic to stream data from into Elasticsearch",
"topics": "kafka-items",
"auto.offset.reset" : "earliest",
"_comment": "If the Kafka message doesn't have a key (as is the case with JDBC source) you need to specify key.ignore=true. If you don't, you'll get an error from the Connect task: 'ConnectException: Key is used as document id and can not be null.",
"key.ignore": "true"
}
}
as you can see I am trying to auto.offset.reset to earliest so if it is keeping track of my records somehow, it will start over, but all in vain.
"auto.offset.reset" : "earliest" can only be used inside the connect-distributed.properties file, not the JSON connector configurations
And in that file, since it's a consumer configuration, it's named consumer.auto.offset.reset.
Also, the consumer group is mapped to the name field of the connector configuration, so unless that's changed, you'd be continuing to consume from where the previous one of the same name left off until the group offsets are reset or the name is changed. By default, the group name is connect-${connector_name}

Resources