Use ByteArrayFormat with TimeBasedPartitioner that extracts using RecordField - apache-kafka-connect

I'm trying to use TimeBasedPartitioner that extracts using RecordField with the following configuration:
{
"name": "s3-sink",
"connector.class": "io.confluent.connect.s3.S3SinkConnector",
"tasks.max": "10",
"topics": "topics1.topics2",
"s3.region": "us-east-1",
"s3.bucket.name": "bucket",
"s3.part.size": "5242880",
"s3.compression.type": "gzip",
"timezone": "UTC",
"rotate.schedule.interval.ms": "900000",
"flush.size": "1000000",
"schema.compatibility": "NONE",
"storage.class": "io.confluent.connect.s3.storage.S3Storage",
"format.class": "io.confluent.connect.s3.format.bytearray.ByteArrayFormat",
"partitioner.class": "io.confluent.connect.storage.partitioner.HourlyPartitioner",
"partition.duration.ms": "900000",
"locale": "en",
"timestamp.extractor": "RecordField",
"timestamp.field": "time",
"key.converter": "org.apache.kafka.connect.converters.ByteArrayConverter",
"key.converter.schemas.enabled": false,
"value.converter": "org.apache.kafka.connect.converters.ByteArrayConverter",
"value.converter.schemas.enabled": false,
"interal.key.converter": "org.apache.kafka.connect.json.JsonConverter",
"internal.key.converter.schemas.enabled": false,
"interal.value.converter": "org.apache.kafka.connect.json.JsonConverter",
"internal.value.converter.schemas.enabled": false,
}
I keep getting the following error and I'm not finding much that explains what is going on. I looked at the source code and it appears that the record is not a Struct or Map type so I'm wondering if there is an issue with using ByteArrayFormat?
org.apache.kafka.connect.errors.ConnectException: Exiting WorkerSinkTask due to unrecoverable exception.
at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:546)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:302)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:205)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:173)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:170)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:214)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: io.confluent.connect.storage.errors.PartitionException: Error encoding partition.
at io.confluent.connect.storage.partitioner.TimeBasedPartitioner$RecordFieldTimestampExtractor.extract(TimeBasedPartitioner.java:294)
at io.confluent.connect.s3.TopicPartitionWriter.executeState(TopicPartitionWriter.java:199)
at io.confluent.connect.s3.TopicPartitionWriter.write(TopicPartitionWriter.java:176)
at io.confluent.connect.s3.S3SinkTask.put(S3SinkTask.java:195)
at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:524)
I've been able to write out using the default partitioner.

Related

None of log files contains offset SCN: 9483418337736, re-snapshot is required

I am deploying a Debezium connector for oracle. When I deploy the connect it takes a consistent snapshot and it captures all the records being inserted into the table. After a while ranging from a few minutes to hours, it fails complaining about the SCN could not be found. When I perform a fresh snapshot it starts working and then fails after a while with the same error.
Following is the deployed config.
{
"name": "debezium-connector",
"config": {
"connector.class": "io.debezium.connector.oracle.OracleConnector",
"message.key.columns": "COMMON.SETTLEMENT_DAY:SETTLEMENT_DAY",
"tasks.max": "1",
"database.history.kafka.topic": "debezium.schema-changes.inventory",
"log.mining.strategy": "online_catalog",
"signal.data.collection": "ABC.debezium_signal",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"log.mining.archive.log.hours": "2",
"database.user": "ABC",
"database.dbname": "EDPM",
"database.connection.adapter": "logminer",
"database.history.kafka.bootstrap.servers": "10.0.0.36:9092,10.0.0.37:9092,10.0.0.38:9092",
"database.url": "jdbc:oracle:thin:#localhost:1521/EDPM",
"time.precision.mode": "connect",
"database.server.name": "DEBEZIUM",
"event.processing.failure.handling.mode": "warn",
"heartbeat.interval.ms": "300000",
"database.port": "1521",
"key.converter.schemas.enable": "true",
"database.hostname": "localhost",
"database.password": "************",
"value.converter.schemas.enable": "true",
"name": "debezium-connector",
"table.include.list": "COMMON.SETTLEMENT_DAY,ABC.debezium_signal",
"snapshot.mode": "schema_only"
}
}
I am getting the following error:
{
"name": "debezium-connector",
"connector": {
"state": "RUNNING",
"worker_id": "0.0.0.0:28084"
},
"tasks": [
{
"id": 0,
"state": "FAILED",
"worker_id": "0.0.0.0:28084",
"trace": "org.apache.kafka.connect.errors.ConnectException: An exception occurred in the change event producer. This connector will be stopped.\n\tat
io.debezium.pipeline.ErrorHandler.setProducerThrowable(ErrorHandler.java:42)\n\tat
io.debezium.connector.oracle.logminer.LogMinerStreamingChangeEventSource.execute(LogMinerStreamingChangeEventSource.java:211)\n\tat
io.debezium.connector.oracle.logminer.LogMinerStreamingChangeEventSource.execute(LogMinerStreamingChangeEventSource.java:63)\n\tat
io.debezium.pipeline.ChangeEventSourceCoordinator.streamEvents(ChangeEventSourceCoordinator.java:159)\n\tat
io.debezium.pipeline.ChangeEventSourceCoordinator.lambda$start$0(ChangeEventSourceCoordinator.java:122)\n\tat
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\n\tat
java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat
java.lang.Thread.run(Thread.java:748)\n
Caused by: java.lang.IllegalStateException: None of log files contains offset SCN: 9483418337736, re-snapshot is required.\n\tat
io.debezium.connector.oracle.logminer.LogMinerHelper.setLogFilesForMining(LogMinerHelper.java:480)\n\tat
io.debezium.connector.oracle.logminer.LogMinerStreamingChangeEventSource.initializeRedoLogsForMining(LogMinerStreamingChangeEventSource.java:248)\n\tat
io.debezium.connector.oracle.logminer.LogMinerStreamingChangeEventSource.execute(LogMinerStreamingChangeEventSource.java:167)\n\t... 8 more\n"
}
],
"type": "source"
}

Facing issues with kakfa keys while building a SQL audit system using Kafka connect & Debezium

I have a table “books” in database motor. This is my source and for source connection I created a topic “mysql-books”. So far all good I am able to see messages on Confluent Control Center. Now these messages I want to sink into another database called "motor-audit" so that in audit I am should see all the changes that happened to the table “books”. I have given the topic “mysql-books” in my sink curl for sink connector since changes are being published to this topic.
My source config -
curl -X POST http://localhost:8083/connectors -H "Content-Type: application/json" -d '{
"name": "jdbc_source_mysql_001",
"config": {
"value.converter.schema.registry.url": "http://0.0.0.0:8081",
"key.converter.schema.registry.url": "http://0.0.0.0:8081",
"name": "jdbc_source_mysql_001",
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"key.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"connection.url": "jdbc:mysql://localhost:3306/motor",
"connection.user": "yagnesh",
"connection.password": "yagnesh123",
"catalog.pattern": "motor",
"mode": "bulk",
"poll.interval.ms": "10000",
"topic.prefix": "mysql-",
"transforms":"createKey,extractInt",
"transforms.createKey.type":"org.apache.kafka.connect.transforms.ValueToKey",
"transforms.createKey.fields":"id",
"transforms.extractInt.type":"org.apache.kafka.connect.transforms.ExtractField$Key",
"transforms.extractInt.field":"id"
}
}
My Sink config -
curl -X PUT http://localhost:8083/connectors/jdbc_sink_mysql_001/config \
-H "Content-Type: application/json" -d '{
"value.converter.schema.registry.url": "http://0.0.0.0:8081",
"value.converter.schemas.enable": "true",
"key.converter.schema.registry.url": "http://0.0.0.0:8081",
"name": "jdbc_sink_mysql_001",
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"key.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"topics":"mysql-books",
"connection.url": "jdbc:mysql://mysql:3306/motor",
"connection.user": "yagnesh",
"connection.password": "yagnesh123",
"insert.mode": "insert",
"auto.create": "true",
"auto.evolve": "true"
}'
This is how messages on the topic look like -
The keys are seen in bytes but even if I use either AvroConverter or StringConverter for the key and keep it same in both source and sink still I face the same error.
The database table which is into play is created with this schema -
CREATE TABLE `motor`.`books` (
`id` INT NOT NULL AUTO_INCREMENT,
`author` VARCHAR(45) NULL,
PRIMARY KEY (`id`));
With all this I am facing this error -
io.confluent.rest.exceptions.RestNotFoundException: Subject 'mysql-books-key' not found.
at io.confluent.kafka.schemaregistry.rest.exceptions.Errors.subjectNotFoundException(Errors.java:69)
Edit: I modified the URL in sink to have localhost and given stringconverter to key and kep avroconverter for value and now I am getting a new error which is -
Caused by: java.sql.SQLException: Exception chain:
java.sql.SQLSyntaxErrorException: BLOB/TEXT column 'id' used in key specification without a key length
Edit 2:
As suggested by #Onecricketeer I am trying Debezium and using below config for MysqlConnector. I have already enabled bin_log in mysqld.cnf but upon launching getting errors like -
Caused by: org.apache.kafka.connect.errors.DataException: Field does not exist: id
This is my debezium config -
{
"transforms.createKey.type": "org.apache.kafka.connect.transforms.ValueToKey",
"transforms.extractInt.type": "org.apache.kafka.connect.transforms.ExtractField$Key",
"value.converter.schema.registry.url": "http://0.0.0.0:8081",
"transforms.extractInt.field": "id",
"transforms.createKey.fields": "id",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
"key.converter.schema.registry.url": "http://0.0.0.0:8081",
"name": "mysql-connector-deb-demo",
"connector.class": "io.debezium.connector.mysql.MySqlConnector",
"key.converter": "org.apache.kafka.connect.converters.IntegerConverter",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"transforms": [
"createKey",
"extractInt",
"unwrap"
],
"database.hostname": "localhost",
"database.port": "3306",
"database.user": "yagnesh",
"database.password": "**********",
"database.server.name": "mysql",
"database.server.id": "1",
"event.processing.failure.handling.mode": "ignore",
"database.history.kafka.bootstrap.servers": "localhost:9092",
"database.history.kafka.topic": "dbhistory.demo",
"table.whitelist": [
"motor.books"
],
"table.include.list": [
"motor.books"
],
"include.schema.changes": "true"
}
Before using "unwrap" I was facing mismatched input '-' expecting <EOF> SQL
hence upon looking for this fixed this using "unwrap" following this question - Fix for mismatched input.
Let me know if this is actually needed or not.

S3 connector with HourlyPartitioner failing

When we tried to write into S3 through S3 sink connector with default config, working fine without any issue. But when we tried with hourly partition getting failed with below error.
Please find the both codes and error messages and help us to resolve this issue
Default :
{
"value.converter.schemas.enable": "false",
"name": "tibconew1-test-s3standard-default-sink-connector",
"connector.class": "io.confluent.connect.s3.S3SinkConnector",
"tasks.max": "2",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "org.apache.kafka.connect.storage.StringConverter",
"errors.tolerance": "all",
"topics": [
"test.s3custom.default.dax.shipment.data",
"test.s3custom.default.dax.shipment.data",
"test.s3custom.hourly.onprem.tibco.dax_shipment.dpp_asn"
],
"topics.regex": "",
"errors.deadletterqueue.topic.name": "dlq_test.s3custom.default.dax.shipment.data",
"errors.deadletterqueue.context.headers.enable": "true",
"format.class": "io.confluent.connect.s3.format.json.JsonFormat",
"flush.size": "1000",
"s3.bucket.name": "test-stg-raw",
"s3.region": "us-east-1",
"s3.credentials.provider.class": "com.amazonaws.auth.InstanceProfileCredentialsProvider",
"s3.acl.canned": "bucket-owner-full-control",
"storage.class": "io.confluent.connect.s3.storage.S3Storage",
"topics.dir": "streams_dir",
"partitioner.class": "io.confluent.connect.storage.partitioner.DefaultPartitioner"
}
Hourly :
{
"value.converter.schema.registry.url": "https://confschema.test-dsol-core.testdigital-stg.com",
"value.converter.schemas.enable": "false",
"name": "test.s3custom.hourly.tibco.dax_shipment.dpp_asn.sink-connector",
"connector.class": "io.confluent.connect.s3.S3SinkConnector",
"tasks.max": "2",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"errors.tolerance": "all",
"topics": [
"test.s3custom.hourly.onprem.tibco.dax_shipment.dpp_asn"
],
"topics.regex": "",
"errors.deadletterqueue.topic.name": "dlq_test.s3custom.hourly.onprem.tibco.dax_shipment.dpp_asn.sink",
"errors.deadletterqueue.context.headers.enable": "true",
"format.class": "io.confluent.connect.s3.format.json.JsonFormat",
"flush.size": "10",
"s3.bucket.name": "test-stg-raw",
"s3.region": "us-east-1",
"s3.credentials.provider.class": "com.amazonaws.auth.InstanceProfileCredentialsProvider",
"s3.acl.canned": "bucket-owner-full-control",
"storage.class": "io.confluent.connect.s3.storage.S3Storage",
"topics.dir": "streams_dir",
"partitioner.class": "io.confluent.connect.storage.partitioner.HourlyPartitioner",
"locale": "en-US",
"timezone": "America/Chicago",
"timestamp.extractor": "RecordField",
"timestamp.field": "DPP_ASN.LST_UPDT_TS"
}
Error :
Finally we found the reason . Due to timestamp received from the payload is an invalid format which has additional space in it.So we corrected the format in source side. For the hourly partitioner, the connector is expecting the value is based on hours.
Hourly Partitioner:
io.confluent.connect.storage.partitioner.HourlyPartitioner is equivalent to the TimeBasedPartitioner with path.format='year'=YYYY/'month'=MM/'day'=dd/'hour'=HH and
Message was : "LST_UPDT_TS":"2021-02-01 07:16:23.567"
Corrected as : "LST_UPDT_TS":"2015-08-01T17:00:00.69243-05:00"

having a problem with the flatten value transformation

I am attempting to flatten a topic before sending it along to my postgres db, using something like the connector below. I am using the confluent 4.1.1 kafka connect docker image, the only change being I copied a custom connector jar into /usr/share/java and am running it under a different accoount.
version (kafka connect) "1.1.1-cp1"
commit "0a5db4d59ee15a47"
{
"name": "problematic_postgres_sink",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"key.converter.schema.registry.url": "http://kafkaschemaregistry.service.consul:8081",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url": "http://kafkaschemaregistry.service.consul:8081",
"connection.url": "jdbc:postgresql://123.123.123.123:5432/mypostgresdb",
"connection.user": "abc",
"connection.password": "xyz",
"insert.mode": "upsert",
"auto.create": true,
"auto.evolve": true,
"topics": "mytopic",
"pk.mode": "kafka",
"transforms": "Flatten",
"transforms.Flatten.type": "org.apache.kafka.connect.transforms.Flatten$Value",
"transforms.Flatten.delimiter": "_"
}
}
I get a 400 error code:
Connector configuration is invalid and contains the following 1
error(s): Invalid value class
org.apache.kafka.connect.transforms.Flatten for configuration
transforms.Flatten.type: Error getting config definition from
Transformation: null

Kafka Connect failing to flush records to Elasticsearch

I'm running a simple Kafka docker instance and trying to insert data into Elasticsearch instance, however I'm seeing this kind of exception:
[2018-01-08 16:17:20,839] ERROR Failed to execute batch 36528 of 1 records after total of 6 attempt(s) (io.confluent.connect.elasticsearch.bulk.BulkProcessor)
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:139)
at org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:155)
at org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:284)
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140)
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:261)
at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:165)
at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:167)
at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:272)
at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:124)
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:271)
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88)
at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107)
at io.searchbox.client.http.JestHttpClient.execute(JestHttpClient.java:48)
at io.confluent.connect.elasticsearch.BulkIndexingClient.execute(BulkIndexingClient.java:57)
at io.confluent.connect.elasticsearch.BulkIndexingClient.execute(BulkIndexingClient.java:34)
at io.confluent.connect.elasticsearch.bulk.BulkProcessor$BulkTask.execute(BulkProcessor.java:350)
at io.confluent.connect.elasticsearch.bulk.BulkProcessor$BulkTask.call(BulkProcessor.java:327)
at io.confluent.connect.elasticsearch.bulk.BulkProcessor$BulkTask.call(BulkProcessor.java:313)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
My Connect config is as follows:
{
"name": "elasticsearch-analysis",
"config": {
"tasks.max": 1,
"topics": "analysis",
"connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
"connection.url": "http://elasticsearch:9200",
"topic.index.map": "analysis:analysis",
"schema.ignore": true,
"key.ignore": false,
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url": "http://schema_registry:8081",
"type.name": "analysis",
"batch.size": 200,
"flush.timeout.ms": 600000,
"transforms":"insertKey,extractId",
"transforms.insertKey.type": "org.apache.kafka.connect.transforms.ValueToKey",
"transforms.insertKey.fields": "Id",
"transforms.extractId.type":"org.apache.kafka.connect.transforms.ExtractField$Key",
"transforms.extractId.field":"Id"
}
}
There's not much data int the topic, just about 70000 unique messages.
As you can see, I've increased flush time and reduced batch sizes, but I still experience these timeouts.
I was unable to find what could've been the fix for it.
Possible your index is refreshing too quickly (the default is 1 second). Try and updating it to something less frequent or even turning it off initially.
curl -X PUT http://$ES_HOST/$ELASTICSEARCH_INDEX_NEW/_settings \
-H "Content-Type: application/json" -d '
{
"index" : {
"refresh_interval" : "15s"
}
}'

Resources