ExtractField and Parse JSON in kafka-connect sink - elasticsearch

I have a kafka-connect flow of mongodb->kafka connect->elasticsearch sending data end to end OK, but the payload document is JSON encoded. Here's my source mongodb document.
{
"_id": "1541527535911",
"enabled": true,
"price": 15.99,
"style": {
"color": "blue"
},
"tags": [
"shirt",
"summer"
]
}
And here's my mongodb source connector configuration:
{
"name": "redacted",
"config": {
"connector.class": "com.teambition.kafka.connect.mongo.source.MongoSourceConnector",
"databases": "redacted.redacted",
"initial.import": "true",
"topic.prefix": "redacted",
"tasks.max": "8",
"batch.size": "1",
"key.serializer": "org.apache.kafka.common.serialization.StringSerializer",
"value.serializer": "org.apache.kafka.common.serialization.JSONSerializer",
"key.serializer.schemas.enable": false,
"value.serializer.schemas.enable": false,
"compression.type": "none",
"mongo.uri": "mongodb://redacted:27017/redacted",
"analyze.schema": false,
"schema.name": "__unused__",
"transforms": "RenameTopic",
"transforms.RenameTopic.type":
"org.apache.kafka.connect.transforms.RegexRouter",
"transforms.RenameTopic.regex": "redacted.redacted_Redacted",
"transforms.RenameTopic.replacement": "redacted"
}
}
Over in elasticsearch, it ends up looking like this:
{
"_index" : "redacted",
"_type" : "kafka-connect",
"_id" : "{\"schema\":{\"type\":\"string\",\"optional\":true},\"payload\":\"1541527535911\"}",
"_score" : 1.0,
"_source" : {
"ts" : 1541527536,
"inc" : 2,
"id" : "1541527535911",
"database" : "redacted",
"op" : "i",
"object" : "{ \"_id\" : \"1541527535911\", \"price\" : 15.99,
\"enabled\" : true, \"tags\" : [\"shirt\", \"summer\"],
\"style\" : { \"color\" : \"blue\" } }"
}
}
I'd like to do use 2 single message transforms:
ExtractField to grab object, which is a string of JSON
Something to parse that JSON into an object or just let the normal JSONConverter handle it, as long as it ends up as properly structured in elasticsearch.
I've attempted to do it with just ExtractField in my sink config, but I see this error logged by kafka
kafka-connect_1 | org.apache.kafka.connect.errors.ConnectException:
Bulk request failed: [{"type":"mapper_parsing_exception",
"reason":"failed to parse",
"caused_by":{"type":"not_x_content_exception",
"reason":"Compressor detection can only be called on some xcontent bytes or
compressed xcontent bytes"}}]
Here's my elasticsearch sink connector configuration. In this version, I have things working but I had to code a custom ParseJson SMT. It's working well, but if there's a better way or a way to do this with some combination of built-in stuff (converters, SMTs, whatever works), I'd love to see that.
{
"name": "redacted",
"config": {
"connector.class":
"io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
"batch.size": 1,
"connection.url": "http://redacted:9200",
"key.converter.schemas.enable": true,
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"schema.ignore": true,
"tasks.max": "1",
"topics": "redacted",
"transforms": "ExtractFieldPayload,ExtractFieldObject,ParseJson,ReplaceId",
"transforms.ExtractFieldPayload.type": "org.apache.kafka.connect.transforms.ExtractField$Value",
"transforms.ExtractFieldPayload.field": "payload",
"transforms.ExtractFieldObject.type": "org.apache.kafka.connect.transforms.ExtractField$Value",
"transforms.ExtractFieldObject.field": "object",
"transforms.ParseJson.type": "reaction.kafka.connect.transforms.ParseJson",
"transforms.ReplaceId.type": "org.apache.kafka.connect.transforms.ReplaceField$Value",
"transforms.ReplaceId.renames": "_id:id",
"type.name": "kafka-connect",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": false
}
}

I am not sure about your Mongo connector. I don't recognize the class or the configurations... Most people probably use Debezium Mongo connector
I would setup this way, though
"connector.class": "com.teambition.kafka.connect.mongo.source.MongoSourceConnector",
"key.serializer": "org.apache.kafka.common.serialization.StringSerializer",
"value.serializer": "org.apache.kafka.common.serialization.JSONSerializer",
"key.serializer.schemas.enable": false,
"value.serializer.schemas.enable": true,
The schemas.enable is important, that way the internal Connect data classes can know how to convert to/from other formats.
Then, in the Sink, you again need to use JSON DeSerializer (via the converter) so that it creates a full object rather than a plaintext string, as you see in Elasticsearch ({\"schema\":{\"type\":\"string\").
"connector.class":
"io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"key.converter.schemas.enable": false,
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": true
And if this doesn't work, then you might have to manually create your index mapping in Elasticsearch ahead of time so it knows how to actually parse the strings you are sending it

Related

How connector-source works with "query" and "mode": timestamp+incrementing

Can someone explain to me how the connector source works using "query" and the "mode": timestamp+incrementing?
Because it works perfectly when the query is small but for many records it becomes impossible.
From what I see, it re-runs the entire query over and over again.
This is my source connector:
{
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"tasks.max": "1",
"connection.url": "jdbc:informix-sqli://ip:port/sis:informixserver=mibase",
"connection.user":"informix",
"connection.password":"pass",
"query": "SELECT * FROM my_vw",
"topic.prefix": "novedades",
"db.timezone": "America/Argentina/Buenos_Aires",
"dialect.name": "GenericDatabaseDialect",
"timestamp.granularity": "connect_logical",
"poll.interval.ms": "10000",
"mode":"timestamp+incrementing",
"schema.pattern": "informix",
"timestamp.column.name": "last_date",
"incrementing.column.name": "id",
"validate.non.null": false,
"numeric.mapping":"best_fit",
"transforms": "copyFieldToKey,extractKeyFromStruct,removeKeyFromValue",
"transforms.copyFieldToKey.type": "org.apache.kafka.connect.transforms.ValueToKey",
"transforms.copyFieldToKey.fields": "id",
"transforms.extractKeyFromStruct.type": "org.apache.kafka.connect.transforms.ExtractField$Key",
"transforms.extractKeyFromStruct.field": "id",
"transforms.removeKeyFromValue.type": "org.apache.kafka.connect.transforms.ReplaceField$Value",
"transforms.removeKeyFromValue.blacklist": "id",
"key.converter" : "org.apache.kafka.connect.converters.LongConverter"
}
The filters work well because it only brings me the changes and/or new records, but apparently it reruns the entire query and if it has millions of records, that total is read over and over again.

dropfield transform sink connector: (STRUCT) type doesn't have a mapping to the SQL database column type

I created a sink connector from kafka to mysql.
After transform in sink connector's config and deleting some columns I get this error whereas whithout transform it works:
(STRUCT) type doesn't have a mapping to the SQL database column type
{
"name": "mysql-conf-sink",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"tasks.max": "3",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url": "http://localhost:8081",
"topics": "mysql.cars.prices",
"transforms": "dropPrefix,unwrap",
"transforms.dropPrefix.type": "org.apache.kafka.connect.transforms.RegexRouter",
"transforms.dropPrefix.regex": "mysql.cars.prices",
"transforms.dropPrefix.replacement": "prices",
"transforms.timestamp.type": "org.apache.kafka.connect.transforms.TimestampConverter$Value",
"transforms.timestamp.target.type": "Timestamp",
"transforms.timestamp.field": "date_time",
"transforms.timestamp.format": "yyyy-MM-dd HH:mm:ss",
"errors.tolerance": "all",
"errors.log.enable": "true",
"errors.log.include.messages": "true",
"connection.url": "jdbc:mysql://localhost:3306/product",
"connection.user": "kafka",
"connection.password": "123456",
"transforms": "ReplaceField",
"transforms.ReplaceField.type": "org.apache.kafka.connect.transforms.ReplaceField$Value",
"transforms.ReplaceField.blacklist": "id, brand",
"insert.mode": "insert",
"auto.create": "true",
"auto.evolve": "true",
"batch.size": 50000
}
}
You have put "transforms" key more than once in your JSON, which isn't valid.
Try with one entry
"transforms": "unwrap,ReplaceField,dropPrefix",
You are getting the error because you have overrode the value, and unwrap, specifically, is no longer called, so you have nested Structs.
The blacklist property got renamed to exclude, by the way - https://docs.confluent.io/platform/current/connect/transforms/replacefield.html#properties

Kafka connect Jdbc sink connector not auto creating tables

I am using docker images of kafka and kafka connect to test cdc using debezium, and the database is a standalone one
My sink connector config json looks like this,
{
"name": "jdbc-sink-test-oracle",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"tasks.max": "1",
"dialect.name": "OracleDatabaseDialect",
"table.name.format": "TEST",
"topics": "oracle-db-source.DBZ_SRC.TEST",
"connection.url": "jdbc:oracle:thin:#hostname:1521/DB",
"connection.user": "DBZ_TARGET",
"connection.password": "DBZ_TARGET",
"transforms": "unwrap",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
"transforms.unwrap.drop.tombstones": "false",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": "true",
"auto.create": "true",
"insert.mode": "upsert",
"delete.enabled": "true",
"pk.fields": "ID",
"pk.mode": "record_key"
}
}
and my source connector config json looks like this,
{
"name": "test-source-connector",
"config": {
"connector.class" : "io.debezium.connector.oracle.OracleConnector",
"tasks.max" : "1",
"database.server.name" : "oracle-db-source",
"database.hostname" : "hostname",
"database.port" : "1521",
"database.user" : "clogminer",
"database.password" : "clogminer",
"database.dbname" : "DB",
"database.oracle.version": "19",
"database.history.kafka.bootstrap.servers" : "kafka:9092",
"database.history.kafka.topic": "schema-changes.DBZ_SRC",
"database.connection.adapter": "logminer",
"table.include.list" : "DBZ_SRC.TEST",
"database.schema": "DBZ_SRC",
"errors.log.enable": "true",
"snapshot.lock.timeout.ms":"5000",
"include.schema.changes": "true",
"snapshot.mode":"initial",
"decimal.handling.mode": "double"
}
}
and I am getting this error for the above configurations,
Error : 942, Position : 11, Sql = merge into "TEST" using (select :1 "ID", :2 "NAME", :3 "DESCRIPTION", :4 "WEIGHT" FROM dual) incoming on("TEST"."ID"=incoming."ID") when matched then update set "TEST"."NAME"=incoming."NAME","TEST"."DESCRIPTION"=incoming."DESCRIPTION","TEST"."WEIGHT"=incoming."WEIGHT" when not matched then insert("TEST"."NAME","TEST"."DESCRIPTION","TEST"."WEIGHT","TEST"."ID") values(incoming."NAME",incoming."DESCRIPTION",incoming."WEIGHT",incoming."ID"), OriginalSql = merge into "TEST" using (select ? "ID", ? "NAME", ? "DESCRIPTION", ? "WEIGHT" FROM dual) incoming on("TEST"."ID"=incoming."ID") when matched then update set "TEST"."NAME"=incoming."NAME","TEST"."DESCRIPTION"=incoming."DESCRIPTION","TEST"."WEIGHT"=incoming."WEIGHT" when not matched then insert("TEST"."NAME","TEST"."DESCRIPTION","TEST"."WEIGHT","TEST"."ID") values(incoming."NAME",incoming."DESCRIPTION",incoming."WEIGHT",incoming."ID"), Error Msg = ORA-00942: table or view does not exist
at io.confluent.connect.jdbc.sink.JdbcSinkTask.getAllMessagesException(JdbcSinkTask.java:150)
at io.confluent.connect.jdbc.sink.JdbcSinkTask.put(JdbcSinkTask.java:102)
... 11 more
2022-07-28 15:01:58,644 ERROR || WorkerSinkTask{id=jdbc-sink-test-oracle-0} Task threw an uncaught and unrecoverable exception. Task is being killed and will not recover until manually restarted [org.apache.kafka.connect.runtime.WorkerTask]
org.apache.kafka.connect.errors.ConnectException: Exiting WorkerSinkTask due to unrecoverable exception.
at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:610)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:330)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:232)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:201)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:188)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:237)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.kafka.connect.errors.ConnectException: java.sql.SQLException: Exception chain:
java.sql.BatchUpdateException: ORA-00942: table or view does not exist
java.sql.SQLSyntaxErrorException: ORA-00942: table or view does not exist
I believe according to the configurations I gave, the table needs to be auto created but it says table doesnt exist.
But, it is working fine and is auto creating the table named 'TEST2' and is also exporting the data from source to this table for this sink connector configuration
{
"name": "jdbc-sink-test2-oracle",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"tasks.max": "1",
"dialect.name": "OracleDatabaseDialect",
"table.name.format": "TEST2",
"topics": "oracle-db-source.DBZ_SRC.TEST",
"connection.url": "jdbc:oracle:thin:#hostname:1521/DB",
"connection.user": "DBZ_TARGET",
"connection.password": "DBZ_TARGET",
"transforms": "unwrap",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
"transforms.unwrap.drop.tombstones": "false",
"auto.create": "true",
"insert.mode": "upsert",
"delete.enabled": "true",
"pk.fields": "ID",
"pk.mode": "record_key"
}
}
Edit:
The sink connector is working fine if the target table with the same name as the source table is already created with same DDL, but the target table is not getting auto created if it is not present already.

illegal_argument_exception while using kafka connector to push to elasticsearch

I am trying to push logs from kafka tops to elasticsearch.
My message in kafka:
{
"#timestamp": 1589549688.659166,
"log": "13:34:48.658 [pool-2-thread-1] DEBUG health check success",
"stream": "stdout",
"time": "2020-05-15T13:34:48.659166158Z",
"pod_name": "my-pod-789f8c85f4-mt62l",
"namespace_name": "services",
"pod_id": "600ca012-91f5-XXXX-XXXX-XXXXXXXXXXX",
"host": "ip-192-168-88-59.ap-south-1.compute.internal",
"container_name": "my-pod",
"docker_id": "XXXXXXXXXXXXXXXXX1435bb2870bfc9d20deb2c483ce07f8e71ec",
"container_hash": "myregistry",
"labelpod-template-hash": "9tignfe9r",
"labelsecurity.istio.io/tlsMode": "istio",
"labelservice": "my-pod",
"labelservice.istio.io/canonical-name": "my-pod",
"labelservice.istio.io/canonical-revision": "latest",
"labeltype": "my-pod",
"annotationkubernetes.io/psp": "eks.privileged",
"annotationsidecar.istio.io/status": "{\"version\":\"58dc8b12bb311f1e2f46fd56abfe876ac96a38d7ac3fc6581af3598ccca7522f\"}"
}
This is my connector config:
{
"name": "logs",
"config": {
"connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
"connection.url": "http://es:9200",
"connection.username": "username",
"connection.password": "password",
"tasks.max": "10",
"topics": "my-pod",
"name": "logs",
"type.name": "_doc",
"schema.ignore": "true",
"key.ignore": "true",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": "false",
"transforms": "routeTS",
"transforms.routeTS.type": "org.apache.kafka.connect.transforms.TimestampRouter",
"transforms.routeTS.topic.format": "${topic}-${timestamp}",
"transforms.routeTS.timestamp.format": "YYYYMMDD"
}
}
This is the error i'm getting
cp-kafka-connect-server [2020-05-15 13:30:59,083] WARN Failed to execute batch 4830 of 18 records with attempt 4/6, will attempt retry after 539 ms. Failure reason: Bulk request failed: [{"type":"illegal_argument_exception","reason":"mapper [labelservice] of different type, current_type [text], merged_type [ObjectMapper]"}
I haven't created any mapping beforehand. I'm depending on the connector to create the index.
This is the mapping I have in es which is autocreated.
{
"mapping": {}
}
The error message is clear
reason":"mapper [labelservice] of different type, current_type [text],
merged_type [ObjectMapper]"
It means in your index mapping labelservice is defined as text but you are sending below data in labelservice field:
"labelservice": "my-pod",
"labelservice.istio.io/canonical-name": "my-pod",
"labelservice.istio.io/canonical-revision": "latest",
This is the format of object type in Elasticsearch, now there is a mismatch in the data-type which caused the error message.
You need to change your mapping and define labelservice as object to make it work. Refer object datatype in Elasticsearch for more info.

Elasticsearch fails to map imported data

I have managed to create an import from Kafka to Elasticsearch using Kafka Connect.
Connector-config:
{
"name": "raw-customer-equipment",
"config": {
"connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
"tasks.max": 1,
"topics": "raw.customer.equipment",
"key.ignore": true,
"value.converter.schemas.enable": false,
"schema.ignore": true,
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"connection.url": "<elastic-url>",
"connection.username": "<user>",
"connection.password": "<pwd>",
"type.name": "_doc" }
}
However Elasticsearch doesnt seem to be able to map the imported (Json) data. When peeking on it in Kibana the imported data doesnt seem to be search'able.
{
"_index": "raw.customer.equipment",
"_type": "_doc",
"_id": "raw.customer.equipment+1+929943",
"_version": 1,
"_score": 0,
"_source": {
"ifstats_list": [
{
"Event Time": "1589212678436",
"AP_list": [
{
"AP ID": 1,
"AP Alias": "PRIV0"
},
{
"AP ID": 2,
"AP Alias": "VID1"
},
{
"AP ID": 5,
"AP Alias": "VID1_BH"
}
],
"Device Type": "<type>",
...
"Associated Stations": [
{
"Packets sent": 11056613,
"Packets received": 304744,
"Multiple Retries Count": 0,
"Channel STA": 6,
"MAC Address": "<mac>",
....
},
{
....
}]
....
I want to be able to query by for instance "MAC Address" but Elastic seem to just handle the imported data as a big text-chunk.
I guess It is something in the Kafka-connector setup that is missing or wrong but I fail to see what.
As you might have guessed Im new at Elastic, and Im not the one supposed to be able to use the data in the end
Any help appreciated
BR
Edit:
Added connector-config by request.

Resources