rename index in elasticsearch with kafka sink - elasticsearch

I am using the following sink. Issue is that it sets the elasticsearch index name same as the topic. I want to have a different elasticseach index name. How can I achieve that. I am using confluent 4
{
"name": "es-sink-mysql-foobar-02",
"config": {
"_comment": "-- standard converter stuff -- this can actually go in the worker config globally --",
"connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"key.converter": "io.confluent.connect.avro.AvroConverter",
"key.converter.schema.registry.url": "http://localhost:8081",
"value.converter.schema.registry.url": "http://localhost:8081",
"_comment": "--- Elasticsearch-specific config ---",
"_comment": "Elasticsearch server address",
"connection.url": "http://localhost:9200",
"_comment": "Elasticsearch mapping name. Gets created automatically if doesn't exist ",
"type.name": "type.name=kafka-connect",
"index.name": "asimtest",
"_comment": "Which topic to stream data from into Elasticsearch",
"topics": "mysql-foobar",
"_comment": "If the Kafka message doesn't have a key (as is the case with JDBC source) you need to specify key.ignore=true. If you don't, you'll get an error from the Connect task: 'ConnectException: Key is used as document id and can not be null.",
"key.ignore": "true"
}
}

Use Kafka Connect's Single Message Transform (SMT) capabilities for this.
For example, to drop the mysql- prefix:
"_comment": "Drop the mysql- prefix from the topic name and thus Elasticsearch index name",
"transforms": "dropPrefix",
"transforms.dropPrefix.type":"org.apache.kafka.connect.transforms.RegexRouter",
"transforms.dropPrefix.regex":"mysql-(.*)",
"transforms.dropPrefix.replacement":"$1"
or to drop the prefix and also route the messages to a time-based Elasticsearch index:
"transforms":"dropPrefix,routeTS",
"transforms.dropPrefix.type":"org.apache.kafka.connect.transforms.RegexRouter",
"transforms.dropPrefix.regex":"mysql-(.*)",
"transforms.dropPrefix.replacement":"$1",
"transforms.routeTS.type":"org.apache.kafka.connect.transforms.TimestampRouter",
"transforms.routeTS.topic.format":"kafka-${topic}-${timestamp}",
"transforms.routeTS.timestamp.format":"YYYYMM"
See https://www.confluent.io/blog/simplest-useful-kafka-connect-data-pipeline-world-thereabouts-part-3/ for more details.

Related

Can we make Single JDBC Sink Connector for multiple source db if primary key is same in all source DB?

Below is my JDBC Sink Connector Configuration Properties.
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"transforms.dropPrefix.replacement": "$1",
"table.name.format": "kafka_${topic}",
"connection.password": "********",
"tasks.max": "3",
"topics": "aiq.db1.Test1,aiq.db1.Test2,aiq.db2.Topic1,aiq.db2.Topic2",
"batch.size": "3000",
"transforms": "dropPrefix",
"transforms.dropPrefix.regex": "aiq.(.*)",
"transforms.dropPrefix.type": "org.apache.kafka.connect.transforms.RegexRouter",
"value.converter.schema.registry.url": "http://localhost:8081",
"value.converter.value.subject.name.strategy": "io.confluent.kafka.serializers.subject.RecordNameStrategy",
"auto.evolve": "true",
"connection.user": "admin",
"name": "MSSQL_jdbc_sink_connect",
"errors.tolerance": "all",
"auto.create": "true",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"connection.url": "jdbc:sqlserver://mssql",
"insert.mode": "upsert",
"key.converter": "io.confluent.connect.avro.AvroConverter",
"key.converter.schema.registry.url": "http://localhost:8081",
"pk.mode": "record_value",
"pk.fields": "id"
If I used this then connector is looking for db1 or db2, which is source db and giving this error.
com.microsoft.sqlserver.jdbc.SQLServerException: Database 'db2' does not exist. Make sure that the name is entered correctly.
at io.confluent.connect.jdbc.sink.JdbcSinkTask.getAllMessagesException(JdbcSinkTask.java:150)
at io.confluent.connect.jdbc.sink.JdbcSinkTask.put(JdbcSinkTask.java:102)
... 11 more
[2022-01-25 06:09:09,582] WARN Write of 500 records failed, remainingRetries=10 (io.confluent.connect.jdbc.sink.JdbcSinkTask:92)
com.microsoft.sqlserver.jdbc.SQLServerException: Database 'db2' does not exist. Make sure that the name is entered correctly.
Please let me know Can I create a JDBC sink connector which uses more than one databases for source topic.
If this scenario is possible then How can I achieve this by using JDBC Sink Connector?
I have used these properties and it worked for me in this case(if we have multiple source db and one target db to store that data)
table.name.format=iq_${topic}
transforms=dropPrefix
transforms.dropPrefix.replacement=$1_transferiq_$2
transforms.dropPrefix.regex=iq.(.*).transferiq.(.*)
transforms.dropPrefix.type=org.apache.kafka.connect.transforms.RegexRouter
AFAIK, the connection.url can only refer to one database at a time, for an authenticated user to that database.
If you need to write different topics to different databases, copy your connector config, and change the appropriate configs

How kafka fetch data only when new row insert or update old row from mysql database using kafka streaming

I am using confluent platform JDBC connector, it is streaming data from mysql to kafka consumer. Through an application inserting data into another database for reporting purpose.
Here problem it is only streaming all data again and again within some interval of time. Actually a want only those data which is newly inserted or any update of previous record.
According to timestamp can't do because table does not contain any time column. And as per increment id also not possible. Please share any solution.
I have sample configuration file.
demo.json
{
"name":"mysql-connector-demo",
"config":{
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"key.converter":"org.apache.kafka.connect.json.JsonConverter",
"value.converter":"org.apache.kafka.connect.json.JsonConverter",
"connection.url": "jdbc:mysql://localhost:3306/test",
"connection.user": "root",
"connection.password": "1234",
"topic.prefix": "test",
"catalog.pattern":"test",
"mode": "bulk",
"validate.non.null": false,
"query": "select * from test ",
"table.types": "TABLE",
"topic.prefix": "test-jdbc-",
"poll.interval.ms": 10000
"schema.ignore": true,
"key.converter.schemas.enable": "false",
"value.converter.schemas.enable": "false"
}
}
But here new inserted record and newly updated record not affected to kafka consumer.
want only those data which is newly inserted or any update of previous record
Then you should be using change-data-capture (CDC) rather than JDBC polling.
Debezium is one such solution

query-based JDBC Source connector Kafka

I have a legacy data base that has a primary key column to be string ( yeah I know ). I want to do an increment dumping mode from the postgres DB into kafka topics using JDBC kafka Source Connector
Below is my attempt to recreate the problem
create table test(
id varchar(20) primary key,
name varchar(10)
);
INSERT INTO test(
id, name)
VALUES ('1ab', 't'),
('2ab', 't'),
('3ab', 't')
My config
{"name" : "test_connector",
"config" : {
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"connection.url": "jdbc:postgresql://12.34.5.6:5432/",
"connection.user": "user",
"connection.password": "password",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"topic.prefix": "incre_",
"mode": "incrementing",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"query" :"SELECT cast(replace(id, 'ab','') as integer) as id , name from test ORDER BY id ASC",
"incrementing.column.name":"id",
"value.converter.schema.registry.url": "http://schema-registry_url.com",
"key.converter.schema.registry.url": "http://schema-registry_url.com",
"offset.flush.timeout.ms": 2000,
}
}
After I posted the config , the status was RUNNING when I did a HTTP curl . There is also no error log in the worker's log when I checked it
There is also no data in the kafka topic when I tried to do a console-consumer
I also tried several other combination like adding in "table.whitelist": "test".
Another thing i tried was following these two links
https://rmoff.net/2018/05/21/kafka-connect-and-oracle-data-types/
https://www.confluent.io/blog/kafka-connect-deep-dive-jdbc-source-connector but none help , even the smart trick that was suggested like SELECT * from (SELECT id, name from test where ...)
So after a few hours playing with different configuration. I come back to the official document and realised this
Use a custom query instead of loading tables, allowing you to join data from multiple tables. As long as the query does not include its own filtering, you can still use the built-in modes for incremental queries (in this case, using a timestamp column). Note that this limits you to a single output per connector and because there is no table name, the topic “prefix” is actually the full topic name in this case.
So the key is that "topic.prefix": "incre_test"
Follow up on the previous setting, the proper config should be
{"name" : "test_connector",
"config" : {
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"connection.url": "jdbc:postgresql://12.34.5.6:5432/",
"connection.user": "user",
"connection.password": "password",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"topic.prefix": "incre_test",
"mode": "incrementing",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"query" :"SELECT cast(replace(id, 'ab','') as integer) as id , name from test ORDER BY id ASC",
"incrementing.column.name":"id",
"value.converter.schema.registry.url": "http://schema-registry_url.com",
"key.converter.schema.registry.url": "http://schema-registry_url.com",
"offset.flush.timeout.ms": 2000,
}
}
I am afraid you cannot use your varchar id in incrementing mode because it is not an incrementing column/type. According to Confluent Docs,
Incrementing Column: A single column containing a unique ID for each row, where newer rows are guaranteed to have larger IDs, i.e. an
AUTOINCREMENT column. Note that this mode can only detect new rows.
Updates to existing rows cannot be detected, so this mode should only
be used for immutable data. One example where you might use this mode
is when streaming fact tables in a data warehouse, since those are
typically insert-only.

Kafka connect elastic search ID creation for multiple fields not working

I am asking this question as there was no answer in the original case: Elastic Kafka Connector, ID Creation.
I have a similar situation.
Elastic search table to create a record for a single field, but not for multiple fields when request sent through kafkaconnect.
Getting exception "Key is used as document id and can not be null" in elastic search.
My Connector Configurations:
{
"name": "test-connector33",
"config": {
"connector.class":"io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
"tasks.max": "1",
"topics": "test-connector33",
"connection.url": "http://localhost:9200",
"type.name": "aggregator",
"schema.ignore": "true",
"topic.schema.ignore": "true",
"topic.key.ignore": "false",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": "false",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable": "false",
"key.ignore":"false",
"name": "test-connector33",
"transforms": "InsertKey,extractKey",
"transforms.InsertKey.type":"org.apache.kafka.connect.transforms.ValueToKey",
"transforms.InsertKey.fields":"customerId,city",
"transforms.extractKey.type":"org.apache.kafka.connect.transforms.ExtractField$Key",
"transforms.extractKey.field":"customerId,city"
}}
Any idea how to resolve this?
Thanks in advance!
org.apache.kafka.connect.transforms.ExtractField$Key only supports single fields.
Pretend your JSON object was a HashMap<String, Object>. You cannot find the field customerId,city, and so that map.get(field) operation returns null, therefore setting the field to be null.
If you want to send keys via the console producer, you are welcome to do that by adding --property print.key=true as a flag, then typing the key, press tab, then putting the value. If you want to echo data into the process, then you can also set --property key.separator='|', for a vertical bar as well as add --property parse.key=true

Kafka Connect stops after 300k records

I am trying to sink my mysql table to elasticsearch. My table has 1 million plus records. Issue is that my elasticsearch does not get anymore records after 300 some thousand records are inserted. I know that first time I ran it, it did run all the records. Its when I tried to do it again after deleting ES index, this happened. I have tried resetting the update_ts field to new timestamp. I have tried offset value in sink. Nothing seems to be working.
Here is my source file
{
"name": "items3",
"config": {
"_comment": "The JDBC connector class. Don't change this if you want to use the JDBC Source.",
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"_comment": "How to serialise the value of keys - here use the Confluent Avro serialiser. Note that the JDBC Source Connector always returns null for the key ",
"key.converter": "io.confluent.connect.avro.AvroConverter",
"_comment": "Since we're using Avro serialisation, we need to specify the Confluent schema registry at which the created schema is to be stored. NB Schema Registry and Avro serialiser are both part of Confluent Open Source.",
"key.converter.schema.registry.url": "http://localhost:8081",
"_comment": "As above, but for the value of the message. Note that these key/value serialisation settings can be set globally for Connect and thus omitted for individual connector configs to make them shorter and clearer",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url": "http://localhost:8081",
"_comment": " --- JDBC-specific configuration below here --- ",
"_comment": "JDBC connection URL. This will vary by RDBMS. Consult your manufacturer's handbook for more information",
"connection.url": "jdbc:mysql://localhost:3306/db?user=user&password=password",
"_comment": "Which table(s) to include",
"table.whitelist": "items",
"_comment": "Pull all rows based on an timestamp column. You can also do bulk or incrementing column-based extracts. For more information, see http://docs.confluent.io/current/connect/connect-jdbc/docs/source_config_options.html#mode",
"mode": "timestamp+incrementing",
"incrementing.column.name": "id",
"timestamp.column.name": "update_ts",
"_comment": "If the column is not defined as NOT NULL, tell the connector to ignore this ",
"validate.non.null": "true",
"_comment": "The Kafka topic will be made up of this prefix, plus the table name ",
"topic.prefix": "kafka-",
"auto.offset.reset" : "earliest"
}
}
And here is my sink
{
"name": "items-sink",
"config": {
"_comment": "-- standard converter stuff -- this can actually go in the worker config globally --",
"connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"key.converter": "io.confluent.connect.avro.AvroConverter",
"key.converter.schema.registry.url": "http://localhost:8081",
"value.converter.schema.registry.url": "http://localhost:8081",
"_comment": "--- Elasticsearch-specific config ---",
"_comment": "Elasticsearch server address",
"connection.url": "http://localhost:9200",
"_comment": "Elasticsearch mapping name. Gets created automatically if doesn't exist ",
"type.name": "items",
"_comment": "Which topic to stream data from into Elasticsearch",
"topics": "kafka-items",
"auto.offset.reset" : "earliest",
"_comment": "If the Kafka message doesn't have a key (as is the case with JDBC source) you need to specify key.ignore=true. If you don't, you'll get an error from the Connect task: 'ConnectException: Key is used as document id and can not be null.",
"key.ignore": "true"
}
}
as you can see I am trying to auto.offset.reset to earliest so if it is keeping track of my records somehow, it will start over, but all in vain.
"auto.offset.reset" : "earliest" can only be used inside the connect-distributed.properties file, not the JSON connector configurations
And in that file, since it's a consumer configuration, it's named consumer.auto.offset.reset.
Also, the consumer group is mapped to the name field of the connector configuration, so unless that's changed, you'd be continuing to consume from where the previous one of the same name left off until the group offsets are reset or the name is changed. By default, the group name is connect-${connector_name}

Resources