I've used Debezium for Mysql -> Elasticsearch CDC.
Now, the issue is that when I delete data from MySQL, it still reappears in Elasticsearch, even if data is no longer present in MySQL DB. UPDATE and INSERT works fine, but DELETE isn't.
Also, I did the following:
Delete data in MySQL
Delete Elasticsearch Index and ES Kafka Sink
Create a new connector for ES in Kakfa
Now, the weird part is that all of my deleted data reappers here as well! When I check ES data before step (3), data wasn't there. But afterwards, this behaviour is observed.
Please help me fix this issue!
MySQL config :
"config": {
"connector.class": "io.debezium.connector.mysql.MySqlConnector",
"database.allowPublicKeyRetrieval": "true",
"database.user": "cdc-reader",
"tasks.max": "1",
"database.history.kafka.bootstrap.servers": "X.X.X.X:9092",
"database.history.kafka.topic": "schema-changes.mysql",
"database.server.name": "data_test",
"schema.include.list": "data_test",
"database.port": "3306",
"tombstones.on.delete": "true",
"delete.enabled": "true",
"database.hostname": "X.X.X.X",
"database.password": "xxxxx",
"name": "slave_test",
"database.history.skip.unparseable.ddl": "true",
"table.include.list": "search_ai.*"
},
Elasticsearch config:
"config": {
"connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
"type.name": "_doc",
"behavior.on.null.values": "delete",
"transforms.extractKey.field": "ID",
"tasks.max": "1",
"topics": "search_ai.search_ai.slave_data",
"transforms.InsertKey.fields": "ID",
"transforms": "unwrap,key,InsertKey,extractKey",
"key.ignore": "false",
"transforms.extractKey.type": "org.apache.kafka.connect.transforms.ExtractField$Key",
"transforms.key.field": "ID",
"transforms.key.type": "org.apache.kafka.connect.transforms.ExtractField$Key",
"name": "esd_2",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
"connection.url": "http://X.X.X.X:9200",
"transforms.InsertKey.type": "org.apache.kafka.connect.transforms.ValueToKey"
},
Debezium is reading the transaction log, not the source table, so the inserts and updates are always going to be read first, causing inserts and doc updates in Elasticsearch...
Secondly, did you create the sink connector with a new name or different one?
If the same one, the original consumer group offsets would not have changed, causing the consumer group to pickup at the offsets before you deleted the original connector
if a new name, and depending on the auto.offset.reset value of the sink connector consumer, you could be consuming the Debezium topic from the beginning, and causing data to get re-inserted into Elasticsearch, as mentioned. You need to check if your Mysql delete events are actually getting produced/consumed as tombstone values to cause deletes in Elasticsearch
Related
I have this connector and sink which basically creates a topic with
"Test.dbo.TEST_A" and write to the ES index "Test". I have set the "key.ignore": "false" so that row updates are also updated in ES and
"transforms.unwrap.add.fields":"table" to keep track on which table the document belong to.
{
"name": "Test-connector",
"config": {
"connector.class": "io.debezium.connector.sqlserver.SqlServerConnector",
"tasks.max": "1",
"database.hostname": "192.168.1.234",
"database.port": "1433",
"database.user": "user",
"database.password": "pass",
"database.dbname": "Test",
"database.server.name": "MyServer",
"table.include.list": "dbo.TEST_A",
"database.history.kafka.bootstrap.servers": "kafka:9092",
"database.history.kafka.topic": "dbhistory.testA",
"transforms": "unwrap",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
"transforms.unwrap.drop.tombstones": "false",
"transforms.unwrap.delete.handling.mode": "rewrite",
"transforms.unwrap.add.fields":"table"
}
}
{
"name": "elastic-sink-test",
"config": {
"connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
"tasks.max": "1",
"topics": "TEST_A",
"connection.url": "http://localhost:9200/",
"string.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schema.enable": "false",
"schema.ignore": "true",
"transforms": "topicRoute,unwrap,key",
"transforms.topicRoute.type": "org.apache.kafka.connect.transforms.RegexRouter",
"transforms.topicRoute.regex": "(.*).dbo.TEST_A", /* Use the database name */
"transforms.topicRoute.replacement": "$1",
"transforms.unwrap.type": "io.debezium.transforms.UnwrapFromEnvelope",
"transforms.unwrap.drop.tombstones": "false",
"transforms.key.type": "org.apache.kafka.connect.transforms.ExtractField$Key",
"transforms.key.field": "Id",
"key.ignore": "false",
"type.name": "TEST_A",
"behavior.on.null.values": "delete"
}
}
But when I add another connector/sink to include another table "TEST_B" from the database.
It seems like whenever the id from TEST_A and TEST_B are the same one of the row is deleted from ES?
Is it possible with this setup to have one index = one dabase or is the only solution to have one index per table?
The reason I want to have one index = one dabase is to decrease the amount of indexes when more database are added to ES.
You are reading data changes from different Databases/Tables and writing them into the same ElasticSearch index, with the ES document ID set to the DB record ID. And as you can see, if the DB record IDs collide, the index document IDs will also collide, causing old documents to be deleted.
You have a few options here:
ElasticSearch index per DB/Table name: You can implement this with different connectors or with a custom Single Message Transform (SMT)
Globally unique DB records: If you control the schema of the source tables, you can set the primary key to a UUID. This will prevent ID collisions.
As you mentioned in the comments, set the ES document ID to DB/Table/ID. You can implement this change using an SMT
From the documentation for the Oracle Debezium Connector it seems that when an update is performed on a row it should send a Kafka message with all of the data for the state of the row before the update and all of the data for the state of the row after the update. However, I am getting zeros in almost all of the fields, except the field that was updated and one other field that has a unique constraint, but which is not used by Debezium as the key. The key used by Debezium is a combination of four fields, which together are unique. Here is how I created the connector. How can I get Debezium to give me data for all of the fields, not just the one that was updated, or is this not possible?
{
"name": "bom-tables",
"config": {
"name": "bom-tables",
"connector.class": "io.debezium.connector.oracle.OracleConnector",
"database.server.name": "fake.example.com",
"database.hostname": "fake2.example.com",
"snapshot.mode": "initial",
"database.port": "1521",
"database.user": "XSTRM",
"database.password": "FAKE_PASS",
"database.dbname": "FAKE_DBNAME",
"database.out.server.name": "DBZXOUT",
"database.history.kafka.bootstrap.servers": "localhost:9092",
"database.history.kafka.topic": "schema-changes.inventory",
"database.tablename.case.insensitive": "true",
"database.oracle.version": "11",
"include.schema.changes": "true",
"table.whitelist": "XXX,YYY",
"errors.log.enable": "true"
}
}
Thanks for any help.
Trying to create elasticsearch sink connector with following config, the creation is successful but when a message is produced on "my.topic.one", ES sink connector fails while trying to create an index with name "my.topic.one" : "Could not create index 'my.topic.one'" (User that I am using to connect to ES does not have create index permission intentionally). Why is it trying to create a new index and how to get the connector to index to previously created "elasticsearch_index_name"?
{
"type.name": "_doc",
"tasks.max": "1",
"connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
"connection.url": "http://elasticsearch-service:9200",
"behavior.on.null.values": "delete",
"key.ignore": "false",
"write.method": "upsert",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"key.converter.schemas.enable": "false",
"value.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter.schemas.enable": "false",
"topics": "my.topic.one,my.topic.two",
"transforms": "renameTopic",
"transforms.renameTopic.type": "org.apache.kafka.connect.transforms.RegexRouter",
"transforms.renameTopic.regex": ".*",
"transforms.renameTopic.replacement": "elasticsearch_index_name"
}
UPDATE: ES sink connector throws error even if I use just one topic in "topics" attribute and same topic name in "renameTopic.regex" like below, rest all attributes same.
"topics": "my.topic.one",
"transforms.renameTopic.regex": "my.topic.one"
Adding following property to ES sink connector config, solved the issue at hand :
"auto.create.indices.at.start": "false"
I have topics being created in kafka (test1, test2, test3) and I want to sink them to elastic at creation time. I tried topics.regex but it only creates indices for topics already existing. How can I sink a new topic into an index when it gets created dynamically?
Here is the connector config that I am using for kafka-sink:
{
"name": "elastic-sink-test-regex",
"config": {
"connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
"tasks.max": "1",
"topics.regex": "test[0-9]+",
"type.name": "kafka-connect",
"connection.url": "http://192.168.0.188:9200",
"key.ignore": "true",
"schema.ignore": "true",
"schema.enable": "false",
"batch.size": "100",
"flush.timeout.ms": "100000",
"max.buffered.records": "10000",
"max.retries": "10",
"retry.backoff.ms": "1000",
"max.in.flight.requests": "3",
"is.timebased.indexed": "False",
"time.index": "at"
}
}
A sink connector won't read new topics till this connector is restarted (or a scheduled rebalance occurred). You can run a Kafka Stream that reads messages from new topics and put them into a result-like topic. A Sink Connector reads from the result-like topic.
To save a "message - topic" matching you can use Kafka Record Headers.
Make sure it meets your requirements!
I'm using the kafka connect elasticsearch connector to write data from a topic to an ElasticSearch index. Both the key and value of the topic messages are in json format. The connector is not able to start because of the following error:
org.apache.kafka.connect.errors.DataException: MAP is not supported as the document id.
Following is the format of my messages (key | value):
{"key":"OKOK","start":1517241690000,"end":1517241695000} | {"measurement":"responses","count":9,"sum":1350.0,"max":150.0,"min":150.0,"avg":150.0}
And following is the body of the POST request I'm using to create the connector:
{
"name": "elasticsearch-sink-connector",
"config": {
"connector.class":"io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
"tasks.max": "1",
"topics": "output-topic-elastic",
"connection.url": "http://elasticsearch:9200",
"type.name": "aggregator",
"schemas.enable": "false",
"topic.schema.ignore": "true",
"topic.key.ignore": "false",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": "false",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable": "false",
"key.ignore":"false",
"topic.index.map": "output-topic-elastic:aggregator",
"name": "elasticsearch-sink",
"transforms": "InsertKey",
"transforms.InsertKey.type":"org.apache.kafka.connect.transforms.ValueToKey",
"transforms.InsertKey.fields":"key"
}}
Any help would be really appreciated. I've found out a similar question on stackoverflow 1 but I've got no luck with the answers.
ES document ID creation
You also need ExtractField in there
"transforms": "InsertKey,extractKey",
"transforms.InsertKey.type":"org.apache.kafka.connect.transforms.ValueToKey",
"transforms.InsertKey.fields":"key",
"transforms.extractKey.type":"org.apache.kafka.connect.transforms.ExtractField$Key",
"transforms.extractKey.field":"key"
Check out this post for more details.