Does kafka sink connector inserts record if the transformation returns NULL? - elasticsearch

I have a kafka sink connector (elasticsearch), and I'm writing a custom SMT. I'm wondering what would happen if the SMT returns a NULL record. What the connector would do ? Is it going to insert something NULL in my elasticsearch index ? Or is it going to not insert at all ?

You can control how it's handled using the behavior.on.null.values config option.
ignore - the message is ignored
delete - the Elasticsearch record with matching key is deleted
fail - the connector will stop.
Ref: doc page

Related

Kafka Sink Connector: Is there any way to apply value transformations only for messages that meet the condition?

I'm trying to use Elasticsearch sink connector to transfer all messages to ES index.
There is a drop transformation for Kafka connectors, that tells connector to delete rows from ES if the body is null.
What if for delete action we send a message with not-null body? Is there any way to apply transformation with some condition/predicate, at the same time continue processing create/update messages without transformation? Like to apply drop value transformation only on rows that have deleted flag true in their bodies.
The transforms don't "delete from ES", they only modify the Kafka record.
If you want to act only on specific records, that's what the Filter w/ Predicate transform is for, which you'd need to chain before a drop transformation since I don't think it's possible to invoke a delete ES event with a non-null record value
at the same time continue processing create/update messages without transformation?
You'd need to run another connector that reverses the predicate condition of the other

kafka-connect-elasticsearch: How to sync elasticsearch with consumer group?

I want to query messages in a Kafka topic but not all messages, not from the beginning. I just need to see which messages are not yet committed based on a consumer group. So, basically what I want to have is to delete the documents whose offset is lower than a consumer group offset.
At this point, if I use elastic-connector, is there any way or a workaround to delete documents from the elastic index after a message is consumed and committed?
Or, should I use Kafka Streams and how?
The sink connector only deletes documents when that property is explicitly enabled and there is a null valued record for a document ID in the topic you're reading. This means you need to actually consume this null record and have it be processed by the connector
see which messages are not yet committed
This would imply messages that have not been processed by the connector, making them not searchable in Elasticsearch
delete the documents whose offset is lower than a consumer group offset
If you created a fresh index in Elasticsearch that's only used by the connector, you could pause the connector, then truncate the index, then resume the connector
is there any way or a workaround to delete documents from the elastic index after a message is consumed and committed
Directly use the DELETE API

Confluent Elasticsearch Sink connector, write.method : "UPSERT" on different key

In COnfluent Elasticsearch Sink connector, I am trying to write in same Elasticsearch index from two different topics. First topic is INSERT and another topic is UPSERT. For UPSERT, I want to update the JSON document based on some other field instead of "_id". IS that possible ? If yes, How can I do that ?
Use key.ignore=false and use existing primary key columns as _id for each json document.

Using elasticsearch generated ID's in kafka elasticsearch connector

I noticed that documents indexed in elasticsearch using the kafka elasticsearch connector have their ids in the following format topic+partition+offset.
I would prefer to use id's generated by elasticsearch. It seems topic+partition+offset is not usually unique so I am loosing data.
How can I change that?
As Phil says in the comments -- topic-partition-offset should be unique, so I don't see how this is causing data loss for you.
Regardless - you can either let the connector generate the key (as you are doing), or you can define the key yourself (key.ignore=false). There is no other option.
You can use Single Message Transformations with Kafka Connect to derive a key from the fields in your data. Based on your message in the Elasticsearch forum it looks like there is an id in your data - if that's going to be unique you could set that as your key, and thus as your Elasticsearch document ID too. Here's an example of defining a key with SMT:
# Add the `id` field as the key using Simple Message Transformations
transforms=InsertKey, ExtractId
# `ValueToKey`: push an object of one of the column fields (`id`) into the key
transforms.InsertKey.type=org.apache.kafka.connect.transforms.ValueToKey
transforms.InsertKey.fields=id
# `ExtractField`: convert key from an object to a plain field
transforms.ExtractId.type=org.apache.kafka.connect.transforms.ExtractField$Key
transforms.ExtractId.field=id
(via https://www.confluent.io/blog/building-real-time-streaming-etl-pipeline-20-minutes/)
#Robin Moffatt, as much as I see it, topic-partition-offset can cause duplicates in case that upgrade your kafka cluster, but not in rolling upgrade fashion but just replace cluster with cluster (which is sometime easier to replace). In this case you will experience data loss because of overwriting data.
Regarding to your excellent example, this can be the solution for many of the cases, but I'd add another option. Maybe you can add epoc timestamp element to the topic-partition-offset so this will be like this topic-partition-offset-current_timestamp.
What do you think?

Is an upsert possible with Kafka Connect to ElasticSearch

I'm receiving events which end up in Kafka. From these events I fetch the id using a Kafka Streams application and posting it back to Kafka as a pair of (id, 1) in another topic. Then I would like to see if the id exists already in ElasticSearch, and if so update its counter, otherwise create a new record in ElasticSearch with the id from Kafka and counter set to 1, i.e. an upsert of the record (id, 1) to ES.
I was hoping to use Kafka Connect to ElasticSearch for this, but it seems to be not that straightforward if possible at all. I can see that adding records to ES works, but merging with existing records seems is something I haven't found out about yet. Is this possible already, and if so, how, and if not, is it planned to be possible in a nearby release?
I forked the datamountaineer ES sink connector to allow Upsert. With it you can specify a PK and run an update with docAsUpsert into ES. You can grab the project and compile the Jar from my github fork.

Resources