Kafka Sink Connector: Is there any way to apply value transformations only for messages that meet the condition? - elasticsearch

I'm trying to use Elasticsearch sink connector to transfer all messages to ES index.
There is a drop transformation for Kafka connectors, that tells connector to delete rows from ES if the body is null.
What if for delete action we send a message with not-null body? Is there any way to apply transformation with some condition/predicate, at the same time continue processing create/update messages without transformation? Like to apply drop value transformation only on rows that have deleted flag true in their bodies.

The transforms don't "delete from ES", they only modify the Kafka record.
If you want to act only on specific records, that's what the Filter w/ Predicate transform is for, which you'd need to chain before a drop transformation since I don't think it's possible to invoke a delete ES event with a non-null record value
at the same time continue processing create/update messages without transformation?
You'd need to run another connector that reverses the predicate condition of the other

Related

For Kafka sink Connector I send a single message to multiple indices documents in elasticseach?

I am recieving a very complex json inside a topic message, so i want to do some computations with it using SMTs and send to different elasticsearch indice documents. is it possible?
I am not able to find a solution for this.
The Elasticsearch sink connector only writes to one index, per record, based on the topic name. It's explicitly written in the Confluent documentation that topic altering transforms such as RegexRouter will not work as expected.
I'd suggest looking at logstash Kafka input and Elasticsearch output as an alternative, however, I'm still not sure how you'd "split" a record into multiple documents there either.
You may need an intermediate Kafka consumer such as Kafka Streams or ksqlDB to extract your nested JSON and emit multiple records that you expect in Elasticsearch.

kafka-connect-elasticsearch: How to sync elasticsearch with consumer group?

I want to query messages in a Kafka topic but not all messages, not from the beginning. I just need to see which messages are not yet committed based on a consumer group. So, basically what I want to have is to delete the documents whose offset is lower than a consumer group offset.
At this point, if I use elastic-connector, is there any way or a workaround to delete documents from the elastic index after a message is consumed and committed?
Or, should I use Kafka Streams and how?
The sink connector only deletes documents when that property is explicitly enabled and there is a null valued record for a document ID in the topic you're reading. This means you need to actually consume this null record and have it be processed by the connector
see which messages are not yet committed
This would imply messages that have not been processed by the connector, making them not searchable in Elasticsearch
delete the documents whose offset is lower than a consumer group offset
If you created a fresh index in Elasticsearch that's only used by the connector, you could pause the connector, then truncate the index, then resume the connector
is there any way or a workaround to delete documents from the elastic index after a message is consumed and committed
Directly use the DELETE API

Does kafka sink connector inserts record if the transformation returns NULL?

I have a kafka sink connector (elasticsearch), and I'm writing a custom SMT. I'm wondering what would happen if the SMT returns a NULL record. What the connector would do ? Is it going to insert something NULL in my elasticsearch index ? Or is it going to not insert at all ?
You can control how it's handled using the behavior.on.null.values config option.
ignore - the message is ignored
delete - the Elasticsearch record with matching key is deleted
fail - the connector will stop.
Ref: doc page

Kafka Connect to persist topic to Elasticsearch index using field from (json) message

I'm attempting to index messages in Elasticsearch using SMT's from Kafka's Connect API only.
So far I had luck with simply using the topic and timestamp router functionality.
However, now I'd like to create separate indices based on a certain field in the message.
Suppose the messages are formatted as such:
{"productId": 1, "category": "boat", "price": 135000}
{"productId": 1, "category": "helicopter", "price": 300000}
{"productId": 1, "category": "car", "price": 25000}
Would it somehow be possible to index these to the following indices based on product category?
product-boat
product-helicopter
product-car
or would I have to create separate topics for every single category (knowing that it could become hundreds or thousands of them)?
Am I overseeing a transform that could do this or is this simply not possible and will a custom component have to be built?
There's nothing out of the box with Kafka Connect that will do this. You have a few options:
The Elasticsearch sink connector will route messages to a target index based on its topic, so you could write a custom SMT that would inspect a message and route it to a different topic accordingly
Use a stream processor to pre-process the messages such that they're already on different topics by the time they are consumed by the Elasticsearch sink connector. For example, Kafka Streams or KSQL.
KSQL you would need to hard code each category (CREATE STREAM product-boat AS SELECT * FROM messages WHERE category='boat' etc)
Kafka Streams now has Dynamic Routing (KIP-303) which would be a more flexible way of doing it
Handcode a bespoke Elasticsearch sink connector with the logic coded in to route the messages to indices based on message contents. This feels like the worst of the three approach IMO.
If you are using Confluent Platform you can do some kind of routing depends on field value in the message.
To do that you have to use ExtractTopic SMT from Confluent. More details regarding that SMT can be found at https://docs.confluent.io/current/connect/transforms/extracttopic.html#extracttopic
Kafka Sink Connector processes messages, that are represented by SinkRecord. Each SinkRecord contains of several fields: topic, partition, value, key, etc. Those fields are set by Kafka Connect and using transformation you can change those value. ExtractTopic SMT changes value of topic based on value or key of the message.
Transformations configuration will be something like that:
{
...
"transforms": "ExtractTopic",
"transforms.ExtractTopic.type": "io.confluent.connect.transforms.ExtractTopic$Value",
"transforms.ExtractTopic.field": "name", <-- name of field, that value will be used as index name
...
}
One limitation is, that you have to create indices in advance.
How I assume you are using Elasticsearch Sink Connector. Elasticsearch connector has ability to create index, but it does it when its opens - method to create writers for particular partition (ElasticsearchSinkTask::open). In your use case at that moment all indices can't be created, because value of all messages are not available.
Maybe it isn't the purest approach, because ExtractTopic should be rather used for Source connectors, but in you case it might work.

Using elasticsearch generated ID's in kafka elasticsearch connector

I noticed that documents indexed in elasticsearch using the kafka elasticsearch connector have their ids in the following format topic+partition+offset.
I would prefer to use id's generated by elasticsearch. It seems topic+partition+offset is not usually unique so I am loosing data.
How can I change that?
As Phil says in the comments -- topic-partition-offset should be unique, so I don't see how this is causing data loss for you.
Regardless - you can either let the connector generate the key (as you are doing), or you can define the key yourself (key.ignore=false). There is no other option.
You can use Single Message Transformations with Kafka Connect to derive a key from the fields in your data. Based on your message in the Elasticsearch forum it looks like there is an id in your data - if that's going to be unique you could set that as your key, and thus as your Elasticsearch document ID too. Here's an example of defining a key with SMT:
# Add the `id` field as the key using Simple Message Transformations
transforms=InsertKey, ExtractId
# `ValueToKey`: push an object of one of the column fields (`id`) into the key
transforms.InsertKey.type=org.apache.kafka.connect.transforms.ValueToKey
transforms.InsertKey.fields=id
# `ExtractField`: convert key from an object to a plain field
transforms.ExtractId.type=org.apache.kafka.connect.transforms.ExtractField$Key
transforms.ExtractId.field=id
(via https://www.confluent.io/blog/building-real-time-streaming-etl-pipeline-20-minutes/)
#Robin Moffatt, as much as I see it, topic-partition-offset can cause duplicates in case that upgrade your kafka cluster, but not in rolling upgrade fashion but just replace cluster with cluster (which is sometime easier to replace). In this case you will experience data loss because of overwriting data.
Regarding to your excellent example, this can be the solution for many of the cases, but I'd add another option. Maybe you can add epoc timestamp element to the topic-partition-offset so this will be like this topic-partition-offset-current_timestamp.
What do you think?

Resources