Kafka-connect elasticsearch auto-lowercase topic name for for index - elasticsearch

I'm using elasticsearch sink kafka-connector to index messages from multiple kafka topics to elasticsearch.
I have my topics with camelCase naming, and I can't change it. So when starting up the ES sink connector, it does not index anything because elaticsearch has problems with non-lowercase index names.
I know I can use topic.index.map property to manually convert topic name to index.
topic.index.map=myTopic1:mytopic1, myTopic2:mytopic2,...
Is there a way to convert to lowercase automatically? I have dozens of topics to convert, and I suspect it to be around hundred soon.

Found out that since 5.1 they do that automatically, if mapping is not specified for the topic.
from here:
final String indexOverride = topicToIndexMap.get(topic);
String index = indexOverride != null ? indexOverride : topic.toLowerCase();
See this commit for details.

As of recent versions of the Elasticsearch sink connector this is done automatically. The PR that fixed this was https://github.com/confluentinc/kafka-connect-elasticsearch/pull/251

Related

For Kafka sink Connector I send a single message to multiple indices documents in elasticseach?

I am recieving a very complex json inside a topic message, so i want to do some computations with it using SMTs and send to different elasticsearch indice documents. is it possible?
I am not able to find a solution for this.
The Elasticsearch sink connector only writes to one index, per record, based on the topic name. It's explicitly written in the Confluent documentation that topic altering transforms such as RegexRouter will not work as expected.
I'd suggest looking at logstash Kafka input and Elasticsearch output as an alternative, however, I'm still not sure how you'd "split" a record into multiple documents there either.
You may need an intermediate Kafka consumer such as Kafka Streams or ksqlDB to extract your nested JSON and emit multiple records that you expect in Elasticsearch.

Confluent Elasticsearch Sink connector, write.method : "UPSERT" on different key

In COnfluent Elasticsearch Sink connector, I am trying to write in same Elasticsearch index from two different topics. First topic is INSERT and another topic is UPSERT. For UPSERT, I want to update the JSON document based on some other field instead of "_id". IS that possible ? If yes, How can I do that ?
Use key.ignore=false and use existing primary key columns as _id for each json document.

Use clevercloud drain with Elasticsearch target

I'm using Clevercloud to host my application and I want to store logs in an Elasticsearch index. I read the documentation here and I tried to create a drain as explained. I was able to create the drain but no data has been added in my Elasticsearch.
Somebody has an idea ?
I found the solution : I couldn't see datas because I was looking at the wrong ES index. Even if you specified an index in your URL, logs are in logstash format so by default it will create a new index per day named logstash-YYYY-MM-DD. The datas was in those indexes.

Topic mapping when streaming from Kafka to Elasticsearch

When I transfer or stream two and three tables then I can easily map in Elasticsearch but can I map automatically map topics to index
I have streamed data from PostgreSQL to ES by mapping manually topic.index.map=topic1:index1,topic2:index2, etc.
Can I map automatically whatever topics send by producer then consumer consume in ES connector automatically?
By default, the topics map directly to an index of the same name.
If you want "better" control, you can use RegexRouter in a transforms property
To quote the docs
topic.index.map
This option is now deprecated. A future version may remove it completely. Please use single message transforms, such as RegexRouter, to map topic names to index names
If you cannot capture a single regex for each topic in the connector, then run more connectors with a different pattern

Using elasticsearch generated ID's in kafka elasticsearch connector

I noticed that documents indexed in elasticsearch using the kafka elasticsearch connector have their ids in the following format topic+partition+offset.
I would prefer to use id's generated by elasticsearch. It seems topic+partition+offset is not usually unique so I am loosing data.
How can I change that?
As Phil says in the comments -- topic-partition-offset should be unique, so I don't see how this is causing data loss for you.
Regardless - you can either let the connector generate the key (as you are doing), or you can define the key yourself (key.ignore=false). There is no other option.
You can use Single Message Transformations with Kafka Connect to derive a key from the fields in your data. Based on your message in the Elasticsearch forum it looks like there is an id in your data - if that's going to be unique you could set that as your key, and thus as your Elasticsearch document ID too. Here's an example of defining a key with SMT:
# Add the `id` field as the key using Simple Message Transformations
transforms=InsertKey, ExtractId
# `ValueToKey`: push an object of one of the column fields (`id`) into the key
transforms.InsertKey.type=org.apache.kafka.connect.transforms.ValueToKey
transforms.InsertKey.fields=id
# `ExtractField`: convert key from an object to a plain field
transforms.ExtractId.type=org.apache.kafka.connect.transforms.ExtractField$Key
transforms.ExtractId.field=id
(via https://www.confluent.io/blog/building-real-time-streaming-etl-pipeline-20-minutes/)
#Robin Moffatt, as much as I see it, topic-partition-offset can cause duplicates in case that upgrade your kafka cluster, but not in rolling upgrade fashion but just replace cluster with cluster (which is sometime easier to replace). In this case you will experience data loss because of overwriting data.
Regarding to your excellent example, this can be the solution for many of the cases, but I'd add another option. Maybe you can add epoc timestamp element to the topic-partition-offset so this will be like this topic-partition-offset-current_timestamp.
What do you think?

Resources