Kafka Connect to persist topic to Elasticsearch index using field from (json) message - elasticsearch

I'm attempting to index messages in Elasticsearch using SMT's from Kafka's Connect API only.
So far I had luck with simply using the topic and timestamp router functionality.
However, now I'd like to create separate indices based on a certain field in the message.
Suppose the messages are formatted as such:
{"productId": 1, "category": "boat", "price": 135000}
{"productId": 1, "category": "helicopter", "price": 300000}
{"productId": 1, "category": "car", "price": 25000}
Would it somehow be possible to index these to the following indices based on product category?
product-boat
product-helicopter
product-car
or would I have to create separate topics for every single category (knowing that it could become hundreds or thousands of them)?
Am I overseeing a transform that could do this or is this simply not possible and will a custom component have to be built?

There's nothing out of the box with Kafka Connect that will do this. You have a few options:
The Elasticsearch sink connector will route messages to a target index based on its topic, so you could write a custom SMT that would inspect a message and route it to a different topic accordingly
Use a stream processor to pre-process the messages such that they're already on different topics by the time they are consumed by the Elasticsearch sink connector. For example, Kafka Streams or KSQL.
KSQL you would need to hard code each category (CREATE STREAM product-boat AS SELECT * FROM messages WHERE category='boat' etc)
Kafka Streams now has Dynamic Routing (KIP-303) which would be a more flexible way of doing it
Handcode a bespoke Elasticsearch sink connector with the logic coded in to route the messages to indices based on message contents. This feels like the worst of the three approach IMO.

If you are using Confluent Platform you can do some kind of routing depends on field value in the message.
To do that you have to use ExtractTopic SMT from Confluent. More details regarding that SMT can be found at https://docs.confluent.io/current/connect/transforms/extracttopic.html#extracttopic
Kafka Sink Connector processes messages, that are represented by SinkRecord. Each SinkRecord contains of several fields: topic, partition, value, key, etc. Those fields are set by Kafka Connect and using transformation you can change those value. ExtractTopic SMT changes value of topic based on value or key of the message.
Transformations configuration will be something like that:
{
...
"transforms": "ExtractTopic",
"transforms.ExtractTopic.type": "io.confluent.connect.transforms.ExtractTopic$Value",
"transforms.ExtractTopic.field": "name", <-- name of field, that value will be used as index name
...
}
One limitation is, that you have to create indices in advance.
How I assume you are using Elasticsearch Sink Connector. Elasticsearch connector has ability to create index, but it does it when its opens - method to create writers for particular partition (ElasticsearchSinkTask::open). In your use case at that moment all indices can't be created, because value of all messages are not available.
Maybe it isn't the purest approach, because ExtractTopic should be rather used for Source connectors, but in you case it might work.

Related

For Kafka sink Connector I send a single message to multiple indices documents in elasticseach?

I am recieving a very complex json inside a topic message, so i want to do some computations with it using SMTs and send to different elasticsearch indice documents. is it possible?
I am not able to find a solution for this.
The Elasticsearch sink connector only writes to one index, per record, based on the topic name. It's explicitly written in the Confluent documentation that topic altering transforms such as RegexRouter will not work as expected.
I'd suggest looking at logstash Kafka input and Elasticsearch output as an alternative, however, I'm still not sure how you'd "split" a record into multiple documents there either.
You may need an intermediate Kafka consumer such as Kafka Streams or ksqlDB to extract your nested JSON and emit multiple records that you expect in Elasticsearch.

Kafka JDBC Sink handle array datatype

I know that Kafka JDBC Sink Connector have some drawbacks for array datatype. However is it possible to combine the Sink Connector with a simple Kafka Connector which can support array datatype. How can I filter from Kafka configuration and switch them into simple Kafka Connector configuration What does simple Kafka Configuration mean? How can Kafka Connect support array fields
name: topic_name
type: array
item: Topic file
Is this possible where it will consume to the db as a string not an array
"fields":[{
"name":"item_id",
"type":{
"type":"array",
"items":["null", "string"]
},
"default":[]
}]
}
Kafka Connect framework itself doesn't expose limitations around types, it's in the source code for the JDBC sink that arrays are rejected.
There's an outstanding PR to support it for Postgres - https://github.com/confluentinc/kafka-connect-jdbc/pull/805
Unclear what you mean by "simple", but if you want to use a different connector, then you'd need to install it, and then change the class. For example, maybe the MongoDB sink handles arrays. I know that S3 and HDFS sinks do...
is it possible to combine the Sink Connector with a simple Kafka Connector
Again, unsure what you mean by this, but connectors generally don't "chain together". While you could use MirrorMaker2 with a transform to effectively do the same work as Kafka Streams, best to use the more appropriate tools for that
Is this possible where it will consume to the db as a string not an array
Sure, if the message field is actually a string. As suggested, you're going to need to process the message before the sink connector consumes it

Deisgn: Generic Elastic Search Index storing all Kafka Topics messages

Hi I am new to Elastic stack. This is basically a design based question. We have lot of Kafka Topics (>500) and each of them store json as data exchange format. Now we are planning to build a Kafka Consumer and dump all the records/jsons into a Single Index. We have some requirements but to begin with the most important one being, able to search through all the relevant jsons based on few important field values. For example if I have multiple jsons having field correlation id with a value XYZ, then if I enter XYZ then it should be able to search through all the topics.
Also as an additional question, since we are using Kibana do we have some inbuilt visualization for this search thing so that we dont need to build our own UI? This is simply for management searching specific values and need not be very fancy UI.
What should be the best thing to do, is having a single index the best design? What all things we need to consider? I read about the standard Analyzer and am wondering if that is enough for our purpose.
Assumption- All Kafka topics will store jsons and each json can be of different formats. Some might have lots of nesting, some might have nested objects. Some might be simple.

Topic mapping when streaming from Kafka to Elasticsearch

When I transfer or stream two and three tables then I can easily map in Elasticsearch but can I map automatically map topics to index
I have streamed data from PostgreSQL to ES by mapping manually topic.index.map=topic1:index1,topic2:index2, etc.
Can I map automatically whatever topics send by producer then consumer consume in ES connector automatically?
By default, the topics map directly to an index of the same name.
If you want "better" control, you can use RegexRouter in a transforms property
To quote the docs
topic.index.map
This option is now deprecated. A future version may remove it completely. Please use single message transforms, such as RegexRouter, to map topic names to index names
If you cannot capture a single regex for each topic in the connector, then run more connectors with a different pattern

Elasticsearch Kafka Connector - setting index based on message value

I have messages coming through a Kafka topic in the following format:
{"elasticsearch_index": "index_1", "first_name": "Jane"}
{"elasticsearch_index": "index_2", "first_name": "John"}
Note that each message contains the desired Elasticsearch index that we would like to route the record to. Is it possible to use Confluent's Elasticsearch Kafka Connector to route these records to the appropriate index name (e.g. whatever is listed under the elasticsearch_index key)?
It doesn't look like the Single Message Transforms (SMT) supports this behavior currently, but maybe I am misreading. Any info will be greatly appreciated.
Two options:
Write your own Transform using the Single Message Transform API
Use KSQL (or Kafka Streams) to route the messages to required topics first, and then use the new (Apache Kafka 1.1) regex capabilities to land those topics to Elasticsearch from Kafka Connect.

Resources