Clickhouse Kafka engine virtual column _timestamp meaning - clickhouse

In the Clickhouse Kafka engine, there are some virtual columns containing the _timestamp or _timestamp_ms fields.
Any idea what do these fields exactly mean? When the message
was sent to Kafka
was consumed by Clickhouse
was stored to Clickhouse
Or something different?

it is the timestamp from the kafka message. https://kafka.apache.org/20/javadoc/org/apache/kafka/clients/producer/ProducerRecord.html
https://stackoverflow.com/a/61857937/11644308
It is set by your producer or your broker

Related

For Kafka sink Connector I send a single message to multiple indices documents in elasticseach?

I am recieving a very complex json inside a topic message, so i want to do some computations with it using SMTs and send to different elasticsearch indice documents. is it possible?
I am not able to find a solution for this.
The Elasticsearch sink connector only writes to one index, per record, based on the topic name. It's explicitly written in the Confluent documentation that topic altering transforms such as RegexRouter will not work as expected.
I'd suggest looking at logstash Kafka input and Elasticsearch output as an alternative, however, I'm still not sure how you'd "split" a record into multiple documents there either.
You may need an intermediate Kafka consumer such as Kafka Streams or ksqlDB to extract your nested JSON and emit multiple records that you expect in Elasticsearch.

kafka-connect-elasticsearch: How to sync elasticsearch with consumer group?

I want to query messages in a Kafka topic but not all messages, not from the beginning. I just need to see which messages are not yet committed based on a consumer group. So, basically what I want to have is to delete the documents whose offset is lower than a consumer group offset.
At this point, if I use elastic-connector, is there any way or a workaround to delete documents from the elastic index after a message is consumed and committed?
Or, should I use Kafka Streams and how?
The sink connector only deletes documents when that property is explicitly enabled and there is a null valued record for a document ID in the topic you're reading. This means you need to actually consume this null record and have it be processed by the connector
see which messages are not yet committed
This would imply messages that have not been processed by the connector, making them not searchable in Elasticsearch
delete the documents whose offset is lower than a consumer group offset
If you created a fresh index in Elasticsearch that's only used by the connector, you could pause the connector, then truncate the index, then resume the connector
is there any way or a workaround to delete documents from the elastic index after a message is consumed and committed
Directly use the DELETE API

Kafka JDBC Sink handle array datatype

I know that Kafka JDBC Sink Connector have some drawbacks for array datatype. However is it possible to combine the Sink Connector with a simple Kafka Connector which can support array datatype. How can I filter from Kafka configuration and switch them into simple Kafka Connector configuration What does simple Kafka Configuration mean? How can Kafka Connect support array fields
name: topic_name
type: array
item: Topic file
Is this possible where it will consume to the db as a string not an array
"fields":[{
"name":"item_id",
"type":{
"type":"array",
"items":["null", "string"]
},
"default":[]
}]
}
Kafka Connect framework itself doesn't expose limitations around types, it's in the source code for the JDBC sink that arrays are rejected.
There's an outstanding PR to support it for Postgres - https://github.com/confluentinc/kafka-connect-jdbc/pull/805
Unclear what you mean by "simple", but if you want to use a different connector, then you'd need to install it, and then change the class. For example, maybe the MongoDB sink handles arrays. I know that S3 and HDFS sinks do...
is it possible to combine the Sink Connector with a simple Kafka Connector
Again, unsure what you mean by this, but connectors generally don't "chain together". While you could use MirrorMaker2 with a transform to effectively do the same work as Kafka Streams, best to use the more appropriate tools for that
Is this possible where it will consume to the db as a string not an array
Sure, if the message field is actually a string. As suggested, you're going to need to process the message before the sink connector consumes it

Kafka Connect to persist topic to Elasticsearch index using field from (json) message

I'm attempting to index messages in Elasticsearch using SMT's from Kafka's Connect API only.
So far I had luck with simply using the topic and timestamp router functionality.
However, now I'd like to create separate indices based on a certain field in the message.
Suppose the messages are formatted as such:
{"productId": 1, "category": "boat", "price": 135000}
{"productId": 1, "category": "helicopter", "price": 300000}
{"productId": 1, "category": "car", "price": 25000}
Would it somehow be possible to index these to the following indices based on product category?
product-boat
product-helicopter
product-car
or would I have to create separate topics for every single category (knowing that it could become hundreds or thousands of them)?
Am I overseeing a transform that could do this or is this simply not possible and will a custom component have to be built?
There's nothing out of the box with Kafka Connect that will do this. You have a few options:
The Elasticsearch sink connector will route messages to a target index based on its topic, so you could write a custom SMT that would inspect a message and route it to a different topic accordingly
Use a stream processor to pre-process the messages such that they're already on different topics by the time they are consumed by the Elasticsearch sink connector. For example, Kafka Streams or KSQL.
KSQL you would need to hard code each category (CREATE STREAM product-boat AS SELECT * FROM messages WHERE category='boat' etc)
Kafka Streams now has Dynamic Routing (KIP-303) which would be a more flexible way of doing it
Handcode a bespoke Elasticsearch sink connector with the logic coded in to route the messages to indices based on message contents. This feels like the worst of the three approach IMO.
If you are using Confluent Platform you can do some kind of routing depends on field value in the message.
To do that you have to use ExtractTopic SMT from Confluent. More details regarding that SMT can be found at https://docs.confluent.io/current/connect/transforms/extracttopic.html#extracttopic
Kafka Sink Connector processes messages, that are represented by SinkRecord. Each SinkRecord contains of several fields: topic, partition, value, key, etc. Those fields are set by Kafka Connect and using transformation you can change those value. ExtractTopic SMT changes value of topic based on value or key of the message.
Transformations configuration will be something like that:
{
...
"transforms": "ExtractTopic",
"transforms.ExtractTopic.type": "io.confluent.connect.transforms.ExtractTopic$Value",
"transforms.ExtractTopic.field": "name", <-- name of field, that value will be used as index name
...
}
One limitation is, that you have to create indices in advance.
How I assume you are using Elasticsearch Sink Connector. Elasticsearch connector has ability to create index, but it does it when its opens - method to create writers for particular partition (ElasticsearchSinkTask::open). In your use case at that moment all indices can't be created, because value of all messages are not available.
Maybe it isn't the purest approach, because ExtractTopic should be rather used for Source connectors, but in you case it might work.

Elasticsearch Kafka Connector - setting index based on message value

I have messages coming through a Kafka topic in the following format:
{"elasticsearch_index": "index_1", "first_name": "Jane"}
{"elasticsearch_index": "index_2", "first_name": "John"}
Note that each message contains the desired Elasticsearch index that we would like to route the record to. Is it possible to use Confluent's Elasticsearch Kafka Connector to route these records to the appropriate index name (e.g. whatever is listed under the elasticsearch_index key)?
It doesn't look like the Single Message Transforms (SMT) supports this behavior currently, but maybe I am misreading. Any info will be greatly appreciated.
Two options:
Write your own Transform using the Single Message Transform API
Use KSQL (or Kafka Streams) to route the messages to required topics first, and then use the new (Apache Kafka 1.1) regex capabilities to land those topics to Elasticsearch from Kafka Connect.

Resources