kafka-connect-elasticsearch: How to sync elasticsearch with consumer group? - elasticsearch

I want to query messages in a Kafka topic but not all messages, not from the beginning. I just need to see which messages are not yet committed based on a consumer group. So, basically what I want to have is to delete the documents whose offset is lower than a consumer group offset.
At this point, if I use elastic-connector, is there any way or a workaround to delete documents from the elastic index after a message is consumed and committed?
Or, should I use Kafka Streams and how?

The sink connector only deletes documents when that property is explicitly enabled and there is a null valued record for a document ID in the topic you're reading. This means you need to actually consume this null record and have it be processed by the connector
see which messages are not yet committed
This would imply messages that have not been processed by the connector, making them not searchable in Elasticsearch
delete the documents whose offset is lower than a consumer group offset
If you created a fresh index in Elasticsearch that's only used by the connector, you could pause the connector, then truncate the index, then resume the connector
is there any way or a workaround to delete documents from the elastic index after a message is consumed and committed
Directly use the DELETE API

Related

For Kafka sink Connector I send a single message to multiple indices documents in elasticseach?

I am recieving a very complex json inside a topic message, so i want to do some computations with it using SMTs and send to different elasticsearch indice documents. is it possible?
I am not able to find a solution for this.
The Elasticsearch sink connector only writes to one index, per record, based on the topic name. It's explicitly written in the Confluent documentation that topic altering transforms such as RegexRouter will not work as expected.
I'd suggest looking at logstash Kafka input and Elasticsearch output as an alternative, however, I'm still not sure how you'd "split" a record into multiple documents there either.
You may need an intermediate Kafka consumer such as Kafka Streams or ksqlDB to extract your nested JSON and emit multiple records that you expect in Elasticsearch.

Kafka Sink Connector: Is there any way to apply value transformations only for messages that meet the condition?

I'm trying to use Elasticsearch sink connector to transfer all messages to ES index.
There is a drop transformation for Kafka connectors, that tells connector to delete rows from ES if the body is null.
What if for delete action we send a message with not-null body? Is there any way to apply transformation with some condition/predicate, at the same time continue processing create/update messages without transformation? Like to apply drop value transformation only on rows that have deleted flag true in their bodies.
The transforms don't "delete from ES", they only modify the Kafka record.
If you want to act only on specific records, that's what the Filter w/ Predicate transform is for, which you'd need to chain before a drop transformation since I don't think it's possible to invoke a delete ES event with a non-null record value
at the same time continue processing create/update messages without transformation?
You'd need to run another connector that reverses the predicate condition of the other

Topic mapping when streaming from Kafka to Elasticsearch

When I transfer or stream two and three tables then I can easily map in Elasticsearch but can I map automatically map topics to index
I have streamed data from PostgreSQL to ES by mapping manually topic.index.map=topic1:index1,topic2:index2, etc.
Can I map automatically whatever topics send by producer then consumer consume in ES connector automatically?
By default, the topics map directly to an index of the same name.
If you want "better" control, you can use RegexRouter in a transforms property
To quote the docs
topic.index.map
This option is now deprecated. A future version may remove it completely. Please use single message transforms, such as RegexRouter, to map topic names to index names
If you cannot capture a single regex for each topic in the connector, then run more connectors with a different pattern

Is an upsert possible with Kafka Connect to ElasticSearch

I'm receiving events which end up in Kafka. From these events I fetch the id using a Kafka Streams application and posting it back to Kafka as a pair of (id, 1) in another topic. Then I would like to see if the id exists already in ElasticSearch, and if so update its counter, otherwise create a new record in ElasticSearch with the id from Kafka and counter set to 1, i.e. an upsert of the record (id, 1) to ES.
I was hoping to use Kafka Connect to ElasticSearch for this, but it seems to be not that straightforward if possible at all. I can see that adding records to ES works, but merging with existing records seems is something I haven't found out about yet. Is this possible already, and if so, how, and if not, is it planned to be possible in a nearby release?
I forked the datamountaineer ES sink connector to allow Upsert. With it you can specify a PK and run an update with docAsUpsert into ES. You can grab the project and compile the Jar from my github fork.

Couchbase XDCR Elasticsearch speed and deletions

We are thinking about implementing some sort of message cache which would hold onto the messages we send to our search index so we could persist while the index was down for an extended period of time (for example a complete re-index) then 're-apply' the messages. These messages are creations or updates of the documents we index. If space were cheap enough, with something as scalable as Couchbase we may even be able to hold all messages but I haven't done any sort of estimations of message size and quantity yet. Anyway, I suggested Couchbase + XDCR + Elasticsearch for this task as most of the work would be done automatically however there are 4 questions I have remaining:
If we were implementing this as a cache, I would not want Elasticsearch to remove any documents that were not in Couchbase, is this possible to do (perhaps it is even the default behaviour)?
Is it possible to apply some sort of versioning so that a document in the index is not over-written by an older version coming from Couchbase?
If I were to add a new field to the index, I might need to re-index from the actual document datasource then re-apply all the messages stored in Couchbase. I may have 100 million documents in Elasticsearch and say 500,000 documents in Couchbase that I want to re-apply to Elasticsearch? What would the speed be like.
Would I be able to apply any sort of logic in-between Couchbase and Elasticsearch?
Update:
So we store documents in an RDBMS as we need instant access to inserted docs plus some other stuff. We send limited versions of the document to a search engine via messages. If we want to add a field to the index we need to re-index the system from the RDBMS somehow. If we have this Couchbase message cache we could add the field to messages first, then switch off the indexing of old messages and re-index from the RDBMS. We could then switch back on the indexing of the messages and the entire 'queue' of messages would be indexed without having lost anything.
This system (if it worked) would remove the need for an MQ server, a message listener and make sure no documents were missing from the index.
The versioning would be necessary as we don't want to apply an 'update' to the index which actually contains a more recent document (not sure if this would ever happen now I think about it).
I appreciate it's probably not too great a job to implement points 1 and 4 by changing the Elasticsearch plugin code but I would like to confirm that the idea is reasonable first!
The Couchbase-Elasticsearch integration today should be seen as an indexing engine for Couchbase. This means the index is "managed/controlled" by the data that are in Couchbase.
The XDCR is used to sent "all the events" to Elasticsearch. This means the index is update/delete every time a document (stored in Couchbase) is created, modified or deleted.
So "all the documents" stored into a Couchbase bucket are indexed into Elasticsearch.
Let's answer your questions one by one, based on the current implementation of the Couchbase-Elasticsearch.
When a document is removed from Couchbase, the Elasticsearch index is update (entry removed).
Not sure to understand the question. How an "older" version could come from Couchbase? Anyway once again everytime the document that is stored into Couchbase is modified, the index in Elasticsearch is updated.
Not sure to understand where you want to add a new field? If this is into a document that is stored into Couchbase, when the document will be sent to Elasticsearch the index will be updated. But based on what I have said before : all document "stored" into Couchbase will be present in Elasticsearch index.
Not with the plugin as it is today, but as you know it is an open source project so you can either add some logic to it or even contribute your ideas to the project ( https://github.com/couchbaselabs/elasticsearch-transport-couchbase )
So let me ask you more questions:
- how do you inser the document into you application? (and where Couchbase? Elasticsearch?)
- what are the types of documents?
- what do you want to cache into Couchbase?

Resources