logstash kafka input performance / config tuning - elasticsearch

I use logstash to transfer data from Kafka to Elasticsearch and I'm getting the following error:
WARN org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - Auto offset commit failed for group kafka-es-sink: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured session.timeout.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.
I tried to adjust the session timeout (to 30000) and max poll records (to 250).
The topic produces 1000 events per seconds in avro format. There are 10 partitions (2 servers) and two logstash instances with 5 consumer threads each.
I have no problems with other topics with ~100-300 events per second.
I think it should be a config issue because I also have a second connector between Kafka and Elasticsearch on the same topic which works fine (confluent's kafka-connect-elasticsearch)
The main aim is to compare kafka connect and logstash as connector. Maybe anyone has also some experience in general?

Related

JDBC sink connector Confluent to sink based on either time or records

I have used datalake connectors to sink data from a topic and that allowed me to specify
number of records
An interval.
So, that essentially meant the connector would sink whichever condition is satisfied first.
e.g. this with the properties specified here.
In there you could see the properties
flush.size and
rotate.interval.ms or rotate.schedule.interval.ms
I am trying to achieve the same using the JDBC sink connector specified here, but I only see
batch.size
The problem is some times during the day, messages arrive rather infrequently and thus the sinking of the data onto the destination (in this case a Azure SQL Server DB) does not happen, until the batch.size is achieved.
Is there a way to specify that sink when either the batch.size is what I specify or when a certain time interval has elapsed?
I have gone through this very interesting discussion but I can't find a way to use this to fulfill the requirements I have.
also, I have seen that I have the max.tasks property , which essentially spawns multiple "tasks" in parallel to sink the data . So, if my topic has 4 partitions and I have max.tasks specified as 4, and my batch.size is 10- does it mean the data would only be sink by each of the tasks when 10 messages have arrived in their assigned partition?.
Any questions and I can elaborate.

Spring Boot Kafka Listener concurrency

I am using Spring Boot #kafkaListener in my application. Lets assume I use below configuration -
Topic Partitions : 2
spring.kafka.listener.concurrency : 2
group-id : TEST_GRP_ID
Acknowledgement : Manual
My question is ,
As per my knowledge Concurrency will create parallel thread to consume message.
So, thread 1 consumed the batch of records and thread 2 consumed the batch of records in this case processing of the messages will sequential and then commit the offset?
If I have two instances of the micro service in my cloud environment (in production more partition and more instances), then how concurrency will work? In each instance will create two parallel thread for my Kafka consumer?
How can I improve performance of my consumer or how can I make fast consumption and processing of the messages?
Your understanding is not too far from the truth. In fact only one consumer per partition can exist for the given group. The concurrency number gives us an approximate number of target consumers. And independently of microservice instances only two maximum consumers can exist if you have only two partitions in your topic.
So, to increase a performance you need to have more than 2 partition or more topics to consume, then they all can be distributed between your instances and their consumers evenly.
See more info in Apache Kafka docs: https://docs.confluent.io/platform/current/clients/consumer.html
✓you are having concurrency as 2 , which means 2 containers will be created to your listener.
✓As you are having 2 partitions in topic , messages from both the partitions will be consumed and processed parallelly.
✓When you spin up one more instance with same group name , the first thing that will happen is Group Rebalance .
✓Despite this event , as at any point of time only one consumer from a specific consumer group can be there for a partition , In the end , only 2 containers will be listening to messages and other 2 containers just remain idle.
✓In order to achieve more scalability , we need to add more number of partitions to the topic there by we can have more number of active listener containers

Kafka Elasticsearch Connector for bulk operations

I am using the Elasticsearch Sink Connector for operations (index, update, delete) on single records.
Elasticsearch also has a /_bulk endpoint which can be used to create, update, index, or delete multiple records at once. Documentation here.
Does the Elasticsearch Sink Connector support these types of bulk operations? If so, what is the configuration I need, or is there any sample code I can review?
Internally the Elasticsearch sink connector creates a bulk processor that is used to send records in a batch. To control this processor you need to configure the following properties:
batch.size: The number of records to process as a batch when writing to Elasticsearch.
max.in.flight.requests: The maximum number of indexing requests that can be in-flight to Elasticsearch before blocking further requests.
max.buffered.records: The maximum number of records each task will buffer before blocking acceptance of more records. This config can be used to limit the memory usage for each task.
linger.ms: Records that arrive in between request transmissions are batched into a single bulk indexing request, based on the batch.size configuration. Normally this only occurs under load when records arrive faster than they can be sent out. However it may be desirable to reduce the number of requests even under light load and benefit from bulk indexing. This setting helps accomplish that - when a pending batch is not full, rather than immediately sending it out the task will wait up to the given delay to allow other records to be added so that they can be batched into a single request.
flush.timeout.ms: The timeout in milliseconds to use for periodic flushing, and when waiting for buffer space to be made available by completed requests as records are added. If this timeout is exceeded the task will fail.

Kafka Connect JDBC Sink Commits

I cannot find the commit strategy or a parameter for Kafka Connect JDBC Sink in terms of that JDBC target.
Is it commit every N rows or when batch.size reached? Whatever that N rows is? Batch size or when complete would make sense.
When a Kafka Connect worker is running a sink task, it will consume messages from the topic partition(s) assigned to the task: once partitions have been opened for writing, Connect will begin forwarding records from Kafka using the put(Collection) API.
JDBC sink connector writes each batch of messages passed through the put(Collection) method using a transaction (the size of which can be controlled via the connector's consumer settings).

Does the events in the same partition go to the same FlowFile using Kafka Consumer in NiFi

The post sets the Max Poll Records to 1 to guarantee the events in one flow file come from the same partition.
https://community.hortonworks.com/articles/223849/simple-backup-and-restore-of-kafka-messages-via-ni.html
Does that mean if using Message Demarcator, the events in the same FlowFile can be from different partitions?
from the source code I think the above thinking is true?
https://github.com/apache/nifi/blob/ea9b0db2f620526c8dd0db595cf8b44c3ef835be/nifi-nar-bundles/nifi-kafka-bundle/nifi-kafka-0-9-processors/src/main/java/org/apache/nifi/processors/kafka/pubsub/ConsumerLease.java#L366
When using a demarcator it creates a bundle per topic/partition, so you will get flow files where all messages are from the same topic partition:
https://github.com/apache/nifi/blob/ea9b0db2f620526c8dd0db595cf8b44c3ef835be/nifi-nar-bundles/nifi-kafka-bundle/nifi-kafka-0-9-processors/src/main/java/org/apache/nifi/processors/kafka/pubsub/ConsumerLease.java#L378
The reason that post set max pool records to 1 was explained in the post, it was because the key of the messages is only available when there is 1 message per flow file, and they needed the key in this case. In general, it is better to not do this and to have many messages per flow file.

Resources