Kafka Elasticsearch Connector for bulk operations - elasticsearch

I am using the Elasticsearch Sink Connector for operations (index, update, delete) on single records.
Elasticsearch also has a /_bulk endpoint which can be used to create, update, index, or delete multiple records at once. Documentation here.
Does the Elasticsearch Sink Connector support these types of bulk operations? If so, what is the configuration I need, or is there any sample code I can review?

Internally the Elasticsearch sink connector creates a bulk processor that is used to send records in a batch. To control this processor you need to configure the following properties:
batch.size: The number of records to process as a batch when writing to Elasticsearch.
max.in.flight.requests: The maximum number of indexing requests that can be in-flight to Elasticsearch before blocking further requests.
max.buffered.records: The maximum number of records each task will buffer before blocking acceptance of more records. This config can be used to limit the memory usage for each task.
linger.ms: Records that arrive in between request transmissions are batched into a single bulk indexing request, based on the batch.size configuration. Normally this only occurs under load when records arrive faster than they can be sent out. However it may be desirable to reduce the number of requests even under light load and benefit from bulk indexing. This setting helps accomplish that - when a pending batch is not full, rather than immediately sending it out the task will wait up to the given delay to allow other records to be added so that they can be batched into a single request.
flush.timeout.ms: The timeout in milliseconds to use for periodic flushing, and when waiting for buffer space to be made available by completed requests as records are added. If this timeout is exceeded the task will fail.

Related

How do we know when a flow is completed in case we have multiple flowfiles running parallely?

I have a requirement where we have a template which uses SQL as source and SQL as destination and data would be more than 100GB for each table so here template will be instantiated multiple times based on tables to be migrated and also each table is partitioned into multiple flowfiles. How do we know when the process is completed? As here there will be multiple flowfiles we are unable to conclude as it hits end processor.
I have tried using SitetoSiteStatusReportingTask to check queue count, but it provides count based on connection and its difficult to fetch connectionid for each connection then concatenate as we have large number of templates. Here we have another problem in reporting task as it provides data on all process groups which are available on NIFI canvas which will be huge data if all templates are running and may impact in performance even though I used avro schema to fetch only queue count and connection id.
Can you please suggest some ideas and help me to achieve this?
you have multiple solution :
1 - you can use the wait/notify duo processor.
if you dont want multiple flowfile running parallely :
2 - set backpressure on Queue
3 - specify group level flow file concurrency (recommended but Nifi 1.12 only )

logstash kafka input performance / config tuning

I use logstash to transfer data from Kafka to Elasticsearch and I'm getting the following error:
WARN org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - Auto offset commit failed for group kafka-es-sink: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured session.timeout.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.
I tried to adjust the session timeout (to 30000) and max poll records (to 250).
The topic produces 1000 events per seconds in avro format. There are 10 partitions (2 servers) and two logstash instances with 5 consumer threads each.
I have no problems with other topics with ~100-300 events per second.
I think it should be a config issue because I also have a second connector between Kafka and Elasticsearch on the same topic which works fine (confluent's kafka-connect-elasticsearch)
The main aim is to compare kafka connect and logstash as connector. Maybe anyone has also some experience in general?

How do I add a custom monitoring feature in my Spark application?

I am developing a Spark application. The application takes data from Kafka queue and processes that data. After processing it stores data in Hbase table.
Now I want to monitor some of the performance attributed such as,
Total count of input and output records.(Not all records will be persisted to Hbase, some of the data may be filtered out in processing)
Average processing time per message
Average time taken to persist the messages.
I need to collect this information and send it to a different Kafka queue for monitoring.
Considering that the monitoring should not incur a significant delay in the processing.
Please suggest some ideas for this.
Thanks.

Apache Nifi processor that acts like a barrier to synchronize multiple flow files

I'm evaluating Nifi for our ETL process.
I want to build the following flow:
Fetch a lot of data from SQL database -> Split into chunks 1000 records
each -> Count error records in each chunk -> Count total number of error
records -> If it exceeds a threshold Fail process -> else save each chunk to the database.
The problem I can't resolve is how to wait until all chunks are validated.
If for example I have 5 validation tasks working concurrently, I need some
kind of barrier to wait until all chunks are processed and only after that
run error count processor because I don't want to save invalid data and
delete it if the threshold is reached.
The other question I have is if there is any possibility to run this
validation processor on multiple nodes in parallel and still have the
possibility to wait until they all are completed.
One solution to this is to use the ExecuteScript processor as a "relief valve" to hold a simple count in memory triggered off of the first receipt of a flowfile with a specific attribute value (store in the local/cluster state with basically a Map of key attribute-value to value count). Once that value reaches a threshold, you can generate a new flowfile to route to the success relationship containing the attribute value that has finished. In this case, send the other results (the flowfiles that need to be batched) to a MergeContent processor and set the minimum batching size to whatever you like. The follow-on processor to the valve should have its Scheduling Strategy set to Event Driven so it only runs when it receives a flowfile from the valve.
Updating count in distributed MapCache is not the correct way as fetch and update are separate and cannot be made in atomic processor which just increments counts.
http://apache-nifi-users-list.2361937.n4.nabble.com/How-do-I-atomically-increment-a-variable-in-NiFi-td1084.html

HBase Data Access performance improvement using HBase API

I am trying to scan some rows using prefix filter from the HBase table. I am on HBase 0.96.
I want to increase the throughput of each RPC call so as to reduce the number of request hitting the region.
I tried getCaching(int) and setCacheBlocks(true) on the scan object. I also tried adding resultScanner.next(int). Using all these combination I am still not able to reduce the number of RPC calls. I am still hitting HBase region for each key instead of bringing the multiple keys per RPC call.
The HBase region server/ Datanode has enough CPU and Memory allocated. Also my data is evenly distributed across different region servers. Also the data that I am bring back per key is not a lot.
I observed that when I add more data to the table the time taken for the request increases. It also increases when the number of request increases.
Thank you for your help.
R
Prefix filter is usually a performance killer because they perform full table scan, always use a start and stop row in your scans rather than prefix filter.
Scan scan = new Scan(Bytes.toBytes("prefix"),Bytes.toBytes("prefix~"));
when iterate over the Result from the ResultScanner, every iteration is an RPC call, you can call resultScanner.next(n) to get a batch of results in one go.

Resources