Achieve concurrency in Kafka consumers - parallel-processing

We are working on parallelising our Kafka consumer to process more number of records to handle the Peak load. One way, we are already doing is through spinning up as many consumers as many partitions within the same consumer group.
Our Consumer deals with making an API call which is synchronous as of now. We felt making this API call asynchronous will make our consumer handle more load. Hence, we are trying to making the API call Asynchronous and in its response we are increasing the offset. However we are seeing an issue with this:
By making the API call Asynchronous, we may get the response for the last record first and none of the previous record's API calls haven't initiated or done by then. If we commit the offset as soon as we receive the response of the last record, the offset would get changed to the last record. In the meantime if the consumer restarts or partition rebalances, we will not receive any record before the last record we committed the offset as. With this, we will miss out the unprocessed records.
As of now we already have 25 partitions. We are looking forward to understand if someone have achieved parallelism without increasing the partitions or increasing the partitions is the only way to achieve parallelism (to avoid offset issues).

First, you need to decouple (if only at first) the reading of the messages from the processing of these messages. Next look at how many concurrent calls you can make to your API as it doesn't make any sense to call it more frequently than the server can handle, asynchronously or not. If the number of concurrent API calls is roughly equal to the number of partitions you have in your topic, then it doesn't make sense to call the API asynchronously.
If the number of partitions is significantly less than the max number of possible concurrent API calls then you have a few choices. You could try to make the max number of concurrent API calls with fewer threads (one per consumer) by calling the API's asynchronously as you suggest, or you can create more threads and make your calls synchronously. Of course, then you get into the problem of how can your consumers hand their work off to a greater number of shared threads, but that's exactly what streaming execution platforms like Flink or Storm do for you. Streaming platforms (like Flink) that offer checkpoint processing can also handle your problem of how to handle offset commits when messages are processed out of order. You could roll your own checkpoint processing and roll your own shared thread management, but you'd have to really want to avoid using a streaming execution platform.
Finally, you might have more consumers than max possible concurrent API calls, but then I'd suggest that you just have fewer consumers and share partitions, not API calling threads.
And, of course, you can always change the number of your topic partitions to make your preferred option above more feasible.
Either way, to answer your specific question you want to look at how Flink does checkpoint processing with Kafka offset commits. To oversimplify (because I don't think you want to roll your own), the kafka consumers have to remember not only the offsets they just committed, but they have to hold on to the previous committed offsets, and that defines a block of messages flowing though your application. Either that block of messages in its entirety is processed all the way through or you need to rollback the processing state of each thread to the point where the last message in the previous block was processed. Again, that's a major oversimplification, but that's kinda how it's done.

You have to look at kafka batch processing. In a nutshell: you can setup huge batch.size with a little number (or even single) of partitions. As far, as whole batch of messages consumed at consumer side (i.e. in ram memory) - you can parallelize this messages in any way you want.
I would really like to share links, but their number rolls over the web hole.
UPDATE
In terms of committing offsets - you can do this for whole batch.
In general, kafka doesn't achieve target performance requirements by abusing partitions number, but rather relying on batch processing.
I already saw a lot of projects, suffering from partitions scaling (you may see issues later, during rebalancing for example). The rule of thumb - look at every available batch setting first.

Related

Many producers single consumer fair job scheduling in Golang

I have multiple producers that stage objects (jobs) for processing, and a single consumer that takes objects one-by-one. I need to design a sort of a scheduler in golang.
Scheduling is asynchroneous, i.e. each producer works in a separate gorourine.
Scheduler interface is "good" in terms of golang-way (I'm new in Go).
A producer can remove or replace its staged object (if not yet consumed) with zero or minimal lost in the position in a queue. If a producer misses its slot because it canceled and then restaged an object, it still keeps a privilege to stage as soon as possible early till the end of the particular round.
"Fair" scheduling between producers.
Customizable multi-level weighting/prioritization
I'd like some hints and examples on right design of such a scheduler.
I feel I need every producer to wait for a token in a channel, then write (or don't write) an object to a shared consumer channel, then dispose the token, so it is routed to a next producer. Still, I'm not shure this is the best approach. Besides, it takes 3 sequential syncrhoneous operations per producer, so I'm afraid I'll have performance pitfalls because of the token traveling too slowly between producers. Also, 3 steps for one operation is probably not a good golang-way.

AWS Kinesis - Avoiding stalled shards

I am using Kinesis to process events in a micro-service architecture. Events are partitioned at a client project level to ensure all events related to the same project occur in the correct sequence. Currently if there is an error processing one of the events, this can cause the events from other partitions to also become blocked. I had hoped that by increasing the parallelisation factor and bisecting the batch on error, this would allow the other lambda processors to continue processing events from other partitions. This is largely the case, but there are still times when multiple partitions become stuck, presumably because kinesis is sometimes deciding to always allocate several partitions to the same lambda processor.
My question is, is there any way to avoid this in kinesis, or will I need to start making use of a dead letter queue, and removing events that are repeatedly failing? Downside to this is that I don't really want to continue processing further events for the same partition once there is a failure, as the state of the micro-service is likely to be corrupt at this point, and I would rather out team manually address whatever issue has occurred before continuing to play events from the failed partition.

Kafka: is it better to have a lot of small messages or fewer, but bigger ones?

There is a microservice, which receives the batch of the messages from the outside and push them to kafka. Each message is sent separately, so for each batch I have around 1000 messages 100 bytes each. It seems like the messages take much more space internally, because the free space on the disk going down much faster than I expected.
I'm thinking about changing the producer logic, the way it will put all the batch in one message (the consumer then will split them by itself). But I haven't found any information about space or performance issues with many small messages, neither any guildlines about balance between size and count. And I don't know Kafka enough to have my own conclusion.
Thank you.
The producer will, by itself, batch messages that are destined to the same partition, in order to avoid unnecesary calls.
The producer makes this thanks to its background threads. In the image, you can see how it batches 3 messages before sending them to each partition.
If you also set compression in the producer-side, it will also compress (GZip, LZ4, Snappy are the valid codecs) the messages before sending it to the wire. This property can also can be set on the broker-side (so the messages are sent uncompressed by the producer, and compressed by the broker).
It depends on your network capacity to decide wether you prefer a slower producer (as the compression will slow it) or bigger load on the wire. Note that setting a big compression level on big files may affect a lot your overall performance.
Anyway, I believe the big/small msg problem hurts a lot more to the consumer side; Sending messages to Kafka is easy and fast (the default behaviour is async, so the producer won't be too busy). But on the consumer side, you'll have to look the way you are processing the messages:
One Consumer-Worker
Here you couple consuming with processing. This is the simplest way: the consumer sets its own thread, reads a kafka msg and process it. Then continues the loop.
One Consumer - Many workers
Here you decouple consuming and processing. In most cases, reading from kafka will be faster than the time you need to process the message. It is just physics. In this approach, one consumer feeds many separate worker threads that share the processing load.
More info about this here, just above the Constructors area.
Why do I explain this? Well, if your messages are too big, and you choose the first option, your consumer may not call poll() within the timeout interval, so it will rebalance continuosly. If your messages are big (and take some time to be processed), better choose to implement the second option, as the consumer will continue its own way, calling poll() without falling in rebalances.
If the messages are too big and too many, you may have to start thinking about different structures than can buffer the messages into your memory. Pools, deques, queues, for example, are different options to acomplish this.
You may also increase the poll timeout interval. This may hide you about dead consumers, so I don't really recommend it.
So my answer would be: it depends, basicallty on: your network capacity, your required latency, your processing capacity. If you are able to process big messages equally fast as smaller ones, then I wouldn't care much.
Maybe if you need to filter and reprocess older messages I'd recommend partitioning the topics and sending smaller messages, but it's only a use-case.

Two processes single producer / single consumer in Windows. What is better Mutex, Event or Semaphore

I could use either primitive to make it works, but I wonder from a performance perspective, which one is more adequate for such a scenario.
I need to synchronize only two processes. There are always two, no more, no less. One Writes to a memory mapped file while the other reads from it in a producer / consumer fashion. I care about performance, and given how simple the scenario is, I think I could use something light weight, but I dont know for sure which one is faster but still adequate for this scenario.
First point: they're all kernel objects so all of them involve a switch from user mode to kernel mode. That imposes enough overhead by itself that you're unlikely to notice any real difference between them in terms of speed or anything like that. Therefore, which one is preferable will depend a great deal upon how you're structuring the data in the shared memory region, and how you use it.
Let's start with what would probably be the simplest case: that the shared memory region forms the bottleneck. All the time that the consumer isn't reading, the producer will be writing and vice versa. At least initially, this seems like a case were we can use a single mutex. The producer waits on the mutex, writes data, releases the mutex. The consumer waits on the mutex, reads data, releases the mutex. This continues until everything is done.
Unfortunately, while this protects against the producer and consumer using the shared region at the same time, it does not ensure proper operation. For example: the producer writes a buffer full of information, then releases the mutex. Then it waits on the mutex again, so when the reader is done it can write more data -- but at that point, there's no guarantee that the consumer will be the next one to get the mutex. The producer might get it back immediately, and write more data over what it just produced, so the consumer will never see the previous data.
One way to prevent that would be to use a couple of events: one from the producer to the consumer to say that there's data waiting to be read, and the other from the consumer to the producer to say all the data in the buffer has been read. In this case, the producer waits on its event, which the consumer will only set when it's done reading data. The producer then writes some data, and signals the consumer's event to say some data is ready. The consumer reads the data, and then signals event to the producer so the cycle can continue.
As long as you only have a single producer and single consumer and treat the entire as a single "chunk" of data that's controlled together, that's adequate. That, however, can lead to a problem. Let's consider, for example, a web server front-end as the producer and back-end as the consumer (and some separate mechanism for passing results back to the web server). If the buffer is small enough to only hold one request, the producer may have to buffer up several incoming requests as the consumer is processing one. Each time the consumer is ready to process a request, the producer has to stop what it's doing, copy a request to the buffer, and let the consumer know it can proceed.
The basic point of separate processes, however, is to let each proceed on its own schedule as much as possible. To allow that, we might make room in our shared buffer for a number of requests. At any given time, some number of those slots will full (or, looking at it from the other direction, some number will be free). For this case, we just about need a counted semaphore to track those slots. The producer can write something any time at least one slot is free. The consumer can read something anytime at least one slot is filled.
Bottom line: the choice isn't about speed. It's about how your use/structure the data and the processes' access to it. Assuming it's really as simple as you describe, the pair of events is probably the simplest mechanism that will work.

Are there any tools to optimize the number of consumer and producer threads on a JMS queue?

I'm working on an application that is distributed over two JBoss instances and that produces/consumes JMS messages on several JMS queues.
When we configured the application we had to determine which threading model we would use, in particular the number of producing and consuming threads per queue. We have done this in a rather ad-hoc fashion but after reading the most recent columns by Herb Sutter in Dr Dobbs (in particular this one) I would like to size our threads in a more rigorous manner.
Are there any methods/tools to measure the throughput of JMS queues (in particular JBoss Messaging queues) as a function of the number of producing/consuming threads?
This is not really about a specific tool, but may be helpful.
Consumers:
Not sure what your inner architecture is, but let's assume it's an MDB reading in messages. I assert that your only requirement here for rigorous thread count sizing is to choose a maximum cap. If your MDB uses resources from a finite supplier like a JDBC connection pool, consider the maximum cap as the highest number of concurrent instances from that resource that you can tolerate taking. If the MDB's queue is remote, you probably want to consider remote connections (or technically, JMS sessions) a finite resource. If the MDB has less finite requirements (and the queue is local), your maximum cap becomes the number of threads, memory used and/or flat out CPU consumed by the working threads. The reasoning here is that the JBoss MDB container will simply keep allocating more MDB instances (and therefore threads) until the queue is empty or the maximum cap is reached. The only reason I can think of that you would really agonize over the minimum would be if the container's elapsed time or overhead to create new instances is above your tolerance and those operations are usually pretty small potatoes.
Producers
A general axiom of messaging is that producers nearly always outperform consumers. You would think this is pretty arbitrary, but it is a pattern I see recurring all the time, even in widely different messaging scenarios. Anyways, it's tough to say how the threading should work for the producer without knowing a bit about the application, but are you basically capable of [indefinitely] proportionally increasing the number of producer threads and the number of messages generated, or do you have some sort of cap where additional threads simply do not generate more messages ? I would guess it is the latter since most useful work has some limited data or calculation supplier. As I see it, the two drivers here are ordering and persistence.
First off, if you have strict message ordering where messages must be processed in strict (FPFP) First Produced First Processed then you're in a bit of a bind because you almost have to drop down to single threaded throughput unless you can devise some form of logical message demarcation (eg. a client number where any given client's messages are always sent to the same queue, but you may have multiple queues each serviced by one thread so each client is effectively FPFP).
Ordering aside, persistence is the next consideration in that if you have reliable and extensive message persistence, (or have a very high tolerance for message loss) just let the producer threads go to town. The messages will queue up reliably and eventually the consumers will [hopefully] catch up. However, if your message persistence message count or simple queue depths can potentially give you the willies when they get too high, here's where a tool might come in useful. If your producer thread count can be dynamically modified (which they can in many Java ThreadPool implementations) then you could sample the queue depths and raise or lower the producer thread count in accordance with the queue depth ranges you define, optionally to the point where if the consumers basically stall, so will the producers. I do not know of a specific tool that does this but between two JBoss servers this is fairly simple to whip up. Picking your queue depth-->producer thread count will be trickier.
Having said all that, I am going to actually read the article you linked to.....
I've got the perfect thing for you: IBM provide a free command line tool called perfharness.
It's aimed at benchmarking JMS providers, i.e. measuring the throughput of queues (single or multiple) given different numbers of producing or consuming threads.
Some features:
Send and consume messages at a fixed rate (msg/s) or at maximum rate possible on the queue
Use a specific number of threads
Use either JMS or native MQ
Can use data either generated randomly or taken from a file
Generates statistics telling you exactly how fast your queue is performing
The only down side is that it's not super intuitive, given the number of operations it supports. And IBM haven't open sourced it, which is a shame. However it sounds perfect for your purposes.

Resources