multiple consumers per kinesis shard - sharding

I read you can have multiple consumer apps per kinesis stream.
http://docs.aws.amazon.com/kinesis/latest/dev/developing-consumers-with-kcl.html
however, I heard you can only have on consumer per shard. Is this true? I don't find any documentation to support this, and can't imagine how that could be if multiple consumers are reading from the same stream. Certainly, it doesn't mean the producer needs to repeat content in different shards for different consumers.

Kinesis Client Library starts threads in the background, each listens to 1 shard in the stream. You cannot connect to a shard over multiple threads, that is by-design.
http://docs.aws.amazon.com/kinesis/latest/dev/kinesis-record-processor-scaling.html
For example, if your application is running on one EC2 instance, and
is processing one Amazon Kinesis stream that has four shards. This one
instance has one KCL worker and four record processors (one record
processor for every shard). These four record processors run in
parallel within the same process.
In the explanation above, the term "KCL worker" refers to a Kinesis consumer application. Not the threads.
But below, the same "KCL worker" term refers to a "Worker" thread in the application; which is a runnable.
Typically, when you use the KCL,
you should ensure that the number of instances does not exceed the
number of shards (except for failure standby purposes). Each shard is
processed by exactly one KCL worker and has exactly one corresponding
record processor, so you never need multiple instances to process one
shard.
See the Worker.java class in KCL source.

Late to the party, but the answer is that you can have multiple consumers per kinesis shard. A KCL instance will only start one process per shard, but you can have another KCL instance consuming the same stream (and shard), assuming the second one has permission.
There are limits, though, as laid out in the docs, including:
Each shard can support up to 5 transactions per second for reads, up to a maximum total data read rate of 2 MB per second.
If you want a stream with multiple consumers where each message will be processed once, you're probably better off with something like Amazon Simple Queue Service.

to keep it simple, you can have multiple/different lambda functions get triggered on kinesis data. this way your both the lambdas are going to get all the data from the kinesis. The downside is that now you will have to increase the throughput at the kinesis level which is going to pricey. Use SQS instead for your use case.

Related

Dynamically adapt the number of consumer thread to the number of Kafka partitions

I have a Kafka topic with 50 partitions.
My Spring Boot application uses Spring Kafka to read those messages with a #KafkaListener
The number of instances of my application autoscale in my Kubernetes.
By default, it seems that Spring Kafka launch 1 consumer thread per topic.
org.springframework.kafka.KafkaListenerEndpointContainer#0-0-C-1
So, with a unique instance of the application, one thread is reading the 50 partitions.
With 2 instances, there is a load balancing and each instance listen to 25 partitions. Still with 1 thread per instance.
I know I can set the number of thread using the concurrency parameter on #KafkaListener.
But this is a fixed value.
Is there any way to tell Spring to dynamically adapt the number of consumer threads to the number of partition the client is currently listening?
I think there might be a better way of approaching this.
You should figure out how many records / partitions in parallel one instance of your application can handle optimally, through load / performance tests.
Let's say one instance can handle 10 threads / records in parallel optimally. Now if you scale out your app to 50 instances, in your approach, each instance will get one partition, and each instance will be performing below its capacity, wasting resources.
Now consider the opposite - only one instance is left, and it spawns 50 threads to consume from all partitions in parallel. The app's performance will be severally degraded, it might become unresponsive or even crash.
So, in this hypotethical scenario, what you might want to do is, for example, start with one or two instances handling all partitions with 10 threads each, and have it scale to up to 5 instances if there's consumer lag, so that each partition has a dedicated thread processing it.
Again, the actual figures should be determined through load / performance testing.

Stream thread calculation

I'm using Stream DSL. I have three source topic with partition 17, 100, 40.
I will be running three instances and 2 standby instances.
How can I calculate how many stream threads I will need so that each thread gets exactly one task or highest parallelism is achieved?
This depends on the structure of your application. You can run the application with a single thread and observe the number of created tasks. The number of task is the maximum number of threads you can use.
The task that are created are logged or you obtain them via KafkaStream#localThreadMetadata().
I will try to discuss an approach here in short
You are asking for maximum parallelism
This can be achieved by separating out each topic in a separate
topology
Each topology having separate thread count (one thread per
consumer per topic) - 17/3, 100/3, 40/3 - topic partition/instances
This will make sure that each topology gets separate thread count and
separate parallelism
each topology will act as separate consumer
group

Kafka work queue with a dynamic number of parallel consumers

I want to use Kafka to "divide the work". I want to publish instances of work to a topic, and run a cloud of identical consumers to process them. As each consumer finishes its work, it will pluck the next work from the topic. Each work should only be processed once by one consumer. Processing work is expensive, so I will need many consumers running on many machines to keep up. I want the number of consumers to grow and shrink as needed (I plan to use Kubernetes for this).
I found a pattern where a unique partition is created for each consumer. This "divides the work", but the number of partitions is set when the topic is created. Furthermore, the topic must be created on the command line e.g.
bin/kafka-topics.sh --zookeeper localhost:2181 --partitions 3 --topic divide-topic --create --replication-factor 1
...
for n in range(0,3):
consumer = KafkaConsumer(
bootstrap_servers=['localhost:9092'])
partition = TopicPartition('divide-topic',n)
consumer.assign([partition])
...
I could create a unique topic for each consumer, and write my own code to assign work to those topic. That seems gross, and I still have to create topics via the command line.
A work queue with a dynamic number of parallel consumers is a common architecture. I can't be the first to need this. What is the right way to do it with Kafka?
The pattern you found is accurate. Note that topics can also be created using the Kafka Admin API and partitions can also be added once a topic has been created (with some gotchas).
In Kafka, the way to divide work and allow scaling is to use partitions. This is because in a consumer group, each partition is consumed by a single consumer at any time.
For example, you can have a topic with 50 partitions and a consumer group subscribed to this topic:
When the throughput is low, you can have only a few consumers in the group and they should be able to handle the traffic.
When the throughput increases, you can add consumers, up to the number of partitions (50 in this example), to pick up some of the work.
In this scenario, 50 consumers is the limit in terms of scaling. Consumers expose a number of metrics (like lag) allowing you to decide if you have enough of them at any time
Thank you Mickael for pointing me in the correct direction.
https://www.safaribooksonline.com/library/view/kafka-the-definitive/9781491936153/ch04.html
Kafka consumers are typically part of a consumer group. When multiple
consumers are subscribed to a topic and belong to the same consumer group,
each consumer in the group will receive messages from a different subset of
the partitions in the topic.
https://dzone.com/articles/dont-use-apache-kafka-consumer-groups-the-wrong-wa,
Having consumers as part of the same consumer group means providing the
“competing consumers” pattern with whom the messages from topic partitions
are spread across the members of the group. Each consumer receives messages
from one or more partitions (“automatically” assigned to it) and the same
messages won’t be received by the other consumers (assigned to different
partitions). In this way, we can scale the number of the consumers up to the
number of the partitions (having one consumer reading only one partition); in
this case, a new consumer joining the group will be in an idle state without
being assigned to any partition.
Example code for dividing the work among 3 consumers, up to a maximum of 100:
bin/kafka-topics.sh --partitions 100 --topic divide-topic --create --replication-factor 1 --zookeeper localhost:2181
...
for n in range(0,3):
consumer = KafkaConsumer(group_id='some-constant-group',
bootstrap_servers=['localhost:9092'])
...
I think, you are on right path -
Here are some steps involved -
Create Kafka Topic and create the required partitions. The number of partitions is the unit of parallelism. In other words you run these many number of consumers to process the work.
You can increase the partitions if the scaling requirements increased. BUT it comes with caveats like repartitioning. Please read the kafka documentation about the new partition addition.
Define a Kafka Consumer group for the consumer. Kafka will assign partitions to available consumers in the consumer group and automatically rebalance. If the consumer is added/removed, kafka does the rebalancing automatically.
If the consumers are packaged as docker container, then using kubernetes helps in managing the containers especially for multi-node environment. Other tools include docker-swarm, openshift, Mesos etc.
Kafka offers the ordering for partitions.
Check out the delivery guarantees - At-least once, Exactly once based on your use cases.
Alternatively, you can use Kafka Streams APIS. Kafka Streams is a client library for processing and analyzing data stored in Kafka. It builds upon important stream processing concepts such as properly distinguishing between event time and processing time, windowing support, and simple yet efficient management and real-time querying of application state.
Since you have a slow consumer use case, it's a great fit for Confluent's Parallel Consumer (PC). PC directly solves for this, by sub partitioning the input partitions by key and processing each key in parallel. So processing can take as long as you like. It also tracks per record acknowledgement. Check Parallel Consumer GitHub (it's open source BTW, and I'm the author).

Kinesis stream / shard - multiple consumers

I have already read some questions about kinesis shard and multiple consumers but I still don't understand how it works.
My use case: I have a kinesis stream with just one shard. I would like to consume this shard using different lambda function, each of them independently. It's like that each lambda function will have it's own shard iterator.
Is it possible? Set multiple lambda consumers ( stream based) reading from the same stream/shard?
Hey Mr Magalhaes I believe the following picture should answer some of your questions.
So to clarify you can set multiple lambdas as consumers on a kinesis stream, but the Lambdas will block each other on processing. If your stream has only one shard it will only have one concurrent Lambda.
If you have one kinesis stream, you can connect as many lambda functions as you want through an event source mapping.
All functions will run simultaneously and fully independent of each other and will constantly be invoked if new records arrive in the stream.
The number of shards does not matter.
For a single lambda function:
"For Lambda functions that process Kinesis or DynamoDB streams the number of shards is the unit of concurrency. If your stream has 100 active shards, there will be at most 100 Lambda function invocations running concurrently. This is because Lambda processes each shard’s events in sequence." [https://docs.aws.amazon.com/lambda/latest/dg/scaling.html]
But there is no limit on how many different lambda consumers you want to attach with kinesis.
Yes, no problem with this !
The number of shards doesn't limit the number of consumers a stream can have.
In you case, it will just limit the number of concurrent invocations of each lambda. This means that for each consumers, you can only have the number of shards of concurrent executions.
Seethis doc for more details.
Short answer:
Yes it will work, and will work concurrently.
Long answer:
Each shared in Kinesis stream has 2MiB/sec read throughput:
https://docs.aws.amazon.com/streams/latest/dev/building-consumers.html
If you have multiple applications (in your case Lambda's). They will share the throughput.
A description taken from the link above:
Fixed at a total of 2 MiB/sec per shard. If there are multiple consumers reading from the same shard, they all share this throughput. The sum of the throughput they receive from the shard doesn't exceed 2 MiB/sec.
If you create (write) less than 1mib/sec of data you should be able to support two "applications" with a single shard.
In general if you have Y shards and X applications it should work properly assuming your total write throughput (mib/sec) is less than 2mib/sec * Y / X and that data is spread equally between shards.
If you require each "Application" to use 2 Mib/sec each, you may enable "Consumers with Enhanced Fan-Out" which "fan-outs" the stream giving each application a dedicated 2 Mib/sec per shard (instead of sharing the throughput).
This is described in the following link:
https://docs.aws.amazon.com/streams/latest/dev/introduction-to-enhanced-consumers.html
In Amazon Kinesis Data Streams, you can build consumers that use a feature called enhanced fan-out. This feature enables consumers to receive records from a stream with throughput of up to 2 MiB of data per second per shard. This throughput is dedicated, which means that consumers that use enhanced fan-out don't have to contend with other consumers that are receiving data from the stream.

Storm as a replacement for Multi-threaded Consumer/Producer approach to process high volumes?

We have a existing setup where upstream systems send messages to us on a Message Queue and we process these messages.The content is xml and we simply unmarshal.This unmarshalling step is followed by a write to db (to put relevant values onto relevant columns).
The system is set to interface with many more upstream systems and our volumes are going to increase to a peak size of 40mm per day.
Our current way of processing is have listeners on the queues and then have a multiple threads of producers and consumers which do the unmarshalling and subsequent db write.
My question : Can this process fit into the Storm use case scenario?
I mean can MQ be my spout and I have 2 bolts one to unmarshal and this then becomes the spout for the next bolt which does the write to db?
If yes,what is the benefit that I can derive? Is it a goodbye to cumbersome multi threaded producer/worker pattern of code.
If its as simple as the above then where/why would one want to resort to the conventional multi threaded approach to producer/consumer scenario
My point being is there a data volume/frequency at which Storm starts to shine when compared to the conventional approach.
PS : I'm very new to this and trying to get a hang of this and want to ascertain if the line of thinking is right
Regards,
CVM
Definitely this scenario can fit into a storm topology. The spouts can pull from MQ and the bolts can handle the unmarshalling and subsequent processing.
The major benefit over conventional multi threaded pattern is the ability to add more worker nodes as the load increases. This is not so easy with traditional producer consumer patterns.
Specific data volume number is a very broad question since it depends on a large number of factors like hardware etc.

Resources