So I have multiple producers who generate some tasks
and there is consumer (or multiple) who executes these tasks (f.e. count number of lines in a file)
My problem is that consumers should be treated evenly, meaning if one producer generate 10 tasks and other only 2 - consumer should first do 1 task from producer1, then task from producer2, then producer1 and then the rest
Basically for each producer system must guaranty that created tasks will not wait for large chunk of tasks from other producers
Can you help me with algorithm or ready to use broker/queue software that can achieve this goal ?
Related
Can it be safe to say that in all and all in Kafka Stream, Tasks represent subscriptions to partitions, while Threads represent consumers ?
That is, if there is 8 partition there will always be 8 Tasks. However the number of consumers is determine by the number of Thread available. Those are spread across Application instance. So one application instance may represent 2 consumer provided that is has 2 Thread associated to it.
For full parallelism, with a topic with 8 partitions we could have 2 application instance with each having 4 Thread, or one application instance with 8 Threads and so on.
Yeah, Number of tasks will be equal to maximum number of partitions in any Kafka stream app
In case there are two topics "A" and "B" each having 8 partitions. So no. of tasks will be max(8,8) = 8. Now each consumer represents a thread. If you set of threads as 2, so 2 threads will distribute the tasks between each other. Each thread will get 4 tasks to process.
For full parallelism, with a topic with 8 partitions we could have 2
application instance with each having 4 Thread, or one application
instance with 8 Threads and so on.
You should use number of threads to the maximum number of partitions always in order to achieve the full parallelism. You can either do it in several application instances or one.
Here is a nicely explained Threading model of Kstream.
https://docs.confluent.io/current/streams/architecture.html#parallelism-model
I'm using Stream DSL. I have three source topic with partition 17, 100, 40.
I will be running three instances and 2 standby instances.
How can I calculate how many stream threads I will need so that each thread gets exactly one task or highest parallelism is achieved?
This depends on the structure of your application. You can run the application with a single thread and observe the number of created tasks. The number of task is the maximum number of threads you can use.
The task that are created are logged or you obtain them via KafkaStream#localThreadMetadata().
I will try to discuss an approach here in short
You are asking for maximum parallelism
This can be achieved by separating out each topic in a separate
topology
Each topology having separate thread count (one thread per
consumer per topic) - 17/3, 100/3, 40/3 - topic partition/instances
This will make sure that each topology gets separate thread count and
separate parallelism
each topology will act as separate consumer
group
I want to use Kafka to "divide the work". I want to publish instances of work to a topic, and run a cloud of identical consumers to process them. As each consumer finishes its work, it will pluck the next work from the topic. Each work should only be processed once by one consumer. Processing work is expensive, so I will need many consumers running on many machines to keep up. I want the number of consumers to grow and shrink as needed (I plan to use Kubernetes for this).
I found a pattern where a unique partition is created for each consumer. This "divides the work", but the number of partitions is set when the topic is created. Furthermore, the topic must be created on the command line e.g.
bin/kafka-topics.sh --zookeeper localhost:2181 --partitions 3 --topic divide-topic --create --replication-factor 1
...
for n in range(0,3):
consumer = KafkaConsumer(
bootstrap_servers=['localhost:9092'])
partition = TopicPartition('divide-topic',n)
consumer.assign([partition])
...
I could create a unique topic for each consumer, and write my own code to assign work to those topic. That seems gross, and I still have to create topics via the command line.
A work queue with a dynamic number of parallel consumers is a common architecture. I can't be the first to need this. What is the right way to do it with Kafka?
The pattern you found is accurate. Note that topics can also be created using the Kafka Admin API and partitions can also be added once a topic has been created (with some gotchas).
In Kafka, the way to divide work and allow scaling is to use partitions. This is because in a consumer group, each partition is consumed by a single consumer at any time.
For example, you can have a topic with 50 partitions and a consumer group subscribed to this topic:
When the throughput is low, you can have only a few consumers in the group and they should be able to handle the traffic.
When the throughput increases, you can add consumers, up to the number of partitions (50 in this example), to pick up some of the work.
In this scenario, 50 consumers is the limit in terms of scaling. Consumers expose a number of metrics (like lag) allowing you to decide if you have enough of them at any time
Thank you Mickael for pointing me in the correct direction.
https://www.safaribooksonline.com/library/view/kafka-the-definitive/9781491936153/ch04.html
Kafka consumers are typically part of a consumer group. When multiple
consumers are subscribed to a topic and belong to the same consumer group,
each consumer in the group will receive messages from a different subset of
the partitions in the topic.
https://dzone.com/articles/dont-use-apache-kafka-consumer-groups-the-wrong-wa,
Having consumers as part of the same consumer group means providing the
“competing consumers” pattern with whom the messages from topic partitions
are spread across the members of the group. Each consumer receives messages
from one or more partitions (“automatically” assigned to it) and the same
messages won’t be received by the other consumers (assigned to different
partitions). In this way, we can scale the number of the consumers up to the
number of the partitions (having one consumer reading only one partition); in
this case, a new consumer joining the group will be in an idle state without
being assigned to any partition.
Example code for dividing the work among 3 consumers, up to a maximum of 100:
bin/kafka-topics.sh --partitions 100 --topic divide-topic --create --replication-factor 1 --zookeeper localhost:2181
...
for n in range(0,3):
consumer = KafkaConsumer(group_id='some-constant-group',
bootstrap_servers=['localhost:9092'])
...
I think, you are on right path -
Here are some steps involved -
Create Kafka Topic and create the required partitions. The number of partitions is the unit of parallelism. In other words you run these many number of consumers to process the work.
You can increase the partitions if the scaling requirements increased. BUT it comes with caveats like repartitioning. Please read the kafka documentation about the new partition addition.
Define a Kafka Consumer group for the consumer. Kafka will assign partitions to available consumers in the consumer group and automatically rebalance. If the consumer is added/removed, kafka does the rebalancing automatically.
If the consumers are packaged as docker container, then using kubernetes helps in managing the containers especially for multi-node environment. Other tools include docker-swarm, openshift, Mesos etc.
Kafka offers the ordering for partitions.
Check out the delivery guarantees - At-least once, Exactly once based on your use cases.
Alternatively, you can use Kafka Streams APIS. Kafka Streams is a client library for processing and analyzing data stored in Kafka. It builds upon important stream processing concepts such as properly distinguishing between event time and processing time, windowing support, and simple yet efficient management and real-time querying of application state.
Since you have a slow consumer use case, it's a great fit for Confluent's Parallel Consumer (PC). PC directly solves for this, by sub partitioning the input partitions by key and processing each key in parallel. So processing can take as long as you like. It also tracks per record acknowledgement. Check Parallel Consumer GitHub (it's open source BTW, and I'm the author).
I am using resque to background process two types of jobs:
(1) 3rd-party API requests
(2) DB query and insert
While the two jobs can be processed parallely, each job type in itself can only be processed in serial order. For example, DB operations need to happen in serial order but can be executed in parallel with 3rd party API requests.
I am contemplating either of the following methods for executing this :
(1) Having two queues with one queue handling only API requests and the other queue
handling only db queries. Each queue will have its own worker.
(2) One single queue but two workers. One worker for each job.
I would like to know the difference in the two approaches and which among the two would be a better approach to take.
This choice of selecting an architecture is not straight forward, you have to keep many things in mind.
Having two queues with one queue handling only API requests and the other queue
handling only db queries. Each queue will have its own worker.
Ans: You can have this architecture when you have both the queues equally busy i.e if you have one queue with more number of jobs and other empty then your one worker will be idol and there will be jobs waiting in other queue.
You should always think about full utilisation of your workers.
One single queue but two workers. One worker for each job.
Ans: This approach is what we also use in our project.
Having all jobs en-queued to one queue and have multiple workers running on it.
Your workers will always be busy no matter which type job is present.
proper worker utilisation is possible.
At last I would suggest you can use 2nd approach.
I want to run 50 tasks. All these tasks execute the same piece of code. Only difference will be the data. Which will be completed faster ?
a. Queuing up 50 tasks in a queue
b. Queuing up 5 tasks each in 10 different queue
Is there any ideal number of tasks that can be queued up in 1 queue before using another queue ?
The rate at which tasks are executed depends on two factors: the number of instances your app is running on, and the execution rate of the queue the tasks are on.
The maximum task queue execution rate is now 100 per queue per second, so that's not likely to be a limiting factor - so there's no harm in adding them to the same queue. In any case, sharding between queues for more execution rate is at best a hack. Queues are designed for functional separation, not as a performance measure.
The bursting rate of task queues is controlled by the bucket size. If there is a token in the queue's bucket the task should run immediately. So if you have:
queue:
- name: big_queue
rate: 50/s
bucket_size: 50
And haven't queue any tasks in a second all tasks should start right away.
see http://code.google.com/appengine/docs/python/config/queue.html#Queue_Definitions for more information.
Splitting the tasks into different queues will not improve the response time unless the bucket hadn't had enough time to completely fill with tokens.
I'd add another factor into the mix- concurrency. If you have slow running (more than 30 seconds or so) tasks, then AppEngine seems to struggle to scale up the correct number of instances to deal with the requests (seems to max out about 7-8 for me).
As of SDK 1.4.3, there's a setting in your queue.xml and your appengine-web.config you can use to tell AppEngine that each instance can handle more than one task at a time:
<threadsafe>true</threadsafe> (in appengine-web.xml)
<max-concurrent-requests>10</max-concurrent-requests> (in queue.xml)
This solved all my problems with tasks executing too slowly (despite setting all other queue params to the maximum)
More Details (http://blog.crispyfriedsoftware.com)
Queue up 50 tasks and set your queue to process 10 at a time or whatever you would like if they can run independently of each other. I see a similar problem and I just run 10 tasks at a time to process the 3300 or so that I need to run. It takes 45 minutes or so to process all of them but the CPU time used is negligible surprisingly.