Kafka Stream Thread Count double expected - apache-kafka-streams

If I create a simple topology where I have a source and a processor I am getting double the expected StreamThread in the consoles.
For example, if I set threads to one and have one partition I see 2 stream threads. If I set to 20 threads and have 20 partitions I see 40 stream threads.
Based on Kafka Streams thread number, I was expecting half of these numbers.
I am configuring something wrong or is this expected?
EDIT:
After stream = new KafkaStreams(topology, streamsConfig); is called I only see it create 20 threads.
After stream.start() is called I see those 20 threads transition from CREATED to RUNNING.
It is only later in the initialization that the other 20 threads get created. It looks like StreamsBuilderFactoryBean#start() is then getting called where the topology contains nothing. Looks like I either need to stop this from getting called somehow or remove my creation process. Not sure what is preferred.

Turns out #EnableKafkaStreams annotation will start kafka streams for you. Since I was not creating a topology bean there was no collision as it was an empty topology.

Related

Design Pattern - Spring KafkaListener processing 1 million records in 1 hour

My spring boot application is going to listen to 1 million records an hour from a kafka broker. The entire processing logic for each message takes 1-1.5 seconds including a database insert. Broker has 64 partitions, which is also the concurrency of my #KafkaListener.
My current code is only able to process 90 records in a minute in a lower environment where I am listening to around 50k records an hour. Below is the code and all other config parameters like max.poll.records etc are default values:
#KafkaListener(id="xyz-listener", concurrency="64", topics="my-topic")
public void listener(String record) {
// processing logic
}
I do get "it is likely that the consumer was kicked out of the group" 7-8 times an hour. I think both of these issues can be solved through isolating listener method and multithreading processing of each message but I am not sure how to do that.
There are a few points to consider here. First, 64 consumers seems a bit too much for a single application to handle consistently.
Considering each poll by default fetches 500 records per consumer at a time, your app might be getting overloaded and causing the consumers to get kicked out of the group if a single batch takes more than the 5 minutes default for max.poll.timeout.ms to be processed.
So first, I'd consider scaling the application horizontally so that each application handles a smaller amount of partitions / threads.
A second way to increase throughput would be using a batch listener, and handling processing and DB insertions in batches as you can see in this answer.
Using both, you should be processing a sensible amount of work in parallel per app, and should be able to achieve your desired throughput.
Of course, you should load test each approach with different figures to have proper metrics.
EDIT: Addressing your comment, if you want to achieve this throughput I wouldn't give up on batch processing just yet. If you do the DB operations row by row you'll need a lot more resources for the same performance.
If your rule engine doesn't do any I/O you can iterate each record from the batch through it without losing performance.
About data consistency, you can try some strategies. For example, you can have a lock to ensure that even through a rebalance only one instance will process a given batch of records at a given time - or perhaps there's a more idiomatic way of handling that in Kafka using the rebalance hooks.
With that in place, you can batch load all the information you need to filter out duplicated / outdated records when you receive the records, iterate each record through the rule engine in memory, and then batch persist all results, to then release the lock.
Of course, it's hard to come up with an ideal strategy without knowing more details about the process. The point is by doing that you should be able to handle around 10x more records within each instance, so I'd definitely give it a shot.

Is the attempts passed to retry_on respected with multiple workers

When retry_on causes a delayed job to get retried, I see the run_at change on the record in the database, but attempts in the database does not get incremented until the job has been retried as many times as the attempts argument I passed to retry_on (That may be confusing. There are two different things here called attempts: the column in the db and the argument passed to retry_on).
How does this work when I have multiple workers? I see in the code for ActiveJob that the number of attempts in retry_on is being tracked through exception_executions, which appears to be just a hash stored in RAM.
Does anything prevent different workers from picking up the job on the retry?
If not, it seems like if I pass attempts: 3 to retry_on and I have 10 workers, then the job could end up getting retried as many as 30 times before it is reported as an error in DelayedJob.
Is that right? If so, is it a bug?
In ActiveJob, it looks like both executions and exception_executions are persisted on the job instance. If you're using something like delayed_job, this is stored in the handler column on the delayed_jobs table.
When a job is created, various job data (args, provider information, etc.) is stored deserialized on the job instance. When the job is executed, this data is then serialized for use in ActiveJob as well as your queueing adapter.
This should not be an issue with multiple worker processes - each process would read from and write to this job data in ActiveJob's retry_on logic, meaning each process would be aware of how many times the jobs has executed and/or raised an exception across all processes.

Wait for submitToplogy to finish

I am reading the storm applied book. I found the following code snippet in the book
LocalCluster lc = new LocalCluster()
lc.submitTopology("GitHub-commit-count-topology"), config, topology);
Utils.sleep(TEN_MINUTES)
lc.killTopology("GitHub-commit-count-topology")
lc.shutdown()
So this code will submit the topology for execution wait for fixed 10 minutes and then kill the topology. But this is odd. How can I say. submitTopology wait for it to complete and completed. kill and shutdown.
Like in Akka Streams we get Future[Done] and we just wait on that future to complete. (rather than fixed 10 minutes).
You can do this with https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/Testing.java#L376.
The reason this isn't used in some cases is that it requires every spout in the topology to implement the CompletableSpout interface https://github.com/apache/storm/blob/4137328b75c06771f84414c3c2113e2d1c757c08/storm-client/src/jvm/org/apache/storm/testing/CompletableSpout.java.
Most Storm spouts never reach a point where they're "done" (since it's a stream processing framework, not a batch processing framework), so there's no way to tell when the topology is finished. For example, if you're consuming messages from a Kafka topic, the producers may at any point add more messages to the topic, so how will the consumer determine it is finished consuming?
CompletableSpout exists mostly to ease testing, because it's then possible for a spout to say whether it is done. The completeTopology method I linked can then use this extra feature to tell whether all spouts in the topology are "done", and can stop the topology after that.
If the spout you're using in a test doesn't implement CompletableSpout (which most spouts don't), there's no way to tell when the topology is finished in general. In many cases you can still do better than the example you linked, e.g. if my topology is supposed to write 10 messages to a queue in the test, I can make the test end once 10 messages have been written to the queue.
To relate to Akka streams, I'm not really familiar with them, but looking at the introductory documentation, you could consider CompletableSpouts to be similar to bounded Sources (eg. a Source(1 to 100)), while "normal" spouts are unbounded Sources (e.g. a Source.repeat(1)).

Spark batches does not complete when running on Yarn cluster

Setting the scene
I am working to make a Spark streaming application (Spark 2.2.1 with Scala) run on a Yarn cluster (Hadoop 2.7.4).
So far I managed to submit the application to the Yarn cluster with spark-submit. I can see that the receiver task starts up correctly and fetches a lot of records from the database (Couchbase Server 5.0) and I can also see that the records are divided into batches.
The question
When I look at the Streaming Statistics on the Spark Web UI, I can however see that my batches are never processed. I have seen batches with 0 records process and complete but when a batch with records start processing it never completes. One time it even got stuck on a batch with 0 records.
I even tried simplifying the output operations on the SteamingContext as much as possible. But still with the very simple output operation print() my batches are never processed. The logs does not show any warnings or errors.
Does anyone know what might be wrong? Any suggestions on how to solve this will be much appreciated.
More Info
The main class of the Spark application is built from this example (first one) from the Couchbase Spark Connector documentation combined with this example with checkpoint from the Spark Documentation.
Right now I have 3230 Active Batches (3229 queued and 1 processing) and 1 Completed Batch (that had 0 records) and the application has been running for 4 hours and 30 minutes... and another batch is added every 5 seconds.
If I look at the "thread dump" for the executors I see a lot of WAITING, TIMED WAITING and a few RUNNABLE threads. The list will fill up 3 screenshots, so i will only post it if needed.
Below you will find some screenshots from the Web UI
Executor Overview
Spark Jobs Overview
Node Overview with resources
Capacity Scheduler Overview
Per screenshot, you have 2 cores and 1 is being used for driver and another is being used for receiver. You don't have a core for the actual processing to happen. Please increase the number of cores and try again.
Refer: https://spark.apache.org/docs/latest/streaming-programming-guide.html#input-dstreams-and-receivers
If you are using an input DStream based on a receiver (e.g. sockets, Kafka, Flume, etc.), then the single thread will be used to run the receiver, leaving no thread for processing the received data. Hence, when running locally, always use “local[n]” as the master URL, where n > number of receivers to run (see Spark Properties for information on how to set the master).

How would you emit storm data after a period of time has lapsed?

For example, lets say you were using storm to aggregate web visit start and end dates. A session starts with the first visit from a user and ends after 30 minutes of inactivity from that same user. This data is being streamed into storm in realtime as its collected. How would you tell storm to emit data after that 30 minutes of inactivity?
I am not sure but you can look for TOPOLOGY_TICK_TUPLE_FREQ_SECS properties in storm. As found in this article
Tick tuples: It’s common to require a bolt to “do something” at a fixed interval, like flush writes to a database. Many people have been using variants of a ClockSpout to send these ticks. The problem with a ClockSpout is that you can’t internalize the need for ticks within your bolt, so if you forget to set up your bolt correctly within your topology it won’t work correctly. 0.8.0 introduces a new “tick tuple” config that lets you specify the frequency at which you want to receive tick tuples via the “topology.tick.tuple.freq.secs” component-specific config, and then your bolt will receive a tuple from the __system component and __tick stream at that frequency.
You can also found the sample code to configure spouts or bolt to receive the tick tuple with a specific interval.

Resources