I am interested in using Kafka to stream data (100 K records per second) that has to be consumed by multiple consumers(nosql, lucene) and wanted to know if Kafka is good resource for my requirements or any alternative that are useful as well. The consumers consume data are:
Consumer 1 - consumes data as soon as it comes to topic.
Consumer 2 - consumes data in batches from topic
Yes, Kafka is a perfect fit for your requirement. Read about Kafka Streams here
If you want to read the data in batches, use the Kafka Consumer
Related
I have a kafka streams application which reads from one topic with 100 partitions. Does some stateless transformations on individual messages and produces it to another topic. There are 2 instances of the application running on 2 machines with 4 cores and 8Gb ram, num.stream.threads is configured to 50. The throughput for output topic is very low. What are the tunining parameters to look at for increasing the processing speed of the Kafka Stream Application.
I am comparing the throughput of spark streaming and Kafka streams. My results state that Kafka Streams has a higher throughput than Spark streaming. Is this correct? Shouldn't it be the other way around?
Thanks
No one streaming platform is universally faster than all others for every use case. Don't get fooled by benchmarketing results that compare apples to oranges (like Kafka Streams reading from a disk based source vs Spark Streaming reading from an in-memory source). You haven't posted your test but it is entirely possible that it represents a use case (and test environment) in which Kafka Streams is indeed faster.
We are facing performance issue while integrating Spark-Kafka streams.
Project setup:
We are using Kafka topics with 3 partitions and producing 3000 messages in each partition and processing it in Spark direct streaming.
Problem we are facing:
In the processing end we are having Spark direct stream approach to process the same. As per the below documentation. Spark should create parallel direct streams as many as the number of partitions in the topic (which is 3 in this case). But while reading we can see all the messages from partition 1 is getting processed first then second then third. Any help why it is not processing parallel? as per my understanding if it is reading parallel from all the partition at the same time then the message output should be random.
http://spark.apache.org/docs/latest/streaming-kafka-0-8-integration.html#approach-2-direct-approach-no-receivers
Did you try setting the spark.streaming.concurrentJobs parameter.
May be in your case, it can be set to three.
sparkConf.set("spark.streaming.concurrentJobs", "3").
Thanks.
I have a kafka cluster with three brokers and one topic with replication factor of three and three partitions. I can see that every broker has a copy of log for all partitions with the same size. There are two producers for this topic.
One day I reduced writing volume of one producer by half. Then I found that all three brokers' inbound traffic reduced which is expected, but only partition 1's leader node's out traffic reduced which I don't understand.
The partition leader's outbound traffic reduced because of replication. But each broker is the leader of one partition, why only one leader's outbound traffic reduced? Is it possible that the producer only writes content to one partition? while I don't think so.
Please help me explain it. The cluster is working fine now, but I need to understand it in case of potential problem.
Assuming you are using Default Partitioner for KafkaProducer, which means two events with the same key are guaranteed to be sent to the same partition.
From From Kafka Documentation
All reads and writes go to the leader of the partition and Followers
consume messages from the leader just as a normal Kafka consumer would
and apply them to their own log.
You could have reduced data ( from a producer) by skiping specific key or set of Keys, which could means no data to particular partition.
This answers why leader's outbound traffic reduced (No records for followers to consume)
I am reading up on Apache Storm to evaluate if it is suited for our real time processing needs.
One thing that I couldn't figure out until now is — Where does it store the tuples during the time when next node is not available for processing it. For e.g. Let's say spout A is producing at the speed of 1000 tuples per second, but the next level of bolts(that process spout A output) can only collectively consume at a rate of 500 tuples per second. What happens to the other tuples ? Does it have a disk-based buffer(or something else) to account for this ?
Storm used internal in-memory message queues. Thus, if a bolt cannot keep up processing, the messages are buffered there.
Before Storm 1.0.0 those queues may grow out-of-bound (ie, you get an out-of-memory exception and your worker dies). To protect from data loss, you need to make sure that the spout can re-read the data (see https://storm.apache.org/releases/1.0.0/Guaranteeing-message-processing.html)
You could use "max.spout.pending" parameter, to limit the tuples in-flight to tackle this problem though.
As of Storm 1.0.0, backpressure is supported (see https://storm.apache.org/2016/04/12/storm100-released.html). This allows bolt to notify its upstream producers to "slow down" if a queues grows too large (and speed up again in a queues get empty). In your spout-bolt-example, the spout would slow down to emit messaged in this case.
Typically, Storm spouts read off of some persistent store and track that completion of tuples to determine when it's safe to remove or ack a message in that store. Vanilla Storm itself does not persist tuples. Tuples are replayed from the source in the event of a failure.
I have to agree with others that you should check out Heron. Stream processing frameworks have advanced significantly since the inception of Storm.