How does Spark process multiple Kafka topic at one Dstream ?
1.on by one in a row.
2.simultaneously (topic-partition per executor).
In DirectStream method spark process Kafka data parallelly. i.e each executor core will pull data from corresponding kafka topic and partition.
Lets say you have 2 topics each with 3 partitions. And you have 3 executors each with 2 cores.
If you are reading from these 2 topics then spark will launch 1 job with 6 tasks (=6 cores) and each task pulls data from corresponding topic+partition. And all these tasks run parallelly.
Hope this helps!
Related
I have a requirement to run multiple instances of a spring batch remote partitioned job with SQS.Say for example I have 3 instances of my job submitted as below and below are the partitions for each job that is submitted to the SQS to be executed by followers.
J1 - P1,P2,P3,P4,P5
J2 - P6,P7
J3 - P8
Messages in my SQS queue will look like this since am using a SQS FIFO with random groupId P1, P2, P3,P4,P5,P6,P7,P8
I know we will have random message processing order since each message has a different groupId but my requirement is to make sure all 3 jobs run in parallel.If job 1 has large no of partitions,I don't want my other jobs to wait till job 1 messages are exhausted.Any ideas here will help ...
I am using Spring Boot #kafkaListener in my application. Lets assume I use below configuration -
Topic Partitions : 2
spring.kafka.listener.concurrency : 2
group-id : TEST_GRP_ID
Acknowledgement : Manual
My question is ,
As per my knowledge Concurrency will create parallel thread to consume message.
So, thread 1 consumed the batch of records and thread 2 consumed the batch of records in this case processing of the messages will sequential and then commit the offset?
If I have two instances of the micro service in my cloud environment (in production more partition and more instances), then how concurrency will work? In each instance will create two parallel thread for my Kafka consumer?
How can I improve performance of my consumer or how can I make fast consumption and processing of the messages?
Your understanding is not too far from the truth. In fact only one consumer per partition can exist for the given group. The concurrency number gives us an approximate number of target consumers. And independently of microservice instances only two maximum consumers can exist if you have only two partitions in your topic.
So, to increase a performance you need to have more than 2 partition or more topics to consume, then they all can be distributed between your instances and their consumers evenly.
See more info in Apache Kafka docs: https://docs.confluent.io/platform/current/clients/consumer.html
✓you are having concurrency as 2 , which means 2 containers will be created to your listener.
✓As you are having 2 partitions in topic , messages from both the partitions will be consumed and processed parallelly.
✓When you spin up one more instance with same group name , the first thing that will happen is Group Rebalance .
✓Despite this event , as at any point of time only one consumer from a specific consumer group can be there for a partition , In the end , only 2 containers will be listening to messages and other 2 containers just remain idle.
✓In order to achieve more scalability , we need to add more number of partitions to the topic there by we can have more number of active listener containers
I am trying to create a Kafka Streams service where
I am trying to initialize a cache in a processor, that will then be updated by consuming messages with a topic say "nodeStateChanged" for a partition key lets say locationId.
I need to check the node state when I consume another topic lets say "Report" again keyed by the same locationId. Effectively I am joining with the table created by nodeStateChanged.
How do I ensure that all the updates for nodeStateChanged fall on the same instance as the Report topic so that the lookup for a location is possible when a new report is recieved. Do 1 and 2 need to be created by the same topology or it okay to create two seperate topologies that share the same APPLICATION_ID_CONFIG.
You don't need to do anything. Kafka Streams will always co-partition topics. Ie, if you have a sub-topology that reads from multiple topics with N partitions each, you get N tasks and each task is processing corresponding partitions, ie, task 0 processes partitions zero of both input topics, task 1 processes partitions one of both input topics, etc.
I have a use case where event information about sensors is inserted continuously in MySQL. We need to send this information with some processing in a Kafka topic every 1 or 2 minutes.
I am using Spark to send this information to Kafka topic and for maintaining CDC in Phoenix table.I am using a Cron job to run spark job every 1 minute.
The issue I am currently facing is message ordering, I need to send these messages in ascending timestamp to end the system Kafka topic (which has 1 partition). But most of the message ordering is lost due to more than 1 spark DataFrame partition sends information concurrently to Kafka topic.
Currently as a workaround I am re-partitioning my DataFrame in 1, in order to maintain the messages ordering, but this is not a long term solution as I am losing spark distributed computing.
If you guys have any better solution design around this please suggest.
I am able to achieve message ordering as per ascending timestamp to some extend by reparations my data with the keys and by applying sorting within a partition.
val pairJdbcDF = jdbcTable.map(row => ((row.getInt(0), row.getString(4)), s"${row.getInt(0)},${row.getString(1)},${row.getLong(2)},${row. /*getDecimal*/ getString(3)},${row.getString(4)}"))
.toDF("Asset", "Message")
val repartitionedDF = pairJdbcDF.repartition(getPartitionCount, $"Asset")
.select($"Message")
.select(expr("(split(Message, ','))[0]").cast("Int").as("Col1"),
expr("(split(Message, ','))[1]").cast("String").as("TS"),
expr("(split(Message, ','))[2]").cast("Long").as("Col3"),
expr("(split(Message, ','))[3]").cast("String").as("Col4"),
expr("(split(Message, ','))[4]").cast("String").as("Value"))
.sortWithinPartitions($"TS", $"Value")
We are trying to replace Apache Storm with Apache Spark streaming.
In storm; we partitioned stream based on "Customer ID" so that msgs with a range of "customer IDs" will be routed to same bolt (worker).
We do this because each worker will cache customer details (from DB).
So we split into 4 partitions and each bolt (worker) will have 1/4 of the entire range.
I did see comparison Spark and Storm; and this being limitation on Spark.
I am hoping we have a solution to this in Spark Streaming
When using Kafka, one way to address this problem is to partition your data at the producer side. As you probably have seen, Kafka messages have a key, and you may use that key to partition the data among partitions.
Using the Kafka receiver, you create one receiver per partition. Upon start of the Streaming job, the receivers will be distributed over several executors.
This means that every executor (JVM) will be receiving data for only the partitions it's got assigned. This results on the same id going to the same executor for the lifetime of the receiver, and enables effective local caching as intended in the question.