Partition Strategy in Kafka Stream - apache-kafka-streams

Which partition strategy Kafka stream uses ? Can we change the partition strategy in Kafka Stream as we can change in normal Kafka Consumer
streamsConfiguration.put(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG,Collections.singletonList(StickyAssignor.class));
makes no difference and always StreamsPartitionAssignor is used

No. You cannot set an partition assignor.
Kafka Streams has very specific requirements how partition assignment works and if not done correctly, incorrect result could be computed. Thus, it's not allowed to set a custom partitions assignor.

Related

Set Table/Topic order in Apache Kafka JDBC

Have 2 topics, source_topic.a , source_topic.b .
source_topic.a have dependency with source_topic.b (eg. need to sink source_topic.b first). In order to note the sink process, need to sink data from source_topic.b first then sink from source_topic.a. Is there any way to set an order of topics / tables in source/sink configurations ?
Following are the configurations used and there are multiple tables and topics. The timestamp is used for the mode for updating a table each time it is polled. And timestamp.initial set value to a specific timestamp.
The Source Configuration
name=jdbc-mssql-prod-5
connector.class=io.confluent.connect.jdbc.JdbcSourceConnector
connection.url=jdbc:sqlserver:
connection.user=
connection.password=
topic.prefix= source_topic.
mode=timestamp
table.whitelist=A,B,C
timestamp.column.name=ModifiedDateTime
connection.backoff.ms=60000
connection.attempts=300
validate.non.null= false
# enter timestamp in milliseconds
timestamp.initial= 1604977200000
The Sink Configuration
name=mysql-sink-prod-5
connector.class=io.confluent.connect.jdbc.JdbcSinkConnector
tasks.max=1
topics= sink_topic_a, sink_topic_b
connection.url=jdbc:mysql:
connection.user=
connection.password=
insert.mode=upsert
delete.enabled=true
pk.mode=record_key
errors.log.enable= true
errors.log.include.messages=true
No, the JDBC Sink connector doesn't support that kind of logic.
You're applying batch thinking to a streams world :) Consider: how would Kafka know that it had "finished" sinking topic_a? Streams are unbounded, so you'd end up having to say something like "if you don't receive any more messages in a given time window then assume that you've finished sinking data from this topic and move onto the next one".
You may be best doing the necessary join of the data within Kafka itself (e.g. with Kafka Streams or ksqlDB), and then writing the result back to a new Kafka topic which you then sink to your database.

How do I get two topics that have the same partition key and the number of partitions land on the same consumer within a kafka streams application

I am trying to create a Kafka Streams service where
I am trying to initialize a cache in a processor, that will then be updated by consuming messages with a topic say "nodeStateChanged" for a partition key lets say locationId.
I need to check the node state when I consume another topic lets say "Report" again keyed by the same locationId. Effectively I am joining with the table created by nodeStateChanged.
How do I ensure that all the updates for nodeStateChanged fall on the same instance as the Report topic so that the lookup for a location is possible when a new report is recieved. Do 1 and 2 need to be created by the same topology or it okay to create two seperate topologies that share the same APPLICATION_ID_CONFIG.
You don't need to do anything. Kafka Streams will always co-partition topics. Ie, if you have a sub-topology that reads from multiple topics with N partitions each, you get N tasks and each task is processing corresponding partitions, ie, task 0 processes partitions zero of both input topics, task 1 processes partitions one of both input topics, etc.

Is scalability applicable with Kafka stream if each topic has single partition

My understanding as per Kafka stream documentation,
Maximum possible parallel tasks is equal to maximum number of partitions of a topic among all topics in a cluster.
I have around 60 topics at Kafka cluster. Each topic has single partition only.
Is it possible to achieve scalability/parallelism with Kafka stream for my Kafka cluster?
Do you want to do the same computation over all topics? For this, I would recommend to introduce an extra topic with many partitions that you use to scale out:
// using new 1.0 API
StreamsBuilder builder = new StreamsBuilder():
KStream parallelizedStream = builder
.stream(/* subscribe to all topics at once*/)
.through("topic-with-many-partitions");
// apply computation
parallelizedStream...
Note: You need to create the topic "topic-with-many-partitions" manually before starting your Streams application
Pro Tip:
The topic "topic-with-many-partitions" can have a very short retention time as it's only used for scaling and must not hold data long term.
Update
If you have 10 topic T1 to T10 with a single partitions each, the program from above will execute as follows (with TN being the dummy topic with 10 partitions):
T1-0 --+ +--> TN-0 --> T1_1
... --+--> T0_0 --+--> ... --> ...
T10-0 --+ +--> TN-10 --> T1_10
The first part of your program will only read all 10 input topics and write it back into 10 partitions of TN. Afterwards, you can get up to 10 parallel tasks, each processing one input partition. If you start 10 KafakStreams instances, only one will execute T0_0, and each will alsa one T1_x running.

Spark Streaming Direct Stream approach with Group ID

I was reading the Spark Streaming kafka integration guide in the latest documentation page, which is based on Kafka 010 version.
http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#creating-a-direct-stream
In that i can see one of the Kafka params is "group.id" -> "example"
I thought we dont have to pass group.id as one of the parameter when we use DirectStream approach. I am confused on this documentation. What is the relation between group.id and Spark Streaming Direct Stream approach.
group.id is a Kafka consumer configuration that is used to group a set of consumer processes into a group so that each Kafka partition can be assigned to exactly one node in the group.
Looking at the Kafka Consumer Configuration, the parameter is optional, unless we use Kafka-based offset management (which Spark Streaming doesn't use with its direct approach). So it should be an optional parameter.
Also looking at the source code of Spark Kafka Direct DStream, spark doesn't add other Kafka params that the client doesn't set. So group.id will default to empty string if not given.
In general, consumer group id's are needed, when you have multiple consumers (a spark streaming job, an akka application, etc.) for the same Kafka topic and you don't want all of them to fall under the same group (which they will if you don't give the group id to all of them). So I think it is a good practice to name each consumer group with its own group id. If you use operational tools around Kafka, it will also visibility about each consumer group as well, if you name them properly.

Sticking stream data to specific working

We are trying to replace Apache Storm with Apache Spark streaming.
In storm; we partitioned stream based on "Customer ID" so that msgs with a range of "customer IDs" will be routed to same bolt (worker).
We do this because each worker will cache customer details (from DB).
So we split into 4 partitions and each bolt (worker) will have 1/4 of the entire range.
I did see comparison Spark and Storm; and this being limitation on Spark.
I am hoping we have a solution to this in Spark Streaming
When using Kafka, one way to address this problem is to partition your data at the producer side. As you probably have seen, Kafka messages have a key, and you may use that key to partition the data among partitions.
Using the Kafka receiver, you create one receiver per partition. Upon start of the Streaming job, the receivers will be distributed over several executors.
This means that every executor (JVM) will be receiving data for only the partitions it's got assigned. This results on the same id going to the same executor for the lifetime of the receiver, and enables effective local caching as intended in the question.

Resources