I was reading the Spark Streaming kafka integration guide in the latest documentation page, which is based on Kafka 010 version.
http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#creating-a-direct-stream
In that i can see one of the Kafka params is "group.id" -> "example"
I thought we dont have to pass group.id as one of the parameter when we use DirectStream approach. I am confused on this documentation. What is the relation between group.id and Spark Streaming Direct Stream approach.
group.id is a Kafka consumer configuration that is used to group a set of consumer processes into a group so that each Kafka partition can be assigned to exactly one node in the group.
Looking at the Kafka Consumer Configuration, the parameter is optional, unless we use Kafka-based offset management (which Spark Streaming doesn't use with its direct approach). So it should be an optional parameter.
Also looking at the source code of Spark Kafka Direct DStream, spark doesn't add other Kafka params that the client doesn't set. So group.id will default to empty string if not given.
In general, consumer group id's are needed, when you have multiple consumers (a spark streaming job, an akka application, etc.) for the same Kafka topic and you don't want all of them to fall under the same group (which they will if you don't give the group id to all of them). So I think it is a good practice to name each consumer group with its own group id. If you use operational tools around Kafka, it will also visibility about each consumer group as well, if you name them properly.
Related
Have 2 topics, source_topic.a , source_topic.b .
source_topic.a have dependency with source_topic.b (eg. need to sink source_topic.b first). In order to note the sink process, need to sink data from source_topic.b first then sink from source_topic.a. Is there any way to set an order of topics / tables in source/sink configurations ?
Following are the configurations used and there are multiple tables and topics. The timestamp is used for the mode for updating a table each time it is polled. And timestamp.initial set value to a specific timestamp.
The Source Configuration
name=jdbc-mssql-prod-5
connector.class=io.confluent.connect.jdbc.JdbcSourceConnector
connection.url=jdbc:sqlserver:
connection.user=
connection.password=
topic.prefix= source_topic.
mode=timestamp
table.whitelist=A,B,C
timestamp.column.name=ModifiedDateTime
connection.backoff.ms=60000
connection.attempts=300
validate.non.null= false
# enter timestamp in milliseconds
timestamp.initial= 1604977200000
The Sink Configuration
name=mysql-sink-prod-5
connector.class=io.confluent.connect.jdbc.JdbcSinkConnector
tasks.max=1
topics= sink_topic_a, sink_topic_b
connection.url=jdbc:mysql:
connection.user=
connection.password=
insert.mode=upsert
delete.enabled=true
pk.mode=record_key
errors.log.enable= true
errors.log.include.messages=true
No, the JDBC Sink connector doesn't support that kind of logic.
You're applying batch thinking to a streams world :) Consider: how would Kafka know that it had "finished" sinking topic_a? Streams are unbounded, so you'd end up having to say something like "if you don't receive any more messages in a given time window then assume that you've finished sinking data from this topic and move onto the next one".
You may be best doing the necessary join of the data within Kafka itself (e.g. with Kafka Streams or ksqlDB), and then writing the result back to a new Kafka topic which you then sink to your database.
I have a use case where event information about sensors is inserted continuously in MySQL. We need to send this information with some processing in a Kafka topic every 1 or 2 minutes.
I am using Spark to send this information to Kafka topic and for maintaining CDC in Phoenix table.I am using a Cron job to run spark job every 1 minute.
The issue I am currently facing is message ordering, I need to send these messages in ascending timestamp to end the system Kafka topic (which has 1 partition). But most of the message ordering is lost due to more than 1 spark DataFrame partition sends information concurrently to Kafka topic.
Currently as a workaround I am re-partitioning my DataFrame in 1, in order to maintain the messages ordering, but this is not a long term solution as I am losing spark distributed computing.
If you guys have any better solution design around this please suggest.
I am able to achieve message ordering as per ascending timestamp to some extend by reparations my data with the keys and by applying sorting within a partition.
val pairJdbcDF = jdbcTable.map(row => ((row.getInt(0), row.getString(4)), s"${row.getInt(0)},${row.getString(1)},${row.getLong(2)},${row. /*getDecimal*/ getString(3)},${row.getString(4)}"))
.toDF("Asset", "Message")
val repartitionedDF = pairJdbcDF.repartition(getPartitionCount, $"Asset")
.select($"Message")
.select(expr("(split(Message, ','))[0]").cast("Int").as("Col1"),
expr("(split(Message, ','))[1]").cast("String").as("TS"),
expr("(split(Message, ','))[2]").cast("Long").as("Col3"),
expr("(split(Message, ','))[3]").cast("String").as("Col4"),
expr("(split(Message, ','))[4]").cast("String").as("Value"))
.sortWithinPartitions($"TS", $"Value")
I have seen that the Big Data community is very hot in using Flafka in many ways for data ingestion but I haven't really gotten why yet.
A simple example I have developed to better understand this is to ingest Twitter data and move them to multiple sinks(HDFS, Storm, HBase).
I have done the implementation for the ingestion part in the following two ways:
(1) Plain Kafka Java Producer with multiple consumers (2) Flume agent #1 (Twitter source+Kafka sink) | (potential) Flume agent #2(Kafka source+multiple sinks). I haven't really seen any difference in the complexity of developing any of these solutions(not a production system I can't comment on performance) - only what I found online is that a good use case for Flafka would be for data from multiple sources that need aggregating in one place before getting consumed in different places.
Can someone explain why would I use Flume+Kafka over plain Kafka or plain Flume?
People usually combine Flume and Kafka, because Flume has a great (and battle-tested) set of connectors (HDFS, Twitter, HBase, etc.) and Kafka brings resilience. Also, Kafka helps distributing Flume events between nodes.
EDIT:
Kafka replicates the log for each topic's partitions across a
configurable number of servers (you can set this replication factor on
a topic-by-topic basis). This allows automatic failover to these
replicas when a server in the cluster fails so messages remain
available in the presence of failures. -- https://kafka.apache.org/documentation#replication
Thus, as soon as Flume gets the message to Kafka, you have a guarantee that your data won't be lost. NB: you can integrate Kafka with Flume at every stage of your ingestion (ie. Kafka can be used as a source, channel and sink, too).
I am using Apache Kafka & Apache Storm integration.
I need to design a model.Here are the specification of my topology :
I have configured topic in Kafka. Let say customer1 . Now, the storm bolts will read the data from the customer1 kafka-spout. It processes the data and writes into mongo and cassandra db. Here the db names are also same as the kafka topics customer1. Table structure and rest of the things will be same.
Now, suppose I get a new customer let say customer2. I need to read data from customer2 kafka-spout and write it into mongo and cassandra db where the db names will be customer2.
I can think of two ways to do it .
I will write a bolt which gets trigged whenever a new customer name gets added into a Kafka topic .That bolt will have code which will create and submit the new topology to cluster.
I will create independent jars for all the customer and submit the topology manually.
I searched a lot about it but didn't get which approach is better.
What are the PROs and CONs of the above specified approach in terms of efficiency, code maintainability and adding new changes to the existing model ?
Is there any other way to handle this ?
We are trying to replace Apache Storm with Apache Spark streaming.
In storm; we partitioned stream based on "Customer ID" so that msgs with a range of "customer IDs" will be routed to same bolt (worker).
We do this because each worker will cache customer details (from DB).
So we split into 4 partitions and each bolt (worker) will have 1/4 of the entire range.
I did see comparison Spark and Storm; and this being limitation on Spark.
I am hoping we have a solution to this in Spark Streaming
When using Kafka, one way to address this problem is to partition your data at the producer side. As you probably have seen, Kafka messages have a key, and you may use that key to partition the data among partitions.
Using the Kafka receiver, you create one receiver per partition. Upon start of the Streaming job, the receivers will be distributed over several executors.
This means that every executor (JVM) will be receiving data for only the partitions it's got assigned. This results on the same id going to the same executor for the lifetime of the receiver, and enables effective local caching as intended in the question.