SpringXD How can a processor in a stream to get data from a kafka source module - spring

I have the following stream defined:
stream create aStream –definition “kafka –zkconnect=localhost:2181 --topic aTopic | aProcess"
My question is that how I should code aProcess so that it can receive data (in String) from the build-in kafka source module and print the data? Many thanks.

You don't need to write any code. All you need to do is specify a sink modoule in your stream definition.
stream create aStream –definition “kafka –zkconnect=localhost:2181 --topic aTopic | log" --deploy
Will write it to the STDOUT of your XD Container, or you could use the File sink to write it to a file. The only time you need to write a custom module would be to get functionality not provided by any of the out of the box source, sinks, or processors, found at
http://docs.spring.io/spring-xd/docs/current/reference/html/

Related

Once in a while Spark Structured Streaming write stream is getting IllegalStateException: Race while writing batch 4

I have multiple queries running on the same spark structured streaming session.
The queries are writing parquet records to Google Bucket and checkpoint to Google Bucket.
val query1 = df1
.select(col("key").cast("string"),from_json(col("value").cast("string"), schema, Map.empty[String, String]).as("data"))
.select("key","data.*")
.writeStream.format("parquet").option("path", path).outputMode("append")
.option("checkpointLocation", checkpoint_dir1)
.partitionBy("key")/*.trigger(Trigger.ProcessingTime("5 seconds"))*/
.queryName("query1").start()
val query2 = df2.select(col("key").cast("string"),from_json(col("value").cast("string"), schema, Map.empty[String, String]).as("data"))
.select("key","data.*")
.writeStream.format("parquet").option("path", path).outputMode("append")
.option("checkpointLocation", checkpoint_dir2)
.partitionBy("key")/*.trigger(Trigger.ProcessingTime("5 seconds"))*/
.queryName("query2").start()
Problem: Sometimes job fails with ava.lang.IllegalStateException: Race while writing batch 4
Logs:
Caused by: java.lang.IllegalStateException: Race while writing batch 4
at org.apache.spark.sql.execution.streaming.ManifestFileCommitProtocol.commitJob(ManifestFileCommitProtocol.scala:67)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:187)
... 20 more
20/07/24 19:40:15 INFO SparkContext: Invoking stop() from shutdown hook
This error is because there are two writers writing to the output path. The file streaming sink doesn't support multiple writers. It assumes there is only one writer writing to the path. Each query needs to use its own output directory.
Hence, in order to fix this, you can make each query use its own output directory. When reading back the data, you can load each output directory and union them.
You can also use a streaming sink that supports multiple concurrent writers, such as the Delta Lake library. It's also supported by Google Cloud: https://cloud.google.com/blog/products/data-analytics/getting-started-with-new-table-formats-on-dataproc . This link has instructions about how to use Delta Lake on Google Cloud. It doesn't mention the streaming case, but what you need to do is changing format("parquet") to format("delta") in your codes.

Restart kafka connect sink and source connectors to read from beginning

I have searched quite a lot on this but there doesn't seems to be a good guide around this.
From what I have searched there are a few things to consider:
Resetting Sink Connector internal topics (status, config and offset).
Source Connector offsets implementation is implementation specific.
Question: Is there even a need to reset these topics?
Deleting the consumer group.
Restarting the connector with a different name (this is also an option) but it doesn't seems to be the right thing to do.
Resetting consumer group to --reset-offsets to --to-earliest
Using the REST API (Does the it provides the functionality to reset and read from beginning)
What would be the best way to restart both a sink and a source connector to read from beginning?
Source Connector:
Standalone mode: remove offset file (/tmp/connect.offsets) or change connector name.
Distributed mode: change name of the connector.
Sink Connector (both modes) one of the following methods:
Change name.
Reset offset for the Consumer group. Name of the group is same as Connector name.
To reset offset you have to first delete connector, reset offset (./bin/kafka-consumer-groups.sh --bootstrap-server :9092 --group connectorName --reset-offsets --to-earliest --execute --topic topicName), add same configuration one more time
You can check following question: Reset the JDBC Kafka Connector to start pulling rows from the beginning of time?
Source connector Distributed mode - has another option which is producing a new message to the offset topic.
For example I use jdbc source connector:
When looking on the offset topic I see the following:
./kafka-console-consumer.sh --zookeeper localhost:2181/kafka11-staging --topic kc-staging--offsets --from-beginning --property print.key=true
["referrer-family-jdbc-source",{"query":"query"}] {"incrementing":100}
Now in order to reset this I just produce another message with incrementing:0
For example: how to produce from shell with key from here
./kafka-console-producer.sh \
--broker-list `hostname`:9092 \
--topic kc-staging--offsets \
--property "parse.key=true" \
--property "key.separator=|"
["referrer-family-jdbc-source",{"query":"query"}]|{"incrementing":0}
Please note that you need to do the following:
Delete the connector.
Produce a message with the relevant offset as I described above.
Create the connector again.
a bit late but found another way. Just set the offset.storage.file.name in standalone mode to dev/null:
#worker.properties
offset.storage.file.filename=/dev/null
#cmdline
connect-standalone /data/config/worker.properties /data/config/connector.properties

How to scale up named channels by having kafka as a transport layer

I'm using Spring XD 1.2.1 with kafka as a transport layer. I have the follow set up:
xd:
transport: kafka
messagebus:
kafka:
default:
concurrency: 10
minPartitionCount: 10
I have the following streams as example:
Streams
stream create f --definition "queue:foo > transform --expression=payload+'-foo' | log"
stream create b --definition "queue:bar > transform --expression=payload+'-bar' | log"
stream deploy --name f --properties "module.transform.count=2"
stream deploy --name b --properties "module.transform.count=2"
stream create r --definition "time | router --expression=payload.contains('10')?'queue:foo':'queue:bar'" --deploy
Question
How can I scale up the first processor in the streams which "source" is a named channel? I was expecting something like 20 partitions in the transformers of the streams f and b, as long as the count is 2, and the concurrency is 10. But the number of partitions are 10.
This is working as expected when you deploy other modules that are not the first.
Should I configure the named channels in a specific way to achieve this?
Thanks.
The kafka partitions are controlled by the producing side.
For a mid-stream channel we can look at the "next" module and calculate the partition count by calculating its needs (concurrency*count).
With a named channel, we have no way of knowing the number of consumers (or their concurrency) so a count and concurrency of 1 is used, and so minPartitionCount is used as the partition count.
You would need to deploy the producing stream with an appropriate setting to increase the partitions:
stream deploy foo --properties module.last.producer.minPartitionCount=20
EDIT
Actually - it looks like we have a bug - you can't specifiy the minPartitionCount on a named channel.
I see you have opened a JIRA Issue.

In Storm Spout, Naming the Consumer Group

I am currently using:
https://github.com/wurstmeister/storm-kafka-0.8-plus/commits/master
which has been moved to:
https://github.com/apache/storm/tree/master/external/storm-kafka
I want to specify the Kafka Consumer Group Name. By looking at the storm-kafka code, I followed the setting, id, to find that is is never used when dealing with a consumer configuration, but is used in creating the zookeeper path at which offset information is stored. Here in this link is an example of why I would want to do this: https://labs.spotify.com/2015/01/05/how-spotify-scales-apache-storm/
Am I correct in saying that the Consumer Group Name cannot be set using the https://github.com/apache/storm/tree/master/external/storm-kafka code?
So far, storm-kafka integration is implemented using SimpleConsumer API of kafka and the format it stores consumer offset in zookeeper is implemented in their own way(JSON format).
If you write spout config like below,
SpoutConfig spoutConfig = new SpoutConfig(zkBrokerHosts,
"topic name",
"/kafka/consumers(just an example, path to store consumer offset)",
"yourTopic");
It will write consumer offset in subdirectories of /kafka/consumers/yourTopic.
Note that by default storm-kafka uses same zookeeper that your Storm uses.

Using Kafka to import data to Hadoop

Firstly I was thinking what to use to get events into Hadoop, where they will be stored and periodically analysis would be performed on them (possibly using Ooozie to schedule periodic analysis) Kafka or Flume, and decided that Kafka is probably a better solution, since we also have a component that does event processing, so in this way, both batch and event processing components get data in the same way.
But know I'm looking for suggestions concretely how to get data out of broker to Hadoop.
I found here that Flume can be used in combination with Kafka
Flume - Contains Kafka Source (consumer) and Sink (producer)
And also found on the same page and in Kafka documentation that there is something called Camus
Camus - LinkedIn's Kafka=>HDFS pipeline. This one is used for all data at LinkedIn, and works great.
I'm interested in what would be a better (and easier, better documented solution) to do that? Also, are there any examples or tutorials how to do it?
When should I use this variants over simpler, High level consumer?
I'm opened for suggestions if there is another/better solution than this two.
Thanks
You can use flume to dump data from Kafka to HDFS. Flume has kafka source and sink. Its a matter of property file change. An example is given below.
Steps:
Create a kafka topic
kafka-topics --create --zookeeper localhost:2181 --replication-factor 1 -- partitions 1 --topic testkafka
Write to the above created topic using kafka console producer
kafka-console-producer --broker-list localhost:9092 --topic testkafka
Configure a flume agent with the following properties
flume1.sources = kafka-source-1
flume1.channels = hdfs-channel-1
flume1.sinks = hdfs-sink-1
flume1.sources.kafka-source-1.type = org.apache.flume.source.kafka.KafkaSource
flume1.sources.kafka-source-1.zookeeperConnect = localhost:2181
flume1.sources.kafka-source-1.topic =testkafka
flume1.sources.kafka-source-1.batchSize = 100
flume1.sources.kafka-source-1.channels = hdfs-channel-1
flume1.channels.hdfs-channel-1.type = memory
flume1.sinks.hdfs-sink-1.channel = hdfs-channel-1
flume1.sinks.hdfs-sink-1.type = hdfs
flume1.sinks.hdfs-sink-1.hdfs.writeFormat = Text
flume1.sinks.hdfs-sink-1.hdfs.fileType = DataStream
flume1.sinks.hdfs-sink-1.hdfs.filePrefix = test-events
flume1.sinks.hdfs-sink-1.hdfs.useLocalTimeStamp = true
flume1.sinks.hdfs-sink-1.hdfs.path = /tmp/kafka/%{topic}/%y-%m-%d
flume1.sinks.hdfs-sink-1.hdfs.rollCount=100
flume1.sinks.hdfs-sink-1.hdfs.rollSize=0
flume1.channels.hdfs-channel-1.capacity = 10000
flume1.channels.hdfs-channel-1.transactionCapacity = 1000
Save the above config file as example.conf
Run the flume agent
flume-ng agent -n flume1 -c conf -f example.conf - Dflume.root.logger=INFO,console
Data will be now dumped to HDFS location under the following path
/tmp/kafka/%{topic}/%y-%m-%d
Most of the time, I see people using Camus with azkaban
You can you at the github repo of Mate1 for their implementation of Camus. It's not a tutorial but I think it could help you
https://github.com/mate1/camus

Resources