I am designing an Apache Storm topology using a Flux YAML topology definition file. The trouble is I don't see how to :-
Create a stream that sends to multiple bolts (the syntax seems to only include one 'to:' line).
Emit multiple named streams from a single bolt. This is perfectly legal in Apache Storm. I am concerned that the Stream 'name:' line is declared as 'optional - not used' and hence Flux does not seem to support this feature of Storm ?
Each destination needs to be listed as a separate stream as they have individual grouping definitions.
I don't think that's possible with Flux (0.10.0) yet.
Related
Our Apache Storm topology listens messages from Kafka using KafkaSpout and after doing lot of mapping/reducing/enrichment/aggregation etc. etc finally inserts data into Cassandra. There is another kafka input where we receive user queries for data if topology finds a response then it sends that onto a third kafka topic. Now we want to write E2E test using Junit in which we can directly programmatically insert data into topology and then by inserting user query message, we can assert on third point that response received on our query is correct.
To achieve this, we thought of starting EmbeddedKafka and CassandraUnit and then replacing actual Kafka and Cassandra with them and then we can start topology in the context of this single Junit test.
But our approach doesn't fit well with JUnit because it makes these tests too bulky. Starting kafka, cassandra and topology all are time taking and consume lot of resource. Is there anything in Apache Storm which can support kind of testing we are planning to write?
There are a number of options you have here, depending on what kind of slowdown you can live with:
As you mentioned, you can start Kafka, Cassandra and the topology. This is the slowest option, and the "most realistic".
Start Kafka and Cassandra once, and reuse them for all the tests. You can do the same with the Storm LocalCluster. It is likely faster to clear Kafka/Cassandra between each test (e.g. deleting all topics) instead of restarting them.
Replace the Kafka spouts/bolts and Cassandra bolt with stubs in test. Storm has a number of tools built in for stubbing bolts and spouts, e.g. the FixedTupleSpout, FeederSpout, the tracked topology and completable topology functionality in LocalCluster. This way you can insert some fixed tuples into the topology, and do asserts about which tuples where sent to the Cassandra bolt stub. There's examples of some of this functionality here and here
Finally you can of course unit test individual bolts. This is the fastest kind of test. You can use Testing.testTuple to create test tuples to pass to the bolt.
I have a Storm topology where I have to send output to kafka as well as update a value in redis. For this I have a Kafkabolt as well as a RedisBolt.
Below is what my topology looks like -
tp.setSpout("kafkaSpout", kafkaSpout, 3);
tp.setBolt("EvaluatorBolt", evaluatorBolt, 6).shuffleGrouping("kafkaStream");
tp.setBolt("ResultToRedisBolt",ResultsToRedisBolt,3).shuffleGrouping("EvaluatorBolt","ResultStream");
tp.setBolt("ResultToKafkaBolt", ResultsToKafkaBolt, 3).shuffleGrouping("EvaluatorBolt","ResultStream");
The problem is that both of the end bolts (Redis and Kafka) are listening to the same stream from the preceding bolt (ResultStream), hence both can fail independently. What I really need is that if the result is successfully published in Kafka, then only I update the value in Redis. Is there a way to have an output stream from a kafkaBolt where I can get the messages published successfully to Kafka? I can then probably listen to that stream in my RedisBolt and act accordingly.
It is not currently possible, unless you modify the bolt code. You would likely be better off changing your design slightly, since doing extra processing after the tuple is written to Kafka has some drawbacks. If you write the tuple to Kafka and you fail to write to Redis, you will get duplicates in Kafka, since the processing will start over at the spout.
It might be better, depending on your use case, to write the result to Kafka, and then have another topology read the result from Kafka and write to Redis.
If you still need to be able to emit new tuples from the bolt, it should be pretty easy to implement. The bolt recently got the ability to add a custom Producer callback, so we could extend that mechanism.
See the discussion at https://github.com/apache/storm/pull/2790#issuecomment-411709331 for context.
I've 1 spout and 3 bolts in a topology sharing a single stream declared originally using declarer.declareStream(s1,...) in the declareOutputFields() method of the spout.
The spout emits to the stream s1, and all downstream bolts also emit Values to the same stream s1. The bolts also declare the same stream s1 in their declareOutputFields().
Is there any problem with that? What is the correct way to do it? Please provide sufficient references.
I don't see any problem with your design, except it is unncessary unless you have a specific reason. According to Storm documentation:
Saying declarer.shuffleGrouping("1") subscribes to the default stream
on component "1" and is equivalent to declarer.shuffleGrouping("1",
DEFAULT_STREAM_ID).
Thus if your bolts and spouts do not need to emit more than one stream, there is really no need to specify the steam ID yourself. You can just use the default stream ID.
I am beginning with Storm. I would like to know that what are possible sources that I can use for my POC? Like Twitter.
Where can write output data after processing with Storm? Like HDFS/Hbase etc.
Storm can handle any source or sink that can be accessed via regular Java/Clojure/... code (if you are willing to do the coding by yourself).
Of course, there are many standard sources/sinks and Storm contains a couple of Spout/Bolt implementation for them: for example there is KafkaSpout or HdfsBolt. Check out the project as new sources and sink are added over time.
I'm working on project that should write via kafka to hdfs.
Suppose there is online server that writes messages into the kafka. Each message includes timestamp in it.
I want to create a job that the output will be a file/files according to timestamp in messages.
For example if the data in kafka is
{"ts":"01-07-2013 15:25:35.994", "data": ...}
...
{"ts":"01-07-2013 16:25:35.994", "data": ...}
...
{"ts":"01-07-2013 17:25:35.994", "data": ...}
I would like to get the 3 files as output
kafka_file_2013-07-01_15.json
kafka_file_2013-07-01_16.json
kafka_file_2013-07-01_17.json
And of course If I'm running this job once again and there is a new messages in queue like
{"ts":"01-07-2013 17:25:35.994", "data": ...}
It should create a file
kafka_file_2013-07-01_17_2.json // second chunk of hour 17
I've seen some open sources but most of them reads from kafka to some hdfs folder.
What is the best solution/design/opensource for this problem
You should definitely check out Camus API implementation from linkedIn. Camus is LinkedIn’s Kafka->HDFS pipeline. It is a mapreduce job that does distributed data loads out of Kafka. Check out this post I have written for a simple example which fetches from twitter stream and writes to HDFS based on tweet timestamps.
Project is available at github at - https://github.com/linkedin/camus
Camus needs two main components for reading and decoding data from Kafka and writing data to HDFS –
Decoding Messages read from Kafka
Camus has a set of Decoders which helps in decoding messages coming from Kafka, Decoders basically extends com.linkedin.camus.coders.MessageDecoder which implements logic to partition data based on timestamp. A set of predefined Decoders are present in this directory and you can write your own based on these. camus/camus-kafka-coders/src/main/java/com/linkedin/camus/etl/kafka/coders/
Writing messages to HDFS
Camus needs a set of RecordWriterProvider classes which extends com.linkedin.camus.etl.RecordWriterProvider that will tell Camus what’s the payload that should be written to HDFS.A set of predefined RecordWriterProvider are present in this directory and you can write your own based on these.
camus-etl-kafka/src/main/java/com/linkedin/camus/etl/kafka/common
If you're looking for a more real-time approach you should check out StreamSets Data Collector. It's also an Apache licensed open source tool for ingest.
The HDFS destination is configurable to write to time based directories based on the template you specify. And it already includes a way to specify a field in your incoming messages to use to determine the time a message should be written. The config is called "Time Basis" and you can specify something like ${record:value("/ts")}.
*full disclosure I'm an engineer working on this tool.
if you are using Apache Kafka 0.9 or above, you can use the Kafka Connect API.
check out https://github.com/confluentinc/kafka-connect-hdfs
This is a Kafka connector for copying data between Kafka and HDFS.
Check this out for continuous ingestion from Kafka to HDFS. Since it depends on Apache Apex, it has the guarantees Apex provides.
https://www.datatorrent.com/apphub/kafka-to-hdfs-sync/
Checkout Camus:
https://github.com/linkedin/camus
This will write data in Avro format though... others RecordWriters are pluggable.