Storm input and output possible source - hadoop

I am beginning with Storm. I would like to know that what are possible sources that I can use for my POC? Like Twitter.
Where can write output data after processing with Storm? Like HDFS/Hbase etc.

Storm can handle any source or sink that can be accessed via regular Java/Clojure/... code (if you are willing to do the coding by yourself).
Of course, there are many standard sources/sinks and Storm contains a couple of Spout/Bolt implementation for them: for example there is KafkaSpout or HdfsBolt. Check out the project as new sources and sink are added over time.

Related

Storm bolt following a kafka bolt

I have a Storm topology where I have to send output to kafka as well as update a value in redis. For this I have a Kafkabolt as well as a RedisBolt.
Below is what my topology looks like -
tp.setSpout("kafkaSpout", kafkaSpout, 3);
tp.setBolt("EvaluatorBolt", evaluatorBolt, 6).shuffleGrouping("kafkaStream");
tp.setBolt("ResultToRedisBolt",ResultsToRedisBolt,3).shuffleGrouping("EvaluatorBolt","ResultStream");
tp.setBolt("ResultToKafkaBolt", ResultsToKafkaBolt, 3).shuffleGrouping("EvaluatorBolt","ResultStream");
The problem is that both of the end bolts (Redis and Kafka) are listening to the same stream from the preceding bolt (ResultStream), hence both can fail independently. What I really need is that if the result is successfully published in Kafka, then only I update the value in Redis. Is there a way to have an output stream from a kafkaBolt where I can get the messages published successfully to Kafka? I can then probably listen to that stream in my RedisBolt and act accordingly.
It is not currently possible, unless you modify the bolt code. You would likely be better off changing your design slightly, since doing extra processing after the tuple is written to Kafka has some drawbacks. If you write the tuple to Kafka and you fail to write to Redis, you will get duplicates in Kafka, since the processing will start over at the spout.
It might be better, depending on your use case, to write the result to Kafka, and then have another topology read the result from Kafka and write to Redis.
If you still need to be able to emit new tuples from the bolt, it should be pretty easy to implement. The bolt recently got the ability to add a custom Producer callback, so we could extend that mechanism.
See the discussion at https://github.com/apache/storm/pull/2790#issuecomment-411709331 for context.

Apache Storm Topology using Flux YAML file

I am designing an Apache Storm topology using a Flux YAML topology definition file. The trouble is I don't see how to :-
Create a stream that sends to multiple bolts (the syntax seems to only include one 'to:' line).
Emit multiple named streams from a single bolt. This is perfectly legal in Apache Storm. I am concerned that the Stream 'name:' line is declared as 'optional - not used' and hence Flux does not seem to support this feature of Storm ?
Each destination needs to be listed as a separate stream as they have individual grouping definitions.
I don't think that's possible with Flux (0.10.0) yet.

storm - how to choose a stream grouping

I'm using the KafkaSpout to read / stream message of compressed Byte[]. The bolts are simple: uncompress the message -> write to Cassandra. I'm wondering which Stream Grouping to use.
The samples appear to mainly use the Shuffle Grouping. In testing I've been using the All Grouping (figuring that I want all of the messages to go through the one bolt) but I see notes about "Use this grouping with care".
Suggestions on how to proceeed?
Shuffle grouping is sufficient for your use case, which distributes workload across downstream bolts.
All grouping is rarely needed, and can results in duplicated processing in all downstream bolts.
Reference:
https://storm.apache.org/documentation/Concepts.html#stream-groupings

#Storm: how to setup various metrics for the same data source

I'm trying to setup Storm to aggregate a stream, but with various (DRPC available) metrics on the same stream.
E.g. the stream is consisted of messages that have a sender, a recipient, the channel through which the message arrived and a gateway through which it was delivered. I'm having trouble deciding how to organize one or more topologies that could give me e.g. total count of messages by gateway and/or by channel. And besides the total, counts per minute would be nice too.
The basic idea is to have a spout that will accept messaging events, and from there aggregate the data as needed. Currently I'm playing around with Trident and DRPC and I've came up with two possible topologies that solve the problem at this stage. Can't decide which approach is better, if any?!
The entire source is available at this gist.
It has three classes:
RandomMessageSpout
used to emit the messaging data
simulates the real data source
SeparateTopology
creates a separate DRPC stream for each metric needed
also a separate query state is created for each metric
they all use the same spout instance
CombinedTopology
creates a single DRPC stream with all the metrics needed
creates a separate query state for each metric
each query state extracts the desired metric and groups results for it
Now, for the problems and questions:
SeparateTopology
is it necessary to use the same spout instance or can I just say new RandomMessageSpout() each time?
I like the idea that I don't need to persist grouped data by all the metrics, but just the groupings we need to extract later
is the spout emitted data actually processed by all the state/query combinations, e.g. not the first one that comes?
would this also later enable dynamic addition of new state/query combinations at runtime?
CombinedTopology
I don't really like the idea that I need to persist data grouped by all the metrics since I don't need all the combinations
it came as a surprise that the all the metrics always return the same data
e.g. channel and gateway inquiries return status metrics data
I found that this was always the data grouped by the first field in state definition
this topic explains the reasoning behind this behaviour
but I'm wondering if this is a good way of doing thins in the first place (and will find a way around this issue if need be)
SnapshotGet vs TupleCollectionGet in stateQuery
with SnapshotGet things tended to work, but not always, only TupleCollectionGet solved the issue
any pointers as to what is correct way of doing that?
I guess this is a longish question / topic, but any help is really appreciated!
Also, if I missed the architecture entirely, suggestions on how to accomplish this would be most welcome.
Thanks in advance :-)
You can't actually split a stream in SeparateTopology by invoking newStream() using the same spout instance, since that would create new instances of the same RandomMessageSpout spout, which would result in duplicate values being emitted to your topology by multiple, separate spout instances. (Spout parallelization is only possible in Storm with partitioned spouts, where each spout instance processes a partition of the whole dataset -- a Kafka partition, for example).
The correct approach here is to modify the CombinedTopology to split the stream into multiple streams as needed for each metric you need (see below), and then do a groupBy() by that metric's field and persistentAggregate() on each newly branched stream.
From the Trident FAQ,
"each" returns a Stream object, which you can store in a variable. You can then run multiple eaches on the same Stream to split it, e.g.:
Stream s = topology.each(...).groupBy(...).aggregate(...)
Stream branch1 = s.each(...)
Stream branch2 = s.each(...)
See this thread on Storm's mailing list, and this one for more information.

What is most efficient way to write from kafka to hdfs with files partitioning into dates

I'm working on project that should write via kafka to hdfs.
Suppose there is online server that writes messages into the kafka. Each message includes timestamp in it.
I want to create a job that the output will be a file/files according to timestamp in messages.
For example if the data in kafka is
{"ts":"01-07-2013 15:25:35.994", "data": ...}
...
{"ts":"01-07-2013 16:25:35.994", "data": ...}
...
{"ts":"01-07-2013 17:25:35.994", "data": ...}
I would like to get the 3 files as output
kafka_file_2013-07-01_15.json
kafka_file_2013-07-01_16.json
kafka_file_2013-07-01_17.json
And of course If I'm running this job once again and there is a new messages in queue like
{"ts":"01-07-2013 17:25:35.994", "data": ...}
It should create a file
kafka_file_2013-07-01_17_2.json // second chunk of hour 17
I've seen some open sources but most of them reads from kafka to some hdfs folder.
What is the best solution/design/opensource for this problem
You should definitely check out Camus API implementation from linkedIn. Camus is LinkedIn’s Kafka->HDFS pipeline. It is a mapreduce job that does distributed data loads out of Kafka. Check out this post I have written for a simple example which fetches from twitter stream and writes to HDFS based on tweet timestamps.
Project is available at github at - https://github.com/linkedin/camus
Camus needs two main components for reading and decoding data from Kafka and writing data to HDFS –
Decoding Messages read from Kafka
Camus has a set of Decoders which helps in decoding messages coming from Kafka, Decoders basically extends com.linkedin.camus.coders.MessageDecoder which implements logic to partition data based on timestamp. A set of predefined Decoders are present in this directory and you can write your own based on these. camus/camus-kafka-coders/src/main/java/com/linkedin/camus/etl/kafka/coders/
Writing messages to HDFS
Camus needs a set of RecordWriterProvider classes which extends com.linkedin.camus.etl.RecordWriterProvider that will tell Camus what’s the payload that should be written to HDFS.A set of predefined RecordWriterProvider are present in this directory and you can write your own based on these.
camus-etl-kafka/src/main/java/com/linkedin/camus/etl/kafka/common
If you're looking for a more real-time approach you should check out StreamSets Data Collector. It's also an Apache licensed open source tool for ingest.
The HDFS destination is configurable to write to time based directories based on the template you specify. And it already includes a way to specify a field in your incoming messages to use to determine the time a message should be written. The config is called "Time Basis" and you can specify something like ${record:value("/ts")}.
*full disclosure I'm an engineer working on this tool.
if you are using Apache Kafka 0.9 or above, you can use the Kafka Connect API.
check out https://github.com/confluentinc/kafka-connect-hdfs
This is a Kafka connector for copying data between Kafka and HDFS.
Check this out for continuous ingestion from Kafka to HDFS. Since it depends on Apache Apex, it has the guarantees Apex provides.
https://www.datatorrent.com/apphub/kafka-to-hdfs-sync/
Checkout Camus:
https://github.com/linkedin/camus
This will write data in Avro format though... others RecordWriters are pluggable.

Resources