Apache Storm: split a stream to different bolts - apache-storm

I making storm topology and I'm dealing with strings from this format: "x-x-x-x" where x is some digit. I want the strings stream to be split between 4 bolts equaly.
The problem is that for the following code, all the bolts get all the tuples, instead of send eack tuple to exactly one bolt:
builder.setSpout("digits-spout", new ReaderSpout());
builder.setBolt("level-1", new SomeBolt(1)).shuffleGrouping("digits-spout");
builder.setBolt("level-2", new SomeBolt(2)).shuffleGrouping("digits-spout");
builder.setBolt("level-3", new SomeBolt(3)).shuffleGrouping("digits-spout");
builder.setBolt("level-4", new SomeBolt(4)).shuffleGrouping("digits-spout");
as you can see i use same bolt but different consturctor.
Thanks!

According to what I understand from your question, I may offer an extra bolt for your problem like following example:
builder.setSpout("digits-spout", new ReaderSpout());
builder.setBolt("stringSplitterBoltName", new
StringSplitterBolt(1)).shuffleGrouping("digits-spout");
builder.setBolt("level-1", new
SomeBolt(1)).shuffleGrouping("stringSplitterBoltName");
builder.setBolt("level-2", new
SomeBolt(2)).shuffleGrouping("stringSplitterBoltName");
builder.setBolt("level-3", new
SomeBolt(3)).shuffleGrouping("stringSplitterBoltName");
builder.setBolt("level-4", new
SomeBolt(4)).shuffleGrouping("stringSplitterBoltName");

If you want the bolts to have different processing logic, you can just add 4 tasks of the same bolt. In this case, you will receive messages randomly between the bolt instances. You can check for the string value within that bolt and take appropriate execution path. You will avoid separate codebase for 4 bolts.
Alternatively, if you want to have separate bolt code for the strings, go for above suggestion by zackeriya.

Related

Consumer assignment with multiple topics with Kafka Streams

Apologies if this has been already covered before here, I couldn't find anything closely related. I have this Kafka Streams app which reads from multiple topics, persist the records on a DB and then publish an event to an output topic. Pretty straightforward, it's stateless in terms of kafka local stores. (Topology below)
Topic1(T1) has 5 partitions, Topic2(T2) has a single partition. The issue here is, while consuming from two topics, if I want to go "full speed" with T1 (5 consumers), it doesn't guarantee that I will have dedicated consumers for each partition on T1. It will be distributed within the two topic partitions and I might end up with unbalanced consumers (and idle consumers), something like below:
[c1: t1p1, t1p3], [c2: t1p2, t1p5], [c3: t1p4, t2p1], [c4: (idle consumer)], [c5: (idle consumer)]
[c1: t1p1, t1p2], [c2: t1p5], [c3: t1p4, t2p1], [c4: (idle consumer)], [c5: t1p3]
With that said:
Is it a good practice having a topology that reads from multiple topics within the same KafkaStreams instance?
Is there any way to achieve a partition assignment like the following if I want go "full speed" for T1? [c1: t1p1, t2p1], [c2: t1p2], [c3: t1p3], [c4: t1p4], [c5: t1p5]
Which of the topologies below is most optimal to what I want to achieve? Or is it completely unrelated?
Option A (Current topology)
Topologies:
Sub-topology: 0
Source: topic1-source (topics: [TOPIC1])
--> topic1-processor
Processor: topic1-processor (stores: [])
--> topic1-sink
<-- topic1-source
Sink: topic1-sink (topic: OUTPUT-TOPIC)
<-- topic1-processor
Sub-topology: 1
Source: topic2-source (topics: [TOPIC2])
--> topic2-processor
Processor: topic2-processor (stores: [])
--> topic2-sink
<-- topic2-source
Sink: topic2-sink (topic: OUTPUT-TOPIC)
<-- topic2-processor
Option B:
Topologies:
Sub-topology: 0
Source: topic1-source (topics: [TOPIC1])
--> topic1-processor
Source: topic2-source (topics: [TOPIC2])
--> topic2-processor
Processor: topic1-processor (stores: [])
--> response-sink
<-- topic1-source
Processor: topic2-processor (stores: [])
--> response-sink
<-- topic2-source
Sink: response-sink (topic: OUTPUT-TOPIC)
<-- topic2-processor, topic1-processor
If I use two streams for each topic instead of a single streams with multiple topic, would that work for what I am trying to achieve?
config1.put("application.id", "app1");
KakfaStreams stream1 = new KafkaStreams(config1, topologyTopic1);
stream1.start();
config2.put("application.id", "app2");
KakfaStreams stream2 = new KafkaStreams(config2, topologyTopic2);
stream2.start();
The initial assignments you describe, would never happen with Kafka Streams (And also not with any default Consumer config). If there are 5 partitions and you have 5 consumers, each consumer would get 1 partition assigned (for a plain consumer with a custom PartitionAssignor you could do the assignment differently, but all default implementations would ensure proper load balancing).
Is it a good practice having a topology that reads from multiple topics within the same KafkaStreams instance?
There is not issue with that.
Is there any way to achieve a partition assignment like the following if I want go "full speed" for T1? [c1: t1p1, t2p1], [c2: t1p2], [c3: t1p3], [c4: t1p4], [c5: t1p5]
Depending how you write your topology, this would be the assignment Kafka Streams uses out-of-the-box. For you two options, option B would result in this assignment.
Which of the topologies below is most optimal to what I want to achieve? Or is it completely unrelated?
As mentioned above, Option B would result in the assignment above. For Option A, you could actually even use a 6th instance and each instance would processes exactly one partition (because there are two sub-topologies, you get 6 tasks, 5 for sub-topology-0 and 1 for sub-topology-1; sub-topologies are scaled out independently of each other); for Option A, you only get 5 tasks though because there is only one sub-topology and thus the maximum number of partitions of both input topic (that is 5) determines the number of tasks.
If I use two streams for each topic instead of a single streams with multiple topic, would that work for what I am trying to achieve?
Yes, it would be basically the same as Option A -- however, you get two consumer groups and thus "two application" instead of one.

How to do aggregation on fixed-size count-based sliding window?

How do I implement a sliding window aggregation (or transformation) with a fixed-size count-based window?
For e.g: If I have stream data like the following
input stream = 1,2,3,4,5,6,7,8...
Assume that time is not relevant here. And say my aggregate function is AVERAGE and window size is fixed at 3 records (not 3 millis, 3 secs, 3 hours etc), I would like my output stream to be
output stream = avg(1,2,3), avg(2,3,4), avg(3,4,5), avg(4,5,6), avg(5,6,7)... = 2,3,4,5,6...
The Windows documented in Kafka streams work are "time-based". Even the constructor of base class Window has following signature:
Window(long startMs, long endMs)
So I was not sure if it's the right tool to do non-time based windowing aggregating.
Apache Flink supports count-based sliding and tumbling windows. That's exactly what I need, but I'm looking for a similar feature in Kafka Streams.
If time-ordering is no concern for you, you can implement a custom Transformer with attached state.
StreamsBuilder builder = new StreamsBuilder();
builder.addStoreStore(...); // add KeyValueStore here
KStream result = builder.stream("topic").transform(...); // pass in name of your KeyValueStore, too
For you custom Transformer you can maintain a List per key with the list being your window -- as long as the list is smaller than you window-size you append new record to the list -- if it's exactly the size, you trigger the computation -- if it exceeds the size, you trim it and trigger the computation afterwards.
See the docs for more details: https://kafka.apache.org/10/documentation/streams/developer-guide/processor-api.html (Note, that a Processor and a Transformer are basically the same thing.)
If you wish to use Apache Storrm which is also an streaming engine, kafka can be connected as a data source to it. Storm new version provides a concept called Tumbling Window, which delivers exact number of tuple to your topology. This can easily be used to solve your problem.
For more have a look at https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.0/bk_storm-component-guide/content/storm-windowing-concepts.html

Storm UI showing Different Numbers of Executors and Tasks

I am using storm with trident topology but I am not able to understand how the parallelism is attained, it is differing according to my calculation and what I seeing on storm UI,
Here's the code the assign number of workers:
public Config getTopologyConfiguration() {
Config conf = new Config();
//conf.setDebug(true);
conf.setNumWorkers(6);
conf.setMessageTimeoutSecs(100);
return conf;
}
And Here is the stream processing code:
s.name("aggregation_stream")
.parallelismHint(invoiceAggregationConfig.getSpoutParallelism())
.partitionBy(groupedFields)
.partitionAggregate(aggregateInputFields,
new GenericAggregator(groupedFields, aggregatedFieldsList, aggregateFieldsOperationList),
aggregatorOutputFields)
.parallelismHint(invoiceAggregationConfig.getAggregationParallelism())
.shuffle()
.each(aggregatorOutputFields,
new CreatePaymentFromInvoices(paymentType, groupMap, aggMap, paymentExtraParams),
Const.PAYMENT_FIELD)
.each(TridentUtils.fieldsConcat(aggregatorOutputFields, Const.PAYMENT_FIELD),
new CreateApplicationFromPaymentAndInvoices(invoiceType),
Const.APPLICATIONS_FIELD)
.each(TridentUtils.fieldsConcat(aggregatorOutputFields, Const.PAYMENT_FIELD, Const.APPLICATIONS_FIELD),
new RestbusFilterForPaymentAndApplications(environment, bu, serviceConfiguration))
.parallelismHint(invoiceAggregationConfig.getPersistenceParallelism());
and the parallelism attributes which I am using in the code above are here:
spoutParallelism: 3
aggregationParallelism: 6
persistenceParallelism: 6
Now according to my calculation the number of executors should be
3*6 + 6 = 24
But in Storm UI it is showing 23, how??
EDITED
Adding new screenshot which is having information about individual components
Here I can see the number of Executors and tasks are 50, but I didn't set any configuration to for this, does storm provide this itself??
Secondly, The number of emitted tuples is huge in number, I am not producing this much data,this is more than 100s times more tuples, how come this much tuples are showing in UI??
Number of emitted tuples can be huge number
Reason : when spout emit a tuple it will expect the ack, if ack not received it will resend the tuple so emitted and transferred count can be a higher value. (check ack count its small number with compare to emitted count)

Storm 0.10.0 reuse a topology design?

Can the following design be accomplished in Storm?
Lets take the wordcount example that is present in the following
https://github.com/nathanmarz/storm-starter/blob/master/src/jvm/storm/starter/WordCountTopology.java
I am changing the word generator spout to a file reader spout
The design for this Word Count Topology is
1. Spout to read file and create sentences line by line
2. Bolt to split sentences to words
3. Bolt to add unique words and give a word and its corresponding count
So in a way the topology is describing the flow a file needs to take to count the unique words it has.
If I have two files file 1 and file 2 one should be able to call the same topology and create two instance of this topology to run the same word count.
In order to track if the word count has indeed finished the instances of word count topology should have a completed status once the file has been processed.
In the current design of Storm, I find that the Topology is the actual instance so it is like a task.
One needs to make two different calls with different Topology names like
for file 1
StormSubmitter.submitTopology("WordCountTopology1", conf,builder.createTopology());
for file 2
StormSubmitter.submitTopology("WordCountTopology2", conf,builder.createTopology());
not to mention the same upload of the jar using the storm client
storm jar stormwordcount-1.0.0-jar-with-dependencies.jar com.company.WordCount1Main.App "server" "filepath1"
storm jar stormwordcount-1.0.0-jar-with-dependencies.jar com.company.WordCount2Main.App "server" "filepath2"
The other issue is the topologies don't complete once the file is processed. They are alive all the time before we issue a kill on the topology
storm kill "WordCountTopology"
I understand that in a streaming world where the messages are coming from a message queue like Kafka there is no end of message but how is that relevant in the file world where the entities/messages are fixed.
Is there an API that does the following?
//creates the topology, this is done one time using the storm to upload the respective jars
StormSubmitter.submitTopology("WordCountTopology", conf,builder.createTopology());
Once uploaded the application code just instantiates the topology with the agruments
//creates an instance of the topology and give a status tracker
JobTracker tracker = StormSubmitter.runTopology("WordCountTopology", conf, args);
//Can query the Storm for the current job if its complete or not
JobStatus status = StormSubmitter.getTopologyStatus(conf, tracker);
For reusing the same topology twice, you have two possibilities:
1) Use a constructor parameter for your file spout and instantiate the same topology with twice with different parameters:
private StormTopology createMyTopology(String filename) {
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("File Spout", new FileSpout(filename));
// add further spouts and bolts etc.
return builder.createTopology();
}
public static void main(String[] args) {
String file1 = "/path/to/file1";
String file2 = "/path/to/file2";
Config c = new Config();
if(useFile1) {
StormSubmitter.submitTopology("T1", c, createMyTopology(file1));
} else {
StormSubmitter.submitTopology("T1", c, createMyTopology(file2));
}
}
2) As an alternative, you could configure your file spout in open() method.
public class FileSpout extends IRichSpout {
#Override
public void open(Map conf, ...) {
String filenmae = (String)conf.get("FILENAME");
// ...
}
// other methods omitted
}
public static void main(String[] args) {
String file1 = "/path/to/file1";
String file2 = "/path/to/file2";
Config c = new Config();
if(useFile1) {
c.put("FILENAME", file1);
} else {
c.put("FILENAME", file2);
}
// assembly topology...
StormSubmitter.submitTopology("T", c, builder.createTopology());
}
For you second question: there is no API in Storm that terminates a topology automatically. You could use TopologyInfo and monitor the number of emitted tuples of the spout. If it does not change for some time, you can assume that the whole file got read and then kill the topology.
Config cfg = new Config();
// set NIMBUS_HOST and NIMBUS_THRIFT_PORT in cfg
Client client = NimbusClient.getConfiguredClient(cfg).getClient();
TopologyInfo info = client.getTopologyInfo("topologyName");
// get emitted tuples...
client.killTopology("topologyName");
The word count topology mentioned in the post doesn't do justice for the might and power of Storm. Since Storm is a Stream processor, it requires a stream; period. By definition files are files it is static. I empathize with the Storm developers on how can a simple hello world be given to the adoption on how to show case the topology concepts and a non stream technology like file was taken. So to the newbies who are learning Storm which I was at that time, it was a difficult to understand how to develop using the example. The example is just a way to show how Storm concepts work, not a real word application of how files would come or needs to be processed.
So here is the take on how one of the solution could be.
Since topologies run all the time, they can compute the word count for as long as one wants i,e within a file or across all files for any periods of time.
In order to allow for different files to come in, we would need a streaming spout. So naturally you would need a Kafka Message Broker or similar to receive files in a stream. Depending on the size of the file and the restriction that message brokers put namely Kafka which has a 1 MB file restriction, we could pick to send the file itself as the payload or the reference of the file in which case you would need a distributed file system to store the file namely a Hadoop DFS or a NAS.
We then read these files using a Kafka Spout as opposed to FileSpout.
We now have the following issues
1. Word Count Across Files
2. Word Count per File
3. Running Status on the word count till it is processed
4. When do we know if a file is processed or complete
Word Count Across Files
Using the example provided, this is the use case the example targets so if we continue to stream the files and in each file we read the lines, split the word and send to other bolts, the bolts would count the words independent of which file it came from.
File1 A quick brown fox jumped ...
File2 Once upon a time a fox ...
Field Grouping
quick
brown
fox
...
Once
upon
fox (not needed as it came in file 1)
...
Word Count Per File
In order to do this, we would now need to put the fields grouping of words to be appended with the fileId. So now the example needs to change to include a fileId for each word it splits.
So
File1 A quick brown fox jumped ...
File2 Once upon a time a fox ...
So the fields grouping on word would be (canceling the noise words)
File1_quick
File1_brown
File1_fox
File2_once
File2_upon
File2_fox
Running Status on the word count till it is processed
Since all these counts are in memory of the bolt and we don't know the EoF there is no way to get the status unless someone peaks into the bolt or we send the counts periodically to another data store where we can query it. This is exactly what we need to do, which is at periodic intervals we need to persist the in-memory bolt counts to a data store like hbase, elastic, mongo db etc
When do we know if a file is processed or complete
Perhaps this is the toughest question to answer in the streaming world, basically the stream processor doesn't know the steam is finished as from its perspective the streams are files coming in and it needs to split each file into words and count in corresponding bolts. So they don't know what has happened before or after it reached each actor.
This entire thing needs to be done by the app developer.
One way to do this is when each file is read we count the total words and send a message
File 1 : Total Words : 1000
File 2 : Total Words : 2000
Now when we do the word count and find different words per file File1_* the count of individual words and the total words should match before we say a file is complete. All these are custom logic we would need to write before we can say its complete.
So in essential Storm provides the framework to do stream processing in a variety of ways. Its the application developers job to develop with the design that it has and implement their own logic depending on the use case. It doesn't provide application use cases out of the box or a good reference implementation which I think we need to build as its not a commercial product and depends on community to champion.

Spark Streaming and ElasticSearch - Could not write all entries

I'm currently writing a Scala application made of a Producer and a Consumer. The Producers get some data from and external source and writes em inside Kafka. The Consumer reads from Kafka and writes to Elasticsearch.
The consumer is based on Spark Streaming and every 5 seconds fetches new messages from Kafka and writes them to ElasticSearch. The problem is I'm not able to write to ES because I get a lot of errors like the one below :
ERROR] [2015-04-24 11:21:14,734] [org.apache.spark.TaskContextImpl]:
Error in TaskCompletionListener
org.elasticsearch.hadoop.EsHadoopException: Could not write all
entries [3/26560] (maybe ES was overloaded?). Bailing out... at
org.elasticsearch.hadoop.rest.RestRepository.flush(RestRepository.java:225)
~[elasticsearch-spark_2.10-2.1.0.Beta3.jar:2.1.0.Beta3] at
org.elasticsearch.hadoop.rest.RestRepository.close(RestRepository.java:236)
~[elasticsearch-spark_2.10-2.1.0.Beta3.jar:2.1.0.Beta3] at
org.elasticsearch.hadoop.rest.RestService$PartitionWriter.close(RestService.java:125)
~[elasticsearch-spark_2.10-2.1.0.Beta3.jar:2.1.0.Beta3] at
org.elasticsearch.spark.rdd.EsRDDWriter$$anonfun$write$1.apply$mcV$sp(EsRDDWriter.scala:33)
~[elasticsearch-spark_2.10-2.1.0.Beta3.jar:2.1.0.Beta3] at
org.apache.spark.TaskContextImpl$$anon$2.onTaskCompletion(TaskContextImpl.scala:57)
~[spark-core_2.10-1.2.1.jar:1.2.1] at
org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:68)
[spark-core_2.10-1.2.1.jar:1.2.1] at
org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:66)
[spark-core_2.10-1.2.1.jar:1.2.1] at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
[na:na] at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
[na:na] at
org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:66)
[spark-core_2.10-1.2.1.jar:1.2.1] at
org.apache.spark.scheduler.Task.run(Task.scala:58)
[spark-core_2.10-1.2.1.jar:1.2.1] at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
[spark-core_2.10-1.2.1.jar:1.2.1] at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
[na:1.7.0_65] at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[na:1.7.0_65] at java.lang.Thread.run(Thread.java:745) [na:1.7.0_65]
Consider that the producer is writing 6 messages every 15 seconds so I really don't understand how this "overload" can possibly happen (I even cleaned the topic and flushed all old messages, I thought it was related to an offset issue). The task executed by Spark Streaming every 5 seconds can be summarized by the following code :
val result = KafkaUtils.createStream[String, Array[Byte], StringDecoder, DefaultDecoder](ssc, kafkaParams, Map("wasp.raw" -> 1), StorageLevel.MEMORY_ONLY_SER_2)
val convertedResult = result.map(k => (k._1 ,AvroToJsonUtil.avroToJson(k._2)))
//TO-DO : Remove resource (yahoo/yahoo) hardcoded parameter
log.info(s"*** EXECUTING SPARK STREAMING TASK + ${java.lang.System.currentTimeMillis()}***")
convertedResult.foreachRDD(rdd => {
rdd.map(data => data._2).saveToEs("yahoo/yahoo", Map("es.input.json" -> "true"))
})
If I try to print the messages instead of sending to ES, everything is fine and I actually see only 6 messages. Why can't I write to ES?
For the sake of completeness, I'm using this library to write to ES : elasticsearch-spark_2.10 with the latest beta version.
I found, after many retries, a way to write to ElasticSearch without getting any error. Basically passing the parameter "es.batch.size.entries" -> "1" to the saveToES method solved the problem. I don't understand why using the default or any other batch size leads to the aforementioned error considering that I would expect an error message if I'm trying to write more stuff than the allowed max batch size, not less.
Moreover I've noticed that actually I was writing to ES but not all my messages, I was losing between 1 and 3 messages per batch.
When I pushed dataframe to ES on Spark, I had the same error message. Even with "es.batch.size.entries" -> "1" configuration,I had the same error.
Once I increased thread pool in ES, I could figure out this issue.
for example,
Bulk pool
threadpool.bulk.type: fixed
threadpool.bulk.size: 600
threadpool.bulk.queue_size: 30000
Like it was already mentioned here, this is a document write conflict.
Your convertedResult data stream contains multiple records with the same id. When written to elastic as part of the same batch produces the error above.
Possible solutions:
Generate unique id for each record. Depending on your use case it can be done in a few different ways. As example, one common solution is to create a new field by combining the id and lastModifiedDate fields and use that field as id when writing to elastic.
Perform de-duplication of records based on id - select only one record with particular id and discard other duplicates. Depending on your use case, this could be the most current record (based on time stamp field), most complete (most of the fields contain data), etc.
The #1 solution will store all records that you receive in the stream.
The #2 solution will store only the unique records for a specific id based on your de-duplication logic. This result would be the same as setting "es.batch.size.entries" -> "1", except you will not limit the performance by writing one record at a time.
One of the possibility is the cluster/shard status being RED. Please address this issue which may be due to unassigned replicas. Once status turned GREEN the API call succeeded just fine.
This is a document write conflict.
For example:
Multiple documents specify the same _id for Elasticsearch to use.
These documents are located in different partitions.
Spark writes multiple partitions to ES simultaneously.
Result is Elasticsearch receiving multiple updates for a single Document at once - from multiple sources / through multiple nodes / containing different data
"I was losing between 1 and 3 messages per batch."
Fluctuating number of failures when batch size > 1
Success if batch write size "1"
Just adding another potential reason for this error, hopefully it helps someone.
If your Elasticsearch index has child documents then:
if you are using a custom routing field (not _id), then according to
the documentation the uniqueness of the documents is not guaranteed.
This might cause issues while updating from spark.
If you are using the standard _id, the uniqueness will be preserved, however you need to make sure the following options are provided while writing from Spark to Elasticsearch:
es.mapping.join
es.mapping.routing

Resources