When I use storm, how can I ensure that a bolt with multiple inputs, process only when all the inputs arrive? - apache-storm

The topology looks like this.
how can I ensure that a bolt with multiple inputs, process only when all the inputs arrive?

Bolt.execute() is called for each incoming tuple, regardless what the producer was (and you cannot change this). If you want to process multiple tuples from different producers at once, you need to write custom UDF code.
You need an input buffer for each producer, that can buffer incoming tuples (maybe a LinkedList<Tuple> as bolt member)
For each incoming tuple, you add the tuple to the corresponding buffer (you can access the producer information in the tuple's meta data, via. input.getSourceComponent()
After adding the tuple to the buffer, you check, if each buffer contains at least one tuple: if yes, take one tuple from each buffer an process them (after processing, check the buffers again until at least once buffer is empty) -- of no, just return and do not process anything.

You might want to take a look here (refer to Batching). For bolts that process more complex operations such as aggregation on multiple input tuples, you will need to extend BaseRichBolt and do your own control of the anchoring mechanism.
For this you need to declare your own output collector like this:
private OutputCollector outputCollector;
And then initialise it through your override of the prepare method:
#Override
public void prepare(Map map, TopologyContext topologyContext, OutputCollector outputCollector) {
this.outputCollector = outputCollector;
}
Your execute method for BaseRichBolt only receives a tuple as argument, you need to be able to perform the logic to maintain the anchors and using them when emitting.
private final List<Tuple> anchors = new ArrayList<Tuple>();
#Override
public void execute(Tuple tuple) {
if (!isTupleAggregationComplete(anchors, tuple)) {
anchors.add(tuple);
return;
}
// do your computations here!
collector.emit(anchors, new Values(foo,bar,xpto));
anchors.clear();
}
You should implement isTupleAggregationComplete with the necessary logic that checks if the bolt have everything necessary information to proceed with the processing.

Related

RunnableGraph to wait for multiple response from source

I am using Akka in Play Controller and performing ask() to a actor by name publish , and internal publish actor performs ask to multiple actors and passes reference of sender. The controller actor needs to wait for response from multiple actors and create a list of response.
Please find the code below. but this code is only waiting for 1 response and latter terminating. Please suggest
// Performs ask to publish actor
Source<Object,NotUsed> inAsk = Source.fromFuture(ask(publishActor,service.getOfferVerifyRequest(request).getPayloadData(),1000));
final Sink<String, CompletionStage<String>> sink = Sink.head();
final Flow<Object, String, NotUsed> f3 = Flow.of(Object.class).map(elem -> {
log.info("Data in Graph is " +elem.toString());
return elem.toString();
});
RunnableGraph<CompletionStage<String>> result = RunnableGraph.fromGraph(
GraphDSL.create(
sink , (builder , out) ->{
final Outlet<Object> source = builder.add(inAsk).out();
builder
.from(source)
.via(builder.add(f3))
.to(out); // to() expects a SinkShape
return ClosedShape.getInstance();
}
));
ActorMaterializer mat = ActorMaterializer.create(aSystem);
CompletionStage<String> fin = result.run(mat);
fin.toCompletableFuture().thenApply(a->{
log.info("Data is "+a);
return true;
});
log.info("COMPLETED CONTROLLER ");
If you have several responses ask won't cut it, that is only for a single request-response where the response ends up in a Future/CompletionStage.
There are a few different strategies to wait for all answers:
One is to create an intermediate actor whose only job is to collect all answers and then when all partial responses has arrived respond to the original requestor, that way you could use ask to get a single aggregate response back.
Another option would be to use Source.actorRef to get an ActorRef that you could use as sender together with tell (and skip using ask). Inside the stream you would then take elements until some criteria is met (time has passed or elements have been seen). You may have to add an operator to mimic the ask response timeout to make sure the stream fails if the actor never responds.
There are some other issues with the code shared, one is creating a materializer on each request, these have a lifecycle and will fill up your heap over time, you should rather get a materializer injected from play.
With the given logic there is no need whatsoever to use the GraphDSL, that is only needed for complex streams with multiple inputs and outputs or cycles. You should be able to compose operators using the Flow API alone (see for example https://doc.akka.io/docs/akka/current/stream/stream-flows-and-basics.html#defining-and-running-streams )

Is it necessary to ack a tuple in storm bolt

This seems confusing, some examples I have seen, where ack on a tuple is called in each bolt while some places that was not the case. What is the practice regarding this and what can be it's implications?
After searching around on internet and this answer, I found this link from docs which is really helpful in this regard.
How spout handle messages:
When a spot takes a message from the source say Kafka or Kestrel queue, it opens the message. This means the message is not actually taken off the queue yet, but instead placed in a "pending" state waiting for acknowledgement that the message is completed. While in the pending state, a message will not be sent to other consumers of the queue. Additionally, if a client disconnects all pending messages for that client are put back on the queue.
When a message is opened, Kestrel provides the client with the data for the message as well as a unique id for the message. The KestrelSpout uses that exact id as the message id for the tuple when emitting the tuple to the SpoutOutputCollector. Sometime later on, when ack or fail are called on the KestrelSpout, the KestrelSpout sends an ack or fail message to Kestrel with the message id to take the message off the queue or have it put back on.
When is Ack needed:
A client need to tell Storm whenever it's creating a new link in the tree of tuples also called anchoring, which is done by emitting a new tuple.
A Client also need to tell Storm when you have finished processing an individual tuple, which is done by ack. By doing both these things, Storm can detect when the tree of tuples is fully processed and can ack or fail the spout tuple appropriately.
In the following example bolt splits a tuple containing a sentence into a tuple for each word. Each word tuple is anchored by specifying the input tuple as the first argument to emit. Since the word tuple is anchored, the spout tuple at the root of the tree will be replayed later on if the word tuple failed to be processed downstream.
public class SplitSentence extends BaseRichBolt {
OutputCollector _collector;
public void prepare(Map conf, TopologyContext context, OutputCollector collector) {
_collector = collector;
}
public void execute(Tuple tuple) {
String sentence = tuple.getString(0);
for(String word: sentence.split(" ")) {
_collector.emit(tuple, new Values(word));
}
_collector.ack(tuple);
}
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word"));
}
}
In contrast, if the word tuple is emitted like this:
_collector.emit(new Values(word));
Emitting the word tuple this way causes it to be unanchored. If the tuple fails be processed downstream, the root tuple will not be replayed. Depending on the fault-tolerance guarantees you need in your topology, sometimes it's appropriate to emit an unanchored tuple.
When is Ack not needed:
In many cases, bolts follow a common pattern of reading an input tuple, emitting tuples based on it, and then acking the tuple at the end of the execute method. These bolts fall into the categories of filters and simple functions. Storm has an interface called BasicBolt that encapsulates this pattern for you.
Following is the example of SplitSentence, which can be written as a BasicBolt like follows:
public class SplitSentence extends BaseBasicBolt {
public void execute(Tuple tuple, BasicOutputCollector collector) {
String sentence = tuple.getString(0);
for(String word: sentence.split(" ")) {
collector.emit(new Values(word));
}
}
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word"));
}
}
This implementation is simpler than the implementation from before and is semantically identical. Tuples emitted to BasicOutputCollector are automatically anchored to the input tuple, and the input tuple is acked for you automatically when the execute method completes.
Edit
As commented, and as can be seen here, IBasicBolt takes care of acking for you, so whatever class is implementing IBasicBolt:
/**
* Process the input tuple and optionally emit new tuples based on the input tuple.
*
* All acking is managed for you. Throw a FailedException if you want to fail the tuple.
*/
BaseBasicBolt and BaseRichBolt both implements IBasicBolt.

Is possible to know the size of a stream without using a terminal operation

I have 3 interfaces
public interface IGhOrg {
int getId();
String getLogin();
String getName();
String getLocation();
Stream<IGhRepo> getRepos();
}
public interface IGhRepo {
int getId();
int getSize();
int getWatchersCount();
String getLanguage();
Stream<IGhUser> getContributors();
}
public interface IGhUser {
int getId();
String getLogin();
String getName();
String getCompany();
Stream<IGhOrg> getOrgs();
}
and I need to implement Optional<IGhRepo> highestContributors(Stream<IGhOrg> organizations)
this method returns a IGhRepo with most Contributors(getContributors())
I tried this
Optional<IGhRepo> highestContributors(Stream<IGhOrg> organizations){
return organizations
.flatMap(IGhOrg::getRepos)
.max((repo1,repo2)-> (int)repo1.getContributors().count() - (int)repo2.getContributors().count() );
}
but it gives me the
java.lang.IllegalStateException: stream has already been operated upon or closed
I understand that count() is a terminal operation in Stream but I can't solve this problem, please help!
thanks
Is possible to know the size of a stream without using a terminal operation
No it's not, because streams can be infinite or generate output on demand. It's not necessary that they are backed by collections.
but it gives me the
java.lang.IllegalStateException: stream has already been operated upon or closed
That's becase you are returning the same stream instance on each method invocation. You should return a new Stream instead.
I understand that count() is a terminal operation in Stream but I can't solve this problem, please help!
IMHO you are misusing the streams here. Performance and simplicity wise it's much better that you return some Collection<XXX> instead of Stream<XXX>
NO.
This is not possible to know the size of a stream in java.
As mentioned in java 8 stream docs
No storage. A stream is not a data structure that stores elements;
instead, it conveys elements from a source such as a data structure,
an array, a generator function, or an I/O channel, through a pipeline
of computational operations.
You don't specify this, but it looks like some or possibly all of the interface methods that return Stream<...> values don't return a fresh stream each time they are called.
This seems problematic to me from an API point of view, as it means each of these streams, and a fair chunk of the object's functionality can be used at most once.
You may be able to solve the particular problem you are having by ensuring that the stream from each object is used only once in the method, something like this:
Optional<IGhRepo> highestContributors(Stream<IGhOrg> organizations) {
return organizations
.flatMap(IGhOrg::getRepos)
.distinct()
.map(repo -> new AbstractMap.SimpleEntry<>(repo, repo.getContributors().count()))
.max(Map.Entry.comparingByValue())
.map(Map.Entry::getKey);
}
Unfortunately it looks like you will now be stuck if you want to (for example) print a list of the contributors, as the stream returned from getContributors() for the returned IGhRepo has already been consumed.
You might want to consider having your implementation objects return a fresh stream each time a stream returning method is called.
You could keep a counter that is incremented per "iteration" using peek. In the example below the counter is incremented before every item is processed with doSomeLogic
final var counter = new AtomicInteger();
getStream().peek(item -> counter.incrementAndGet()).forEach(this::doSomeLogic);

Storm Tick Tuple from a spout

I want to configure my spout to emit tick tuples on 2 different frequencies on different streams. My questions are as follows:
I understand how this is done using the bolt. But, on a spout, will the tick tuple invoke the next Tuple method on every tick?
How can I determine the frequency at which the tick was invoked? Meaning, the actual value of the time I configured in the config object?
Only bolts can receive tick tuples. Spouts can only emit tuples.
I'm assuming you're trying to do a "read" every so often from within your spout in order to emit a new tuple.
For example, to sleep 50 milliseconds between reads:
#Override
public void nextTuple() {
try {
String message = _mqClient.getMessage();
if (message != null) {
_collector.emit(new Values(message));
}
// sleep for 50 milliseconds
Utils.sleep(50);
} catch (Exception e) {
_collector.reportError(e);
LOG.error("MQ spout error {}", e);
}
}
Maybe this can help you:
https://github.com/ptgoetz/storm-signals
Storm-Signals aims to provide a way to send messages ("signals") to
components (spouts/bolts) in a storm topology that are otherwise not
addressable.
Storm topologies can be considered static in that modifications to a
topology's behavior require redeployment. Storm-Signals provides a
simple way to modify a topology's behavior at runtime, without
redeployment.

how can I prove field-grouping functionality(tuple with same field goes to same task)?

I'm a fresher of Storm, I'm getting started with Storm using the project storm-starter. In this project there is a Topology called WordCountTopology, the key code for building topology is:
builder.setBolt("split", new SplitSentence(), 8).shuffleGrouping("spout");
builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word"));
and in the implementation of WordCount bolt, the key method execute is:
#Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
String word = tuple.getString(0);
Integer count = counts.get(word);
if (count == null)
count = 0;
count++;
counts.put(word, count);
collector.emit(new Values(word, count));
}
My Question is:
As the functionality of filed-grouping is that: tuples with the same filed word will go to the same task for post processing. Here "task" means thread, how can I prove this functionality? In addition, in my opinion, the logic in method execute is a little awkward. In a single task, the parameter tuple is always the same, but in the execute method it does not reflect this, in other words, the logic dose not use this convenience.
Am I clear? My point is that, the code here in execute is not taking the feature of filed-grouping into account, the code here can also be applied to the situation of shuffle-grouping.
I would like to site few points, it might help clear your doubts
Here "task" means thread
In storm's terminology tasks are NOT threads but they are responsible for processing the actual logic. Each spout or bolt that you implement in your code executes as many tasks across the cluster. So you can define them as an running instance of the components i.e Spouts or Bolts.
There is another entity called Executors which are the thread responsible for running these tasks.It can run one or multiple tasks of the same component. An executor having multiple tasks actually is saying the same component is executed for multiple times by the executor.
Now coming back to your question
the code here in execute is not taking the feature of filed-grouping into account, the code here can also be applied to the situation of shuffle-grouping
In very brief A fields grouping lets you group a stream by a subset of its fields, meaning in order to do a word count, if we filtered the stream by using fieldsGrouping on a field name 'first_name` then it is expected that all the first_name field with a value say (Foo) should go to the same task, and the same field with a different value (Bar) goes to another task.
So here the execute method is supposed to receive the same field value and thus can easily update its counter and to do that it does not require to consider anything special. The whole logic is written keeping in mind that the bolt will be chained with the proper data and that's why using the proper grouping become such an important thing. So if you use shuffleGrouping then same code will run but produces incorrect data.
Well Pinky (or anyone else who finds this useful), to prove it, you just have to keep track of the bolt or spout task ID:
#Override
public void prepare(Map map, TopologyContext tc, OutputCollector oc) {
this.boltId = tc.getThisTaskId();
}
Now in the execute() of the same fieldsGrouped bolt that receives the tuples, you just print the id and the tuple:
#Override
public void execute(Tuple tuple) {
String myWord = (String) tuple.getValue(0);
System.out.println("word: "+myWord+" boltID:"+boltId);
}

Resources