Storm Tick Tuple from a spout - apache-storm

I want to configure my spout to emit tick tuples on 2 different frequencies on different streams. My questions are as follows:
I understand how this is done using the bolt. But, on a spout, will the tick tuple invoke the next Tuple method on every tick?
How can I determine the frequency at which the tick was invoked? Meaning, the actual value of the time I configured in the config object?

Only bolts can receive tick tuples. Spouts can only emit tuples.
I'm assuming you're trying to do a "read" every so often from within your spout in order to emit a new tuple.
For example, to sleep 50 milliseconds between reads:
#Override
public void nextTuple() {
try {
String message = _mqClient.getMessage();
if (message != null) {
_collector.emit(new Values(message));
}
// sleep for 50 milliseconds
Utils.sleep(50);
} catch (Exception e) {
_collector.reportError(e);
LOG.error("MQ spout error {}", e);
}
}

Maybe this can help you:
https://github.com/ptgoetz/storm-signals
Storm-Signals aims to provide a way to send messages ("signals") to
components (spouts/bolts) in a storm topology that are otherwise not
addressable.
Storm topologies can be considered static in that modifications to a
topology's behavior require redeployment. Storm-Signals provides a
simple way to modify a topology's behavior at runtime, without
redeployment.

Related

Batch processing flowfiles in apache nifi

I have written custom nifi processor which tries to batch process input flow files.
However, it seems it is not behaving as expected. Here is what happening:
I copy paste some files on server. FethFromServerProcessor fetches those files from server and puts it in queue1. MyCustomProcessor reads files in batch from queue1. I have batchSize property defined on MyCustomProcessor and inside its onTrigger() method, I am getting all flow files from queue1 in current batch by doing following:
session.get(context.getProperty(batchSize).asInteger())
First line of onTrigger() creates timestamp and adds this timestamp on all flow files. So all files in the batch should have same timestamp. However, that is not happening. Usually first flow file get one timestamp and rest of the flow files get other timestamp.
It seems that when FetchFromServerProcessor fetches first file from server and puts it in the queue1, MyCustomProcessor gets triggered and it fetches all files from queue. Incidentally, it happens that there used to be single file, which is picked up as only file in this batch. By the time MyCustomProcessor has processed this file, FetchFromServerProcessor has fetched all the files from server and put them in the queue1. So after processing first file, MyCustomProcessor takes all the files in queue1 and forms second batch, whereas I want all files picked up in single batch.
How can I avoid two batches getting formed? I see that people discuss wait-notify in this context:1, 2. But I am not able to make quick sense out of these posts. Can someone give me minimal steps to achieve this using wait notify processors or can someone point me to minimal tutorial which gives step by step procedure to use wait-notify processors? Also is wait-notify pattern standard approach to solve batch related problem I explained? Or is there any other standard approach to get this done?
It sounds as if this batch size is the required count of incoming flowfiles to CustomProcessor, so why not write your CustomProcessor#onTrigger() as follows:
#Override
public void onTrigger(ProcessContext context, ProcessSession session) throws ProcessException {
final ComponentLog logger = getLogger();
// Try to get n flowfiles from incoming queue
final Integer desiredFlowfileCount = context.getProperty(batchSize).asInteger();
final int queuedFlowfileCount = session.getQueueSize().getObjectCount();
if (queuedFlowfileCount < desiredFlowfileCount) {
// There are not yet n flowfiles queued up, so don't try to run again immediately
if (logger.isDebugEnabled()) {
logger.debug("Only {} flowfiles queued; waiting for {}", new Object[]{queuedFlowfileCount, desiredFlowfileCount});
}
context.yield();
return;
}
// If we're here, we do have at least n queued flowfiles
List<FlowFile> flowfiles = session.get(desiredFlowfileCount);
try {
// TODO: Perform work on all flowfiles
flowfiles = flowfiles.stream().map(f -> session.putAttribute(f, "timestamp", "my static timestamp value")).collect(Collectors.toList());
session.transfer(flowfiles, REL_SUCCESS);
// If extending AbstractProcessor, this is handled for you and you don't have to explicitly commit
session.commit();
} catch (Exception e) {
logger.error("Helpful error message");
if (logger.isDebugEnabled()) {
logger.error("Further stacktrace: ", e);
}
// Penalize the flowfiles if appropriate (also done for you if extending AbstractProcessor and an exception is thrown from this method
session.rollback(true);
// --- OR ---
// Transfer to failure if they can't be retried
session.transfer(flowfiles, REL_FAILURE);
}
}
The Java 8 stream syntax can be replaced by this if it's unfamiliar:
for (int i = 0; i < flowfiles.size(); i++) {
// Write the same timestamp value onto all flowfiles
FlowFile f = flowfiles.get(i);
flowfiles.set(i, session.putAttribute(f, "timestamp", "my timestamp value"));
}
The semantics between penalization (telling the processor to delay performing work on a specific flowfile) and yielding (telling the processor to wait some period of time to try performing any work again) are important.
You probably also want the #TriggerSerially annotation on your custom processor to ensure you do not have multiple threads running such that a race condition could arise.

Is it necessary to ack a tuple in storm bolt

This seems confusing, some examples I have seen, where ack on a tuple is called in each bolt while some places that was not the case. What is the practice regarding this and what can be it's implications?
After searching around on internet and this answer, I found this link from docs which is really helpful in this regard.
How spout handle messages:
When a spot takes a message from the source say Kafka or Kestrel queue, it opens the message. This means the message is not actually taken off the queue yet, but instead placed in a "pending" state waiting for acknowledgement that the message is completed. While in the pending state, a message will not be sent to other consumers of the queue. Additionally, if a client disconnects all pending messages for that client are put back on the queue.
When a message is opened, Kestrel provides the client with the data for the message as well as a unique id for the message. The KestrelSpout uses that exact id as the message id for the tuple when emitting the tuple to the SpoutOutputCollector. Sometime later on, when ack or fail are called on the KestrelSpout, the KestrelSpout sends an ack or fail message to Kestrel with the message id to take the message off the queue or have it put back on.
When is Ack needed:
A client need to tell Storm whenever it's creating a new link in the tree of tuples also called anchoring, which is done by emitting a new tuple.
A Client also need to tell Storm when you have finished processing an individual tuple, which is done by ack. By doing both these things, Storm can detect when the tree of tuples is fully processed and can ack or fail the spout tuple appropriately.
In the following example bolt splits a tuple containing a sentence into a tuple for each word. Each word tuple is anchored by specifying the input tuple as the first argument to emit. Since the word tuple is anchored, the spout tuple at the root of the tree will be replayed later on if the word tuple failed to be processed downstream.
public class SplitSentence extends BaseRichBolt {
OutputCollector _collector;
public void prepare(Map conf, TopologyContext context, OutputCollector collector) {
_collector = collector;
}
public void execute(Tuple tuple) {
String sentence = tuple.getString(0);
for(String word: sentence.split(" ")) {
_collector.emit(tuple, new Values(word));
}
_collector.ack(tuple);
}
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word"));
}
}
In contrast, if the word tuple is emitted like this:
_collector.emit(new Values(word));
Emitting the word tuple this way causes it to be unanchored. If the tuple fails be processed downstream, the root tuple will not be replayed. Depending on the fault-tolerance guarantees you need in your topology, sometimes it's appropriate to emit an unanchored tuple.
When is Ack not needed:
In many cases, bolts follow a common pattern of reading an input tuple, emitting tuples based on it, and then acking the tuple at the end of the execute method. These bolts fall into the categories of filters and simple functions. Storm has an interface called BasicBolt that encapsulates this pattern for you.
Following is the example of SplitSentence, which can be written as a BasicBolt like follows:
public class SplitSentence extends BaseBasicBolt {
public void execute(Tuple tuple, BasicOutputCollector collector) {
String sentence = tuple.getString(0);
for(String word: sentence.split(" ")) {
collector.emit(new Values(word));
}
}
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word"));
}
}
This implementation is simpler than the implementation from before and is semantically identical. Tuples emitted to BasicOutputCollector are automatically anchored to the input tuple, and the input tuple is acked for you automatically when the execute method completes.
Edit
As commented, and as can be seen here, IBasicBolt takes care of acking for you, so whatever class is implementing IBasicBolt:
/**
* Process the input tuple and optionally emit new tuples based on the input tuple.
*
* All acking is managed for you. Throw a FailedException if you want to fail the tuple.
*/
BaseBasicBolt and BaseRichBolt both implements IBasicBolt.

ZeroMQ (clrzmq4) polling issue

What I'm trying to accomplish is to implement reading a message from one of two sockets, wherever it arrives first. As far as I understand polling (zmq_poll) is the right thing to do (as demonstrated in mspoller in guide). Here I'll provide small pseudo-code snippet:
TimeSpan timeout = TimeSpan.FromMilliseconds(50);
using (var receiver1 = new ZSocket(ZContext.Current, ZSocketType.DEALER))
using (var receiver2 = new ZSocket(ZContext.Current, ZSocketType.PAIR))
{
receiver1.Bind("tcp://someaddress");
// Note that PAIR socket is inproc:
receiver2.Connect("inproc://otheraddress");
var poll = ZPollItem.CreateReceiver();
ZError error;
ZMessage msg;
while (true)
{
if (receiver1.PollIn(poll, out msg, out error, timeout))
{
// ...
}
if (receiver2.PollIn(poll, out msg, out error, timeout))
{
// ...
}
}
}
As you can see it is actually the same exact implementation as in mspoller in guide.
In my case receiver2 (PAIR socket) should receive a large number of messages. In fact I've created a test in which number of messages sent to it is always greater than the number of messages it is capable to receive (at least in demonstrated implementation).
I've run the test for 2 seconds, and I was very surprised with results:
Number of messages sent to receiver2: 180 (by "sent" I mean that they are handed out to another PAIR socket not shown in the previous snippet);
Number of messages received by receiver2: 21 ??? Only 21 messages in 2 seconds??? 10 messages per second???
Then I've tried to play with different timeout values and I've found out that it significantly influences the number of messages received. Duration (2 seconds) and number of messages sent (180) remain the same. The results are:
timeout value of 200 milliseconds - number of messages received drops to 10 (5 per second);
timeout value of 10 milliseconds - number of messages received rises to 120 (60 per second).
The results are telling me that polling simply does not work. If polling were working properly, as far as I understand the mechanism, timeout should not have any influence in this scenario. No matter if we set timeout to 1 hour or 5 milliseconds - since there are always messages to receive there's nothing to wait for, so the loop should work with the same speed.
My another big concern is the fact that even with very small timeout value receiver2 is not capable to receive all 180 messages. I'm struggling here to accomplish receiving rate of 100 messages per second, although I've selected ZeroMQ which should be very fast (benchmarks are mentioning numbers as 6 million messages per second).
So my question is obvious: am I doing something wrong here? Is there a better way to implement polling?
By browsing clrzmq4 code I've noticed that there's also possibility to call pollIn method on enumeration of sockets ZPollItems.cs, line 151, but I haven't found any example anywhere!
Can this be the right approach? Any documentation anywhere?
Thanks
I've found the problem / solution for this. Instead using PollIn method on each socket separately we should use PollIn method on array of sockets. Obviously the example from the guide is HUGELY MISLEADING. Here's the correct approach:
TimeSpan timeout = TimeSpan.FromMilliseconds(50);
using (var receiver1 = new ZSocket(ZContext.Current, ZSocketType.DEALER))
using (var receiver2 = new ZSocket(ZContext.Current, ZSocketType.PAIR))
{
receiver1.Bind("tcp://someaddress");
receiver2.Connect("inproc://otheraddress");
// We should "remember" the order of sockets within the array
// because order of messages in the received array will correspond to it.
ZSocket[] sockets = { receiver1, receiver2 };
// Note that we should use two ZPollItem instances:
ZPollItem[] pollItems = { ZPollItem.CreateReceiver(), ZPollItem.CreateReceiver() };
ZError error;
ZMessage[] msg;
while (true)
{
if (sockets.PollIn(pollItems, out msg, out error, timeout))
{
if (msg[0] != null)
{
// The first message gotten from receiver1
}
if (msg[1] != null)
{
// The second message gotten from receiver2
}
}
}
}
Now receiver2 reaches 15,000 received messages per second, no matter timeout value, and no matter number of messages received by receiver1.
UPDATE: Folks from clrzmq4 have acknowledged this issue, so probably the example will be corrected soon.

When I use storm, how can I ensure that a bolt with multiple inputs, process only when all the inputs arrive?

The topology looks like this.
how can I ensure that a bolt with multiple inputs, process only when all the inputs arrive?
Bolt.execute() is called for each incoming tuple, regardless what the producer was (and you cannot change this). If you want to process multiple tuples from different producers at once, you need to write custom UDF code.
You need an input buffer for each producer, that can buffer incoming tuples (maybe a LinkedList<Tuple> as bolt member)
For each incoming tuple, you add the tuple to the corresponding buffer (you can access the producer information in the tuple's meta data, via. input.getSourceComponent()
After adding the tuple to the buffer, you check, if each buffer contains at least one tuple: if yes, take one tuple from each buffer an process them (after processing, check the buffers again until at least once buffer is empty) -- of no, just return and do not process anything.
You might want to take a look here (refer to Batching). For bolts that process more complex operations such as aggregation on multiple input tuples, you will need to extend BaseRichBolt and do your own control of the anchoring mechanism.
For this you need to declare your own output collector like this:
private OutputCollector outputCollector;
And then initialise it through your override of the prepare method:
#Override
public void prepare(Map map, TopologyContext topologyContext, OutputCollector outputCollector) {
this.outputCollector = outputCollector;
}
Your execute method for BaseRichBolt only receives a tuple as argument, you need to be able to perform the logic to maintain the anchors and using them when emitting.
private final List<Tuple> anchors = new ArrayList<Tuple>();
#Override
public void execute(Tuple tuple) {
if (!isTupleAggregationComplete(anchors, tuple)) {
anchors.add(tuple);
return;
}
// do your computations here!
collector.emit(anchors, new Values(foo,bar,xpto));
anchors.clear();
}
You should implement isTupleAggregationComplete with the necessary logic that checks if the bolt have everything necessary information to proceed with the processing.

how can I prove field-grouping functionality(tuple with same field goes to same task)?

I'm a fresher of Storm, I'm getting started with Storm using the project storm-starter. In this project there is a Topology called WordCountTopology, the key code for building topology is:
builder.setBolt("split", new SplitSentence(), 8).shuffleGrouping("spout");
builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word"));
and in the implementation of WordCount bolt, the key method execute is:
#Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
String word = tuple.getString(0);
Integer count = counts.get(word);
if (count == null)
count = 0;
count++;
counts.put(word, count);
collector.emit(new Values(word, count));
}
My Question is:
As the functionality of filed-grouping is that: tuples with the same filed word will go to the same task for post processing. Here "task" means thread, how can I prove this functionality? In addition, in my opinion, the logic in method execute is a little awkward. In a single task, the parameter tuple is always the same, but in the execute method it does not reflect this, in other words, the logic dose not use this convenience.
Am I clear? My point is that, the code here in execute is not taking the feature of filed-grouping into account, the code here can also be applied to the situation of shuffle-grouping.
I would like to site few points, it might help clear your doubts
Here "task" means thread
In storm's terminology tasks are NOT threads but they are responsible for processing the actual logic. Each spout or bolt that you implement in your code executes as many tasks across the cluster. So you can define them as an running instance of the components i.e Spouts or Bolts.
There is another entity called Executors which are the thread responsible for running these tasks.It can run one or multiple tasks of the same component. An executor having multiple tasks actually is saying the same component is executed for multiple times by the executor.
Now coming back to your question
the code here in execute is not taking the feature of filed-grouping into account, the code here can also be applied to the situation of shuffle-grouping
In very brief A fields grouping lets you group a stream by a subset of its fields, meaning in order to do a word count, if we filtered the stream by using fieldsGrouping on a field name 'first_name` then it is expected that all the first_name field with a value say (Foo) should go to the same task, and the same field with a different value (Bar) goes to another task.
So here the execute method is supposed to receive the same field value and thus can easily update its counter and to do that it does not require to consider anything special. The whole logic is written keeping in mind that the bolt will be chained with the proper data and that's why using the proper grouping become such an important thing. So if you use shuffleGrouping then same code will run but produces incorrect data.
Well Pinky (or anyone else who finds this useful), to prove it, you just have to keep track of the bolt or spout task ID:
#Override
public void prepare(Map map, TopologyContext tc, OutputCollector oc) {
this.boltId = tc.getThisTaskId();
}
Now in the execute() of the same fieldsGrouped bolt that receives the tuples, you just print the id and the tuple:
#Override
public void execute(Tuple tuple) {
String myWord = (String) tuple.getValue(0);
System.out.println("word: "+myWord+" boltID:"+boltId);
}

Resources