Is it necessary to ack a tuple in storm bolt - apache-storm

This seems confusing, some examples I have seen, where ack on a tuple is called in each bolt while some places that was not the case. What is the practice regarding this and what can be it's implications?

After searching around on internet and this answer, I found this link from docs which is really helpful in this regard.
How spout handle messages:
When a spot takes a message from the source say Kafka or Kestrel queue, it opens the message. This means the message is not actually taken off the queue yet, but instead placed in a "pending" state waiting for acknowledgement that the message is completed. While in the pending state, a message will not be sent to other consumers of the queue. Additionally, if a client disconnects all pending messages for that client are put back on the queue.
When a message is opened, Kestrel provides the client with the data for the message as well as a unique id for the message. The KestrelSpout uses that exact id as the message id for the tuple when emitting the tuple to the SpoutOutputCollector. Sometime later on, when ack or fail are called on the KestrelSpout, the KestrelSpout sends an ack or fail message to Kestrel with the message id to take the message off the queue or have it put back on.
When is Ack needed:
A client need to tell Storm whenever it's creating a new link in the tree of tuples also called anchoring, which is done by emitting a new tuple.
A Client also need to tell Storm when you have finished processing an individual tuple, which is done by ack. By doing both these things, Storm can detect when the tree of tuples is fully processed and can ack or fail the spout tuple appropriately.
In the following example bolt splits a tuple containing a sentence into a tuple for each word. Each word tuple is anchored by specifying the input tuple as the first argument to emit. Since the word tuple is anchored, the spout tuple at the root of the tree will be replayed later on if the word tuple failed to be processed downstream.
public class SplitSentence extends BaseRichBolt {
OutputCollector _collector;
public void prepare(Map conf, TopologyContext context, OutputCollector collector) {
_collector = collector;
}
public void execute(Tuple tuple) {
String sentence = tuple.getString(0);
for(String word: sentence.split(" ")) {
_collector.emit(tuple, new Values(word));
}
_collector.ack(tuple);
}
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word"));
}
}
In contrast, if the word tuple is emitted like this:
_collector.emit(new Values(word));
Emitting the word tuple this way causes it to be unanchored. If the tuple fails be processed downstream, the root tuple will not be replayed. Depending on the fault-tolerance guarantees you need in your topology, sometimes it's appropriate to emit an unanchored tuple.
When is Ack not needed:
In many cases, bolts follow a common pattern of reading an input tuple, emitting tuples based on it, and then acking the tuple at the end of the execute method. These bolts fall into the categories of filters and simple functions. Storm has an interface called BasicBolt that encapsulates this pattern for you.
Following is the example of SplitSentence, which can be written as a BasicBolt like follows:
public class SplitSentence extends BaseBasicBolt {
public void execute(Tuple tuple, BasicOutputCollector collector) {
String sentence = tuple.getString(0);
for(String word: sentence.split(" ")) {
collector.emit(new Values(word));
}
}
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word"));
}
}
This implementation is simpler than the implementation from before and is semantically identical. Tuples emitted to BasicOutputCollector are automatically anchored to the input tuple, and the input tuple is acked for you automatically when the execute method completes.
Edit
As commented, and as can be seen here, IBasicBolt takes care of acking for you, so whatever class is implementing IBasicBolt:
/**
* Process the input tuple and optionally emit new tuples based on the input tuple.
*
* All acking is managed for you. Throw a FailedException if you want to fail the tuple.
*/
BaseBasicBolt and BaseRichBolt both implements IBasicBolt.

Related

Linearization in Reactor Netty (Spring Boot Webflux)

How can I guarantee linearizability of requests in Reactor Netty?
Theory:
Given:
Request A wants to write x=2, y=0
Request B wants to read x, y and write x=x+2, y=y+1
Request C wants to read x and write y=x
All Requests are processed asynchronously and return to the client immediately with status ACCEPTED.
Example:
Send requests A, B, C in order.
Example Log Output: (request, thread name, x, y)
Request A, nioEventLoopGroup-2-0, x=2, y=0
Request C, nioEventLoopGroup-2-2, x=2, y=2
Request B, nioEventLoopGroup-2-1, x=4, y=3
Business logic requires all reads after A to see x=2 and y=0.
And request B to see x=2, y=0 and set y=1.
And request C to see x=4 and set y=4.
In short: The business logic makes every next write operation dependent on the previous write operation to be completed. Otherwise the operations are not reversible.
Example Code
Document:
#Document
#Data
#NoArgsConstructor
#AllArgsConstructor
public class Event {
#Id
private String id;
private int data;
public Event withNewId() {
setId(UUID.randomUUID().toString());
return this;
}
}
Repo:
public interface EventRepository extends ReactiveMongoRepository<Event, String> {}
Controller:
#RestController
#RequestMapping(value = "/api/event")
#RequiredArgsConstructor
public class EventHandler {
private final EventRepository repo;
#PostMapping
public Mono<String> create(Event event) {
return Mono.just(event.withNewId().getId())
.doOnNext(id ->
// do query based on some logic depending on event data
Mono.just(someQuery)
.flatMap(query ->
repo.find(query)
.map(e -> event.setData(event.getData() + e.getData())))
.switchIfEmpty(Mono.just(event))
.flatMap(e -> repo.save(e))
.subscribeOn(Schedulers.single())
.subscribe());
}
}
It does not work, but with subscribeOn I try to guarantee linearizability. Meaning that concurrent requests A and B will always write their payload to the DB in the order in which they are received by the server. Therefore if another concurrent request C is a compound of first read than write, it will read changes from the DB that reflect those of request B, not A, and write its own changes based of B.
Is there a way in Reactor Netty to schedule executors with an unbound FIFO queue, so that I can process the requests asynchronously but in order?
I don't think that this is specific to Netty or Reactor in particular, but to a more broad topic - how to handle out-of-order message delivery and more-than-once message delivery. A few questions:
Does the client always sends the same number of requests in the same order? There's always a chance that, due to networking issues the requests may arrive out of order, or one or more may be lost.
Does the client make retries? What happens if the same request reaches the server twice?
If the order matters, why doesn't the client wait for the result of the nth-1 request, before issuing nth request? In other words, why there are many concurrent requests?
I'd try to redesign the operation in such a way that there's a single request executing the operations on the backend in the required order and using concurrency here if necessary to speed-up the process.
If it's not possible, for example, you don't control the client, or more generally the order in which the events (requests) arrive, you have to implement ordering on application-level logic using per-message semantics to do the ordering. You can, for example store or buffer the messages, waiting for all to arrive, and when they do, only then trigger the business logic using the data from the messages in the correct order. This requires some kind of a key (identity) which can attribute messages to the same entity, and a sorting-key, that you know how to sort the messages in the correct order.
EDIT:
After getting the answers, you can definitely implement it "the Reactor way".
Sinks.Many<Event> sink = Sinks.many() // you creat a 'sink' where the events will go
.multicast() // broads all messages to all subscribes of the stream
.directBestEffort(); // additional semantics - publishing will fail if no subscribers - doesn't really matter here
Flux<Event> eventFlux = sink.asFlux(); // the 'view' of the sink as a flux you can subscribe to
public void run() {
subscribeAndProcess();
sink.tryEmitNext(new Event("A", "A", "A"));
sink.tryEmitNext(new Event("A", "C", "C"));
sink.tryEmitNext(new Event("A", "B", "B"));
sink.tryEmitNext(new Event("B", "A", "A"));
sink.tryEmitNext(new Event("B", "C", "C"));
sink.tryEmitNext(new Event("B", "B", "B"));
}
void subscribeAndProcess() {
eventFlux.groupBy(Event::key)
.flatMap(
groupedEvents -> groupedEvents.distinct(Event::type) // distinct to avoid duplicates
.buffer(3) // there are three event types, so we buffer and wait for all to arrive
.flatMap(events -> // once all the events are there we can do the processing the way we need
Mono.just(events.stream()
.sorted(Comparator.comparing(Event::type))
.map(e -> e.key + e.value)
.reduce(String::concat)
.orElse(""))
)
)
.subscribe(System.out::println);
}
// prints values concatenated in order per key:
// - AAABAC
// - BABBBC
See Gist: https://gist.github.com/tarczynskitomek/d9442ea679e3eed64e5a8470217ad96a
There are a few caveats:
If all of the expected events for the given key don't arrive you waste memory buffering - unless you set a timeout
How will you ensure that all the events for a given key go to the same application instance?
How will you recover from failures encountered mid-processing?
Having all this in mind, I would go with a persistent storage - say saving the incoming events in the database, and doing the processing in background - for this you don't need to use Reactor. Most of the time a simple Servlets based Spring app will be far easier to maintain and develop, especially if you have no previous experience with Functional Reactive Programming.
Looking at the provided code I would not try to handle it on Reactor Netty level.
At first, several comments regarding controller implementation because it has multiple issues that violate reactive principles. I would recommend to spend some time learning reactive API but here are some hints
In reactive nothing happens until you subscribe. At the same time calling subscribe explicitly is an anti-pattern and should be avoided until you are creating framework similar to WebFlux.
parallel scheduler should be used to run non-blocking logic until you have some blocking code.
doOn... are so-called side-effect operators and should not be used for constructing reactive flows.
#PostMapping
public Mono<String> create(Event event) {
// do query based on some logic depending on event data
return repo.find(query)
.map(e -> event.setData(event.getData() + e.getData()))
.switchIfEmpty(Mono.just(event))
.flatMap(e -> repo.save(e));
}
Now, processing requests in the predefined sequence could be tricky because of network failures, possible retries, etc. What if you never get Request B or Request C? Should you still persist Request A?
As #ttarczynski mentioned in his comment the best option is to redesign API and send single request.
In case it's not an option you would need to introduce some state to "postpone" request processing and then, depending on consistency semantic, process them as a "batch" when the last request is received or just defer Request C until you get Request A & B.

When I use storm, how can I ensure that a bolt with multiple inputs, process only when all the inputs arrive?

The topology looks like this.
how can I ensure that a bolt with multiple inputs, process only when all the inputs arrive?
Bolt.execute() is called for each incoming tuple, regardless what the producer was (and you cannot change this). If you want to process multiple tuples from different producers at once, you need to write custom UDF code.
You need an input buffer for each producer, that can buffer incoming tuples (maybe a LinkedList<Tuple> as bolt member)
For each incoming tuple, you add the tuple to the corresponding buffer (you can access the producer information in the tuple's meta data, via. input.getSourceComponent()
After adding the tuple to the buffer, you check, if each buffer contains at least one tuple: if yes, take one tuple from each buffer an process them (after processing, check the buffers again until at least once buffer is empty) -- of no, just return and do not process anything.
You might want to take a look here (refer to Batching). For bolts that process more complex operations such as aggregation on multiple input tuples, you will need to extend BaseRichBolt and do your own control of the anchoring mechanism.
For this you need to declare your own output collector like this:
private OutputCollector outputCollector;
And then initialise it through your override of the prepare method:
#Override
public void prepare(Map map, TopologyContext topologyContext, OutputCollector outputCollector) {
this.outputCollector = outputCollector;
}
Your execute method for BaseRichBolt only receives a tuple as argument, you need to be able to perform the logic to maintain the anchors and using them when emitting.
private final List<Tuple> anchors = new ArrayList<Tuple>();
#Override
public void execute(Tuple tuple) {
if (!isTupleAggregationComplete(anchors, tuple)) {
anchors.add(tuple);
return;
}
// do your computations here!
collector.emit(anchors, new Values(foo,bar,xpto));
anchors.clear();
}
You should implement isTupleAggregationComplete with the necessary logic that checks if the bolt have everything necessary information to proceed with the processing.

Storm Tick Tuple from a spout

I want to configure my spout to emit tick tuples on 2 different frequencies on different streams. My questions are as follows:
I understand how this is done using the bolt. But, on a spout, will the tick tuple invoke the next Tuple method on every tick?
How can I determine the frequency at which the tick was invoked? Meaning, the actual value of the time I configured in the config object?
Only bolts can receive tick tuples. Spouts can only emit tuples.
I'm assuming you're trying to do a "read" every so often from within your spout in order to emit a new tuple.
For example, to sleep 50 milliseconds between reads:
#Override
public void nextTuple() {
try {
String message = _mqClient.getMessage();
if (message != null) {
_collector.emit(new Values(message));
}
// sleep for 50 milliseconds
Utils.sleep(50);
} catch (Exception e) {
_collector.reportError(e);
LOG.error("MQ spout error {}", e);
}
}
Maybe this can help you:
https://github.com/ptgoetz/storm-signals
Storm-Signals aims to provide a way to send messages ("signals") to
components (spouts/bolts) in a storm topology that are otherwise not
addressable.
Storm topologies can be considered static in that modifications to a
topology's behavior require redeployment. Storm-Signals provides a
simple way to modify a topology's behavior at runtime, without
redeployment.

how to choose a field value from a specific stream in storm

public void execute(Tuple input) {
Object value = input.getValueByField(FIELD_NAME);
...
}
When calling getValueByField, how do I specify a particular stream name emitted by previous Bolt/Spout so that particular FIELD_NAME is coming from that stream?
I need to know this because I'm facing the following exception:
InvalidTopologyException(msg:Component: [bolt2-name] subscribes from non-existent stream: [default] of component [bolt1-name])
So, I want to specify a particular stream while calling getValueBy... methods.
I don't remember a way of doing it on a tuple, but you can get the information of who sent you the tuple:
String sourceComponent = tuple.getSourceComponent();
String streamId = tuple.getSourceStreamId();
Then you can use a classic switch/case in java to call a specific method that will know which fields are available.
You can also iterate through fields included in your tuple to check if the field is available but I find this way dirty.
for (String field : tuple.getFields()) {
// Check something on field...
}
Just found out that the binding to a specific stream could be done while building topology.
The Spout could declare fields to a stream (in declareOutputFields method)
declarer.declareStream(streamName, new Fields(field1, field2));
...and emit value to the stream
collector.emit(streamName, new Values(value1, value2...), msgID);
When Bolt is being added in the topology, it could subscribe to a specific stream from preceding spout or bolt like following
topologyBuilder.setBolt(boltId, new BoltClass(), parallelismLevel)
.localOrShuffleGrouping(spoutORBoltID, streamID);
The overloaded version of the method localOrShuffleGrouping provides an option to specify streamID as last argument.

how can I prove field-grouping functionality(tuple with same field goes to same task)?

I'm a fresher of Storm, I'm getting started with Storm using the project storm-starter. In this project there is a Topology called WordCountTopology, the key code for building topology is:
builder.setBolt("split", new SplitSentence(), 8).shuffleGrouping("spout");
builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word"));
and in the implementation of WordCount bolt, the key method execute is:
#Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
String word = tuple.getString(0);
Integer count = counts.get(word);
if (count == null)
count = 0;
count++;
counts.put(word, count);
collector.emit(new Values(word, count));
}
My Question is:
As the functionality of filed-grouping is that: tuples with the same filed word will go to the same task for post processing. Here "task" means thread, how can I prove this functionality? In addition, in my opinion, the logic in method execute is a little awkward. In a single task, the parameter tuple is always the same, but in the execute method it does not reflect this, in other words, the logic dose not use this convenience.
Am I clear? My point is that, the code here in execute is not taking the feature of filed-grouping into account, the code here can also be applied to the situation of shuffle-grouping.
I would like to site few points, it might help clear your doubts
Here "task" means thread
In storm's terminology tasks are NOT threads but they are responsible for processing the actual logic. Each spout or bolt that you implement in your code executes as many tasks across the cluster. So you can define them as an running instance of the components i.e Spouts or Bolts.
There is another entity called Executors which are the thread responsible for running these tasks.It can run one or multiple tasks of the same component. An executor having multiple tasks actually is saying the same component is executed for multiple times by the executor.
Now coming back to your question
the code here in execute is not taking the feature of filed-grouping into account, the code here can also be applied to the situation of shuffle-grouping
In very brief A fields grouping lets you group a stream by a subset of its fields, meaning in order to do a word count, if we filtered the stream by using fieldsGrouping on a field name 'first_name` then it is expected that all the first_name field with a value say (Foo) should go to the same task, and the same field with a different value (Bar) goes to another task.
So here the execute method is supposed to receive the same field value and thus can easily update its counter and to do that it does not require to consider anything special. The whole logic is written keeping in mind that the bolt will be chained with the proper data and that's why using the proper grouping become such an important thing. So if you use shuffleGrouping then same code will run but produces incorrect data.
Well Pinky (or anyone else who finds this useful), to prove it, you just have to keep track of the bolt or spout task ID:
#Override
public void prepare(Map map, TopologyContext tc, OutputCollector oc) {
this.boltId = tc.getThisTaskId();
}
Now in the execute() of the same fieldsGrouped bolt that receives the tuples, you just print the id and the tuple:
#Override
public void execute(Tuple tuple) {
String myWord = (String) tuple.getValue(0);
System.out.println("word: "+myWord+" boltID:"+boltId);
}

Resources