how to choose a field value from a specific stream in storm - apache-storm

public void execute(Tuple input) {
Object value = input.getValueByField(FIELD_NAME);
...
}
When calling getValueByField, how do I specify a particular stream name emitted by previous Bolt/Spout so that particular FIELD_NAME is coming from that stream?
I need to know this because I'm facing the following exception:
InvalidTopologyException(msg:Component: [bolt2-name] subscribes from non-existent stream: [default] of component [bolt1-name])
So, I want to specify a particular stream while calling getValueBy... methods.

I don't remember a way of doing it on a tuple, but you can get the information of who sent you the tuple:
String sourceComponent = tuple.getSourceComponent();
String streamId = tuple.getSourceStreamId();
Then you can use a classic switch/case in java to call a specific method that will know which fields are available.
You can also iterate through fields included in your tuple to check if the field is available but I find this way dirty.
for (String field : tuple.getFields()) {
// Check something on field...
}

Just found out that the binding to a specific stream could be done while building topology.
The Spout could declare fields to a stream (in declareOutputFields method)
declarer.declareStream(streamName, new Fields(field1, field2));
...and emit value to the stream
collector.emit(streamName, new Values(value1, value2...), msgID);
When Bolt is being added in the topology, it could subscribe to a specific stream from preceding spout or bolt like following
topologyBuilder.setBolt(boltId, new BoltClass(), parallelismLevel)
.localOrShuffleGrouping(spoutORBoltID, streamID);
The overloaded version of the method localOrShuffleGrouping provides an option to specify streamID as last argument.

Related

There is no way to create a reference to stream & it’s not possible to reuse the same stream multiple times

Reading article about java 8 stream, and found
Java Streams are consumable, so there is no way to create a reference
to stream for future usage. Since the data is on-demand, it’s not
possible to reuse the same stream multiple times.
at the same time at the same article
//sequential stream
Stream<Integer> sequentialStream = myList.stream();
//parallel stream
Stream<Integer> parallelStream = myList.parallelStream();
What does it mean of "there is no way to create a reference to stream for future usage" ? aren't sequentialStream and parallelStream references to streams
also what does it mean of "it’s not possible to reuse the same stream multiple times" ?
What it means is that every time you need to operate on a stream, you must make a new one.
So you cannot, for example, have something like:
Class Person {
private Stream<String> phoneNumbers;
Stream<String> getPhoneNumbers() {
return phoneNumbers;
}
}
and just reuse that one stream whenever you like. Instead, you must have something like
Class Person {
private List<String> phoneNumbers;
Stream<String> getPhoneNumbers() {
return phoneNumbers.stream(); // make a NEW stream over the same data
}
}
The code snipped you included does just that. It makes 2 different streams over the same data

How do I combine a Kafka Processor with a Kafka Streams application?

I am trying to construct a Kafka Streams operator that takes in a Stream< timestamp, data> and outputs another Stream where the timestamps are sorted in ascending order; the purpose is to deal with streams that have "out of order" entries due to delays in the supplier.
At first, I thought about doing this with time-windowed aggregation, but then I happened upon a solution using a Kafka Processor. I figured I could then say something like:
class SortProcessor implements Processor<timestamp,data> ...
class SortProcessorSupplier ...supplies suitably initialized SortProcessor
KStream<timestamp,data> input_stream = ...sourced from "input_topic"
KStream<timestamp,data> output_stream =
input_stream.process( new SortProcessorSupplier(...parameters...) );
However, this doesn't work because KStream.process returns void.
So, my question is: How do I "wrap" the Processor so that I can use it as follows:
KStream<timestamp,data> input_stream = ...sourced from "input_topic"
KStream<timestamp,data> output_stream =
new WrappedSortProcessor( input_stream, ...parameters... )
Instead of a Processor you can use a Transformer, which is very similar to a Processor but is better suited to forwarding results on to the stream. You can then invoke it from the stream using the KStream.transform() method instead of process().

When I use storm, how can I ensure that a bolt with multiple inputs, process only when all the inputs arrive?

The topology looks like this.
how can I ensure that a bolt with multiple inputs, process only when all the inputs arrive?
Bolt.execute() is called for each incoming tuple, regardless what the producer was (and you cannot change this). If you want to process multiple tuples from different producers at once, you need to write custom UDF code.
You need an input buffer for each producer, that can buffer incoming tuples (maybe a LinkedList<Tuple> as bolt member)
For each incoming tuple, you add the tuple to the corresponding buffer (you can access the producer information in the tuple's meta data, via. input.getSourceComponent()
After adding the tuple to the buffer, you check, if each buffer contains at least one tuple: if yes, take one tuple from each buffer an process them (after processing, check the buffers again until at least once buffer is empty) -- of no, just return and do not process anything.
You might want to take a look here (refer to Batching). For bolts that process more complex operations such as aggregation on multiple input tuples, you will need to extend BaseRichBolt and do your own control of the anchoring mechanism.
For this you need to declare your own output collector like this:
private OutputCollector outputCollector;
And then initialise it through your override of the prepare method:
#Override
public void prepare(Map map, TopologyContext topologyContext, OutputCollector outputCollector) {
this.outputCollector = outputCollector;
}
Your execute method for BaseRichBolt only receives a tuple as argument, you need to be able to perform the logic to maintain the anchors and using them when emitting.
private final List<Tuple> anchors = new ArrayList<Tuple>();
#Override
public void execute(Tuple tuple) {
if (!isTupleAggregationComplete(anchors, tuple)) {
anchors.add(tuple);
return;
}
// do your computations here!
collector.emit(anchors, new Values(foo,bar,xpto));
anchors.clear();
}
You should implement isTupleAggregationComplete with the necessary logic that checks if the bolt have everything necessary information to proceed with the processing.

Is possible to know the size of a stream without using a terminal operation

I have 3 interfaces
public interface IGhOrg {
int getId();
String getLogin();
String getName();
String getLocation();
Stream<IGhRepo> getRepos();
}
public interface IGhRepo {
int getId();
int getSize();
int getWatchersCount();
String getLanguage();
Stream<IGhUser> getContributors();
}
public interface IGhUser {
int getId();
String getLogin();
String getName();
String getCompany();
Stream<IGhOrg> getOrgs();
}
and I need to implement Optional<IGhRepo> highestContributors(Stream<IGhOrg> organizations)
this method returns a IGhRepo with most Contributors(getContributors())
I tried this
Optional<IGhRepo> highestContributors(Stream<IGhOrg> organizations){
return organizations
.flatMap(IGhOrg::getRepos)
.max((repo1,repo2)-> (int)repo1.getContributors().count() - (int)repo2.getContributors().count() );
}
but it gives me the
java.lang.IllegalStateException: stream has already been operated upon or closed
I understand that count() is a terminal operation in Stream but I can't solve this problem, please help!
thanks
Is possible to know the size of a stream without using a terminal operation
No it's not, because streams can be infinite or generate output on demand. It's not necessary that they are backed by collections.
but it gives me the
java.lang.IllegalStateException: stream has already been operated upon or closed
That's becase you are returning the same stream instance on each method invocation. You should return a new Stream instead.
I understand that count() is a terminal operation in Stream but I can't solve this problem, please help!
IMHO you are misusing the streams here. Performance and simplicity wise it's much better that you return some Collection<XXX> instead of Stream<XXX>
NO.
This is not possible to know the size of a stream in java.
As mentioned in java 8 stream docs
No storage. A stream is not a data structure that stores elements;
instead, it conveys elements from a source such as a data structure,
an array, a generator function, or an I/O channel, through a pipeline
of computational operations.
You don't specify this, but it looks like some or possibly all of the interface methods that return Stream<...> values don't return a fresh stream each time they are called.
This seems problematic to me from an API point of view, as it means each of these streams, and a fair chunk of the object's functionality can be used at most once.
You may be able to solve the particular problem you are having by ensuring that the stream from each object is used only once in the method, something like this:
Optional<IGhRepo> highestContributors(Stream<IGhOrg> organizations) {
return organizations
.flatMap(IGhOrg::getRepos)
.distinct()
.map(repo -> new AbstractMap.SimpleEntry<>(repo, repo.getContributors().count()))
.max(Map.Entry.comparingByValue())
.map(Map.Entry::getKey);
}
Unfortunately it looks like you will now be stuck if you want to (for example) print a list of the contributors, as the stream returned from getContributors() for the returned IGhRepo has already been consumed.
You might want to consider having your implementation objects return a fresh stream each time a stream returning method is called.
You could keep a counter that is incremented per "iteration" using peek. In the example below the counter is incremented before every item is processed with doSomeLogic
final var counter = new AtomicInteger();
getStream().peek(item -> counter.incrementAndGet()).forEach(this::doSomeLogic);

how can I prove field-grouping functionality(tuple with same field goes to same task)?

I'm a fresher of Storm, I'm getting started with Storm using the project storm-starter. In this project there is a Topology called WordCountTopology, the key code for building topology is:
builder.setBolt("split", new SplitSentence(), 8).shuffleGrouping("spout");
builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word"));
and in the implementation of WordCount bolt, the key method execute is:
#Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
String word = tuple.getString(0);
Integer count = counts.get(word);
if (count == null)
count = 0;
count++;
counts.put(word, count);
collector.emit(new Values(word, count));
}
My Question is:
As the functionality of filed-grouping is that: tuples with the same filed word will go to the same task for post processing. Here "task" means thread, how can I prove this functionality? In addition, in my opinion, the logic in method execute is a little awkward. In a single task, the parameter tuple is always the same, but in the execute method it does not reflect this, in other words, the logic dose not use this convenience.
Am I clear? My point is that, the code here in execute is not taking the feature of filed-grouping into account, the code here can also be applied to the situation of shuffle-grouping.
I would like to site few points, it might help clear your doubts
Here "task" means thread
In storm's terminology tasks are NOT threads but they are responsible for processing the actual logic. Each spout or bolt that you implement in your code executes as many tasks across the cluster. So you can define them as an running instance of the components i.e Spouts or Bolts.
There is another entity called Executors which are the thread responsible for running these tasks.It can run one or multiple tasks of the same component. An executor having multiple tasks actually is saying the same component is executed for multiple times by the executor.
Now coming back to your question
the code here in execute is not taking the feature of filed-grouping into account, the code here can also be applied to the situation of shuffle-grouping
In very brief A fields grouping lets you group a stream by a subset of its fields, meaning in order to do a word count, if we filtered the stream by using fieldsGrouping on a field name 'first_name` then it is expected that all the first_name field with a value say (Foo) should go to the same task, and the same field with a different value (Bar) goes to another task.
So here the execute method is supposed to receive the same field value and thus can easily update its counter and to do that it does not require to consider anything special. The whole logic is written keeping in mind that the bolt will be chained with the proper data and that's why using the proper grouping become such an important thing. So if you use shuffleGrouping then same code will run but produces incorrect data.
Well Pinky (or anyone else who finds this useful), to prove it, you just have to keep track of the bolt or spout task ID:
#Override
public void prepare(Map map, TopologyContext tc, OutputCollector oc) {
this.boltId = tc.getThisTaskId();
}
Now in the execute() of the same fieldsGrouped bolt that receives the tuples, you just print the id and the tuple:
#Override
public void execute(Tuple tuple) {
String myWord = (String) tuple.getValue(0);
System.out.println("word: "+myWord+" boltID:"+boltId);
}

Resources