sending input from single spout to multiple bolts with Fields grouping in Apache Storm - apache-storm

builder.setSpout("spout", new TweetSpout());
builder.setBolt("bolt", new TweetCounter(), 2).fieldsGrouping("spout",
new Fields("field1"));
I have an input field "field1" added in fields grouping. By definition of fields grouping, all tweets with same "field1" should go to a single task of TweetCounter. The executors # set for TweetCounter bolt is 2.
However, if "field1" is the same in all the tuples of incoming stream, does this mean that even though I specified 2 executors for TweetCounter, the stream would only be sent to one of them and the other instance remains empty?
To go further with my particular use case, how can I use a single spout and send data to different bolts based on a particular value of an input field (field1)?

It seems one way to solved this problem is to use Direct grouping where the source decides which component will receive the tuple. :
This is a special kind of grouping. A stream grouped this way means that the producer of the tuple decides which task of the consumer will receive this tuple. Direct groupings can only be declared on streams that have been declared as direct streams. Tuples emitted to a direct stream must be emitted using one of the [emitDirect](javadocs/org/apache/storm/task/OutputCollector.html#emitDirect(int, int, java.util.List) methods. A bolt can get the task ids of its consumers by either using the provided TopologyContext or by keeping track of the output of the emit method in OutputCollector (which returns the task ids that the tuple was sent to).
You can see it's example uses here:
collector.emitDirect(getWordCountIndex(word),new Values(word));
where getWordCountIndex returns the index of the component where this tuple will be processes.

An alternative to using emitDirect as described in this answer is to implement your own stream grouping. The complexity is about the same, but it allows you to reuse grouping logic across multiple bolts.
For example, the shuffle grouping in Storm is implemented as a CustomStreamGrouping as follows:
public class ShuffleGrouping implements CustomStreamGrouping, Serializable {
private ArrayList<List<Integer>> choices;
private AtomicInteger current;
#Override
public void prepare(WorkerTopologyContext context, GlobalStreamId stream, List<Integer> targetTasks) {
choices = new ArrayList<List<Integer>>(targetTasks.size());
for (Integer i : targetTasks) {
choices.add(Arrays.asList(i));
}
current = new AtomicInteger(0);
Collections.shuffle(choices, new Random());
}
#Override
public List<Integer> chooseTasks(int taskId, List<Object> values) {
int rightNow;
int size = choices.size();
while (true) {
rightNow = current.incrementAndGet();
if (rightNow < size) {
return choices.get(rightNow);
} else if (rightNow == size) {
current.set(0);
return choices.get(0);
}
} // race condition with another thread, and we lost. try again
}
}
Storm will call prepare to tell you the task ids your grouping is responsible for, as well as some context on the topology. When Storm emits a tuple from a bolt/spout where you're using this grouping, Storm will call chooseTasks which lets you define which tasks the tuple should go to. You would then use the grouping when building your topology as shown:
TopologyBuilder tp = new TopologyBuilder();
tp.setSpout("spout", new MySpout(), 1);
tp.setBolt("bolt", new MyBolt())
.customGrouping("spout", new ShuffleGrouping());
Be aware that groupings need to be Serializable and thread safe.

Related

Utilize a single processor to process data from multiple sources of different Key and Value "Serdes"

Is it possible to utilize a single processor to process data from multiple sources of different Key and Value "Serdes"?
Below is my topology
topology.addSource("MarketData", Serdes.String().deserializer(), marketDataSerde.deserializer(),"market.data")
.addSource("EventData", Serdes.String().deserializer(), eventDataSerde.deserializer(),"event.data")
.addProcessor("StrategyTwo", new StrategyTwoProcessorSupplier(), "MarketData", "EventData")
.addSink("StrategyTwoSignal", "signal.data", Serdes.String().serializer(), signalSerde.serializer(),"StrategyTwo");
Below is the process method from the processor.
public void process(Record<String, MarketData> record) {
MarketData marketData = record.value();
}
Is it possible to have a generic record in the process method that can be processed differently depending on the type of record?
In the event that the above solution is not feasible, is it possible to have multiple sources and processors without having intermittent topics as a result? Example:
topology.addSource("MarketData", Serdes.String().deserializer(), marketDataSerde.deserializer(),"market.data")
.addProcessor("StrategyTwoMarketData", new StrategyTwoMarketDataProcessorSupplier(), "MarketData")
.addSource("EventData", Serdes.String().deserializer(), eventDataSerde.deserializer(),"event.data")
.addProcessor("StrategyTwoEventData", new StrategyTwoEventDataProcessorSupplier(), "EventData")
.addProcessor("StrategyTwo", new StrategyTwoProcessorSupplier(), "EventData")
.addSink("StrategyTwoSignal", "signal.data", Serdes.String().serializer(), signalSerde.serializer(),"StrategyTwo");

Spring #StreamListener process(KStream<?,?> stream) Partition

I have a topic with multiple partitions in my stream processor i just wanted to stream that from one partition, and could nto figure out how to configure this
spring.cloud.stream.kafka.streams.bindings.input.consumer.application-id=s-processor
spring.cloud.stream.bindings.input.destination=uinput
spring.cloud.stream.bindings.input.group=r-processor
spring.cloud.stream.bindings.input.contentType=application/java-serialized-object
spring.cloud.stream.bindings.input.consumer.header-mode=raw
spring.cloud.stream.bindings.input.consumer.use-native-decoding=true
spring.cloud.stream.bindings.input.consumer.partitioned=true
#StreamListener(target = "input")
// #SendTo(value = { "uoutput" })
public void process(KStream<UUID, AModel> ustream) {
I want only one partition data to be processed by this processor, there will be other processors for other partition(s)
So far my finding is something to do with https://kafka.apache.org/20/javadoc/org/apache/kafka/streams/StreamsConfig.html#PARTITION_GROUPER_CLASS_CONFIG, but couldnot find how to set this property in spring application.properties
I think the partition grouper is to group partition with tasks within a single processor. If you want to ensure that only a single partition is processed by a processor, then you need to provide at least the same number of processor instances as the topic partitions. For e.g. if your topic has 4 partitions, then you need to have 4 instances of the stream application to ensure that each instance is only processing a single partition.
Kafka Streams does not allow to read a single partition. If you subscribe to a topic, all partitions are consumed and distributed over the available instances. Thus, you can't know in advance, which partition is assigned to what instance, and all instances execute the same code.
But each partition linked to processor has different kind of data hence require different processor application
For this case, the processor (or transformer) must be able to process data for all partitions. Kafka Streams exposes the partitions number via the ProcessorContext object that is handed to a processor via init() method: https://kafka.apache.org/20/javadoc/org/apache/kafka/streams/kstream/Transformer.html#init-org.apache.kafka.streams.processor.ProcessorContext-
Thus, you need to "branch" with within your transformer to apply different processing logic based on the partition:
ustream.transform(() -> new MyTransformer());
class MyTransformer implement Transformer {
// other methods omitted
R transform(K key, V value) {
switch(context.partition()) { // get context from `init()`
case 0:
// your processing logic
break;
case 1:
// your processing logic
break;
// ...
}
}

Kafka Streams API: I am joining two KStreams of empmodel

final KStream<String, EmpModel> empModelStream = getMapOperator(empoutStream);
final KStream<String, EmpModel> empModelinput = getMapOperator(inputStream);
// empModelinput.print();
// empModelStream.print();
empModelStream.join(empModelinput, new ValueJoiner<EmpModel, EmpModel, Object>() {
#Override
public Object apply(EmpModel paramV1, EmpModel paramV2) {
System.out.println("Model1 "+paramV1.getKey());
System.out.println("Model2 "+paramV2.getKey());
return paramV1;
}
},JoinWindows.of("2000L"));
I get error:
Invalid topology building: KSTREAM-MAP-0000000003 and KSTREAM-MAP-0000000004 are not joinable
If you want to join two KStreams you must ensure that both have the same number of partitions. (cf. "Note" box in http://docs.confluent.io/current/streams/developer-guide.html#joining-streams)
If you use Kafka v0.10.1+, repartitioning will happen automatically (cf. http://docs.confluent.io/current/streams/upgrade-guide.html#auto-repartitioning).
For Kafka v0.10.0.x you have two options:
ensure that the original input topics do have the same number of partitions
or, add a call to .through("my-repartitioning-topic") to one of the KStreams before the join. You need to create the topic "my-repartioning-topic" with the right number of partitions (ie, same number of partitions as the second KStream's original input topic) before you start your Streams application

How to count the number of output tuples in cascading

Using cascading framework, I am filtering some tuples and outputting those in an S3 file.
I also want to count the number of output total tuples. One simple way is to download the output S3 file and count the number of lines.
Is there any other way to dump the count of output tuples to another file?
This can be done using a flowprocess.
We can write a custom function.
public class Counter extends BaseOperation implements Function {
...
#Override
public void operate(FlowProcess flowProcess, FunctionCall functionCall) {
functionCall.getOutputCollector().add(functionCall.getArguments());
flowProcess.increment(counterGroup, counterName, 1);
}
...
}
Use :
groupByPipe = new Each(groupByPipe, new Counter(COUNTER_GROUP_NAME, counterName));

SingleColumnValueFilter not returning proper number of rows

In our HBase table, each row has a column called crawl identifier. Using a MapReduce job, we only want to process at any one time rows from a given crawl. In order to run the job more efficiently we gave our scan object a filter that (we hoped) would remove all rows except those with the given crawl identifier. However, we quickly discovered that our jobs were not processing the correct number of rows.
I wrote a test mapper to simply count the number of rows with the correct crawl identifier, without any filters. It iterated over all the rows in the table and counted the correct, expected number of rows (~15000). When we took that same job, added a filter to the scan object, the count dropped to ~3000. There was no manipulation of the table itself during or in between these two jobs.
Since adding the scan filter caused the visible rows to change so dramatically, we expect that we simply built the filter incorrectly.
Our MapReduce job features a single mapper:
public static class RowCountMapper extends TableMapper<ImmutableBytesWritable, Put>{
public String crawlIdentifier;
// counters
private static enum CountRows {
ROWS_WITH_MATCHED_CRAWL_IDENTIFIER
}
#Override
public void setup(Context context){
Configuration configuration=context.getConfiguration();
crawlIdentifier=configuration.get(ConfigPropertyLib.CRAWL_IDENTIFIER_PROPERTY);
}
#Override
public void map(ImmutableBytesWritable legacykey, Result row, Context context){
String rowIdentifier=HBaseSchema.getValueFromRow(row, HBaseSchema.CRAWL_IDENTIFIER_COLUMN);
if (StringUtils.equals(crawlIdentifier, rowIdentifier)){
context.getCounter(CountRows.ROWS_WITH_MATCHED_CRAWL_IDENTIFIER).increment(1l);
}
}
}
The filter setup is like this:
String crawlIdentifier=configuration.get(ConfigPropertyLib.CRAWL_IDENTIFIER_PROPERTY);
if (StringUtils.isBlank(crawlIdentifier)){
throw new IllegalArgumentException("Crawl Identifier not set.");
}
// build an HBase scanner
Scan scan=new Scan();
SingleColumnValueFilter filter=new SingleColumnValueFilter(HBaseSchema.CRAWL_IDENTIFIER_COLUMN.getFamily(),
HBaseSchema.CRAWL_IDENTIFIER_COLUMN.getQualifier(),
CompareOp.EQUAL,
Bytes.toBytes(crawlIdentifier));
filter.setFilterIfMissing(true);
scan.setFilter(filter);
Are we using the wrong filter, or have we configured it wrong?
EDIT: we're looking at manually adding all the column families as per https://issues.apache.org/jira/browse/HBASE-2198 but I'm pretty sure the Scan includes all the families by default.
The filter looks correct, but under certain conditions one scenario that could cause this relates to character encodings. Your Filter is using Bytes.toBytes(String) which uses UTF8 [1], whereas you might be using native character encoding in HBaseSchema or when you write the record if you use String.getBytes()[2]. Check that the crawlIdentifier was originally written to HBase using the following to ensure the filter is comparing like for like in the filtered scan.
Bytes.toBytes(crawlIdentifier)
[1] http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/util/Bytes.html#toBytes(java.lang.String)
[2] http://docs.oracle.com/javase/1.4.2/docs/api/java/lang/String.html#getBytes()

Resources