send output of two bolts to a single bolt in Storm? - apache-storm

What is the easiest way to send output of BoltA and BoltB to BoltC. Do I have to use Joins or is there any simpler solution. A and B have same fields (ts, metric_name, metric_count).
// KafkaSpout --> LogDecoder
builder.setBolt(LOGDECODER_BOLT_ID, logdecoderBolt, 10).shuffleGrouping(KAFKA_SPOUT_ID);
// LogDecoder --> CountBolt
builder.setBolt(COUNT_BOLT_ID, countBolt, 10).shuffleGrouping(LOGDECODER_BOLT_ID);
// LogDecoder --> HttpResCodeCountBolt
builder.setBolt(HTTP_RES_CODE_COUNT_BOLT_ID, http_res_code_count_bolt, 10).shuffleGrouping(LOGDECODER_BOLT_ID);
# And now I want to send CountBolt and HttpResCodeCountBolt output to Aggregator Bolt.
// CountBolt --> AggregatwBolt
builder.setBolt(AGGREGATE_BOLT_ID, aggregateBolt, 5).fieldsGrouping((COUNT_BOLT_ID), new Fields("ts"));
// HttpResCodeCountBolt --> AggregatwBolt
builder.setBolt(AGGREGATE_BOLT_ID, aggregateBolt, 5).fieldsGrouping((HTTP_RES_CODE_COUNT_BOLT_ID), new Fields("ts"));
Is this possible ?

Yes. Just add a stream-id ("stream1" and "stream2" below) to the fieldsGrouping call:
BoltDeclarer bd = builder.setBolt(AGGREGATE_BOLT_ID, aggregateBolt, 5);
bd.fieldsGrouping((COUNT_BOLT_ID), "stream1", new Fields("ts"));
bd.fieldsGrouping((HTTP_RES_CODE_COUNT_BOLT_ID), "stream2", new Fields("ts"));
and then in the execute() method for BoltC you can test to see which stream the tuple came from:
public void execute(Tuple tuple) {
if ("stream1".equals(tuple.getSourceStreamId())) {
// this came from stream1
} else if ("stream2".equals(tuple.getSourceStreamId())) {
// this came from stream2
}
Since you know which stream the tuple came from, you don't need to have the same shape of tuple on the two streams. You just de-marshall the tuple according to the stream-id.
You can also check to see which component the tuple came from (as I type this I think this might be more appropriate to your case) as well as the instance of the component (the task) that emitted the tuple.

As #Chris said you can use streams. But you can also simply get the source component from the tuple.
#Override
public final void execute(final Tuple tuple) {
final String sourceComponent = tuple.getSourceComponent();
....
}
The source component is the name you gave to the Bolt at the topology's initialization. For instance: COUNT_BOLT_ID.

Related

Spring Batch - use JpaPagingItemReader to read lists instead of individual items

Spring Batch is designed to read and process one item at a time, then write the list of all items processed in a chunk. I want my item to be a List<T> as well, to be thus read and processed, and then write a List<List<T>>. My data source is a standard Spring JpaRepository<T, ID>.
My question is whether there are some standard solutions for this "aggregated" approach. I see that there are some, but they don't read from a JpaRepository, like:
https://github.com/spring-projects/spring-batch/blob/main/spring-batch-samples/src/main/java/org/springframework/batch/sample/domain/multiline/AggregateItemReader.java
Spring Batch - Item Reader and ItemProcessor with a list
Spring Batch- how to pass list of multiple items from input to ItemReader, ItemProcessor and ItemWriter
Update:
I'm looking for a solution that would work for a rapidly changing dataset and in a multithreading environment.
I want my item to be a List as well, to be thus read and processed, and then write a List<List>.
Spring Batch does not (and should not) be aware of what an "item" is. It is up to you do design what an "item" is and how it is implemented (a single value, a list, a stream , etc). In your case, you can encapsulate the List<T> in a type that could be used as an item, and process data as needed. You would need a custom item reader though.
The solution we found is to use a custom aggregate reader as suggested here, which accumulates the read data into a list of a given size then passes it along. For our specific use case, we read data using a JpaPagingItemReader. The relevant part is:
public List<T> read() throws Exception {
ResultHolder holder = new ResultHolder();
// read until no more results available or aggregated size is reached
while (!itemReaderExhausted && holder.getResults().size() < aggregationSize) {
process(itemReader.read(), holder);
}
if (CollectionUtils.isEmpty(holder.getResults())) {
return null;
}
return holder.getResults();
}
private void process(T readValue, ResultHolder resultHolder) {
if (readValue == null) {
itemReaderExhausted = true;
return;
}
resultHolder.addResult(readValue);
}
In order to account for the volatility of the dataset, we extended the JPA reader and overwritten the getPage() method to always return 0, and controlled the dataset through the processor and writer to have the next fresh data to be fetched always on the first page. The hint was given here and in some other SO answers.
public int getPage() {
return 0;
}

Is it possible to make map-only task execute parallelly in Apache Flink

I'm using Flink to process some JSON-format streaming data:
{"uuid":"903493290432934", "bin": "68.3"}
{"uuid":"324938722984237", "bin": "56.8"}
...
My job is quite simple:
get stream from the Data Source ---> deserialize data into String ---> transform String to JSON object myJsonObj ---> double res = myJsonObj.get("bin") ---> do some heavy calculation with res.
Here is my code:
FlinkPravegaReader<String> source = ... // init source
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// transform String to MyJson
DataStream<MyJson> jsonStream = env.addSource(source).name("Pravega Stream")
.map(new MapFunction<String, MyJson>() {
#Override
public MyJson map(String s) throws Exception {
MyJson myJson = JSON.parseObject(s, MyJson.class);
return myJson;
}
});
// do the heavy process
DataStream<String> heavyResult = jsonStream
.map(new MapFunction<MyJson, String>() {
#Override
public String map(MyJson myJson) throws Exception {
double res = myJson.get("bin");
// do some very heavy calculation
return myJson.get("uuid").asText() + " done.";
}
});
heavyResult.print();
As my understanding, I haven't used any keyBy/window, so I think I used windowAll by default. Am I right?
If I'm right, the doc of Flink told me that windowAll couldn't be run in the parallel way. So does it mean that I have to do the heavy calculation one by one? I'm thinking if it is possible to do the heavy calculation parallelly.
As you see, in my case, it doesn't seem that using keyBy/window makes any sense. So how to make this case execute parallelly? Is it possible to make two jobs running together with the same Data Source as below?
/----windowAll ---- do the heavy calculation
/
Data Source-
\
\----windowAll ---- do the heavy calculation
Is this design possible? Saying that the Data Source generates three elements: A and B. With this design, I'm expecting that one windowAll processes A while the other windowAll processes B.
A keyed stream is used to create a partition in your data, so all the trafic from the same key is sent to thee same taskmanager.
A window is used when you want to aggregate elements from the stream to compute them as a set for a given reason.
If you case does not fit on above cases you don't use them.
To provide parallelism to the whole stream just use
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(3); //Notice you'll need 3 taskmanagers slots available.
To define paralelism for a single operator (heavy calculation) use:
DataStream<String> heavyResult = jsonStream
.map(new MapFunction<MyJson, String>() {
#Override
public String map(MyJson myJson) throws Exception {
double res = myJson.get("bin");
// do some very heavy calculation
return myJson.get("uuid").asText() + " done.";
}
}).setParallelism(3); //Notice you'll need 3 taskmanagers slots available.
More info at https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/parallel.html

Reactor Flux conditional emit

Is it possible to allow emitting values from a Flux conditionally based on a global boolean variable?
I'm working with Flux delayUntil(...) but not able to fully grasp the functionality or my assumptions are wrong.
I have a global AtomicBoolean that represents the availability of a downstream connection and only want the upstream Flux to emit if the downstream is ready to process.
To represent the scenario, created a (not working) test sample
//Randomly generates a boolean value every 5 seconds
private Flux<Boolean> signalGenerator() {
return Flux.range(1, Integer.MAX_VALUE)
.delayElements(Duration.ofMillis(5000))
.map(integer -> new Random().nextBoolean());
}
and
Flux.range(1, Integer.MAX_VALUE)
.delayElements(Duration.ofMillis(1000))
.delayUntil(evt -> signalGenerator()) // ?? Only proceed when signalGenerator returns true
.subscribe(System.out::println);
I have another scenario where a downstream process can accept only x messages a second. In the current non-reactive implementation we have a Semaphore of x permits and the thread is blocked if no more permits are available, with Semaphore permits resetting every second.
In both scenarios I want upstream Flux to emit only when there is a demand from the downstream process, and I do not want to Buffer.
You might consider using Mono.fromRunnable() as an input to delayUntil() like below;
Helper class;
public class FluxCondition {
CountDownLatch latch = new CountDownLatch(10); // it depends, might be managed somehow
Runnable r = () -> { latch.await(); }
public void lock() { Mono.fromRunnable(r) };
public void release() { latch.countDown(); }
}
Usage;
FluxCondition delayCondition = new FluxCondition();
Flux.range(1, 10).delayUntil(o -> delayCondition.lock()).subscribe();
.....
delayCondition.release(); // shall call this for each element
I guess there might be a better solution by using sink.emitNext but this might also require a condition variable for controlling Flux flow.
According my understanding, in reactive programming, your data should be considered in every operator step. So it might be better for you to design your consumer as a reactive processor. In my case I had no chance and followed the way as I described above

Kafka Streams - The state store may have migrated to another instance

I'm writing a basic application to test the Interactive Queries feature of Kafka Streams. Here is the code:
public static void main(String[] args) {
StreamsBuilder builder = new StreamsBuilder();
KeyValueBytesStoreSupplier waypointsStoreSupplier = Stores.persistentKeyValueStore("test-store");
StoreBuilder waypointsStoreBuilder = Stores.keyValueStoreBuilder(waypointsStoreSupplier, Serdes.ByteArray(), Serdes.Integer());
final KStream<byte[], byte[]> waypointsStream = builder.stream("sample1");
final KStream<byte[], TruckDriverWaypoint> waypointsDeserialized = waypointsStream
.mapValues(CustomSerdes::deserializeTruckDriverWaypoint)
.filter((k,v) -> v.isPresent())
.mapValues(Optional::get);
waypointsDeserialized.groupByKey().aggregate(
() -> 1,
(aggKey, newWaypoint, aggValue) -> {
aggValue = aggValue + 1;
return aggValue;
}, Materialized.<byte[], Integer, KeyValueStore<Bytes, byte[]>>as("test-store").withKeySerde(Serdes.ByteArray()).withValueSerde(Serdes.Integer())
);
final KafkaStreams streams = new KafkaStreams(builder.build(), new StreamsConfig(createStreamsProperties()));
streams.cleanUp();
streams.start();
ReadOnlyKeyValueStore<byte[], Integer> keyValueStore = streams.store("test-store", QueryableStoreTypes.keyValueStore());
KeyValueIterator<byte[], Integer> range = keyValueStore.all();
while (range.hasNext()) {
KeyValue<byte[], Integer> next = range.next();
System.out.println(next.value);
}
Runtime.getRuntime().addShutdownHook(new Thread(streams::close));
}
protected static Properties createStreamsProperties() {
final Properties streamsConfiguration = new Properties();
streamsConfiguration.put(StreamsConfig.APPLICATION_ID_CONFIG, "random167");
streamsConfiguration.put(StreamsConfig.CLIENT_ID_CONFIG, "client-id");
streamsConfiguration.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
streamsConfiguration.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
streamsConfiguration.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, Serdes.String().getClass().getName());
streamsConfiguration.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, Serdes.Integer().getClass().getName());
//streamsConfiguration.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 10000);
return streamsConfiguration;
}
So my problem is, every time I run this I get this same error:
Exception in thread "main" org.apache.kafka.streams.errors.InvalidStateStoreException: the state store, test-store, may have migrated to another instance.
I'm running only 1 instance of the application, and the topic I'm consuming from has only 1 partition.
Any idea what I'm doing wrong ?
Looks like you have a race condition. From the kafka streams javadoc for KafkaStreams::start() it says:
Start the KafkaStreams instance by starting all its threads. This function is expected to be called only once during the life cycle of the client.
Because threads are started in the background, this method does not block.
https://kafka.apache.org/10/javadoc/index.html?org/apache/kafka/streams/KafkaStreams.html
You're calling streams.store() immediately after streams.start(), but I'd wager that you're in a state where it hasn't initialized fully yet.
Since this is code appears to be just for testing, add a Thread.sleep(5000) or something in there and give it a go. (This is not a solution for production) Depending on your input rate into the topic, that'll probably give a bit of time for the store to start filling up with events so that your KeyValueIterator actually has something to process/print.
Probably not applicable to OP but might help others:
In trying to retrieve a KTable's store, make sure the the KTable's topic exists first or you'll get this exception.
I failed to call Storebuilder before consuming the store.
Typically this happens for two reasons:
The local KafkaStreams instance is not yet ready (i.e., not yet in
runtime state RUNNING, see Run-time Status Information) and thus its
local state stores cannot be queried yet. The local KafkaStreams
instance is ready (e.g. in runtime state RUNNING), but the particular
state store was just migrated to another instance behind the scenes.
This may notably happen during the startup phase of a distributed
application or when you are adding/removing application instances.
https://docs.confluent.io/platform/current/streams/faq.html#handling-invalidstatestoreexception-the-state-store-may-have-migrated-to-another-instance
The simplest approach is to guard against InvalidStateStoreException when calling KafkaStreams#store():
// Example: Wait until the store of type T is queryable. When it is, return a reference to the store.
public static <T> T waitUntilStoreIsQueryable(final String storeName,
final QueryableStoreType<T> queryableStoreType,
final KafkaStreams streams) throws InterruptedException {
while (true) {
try {
return streams.store(storeName, queryableStoreType);
} catch (InvalidStateStoreException ignored) {
// store not yet ready for querying
Thread.sleep(100);
}
}
}

(Twitter) Storm's Window On Aggregation

I'm playing around with Storm, and I'm wondering where Storm specifies (if possible) the (tumbling/sliding) window size upon an aggregation. E.g. If we want to find the trending topics for the previous hour on Twitter. How do we specify that a bolt should return results for every hour? Is this done programatically inside each bolt? Or is it some way to specify a "window" ?
Disclaimer: I wrote the Trending Topics with Storm article referenced by gakhov in his answer above.
I'd say the best practice is to use the so-called tick tuples in Storm 0.8+. With these you can configure your own spouts/bolts to be notified at certain time intervals (say, every ten seconds or every minute).
Here's a simple example that configures the component in question to receive tick tuples every ten seconds:
// in your spout/bolt
#Override
public Map<String, Object> getComponentConfiguration() {
Config conf = new Config();
int tickFrequencyInSeconds = 10;
conf.put(Config.TOPOLOGY_TICK_TUPLE_FREQ_SECS, tickFrequencyInSeconds);
return conf;
}
You can then use a conditional switch in your spout/bolt's execute() method to distinguish "normal" incoming tuples from the special tick tuples. For instance:
// in your spout/bolt
#Override
public void execute(Tuple tuple) {
if (isTickTuple(tuple)) {
// now you can trigger e.g. a periodic activity
}
else {
// do something with the normal tuple
}
}
private static boolean isTickTuple(Tuple tuple) {
return tuple.getSourceComponent().equals(Constants.SYSTEM_COMPONENT_ID)
&& tuple.getSourceStreamId().equals(Constants.SYSTEM_TICK_STREAM_ID);
}
Again, I wrote a pretty detailed blog post about doing this in Storm a few days ago as gakhov pointed out (shameless plug!).
Add a new spout with parallelism degree of 1, and have it emit an empty signal and then Utils.sleep until next time (all done in nextTuple). Then, link all relevant bolts to that spout using all-grouping, so all of their instances will receive that same signal.

Resources