(Twitter) Storm's Window On Aggregation - apache-storm

I'm playing around with Storm, and I'm wondering where Storm specifies (if possible) the (tumbling/sliding) window size upon an aggregation. E.g. If we want to find the trending topics for the previous hour on Twitter. How do we specify that a bolt should return results for every hour? Is this done programatically inside each bolt? Or is it some way to specify a "window" ?

Disclaimer: I wrote the Trending Topics with Storm article referenced by gakhov in his answer above.
I'd say the best practice is to use the so-called tick tuples in Storm 0.8+. With these you can configure your own spouts/bolts to be notified at certain time intervals (say, every ten seconds or every minute).
Here's a simple example that configures the component in question to receive tick tuples every ten seconds:
// in your spout/bolt
#Override
public Map<String, Object> getComponentConfiguration() {
Config conf = new Config();
int tickFrequencyInSeconds = 10;
conf.put(Config.TOPOLOGY_TICK_TUPLE_FREQ_SECS, tickFrequencyInSeconds);
return conf;
}
You can then use a conditional switch in your spout/bolt's execute() method to distinguish "normal" incoming tuples from the special tick tuples. For instance:
// in your spout/bolt
#Override
public void execute(Tuple tuple) {
if (isTickTuple(tuple)) {
// now you can trigger e.g. a periodic activity
}
else {
// do something with the normal tuple
}
}
private static boolean isTickTuple(Tuple tuple) {
return tuple.getSourceComponent().equals(Constants.SYSTEM_COMPONENT_ID)
&& tuple.getSourceStreamId().equals(Constants.SYSTEM_TICK_STREAM_ID);
}
Again, I wrote a pretty detailed blog post about doing this in Storm a few days ago as gakhov pointed out (shameless plug!).

Add a new spout with parallelism degree of 1, and have it emit an empty signal and then Utils.sleep until next time (all done in nextTuple). Then, link all relevant bolts to that spout using all-grouping, so all of their instances will receive that same signal.

Related

Reactor Flux conditional emit

Is it possible to allow emitting values from a Flux conditionally based on a global boolean variable?
I'm working with Flux delayUntil(...) but not able to fully grasp the functionality or my assumptions are wrong.
I have a global AtomicBoolean that represents the availability of a downstream connection and only want the upstream Flux to emit if the downstream is ready to process.
To represent the scenario, created a (not working) test sample
//Randomly generates a boolean value every 5 seconds
private Flux<Boolean> signalGenerator() {
return Flux.range(1, Integer.MAX_VALUE)
.delayElements(Duration.ofMillis(5000))
.map(integer -> new Random().nextBoolean());
}
and
Flux.range(1, Integer.MAX_VALUE)
.delayElements(Duration.ofMillis(1000))
.delayUntil(evt -> signalGenerator()) // ?? Only proceed when signalGenerator returns true
.subscribe(System.out::println);
I have another scenario where a downstream process can accept only x messages a second. In the current non-reactive implementation we have a Semaphore of x permits and the thread is blocked if no more permits are available, with Semaphore permits resetting every second.
In both scenarios I want upstream Flux to emit only when there is a demand from the downstream process, and I do not want to Buffer.
You might consider using Mono.fromRunnable() as an input to delayUntil() like below;
Helper class;
public class FluxCondition {
CountDownLatch latch = new CountDownLatch(10); // it depends, might be managed somehow
Runnable r = () -> { latch.await(); }
public void lock() { Mono.fromRunnable(r) };
public void release() { latch.countDown(); }
}
Usage;
FluxCondition delayCondition = new FluxCondition();
Flux.range(1, 10).delayUntil(o -> delayCondition.lock()).subscribe();
.....
delayCondition.release(); // shall call this for each element
I guess there might be a better solution by using sink.emitNext but this might also require a condition variable for controlling Flux flow.
According my understanding, in reactive programming, your data should be considered in every operator step. So it might be better for you to design your consumer as a reactive processor. In my case I had no chance and followed the way as I described above

Prometheus + Micrometer: how to record time intervals and success/failure rates

I am sending from a front-end client to a metrics-microservice a JSON with the following data:
{
totalTimeOnTheNetwork: number;
timeElasticsearch: number;
isSuccessful: boolean;
}
The metrics-microservice currently handles the data like this:
#AllArgsConstructor
#Service
public class ClientMetricsService {
#Autowired
MeterRegistry registry; // abstract class, SimpleMeterRegistry gets injected
public void metrics(final MetricsProperty metrics) {
final long networkTime = metrics.getTotalTime() - metrics.getElasticTime();
registry.timer(ELASTIC_TIME_LABEL).record(metrics.getElasticTime(), TimeUnit.MILLISECONDS);
registry.timer(TOTAL_TIME_LABEL).record(metrics.getTotalTime(), TimeUnit.MILLISECONDS);
registry.timer(NETWORK_TIME_LABEL).record(networkTime, TimeUnit.MILLISECONDS);
}
}
As you can see I make a new metric for each of the time intervals. I was wondering if I can put all the intervals into one metric? It would be great if I did not have to calculate network-time on the metrics-microservice but rather in Grafana.
Also, could I put a success/failure tag inside the registry.timer? I assume I need to use a timer.builder on every request then like this:
Timer timer = Timer
.builder("my.timer")
.description("a description of what this timer does") // optional
.tags("region", "test") // optional
.register(registry);
Is that a typical way to do it (eg create a new timer on every HTTP request and link it to the registry) or should the timer be derived from the MeterRegistry like in my current version?
Or would you use another metric for logging success/failure? In the future instead of a boolean, the metric might change to a http-error-code for example, so I am not sure how to implement it in a maintainable way
Timer timer = Timer
.builder("your-timer-name-here")
.tags("ResponseStatus", isSuccessful.toString, "ResponseCode", http-error-code.toString)
.register(registry);
timer.record(metrics.getTotalTime);
Should be working code that responds to your question but I have a feeling there is a misunderstanding. Why do you want everything in one metric?
Either way you can probably sort that out with tags. I do not know the capabilities on the Grafana end but it might be as simple as throwing the .getElasticTime info into another tag and sending it through.

Kafka Streams - The state store may have migrated to another instance

I'm writing a basic application to test the Interactive Queries feature of Kafka Streams. Here is the code:
public static void main(String[] args) {
StreamsBuilder builder = new StreamsBuilder();
KeyValueBytesStoreSupplier waypointsStoreSupplier = Stores.persistentKeyValueStore("test-store");
StoreBuilder waypointsStoreBuilder = Stores.keyValueStoreBuilder(waypointsStoreSupplier, Serdes.ByteArray(), Serdes.Integer());
final KStream<byte[], byte[]> waypointsStream = builder.stream("sample1");
final KStream<byte[], TruckDriverWaypoint> waypointsDeserialized = waypointsStream
.mapValues(CustomSerdes::deserializeTruckDriverWaypoint)
.filter((k,v) -> v.isPresent())
.mapValues(Optional::get);
waypointsDeserialized.groupByKey().aggregate(
() -> 1,
(aggKey, newWaypoint, aggValue) -> {
aggValue = aggValue + 1;
return aggValue;
}, Materialized.<byte[], Integer, KeyValueStore<Bytes, byte[]>>as("test-store").withKeySerde(Serdes.ByteArray()).withValueSerde(Serdes.Integer())
);
final KafkaStreams streams = new KafkaStreams(builder.build(), new StreamsConfig(createStreamsProperties()));
streams.cleanUp();
streams.start();
ReadOnlyKeyValueStore<byte[], Integer> keyValueStore = streams.store("test-store", QueryableStoreTypes.keyValueStore());
KeyValueIterator<byte[], Integer> range = keyValueStore.all();
while (range.hasNext()) {
KeyValue<byte[], Integer> next = range.next();
System.out.println(next.value);
}
Runtime.getRuntime().addShutdownHook(new Thread(streams::close));
}
protected static Properties createStreamsProperties() {
final Properties streamsConfiguration = new Properties();
streamsConfiguration.put(StreamsConfig.APPLICATION_ID_CONFIG, "random167");
streamsConfiguration.put(StreamsConfig.CLIENT_ID_CONFIG, "client-id");
streamsConfiguration.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
streamsConfiguration.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
streamsConfiguration.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, Serdes.String().getClass().getName());
streamsConfiguration.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, Serdes.Integer().getClass().getName());
//streamsConfiguration.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 10000);
return streamsConfiguration;
}
So my problem is, every time I run this I get this same error:
Exception in thread "main" org.apache.kafka.streams.errors.InvalidStateStoreException: the state store, test-store, may have migrated to another instance.
I'm running only 1 instance of the application, and the topic I'm consuming from has only 1 partition.
Any idea what I'm doing wrong ?
Looks like you have a race condition. From the kafka streams javadoc for KafkaStreams::start() it says:
Start the KafkaStreams instance by starting all its threads. This function is expected to be called only once during the life cycle of the client.
Because threads are started in the background, this method does not block.
https://kafka.apache.org/10/javadoc/index.html?org/apache/kafka/streams/KafkaStreams.html
You're calling streams.store() immediately after streams.start(), but I'd wager that you're in a state where it hasn't initialized fully yet.
Since this is code appears to be just for testing, add a Thread.sleep(5000) or something in there and give it a go. (This is not a solution for production) Depending on your input rate into the topic, that'll probably give a bit of time for the store to start filling up with events so that your KeyValueIterator actually has something to process/print.
Probably not applicable to OP but might help others:
In trying to retrieve a KTable's store, make sure the the KTable's topic exists first or you'll get this exception.
I failed to call Storebuilder before consuming the store.
Typically this happens for two reasons:
The local KafkaStreams instance is not yet ready (i.e., not yet in
runtime state RUNNING, see Run-time Status Information) and thus its
local state stores cannot be queried yet. The local KafkaStreams
instance is ready (e.g. in runtime state RUNNING), but the particular
state store was just migrated to another instance behind the scenes.
This may notably happen during the startup phase of a distributed
application or when you are adding/removing application instances.
https://docs.confluent.io/platform/current/streams/faq.html#handling-invalidstatestoreexception-the-state-store-may-have-migrated-to-another-instance
The simplest approach is to guard against InvalidStateStoreException when calling KafkaStreams#store():
// Example: Wait until the store of type T is queryable. When it is, return a reference to the store.
public static <T> T waitUntilStoreIsQueryable(final String storeName,
final QueryableStoreType<T> queryableStoreType,
final KafkaStreams streams) throws InterruptedException {
while (true) {
try {
return streams.store(storeName, queryableStoreType);
} catch (InvalidStateStoreException ignored) {
// store not yet ready for querying
Thread.sleep(100);
}
}
}

Twitter crawler: why does the memory grow?

I have been trying to crawl Twitter via the Streaming API and by filtering the retrieved tweets by keywords/hashtags/users.
Here is my example using HBC (although the same problem happens with Twitter4J):
// After connection:
final BlockingQueue<String> queue = new LinkedBlockingQueue<String>(10000);
StatusesFilterEndpoint filterQuery = new StatusesFilterEndpoint();
filterQuery.followings(myListOfUserIDs);
filterQuery.trackTerms(myListOfKeywordsAndHashtags);
final ExecutorService executor = Executors.newFixedThreadPool(4);
Runnable tweetAnalyzer = defineRunnable(queue);
for (int i = 0; i < NUM_THREADS; i++)
executor.execute(tweetAnalyzer);
where the analyzer tweetAnalyzer is returned by:
private Runnable defineRunnable(final BlockingQueue<String> queue) {
return new Runnable() {
#Override
public void run() {
while (true)
try {
System.out.println(queue.take());
}
catch (InterruptedException e) {
e.printStackTrace();
}
}
};
}
However, the process continues to grow in memory.
Two questions:
How to design this crawler properly, so that it does not grow in memory and does not saturate the RAM?
How to select the best queue length (here set to 10000) so that it does not saturate? I have seen that using this length the queue continues to be full of tweets (it never goes empty) and I am able to crawl 700 tweets/min, which is huge)
Thank you in advance.
It's a bit hard to determine from the snippets that you provide. Do you register StatusesFilterEndpoint correctly?
I would recommend that you write a separate thread to monitor the size of the queue.
Obvious you are not able to proceed all the twitter messages you download. So you can only:
reduce the number of tweets you download by filtering more aggressively
Sample the input by throwing away every n message.
use a faster machine although for the tweetAnalyzer you display in the question this might not help.
deploy on a cluster

send output of two bolts to a single bolt in Storm?

What is the easiest way to send output of BoltA and BoltB to BoltC. Do I have to use Joins or is there any simpler solution. A and B have same fields (ts, metric_name, metric_count).
// KafkaSpout --> LogDecoder
builder.setBolt(LOGDECODER_BOLT_ID, logdecoderBolt, 10).shuffleGrouping(KAFKA_SPOUT_ID);
// LogDecoder --> CountBolt
builder.setBolt(COUNT_BOLT_ID, countBolt, 10).shuffleGrouping(LOGDECODER_BOLT_ID);
// LogDecoder --> HttpResCodeCountBolt
builder.setBolt(HTTP_RES_CODE_COUNT_BOLT_ID, http_res_code_count_bolt, 10).shuffleGrouping(LOGDECODER_BOLT_ID);
# And now I want to send CountBolt and HttpResCodeCountBolt output to Aggregator Bolt.
// CountBolt --> AggregatwBolt
builder.setBolt(AGGREGATE_BOLT_ID, aggregateBolt, 5).fieldsGrouping((COUNT_BOLT_ID), new Fields("ts"));
// HttpResCodeCountBolt --> AggregatwBolt
builder.setBolt(AGGREGATE_BOLT_ID, aggregateBolt, 5).fieldsGrouping((HTTP_RES_CODE_COUNT_BOLT_ID), new Fields("ts"));
Is this possible ?
Yes. Just add a stream-id ("stream1" and "stream2" below) to the fieldsGrouping call:
BoltDeclarer bd = builder.setBolt(AGGREGATE_BOLT_ID, aggregateBolt, 5);
bd.fieldsGrouping((COUNT_BOLT_ID), "stream1", new Fields("ts"));
bd.fieldsGrouping((HTTP_RES_CODE_COUNT_BOLT_ID), "stream2", new Fields("ts"));
and then in the execute() method for BoltC you can test to see which stream the tuple came from:
public void execute(Tuple tuple) {
if ("stream1".equals(tuple.getSourceStreamId())) {
// this came from stream1
} else if ("stream2".equals(tuple.getSourceStreamId())) {
// this came from stream2
}
Since you know which stream the tuple came from, you don't need to have the same shape of tuple on the two streams. You just de-marshall the tuple according to the stream-id.
You can also check to see which component the tuple came from (as I type this I think this might be more appropriate to your case) as well as the instance of the component (the task) that emitted the tuple.
As #Chris said you can use streams. But you can also simply get the source component from the tuple.
#Override
public final void execute(final Tuple tuple) {
final String sourceComponent = tuple.getSourceComponent();
....
}
The source component is the name you gave to the Bolt at the topology's initialization. For instance: COUNT_BOLT_ID.

Resources