Using tick tuples with trident in storm - apache-storm

I am able to use standard spout,bolt combination to do streaming aggregation
and works very well in happy case, when using tick tuples to persist data at some interval
to make use of batching. Right now i am doing some failure management (tracking off tuples not saved etc) myself.(i.e not ootb from storm)
But i have read that trident gives you a higher abstraction and better failure management.
What i dont understand is whether there is tick tuple support in trident. Basically
I would like to batch in memory for the current minute or so and persist any aggregated data
for the previous minutes using trident.
Any pointers here or design suggestions would be helpful.
Thanks

Actually micro-batching is a built-in Trident's feature. You don't need any tick tuples for that. When you have something like this in your code:
topology
.newStream("myStream", spout)
.partitionPersist(
ElasticSearchEventState.getFactoryFor(connectionProvider),
new Fields("field1", "field2"),
new ElasticSearchEventUpdater()
)
(I'm using here my custom ElasticSearch state/updater, you might use something else)
So when you have something like this, under the hood Trident group your stream into batches and performs partitionPersist operation not on individual tuples but on those batches.
If you still need tick tuples for any reason, just create your tick spout, something like this works for me:
public class TickSpout implements IBatchSpout {
public static final String TIMESTAMP_FIELD = "timestamp";
private final long delay;
public TickSpout(long delay) {
this.delay = delay;
}
#Override
public void open(Map conf, TopologyContext context) {
}
#Override
public void emitBatch(long batchId, TridentCollector collector) {
Utils.sleep(delay);
collector.emit(new Values(System.currentTimeMillis()));
}
#Override
public void ack(long batchId) {
}
#Override
public void close() {
}
#Override
public Map getComponentConfiguration() {
return null;
}
#Override
public Fields getOutputFields() {
return new Fields(TIMESTAMP_FIELD);
}
}

Related

How to process multiple AMQP messages in parallel with the same #Incoming method

Is it possible to process multiple amqp - messages in parallel with the same method annotated with #Incoming("queue") with quarkus and smallrye-reactive-messaging?
To be more precise, I have following class:
#ApplicationScoped
public class Receiver {
#Incoming("test-queue")
public void process(String input) {
System.out.println("start processing:" + input);
try {
Thread.sleep(10_000);
} catch (InterruptedException e) {
e.printStackTrace();
}
System.out.println("end processing:" + input);
}
}
With the configuration in the application.properties:
amqp-host: localhost
amqp-port: 5672
amqp-username: quarkus
amqp-password: quarkus
mp.messaging.incoming.test-queue.connector: smallrye-amqp
mp.messaging.incoming.test-queue.address: test-queue
Now I'd like define by configuration how many parallel processing of messages are possible. For example, on a 4 core cpu it should run 4 in parallel.
Currently I can just add 4 copies of the method with different names to allow this parallelism, but that is not configurable.
I'm not sure, but I don't think Reactive Messaging supports what you're asking for.
You can, however, do what you want another way. I think it's also a better overall pattern for using messaging.
http://smallrye.io/smallrye-reactive-messaging/smallrye-reactive-messaging/2.5/amqp/amqp.html#amqp-inbound
Find the example with the CompletionStage and the explicit ack(). That variant is asynchronous, so if you combine it with Java's existing concurrency facilities, you'll get efficient parallel processing.
I would send the incoming work to an executor, and then have the executing task ack() when it completes.
I just came across the same scenario and here is how the spec intends for you to handle concurrency:
From eclipse Microprofile spec
Basically, instead of having a class with a method like this:
#Incoming("test-queue")
public void process(String input) {}
You have 2 classes like this:
#ApplicationScoped
public class MessageSubscriberProducer {
#Incoming("test-queue")
public Subscriber<String> createSubscriber() {
return new SubscriberImpl();
}
}
public class SubsciberImpl implements Subscriber<String> {
private Subscription subscription;
#Override
public void onSubscribe(Subscription subscription) {
this.subscription = subscription;
this.subscription.request(4); // this tells how many messages to grab right away
}
#Override
public void onNext(String val) {
// do processing
this.subscription.request(1); // grab 1 more
}
}
This has the additional advantage of moving your processing code from the vert.x event-loop thread to a worker thread pool.

How can I see the current output of a running Storm topology?

Currently learning on how to use Storm (version 2.1.0), I am a bit confused on a specific aspect of this data streaming processing (DSP) engine: How is output data handled? Tutorials provide good explanations on system setup and running our first application. Unfortunately, I didn't find a page providing details on results generated by a topology.
With DSP applications, there are no final output because input data is a continuously incoming stream of data (or maybe we can say there is a final output when application is stopped). What I would like is to be able to see the state of current output (the actual output data generated at current time) of a running topology.
I'm able to run WordCountTopology. I understand the output of this topology is generated by the following snippet of code:
public static class WordCount extends BaseBasicBolt {
Map<String, Integer> counts = new HashMap<String, Integer>();
#Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
String word = tuple.getString(0);
Integer count = counts.get(word);
if (count == null) {
count = 0;
}
count++;
counts.put(word, count);
collector.emit(new Values(word, count));
}
#Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word", "count"));
}
}
My misunderstanding is on the location of the <"word":string, "count":int> output. Is it only in memory, written in a database somewhere, written in a file?
Going further with this question: what are the existing possibilities for storing in-progress output data? What is the "good way" of handling such data?
I hope my question is not too naive. And thanks to the StackOverflow community for always providing good help.
A few days have passed since I posted this question. I am back to share with you what I have tried. Although I cannot tell if it is the right way of doing, the two following propositions answer my question.
Simple System.out.println()
The first thing I've tried is to make a System.out.println("Hello World!") directly within the prepare() method of my BaseBasicBolt. This method is called only once at the beginning of each Bolt's thread execution.
public void prepare(Map topoConf, TopologyContext context) {
System.out.println("Hello World!");
}
The big challenge was to figure out where the log is written. By default, it is written within <storm installation folder>/logs/workers-artifacts/<topology name>/<worker-port>/worker.log where <worker-port> is the port of a requested worker/slot.
For instance, with conf.setNumWorkers(3), the topology requests an access to 3 workers (3 slots). Therefore, values of <worker-port> will be 6700, 6701 and 6702. Those values are the port numbers of the 3 slots (defined in storm.yaml under supervisor.slots.ports).
Note: you will have as many "Hello World!" as the parallel size of your BaseBasicBolt. When the split bolt is instantiated with builder.setBolt("split", new SplitSentence(), 8), it results in 8 parallel threads, each one writing its own log.
Writing to a file
For research purpose I have to analyse large amounts of logs that I need in a specific format. The solution I found is to append the logs to a specific file managed by each bolt.
Hereafter is my own implementation of this file logging solution for the count bolt.
public static class WordCount extends BaseBasicBolt {
private String workerName;
private FileWriter fw;
private BufferedWriter bw;
private PrintWriter out;
private String logFile = "/var/log/storm/count.log";
private Map<String, Integer> counts = new HashMap<String, Integer>();
public void prepare(Map topoConf, TopologyContext context) {
this.workerName = this.toString();
try {
this.fw = new FileWriter(logFile, true);
this.bw = new BufferedWriter(fw);
this.out = new PrintWriter(bw);
} catch (Exception e) {
System.out.println(e);
}
}
#Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
String word = tuple.getString(0);
Integer count = counts.get(word);
if (count == null) {
count = 0;
}
count++;
counts.put(word, count);
collector.emit(new Values(word, count));
out.println(this.workerName + ": Hello World!");
}
#Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word", "count"));
}
}
In this code, my log file is located in /var/log/storm/count.log and calling out.println(text) appends the text at this end of this file. As I am not sure if it is thread-safe, all parallel threads writing at the same time into the same file might result in data loss.
Note: if your bolts are distributed accros multiple machines, each machine is going to have its own log file. During my testings, I configured a simple cluster with 1 machine (running Nimbus + Supervisor + UI), therefore I had only 1 log file.
Conclusion
There are multiple ways to deal with output data and, more generally logging anything with Storm. I didn't find any official way of doing it and documentation very light on this subject.
While some of us would be satisfied with a simple sysout.println(), others might need to push large quantity of data into specific files, or maybe in a specialized database engine. Anything you can do with Java is possible with Storm because it's simple Java programming.
Any advices and additional comments to complete this answer will be gladly appreciated.

Spring Batch question for email summary at the end of all jobs

We have approximately 20 different Spring Batch jobs (some running as microservices, some lumped together in one Spring Boot app). What I need to do is gather all the errors encountered by ALL the jobs, as well as the number of records processed, and summarize it all in an email.
I have implemented ItemListenerSupport as a start:
public class BatchItemListener extends ItemListenerSupport<BaseDomainDataObject, BaseDomainDataObject> {
private final static Log logger = LogFactory.getLog(BatchItemListener.class);
private final static Map<String, Integer> numProcessedMap = new HashMap<>();
private final static Map<String, Integer> errorMap = new HashMap<>();
#Override
public void onReadError(Exception ex) {
logger.error("Encountered error on read", ex);
}
#Override
public void onProcessError(BaseDomainDataObject item, Exception ex) {
String msgBody = ExceptionUtils.getStackTrace(ex);
errorMap.put(item, msgBody);
}
#Override
public void onWriteError(Exception ex, List<? extends BaseDomainDataObject> items) {
logger.error("Encountered error on write", ex);
numProcessedMap.computeIfAbsent("numErrors", val -> items.size());
}
#Override
public void afterWrite(List<? extends BaseDomainDataObject> items) {
logger.info("Logging successful number of items written...");
numProcessedMap.computeIfAbsent("numSuccess", val -> items.size());
}
}
But how to I access the errors I accumulate in the listener when my batch jobs are finally finished? Right now I don't even have a good way to know when they are all finished. Any suggestions? Does Spring Batch provide something better for summarizing jobs?
Spring Batch does not provide a way to orchestrate jobs. The closest you can get out of the box is using a "master" job with multiple steps of type Jobstep that delegate to your sub-jobs. with this approach, you can do the aggregation in a JobExecutionListener#afterJob configured on the master job.
Otherwise, you can Spring Cloud Data Flow and create a composed task of all your jobs.

Subscription to UnicastProcessor never triggers

I wish to batch and process items as they come along so i created a UnicastProcessor and subscribed to it like this
UnicastProcessor<String> processor = UnicastProcessor.create()
processor
.bufferTimeout(10, Duration.ofMillis(500))
.subscribe(new Subscriber<List<String>>() {
#Override
public void onSubscribe(Subscription subscription) {
System.out.println("OnSubscribe");
}
#Override
public void onNext(List<String> strings) {
System.out.println("OnNext");
}
#Override
public void onError(Throwable throwable) {
System.out.println("OnError");
}
#Override
public void onComplete() {
System.out.println("OnComplete");
}
});
And then for testing purposes i created a new thread and started adding items in a loop
new Thread(() -> {
int limit = 100
i = 0
while(i < limit) {
++i
processor.sink().next("Hello $i")
}
System.out.println("Published all")
}).start()
After running this (and letting the main thread sleep for 5 seconds) i can see that all item have been published, but the subscriber does not trigger on any of the events so i can't process any of the published items.
What am I doing wrong here?
Reactive Streams specification is the answer!
The total number of onNext´s signalled by a Publisher to a Subscriber
MUST be less than or equal to the total number of elements requested
by that Subscriber´s Subscription at all times. [Rule 1.1]
In your example, you just simply provide a subscriber who does nothing in any sense. In turn, Reactive Streams specification, directly says that nothing will happen (there will be no onNext invocation) if you have not called Subscription#request method
A Subscriber MUST signal demand via Subscription.request(long n) to
receive onNext signals. [Rule 2.1]
Thus, to fix your problem, one of the possible solutions is changing the code in the following way:
UnicastProcessor<String> processor = UnicastProcessor.create()
processor
.bufferTimeout(10, Duration.ofMillis(500))
.subscribe(new Subscriber<List<String>>() {
#Override
public void onSubscribe(Subscription subscription) {
System.out.println("OnSubscribe");
subscription.request(Long.MAX_VALUE);
}
#Override
public void onNext(List<String> strings) {
System.out.println("OnNext");
}
#Override
public void onError(Throwable throwable) {
System.out.println("OnError");
}
#Override
public void onComplete() {
System.out.println("OnComplete");
}
});
Note, in this example demand in size Long.MAX_VALUE means an unbounded demand so that all messages will be directly pushed to the given Subscriber [Rule 3.17]
Use UnicatProcessor correctly
On the one hand, your example will work correctly with mentioned fixes. However, on the other hand, each invocation of FluxProcessor#sink() (yeah sink is FluxProcessor's method) will lead to a redundant calling of UnicastProcessor's onSubscribe method, which under the hood cause a few atomic reads and writes which might be avoided if create FluxSink once and safely use it as many tame as needed. For example:
UnicastProcessor<String> processor = UnicastProcessor.create()
FluxSink<String> sink = processor.serialize().sink();
...
new Thread(() -> {
int limit = 100
i = 0
while(i < limit) {
++i
sink.next("Hello $i")
}
System.out.println("Published all")
}).start()
Note, in this example, I executed an additional method serialize which provide thread-safe sink and ensure that the calling of FluxSink#next concurrently will not cause a violation of the ReactiveStreams spec.

How to close a database connection opened by an IBackingMap implementation within a Storm Trident topology?

I'm implementing an IBackingMap for my Trident topology to store tuples to ElasticSearch (I know there are several implementations for Trident/ElasticSearch integration already existing at GitHub however I've decided to implement a custom one which suits my task better).
So my implementation is a classic one with a factory:
public class ElasticSearchBackingMap implements IBackingMap<OpaqueValue<BatchAggregationResult>> {
// omitting here some other cool stuff...
private final Client client;
public static StateFactory getFactoryFor(final String host, final int port, final String clusterName) {
return new StateFactory() {
#Override
public State makeState(Map conf, IMetricsContext metrics, int partitionIndex, int numPartitions) {
ElasticSearchBackingMap esbm = new ElasticSearchBackingMap(host, port, clusterName);
CachedMap cm = new CachedMap(esbm, LOCAL_CACHE_SIZE);
MapState ms = OpaqueMap.build(cm);
return new SnapshottableMap(ms, new Values(GLOBAL_KEY));
}
};
}
public ElasticSearchBackingMap(String host, int port, String clusterName) {
Settings settings = ImmutableSettings.settingsBuilder()
.put("cluster.name", clusterName).build();
// TODO add a possibility to close the client
client = new TransportClient(settings)
.addTransportAddress(new InetSocketTransportAddress(host, port));
}
// the actual implementation is left out
}
You see it gets host/port/cluster name as input params and creates an ElasticSearch client as a member of the class BUT IT NEVER CLOSES THE CLIENT.
It is then used from within a topology in a pretty familiar way:
tridentTopology.newStream("spout", spout)
// ...some processing steps here...
.groupBy(aggregationFields)
.persistentAggregate(
ElasticSearchBackingMap.getFactoryFor(
ElasticSearchConfig.ES_HOST,
ElasticSearchConfig.ES_PORT,
ElasticSearchConfig.ES_CLUSTER_NAME
),
new Fields(FieldNames.OUTCOME),
new BatchAggregator(),
new Fields(FieldNames.AGGREGATED));
This topology is wrapped into some public static void main, packed in a jar and sent to Storm for execution.
The question is, should I worry about closing the ElasticSearch connection or it is Storm's own business? If it is not done by Storm, how and when in the topology's lifecycle I should do that?
Thanks in advance!
Okay, answering my own question.
First of all, thanks again #dedek for suggestions and reviving the ticket in Storm's Jira.
Finally, since there's no official way to do that, I've decided to go for cleanup() method of Trident's Filter. So far I've verified the following (for Storm v. 0.9.4):
With LocalCluster
cleanup() gets called on cluster's shutdown
cleanup() DOESN'T get called when killing the topology, this shouldn't be a tragedy, very likely one won't use LocalCluster for real deployments anyway
With a real cluster
it gets called when the topology is killed as well as when the worker is stopped using pkill -TERM -u storm -f 'backtype.storm.daemon.worker'
it doesn't get called if the worker is killed with kill -9 or when it crashes or - sadly - when the worker dies due to an exception
In overall that gives more or less decent guarantee of cleanup() to get called, provided you'll be careful with exception handling (I tend to add 'thundercatches' to every of my Trident primitives anyway).
My code:
public class CloseFilter implements Filter {
private static final Logger LOG = LoggerFactory.getLogger(CloseFilter.class);
private final Closeable[] closeables;
public CloseFilter(Closeable... closeables) {
this.closeables = closeables;
}
#Override
public boolean isKeep(TridentTuple tuple) {
return true;
}
#Override
public void prepare(Map conf, TridentOperationContext context) {
}
#Override
public void cleanup() {
for (Closeable c : closeables) {
try {
c.close();
} catch (Exception e) {
LOG.warn("Failed to close an instance of {}", c.getClass(), e);
}
}
}
}
However would be nice if some day hooks for closing connections become a part of the API.

Resources