How can I see the current output of a running Storm topology? - apache-storm

Currently learning on how to use Storm (version 2.1.0), I am a bit confused on a specific aspect of this data streaming processing (DSP) engine: How is output data handled? Tutorials provide good explanations on system setup and running our first application. Unfortunately, I didn't find a page providing details on results generated by a topology.
With DSP applications, there are no final output because input data is a continuously incoming stream of data (or maybe we can say there is a final output when application is stopped). What I would like is to be able to see the state of current output (the actual output data generated at current time) of a running topology.
I'm able to run WordCountTopology. I understand the output of this topology is generated by the following snippet of code:
public static class WordCount extends BaseBasicBolt {
Map<String, Integer> counts = new HashMap<String, Integer>();
#Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
String word = tuple.getString(0);
Integer count = counts.get(word);
if (count == null) {
count = 0;
}
count++;
counts.put(word, count);
collector.emit(new Values(word, count));
}
#Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word", "count"));
}
}
My misunderstanding is on the location of the <"word":string, "count":int> output. Is it only in memory, written in a database somewhere, written in a file?
Going further with this question: what are the existing possibilities for storing in-progress output data? What is the "good way" of handling such data?
I hope my question is not too naive. And thanks to the StackOverflow community for always providing good help.

A few days have passed since I posted this question. I am back to share with you what I have tried. Although I cannot tell if it is the right way of doing, the two following propositions answer my question.
Simple System.out.println()
The first thing I've tried is to make a System.out.println("Hello World!") directly within the prepare() method of my BaseBasicBolt. This method is called only once at the beginning of each Bolt's thread execution.
public void prepare(Map topoConf, TopologyContext context) {
System.out.println("Hello World!");
}
The big challenge was to figure out where the log is written. By default, it is written within <storm installation folder>/logs/workers-artifacts/<topology name>/<worker-port>/worker.log where <worker-port> is the port of a requested worker/slot.
For instance, with conf.setNumWorkers(3), the topology requests an access to 3 workers (3 slots). Therefore, values of <worker-port> will be 6700, 6701 and 6702. Those values are the port numbers of the 3 slots (defined in storm.yaml under supervisor.slots.ports).
Note: you will have as many "Hello World!" as the parallel size of your BaseBasicBolt. When the split bolt is instantiated with builder.setBolt("split", new SplitSentence(), 8), it results in 8 parallel threads, each one writing its own log.
Writing to a file
For research purpose I have to analyse large amounts of logs that I need in a specific format. The solution I found is to append the logs to a specific file managed by each bolt.
Hereafter is my own implementation of this file logging solution for the count bolt.
public static class WordCount extends BaseBasicBolt {
private String workerName;
private FileWriter fw;
private BufferedWriter bw;
private PrintWriter out;
private String logFile = "/var/log/storm/count.log";
private Map<String, Integer> counts = new HashMap<String, Integer>();
public void prepare(Map topoConf, TopologyContext context) {
this.workerName = this.toString();
try {
this.fw = new FileWriter(logFile, true);
this.bw = new BufferedWriter(fw);
this.out = new PrintWriter(bw);
} catch (Exception e) {
System.out.println(e);
}
}
#Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
String word = tuple.getString(0);
Integer count = counts.get(word);
if (count == null) {
count = 0;
}
count++;
counts.put(word, count);
collector.emit(new Values(word, count));
out.println(this.workerName + ": Hello World!");
}
#Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word", "count"));
}
}
In this code, my log file is located in /var/log/storm/count.log and calling out.println(text) appends the text at this end of this file. As I am not sure if it is thread-safe, all parallel threads writing at the same time into the same file might result in data loss.
Note: if your bolts are distributed accros multiple machines, each machine is going to have its own log file. During my testings, I configured a simple cluster with 1 machine (running Nimbus + Supervisor + UI), therefore I had only 1 log file.
Conclusion
There are multiple ways to deal with output data and, more generally logging anything with Storm. I didn't find any official way of doing it and documentation very light on this subject.
While some of us would be satisfied with a simple sysout.println(), others might need to push large quantity of data into specific files, or maybe in a specialized database engine. Anything you can do with Java is possible with Storm because it's simple Java programming.
Any advices and additional comments to complete this answer will be gladly appreciated.

Related

Is there any way to free JVM memory in #AfterChunks in spring batch?

Is there any way to free JVM memory in #AfterChunks? Because we are getting outOfMemory error after processing couple of records.
Is there any way to free memory after spring batch job completion ?
Public class ABC implements ChunkListener{
private static final Logger log = oggerFactory.getLogger(ABC .class);
private MessageFormat fmt = new MessageFormat("{0} items processed");
private int loggingInterval = 100;
#Override
public void beforeChunk(ChunkContext context) {
// Nothing to do here
}
#Override
public void afterChunk(ChunkContext context) {
int count = context.getStepContext().getStepExecution().getReadCount();
// If the number of records processed so far is a multiple of the logging interval then output a log message.
if (count > 0 && count % loggingInterval == 0) {
log.info( fmt.format(new Object[] {new Integer(count) })) ;
}
//String name = context.getStepContext().getStepName();
//context.getStepContext().registerDestructionCallback(name, callback);
}
How to call registerDestructionCallback to clean up? What are name and callback? Any reference?
There is no way to force a GC in Java (System.gc() is just a hint to the JVM) and you should leave this to the GC.
Items in a chunk-oriented step should be garbage collected after each chunk is processed, see Does Spring Batch release the heap memory after processing each batch?. If you have an OOM, make sure:
your items are not held by a processor, mapper, etc
a chunk can fit in memory: sometimes, using the driving query pattern, the processor fetches more details about each item and you can get out of memory very quickly for the first few items

Long-running AEM EventListener working inconsistently - blacklisted?

As always, AEM has brought new challenges to my life. This time, I'm experiencing an issue where an EventListener that listens for ReplicationEvents is working sometimes, and normally just the first few times after the service is restarted. After that, it stops running entirely.
The first line of the listener is a log line. If it was running, it would be clear. Here's a simplified example of the listener:
#Component(immediate = true, metatype = false)
#Service(value = EventHandler.class)
#Property(
name="event.topics", value = ReplicationEvent.EVENT_TOPIC
)
public class MyActivityReplicationListener implements EventHandler {
#Reference
private SlingRepository repository;
#Reference
private OnboardingInterface onboardingService;
#Reference
private QueryInterface queryInterface;
private Logger log = LoggerFactory.getLogger(this.getClass());
private Session session;
#Override
public void handleEvent(Event ev) {
log.info(String.format("Starting %s", this.getClass()));
// Business logic
log.info(String.format("Finished %s", this.getClass()));
}
}
Now before you panic that I haven't included the business logic, see my answer below. The main point of interest is that the business logic could take a few seconds.
While crawling through the second page of Google search to find an answer, I came across this article. A German article explaining that EventListeners that take more than 5 seconds to finish are sort of silently quarantined by AEM with no output.
It just so happens that this task might take longer than 5 seconds, as it's working off data that was originally quite small, but has grown (and this is in line with other symptoms).
I put a change in that makes the listener much more like the one in that article - that is, it uses an EventConsumer to asynchronously process the ReplicationEvent using a pub/sub model. Here's a simplified version of the new model (for AEM 6.3):
#Component(immediate = true, property = {
EventConstants.EVENT_TOPIC + "=" + ReplicationEvent.EVENT_TOPIC,
JobConsumer.PROPERTY_TOPICS + "=" + AsyncReplicationListener.JOB_TOPIC
})
public class AsyncReplicationListener implements EventHandler, JobConsumer {
private static final String PROPERTY_EVENT = "event";
static final String JOB_TOPIC = ReplicationEvent.EVENT_TOPIC;
#Reference
private JobManager jobManager;
#Override
public JobConsumer.JobResult process (Job job) {
try {
ReplicationEvent event = (ReplicationEvent)job.getProperty(PROPERTY_EVENT);
// Slow business logic (>5 seconds)
} catch (Exception e) {
return JobResult.FAILED;
}
return JobResult.OK ;
}
#Override
public void handleEvent(Event event) {
final Map <String, Object> payload = new HashMap<>();
payload.put(PROPERTY_EVENT, ReplicationEvent.fromEvent(event));
final Job addJobResult = jobManager.addJob(JOB_TOPIC , payload);
}
}
You can see here that the EventListener passes off the ReplicationEvent wrapped up in a Job, which is then handled by the JobConsumer, which according to this magic article, is not subject to the 5 second rule.
Here is some official documentation on this time limit. Once I had the "5 seconds" key, I was able to a bit more information, here and here, that talk about the 5 second limit as well. The first article uses a similar method to the above, and the second article shows a way to turn off these time limits.
The time limits can be disabled entirely (or increased) in the configMgr by setting the Timeout property to zero in the Apache Felix Event Admin Implementation configuration.

How to close a database connection opened by an IBackingMap implementation within a Storm Trident topology?

I'm implementing an IBackingMap for my Trident topology to store tuples to ElasticSearch (I know there are several implementations for Trident/ElasticSearch integration already existing at GitHub however I've decided to implement a custom one which suits my task better).
So my implementation is a classic one with a factory:
public class ElasticSearchBackingMap implements IBackingMap<OpaqueValue<BatchAggregationResult>> {
// omitting here some other cool stuff...
private final Client client;
public static StateFactory getFactoryFor(final String host, final int port, final String clusterName) {
return new StateFactory() {
#Override
public State makeState(Map conf, IMetricsContext metrics, int partitionIndex, int numPartitions) {
ElasticSearchBackingMap esbm = new ElasticSearchBackingMap(host, port, clusterName);
CachedMap cm = new CachedMap(esbm, LOCAL_CACHE_SIZE);
MapState ms = OpaqueMap.build(cm);
return new SnapshottableMap(ms, new Values(GLOBAL_KEY));
}
};
}
public ElasticSearchBackingMap(String host, int port, String clusterName) {
Settings settings = ImmutableSettings.settingsBuilder()
.put("cluster.name", clusterName).build();
// TODO add a possibility to close the client
client = new TransportClient(settings)
.addTransportAddress(new InetSocketTransportAddress(host, port));
}
// the actual implementation is left out
}
You see it gets host/port/cluster name as input params and creates an ElasticSearch client as a member of the class BUT IT NEVER CLOSES THE CLIENT.
It is then used from within a topology in a pretty familiar way:
tridentTopology.newStream("spout", spout)
// ...some processing steps here...
.groupBy(aggregationFields)
.persistentAggregate(
ElasticSearchBackingMap.getFactoryFor(
ElasticSearchConfig.ES_HOST,
ElasticSearchConfig.ES_PORT,
ElasticSearchConfig.ES_CLUSTER_NAME
),
new Fields(FieldNames.OUTCOME),
new BatchAggregator(),
new Fields(FieldNames.AGGREGATED));
This topology is wrapped into some public static void main, packed in a jar and sent to Storm for execution.
The question is, should I worry about closing the ElasticSearch connection or it is Storm's own business? If it is not done by Storm, how and when in the topology's lifecycle I should do that?
Thanks in advance!
Okay, answering my own question.
First of all, thanks again #dedek for suggestions and reviving the ticket in Storm's Jira.
Finally, since there's no official way to do that, I've decided to go for cleanup() method of Trident's Filter. So far I've verified the following (for Storm v. 0.9.4):
With LocalCluster
cleanup() gets called on cluster's shutdown
cleanup() DOESN'T get called when killing the topology, this shouldn't be a tragedy, very likely one won't use LocalCluster for real deployments anyway
With a real cluster
it gets called when the topology is killed as well as when the worker is stopped using pkill -TERM -u storm -f 'backtype.storm.daemon.worker'
it doesn't get called if the worker is killed with kill -9 or when it crashes or - sadly - when the worker dies due to an exception
In overall that gives more or less decent guarantee of cleanup() to get called, provided you'll be careful with exception handling (I tend to add 'thundercatches' to every of my Trident primitives anyway).
My code:
public class CloseFilter implements Filter {
private static final Logger LOG = LoggerFactory.getLogger(CloseFilter.class);
private final Closeable[] closeables;
public CloseFilter(Closeable... closeables) {
this.closeables = closeables;
}
#Override
public boolean isKeep(TridentTuple tuple) {
return true;
}
#Override
public void prepare(Map conf, TridentOperationContext context) {
}
#Override
public void cleanup() {
for (Closeable c : closeables) {
try {
c.close();
} catch (Exception e) {
LOG.warn("Failed to close an instance of {}", c.getClass(), e);
}
}
}
}
However would be nice if some day hooks for closing connections become a part of the API.

Using tick tuples with trident in storm

I am able to use standard spout,bolt combination to do streaming aggregation
and works very well in happy case, when using tick tuples to persist data at some interval
to make use of batching. Right now i am doing some failure management (tracking off tuples not saved etc) myself.(i.e not ootb from storm)
But i have read that trident gives you a higher abstraction and better failure management.
What i dont understand is whether there is tick tuple support in trident. Basically
I would like to batch in memory for the current minute or so and persist any aggregated data
for the previous minutes using trident.
Any pointers here or design suggestions would be helpful.
Thanks
Actually micro-batching is a built-in Trident's feature. You don't need any tick tuples for that. When you have something like this in your code:
topology
.newStream("myStream", spout)
.partitionPersist(
ElasticSearchEventState.getFactoryFor(connectionProvider),
new Fields("field1", "field2"),
new ElasticSearchEventUpdater()
)
(I'm using here my custom ElasticSearch state/updater, you might use something else)
So when you have something like this, under the hood Trident group your stream into batches and performs partitionPersist operation not on individual tuples but on those batches.
If you still need tick tuples for any reason, just create your tick spout, something like this works for me:
public class TickSpout implements IBatchSpout {
public static final String TIMESTAMP_FIELD = "timestamp";
private final long delay;
public TickSpout(long delay) {
this.delay = delay;
}
#Override
public void open(Map conf, TopologyContext context) {
}
#Override
public void emitBatch(long batchId, TridentCollector collector) {
Utils.sleep(delay);
collector.emit(new Values(System.currentTimeMillis()));
}
#Override
public void ack(long batchId) {
}
#Override
public void close() {
}
#Override
public Map getComponentConfiguration() {
return null;
}
#Override
public Fields getOutputFields() {
return new Fields(TIMESTAMP_FIELD);
}
}

How to provide values to storm for calculation

I have a hard time understanding how to provide values to storm since i am a newbie to storm.
I started with the starter kit. I went through the TestWordSpout and in that the following code provides new values
public void nextTuple() {
Utils.sleep(100);
final String[] words = new String[] {"nathan", "mike", "jackson", "golda", "bertels"};
final Random rand = new Random();
final String word = words[rand.nextInt(words.length)];
_collector.emit(new Values(word));
}
So i see it's taking one word at a time _collector.emit(new Values(word));
How i can provide a collection of words directly.Is this possible?
TestWordSpout.java
What I mean when nextTuple is called a new words is selected at random from the list and emitted. The random list may look like this after certain time interval
#100ms: nathan
#200ms: golda
#300ms: golda
#400ms: jackson
#500ms: mike
#600ms: nathan
#700ms: bertels
What if i already have a collection of this list and just feed it to storm.
Storm is designed and built to process the continuous stream of data. Please see Rationale for the Storm. It's very unlikely that input data is feed into the storm cluster. Generally, the input data to storm is either from the JMS queues, Apache Kafka or twitter feeds etc. I would think, you would like to pass few configurations. In that case, the following would apply.
Considering the Storm design purpose, very limited configuration details can be passed to Storm such as the RDMBS connection details (Oracle/DB2/MySQL etc), JMS provider details(IBM MQ/RabbitMQ etc) or Apache Kafka details/Hbase etc.
For your particular question or providing the configuration details for the above products, there are three ways that I could think
1.Set the configuration details on the instance of the Spout or Bolt
For eg: Declare the instance variables and assign the values as part of the Spout/Bolt constructor as below
public class TestWordSpout extends BaseRichSpout {
List<String> listOfValues;
public TestWordSpout(List<String> listOfValues) {
this.listOfValues=listOfValues;
}
}
On the topology submission class, create an instance of Spout with the list of values
List<String> listOfValues=new ArrayList<String>();
listOfValues.add("nathan");
listOfValues.add("golda");
listOfValues.add("mike");
builder.setSpout("word", new TestWordSpout(listOfValues), 3);
These values are available as instance variables in the nextTuple() method
Please look at the Storm integrations at Storm contrib on the configurations set for RDBMS/Kafka etc as above
2.Set the configurations in the getComponentConfiguration(). This method is used to override the topology configurations, however, you could pass in few details as below
#Override
public Map<String, Object> getComponentConfiguration() {
Map<String, Object> ret = new HashMap<String, Object>();
if(!_isDistributed) {
ret.put(Config.TOPOLOGY_MAX_TASK_PARALLELISM, 1);
return ret;
} else {
List<String> listOfValues=new ArrayList<String>();
listOfValues.add("nathan");
listOfValues.add("golda");
listOfValues.add("mike");
ret.put("listOfValues", listOfValues);
}
return ret;
}
and the configuration details are available in the open() or prepare() method of Spout/Bolt respectively.
public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) {
_collector = collector;
this.listOfValues=(List<String>)conf.get("listOfValues");
}
3.Declare the configurations in the property file and jar it as part of the jar file that would be submitted to the Storm cluster. The Nimbus node copies the jar file to the worker nodes and makes it available to executor thread. The open()/prepare() method can read the property file and assign to instance variable.
"Values" type accept any kind of objects and any number.
So you can simply send a List for instance from the execute method of a Bolt or from the nextTuple method of a Spout:
List<String> words = new ArrayList<>();
words.add("one word");
words.add("another word");
_collector.emit(new Values(words));
You can add a new Field too, just be sure to declare it in declareOutputFields method
_collector.emit(new Values(words, "a new field value!");
And in your declareOutputFields method
#Override
public void declareOutputFields(final OutputFieldsDeclarer outputFieldsDeclarer) {
outputFieldsDeclarer.declare(new Fields("collection", "newField"));
}
You can get the fields in the next Bolt in the topology from the tuple object given by the execute method:
List<String> collection = (List<String>) tuple.getValueByField("collection");
String newFieldValue = tuple.getStringByField("newField");

Resources