Execute a method only once at start of an Apache Storm topology - apache-storm

If I have a simple Apache Storm topology with a spout (set to a parallelism of 2) running on two separate nodes. How can I write a method that will be run once, and only once, at the start of the topology before any processing of tuples has begun?
Any implementation of a singleton/static class, or synchronized method alone will not work, as the two instances are running on separate nodes.
Perhaps there are some Storm methods that I can use to decide if I'm the first Spout to be instantiated, and run only then? I tried playing around with the getThisTaskId() & getThisWorkerTasks() methods, but was unsuccessful.
NOTE: The parallelism of 2 is to keep things simple. A solution should work for any number of nodes/workers.

Edit: Thought of an easier solution. I'll leave the original answer below in case it is helpful.
You can use TopologyContext.getThisTaskIndex to do this. If you make your spout open method run the code only if TopologyContext.getThisTaskIndex == 0, then your code will run only once, before any tuples are emitted.
If the worker that ran this code crashes, the code will be run again when the spout instance with task index 0 is restarted. In order to fix this, you can use Zookeeper to store state that should carry over across restarts, e.g. put a flag in Zookeeper once the only-once code has run, and have the spout open check that the flag is not set before running the code.
You can use TopologyContext.getStormId to get a constant unique string to identify the topology, so you can tell whether the flag was set by this topology or a previous deployment.
Original answer:
The easiest way to run some code only once on deployment of a topology, is to call the code when you submit the topology. You can call the only-once code at the same time as you wire your topology with TopologyBuilder. This will only get run once. The downside is it will run on the machine you're calling storm jar from.
If you for some reason can't do it this way or need to run the code from one of the worker nodes, there isn't anything built in to Storm to allow you to do this. The reason there isn't such a mechanism is that it requires extra coordination between the worker JVMs, and I don't think anyone has needed something like this.
The best option for you would probably be to look at Zookeeper/Curator to do this coordination (see https://curator.apache.org/curator-recipes/index.html). This should allow you to make only one worker in the cluster run your code. You'll have to consider what should happen if the worker chosen to run your code crashes/stalls.
Storm already uses Zookeeper for coordination, so you can just connect to that cluster.

Related

Control over scheduling/placement in Apache Storm

I am running a wordcount topology in a Storm cluster composed on 2 nodes. One node is the Master node (with Nimbus, UI and Logviewer) and both of then are Supervisor with 1 Worker each. In other words, my Master node is also a Supervisor, and the second node is only a Supervisor. As I said, there is 1 Worker per Supervisor.
The topology I am using is configured so that it is using these 2 Workers (setNumWorkers(2)). In details, the topology has 1 Spout with 2 threads, 1 split Bolt and 1 count Bolt. When I deploy the topology with the default scheduler, the first Supervisor has 1 Spout thread and the split Bolt, and the second Supervisor has 1 Spout thread and the count Bolt.
Given this context, how can I control the placement of operators (Spout/Bolt) between these 2 Workers? For research purpose, I need to have some control over the placement of these operators between nodes. However, the mechanism seems to be transparent within Storm and such a control is not available for the end-user.
I hope my question is clear enough. Feel free to ask for additional details. I am aware that I may need to dig into Storm's source code and recompile. That's fine. I am looking for a starting point and advices on how to proceed.
The version of Storm I am using is 2.1.0.
Scheduling is handled by a pluggable scheduler in Storm. See the documentation at http://storm.apache.org/releases/2.1.0/Storm-Scheduler.html.
You may want to look at the DefaultScheduler for reference https://github.com/apache/storm/blob/v2.1.0/storm-server/src/main/java/org/apache/storm/scheduler/DefaultScheduler.java. This is the default scheduler used by Storm, and has a bit of handling for banning "bad" workers from the assignment, but otherwise largely just does round robin assignment.
If you don't want to implement a cluster-wide scheduler, you might be able to set your cluster to use the ResourceAwareScheduler, and use a topology-level scheduling strategy instead. You would set this by setting config.setTopologyStrategy(YourStrategyHere.class) when you submit your topology. You will want to implement this interface https://github.com/apache/storm/blob/e909b3d604367e7c47c3bbf3ec8e7f6b672ff778/storm-server/src/main/java/org/apache/storm/scheduler/resource/strategies/scheduling/IStrategy.java#L43 and you can find an example implementation at https://github.com/apache/storm/blob/c427119f24bc0b14f81706ab4ad03404aa85aede/storm-server/src/main/java/org/apache/storm/scheduler/resource/strategies/scheduling/DefaultResourceAwareStrategy.java
Edit: If you implement either an IStrategy or IScheduler, they need to go in a jar that you put in storm/lib on the Nimbus machine. The strategy or scheduler needs to be on the classpath of the Nimbus process.

Parallelism in Apache Storm with one worker node

I am trying to Parallelize my topology using Apache Storm but it gives me java.util.ConcurrentModificationException error on worker nodes if I increased the number of workers>1. It works fine with 1 worker and in local cluster. I want a way to parallelize my topology and measure the different parameters like throughput, latency, emit rate etc. using one worker node only.
Based on the stack trace you posted, it looks like Kryo is trying to serialize an ArrayList and hitting a ConcurrentModificationException. I would look for any place you emit an ArrayList and make sure that you don't modify it after you've passed it to OutputCollector.emit.
Likely the reason you're not seeing this issue when you only have one worker is that Storm only serializes emitted objects when they need to be sent to a different worker.

storm - How to check if a topology is idle or running?

That can be met checking if a determined bolt is processing, if the bolt have any tuples in the queue to be inserted yet or something like that.
What I want, in resume, is know, in any way, if a topology has done it's work yet or no.
I know it sounds contradictory, since a topology should never have the work done, but I'm using it to do tests and in the beginning I have not a non-stop stream of data, but a finite amount of data.
To check the running topologies and their statuses you can run:
{dir/to/storm}/bin/storm list
You can also navigate to the running storm UI and check topologies/logs from there.
If you want to check if the work has been performed on a tuple then you can add your own logging. I have added some logic to print out how many tuples are processed each second which I find useful.
U can check it from the storm UI which I think is the most easiest way.

How to find the right portion between hadoop instance types

I am trying to find out how many MASTER, CORE, TASK instances are optimal to my jobs. I couldn't find any tutorial that explains how do I figure it out.
How do I know if I need more than 1 core instance? What are the "symptoms" I would see in EMR's console in the metrics that would hint I need more than one core? So far when I tried the same job with 1*core+7*task instances it ran pretty much like on 8*core, but it doesn't make much sense to me. Or is it possible that my job is so much CPU bound that the IO is such minor? (I have a map-only job that parses apache log files into csv file)
Is there such a thing to have more than 1 master instance? If yes, when is it needed? I wonder, because my master node pretty much is just waiting for the other nodes to do the job (0%CPU) for 95% of the time.
Can the master and the core node be identical? I can have a master only cluster, when the 1 and only node does everything. It looks like it would be logical to be able to have a cluster with 1 node that is the master and the core , and the rest are task nodes, but it seems to be impossible to set it up that way with EMR. Why is that?
The master instance acts as a manager and coordinates everything that goes in the whole cluster. As such, it has to exist in every job flow you run but just one instance is all you need. Unless you are deploying a single-node cluster (in which case the master instance is the only node running), it does not do any heavy lifting as far as actual MapReducing is concerned, so the instance does not have to be a powerful machine.
The number of core instances that you need really depends on the job and how fast you want to process it, so there is no single correct answer. A good thing is that you can resize the core/task instance group, so if you think your job is running slow, then you can add more instances to a running process.
One important difference between core and task instance groups is that the core instances store actual data on HDFS whereas task instances do not. In turn, you can only increase the core instance group (because removing running instances would lose the data on those instances). On the other hand, you can both increase and decrease the task instance group by adding or removing task instances.
So these two types of instances can be used to adjust the processing power of your job. Typically, you use ondemand instances for core instances because they must be running all the time and cannot be lost, and you use spot instances for task instances because losing task instances do not kill the entire job (e.g., the tasks not finished by task instances will be rerun on core instances). This is one way to run a large cluster cost-effectively by using spot instances.
The general description of each instance type is available here:
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/InstanceGroups.html
Also, this video may be useful for using EMR effectively:
https://www.youtube.com/watch?v=a5D_bs7E3uc

What is the difference between job.submit and job.waitForComplete in Apache Hadoop?

I have read the documentation so I know the difference.
My question however is that, is there any risk in using .submit instead of .waitForComplete if I want to run several Hadoop jobs on a cluster in parallel ?
I mostly use Elastic Map Reduce.
When I tried doing so, I noticed that only the first job being executed.
If your aim is to run jobs in parallel then there is certainly no risk in using job.submit(). The main reason job.waitForCompletion exists is that it's method call returns only when the job gets finished, and it returns with it's success or failure status which can be used to determine that further steps are to be run or not.
Now, getting back at you seeing only the first job being executed, this is because by default Hadoop schedules the jobs in FIFO order. You certainly can change this behaviour. Read more here.

Resources