Storm Topology deployment using pre deployed jar - apache-storm

We currently have a jar that contains 6 topologies. To deploy these topologies we currently do 6 separate calls using
/bin/storm jar $LOCAL_JAR $TOPOLOGY_CLASS $TOPOLOGY_NAME $PS_ENV $ZK_QUORUM -c nimbus.host=$NIMBUS_HOST $STORM_CONFIG_ARGS
Looking at the log output, each time the topology is submitted the jar is also uploaded to nimbus i.e there are 6 lines like this
9937 [main] INFO o.a.s.StormSubmitter - Successfully uploaded topology jar to assigned location:...
I want to avoid the multiple uploading of the jar. I have tried uploading the jar via scp and placing it in at "uploadedJarLocation" on the nimbus node ( I do this once).
Then changing my deployment code to use the following for each of the topologies.
nimbusClient = NimbusClient.getConfiguredClient(storm_conf);
client = nimbusClient.getClient();
...
client.submitTopology(topologyName, uploadedJarLocation, jsonConf, topology.buildTopology());
This has sped things up and seems to work fine but I want to ask
Is this a safe approach, can I safely reference the uploadedJarLocation I pre-uploaded to nimbus via scp?
Are there any alternative methods to avoid the multiple jar upload?
I know about the StormSubmitter.submitJar as an alternative but have found this to be slow.

Related

How do I get a working directory of Spark executor in Java? [duplicate]

This question already exists:
Copy files (config) from HDFS to local working directory of every spark executor
Closed 5 years ago.
I need to know a current working directory URI/URL of Spark executor so I can copy some dependencies there before job executes. How do I get in Java ? What api should I call?
Working directory is application specific so you want be able to get it before applications starts. It is best to use standard Spark mechanisms:
--jars / spark.jars - for JAR files.
pyFiles - for Python dependencies.
SparkFiles / --files / --archives - for everything else

Submitting a topology to Storm

I have configured Storm on my machine. Zookeeper, Nimbus and Supervisor are running properly.
Now I want to submit a topology to this storm.
I am trying to use storm jar.
but I am not able to submit it.
Can anybody please give an example for this.
It will be very helpful.
Thanks in advance:)
The answer is in the official documentation, and it is clear enough. Run storm jar path/to/allmycode.jar org.me.MyTopology arg1 arg2 arg3 (replace with your project name and arguments if any). Make sure you are using StormSubmitter object instead of LocalCluster.
Unfortunately, almost all the examples on the internet show the word counter example, and do not mention the steps required in a simple way:
All you need to do is this:
1. Navigate to your storm bin folder:
cd /Users/nav/programming/apache-storm-1.0.1/bin
2. Start nimbus
./storm nimbus
3. Start supervisor
./storm supervisor
4. Start the ui program
./storm ui
5. Make sure you build your jar file with the storm jar excluded from it.
6. Make sure your /Users/nav/programming/apache-storm-1.0.1/conf/storm.yaml file is valid (this should've been step 2).
7. Make sure that in your code, you are submitting the topology using StormSubmitter.submitTopology
8. Navigate to the storm bin folder again
cd /Users/nav/programming/apache-storm-1.0.1/bin
9. Submit your jar file to storm
./storm jar /Users/nav/myworkspace/StormTrial/build/libs/StormTrial.jar com.abc.stormtrial.StormTrial
The above command is basically just this:
stormExecutable jarOption pathToYourJarFile theClassContainingYourMainFile
If you want to pass commandline arguments to your program, add it at the end:
stormExecutable jarOption pathToYourJarFile theClassContainingYourMainFile commandlineArguments
Here, com.abc.stormtrial is the full package name and .StormTrial is the name of the class that contains your main function.
Now open up your browser and type http://127.0.0.1:8080 and you'll see your topology running via Storm's UI.

Interfacing Hadoop to Cassandra on Amazon AWS - netty version conflict?

I've got a Hadoop map reduce class that runs on Amazon EMR and outputs to an HDFS flatfile. All well and good, but now I need to output to a Cassandra DB, also running on AWS. I built and ran a local client and got that working, then moved the Cassandra writing code to my Hadoop project. The problem, it seems, is that Amazon draws in /home/hadoop/lib/netty-3.2.4.Final.jar for Hadoop 1.0.3, but the Cassandra that runs on AWS is 1.2.6 and uses netty-3.5.9.Final.jar.
What can I do to prevent or circumvent this conflict? Can I draw in my newer version of netty along side the one Amazon EMR draws in?
The error I get from running the jar on EMR is as follows:
Exception in thread "main" java.lang.NoSuchMethodError: org.jboss.netty.handler.codec.frame.LengthFieldBasedFrameDecoder.<init>(IIIIIZ)V
at org.apache.cassandra.transport.Frame$Decoder.<init>(Frame.java:147)
at com.datastax.driver.core.Connection$PipelineFactory.getPipeline(Connection.java:616)
at org.jboss.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:212)
at org.jboss.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:188)
at com.datastax.driver.core.Connection.<init>(Connection.java:111)
at com.datastax.driver.core.Connection.<init>(Connection.java:56)
at com.datastax.driver.core.Connection$Factory.open(Connection.java:387)
at com.datastax.driver.core.ControlConnection.tryConnect(ControlConnection.java:211)
at com.datastax.driver.core.ControlConnection.reconnectInternal(ControlConnection.java:174)
at com.datastax.driver.core.ControlConnection.connect(ControlConnection.java:87)
at com.datastax.driver.core.Cluster$Manager.init(Cluster.java:609)
at com.datastax.driver.core.Cluster$Manager.access$100(Cluster.java:553)
at com.datastax.driver.core.Cluster.<init>(Cluster.java:67)
at com.datastax.driver.core.Cluster.buildFrom(Cluster.java:94)
at com.datastax.driver.core.Cluster$Builder.build(Cluster.java:534)
at GraphAnalysis.MatrixBuilder.compileOutput(MatrixBuilder.java:282)
at GraphAnalysis.MatrixBuilder.main(MatrixBuilder.java:205)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:187)
I had a similar problem with one of our internal modules which had used netty-3.2.4.jar which had a missing function.
Ultimately I had to do open-heart surgery on the jar file.
Expand the jar file that embeds the old netty library using jar
-xvf oldlibrary.jar
Delete the folder org/jboss/netty.
Then expand the new netty jar file into a separate folder
Copy the new org/jboss/netty folder along with its contents to the old location cp -prf netty .
Create a new JAR package jar -cvf ../new_jarfile.jar *
This is a rare case, but you can apply this method to overcome library incompatibilities whenever they occur. Make sure that your program still runs. If you change the underlying library this way, there are changes that your original program will not run. A well-designed library should be agnostic of such changes though.

Using different hadoop-mapreduce-client-core.jar to run hadoop cluster

I'm working on a hadoop cluster with CDH4.2.0 installed and ran into this error. It's been fixed in later versions of hadoop but I don't have access to update the cluster. Is there a way to tell hadoop to use this jar when running my job through the command line arguments like
hadoop jar MyJob.jar -D hadoop.mapreduce.client=hadoop-mapreduce-client-core-2.0.0-cdh4.2.0.jar
where the new mapreduce-client-core.jar file is the patched jar from the ticket. Or must hadoop be completely recompiled with this new jar? I'm new to hadoop so I don't know all the command line options that are possible.
I'm not sure how that would work as when you're executing the hadoop command you're actually executing code in the client jar.
Can you not use MR1? The issue says this issue only occurs when you're using MR2, so unless you really need Yarn you're probably better using the MR1 library to run your map/reduce.

Running jar on hadoop server as a service

i made one jar
which analyzes system logs .. for running this jar on HADOOP server i can do it using command line like "bin/hadoop jar log.jar"
but my problem is i want to make this jar executable in background as a service on Ubuntu master machine.
can any one help me how can i make HADOOP jar as a service so it can run like a background service on Ubuntu Machine .. runs after every 1hrs.
You have a few options, here's two:
Configure a crontab job to run your job every hour, something like (you'll need to fully qualify the path to hadoop and the jar itself):
0 * * * * /usr/lib/hadoop/bin/hadoop jar /path/to/jar/log.jar
Run an OOZIE server and configure a coordinator to submit the job on an hourly basis. More effort that the above suggestion but worth a look.

Resources