I am working on a storm application and need to pass env & cloud info using vmargs.
Program args raises security issues, since I am setting those values using System.setproperty & properties file I can't use. Vmargs don't work with storm.
Any idea what else I can do?
I assume you are running storm in cluster mode on a remote machine.
In this case, storm suggests you to use the storm jar command. It is true, that this by default accepts just a few arguments since it is internally using a python script that is calling java.
But it seems to be possible to start storm with a java command as well and pass command line commands to that - see this link.
Another method to pass some variables to topology is the environment variable substitution in a flux file and run
name: topo${ENV-YOURVAR}
when run in bash looks like:
export YOURVAR=test
storm jar yourtopo.jar org.apache.storm.flux.Flux yourtopo.flux -e
Flag -e tells storm to substitute environment variables in the flux file
Related
I have to test some code using Spark and I'm pretty new to it.
The code I have runs an ETL script on a cluster. The ETL script is written in Python and have several prints in it but I'm unable to see those prints. The Python script is added to the spark-submit in the --py-files tag. I don't if those prints are unreachable since they are happening in the YARN executors and I should change them to logs and use log4j or add them to an accumulator reachable by the driver.
Any suggestions would help.
The final goal is to see how the execution of the code is going.I don't know if simple prints is the best solution but it was already in the code I was given to test.
I'm running a oozie workflow with some bash scripts in a hadoop environment (Hadoop 2.7.3). but my workflow is failing because my shell action get an error. After save the commands output in a file as a log I found in it the next entry:
moveToLocal: Option '-moveToLocal' is not implemented yet.
After I get this error my shell action fails becouse it takes this as an error and fails the entire action?
Also that line means that my current version of hadoop (2.7.3) doesn't support that command?
According to the documentation for 2.7.3 hadoop version:
https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-common/FileSystemShell.html#moveToLocal
Says this command is not support it yet. Now, my shell action take it as an exception and terminate the shell script. I'm changing that command for an equivalent.
I don't know anything about bash, but i put together a script to help me run my Hbase java application:
#!/bin/bash
HADOOP_CLASSPATH="$(hbase classpath)"
hadoop jar my.jar my_pkg.my_class
When I run it I get a:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/regionserver/IncreasingToUpperBoundRegionSplitPolicy
When I echo out the HADOOP_CLASSPATH I see that hbase-server-1.2.0-cdh5.8.0.jar is there...
Is the hadoop jar command ignoring the HADOOP_CLASSPATH?
Also I have tried to run the commands from the command-line instead of using my script. I get the same error.
The approach was inspired by this cloduera-question
The solution was to include the Hadoop class path on the same line. I am not certain what the difference is, but this works:
HADOOP_CLASSPATH="$(hbase classpath)" hadoop jar my.jar my_pkg.my_class
I have configured Storm on my machine. Zookeeper, Nimbus and Supervisor are running properly.
Now I want to submit a topology to this storm.
I am trying to use storm jar.
but I am not able to submit it.
Can anybody please give an example for this.
It will be very helpful.
Thanks in advance:)
The answer is in the official documentation, and it is clear enough. Run storm jar path/to/allmycode.jar org.me.MyTopology arg1 arg2 arg3 (replace with your project name and arguments if any). Make sure you are using StormSubmitter object instead of LocalCluster.
Unfortunately, almost all the examples on the internet show the word counter example, and do not mention the steps required in a simple way:
All you need to do is this:
1. Navigate to your storm bin folder:
cd /Users/nav/programming/apache-storm-1.0.1/bin
2. Start nimbus
./storm nimbus
3. Start supervisor
./storm supervisor
4. Start the ui program
./storm ui
5. Make sure you build your jar file with the storm jar excluded from it.
6. Make sure your /Users/nav/programming/apache-storm-1.0.1/conf/storm.yaml file is valid (this should've been step 2).
7. Make sure that in your code, you are submitting the topology using StormSubmitter.submitTopology
8. Navigate to the storm bin folder again
cd /Users/nav/programming/apache-storm-1.0.1/bin
9. Submit your jar file to storm
./storm jar /Users/nav/myworkspace/StormTrial/build/libs/StormTrial.jar com.abc.stormtrial.StormTrial
The above command is basically just this:
stormExecutable jarOption pathToYourJarFile theClassContainingYourMainFile
If you want to pass commandline arguments to your program, add it at the end:
stormExecutable jarOption pathToYourJarFile theClassContainingYourMainFile commandlineArguments
Here, com.abc.stormtrial is the full package name and .StormTrial is the name of the class that contains your main function.
Now open up your browser and type http://127.0.0.1:8080 and you'll see your topology running via Storm's UI.
Hadoop streaming will run the process in "local" mode when there is no hadoop instance running on the box. I have a shell script that is controlling a set of hadoop streaming jobs in sequence and I need to condition copying files from HDFS to local depending on whether the jobs have been running locally or not. Is there a standard way to accomplish this test? I could do a "ps aux | grep something" but that seems ad-hoc.
Hadoop streaming will run the process in "local" mode when there is no hadoop instance running on the box.
Can you pl point to the reference for this?
A regular or a streaming job will run the way it is configured, so we know ahead of time in which mode a Job is run. Check the documentation for configuring Hadoop on a Single Node and Cluster in different modes.
Rather than trying to detect at run time which mode the process is operating, it is probably better to wrap the tool you are developing in a bash script that explicitly selects local vs cluster operatide. The O'Reilly Hadoop describes how to explicitly choose local using a configuration file override:
hadoop v2.MaxTemperatureDriver -conf conf/hadoop-local.xml input/ncdc/micro max-temp
where conf-local.xml is an XML file configured for local operation.
I haven't tried this yet, but I think you can just read out the mapred.job.tracker configuration setting.