How to set -javaagent in spark-submit - shell

I have used aspect-oriented programming to do the logging in a Java Maven project.
While running it through eclipse I have to initialize javaagent in vmargs, as follows:
-javaagent:lib/aspectjweaver-1.9.1.jar
Now I want to submit the jar produced to a Spark worker. I have written a shell script to do it. I am able to run but unable to initialize javaagent.
export SPARK_PATH=/xyz
export SPARK_URL=spark://abc:0000
export JAVA_OPTS="$JAVA_OPTS -javaagent:../aspectweaver-1.9.1.jar"
$SPARK_PATH/spark-submit --master $SPARK_URL --jars --class com.main.index ../index-0.0.1-SNAPSHOT.jar
I have tried number of examples like setting JAVA_OPTS and CATALINE_OPTS, creating spark-env.sh and setting it. But none of this worked. Struggling from last 3 days.
I checked few similar questions on stackoverflow but none of the were helpful in setting javaagent. Help.
Thanks.
EDIT:
I am checking if javaagent is initialized in code using below code:
try {
org.aspectj.weaver.loadtime.Agent.getInstrumentation();
} catch (NoClassDefFoundError | UnsupportedOperationException e) {
System.out.println(e);
}
I get the NoClassDefFoundError. Which concludes that javaagent is not set.

I got the answer for this, I had to use "--driver-java-options". Below is the updated script.
$SPARK_PATH/spark-submit --master $SPARK_URL --driver-java-options "-javaagent:../aspectjweaver-1.9.1.jar" --class com.main.index ../index-0.0.1-SNAPSHOT.jar "$1"

Related

Is there any way in Spark to keep each stage's run time?

I am measuring the run-times of a spark job with different resource configurations and need to compare the run time of each stage. I can see them in UI only when the job is running.
I run my job on a Hadoop cluster and use Yarn as the resource manager.
Is there any way to keep each stage's run-time? Is there any log for them?
UPDATE:
I read the monitoring document which is mentioned in the comment and add the following lines but it doesn't work:
in spark-defaults.conf :
spark.eventLog.enabled true
spark.eventLog.dir hdfs:///[nameNode]:8020/[PathToSparkEventLogDir]
spark.history.fs.logDirectory
hdfs:///[nameNode]:8020/[PathTosparkLogDirectory]
in spark-env.sh:
export SPARK_PUBLIC_DNS=[nameNode]
SPARK_HISTORY_OPTS="-Dspark.eventLog.enabled=true"
SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Dspark.history.fs.logDirectory=$sparkHistoryDir"
SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Dspark.history.provider=org.apache.spark.deploy.history.FsHistoryProvider"
SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Dspark.history.fs.cleaner.enabled=true"
SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Dspark.history.fs.cleaner.interval=7d"
It looks for /tmp/spark-events/ folder and when I create it and start the history server, it doesn't show any complete or incomplete application.
Note I tried the logDirectory value without port number too but it didn't work.
I could run the Spark History Server and see the history of completed and incompleted applications by applying the following commands:
Set the public DNS value in conf/spark-env.sh
export SPARK_PUBLIC_DNS= NameNode-IP
Add these properties to SparkConf in my Java code:
SparkConf conf = new SparkConf()
.set("spark.eventLog.enabled", "true")
.set("spark.eventLog.dir", "hdfs:///user/[user-path]/sparkEventLog")
.set("spark.history.fs.logDirectory", "hdfs:///user/[user-path]/sparkEventLog")
Create the property file ( spark/conf/history.properties ) containg the following lines
spark.eventLog.enabled true
spark.eventLog.dir hdfs:///user/[user-path]/sparkEventLog
spark.history.fs.logDirectory hdfs:///user/[user-path]/sparkEventLog
Start the history server:
./sbin/start-history-server.sh --properties-file ./conf/history.properties
Note: The properties eventLog.dir and eventLog.dir should have the save values.

Unable to Start Stanford CoreNLP server with Shift-Reduce Parser

I have setup coreNLP server on local and can be started using command:
java -mx5g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -timeout 100000
But default constituency parser is set with following warning message:
warning: cannot find edu/stanford/nlp/models/srparser/englishSR.ser.gz
using: edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz instead
I downloaded the shift reduce parser, extracted jar file and pasted edu folder in project directory.
To double check, I manually traversed to edu/stanford/nlp/models/lexparser to make sure englishSR.ser.gz exists.
Also, created .properties file and used -serverProperties option when starting server again.
StanfordCoreNLP.properties file contains:
parser.model = edu/stanford/nlp/models/srparser/englishSR.ser.gz
Command to run server is now:
java -mx5g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -timeout 100000 -serverProperties "StanfordCoreNLP.properties"
No luck and I am still getting the same warning.
Please comment if more clarification is needed.
You should put the full English models jar in the directory where you are running this command.
The -cp "*" is saying to look at all jars in that directory.
You can find that English models jar here:
https://stanfordnlp.github.io/CoreNLP/download.html

Adding some hadoop configuration to a spark application at runtime (through spark-submit)?

I want to send a key-value pair to my spark application something like the following:
mapreduce.input.fileinputformat.input.dir.recursive=true
I understand I can do this from the code in the following way:
sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive","true")
But I want to be able to send this property through spark-submit at runtime. Would this be possible?
Absolutely!
spark-submit (as well as spark-shell) support the --conf PROP=VALUE and --properties-file FILE options, which allow you specify such arbitrary configurations options. You can then get the values you pass by using the SparkConf .get function:
val conf = new SparkConf()
val mrRecursive =
conf.get("spark.mapreduce.input.fileinputformat.input.dir.recursive")
sc.hadoopConfiguration.set("spark.mapreduce.input.fileinputformat.input.dir.recursive", mrRecursive)
Spark-submit/spark-shell --help:
--conf PROP=VALUE Arbitrary Spark configuration property.
--properties-file FILE Path to a file from which to load extra properties. If not
specified, this will look for conf/spark-defaults.conf.
Spark docs regarding [dynamically] loading properties: https://spark.apache.org/docs/latest/configuration.html
Without code modification, such approach can be used.
Hadoop Configuration reads file "core-default.xml" during creation, description is here:
https://hadoop.apache.org/docs/r2.6.1/api/org/apache/hadoop/conf/Configuration.html
If put values in "core-default.xml", and include directory with file in classpath with spark-submit "driver-class-path" parameter, it can work.

HBase ERROR: hbase-default.xml file seems to be for and old version of HBase (null)

I am trying to write a program to connect to HBase. However when I execute following command
HBaseConfiguration.create(); I get following error: .
"hbase-default.xml file seems to be for and old version of HBase (null), this version is 0.92.1-cdh4.1.2.
When I dig deep and debug inside observe following:
class HBaseConfiguration
private static void checkDefaultsVersion(Configuration conf) {
if (conf.getBoolean("hbase.defaults.for.version.skip", Boolean.FALSE))return;
String defaultsVersion = conf.get("hbase.defaults.for.version");
String thisVersion = VersionInfo.getVersion();
if (!thisVersion.equals(defaultsVersion)) {
throw new RuntimeException(
"hbase-default.xml file seems to be for and old version of HBase (" +
defaultsVersion + "), this version is " + thisVersion);
}
}
In my case HBase returns default version as null, I am not sure why its returning as null as I checked the corresponding entry in hbase-default.xml packaged with the HBase.jar it has correct entry.
When I try the same thing from a standalone program it works as expected.
Guyz, Please let me know if you have any questions.
Thanks in advance,
Rohit
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<name>hbase.defaults.for.version.skip</name>
<value>true</value>
</property>
</configuration>
Add this to a hbase-default.xml and put the file in the classpath or resource foldr. I got it when i ran from within spring hadoop environment. By adding above file to reosurce folder of the job jar i was able to solve tis-
finally found the workaround to this problem...
The problem is hbase-default.xml is not included in your classpath.
I added hbase-default.xml in target/test-classes ( it will vary in your case ), you can just add hbase-default.xml in various folder and see what works for you.
NOTE : This is just workaround, not the solution
Solution will be load the proper jars ( which I haven't figured out yet )
I've been getting this error using HBase1.1.1.
I created a simple HBase client and it worked fine. Then I built a simple RMI service, and that worked fine. But when I tried putting my simple HBase query code into RMI service I started getting this error on the HBaseConfiguration.create() call. After playing a bit, I found that the HBaseConfiguration.create() call works OK if placed before the security manager stuff that is in my main(). I get the error if the call is placed after block of code containing security manager calls...
Configuration conf = HBaseConfiguration.create(); // This works
if(System.getSecurityManager() == null)
{
System.setSecurityManager(new SecurityManager());
} // End if
// Configuration conf = HBaseConfiguration.create(); // This fails
I get the error if the create() call happens in main() after that security manager block, or in code within the class that is instantiated by main(). I don't get the error if create() is called within a static{ } block in my RMI service class (which I believe gets called before main()), or in main() before the security manager block, as shown.
BTW, the jar files that I include in my class path in order to get a minimal client to run are the following:
commons-codec-1.9.jar,
commons-collections-3.2.1.jar,
commons-configuration-1.6.jar,
commons-lang-2.6.jar,
commons-logging-1.2.jar,
guava-12.0.1.jar,
hadoop-auth-2.5.1.jar,
hadoop-common-2.5.1.jar,
hbase-client-1.1.1.jar,
hbase-common-1.1.1.jar,
hbase-hadoop2-compat-1.1.1.jar,
hbase-it-1.1.1-tests.jar,
hbase-protocol-1.1.1.jar,
htrace-core-3.1.0-incubating.jar,
log4j-1.2.17.jar,
netty-all-4.0.23.Final.jar,
protobuf-java-2.5.0.jar,
slf4j-api-1.7.7.jar,
slf4j-log4j12-1.7.5.jar
Had a similar problem where the error was
java.lang.RuntimeException: hbase-default.xml file seems to be for and old version of HBase (0.98.3-hadoop2), this version is Unknown
at org.apache.hadoop.hbase.HBaseConfiguration.checkDefaultsVersion(HBaseConfiguration.java:70)
at org.apache.hadoop.hbase.HBaseConfiguration.addHbaseResources(HBaseConfiguration.java:102)
at org.apache.hadoop.hbase.HBaseConfiguration.create(HBaseConfiguration.java:113)
In my case I had same set of jar files at two different levels of classpath, removed from one level and it worked fine.
In my case the issue was caused by old java version (1.5), which was default on the server. But it works fine with 1.7.
In my code, I used this to solve my error.
val config = HBaseConfiguration.create() //error
val config = new Configuration() //works

Alternative ways to start hadoop streaming job

I can sucessfully kick of a hadoop streaming job from the terminal but i am looking for ways to start steaming jobs via an api, eclipse or some other means.
The closest i found was this post https://stackoverflow.com/questions/11564463/remotely-execute-hadoop-streaming-job but it has no answers!
Any ideas or suggestions would be welcome.
Interesting question, I found a way to do this, hopefully this will help you too.
First method should work on Hadoop 0.22:
Configuration conf = new Configuration();
conf.set("fs.defaultFS", "hdfs://xxxxx:9000");
conf.set("mapred.job.tracker", "hdfs://xxxxx:9001");
StreamJob sj = new StreamJob();
try {
ToolRunner.run(conf, sj, new String[] {
"-D", "stream.tmpdir=c:\\",
"-mapper", "/path/to/mapper.py",
"-reducer", "/path/to/reducer.py", "-input",
"/path/to/input", "-output",
"/path/to/output" });
} catch (Exception e) {
e.printStackTrace();
}
I also found this Java wrapper which you should be able to run.
Take a look at Apache Oozie - once you have defined your job via XML you can launch a job via an Http POST to the oozie server
When the Hadoop streaming job is run as
hadoop jar /home/training/Installations/hadoop-1.0.3/contrib/streaming/hadoop-streaming-1.0.3.jar -input input4 -output output4 -mapper /home/training/Code/Streaming/max_temperature_map.rb -reducer /home/training/Code/Streaming/max_temperature_reduce.rb
then org.apache.hadoop.streaming.HadoopStreaming is executed. This class is defined in the MANIFEST.MF in the hadoop-streaming-1.0.3.jar. Check the code in the org.apache.hadoop.streaming.HadoopStreaming java class to know the API details.

Resources