Submitting Spark application via YARN client - hadoop

I am using the org.apache.spark.deploy.yarn.Client (Spark 2.1.0) to submit spark a yarn application (SparkPi example). Following are the pertinent lines:
List<String> arguments = Lists.newArrayList("--class", "org.apache.spark.examples.SparkPi","--jar", "path/to/spark examples jar", "--arg", "10");
SparkConf sparkConf = new SparkConf();
applicationTag = "TestApp-" + new Date().getTime();
sparkConf.set("spark.yarn.submit.waitAppCompletion", "false");
sparkConf.set("spark.yarn.tags", applicationTag);
sparkConf.set("spark.submit.deployMode", "cluster");
sparkConf.set("spark.yarn.jars", "/opt/spark/jars/*.jar");
System.setProperty("SPARK_YARN_MODE", "true");
System.setProperty("SPARK_HOME", "/opt/spark");
ClientArguments cArgs = new ClientArguments(arguments.toArray(new String[arguments.size()]));
Client client = new Client(cArgs, sparkConf);
client.run();
This seems to be working and the Spark application appears in the YARN RM UI & succeeds. However, the container logs show that the URL for the staging directory is being picked up as
SPARK_YARN_STAGING_DIR -> file:/home/{current user}/.sparkStaging/application_xxxxxx. Going through org.apache.spark.deploy.yarn.Client shows the likely reason for it being that the base path for the staging directory is not picked up correctly. The base path should be hdfs://localhost:9000/user/{current user}/ rather than file:/home/{current user}/ as confirmed by the following error appearing in the logs when the staging directory is cleared off:
java.lang.IllegalArgumentException: Wrong FS: file:/home/user/.sparkStaging/application_1496908076154_0022, expected: hdfs://127.0.0.1:9000
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:649)
at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:194)
at org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:106)
at org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:707)
at org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:703)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.delete(DistributedFileSystem.java:714)
at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$cleanupStagingDir(ApplicationMaster.scala:545)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:233)
at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
This all works fine when spark-submit is used as I believe it sets all the required environment variables correctly.
I have also tried with setting sparkConf.set("spark.yarn.stagingDir", "hdfs://localhost:9000/user/{current user}"); but to no avail as it results in some other errors, such as hdfs not being recognised as a valid file system.

Related

FileStream on cluster gives me an exception

I am writing a Spark STreaming application using file stream...
val probeFileLines = ssc.fileStream[LongWritable, Text, TextInputFormat]("/data-sources/DXE_Ver/1.4/MTN_Abuja/DXE/20160221/HTTP", filterF, false) //.persist(StorageLevel.MEMORY_AND_DISK_SER)
But I get exception error for file/IO..for
16/09/07 10:20:30 WARN FileInputDStream: Error finding new files
java.io.FileNotFoundException: /mapr/cellos-mapr/data-sources/DXE_Ver/1.4/MTN_Abuja/DXE/20160221/HTTP
at com.mapr.fs.MapRFileSystem.listMapRStatus(MapRFileSystem.java:1486)
at com.mapr.fs.MapRFileSystem.listStatus(MapRFileSystem.java:1523)
at com.mapr.fs.MapRFileSystem.listStatus(MapRFileSystem.java:86)
While the directory exist in my cluster.
I am running my job using spark submit
spark-submit --class "StreamingEngineSt" target/scala-2.11/sprkhbase_2.11-1.0.2.jar
This could be related to file permissions or ownership(May be need hdfs user).

how to trigger spark job from java code with SparkSubmit.scala

I saw oozie is using
List<String> sparkArgs = new ArrayList<String>();
sparkArgs.add("--master");
sparkArgs.add("yarn-cluster");
sparkArgs.add("--class");
sparkArgs.add("com.sample.spark.HelloSpark");
...
SparkSubmit.main(sparkArgs.toArray(new String[sparkArgs.size()]));
But when I ran this on cluster, I always got
Error: Could not load YARN classes. This copy of Spark may not have been compiled with YARN support.
I think that is because my program can not find HADOOP_CONF_DIR. But how do I let SparkSubmit know that settings in Java code ?

/hbase/meta-region-server because node does not exist (not an error)

I'm running hbase in a cluster mode and I'm getting the following error:
DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil - catalogtracker-on-hconnection-0x6e704bd0x0, quorum=node2:2181, baseZNode=/hbase Set watcher on znode that does not yet exist, /hbase/meta-region-server
I had the similar the error and got it resolved by doing these:
1) Making sure HBase Client Version is compatible with the HBase version on the cluster.
2) Adding hbase-site.xml to your application classpath, so that HBase client determines all the appropriate HBase configurations from it.
val conf = org.apache.hadoop.hbase.HBaseConfiguration.create()
// Instead of the following settings, pass hbase-site.xml in classpath
// conf.set("hbase.zookeeper.quorum", hbaseHost)
// conf.set("hbase.zookeeper.property.clientPort", hbasePort)
HBaseAdmin.checkHBaseAvailable(conf);
log.debug("HBase found! with conf " + conf);

Unresponsive Spark Master

I'm trying to run a simple Spark app in Standalone mode on Mac.
I manage to run ./sbin/start-master.sh to start the master server and worker.
./bin/spark-shell --master spark://MacBook-Pro.local:7077 also works and I can see it in running application list in Master WebUI
Now I'm trying to write a simple spark app.
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.SparkContext._
object SimpleApp {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Simple Application")
.setMaster("spark://MacBook-Pro.local:7077")
val sc = new SparkContext(conf)
val logFile = "README.md"
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
}
}
Running this simple app gives me error message that Master is unresponsive
15/02/15 09:47:47 INFO AppClient$ClientActor: Connecting to master spark://MacBook-Pro.local:7077...
15/02/15 09:47:48 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkMaster#MacBook-Pro.local:7077] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
15/02/15 09:48:07 INFO AppClient$ClientActor: Connecting to master spark://MacBook-Pro.local:7077...
15/02/15 09:48:07 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkMaster#MacBook-Pro.local:7077] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
15/02/15 09:48:27 INFO AppClient$ClientActor: Connecting to master spark://MacBook-Pro.local:7077...
15/02/15 09:48:27 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkMaster#MacBook-Pro.local:7077] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
15/02/15 09:48:47 ERROR SparkDeploySchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.
15/02/15 09:48:47 WARN SparkDeploySchedulerBackend: Application ID is not initialized yet.
15/02/15 09:48:47 ERROR TaskSchedulerImpl: Exiting due to error from cluster scheduler: All masters are unresponsive! Giving up.
Any idea what is the problem?
Thanks
You can either set the master when calling spark-submit, or (as you've done here) by explicitly setting it via the SparkConf. Try following the example in the Spark Configuration docs, and setting the master as follows:
val conf = new SparkConf().setMaster("local[2]")
From the same page (explaining the number in brackets that follows local): "Note that we run with local[2], meaning two threads - which represents “minimal” parallelism, which can help detect bugs that only exist when we run in a distributed context."
I got the same issue and solve it finally. In my case, because I wrote the source code based on scala 2.11. But for spark, I build it with Maven following the default command:
build/mvn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package
According to this build script, it will set the version of scala to version 2.10. Due to the different scala version between Spark Client and Master, it will raise incompatible serialization when client send message to master via remote actor. Finally "All masters are unresponsive" error message was shown in the console.
My Solution:
1. Re-build spark for scala 2.11 (Make sure your programming env to scala 2.11). Please run this command as below in SPARK_HOME:
dev/change-version-to-2.11.sh
mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package
After building, the package will be located in SPARK_HOME/assembly/target/scala-2.11. If you start your spark server using start-all.sh, it will report the target package can't found.
Go to conf folder, edit spark-env.sh file. Append the code line as below:
export SPARK_SCALA_VERSION="2.11"
Please run start-all.sh, and set the correct master url in your program, and run it. It done!
Notice: The error message in the console is not enough. So that you need toggle your log feature on to inspect what happen. You can go to conf folder, and copy log4j.properties.template to log4j.properties. After the spark master was started, the log files will save on SPARK_HOME/logs folder.
I write my code in JAVA, but I got the same problem with you. Because my scala version is 2.10, my dependencies is 2.11. Then I changed spark-core_2.11 and spark-sql_2.11 to spark-core_2.10 and spark-sql_2.10 in pom.xml. Maybe you can solve your issue in similar way.
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>${spark.version}</version>
</dependency>

How to set system environment variable from Mapper Hadoop?

The problem below the line is solved but I am facing another problem.
I am doing this :
DistributedCache.createSymlink(job.getConfiguration());
DistributedCache.addCacheFile(new URI
("hdfs:/user/hadoop/harsh/libnative1.so"),conf.getConfiguration());
and in the mapper :
System.loadLibrary("libnative1.so");
(i also tried
System.loadLibrary("libnative1");
System.loadLibrary("native1");
But I am getting this error:
java.lang.UnsatisfiedLinkError: no libnative1.so in java.library.path
I am totally clueless what should I set java.library.path to ..
I tried setting it to /home and copied every .so from distributed cache to /home/ but still it didn't work :(
Any suggestions / solutions please?
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
I want to set the system environment variable (specifically, LD_LIBRARY_PATH) of the machine where the mapper is running.
I tried :
Runtime run = Runtime.getRuntime();
Process pr=run.exec("export LD_LIBRARY_PATH=/usr/local/:$LD_LIBRARY_PATH");
But it throws IOException.
I also know about
JobConf.MAPRED_MAP_TASK_ENV
But I am using hadoop version 0.20.2 which has Job & Configuration instead of JobConf.
I am unable to find any such variable, and this is also not a Hadoop specific environment variable but a system environment variable.
Any solution/suggestion?
Thanks in advance..
Why dont you export this variable on all nodes of the cluster ?
Anyways, use the Configuration class as below while submitting the Job
Configuration conf = new Configuration();
conf.set("mapred.map.child.env",<string value>);
Job job = new Job(conf);
The format of the value is k1=v1,k2=v2

Resources