how to trigger spark job from java code with SparkSubmit.scala - hadoop

I saw oozie is using
List<String> sparkArgs = new ArrayList<String>();
SparkSubmit.main(sparkArgs.toArray(new String[sparkArgs.size()]));
But when I ran this on cluster, I always got
Error: Could not load YARN classes. This copy of Spark may not have been compiled with YARN support.
I think that is because my program can not find HADOOP_CONF_DIR. But how do I let SparkSubmit know that settings in Java code ?


Is it possible to read a file using SparkSession object of Scala language on Windows?

I've been trying to read from a .csv file on many ways, utilizing SparkContext object. I found it possible through function, but I want to use spark object. Everytime I run function textfile for org.apache.spark.SparkContext I get the same error:
21/12/29 16:47:32 WARN streaming.FileStreamSink: Error while looking for metadata directory.
java.lang.UnsupportedOperationException: Not implemented by the DistributedFileSystem FileSystem implementation
As it's mentioned in the title I run the code on Windows in IntelliJ
In build.sbt have no redundant or overlapped dependencies. I use hadoop-tools, spark-sql and hadoop-xz.
Have you tried to run your spark-shell using local mode?
spark-shell --master=local
Also pay attention to not use both Hadoop-code and Hadoop-commons as a dependencies since you may have conflicting jars issues.
I've found the solution, precisely one of my colleague did that.
In dependencies build.sbt I changed hadoop-tools to hadoop-commons and it worked out.

Submitting Spark application via YARN client

I am using the org.apache.spark.deploy.yarn.Client (Spark 2.1.0) to submit spark a yarn application (SparkPi example). Following are the pertinent lines:
List<String> arguments = Lists.newArrayList("--class", "org.apache.spark.examples.SparkPi","--jar", "path/to/spark examples jar", "--arg", "10");
SparkConf sparkConf = new SparkConf();
applicationTag = "TestApp-" + new Date().getTime();
sparkConf.set("spark.yarn.submit.waitAppCompletion", "false");
sparkConf.set("spark.yarn.tags", applicationTag);
sparkConf.set("spark.submit.deployMode", "cluster");
sparkConf.set("spark.yarn.jars", "/opt/spark/jars/*.jar");
System.setProperty("SPARK_YARN_MODE", "true");
System.setProperty("SPARK_HOME", "/opt/spark");
ClientArguments cArgs = new ClientArguments(arguments.toArray(new String[arguments.size()]));
Client client = new Client(cArgs, sparkConf);;
This seems to be working and the Spark application appears in the YARN RM UI & succeeds. However, the container logs show that the URL for the staging directory is being picked up as
SPARK_YARN_STAGING_DIR -> file:/home/{current user}/.sparkStaging/application_xxxxxx. Going through org.apache.spark.deploy.yarn.Client shows the likely reason for it being that the base path for the staging directory is not picked up correctly. The base path should be hdfs://localhost:9000/user/{current user}/ rather than file:/home/{current user}/ as confirmed by the following error appearing in the logs when the staging directory is cleared off:
java.lang.IllegalArgumentException: Wrong FS: file:/home/user/.sparkStaging/application_1496908076154_0022, expected: hdfs://
at org.apache.hadoop.fs.FileSystem.checkPath(
at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(
at org.apache.hadoop.hdfs.DistributedFileSystem.access$000(
at org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(
at org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(
at org.apache.hadoop.hdfs.DistributedFileSystem.delete(
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:233)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anon$
at org.apache.hadoop.util.ShutdownHookManager$
This all works fine when spark-submit is used as I believe it sets all the required environment variables correctly.
I have also tried with setting sparkConf.set("spark.yarn.stagingDir", "hdfs://localhost:9000/user/{current user}"); but to no avail as it results in some other errors, such as hdfs not being recognised as a valid file system.

Spark (shell), Cassandra : Hello, World?

How do you get a basic Hello, world! example running in Spark with Cassandra? So far, we've found this helpful answer:
How to load Spark Cassandra Connector in the shell?
Which works perfectly!
Then we attempt to follow to documentation and the getting started example:
It says to do this:
import com.datastax.spark.connector.cql.CassandraConnector
CassandraConnector(conf).withSessionDo { session =>
session.execute("CREATE KEYSPACE test2 WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1 }")
session.execute("CREATE TABLE test2.words (word text PRIMARY KEY, count int)")
But it says we don't have com.datastax.spark.connector.cql?
Btw, we got the Spark connector from here:
Maven Central Repository (spark-cassandra-connector-java_2.11)
So how do you get to the point where you can create a keyspace, a table and insert rows after you have Spark and Cassandra running locally?
The jar you downloaded only has the Java api so it won't work with the Scala Spark Shell. I recommend you follow the instructions on the Spark Cassandra Connector page.
These instructions will have you build the full assembly jar with all the dependencies and add it to the Spark Shell classpath using --jars.

pig + hbase + hadoop2 integration

has anyone had successful experience loading data to hbase-0.98.0 from pig-0.12.0 on hadoop-2.2.0 in an environment of hadoop-2.20+hbase-0.98.0+pig-0.12.0 combination without encountering this error:
ERROR 2998: Unhandled internal error.
with a line of log trace:
java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/filter/WritableByteArra
I searched the web and found a handful of problems and solutions but all of them refer to pre-hadoop2 and base-0.94-x which were not applicable to my situation.
I have a 5 node hadoop-2.2.0 cluster and a 3 node hbase-0.98.0 cluster and a client machine installed with hadoop-2.2.0, base-0.98.0, pig-0.12.0. Each of them functioned fine separately and I got hdfs, map reduce, region servers , pig all worked fine. To complete an "loading data to base from pig" example, i have the following export:
and when i tried to run : pig -x local -f loaddata.pig
and boom, the following error:ERROR 2998: Unhandled internal error. org/apache/hadoop/hbase/filter/WritableByteArrayComparable (this should be the 100+ times i got it dying countless tries to figure out a working setting).
the trace log shows:lava.lang.NoClassDefFoundError: org/apache/hadoop/hbase/filter/WritableByteArrayComparable
the following is my pig script:
REGISTER /usr/local/hbase/lib/hbase-*.jar;
REGISTER /usr/local/hbase/lib/hadoop-*.jar;
REGISTER /usr/local/hbase/lib/protobuf-java-2.5.0.jar;
REGISTER /usr/local/hbase/lib/zookeeper-3.4.5.jar;
raw_data = LOAD '/home/hdadmin/200408hourly.txt' USING PigStorage(',');
weather_data = FOREACH raw_data GENERATE $1, $10;
ranked_data = RANK weather_data;
final_data = FILTER ranked_data BY $0 IS NOT NULL;
STORE final_data INTO 'hbase://weather' USING
org.apache.pig.backend.hadoop.hbase.HBaseStorage('info:date info:temp');
I have successfully created a base table 'weather'.
Has anyone had successful experience and be generous to share with us?
ant clean jar-withouthadoop -Dhadoopversion=23 -Dhbaseversion=95
By default it builds against hbase 0.94. 94 and 95 are the only options.
If you know which jar file contains the missing class, e.g. org/apache/hadoop/hbase/filter/WritableByteArray, then you can use the pig.additional.jars property when running the pig command to ensure that the jar file is available to all the mapper tasks.
pig -D pig.additional.jars=FullPathToJarFile.jar bulkload.pig
pig -D pig.additional.jars=/usr/lib/hbase/lib/hbase-protocol.jar bulkload.pig

How to set system environment variable from Mapper Hadoop?

The problem below the line is solved but I am facing another problem.
I am doing this :
DistributedCache.addCacheFile(new URI
and in the mapper :
(i also tried
But I am getting this error:
java.lang.UnsatisfiedLinkError: no in java.library.path
I am totally clueless what should I set java.library.path to ..
I tried setting it to /home and copied every .so from distributed cache to /home/ but still it didn't work :(
Any suggestions / solutions please?
I want to set the system environment variable (specifically, LD_LIBRARY_PATH) of the machine where the mapper is running.
I tried :
Runtime run = Runtime.getRuntime();
Process pr=run.exec("export LD_LIBRARY_PATH=/usr/local/:$LD_LIBRARY_PATH");
But it throws IOException.
I also know about
But I am using hadoop version 0.20.2 which has Job & Configuration instead of JobConf.
I am unable to find any such variable, and this is also not a Hadoop specific environment variable but a system environment variable.
Any solution/suggestion?
Thanks in advance..
Why dont you export this variable on all nodes of the cluster ?
Anyways, use the Configuration class as below while submitting the Job
Configuration conf = new Configuration();
conf.set("",<string value>);
Job job = new Job(conf);
The format of the value is k1=v1,k2=v2
