how to trigger spark job from java code with SparkSubmit.scala - hadoop

I saw oozie is using
List<String> sparkArgs = new ArrayList<String>();
sparkArgs.add("--master");
sparkArgs.add("yarn-cluster");
sparkArgs.add("--class");
sparkArgs.add("com.sample.spark.HelloSpark");
...
SparkSubmit.main(sparkArgs.toArray(new String[sparkArgs.size()]));
But when I ran this on cluster, I always got
Error: Could not load YARN classes. This copy of Spark may not have been compiled with YARN support.
I think that is because my program can not find HADOOP_CONF_DIR. But how do I let SparkSubmit know that settings in Java code ?

Related

Is it possible to read a file using SparkSession object of Scala language on Windows?

I've been trying to read from a .csv file on many ways, utilizing SparkContext object. I found it possible through scala.io.Source.fromFile function, but I want to use spark object. Everytime I run function textfile for org.apache.spark.SparkContext I get the same error:
scala> sparkSession.read.csv("file://C:\\Users\\184229\\Desktop\\bigdata.csv")
21/12/29 16:47:32 WARN streaming.FileStreamSink: Error while looking for metadata directory.
java.lang.UnsupportedOperationException: Not implemented by the DistributedFileSystem FileSystem implementation
.....
As it's mentioned in the title I run the code on Windows in IntelliJ
[Edit]
In build.sbt have no redundant or overlapped dependencies. I use hadoop-tools, spark-sql and hadoop-xz.
Have you tried to run your spark-shell using local mode?
spark-shell --master=local
Also pay attention to not use both Hadoop-code and Hadoop-commons as a dependencies since you may have conflicting jars issues.
I've found the solution, precisely one of my colleague did that.
In dependencies build.sbt I changed hadoop-tools to hadoop-commons and it worked out.

Submitting Spark application via YARN client

I am using the org.apache.spark.deploy.yarn.Client (Spark 2.1.0) to submit spark a yarn application (SparkPi example). Following are the pertinent lines:
List<String> arguments = Lists.newArrayList("--class", "org.apache.spark.examples.SparkPi","--jar", "path/to/spark examples jar", "--arg", "10");
SparkConf sparkConf = new SparkConf();
applicationTag = "TestApp-" + new Date().getTime();
sparkConf.set("spark.yarn.submit.waitAppCompletion", "false");
sparkConf.set("spark.yarn.tags", applicationTag);
sparkConf.set("spark.submit.deployMode", "cluster");
sparkConf.set("spark.yarn.jars", "/opt/spark/jars/*.jar");
System.setProperty("SPARK_YARN_MODE", "true");
System.setProperty("SPARK_HOME", "/opt/spark");
ClientArguments cArgs = new ClientArguments(arguments.toArray(new String[arguments.size()]));
Client client = new Client(cArgs, sparkConf);
client.run();
This seems to be working and the Spark application appears in the YARN RM UI & succeeds. However, the container logs show that the URL for the staging directory is being picked up as
SPARK_YARN_STAGING_DIR -> file:/home/{current user}/.sparkStaging/application_xxxxxx. Going through org.apache.spark.deploy.yarn.Client shows the likely reason for it being that the base path for the staging directory is not picked up correctly. The base path should be hdfs://localhost:9000/user/{current user}/ rather than file:/home/{current user}/ as confirmed by the following error appearing in the logs when the staging directory is cleared off:
java.lang.IllegalArgumentException: Wrong FS: file:/home/user/.sparkStaging/application_1496908076154_0022, expected: hdfs://127.0.0.1:9000
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:649)
at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:194)
at org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:106)
at org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:707)
at org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:703)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.delete(DistributedFileSystem.java:714)
at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$cleanupStagingDir(ApplicationMaster.scala:545)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:233)
at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
This all works fine when spark-submit is used as I believe it sets all the required environment variables correctly.
I have also tried with setting sparkConf.set("spark.yarn.stagingDir", "hdfs://localhost:9000/user/{current user}"); but to no avail as it results in some other errors, such as hdfs not being recognised as a valid file system.

Spark (shell), Cassandra : Hello, World?

How do you get a basic Hello, world! example running in Spark with Cassandra? So far, we've found this helpful answer:
How to load Spark Cassandra Connector in the shell?
Which works perfectly!
Then we attempt to follow to documentation and the getting started example:
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/1_connecting.md
It says to do this:
import com.datastax.spark.connector.cql.CassandraConnector
CassandraConnector(conf).withSessionDo { session =>
session.execute("CREATE KEYSPACE test2 WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1 }")
session.execute("CREATE TABLE test2.words (word text PRIMARY KEY, count int)")
}
But it says we don't have com.datastax.spark.connector.cql?
Btw, we got the Spark connector from here:
Maven Central Repository (spark-cassandra-connector-java_2.11)
So how do you get to the point where you can create a keyspace, a table and insert rows after you have Spark and Cassandra running locally?
The jar you downloaded only has the Java api so it won't work with the Scala Spark Shell. I recommend you follow the instructions on the Spark Cassandra Connector page.
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/13_spark_shell.md
These instructions will have you build the full assembly jar with all the dependencies and add it to the Spark Shell classpath using --jars.

pig + hbase + hadoop2 integration

has anyone had successful experience loading data to hbase-0.98.0 from pig-0.12.0 on hadoop-2.2.0 in an environment of hadoop-2.20+hbase-0.98.0+pig-0.12.0 combination without encountering this error:
ERROR 2998: Unhandled internal error.
org/apache/hadoop/hbase/filter/WritableByteArrayComparable
with a line of log trace:
java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/filter/WritableByteArra
I searched the web and found a handful of problems and solutions but all of them refer to pre-hadoop2 and base-0.94-x which were not applicable to my situation.
I have a 5 node hadoop-2.2.0 cluster and a 3 node hbase-0.98.0 cluster and a client machine installed with hadoop-2.2.0, base-0.98.0, pig-0.12.0. Each of them functioned fine separately and I got hdfs, map reduce, region servers , pig all worked fine. To complete an "loading data to base from pig" example, i have the following export:
export PIG_CLASSPATH=$HADOOP_INSTALL/etc/hadoop:$HBASE_PREFIX/lib/*.jar
:$HBASE_PREFIX/lib/protobuf-java-2.5.0.jar:$HBASE_PREFIX/lib/zookeeper-3.4.5.jar
and when i tried to run : pig -x local -f loaddata.pig
and boom, the following error:ERROR 2998: Unhandled internal error. org/apache/hadoop/hbase/filter/WritableByteArrayComparable (this should be the 100+ times i got it dying countless tries to figure out a working setting).
the trace log shows:lava.lang.NoClassDefFoundError: org/apache/hadoop/hbase/filter/WritableByteArrayComparable
the following is my pig script:
REGISTER /usr/local/hbase/lib/hbase-*.jar;
REGISTER /usr/local/hbase/lib/hadoop-*.jar;
REGISTER /usr/local/hbase/lib/protobuf-java-2.5.0.jar;
REGISTER /usr/local/hbase/lib/zookeeper-3.4.5.jar;
raw_data = LOAD '/home/hdadmin/200408hourly.txt' USING PigStorage(',');
weather_data = FOREACH raw_data GENERATE $1, $10;
ranked_data = RANK weather_data;
final_data = FILTER ranked_data BY $0 IS NOT NULL;
STORE final_data INTO 'hbase://weather' USING
org.apache.pig.backend.hadoop.hbase.HBaseStorage('info:date info:temp');
I have successfully created a base table 'weather'.
Has anyone had successful experience and be generous to share with us?
ant clean jar-withouthadoop -Dhadoopversion=23 -Dhbaseversion=95
By default it builds against hbase 0.94. 94 and 95 are the only options.
If you know which jar file contains the missing class, e.g. org/apache/hadoop/hbase/filter/WritableByteArray, then you can use the pig.additional.jars property when running the pig command to ensure that the jar file is available to all the mapper tasks.
pig -D pig.additional.jars=FullPathToJarFile.jar bulkload.pig
Example:
pig -D pig.additional.jars=/usr/lib/hbase/lib/hbase-protocol.jar bulkload.pig

How to set system environment variable from Mapper Hadoop?

The problem below the line is solved but I am facing another problem.
I am doing this :
DistributedCache.createSymlink(job.getConfiguration());
DistributedCache.addCacheFile(new URI
("hdfs:/user/hadoop/harsh/libnative1.so"),conf.getConfiguration());
and in the mapper :
System.loadLibrary("libnative1.so");
(i also tried
System.loadLibrary("libnative1");
System.loadLibrary("native1");
But I am getting this error:
java.lang.UnsatisfiedLinkError: no libnative1.so in java.library.path
I am totally clueless what should I set java.library.path to ..
I tried setting it to /home and copied every .so from distributed cache to /home/ but still it didn't work :(
Any suggestions / solutions please?
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
I want to set the system environment variable (specifically, LD_LIBRARY_PATH) of the machine where the mapper is running.
I tried :
Runtime run = Runtime.getRuntime();
Process pr=run.exec("export LD_LIBRARY_PATH=/usr/local/:$LD_LIBRARY_PATH");
But it throws IOException.
I also know about
JobConf.MAPRED_MAP_TASK_ENV
But I am using hadoop version 0.20.2 which has Job & Configuration instead of JobConf.
I am unable to find any such variable, and this is also not a Hadoop specific environment variable but a system environment variable.
Any solution/suggestion?
Thanks in advance..
Why dont you export this variable on all nodes of the cluster ?
Anyways, use the Configuration class as below while submitting the Job
Configuration conf = new Configuration();
conf.set("mapred.map.child.env",<string value>);
Job job = new Job(conf);
The format of the value is k1=v1,k2=v2

Resources