How to add a typesafe config file which is located on HDFS to spark-submit (cluster-mode)? - hadoop

I have a Spark (Spark 1.5.2) application that streams data from Kafka to HDFS. My application contains two Typesafe config files to configure certain things like Kafka topic etc.
Now I want to run my application with spark-submit (cluster mode) in a cluster.
The jar file with all dependencies of my project is stored on HDFS.
As long as my config files are included in the jar file everything works fine. But this is unpractical for testing purposes because I always have to rebuild the jar.
Therefore I excluded the config files of my project and I added them via "driver-class-path". This worked on client mode but if I move the config files now to HDFS and run my application in cluster mode it can't find the settings. Below you can find my spark-submit command:
/usr/local/spark/bin/spark-submit \
--total-executor-cores 10 \
--executor-memory 15g \
--verbose \
--deploy-mode cluster\
--class com.hdp.speedlayer.SpeedLayerApp \
--driver-class-path hdfs://iot-master:8020/user/spark/config \
--master spark://spark-master:6066 \
hdfs://iot-master:8020/user/spark/speed-layer-CONFIG.jar
I already tried it with the --file parameter but that also didn't work. Does anybody know how I can fix this?
Update:
I did some further research and I figured out that it could be related to the HDFS path. I changed the HDFS path to "hdfs:///iot-master:8020//user//spark//config But unfortunately that also that didn't work. But maybe this could help you.
Below you can also see the error I get when I run the driver program in cluster mode:
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: java.lang.ExceptionInInitializerError
at com.speedlayer.SpeedLayerApp.main(SpeedLayerApp.scala)
... 6 more
Caused by: com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'application'
at com.typesafe.config.impl.SimpleConfig.findKey(SimpleConfig.java:124)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:145)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:159)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:164)
...

Trying to achieve the same result I found out the following:
--files: is associated only to local files on machine running the spark-submit command and converts to conf.addFile(). so hdfs files wont work unless you are able to run hdfs dfs -get <....> before to retrieve the file. in my case I want to run it from oozie so I dont know on which machine its going to run and I dont want to add a copy file action to my workflow.
The quote #Yuval_Itzchakov took refers to --jars which only handles jars since it converts to conf.addJar()
So as far as I know there is no strait way to load configuration file from hdfs.
My approach was to pass the path to my app and read the configuration file and merge it into reference file:
private val HDFS_IMPL_KEY = "fs.hdfs.impl"
def loadConf(pathToConf: String): Config = {
val path = new Path(pathToConf)
val confFile = File.createTempFile(path.getName, "tmp")
confFile.deleteOnExit()
getFileSystemByUri(path.toUri).copyToLocalFile(path, new Path(confFile.getAbsolutePath))
ConfigFactory.load(ConfigFactory.parseFile(confFile))
}
def getFileSystemByUri(uri: URI) : FileSystem = {
val hdfsConf = new Configuration()
hdfsConf.set(HDFS_IMPL_KEY, classOf[org.apache.hadoop.hdfs.DistributedFileSystem].getName)
FileSystem.get(uri, hdfsConf)
}
P.S the error only means that the ConfigFactory didnt find any configuration file, so he couldn't find the property you are looking for.

One option is to use the --files flag and with the HDFS location and make sure you add it to your executor classpath using the spark.executor.extraClassPath flag with -Dconfig.file:
Spark uses the following URL scheme to allow different
strategies for disseminating jars:
file: - Absolute paths and file:/ URIs are served by the driver’s HTTP
file server, and every executor pulls the file from the driver HTTP
server.
hdfs:, http:, https:, ftp: - these pull down files and JARs
from the URI as expected
local: - a URI starting with local:/ is
expected to exist as a local file on each worker node. This means that
no network IO will be incurred, and works well for large files/JARs
that are pushed to each worker, or shared via NFS, GlusterFS, etc.
Also, you can see it when looking at the help documentation for spark-submit:
--files FILES Comma-separated list of files to be placed in the working
directory of each executor.
Running with spark-submit:
/usr/local/spark/bin/spark-submit \
--total-executor-cores 10 \
--executor-memory 15g \
--conf "spark.executor.extraClassPath=-Dconfig.file=application.conf"
--verbose \
--deploy-mode cluster\
--class com.hdp.speedlayer.SpeedLayerApp \
--driver-class-path hdfs://iot-master:8020/user/spark/config \
--files hdfs:/path/to/conf \
--master spark://spark-master:6066 \
hdfs://iot-master:8020/user/spark/speed-layer-CONFIG.jar

Related

Spark yarn-cluster mode - read file passed with --files

I'm running my spark application using yarn-cluster master.
What does the app do?
External service generates a jsonFile based on HTTP request to a RESTService
Spark needs to read this file and do some work after parsing the json
Simplest solution that came to mind was to use --files to load that file.
In yarn-cluster mode reading a file means it must be available on hdfs (if I'm right?) and my file is being copied to path like this:
/hadoop_user_path/.sparkStaging/spark_applicationId/myFile.json
Where I can of course read it, but I cannot find a way to get this path from any configuration / SparkEnv object. And hardcoding .sparkStaging in spark code seamed like a bad idea.
Why simple:
val jsonStringData = spark.textFile(myFileName)
sqlContext.read.json(jsonStringData)
cannot read file passed with --files and throws FileNotFoundException? Why is spark looking for files in hadoop_user_folder only?
My solution which works for now:
Just before running spark, I copy file to proper hdfs folder, pass the filename as Spark argument, process the file from a known path and after the job is done I delete the file form hdfs.
I thought passing the file as --files would let me forget about saving and deleting this file. Something like pass-process-andforget.
How do you read a file passed with --files then? The only solution is with creating path by hand, hardcoding ".sparkStaging" folder path?
The question is written very ambiguously. However, from what I seem to get is that you want to read a file from any location of your Local OS File System, and not just from HDFS.
Spark uses URI's to identify paths, and in the availability of a valid Hadoop/HDFS Environment, it will default to HDFS. In that case, to point to your Local OS FileSystem, in the case of for example UNIX/LINUX, you can use something like:
file:///home/user/my_file.txt
If you are using an RDD to read from this file, you run in yarn-cluster mode, or the file is accessed within a task, you will need to take care of copying and distributing that file manually to all nodes in your cluster, using the same path. That is what it makes it easy of first putting it on hfs, or that is what the --files option is supposed to do for you.
See more info on Spark, External Datasets.
For any files that were added through the --files option, or were added through SparkContext.addFile, you can get information about their location using the SparkFiles helper class.
Answer from #hartar worked for me. Here is the complete solution.
add required files during spark-submit using --files
spark-submit --name "my_job" --master yarn --deploy-mode cluster --files /home/xyz/file1.properties,/home/xyz/file2.properties --class test.main /home/xyz/my_test_jar.jar
get spark session inside main method
SparkSession ss = new SparkSession.Builder().getOrCreate();
Since i am interested only in .properties files, i am filtering it, instead if you know the file name which you wish to read then it can be directly used in FileInputStream.
spark.yarn.dist.files would have stored it as file:/home/xyz/file1.properties,file:/home/xyz/file2.properties hence splitting the string by (,) and (/) so that i can eliminate the rest of the content except the file name.
String[] files = Pattern.compile("/|,").splitAsStream(ss.conf().get("spark.yarn.dist.files")).filter(s -> s.contains(".properties")).toArray(String[]::new);
//load all files to Property
for (String f : files) {
props.load(new FileInputStream(f));
}
I had the same problem as you, in fact, you must know that when you send an executable and files, these are at the same level, so in your executable, it is enough that you just put the file name to Access it since your executable is based on its own folder.
You do not need to use sparkFiles or any other class. Just the method like readFile("myFile.json");
I have come across an easy way to do it.
We are using Spark 2.3.0 on Yarn in pseudo distributed mode. We need to query a postgres table from spark whose configurations are defined in a properties file.
I passed the property file using --files attribute of spark submit. To read the file in my code I simply used java.util.Properties.PropertiesReader class.
I just need to ensure that the path I specify when loading file is same as that passed in --files argument
e.g. if the spark submit command looked like:
spark-submit --class --master yarn --deploy-mode client--files test/metadata.properties myjar.jar
Then my code to read the file will look like:
Properties props = new Properties();
props.load(new FileInputStream(new File("test/metadata.properties")));
Hope you find this helpful.

Spark assembly file uploaded despite spark.yarn.conf being set

I submit jobs to a Spark cluster running on Yarn using spark-submit sometimes through a relatively slow connection. In order to avoid uploading the 156MB spark-assembly file for each job, I set the configuration option spark.yarn.jar to the file on HDFS. However, this does not avoid the upload, but rather takes the assembly file from the HDFS Spark directory and copies it to the application directory:
$:~/spark-1.4.0-bin-hadoop2.6$ bin/spark-submit --class MyClass --master yarn-cluster --conf spark.yarn.jar=hdfs://node-00b/user/spark/share/lib/spark-assembly.jar my.jar
[...]
15/07/06 21:25:43 INFO yarn.Client: Uploading resource hdfs://node-00b/user/spark/share/lib/spark-assembly.jar -> hdfs://nameservice1/user/XXX/.sparkStaging/application_1434986503384_0477/spark-assembly.jar
I was expecting that the assembly file should be copied within the HDFS, but actually it seems to be downloaded and uploaded again which is quite counter-productive. Any hints on that?
Both HDFS have to be the same system. See relevant codes here:
https://github.com/apache/spark/blob/37bf76a2de2143ec6348a3d43b782227849520cc/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L1308
https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L1308
Any reason why you can't have spark assembly jar on nameservice1 HDFS instead?

Setting S3 output file grantees for spark output files

I'm running Spark on AWS EMR and I'm having some issues getting the correct permissions on the output files (rdd.saveAsTextFile('<file_dir_name>')). In hive, I would add a line in the beginning with set fs.s3.canned.acl=BucketOwnerFullControl and that would set the correct permissions. For Spark, I tried running:
hadoop jar /mnt/var/lib/hadoop/steps/s-3HIRLHJJXV3SJ/script-runner.jar \
/home/hadoop/spark/bin/spark-submit --deploy-mode cluster --master yarn-cluster \
--conf "spark.driver.extraJavaOptions -Dfs.s3.canned.acl=BucketOwnerFullControl" \
hdfs:///user/hadoop/spark.py
But the permissions do not get set properly on the output files. What is the proper way to pass in the 'fs.s3.canned.acl=BucketOwnerFullControl' or any of the S3 canned permissions to the spark job?
Thanks in advance
I found the solution. In the job, you have to access the JavaSparkContext and from there get the Hadoop configuration and set the parameter there. For example:
sc._jsc.hadoopConfiguration().set('fs.s3.canned.acl','BucketOwnerFullControl')
The proper way to pass hadoop config keys in spark is to use --conf with keys prefixed with spark.hadoop.. Your command would look like
hadoop jar /mnt/var/lib/hadoop/steps/s-3HIRLHJJXV3SJ/script-runner.jar \
/home/hadoop/spark/bin/spark-submit --deploy-mode cluster --master yarn-cluster \
--conf "spark.hadoop.fs.s3.canned.acl=BucketOwnerFullControl" \
hdfs:///user/hadoop/spark.py
Unfortunately I cannot find any reference in official documentation of spark.

Spark-submit not working when application jar is in hdfs

I'm trying to run a spark application using bin/spark-submit. When I reference my application jar inside my local filesystem, it works. However, when I copied my application jar to a directory in hdfs, i get the following exception:
Warning: Skip remote jar hdfs://localhost:9000/user/hdfs/jars/simple-project-1.0-SNAPSHOT.jar.
java.lang.ClassNotFoundException: com.example.SimpleApp
Here's the command:
$ ./bin/spark-submit --class com.example.SimpleApp --master local hdfs://localhost:9000/user/hdfs/jars/simple-project-1.0-SNAPSHOT.jar
I'm using hadoop version 2.6.0, spark version 1.2.1
The only way it worked for me, when I was using
--master yarn-cluster
To make HDFS library accessible to spark-job , you have to run job in cluster mode.
$SPARK_HOME/bin/spark-submit \
--deploy-mode cluster \
--class <main_class> \
--master yarn-cluster \
hdfs://myhost:8020/user/root/myjar.jar
Also, There is Spark JIRA raised for client mode which is not supported yet.
SPARK-10643 :Support HDFS application download in client mode spark submit
There is a workaround. You could mount the directory in HDFS (which contains your application jar) as local directory.
I did the same (with azure blob storage, but it should be similar for HDFS)
example command for azure wasb
sudo mount -t cifs //{storageAccountName}.file.core.windows.net/{directoryName} {local directory path} -o vers=3.0,username={storageAccountName},password={storageAccountKey},dir_mode=0777,file_mode=0777
Now, in your spark submit command, you provide the path from the command above
$ ./bin/spark-submit --class com.example.SimpleApp --master local {local directory path}/simple-project-1.0-SNAPSHOT.jar
spark-submit --master spark://kssr-virtual-machine:7077 --deploy-mode client --executor-memory 1g hdfs://localhost:9000/user/wordcount.py
For me its working I am using Hadoop 3.3.1 & Spark 3.2.1. I am able to read the file from HDFS.
Yes, it has to be a local file. I think that's simply the answer.

How to report JMX from Spark Streaming on EC2 to VisualVM?

I have been trying to get a Spark Streaming job, running on a EC2 instance to report to VisualVM using JMX.
As of now I have the following config file:
spark/conf/metrics.properties:
*.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink
master.source.jvm.class=org.apache.spark.metrics.source.JvmSource
worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource
driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource
executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource
And I start the spark streaming job like this:
(the -D bits I have added afterwards in the hopes of getting remote access to the ec2's jmx)
terminal:
spark/bin/spark-submit --class my.class.StarterApp --master local --deploy-mode client \
project-1.0-SNAPSHOT.jar \
-Dcom.sun.management.jmxremote \
-Dcom.sun.management.jmxremote.port=54321 \
-Dcom.sun.management.jmxremote.authenticate=false \
-Dcom.sun.management.jmxremote.ssl=false
There are two issues with the spark-submit command line:
local - you must not run Spark Standalone with local master URL because there will be no threads to run your computations (jobs) and you've got two, i.e. one for a receiver and another for the driver. You should see the following WARN in the logs:
WARN StreamingContext: spark.master should be set as local[n], n > 1
in local mode if you have receivers to get data, otherwise Spark jobs
will not get resources to process the received data.
-D options are not picked up by the JVM as they're given after the Spark Streaming application and effectively became its command-line arguments. Put them before project-1.0-SNAPSHOT.jar and start over (you have to fix the above issue first!)
spark-submit --conf "spark.driver.extraJavaOptions=-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=8090 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false"/path/example/src/main/python/pi.py 10000
Notes:the configurations format : --conf "params" . tested under spark 2.+

Resources