Adding some hadoop configuration to a spark application at runtime (through spark-submit)? - hadoop

I want to send a key-value pair to my spark application something like the following:
mapreduce.input.fileinputformat.input.dir.recursive=true
I understand I can do this from the code in the following way:
sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive","true")
But I want to be able to send this property through spark-submit at runtime. Would this be possible?

Absolutely!
spark-submit (as well as spark-shell) support the --conf PROP=VALUE and --properties-file FILE options, which allow you specify such arbitrary configurations options. You can then get the values you pass by using the SparkConf .get function:
val conf = new SparkConf()
val mrRecursive =
conf.get("spark.mapreduce.input.fileinputformat.input.dir.recursive")
sc.hadoopConfiguration.set("spark.mapreduce.input.fileinputformat.input.dir.recursive", mrRecursive)
Spark-submit/spark-shell --help:
--conf PROP=VALUE Arbitrary Spark configuration property.
--properties-file FILE Path to a file from which to load extra properties. If not
specified, this will look for conf/spark-defaults.conf.
Spark docs regarding [dynamically] loading properties: https://spark.apache.org/docs/latest/configuration.html

Without code modification, such approach can be used.
Hadoop Configuration reads file "core-default.xml" during creation, description is here:
https://hadoop.apache.org/docs/r2.6.1/api/org/apache/hadoop/conf/Configuration.html
If put values in "core-default.xml", and include directory with file in classpath with spark-submit "driver-class-path" parameter, it can work.

Related

Is there any way in Spark to keep each stage's run time?

I am measuring the run-times of a spark job with different resource configurations and need to compare the run time of each stage. I can see them in UI only when the job is running.
I run my job on a Hadoop cluster and use Yarn as the resource manager.
Is there any way to keep each stage's run-time? Is there any log for them?
UPDATE:
I read the monitoring document which is mentioned in the comment and add the following lines but it doesn't work:
in spark-defaults.conf :
spark.eventLog.enabled true
spark.eventLog.dir hdfs:///[nameNode]:8020/[PathToSparkEventLogDir]
spark.history.fs.logDirectory
hdfs:///[nameNode]:8020/[PathTosparkLogDirectory]
in spark-env.sh:
export SPARK_PUBLIC_DNS=[nameNode]
SPARK_HISTORY_OPTS="-Dspark.eventLog.enabled=true"
SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Dspark.history.fs.logDirectory=$sparkHistoryDir"
SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Dspark.history.provider=org.apache.spark.deploy.history.FsHistoryProvider"
SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Dspark.history.fs.cleaner.enabled=true"
SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Dspark.history.fs.cleaner.interval=7d"
It looks for /tmp/spark-events/ folder and when I create it and start the history server, it doesn't show any complete or incomplete application.
Note I tried the logDirectory value without port number too but it didn't work.
I could run the Spark History Server and see the history of completed and incompleted applications by applying the following commands:
Set the public DNS value in conf/spark-env.sh
export SPARK_PUBLIC_DNS= NameNode-IP
Add these properties to SparkConf in my Java code:
SparkConf conf = new SparkConf()
.set("spark.eventLog.enabled", "true")
.set("spark.eventLog.dir", "hdfs:///user/[user-path]/sparkEventLog")
.set("spark.history.fs.logDirectory", "hdfs:///user/[user-path]/sparkEventLog")
Create the property file ( spark/conf/history.properties ) containg the following lines
spark.eventLog.enabled true
spark.eventLog.dir hdfs:///user/[user-path]/sparkEventLog
spark.history.fs.logDirectory hdfs:///user/[user-path]/sparkEventLog
Start the history server:
./sbin/start-history-server.sh --properties-file ./conf/history.properties
Note: The properties eventLog.dir and eventLog.dir should have the save values.

Hadoop Configuration Object read XML

I have an XML file containing some name and values that I want to read from in my Spark application. How do I use the Hadoop Configuration to read in these values and use them in my code?
I tried uploading the XML file to HDFS , but I'm not sure what the key is supposed to be when I used conf.get()
Maybe you forgot to include these lines to your code:
val conf = new Configuration()
conf.addResource(new Path(<path-to-file>))

Why doesn't Hadoop respect 'spark.hadoop.fs' properties set in pyspark? [duplicate]

This question already has answers here:
How to set hadoop configuration values from pyspark
(3 answers)
Closed 5 years ago.
There are three properties in my spark-defaults.conf that I want to be able to set dynamically:
spark.driver.maxResultSize
spark.hadoop.fs.s3a.access.key
spark.hadoop.fs.s3a.secret.key
Here's my attempt to do so:
from pyspark import SparkConf
from pyspark.sql import SparkSession
conf = (SparkConf()
.setMaster(spark_master)
.setAppName(app_name)
.set('spark.driver.maxResultSize', '5g')
.set('spark.hadoop.fs.s3a.access.key', '<access>')\
.set('spark.hadoop.fs.s3a.secret.key', '<secret>)
)
spark = SparkSession.builder.\
config(conf=conf).\
getOrCreate()
print(spark.conf.get('spark.driver.maxResultSize'))
print(spark.conf.get('spark.hadoop.fs.s3a.access.key'))
print(spark.conf.get('spark.hadoop.fs.s3a.secret.key'))
spark.stop()
Here's the output I get:
5g
<access>
<secret>
However when I try to read a csv file on S3 using this configuration, I get a permissions denied error.
If I set the credentials via environment variables, I am able to read the file.
Why doesn't Hadoop respect the credentials specified this way?
Update:
I am aware of other Q&As relating to setting Hadoop properties in pyspark.
Here I am trying to record for posterity how you can be fooled into thinking that you can set them dynamically via spark.hadoop.*, since that is the name you use to set these properties in spark-defaults.conf, and since you don't get an error directly when you try to set them this way.
Many sites tell you to "set the spark.hadoop.fs.s3a.access.key property", but don't specify that this only the case if you set it statically in spark-defaults.conf and not dynamically in pyspark.
It turns out that you can't specify Hadoop properties via:
spark.conf.set('spark.hadoop.<property>', <value>)
but you must instead use:
spark.sparkContext._jsc.hadoopConfiguration().set('<property>', <value>)
I believe you can only use spark.conf.set() for the properties listed on the Spark Configuration page.

Dump hadoop configuration in Spark

I use sc.hadoopConfiguration.set to set configuration.
How do I dump those config?Either print them on console or dump them to file
You can dump hadoop configurations to xml file (I am assuming you are using Scala)
val out = new FileOutputStream("conf.xml")
sc.hadoopConfiguration.writeXml(out)

Inject external properties for Apache Storm Flux

I am using Flux 1.0.0 and I have rewritten my topology into a YAML file. But I have some properties that used to be part of the configuration that I used the Storm driver to run with.
storm.Driver --config myConfig/config.conf
Now with Storm Flux, how can I inject the properties that are in config.conf into my topology?
I am currently doing java -cp myStormJar org.apache.sotrm.flux.Flux --local /src/main/resources/myTopology.yaml
I tried to use --resources option, followed by the path to the conf file, but it does not inject it.
Add the filter --resources placeholders ${resource.filter} to your yaml file.
To make the property available in the stormConf - re-declare the filter resource in config: properties.
name: "storm-topology"
config:
kafka.mapper.zkPort: ${kafka.mapper.zkPort}
kafka.mapper.zkServers: ${kafka.mapper.zkServers}
You can also review the simple_hdfs.yaml example at: https://github.com/ptgoetz/flux/tree/master/flux-examples

Resources