Hadoop Configuration Object read XML - hadoop

I have an XML file containing some name and values that I want to read from in my Spark application. How do I use the Hadoop Configuration to read in these values and use them in my code?
I tried uploading the XML file to HDFS , but I'm not sure what the key is supposed to be when I used conf.get()

Maybe you forgot to include these lines to your code:
val conf = new Configuration()
conf.addResource(new Path(<path-to-file>))

Related

How to handle service application failures developed by Hive JDBC?

I'm developing a service application based on the Hive jdbc-examples project, fail to run the application.
I normally use Huawei's big data products to perform Hive's jdbc sample development, which always fails to run the application.
To solve this problem, try either one of the following methods:
Method 1
Put the core-site.xml configuration file to classpath according to the Hive jdbc-examples sample project.
Method 2
In the code, explicitly load the core-site.xml file:
......conf = new Configuration();
String userdir = System.getProperty("user.dir") + File.separator + "conf" + File.separator;
conf.addResource(new Path(userdir + "core-site.xml"));......
Method 3
In the code, set hadoop.security.authentication to kerberos:
......CONF = new Configuration();
CONF.set("hadoop.security.authentication", "kerberos");......
See more.

Adding some hadoop configuration to a spark application at runtime (through spark-submit)?

I want to send a key-value pair to my spark application something like the following:
mapreduce.input.fileinputformat.input.dir.recursive=true
I understand I can do this from the code in the following way:
sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive","true")
But I want to be able to send this property through spark-submit at runtime. Would this be possible?
Absolutely!
spark-submit (as well as spark-shell) support the --conf PROP=VALUE and --properties-file FILE options, which allow you specify such arbitrary configurations options. You can then get the values you pass by using the SparkConf .get function:
val conf = new SparkConf()
val mrRecursive =
conf.get("spark.mapreduce.input.fileinputformat.input.dir.recursive")
sc.hadoopConfiguration.set("spark.mapreduce.input.fileinputformat.input.dir.recursive", mrRecursive)
Spark-submit/spark-shell --help:
--conf PROP=VALUE Arbitrary Spark configuration property.
--properties-file FILE Path to a file from which to load extra properties. If not
specified, this will look for conf/spark-defaults.conf.
Spark docs regarding [dynamically] loading properties: https://spark.apache.org/docs/latest/configuration.html
Without code modification, such approach can be used.
Hadoop Configuration reads file "core-default.xml" during creation, description is here:
https://hadoop.apache.org/docs/r2.6.1/api/org/apache/hadoop/conf/Configuration.html
If put values in "core-default.xml", and include directory with file in classpath with spark-submit "driver-class-path" parameter, it can work.

Why doesn't Hadoop respect 'spark.hadoop.fs' properties set in pyspark? [duplicate]

This question already has answers here:
How to set hadoop configuration values from pyspark
(3 answers)
Closed 5 years ago.
There are three properties in my spark-defaults.conf that I want to be able to set dynamically:
spark.driver.maxResultSize
spark.hadoop.fs.s3a.access.key
spark.hadoop.fs.s3a.secret.key
Here's my attempt to do so:
from pyspark import SparkConf
from pyspark.sql import SparkSession
conf = (SparkConf()
.setMaster(spark_master)
.setAppName(app_name)
.set('spark.driver.maxResultSize', '5g')
.set('spark.hadoop.fs.s3a.access.key', '<access>')\
.set('spark.hadoop.fs.s3a.secret.key', '<secret>)
)
spark = SparkSession.builder.\
config(conf=conf).\
getOrCreate()
print(spark.conf.get('spark.driver.maxResultSize'))
print(spark.conf.get('spark.hadoop.fs.s3a.access.key'))
print(spark.conf.get('spark.hadoop.fs.s3a.secret.key'))
spark.stop()
Here's the output I get:
5g
<access>
<secret>
However when I try to read a csv file on S3 using this configuration, I get a permissions denied error.
If I set the credentials via environment variables, I am able to read the file.
Why doesn't Hadoop respect the credentials specified this way?
Update:
I am aware of other Q&As relating to setting Hadoop properties in pyspark.
Here I am trying to record for posterity how you can be fooled into thinking that you can set them dynamically via spark.hadoop.*, since that is the name you use to set these properties in spark-defaults.conf, and since you don't get an error directly when you try to set them this way.
Many sites tell you to "set the spark.hadoop.fs.s3a.access.key property", but don't specify that this only the case if you set it statically in spark-defaults.conf and not dynamically in pyspark.
It turns out that you can't specify Hadoop properties via:
spark.conf.set('spark.hadoop.<property>', <value>)
but you must instead use:
spark.sparkContext._jsc.hadoopConfiguration().set('<property>', <value>)
I believe you can only use spark.conf.set() for the properties listed on the Spark Configuration page.

Dump hadoop configuration in Spark

I use sc.hadoopConfiguration.set to set configuration.
How do I dump those config?Either print them on console or dump them to file
You can dump hadoop configurations to xml file (I am assuming you are using Scala)
val out = new FileOutputStream("conf.xml")
sc.hadoopConfiguration.writeXml(out)

Utility to push data into HDFS

I need to build a common utility for unix/Windows based system to push data into hadoop system. User can run that utility from any platform and should be able to push data into HDFS.
WebHDFS can be one of the option but curious to know if anything else available.
Any suggestions?
I usually make a maven project and I add this dependency to my pom.xml file:
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.6.1</version>
</dependency>
Then push data into HDFS it's very easy with the hadoop java api, this is a simple example just to see how it works:
String namenodeLocation = "hdfs://[your-namenode-ip-address]:[hadoop:listening-port]/";
Configuration configuration = new Configuration();
FileSystem hdfs = FileSystem.get( new URI( namenodeLocation ), configuration );
Path file = new Path(namenodeLocation+"/myWonderful.data");
FSDataOutputStream outStream = hdfs.create(file);
byte[] coolDataToPushToHDFS = new byte[1500];
outStream.write(coolDataToPushToHDFS);
outStream.close();
hdfs.close();
It's a really simple program. I think the steps you have to do are:
Let users choose the input/data to push
Use hadoop java api to send file/data to your cluster
Give some feedback to the user.
You can also append information to a file, not only create new file.
Give a look to the documentation: https://hadoop.apache.org/docs/current/api/

Resources