Hadoop S3 access - FileSystem vs FileContext - hadoop

I'm having trouble accessing the S3 "file-system" from the HDFS FileContext object, but I can use the FileSystem object to do the same.
As I understand, FileContext has superseded FileSystem so it seems I'm doing it wrong if I need to fall back to using the FileSystem.
Am I doing it wrong? Or is the FileContext not as functional as the older FileSystem?
My functions (FYI - I'm running this from Jupyter, using spark 2.1 with hadoop 2.6.0-cdh5.5.1):
val hdfsConf = spark.sparkContext.hadoopConfiguration
import _root_.org.apache.hadoop.conf.Configuration
import _root_.org.apache.hadoop.fs.{FileContext, Path, FileSystem}
def pathExistsFs(bucket:String, pStr:String): Boolean = {
val p = new Path(pStr)
val fs = FileSystem.get(new URI(s"s3a://$bucket"), spark.sparkContext.hadoopConfiguration)
fs.exists(p)
}
def pathExistsFc(bucket:String, pStr:String): Boolean = {
val p = new Path(pStr)
val fc = FileContext.getFileContext(new URI(s"s3a://$bucket"),
spark.sparkContext.hadoopConfiguration)
fc.util().exists(p)
}
The output (pathExistsFs works, pathExistsFc fails):
pathExistsF("myBucket", "myS3Key/path.txt")
>>> res36_5: Boolean = true
pathExistsFc("myBucket", "myS3Key/path.txt")
>>> org.apache.hadoop.fs.UnsupportedFileSystemException: No AbstractFileSystem for scheme: s3a...
org.apache.hadoop.fs.UnsupportedFileSystemException: No AbstractFileSystem for scheme: s3a
org.apache.hadoop.fs.AbstractFileSystem.createFileSystem(AbstractFileSystem.java:154)
org.apache.hadoop.fs.AbstractFileSystem.get(AbstractFileSystem.java:242)
org.apache.hadoop.fs.FileContext$2.run(FileContext.java:337)
org.apache.hadoop.fs.FileContext$2.run(FileContext.java:334)
java.security.AccessController.doPrivileged(Native Method)
javax.security.auth.Subject.doAs(Subject.java:422)
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
org.apache.hadoop.fs.FileContext.getAbstractFileSystem(FileContext.java:334)
org.apache.hadoop.fs.FileContext.getFileContext(FileContext.java:451)
$sess.cmd37Wrapper$Helper$Hadoop$.pathExistsFc(cmd37.sc:14)
$sess.cmd42Wrapper$Helper.<init>(cmd42.sc:8)
$sess.cmd42Wrapper.<init>(cmd42.sc:686)
$sess.cmd42$.<init>(cmd42.sc:545)
$sess.cmd42$.<clinit>(cmd42.sc:-1)
Thanks!

Stay with the FileSystem APIs; because of it's low-level nature, it's actually where most of the S3 performance dev goes. There is now a bridge class from FileContext to the S3AFileSystem class, but that clearly isn't in your CDH version.

Related

Accessing HDFS configured as High availability from Client program

I am trying to understand the context of the working and not working program which connects HDFS via nameservice(which connects active name node - High availability Namenode) outside HDFS cluster.
Not working program:
When i read both config files (core-site.xml and hdfs-site.xml) and accessing HDFS file , it throws an error
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
object HadoopAccess {
def main(args: Array[String]): Unit ={
val hadoopConf = new Configuration(false)
val coreSiteXML = "C:\\Users\\507\\conf\\core-site.xml"
val HDFSSiteXML = "C:\\Users\\507\\conf\\hdfs-site.xml"
hadoopConf.addResource(new Path("file:///" + coreSiteXML))
hadoopConf.addResource(new Path("file:///" + HDFSSiteXML))
println("hadoopConf : " + hadoopConf.get("fs.defaultFS"))
val fs = FileSystem.get(hadoopConf)
val check = fs.exists(new Path("/apps/hive"));
//println("Checked : "+ check)
}
}
Error : We see that Unknownhost Exception
hadoopConf :
hdfs://mycluster
Configuration: file:/C:/Users/64507/conf/core-site.xml, file:/C:/Users/64507/conf/hdfs-site.xml
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Exception in thread "main" java.lang.IllegalArgumentException: java.net.UnknownHostException: mycluster
at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:378)
at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:310)
at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:176)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:678)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:619)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:149)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:172)
at HadoopAccess$.main(HadoopAccess.scala:28)
at HadoopAccess.main(HadoopAccess.scala)
Caused by: java.net.UnknownHostException: mycluster
Working Program : I specifically set the High availability into hadoopConf object and passing to Filesystem object , the program works
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
object HadoopAccess {
def main(args: Array[String]): Unit ={
val hadoopConf = new Configuration(false)
val coreSiteXML = "C:\\Users\\507\\conf\\core-site.xml"
val HDFSSiteXML = "C:\\Users\\507\\conf\\hdfs-site.xml"
hadoopConf.addResource(new Path("file:///" + coreSiteXML))
hadoopConf.addResource(new Path("file:///" + HDFSSiteXML))
hadoopConf.set("fs.defaultFS", hadoopConf.get("fs.defaultFS"))
//hadoopConf.set("fs.defaultFS", "hdfs://mycluster")
//hadoopConf.set("fs.default.name", hadoopConf.get("fs.defaultFS"))
hadoopConf.set("dfs.nameservices", hadoopConf.get("dfs.nameservices"))
hadoopConf.set("dfs.ha.namenodes.mycluster", "nn1,nn2")
hadoopConf.set("dfs.namenode.rpc-address.mycluster.nn1", "namenode1:8020")
hadoopConf.set("dfs.namenode.rpc-address.mycluster.nn2", "namenode2:8020")
hadoopConf.set("dfs.client.failover.proxy.provider.mycluster",
"org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider")
println(hadoopConf)
/* val namenode = hadoopConf.get("fs.defaultFS")
println("namenode: "+ namenode) */
val fs = FileSystem.get(hadoopConf)
val check = fs.exists(new Path("hdfs://mycluster/apps/hive"));
//println("Checked : "+ check)
}
}
Any reason why we need to set values for this configs like dfs.nameservices,fs.client.failover.proxy.provider.mycluster,dfs.namenode.rpc-address.mycluster.nn1 in hadoopconf object as this values already present in hdfs-site.xml file and core-site.xml. These configs are High availability Namenode settings.
The above program which I am running via Edge mode or local IntelliJ.
Hadoop version : 2.7.3.2
Hortonworks : 2.6.1
My observation in Spark Scala REPL :
When I do val hadoopConf = new Configuration(false) and val fs = FileSystem.get(hadoopConf) .This gives me Local FileSystem .So when I perform below
hadoopConf.addResource(new Path("file:///" + coreSiteXML))
hadoopConf.addResource(new Path("file:///" + HDFSSiteXML))
,now the file System changed to DFSFileSysyem ..My assumption is that some client library which is in Spark that is not available in somewhere in during build or edge node common place .
some client library which is in Spark that is not available in somewhere in during build or edge node common place
This common place would be $SPARK_HOME/conf and/or $HADOOP_CONF_DIR. But if you are just running a regular Scala app with java jar or with IntelliJ, that has nothing to do with Spark.
... this values already present in hdfs-site.xml file and core-site.xml
Then, they should be read, accordingly, however overriding in the code shouldn't hurt either.
The values are necessary because they dicate where the actual namenodes are running; otherwise, it thinks mycluster is a real DNS name of only one server, when it isn't

How to check if a file exists in Google Storage from Spark Dataproc?

I was assuming that Google Storage connector would allow to query GS directly as if it was HDFS from Spark in Dataproc, but it looks like the following does not work (from Spark Shell):
scala> import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.FileSystem
scala> import org.apache.hadoop.fs.Path
import org.apache.hadoop.fs.Path
scala> FileSystem.get(sc.hadoopConfiguration).exists(new Path("gs://samplebucket/file"))
java.lang.IllegalArgumentException: Wrong FS: gs://samplebucket/file, expected: hdfs://dataprocmaster-m
Is there a way to access Google Storage files using just the Hadoop API?
That's because FileSystem.get(...) returns the default FileSystem which according to your configuration is HDFS and can only work with paths starting with hdfs://. Use the following to get the correct FS.
Path p = new Path("gs://...");
FileSystem fs = p.getFileSystem(...);
fs.exists(p);
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.hadoop.fs.{FileSystem, Path}
val p = "gs://<your dir>"
val path = new Path(p)
val fs = path.getFileSystem(sc.hadoopConfiguration)
fs.exists(path)
fs.isDirectory(path)
I translated #Pradeep Gollakota answer to PySpark, thanks!!
def path_exists(spark, path): #path = gs://.... return true if exists
p = spark._jvm.org.apache.hadoop.fs.Path(path)
fs = p.getFileSystem(spark._jsc.hadoopConfiguration())
return fs.exists(p)

HDFS path does not exist with SparkSession object when spark master is set as LOCAL

I am trying to load a dataset into Hive table using Spark.
But when I try to load the file from HDFS directory to Spark, I get the exception:
org.apache.spark.sql.AnalysisException: Path does not exist: file:/home/cloudera/partfile;
These are the steps before loading the file.
val wareHouseLocation = "file:${system:user.dir}/spark-warehouse"
val SparkSession = SparkSession.builder.master("local[2]") \
.appName("SparkHive") \
.enableHiveSupport() \
.config("hive.exec.dynamic.partition", "true") \
.config("hive.exec.dynamic.partition.mode","nonstrict") \
.config("hive.metastore.warehouse.dir","/user/hive/warehouse") \
.config("spark.sql.warehouse.dir",wareHouseLocation).getOrCreate()
import sparkSession.implicits._
val partf = sparkSession.read.textFile("partfile")
Exception for the statement ->
val partf = sparkSession.read.textFile("partfile")
org.apache.spark.sql.AnalysisException: Path does not exist: file:/home/cloudera/partfile;
But I have the file in my home directory of HDFS.
hadoop fs -ls
Found 1 items
-rw-r--r-- 1 cloudera cloudera 58 2017-06-30 02:23 partfile
I tried various ways to load the dataset like:
val partfile = sparkSession.read.textFile("/user/cloudera/partfile") and
val partfile = sparkSession.read.textFile("hdfs://quickstart.cloudera:8020/user/cloudera/partfile")
But nothing seems to work.
My spark version is 2.0.2
Could anyone tell me how to fix it ?
When you submit the job by setting master as local[2], your job is not getting submitted to spark master and so, spark does not know about underlying HDFS.
Spark will consider local file system as its default file system, and that's why, IOException occurs in your case.
Try this way:
val SparkSession = SparkSession.builder \
.master("<spark-master-ip>:<spark-port>") \
.appName("SparkHive").enableHiveSupport() \
.config("hive.exec.dynamic.partition", "true") \
.config("hive.exec.dynamic.partition.mode","nonstrict") \
.config("hive.metastore.warehouse.dir","/user/hive/warehouse") \
.config("spark.sql.warehouse.dir",wareHouseLocation).getOrCreate()
import sparkSession.implicits._
val partf = sparkSession.read.textFile("partfile")
You need to know <spark-master-ip> and <spark-port> for this.
This way, spark will take underlying hdfs file system as its default file system.
It's not clear for me what would be an error with explicit protocol specification but usually (as already was answered) it means that no neccesary configurations were passed into Spark context.
The first solution:
val sc = ??? // Spark Context
val config = sc.hadoopConfiguration
// you can mutate config object, it should work
config.addResource(new Path(s"${HADOOP_HOME}/conf/core-site.xml"))
// instead of adding a resource you can just specify hdfs address
// config.set("fs.defaultFS", "hdfs://host:port")
The second:
Explicitly specify HADOOP_CONF_DIR in $SPARK_HOME/spark-env.sh file. If you plan to use a cluster, be sure that every node of your cluster have HADOOP_CONF_DIR specified.
And be sure that you have all necessary Hadoop deps in your Spark / App classpath.
Try the following, it should work.
SparkSession session = SparkSession.builder().appName("Appname").master("local[1]").getOrCreate();
DataFrameReader dataFrameReader = session.read();
String path = "path\\file.csv";
Dataset <Row> responses = dataFrameReader.option("header","true").csv(path);

Can I create sequence file using spark dataframes?

I have a requirement in which I need to create a sequence file.Right now we have written custom api on top of hadoop api,but since we are moving in spark we have to achieve the same using spark.Can this be achieved using spark dataframes?
AFAIK there is no native api available directly in DataFrame except the below approach
Please try/think some thing like(which is RDD of DataFrame style, inspired by SequenceFileRDDFunctions.scala & method saveAsSequenceFile) in below example :
Extra functions available on RDDs of (key, value) pairs to create a Hadoop SequenceFile, through an implicit conversion.
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.SequenceFileRDDFunctions
import org.apache.hadoop.io.NullWritable
object driver extends App {
val conf = new SparkConf()
.setAppName("HDFS writable test")
val sc = new SparkContext(conf)
val empty = sc.emptyRDD[Any].repartition(10)
val data = empty.mapPartitions(Generator.generate).map{ (NullWritable.get(), _) }
val seq = new SequenceFileRDDFunctions(data)
// seq.saveAsSequenceFile("/tmp/s1", None)
seq.saveAsSequenceFile(s"hdfs://localdomain/tmp/s1/${new scala.util.Random().nextInt()}", None)
sc.stop()
}
Further information pls see ..
how-to-write-dataframe-obtained-from-hive-table-into-hadoop-sequencefile-and-r
sequence file

using amazon s3 as input,output and to store intermediate results in EMR map reduce job

I am trying to use Amazon s3 storage with EMR. However, when I currently run my code I get multiple errors like
java.lang.IllegalArgumentException: This file system object (hdfs://10.254.37.109:9000) does not support access to the request path 's3n://energydata/input/centers_200_10k_norm.csv' You possibly called FileSystem.get(conf) when you should have called FileSystem.get(uri, conf) to obtain a file system supporting your path.
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:384)
at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:129)
at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:154)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:429)
at edu.stanford.cs246.hw2.KMeans$CentroidMapper.setup(KMeans.java:112)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:771)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:375)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
In main I set my input and output paths like this and I put s3n://energydata/input/centers_200_10k_norm.csv in configuration CFILE that I retrieve in the mapper and reducer:
FileSystem fs = FileSystem.get(conf);
conf.set(CFILE, inPath); //inPath in this case is s3n://energydata/input/centers_200_10k_norm.csv
FileInputFormat.addInputPath(job, new Path(inputDir));
FileOutputFormat.setOutputPath(job, new Path(outputDir));
The specific example where the error above occurs in my mapper and reducer where I try to access CFILE (s3n://energydata/input/centers_200_10k_norm.csv). This is how I try to get the path:
FileSystem fs = FileSystem.get(context.getConfiguration());
Path cFile = new Path(context.getConfiguration().get(CFILE));
DataInputStream d = new DataInputStream(fs.open(cFile)); ---->Error
s3n://energydata/input/centers_200_10k_norm.csv is one of the input arguments to the program and when I launched my EMR job I specified my input and output directories to be s3n://energydata/input and s3n://energydata/output
I tried doing what was suggested in file path in hdfs but I'm still getting the error. Any help would be appreciated.
thanks!
try instead:
Path cFile = new Path(context.getConfiguration().get(CFILE));
FileSystem fs = cFile.getFileSystem(context.getConfiguration());
DataInputStream d = new DataInputStream(fs.open(cFile));
thanks. I actually fixed it by using the following code:
String uriStr = "s3n://energydata/centroid/";
URI uri = URI.create(uriStr);
FileSystem fs = FileSystem.get(uri, context.getConfiguration());
Path cFile = new Path(context.getConfiguration().get(CFILE));
DataInputStream d = new DataInputStream(fs.open(cFile));

Resources