Accessing HDFS configured as High availability from Client program - hadoop

I am trying to understand the context of the working and not working program which connects HDFS via nameservice(which connects active name node - High availability Namenode) outside HDFS cluster.
Not working program:
When i read both config files (core-site.xml and hdfs-site.xml) and accessing HDFS file , it throws an error
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
object HadoopAccess {
def main(args: Array[String]): Unit ={
val hadoopConf = new Configuration(false)
val coreSiteXML = "C:\\Users\\507\\conf\\core-site.xml"
val HDFSSiteXML = "C:\\Users\\507\\conf\\hdfs-site.xml"
hadoopConf.addResource(new Path("file:///" + coreSiteXML))
hadoopConf.addResource(new Path("file:///" + HDFSSiteXML))
println("hadoopConf : " + hadoopConf.get("fs.defaultFS"))
val fs = FileSystem.get(hadoopConf)
val check = fs.exists(new Path("/apps/hive"));
//println("Checked : "+ check)
}
}
Error : We see that Unknownhost Exception
hadoopConf :
hdfs://mycluster
Configuration: file:/C:/Users/64507/conf/core-site.xml, file:/C:/Users/64507/conf/hdfs-site.xml
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Exception in thread "main" java.lang.IllegalArgumentException: java.net.UnknownHostException: mycluster
at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:378)
at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:310)
at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:176)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:678)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:619)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:149)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:172)
at HadoopAccess$.main(HadoopAccess.scala:28)
at HadoopAccess.main(HadoopAccess.scala)
Caused by: java.net.UnknownHostException: mycluster
Working Program : I specifically set the High availability into hadoopConf object and passing to Filesystem object , the program works
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
object HadoopAccess {
def main(args: Array[String]): Unit ={
val hadoopConf = new Configuration(false)
val coreSiteXML = "C:\\Users\\507\\conf\\core-site.xml"
val HDFSSiteXML = "C:\\Users\\507\\conf\\hdfs-site.xml"
hadoopConf.addResource(new Path("file:///" + coreSiteXML))
hadoopConf.addResource(new Path("file:///" + HDFSSiteXML))
hadoopConf.set("fs.defaultFS", hadoopConf.get("fs.defaultFS"))
//hadoopConf.set("fs.defaultFS", "hdfs://mycluster")
//hadoopConf.set("fs.default.name", hadoopConf.get("fs.defaultFS"))
hadoopConf.set("dfs.nameservices", hadoopConf.get("dfs.nameservices"))
hadoopConf.set("dfs.ha.namenodes.mycluster", "nn1,nn2")
hadoopConf.set("dfs.namenode.rpc-address.mycluster.nn1", "namenode1:8020")
hadoopConf.set("dfs.namenode.rpc-address.mycluster.nn2", "namenode2:8020")
hadoopConf.set("dfs.client.failover.proxy.provider.mycluster",
"org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider")
println(hadoopConf)
/* val namenode = hadoopConf.get("fs.defaultFS")
println("namenode: "+ namenode) */
val fs = FileSystem.get(hadoopConf)
val check = fs.exists(new Path("hdfs://mycluster/apps/hive"));
//println("Checked : "+ check)
}
}
Any reason why we need to set values for this configs like dfs.nameservices,fs.client.failover.proxy.provider.mycluster,dfs.namenode.rpc-address.mycluster.nn1 in hadoopconf object as this values already present in hdfs-site.xml file and core-site.xml. These configs are High availability Namenode settings.
The above program which I am running via Edge mode or local IntelliJ.
Hadoop version : 2.7.3.2
Hortonworks : 2.6.1
My observation in Spark Scala REPL :
When I do val hadoopConf = new Configuration(false) and val fs = FileSystem.get(hadoopConf) .This gives me Local FileSystem .So when I perform below
hadoopConf.addResource(new Path("file:///" + coreSiteXML))
hadoopConf.addResource(new Path("file:///" + HDFSSiteXML))
,now the file System changed to DFSFileSysyem ..My assumption is that some client library which is in Spark that is not available in somewhere in during build or edge node common place .

some client library which is in Spark that is not available in somewhere in during build or edge node common place
This common place would be $SPARK_HOME/conf and/or $HADOOP_CONF_DIR. But if you are just running a regular Scala app with java jar or with IntelliJ, that has nothing to do with Spark.
... this values already present in hdfs-site.xml file and core-site.xml
Then, they should be read, accordingly, however overriding in the code shouldn't hurt either.
The values are necessary because they dicate where the actual namenodes are running; otherwise, it thinks mycluster is a real DNS name of only one server, when it isn't

Related

Hadoop S3 access - FileSystem vs FileContext

I'm having trouble accessing the S3 "file-system" from the HDFS FileContext object, but I can use the FileSystem object to do the same.
As I understand, FileContext has superseded FileSystem so it seems I'm doing it wrong if I need to fall back to using the FileSystem.
Am I doing it wrong? Or is the FileContext not as functional as the older FileSystem?
My functions (FYI - I'm running this from Jupyter, using spark 2.1 with hadoop 2.6.0-cdh5.5.1):
val hdfsConf = spark.sparkContext.hadoopConfiguration
import _root_.org.apache.hadoop.conf.Configuration
import _root_.org.apache.hadoop.fs.{FileContext, Path, FileSystem}
def pathExistsFs(bucket:String, pStr:String): Boolean = {
val p = new Path(pStr)
val fs = FileSystem.get(new URI(s"s3a://$bucket"), spark.sparkContext.hadoopConfiguration)
fs.exists(p)
}
def pathExistsFc(bucket:String, pStr:String): Boolean = {
val p = new Path(pStr)
val fc = FileContext.getFileContext(new URI(s"s3a://$bucket"),
spark.sparkContext.hadoopConfiguration)
fc.util().exists(p)
}
The output (pathExistsFs works, pathExistsFc fails):
pathExistsF("myBucket", "myS3Key/path.txt")
>>> res36_5: Boolean = true
pathExistsFc("myBucket", "myS3Key/path.txt")
>>> org.apache.hadoop.fs.UnsupportedFileSystemException: No AbstractFileSystem for scheme: s3a...
org.apache.hadoop.fs.UnsupportedFileSystemException: No AbstractFileSystem for scheme: s3a
org.apache.hadoop.fs.AbstractFileSystem.createFileSystem(AbstractFileSystem.java:154)
org.apache.hadoop.fs.AbstractFileSystem.get(AbstractFileSystem.java:242)
org.apache.hadoop.fs.FileContext$2.run(FileContext.java:337)
org.apache.hadoop.fs.FileContext$2.run(FileContext.java:334)
java.security.AccessController.doPrivileged(Native Method)
javax.security.auth.Subject.doAs(Subject.java:422)
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
org.apache.hadoop.fs.FileContext.getAbstractFileSystem(FileContext.java:334)
org.apache.hadoop.fs.FileContext.getFileContext(FileContext.java:451)
$sess.cmd37Wrapper$Helper$Hadoop$.pathExistsFc(cmd37.sc:14)
$sess.cmd42Wrapper$Helper.<init>(cmd42.sc:8)
$sess.cmd42Wrapper.<init>(cmd42.sc:686)
$sess.cmd42$.<init>(cmd42.sc:545)
$sess.cmd42$.<clinit>(cmd42.sc:-1)
Thanks!
Stay with the FileSystem APIs; because of it's low-level nature, it's actually where most of the S3 performance dev goes. There is now a bridge class from FileContext to the S3AFileSystem class, but that clearly isn't in your CDH version.

spark job performing poorly while converting text files to parquet format

I have a spark streaming application which is responsible for converting text files into parquet format on the fly, and then saving the data in an external hive table. Please refer the mentioned piece of code which is one of the classes for processing text files into parquet:
object HistTableLogic {val logger = Logger.getLogger("file")
def schemadef(batchId: String) {
println("process started!")
logger.debug("process started")
val sourcePath = "some path"
val destPath = "somepath"
println(s"source path :${sourcePath}")
println(s"dest path :${destPath}")
logger.debug(s"source path :${sourcePath}")
logger.debug(s"dest path :${destPath}")
// val sc = new SparkContext(new SparkConf().set("spark.driver.allowMultipleContexts", "true"))
val conf = new Configuration()
println("Spark Context created!!")
logger.debug("Spark Context created!!")
val spark = SparkSession.builder.enableHiveSupport().getOrCreate()
println("Spark session created!")
logger.debug("Spark session created!")
val schema = StructType.apply(spark.read.table("hivetable").schema.fields.dropRight(2))
try {
val fs = FileSystem.get(conf)
spark.sql("ALTER table hivetable drop if exists partition (batch_run_dt='"+batchId.substring(1,9)+"', batchid='"+batchId+"')")
fs.listStatus(new Path(sourcePath)).foreach(x => {
val df = spark.read.format("com.databricks.spark.csv").option("inferSchema","true").option("delimiter","\u0001").
schema(schema).csv(s"${sourcePath}/"+batchId).na.fill("").repartition(50).write.mode("overwrite").option("compression", "gzip")
.parquet(s"${destPath}/batch_run_dt="+batchId.substring(1,9)+"/batchid="+batchId)
spark.sql("ALTER table hivetable add partition (batch_run_dt='"+batchId.substring(1,9)+"', batchid='"+batchId+"')")
logger.debug("Partition added")
})
} catch {
case e: Exception => {
println("---------Exception caught---------!")
logger.debug("---------Exception caught---------!")
e.printStackTrace()
logger.debug(e.printStackTrace)
logger.debug(e.getMessage)
}
}
}}
I am passing schemadef method of above class in main method of another java class which has the logic of receiving batchIds 24x7, set via custom receiver.
Functionally the application runs fine, but is taking around 15 minutes to process even 1GB of data. And if I try to simply load the data into hive table through LOAD query, it happens within a minute.
referring below configuration for spark job:
SPARK_MASTER YARN
SPARK_DEPLOY-MODE CLUSTER
SPARK_DRIVER-MEMORY 13g
SPARK_NUM-EXECUTORS 6
SPARK_EXECUTOR-MEMORY 15g
SPARK_EXECUTOR-CORES 2
Please let me know if you find any flaw in this or any other optimization I can do to enhance this process. Thank you

HDFS path does not exist with SparkSession object when spark master is set as LOCAL

I am trying to load a dataset into Hive table using Spark.
But when I try to load the file from HDFS directory to Spark, I get the exception:
org.apache.spark.sql.AnalysisException: Path does not exist: file:/home/cloudera/partfile;
These are the steps before loading the file.
val wareHouseLocation = "file:${system:user.dir}/spark-warehouse"
val SparkSession = SparkSession.builder.master("local[2]") \
.appName("SparkHive") \
.enableHiveSupport() \
.config("hive.exec.dynamic.partition", "true") \
.config("hive.exec.dynamic.partition.mode","nonstrict") \
.config("hive.metastore.warehouse.dir","/user/hive/warehouse") \
.config("spark.sql.warehouse.dir",wareHouseLocation).getOrCreate()
import sparkSession.implicits._
val partf = sparkSession.read.textFile("partfile")
Exception for the statement ->
val partf = sparkSession.read.textFile("partfile")
org.apache.spark.sql.AnalysisException: Path does not exist: file:/home/cloudera/partfile;
But I have the file in my home directory of HDFS.
hadoop fs -ls
Found 1 items
-rw-r--r-- 1 cloudera cloudera 58 2017-06-30 02:23 partfile
I tried various ways to load the dataset like:
val partfile = sparkSession.read.textFile("/user/cloudera/partfile") and
val partfile = sparkSession.read.textFile("hdfs://quickstart.cloudera:8020/user/cloudera/partfile")
But nothing seems to work.
My spark version is 2.0.2
Could anyone tell me how to fix it ?
When you submit the job by setting master as local[2], your job is not getting submitted to spark master and so, spark does not know about underlying HDFS.
Spark will consider local file system as its default file system, and that's why, IOException occurs in your case.
Try this way:
val SparkSession = SparkSession.builder \
.master("<spark-master-ip>:<spark-port>") \
.appName("SparkHive").enableHiveSupport() \
.config("hive.exec.dynamic.partition", "true") \
.config("hive.exec.dynamic.partition.mode","nonstrict") \
.config("hive.metastore.warehouse.dir","/user/hive/warehouse") \
.config("spark.sql.warehouse.dir",wareHouseLocation).getOrCreate()
import sparkSession.implicits._
val partf = sparkSession.read.textFile("partfile")
You need to know <spark-master-ip> and <spark-port> for this.
This way, spark will take underlying hdfs file system as its default file system.
It's not clear for me what would be an error with explicit protocol specification but usually (as already was answered) it means that no neccesary configurations were passed into Spark context.
The first solution:
val sc = ??? // Spark Context
val config = sc.hadoopConfiguration
// you can mutate config object, it should work
config.addResource(new Path(s"${HADOOP_HOME}/conf/core-site.xml"))
// instead of adding a resource you can just specify hdfs address
// config.set("fs.defaultFS", "hdfs://host:port")
The second:
Explicitly specify HADOOP_CONF_DIR in $SPARK_HOME/spark-env.sh file. If you plan to use a cluster, be sure that every node of your cluster have HADOOP_CONF_DIR specified.
And be sure that you have all necessary Hadoop deps in your Spark / App classpath.
Try the following, it should work.
SparkSession session = SparkSession.builder().appName("Appname").master("local[1]").getOrCreate();
DataFrameReader dataFrameReader = session.read();
String path = "path\\file.csv";
Dataset <Row> responses = dataFrameReader.option("header","true").csv(path);

error spark-shell, falling back to uploading libraries under SPARK_HOME

I'm trying to connect a spark-shell amazon hadoop, but I esart all the time giving the following error and do not know how to fix it or configure what is missing.
spark.yarn.jars, spark.yarn.archive
spark-shell --jars /usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/08/12 07:47:26 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
16/08/12 07:47:28 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
Thx!!!
Error1
I'm trying to run a SQL query, something totally simple as:
val sqlDF = spark.sql("SELECT col1 FROM tabl1 limit 10")
sqlDF.show()
WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
Error2
Then I try to run a script scala, something simple collected in:
https://blogs.aws.amazon.com/bigdata/post/Tx2D93GZRHU3TES/Using-Spark-SQL-for-ETL
import org.apache.hadoop.io.Text;
import org.apache.hadoop.dynamodb.DynamoDBItemWritable
import com.amazonaws.services.dynamodbv2.model.AttributeValue
import org.apache.hadoop.dynamodb.read.DynamoDBInputFormat
import org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat
import org.apache.hadoop.mapred.JobConf
import org.apache.hadoop.io.LongWritable
import java.util.HashMap
var ddbConf = new JobConf(sc.hadoopConfiguration)
ddbConf.set("dynamodb.output.tableName", "tableDynamoDB")
ddbConf.set("dynamodb.throughput.write.percent", "0.5")
ddbConf.set("mapred.input.format.class", "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
ddbConf.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
var genreRatingsCount = sqlContext.sql("SELECT col1 FROM table1 LIMIT 1")
var ddbInsertFormattedRDD = genreRatingsCount.map(a => {
var ddbMap = new HashMap[String, AttributeValue]()
var col1 = new AttributeValue()
col1.setS(a.get(0).toString)
ddbMap.put("col1", col1)
var item = new DynamoDBItemWritable()
item.setItem(ddbMap)
(new Text(""), item)
}
)
ddbInsertFormattedRDD.saveAsHadoopDataset(ddbConf)
scala.reflect.internal.Symbols$CyclicReference: illegal cyclic reference involving object InterfaceAudience
at scala.reflect.internal.Symbols$Symbol$$anonfun$info$3.apply(Symbols.scala:1502)
at scala.reflect.internal.Symbols$Symbol$$anonfun$info$3.apply(Symbols.scala:1500)
at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
Looks like spark UI not started, tried to start spark shell and also check sparkUI localhost:4040 running correctly.

using amazon s3 as input,output and to store intermediate results in EMR map reduce job

I am trying to use Amazon s3 storage with EMR. However, when I currently run my code I get multiple errors like
java.lang.IllegalArgumentException: This file system object (hdfs://10.254.37.109:9000) does not support access to the request path 's3n://energydata/input/centers_200_10k_norm.csv' You possibly called FileSystem.get(conf) when you should have called FileSystem.get(uri, conf) to obtain a file system supporting your path.
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:384)
at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:129)
at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:154)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:429)
at edu.stanford.cs246.hw2.KMeans$CentroidMapper.setup(KMeans.java:112)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:771)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:375)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
In main I set my input and output paths like this and I put s3n://energydata/input/centers_200_10k_norm.csv in configuration CFILE that I retrieve in the mapper and reducer:
FileSystem fs = FileSystem.get(conf);
conf.set(CFILE, inPath); //inPath in this case is s3n://energydata/input/centers_200_10k_norm.csv
FileInputFormat.addInputPath(job, new Path(inputDir));
FileOutputFormat.setOutputPath(job, new Path(outputDir));
The specific example where the error above occurs in my mapper and reducer where I try to access CFILE (s3n://energydata/input/centers_200_10k_norm.csv). This is how I try to get the path:
FileSystem fs = FileSystem.get(context.getConfiguration());
Path cFile = new Path(context.getConfiguration().get(CFILE));
DataInputStream d = new DataInputStream(fs.open(cFile)); ---->Error
s3n://energydata/input/centers_200_10k_norm.csv is one of the input arguments to the program and when I launched my EMR job I specified my input and output directories to be s3n://energydata/input and s3n://energydata/output
I tried doing what was suggested in file path in hdfs but I'm still getting the error. Any help would be appreciated.
thanks!
try instead:
Path cFile = new Path(context.getConfiguration().get(CFILE));
FileSystem fs = cFile.getFileSystem(context.getConfiguration());
DataInputStream d = new DataInputStream(fs.open(cFile));
thanks. I actually fixed it by using the following code:
String uriStr = "s3n://energydata/centroid/";
URI uri = URI.create(uriStr);
FileSystem fs = FileSystem.get(uri, context.getConfiguration());
Path cFile = new Path(context.getConfiguration().get(CFILE));
DataInputStream d = new DataInputStream(fs.open(cFile));

Resources