apache spark - check if file exists - hadoop

I am new to spark and I have a question. I have a two step process in which the first step write a SUCCESS.txt file to a location on HDFS. My second step which is a spark job has to verify if that SUCCESS.txt file exists before it starts processing the data.
I checked the spark API and didnt find any method which checks if a file exists. Any ideas how to handle this?
The only method I found was sc.textFile(hdfs:///SUCCESS.txt).count() which would throw an exception when the file does not exist. I have to catch that exception and write my program accordingly. I didnt really like this approach. Hoping to find a better alternative.

For a file in HDFS, you can use the hadoop way of doing this:
val conf = sc.hadoopConfiguration
val fs = org.apache.hadoop.fs.FileSystem.get(conf)
val exists = fs.exists(new org.apache.hadoop.fs.Path("/path/on/hdfs/to/SUCCESS.txt"))

For Pyspark, you can achieve this without invoking a subprocess using something like:
fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration())
fs.exists(sc._jvm.org.apache.hadoop.fs.Path("path/to/SUCCESS.txt"))

I will say, best way to call this through function which internally check for file presence in the traditional hadoop file check.
object OutputDirCheck {
def dirExists(hdfsDirectory: String): Boolean = {
val hadoopConf = new org.apache.hadoop.conf.Configuration()
val fs = org.apache.hadoop.fs.FileSystem.get(hadoopConf)
fs.exists(new org.apache.hadoop.fs.Path(hdfsDirectory))
}
}

Using Databricks dbutils:
def path_exists(path):
try:
if len(dbutils.fs.ls(path)) > 0:
return True
except:
return False

for Spark 2.0 or higher you can use the method exist of hadoop.fr.FileSystem :
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.spark.sql.SparkSession
object Test extends App {
val spark = SparkSession.builder
.master("local[*]")
.appName("BigDataETL - Check if file exists")
.getOrCreate()
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
// This methods returns Boolean (true - if file exists, false - if file doesn't exist
val fileExists = fs.exists(new Path("<parh_to_file>"))
if (fileExists) println("File exists!")
else println("File doesn't exist!")
}
for Spark 1.6 to 2.0
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.spark.{SparkConf, SparkContext}
object Test extends App {
val sparkConf = new SparkConf().setAppName(s"BigDataETL - Check if file exists")
val sc = new SparkContext(sparkConf)
val fs = FileSystem.get(sc.hadoopConfiguration)
val fileExists = fs.exists(new Path("<parh_to_file>"))
if (fileExists) println("File exists!")
else println("File doesn't exist!")
}

For Java coders;
SparkConf sparkConf = new SparkConf().setAppName("myClassname");
SparkContext sparky = new SparkContext(sparkConf);
JavaSparkContext context = new JavaSparkContext(sparky);
FileSystem hdfs = org.apache.hadoop.fs.FileSystem.get(context.hadoopConfiguration());
Path path = new Path(sparkConf.get(path_to_File));
if (!hdfs.exists(path)) {
//Path does not exist.
}
else{
//Path exist.
}

For pyspark python users:
i didn't find anything with python or pyspark so we need to execute hdfs command from python code. This has worked for me.
hdfs command to get if folder exisits : returning 0 if true
hdfs dfs -test -d /folder-path
hdfs command to get if file exists : returning 0 if true
hdfs dfs -test -d /folder-path
For putting this in python code i followed below lines of code :
import subprocess
def run_cmd(args_list):
proc = subprocess.Popen(args_list, stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
proc.communicate()
return proc.returncode
cmd = ['hdfs', 'dfs', '-test', '-d', "/folder-path"]
code = run_cmd(cmd)
if code == 0:
print('folder exist')
print(code)
Output if folder exists :
folder exists
0

For PySpark:
from py4j.protocol import Py4JJavaError
def path_exist(path):
try:
rdd = sc.textFile(path)
rdd.take(1)
return True
except Py4JJavaError as e:
return False

#Nandeesh's answer evaluates all the Py4JJavaError exceptions. I propose to add another step to evaluate java exception error message:
from py4j.protocol import Py4JJavaError
def file_exists(path):
try:
spark.sparkContext.textFile(path).take(1)
except Py4JJavaError as e:
if 'org.apache.hadoop.mapred.InvalidInputException: Input path does not exist' in str(e.java_exception):
return False
else:
return True

Related

How to convert an RDD (that read in a directory of text files) into dataFrame in Apache Spark in Scala?

I'm developing a Scala feature extracting app using Apache Spark TF-IDF. I need to read in from a directory of text files. I'm trying to convert an RDD to a dataframe but I'm getting the error "value toDF() is not a member of org.apache.spark.rdd.RDD[streamedRDD]". This is what I have right now ...
I have spark-2.2.1 & Scala 2.1.11. Thanks in advance.
Code:
// Creating the Spark context that will interface with Spark
val conf = new SparkConf()
.setMaster("local")
.setAppName("TextClassification")
val sc = new SparkContext(conf)
// Load documents (one per line)
val data = sc.wholeTextFiles("C:/Users/*")
val text = data.map{case(filepath,text) => text}
val id = data.map{case(filepath, text) => text.split("#").takeRight(1)(0)}
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
case class dataStreamed(id: String, input: String)
val tweetsDF = data
.map{case (filepath, text) =>
val id = text.split("#").takeRight(1)(0)
val input = text.split(":").takeRight(2)(0)
dataStreamed(id, input)}
.as[dataStreamed]
.toDF()
.cache()
// -------------------- TF-IDF --------------------
// From spark.apache.org
// URL http://spark.apache.org/docs/latest/ml-features.html#tf-idf
val tokenizer = new Tokenizer().setInputCol("input").setOutputCol("words")
val wordsData = tokenizer.transform(tweetsDF)
val hashingTF = new HashingTF()
.setInputCol("words")
.setOutputCol("rawFeatures")
val tf = hashingTF.transform(wordsData).cache() // Hashed words
// Compute for the TFxIDF
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val tfidf = idf.fit(tf)
Data: (Text files like these in a folder is what I need read in)
https://www.dropbox.com/s/cw3okhaosu7i1md/cars.txt?dl=0
https://www.dropbox.com/s/29tgqg7ifpxzwwz/Italy.txt?dl=0
The problem here is that map function returns a type of Dataset[Row] which you assign to tweetsDF. It should be:
case class dataStreamed(id: String, input: String)
def test() = {
val sparkConf = new SparkConf().setAppName("TextClassification").setMaster("local")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
val sqlContext = spark.sqlContext
import sqlContext.implicits._
// Load documents (one per line)
val data = spark.sparkContext.wholeTextFiles("C:\\tmp\\stackoverflow\\*")
val dataset = spark.createDataset(data)
val tweetsDF = dataset
.map{case (id : String, input : String) =>
val file = id.split("#").takeRight(1)(0)
val content = input.split(":").takeRight(2)(0)
dataStreamed(file, content)}
.as[dataStreamed]
tweetsDF.printSchema()
tweetsDF.show(10)
}
First data will be an RDD(String, String) then I create a new Dataset with spark.createDataset in order to be able to use map properly together with the case class. Please note that you must define dataStreamedclass out of your method (test in this case)
Good luck
We can do this with couple of commands/functions:
Invoke the spark/scala shell, you can use driver-memory, executor-memory, executor-cores etc as suits your job
spark-shell
Read the text file from HDFS
val text_rdd = sc.textFile("path/to/file/on/hdfs")
Convert the text rdd to DataFrame
val text_df = text_rdd.toDF
Save as plan text format in HDFS
text_df.saveAsTextFile("path/to/hdfs")
Save as splittable compressed format in HDFS
text_df.coalesce(1).write.parquet("path/to/hdfs")

How do we Check if there are some avro files available inside a HDFS folder?

I have some avro files inside HDFS folder /user/data/output_files/file_2017-10-18
scala> val hdfsLoc ="/user/data/output_files/file_2017-10-18/*.avro"
hdfsLoc: String = /user/data/output_files/file_2017-10-18/*.avro
scala> val conf = new Configuration()
scala> val fs = FileSystem.get(conf)
scala> val result = fs.exists(new Path(hdfsLoc))
result: Boolean = false
The above code gives result as false . It says there is no avro files inside that HDFS folder
If I give the full name of avro file , then it returns true
scala> val hdfsLoc ="/user/data/output_files/file_2017-10-18/part-r-00000-ed937f14-c7d1-480a-9c79-1cda3db4e6ce.avro"
hdfsLoc: String = /user/data/output_files/file_2017-10-18/part-r-00000-ed937f14-c7d1-480a-9c79-1cda3db4e6ce.avro
scala> val result = fs.exists(new Path(hdfsLoc))
result: Boolean = true
How do I ensure that there are one or more avro files inside a HDFS folder?
Seems FileSystem doesn't support wildcards. The workaround should be as below though it looks nasty.
val list = fs.listFiles(new Path("/test/"), true)
var result=false
while(list.hasNext()){
if(list.next().getPath.getName.endsWith(".avro"))
result=true
}
FileSystem API has a different function called globStatus which allows you to use wildcards.
It returns Array[org.apache.hadoop.fs.FileStatus]
val fs = FileSystem.get(Sc.hadoopConfiguration)
fs.globStatus(new Path("/user/data/output_files/file_2017-10-18/*.avro")).length match {
case x: Int if (x>0) => doSomethingWhenAvroFileExists()
case _ => doSomethingWhenNoAvroFilesExist()
}

File found in IntelliJ but not in built jar

I am running some code through a compiler, and I have to query which Operating System the user is using to call the appropriate binary. The code works, and calls the binary in IntelliJ, but when I build a jar file with gradle, I get a file not found exception (the binary) on the line that corresponds to val tempBinaryCopy.
fun assemble(file: String) {
val currentDirectory = System.getProperty("user.dir")
val binary = when {
System.getProperty("os.name").startsWith("Linux") -> javaClass.classLoader.getResource("osx_linux").file
System.getProperty("os.name").startsWith("Mac") -> javaClass.classLoader.getResource("osx_mac").file
else ->javaClass.classLoader.getResource("osx_win.exe").file
}
val binaryFile = File(binary).name
val assemblyFile = File(file).name
val tempBinaryCopy = File(binary).copyTo(File(currentDirectory, binaryFile), true)
val tempAssemblyCopy = File(file).copyTo(File(currentDirectory, assemblyFile), true)
tempAssemblyCopy.deleteOnExit()
tempBinaryCopy.deleteOnExit()
Files.setPosixFilePermissions(tempBinaryCopy.toPath(), setOf(PosixFilePermission.OWNER_EXECUTE))
val process = Runtime.getRuntime().exec(arrayOf(tempBinaryCopy.absolutePath, tempAssemblyCopy.absolutePath, "-v"))
process.inputStream.bufferedReader().readLines().forEach { println(it) }
}
The exception
Exception in thread "main" kotlin.io.NoSuchFileException: file:/Users/dross/Desktop/OS.jar!/osx_mac: The source file doesn't exist.
at kotlin.io.FilesKt__UtilsKt.copyTo(Utils.kt:179)
at kotlin.io.FilesKt__UtilsKt.copyTo$default(Utils.kt:177)
at com.max.power.os.assembler.ProvidedAssembler.assemble(ProvidedAssembler.kt:22)
I have also tried a replace on the line in question to remove the ! and the result was the same.
javaClass.classLoader.getResource("osx_linux") gives you an instance of URL, URL.file gives you the file part of the URL. This might work as long as the file is not packaged in a JAR, but as you can see fails in case the file is packed into a JAR. You should probably instead use getResourceAsStream and the copy the InputStream you receive to the destination where you want to have it.

Spark RDD map in yarn mode does not allow access to variables?

I have got a brand new install of spark 1.2.1 over a mapr cluster and while testing it I find that it works nice in local mode but in yarn modes it seems not to be able to access variables, neither if broadcasted. To be precise, the following test code
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object JustSpark extends App {
val conf = new org.apache.spark.SparkConf().setAppName("SimpleApplication")
val sc = new SparkContext(conf)
val a = List(1,3,4,5,6)
val b = List("a","b","c")
val bBC= sc.broadcast(b)
val data = sc.parallelize(a)
val transform = data map ( t => { "hi" })
transform.take(3) foreach (println _)
val transformx2 = data map ( t => { bBC.value.size })
transformx2.take(3) foreach (println _)
//val transform2 = data map ( t => { b.size })
//transform2.take(3) foreach (println _)
}
works in local mode but fails in yarn. More precisely, both methods, transform2 and transformx2, fail, and all of them work if --master local[8].
I am compiling it with sbt and sending with the submit tool
/opt/mapr/spark/spark-1.2.1/bin/spark-submit --class JustSpark --master yarn target/scala-2.10/simulator_2.10-1.0.jar
Any idea what is going on? The fail message just claims to have a java null pointer exception in the place where it should be accessing the variable. Is there other method to pass variables inside the RDD maps?
I'm going to take a pretty good guess: it's because you're using App. See https://issues.apache.org/jira/browse/SPARK-4170 for details. Write a main() method instead.
I presume the culprit were
val transform2 = data map ( t => { b.size })
In particular the accessing the local variable b . You may actually see in your log files java.io.NotSerializableException .
What is supposed to happen: Spark will attempt to serialize any referenced object. That means in this case the entire JustSpark class - since one of its members is referenced.
Why did this fail? Your class is not Serializable. Therefore Spark is unable to send it over the wire. In particular you have a reference to SparkContext - which does not extend Serializable
class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationClient {
So - your first code - which does broadcast only the variable value - is the correct way.
This is the original example of broadcast, from spark sources, altered to use lists instead of arrays:
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object MultiBroadcastTest {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Multi-Broadcast Test")
val sc = new SparkContext(sparkConf)
val slices = if (args.length > 0) args(0).toInt else 2
val num = if (args.length > 1) args(1).toInt else 1000000
val arr1 = (1 to num).toList
val arr2 = (1 to num).toList
val barr1 = sc.broadcast(arr1)
val barr2 = sc.broadcast(arr2)
val observedSizes: RDD[(Int, Int)] = sc.parallelize(1 to 10, slices).map { _ =>
(barr1.value.size, barr2.value.size)
}
observedSizes.collect().foreach(i => println(i))
sc.stop()
}}
I compiled it in my environment and it works.
So what is the difference?
The problematic example uses extends App while the original example is a plain singleton.
So I demoted the code to a "doIt()" function
object JustDoSpark extends App{
def doIt() {
...
}
doIt()
and guess what. It worked.
Surely the problem is related to Serialization indeed, but in a different way. Having the code in the body of the object seems to cause problems.

How to get file size

I am running a hadoop job, I have FileSystem object and Path object and I want to know what is the file (Path) size.
any idea?
long length = FileSystem#getFileStatus(PATH)#getLen();
Here is a link to the relevant documentation of Hadoop 2.2.0
another API is(written in Scala):
private def getFileSizeByPath(arg : String): Long = {
val path = new Path(arg)
val hdfs = path.getFileSystem(new Configuration())
val cSummary = hdfs.getContentSummary(path)
val length = cSummary.getLength
length
}
Note that the return Long type is Byte in size.

Resources