I am running a hadoop job, I have FileSystem object and Path object and I want to know what is the file (Path) size.
any idea?
long length = FileSystem#getFileStatus(PATH)#getLen();
Here is a link to the relevant documentation of Hadoop 2.2.0
another API is(written in Scala):
private def getFileSizeByPath(arg : String): Long = {
val path = new Path(arg)
val hdfs = path.getFileSystem(new Configuration())
val cSummary = hdfs.getContentSummary(path)
val length = cSummary.getLength
length
}
Note that the return Long type is Byte in size.
Related
I'm developing a Scala feature extracting app using Apache Spark TF-IDF. I need to read in from a directory of text files. I'm trying to convert an RDD to a dataframe but I'm getting the error "value toDF() is not a member of org.apache.spark.rdd.RDD[streamedRDD]". This is what I have right now ...
I have spark-2.2.1 & Scala 2.1.11. Thanks in advance.
Code:
// Creating the Spark context that will interface with Spark
val conf = new SparkConf()
.setMaster("local")
.setAppName("TextClassification")
val sc = new SparkContext(conf)
// Load documents (one per line)
val data = sc.wholeTextFiles("C:/Users/*")
val text = data.map{case(filepath,text) => text}
val id = data.map{case(filepath, text) => text.split("#").takeRight(1)(0)}
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
case class dataStreamed(id: String, input: String)
val tweetsDF = data
.map{case (filepath, text) =>
val id = text.split("#").takeRight(1)(0)
val input = text.split(":").takeRight(2)(0)
dataStreamed(id, input)}
.as[dataStreamed]
.toDF()
.cache()
// -------------------- TF-IDF --------------------
// From spark.apache.org
// URL http://spark.apache.org/docs/latest/ml-features.html#tf-idf
val tokenizer = new Tokenizer().setInputCol("input").setOutputCol("words")
val wordsData = tokenizer.transform(tweetsDF)
val hashingTF = new HashingTF()
.setInputCol("words")
.setOutputCol("rawFeatures")
val tf = hashingTF.transform(wordsData).cache() // Hashed words
// Compute for the TFxIDF
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val tfidf = idf.fit(tf)
Data: (Text files like these in a folder is what I need read in)
https://www.dropbox.com/s/cw3okhaosu7i1md/cars.txt?dl=0
https://www.dropbox.com/s/29tgqg7ifpxzwwz/Italy.txt?dl=0
The problem here is that map function returns a type of Dataset[Row] which you assign to tweetsDF. It should be:
case class dataStreamed(id: String, input: String)
def test() = {
val sparkConf = new SparkConf().setAppName("TextClassification").setMaster("local")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
val sqlContext = spark.sqlContext
import sqlContext.implicits._
// Load documents (one per line)
val data = spark.sparkContext.wholeTextFiles("C:\\tmp\\stackoverflow\\*")
val dataset = spark.createDataset(data)
val tweetsDF = dataset
.map{case (id : String, input : String) =>
val file = id.split("#").takeRight(1)(0)
val content = input.split(":").takeRight(2)(0)
dataStreamed(file, content)}
.as[dataStreamed]
tweetsDF.printSchema()
tweetsDF.show(10)
}
First data will be an RDD(String, String) then I create a new Dataset with spark.createDataset in order to be able to use map properly together with the case class. Please note that you must define dataStreamedclass out of your method (test in this case)
Good luck
We can do this with couple of commands/functions:
Invoke the spark/scala shell, you can use driver-memory, executor-memory, executor-cores etc as suits your job
spark-shell
Read the text file from HDFS
val text_rdd = sc.textFile("path/to/file/on/hdfs")
Convert the text rdd to DataFrame
val text_df = text_rdd.toDF
Save as plan text format in HDFS
text_df.saveAsTextFile("path/to/hdfs")
Save as splittable compressed format in HDFS
text_df.coalesce(1).write.parquet("path/to/hdfs")
I have some avro files inside HDFS folder /user/data/output_files/file_2017-10-18
scala> val hdfsLoc ="/user/data/output_files/file_2017-10-18/*.avro"
hdfsLoc: String = /user/data/output_files/file_2017-10-18/*.avro
scala> val conf = new Configuration()
scala> val fs = FileSystem.get(conf)
scala> val result = fs.exists(new Path(hdfsLoc))
result: Boolean = false
The above code gives result as false . It says there is no avro files inside that HDFS folder
If I give the full name of avro file , then it returns true
scala> val hdfsLoc ="/user/data/output_files/file_2017-10-18/part-r-00000-ed937f14-c7d1-480a-9c79-1cda3db4e6ce.avro"
hdfsLoc: String = /user/data/output_files/file_2017-10-18/part-r-00000-ed937f14-c7d1-480a-9c79-1cda3db4e6ce.avro
scala> val result = fs.exists(new Path(hdfsLoc))
result: Boolean = true
How do I ensure that there are one or more avro files inside a HDFS folder?
Seems FileSystem doesn't support wildcards. The workaround should be as below though it looks nasty.
val list = fs.listFiles(new Path("/test/"), true)
var result=false
while(list.hasNext()){
if(list.next().getPath.getName.endsWith(".avro"))
result=true
}
FileSystem API has a different function called globStatus which allows you to use wildcards.
It returns Array[org.apache.hadoop.fs.FileStatus]
val fs = FileSystem.get(Sc.hadoopConfiguration)
fs.globStatus(new Path("/user/data/output_files/file_2017-10-18/*.avro")).length match {
case x: Int if (x>0) => doSomethingWhenAvroFileExists()
case _ => doSomethingWhenNoAvroFilesExist()
}
I am running some code through a compiler, and I have to query which Operating System the user is using to call the appropriate binary. The code works, and calls the binary in IntelliJ, but when I build a jar file with gradle, I get a file not found exception (the binary) on the line that corresponds to val tempBinaryCopy.
fun assemble(file: String) {
val currentDirectory = System.getProperty("user.dir")
val binary = when {
System.getProperty("os.name").startsWith("Linux") -> javaClass.classLoader.getResource("osx_linux").file
System.getProperty("os.name").startsWith("Mac") -> javaClass.classLoader.getResource("osx_mac").file
else ->javaClass.classLoader.getResource("osx_win.exe").file
}
val binaryFile = File(binary).name
val assemblyFile = File(file).name
val tempBinaryCopy = File(binary).copyTo(File(currentDirectory, binaryFile), true)
val tempAssemblyCopy = File(file).copyTo(File(currentDirectory, assemblyFile), true)
tempAssemblyCopy.deleteOnExit()
tempBinaryCopy.deleteOnExit()
Files.setPosixFilePermissions(tempBinaryCopy.toPath(), setOf(PosixFilePermission.OWNER_EXECUTE))
val process = Runtime.getRuntime().exec(arrayOf(tempBinaryCopy.absolutePath, tempAssemblyCopy.absolutePath, "-v"))
process.inputStream.bufferedReader().readLines().forEach { println(it) }
}
The exception
Exception in thread "main" kotlin.io.NoSuchFileException: file:/Users/dross/Desktop/OS.jar!/osx_mac: The source file doesn't exist.
at kotlin.io.FilesKt__UtilsKt.copyTo(Utils.kt:179)
at kotlin.io.FilesKt__UtilsKt.copyTo$default(Utils.kt:177)
at com.max.power.os.assembler.ProvidedAssembler.assemble(ProvidedAssembler.kt:22)
I have also tried a replace on the line in question to remove the ! and the result was the same.
javaClass.classLoader.getResource("osx_linux") gives you an instance of URL, URL.file gives you the file part of the URL. This might work as long as the file is not packaged in a JAR, but as you can see fails in case the file is packed into a JAR. You should probably instead use getResourceAsStream and the copy the InputStream you receive to the destination where you want to have it.
I have a string sequence Seq[String] which represents stdin input lines.
Those lines map to a model entity, but it is not guaranteed that 1 line = 1 entity instance.
Each entity is delimited with a special string that will not occur anywhere else in the input.
My solution was something like:
val entities = lines.mkString.split(myDelimiter).map(parseEntity)
parseEntity implementation is not relevant, it gets a String and maps to a case class which represents the model entity
The problem is with a given input, I get an OutOfMemoryException on the lines.mkString. Would a fold/foldLeft/foldRight be more efficient? Or do you have any better alternative?
You can solve this using akka streams and delimiter framing. See this section of the documentation for the basic approach.
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import akka.stream.scaladsl.{Framing, Source}
import akka.util.ByteString
val example = (0 until 100).mkString("delimiter").grouped(8).toIndexedSeq
val framing = Framing.delimiter(ByteString("delimiter"), 1000)
implicit val system = ActorSystem()
implicit val mat = ActorMaterializer()
Source(example)
.map(ByteString.apply)
.via(framing)
.map(_.utf8String)
.runForeach(println)
The conversion to and from ByteString is a bit annoying, but Framing.delimiter is only defined for ByteString.
If you are fine with a more pure functional approach, fs2 will also offer primitives to solve this problem.
Something that worked for me if you are reading from a stream (your mileage may vary). Slightly modified version of Scala LineIterator:
class EntityIterator(val iter: BufferedIterator[Char]) extends AbstractIterator[String] with Iterator[String] {
private[this] val sb = new StringBuilder
def getc() = iter.hasNext && {
val ch = iter.next
if (ch == '\n') false // Replace with your delimiter here
else {
sb append ch
true
}
}
def hasNext = iter.hasNext
def next = {
sb.clear
while (getc()) { }
sb.toString
}
}
val entities =
new EnityIterator(scala.io.Source.fromInputStream(...).iter.buffered)
entities.map(...)
I have got a brand new install of spark 1.2.1 over a mapr cluster and while testing it I find that it works nice in local mode but in yarn modes it seems not to be able to access variables, neither if broadcasted. To be precise, the following test code
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object JustSpark extends App {
val conf = new org.apache.spark.SparkConf().setAppName("SimpleApplication")
val sc = new SparkContext(conf)
val a = List(1,3,4,5,6)
val b = List("a","b","c")
val bBC= sc.broadcast(b)
val data = sc.parallelize(a)
val transform = data map ( t => { "hi" })
transform.take(3) foreach (println _)
val transformx2 = data map ( t => { bBC.value.size })
transformx2.take(3) foreach (println _)
//val transform2 = data map ( t => { b.size })
//transform2.take(3) foreach (println _)
}
works in local mode but fails in yarn. More precisely, both methods, transform2 and transformx2, fail, and all of them work if --master local[8].
I am compiling it with sbt and sending with the submit tool
/opt/mapr/spark/spark-1.2.1/bin/spark-submit --class JustSpark --master yarn target/scala-2.10/simulator_2.10-1.0.jar
Any idea what is going on? The fail message just claims to have a java null pointer exception in the place where it should be accessing the variable. Is there other method to pass variables inside the RDD maps?
I'm going to take a pretty good guess: it's because you're using App. See https://issues.apache.org/jira/browse/SPARK-4170 for details. Write a main() method instead.
I presume the culprit were
val transform2 = data map ( t => { b.size })
In particular the accessing the local variable b . You may actually see in your log files java.io.NotSerializableException .
What is supposed to happen: Spark will attempt to serialize any referenced object. That means in this case the entire JustSpark class - since one of its members is referenced.
Why did this fail? Your class is not Serializable. Therefore Spark is unable to send it over the wire. In particular you have a reference to SparkContext - which does not extend Serializable
class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationClient {
So - your first code - which does broadcast only the variable value - is the correct way.
This is the original example of broadcast, from spark sources, altered to use lists instead of arrays:
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object MultiBroadcastTest {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Multi-Broadcast Test")
val sc = new SparkContext(sparkConf)
val slices = if (args.length > 0) args(0).toInt else 2
val num = if (args.length > 1) args(1).toInt else 1000000
val arr1 = (1 to num).toList
val arr2 = (1 to num).toList
val barr1 = sc.broadcast(arr1)
val barr2 = sc.broadcast(arr2)
val observedSizes: RDD[(Int, Int)] = sc.parallelize(1 to 10, slices).map { _ =>
(barr1.value.size, barr2.value.size)
}
observedSizes.collect().foreach(i => println(i))
sc.stop()
}}
I compiled it in my environment and it works.
So what is the difference?
The problematic example uses extends App while the original example is a plain singleton.
So I demoted the code to a "doIt()" function
object JustDoSpark extends App{
def doIt() {
...
}
doIt()
and guess what. It worked.
Surely the problem is related to Serialization indeed, but in a different way. Having the code in the body of the object seems to cause problems.