How to count partitions with FileSystem API? - hadoop

I am using Hadoop version 2.7 and its FileSystem API. The question is about "how to count partitions with the API?" but, to put it into a software problem, I am coping here a Spark-Shell script... The concrete question about the script is
The variable parts below is counting the number of table partitions, or other thing?
import org.apache.hadoop.fs.{FileSystem, Path}
import scala.collection.mutable.ArrayBuffer
import spark.implicits._
val warehouse = "/apps/hive/warehouse" // the Hive default location for all databases
val db_regex = """\.db$""".r // filter for names like "*.db"
val tab_regex = """\.hive\-staging_""".r // signature of Hive files
val trStrange = "[\\s/]+|[^\\x00-\\x7F]+|[\\p{Cntrl}&&[^\r\n\t]]+|\\p{C}+".r //mark
def cutPath (thePath: String, toCut: Boolean = true) : String =
if (toCut) trStrange.replaceAllIn( thePath.replaceAll("^.+/", ""), "#") else thePath
val fs_get = FileSystem.get( sc.hadoopConfiguration )
fs_get.listStatus( new Path(warehouse) ).foreach( lsb => {
val b = lsb.getPath.toString
if (db_regex.findFirstIn(b).isDefined)
fs_get.listStatus( new Path(b) ).foreach( lst => {
val lstPath = lst.getPath
val t = lstPath.toString
var parts = -1
var size = -1L
if (!tab_regex.findFirstIn(t).isDefined) {
try {
val pp = fs_get.listStatus( lstPath )
parts = pp.length // !HERE! partitions?
pp.foreach( p => {
try { // SUPPOSING that size is the number of bytes of table t
size = size + fs.getContentSummary(p.getPath).getLength
} catch { case _: Throwable => }
})
} catch { case _: Throwable => }
println(s"${cutPath(b)},${cutPath(t)},$parts,$size")
}
})
}) // x warehouse loop
System.exit(0) // get out from spark-shell
This is only an example to show the focus: the correct scan and semantic interpretation of the Hive default database FileSystem structure, using Hive FileSystem API. The script sometimes need some memory, but is working fine. Run with sshell --driver-memory 12G --executor-memory 18G -i teste_v2.scala > output.csv
Note: the aim here is not to count partitions by any other method (e.g. HQL DESCRIBE or Spark Schema), but to use the API for it... For control and for data quality checks, the API is important as a kind of "lower level measurement".

Hive structures its metadata as database > tables > partitions > files. This typically translates into filesystem directory structure <hive.warehouse.dir>/database.db/table/partition/.../files. Where /partition/.../ signifies that tables can be partitioned by multiple columns thus creating a nested levels of subdirectories. (A partition is a directory with the name .../partition_column=value by convention).
So seems like your script will be printing the number of files (parts) and their total length (size) for each single-column partitioned table in each of your databases, if I'm not mistaken.
As alternative, I'd suggest you look at hdfs dfs -count command to see if it suits your needs, and maybe wrap it in a simple shell script to loop through the databases and tables.

Related

Get statistical properties of a list of values stored in JSON with Spark

I have my data stored in a JSON format using the following structure:
{"generationId":1,"values":[-36.0431,-35.913,...,36.0951]}
I want to get the distribution of the spacing (differences between the consecutive numbers) between the values averaged over the files (generationIds).
The first lines in my zepplein notebook are:
import org.apache.spark.sql.SparkSession
val warehouseLocation = "/user/hive/warehouse"
val spark = SparkSession.builder().appName("test").config("spark.sql.warehouse.dir", warehouseLocation).enableHiveSupport().getOrCreate()
val jsonData = spark.read.json("/user/hive/warehouse/results/*.json")
jsonData.createOrReplaceTempView("results")
I just now realized however, that this was not a good idea. The data in the above JSON now looks like this:
val gen_1 = spark.sql("SELECT * FROM eig where generationId = 1")
gen_1.show()
+------------+--------------------+
|generationId| values|
+------------+--------------------+
| 1|[-36.0431, -35.91...|
+------------+--------------------+
All the values are in the same field.
Do you have any idea how to approach this issue in a different way? It does not necessarily have to be Hive. Any Spark related solution is OK.
The number of values can be ~10000, and later. I would like to plot this distribution together with an already known function (simulation vs theory).
This recursive function, which is not terribly elegant and certainly not battle-tested, can calculate the differences (assuming an even-sized collection):
def differences(l: Seq[Double]): Seq[Double] = {
if (l.size < 2) {
Seq.empty[Double]
} else {
val values = l.take(2)
Seq(Math.abs(values.head - values(1))) ++ differences(l.tail)
}
}
Given such a function, you could apply it in Spark like this:
jsonData.map(r => (r.getLong(0), differences(r.getSeq[Double](1))))

Iteratively running queries on Apache Spark

I've been trying to execute 10,000 queries over a relatively large dataset 11M. More specifically I am trying to transform an RDD using filter based on some predicate and then compute how many records conform to that filter by applying the COUNT action.
I am running Apache Spark on my local machine having 16GB of memory and an 8-core CPU. I have set the --driver-memory to 10G in order to cache the RDD in memory.
However, because I have to re-do this operation 10,000 times it takes unusually long for this to finish. I am also attaching my code hoping it will make things more clear.
Loading the queries and the dataframe I am going to query against.
//load normalized dimensions
val df = spark.read.parquet("/normalized.parquet").cache()
//load query ranges
val rdd = spark.sparkContext.textFile("part-00000")
Parallelizing the execution of queries
In here, my queries are collected in a list and using par are executed in parallel. I then collect the required parameters that my query needs, to filter the Dataset. The isWithin function calls a function and tests whether the Vector contained in my dataset is within the given bounds by my queries.
Now after filtering my dataset, I execute count to get the number of records that exist in the filtered dataset and then create a string reporting how many that was.
val results = queries.par.map(q => {
val volume = q(q.length-1)
val dimensions = q.slice(0, q.length-1)
val count = df.filter(row => {
val v = row.getAs[DenseVector]("scaledOpen")
isWithin(volume, v, dimensions)
}).count
q.mkString(",")+","+count
})
Now, what I have in mind is that this task is generally really hard given the large dataset that I have and trying to run such thing on a single machine. I know this could be much faster on something running on top of Spark or by utilizing an index. However, I am wondering if there is a way to make it faster as it is.
Just because you parallelize access to a local collection it doesn't mean that anything is executed in parallel. Number of jobs that can be executed concurrently is limited by the cluster resources not driver code.
At the same time Spark is designed for high latency batch jobs. If number of jobs goes into tens of thousands you just cannot expect things to be fast.
One thing you can try is to push filters down into a single job. Convert DataFrame to RDD:
import org.apache.spark.mllib.linalg.{Vector => MLlibVector}
import org.apache.spark.rdd.RDD
val vectors: RDD[org.apache.spark.mllib.linalg.DenseVector] = df.rdd.map(
_.getAs[MLlibVector]("scaledOpen").toDense
)
map vectors to {0, 1} indicators:
import breeze.linalg.DenseVector
// It is not clear what is the type of queries
type Q = ???
val queries: Seq[Q] = ???
val inds: RDD[breeze.linalg.DenseVector[Long]] = vectors.map(v => {
// Create {0, 1} indicator vector
DenseVector(queries.map(q => {
// Define as before
val volume = ???
val dimensions = ???
// Output 0 or 1 for each q
if (isWithin(volume, v, dimensions)) 1L else 0L
}): _*)
})
aggregate partial results:
val counts: breeze.linalg.DenseVector[Long] = inds
.aggregate(DenseVector.zeros[Long](queries.size))(_ += _, _ += _)
and prepare final output:
queries.zip(counts.toArray).map {
case (q, c) => s"""${q.mkString(",")},$c"""
}

Merging inputs in distributed application

INTRODUCTION
I have to write distributed application which counts maximum number of unique values for 3 records. I have no experience in such area and don't know frameworks at all. My input could looks as follow:
u1: u2,u3,u4,u5,u6
u2: u1,u4,u6,u7,u8
u3: u1,u4,u5,u9
u4: u1,u2,u3,u6
...
Then beginning of the results should be:
(u1,u2,u3), u4,u5,u6,u7,u8,u9 => count=6
(u1,u2,u4), u3,u5,u6,u7,u8 => count=5
(u1,u3,u4), u2,u5,u6,u9 => count=4
(u2,u3,u4), u1,u5,u6,u7,u8,u9 => count=6
...
So my approach is to first merge each two of records, and then merge each merged pair with each single record.
QUESTION
Can I do such operation like working (merge) on more than one input row on the same time in framewors like hadoop/spark? Or maybe my approach is incorrect and I should do this different way?
Any advice will be appreciated.
Can I do such operation like working (merge) on more than one input row on the same time in framewors like hadoop/spark?
Yes, you can.
Or maybe my approach is incorrect and I should do this different way?
It depends on the size of the data. If your data is small, it's faster and easier to do it locally. If your data is huge, at least hundreds of GBs, the common strategy is to save the data to HDFS(distributed file system), and do analysis using Mapreduce/Spark.
A example spark application written in scala:
object MyCounter {
val sparkConf = new SparkConf().setAppName("My Counter")
val sc = new SparkContext(sparkConf)
def main(args: Array[String]) {
val inputFile = sc.textFile("hdfs:///inputfile.txt")
val keys = inputFile.map(line => line.substring(0, 2)) // get "u1" from "u1: u2,u3,u4,u5,u6"
val triplets = keys.cartesian(keys).cartesian(keys)
.map(z => (z._1._1, z._1._2, z._2))
.filter(z => !z._1.equals(z._2) && !z._1.equals(z._3) && !z._2.equals(z._3)) // get "(u1,u2,u3)" triplets
// If you have small numbers of (u1,u2,u3) triplets, it's better prepare them locally.
val res = triplets.cartesian(inputFile).filter(z => {
z._2.startsWith(z._1._1) || z._2.startsWith(z._1._2) || z._2.startsWith(z._1._3)
}) // (u1,u2,u3) only matches line starts with u1,u2,u3, for example "u1: u2,u3,u4,u5,u6"
.reduceByKey((a, b) => a + b) // merge three lines
.map(z => {
val line = z._2
val values = line.split(",")
//count unique values using set
val set = new util.HashSet[String]()
for (value <- values) {
set.add(value)
}
"key=" + z._1 + ", count=" + set.size() // the result from one mapper is a string
}).collect()
for (line <- res) {
println(line)
}
}
}
The code is not tested. And is not efficient. It can have some optimization (for example, remove unnecessary map-reduce steps.)
You can rewrite the same version using Python/Java.
You can implement the same logic using Hadoop/Mapreduce

How to split rows of a Spark RDD by Deliminator

I am trying to split data in Spark into the form of an RDD of Array[String]. Currently I have loaded the file into an RDD of String.
> val csvFile = textFile("/input/spam.csv")
I would like to split on a a , deliminator.
This:
val csvFile = textFile("/input/spam.csv").map(line => line.split(","))
returns you RDD[Array[String]].
If you need first column as one RDD then using map function return only first index from Array:
val firstCol = csvFile.map(_.(0))
You should be using the spark-csv library which is able to parse your file considering headers and allow you to specify the delimitor. Also, it makes a pretty good job at infering the schema. I'll let you read the documentation to discover the plenty of options at your disposal.
This may look like this :
sqlContext.read.format("com.databricks.spark.csv")
.option("header","true")
.option("delimiter","your delimitor")
.load(pathToFile)
Be aware, this returns a DataFrame that you may have to convert to an rdd using .rdd function.
Of course, you will have to load the package into the driver for it to work.
// create spark session
val spark = org.apache.spark.sql.SparkSession.builder
.master("local")
.appName("Spark CSV Reader")
.getOrCreate;
// read csv
val df = spark.read
.format("csv")
.option("header", "true") //reading the headers
.option("mode", "DROPMALFORMED")
.option("delimiter", ",")
.load("/your/csv/dir/simplecsv.csv")
// convert dataframe to rdd[row]
val rddRow = df.rdd
// print 2 rows
rddRow.take(2)
// convert df to rdd[string] for specific column
val oneColumn = df.select("colName").as[(String)].rdd
oneColumn.take(2)
// convert df to rdd[string] for multiple columns
val multiColumn = df.select("col1Name","col2Name").as[(String, String)].rdd
multiColumn.take(2)

Writing to hadoop distributed file system multiple times with Spark

I've created a spark job that reads in a textfile everyday from my hdfs and extracts unique keys from each line in the text file. There are roughly 50000 keys in each text file. The same data is then filtered by the extracted key and saved to the hdfs.
I want to create a directory in my hdfs with the structure: hdfs://.../date/key that contains the filtered data. The problem is that writing to the hdfs takes a very very long time because there are so many keys.
The way it's written right now:
val inputData = sparkContext.textFile(""hdfs://...", 2)
val keys = extractKey(inputData) //keys is an array of approx 50000 unique strings
val cleanedData = cleanData(inputData) //cleaned data is an RDD of strings
keys.map(key => {
val filteredData = cleanedData.filter(line => line.contains(key))
filteredData.repartition(1).saveAsTextFile("hdfs://.../date/key")
})
Is there a way to make this faster? I've thought about repartitioning the data into the number of keys extracted but then I can't save in the format hdfs://.../date/key. I've also tried groupByKey but I can't save the values because they aren't RDDs.
Any help is appreciated :)
def writeLines(iterator: Iterator[(String, String)]) = {
val writers = new mutalbe.HashMap[String, BufferedWriter] // (key, writer) map
try {
while (iterator.hasNext) {
val item = iterator.next()
val key = item._1
val line = item._2
val writer = writers.get(key) match {
case Some(writer) => writer
case None =>
val path = arg(1) + key
val outputStream = FileSystem.get(new Configuration()).create(new Path(path))
writer = new BufferedWriter(outputStream)
}
writer.writeLine(line)
} finally {
writers.values.foreach(._close())
}
}
val inputData = sc.textFile()
val keyValue = inputData.map(line => (key, line))
val partitions = keyValue.partitionBy(new MyPartition(10))
partitions.foreachPartition(writeLines)
class MyPartitioner(partitions: Int) extends Partitioner {
override def numPartitions: Int = partitions
override def getPartition(key: Any): Int = {
// make sure lines with the same key in the same partition
(key.toString.hashCode & Integer.MAX_VALUE) % numPartitions
}
}
I think the approach should be similar to Write to multiple outputs by key Spark - one Spark job. The partition number has nothing to do with the directory number. To implement it, you may need to override generateFileNameForKeyValue with your customized version to save to different directory.
Regarding scalability, it is not an issue of spark, it is hdfs instead. But no matter how you implemented, as long as the requirements is not changed, it is unavoidable. But I think Hdfs is probably OK with 50,000 file handlers
You are specifying just 2 partitions for the input, and 1 partition for the output. One effect of this is severely limiting the parallelism of these operations. Why are these needed?
Instead of computing 50,000 filtered RDDs, which is really slow too, how about just grouping by the key directly? I get that you want to output them into different directories but that is really causing the bottlenecks here. Is there perhaps another way to architect this that simply lets you read (key,value) results?

Resources