Why Spark DataFrame Repartition not working correctly - hadoop

Spark 1.6.2 HDP 2.5.2
I am using spark sql to fetch data from a hive table and then repartitioning on a particular column "serial" with 100 partitions but spark does not repartition the data into 100 partitions (can be seen as number of tasks in spark ui) instead has 126 tasks.
val data = sqlContext.sql("""select * from default.tbl_orc_zlib""")
val filteredData = data.filter( data("day").isNotNull ) // NULL check
//Repartition on serial column with 100 partitions
val repartData = filteredData.repartition(100,filteredData("serial"))
val repartSortData = repartData.sortWithinPartitions("serial","linenr")
val mappedData = repartSortData.map(s => s.mkString("\t"))
val res = mappedData.pipe("xyz.dll")
res.saveAsTextFile("hdfs:///../../../")
But if i use a coalesce first and then repartition then the number of tasks become 150 (correct 50 of coalesce and 100 of repartition)
filteredData.coalesce(50)//works fine
Can someone please explain me why is this happening

Related

How to count partitions with FileSystem API?

I am using Hadoop version 2.7 and its FileSystem API. The question is about "how to count partitions with the API?" but, to put it into a software problem, I am coping here a Spark-Shell script... The concrete question about the script is
The variable parts below is counting the number of table partitions, or other thing?
import org.apache.hadoop.fs.{FileSystem, Path}
import scala.collection.mutable.ArrayBuffer
import spark.implicits._
val warehouse = "/apps/hive/warehouse" // the Hive default location for all databases
val db_regex = """\.db$""".r // filter for names like "*.db"
val tab_regex = """\.hive\-staging_""".r // signature of Hive files
val trStrange = "[\\s/]+|[^\\x00-\\x7F]+|[\\p{Cntrl}&&[^\r\n\t]]+|\\p{C}+".r //mark
def cutPath (thePath: String, toCut: Boolean = true) : String =
if (toCut) trStrange.replaceAllIn( thePath.replaceAll("^.+/", ""), "#") else thePath
val fs_get = FileSystem.get( sc.hadoopConfiguration )
fs_get.listStatus( new Path(warehouse) ).foreach( lsb => {
val b = lsb.getPath.toString
if (db_regex.findFirstIn(b).isDefined)
fs_get.listStatus( new Path(b) ).foreach( lst => {
val lstPath = lst.getPath
val t = lstPath.toString
var parts = -1
var size = -1L
if (!tab_regex.findFirstIn(t).isDefined) {
try {
val pp = fs_get.listStatus( lstPath )
parts = pp.length // !HERE! partitions?
pp.foreach( p => {
try { // SUPPOSING that size is the number of bytes of table t
size = size + fs.getContentSummary(p.getPath).getLength
} catch { case _: Throwable => }
})
} catch { case _: Throwable => }
println(s"${cutPath(b)},${cutPath(t)},$parts,$size")
}
})
}) // x warehouse loop
System.exit(0) // get out from spark-shell
This is only an example to show the focus: the correct scan and semantic interpretation of the Hive default database FileSystem structure, using Hive FileSystem API. The script sometimes need some memory, but is working fine. Run with sshell --driver-memory 12G --executor-memory 18G -i teste_v2.scala > output.csv
Note: the aim here is not to count partitions by any other method (e.g. HQL DESCRIBE or Spark Schema), but to use the API for it... For control and for data quality checks, the API is important as a kind of "lower level measurement".
Hive structures its metadata as database > tables > partitions > files. This typically translates into filesystem directory structure <hive.warehouse.dir>/database.db/table/partition/.../files. Where /partition/.../ signifies that tables can be partitioned by multiple columns thus creating a nested levels of subdirectories. (A partition is a directory with the name .../partition_column=value by convention).
So seems like your script will be printing the number of files (parts) and their total length (size) for each single-column partitioned table in each of your databases, if I'm not mistaken.
As alternative, I'd suggest you look at hdfs dfs -count command to see if it suits your needs, and maybe wrap it in a simple shell script to loop through the databases and tables.

SparkRDD Operations

Let's assume i had a table of two columns A and B in a CSV File. I pick maximum value from column A [Max value = 100] and i need to return the respective value of column B [Return Value = AliExpress] using JavaRDD Operations without using DataFrames.
Input Table :
COLUMN A Column B
56 Walmart
72 Flipkart
96 Amazon
100 AliExpress
Output Table:
COLUMN A Column B
100 AliExpress
This is what i tried till now
SourceCode:
SparkConf conf = new SparkConf().setAppName("SparkCSVReader").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> diskfile = sc.textFile("/Users/apple/Downloads/Crash_Data_1.csv");
JavaRDD<String> date = diskfile.flatMap(f -> Arrays.asList(f.split(",")[1]));
From the above code i can fetch only one column data. Is there anyway to get two columns. Any suggestions. Thanks in advance.
You can use either top or takeOrdered functions to achieve it.
rdd.top(1) //gives you top element in your RDD
Data:
COLUMN_A,Column_B
56,Walmart
72,Flipkart
96,Amazon
100,AliExpress
Creating df using Spark 2
val df = sqlContext.read.option("header", "true")
.option("inferSchema", "true")
.csv("filelocation")
df.show
import sqlContext.implicits._
import org.apache.spark.sql.functions._
Using Dataframe functions
df.orderBy(desc("COLUMN_A")).take(1).foreach(println)
OUTPUT:
[100,AliExpress]
Using RDD functions
df.rdd
.map(row => (row(0).toString.toInt, row(1)))
.sortByKey(false)
.take(1).foreach(println)
OUTPUT:
(100,AliExpress)

Iteratively running queries on Apache Spark

I've been trying to execute 10,000 queries over a relatively large dataset 11M. More specifically I am trying to transform an RDD using filter based on some predicate and then compute how many records conform to that filter by applying the COUNT action.
I am running Apache Spark on my local machine having 16GB of memory and an 8-core CPU. I have set the --driver-memory to 10G in order to cache the RDD in memory.
However, because I have to re-do this operation 10,000 times it takes unusually long for this to finish. I am also attaching my code hoping it will make things more clear.
Loading the queries and the dataframe I am going to query against.
//load normalized dimensions
val df = spark.read.parquet("/normalized.parquet").cache()
//load query ranges
val rdd = spark.sparkContext.textFile("part-00000")
Parallelizing the execution of queries
In here, my queries are collected in a list and using par are executed in parallel. I then collect the required parameters that my query needs, to filter the Dataset. The isWithin function calls a function and tests whether the Vector contained in my dataset is within the given bounds by my queries.
Now after filtering my dataset, I execute count to get the number of records that exist in the filtered dataset and then create a string reporting how many that was.
val results = queries.par.map(q => {
val volume = q(q.length-1)
val dimensions = q.slice(0, q.length-1)
val count = df.filter(row => {
val v = row.getAs[DenseVector]("scaledOpen")
isWithin(volume, v, dimensions)
}).count
q.mkString(",")+","+count
})
Now, what I have in mind is that this task is generally really hard given the large dataset that I have and trying to run such thing on a single machine. I know this could be much faster on something running on top of Spark or by utilizing an index. However, I am wondering if there is a way to make it faster as it is.
Just because you parallelize access to a local collection it doesn't mean that anything is executed in parallel. Number of jobs that can be executed concurrently is limited by the cluster resources not driver code.
At the same time Spark is designed for high latency batch jobs. If number of jobs goes into tens of thousands you just cannot expect things to be fast.
One thing you can try is to push filters down into a single job. Convert DataFrame to RDD:
import org.apache.spark.mllib.linalg.{Vector => MLlibVector}
import org.apache.spark.rdd.RDD
val vectors: RDD[org.apache.spark.mllib.linalg.DenseVector] = df.rdd.map(
_.getAs[MLlibVector]("scaledOpen").toDense
)
map vectors to {0, 1} indicators:
import breeze.linalg.DenseVector
// It is not clear what is the type of queries
type Q = ???
val queries: Seq[Q] = ???
val inds: RDD[breeze.linalg.DenseVector[Long]] = vectors.map(v => {
// Create {0, 1} indicator vector
DenseVector(queries.map(q => {
// Define as before
val volume = ???
val dimensions = ???
// Output 0 or 1 for each q
if (isWithin(volume, v, dimensions)) 1L else 0L
}): _*)
})
aggregate partial results:
val counts: breeze.linalg.DenseVector[Long] = inds
.aggregate(DenseVector.zeros[Long](queries.size))(_ += _, _ += _)
and prepare final output:
queries.zip(counts.toArray).map {
case (q, c) => s"""${q.mkString(",")},$c"""
}

Spark SQL performance: version 1.6 vs version 1.5

I have tried to compare the performance of Spark SQL version 1.6 and version 1.5. In a simple case, Spark 1.6 is quite faster than Spark 1.5. However, in a more complex query - in my case an aggregation query with grouping sets, Spark SQL version 1.6 is very much slower than Spark SQL version 1.5. Does anybody notice the same issue? and even better have a solution for this kind of query?
Here is my code
case class Toto(
a: String = f"${(math.random*1e6).toLong}%06.0f",
b: String = f"${(math.random*1e6).toLong}%06.0f",
c: String = f"${(math.random*1e6).toLong}%06.0f",
n: Int = (math.random*1e3).toInt,
m: Double = (math.random*1e3))
val data = sc.parallelize(1 to 1e6.toInt).map(i => Toto())
val df: org.apache.spark.sql.DataFrame = sqlContext.createDataFrame( data )
df.registerTempTable( "toto" )
val sqlSelect = "SELECT a, b, COUNT(1) AS k1, COUNT(DISTINCT n) AS k2, SUM(m) AS k3"
val sqlGroupBy = "FROM toto GROUP BY a, b GROUPING SETS ((a,b),(a),(b))"
val sqlText = s"$sqlSelect $sqlGroupBy"
val rs1 = sqlContext.sql( sqlText )
rs1.saveAsParquetFile( "rs1" )
Here are 2 screenshots Spark 1.5.2 and Spark 1.6.0 with --driver-memory=1G. The DAG on Spark 1.6.0 can be viewed at DAG.
Thanks Herman van Hövell for his reply on spark dev community. In order to share with other members, I share his response here.
1.6 plans single distinct aggregates like multiple distinct aggregates; this inherently causes some overhead but is more stable in case of high cardinalities. You can revert to the old behavior by setting the spark.sql.specializeSingleDistinctAggPlanning option to false. See also: https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala#L452-L462
Actually in order to revert the setting value should be "true".

Writing to hadoop distributed file system multiple times with Spark

I've created a spark job that reads in a textfile everyday from my hdfs and extracts unique keys from each line in the text file. There are roughly 50000 keys in each text file. The same data is then filtered by the extracted key and saved to the hdfs.
I want to create a directory in my hdfs with the structure: hdfs://.../date/key that contains the filtered data. The problem is that writing to the hdfs takes a very very long time because there are so many keys.
The way it's written right now:
val inputData = sparkContext.textFile(""hdfs://...", 2)
val keys = extractKey(inputData) //keys is an array of approx 50000 unique strings
val cleanedData = cleanData(inputData) //cleaned data is an RDD of strings
keys.map(key => {
val filteredData = cleanedData.filter(line => line.contains(key))
filteredData.repartition(1).saveAsTextFile("hdfs://.../date/key")
})
Is there a way to make this faster? I've thought about repartitioning the data into the number of keys extracted but then I can't save in the format hdfs://.../date/key. I've also tried groupByKey but I can't save the values because they aren't RDDs.
Any help is appreciated :)
def writeLines(iterator: Iterator[(String, String)]) = {
val writers = new mutalbe.HashMap[String, BufferedWriter] // (key, writer) map
try {
while (iterator.hasNext) {
val item = iterator.next()
val key = item._1
val line = item._2
val writer = writers.get(key) match {
case Some(writer) => writer
case None =>
val path = arg(1) + key
val outputStream = FileSystem.get(new Configuration()).create(new Path(path))
writer = new BufferedWriter(outputStream)
}
writer.writeLine(line)
} finally {
writers.values.foreach(._close())
}
}
val inputData = sc.textFile()
val keyValue = inputData.map(line => (key, line))
val partitions = keyValue.partitionBy(new MyPartition(10))
partitions.foreachPartition(writeLines)
class MyPartitioner(partitions: Int) extends Partitioner {
override def numPartitions: Int = partitions
override def getPartition(key: Any): Int = {
// make sure lines with the same key in the same partition
(key.toString.hashCode & Integer.MAX_VALUE) % numPartitions
}
}
I think the approach should be similar to Write to multiple outputs by key Spark - one Spark job. The partition number has nothing to do with the directory number. To implement it, you may need to override generateFileNameForKeyValue with your customized version to save to different directory.
Regarding scalability, it is not an issue of spark, it is hdfs instead. But no matter how you implemented, as long as the requirements is not changed, it is unavoidable. But I think Hdfs is probably OK with 50,000 file handlers
You are specifying just 2 partitions for the input, and 1 partition for the output. One effect of this is severely limiting the parallelism of these operations. Why are these needed?
Instead of computing 50,000 filtered RDDs, which is really slow too, how about just grouping by the key directly? I get that you want to output them into different directories but that is really causing the bottlenecks here. Is there perhaps another way to architect this that simply lets you read (key,value) results?

Resources