I've created a spark job that reads in a textfile everyday from my hdfs and extracts unique keys from each line in the text file. There are roughly 50000 keys in each text file. The same data is then filtered by the extracted key and saved to the hdfs.
I want to create a directory in my hdfs with the structure: hdfs://.../date/key that contains the filtered data. The problem is that writing to the hdfs takes a very very long time because there are so many keys.
The way it's written right now:
val inputData = sparkContext.textFile(""hdfs://...", 2)
val keys = extractKey(inputData) //keys is an array of approx 50000 unique strings
val cleanedData = cleanData(inputData) //cleaned data is an RDD of strings
keys.map(key => {
val filteredData = cleanedData.filter(line => line.contains(key))
filteredData.repartition(1).saveAsTextFile("hdfs://.../date/key")
})
Is there a way to make this faster? I've thought about repartitioning the data into the number of keys extracted but then I can't save in the format hdfs://.../date/key. I've also tried groupByKey but I can't save the values because they aren't RDDs.
Any help is appreciated :)
def writeLines(iterator: Iterator[(String, String)]) = {
val writers = new mutalbe.HashMap[String, BufferedWriter] // (key, writer) map
try {
while (iterator.hasNext) {
val item = iterator.next()
val key = item._1
val line = item._2
val writer = writers.get(key) match {
case Some(writer) => writer
case None =>
val path = arg(1) + key
val outputStream = FileSystem.get(new Configuration()).create(new Path(path))
writer = new BufferedWriter(outputStream)
}
writer.writeLine(line)
} finally {
writers.values.foreach(._close())
}
}
val inputData = sc.textFile()
val keyValue = inputData.map(line => (key, line))
val partitions = keyValue.partitionBy(new MyPartition(10))
partitions.foreachPartition(writeLines)
class MyPartitioner(partitions: Int) extends Partitioner {
override def numPartitions: Int = partitions
override def getPartition(key: Any): Int = {
// make sure lines with the same key in the same partition
(key.toString.hashCode & Integer.MAX_VALUE) % numPartitions
}
}
I think the approach should be similar to Write to multiple outputs by key Spark - one Spark job. The partition number has nothing to do with the directory number. To implement it, you may need to override generateFileNameForKeyValue with your customized version to save to different directory.
Regarding scalability, it is not an issue of spark, it is hdfs instead. But no matter how you implemented, as long as the requirements is not changed, it is unavoidable. But I think Hdfs is probably OK with 50,000 file handlers
You are specifying just 2 partitions for the input, and 1 partition for the output. One effect of this is severely limiting the parallelism of these operations. Why are these needed?
Instead of computing 50,000 filtered RDDs, which is really slow too, how about just grouping by the key directly? I get that you want to output them into different directories but that is really causing the bottlenecks here. Is there perhaps another way to architect this that simply lets you read (key,value) results?
Related
I am using Hadoop version 2.7 and its FileSystem API. The question is about "how to count partitions with the API?" but, to put it into a software problem, I am coping here a Spark-Shell script... The concrete question about the script is
The variable parts below is counting the number of table partitions, or other thing?
import org.apache.hadoop.fs.{FileSystem, Path}
import scala.collection.mutable.ArrayBuffer
import spark.implicits._
val warehouse = "/apps/hive/warehouse" // the Hive default location for all databases
val db_regex = """\.db$""".r // filter for names like "*.db"
val tab_regex = """\.hive\-staging_""".r // signature of Hive files
val trStrange = "[\\s/]+|[^\\x00-\\x7F]+|[\\p{Cntrl}&&[^\r\n\t]]+|\\p{C}+".r //mark
def cutPath (thePath: String, toCut: Boolean = true) : String =
if (toCut) trStrange.replaceAllIn( thePath.replaceAll("^.+/", ""), "#") else thePath
val fs_get = FileSystem.get( sc.hadoopConfiguration )
fs_get.listStatus( new Path(warehouse) ).foreach( lsb => {
val b = lsb.getPath.toString
if (db_regex.findFirstIn(b).isDefined)
fs_get.listStatus( new Path(b) ).foreach( lst => {
val lstPath = lst.getPath
val t = lstPath.toString
var parts = -1
var size = -1L
if (!tab_regex.findFirstIn(t).isDefined) {
try {
val pp = fs_get.listStatus( lstPath )
parts = pp.length // !HERE! partitions?
pp.foreach( p => {
try { // SUPPOSING that size is the number of bytes of table t
size = size + fs.getContentSummary(p.getPath).getLength
} catch { case _: Throwable => }
})
} catch { case _: Throwable => }
println(s"${cutPath(b)},${cutPath(t)},$parts,$size")
}
})
}) // x warehouse loop
System.exit(0) // get out from spark-shell
This is only an example to show the focus: the correct scan and semantic interpretation of the Hive default database FileSystem structure, using Hive FileSystem API. The script sometimes need some memory, but is working fine. Run with sshell --driver-memory 12G --executor-memory 18G -i teste_v2.scala > output.csv
Note: the aim here is not to count partitions by any other method (e.g. HQL DESCRIBE or Spark Schema), but to use the API for it... For control and for data quality checks, the API is important as a kind of "lower level measurement".
Hive structures its metadata as database > tables > partitions > files. This typically translates into filesystem directory structure <hive.warehouse.dir>/database.db/table/partition/.../files. Where /partition/.../ signifies that tables can be partitioned by multiple columns thus creating a nested levels of subdirectories. (A partition is a directory with the name .../partition_column=value by convention).
So seems like your script will be printing the number of files (parts) and their total length (size) for each single-column partitioned table in each of your databases, if I'm not mistaken.
As alternative, I'd suggest you look at hdfs dfs -count command to see if it suits your needs, and maybe wrap it in a simple shell script to loop through the databases and tables.
I have my data stored in a JSON format using the following structure:
{"generationId":1,"values":[-36.0431,-35.913,...,36.0951]}
I want to get the distribution of the spacing (differences between the consecutive numbers) between the values averaged over the files (generationIds).
The first lines in my zepplein notebook are:
import org.apache.spark.sql.SparkSession
val warehouseLocation = "/user/hive/warehouse"
val spark = SparkSession.builder().appName("test").config("spark.sql.warehouse.dir", warehouseLocation).enableHiveSupport().getOrCreate()
val jsonData = spark.read.json("/user/hive/warehouse/results/*.json")
jsonData.createOrReplaceTempView("results")
I just now realized however, that this was not a good idea. The data in the above JSON now looks like this:
val gen_1 = spark.sql("SELECT * FROM eig where generationId = 1")
gen_1.show()
+------------+--------------------+
|generationId| values|
+------------+--------------------+
| 1|[-36.0431, -35.91...|
+------------+--------------------+
All the values are in the same field.
Do you have any idea how to approach this issue in a different way? It does not necessarily have to be Hive. Any Spark related solution is OK.
The number of values can be ~10000, and later. I would like to plot this distribution together with an already known function (simulation vs theory).
This recursive function, which is not terribly elegant and certainly not battle-tested, can calculate the differences (assuming an even-sized collection):
def differences(l: Seq[Double]): Seq[Double] = {
if (l.size < 2) {
Seq.empty[Double]
} else {
val values = l.take(2)
Seq(Math.abs(values.head - values(1))) ++ differences(l.tail)
}
}
Given such a function, you could apply it in Spark like this:
jsonData.map(r => (r.getLong(0), differences(r.getSeq[Double](1))))
INTRODUCTION
I have to write distributed application which counts maximum number of unique values for 3 records. I have no experience in such area and don't know frameworks at all. My input could looks as follow:
u1: u2,u3,u4,u5,u6
u2: u1,u4,u6,u7,u8
u3: u1,u4,u5,u9
u4: u1,u2,u3,u6
...
Then beginning of the results should be:
(u1,u2,u3), u4,u5,u6,u7,u8,u9 => count=6
(u1,u2,u4), u3,u5,u6,u7,u8 => count=5
(u1,u3,u4), u2,u5,u6,u9 => count=4
(u2,u3,u4), u1,u5,u6,u7,u8,u9 => count=6
...
So my approach is to first merge each two of records, and then merge each merged pair with each single record.
QUESTION
Can I do such operation like working (merge) on more than one input row on the same time in framewors like hadoop/spark? Or maybe my approach is incorrect and I should do this different way?
Any advice will be appreciated.
Can I do such operation like working (merge) on more than one input row on the same time in framewors like hadoop/spark?
Yes, you can.
Or maybe my approach is incorrect and I should do this different way?
It depends on the size of the data. If your data is small, it's faster and easier to do it locally. If your data is huge, at least hundreds of GBs, the common strategy is to save the data to HDFS(distributed file system), and do analysis using Mapreduce/Spark.
A example spark application written in scala:
object MyCounter {
val sparkConf = new SparkConf().setAppName("My Counter")
val sc = new SparkContext(sparkConf)
def main(args: Array[String]) {
val inputFile = sc.textFile("hdfs:///inputfile.txt")
val keys = inputFile.map(line => line.substring(0, 2)) // get "u1" from "u1: u2,u3,u4,u5,u6"
val triplets = keys.cartesian(keys).cartesian(keys)
.map(z => (z._1._1, z._1._2, z._2))
.filter(z => !z._1.equals(z._2) && !z._1.equals(z._3) && !z._2.equals(z._3)) // get "(u1,u2,u3)" triplets
// If you have small numbers of (u1,u2,u3) triplets, it's better prepare them locally.
val res = triplets.cartesian(inputFile).filter(z => {
z._2.startsWith(z._1._1) || z._2.startsWith(z._1._2) || z._2.startsWith(z._1._3)
}) // (u1,u2,u3) only matches line starts with u1,u2,u3, for example "u1: u2,u3,u4,u5,u6"
.reduceByKey((a, b) => a + b) // merge three lines
.map(z => {
val line = z._2
val values = line.split(",")
//count unique values using set
val set = new util.HashSet[String]()
for (value <- values) {
set.add(value)
}
"key=" + z._1 + ", count=" + set.size() // the result from one mapper is a string
}).collect()
for (line <- res) {
println(line)
}
}
}
The code is not tested. And is not efficient. It can have some optimization (for example, remove unnecessary map-reduce steps.)
You can rewrite the same version using Python/Java.
You can implement the same logic using Hadoop/Mapreduce
I'm using Apache Flink's DataSet API. I want to implement a job that writes multiple results into different files.
How can I do that?
You can add as many data sinks to a DataSet program as you need.
For example in a program like this:
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<Tuple3<String, Long, Long>> data = env.readFromCsv(...);
// apply MapFunction and emit
data.map(new YourMapper()).writeToText("/foo/bar");
// apply FilterFunction and emit
data.filter(new YourFilter()).writeToCsv("/foo/bar2");
You read a DataSet data from a CSV file. This data is given to two subsequent transformations:
To a MapFunction and its result is written to a text file.
To a FilterFunction and the non-filtered tuples are written to a CSV file.
You can also have multiple data source and branch and merge data sets (using union, join, coGroup, cross, or broadcast sets) as you like.
You can use HadoopOutputFormat API in Flink like this:
class IteblogMultipleTextOutputFormat[K, V] extends MultipleTextOutputFormat[K, V] {
override def generateActualKey(key: K, value: V): K =
NullWritable.get().asInstanceOf[K]
override def generateFileNameForKeyValue(key: K, value: V, name: String): String =
key.asInstanceOf[String]
}
and we can using IteblogMultipleTextOutputFormat as follow:
val multipleTextOutputFormat = new IteblogMultipleTextOutputFormat[String, String]()
val jc = new JobConf()
FileOutputFormat.setOutputPath(jc, new Path("hdfs:///user/iteblog/"))
val format = new HadoopOutputFormat[String, String](multipleTextOutputFormat, jc)
val batch = env.fromCollection(List(("A", "1"), ("A", "2"), ("A", "3"),
("B", "1"), ("B", "2"), ("C", "1"), ("D", "2")))
batch.output(format)
for more information you can see:http://www.iteblog.com/archives/1667
I am trying to split data in Spark into the form of an RDD of Array[String]. Currently I have loaded the file into an RDD of String.
> val csvFile = textFile("/input/spam.csv")
I would like to split on a a , deliminator.
This:
val csvFile = textFile("/input/spam.csv").map(line => line.split(","))
returns you RDD[Array[String]].
If you need first column as one RDD then using map function return only first index from Array:
val firstCol = csvFile.map(_.(0))
You should be using the spark-csv library which is able to parse your file considering headers and allow you to specify the delimitor. Also, it makes a pretty good job at infering the schema. I'll let you read the documentation to discover the plenty of options at your disposal.
This may look like this :
sqlContext.read.format("com.databricks.spark.csv")
.option("header","true")
.option("delimiter","your delimitor")
.load(pathToFile)
Be aware, this returns a DataFrame that you may have to convert to an rdd using .rdd function.
Of course, you will have to load the package into the driver for it to work.
// create spark session
val spark = org.apache.spark.sql.SparkSession.builder
.master("local")
.appName("Spark CSV Reader")
.getOrCreate;
// read csv
val df = spark.read
.format("csv")
.option("header", "true") //reading the headers
.option("mode", "DROPMALFORMED")
.option("delimiter", ",")
.load("/your/csv/dir/simplecsv.csv")
// convert dataframe to rdd[row]
val rddRow = df.rdd
// print 2 rows
rddRow.take(2)
// convert df to rdd[string] for specific column
val oneColumn = df.select("colName").as[(String)].rdd
oneColumn.take(2)
// convert df to rdd[string] for multiple columns
val multiColumn = df.select("col1Name","col2Name").as[(String, String)].rdd
multiColumn.take(2)