How to split rows of a Spark RDD by Deliminator - hadoop

I am trying to split data in Spark into the form of an RDD of Array[String]. Currently I have loaded the file into an RDD of String.
> val csvFile = textFile("/input/spam.csv")
I would like to split on a a , deliminator.

This:
val csvFile = textFile("/input/spam.csv").map(line => line.split(","))
returns you RDD[Array[String]].
If you need first column as one RDD then using map function return only first index from Array:
val firstCol = csvFile.map(_.(0))

You should be using the spark-csv library which is able to parse your file considering headers and allow you to specify the delimitor. Also, it makes a pretty good job at infering the schema. I'll let you read the documentation to discover the plenty of options at your disposal.
This may look like this :
sqlContext.read.format("com.databricks.spark.csv")
.option("header","true")
.option("delimiter","your delimitor")
.load(pathToFile)
Be aware, this returns a DataFrame that you may have to convert to an rdd using .rdd function.
Of course, you will have to load the package into the driver for it to work.

// create spark session
val spark = org.apache.spark.sql.SparkSession.builder
.master("local")
.appName("Spark CSV Reader")
.getOrCreate;
// read csv
val df = spark.read
.format("csv")
.option("header", "true") //reading the headers
.option("mode", "DROPMALFORMED")
.option("delimiter", ",")
.load("/your/csv/dir/simplecsv.csv")
// convert dataframe to rdd[row]
val rddRow = df.rdd
// print 2 rows
rddRow.take(2)
// convert df to rdd[string] for specific column
val oneColumn = df.select("colName").as[(String)].rdd
oneColumn.take(2)
// convert df to rdd[string] for multiple columns
val multiColumn = df.select("col1Name","col2Name").as[(String, String)].rdd
multiColumn.take(2)

Related

How to count partitions with FileSystem API?

I am using Hadoop version 2.7 and its FileSystem API. The question is about "how to count partitions with the API?" but, to put it into a software problem, I am coping here a Spark-Shell script... The concrete question about the script is
The variable parts below is counting the number of table partitions, or other thing?
import org.apache.hadoop.fs.{FileSystem, Path}
import scala.collection.mutable.ArrayBuffer
import spark.implicits._
val warehouse = "/apps/hive/warehouse" // the Hive default location for all databases
val db_regex = """\.db$""".r // filter for names like "*.db"
val tab_regex = """\.hive\-staging_""".r // signature of Hive files
val trStrange = "[\\s/]+|[^\\x00-\\x7F]+|[\\p{Cntrl}&&[^\r\n\t]]+|\\p{C}+".r //mark
def cutPath (thePath: String, toCut: Boolean = true) : String =
if (toCut) trStrange.replaceAllIn( thePath.replaceAll("^.+/", ""), "#") else thePath
val fs_get = FileSystem.get( sc.hadoopConfiguration )
fs_get.listStatus( new Path(warehouse) ).foreach( lsb => {
val b = lsb.getPath.toString
if (db_regex.findFirstIn(b).isDefined)
fs_get.listStatus( new Path(b) ).foreach( lst => {
val lstPath = lst.getPath
val t = lstPath.toString
var parts = -1
var size = -1L
if (!tab_regex.findFirstIn(t).isDefined) {
try {
val pp = fs_get.listStatus( lstPath )
parts = pp.length // !HERE! partitions?
pp.foreach( p => {
try { // SUPPOSING that size is the number of bytes of table t
size = size + fs.getContentSummary(p.getPath).getLength
} catch { case _: Throwable => }
})
} catch { case _: Throwable => }
println(s"${cutPath(b)},${cutPath(t)},$parts,$size")
}
})
}) // x warehouse loop
System.exit(0) // get out from spark-shell
This is only an example to show the focus: the correct scan and semantic interpretation of the Hive default database FileSystem structure, using Hive FileSystem API. The script sometimes need some memory, but is working fine. Run with sshell --driver-memory 12G --executor-memory 18G -i teste_v2.scala > output.csv
Note: the aim here is not to count partitions by any other method (e.g. HQL DESCRIBE or Spark Schema), but to use the API for it... For control and for data quality checks, the API is important as a kind of "lower level measurement".
Hive structures its metadata as database > tables > partitions > files. This typically translates into filesystem directory structure <hive.warehouse.dir>/database.db/table/partition/.../files. Where /partition/.../ signifies that tables can be partitioned by multiple columns thus creating a nested levels of subdirectories. (A partition is a directory with the name .../partition_column=value by convention).
So seems like your script will be printing the number of files (parts) and their total length (size) for each single-column partitioned table in each of your databases, if I'm not mistaken.
As alternative, I'd suggest you look at hdfs dfs -count command to see if it suits your needs, and maybe wrap it in a simple shell script to loop through the databases and tables.

Get statistical properties of a list of values stored in JSON with Spark

I have my data stored in a JSON format using the following structure:
{"generationId":1,"values":[-36.0431,-35.913,...,36.0951]}
I want to get the distribution of the spacing (differences between the consecutive numbers) between the values averaged over the files (generationIds).
The first lines in my zepplein notebook are:
import org.apache.spark.sql.SparkSession
val warehouseLocation = "/user/hive/warehouse"
val spark = SparkSession.builder().appName("test").config("spark.sql.warehouse.dir", warehouseLocation).enableHiveSupport().getOrCreate()
val jsonData = spark.read.json("/user/hive/warehouse/results/*.json")
jsonData.createOrReplaceTempView("results")
I just now realized however, that this was not a good idea. The data in the above JSON now looks like this:
val gen_1 = spark.sql("SELECT * FROM eig where generationId = 1")
gen_1.show()
+------------+--------------------+
|generationId| values|
+------------+--------------------+
| 1|[-36.0431, -35.91...|
+------------+--------------------+
All the values are in the same field.
Do you have any idea how to approach this issue in a different way? It does not necessarily have to be Hive. Any Spark related solution is OK.
The number of values can be ~10000, and later. I would like to plot this distribution together with an already known function (simulation vs theory).
This recursive function, which is not terribly elegant and certainly not battle-tested, can calculate the differences (assuming an even-sized collection):
def differences(l: Seq[Double]): Seq[Double] = {
if (l.size < 2) {
Seq.empty[Double]
} else {
val values = l.take(2)
Seq(Math.abs(values.head - values(1))) ++ differences(l.tail)
}
}
Given such a function, you could apply it in Spark like this:
jsonData.map(r => (r.getLong(0), differences(r.getSeq[Double](1))))

SparkRDD Operations

Let's assume i had a table of two columns A and B in a CSV File. I pick maximum value from column A [Max value = 100] and i need to return the respective value of column B [Return Value = AliExpress] using JavaRDD Operations without using DataFrames.
Input Table :
COLUMN A Column B
56 Walmart
72 Flipkart
96 Amazon
100 AliExpress
Output Table:
COLUMN A Column B
100 AliExpress
This is what i tried till now
SourceCode:
SparkConf conf = new SparkConf().setAppName("SparkCSVReader").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> diskfile = sc.textFile("/Users/apple/Downloads/Crash_Data_1.csv");
JavaRDD<String> date = diskfile.flatMap(f -> Arrays.asList(f.split(",")[1]));
From the above code i can fetch only one column data. Is there anyway to get two columns. Any suggestions. Thanks in advance.
You can use either top or takeOrdered functions to achieve it.
rdd.top(1) //gives you top element in your RDD
Data:
COLUMN_A,Column_B
56,Walmart
72,Flipkart
96,Amazon
100,AliExpress
Creating df using Spark 2
val df = sqlContext.read.option("header", "true")
.option("inferSchema", "true")
.csv("filelocation")
df.show
import sqlContext.implicits._
import org.apache.spark.sql.functions._
Using Dataframe functions
df.orderBy(desc("COLUMN_A")).take(1).foreach(println)
OUTPUT:
[100,AliExpress]
Using RDD functions
df.rdd
.map(row => (row(0).toString.toInt, row(1)))
.sortByKey(false)
.take(1).foreach(println)
OUTPUT:
(100,AliExpress)

Can Flink write results into multiple files (like Hadoop's MultipleOutputFormat)?

I'm using Apache Flink's DataSet API. I want to implement a job that writes multiple results into different files.
How can I do that?
You can add as many data sinks to a DataSet program as you need.
For example in a program like this:
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<Tuple3<String, Long, Long>> data = env.readFromCsv(...);
// apply MapFunction and emit
data.map(new YourMapper()).writeToText("/foo/bar");
// apply FilterFunction and emit
data.filter(new YourFilter()).writeToCsv("/foo/bar2");
You read a DataSet data from a CSV file. This data is given to two subsequent transformations:
To a MapFunction and its result is written to a text file.
To a FilterFunction and the non-filtered tuples are written to a CSV file.
You can also have multiple data source and branch and merge data sets (using union, join, coGroup, cross, or broadcast sets) as you like.
You can use HadoopOutputFormat API in Flink like this:
class IteblogMultipleTextOutputFormat[K, V] extends MultipleTextOutputFormat[K, V] {
override def generateActualKey(key: K, value: V): K =
NullWritable.get().asInstanceOf[K]
override def generateFileNameForKeyValue(key: K, value: V, name: String): String =
key.asInstanceOf[String]
}
and we can using IteblogMultipleTextOutputFormat as follow:
val multipleTextOutputFormat = new IteblogMultipleTextOutputFormat[String, String]()
val jc = new JobConf()
FileOutputFormat.setOutputPath(jc, new Path("hdfs:///user/iteblog/"))
val format = new HadoopOutputFormat[String, String](multipleTextOutputFormat, jc)
val batch = env.fromCollection(List(("A", "1"), ("A", "2"), ("A", "3"),
("B", "1"), ("B", "2"), ("C", "1"), ("D", "2")))
batch.output(format)
for more information you can see:http://www.iteblog.com/archives/1667

Writing to hadoop distributed file system multiple times with Spark

I've created a spark job that reads in a textfile everyday from my hdfs and extracts unique keys from each line in the text file. There are roughly 50000 keys in each text file. The same data is then filtered by the extracted key and saved to the hdfs.
I want to create a directory in my hdfs with the structure: hdfs://.../date/key that contains the filtered data. The problem is that writing to the hdfs takes a very very long time because there are so many keys.
The way it's written right now:
val inputData = sparkContext.textFile(""hdfs://...", 2)
val keys = extractKey(inputData) //keys is an array of approx 50000 unique strings
val cleanedData = cleanData(inputData) //cleaned data is an RDD of strings
keys.map(key => {
val filteredData = cleanedData.filter(line => line.contains(key))
filteredData.repartition(1).saveAsTextFile("hdfs://.../date/key")
})
Is there a way to make this faster? I've thought about repartitioning the data into the number of keys extracted but then I can't save in the format hdfs://.../date/key. I've also tried groupByKey but I can't save the values because they aren't RDDs.
Any help is appreciated :)
def writeLines(iterator: Iterator[(String, String)]) = {
val writers = new mutalbe.HashMap[String, BufferedWriter] // (key, writer) map
try {
while (iterator.hasNext) {
val item = iterator.next()
val key = item._1
val line = item._2
val writer = writers.get(key) match {
case Some(writer) => writer
case None =>
val path = arg(1) + key
val outputStream = FileSystem.get(new Configuration()).create(new Path(path))
writer = new BufferedWriter(outputStream)
}
writer.writeLine(line)
} finally {
writers.values.foreach(._close())
}
}
val inputData = sc.textFile()
val keyValue = inputData.map(line => (key, line))
val partitions = keyValue.partitionBy(new MyPartition(10))
partitions.foreachPartition(writeLines)
class MyPartitioner(partitions: Int) extends Partitioner {
override def numPartitions: Int = partitions
override def getPartition(key: Any): Int = {
// make sure lines with the same key in the same partition
(key.toString.hashCode & Integer.MAX_VALUE) % numPartitions
}
}
I think the approach should be similar to Write to multiple outputs by key Spark - one Spark job. The partition number has nothing to do with the directory number. To implement it, you may need to override generateFileNameForKeyValue with your customized version to save to different directory.
Regarding scalability, it is not an issue of spark, it is hdfs instead. But no matter how you implemented, as long as the requirements is not changed, it is unavoidable. But I think Hdfs is probably OK with 50,000 file handlers
You are specifying just 2 partitions for the input, and 1 partition for the output. One effect of this is severely limiting the parallelism of these operations. Why are these needed?
Instead of computing 50,000 filtered RDDs, which is really slow too, how about just grouping by the key directly? I get that you want to output them into different directories but that is really causing the bottlenecks here. Is there perhaps another way to architect this that simply lets you read (key,value) results?

Resources