spark job performing poorly while converting text files to parquet format - performance

I have a spark streaming application which is responsible for converting text files into parquet format on the fly, and then saving the data in an external hive table. Please refer the mentioned piece of code which is one of the classes for processing text files into parquet:
object HistTableLogic {val logger = Logger.getLogger("file")
def schemadef(batchId: String) {
println("process started!")
logger.debug("process started")
val sourcePath = "some path"
val destPath = "somepath"
println(s"source path :${sourcePath}")
println(s"dest path :${destPath}")
logger.debug(s"source path :${sourcePath}")
logger.debug(s"dest path :${destPath}")
// val sc = new SparkContext(new SparkConf().set("spark.driver.allowMultipleContexts", "true"))
val conf = new Configuration()
println("Spark Context created!!")
logger.debug("Spark Context created!!")
val spark = SparkSession.builder.enableHiveSupport().getOrCreate()
println("Spark session created!")
logger.debug("Spark session created!")
val schema = StructType.apply(spark.read.table("hivetable").schema.fields.dropRight(2))
try {
val fs = FileSystem.get(conf)
spark.sql("ALTER table hivetable drop if exists partition (batch_run_dt='"+batchId.substring(1,9)+"', batchid='"+batchId+"')")
fs.listStatus(new Path(sourcePath)).foreach(x => {
val df = spark.read.format("com.databricks.spark.csv").option("inferSchema","true").option("delimiter","\u0001").
schema(schema).csv(s"${sourcePath}/"+batchId).na.fill("").repartition(50).write.mode("overwrite").option("compression", "gzip")
.parquet(s"${destPath}/batch_run_dt="+batchId.substring(1,9)+"/batchid="+batchId)
spark.sql("ALTER table hivetable add partition (batch_run_dt='"+batchId.substring(1,9)+"', batchid='"+batchId+"')")
logger.debug("Partition added")
})
} catch {
case e: Exception => {
println("---------Exception caught---------!")
logger.debug("---------Exception caught---------!")
e.printStackTrace()
logger.debug(e.printStackTrace)
logger.debug(e.getMessage)
}
}
}}
I am passing schemadef method of above class in main method of another java class which has the logic of receiving batchIds 24x7, set via custom receiver.
Functionally the application runs fine, but is taking around 15 minutes to process even 1GB of data. And if I try to simply load the data into hive table through LOAD query, it happens within a minute.
referring below configuration for spark job:
SPARK_MASTER YARN
SPARK_DEPLOY-MODE CLUSTER
SPARK_DRIVER-MEMORY 13g
SPARK_NUM-EXECUTORS 6
SPARK_EXECUTOR-MEMORY 15g
SPARK_EXECUTOR-CORES 2
Please let me know if you find any flaw in this or any other optimization I can do to enhance this process. Thank you

Related

Can I create sequence file using spark dataframes?

I have a requirement in which I need to create a sequence file.Right now we have written custom api on top of hadoop api,but since we are moving in spark we have to achieve the same using spark.Can this be achieved using spark dataframes?
AFAIK there is no native api available directly in DataFrame except the below approach
Please try/think some thing like(which is RDD of DataFrame style, inspired by SequenceFileRDDFunctions.scala & method saveAsSequenceFile) in below example :
Extra functions available on RDDs of (key, value) pairs to create a Hadoop SequenceFile, through an implicit conversion.
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.SequenceFileRDDFunctions
import org.apache.hadoop.io.NullWritable
object driver extends App {
val conf = new SparkConf()
.setAppName("HDFS writable test")
val sc = new SparkContext(conf)
val empty = sc.emptyRDD[Any].repartition(10)
val data = empty.mapPartitions(Generator.generate).map{ (NullWritable.get(), _) }
val seq = new SequenceFileRDDFunctions(data)
// seq.saveAsSequenceFile("/tmp/s1", None)
seq.saveAsSequenceFile(s"hdfs://localdomain/tmp/s1/${new scala.util.Random().nextInt()}", None)
sc.stop()
}
Further information pls see ..
how-to-write-dataframe-obtained-from-hive-table-into-hadoop-sequencefile-and-r
sequence file

How to read a record from HBase then store into Spark RDD (Resilient Distributed Datasets); and read one RDD record then write into HBase?

So I want to write a code to read a record from Hadoop HBase then store it into Spark RDD (Resilient Distributed Datasets); and read one RDD record then write into HBase. I have ZERO knowledge about either of the two and I need to use AWS cloud or Hadoop virtual machine. Someone please guide me to start from scratch.
Please make use of the basic code in Scala where we are reading the data in HBase using Scala. Similarly you can write a table creation to write the data into HBase
import org.apache.hadoop.hbase.client.{HBaseAdmin, Result}
import org.apache.hadoop.hbase.{ HBaseConfiguration, HTableDescriptor }
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.spark._
object HBaseApp {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("HBaseApp").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
val conf = HBaseConfiguration.create()
val tableName = "table1"
System.setProperty("user.name", "hdfs")
System.setProperty("HADOOP_USER_NAME", "hdfs")
conf.set("hbase.master", "localhost:60000")
conf.setInt("timeout", 100000)
conf.set("hbase.zookeeper.quorum", "localhost")
conf.set("zookeeper.znode.parent", "/hbase-unsecure")
conf.set(TableInputFormat.INPUT_TABLE, tableName)
val admin = new HBaseAdmin(conf)
if (!admin.isTableAvailable(tableName)) {
val tableDesc = new HTableDescriptor(tableName)
admin.createTable(tableDesc)
}
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
println("Number of Records found : " + hBaseRDD.count())
sc.stop()
}
}

Streaming to HBase with pyspark

There is a fair amount of info online about bulk loading to HBase with Spark streaming using Scala (these two were particularly useful) and some info for Java, but there seems to be a lack of info for doing it with PySpark. So my questions are:
How can data be bulk loaded into HBase using PySpark?
Most examples in any language only show a single column per row being upserted. How can I upsert multiple columns per row?
The code I currently have is as follows:
if __name__ == "__main__":
context = SparkContext(appName="PythonHBaseBulkLoader")
streamingContext = StreamingContext(context, 5)
stream = streamingContext.textFileStream("file:///test/input");
stream.foreachRDD(bulk_load)
streamingContext.start()
streamingContext.awaitTermination()
What I need help with is the bulk load function
def bulk_load(rdd):
#???
I've made some progress previously, with many and various errors (as documented here and here)
So after much trial and error, I present here the best I have come up with. It works well, and successfully bulk loads data (using Puts or HFiles) I am perfectly willing to believe that it is not the best method, so any comments/other answers are welcome. This assume you're using a CSV for your data.
Bulk loading with Puts
By far the easiest way to bulk load, this simply creates a Put request for each cell in the CSV and queues them up to HBase.
def bulk_load(rdd):
#Your configuration will likely be different. Insert your own quorum and parent node and table name
conf = {"hbase.zookeeper.qourum": "localhost:2181",\
"zookeeper.znode.parent": "/hbase-unsecure",\
"hbase.mapred.outputtable": "Test",\
"mapreduce.outputformat.class": "org.apache.hadoop.hbase.mapreduce.TableOutputFormat",\
"mapreduce.job.output.key.class": "org.apache.hadoop.hbase.io.ImmutableBytesWritable",\
"mapreduce.job.output.value.class": "org.apache.hadoop.io.Writable"}
keyConv = "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter"
valueConv = "org.apache.spark.examples.pythonconverters.StringListToPutConverter"
load_rdd = rdd.flatMap(lambda line: line.split("\n"))\#Split the input into individual lines
.flatMap(csv_to_key_value)#Convert the CSV line to key value pairs
load_rdd.saveAsNewAPIHadoopDataset(conf=conf,keyConverter=keyConv,valueConverter=valueConv)
The function csv_to_key_value is where the magic happens:
def csv_to_key_value(row):
cols = row.split(",")#Split on commas.
#Each cell is a tuple of (key, [key, column-family, column-descriptor, value])
#Works well for n>=1 columns
result = ((cols[0], [cols[0], "f1", "c1", cols[1]]),
(cols[0], [cols[0], "f2", "c2", cols[2]]),
(cols[0], [cols[0], "f3", "c3", cols[3]]))
return result
The value converter we defined earlier will convert these tuples into HBase Puts
Bulk loading with HFiles
Bulk loading with HFiles is more efficient: rather than a Put request for each cell, an HFile is written directly and the RegionServer is simply told to point to the new HFile. This will use Py4J, so before the Python code we have to write a small Java program:
import py4j.GatewayServer;
import org.apache.hadoop.hbase.*;
public class GatewayApplication {
public static void main(String[] args)
{
GatewayApplication app = new GatewayApplication();
GatewayServer server = new GatewayServer(app);
server.start();
}
}
Compile this, and run it. Leave it running as long as your streaming is happening. Now update bulk_load as follows:
def bulk_load(rdd):
#The output class changes, everything else stays
conf = {"hbase.zookeeper.qourum": "localhost:2181",\
"zookeeper.znode.parent": "/hbase-unsecure",\
"hbase.mapred.outputtable": "Test",\
"mapreduce.outputformat.class": "org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2",\
"mapreduce.job.output.key.class": "org.apache.hadoop.hbase.io.ImmutableBytesWritable",\
"mapreduce.job.output.value.class": "org.apache.hadoop.io.Writable"}#"org.apache.hadoop.hbase.client.Put"}
keyConv = "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter"
valueConv = "org.apache.spark.examples.pythonconverters.StringListToPutConverter"
load_rdd = rdd.flatMap(lambda line: line.split("\n"))\
.flatMap(csv_to_key_value)\
.sortByKey(True)
#Don't process empty RDDs
if not load_rdd.isEmpty():
#saveAsNewAPIHadoopDataset changes to saveAsNewAPIHadoopFile
load_rdd.saveAsNewAPIHadoopFile("file:///tmp/hfiles" + startTime,
"org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2",
conf=conf,
keyConverter=keyConv,
valueConverter=valueConv)
#The file has now been written, but HBase doesn't know about it
#Get a link to Py4J
gateway = JavaGateway()
#Convert conf to a fully fledged Configuration type
config = dict_to_conf(conf)
#Set up our HTable
htable = gateway.jvm.org.apache.hadoop.hbase.client.HTable(config, "Test")
#Set up our path
path = gateway.jvm.org.apache.hadoop.fs.Path("/tmp/hfiles" + startTime)
#Get a bulk loader
loader = gateway.jvm.org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles(config)
#Load the HFile
loader.doBulkLoad(path, htable)
else:
print("Nothing to process")
Finally, the fairly straightforward dict_to_conf:
def dict_to_conf(conf):
gateway = JavaGateway()
config = gateway.jvm.org.apache.hadoop.conf.Configuration()
keys = conf.keys()
vals = conf.values()
for i in range(len(keys)):
config.set(keys[i], vals[i])
return config
As you can see, bulk loading with HFiles is more complex than using Puts, but depending on your data load it is probably worth it since once you get it working it's not that difficult.
One last note on something that caught me off guard: HFiles expect the data they receive to be written in lexical order. This is not always guaranteed to be true, especially since "10" < "9". If you have designed your key to be unique, then this can be fixed easily:
load_rdd = rdd.flatMap(lambda line: line.split("\n"))\
.flatMap(csv_to_key_value)\
.sortByKey(True)#Sort in ascending order

How to export data from Spark SQL to CSV

This command works with HiveQL:
insert overwrite directory '/data/home.csv' select * from testtable;
But with Spark SQL I'm getting an error with an org.apache.spark.sql.hive.HiveQl stack trace:
java.lang.RuntimeException: Unsupported language features in query:
insert overwrite directory '/data/home.csv' select * from testtable
Please guide me to write export to CSV feature in Spark SQL.
You can use below statement to write the contents of dataframe in CSV format
df.write.csv("/data/home/csv")
If you need to write the whole dataframe into a single CSV file, then use
df.coalesce(1).write.csv("/data/home/sample.csv")
For spark 1.x, you can use spark-csv to write the results into CSV files
Below scala snippet would help
import org.apache.spark.sql.hive.HiveContext
// sc - existing spark context
val sqlContext = new HiveContext(sc)
val df = sqlContext.sql("SELECT * FROM testtable")
df.write.format("com.databricks.spark.csv").save("/data/home/csv")
To write the contents into a single file
import org.apache.spark.sql.hive.HiveContext
// sc - existing spark context
val sqlContext = new HiveContext(sc)
val df = sqlContext.sql("SELECT * FROM testtable")
df.coalesce(1).write.format("com.databricks.spark.csv").save("/data/home/sample.csv")
Since Spark 2.X spark-csv is integrated as native datasource. Therefore, the necessary statement simplifies to (windows)
df.write
.option("header", "true")
.csv("file:///C:/out.csv")
or UNIX
df.write
.option("header", "true")
.csv("/var/out.csv")
Notice: as the comments say, it is creating the directory by that name with the partitions in it, not a standard CSV file. This, however, is most likely what you want since otherwise your either crashing your driver (out of RAM) or you could be working with a non distributed environment.
The answer above with spark-csv is correct but there is an issue - the library creates several files based on the data frame partitioning. And this is not what we usually need. So, you can combine all partitions to one:
df.coalesce(1).
write.
format("com.databricks.spark.csv").
option("header", "true").
save("myfile.csv")
and rename the output of the lib (name "part-00000") to a desire filename.
This blog post provides more details: https://fullstackml.com/2015/12/21/how-to-export-data-frame-from-apache-spark/
The simplest way is to map over the DataFrame's RDD and use mkString:
df.rdd.map(x=>x.mkString(","))
As of Spark 1.5 (or even before that)
df.map(r=>r.mkString(",")) would do the same
if you want CSV escaping you can use apache commons lang for that. e.g. here's the code we're using
def DfToTextFile(path: String,
df: DataFrame,
delimiter: String = ",",
csvEscape: Boolean = true,
partitions: Int = 1,
compress: Boolean = true,
header: Option[String] = None,
maxColumnLength: Option[Int] = None) = {
def trimColumnLength(c: String) = {
val col = maxColumnLength match {
case None => c
case Some(len: Int) => c.take(len)
}
if (csvEscape) StringEscapeUtils.escapeCsv(col) else col
}
def rowToString(r: Row) = {
val st = r.mkString("~-~").replaceAll("[\\p{C}|\\uFFFD]", "") //remove control characters
st.split("~-~").map(trimColumnLength).mkString(delimiter)
}
def addHeader(r: RDD[String]) = {
val rdd = for (h <- header;
if partitions == 1; //headers only supported for single partitions
tmpRdd = sc.parallelize(Array(h))) yield tmpRdd.union(r).coalesce(1)
rdd.getOrElse(r)
}
val rdd = df.map(rowToString).repartition(partitions)
val headerRdd = addHeader(rdd)
if (compress)
headerRdd.saveAsTextFile(path, classOf[GzipCodec])
else
headerRdd.saveAsTextFile(path)
}
With the help of spark-csv we can write to a CSV file.
val dfsql = sqlContext.sql("select * from tablename")
dfsql.write.format("com.databricks.spark.csv").option("header","true").save("output.csv")`
The error message suggests this is not a supported feature in the query language. But you can save a DataFrame in any format as usual through the RDD interface (df.rdd.saveAsTextFile). Or you can check out https://github.com/databricks/spark-csv.
enter code here IN DATAFRAME:
val p=spark.read.format("csv").options(Map("header"->"true","delimiter"->"^")).load("filename.csv")

Spark insert to HBase slow

I am inserting to HBase using Spark but it's slow. For 60,000 records it takes 2-3mins. I have about 10 million records to save.
object WriteToHbase extends Serializable {
def main(args: Array[String]) {
val csvRows: RDD[Array[String] = ...
val dateFormatter = DateTimeFormat.forPattern("yyyy-MM-dd HH:mm:ss")
val usersRDD = csvRows.map(row => {
new UserTable(row(0), row(1), row(2), row(9), row(10), row(11))
})
processUsers(sc: SparkContext, usersRDD, dateFormatter)
})
}
def processUsers(sc: SparkContext, usersRDD: RDD[UserTable], dateFormatter: DateTimeFormatter): Unit = {
usersRDD.foreachPartition(part => {
val conf = HBaseConfiguration.create()
val table = new HTable(conf, tablename)
part.foreach(userRow => {
val id = userRow.id
val name = userRow.name
val date1 = dateFormatter.parseDateTime(userRow.date1)
val hRow = new Put(Bytes.toBytes(id))
hRow.add(cf, q, Bytes.toBytes(date1))
hRow.add(cf, q, Bytes.toBytes(name))
...
table.put(hRow)
})
table.flushCommits()
table.close()
})
}
I am using this in spark-submit:
--num-executors 2 --driver-memory 2G --executor-memory 2G --executor-cores 2
It's slow because the implementation doesn't leverage the proximity of the data; the piece of Spark RDD in a server may be transferred to a HBase RegionServer running on another server.
Currently there is no Spark's RRD operation to use HBase data store in efficient manner.
There is a batch api in Htable, you can try to send put requests as 100-500 put packets.I think it can speed up you a little. It returns individual result for every operation, so you can check failed puts if you want.
public void batch(List<? extends Row> actions, Object[] results)
https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#batch%28java.util.List,%20java.lang.Object[]%29
You have to look on the approach where you can distribute your incoming data in to the Spark Job. In your current approach of foreachPartition instead you have to look on Transformations like map, mapToPair as well. You need to evaluate your whole DAG lifecycle and where you can save more time.
After that based on the Parallelism achieved you can call saveAsNewAPIHadoopDataset Action of Spark to write inside HBase more fast and parallel. Like:
JavaPairRDD<ImmutableBytesWritable, Put> yourFinalRDD = yourRDD.<SparkTransformation>{()};
yourFinalRDD.saveAsNewAPIHadoopDataset(yourHBaseConfiguration);
Note: Where yourHBaseConfiguration will be a singleton and will be single object on an Executor node to share between the Tasks
Kindly let me know if this Pseudo-code doesn't work for you or find any difficulty on the same.

Resources