This command works with HiveQL:
insert overwrite directory '/data/home.csv' select * from testtable;
But with Spark SQL I'm getting an error with an org.apache.spark.sql.hive.HiveQl stack trace:
java.lang.RuntimeException: Unsupported language features in query:
insert overwrite directory '/data/home.csv' select * from testtable
Please guide me to write export to CSV feature in Spark SQL.
You can use below statement to write the contents of dataframe in CSV format
df.write.csv("/data/home/csv")
If you need to write the whole dataframe into a single CSV file, then use
df.coalesce(1).write.csv("/data/home/sample.csv")
For spark 1.x, you can use spark-csv to write the results into CSV files
Below scala snippet would help
import org.apache.spark.sql.hive.HiveContext
// sc - existing spark context
val sqlContext = new HiveContext(sc)
val df = sqlContext.sql("SELECT * FROM testtable")
df.write.format("com.databricks.spark.csv").save("/data/home/csv")
To write the contents into a single file
import org.apache.spark.sql.hive.HiveContext
// sc - existing spark context
val sqlContext = new HiveContext(sc)
val df = sqlContext.sql("SELECT * FROM testtable")
df.coalesce(1).write.format("com.databricks.spark.csv").save("/data/home/sample.csv")
Since Spark 2.X spark-csv is integrated as native datasource. Therefore, the necessary statement simplifies to (windows)
df.write
.option("header", "true")
.csv("file:///C:/out.csv")
or UNIX
df.write
.option("header", "true")
.csv("/var/out.csv")
Notice: as the comments say, it is creating the directory by that name with the partitions in it, not a standard CSV file. This, however, is most likely what you want since otherwise your either crashing your driver (out of RAM) or you could be working with a non distributed environment.
The answer above with spark-csv is correct but there is an issue - the library creates several files based on the data frame partitioning. And this is not what we usually need. So, you can combine all partitions to one:
df.coalesce(1).
write.
format("com.databricks.spark.csv").
option("header", "true").
save("myfile.csv")
and rename the output of the lib (name "part-00000") to a desire filename.
This blog post provides more details: https://fullstackml.com/2015/12/21/how-to-export-data-frame-from-apache-spark/
The simplest way is to map over the DataFrame's RDD and use mkString:
df.rdd.map(x=>x.mkString(","))
As of Spark 1.5 (or even before that)
df.map(r=>r.mkString(",")) would do the same
if you want CSV escaping you can use apache commons lang for that. e.g. here's the code we're using
def DfToTextFile(path: String,
df: DataFrame,
delimiter: String = ",",
csvEscape: Boolean = true,
partitions: Int = 1,
compress: Boolean = true,
header: Option[String] = None,
maxColumnLength: Option[Int] = None) = {
def trimColumnLength(c: String) = {
val col = maxColumnLength match {
case None => c
case Some(len: Int) => c.take(len)
}
if (csvEscape) StringEscapeUtils.escapeCsv(col) else col
}
def rowToString(r: Row) = {
val st = r.mkString("~-~").replaceAll("[\\p{C}|\\uFFFD]", "") //remove control characters
st.split("~-~").map(trimColumnLength).mkString(delimiter)
}
def addHeader(r: RDD[String]) = {
val rdd = for (h <- header;
if partitions == 1; //headers only supported for single partitions
tmpRdd = sc.parallelize(Array(h))) yield tmpRdd.union(r).coalesce(1)
rdd.getOrElse(r)
}
val rdd = df.map(rowToString).repartition(partitions)
val headerRdd = addHeader(rdd)
if (compress)
headerRdd.saveAsTextFile(path, classOf[GzipCodec])
else
headerRdd.saveAsTextFile(path)
}
With the help of spark-csv we can write to a CSV file.
val dfsql = sqlContext.sql("select * from tablename")
dfsql.write.format("com.databricks.spark.csv").option("header","true").save("output.csv")`
The error message suggests this is not a supported feature in the query language. But you can save a DataFrame in any format as usual through the RDD interface (df.rdd.saveAsTextFile). Or you can check out https://github.com/databricks/spark-csv.
enter code here IN DATAFRAME:
val p=spark.read.format("csv").options(Map("header"->"true","delimiter"->"^")).load("filename.csv")
Related
I have a spark streaming application which is responsible for converting text files into parquet format on the fly, and then saving the data in an external hive table. Please refer the mentioned piece of code which is one of the classes for processing text files into parquet:
object HistTableLogic {val logger = Logger.getLogger("file")
def schemadef(batchId: String) {
println("process started!")
logger.debug("process started")
val sourcePath = "some path"
val destPath = "somepath"
println(s"source path :${sourcePath}")
println(s"dest path :${destPath}")
logger.debug(s"source path :${sourcePath}")
logger.debug(s"dest path :${destPath}")
// val sc = new SparkContext(new SparkConf().set("spark.driver.allowMultipleContexts", "true"))
val conf = new Configuration()
println("Spark Context created!!")
logger.debug("Spark Context created!!")
val spark = SparkSession.builder.enableHiveSupport().getOrCreate()
println("Spark session created!")
logger.debug("Spark session created!")
val schema = StructType.apply(spark.read.table("hivetable").schema.fields.dropRight(2))
try {
val fs = FileSystem.get(conf)
spark.sql("ALTER table hivetable drop if exists partition (batch_run_dt='"+batchId.substring(1,9)+"', batchid='"+batchId+"')")
fs.listStatus(new Path(sourcePath)).foreach(x => {
val df = spark.read.format("com.databricks.spark.csv").option("inferSchema","true").option("delimiter","\u0001").
schema(schema).csv(s"${sourcePath}/"+batchId).na.fill("").repartition(50).write.mode("overwrite").option("compression", "gzip")
.parquet(s"${destPath}/batch_run_dt="+batchId.substring(1,9)+"/batchid="+batchId)
spark.sql("ALTER table hivetable add partition (batch_run_dt='"+batchId.substring(1,9)+"', batchid='"+batchId+"')")
logger.debug("Partition added")
})
} catch {
case e: Exception => {
println("---------Exception caught---------!")
logger.debug("---------Exception caught---------!")
e.printStackTrace()
logger.debug(e.printStackTrace)
logger.debug(e.getMessage)
}
}
}}
I am passing schemadef method of above class in main method of another java class which has the logic of receiving batchIds 24x7, set via custom receiver.
Functionally the application runs fine, but is taking around 15 minutes to process even 1GB of data. And if I try to simply load the data into hive table through LOAD query, it happens within a minute.
referring below configuration for spark job:
SPARK_MASTER YARN
SPARK_DEPLOY-MODE CLUSTER
SPARK_DRIVER-MEMORY 13g
SPARK_NUM-EXECUTORS 6
SPARK_EXECUTOR-MEMORY 15g
SPARK_EXECUTOR-CORES 2
Please let me know if you find any flaw in this or any other optimization I can do to enhance this process. Thank you
Is it possible to read pdf/audio/video files(unstructured data) using Apache Spark?
For example, I have thousands of pdf invoices and I want to read data from those and perform some analytics on that. What steps must I do to process unstructured data?
Yes, it is. Use sparkContext.binaryFiles to load files in binary format and then use map to map value to some other format - for example, parse binary with Apache Tika or Apache POI.
Pseudocode:
val rawFile = sparkContext.binaryFiles(...
val ready = rawFile.map ( here parsing with other framework
What is important, parsing must be done with other framework like mentioned previously in my answer. Map will get InputStream as an argument
We had a scenario where we needed to use a custom decryption algorithm on the input files. We didn't want to rewrite that code in Scala or Python. Python-Spark code follows:
from pyspark import SparkContext, SparkConf, HiveContext, AccumulatorParam
def decryptUncompressAndParseFile(filePathAndContents):
'''each line of the file becomes an RDD record'''
global acc_errCount, acc_errLog
proc = subprocess.Popen(['custom_decrypt_program','--decrypt'],
stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
(unzippedData, err) = proc.communicate(input=filePathAndContents[1])
if len(err) > 0: # problem reading the file
acc_errCount.add(1)
acc_errLog.add('Error: '+str(err)+' in file: '+filePathAndContents[0]+
', on host: '+ socket.gethostname()+' return code:'+str(returnCode))
return [] # this is okay with flatMap
records = list()
iterLines = iter(unzippedData.splitlines())
for line in iterLines:
#sys.stderr.write('Line: '+str(line)+'\n')
values = [x.strip() for x in line.split('|')]
...
records.append( (... extract data as appropriate from values into this tuple ...) )
return records
class StringAccumulator(AccumulatorParam):
''' custom accumulator to holds strings '''
def zero(self,initValue=""):
return initValue
def addInPlace(self,str1,str2):
return str1.strip()+'\n'+str2.strip()
def main():
...
global acc_errCount, acc_errLog
acc_errCount = sc.accumulator(0)
acc_errLog = sc.accumulator('',StringAccumulator())
binaryFileTup = sc.binaryFiles(args.inputDir)
# use flatMap instead of map, to handle corrupt files
linesRdd = binaryFileTup.flatMap(decryptUncompressAndParseFile, True)
df = sqlContext.createDataFrame(linesRdd, ourSchema())
df.registerTempTable("dataTable")
...
The custom string accumulator was very useful in identifying corrupt input files.
I have a requirement in which I need to create a sequence file.Right now we have written custom api on top of hadoop api,but since we are moving in spark we have to achieve the same using spark.Can this be achieved using spark dataframes?
AFAIK there is no native api available directly in DataFrame except the below approach
Please try/think some thing like(which is RDD of DataFrame style, inspired by SequenceFileRDDFunctions.scala & method saveAsSequenceFile) in below example :
Extra functions available on RDDs of (key, value) pairs to create a Hadoop SequenceFile, through an implicit conversion.
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.SequenceFileRDDFunctions
import org.apache.hadoop.io.NullWritable
object driver extends App {
val conf = new SparkConf()
.setAppName("HDFS writable test")
val sc = new SparkContext(conf)
val empty = sc.emptyRDD[Any].repartition(10)
val data = empty.mapPartitions(Generator.generate).map{ (NullWritable.get(), _) }
val seq = new SequenceFileRDDFunctions(data)
// seq.saveAsSequenceFile("/tmp/s1", None)
seq.saveAsSequenceFile(s"hdfs://localdomain/tmp/s1/${new scala.util.Random().nextInt()}", None)
sc.stop()
}
Further information pls see ..
how-to-write-dataframe-obtained-from-hive-table-into-hadoop-sequencefile-and-r
sequence file
I have a csv file in hdfs, how can I query this file with spark SQL? For example I would like to make a select request on special columns and get the result to be stored again to the Hadoop distributed file system
Thanks
you can achieve by creating Dataframe.
val dataFrame = spark.sparkContext
.textFile("examples/src/main/resources/people.csv")
.map(_.split(","))
.map(attributes => Person(attributes(0), attributes(1).trim.toInt))
.toDF()
dataFrame.sql("<sql query>");
You should create a SparkSession. An example is here.
Load a CSV file: val df = sparkSession.read.csv("path to your file in HDFS").
Perform your select operation: val df2 = df.select("field1", "field2").
Write the results back: df2.write.csv("path to a new file in HDFS")
I've got stuck while I'm writing a program using Apache Flink. The problem is that I'm trying to generate Hadoop's MapFile as a result of computation but Scala compiler complains about type mismatch.
To illustrate the problem, let me show you the below code snippet which tries to generate two kinds of output: one is Hadoop's SequenceFile and the other is MapFile.
val dataSet: DataSet[(IntWritable, BytesWritable)] =
env.readSequenceFile(classOf[Text], classOf[BytesWritable], inputSequenceFile.toString)
.map(mapper(_))
.partitionCustom(partitioner, 0)
.sortPartition(0, Order.ASCENDING)
val seqOF = new HadoopOutputFormat(
new SequenceFileOutputFormat[IntWritable, BytesWritable](), Job.getInstance(hadoopConf)
)
val mapfileOF = new HadoopOutputFormat(
new MapFileOutputFormat(), Job.getInstance(hadoopConf)
)
val dataSink1 = dataSet.output(seqOF) // it typechecks!
val dataSink2 = dataSet.output(mapfileOF) // syntax error
As commented above, dataSet.output(mapfileOF) causes Scala compiler to complain as follows:
FYI, compared to SequenceFile, MapFile calls for a stronger condition that a key must be WritableComparable.
Before writing the application using Flink, I implemented it using Spark as below and it worked okay (no compilation error and it runs okay without any error).
val rdd = sc
.sequenceFile(inputSequenceFile.toString, classOf[Text], classOf[BytesWritable])
.map(mapper(_))
.repartitionAndSortWithinPartitions(partitioner)
rdd.saveAsNewAPIHadoopFile(
outputPath.toString,
classOf[IntWritable],
classOf[BytesWritable],
classOf[MapFileOutputFormat]
)
Did you check: https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/batch/hadoop_compatibility.html#using-hadoop-outputformats
It contains the following example:
// Obtain your result to emit.
val hadoopResult: DataSet[(Text, IntWritable)] = [...]
val hadoopOF = new HadoopOutputFormat[Text,IntWritable](
new TextOutputFormat[Text, IntWritable],
new JobConf)
hadoopOF.getJobConf.set("mapred.textoutputformat.separator", " ")
FileOutputFormat.setOutputPath(hadoopOF.getJobConf, new Path(resultPath))
hadoopResult.output(hadoopOF)