Unable to view column headers when read in data from Databricks - azure-databricks

I am reading in data from databricks using the following code:
acct = spark.read.csv("/mnt/syn/account/2018-06.csv", inferSchema = True, header=True)
However, I am unable to see the column headers
The printSchema() is as follows:
6d4cd0fe-dd7a-e811-a95c-00224800c9ff:string
5/19/2022 4:25:38 PM1:string
5/19/2022 4:25:38 PM2:string
0:string
14:string
_c5:string
_c6:string
_c7:string
18:string
_c9:string
_c10:string
_c11:string
71775000112:string
930580000:string
_c14:string
_c15:string
_c16:string
117:string
_c18:string
However, when I query the data as a table in Azure Synapse I get successfully get the headers as follows:
I'm pretty sure there is a simple explanation, but I can't think why this is case with Databricks

Please follow this Sample code.
file_location = "/FileStore/tables/export.csv"
df = spark.read.format("csv") \
.option("inferSchema", "true") \
.option("header", "true") \
.option("sep", ",") \
.load(file_location)
display(df)
Output:
for more detail refer this official document.

Related

Update Static Dataframe with Streaming Dataframe in Spark structured streaming

I have a similar use case as per post How to update a Static Dataframe with Streaming Dataframe in Spark structured streaming. You can take the same data as example from the said post.
static_df = spark.read.schema(schemaName).json(fileName)
streaming_df = spark.readStream(....)
new_reference_data = update_reference_df(streaming_df, static_df)
def update_reference_df(df, static_df):
query: StreamingQuery = df \
.writeStream \
.outputMode("append") \
.foreachBatch(lambda batch_df, batchId: update_static_df(batch_df, static_df)) \
.start()
return query
def update_static_df(batch_df, static_df):
df1: DataFrame = static_df.union(batch_df.join(static_df,
(batch_df.SITE == static_df.SITE)
"left_anti"))
return df1
I want to know how will the static_df get refreshed with the new values from the data processed via foreachBatch. As I know foreachBatch returns nothing (VOID). I need to use the new values from static_df in further processing. Appreciate for your help.

Reading Blob Into Pyspark

I'm trying to read in a series of json files stored in an azure blob into spark using the databricks notebook. I set the conf() with my account and key but it always returns the error
shaded.databricks.org.apache.hadoop.fs.azure.AzureException: java.lang.IllegalArgumentException: The String is not a valid Base64-encoded string.
I've followed along with the information provided here:
https://docs.databricks.com/_static/notebooks/data-import/azure-blob-store.html
and here:
https://luminousmen.com/post/azure-blob-storage-with-pyspark
I can pull the data just fine using the azure sdk for python
storage_account_name = "name"
storage_account_access_key = "key"
spark.conf.set(
"fs.azure.account.key."+storage_account_name+".blob.core.windows.net",
storage_account_access_key)
file_location = "wasbs://loc/locationpath"
file_type = "json"
df = spark.read.format(file_type).option("inferSchema", "true").load(file_location)
Should return a dataframe of the json file

'Integer overflow' in AWS Athena for more than 800M records

I have used Spark EMR to copy tables from Oracle to S3 in parquet format, and then used Glue crawler to crawl the data from S3 and registered in Athena. The data ingestion is fine but when I tried to preview the data it showed this error:
GENERIC_INTERNAL_ERROR: integer overflow
I have tried the pipeline multiple times. The original schema is this:
SAMPLEINDEX(NUMBER38, 0)
GENEINDEX(NUMBER38, 0)
VALUE(FLOAT)
MINSEGMENTLENGTH(NUMBER38, 0)
I tried to cast the data into integer, long and string but the error still persists. I also inspected original dataset and didn't find any value that could cause int overflow.
Tables which contain rows < 800 millions work perfectly fine. But when the table has more than 800 millions rows the error started to come up.
Here is some sample code in scala:
val table = sparkSession.read
.format("jdbc")
.option("url", "jdbc:oracle:thin://#XXX")
.option("dbtable", "tcga.%s".format(tableName))
.option("user", "XXX")
.option("password", "XXX")
.option("driver", "oracle.jdbc.driver.OracleDriver")
.option("fetchsize", "50000")
.option("numPartitions", "200")
.load()
println("writing tablename: %s".format(tableName))
val finalDF = table.selectExpr("cast(SAMPLEINDEX as string) as SAMPLEINDEX", "cast(GENEINDEX as string) as GENEINDEX",
"cast(VALUE as string) as VALUE", "cast(MINSEGMENTLENGTH as string) as MINSEGMENTLENGTH")
finalDF.repartition(200)
finalDF.printSchema()
finalDF.write.format("parquet").mode("Overwrite").save("s3n://XXX/CNVGENELEVELDATATEST")
finalDF.printSchema()
finalDF.show()
Does anyone know what may cause the issue?

HDFS path does not exist with SparkSession object when spark master is set as LOCAL

I am trying to load a dataset into Hive table using Spark.
But when I try to load the file from HDFS directory to Spark, I get the exception:
org.apache.spark.sql.AnalysisException: Path does not exist: file:/home/cloudera/partfile;
These are the steps before loading the file.
val wareHouseLocation = "file:${system:user.dir}/spark-warehouse"
val SparkSession = SparkSession.builder.master("local[2]") \
.appName("SparkHive") \
.enableHiveSupport() \
.config("hive.exec.dynamic.partition", "true") \
.config("hive.exec.dynamic.partition.mode","nonstrict") \
.config("hive.metastore.warehouse.dir","/user/hive/warehouse") \
.config("spark.sql.warehouse.dir",wareHouseLocation).getOrCreate()
import sparkSession.implicits._
val partf = sparkSession.read.textFile("partfile")
Exception for the statement ->
val partf = sparkSession.read.textFile("partfile")
org.apache.spark.sql.AnalysisException: Path does not exist: file:/home/cloudera/partfile;
But I have the file in my home directory of HDFS.
hadoop fs -ls
Found 1 items
-rw-r--r-- 1 cloudera cloudera 58 2017-06-30 02:23 partfile
I tried various ways to load the dataset like:
val partfile = sparkSession.read.textFile("/user/cloudera/partfile") and
val partfile = sparkSession.read.textFile("hdfs://quickstart.cloudera:8020/user/cloudera/partfile")
But nothing seems to work.
My spark version is 2.0.2
Could anyone tell me how to fix it ?
When you submit the job by setting master as local[2], your job is not getting submitted to spark master and so, spark does not know about underlying HDFS.
Spark will consider local file system as its default file system, and that's why, IOException occurs in your case.
Try this way:
val SparkSession = SparkSession.builder \
.master("<spark-master-ip>:<spark-port>") \
.appName("SparkHive").enableHiveSupport() \
.config("hive.exec.dynamic.partition", "true") \
.config("hive.exec.dynamic.partition.mode","nonstrict") \
.config("hive.metastore.warehouse.dir","/user/hive/warehouse") \
.config("spark.sql.warehouse.dir",wareHouseLocation).getOrCreate()
import sparkSession.implicits._
val partf = sparkSession.read.textFile("partfile")
You need to know <spark-master-ip> and <spark-port> for this.
This way, spark will take underlying hdfs file system as its default file system.
It's not clear for me what would be an error with explicit protocol specification but usually (as already was answered) it means that no neccesary configurations were passed into Spark context.
The first solution:
val sc = ??? // Spark Context
val config = sc.hadoopConfiguration
// you can mutate config object, it should work
config.addResource(new Path(s"${HADOOP_HOME}/conf/core-site.xml"))
// instead of adding a resource you can just specify hdfs address
// config.set("fs.defaultFS", "hdfs://host:port")
The second:
Explicitly specify HADOOP_CONF_DIR in $SPARK_HOME/spark-env.sh file. If you plan to use a cluster, be sure that every node of your cluster have HADOOP_CONF_DIR specified.
And be sure that you have all necessary Hadoop deps in your Spark / App classpath.
Try the following, it should work.
SparkSession session = SparkSession.builder().appName("Appname").master("local[1]").getOrCreate();
DataFrameReader dataFrameReader = session.read();
String path = "path\\file.csv";
Dataset <Row> responses = dataFrameReader.option("header","true").csv(path);

How to export data from Spark SQL to CSV

This command works with HiveQL:
insert overwrite directory '/data/home.csv' select * from testtable;
But with Spark SQL I'm getting an error with an org.apache.spark.sql.hive.HiveQl stack trace:
java.lang.RuntimeException: Unsupported language features in query:
insert overwrite directory '/data/home.csv' select * from testtable
Please guide me to write export to CSV feature in Spark SQL.
You can use below statement to write the contents of dataframe in CSV format
df.write.csv("/data/home/csv")
If you need to write the whole dataframe into a single CSV file, then use
df.coalesce(1).write.csv("/data/home/sample.csv")
For spark 1.x, you can use spark-csv to write the results into CSV files
Below scala snippet would help
import org.apache.spark.sql.hive.HiveContext
// sc - existing spark context
val sqlContext = new HiveContext(sc)
val df = sqlContext.sql("SELECT * FROM testtable")
df.write.format("com.databricks.spark.csv").save("/data/home/csv")
To write the contents into a single file
import org.apache.spark.sql.hive.HiveContext
// sc - existing spark context
val sqlContext = new HiveContext(sc)
val df = sqlContext.sql("SELECT * FROM testtable")
df.coalesce(1).write.format("com.databricks.spark.csv").save("/data/home/sample.csv")
Since Spark 2.X spark-csv is integrated as native datasource. Therefore, the necessary statement simplifies to (windows)
df.write
.option("header", "true")
.csv("file:///C:/out.csv")
or UNIX
df.write
.option("header", "true")
.csv("/var/out.csv")
Notice: as the comments say, it is creating the directory by that name with the partitions in it, not a standard CSV file. This, however, is most likely what you want since otherwise your either crashing your driver (out of RAM) or you could be working with a non distributed environment.
The answer above with spark-csv is correct but there is an issue - the library creates several files based on the data frame partitioning. And this is not what we usually need. So, you can combine all partitions to one:
df.coalesce(1).
write.
format("com.databricks.spark.csv").
option("header", "true").
save("myfile.csv")
and rename the output of the lib (name "part-00000") to a desire filename.
This blog post provides more details: https://fullstackml.com/2015/12/21/how-to-export-data-frame-from-apache-spark/
The simplest way is to map over the DataFrame's RDD and use mkString:
df.rdd.map(x=>x.mkString(","))
As of Spark 1.5 (or even before that)
df.map(r=>r.mkString(",")) would do the same
if you want CSV escaping you can use apache commons lang for that. e.g. here's the code we're using
def DfToTextFile(path: String,
df: DataFrame,
delimiter: String = ",",
csvEscape: Boolean = true,
partitions: Int = 1,
compress: Boolean = true,
header: Option[String] = None,
maxColumnLength: Option[Int] = None) = {
def trimColumnLength(c: String) = {
val col = maxColumnLength match {
case None => c
case Some(len: Int) => c.take(len)
}
if (csvEscape) StringEscapeUtils.escapeCsv(col) else col
}
def rowToString(r: Row) = {
val st = r.mkString("~-~").replaceAll("[\\p{C}|\\uFFFD]", "") //remove control characters
st.split("~-~").map(trimColumnLength).mkString(delimiter)
}
def addHeader(r: RDD[String]) = {
val rdd = for (h <- header;
if partitions == 1; //headers only supported for single partitions
tmpRdd = sc.parallelize(Array(h))) yield tmpRdd.union(r).coalesce(1)
rdd.getOrElse(r)
}
val rdd = df.map(rowToString).repartition(partitions)
val headerRdd = addHeader(rdd)
if (compress)
headerRdd.saveAsTextFile(path, classOf[GzipCodec])
else
headerRdd.saveAsTextFile(path)
}
With the help of spark-csv we can write to a CSV file.
val dfsql = sqlContext.sql("select * from tablename")
dfsql.write.format("com.databricks.spark.csv").option("header","true").save("output.csv")`
The error message suggests this is not a supported feature in the query language. But you can save a DataFrame in any format as usual through the RDD interface (df.rdd.saveAsTextFile). Or you can check out https://github.com/databricks/spark-csv.
enter code here IN DATAFRAME:
val p=spark.read.format("csv").options(Map("header"->"true","delimiter"->"^")).load("filename.csv")

Resources