Azure Databricks Missing entries when streaming to SQL Datawarehouse - spark-streaming

first I had following instructions, and when upload 20.000 files I got 20.000 records in the DB (Each file only holds 1 rec).
aTracking = sqlContext.read.format('csv').options(header='true', delimiter=';').schema(csvSchema).load("wasbs://" + blobContainer + "#" + blobStorage + ".blob.core.windows.net/rtT*.csv")
aTracking.write \
.option('user', dwUser) \
.option('password', dwPass) \
.jdbc('jdbc:sqlserver://' + dwServer + ':' + dwJdbcPort + ';database=' + dwDatabase, 'stg_tr_energy_xmlin.csv_in', mode = 'append' )
Then, for speed purposes I thought would be better to stream with Polybase ... Coded as ... But there I only got +- 17.000 entries.
aTracking = spark.readStream.format('csv').options(header='true', delimiter=';').schema(csvSchema).load("wasbs://" + blobContainer + "#" + blobStorage + ".blob.core.windows.net/rtT*.csv")
aTracking.writeStream \
.format("com.databricks.spark.sqldw") \
.option("url", sqlDwUrl) \
.option("tempDir", "wasbs://uploaddw#" + blobStorage + ".blob.core.windows.net/stream") \
.option("forwardSparkAzureStorageCredentials", "true") \
.option("dbTable", "stg_tr_energy_xmlin.csv_in") \
.option("checkpointLocation", "/checkpoint") \
.start()
Any suggestions what could cause this ?

The state of your structured streaming query is tracked in the checkpoint location. "Every streaming source is assumed to have offsets (similar to Kafka offsets (...)) to track the read position in the stream. The engine uses checkpointing and write ahead logs to record the offset range of the data being processed in each trigger". See the Spark documentation (search for checkpointing) for more details.
So if you want to reprocess all your files delete the checkpoint location dir (or define a new one) defined under:
.option("checkpointLocation", "/checkpoint").
Additionally _spark_metadata file in the target dir is checked, so if you want all data to be written again, you should also cleanup the target dir (with Azure SQL Data Warehouse the temp dir).

Related

Unable to view column headers when read in data from Databricks

I am reading in data from databricks using the following code:
acct = spark.read.csv("/mnt/syn/account/2018-06.csv", inferSchema = True, header=True)
However, I am unable to see the column headers
The printSchema() is as follows:
6d4cd0fe-dd7a-e811-a95c-00224800c9ff:string
5/19/2022 4:25:38 PM1:string
5/19/2022 4:25:38 PM2:string
0:string
14:string
_c5:string
_c6:string
_c7:string
18:string
_c9:string
_c10:string
_c11:string
71775000112:string
930580000:string
_c14:string
_c15:string
_c16:string
117:string
_c18:string
However, when I query the data as a table in Azure Synapse I get successfully get the headers as follows:
I'm pretty sure there is a simple explanation, but I can't think why this is case with Databricks
Please follow this Sample code.
file_location = "/FileStore/tables/export.csv"
df = spark.read.format("csv") \
.option("inferSchema", "true") \
.option("header", "true") \
.option("sep", ",") \
.load(file_location)
display(df)
Output:
for more detail refer this official document.

Writing parquet file throws...An HTTP header that's mandatory for this request is not specified

I have two ADLSv2 storage accounts, both are hierarchical namespace enabled.
In my Python Notebook, I'm reading a CSV file from one storage account and writing as parquet file in another storage, after some enrichment.
I am getting below error when writing the parquet file...
StatusCode=400, An HTTP header that's mandatory for this request is not
Any help is greatly appreciated.
Below is my Notebook code snippet...
# Databricks notebook source
# MAGIC %python
# MAGIC
# MAGIC STAGING_MOUNTPOINT = "/mnt/inputfiles"
# MAGIC if STAGING_MOUNTPOINT in [mnt.mountPoint for mnt in dbutils.fs.mounts()]:
# MAGIC dbutils.fs.unmount(STAGING_MOUNTPOINT)
# MAGIC
# MAGIC PERM_MOUNTPOINT = "/mnt/outputfiles"
# MAGIC if PERM_MOUNTPOINT in [mnt.mountPoint for mnt in dbutils.fs.mounts()]:
# MAGIC dbutils.fs.unmount(PERM_MOUNTPOINT)
STAGING_STORAGE_ACCOUNT = "--------"
STAGING_CONTAINER = "--------"
STAGING_FOLDER = --------"
PERM_STORAGE_ACCOUNT = "--------"
PERM_CONTAINER = "--------"
configs = {
"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type":
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "#####################",
"fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope="DemoScope",key="DemoSecret"),
"fs.azure.account.oauth2.client.endpoint":
"https://login.microsoftonline.com/**********************/oauth2/token"}
STAGING_SOURCE =
"abfss://{container}#{storage_acct}.blob.core.windows.net/".format(container=STAGING_CONTAINER,
storage_acct=STAGING_STORAGE_ACCOUNT)
try:
dbutils.fs.mount(
source=STAGING_SOURCE,
mount_point=STAGING_MOUNTPOINT,
extra_configs=configs)
except Exception as e:
if "Directory already mounted" in str(e):
pass # Ignore error if already mounted.
else:
raise e
print("Staging Storage mount Success.")
inputDemoFile = "{}/{}/demo.csv".format(STAGING_MOUNTPOINT, STAGING_FOLDER)
readDF = (spark
.read.option("header", True)
.schema(inputSchema)
.option("inferSchema", True)
.csv(inputDemoFile))
LANDING_SOURCE =
"abfss://{container}#{storage_acct}.blob.core.windows.net/".format(container=LANDING_CONTAINER,
storage_acct=PERM_STORAGE_ACCOUNT)
try:
dbutils.fs.mount(
source=PERM_SOURCE,
mount_point=PERM_MOUNTPOINT,
extra_configs=configs)
except Exception as e:
if "Directory already mounted" in str(e):
pass # Ignore error if already mounted.
else:
raise e
print("Landing Storage mount Success.")
outPatientsFile = "{}/patients.parquet".format(outPatientsFilePath)
print("Writing to parquet file: " + outPatientsFile)
***Below call is failing…error is
StatusCode=400
StatusDescription=An HTTP header that's mandatory for this request is not specified.
ErrorCode=
ErrorMessage=***
(readDF
.coalesce(1)
.write
.mode("overwrite")
.option("header", "true")
.option("compression", "snappy")
.parquet(outPatientsFile)
)
I summarize the solution as below.
If you want to mount Azure data lake storage gen2 as Azure databricks file system, the URL should be like abfss://<file-system-name>#<storage-account-name>.dfs.core.windows.net/. For more details, please refer to here
For example
Create an Azure Data Lake Storage Gen2 account.
az login
az storage account create \
--name <account-name> \
--resource-group <group name> \
--location westus \
--sku Standard_RAGRS \
--kind StorageV2 \
--enable-hierarchical-namespace true
Create a service principal and assign Storage Blob Data Contributor to the sp in the scope of the Data Lake Storage Gen2 storage account
az login
az ad sp create-for-rbac -n "MyApp" --role "Storage Blob Data Contributor" \
--scopes /subscriptions/<subscription>/resourceGroups/<resource-group>/providers/Microsoft.Storage/storageAccounts/<storage-account>
mount Azure data lake gen2 in Azure databricks(python)
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "<appId>",
"fs.azure.account.oauth2.client.secret": "<clientSecret>",
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<tenant>/oauth2/token",
"fs.azure.createRemoteFileSystemDuringInitialization": "true"}
dbutils.fs.mount(
source = "abfss://<container-name>#<storage-account-name>.dfs.core.windows.net/folder1",
mount_point = "/mnt/flightdata",
extra_configs = configs)
Couple of important points to note while mounting Storage accounts in Azure Databricks.
For Azure Blob storage: source = "wasbs://<container-name>#<storage-account-name>.blob.core.windows.net/<directory-name>"
For Azure Data Lake Storage gen2: source = "abfss://<file-system-name>#<storage-account-name>.dfs.core.windows.net/"
To mount an Azure Data Lake Storage Gen2 filesystem or a folder inside it as Azure Databricks file system, the URL should be like abfss://<file-system-name>#<storage-account-name>.dfs.core.windows.net/
Reference: Azure Databricks - Azure Data Lake Storage Gen2

HDFS path does not exist with SparkSession object when spark master is set as LOCAL

I am trying to load a dataset into Hive table using Spark.
But when I try to load the file from HDFS directory to Spark, I get the exception:
org.apache.spark.sql.AnalysisException: Path does not exist: file:/home/cloudera/partfile;
These are the steps before loading the file.
val wareHouseLocation = "file:${system:user.dir}/spark-warehouse"
val SparkSession = SparkSession.builder.master("local[2]") \
.appName("SparkHive") \
.enableHiveSupport() \
.config("hive.exec.dynamic.partition", "true") \
.config("hive.exec.dynamic.partition.mode","nonstrict") \
.config("hive.metastore.warehouse.dir","/user/hive/warehouse") \
.config("spark.sql.warehouse.dir",wareHouseLocation).getOrCreate()
import sparkSession.implicits._
val partf = sparkSession.read.textFile("partfile")
Exception for the statement ->
val partf = sparkSession.read.textFile("partfile")
org.apache.spark.sql.AnalysisException: Path does not exist: file:/home/cloudera/partfile;
But I have the file in my home directory of HDFS.
hadoop fs -ls
Found 1 items
-rw-r--r-- 1 cloudera cloudera 58 2017-06-30 02:23 partfile
I tried various ways to load the dataset like:
val partfile = sparkSession.read.textFile("/user/cloudera/partfile") and
val partfile = sparkSession.read.textFile("hdfs://quickstart.cloudera:8020/user/cloudera/partfile")
But nothing seems to work.
My spark version is 2.0.2
Could anyone tell me how to fix it ?
When you submit the job by setting master as local[2], your job is not getting submitted to spark master and so, spark does not know about underlying HDFS.
Spark will consider local file system as its default file system, and that's why, IOException occurs in your case.
Try this way:
val SparkSession = SparkSession.builder \
.master("<spark-master-ip>:<spark-port>") \
.appName("SparkHive").enableHiveSupport() \
.config("hive.exec.dynamic.partition", "true") \
.config("hive.exec.dynamic.partition.mode","nonstrict") \
.config("hive.metastore.warehouse.dir","/user/hive/warehouse") \
.config("spark.sql.warehouse.dir",wareHouseLocation).getOrCreate()
import sparkSession.implicits._
val partf = sparkSession.read.textFile("partfile")
You need to know <spark-master-ip> and <spark-port> for this.
This way, spark will take underlying hdfs file system as its default file system.
It's not clear for me what would be an error with explicit protocol specification but usually (as already was answered) it means that no neccesary configurations were passed into Spark context.
The first solution:
val sc = ??? // Spark Context
val config = sc.hadoopConfiguration
// you can mutate config object, it should work
config.addResource(new Path(s"${HADOOP_HOME}/conf/core-site.xml"))
// instead of adding a resource you can just specify hdfs address
// config.set("fs.defaultFS", "hdfs://host:port")
The second:
Explicitly specify HADOOP_CONF_DIR in $SPARK_HOME/spark-env.sh file. If you plan to use a cluster, be sure that every node of your cluster have HADOOP_CONF_DIR specified.
And be sure that you have all necessary Hadoop deps in your Spark / App classpath.
Try the following, it should work.
SparkSession session = SparkSession.builder().appName("Appname").master("local[1]").getOrCreate();
DataFrameReader dataFrameReader = session.read();
String path = "path\\file.csv";
Dataset <Row> responses = dataFrameReader.option("header","true").csv(path);

Texture Packer Command line multiple images

I’m using TexturePacker from the command line and am unable to get it to pack multiple sprites into 1 sheet. It allows me to do 1 sprite so the command is fine.
This is the command I am using.
"TexturePacker --format sparrow --texture-format atf --opt DXT5
--max-width 2048 --max-height 2048 --size-constraints POT --pack-mode Best --enable-rotation --trim-mode Trim --algorithm MaxRects
--multipack --border-padding 5 --shape-padding 5 --reduce-border-artifacts --scale " + sheetInfo.scale + " --data " + imageOutputDirectory + "\" + lanfPrefix + "\" + sheetInfo.name +
".xml " + "--sheet " + imageOutputDirectory + "\" + lanfPrefix + "\"
+ sheetInfo.name + ".atf image1.png image2.png";
Any ideas why this isn't working? According to the documentation, it should work.
I was unable to find any real way to fix this even contacted the developer of texture packer but got no response. Instead I was able to accomplish the desired outcome by copying all needed files to a temp directory and then add the directory to the end of the texturepacker call instead of individual images.
Due to the poor TexturePacker documentation, it took me much trial and error to figure this out!
To add multiple images to a single sprite sheet, here is a sample command that will create an atlas called out.png (the default) containing images img_1 to img_4...
TexturePacker --format unity-texture2d img_1.png img_2.png img_3.png img_4.png
The key is the list of image filenames separated only by spaces. I am working from Python, so here is the script I use to create the same atlas that the sample line above will give you. Using glob with wildcards allows me to select images from a folder containing many images and eliminates the need to isolate the files I want into a folder just for TexturePacker's sake.
import subprocess, glob
TP = r"C:\Program Files\CodeAndWeb\TexturePacker\bin\TexturePacker.exe"
baseFrame = "img"
def FillAtlas(baseFrame):
globString = baseFrame + "_*.png"
frameList = glob.glob(globString)
imgList = []
for frame in frameList:
imgList.append(frame)
TPargs = [TP, "--format", "unity-texture2d"] + imgList
subprocess.call(TPargs)
FillAtlas(baseFrame)

Creating a new rrd database based on an existing one

I have some old rrdtool databases, for which the exact creation recipe has long been since lost. I need to create a new database with the same characteristics as the current ones. I've dumped a couple of old databases and pored over the contents but I'm not sure how to interpret the metadata. I think it appears in the following stanzas
<cf> AVERAGE </cf>
<pdp_per_row> 360 </pdp_per_row> <!-- 1800 seconds -->
<xff> 5.0000000000e-01 </xff>
There are four such stanzas, which correspond to the way I recall the round-robin cascading was set up. Has anyone already done this, or can give me pointers as to how to clone a new empty rrd database from an existing one? Or show me where I missed this in the documentation.
I use the command rrdcreate. It can create a new rrd based in an existing one. The -t parameter indicate a existing rrd as template.
rrdcreate new.rrd -t existing.rrd
rrdtools' rrdinfo is your friend!
It will tell you how the rrd file's data source(s) and archive(s) were created. Example
$ rrdtool info random.rrd
filename = "random.rrd"
rrd_version = "0001"
step = 300
last_update = 955892996
ds[a].type = "GAUGE"
ds[a].minimal_heartbeat = 600
ds[a].min = NaN
ds[a].max = NaN
ds[a].last_ds = "UNKN"
ds[a].value = 2.1824421548e+04
ds[a].unknown_sec = 0
ds[b].type = "GAUGE"
ds[b].minimal_heartbeat = 600
ds[b].min = NaN
ds[b].max = NaN
ds[b].last_ds = "UNKN"
ds[b].value = 3.9620838224e+03
ds[b].unknown_sec = 0
rra[0].cf = "AVERAGE"
rra[0].pdp_per_row = 1
rra[0].cdp_prep[0].value = nan
rra[0].cdp_prep[0].unknown_datapoints = 0
rra[0].cdp_prep[1].value = nan
rra[0].cdp_prep[1].unknown_datapoints = 0
You can try use the clone script described here. It's very basic but it works for simple rrd files. I used it to figure out a schema that was generated by munin. I needed to insert old data into munin so I reverse engineered the schema, set the --start to a date prior to the start of my old data and re-imported data into the rrd.
$ python rrdinfo-parser.py -f test.rrd
rrdtool create test.rrd --start 920804400 --step 300 \
DS:speed:COUNTER:600:U:U \
RRA:AVERAGE:0.5:1:24 \
RRA:AVERAGE:0.5:6:10 \

Resources