Reading sas file from blob storage in R - azure-blob-storage

I am trying to read .sas7bdat file from default container. I have tried following till now:
sas_file <- RxSasData("wasbs://container#storageaccount.blob.core.windows.net/abc/xyz.sas7bdat")
sas_df <- rxImport(sas_file)
but I get following error:
The file 'wasbs://container#storageaccount.blob.core.windows.net/abc/xyz.sas7bdat' does not exist.
Could not open data source.
Error in doTryCatch(return(expr), name, parentenv, handler) :
Could not open data source.
File exists at the mentioned location in code. Still it throws error. Can someone please help me this?

According to your code, I think you want to local a SAS data file from HDFS on Azure HDInsight via RxSasData. However, RxSasData seems to be not supported on Hadoop env, as the figure below, please see here.
Please try to copy the file to local filesystem on HDI, then to read.

Related

File read from ADLS Gen2 Error - Configuration property xxx.dfs.core.windows.net not found

I am using ADLS Gen2, from a Databricks notebook trying to process the file using 'abfss' path.
I am able to read parquet files just fine but when I try to load the XML files, I am getting the error the configuration is not found - Configuration property xxx.dfs.core.windows.net not found.
I haven't tried mounting the file but trying to understand if it's a known limitation with XML files, as I am able to read the parquet files just fine.
Here is my XML libraries config
com.databricks:spark-xml_2.11:0.9.0
I tried a couple of things per the other articles but still getting the same error.
Added a new scope to see if it's a scope issue in the Databricks Workspace.
Tried adding configuration
spark.conf.set("fs.azure.account.key.xxxxx.dfs.core.windows.net", "xxxx==")
df = spark.read.format("xml")
.option("rootTag","BookArticle")
.option("inferSchema", "true")
.option("error_bad_lines",True)
.option("mode", "DROPMALFORMED")
.load(abfsssourcename) ##abfsssourcename is the path of the source file name
Exception Details: Py4JJavaError: An error occurred while calling o1113.load.
Configuration property xxxx.dfs.core.windows.net not found. at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AbfsConfiguration.getStorageAccountKey(AbfsConfiguration.java:392) at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.initializeClient(AzureBlobFileSystemStore.java:1008) at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.<init>(AzureBlobFileSystemStore.java:151) at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.initialize(AzureBlobFileSystem.java:106) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:500) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:469) at org.apache.spark.SparkContext$$anonfun$newAPIHadoopFile$2.apply(SparkContext.scala:1281) at org.apache.spark.SparkContext$$anonfun$newAPIHadoopFile$2.apply(SparkContext.scala:1269) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.SparkContext.withScope(SparkContext.scala:820) at org.apache.spark.SparkContext.newAPIHadoopFile(SparkContext.scala:1269) at com.databricks.spark.xml.util.XmlFile$.withCharset(XmlFile.scala:46) at com.databricks.spark.xml.DefaultSource$$anonfun$createRelation$1.apply(DefaultSource.scala:71) at com.databricks.spark.xml.DefaultSource$$anonfun$createRelation$1.apply(DefaultSource.scala:71) at com.databricks.spark.xml.XmlRelation$$anonfun$1.apply(XmlRelation.scala:43) at com.databricks.spark.xml.XmlRelation$$anonfun$1.apply(XmlRelation.scala:42) at scala.Option.getOrElse(Option.scala:121) at com.databricks.spark.xml.XmlRelation.<init>(XmlRelation.scala:41) at com.databricks.spark.xml.XmlRelation$.apply(XmlRelation.scala:29) at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:74) at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:52) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:311) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:297) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:214) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
I summarize the solution as below.
The package com.databricks:spark-xml seems using RDD API to read xml file. When we use using the RDD API to access Azure Data Lake Storage Gen2, wecannot access Hadoop configuration options set using spark.conf.set(...). So we should update the code as spark._jsc.hadoopConfiguration().set("fs.azure.account.key.xxxxx.dfs.core.windows.net", "xxxx=="). For more details, please refer to here.
Besides, you aslo can mount Azure Data Lake Storage Gen2 as file system in Azure databricks.

Required field 'uncompressed_page_size' was not found in serialized data! Parquet

I am getting below error while trying to save parquet file from local directory using pyspark.
I tried spark 1.6 and 2.2 both give same error
It display's schema properly but gives error at the time of writing file.
base_path = "file:/Users/xyz/Documents/Temp/parquet"
reg_path = "file:/Users/xyz/Documents/Temp/parquet/ds_id=48"
df = sqlContext.read.option( "basePath",base_path).parquet(reg_path)
out_path = "file:/Users/xyz/Documents/Temp/parquet/out"
df2 = df.coalesce(5)
df2.printSchema()
df2.write.mode('append').parquet(out_path)
org.apache.spark.SparkException: Task failed while writing rows
Caused by: java.io.IOException: can not read class org.apache.parquet.format.PageHeader: Required field 'uncompressed_page_size' was not found in serialized data! Struct: PageHeader(type:null, uncompressed_page_size:0, compressed_page_size:0)
In my own case, I was writing a custom Parquet Parser for Apache Tika and I experienced this error. It turned out that if the file is being used by another process, the ParquetReader will not be able to access uncompressed_page_size. Hence, causing the error.
Verify if other processes are not holding on to the file.
Temporary resolved by the spark config:
"spark.sql.hive.convertMetastoreParquet": "false"
Although it would has extra cost, but a walkaround approach by now.

Sqoop through JAVA API

We are trying to sqoop data from mysql to HDFS. When we run the code the data gets stored in local file system. We want the data to be in HDFS. Can any one suggest us with the following code?
SqoopOptions options = new SqoopOptions();
options.setConnectString("jdbc:mysql:hostname/db_name");
options.setUsername("user");
options.setPassword("pass");
options.setTableName("table");
options.setDirectMode(true);
options.setNumMappers(4);
options.setDriverClassName("com.mysql.jdbc.Driver");
options.setSqlQuery("select * from table");
options.setWhereClause("value > 15.0");
options.setTargetDir("output");
options.doHiveImport();
System.out.println();
int ret=new ImportTool().run(options);
System.out.println(ret);
I ran the same program in hdfs and got the output :)
Here the issue is with options.setTargetDir("output");
You are not specifying a qualifying HDFS path. If you change "output" with a valid HDFS path, you should be able to run the code from anywhere and still get a proper result.

Pig register jar, file does not exist error

I'm using Hortonworks sandbox and trying to run a simple pig script. There appear to be annoying error related to "file does not exist".
Below is the script:
REGISTER '/piggybank.jar';
inp = load '/my.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage..
ERROR 2997: Encountered IOException. File does not exist:
hdfs://sandbox.hortonworks.com:8020/tmp/udfs/ '/piggybank.jar'
However, my jar is present at the root(/) and I have given proper permission as well. Don't know why the path is pointing to /tmp/udfs....
Can anyone provide some suggestion?
Do not place the path within quotes. Also provide full URI of the Jar file location.
REGISTER hdfs://sandbox.hortonworks.com:8020/piggybank.jar;
Refer REGISTER (a jar/script).

How can using Jena read file from HDFS and convert it to Rdf?

I'm using Apache Jena to convert a .csv file to .rdf. I use model.read(pathFile), but it only reads file from the local filesystem. I want to read from hdfs, such as model.read(hdfs://....), but it gives an error.
And the error is:
Exception in thread "main" org.apache.jena.riot.RiotNotFoundException: Not found: hdfs://localhost:54310/user/hduser/demo/departments/part-00000.csv
How can I do it?
You will need to add a Locator to the StreamManager to handle "hdfs://".
Jena does not ship with code for reading HDFS URLs.

Resources