Reading Blob Into Pyspark - azure-databricks

I'm trying to read in a series of json files stored in an azure blob into spark using the databricks notebook. I set the conf() with my account and key but it always returns the error
shaded.databricks.org.apache.hadoop.fs.azure.AzureException: java.lang.IllegalArgumentException: The String is not a valid Base64-encoded string.
I've followed along with the information provided here:
https://docs.databricks.com/_static/notebooks/data-import/azure-blob-store.html
and here:
https://luminousmen.com/post/azure-blob-storage-with-pyspark
I can pull the data just fine using the azure sdk for python
storage_account_name = "name"
storage_account_access_key = "key"
spark.conf.set(
"fs.azure.account.key."+storage_account_name+".blob.core.windows.net",
storage_account_access_key)
file_location = "wasbs://loc/locationpath"
file_type = "json"
df = spark.read.format(file_type).option("inferSchema", "true").load(file_location)
Should return a dataframe of the json file

Related

Azure - Copy LARGE blobs from one container to other using logic apps

I successfully built logic app where whenever a blob is added in container-one, it gets copied to container-2. However it fails when any blobs larger than 50 MB (default size) is uploaded.
Could you please guide.
Blobs are added via rest api.
Below is the flow,
Currently, the maximum file size with disabled chunking is 50MB. One of the workarounds is to use Azure functions in order to transfer the files from one container to another.
Below is the sample Python Code that worked for me when I'm trying to transfer files from One container to Another
from azure.storage.blob import BlobClient, BlobServiceClient
from azure.storage.blob import ResourceTypes, AccountSasPermissions
from azure.storage.blob import generate_account_sas
from datetime import datetime,timedelta
connection_string = '<Your Connection String>'
account_key = '<Your Account Key>'
source_container_name = 'container1'
blob_name = 'samplepdf.pdf'
destination_container_name = 'container2'
# Create client
client = BlobServiceClient.from_connection_string(connection_string)
# Create sas token for blob
sas_token = generate_account_sas(
account_name = client.account_name,
account_key = account_key,
resource_types = ResourceTypes(object=True),
permission= AccountSasPermissions(read=True),
expiry = datetime.utcnow() + timedelta(hours=4)
)
# Create blob client for source blob
source_blob = BlobClient(
client.url,
container_name = source_container_name,
blob_name = blob_name,
credential = sas_token
)
# Create new blob and start copy operation
new_blob = client.get_blob_client(destination_container_name, blob_name)
new_blob.start_copy_from_url(source_blob.url)
RESULT:
REFERENCES:
General Limits
How to copy a blob from one container to another container using Azure Blob storage SDK

How do I write data binary to gcs with ruby efficiently?

I want to upload data binary directly to GCP storage, without writing the file to disk. Below is the code snippet I have created to get to the state that I am going to be at.
require 'google/cloud/storage'
bucket_name = '-----'
data = File.open('image_block.jpg', 'rb') {|file| file.read }
storage = Google::Cloud::Storage.new("project_id": "maybe-i-will-tell-u")
bucket = storage.bucket bucket_name, skip_lookup: true
Now I want to directly put this data into a file on gcs, without having to write a file to disk.
Is there an efficient way we can do that?
I tried the following code
to_send = StringIO.new(data).read
bucket.create_file to_send, "image_inder_11111.jpg"
but this throws an error saying
/google/cloud/storage/bucket.rb:2898:in `file?': path name contains null byte (ArgumentError)
from /home/inder/.gem/gems/google-cloud-storage-1.36.1/lib/google/cloud/storage/bucket.rb:2898:in `ensure_io_or_file_exists!'
from /home/inder/.gem/gems/google-cloud-storage-1.36.1/lib/google/cloud/storage/bucket.rb:1566:in `create_file'
from champa.rb:14:in `<main>'
As suggested by #stefan, It should be to_send = StringIO.new(data), i.e. without .read (which would return a string again)

Is there a ruby method for finding a blob uri?

I checked the whole azure-storage-blob gem and didn't find any way to get the URI for a blob. Is there some way to construct it correctly and in a generic way that will work for any other blob in any region?
I used S3 SDK before and I'm well grounded in S3 but new to Azure.
There is a protected method called blob_uri that looks like this:
def blob_uri(container_name, blob_name, query = {}, options = {})
if container_name.nil? || container_name.empty?
path = blob_name
else
path = ::File.join(container_name, blob_name)
end
options = { encode: true }.merge(options)
generate_uri(path, query, options)
end
So you could take the short cut of:
blob_client = Azure::Storage::Blob::BlobService.create(storage_account_name: 'XXX' , storage_access_key: 'XXX')
blob_client.send(:blob_uri, container_name,blob_name)
However, the actual URI is simply:
https://[storage_account_name].blob.core.windows.net/container/[container[s]]/[blob file name]
So since you have to know the blob name and the container to access to blob.
File.join(blob_client.host,container,blob_name)
Is the URI to the blob

How to read a COMPRESS()-ed H2 blob column via JDBC?

I have a file-based H2 database (engine version 1.4.196) with a mediumblob column containing data returned by the COMPRESS() function:
create table foo (compressed_data mediumblob);
...
insert into foo (compressed_data) values (COMPRESS(STRINGTOUTF8('Test'), 'DEFLATE'));
(The table is created and filled by flyway.)
I'd like to read this data in a JDBC client without calling DECOMPRESS() first. (I want to do the decompression client-side for compatibility with another system). I've tried to read the data via an InflaterInputStream, which can uncompress DEFLATE data:
try (InputStream dbStream = rs.getBinaryStream("compressed_data");
InflaterInputStream inflaterStream = new InflaterInputStream(dbStream);
) {
inflaterStream.read();
...
But this causes an error:
java.util.zip.ZipException: incorrect header check
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
...
Is there any way I can get InflaterInputStream-compatible compressed data from a column in H2?
Since you are already using H2 JDBC to access the database you can simply retrieve the compressed data with getBytes and use the expand method of org.h2.tools.CompressTool to uncompress it:
// .java source file is Cp1252 encoded
String sql = "SELECT COMPRESS(STRINGTOUTF8('fermé'), 'DEFLATE') AS foo";
ResultSet rs = st.executeQuery(sql);
rs.next();
byte[] bytesOut = rs.getBytes(1);
byte[] expanded = org.h2.tools.CompressTool.getInstance().expand(bytesOut);
String strOut = new String(expanded, "UTF-8");

Tika text extraction not working on HDFS

I'm trying to use Tika to extract text from a bunch of simple txt files stored on HDFS. I have the following code in my reducer, but surprisingly Tika does not return anything. It work fine in my local machine but as soon as I move everything to hadoop cluster, the result is empty.
FileSystem fs = FileSystem.get(new Configuration());
Path pt = new Path(Configs.BLOBSTORAGEPREFIX+fileAdd);
InputStream stream = fs.open(pt);
AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
parser.parse(stream, handler, metadata);
spaceContentBuffer.append(handler.toString());
The last line append the extreaxted content to a StringBuilder, but it is always empty.
p.s. my hadoop cluster is Azure HDInsight so the HDFS is Blob Storage.
I also tried the following code
Metadata metadata = new Metadata();
BodyContentHandler handler = new BodyContentHandler();
Parser parser = new TXTParser();
ParseContext con = new ParseContext();
parser.parse(stream, handler, metadata, con);
and I got the following error message:
Failed to detect the character encoding of a document
If the user does not specify Content-Type when uploading a blob, it will be set to “application/octet-stream” by default.

Resources