Reading csv from windows C drive in Azure databricks - windows

I am trying to read a .csv file from windows C drive to databricks. I tried the following code after going through some of the answers.
# remove the 'file' string and use 'r' or 'u' prefix to indicate raw/unicore string format
# Option 1
#PATH = r'C:\customers_marketing.csv' # raw string
# Option 2
PATH = u'C:\\customers_marketing.csv' # unicode string
customers_marketing = spark.read.csv(PATH, header="true", inferSchema="true")
However, I was not able to read it to databricks. I get the following error.
IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: C:%5Ccustomers_marketing.csv
Could anyone pls advise/suggest how can I read data from windows c drive to databricks.
Thanks in advance

It's not possible, because your file is on your local machine, and Databricks is in the cloud, without any knowledge about your machine.
You need to upload file onto DBFS, and then read from it. You can do it for example via UI - via DBFS file browser (docs) or via Upload Data UI (docs)
If the file is huge, then you need to use something like az-copy to upload file(s) to Azure Storage

Related

Azure Data Factory Unzipping many files into partitions based on filename

I have a large zip file that has 900k json files in it. I need to process these with a data flow. I'd like to organize the files into folders using the last two digits in the file name so I can process them in junks of 10k. My question is how to I setup a pipeline to use part of the file name of the files in the zip file (the source) as part of the path in the sink?
current setup: zipfile.zip -> /json/XXXXXX.json
desired setup: zipfile.zip -> /json/XXX/XXXXXX.json
Please check if below references can help
In source transformation, you can read from a container, folder, or
individual file in Azure Blob storage. Use the Source options tab to
manage how the files are read. Using a wildcard pattern will instruct
the service to loop through each matching folder and file in a single
source transformation. This is an effective way to process multiple
files within a single flow.
[ ] Matches one or more characters in the brackets.
/data/sales/**/*.csv Gets all .csv files under /data/sales
And please go through 1. Copy and transform data in Azure Blob storage - Azure Data Factory
& Azure Synapse | Microsoft Docs for other patterns and to check all filtering
possibilities in azure blob storage.
How to UnZip Multiple Files which are stored on Azure Blob Storage
By using Azure Data Factory - Bing video
In the sink transformation, you can write to either a container or a folder in Azure Blob storage.
File name option: Determines how the destination files are named in the destination folder.

How give azure machine learning dataset path in an inference script?

I am using azureml sdk in Azure Databricks.
When I write the script for inference model (%%writefile script.py) in a databricks cell,
I try to load a .bin file that I loaded in Azure Machine Learning Datasets.
I would like to do this in the script.py:
fasttext.load_model(azuremldatasetpath)
How can I do to give good dataset path of my .bin file in azuremldatasetpath variable ? (Without calling workspace in the script).
Something like:
dataset_path = os.path.join(os.getenv('AZUREML_MODEL_DIR'), 'file.bin')
You can use your model name with the Model.get_model_path() method to retrieve the path of the model file or files on the local file system. If you register a folder or a collection of files, this API returns the path of the directory that contains those files.
More info you may want to refer: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-deploy-advanced-entry-script#azureml_model_dir

Why is an empty file with the name of folder inside a Azure Blob storage container is created?

I am running a Hive QL through HD Insight on-demand cluster which does the following
Spool the data from a hive view
Create a folder by name abcd inside a Blob storage container
named XYZ
Store the view data in a file inside the abcd folder
However, when the hive QL is run, there is an empty file with the name abcd that is getting created outside the abcd folder
Any idea why this is happening and how do we stop it from happening. Please suggest
Thanks,
Surya
You get this because the Azure storage you are mounting does not have a hierarchical file system. For example, the mount is a blob storage of type StorageV2 but you have not ticked the Use hierarchical filesystem at creation time. A version 2 blob with hierarchical file system is known as Azure Data Lake Storage generation 2 (ADLS Gen2), where they basically get rid of the blob - lake difference you had with ADLS Gen 1 vs older blob generations.
According to the blob API you are using, a number of tricks are used to give you the illusion of a hierarchical FS even when you don't have one. Like creating empty files, or hidden ones. The main is that the hierarchy is flat (i.e. there is none), so you can't just create an empty folder, you have to put something there.
For example, if you mount a v2 blob with the wasbs:// driver in Databricks, and you do a mkdir -p /dbfs/mnt/mymount/this/is/a/path from a %sh cell you will see something like this:
this folder, this empty file
this/is folder, this/is empty file
etc.
Finally, while this is perfectly file for Azure blob itself, it might cause trouble to anything else not expecting it, even %sh ls.
Just recreate the storage as ADLS Gen2, or update it live enabling the hierarchical FS.
Thanks,

Datastage : Reading excel from window env. shared folder

I have an Excel file at a shared location on windows environment. I have data stage server on Unix box. I want to read the excel file and load data to a Teradata table. I need help with reading the excel. One option for me is to transfer file to the server location and access it from there but can i read the excel from the shared folder in windows environment?
I tried to use ftp first in datastage. But getting the below error.
<FTP_Enterprise_18> Error occurred during initializeFromArgs().
<FTP_Enterprise_18> uri : ftp://server/path/file.xlsx is not valid remote file.
<main_program> Creation of a step finished with status = FAILED.
No - it is not possible to read it from a remote location so you will need to tranfer it first (if the shared location is not a SAMBA mount on the Unix maschine).
You can use the "Unstructured data" stage to read the Excel fle one it is on the Unix server.

Unzip the .GZ file in worker Process of Azure

Can any1 provide me an Idea, How to implement unzipping of .gz format file through Worker. If i try to write unzipping of file then, where i need to store unzipped file(i.e one text file
) , Will it be loaded in any location in azure. how can i specify the path in Windows Azure Worker process like current execting directory. If this approach doesnot work, then i need to create one more blob to store unzipped .gz file i.e txt.
-mahens
In your Worker Role, it is up to you how a .gz file arrive (downloaded from Azure Blob storage) however on the file is available you can use GZipStream to compress/uncompress a .GZ file. You can also find code sample in above link with Compress and Decompress function.
This SO discussion shares a few tools and code to explain how you can unzip .GZ using C#:
Unzipping a .gz file using C#
Next when you will use Decompress/Compress code in a Worker Role you have ability to store it directly to local storage (as suggested by JcFx) or use MemoryStream to store directly to Azure Blob Storage.
The following SO article shows how you can use GZipStream to store unzipped content into MemoryStream and then use UploadFromStream() API to store directly to Azure Blob storage:
How do I use GZipStream with System.IO.MemoryStream?
If you don't have any action related to your unzipped file then storing directly to Azure Blob storage is best however if you have to do something with unzipped content you can save locally as well as storage to Azure Blob storage back for further usage.
This example, using SharpZipLib, extracts a .gzip file to a stream. From there, you could write it to Azure local storage, or to blob storage:
http://wiki.sharpdevelop.net/GZip-and-Tar-Samples.ashx

Resources