create a CSV file in ADLS from databricks - azure-databricks

I am creating a CSV file in an ADLS folder.
For example: sample.txt is the file name
instead of a single file, I see sample.txt/..,part-000 files.
My question is is there a method to create sample.txt file instead of a directory in pyspark.
df.write() or df.save() both create folders and multiple files inside that directory.
Using Coalesce(1) I can combine multiple part-000 files into one file. but how to create a single csv file?

Unfortunately, Spark doesn’t support creating a data file without a folder
To workaround,
Firstly using coalesce or repartition, create a single part (partition) file.
df\
.coalesce(1)\
.write\
.format("csv")\
.mode("overwrite")\
.save("mydata")
The above example produces an mydata directory, a part-000* file, and hidden files
However, our data is contained in only one CSV file. The name of this file is not user-friendly. We can rename this file and extract it.
data_location = "/mydata.csv/"
files = dbutils.fs.ls(data_location)
csv_file = [x.path for x in files if x.path.endswith(".csv")][0]
dbutils.fs.mv(csv_file, data_location.rstrip('/') + ".csv")
dbutils.fs.rm(data_location, recurse = True)
set up account key and configure the storage account to access. then move file from databricks location to adls. To move file, we use dbutils.fs.mv
```python
storage_account_name = "Storage account name"
storage_account_access_key = "storage account acesss key"
spark.conf.set("fs.azure.account.key."+storage_account_name+".blob.core.windows.net",storage_account_access_key)
dbutils.fs.cp('/mydata.csv.csv','abfss://demo12#pratikstorage1.dfs.core.windows.net//mydata1.csv')
My Execution:
Output:

Related

Data Factory | Copy recursively from multiple subfolders into one folder wit same name

Objective: Copy all files from multiple subfolders into one folder with same filenames.
E.g.
Source Root Folder
20221110/
AppID1
File1.csv
File2.csv
/AppID2
File3.csv
File4.csv
20221114
AppID3
File5.csv
File6.csv
and so on
Destination Root Folder
File1.csv
File2.csv
File3.csv
File4.csv
File5.csv
File6.csv
Approach 1 Azure Data Factory V2 All datasets selected as binary
GET METADATA - CHILDITEMS
FOR EACH - Childitem
COPY ACTIVITY(RECURSIVE : TRUE, COPY BEHAVIOUR: FLATTEN)
This config renames the files with autogenerated names.
If I change the copy behaviour to preserve hierarchy, Both file name and folder structure remains intact.
Approach 2
GET METADATA - CHILDITEMS
FOR EACH - Childitems
Execute PL2 (Pipeline level parameter: #item.name)
Get Metadata2 (Parameterised from dataset, invoked at pipeline level)
For EACH2- Childitems
Copy (Source: FolderName - Pipeline level, File name - ForEach2)
Both approaches not giving the desired output. Any help/Workaround would be appreciated.
My understanding is in Option 2
Step 3 & 5 :is done as to iterate through the folder and subfolder correct ?
6 . Copy (Source: FolderName - Pipeline level, File name - ForEach2)
I think since in step 6 you already have the filename . On the SINK side , add an dynamic expression and add #Filename and that should do the trick .
If all of your files are in the same directory level, you can try the below approach.
First use Get Meta data activity to get all files list and then use copy inside ForEach to copy to a target folder.
These are my source files with directory structure:
Source dataset:
Based on your directory level use the wildcard placeholder(*/*) in the source dataset.
The above error is only a warning, and we can ignore it while debug.
Get meta data activity:
This will give all the files list inside subfolders.
Give this array to a ForEach activity and inside ForEach use copy activity.
Copy activity source:
In the above also, the */* should be same as we gave in Get Meta data.
For sink dataset create a dataset parameter and use in the file path of dataset.
Copy activity sink:
Files copied to target folder:
If your source files are not in same directory level then you can try the recursive approach mentioned in this article by #Richard Swinbank.

Shell Script to Convert CSV to Text File

I need to create a shell script that reads a different folder based on today's date and inside the folder contains multiple files and one csv file that will have unique name everyday that is tab delimited. I want to pull this file and resave it as a text file.
Example of file path:
data/model/output20190725 (folder contains multiple files, new folder is created everyday)
-logfile1
-logfile2
-part3983isis4838.csv (this csv file will have a new and randomly generated name everyday, the csv file is also tab delimited)
I know how to go from a csv file to a text file, but I don't know how to add the logic of the folder name and the csv name changing everyday.
I saw that I could possibly use grep, but I don't know how to navigate to today's date folder and pull the csv and pass to the next argument to make the conversion.
grep -l .csv * |

informatica Post command task

I am working with multiple source files with single source instance. I created three flat files and one destination table to experiment multiple sources. I am using ‘File list’ concept, for that I created a text file which contains all the flat file names.
Example:
Filename : File_list.txt
File content : Price1.txt
Price2.txt
Price3.txt
In the above example Price1.txt, Price2.txt and Price3.txt are flat file names. I specified File_list.txt as a source file while running the Workflow in Informatica. So it will iterate through all the flat files in the specified file (File_list.txt) and insert all the values to destination table.
Now what I want to do is once data is inserted to the destination, I need to delete that source file in that directory location.
How to achieve this?.
You'll need to write a custom script that will use the File_list.txt as input and perform the delete operations. You can then call it using Post-Session Success Command session component, or as a separate Command Task in the workflow linked using a $YourSessionName.Status = SUCCEEDED condition.

how to load multiple text files in a folder in pig using load command?

I have been using this for loading one text file
A = LOAD '1try.txt' USING PigStorage(' ') as (c1:chararray,c2:chararray,c3:chararray,c4:chararray);
You can use folder name instead of file name, like this:
A = LOAD 'myfolder' USING PigStorage(' ')
AS (c1:chararray,c2:chararray,c3:chararray,c4:chararray);
Pig will load all files in the specified folder, as stated in Programming Pig:
When specifying a “file” to read from HDFS, you can specify directories. In this case, Pig will find all files under the directory you specify and use them as input for that load statement. So, if you had a directory input with two datafiles today and yesterday under it, and you specified input as your file to load, Pig will read both today and yesterday as input. If the directory you specify has other directories, files in those directories will be included as well.
Here is the link to the official pig documentation that indicates that you can use the load statement to load all the files in a directory:
http://pig.apache.org/docs/r0.14.0/basic.html#load
Syntax: LOAD 'data' [USING function] [AS schema];
Where: 'data': The name of the file or directory, in single quotes. If you specify a directory name, all the files in the directory are loaded.
data = load '/FOLDER/PATH' using PigStorage(' ') AS (<name> <type>, ..);
OR
data = load '/FOLDER/PATH' using HBaseStorage();

Where do the files created with File.new actually get stored in Ruby?

I am creating files from within Ruby scripts and adding stuff to them. But where are these files stored that I am creating?
I'm very new to this, sorry!
The files are created at whatever location you specified. For instance:
f = File.new("another_test.txt","w+")
that will create the file in the current working directory. You specify the path along with the file name. For example:
f = File.new("~/Desktop/another_test.txt","w+") # will create the file on the desktop.
For more details, check the File documentation.
Updated:
Included mu is too short correction.

Resources