Data Factory | Copy recursively from multiple subfolders into one folder wit same name - azure-blob-storage

Objective: Copy all files from multiple subfolders into one folder with same filenames.
E.g.
Source Root Folder
20221110/
AppID1
File1.csv
File2.csv
/AppID2
File3.csv
File4.csv
20221114
AppID3
File5.csv
File6.csv
and so on
Destination Root Folder
File1.csv
File2.csv
File3.csv
File4.csv
File5.csv
File6.csv
Approach 1 Azure Data Factory V2 All datasets selected as binary
GET METADATA - CHILDITEMS
FOR EACH - Childitem
COPY ACTIVITY(RECURSIVE : TRUE, COPY BEHAVIOUR: FLATTEN)
This config renames the files with autogenerated names.
If I change the copy behaviour to preserve hierarchy, Both file name and folder structure remains intact.
Approach 2
GET METADATA - CHILDITEMS
FOR EACH - Childitems
Execute PL2 (Pipeline level parameter: #item.name)
Get Metadata2 (Parameterised from dataset, invoked at pipeline level)
For EACH2- Childitems
Copy (Source: FolderName - Pipeline level, File name - ForEach2)
Both approaches not giving the desired output. Any help/Workaround would be appreciated.

My understanding is in Option 2
Step 3 & 5 :is done as to iterate through the folder and subfolder correct ?
6 . Copy (Source: FolderName - Pipeline level, File name - ForEach2)
I think since in step 6 you already have the filename . On the SINK side , add an dynamic expression and add #Filename and that should do the trick .

If all of your files are in the same directory level, you can try the below approach.
First use Get Meta data activity to get all files list and then use copy inside ForEach to copy to a target folder.
These are my source files with directory structure:
Source dataset:
Based on your directory level use the wildcard placeholder(*/*) in the source dataset.
The above error is only a warning, and we can ignore it while debug.
Get meta data activity:
This will give all the files list inside subfolders.
Give this array to a ForEach activity and inside ForEach use copy activity.
Copy activity source:
In the above also, the */* should be same as we gave in Get Meta data.
For sink dataset create a dataset parameter and use in the file path of dataset.
Copy activity sink:
Files copied to target folder:
If your source files are not in same directory level then you can try the recursive approach mentioned in this article by #Richard Swinbank.

Related

create a CSV file in ADLS from databricks

I am creating a CSV file in an ADLS folder.
For example: sample.txt is the file name
instead of a single file, I see sample.txt/..,part-000 files.
My question is is there a method to create sample.txt file instead of a directory in pyspark.
df.write() or df.save() both create folders and multiple files inside that directory.
Using Coalesce(1) I can combine multiple part-000 files into one file. but how to create a single csv file?
Unfortunately, Spark doesn’t support creating a data file without a folder
To workaround,
Firstly using coalesce or repartition, create a single part (partition) file.
df\
.coalesce(1)\
.write\
.format("csv")\
.mode("overwrite")\
.save("mydata")
The above example produces an mydata directory, a part-000* file, and hidden files
However, our data is contained in only one CSV file. The name of this file is not user-friendly. We can rename this file and extract it.
data_location = "/mydata.csv/"
files = dbutils.fs.ls(data_location)
csv_file = [x.path for x in files if x.path.endswith(".csv")][0]
dbutils.fs.mv(csv_file, data_location.rstrip('/') + ".csv")
dbutils.fs.rm(data_location, recurse = True)
set up account key and configure the storage account to access. then move file from databricks location to adls. To move file, we use dbutils.fs.mv
```python
storage_account_name = "Storage account name"
storage_account_access_key = "storage account acesss key"
spark.conf.set("fs.azure.account.key."+storage_account_name+".blob.core.windows.net",storage_account_access_key)
dbutils.fs.cp('/mydata.csv.csv','abfss://demo12#pratikstorage1.dfs.core.windows.net//mydata1.csv')
My Execution:
Output:

rsync diff but with filter on certain file extension only

I am trying to perform diff on sub-directories that includes certain file extension only.
Below is my folder structure
GRADES
--test1
--file1.pdf
--file2.pdf
--test2
-T60_Hello.xml
file123.pdf
T60_Hello.xml
T61_Hello.xml
--test3 (this folder would contain everything except files with extension .xml)
What i want to do is when i perform diff from GRADES (src) to Test3(dest) then i want to include only files with following filter - T60_*.xml
Below is my script -
rsync -rvcm --include='T60_*.xml' /Users/Desktop/GRADES/* /Users/Desktop/GRADES/test3
At the moment i am getting all files that are not in Test3 folder.

informatica Post command task

I am working with multiple source files with single source instance. I created three flat files and one destination table to experiment multiple sources. I am using ‘File list’ concept, for that I created a text file which contains all the flat file names.
Example:
Filename : File_list.txt
File content : Price1.txt
Price2.txt
Price3.txt
In the above example Price1.txt, Price2.txt and Price3.txt are flat file names. I specified File_list.txt as a source file while running the Workflow in Informatica. So it will iterate through all the flat files in the specified file (File_list.txt) and insert all the values to destination table.
Now what I want to do is once data is inserted to the destination, I need to delete that source file in that directory location.
How to achieve this?.
You'll need to write a custom script that will use the File_list.txt as input and perform the delete operations. You can then call it using Post-Session Success Command session component, or as a separate Command Task in the workflow linked using a $YourSessionName.Status = SUCCEEDED condition.

Merge PDF files from multiple folders with same filename

I am looking to merge PDF files from two separate folders into a third folder, based on file name.
Directory structure:
FOLDER_1 = File set #1.
FOLDER_2 = File set #2.
MERGED_PDFS = Output of merged files.
FOLDER_1 contains a set of PDF files which could be named with any combination of letters, numbers and allowed symbols.
FOLDER_2 contains a set of PDFs with the exact same names as FOLDER_1. The data on these sheets is different. The files from FOLDER_2 need to be inserted into the files from FOLDER_1, at the end of the file.
The output of this merged file will be placed in the MERGED_PDFs folder, retaining the name used to match the files in FOLDER_1 and FOLDER_2.
Example:
FOLDER_1: R000135322.PDF
FOLDER_2: R000135322.PDF
MERGED_PDFS: R000135322.PDF
(MERGED_PDFS contains a merged PDF from FOLDER_1 & FOLDER_2, with the PDF from FOLDER_2 being placed at the end of the PDF in FOLDER_1.
I saw some similar examples of this being done with PDFtk, but unsure how to edit to get my expected output.
Thanks
Here's what you need to do:
Install FolderMill
Specify the Incoming folder and the Output folder for FolderMill on your PC
Since you mention that files in FOLDER_1 and files in FOLDER_2 have the same filenames, just add "Convert to PDF" action and select Multipage: "Append pages to existing document" in the options.
Click Apply changes
Start FolderMill by pressing the Play button.
Grab the files from FOLDER_1 and put them into the Incoming folder
Grab the files from FOLDER_2 and do the same.
Receive the merged PDFs from the Output folder
If the you are not sure if all the corresponding files have the same filenames, you may also need to use the "Rename" action.
FYI, we have a detailed step-by-step guide how to do it (with screenshots).
You are welcome :)

How to restore a folder structure 7Zip'd with split volume option?

I 7Zip'd a multi-gig folder which contained many folders each with many files using the split to volumes (9Meg) option. 7Zip created files of type .zip.001,
.zip.002, etc. When I extract .001 it appears to work correctly but I get an 'unexpected end of data' error. 7Zip does not automatically go to .002. When I extract .002, it also gives the same error and it does not continue the original folder/file structure. Instead it extracts a zip file in the same folder as the previously extracted files. How do I properly extract split files to obtain the original folder/file structure? Thank you.

Resources