I am new to Azure Data Factory and I am trying to solve a particular issue that I have when copying files from ftp to azure storage account.I have to copy files from source folder to target folder . The files in the source folder are of different format (csv, txt, xml), and when I execute the pipeline all files are copied with .txt extension.
I am not sure how I can copy the files and keep their original format
Any help?
Note: I think the problem is the output filename that I didn't specify in the output dataset, but when I specify it by setting filename as item().name it is the name of the folder that contains the files that is copied and not the files .. any solution for that ?
Use Get Metadata activity to get a list of folders from the path and Foreach loop activity to loop through the folder and copy files to sink.
Use binary dataset for source and sink to copy files.
Use Get Metadata to get the list of folders. You can parameterize the path or hardcode it.
Output of Get Metada activity:
Connect the Get Medata activity to Foreach loop and pass the childitems to the items list.
#activity('Get Metadata1').output.childitems
Add copy data activity inside Foreach loop and add folder path dynamically by concatenating source dataset path and current item of Foreach loop.
#concat(pipeline().parameters.folderpath, '/', item().name, '/')
In Sink, connect it to binary dataset.
When you run the pipeline you can see all the files copied from source to sink irrespective of the extensions.
Try using the binary format.
https://learn.microsoft.com/en-us/azure/data-factory/format-binary.
You can use Binary dataset in Copy activity, GetMetadata activity, or Delete activity. When using Binary dataset, ADF does not parse file content but treat it as-is.
...
Below is an example of Binary dataset on Azure Blob Storage:
{
"name": "BinaryDataset",
"properties": {
"type": "Binary",
"linkedServiceName": {
"referenceName": "<Azure Blob Storage linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"location": {
"type": "AzureBlobStorageLocation",
"container": "containername",
"folderPath": "folder/subfolder",
},
"compression": {
"type": "ZipDeflate"
}
}
}
}
Related
Objective: Copy all files from multiple subfolders into one folder with same filenames.
E.g.
Source Root Folder
20221110/
AppID1
File1.csv
File2.csv
/AppID2
File3.csv
File4.csv
20221114
AppID3
File5.csv
File6.csv
and so on
Destination Root Folder
File1.csv
File2.csv
File3.csv
File4.csv
File5.csv
File6.csv
Approach 1 Azure Data Factory V2 All datasets selected as binary
GET METADATA - CHILDITEMS
FOR EACH - Childitem
COPY ACTIVITY(RECURSIVE : TRUE, COPY BEHAVIOUR: FLATTEN)
This config renames the files with autogenerated names.
If I change the copy behaviour to preserve hierarchy, Both file name and folder structure remains intact.
Approach 2
GET METADATA - CHILDITEMS
FOR EACH - Childitems
Execute PL2 (Pipeline level parameter: #item.name)
Get Metadata2 (Parameterised from dataset, invoked at pipeline level)
For EACH2- Childitems
Copy (Source: FolderName - Pipeline level, File name - ForEach2)
Both approaches not giving the desired output. Any help/Workaround would be appreciated.
My understanding is in Option 2
Step 3 & 5 :is done as to iterate through the folder and subfolder correct ?
6 . Copy (Source: FolderName - Pipeline level, File name - ForEach2)
I think since in step 6 you already have the filename . On the SINK side , add an dynamic expression and add #Filename and that should do the trick .
If all of your files are in the same directory level, you can try the below approach.
First use Get Meta data activity to get all files list and then use copy inside ForEach to copy to a target folder.
These are my source files with directory structure:
Source dataset:
Based on your directory level use the wildcard placeholder(*/*) in the source dataset.
The above error is only a warning, and we can ignore it while debug.
Get meta data activity:
This will give all the files list inside subfolders.
Give this array to a ForEach activity and inside ForEach use copy activity.
Copy activity source:
In the above also, the */* should be same as we gave in Get Meta data.
For sink dataset create a dataset parameter and use in the file path of dataset.
Copy activity sink:
Files copied to target folder:
If your source files are not in same directory level then you can try the recursive approach mentioned in this article by #Richard Swinbank.
I am creating a CSV file in an ADLS folder.
For example: sample.txt is the file name
instead of a single file, I see sample.txt/..,part-000 files.
My question is is there a method to create sample.txt file instead of a directory in pyspark.
df.write() or df.save() both create folders and multiple files inside that directory.
Using Coalesce(1) I can combine multiple part-000 files into one file. but how to create a single csv file?
Unfortunately, Spark doesn’t support creating a data file without a folder
To workaround,
Firstly using coalesce or repartition, create a single part (partition) file.
df\
.coalesce(1)\
.write\
.format("csv")\
.mode("overwrite")\
.save("mydata")
The above example produces an mydata directory, a part-000* file, and hidden files
However, our data is contained in only one CSV file. The name of this file is not user-friendly. We can rename this file and extract it.
data_location = "/mydata.csv/"
files = dbutils.fs.ls(data_location)
csv_file = [x.path for x in files if x.path.endswith(".csv")][0]
dbutils.fs.mv(csv_file, data_location.rstrip('/') + ".csv")
dbutils.fs.rm(data_location, recurse = True)
set up account key and configure the storage account to access. then move file from databricks location to adls. To move file, we use dbutils.fs.mv
```python
storage_account_name = "Storage account name"
storage_account_access_key = "storage account acesss key"
spark.conf.set("fs.azure.account.key."+storage_account_name+".blob.core.windows.net",storage_account_access_key)
dbutils.fs.cp('/mydata.csv.csv','abfss://demo12#pratikstorage1.dfs.core.windows.net//mydata1.csv')
My Execution:
Output:
I have multiple folders in a single directory and each folder(folder name as per Siteid) contains the .orig file. so I want to copy the latest file created ".orig" file from multiple directories into a single directory.
The file received from ftp in each directory on a weekly basis
Filename format
:PSAN{SiteId}_20190312_TO_20190318_AT_201903191600.txt
I tried to copy a file using windows command
for /R "source" %f in (*.orig) do copy %f "destination"
using this command I able to copy an all .orig file from multiple folders into a single folder but it fetches all file.
can we do modification in command to fetch the latest file?
How to perform this task using an SSIS package or using cmd.
Create two package variables called InputFileNameWithPath and OutputFolderPath. The Former will be dynamically set in the package. The former should be hardcoded to your destination folder name.
Create a ForEachLoop enumerator of type File Enumerator. Set the path to the root folder and select the check box for traverse sub-folders. You may want to add an expression for the FileMask and set it to *.orig that way only the .orig files are moved (assuming there are other files in the directory that you do not want to move) Assign the fully qualified path to the package variable InputFileName.
Then add a System File Task into the ForEachLoop enumerator and have it move the file. from the InputFileNameWithPath to the OutputFolderPath.
You can use a Script Task to achieve that, you can simply store the folders within an array, loop over these folders get the latest file created and copy it to the destination directory (you can achieve this without SSIS), as example:
using System.Linq;
string[] sourcefolders = new string[] { #"C:\New Folder1", #"C:\New Folder2" };
string destinationfolder = #"D:\Destination";
foreach (string srcFolder in sourcefolders)
{
string latestfile = System.IO.Directory.GetFiles(srcFolder, "*.orig").OrderByDescending(x => System.IO.File.GetLastWriteTime(x)).FirstOrDefault();
if (!String.IsNullOrWhiteSpace(latestfile))
{
System.IO.File.Copy(latestfile, System.IO.Path.Combine(destinationfolder, System.IO.Path.GetFileName(latestfile)));
}
}
An example C# Script Task for this is below. This uses SSIS variables for the source and destination folders, and make sure to add these in the ReadOnlyVariables field if that's how you're storing them. Be sure that you have the references listed below as well. As noted below, if a file with the same name exists it will be overwritten to avoid such an error, however I'd recommend confirming that this meets your requirements. The third parameter of File.Copy controls this, and can be omitted as it's not a required parameter.
using System.Linq;
using System.IO;
using System.Collections.Generic;
string sourceFolder = Dts.Variables["User::SourceDirectory"].Value.ToString();
string destinationFolder = Dts.Variables["User::DestinationDirectory"].Value.ToString();
Dictionary<string, string> files = new Dictionary<string, string>();
DirectoryInfo sourceDirectoryInfo = new DirectoryInfo(sourceFolder);
foreach (DirectoryInfo di in sourceDirectoryInfo.EnumerateDirectories())
{
//get most recently created .orig file
FileInfo fi = di.EnumerateFiles().Where(e => e.Extension == ".orig")
.OrderByDescending(f => f.CreationTime).First();
//add the full file path (FullName) as the key
//and the name only (Name) as the value
files.Add(fi.FullName, fi.Name);
}
foreach (var v in files)
{
//copy using key (full path) to destination folder
//by combining destination folder path with the file name
//true- overwrite file if file by same name exists
File.Copy(v.Key, destinationFolder + v.Value, true);
}
I 7Zip'd a multi-gig folder which contained many folders each with many files using the split to volumes (9Meg) option. 7Zip created files of type .zip.001,
.zip.002, etc. When I extract .001 it appears to work correctly but I get an 'unexpected end of data' error. 7Zip does not automatically go to .002. When I extract .002, it also gives the same error and it does not continue the original folder/file structure. Instead it extracts a zip file in the same folder as the previously extracted files. How do I properly extract split files to obtain the original folder/file structure? Thank you.
I've just spent few hours struggling to do a very simple task: enumerate zip file contents using C# and System.IO.Compression.
When I have zip files with small files inside, all is good.
When i use a zip file of 1.2gb which has a database file inside 4.8Gb in size i get "Number of entries expected in End Of Central Directory does not correspond to number of entries in Central Directory." I've read and I've read a lot and i cant seem to find a way to work with my archive.
The code is:
string zipPath = #"E:\Path\Filename.zip";
using (ZipArchive archive = ZipFile.OpenRead(zipPath))
{
foreach (ZipArchiveEntry entry in archive.Entries)
{
Console.WriteLine(entry.FullName);
}
}
Is there a way, any way, to use System.IO.Compression with large zip file contents? All I want is to enumerate the contents of the archive and I do not want to use any 3rd party bits as I was it was suggested on other places.