Transfer CSV files from azure blob storage to azure SQL database using azure data factory - azure-blob-storage

I need to transfer around 20 CSV files inside a folder named ActivityPointer in an azure blob storage container to Azure SQL database in a single data factory pipeline, but ActivityPointer contains 20 CSV files and another folder named snapshots inside it. So when I try to create a pipeline and give * to select all the CSV files inside ActivityPointer it includes the snapshots folder too, which should not be included. Is there any possibilities to complete this task. Also I can't create another folder to transform the snapshots folder into it. What can I do now? Anyone can please help me out.

Assuming you want to copy all CSV files within ACtivityPointer folder,
You can use wildcard expression as below :
you can provide path till Active folder and than *.csv

Copy data is also considering the inner folder while using wildcards (even if we use .csv in wildcard file path). So, we have to validate whether it is a file or folder. Please look at the following demonstration.
First use Get Metadata on the required folder with field list as Child items. The debug output will be:
Now use this to iterate through child items using For each activity.
#activity('Get Metadata1').output.childItems
Inside for each, use if condition activity to check whether the current item is a file or not. Use the following condition.
#equals(item().type,'File')
When this is true, you can use copy data to complete copying the file to target table (Ignore the false case). I have create file_name parameter in my source dataset passing its value as #item().name().
This will help you to achieve your requirement. The following is the debug output. I have 4 files and 1 folder. The folder will be ignored, and the rest will be copied into the target table.

Related

Load new files only from FTP to BLOB Azure data factory

I am trying to copy files from an FTP to Blob , the probleme is that my pipeline copies all files including the old ones. I would like to do an incremental load by only copying new files. how do U configure this. BTW in my FTP dataset the parameters ModifiedStartDate and ModifiedEndDate are not showing. I would also like to configure theses dates dynamically
Thank you!
There's some work to be done in Azure Data Factory to get this to work. What you're trying to do, if I understand correctly, is to Incrementally Load New Files in Azure Data Factory. You can do so by looking up the latest modified date in the destination folder.
In short (see the above linked article for more information):
Use Get Metadata activity to make a list of all files in the Destination folder
Use For Each activity to iterate this list and compare the modified date with the value stored in a variable
If the value is greater than that of the variable, update the variable with that new value
Use the variable in the Copy Activity’s Filter by Last Modified field to filter out all files that have already been copied

How do I add multiple csv files to the catalog in kedro

I have 4 csv files in Azure blob storage, with same metadata, that i want to process. How can i add them to the datacatalog with a single name in Kedro.
I checked this question
https://stackoverflow.com/questions/61645397/how-do-i-add-many-csv-files-to-the-catalog-in-kedro
But this seems to load all the files in the given folder.
But my requirement is to read only given 4 from many files in the azure container.
Example:
I have many files in azure container in which are 4 transaction csv files with names sales_<date_from>_<date_to>.csv, i want to load these 4 transaction csv files into kedro datacatalog under one dataset.
For starters, PartitionedDataSet is lazy, meaning that files are not actually loaded until you explicitly call that function. Even if you have 100 CSV files that get picked up by the PartitionedDataSet, you can select the partitions that you actually load/work with.
Second, what distinguishes these 4 files from the others? If they have a unique suffix, you can use the filename_suffix option to just select them. For example, if you have:
file_i_dont_care_about.csv
first_file_i_care_about.csv
second_file_i_care_about.csv
third_file_i_care_about.csv
fourth_file_i_care_about.csv
you can specify filepath_suffix: _file_i_care_about.csv.
Don’t think there’s a direct way to do this , you can add another subdirectory inside the blob storage with the 4 files and then use
my_partitioned_dataset:
type: "PartitionedDataSet"
path: "data/01_raw/subdirectory/"
dataset: "pandas.CSVDataSet"
Or in case the requirement of using only 4 files is not going to change anytime soon ,you might as well pass 4 files in the catalog.yml separately to avoid over engineering it.

Azure Data Factory copy activity creates empty files

Whenever I use ADF copy activity with Blob as source/sink, ADF creates an empty file named after the directory of the sink Blob.
For instance, if I want to copy from input/file.csv to process/file.csv, the copy happens but I also have a blob called "process" with size 0 byte created each time.
Any idea why?
Source
Sink
Firstly, I would suggest you optimize you pipeline copy active settings.
Since you are copying one file from one container/folder to another, you can directly set the source file with parameter. Wildcard path expression *.csv is usually used for folder the same type of files.
You can test again and check if the empty file exist again.
HTH.
This happens if you have a storage ADLS gen2 but you have not enabled the Hierarchical namespace and you select the ADLS gen2 while defining your Linked Service and Dataset. A quick fix for this is use Azure Blob Storage when defining LS and DS.

How to update data in CSV files in oracle stored procedure

Create a stored procedure that will read the .csv file from oracle server path using read file operation, query the data in some X table and write the output in .csv file.
here after read .csv file, compare .csv file data with table data and need to update few columns in .csv file.
Oracle works best with data in the database. UPDATE is one of the most frequently used commands.
But, modifying a file which resides in some directory seems to be somewhat out of scope. There are other programming languages you should use, I believe. However, if a hammer is the only tool you have, every problem looks like a nail.
I can think of two options.
One is to load file into the database. Use SQL*Loader to do that if file resides on your PC, or - if you have access to the database server and DBA granted you read/write privileges on a directory (an Oracle object which points to a filesystem directory) - use it as an external table. Once you load data, modify it and export it back (i.e. create a new CSV file) using spool.
Another option is to use UTL_FILE package. It also requires access to the database server's directory. Using the A(ppend) option, you can add rows to the original file, but I don't think that you can edit it so this option - at the end - finishes like the previous one - with creating a new file (but this time using UTL_FILE).
Conclusion? Don't use a database management system to modify files. Use another tool.

Find Replace in .txt files in a directory

I have a lot of script files (all files have SQL or Trans SQL) say around 2000 files that we used to add with the passage of the time. Now the problem is that in most of the scripts the owner of the database object is not defined and due to my current requirement I must have to set the owner of the database object. How can I update all the script files easily.
One approach is to open each file look for create word and replace/append the owner name before the create key word. I want to do it for all the script files in the directory.
Any ideas?
I want something like find replace.
Many thanks in advance.

Resources