I'm copying data from an Oracle DB to ADLS using a copy activity of Azure Data Factory.
The result of this copy is a parquet file that contains the same data of the
table that I have copied but the name of this resultant parquet file is like this:
data_32ecaf24-00fd-42d4-9bcb-8bb6780ae152_7742c97c-4a89-4133-93ea-af2eb7b7083f.parquet
And I need that this name is stored like this:
TableName-Timestamp.parquet
How can I do that with Azure Data Factory?
Another question: Is there a way to add hierarchy when this file is being written? For example, I use the same
pipeline for writting several tables and I want to create a new folder for each table. I can do that if I create a new Dataset for each table
to write, but I want to know if is there a way to do that automatically (Using dynamic content).
Thanks in advance.
You could set a pipeline parameter to achieve it.
Here's the example I tried copy data from Azure SQL database to ADLS, it also should works for oracle to ADLS.
Set pipeline parameter: set the Azure SQL/Oracle table name which need to copy to ADLS:
Source dataset:
Add dynamic content to set table name:
Source:
Add dynamic content: set table name with pipeline parameter:
Sink dataset:
Add dynamic content to set Parquet file name:
Sink:
Add dynamic content to set Parquet file name with pipeline parameter:
Format: TableName-Timestamp.parquet:
#concat(pipeline().parameters.tablename,'-',utcnow())
Then execute the pipeline, you will get the Parquet file like TableName-Timestamp.parquet:
About your another question:
You could add dynamic content set folder name for each table, just follow this:
For example, if we copy the table "test", the result we will get:
container/test/test-2020-04-20T02:01:36.3679489Z.parquet
Hope this helps.
Related
I need to transfer around 20 CSV files inside a folder named ActivityPointer in an azure blob storage container to Azure SQL database in a single data factory pipeline, but ActivityPointer contains 20 CSV files and another folder named snapshots inside it. So when I try to create a pipeline and give * to select all the CSV files inside ActivityPointer it includes the snapshots folder too, which should not be included. Is there any possibilities to complete this task. Also I can't create another folder to transform the snapshots folder into it. What can I do now? Anyone can please help me out.
Assuming you want to copy all CSV files within ACtivityPointer folder,
You can use wildcard expression as below :
you can provide path till Active folder and than *.csv
Copy data is also considering the inner folder while using wildcards (even if we use .csv in wildcard file path). So, we have to validate whether it is a file or folder. Please look at the following demonstration.
First use Get Metadata on the required folder with field list as Child items. The debug output will be:
Now use this to iterate through child items using For each activity.
#activity('Get Metadata1').output.childItems
Inside for each, use if condition activity to check whether the current item is a file or not. Use the following condition.
#equals(item().type,'File')
When this is true, you can use copy data to complete copying the file to target table (Ignore the false case). I have create file_name parameter in my source dataset passing its value as #item().name().
This will help you to achieve your requirement. The following is the debug output. I have 4 files and 1 folder. The folder will be ignored, and the rest will be copied into the target table.
I am trying to copy files from an FTP to Blob , the probleme is that my pipeline copies all files including the old ones. I would like to do an incremental load by only copying new files. how do U configure this. BTW in my FTP dataset the parameters ModifiedStartDate and ModifiedEndDate are not showing. I would also like to configure theses dates dynamically
Thank you!
There's some work to be done in Azure Data Factory to get this to work. What you're trying to do, if I understand correctly, is to Incrementally Load New Files in Azure Data Factory. You can do so by looking up the latest modified date in the destination folder.
In short (see the above linked article for more information):
Use Get Metadata activity to make a list of all files in the Destination folder
Use For Each activity to iterate this list and compare the modified date with the value stored in a variable
If the value is greater than that of the variable, update the variable with that new value
Use the variable in the Copy Activity’s Filter by Last Modified field to filter out all files that have already been copied
Create a stored procedure that will read the .csv file from oracle server path using read file operation, query the data in some X table and write the output in .csv file.
here after read .csv file, compare .csv file data with table data and need to update few columns in .csv file.
Oracle works best with data in the database. UPDATE is one of the most frequently used commands.
But, modifying a file which resides in some directory seems to be somewhat out of scope. There are other programming languages you should use, I believe. However, if a hammer is the only tool you have, every problem looks like a nail.
I can think of two options.
One is to load file into the database. Use SQL*Loader to do that if file resides on your PC, or - if you have access to the database server and DBA granted you read/write privileges on a directory (an Oracle object which points to a filesystem directory) - use it as an external table. Once you load data, modify it and export it back (i.e. create a new CSV file) using spool.
Another option is to use UTL_FILE package. It also requires access to the database server's directory. Using the A(ppend) option, you can add rows to the original file, but I don't think that you can edit it so this option - at the end - finishes like the previous one - with creating a new file (but this time using UTL_FILE).
Conclusion? Don't use a database management system to modify files. Use another tool.
I have a lot of parquet files. I need to read them through Amazon Glue and then provide column names to the table that is being read.
The problem is parquet already have column names which is being read by the crawler and show it in the table. Is it possible to provide my column names to these parquet files in glue
To replace the detected column names with names of your own, you could either:
Use one of the following build in transformations on DynamicFrame
ApplyMapping - Applies a declarative mapping to this DynamicFrame and returns a new DynamicFrame with those mappings applied. (source column, source type, target column, target type)
RenameField - Renames a field in this DynamicFrame and returns a new DynamicFrame with the field renamed. (oldName -> newName)
See the Scala or Python ETL programming guides for more detail.
Or try updating the data catalog field names manually if you don't need to continuously re-crawl the data (or if you do, it is possible to prevent a glue crawler from updating existing data catalog tables via the crawler configuration).
Alternatively, if your requirements are more discrete, the map transform is available to convert each DynamicRecord in the DynamicFrame to a new DynamicRecord of your choosing.
I am new to Informatica so need your help.
I have one staging table where data comes everyday and I need to extract data from this staging table and convert it into Dat file format and place in into a folder. so that these dat files could be a feed for another process.
I dont know how informatica does this (Conversion of data from Staging table to Dat). So please help me to know how Informatica fetch the data from staging table, transform it into Dat file and place it into a folder.
Thanks & Regards,
Vikram
To create a pipe-delimited flat file...
Go to the Target Designer - Select Target->Create then choose Flat File. Then double click on the file, and in the 'Table' tab, at the bottom right select 'Advanced' and choose your delimiter. Then you can add your columns, specify the file location and all is well!
You will need to define a source definition based on your staging table, a target definition based on your final file format and then create the mapping,session and workflow that link the two.
.Dat file is not a complete description for the file, since any file can be renamed to a .dat file. You'll need to decide how the data would be separated in this file (commas? tabs? pipes?). Remember all downstream processes will then use this file as input, so you need to publish this format too.