How to create an Archive file form flatfile with timestamp in informatica which should contain same target location? - business-intelligence

The Scenario is I have a flat file as source and I need to create 2 target files in same location. One is target file and other one is Archive file(along with time stamp)?

You can create command task and archive the file with timestamp.

If you have two target in mapping itself then you can parameterize the target file with $OutputfileName and assign a value something like "targetFileName-date '+%Y-%m-%d'.log"

Related

create a CSV file in ADLS from databricks

I am creating a CSV file in an ADLS folder.
For example: sample.txt is the file name
instead of a single file, I see sample.txt/..,part-000 files.
My question is is there a method to create sample.txt file instead of a directory in pyspark.
df.write() or df.save() both create folders and multiple files inside that directory.
Using Coalesce(1) I can combine multiple part-000 files into one file. but how to create a single csv file?
Unfortunately, Spark doesn’t support creating a data file without a folder
To workaround,
Firstly using coalesce or repartition, create a single part (partition) file.
df\
.coalesce(1)\
.write\
.format("csv")\
.mode("overwrite")\
.save("mydata")
The above example produces an mydata directory, a part-000* file, and hidden files
However, our data is contained in only one CSV file. The name of this file is not user-friendly. We can rename this file and extract it.
data_location = "/mydata.csv/"
files = dbutils.fs.ls(data_location)
csv_file = [x.path for x in files if x.path.endswith(".csv")][0]
dbutils.fs.mv(csv_file, data_location.rstrip('/') + ".csv")
dbutils.fs.rm(data_location, recurse = True)
set up account key and configure the storage account to access. then move file from databricks location to adls. To move file, we use dbutils.fs.mv
```python
storage_account_name = "Storage account name"
storage_account_access_key = "storage account acesss key"
spark.conf.set("fs.azure.account.key."+storage_account_name+".blob.core.windows.net",storage_account_access_key)
dbutils.fs.cp('/mydata.csv.csv','abfss://demo12#pratikstorage1.dfs.core.windows.net//mydata1.csv')
My Execution:
Output:

Source File move to archive folder in unix

Unix Shell Script
Usage: this type of code is used in most accounts where INFA or DataStage is prevalent
Task
Create 2 folders Source, Archive
In the source, keep multiple files of multiple sizes.
The task is to archive the file with highest size by compressing it using tar or tgz format. The archived file name should be nameoffile_ddmmmyyyy_hhmmss.tgz and move it to the archive folder or directory
Configuration file: -
Source file Path
Archive file path
File Size threshold in integer meaning if the file size is equal or greater than this number, archival process should get triggered.
The above 3 should be maintained in a CSV or TXT based configuration file
Consider all exception handling use cases

informatica Post command task

I am working with multiple source files with single source instance. I created three flat files and one destination table to experiment multiple sources. I am using ‘File list’ concept, for that I created a text file which contains all the flat file names.
Example:
Filename : File_list.txt
File content : Price1.txt
Price2.txt
Price3.txt
In the above example Price1.txt, Price2.txt and Price3.txt are flat file names. I specified File_list.txt as a source file while running the Workflow in Informatica. So it will iterate through all the flat files in the specified file (File_list.txt) and insert all the values to destination table.
Now what I want to do is once data is inserted to the destination, I need to delete that source file in that directory location.
How to achieve this?.
You'll need to write a custom script that will use the File_list.txt as input and perform the delete operations. You can then call it using Post-Session Success Command session component, or as a separate Command Task in the workflow linked using a $YourSessionName.Status = SUCCEEDED condition.

Make: Dependency on newest file in directory

For a small project, I have the following workflow:
compile code and generate ./data and ./images
run code, which will write many files to ./data
generate images from the data files, place them in ./images
generate a video from the images
I have written a makefile, which can run the code, and compile it before, if necessary. But I don't know how to implement the dependencies of steps 3 and 4, and currently make that targets manually.
So, is there a way to check if e.g. the newest file in ./data is newer than the newest file in ./images ? It's not necessary to do this on a file-by-file basis, and the total number of data / image files is not known.
Typically the date of the directory is the date that the last file was added/modified, so you could use the timestamp on the directory itself for dependencies.
images : data
// generate images
Alternatively, if there is a mapping between the files in the two directories, you could do something like:
images/%.img: data/%.dat
// generate image...
which would prevent reprocessing data that's already been handled.

SQLLDR file path argument

I have more than 30 files to load the data.
The path changes at every run in those files. So the path becomes
INFILE "/home/dmf/Cycle7Data/ITEM_IMAGE.csv"
INFILE "/home/dmf/Cycle8Data/ITEM_IMAGE.csv"
The file names change on every control file (SUPPLIER.csv)
Is there any way to pass the File path in a variable, or set any Env. Variable?
So that the control file is not edited everytime
You can pass the data file name on the command line; from the documentation:
DATA specifies the name of the data file containing the data to be loaded. If you do not specify a file extension or file type, then the default is .dat.
If you specify a data file on the command line and also specify data files in the control file with INFILE, then the data specified on the command line is processed first. The first data file specified in the control file is ignored. All other data files specified in the control file are processed.
So pass the relevant file name with each invocation, e.g.
sqlldr user/passwd control=myfile.ctl data=/home/dmf/Cycle7Data/ITEM_IMAGE.csv
If you have lots of files to load from a directory you could have a shell script that loops over the directory contents and passes each file name in turn to an SQL*Loader session.

Resources