I am working with multiple source files with single source instance. I created three flat files and one destination table to experiment multiple sources. I am using ‘File list’ concept, for that I created a text file which contains all the flat file names.
Example:
Filename : File_list.txt
File content : Price1.txt
Price2.txt
Price3.txt
In the above example Price1.txt, Price2.txt and Price3.txt are flat file names. I specified File_list.txt as a source file while running the Workflow in Informatica. So it will iterate through all the flat files in the specified file (File_list.txt) and insert all the values to destination table.
Now what I want to do is once data is inserted to the destination, I need to delete that source file in that directory location.
How to achieve this?.
You'll need to write a custom script that will use the File_list.txt as input and perform the delete operations. You can then call it using Post-Session Success Command session component, or as a separate Command Task in the workflow linked using a $YourSessionName.Status = SUCCEEDED condition.
Related
I am experimenting with the flow editor in PowerAutomate to merge a bunch of CSVs from OneDrive that I am syncing via rclone.
The structure is:
OneDrive(Root)/Folder/Subfolder/*.csv
I would like to merge them into a dataset (master CSV) that I can use with PowerBI.
Because this dataset is updated daily, new csvs will get added to the folder, thus my triggering event is "When a file is created"
The automation looks like this:
When a file is created >
Initialize a String Variable >
Find files in folder >
Search Query: *
Folder: Same as #1
FileSearch Mode: OneDriveSearch
Apply to each >
Get File content
Append to string variable
Compose (string variable) >
Create file >
File Path: whatever/path/
File Name: whatever.csv
File Contents: Outputs
The automation runs fine, and creates my master csv.
Except it's blank!
What's going on?
When a File is Created:
Initialize Variable:
Find Files in Folder:
Apply to Each:
Append String to Variable:
Compose:
Create a File:
At the end of the day, I get a CSV with some data written to it. *Previously, I thought it was blank, but it does, indeed have data.. but it appears truncated.
Something to note here... it looks like it retains the headers..
It looks like it grabbed the first 31 files, but there are 229 files in that folder.
Thanks in advance
I am creating a CSV file in an ADLS folder.
For example: sample.txt is the file name
instead of a single file, I see sample.txt/..,part-000 files.
My question is is there a method to create sample.txt file instead of a directory in pyspark.
df.write() or df.save() both create folders and multiple files inside that directory.
Using Coalesce(1) I can combine multiple part-000 files into one file. but how to create a single csv file?
Unfortunately, Spark doesn’t support creating a data file without a folder
To workaround,
Firstly using coalesce or repartition, create a single part (partition) file.
df\
.coalesce(1)\
.write\
.format("csv")\
.mode("overwrite")\
.save("mydata")
The above example produces an mydata directory, a part-000* file, and hidden files
However, our data is contained in only one CSV file. The name of this file is not user-friendly. We can rename this file and extract it.
data_location = "/mydata.csv/"
files = dbutils.fs.ls(data_location)
csv_file = [x.path for x in files if x.path.endswith(".csv")][0]
dbutils.fs.mv(csv_file, data_location.rstrip('/') + ".csv")
dbutils.fs.rm(data_location, recurse = True)
set up account key and configure the storage account to access. then move file from databricks location to adls. To move file, we use dbutils.fs.mv
```python
storage_account_name = "Storage account name"
storage_account_access_key = "storage account acesss key"
spark.conf.set("fs.azure.account.key."+storage_account_name+".blob.core.windows.net",storage_account_access_key)
dbutils.fs.cp('/mydata.csv.csv','abfss://demo12#pratikstorage1.dfs.core.windows.net//mydata1.csv')
My Execution:
Output:
i have a problem, i used "everything" to extract every txt file from a specific directory so that i can merge them. But on emeditor i don't find a way to merge file from a list of localisation.
Here what the everything file look like:
E:\Main directory\subdirectory 1\file.txt
E:\Main directory\subdirectory 2\file.txt
E:\Main directory\subdirectory 3\file.txt
E:\Main directory\subdirectory 4\file.txt
The list goes over 40k location. is there a way to use a program to read all the location in the text file and combine them ?
Also, the subdirectory has other txt file that i don't want to so i can't just merge all txt file from the main. Another thing is that there are variation of the "file.txt" like "Files.txt" for example.
The Scenario is I have a flat file as source and I need to create 2 target files in same location. One is target file and other one is Archive file(along with time stamp)?
You can create command task and archive the file with timestamp.
If you have two target in mapping itself then you can parameterize the target file with $OutputfileName and assign a value something like "targetFileName-date '+%Y-%m-%d'.log"
I have more than 30 files to load the data.
The path changes at every run in those files. So the path becomes
INFILE "/home/dmf/Cycle7Data/ITEM_IMAGE.csv"
INFILE "/home/dmf/Cycle8Data/ITEM_IMAGE.csv"
The file names change on every control file (SUPPLIER.csv)
Is there any way to pass the File path in a variable, or set any Env. Variable?
So that the control file is not edited everytime
You can pass the data file name on the command line; from the documentation:
DATA specifies the name of the data file containing the data to be loaded. If you do not specify a file extension or file type, then the default is .dat.
If you specify a data file on the command line and also specify data files in the control file with INFILE, then the data specified on the command line is processed first. The first data file specified in the control file is ignored. All other data files specified in the control file are processed.
So pass the relevant file name with each invocation, e.g.
sqlldr user/passwd control=myfile.ctl data=/home/dmf/Cycle7Data/ITEM_IMAGE.csv
If you have lots of files to load from a directory you could have a shell script that loops over the directory contents and passes each file name in turn to an SQL*Loader session.