I'm new to Talend and I'm having trouble with tidying up after I have looped through files and loading them to Postrgres.
The load works, and it moves all but one of the files. So if there are 5 files in the folder, it archives 4 and if there is one file it doesn't archive any. It always leaves one file left in the folder though and gives a permissions error.
I have tried various configurations of this job with and without the "tUnite", with a second loop (as shown) and just a move as part of the main job flow. File locking is the common theme across all of the different approaches I've used. This is the current error I get
tFileCopy_1 C:\Users\stuar\Documents\vb-stock-20200705.csv -> C:\Users\stuar\Documents\Archive\vb-stock-20200705.csv: The process cannot access the file because it is being used by another process.
C:\Users\stuar\Documents\vb-stock-20200705.csv tFileCopy_1 - The source file "C:\Users\stuar\Documents\vb-stock-20200705.csv" could not be removed from the folder because it is open or you only have read-only rights.*
Should I be spliting this into 2 jobs and calling them from a parent with load in one and the move in a separate job? or writing to a new file after the tUnite and loading the. It feels like the load is still hanging on to the file I am trying to move.
OK I solved it. I needed to
add in a tFileOutputDelimited in after the tUnite, which merges all the files into a new file
add in a tFileDelete at the end of my tPostJob to delete the staging file I created in the previous step
This all worked fine.
Related
I need to move several hundred files from a Windows source folder to a destination folder together in one operation. The files are named sequentially (e.g. part-0001.csv, part-002.csv). It is not known what the final file in the sequence will be called. The files will arrive in the source folder over a number of weeks and it is not ascertainable when the final one will arrive. The users want to use a trigger file (i.e. the arrival of a spefic named file in the folder e.g. trigger.txt) to cause flow to start. My first two thoughts were using a first ListFile processor as an input to a second, or the input to an ExecuteProcess processor that would call a script to start the second one, however, neither of these processors accept an input, so I am a bit stumped as to how I might achieve this, or indeed if it is possible with NiFi. Has anyone encountered this use case, and if so how did you resolve it?
So I'm a complete rookie with NiFi and when I was trying it out for the first time, I just ran a single "GetFile" processor and set it to a fairly important directory, and now all of the files are gone. I poked around in the Content Repository, and it would appear that there are a whole lot of files there that are in some unknown format. I am assuming those are the files from my HD, but are now in "FlowFile" format. I also noticed that I can look at the provenance records and download them one by one, but there are several thousands...so that is not an option.
So if I'm looking to restore all of those to those files, I imagine I would need to read all of those in the content repository as flowfiles, and then do a PutFile. Any suggestions on how to go about this? Thanks so much!
If you still have the flowfiles in a queue, add a PutFile processor to another directory (not your important one) and move the queue over to it (click the queue that has the flowfiles in it and drag the little blue square at the end of the relationship over to the new PutFile). Run the PutFile and let it drain out. The files might not come out like-for-like, but the data will be there (assuming you didnt drop any flowfiles).
Don't develop flows on important directorties that you don't have backups for. Copy a data subset to a testing dir.
I had many files which I got from the censor. Moreover, the files increase every hour. The files consist of 3 parts, rain_date_time. How can I open each file recursively to get what's inside the file and add it to database. I have found the way to read the file one by one, yet I face difficulty in reading
this is my code
rdd.saveAsTextFile("s3n://bucket-name/path) is creating an empty file with folder name as - [folder-name]_$folder$
Seems like this empty file in used by hadoop-aws jar (of org.apache.hadoop) to mimick S3 filesystem as hadoop filesystem.
But, my application writes thousands of files to S3. As saveAsTextFile creates folder (from the given path) to write the data (from rdd) my application ends up creating thousands of these empty files - [directory-name]_$folder$.
Is there a way to make rdd.saveAsTextFile not to write these empty files?
Stop using s3n, switch to s3a. It's faster and actually supported. that will make this issue go away, along with the atrocious performance problems reading large Parquet/ORC files.
Also, if your app is creating thousands of small files in S3, you are creating future performance problems: listing and opening files on S3 is slow. Try to combine source data into larger columnar-formatted files & use whatever SELECT mechanism your framework has to only read the bits you want
Suppose I have a folder with a few files, images, texts, whatever, it only matters that there are multiple files and the folder is rather large (> 100 mb). Now I want to update five files in this folder, but I want to do this atomically, normally I would just create a temporary folder and write everything into it and if it succeeds, just replace the existing folder. But because I/O is expensive, I don't really want to go this way (resaving hundreds of files just to update five seems like a huge overhead). But how am I supposed to write these five files atomically? Note, I want the writing of all files to be atomic, not each file separately.
You could adapt your original solution:
Create a temporary folder full of hard links to the original files.
Save the five new files into the temporary folder.
Delete the original folder and move the folder of hard links in its place.
Creating a few links should be speedy, and it avoids rewriting all the files.