Talend, combine tWaitForFile and tFileList - etl

I'm not so advanced in Talend, and I have got a job developed by a Talend expert which have a trick that i cannot understand, this is the tricky beginning of the job:
The tricky beginning
The job will process any file exists in a specified folder, there is a producer process which write files continuously into the folder, so it have to process the already existing files and any new file will be created, the ammount of files is so huge and the target system is Linux.
I cant understand why he uses the tFileList while he can use only the tWaitForFile, it can retrieve existing and later created files.
Cordially.

you are right, tFileList or tWaitForFile component is sufficient to process exiting files and newly created file. But if you post job design or component property details for both the component then it will help to answer your question.

TWaitForFile used to start file processing when there is files, the job should be run continiously but do nothing until TWaitForFile detect file in the directory, its simply a file-based trigger used as alternative for the trigger feature of the Entreprise edition.

Related

Apache spark - dealing with auto-updating inputs

I'm new to spark and using it a lot recently to do some batch processing.
Currently I have a new requirement and am stuck on how to approach it.
I have a file which has to be processed but this file can get periodically updated. I want the initial file to be processed and as and when there is an update to the file, I want spark operations to be triggered and should operate only on the updated parts this time. Any way to approach this would be helpful. An
I'm open to using any other technology in combination with spark. The files will generally sit on a file system and could be several GBs in size.
Spark alone cannot recognize if a file has been updated.
It does its job when reading for a first time the file and that's all.
By default, Spark won't know that a file has been updated and won't know which parts of the file are updates.
You should rather work with folders, Spark can run on a folder and can recognize if there is a new file to process in it -> sc.textFile(PATH_FOLDER)...

Rename a file that multiple processes are trying to use

I have 2 applications running in parallel, both doing the following:
check for file not containing "processed"
process the file and then rename it to filename+processed
for every file, only one application shall use it (on a first come first served basis)
I get the files and I also lock them so the other application cannot process it. But when it comes to renaming the file I get a problem. To rename the file, wanted to use the File.renameTo function. However, for that to work, I have to release the lock on the file. But when I release the lock another process may try to use the file. Exactly that should not happen.
Is there any way to prevent the application B from using the file between application A releasing the lock and finishing renaming the file?
EDIT
Some more information:
File creation if the file doesn't exist has to be prevented.
The file will be processed RandomAccessFile (with read and write permission; this creates a new file if it doesn't exist).
Note: On linux, one can rename a file that is locked, so this problem doesn't occur there. However, on Windows a locked file cannot be renamed; I have to release the lock, then rename it. But the time, during which the lock is released creates enables other applications to see that the file is available and then they will try to use it.
Windows applications can do this using the SetFileInformationByHandle function, which allows you to rename the file using the handle you already have open. You probably can't do this natively from Java.
However, a more straightforward solution would be to rename the file (to filename+processing, for example) before you start processing it. Whichever process successfully renames the file in this way is the one responsible for processing it and eventually renaming it to filename+processed.

File watcher in shell

I am trying to keep two directories synchronized with the same files in them.
Files are dropped into Directory A throughout the day. I would like to create a file watcher script that will copy files from Directory A to Directory B as soon as they are dropped.
My thought was to run the job every minute and simply copy everything that dropped in the last minute, but I am wondering if there is a better solution out there.
I'm running MKS toolkit under Windows. Different servers, same operating system.
Thanks for your help!
If you use Linux, you can hook into the kernel using the inotify API to get notified if something in a folder changes. There are command line versions like inotifywatch(1) as well.
To copy the files, I suggest to use rsync(1): it is clever, knows how to clean up after itself and it will create new files hidden while they are copied so users and programs are less likely to pick them up before they are complete.

Relocating ".fig" files when creating a GUI using Matlab GUIDE

I've developed a GUI for some build scripts, and am now in the process of deploying it. As the script will be deployed to a number of different machines at various points, I need to use the standard format of directories that the team use.
The GUI consists of a ".fig" file that contains the visual definition of the UI, and a m-script that defines the functionality. I need to locate these two in "fig/" and "m/" folders respectively, but I can't figure out how to. I first searched for an include statement of some kind in the m-script, as when I Run it on its own, the error message in the command window states that the ".fig" file can't be found, but there doesn't seem to be a reference to the ".fig" file anywhere, I assume that it's inferred as both files have the same name but a different extension.
I fear that Matlab's GUI system requires that both ".m" and ".fig" files are in the same location, but this will be an inelegant solution that I'd rather not go for if I can avoid it.
The next thing I'm going to try is to call a script that copies the fig file from the other directory to the same location as the m-script, when it is executed, then deletes that copy once the script exits, which again seems a clunky solution, but will allow me to adhere to the team's organisation conventions.
Does anyone else know of an undocumented means of specifying the relative location of a GUI ".fig" file?
You can export the GUIDE-generated GUI as a single .m file. Check out this blog post: GUIDE GUIs in All One File.
I'm not sure if this is a new feature, or one of those things that has always been there...

Windows: File Monitoring Script (Batch/VBS)

I'm currently in working on a script to create a custom backup script, the only piece I'm missing is a file monitor. I need some form of a script that will monitor a folder for file changes, and then run a command with the file that's changed.
So, for example, if the file changes, it'll execute "c:/syncbatch.bat %Location_Of_File%"
In VBScript, you can monitor a folder for file changes by subscribing to the WMI __InstanceModificationEvent event. These articles contain sample code that you can learn from and adapt to your specific needs:
WMI and File System Monitoring
How Can I Monitor for Different Types of Events With Just One Script?
Calling WMI is fairly cryptic and it causes the WMI service to start running which can contribute to bloat since its fairly large and you really can't cancel the file change notifications you've requested from it without rebooting. Some people experimenting with remote printing from a Dropbox folder found that a simple VBScript program that ran an endless loop with a 10 second WScript.Sleep call in the loop used far less resource. Of course to stop it you have to task kill that script or program into it some exit trigger it can find like a specifically named empty file in the watch folder, but that is still easier to do than messing with WMI.
The Folder Spy http://venussoftcorporation.blogspot.com/2010/05/thefolderspy.html
is a free lightweight DOT.NET based file/folder watching GUI application I'ved used before to run scripts based on file changes. It looks like the new version can pass the event filename to the launched command. The old version I had didn't yet support file event info so when launched, my script had to instance a File System Object and scan the watched folder to locate the new files based on criteria like datestamps and sizes.
This newer version appears to let you pass the file name to the script if you say myscript.vbs "*f" on the optional script call entry. Quotes can be important when passing file paths that have spaces in folder names. Just remember if you are watching change events you will get a lot of them as a file grows or is edited, usually you just want notification of file adds or deletes.
Another trick your script can do is put the file size in a variable, sleep for a few seconds, and check the file again to see if its changed. if it hasn't changed in a few seconds you can usually assume whatever created it is done writing it to disk. if it keeps changing just loop until its stable.

Resources