Don't process already processed files? - hadoop

In our system, we have multiple pig scripts that run against a particular HDFS directory. The pig scripts can run at different times, and are scheduled to run regularly.
Is there a way to point a pig script at the same directory for multiple executions, but make sure that it only processed new files that it hasn't seen before?
I was thinking of using a custom PathFilter for my loader, but I thought I would ask to see if there is already a way to do this, rather than me reinventing the wheel (!).

Have you tried Moving files to a processed directory when the processing finished.

Related

How to check if file transfer is done to hdfs completed or not

I am copying file to HDFS from other script. I cannot know if file transfer done since other system is doing the file transfer to HDFS. I want to perform next operation as soon as file copy done. How to perform this?
As and when you have a chain of commands, it is best advised to develop a pipeline which also allows plugging any error handling routines or alerting routines if need be.
Have you tried Apache Oozie/Airflow or tools in the similar ecosystem?
Using such a toolset, you can then define the first task as copy and then followed by any other task in line.

Is _logs/skip/ related to hadoop version?

I am doing project about MapReduce task failures. According to Hadoop Beginner's Gudie(Garry Tukington), all of the skip data is stored in _logs/skip/ folder. The author used Hadoop 1.0 version. I am working with Hadoop 2.7.4. Although I tested with skip data, neither output folder nor _logs/skip/ are created. Is _logs/skip folder related to Hadoop version? If I want to skip data in hadoop 2.7.4, what should I do?
The short answer is no, it is not related to hadoop at all.
There are many temporary folders create at the time of execution, which are removed after the execution is completed. This includes log folders, temporary output folders and other temporary folders.
You should not get confused by them. The only guarantee is that it will generate an output folder with a _SUCCESS file even though there is no output.
I hope it answers your query.

Apache spark - dealing with auto-updating inputs

I'm new to spark and using it a lot recently to do some batch processing.
Currently I have a new requirement and am stuck on how to approach it.
I have a file which has to be processed but this file can get periodically updated. I want the initial file to be processed and as and when there is an update to the file, I want spark operations to be triggered and should operate only on the updated parts this time. Any way to approach this would be helpful. An
I'm open to using any other technology in combination with spark. The files will generally sit on a file system and could be several GBs in size.
Spark alone cannot recognize if a file has been updated.
It does its job when reading for a first time the file and that's all.
By default, Spark won't know that a file has been updated and won't know which parts of the file are updates.
You should rather work with folders, Spark can run on a folder and can recognize if there is a new file to process in it -> sc.textFile(PATH_FOLDER)...

Talend, combine tWaitForFile and tFileList

I'm not so advanced in Talend, and I have got a job developed by a Talend expert which have a trick that i cannot understand, this is the tricky beginning of the job:
The tricky beginning
The job will process any file exists in a specified folder, there is a producer process which write files continuously into the folder, so it have to process the already existing files and any new file will be created, the ammount of files is so huge and the target system is Linux.
I cant understand why he uses the tFileList while he can use only the tWaitForFile, it can retrieve existing and later created files.
Cordially.
you are right, tFileList or tWaitForFile component is sufficient to process exiting files and newly created file. But if you post job design or component property details for both the component then it will help to answer your question.
TWaitForFile used to start file processing when there is files, the job should be run continiously but do nothing until TWaitForFile detect file in the directory, its simply a file-based trigger used as alternative for the trigger feature of the Entreprise edition.

File watcher in shell

I am trying to keep two directories synchronized with the same files in them.
Files are dropped into Directory A throughout the day. I would like to create a file watcher script that will copy files from Directory A to Directory B as soon as they are dropped.
My thought was to run the job every minute and simply copy everything that dropped in the last minute, but I am wondering if there is a better solution out there.
I'm running MKS toolkit under Windows. Different servers, same operating system.
Thanks for your help!
If you use Linux, you can hook into the kernel using the inotify API to get notified if something in a folder changes. There are command line versions like inotifywatch(1) as well.
To copy the files, I suggest to use rsync(1): it is clever, knows how to clean up after itself and it will create new files hidden while they are copied so users and programs are less likely to pick them up before they are complete.

Resources