I'm new to spark and using it a lot recently to do some batch processing.
Currently I have a new requirement and am stuck on how to approach it.
I have a file which has to be processed but this file can get periodically updated. I want the initial file to be processed and as and when there is an update to the file, I want spark operations to be triggered and should operate only on the updated parts this time. Any way to approach this would be helpful. An
I'm open to using any other technology in combination with spark. The files will generally sit on a file system and could be several GBs in size.
Spark alone cannot recognize if a file has been updated.
It does its job when reading for a first time the file and that's all.
By default, Spark won't know that a file has been updated and won't know which parts of the file are updates.
You should rather work with folders, Spark can run on a folder and can recognize if there is a new file to process in it -> sc.textFile(PATH_FOLDER)...
Related
I am copying file to HDFS from other script. I cannot know if file transfer done since other system is doing the file transfer to HDFS. I want to perform next operation as soon as file copy done. How to perform this?
As and when you have a chain of commands, it is best advised to develop a pipeline which also allows plugging any error handling routines or alerting routines if need be.
Have you tried Apache Oozie/Airflow or tools in the similar ecosystem?
Using such a toolset, you can then define the first task as copy and then followed by any other task in line.
I am doing project about MapReduce task failures. According to Hadoop Beginner's Gudie(Garry Tukington), all of the skip data is stored in _logs/skip/ folder. The author used Hadoop 1.0 version. I am working with Hadoop 2.7.4. Although I tested with skip data, neither output folder nor _logs/skip/ are created. Is _logs/skip folder related to Hadoop version? If I want to skip data in hadoop 2.7.4, what should I do?
The short answer is no, it is not related to hadoop at all.
There are many temporary folders create at the time of execution, which are removed after the execution is completed. This includes log folders, temporary output folders and other temporary folders.
You should not get confused by them. The only guarantee is that it will generate an output folder with a _SUCCESS file even though there is no output.
I hope it answers your query.
I'm not so advanced in Talend, and I have got a job developed by a Talend expert which have a trick that i cannot understand, this is the tricky beginning of the job:
The tricky beginning
The job will process any file exists in a specified folder, there is a producer process which write files continuously into the folder, so it have to process the already existing files and any new file will be created, the ammount of files is so huge and the target system is Linux.
I cant understand why he uses the tFileList while he can use only the tWaitForFile, it can retrieve existing and later created files.
Cordially.
you are right, tFileList or tWaitForFile component is sufficient to process exiting files and newly created file. But if you post job design or component property details for both the component then it will help to answer your question.
TWaitForFile used to start file processing when there is files, the job should be run continiously but do nothing until TWaitForFile detect file in the directory, its simply a file-based trigger used as alternative for the trigger feature of the Entreprise edition.
I want to run hadoop to process big files, but server machines are clustered and share a file system. So, even if I log in different machines, I have same file directories and files.
In this case, I don't know how to get started. I guess splitted files don't have to be transferred within HDFS to other nodes, but I'm not sure how to configure or start.
Is there any reference or tutorial for this??
Thanks
In our system, we have multiple pig scripts that run against a particular HDFS directory. The pig scripts can run at different times, and are scheduled to run regularly.
Is there a way to point a pig script at the same directory for multiple executions, but make sure that it only processed new files that it hasn't seen before?
I was thinking of using a custom PathFilter for my loader, but I thought I would ask to see if there is already a way to do this, rather than me reinventing the wheel (!).
Have you tried Moving files to a processed directory when the processing finished.