How to identify new files in HDFS - hadoop

Just wondering if there is a way to identify new files that are added to a path in HDFS? For example, if already some files were present for sometime. Now I have added new files today. So wanted to process only those new files. What is the best way to achieve this.
Thanks

You need to write a java code to do this. These steps may help:
1. Before adding new files, fetch the last modified time (hadoop fs -ls /your-path). Lets say it as mTime.
2. Next upload new files into hdfs path
3. Now filter the files that are greater than mTime. These files are to be processed. Make your program to process only these files.
This is just a hint for developing your code. :)

If it is Mapreduce then you can create output directory appended with timestamp on daily basis.
Like
FileOutputFormat.setOutputPath(job, new Path(hdfsFilePath
+ timestamp_start); // start at 12 midnight for example: 1427241600 (GMT) --you can write logic to get epoch time

Related

Hadoop MapReduce streaming - Best methods to ensure I have processed all log files

I'm developing Hadoop MapReduce streaming jobs written in Perl to process a large set of logs in Hadoop. New files are continually added to the data directory and there are 65,000 files in the directory.
Currently I'm using ls on the directory and keeping track of what files I have processed but even the ls takes a long time. I need to process the files in as close to real time as possible.
Using ls to keep track seems less than optimal. Are there any tools or methods for keeping track of what logs have not been processed in a large directory like this?
You can rename the log files once processed by your program.
For example:
command: hadoop fs -mv numbers.map/part-00000 numbers.map/data
Once renamed, you can easily identify you processed ones and yet to be processed ones.
Thought this would fix your issue.

Locking a directory in HDFS

Is there a way to acquire lock on a directory in HDFS? Here's what I am trying to do:
I've a directory called ../latest/...
Every day I need to add fresh data into this directory, but before I copy new data in here, I want to acquire lock so no one is using it while I copy new data into it.
Is there a way to do this in HDFS?
No, there is no way to do this through HDFS.
In general, when I have this problem, I try to copy the data into a random temp location and then move the file once the copy is complete. This is nice because mv is pretty instantaneous, while copying takes longer. That way, if you check to see if anyone else is writing and then mv, the time period and "lock" is held for a shorter time
Generate a random number
Put the data into a new folder in hdfs://tmp/$randomnumber
Check to see if the destination is OK (hadoop fs -ls perhaps)
hadoop fs -mv the data to the latest directory.
There is a slim chance that between 3 and 4 you might have someone clobber something. If that really makes you nervous, perhaps you can implement a simple lock in ZooKeeper. Curator can help you with that.

What would be a good hadoop folder structure which can handle these scenarios?

Folder structure inside HDFS should support yearly, monthly and daily processing of data. If we have to do the processing for last 16 days/ 21 days, the framework should support that. Any adhoc number of days, the processing should be done without human intervention except for the number of days specification and starting date. HDFS path specification should be automated. Default will be daily processing of files.
The framework should be integrated with the Map Reduce code as it has to know which folders it needs to look into to start the job.
Current:
Eg:
/user/projectname/sourcefiles/datasetname/yyyy/mm/dd/timestamp/filename
But this doesn't satisfy all requirements. For example, if we have to process data for past 16 days.
"/user/projectname/sourcefiles/datasetname/yyyy/mm/[01][0-9]/timestamp/filename" This path will not work as 19th day file will also be included.
And how do you ensure that timestamp of HDFS folder structure and Map Reduce job are in sync ?
Thanks for you time.
You can:
use path globbing - calculate the path string for the days you wish to process - see here http://books.google.co.il/books?id=Nff49D7vnJcC&pg=PA61&lpg=PA61&dq=path+globbing+pattern+hadoop&source=bl&ots=IihwWu8xXr&sig=g7DLXSqiJ7HRjQ8ZpxcAWJW0WV0&hl=en&sa=X&ei=Fp13Uey9AaS50QXJq4B4&ved=0CDAQ6AEwAQ#v=onepage&q=path%20globbing%20pattern%20hadoop&f=false
use symbolic links to help you have more than one hierarchy - only available in Java API though - see here http://blog.cloudera.com/blog/2010/08/what%E2%80%99s-new-in-apache-hadoop-0-21/
If you provide a folder to MAP Reduce , it will process all the files in that folder. You can create weekly folders or fortnightly folders. I hope that would help

atomic hadoop fs move

While building an infrastructure for one of my current projects I've faced the problem of replacement of already existing HDFS files. More precisely, I want to do the following:
We have a few machines (log-servers) which are continuously generating logs. We have a dedicated machine (log-preprocessor) which is responsible for receiving log chunks (each chunk is about 30 minutes in length and 500-800 mb in size) from log-servers, preprocessing them and uploading to HDFS of our Hadoop-cluster.
Preprocessing is done in 3 steps:
for each logserver: filter (in parallel) received log chunk (output file is about 60-80mb)
combine (merge-sort) all output files from the step1 and do some minor filtering (additionally, 30-min files are combined together into 1-hour files)
using current mapping from external DB, process the file from step#2 to obtain the final logfile and put this file to HDFS.
Final logfiles are to be used as input for several periodoc HADOOP-applications which are running on a HADOOP-cluster. In HDFS logfiles are stored as follows:
hdfs:/spool/.../logs/YYYY-MM-DD.HH.MM.log
Problem description:
The mapping which is used on step 3 changes over time and we need to reflect these changes by recalculating step3 and replacing old HDFS files with new ones. This update is performed with some periodicity (e.g. every 10-15 minutes) at least for last 12 hours. Please note that, if the mapping has changed, the result of applying step3 on the same input file may be significantly different (it will not be just a superset/subset of previous result). So we need to overwrite existing files in HDFS.
However, we can't just do hadoop fs -rm and then hadoop fs -copyToLocal because if some HADOOP-application is using the file which is temporary removed the app may fail. The solution I use -- put a new file near the old one, the files have the same name but different suffixes denoting files` version. Now the layout is the following:
hdfs:/spool/.../logs/2012-09-26.09.00.log.v1
hdfs:/spool/.../logs/2012-09-26.09.00.log.v2
hdfs:/spool/.../logs/2012-09-26.09.00.log.v3
hdfs:/spool/.../logs/2012-09-26.10.00.log.v1
hdfs:/spool/.../logs/2012-09-26.10.00.log.v2
Any Hadoop-application during it's start (setup) chooses the files with the most up-to-date versions and works with them. So even if some update is going on, the application will not experience any problems because no input file is removed.
Questions:
Do you know some easier approach to this problem which does not use this complicated/ugly file versioning?
Some applications may start using a HDFS-file which is currently uploading, but not yet uploaded (applications see this file in HDFS but don't know if it consistent). In case of gzip files this may lead to failed mappers. Could you please advice how could I handle this issue? I know that for local file systems I can do something like:
cp infile /finaldir/outfile.tmp && mv /finaldir/output.tmp /finaldir/output
This works because mv is an atomic operation, however I'm not sure that this is the case for HDFS. Could you please advice if HDFS has some atomic operation like mv in conventional local file systems?
Thanks in advance!
IMO, the file rename approach is absolutely fine to go with.
HDFS, upto 1.x, lacks atomic renames (they are dirty updates IIRC) - but the operation has usually been considered 'atomic-like' and never given problems to the specific scenario you have in mind here. You could rely on this without worrying about a partial state since the source file is already created and closed.
HDFS 2.x onwards supports proper atomic renames (via a new API call) that has replaced the earlier version's dirty one. It is also the default behavior of rename if you use the FileContext APIs.

VB Script - move files older than 180 days from modified date to another directory

i would like to know if there is a vb script which will move files from a specific location and there subfolders to another location based on their modified date and to keep the original directory structure in the new location.
The results to be saved in a .txt file.
thx in advance.
This former question here on SO
VBScript - copy files modified in last 24 hours
is a sample from which you can start from. If you have any difficulties to adapt it to your needs, come back and ask again.

Resources