While building an infrastructure for one of my current projects I've faced the problem of replacement of already existing HDFS files. More precisely, I want to do the following:
We have a few machines (log-servers) which are continuously generating logs. We have a dedicated machine (log-preprocessor) which is responsible for receiving log chunks (each chunk is about 30 minutes in length and 500-800 mb in size) from log-servers, preprocessing them and uploading to HDFS of our Hadoop-cluster.
Preprocessing is done in 3 steps:
for each logserver: filter (in parallel) received log chunk (output file is about 60-80mb)
combine (merge-sort) all output files from the step1 and do some minor filtering (additionally, 30-min files are combined together into 1-hour files)
using current mapping from external DB, process the file from step#2 to obtain the final logfile and put this file to HDFS.
Final logfiles are to be used as input for several periodoc HADOOP-applications which are running on a HADOOP-cluster. In HDFS logfiles are stored as follows:
Problem description:
The mapping which is used on step 3 changes over time and we need to reflect these changes by recalculating step3 and replacing old HDFS files with new ones. This update is performed with some periodicity (e.g. every 10-15 minutes) at least for last 12 hours. Please note that, if the mapping has changed, the result of applying step3 on the same input file may be significantly different (it will not be just a superset/subset of previous result). So we need to overwrite existing files in HDFS.
However, we can't just do hadoop fs -rm and then hadoop fs -copyToLocal because if some HADOOP-application is using the file which is temporary removed the app may fail. The solution I use -- put a new file near the old one, the files have the same name but different suffixes denoting files` version. Now the layout is the following:
Any Hadoop-application during it's start (setup) chooses the files with the most up-to-date versions and works with them. So even if some update is going on, the application will not experience any problems because no input file is removed.
Do you know some easier approach to this problem which does not use this complicated/ugly file versioning?
Some applications may start using a HDFS-file which is currently uploading, but not yet uploaded (applications see this file in HDFS but don't know if it consistent). In case of gzip files this may lead to failed mappers. Could you please advice how could I handle this issue? I know that for local file systems I can do something like:
cp infile /finaldir/outfile.tmp && mv /finaldir/output.tmp /finaldir/output
This works because mv is an atomic operation, however I'm not sure that this is the case for HDFS. Could you please advice if HDFS has some atomic operation like mv in conventional local file systems?
IMO, the file rename approach is absolutely fine to go with.
HDFS, upto 1.x, lacks atomic renames (they are dirty updates IIRC) - but the operation has usually been considered 'atomic-like' and never given problems to the specific scenario you have in mind here. You could rely on this without worrying about a partial state since the source file is already created and closed.
HDFS 2.x onwards supports proper atomic renames (via a new API call) that has replaced the earlier version's dirty one. It is also the default behavior of rename if you use the FileContext APIs.


Sync files on hdfs having same size but varies in contents

i am trying to sync files from one hadoop clutster to another using distcp and airbnb reair utility, but both of them are not working as expected.
if file size is same on source and destination both of them fails to update it even if file content are been changed(checksum also varies) unless overwrite option is not used.
I need to keep sync data of around 30TB so every time loading complete dataset is not feasible.
Could anyone please suggest how can i bring two dataset in sync if file size is same(count in source is changed) and have varied checksum.
The way DistCp handles syncing between files that are the same size but having different contents is by comparing its so-called FileChecksum. The FileChecksum was first introduced in HADOOP-3981, mostly for the purpose of being used in DistCp. Unfortunately, this has the known shortcoming of being incompatible between different storage implementations, and even incompatible between HDFS instances that have different internal block/chunk settings. Specifically, that FileChecksum bakes in the structure of having, for example, 512-bytes-per-chunk and 128MB-per-block.
Since GCS doesn't have the same notions of "chunks" or "blocks", there's no way for it to have any similar definition of a FileChecksum. The same is also true of all other object stores commonly used with Hadoop; the DistCp documentation appendix discusses this fact under "DistCp and Object Stores".
That said, there's a neat trick that can be done to define a nice standardized representation of a composite CRC for HDFS files that is mostly in-place compatible with existing HDFS deployments; I've filed HDFS-13056 with a proof of concept to try to get this added upstream, after which it should be possible to make it work out-of-the-box against GCS, since GCS also supports file-level CRC32C.

Can I get around the no-update restriction in HDFS?

Thanks for the answers. I'm still not quite getting the answer I want. It's a particular question involving HDFS and the concat api.
Here it is. When concat talks about files, does it mean only "files created and managed by HDFS?" Or will it work on files that are not known to HDFS but just happen to live on the datanodes?
The idea is to
Create a file and save it through HDFS. It's broken up into blocks and saved to the datanodes.
Go directly to the datanodes and make local copies of the blocks using normal shell commands.
Alter those copies. I now have a set of blocks that Hadoop doesn't know about. The checksums are definitely bad.
Use concat to stitch the copies together and "register" them with HDFS.
At the end of all that, I have two files as far as HDFS is concerned. The original and an updated copy. Essentially, I put the data blocks on the datanodes without going through Hadoop. The concat code put all those new blocks into a new HDFS file without having to pass the data through Hadoop.
I don't think this will work, but I need to be sure it won't. It was suggested to me as a possible solution to the update problem. I need to convince them this will not work.
The base philosophy of HDFS is:
write-once, read-many
then, it is not possible to update files with the base implementation of HDFS. You only can append at the end of a current file if you are using a Hadoop branch that allow it. (The original version doesn't allow it)
An alternative could be use a non-standard HDFS like Map-R file system:
Go for HBase which is built on top of Hadoop to support CRUD operations in big data hadoop world.
If you are not supposed to use No SQL database then there is no chance for updating HDFS files. Only option is to rewrite.

Hadoop MapReduce streaming - Best methods to ensure I have processed all log files

I'm developing Hadoop MapReduce streaming jobs written in Perl to process a large set of logs in Hadoop. New files are continually added to the data directory and there are 65,000 files in the directory.
Currently I'm using ls on the directory and keeping track of what files I have processed but even the ls takes a long time. I need to process the files in as close to real time as possible.
Using ls to keep track seems less than optimal. Are there any tools or methods for keeping track of what logs have not been processed in a large directory like this?
You can rename the log files once processed by your program.
For example:
command: hadoop fs -mv
Once renamed, you can easily identify you processed ones and yet to be processed ones.
Thought this would fix your issue.

Locking a directory in HDFS

Is there a way to acquire lock on a directory in HDFS? Here's what I am trying to do:
I've a directory called ../latest/...
Every day I need to add fresh data into this directory, but before I copy new data in here, I want to acquire lock so no one is using it while I copy new data into it.
Is there a way to do this in HDFS?
No, there is no way to do this through HDFS.
In general, when I have this problem, I try to copy the data into a random temp location and then move the file once the copy is complete. This is nice because mv is pretty instantaneous, while copying takes longer. That way, if you check to see if anyone else is writing and then mv, the time period and "lock" is held for a shorter time
Generate a random number
Put the data into a new folder in hdfs://tmp/$randomnumber
Check to see if the destination is OK (hadoop fs -ls perhaps)
hadoop fs -mv the data to the latest directory.
There is a slim chance that between 3 and 4 you might have someone clobber something. If that really makes you nervous, perhaps you can implement a simple lock in ZooKeeper. Curator can help you with that.

Hadoop Distributed Cache - modify file

I have a file in the distributed cache. The driver class, based on the output of a job, updates this file and starts a new job. The new job need these updates.
The way I currently do it is to replace the old Distributed Cache file with a new one (the updated one).
Is there a way of broadcasting the diffs (between the old file and the new one) to all the tasks trackers which need the file ?
Or is it the case that, after a job (the first one, in my case) is finished, all the directories/files specific to that job are deleted and consequently it doesn't even make sense to think in this direction ?
I think that distributed cache is not build with such scenario in mind. It simply put files locally.
In Your case I would suggest to put file in HDFS and make all interested parties to take it from there
As an optimization you can give this file high replication factor and it will be local to most of the tasks.
