Locking a directory in HDFS - hadoop

Is there a way to acquire lock on a directory in HDFS? Here's what I am trying to do:
I've a directory called ../latest/...
Every day I need to add fresh data into this directory, but before I copy new data in here, I want to acquire lock so no one is using it while I copy new data into it.
Is there a way to do this in HDFS?

No, there is no way to do this through HDFS.
In general, when I have this problem, I try to copy the data into a random temp location and then move the file once the copy is complete. This is nice because mv is pretty instantaneous, while copying takes longer. That way, if you check to see if anyone else is writing and then mv, the time period and "lock" is held for a shorter time
Generate a random number
Put the data into a new folder in hdfs://tmp/$randomnumber
Check to see if the destination is OK (hadoop fs -ls perhaps)
hadoop fs -mv the data to the latest directory.
There is a slim chance that between 3 and 4 you might have someone clobber something. If that really makes you nervous, perhaps you can implement a simple lock in ZooKeeper. Curator can help you with that.

Related

How to restore HDFS blocks moved out of /dataN/dfs/dn/current directory?

Due to an unfortunate series of events, a program moved blocks from
/dataN/dfs/dn/current/BP-XXXXXXX/current/finalized/subdirN/subdirN/blk_NNNNNNNNNN
into
/tmp/blk_NNNNNNNNNN
I don't have any logging from the program to tell where the original subdirN/subdirN/ directory was.
Is there any way to figure out where this block should be based on fsimage file, the block file itself, or some other metadata?
I was able to restore some blocks by looking for the corresponding *.meta file, but there are still some holes. Replication saved me from the worst of it, but I'm still missing 5 "mission critical" files I'd like to try and recover.
From hdfs fsck / I can tell what the missing blocks are, and what HDFS files they belonged to, but I can't tell where in the blockpool they should have been placed.
hdfs fsck / -delete is NOT a solution. I don't want to delete things, I want to try my hardest to recover the files, because I HAVE the blocks. I just don't know where they go.
$ hdfs version
Hadoop 2.6.0-cdh5.4.4
Not sure if it possible to do the restore manually, but you can try.
The subdirs are calculated in the: DatanodeUtil.idToBlockDir(...) using the following code:
int d1 = (int)((blockId >> 16) & 0xff);
int d2 = (int)((blockId >> 8) & 0xff);
String path = DataStorage.BLOCK_SUBDIR_PREFIX + d1 + SEP + DataStorage.BLOCK_SUBDIR_PREFIX + d2;
If the files were moved manually, the fsimage might still contain the block ids, use hdfs oiv command to convert fsimage to XML and get the blockIds by deleted file names.
Here is what I ended up doing to fix this. This wont work in all cases, but it worked in mine.
I took advantage of the fact that the input file separator would be the "line input record separator" and that the blocks in hadoop could be concatenated with the missing block. Order of the data doesn't matter for me, only that all the lines are there.
I simply retrieved all the blocks for the file (including the one no longer in hdfs that was moved to a new location), concatenated them all together. Deleted the file from HDFS and did an hdfs -put of the contaminated file to restore the contents.
Not perfect, but it was effective. This saved me from having to reverse engineer anything, and also proved the easiest way to restore the data.
Thanks for the help. I'm sure there is useful information in here for the next person with this problem.

Can I get around the no-update restriction in HDFS?

Thanks for the answers. I'm still not quite getting the answer I want. It's a particular question involving HDFS and the concat api.
Here it is. When concat talks about files, does it mean only "files created and managed by HDFS?" Or will it work on files that are not known to HDFS but just happen to live on the datanodes?
The idea is to
Create a file and save it through HDFS. It's broken up into blocks and saved to the datanodes.
Go directly to the datanodes and make local copies of the blocks using normal shell commands.
Alter those copies. I now have a set of blocks that Hadoop doesn't know about. The checksums are definitely bad.
Use concat to stitch the copies together and "register" them with HDFS.
At the end of all that, I have two files as far as HDFS is concerned. The original and an updated copy. Essentially, I put the data blocks on the datanodes without going through Hadoop. The concat code put all those new blocks into a new HDFS file without having to pass the data through Hadoop.
I don't think this will work, but I need to be sure it won't. It was suggested to me as a possible solution to the update problem. I need to convince them this will not work.
The base philosophy of HDFS is:
write-once, read-many
then, it is not possible to update files with the base implementation of HDFS. You only can append at the end of a current file if you are using a Hadoop branch that allow it. (The original version doesn't allow it)
An alternative could be use a non-standard HDFS like Map-R file system: https://www.mapr.com/blog/get-real-hadoop-read-write-file-system#.VfHYK2wViko
Go for HBase which is built on top of Hadoop to support CRUD operations in big data hadoop world.
If you are not supposed to use No SQL database then there is no chance for updating HDFS files. Only option is to rewrite.

How to identify new files in HDFS

Just wondering if there is a way to identify new files that are added to a path in HDFS? For example, if already some files were present for sometime. Now I have added new files today. So wanted to process only those new files. What is the best way to achieve this.
Thanks
You need to write a java code to do this. These steps may help:
1. Before adding new files, fetch the last modified time (hadoop fs -ls /your-path). Lets say it as mTime.
2. Next upload new files into hdfs path
3. Now filter the files that are greater than mTime. These files are to be processed. Make your program to process only these files.
This is just a hint for developing your code. :)
If it is Mapreduce then you can create output directory appended with timestamp on daily basis.
Like
FileOutputFormat.setOutputPath(job, new Path(hdfsFilePath
+ timestamp_start); // start at 12 midnight for example: 1427241600 (GMT) --you can write logic to get epoch time

Hadoop MapReduce streaming - Best methods to ensure I have processed all log files

I'm developing Hadoop MapReduce streaming jobs written in Perl to process a large set of logs in Hadoop. New files are continually added to the data directory and there are 65,000 files in the directory.
Currently I'm using ls on the directory and keeping track of what files I have processed but even the ls takes a long time. I need to process the files in as close to real time as possible.
Using ls to keep track seems less than optimal. Are there any tools or methods for keeping track of what logs have not been processed in a large directory like this?
You can rename the log files once processed by your program.
For example:
command: hadoop fs -mv numbers.map/part-00000 numbers.map/data
Once renamed, you can easily identify you processed ones and yet to be processed ones.
Thought this would fix your issue.

atomic hadoop fs move

While building an infrastructure for one of my current projects I've faced the problem of replacement of already existing HDFS files. More precisely, I want to do the following:
We have a few machines (log-servers) which are continuously generating logs. We have a dedicated machine (log-preprocessor) which is responsible for receiving log chunks (each chunk is about 30 minutes in length and 500-800 mb in size) from log-servers, preprocessing them and uploading to HDFS of our Hadoop-cluster.
Preprocessing is done in 3 steps:
for each logserver: filter (in parallel) received log chunk (output file is about 60-80mb)
combine (merge-sort) all output files from the step1 and do some minor filtering (additionally, 30-min files are combined together into 1-hour files)
using current mapping from external DB, process the file from step#2 to obtain the final logfile and put this file to HDFS.
Final logfiles are to be used as input for several periodoc HADOOP-applications which are running on a HADOOP-cluster. In HDFS logfiles are stored as follows:
hdfs:/spool/.../logs/YYYY-MM-DD.HH.MM.log
Problem description:
The mapping which is used on step 3 changes over time and we need to reflect these changes by recalculating step3 and replacing old HDFS files with new ones. This update is performed with some periodicity (e.g. every 10-15 minutes) at least for last 12 hours. Please note that, if the mapping has changed, the result of applying step3 on the same input file may be significantly different (it will not be just a superset/subset of previous result). So we need to overwrite existing files in HDFS.
However, we can't just do hadoop fs -rm and then hadoop fs -copyToLocal because if some HADOOP-application is using the file which is temporary removed the app may fail. The solution I use -- put a new file near the old one, the files have the same name but different suffixes denoting files` version. Now the layout is the following:
hdfs:/spool/.../logs/2012-09-26.09.00.log.v1
hdfs:/spool/.../logs/2012-09-26.09.00.log.v2
hdfs:/spool/.../logs/2012-09-26.09.00.log.v3
hdfs:/spool/.../logs/2012-09-26.10.00.log.v1
hdfs:/spool/.../logs/2012-09-26.10.00.log.v2
Any Hadoop-application during it's start (setup) chooses the files with the most up-to-date versions and works with them. So even if some update is going on, the application will not experience any problems because no input file is removed.
Questions:
Do you know some easier approach to this problem which does not use this complicated/ugly file versioning?
Some applications may start using a HDFS-file which is currently uploading, but not yet uploaded (applications see this file in HDFS but don't know if it consistent). In case of gzip files this may lead to failed mappers. Could you please advice how could I handle this issue? I know that for local file systems I can do something like:
cp infile /finaldir/outfile.tmp && mv /finaldir/output.tmp /finaldir/output
This works because mv is an atomic operation, however I'm not sure that this is the case for HDFS. Could you please advice if HDFS has some atomic operation like mv in conventional local file systems?
Thanks in advance!
IMO, the file rename approach is absolutely fine to go with.
HDFS, upto 1.x, lacks atomic renames (they are dirty updates IIRC) - but the operation has usually been considered 'atomic-like' and never given problems to the specific scenario you have in mind here. You could rely on this without worrying about a partial state since the source file is already created and closed.
HDFS 2.x onwards supports proper atomic renames (via a new API call) that has replaced the earlier version's dirty one. It is also the default behavior of rename if you use the FileContext APIs.

Resources