How to delete intermediate output file from Hdfs - hadoop

I am trying to delete intermediate output directory of mapreduce program using
FileUtils.deleteDirectory(new File(tempFiles));
but this command doesn't delete directories from hdfs.

Map reduce does not write intermediate results on hdfs ,it writes on local disk.
Whenever mapper produce output it first goes on memory buffer where partitioning and sorting takes place when buffer exceeds its default capacity it spill those results into local disk .
Summary is output produced by mapper goes into local file system .
Only in one condition mapper will write their output to hdfs if specifically it has been set in the driver class not to use any reducer.
In above case there would be final output we won't say its intermediate.

You are using the wrong API boy ! You should be using apache FileUtil instead FileUtils. The later one is used for file manipulation in local filesystems.
https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileUtil.html#fullyDelete
http://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/FileUtils.html
I understand that one can easily pick the wrong one due to the similar names. Your current code is looking into your local file system to delete that path without any effect on the HDFS.
Sample code :
FileUtil.fullyDelete(new File("pathToDir"));
On the other hand, you can make use of FileSystem api itself which has a method delete. You need to get the FileSystem object though. eg:
filesystem.delete(new Path("pathToDir"), true);
The second argument is the recursive flag.

Related

How to process an open file using MapReduce framework

I have a file that get aggregated and written into HDFS. This file will be opened for an hour before it is closed. Is it possible to compute this file using MapReduce framework, while it is open? I tried it but it's not picking up all appended data. I could query the data in HDFS and it available but not when done by MapReduce. Is there anyway I could force MapReduce to read an open file? Perhaps customize the FileInputFormat class?
You can read what was physically flushed. Since close() makes the final flush of the data, your reads may miss some of the most recent data regardless how you access it (mapreduce or command line).
As a solution I would recommend periodically close the current file, and then open a new one (with some incremented index suffix). You can run you map reduce on multiple files. You would still end up with some data missing in the most recent file, but at least you can control it by frequency of of your file "rotation".

Reading contents of blocks directly in a datanode

In HDFS , the blocks are distributed among the active nodes/slaves. The content of the blocks are simple text so is there any way to see read or access the blocks present in each data node ?
As an entire file or to read a single block (say block number 3) out of sequence?
You can read the file via various mechanisms including the Java API but you cannot start reading in the middle of the file (for example at the start of block 3).
Hadoop reads a block of data and feeds each line to the mapper for further processing. Also, the Hadoop clients gets the blocks related to a file from different Data Nodes before concatenating them. So, it should be possible to get the data from a particular block.
Hadoop Client might be a good place to start with to look at the code. But, HDFS provides file system abstraction. Not sure what the requirement would be for reading the data from a particular block.
Assuming you have ssh access (and appropriate permissions) to the datanodes, you can cd to the path where the blocks are stored and read the blocks stored on that node (e.g., do a cat BLOCK_XXXX). The configuration parameter that tells you where the blocks are stored is dfs.datanode.data.dir, which defaults to file://${hadoop.tmp.dir}/dfs/data. More details here.
Caveat: the block names are coded by HDFS depending on their internal block ID. Just by looking at their names, you cannot know to which file a block belongs.
Finally, I assume you want to do this for debugging purposes or just to satisfy your curiosity. Normally, there is no reason to do this and you should just use the HDFS web-UI or command-line tools to look at the contents of your files.

How to achieve desired block size with Hadoop with data on local filesystem

I have a 2TB sequence file that I am trying to process with Hadoop which resides on a cluster set up to use a local (lustre) filesystem for storage instead of HDFS. My problem is that no matter what I try, I am always forced to have about 66000 map tasks when I run a map/reduce jobs with this data as input. This seems to correspond with a block size of 2TB/66000 =~ 32MB. The actual computation in each map task executes very quickly, but the overhead associated with so many map tasks slows things down substantially.
For the job that created the data and for all subsequent jobs, I have dfs.block.size=536870912 and fs.local.block.size=536870912 (512MB). I also found suggestions that said to try this:
hadoop fs -D fs.local.block.size=536870912 -put local_name remote_location
to make a new copy with larger blocks, which I did to no avail. I have also changed the stripe size of the file on lustre. It seems that any parameters having to do with block size are ignored for local file system.
I know that using lustre instead of HDFS is a non-traditional use of hadoop, but this is what I have to work with. I'm wondering if others either have experience with this, or have any ideas to try other than what I have mentioned.
I am using cdh3u5 if that is useful.

atomic hadoop fs move

While building an infrastructure for one of my current projects I've faced the problem of replacement of already existing HDFS files. More precisely, I want to do the following:
We have a few machines (log-servers) which are continuously generating logs. We have a dedicated machine (log-preprocessor) which is responsible for receiving log chunks (each chunk is about 30 minutes in length and 500-800 mb in size) from log-servers, preprocessing them and uploading to HDFS of our Hadoop-cluster.
Preprocessing is done in 3 steps:
for each logserver: filter (in parallel) received log chunk (output file is about 60-80mb)
combine (merge-sort) all output files from the step1 and do some minor filtering (additionally, 30-min files are combined together into 1-hour files)
using current mapping from external DB, process the file from step#2 to obtain the final logfile and put this file to HDFS.
Final logfiles are to be used as input for several periodoc HADOOP-applications which are running on a HADOOP-cluster. In HDFS logfiles are stored as follows:
hdfs:/spool/.../logs/YYYY-MM-DD.HH.MM.log
Problem description:
The mapping which is used on step 3 changes over time and we need to reflect these changes by recalculating step3 and replacing old HDFS files with new ones. This update is performed with some periodicity (e.g. every 10-15 minutes) at least for last 12 hours. Please note that, if the mapping has changed, the result of applying step3 on the same input file may be significantly different (it will not be just a superset/subset of previous result). So we need to overwrite existing files in HDFS.
However, we can't just do hadoop fs -rm and then hadoop fs -copyToLocal because if some HADOOP-application is using the file which is temporary removed the app may fail. The solution I use -- put a new file near the old one, the files have the same name but different suffixes denoting files` version. Now the layout is the following:
hdfs:/spool/.../logs/2012-09-26.09.00.log.v1
hdfs:/spool/.../logs/2012-09-26.09.00.log.v2
hdfs:/spool/.../logs/2012-09-26.09.00.log.v3
hdfs:/spool/.../logs/2012-09-26.10.00.log.v1
hdfs:/spool/.../logs/2012-09-26.10.00.log.v2
Any Hadoop-application during it's start (setup) chooses the files with the most up-to-date versions and works with them. So even if some update is going on, the application will not experience any problems because no input file is removed.
Questions:
Do you know some easier approach to this problem which does not use this complicated/ugly file versioning?
Some applications may start using a HDFS-file which is currently uploading, but not yet uploaded (applications see this file in HDFS but don't know if it consistent). In case of gzip files this may lead to failed mappers. Could you please advice how could I handle this issue? I know that for local file systems I can do something like:
cp infile /finaldir/outfile.tmp && mv /finaldir/output.tmp /finaldir/output
This works because mv is an atomic operation, however I'm not sure that this is the case for HDFS. Could you please advice if HDFS has some atomic operation like mv in conventional local file systems?
Thanks in advance!
IMO, the file rename approach is absolutely fine to go with.
HDFS, upto 1.x, lacks atomic renames (they are dirty updates IIRC) - but the operation has usually been considered 'atomic-like' and never given problems to the specific scenario you have in mind here. You could rely on this without worrying about a partial state since the source file is already created and closed.
HDFS 2.x onwards supports proper atomic renames (via a new API call) that has replaced the earlier version's dirty one. It is also the default behavior of rename if you use the FileContext APIs.

Is it possible to send parts of the mapper out put to reducer, while just writing the other part to HDFS, in hadoop?

I want to write parts of the mapper output to a folder, say folder A in HDFS. The other part of the output, I want it to be processed by the reducer. Is this possible? I am aware of Multiple Outputs. Is this possible using multiple outputs?
Thanks!
Yes it is possible to do with MultipleOutputs, according to docs any output passed via MultipleOutputs during map stage are ignored by reducer, so this is exactly what you want. I write a small example on my GitHub and I hope you'll find it useful.
You can just write output directly to HDFS from your mapper implementation - Just create a FileSystem object using the context's configuration and then create a file, write to it and remember to close it out:
public void cleanup(Context context) {
FileSystem fs = FileSystem.get(context.getConfiguration());
PrintStream ps = new PrintStream(fs.create(
new Path("/path/to/output", "map-output")));
ps.println("test");
ps.close();
}
Other things for consideration - each file needs to be uniquely named in HDFS, so you could suffix the filename with the mapper ID number, but you also need to appreciate speculative execution (in that your mapper task instance may be running in two locations - both trying to write to the same file in HDFS).
You're normally abstracted away from this as the Output Committer creates files in a tmp HDFS directory with the task ID and attempt number, only moving it to the correct location and file name upon commit of that task attempt. There's no way around this problem when running map-side (data is written to the local file system) without either turning off speculative execution or creating multiple files in HDFS, one of each attempt.
So a more 'complete' solution would look like:
FileSystem fs = FileSystem.get(context.getConfiguration());
PrintStream ps = new PrintStream(fs.create(new Path(
"/path/to/output", String.format("map-output-%05d-%d",
context.getTaskAttemptID().getTaskID().getId(),
context.getTaskAttemptID().getId()))));
ps.println("test");
ps.close();
MultipleOutputs would help you reduce side, but i don't think map-side it would work as there is no output committer and the work directory is not in HDFS.
Of course, if this was a mapper only job, then MultipleOutputs would work. So an alternative approach would be to run a map only job, and then use the desired portion of the output in a secondary job (with an identity mapper) - depends on how much data your moving i guess.

Resources