I have a use case wherein i need to process huge set of files stored in HDFS. Those files are non-splittable and they need to be processed single map process. I wish to schedule the map process in task tracker where data is already available. How can I do it? Currently, I have a file that contains list of filenames. Each map get one line of it via NLineInputFormat. The map process then access the file via FSDataInputStream and work with it. Is there a way to ensure this map process is running on the node where the file is available?. I would like to avoid WholeFileINputFormat as I read it would load the values of the file and feed to the mapper.
Related
I have a file that get aggregated and written into HDFS. This file will be opened for an hour before it is closed. Is it possible to compute this file using MapReduce framework, while it is open? I tried it but it's not picking up all appended data. I could query the data in HDFS and it available but not when done by MapReduce. Is there anyway I could force MapReduce to read an open file? Perhaps customize the FileInputFormat class?
You can read what was physically flushed. Since close() makes the final flush of the data, your reads may miss some of the most recent data regardless how you access it (mapreduce or command line).
As a solution I would recommend periodically close the current file, and then open a new one (with some incremented index suffix). You can run you map reduce on multiple files. You would still end up with some data missing in the most recent file, but at least you can control it by frequency of of your file "rotation".
I am trying to delete intermediate output directory of mapreduce program using
FileUtils.deleteDirectory(new File(tempFiles));
but this command doesn't delete directories from hdfs.
Map reduce does not write intermediate results on hdfs ,it writes on local disk.
Whenever mapper produce output it first goes on memory buffer where partitioning and sorting takes place when buffer exceeds its default capacity it spill those results into local disk .
Summary is output produced by mapper goes into local file system .
Only in one condition mapper will write their output to hdfs if specifically it has been set in the driver class not to use any reducer.
In above case there would be final output we won't say its intermediate.
You are using the wrong API boy ! You should be using apache FileUtil instead FileUtils. The later one is used for file manipulation in local filesystems.
https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileUtil.html#fullyDelete
http://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/FileUtils.html
I understand that one can easily pick the wrong one due to the similar names. Your current code is looking into your local file system to delete that path without any effect on the HDFS.
Sample code :
FileUtil.fullyDelete(new File("pathToDir"));
On the other hand, you can make use of FileSystem api itself which has a method delete. You need to get the FileSystem object though. eg:
filesystem.delete(new Path("pathToDir"), true);
The second argument is the recursive flag.
So I'm writing a MR job to read hundreds of files from an input folder. Since all the files are compressed, so instead of using the default TextInputFormat, I was using the WholeFileReadFormat from an online code source.
So my question is that does the Mapper process multiple input files in sequence? I mean, if I have three files A B C, and since I'm reading the whole file content as the map input value, will mapreduce process the files in the order of, say, A->B->C, which means, only after doing with A, Mapper will start to process B?
Actually, I'm kind of confused on the concept of Map job and Map task. In my understanding the Map job is just the same thing as Mapper. And a mapper job contains several map tasks, in my case, each map task will read in a single file. But what I don't understand is that I think map tasks are executed in parallel, so I think all the input files should be processed in parallel, which turns out to be a paradox....
Can any one please explain it to me?
Is it possible to execute a Hadoop Streaming job that has no input file?
In my use case, I'm able to generate the necessary records for the reducer with a single mapper and execution parameters. Currently, I'm using a stub input file with a single line, I'd like to remove this requirement.
We have 2 use cases in mind.
1)
I want to distribute the loading of files into hdfs from a network location available to all nodes. Basically, I'm going to run ls in the mapper and send the output to a small set of reducers.
We are going to be running fits leveraging several different parameter ranges against several models. The model names do not change and will go to the reducer as keys while the list of tests to run is generated in the mapper.
According to the docs this is not possible. The following are required parameters for execution:
input directoryname or filename
output directoryname
mapper executable or JavaClassName
reducer executable or JavaClassName
It looks like providing a dummy input file is the way to go currently.
I have a reducer that needs to output results to different directories so that we can later use the output as input to Hive as a partitioned table. (Hive creates partitions based on folder name). In order to write out to these locations, we are currently not using any Hadoop framework to accomplish this, we are just writing out to separate locations "behind Hadoop's back", so to speak. In other words we are not using hadoop's API to output these files.
We had issues with mapred.reduce.tasks.speculative.execution set to true. I understand this to be the case because multiple task attempts for the same task are writing to the same location.
Is there a way to correctly use Hadoop's API to output to several different folders from the same reducer such that I can also use mapred.reduce.tasks.speculative.execution=true ? (I know about MultipleOutputs, which I'm not sure supports speculative execution.)
If so, is there a way to do that and output to S3?
The way Hadoop typically deals with speculative execution is to create an output folder for each task attempt (in a _temporary subfolder of the actual HDFS output directory).
The OutputCommitter for the OutputFormat then simply moves the contents of the temp task folder to the actual output folder when a task succeeds, and deletes the other temp task folders for those failed / aborted (this is the default behavior for most FileOutputFormats)
So for your case, if you are writing to a folder outside of the job output folder, then you'll need to extend / implement your own output committer. I'd follow the same principals when creating the files - include the full task id (including the attempt id) to avoid name collisions when speculatively executing. How you track the files created in your job and manage the deletion in the abort / fail scenarios is up to you (maybe some file globing for the task ids?)
You might be interested in this : http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html