Inside my map function, I am trying to read a file from the distributedcache, load its contents into a hash map.
The sys output log of the MapReduce job prints the content of the hashmap. This shows that it has found the file, has loaded into the data structure and performed the needed operation. It iterates through the list and prints its contents. Thus proving that the operation was successful.
However, I still get the below error after a few minutes of running the MR job:
13/01/27 18:44:21 INFO mapred.JobClient: Task Id : attempt_201301271841_0001_m_000001_2, Status : FAILED File does not exist: /app/hadoop/jobs/nw_single_pred_in/predict
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.(
at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(
at org.apache.hadoop.mapred.MapTask.runNewMapper(
at org.apache.hadoop.mapred.Child$
at Method)
at org.apache.hadoop.mapred.Child.main(
Here's the portion which initializes Path with the location of the file to be placed in the distributed cache
// inside main, surrounded by try catch block, yet no exception thrown here
Configuration conf = new Configuration();
// rest of the stuff that relates to conf
Path knowledgefilepath = new Path(args[3]); // args[3] = /app/hadoop/jobs/nw_single_pred_in/predict/knowledge.txt
DistributedCache.addCacheFile(knowledgefilepath.toUri(), conf);
// rest of job settings
job.waitForCompletion(true); // kick off load
This one is inside the map function:
try {
System.out.println("Inside try !!");
Path files[]= DistributedCache.getLocalCacheFiles(context.getConfiguration());
Path cfile = new Path(files[0].toString()); // only one file
System.out.println("File path : "+cfile.toString());
CSVReader reader = new CSVReader(new FileReader(cfile.toString()),'\t');
while ((nline=reader.readNext())!=null)
data.put(nline[0],Double.parseDouble(nline[1])); // load into a hashmap
catch (Exception e)
{// handle exception }
Help appreciated.
Did a fresh installation of hadoop and ran the job with the same jar, the problem disappeared. Seems to be a bug rather than programming errors.


SplitFile gives casting error

I have placed a mp4 file on hdfs and trying to analyze it directly i have a class name as VideoRecordReader in which it gives the casting error. Below is the description of Error.
You have loaded library /usr/local/lib/ which
might have disabled stack guard. The VM will try to fix the stack
guard now. attempt_201607261400_0011_m_000000_1: It's highly
recommended that you fix the library with 'execstack -c ', or
link it with '-z noexecstack'. 16/07/26 17:32:27 INFO
mapred.JobClient: Task Id : attempt_201607261400_0011_m_000000_2,
Status : FAILED java.lang.ClassCastException:
org.apache.hadoop.mapreduce.lib.input.FileSplit cannot be cast to
org.apache.hadoop.mapred.FileSplit at
at org.apache.hadoop.mapred.MapTask.runNewMapper(
at at
org.apache.hadoop.mapred.Child$ at Method) at at
at org.apache.hadoop.mapred.Child.main(
Here is the code of SplitFile.
public void initialize(InputSplit genericSplit, TaskAttemptContext context)
throws IOException, InterruptedException {
FileSplit split = (FileSplit) genericSplit;
Configuration job = context.getConfiguration();
start = 0;
end = 1;
final Path file = split.getPath();
FileSystem fs = file.getFileSystem(job);
fileIn =;
filename = split.getPath().getName();
byte [] b = new byte[fileIn.available()];
video = new VideoObject(b);
kindly help me thank u best regards.
Its likely you're mixing the mapred and mapreduce APIs together.
Its complaining that you're trying to cast org.apache.hadoop.mapreduce.lib.input.FileSplit to org.apache.hadoop.mapred.FileSplit.
You need to make sure that you generally dont mix imports between the two APIs.
So check if the org.apache.hadoop.mapred.FileSplit has been imported and change it to org.apache.hadoop.mapreduce.lib.input.FileSplit.

Can Hadoop MultipleInputs.addInputPath be made to work recursively?

Recent versions of Hadoop already easily support nested input directories using FileInputFormat.setInputDirRecursive, which relies on the mapreduce.input.fileinputformat.input.dir.recursive configuration key.
It's also possible to specify multiple mapper/input-directory combinations using MultipleInputs.addInputPath.
But can I do both at the same time? In other words, is there a way specify multiple mapper/input-directory combinations where the input directories are included recursively?
A concrete example:
I have the following directory structure:
I tried something like this:
Job job = Job.getInstance(conf);
FileInputFormat.setInputDirRecursive(job, true);
MultipleInputs.addInputPath(job, new Path("/dataset1"), TextInputFormat.class,
MultipleInputs.addInputPath(job, new Path("/dataset2"), TextInputFormat.class,
But then I get an exception along the lines of Error: 's3://bucketname/dataset1/subdir1' is a directory
This is running in Amazon EMR under Hadoop 2.4.0.
Edit: Hadoop version is 2.4.0, not 2.6.0
Well, Not sure about s3, but this is normal. Should point to file and not a directory.
Try this.
Method 1
final static public void addInputPathRecursively(FileSystem fs, Path path, PathFilter inputFilter, Job job,boolean swithc) throws IOException
for (FileStatus stat : fs.listStatus(path, inputFilter))
if (stat.isDirectory())
addInputPathRecursively(fs, stat.getPath(), inputFilter, job);
} else
if (swithc)
MultipleInputs.addInputPath(job, new Path(stat.getPath().toString()), TextInputFormat.class, Mapper1.class);
} else
MultipleInputs.addInputPath(job, new Path(stat.getPath().toString()), TextInputFormat.class, Mapper2.class);
In the driver class you can call it accordingly.
addInputPathRecursively(fs, datset1path, new FileFilter(conf, fs,
new String[] { txt }), job,true);
addInputPathRecursively(fs, datset2path, new FileFilter(conf, fs,
new String[] { txt }), job,false);
This is an example but working control the pathfilter properly if you want to apply regEx.
Method 2
Setting this should do the magic too.
FileInputFormat.setInputDirRecursive(job, true);
Method 3
Bypass inside the mapper and process at line level. (Not a good idea!)

FileNotFoundException when using DistributedCache to access MapFile

I am using hadoop cdf4.7 run in yarn mode. There is a MapFile in hdfs://test1:9100/user/tagdict_builder_output/part-00000
and it has two file index and data
I used the following code to add it to distributedCache:
Configuration conf = new Configuration();
Path tagDictFilePath = new Path("hdfs://test1:9100/user/tagdict_builder_output/part-00000");
DistributedCache.addCacheFile(tagDictFilePath.toUri(), conf);
Job job = new Job(conf);
And initialize a MapFile.Reader at setup of Mapper:
protected void setup(Context context) throws IOException, InterruptedException {
Path[] localFiles = DistributedCache.getLocalCacheFiles(context.getConfiguration());
if (localFiles != null && localFiles.length > 0 && localFiles[0] != null) {
String mapFileDir = localFiles[0].toString();"mapFileDir " + mapFileDir);
FileSystem fs = FileSystem.get(context.getConfiguration());
reader = new MapFile.Reader(fs, mapFileDir, context.getConfiguration());
else {
throw new IOException("Could not read lexicon file in DistributedCache");
But it throws FileNotFoundException:
Error: File does not exist: /home/mps/cdh/local/usercache/mps/appcache/application_1405497023620_0045/container_1405497023620_0045_01_000012/part-00000/data
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(
at aps.Cdh4MD5TaglistPreprocessor$Vectorizer.setup(
at org.apache.hadoop.mapred.MapTask.runNewMapper(
at org.apache.hadoop.mapred.YarnChild$
at Method)
at org.apache.hadoop.mapred.YarnChild.main(
I've also tried /user/tagdict_builder_output/part-00000 as path,or use a symbol link. But these do not work either.How to solve this?Many thanks.
As it says here:
Distributed Cache associates the cache files to the current working directory of the mapper and reducer using symlinks.
So you should try to access your files through the File object:
File f = new File("./part-00000");
My last suggestion:
DistributedCache.addCacheFile(new URI(tagDictFilePath.toString() + "#cache-file"), conf);
// in mapper
File f = new File("cache-file");

Unable to load OpenNLP sentence model in Hadoop map-reduce job

I'm trying to get OpenNLP integrated into a map-reduce job on Hadoop, starting with some basic sentence splitting. Within the map function, the following code is run:
public AnalysisFile analyze(String content) {
InputStream modelIn = null;
String[] sentences = null;
// references an absolute path to en-sent.bin"sentenceModelPath: " + sentenceModelPath);
try {
modelIn = getClass().getResourceAsStream(sentenceModelPath);
SentenceModel model = new SentenceModel(modelIn);
SentenceDetectorME sentenceBreaker = new SentenceDetectorME(model);
sentences = sentenceBreaker.sentDetect(content);
} catch (FileNotFoundException e) {
logger.error("Unable to locate sentence model.");
} catch (IOException e) {
} finally {
if (modelIn != null) {
try {
} catch (IOException e) {
}"number of sentences: " + sentences.length);
When I run my job, I'm getting an error in the log saying "in must not be null!" (source of class throwing error), which means that somehow I can't open an InputStream to the model. Other tidbits:
I've verified that the model file exists in the location sentenceModelPath refers to.
I've added Maven dependencies for opennlp-maxent:3.0.2-incubating, opennlp-tools:1.5.2-incubating, and opennlp-uima:1.5.2-incubating.
Hadoop is just running on my local machine.
Most of this is boilerplate from the OpenNLP documentation. Is there something I'm missing, either on the Hadoop side or the OpenNLP side, that would cause me to be unable to read from the model?
Your problem is the getClass().getResourceAsStream(sentenceModelPath) line. This will try to load a file from the classpath - neither the file in HDFS nor on the client local file system is part of the classpath at mapper / reducer runtime, so this is why you're seeing the Null error (the getResourceAsStream() returns null if the resource cannot be found).
To get around this you have a number of options:
Amend your code to load the file from HDFS:
modelIn = FileSystem.get(context.getConfiguration()).open(
new Path("/sandbox/corpus-analysis/nlp/en-sent.bin"));
Amend your code to load the file from the local dir, and use the -files GenericOptionsParser option (which copies to file from the local file system to HDFS, and back down to the local directory of the running mapper / reducer):
modelIn = new FileInputStream("en-sent.bin");
Hard-bake the file into the job jar (in the root dir of the jar), and amend your code to include a leading slash:
modelIn = getClass().getResourceAsStream("/en-sent.bin");</li>

Copying files from HDFS to local file system with JAVA

I am trying to copy files from HDFS to local filesystem for preprocessing. The below code should work according to the documentation. Although it doesn't give any error messages and the mapreduce job runs smoothly I can not see any output on my local hard drive. What do you think the problem is? Thanks.
try {
Path phdfs_input = new Path("hdfs://master:54310/user/hduser/conninput/"+value.toString());
Path plocal_input = new Path("/home/hduser/Desktop/"+avlue.toString());
FileSystem fs = FileSystem.get(context.getConfiguration());
fs.copyToLocalFile(phdfs_input, plocal_input);
/* String localoutput_file = "/home/hduser/Destop/output/"+value.toString();
String cmd1[] = {"mafia", "-mfi", ".5", "-ascii", "~/Desktop/"+value.toString(), localoutput_file };
File mafia_dir = new File("/home/hduser/");
ShellCommandExecutor s = new ShellCommandExecutor(cmd1, mafia_dir);*/
} catch (Exception e) {
Try using /user/hduser/conninput/"+value.toString() in the Path constructor instead of providing the master:54310 part. It should figure out master:54310 from the Configuration.
