Can Hadoop MultipleInputs.addInputPath be made to work recursively? - hadoop

Recent versions of Hadoop already easily support nested input directories using FileInputFormat.setInputDirRecursive, which relies on the mapreduce.input.fileinputformat.input.dir.recursive configuration key.
It's also possible to specify multiple mapper/input-directory combinations using MultipleInputs.addInputPath.
But can I do both at the same time? In other words, is there a way specify multiple mapper/input-directory combinations where the input directories are included recursively?
A concrete example:
I have the following directory structure:
/dataset1/subdir1/data1.txt
/dataset2/subdir2/data2.txt
I tried something like this:
Job job = Job.getInstance(conf);
FileInputFormat.setInputDirRecursive(job, true);
MultipleInputs.addInputPath(job, new Path("/dataset1"), TextInputFormat.class,
Mapper1.class);
MultipleInputs.addInputPath(job, new Path("/dataset2"), TextInputFormat.class,
Mapper2.class);
...
job.waitForCompletion(true);
But then I get an exception along the lines of Error: java.io.IOException: 's3://bucketname/dataset1/subdir1' is a directory
This is running in Amazon EMR under Hadoop 2.4.0.
Edit: Hadoop version is 2.4.0, not 2.6.0

Well, Not sure about s3, but this is normal. Should point to file and not a directory.
Try this.
Method 1
final static public void addInputPathRecursively(FileSystem fs, Path path, PathFilter inputFilter, Job job,boolean swithc) throws IOException
{
for (FileStatus stat : fs.listStatus(path, inputFilter))
{
if (stat.isDirectory())
{
addInputPathRecursively(fs, stat.getPath(), inputFilter, job);
} else
{
if (swithc)
{
MultipleInputs.addInputPath(job, new Path(stat.getPath().toString()), TextInputFormat.class, Mapper1.class);
} else
MultipleInputs.addInputPath(job, new Path(stat.getPath().toString()), TextInputFormat.class, Mapper2.class);
}
}
}
In the driver class you can call it accordingly.
addInputPathRecursively(fs, datset1path, new FileFilter(conf, fs,
new String[] { txt }), job,true);
addInputPathRecursively(fs, datset2path, new FileFilter(conf, fs,
new String[] { txt }), job,false);
This is an example but working control the pathfilter properly if you want to apply regEx.
Method 2
Setting this should do the magic too.
FileInputFormat.setInputDirRecursive(job, true);
Method 3
Bypass inside the mapper and process at line level. (Not a good idea!)

Related

Access hdfs file from udf

I`d like to access a file from my udf call. This is my script:
files = LOAD '$docs_in' USING PigStorage(';') AS (id, stopwords, id2, file);
buzz = FOREACH files GENERATE pigbuzz.Buzz(file, id) as file:bag{(year:chararray, word:chararray, count:long)};
The jar is registered. The path is realtive to my hdfs, where the files really exist. The call is made. But seems that the file is not discovered. Maybe beacause I'm trying to access the file on hdfs.
How can I access a file in hdfs, from my UDF java call?
Inside an EvalFunc you can get a file from the HDFS via:
FileSystem fs = FileSystem.get(UDFContext.getUDFContext().getJobConf());
in = fs.open(new Path(fileName));
BufferedReader br = new BufferedReader(new InputStreamReader(in));
....
You might also consider putting the files into the distributed cache, in that case you have to override getCacheFiles() in your EvalFunc class.
E.g:
#Override
public List<String> getCacheFiles() {
List<String> list = new ArrayList<String>(2);
list.add("/cache/pig/wordlist1.txt#w1");
list.add("/cache/pig/wordlist2.txt#w2");
return list;
}
then you can just pass the symlinks of the files (w1 and w2) in order to get them from
the local file system of each of the worker nodes:
BufferedReader br = new BufferedReader(new FileReader(fileName));

Reading in a parameter file in Amazon Elastic MapReduce and S3

I am trying to run my hadoop program in Amazon Elastic MapReduce system. My program takes an input file from the local filesystem which contains parameters needed for the program to run. However, since the file is normally read from the local filesystem with FileInputStream the task fails when executed in AWS environment with an error saying that the parameter file was not found. Note that, I already uploaded the file into Amazon S3. How can I fix this problem? Thanks. Below is the code that I use to read the paremeter file and consequently read the parameters in the file.
FileInputStream fstream = new FileInputStream(path);
FileInputStream os = new FileInputStream(fstream);
DataInputStream datain = new DataInputStream(os);
BufferedReader br = new BufferedReader(new InputStreamReader(datain));
String[] args = new String[7];
int i = 0;
String strLine;
while ((strLine = br.readLine()) != null) {
args[i++] = strLine;
}
If you must read the file from the local file system, you can configure your EMR job to run with a boostrap action. In that action, simply copy the file from S3 to a local file using s3cmd or similar.
You could also go through the Hadoop FileSystem class to read the file, as I'm pretty sure EMR supports direct access like this. For example:
FileSystem fs = FileSystem.get(new URI("s3://my.bucket.name/"), conf);
DataInputStream in = fs.open(new Path("/my/parameter/file"));
I did not try Amazon Elastic yet, however it looks like a classical application of distributed cache. You add file do cache using -files option (if you implement Tool/ToolRunner) or job.addCacheFile(URI uri) method, and access it as if it existed locally.
You can add this file to the distributed cache as follows :
...
String s3FilePath = args[0];
DistributedCache.addCacheFile(new URI(s3FilePath), conf);
...
Later, in configure() of your mapper/reducer, you can do the following:
...
Path s3FilePath;
#Override
public void configure(JobConf job) {
s3FilePath = DistributedCache.getLocalCacheFiles(job)[0];
FileInputStream fstream = new FileInputStream(s3FilePath.toString());
...
}

Files not put correctly into distributed cache

I am adding a file to distributed cache using the following code:
Configuration conf2 = new Configuration();
job = new Job(conf2);
job.setJobName("Join with Cache");
DistributedCache.addCacheFile(new URI("hdfs://server:port/FilePath/part-r-00000"), conf2);
Then I read the file into the mappers:
protected void setup(Context context)throws IOException,InterruptedException{
Configuration conf = context.getConfiguration();
URI[] cacheFile = DistributedCache.getCacheFiles(conf);
FSDataInputStream in = FileSystem.get(conf).open(new Path(cacheFile[0].getPath()));
BufferedReader joinReader = new BufferedReader(new InputStreamReader(in));
String line;
try {
while ((line = joinReader.readLine()) != null) {
s = line.toString().split("\t");
do stuff to s
} finally {
joinReader.close();
}
The problem is that I only read in one line, and it is not the file I was putting into the cache. Rather it is: cm9vdA==, or root in base64.
Has anyone else had this problem, or see how I'm using distributed cache incorrectly? I am using Hadoop 0.20.2 fully distributed.
Common mistake in your job configuration:
Configuration conf2 = new Configuration();
job = new Job(conf2);
job.setJobName("Join with Cache");
DistributedCache.addCacheFile(new URI("hdfs://server:port/FilePath/part-r-00000"), conf2);
After you create your Job object, you need to pull back the Configuration object as Job makes a copy of it, and configuring values in conf2 after you create the job will have no effect on the job iteself. Try this:
job = new Job(new Configuration());
Configuration conf2 = job.getConfiguration();
job.setJobName("Join with Cache");
DistributedCache.addCacheFile(new URI("hdfs://server:port/FilePath/part-r-00000"), conf2);
You should also check the number of files in the distributed cache, there is probably more than one and you're opening a random file which is giving you the value you are seeing.
I suggest you use symlinking which will make the files available in the local working directory, and with a known name:
DistributedCache.createSymlink(conf2);
DistributedCache.addCacheFile(new URI("hdfs://server:port/FilePath/part-r-00000#myfile"), conf2);
// then in your mapper setup:
BufferedReader joinReader = new BufferedReader(new FileInputStream("myfile"));

Unable to load OpenNLP sentence model in Hadoop map-reduce job

I'm trying to get OpenNLP integrated into a map-reduce job on Hadoop, starting with some basic sentence splitting. Within the map function, the following code is run:
public AnalysisFile analyze(String content) {
InputStream modelIn = null;
String[] sentences = null;
// references an absolute path to en-sent.bin
logger.info("sentenceModelPath: " + sentenceModelPath);
try {
modelIn = getClass().getResourceAsStream(sentenceModelPath);
SentenceModel model = new SentenceModel(modelIn);
SentenceDetectorME sentenceBreaker = new SentenceDetectorME(model);
sentences = sentenceBreaker.sentDetect(content);
} catch (FileNotFoundException e) {
logger.error("Unable to locate sentence model.");
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
if (modelIn != null) {
try {
modelIn.close();
} catch (IOException e) {
}
}
}
logger.info("number of sentences: " + sentences.length);
<snip>
}
When I run my job, I'm getting an error in the log saying "in must not be null!" (source of class throwing error), which means that somehow I can't open an InputStream to the model. Other tidbits:
I've verified that the model file exists in the location sentenceModelPath refers to.
I've added Maven dependencies for opennlp-maxent:3.0.2-incubating, opennlp-tools:1.5.2-incubating, and opennlp-uima:1.5.2-incubating.
Hadoop is just running on my local machine.
Most of this is boilerplate from the OpenNLP documentation. Is there something I'm missing, either on the Hadoop side or the OpenNLP side, that would cause me to be unable to read from the model?
Your problem is the getClass().getResourceAsStream(sentenceModelPath) line. This will try to load a file from the classpath - neither the file in HDFS nor on the client local file system is part of the classpath at mapper / reducer runtime, so this is why you're seeing the Null error (the getResourceAsStream() returns null if the resource cannot be found).
To get around this you have a number of options:
Amend your code to load the file from HDFS:
modelIn = FileSystem.get(context.getConfiguration()).open(
new Path("/sandbox/corpus-analysis/nlp/en-sent.bin"));
Amend your code to load the file from the local dir, and use the -files GenericOptionsParser option (which copies to file from the local file system to HDFS, and back down to the local directory of the running mapper / reducer):
modelIn = new FileInputStream("en-sent.bin");
Hard-bake the file into the job jar (in the root dir of the jar), and amend your code to include a leading slash:
modelIn = getClass().getResourceAsStream("/en-sent.bin");</li>

Copying files from HDFS to local file system with JAVA

I am trying to copy files from HDFS to local filesystem for preprocessing. The below code should work according to the documentation. Although it doesn't give any error messages and the mapreduce job runs smoothly I can not see any output on my local hard drive. What do you think the problem is? Thanks.
try {
Path phdfs_input = new Path("hdfs://master:54310/user/hduser/conninput/"+value.toString());
Path plocal_input = new Path("/home/hduser/Desktop/"+avlue.toString());
FileSystem fs = FileSystem.get(context.getConfiguration());
fs.copyToLocalFile(phdfs_input, plocal_input);
/* String localoutput_file = "/home/hduser/Destop/output/"+value.toString();
String cmd1[] = {"mafia", "-mfi", ".5", "-ascii", "~/Desktop/"+value.toString(), localoutput_file };
File mafia_dir = new File("/home/hduser/");
ShellCommandExecutor s = new ShellCommandExecutor(cmd1, mafia_dir);*/
} catch (Exception e) {
e.printStackTrace();
}
Try using /user/hduser/conninput/"+value.toString() in the Path constructor instead of providing the master:54310 part. It should figure out master:54310 from the Configuration.

Resources