passing URI as a runtime variable to distributed cache in mapreduce hadoop - caching

I am using distributed cache in my mapreduce program and I am passing three variables to this mapreduce program input file, output dir and config file.
I want to add the third argument i.e config file to the Distributed Cache.
I am setting the parameter as follows in run() method of the MapReduce Driver:-
conf.set("CONF_XML", args[2]);
How to add this file into distributed cache in the same method. how do I do that ?
Usually we add using URI(new (file path));
DistributedCache.addCacheFile(new URI(file_path), conf); << here how to pass the argument parameter?

Pass the file path argument to the DistributedCache API as URI
DistributedCache.addCacheFile(new Path(args[2]).toUri(),job.getConfiguration());

Related

Having multiple reduce tasks assemble a single HDFS file as output

Is there any low level API in Hadoop allowing multiple reduce tasks running on different machines to assemble a single HDFS as output of their computation?
Something like, a stub HDFS file is created at the beginning of the job then each reducer creates, as output, a variable number of data blocks and assigns them to this file according to a certain order
The answer is no, that would be an unnecessary complication for a rare use case.
What you should do
option 1 - add some code at the end of your hadoop command
int result = job.waitForCompletion(true) ? 0 : 1;
if (result == 0) { // status code OK
// ls job output directory, collect part-r-XXXXX file names
// create HDFS readers for files
// merge them in a single file in whatever way you want
}
All of the required methods are present in hadoop FileSystem api.
option 2 - add job to merge files
You can create a generic hadoop job that would accept directory name as input and pass everything as-is to the single reducer, that would merge results into one output file. Call this job in a pipeline with your main job.
This would work faster for big inputs.
If you want merged output file on local, you can use hadoop command getmerge to combine multiple reduce task files into one single local output file, below is command for same.
hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt

Oozie generate set of files in directory

I'm trying to ingest log files into hadoop.
I'd like to use oozie to trigger my ingestion task (written in spark),and have oozie pass the filenames to my task.
I expect the log files to be set out as:
/example/${YEAR}-${MONTH}-${DAY}-${HOUR}:${MINUTE}/Log1/Log1.log
/example/${YEAR}-${MONTH}-${DAY}-${HOUR}:${MINUTE}/Log1/Log1.1.log
/example/${YEAR}-${MONTH}-${DAY}-${HOUR}:${MINUTE}/Log1/Log1.2.log
/example/${YEAR}-${MONTH}-${DAY}-${HOUR}:${MINUTE}/Log2/Log2.log
/example/${YEAR}-${MONTH}-${DAY}-${HOUR}:${MINUTE}/Log2/Log2.1.log
/example/${YEAR}-${MONTH}-${DAY}-${HOUR}:${MINUTE}/Log2/Log2.2.log
(etc).
So, now I have two problems:
1. How to get oozie to generate all the file names under /example/${YEAR}-${MONTH}-${DAY}-${HOUR}:${MINUTE}/Log1/ and pass it to my app; and
How to get oozie to in parallel generate all the file names under /example/${YEAR}-${MONTH}-${DAY}-${HOUR}:${MINUTE}/Log2/ and pass it to a second invocation of my task.
DateTime wise File name create can be done by using small Java Program, which can be call from Oozie Workflow.xml,
somthing like
String processedDateString = (new SimpleDateFormat("yyyyMMddhhmmss")).format(new Date(timeInMilis));
and while calling the same jar in workflow
<main-class>NameFile.jar</main-class>
<arg>Path=${output_path}</arg>
<arg>Name=${name}</arg>
<arg>processedDate=${(wf:actionData('Rename')['ProcessedDate'])}</arg>
For Copying/Moving you can use same Java program with Copy Action.
for log1 and log2 location you can mention in job.properties

Hadoop Chain Jobs - Skip second Job if input files does not exit

I have two hadoop jobs. First Job saves output in hdfs file an the second job has as input this file. When this file does not exist i have error. How can i skip the second job if first jobs outfile does not exist?
Use this test, but with the path created by the first job:
FileSystem fs = FileSystem.get(conf);
String inputDir= "HDFS file path";
if (fs.exists(new Path(inputDir))) {
// this block gets executed only if the file path inputDir exists
}
The code inside the block would contain the configuration and execution code for the second job.

DistributedCache Hadoop - FileNotFound

I'm trying to place a file in the distributed cache. In order to do this I invoke my driver class using the -files option, something like:
hadoop jar job.jar my.driver.class -files MYFILE input output
The getCacheFiles() and the getLocalCacheFiles() return arrays of URIs/Paths containing MYFILE.
(E.g.: hdfs://localhost/tmp/hadoopuser/mapred/staging/knappy/.staging/job_201208262359_0005/files/histfile#histfile)
Unfortunately, when trying to retrieve MYFILE in the map task, it throws a FileNotFoundException.
I tried this in standalone(local) mode as well as in pseudo-distributed mode.
Do you know what might be the cause ?
UPDATE:
The following three lines:
System.out.println("cache files:"+ctx.getConfiguration().get("mapred.cache.files"));
uris = DistributedCache.getLocalCacheFiles(ctx.getConfiguration());
for(Path uri: uris){
System.out.println(uri.toString());
System.out.println(uri.getName());
if(uri.getName().contains(Constants.PATH_TO_HISTFILE)){
histfileName = uri.getName();
}
}
print out this:
cache files:file:/home/knappy/histfile#histfile
/tmp/hadoop-knappy/mapred/local/archive/-7231_-1351_105/file/home/knappy/histfile
histfile
So, the file seems to be listed in the job.xml mapred.cache.files property and the local file seems to be present. Still, the FileNotFoundException is thrown.
First check mapred.cache.files in your job's xml to see whether the file is in the cache.
The you can retrieve it in your mapper:
...
Path[] files = DistributedCache.getLocalCacheFiles(context.getConfiguration());
File myFile = new File(files[0].getName());
//read your file content
...

Hadoop per-file block size

In Hadoop book it is said that we can specify per-file block size at the time of creation of file.
"The most natural way to increase the split size is to have larger blocks in HDFS, by setting dfs.block.size, or on a per-file basis at file construction time."
Any idea how to do this at file construction time. I hope by setting this to value = file-size, the file will not be split
you can use CLI:
hadoop fs -D dfs.block.size=file-size -put local_name remote_location
or you can use Java API to specify the dfs.block.size when you want to create or copy files.
Configuration conf = new Configuration();
conf.setInt("dfs.block.size",file-size);

Resources