Can we use a file in the reduce function in hadoop? - hadoop

I want to access a different file (other than the input file to map) in reduce function. Is this possible ?

Have a look at Distributed Cache. You can send a small file to mapper or reducer.
(if you use Java)
In your main/driver, set file for job:
job.addCacheFile(new URI("path/to/file/inHadoop/file.txt#var"));
Note: var is a variable name used to access your file in mapper/reducer i.e. fn[1] in below code.
In mapper or reducer, get file from context:
public void setup(Context context) throws IOException{
Configuration conf = context.getConfiguration();
URI[] cacheFiles = context.getCacheFiles();
String[] fn = cacheFiles[0].toString().split("#");
BufferedReader br = new BufferedReader(new FileReader(fn[1]));
String line = br.readLine();
//do something with line
}
Note: cacheFiles[0] refers to the file you sent from your main/driver
More information

Related

set a conf value in mapper - get it in run method

In the run method of the Driver class, I want to fetch a String value (from the mapper function) and want to write it to a file. I used the following code, but null was returned. Please help
Mapper
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
context.getConfiguration().set("feedName", feedName);
}
Driver Class
#Override
public int run(String[] args) throws Exception {
String lineVal = conf.get("feedName")
}
Configuration is one way.
If you want to pass non-counter types of values back to the driver, you can utilize HDFS for that.
Either write to your main output context (key and values) that you emit from your job.
Or alternatively use MultipleOutputs, if you do not want to mess with your standard job output.
For example, you can write any kind of properties as Text keys and Text values from your mappers or reducers.
Once control is back to your driver, simply read from HDFS. For example you can store your name/values to the Configuration object to be used by the next job in your sequence:
public void load(Configuration targetConf, Path src, FileSystem fs) throws IOException {
InputStream is = fs.open(src);
try {
Properties props = new Properties();
props.load(new InputStreamReader(is, "UTF8"));
for (Map.Entry prop : props.entrySet()) {
String name = (String)prop.getKey();
String value = (String)prop.getValue();
targetConf.set(name, value);
}
} finally {
is.close();
}
}
Note that if you have multiple mappers or reducers where you write to MultipleOutputs, you will end up with multiple {name}-m-##### or {name}-r-##### files.
In that case, you will need to either read from every output file or run a single reducer job to combine your outputs into one and then just read from one file as shown above.
Using configuration you can only do the viceversa.
You can set values in Driver class
public int run(String[] args) throws Exception {
conf.set("feedName",value);
}
and set get those in Mapper class
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
String lineVal = conf.get("feedName");
}
UPDATE
One option to your question is write data to a file and store it in HDFS, and then access them in Driver class. These files can be treated as "Intermediate Files".
Just try it and see.

Reading from a specific file from a directory containing many files in hadoop

I want to read a specific file from a list of files that are present in hadoop based on the name of the file. If the filename matches my givenname I want to process that file data. Here is the below way I have tried in the map method
public void map(LongWritable key,Text value,Context con) throws IOException, InterruptedException
{
FileSplit fs =(FileSplit) con.getInputSplit();
String filename= fs.getPath().getName();
filename=filename.split("-")[0];
if(filename.equals("aak"))
{
String[] tokens = value.toString().split("\t");
String name=tokens[0];
con.write(new Text("mrs"), new Text("filename"));
}
}
You need to write a custom PathFilter implementation and then use setInputPathFilter on FileInputFormat in your driver code. Please take a look at the below link:
https://hadoopi.wordpress.com/2013/07/29/hadoop-filter-input-files-used-for-mapreduce/
Either use a PathFilter, as Arani suggests (+1 for this), or,
if your criterion for selecting your input file is simply that it starts with the string "aak-", then I think, you can easily do what you wish, by changing your input path in your main method (Driver class), like that:
replace:
String inputPath = "/your/input/path"; //containing the file /your/input/path/aak-00000
FileInputFormat.setInputPaths(conf, new Path(inputPath));
with:
String inputPath = "/your/input/path"; //containing the file /your/input/path/aak-00000
FileInputFormat.setInputPaths(conf, new Path(inputPath+"/aak-*"))

How to save the input of a Reducer as a file on the Reducer machine and then access it again in the same reduce phase

I am trying to implement decision tree using mapreduce. In my reduce phase I want to save the output of my mapper in a .data file, and then again access it in the same reducer.
Since I require the data in form of instance for which I require *.data file and not the array of string provided by mapper as output.
This is what I am doing right now:
public void reduce(LongWritable key, Iterable<Text> values, Context context)
throws IOException, InterruptedException
{
Writer writer = null;
writer = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream( key+"/temp.data"), "utf-8")
);
for (Text val : values) {
writer.write( val.toString() + "\n");
}
}

hadoop DistributedCache returns null

i'm using hadoop DistributedCache,but i got some troubles.
my hadoop is in pseudo-distributed mode.
from here we can see in pseudo-distributed mode we use
DistributedCache.getLocalCache(xx) to retrive cached file.
first i put my file into DistributedCache:
DistributedCache.addCacheFile(new Path(
"hdfs://localhost:8022/user/administrator/myfile").toUri(),
job.getConfiguration());
then retrieve in mapper setup(),but DistributedCache.getLocalCache returns null.i can see my cached file through
System.out.println("Cache: "+context.getConfiguration().get("mapred.cache.files"));
and it print out:
hdfs://localhost:8022/user/administrator/myfile
here is my Pseudocode:
public static class JoinMapper{
#Override
protected void setup(Context context){
Path[] cacheFiles = DistributedCache.getLocalCacheFiles(context
.getConfiguration());
System.out.println("Cache
:"+context.getConfiguration().get("mapred.cache.files"));
Path cacheFile;
if (cacheFiles != null) {}
}
}
xx....
public static void main(String[] args){
Job job = new Job(conf, "Join Test");
DistributedCache.addCacheFile(new Path("hdfs://localhost:8022/user/administrator/myfile").toUri(),
job.getConfiguration());}
sorry about poor Typesetting.anyone help please....
btw,i can get uris using
URI[] uris = DistributedCache.getCacheFiles(context
.getConfiguration());
uris returns :
hdfs://localhost:8022/user/administrator/myfile
when i try to read from uri,error with file not found exception.
The Distributed Cache will copy your files from HDFS to the local file system of all TaskTracker.
How are u reading the file? If the file is in HDFS u will have to get HDFS FileSystem, otherwise it is going to use the default (probably the local one). So to read the file in HDFS try:
FileSystem fs = FileSystem.get(new Path("hdfs://localhost:8022/user/administrator/myfile").toUri(), new Configuration());
Path path = new Path (url);
BufferedReader br = new BufferedReader(new InputStreamReader(fs.open(path)));

How do multiple reducers output only one part-file in Hadoop?

In my map-reduce job, I use 4 reducers to implement the reducer jobs. So by doing this, the final output will generate 4 part-files.: part-0000 part-0001 part-0002 part-0003
My question is how can I set the configuration of hadoop to output only one part-file, although the hadoop use 4 reducers to work?
This isn't the behaviour expected from hadoop. But you may use MultipleOutputs to your advantage here.
Create one named output and use that in all your reducers to get the final output in one file itself. It's javadoc itself suggest the following:
JobConf conf = new JobConf();
conf.setInputPath(inDir);
FileOutputFormat.setOutputPath(conf, outDir);
conf.setMapperClass(MOMap.class);
conf.setReducerClass(MOReduce.class);
...
// Defines additional single text based output 'text' for the job
MultipleOutputs.addNamedOutput(conf, "text", TextOutputFormat.class,
LongWritable.class, Text.class);;
...
JobClient jc = new JobClient();
RunningJob job = jc.submitJob(conf);
...
Job configuration usage pattern is:
public class MOReduce implements
Reducer<WritableComparable, Writable> {
private MultipleOutputs mos;
public void configure(JobConf conf) {
...
mos = new MultipleOutputs(conf);
}
public void reduce(WritableComparable key, Iterator<Writable> values,
OutputCollector output, Reporter reporter)
throws IOException {
...
mos.getCollector("text", reporter).collect(key, new Text("Hello"));
...
}
public void close() throws IOException {
mos.close();
...
}
}
If you are using the new mapreduce API then see here.
MultipleOutputs.addNamedOutput(conf, "text", TextOutputFormat.class,
LongWritable.class, Text.class);
Here text is output directory or single large file named text ?

Resources