hadoop-streaming : writing output to different files - hadoop

Here is the scenario
Reducer1
/
Mapper - - Reducer2
\
ReducerN
In reducer I want to write the data on different files, lets say the reducer looks like
def reduce():
for line in sys.STDIN:
if(line == type1):
create_type_1_file(line)
if(line == type2):
create_type_2_file(line)
if(line == type3):
create_type3_file(line)
... and so on
def create_type_1_file(line):
# writes to file1
def create_type2_file(line):
# writes to file2
def create_type_3_file(line):
# write to file 3
Consider the paths to write as :
file1 = /home/user/data/file1
file2 = /home/user/data/file2
file3 = /home/user/data/file3
When I run in pseudo-distributed mode(machine with one node and hdfs daemons running), things are good since all daemons will write to the same set of files
Question:
- If I run this in cluster of 1000 machines, will they write to the same set of files even then? I am writing to local filesystem in this case, Is there a better way to perform this operation in hadoop streaming?

Typically the o/p of reduce is written to a reliable storage system like HDFS, because if one of the nodes goes down then the reduce data associated with that nodes is lost. It's not possible to run that particular reduce task again outside the context of the Hadoop framework. Also, once the job is complete, the o/p from the 1000 nodes have to be consolidated for the different input types.
Concurrent writing is not supported in HDFS. There might be a case where multiple reducers might be writing to the same file in HDFS and this might corrupt the file. When multiple reduce tasks are running on a single node, concurrency might be a problem when writing to a single local file also.
One of the solution is to have a reduce task specific file name and later combine all the files for a specific input type.

Output can be written from the Reducer to more than one location using MultipleOutputs class.You can regard file1,file2 and file3 as three folders and write 1000 Reducers' output data to these folders seperately.
Usage pattern for job submission:
Job job = new Job();
FileInputFormat.setInputPath(job, inDir);
//outDir is the root path, in this case, outDir="/home/user/data/"
FileOutputFormat.setOutputPath(job, outDir);
//You have to assign the output formatclass.Using MultipleOutputs in this way will still create zero-sized default output, eg part-00000. To prevent this use LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class); instead of job.setOutputFormatClass(TextOutputFormat.class); in your Hadoop job configuration.
LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(MOMap.class);
job.setReducerClass(MOReduce.class);
...
job.waitForCompletion(true);
Usage in Reducer:
private MultipleOutputs out;
public void setup(Context context) {
out = new MultipleOutputs(context);
...
}
public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException {
//'/' characters in baseOutputPath will be translated into directory levels in your file system. Also, append your custom-generated path with "part" or similar, otherwise your output will be -00000, -00001 etc. No call to context.write() is necessary.
for (Text line : values) {
if(line == type1)
out.write(key, new Text(line),"file1/part");
else if(line == type2)
out.write(key, new Text(line),"file2/part");
else if(line == type3)
out.write(key, new Text(line),"file3/part");
}
}
protected void cleanup(Context context) throws IOException, InterruptedException {
out.close();
}
ref:https://hadoop.apache.org/docs/r2.6.3/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html

Related

cluster.getJob is returning null in hadoop

public void setup(Context context) throws IOException, InterruptedException{
Configuration conf = context.getConfiguration();
org.apache.hadoop.mapreduce.Cluster cluster = new org.apache.hadoop.mapreduce.Cluster(conf);
Job currentJob = cluster.getJob(context.getJobID());
mapperCounter = currentJob.getCounters().findCounter(TEST).getValue();
}
I wrote the following code to get the value of a counter that I am incrementing in my mapper function. The problem is that the currentJob returned by cluster.getJob is turning out to be null. Does anyone know how I can fix this?
My question is different cause I am trying to access my counter in the reducer not after all the map reduce tasks are done. This code that I have pasted here belongs in my reducer class.
It seems that cluster.getJob(context.getJobID()); does not work in hadoop's Standalone Operation.
Try running your Program with YARN in hadoop's Single Node Cluster mode like described in the documentation: https://hadoop.apache.org/docs/r2.7.0/hadoop-project-dist/hadoop-common/SingleCluster.html#Pseudo-Distributed_Operation

Difference between combiner and in-mapper combiner in mapreduce?

I'm new to hadoop and mapreduce. Could someone clarify the difference between a combiner and an in-mapper combiner or are they the same thing?
You are probably already aware that a combiner is a process that runs locally on each Mapper machine to pre-aggregate data before it is shuffled across the network to the various cluster Reducers.
The in-mapper combiner takes this optimization a bit further: the aggregations do not even write to local disk: they occur in-memory in the Mapper itself.
The in-mapper combiner does this by taking advantage of the setup() and cleanup() methods of
org.apache.hadoop.mapreduce.Mapper
to create an in-memory map along the following lines:
Map<LongWritable, Text> inmemMap = null
protected void setup(Mapper.Context context) throws IOException, InterruptedException {
inmemMap = new Map<LongWritable, Text>();
}
Then during each map() invocation you add values to than in memory map (instead of calling context.write() on each value. Finally the Map/Reduce framework will automatically call:
protected void cleanup(Mapper.Context context) throws IOException, InterruptedException {
for (LongWritable key : inmemMap.keySet()) {
Text myAggregatedText = doAggregation(inmemMap.get(key))// do some aggregation on
the inmemMap.
context.write(key, myAggregatedText);
}
}
Notice that instead of calling context.write() every time, you add entries to the in-memory map. Then in the cleanup() method you call context.write() but with the condensed/pre-aggregated results from your in-memory map . Therefore your local map output spool files (that will be read by the reducers) will be much smaller.
In both cases - both in memory and external combiner - you gain the benefits of less network traffic to the reducers due to smaller map spool files. That also decreases the reducer processing.

hadoop: how do i know what file a task was handling when it failed?

I have a job that has some failed tasks. I want to try and reproduce on the files the tasks were handling but can't find how to know which files these were.
How can I find what files a task was handling when it failed?
I have no idea if that really works, but you may want to try that out (I was coding with Hadoop 2.2):
job.waitForCompletion(true);
Class<? extends InputFormat<?, ?>> clz = job.getInputFormatClass();
InputFormat<?, ?> inputFormat = ReflectionUtils.newInstance(clz, conf);
List<InputSplit> splits = inputFormat.getSplits(job);
TaskCompletionEvent[] events = job.getTaskCompletionEvents(0);
for (TaskCompletionEvent ev : events) {
if (ev.isMapTask() && ev.getStatus() == Status.FAILED) {
int idWithinJob = ev.idWithinJob();
InputSplit inputSplit = splits.get(idWithinJob);
if (inputSplit instanceof FileSplit) {
FileSplit sp = (FileSplit) inputSplit;
System.out.println(sp.getPath() + " failed!");
}
}
}
The idea is rather simple, you get all task events, take map and failed ones. Then you can obtain an index that is usally assigned to the split internally.
The split itself can be obtained by running it over the job data. Please note that the FileSplit can also be a part of the file (block), so you want to check the internal offset and length fields. The type of the split is dependent on the InputFormat, so there is no guarantee that the returned splits are a FileSplit.
Turns out grepping the logs shows what files the task is reading.

Hadoop variable set in reducer and read in driver

How I can set a variable in a reducer, which after its execution can be read by the driver after all tasks finish their execution? Something like:
class Driver extends Configured implements Tool{
public int run(String[] args) throws Exception {
...
JobClient.runJob(conf); // reducer sets some variable
String varValue = ...; // variable value is read by driver
}
}
WORKAROUND
I came up with this "ugly" workaround. The main idea is that you create a group of counters in which you hold only one counter where its name is the value you wish to return (you ignore the actual counter value). The code look like this:
// reducer || mapper
reporter.incrCounter("Group name", "counter name -> actual value", 0);
// driver
RunningJob runningJob = JobClient.runJob(conf);
String value = runningJob.getCounters().getGroup("Group name").iterator().next().getName();
The same will work for mappers as well. Though this solves my problem, I think this type of solution is "ugly". Thus I leave the question open.
You can't amend the configuration in a map / reduce task and expect that change to be persisted to configurations in other tasks and / or the job client that submitted the job (lets say you write different values in the reducer - which one 'wins' out and is persisted back?).
You can however write files to HDFS yourself which can then be read back when your job returns - No less ugly really but there isn't a way doesn't involve another technology (Zookeeper, HBase or any other NoSQL / RDB) holding the value between your task ending and you being able to retrieve the value upon job success.

Hadoop Custom Input format with the new API

I'm a newbie to Hadoop and I'm stuck with the following problem. What I'm trying to do is to map a shard of the database (please don't ask why I need to do that etc) to a mapper, then do certain operation on this data, output the results to reducers and use that output again to do the second phase map/reduce job on the same data using the same shard format.
Hadoop does not provide any input method to send a shard of the database. You can only send line by line using LineInputFormat and LineRecordReader. NLineInputFormat doesn't also help in this case. I need to extend FileInputFormat and RecordReader classes to write my own InputFormat. I have been advised to use LineRecordReader since the underlying code already deals with the FileSplits and all the problems associated with splitting the files.
All I need to do now is to override the nextKeyValue() method which I don't exactly know how.
for(int i=0;i<shard_size;i++){
if(lineRecordReader.nextKeyValue()){
lineValue.append(lineRecordReader.getCurrentValue().getBytes(),0,lineRecordReader.getCurrentValue().getLength());
}
}
The above code snippet is the one that wrote but somehow doesn't work well.
I would suggest to put into your input files connection strings and some other indications where to find the shard.
Mapper will take this information, connect to the database and do a job. I would not suggest t o convert result sets to hadoop's writable classes - it will hinder performance.
The problem I see to be addressed - is to have enough splits of this relatively small input.
You can simply create enough small files with a few shards references each, or you can tweak input format to build small splits. Second way will be more flexible.
What I did, is something like this. I wrote my own record reader to read n lines at a time and send them to mappers as input
public boolean nextKeyValue() throws IOException,
InterruptedException {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < 5; i++) {
if (!lineRecordReader.nextKeyValue()) {
return false;
}
lineKey = lineRecordReader.getCurrentKey();
lineValue = lineRecordReader.getCurrentValue();
sb.append(lineValue.toString());
sb.append(eol);
}
lineValue.set(sb.toString());
//System.out.println(lineValue.toString());
return true;
// throw new UnsupportedOperationException("Not supported yet.");
}
how do you thin

Resources