I tried to use KeyValueInputFormat in the student marks example.
This is the input:
s1 10
s2 50
s3 30
s1 100
s1 50
s2 30
s3 70
s3 50
s2 75
I used the KeyValueInputFormat as the input format, so it brings student names(s1,s2...) as the keys and marks(10,50...) as the values. My aim is to find the total marks for each one. So, I used just the reducer as
public class MarkReducer extends Reducer<Text,Text,Text,LongWritable>{
public void reduce(Text key, Iterable<Text> values,Context ctx) throws IOException, InterruptedException{
long sum = 0;
for(Text value:values){
sum+= Long.parseLong(value.toString());
}
ctx.write(key, new LongWritable(sum));
}
}
I neither created nor mention the mapper in the job. I am getting the error
Error: java.io.IOException: Type mismatch in value from map: expected org.apache.hadoop.io.LongWritable, received org.apache.hadoop.io.Text
But If I use a dummy mapper like,
public class MarkMapper extends Mapper<Text,Text,Text,Text> {
public void map(Text key, Text value, Context ctx) throws IOException, InterruptedException{
ctx.write(key, value);
}
}
I am able to get the proper output. Can someone please help me out?
The problem is that you have stated in your main method that the output of the program will be keys of type Text and values of type LongWritable.
By default, this is also assumed to be the output type of the Mapper.
Also, the default mapper (the IdentityMapper) will also assume that the input it receives is also of the same type as its output, but in your case the input and output of the mapper should be key-value pairs of the type Text.
So, just add a command in the main method, to specify that the output of the Mapper should be of the form Text,Text:
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
I think this should work. Otherwise, do as Mobin Ranjbar suggests.
Change this line:
ctx.write(key, new LongWritable(sum));
to
ctx.write(key, new Text(sum));
in your reducer. Or change reduce(Text key, Iterable<Text> values,Context ctx) to reduce(Text key, Iterable<LongWritable> values,Context ctx)
Related
I need to know the row index of the partitions of the input file that I'm using. I could force this in the original file by concatenating the row index to the data but I'd rather have a way of doing this in Hadoop. I have this in my mapper...
String id = context.getConfiguration().get("mapreduce.task.partition");
But "id" is 0 in every case. In the "Hadoop: The Definitive Guide" it mentions accessing properties like the partition id "can be accessed from the context object passed to all methods of the Mapper or Reducer". It does not, from what I can tell, actually go into how to access this information.
I went through the documentation for the Context object and it seems like the above is the way to do it and the script does compile. But since I'm getting 0 for every value, I'm not sure if I'm actually using the right thing and I'm unable to find any detail online that could help in figuring this out.
Code used to test...
public class Test {
public static class TestMapper extends Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String id = context.getConfiguration().get("mapreduce.task.partition");
context.write(new Text("Test"), new Text(id + "_" + value.toString()));
}
}
public static class TestReducer extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
for(Text value : values) {
context.write(key, value);
}
}
}
public static void main(String[] args) throws Exception {
if(args.length != 2) {
System.err.println("Usage: Test <input path> <output path>");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(Test.class);
job.setJobName("Test");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(TestMapper.class);
job.setReducerClass(TestReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Two options are:
Use the offset instead of the row number
Track the line number in the mapper
For the first one, the key which is LongWritable tells you the offset of the line being processed. Unless your lines are exactly the same length, you won't be able to calculate the line number from an offset, but it does allow you to determine ordering if thats useful.
The second option is to just track it in the mapper. You could change your code to something like:
public static class TestMapper extends Mapper<LongWritable, Text, Text, Text> {
private long currentLineNum = 0;
private Text test = new Text("Test");
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
context.write(test, new Text(currentLineNum + "_" + value));
currentLineNum++;
}
}
You could also represent your matrix as lines of tuples and include the row and col on every tuple so when you're reading in the file, you have that information. If you use a file that is just space or comma seperated values that make up a 2D array, it'll be extremely hard to figure out what line (row) you are currently working on in the mapper
I have an input file contains:
id value
1e 1
2e 1
...
2e 1
3e 1
4e 1
And I would like to find the total id of my input file. So In my main, I have declare a list so that when I read the input file, I will insert the line into the list
MainDriver.java
public static Set list = new HashSet();
and I my map
// Apply regex to find the id
...
// Insert id to the list
MainDriver.list.add(regex.group(1)); // add 1e, 2e, 3e ...
and In my reduce, I try to use the list as
public void reduce(WritableComparable key, Iterator values,
OutputCollector output, Reporter reporter) throws IOException
{
...
output.collect(key, new IntWritable(MainDriver.list.size()));
}
So I expect the value print out the file, in this case will be 4. But it actually prints out 0.
I have verify that regex.group(1) would extract valid id. So I have no clue why the size of my list is 0 in the reduce process.
The mappers and reducers run on separate JVMs (and often separate machines altogether) both from each other and from the driver program, so there is no common instance of your list Set variable that all of those methods can concurrently read and write to.
One way in MapReduce to count the number of keys is:
Emit (id, 1) from your mapper
(optionally) Sum the 1s for each mapper using a combiner to minimize network and reducer I/O
In the reducer:
In setup() initialize a class-scope numeric variable (int or long presumbly) to 0
In reduce() increment the counter, and ignore the values
In cleanup() emit the counter value now that all keys have been processed
Run the job with a single reducer, so all the keys go to the same JVM where a single count can be made
This is basically ignoring the advantage of using MapReduce in the first place.
Correct me if I'm wrong, but it appears you can map your output from your Mapper by "id", and then in your Reducer you receive something like Text key, Iterator values as the parameters.
You can then just sum up values and output output.collect(key, <total value>);
Example (apologies for using Context rather than OutputCollector, but the logic is the same):
public static class MyMapper extends Mapper<LongWritable, Text, Text, Text> {
private final Text key = new Text("id");
private final Text id = new Text();
public void map(LongWritable key, Text value,
Context context) throws IOException, InterruptedException {
id.set(regex.group(1)); // do whatever you do
context.write(id, countOne);
}
}
public static class MyReducer extends Reducer<Text, Text, Text, IntWritable> {
private final IntWritable totalCount = new IntWritable();
public void reduce(Text key, Iterable<Text> values,
Context context) throws IOException, InterruptedException {
int cnt = 0;
for (Text value : values) {
cnt ++;
}
totalCount.set(cnt);
context.write(key, totalCount);
}
}
In the run method of the Driver class, I want to fetch a String value (from the mapper function) and want to write it to a file. I used the following code, but null was returned. Please help
Mapper
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
context.getConfiguration().set("feedName", feedName);
}
Driver Class
#Override
public int run(String[] args) throws Exception {
String lineVal = conf.get("feedName")
}
Configuration is one way.
If you want to pass non-counter types of values back to the driver, you can utilize HDFS for that.
Either write to your main output context (key and values) that you emit from your job.
Or alternatively use MultipleOutputs, if you do not want to mess with your standard job output.
For example, you can write any kind of properties as Text keys and Text values from your mappers or reducers.
Once control is back to your driver, simply read from HDFS. For example you can store your name/values to the Configuration object to be used by the next job in your sequence:
public void load(Configuration targetConf, Path src, FileSystem fs) throws IOException {
InputStream is = fs.open(src);
try {
Properties props = new Properties();
props.load(new InputStreamReader(is, "UTF8"));
for (Map.Entry prop : props.entrySet()) {
String name = (String)prop.getKey();
String value = (String)prop.getValue();
targetConf.set(name, value);
}
} finally {
is.close();
}
}
Note that if you have multiple mappers or reducers where you write to MultipleOutputs, you will end up with multiple {name}-m-##### or {name}-r-##### files.
In that case, you will need to either read from every output file or run a single reducer job to combine your outputs into one and then just read from one file as shown above.
Using configuration you can only do the viceversa.
You can set values in Driver class
public int run(String[] args) throws Exception {
conf.set("feedName",value);
}
and set get those in Mapper class
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
String lineVal = conf.get("feedName");
}
UPDATE
One option to your question is write data to a file and store it in HDFS, and then access them in Driver class. These files can be treated as "Intermediate Files".
Just try it and see.
I am new in hadoop and currently working on hadoop. I have a small query.
I have around 10 files in input folder which I need to pass to my map reduce program. I want the file Name in my mapper as my fileName contains the time at which this file got created. I saw people using FileSplit to get the file Name in mapper. If let say my input files contains million of lines then every time mapper code will be called, it will get the file Name and then extract the time from the file, which is obvious a repeated time consuming thing for the same file. Once I get the time in the mapper I do not have to again and again assign the time from the file.
How can I achieve this?
You could use Mapper's setup method to get the filename, as setup method is gaurenteed to run only once before map() method gets initialized like this:
public class MapperRSJ extends Mapper<LongWritable, Text, CompositeKeyWritableRSJ, Text> {
String filename;
#Override
protected void setup(Context context) throws IOException, InterruptedException {
FileSplit fsFileSplit = (FileSplit) context.getInputSplit();
filename = context.getConfiguration().get(fsFileSplit.getPath().getParent().getName()));
}
#Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
// process each key value pair
}
}
I have a Mapper class that is giving a text key and IntWritable value which could be 1 two or three. Depending upon the values I have to write three different files with different keys. I am getting a Single File output with No record in it.
Also, is there any good Multiple Outputs example(with explanation) you could guide me to?
My Driver Class Had this code:
MultipleOutputs.addNamedOutput(job, "name", TextOutputFormat.class, Text.class, IntWritable.class);
MultipleOutputs.addNamedOutput(job, "attributes", TextOutputFormat.class, Text.class, IntWritable.class);
MultipleOutputs.addNamedOutput(job, "others", TextOutputFormat.class, Text.class, IntWritable.class);
My reducer class is:
public static class Reduce extends Reducer<Text, IntWritable, Text, NullWritable> {
private MultipleOutputs mos;
public void setup(Context context) {
mos = new MultipleOutputs(context);
}
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
String CheckKey = values.toString();
if("1".equals(CheckKey)) {
mos.write("name", key, new IntWritable(1));
}
else if("2".equals(CheckKey)) {
mos.write("attributes", key, new IntWritable(2));
}
else if("3".equals(CheckKey)) {
mos.write("others", key,new IntWritable(3));
}
/* for (IntWritable val : values) {
sum += val.get();
}*/
//context.write(key, null);
}
#Override
public void cleanup(Context context) throws IOException, InterruptedException {
mos.close();
}
}
P.S I am new to HADOOP/MAP-Reduce Programming.
ArrayList<Integer> l = new ArrayList<Integer>();
l.add(1);
System.out.println(l.toString());
results in "[1]" not 1 so
values.toString()
will not give "1"
Apart from that I just tried to print an Iterable and it just gave a reference, so that is definitely your problem. If you want to iterate over the values do as in the example below:
Iterator<Text> valueIterator = values.iterator();
while (valueIterator.hasNext()){
}
Note that you can only iterate once!
Your problem statement is muddled. What do you mean, "depending on the values"? The reducer gets an Iterable of values, not a single value. Something tells me that you need to move the multiple output code in your reducer inside the loop you have commented out for taking the sum.
Or perhaps you don't need a reducer at all and can take care of this in the map phase. If you are using the reduce phase to end up with exactly 4 files by using a single reduce task, then you can also achieve what you want by flipping the key and value in your map phase and forgetting about MultipleOutputs altogether, because you'll end up with only 3 working reduce tasks, one for each of your int values. To get the 4th one you can output two copies of the record in each map call using a special key to indicate that the output is meant for the normal file, not one of the three special files. Normally I would not recommend such a course of action as you have severe bounds on the level of parallelism you can achieve in the reduce phase when the number of keys is small.
You should also include some anomalous data handling code to the end of your 'if' ladder that increments a counter or something if you encounter a value that is not one of the three you are expecting.