Partitioner of Hadoop for first two words of key - hadoop

When I perform Hadoop streaming. There's the output of mapper (Key, Value)
The key is a word sequence that separated with white-space.
I'd like to use partitioner that returns hash value of first two words.
So, implemented as
public static class CounterPartitioner extends Partitioner<Text, IntWritable> {
#Override
public int getPartition(Text key, IntWritable value, int numPartitions) {
String[] line = key.toString().split(" ");
String prefix = (line.length > 1) ? (line[0] + line[1]) : line[0];
return (prefix.hashCode() & Integer.MAX_VALUE) % numPartitions;
}
}
My question is
is there a way by using built-in Hadoop library and modifying configuration such as
mapred.output.key.comparator.class
stream.map.output.field.separator
stream.num.map.output.key.fields
map.output.key.field.separator
mapred.text.key.comparator.options
...
Thanks in advance.

When I perform Hadoop streaming. There's the output of mapper (Key, Value) The key is a word sequence that separated with white-space.
My question is is there a way by using built-in Hadoop library and modifying configuration such as
mapred.output.key.comparator.class
stream.map.output.field.separator
Built-in Hadoop library is based on Java and the purpose of streaming is to use other languages besides Java which talks to STDIO/STDOUT.
I don't see the purpose of changing the streaming related properties using Hadoop API which is built using Java.
BYW, Configuration#set can be used to set the configuration properties besides setting them in the configuration files and from the command prompt.

Related

How to filter Hadoop result output

My reducer:
public static class CustomReducer extends Reducer<Int256Writable, ByteWritable, IntWritable, Int256Writable>
Based on wether the result IntWritable is > 1, I want to filter the output of Hadoop so that all those KV pairs will not be written to output where the condition applies.
Up until now I'm using a simple TextOutputFormat but I'm planning to change to binary soon.
How can I filter the KV pairs before outputting them?
Holy shit I'm stupid. For the record: Simply don't context.write the result in your reducer if you don't want it to appear in the output.

Partitioning! how does hadoop make it? Use a hash function? what is the default function?

Partitioning is the process of determining which reducer instance will receive which intermediate keys and values. Each mapper must determine for all of its output (key, value) pairs which reducer will receive them. It is necessary that for any key, regardless of which mapper instance generated it, the destination partition is the same
Problem: How does hadoop make it? Use a hash function? what is the default function?
The default partitioner in Hadoop is the HashPartitioner which has a method called getPartition. It takes key.hashCode() & Integer.MAX_VALUE and finds the modulus using the number of reduce tasks.
For example, if there are 10 reduce tasks, getPartition will return values 0 through 9 for all keys.
Here is the code:
public class HashPartitioner<K, V> extends Partitioner<K, V> {
public int getPartition(K key, V value, int numReduceTasks) {
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}
}
To create a custom partitioner, you would extend Partitioner, create a method getPartition, then set your partitioner in the driver code (job.setPartitionerClass(CustomPartitioner.class);). This is particularly helpful if doing secondary sort operations, for example.

Hadoop :how to set more than one column as a key and more than one column as a value in mapreduce classes in Hadoop

i want to set more than one column as a key and more than one column as a value in mapreduce "key-value pairs" classes in Hadoop using java and the file reads from contains 20 column .thank you
Combine all the columns which you wanna emit as key and value into a delimited string and emit them as Text.
Suppose your input looks like this :
No,Name,Age,Country
1,tariq,25,india
2,samy,25,xyz
And you want to emit "No+Age" as the key and "Name+Country" as the value.
public static class MyMapper extends Mapper<LongWritable, Text, Text, Text> {
String line = "";
String val = "";
String[] parts;
String key = "";
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
line = value.toString();
parts = line.split(",");
key = parts[0] + "." + parts[2];
val = parts[1] + "." + parts[3];
context.write(new Text(key), new Text(value));
}
}
You could make a composite object which implements WritableComparable<YourClassName> to store the keys together in a concise form. See this link for a good example.
However, seeing as you want 20 components, I'd probably suggest just using a single Text object and parsing it when appropriate for that many. I often use tab separated values and parse them using a custom TSV parser, but merely splitting the Text.toString() with a suitable delimiter char should be entirely sufficient.

Hadoop Custom Input format with the new API

I'm a newbie to Hadoop and I'm stuck with the following problem. What I'm trying to do is to map a shard of the database (please don't ask why I need to do that etc) to a mapper, then do certain operation on this data, output the results to reducers and use that output again to do the second phase map/reduce job on the same data using the same shard format.
Hadoop does not provide any input method to send a shard of the database. You can only send line by line using LineInputFormat and LineRecordReader. NLineInputFormat doesn't also help in this case. I need to extend FileInputFormat and RecordReader classes to write my own InputFormat. I have been advised to use LineRecordReader since the underlying code already deals with the FileSplits and all the problems associated with splitting the files.
All I need to do now is to override the nextKeyValue() method which I don't exactly know how.
for(int i=0;i<shard_size;i++){
if(lineRecordReader.nextKeyValue()){
lineValue.append(lineRecordReader.getCurrentValue().getBytes(),0,lineRecordReader.getCurrentValue().getLength());
}
}
The above code snippet is the one that wrote but somehow doesn't work well.
I would suggest to put into your input files connection strings and some other indications where to find the shard.
Mapper will take this information, connect to the database and do a job. I would not suggest t o convert result sets to hadoop's writable classes - it will hinder performance.
The problem I see to be addressed - is to have enough splits of this relatively small input.
You can simply create enough small files with a few shards references each, or you can tweak input format to build small splits. Second way will be more flexible.
What I did, is something like this. I wrote my own record reader to read n lines at a time and send them to mappers as input
public boolean nextKeyValue() throws IOException,
InterruptedException {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < 5; i++) {
if (!lineRecordReader.nextKeyValue()) {
return false;
}
lineKey = lineRecordReader.getCurrentKey();
lineValue = lineRecordReader.getCurrentValue();
sb.append(lineValue.toString());
sb.append(eol);
}
lineValue.set(sb.toString());
//System.out.println(lineValue.toString());
return true;
// throw new UnsupportedOperationException("Not supported yet.");
}
how do you thin

How to call Partitioner in Haoop v 0.21

In my application I want to create as many reducer jobs as possible based on the keys. Now my current implementation writes all the keys and values in a single (reducer) output file. So to solve this, I have used one partitioner but I cannot call the class.The partitioner should be called after the selection Map task and before the selection reduce task but it did not.The code of the partitioner is the following
public class MultiWayJoinPartitioner extends Partitioner<Text, Text> {
#Override
public int getPartition(Text key, Text value, int nbPartitions) {
return (key.getFirst().hashCode() & Integer.MAX_VALUE) % nbPartitions;
return 0;
}
}
Is this code is correct to partition the files based on the keys and values and the output will be transfer to the reducer automatically??
You don't show all of your code, but there is usually a class (called the "Job" or "MR" class) that configures the mapper, reducer, partitioner, etc. and then actually submits the job to hadoop. In this class you will have a job configuration object that has many properties, one of which is the number of reducers. Set this property to whatever number your hadoop configuration can handle.
Once the job is configured with a given number of reducers, that number will be passed into your partition (which looks correct, by the way). Your partitioner will start returning the appropriate reducer/partition for the key/value pair. That's how you get as many reducers as possible.

Resources