My reducer:
public static class CustomReducer extends Reducer<Int256Writable, ByteWritable, IntWritable, Int256Writable>
Based on wether the result IntWritable is > 1, I want to filter the output of Hadoop so that all those KV pairs will not be written to output where the condition applies.
Up until now I'm using a simple TextOutputFormat but I'm planning to change to binary soon.
How can I filter the KV pairs before outputting them?
Holy shit I'm stupid. For the record: Simply don't context.write the result in your reducer if you don't want it to appear in the output.
Related
I am learning Phoenix CSV Bulk Load recently and I found that the source code of org.apache.phoenix.mapreduce.CsvToKeyValueReducer will cause OOM ( java heap out of memory ) when columns are large in one row (In my case, 44 columns in one row and the avg size of one row is 4KB).
What's more, this class is similar with the hbase bulk load reducer class - KeyValueSortReducer. It means that OOM may happen when using KeyValueSortReducer in my case.
So, I have a question of KeyValueSortReducer - why it need to sort all kvs in treeset first and then write all of them to context? If I remove the treeset sorting code and wirte all kvs directly to the context, the result will be different or be wrong ?
I am looking forward to your reply. Best wish to you!
here is the source code of KeyValueSortReducer:
public class KeyValueSortReducer extends Reducer<ImmutableBytesWritable, KeyValue, ImmutableBytesWritable, KeyValue> {
protected void reduce(ImmutableBytesWritable row, java.lang.Iterable<KeyValue> kvs,
org.apache.hadoop.mapreduce.Reducer<ImmutableBytesWritable, KeyValue, ImmutableBytesWritable, KeyValue>.Context context)
throws java.io.IOException, InterruptedException {
TreeSet<KeyValue> map = new TreeSet<KeyValue>(KeyValue.COMPARATOR);
for (KeyValue kv: kvs) {
try {
map.add(kv.clone());
} catch (CloneNotSupportedException e) {
throw new java.io.IOException(e);
}
}
context.setStatus("Read " + map.getClass());
int index = 0;
for (KeyValue kv: map) {
context.write(row, kv);
if (++index % 100 == 0) context.setStatus("Wrote " + index);
}
}
}
please have a look in to this case study. there are some requirements where you need to order keyvalue pairs into the same row in the HFile.
1.The main question : why hbase KeyValueSortReducer need to sort all KeyValue ?
Thanks to RamPrasad G's reply, we can look into the case study : http://www.deerwalk.com/blog/bulk-importing-data/
This case study will tell us more about hbase bulk import and the reducer class - KeyValueSortReducer.
The reason of sorting all KeyValue in KeyValueSortReducer reduce method is that the HFile need this sorting. you can focus on the section :
A frequently occurring problem while reducing is lexical ordering. It happens when keyvalue list to be outputted from reducer is not sorted. One example is when qualifier names for a single row are not written in lexically increasing order. Another being when multiple rows are written in same reduce method and row id’s are not written in lexically increasing order. It happens because reducer output is never sorted. All sorting occurs on keyvalue outputted by mapper and before it enters reduce method. So, it tries to add keyvalue’s outputted from reduce method in incremental fashion assuming that it is presorted. So, before keyvalue’s are written into context, they must be added into sorting list like TreeSet or HashSet with KeyValue.COMPARATOR as comparator and then writing them in order specified by sorted list.
So, when your columns is very large, it will use a lot of memory for sorting.
As the source code of KeyValueSortReducer memtioned :
/**
* Emits sorted KeyValues.
* Reads in all KeyValues from passed Iterator, sorts them, then emits
* KeyValues in sorted order. If lots of columns per row, it will use lots of
* memory sorting.
* #see HFileOutputFormat
*/
2.The referenced question : why Phoenix CSV BulkLoad reducer casue OOM ?
The reason of Phoenix CSV BulkLoad reducer casue OOM is the issue refer to PHOENIX-2649.
Due to the Comparator inside CsvTableRowKeyPair error to compare two CsvTableRowKeyPair and make all rows to pass by one single reducer in one single reduce call,
it will cause OOM quickly in my case.
Fortunately, Phoenix Team had fixed this issue upon the version of 4.7. If your phoenix version is under 4.7, please note about it and try to update your version,
or you can make a patch to your version.
I hope this answer will help you !
In my map-reduce job, the mapper's output type is <Text, FileAlias> and the class FileAlias has two attributes as follows
public class FileAlias extends Configured implements WritableComparable<FileAlias>{
public boolean isAlias;
public String value;
...
}
For each output key (of type Text), only one of the output values (of type FileAlias) has attribute isAlias set as true. I would like this output value to be the first item in the OutputCollector fed to the reducer. Is there any way to do that?
Take a look at the setGroupingComparatorClass method on the Job object. You should be able to implement a comparator that makes FileAlias with isAlias == true first in the Iterable that is passed to the reduce task.
I'm writing a M/R job that processes large time-series-data files written in binary format that looks something like this (new lines here for readability, actual data is continuous, obviously):
TIMESTAMP_1---------------------TIMESTAMP_1
TIMESTAMP_2**********TIMESTAMP_2
TIMESTAMP_3%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%TIMESTAMP_3
.. etc
Where timestamp is simply a 8 byte struct, identifiable as such by the first 2 bytes. The actual data is bounded between duplicate value timestamps, as displayed above, and contains one or more predefined structs. I would like to write a custom InputFormat that will emit the key/value pair to the mappers:
< TIMESTAMP_1, --------------------- >
< TIMESTAMP_2, ********** >
< TIMESTAMP_3, %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% >
Logically, I'd like to keep track of the current TIMESTAMP, and aggregate all the data until that TIMESTAMP is detected again, then send out my <TIMESTAMP, DATA> pair as a record. My problem is syncing between splits inside the RecordReader, so if a certain reader receives the following split
# a split occurs inside my data
reader X: TIMESTAMP_1--------------
reader Y: -------TIMESTAMP_1 TIMESTAMP_2****..
# or inside the timestamp
or even: #######TIMES
TAMP_1-------------- ..
What's a good way to approach this? Do I have an easy way to access the file offsets such that my CustomRecordReader can sync between splits and not lose data? I feel I have some conceptual gaps on how splits are handled, so perhaps an explanation of these may help. thanks.
In general it is not simple to create input format which support splits, since you should be able to find out where to move from the split boundary to get consistent records. XmlInputFormat is good example of format doing so.
I would suggest first consider if you indeed need splittable inputs? You can define your input format as not splittable and not have all these issues.
If you files are generally not much larger then block size - you loose nothing. If they do - you will loose part of the data locality.
You can subclass the concrete subclass of FileInputFormat, for example, SeqenceFileAsBinaryInputFormat, and override the isSplitable() method to return false:
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapred.SequenceFileAsBinaryInputFormat;
public class NonSplitableBinaryFile extends SequenceFileAsBinaryInputFormat{
#Override
protected boolean isSplitable(FileSystem fs, Path file) {
return false;
}
#Override
public RecordReader getRecordReader(InputSplit split, JobConf job,
Reporter reporter) throws IOException {
//return your customized record reader here
}
}
When I perform Hadoop streaming. There's the output of mapper (Key, Value)
The key is a word sequence that separated with white-space.
I'd like to use partitioner that returns hash value of first two words.
So, implemented as
public static class CounterPartitioner extends Partitioner<Text, IntWritable> {
#Override
public int getPartition(Text key, IntWritable value, int numPartitions) {
String[] line = key.toString().split(" ");
String prefix = (line.length > 1) ? (line[0] + line[1]) : line[0];
return (prefix.hashCode() & Integer.MAX_VALUE) % numPartitions;
}
}
My question is
is there a way by using built-in Hadoop library and modifying configuration such as
mapred.output.key.comparator.class
stream.map.output.field.separator
stream.num.map.output.key.fields
map.output.key.field.separator
mapred.text.key.comparator.options
...
Thanks in advance.
When I perform Hadoop streaming. There's the output of mapper (Key, Value) The key is a word sequence that separated with white-space.
My question is is there a way by using built-in Hadoop library and modifying configuration such as
mapred.output.key.comparator.class
stream.map.output.field.separator
Built-in Hadoop library is based on Java and the purpose of streaming is to use other languages besides Java which talks to STDIO/STDOUT.
I don't see the purpose of changing the streaming related properties using Hadoop API which is built using Java.
BYW, Configuration#set can be used to set the configuration properties besides setting them in the configuration files and from the command prompt.
In my application I want to create as many reducer jobs as possible based on the keys. Now my current implementation writes all the keys and values in a single (reducer) output file. So to solve this, I have used one partitioner but I cannot call the class.The partitioner should be called after the selection Map task and before the selection reduce task but it did not.The code of the partitioner is the following
public class MultiWayJoinPartitioner extends Partitioner<Text, Text> {
#Override
public int getPartition(Text key, Text value, int nbPartitions) {
return (key.getFirst().hashCode() & Integer.MAX_VALUE) % nbPartitions;
return 0;
}
}
Is this code is correct to partition the files based on the keys and values and the output will be transfer to the reducer automatically??
You don't show all of your code, but there is usually a class (called the "Job" or "MR" class) that configures the mapper, reducer, partitioner, etc. and then actually submits the job to hadoop. In this class you will have a job configuration object that has many properties, one of which is the number of reducers. Set this property to whatever number your hadoop configuration can handle.
Once the job is configured with a given number of reducers, that number will be passed into your partition (which looks correct, by the way). Your partitioner will start returning the appropriate reducer/partition for the key/value pair. That's how you get as many reducers as possible.