Deallocate memory of reducer Iterable input - hadoop

I am facing a GC overhead problem because the reducer input is huge. There is no way for me to filter out any of this input data as all entries in the Iterable argument are of use. I tried increasing the heap size on the EMR cluster where I am running this job, but even that didn't help.
Basically what my reducer does is that it takes a list of strings and converts them into a list of objects A. These objects are then combined to form a bigger object B. The solution that I could think of was, that I can dump the entire Iterable input onto disk and free the Iterable object completely. Then from disk read the dumped string one at a time and keep building the larger object B, by intermediately building an object A and then release A before fetching the next string from the disk. This way, I would keep nearly only half of data on the heap than before. But, I think I am unable to free the Iterable input as I am still getting GC overhead.
The way I tried to do this is as follows:
public void reduce(final Text key, Iterable<MapWritable> inputValues,
final Context context)
{
BufferedWriter bw = new BufferedWriter(
new FileWriterWithEncoding(this.fileName, this.encoding));
for (Iterator<MapWritable> iterator = inputValues.iterator(); iterator.hasNext();)
{
MapWritable mapWritable = iterator.next();
// ----
// Put string contained in mapWritable into a file on disk
// ----
}
bw.close();
// Release the input Iterable instance
inputValues = null; //This doesn't seem to work :'(
BufferedReader br = new BufferedReader(new InputStreamReader(
new FileInputStream(this.fileName), this.encoding));
for (String line; (line = br.readLine()) != null;)
{
// ----
// Then read from the file saved in disk one line at a time
// and process it to build the object B
// ----
}
br.close();
}
My question is, is there a way to deallocate the memory of the Iterable input of reducer, from within the reducer method?

Related

why hbase KeyValueSortReducer need to sort all KeyValue

I am learning Phoenix CSV Bulk Load recently and I found that the source code of org.apache.phoenix.mapreduce.CsvToKeyValueReducer will cause OOM ( java heap out of memory ) when columns are large in one row (In my case, 44 columns in one row and the avg size of one row is 4KB).
What's more, this class is similar with the hbase bulk load reducer class - KeyValueSortReducer. It means that OOM may happen when using KeyValueSortReducer in my case.
So, I have a question of KeyValueSortReducer - why it need to sort all kvs in treeset first and then write all of them to context? If I remove the treeset sorting code and wirte all kvs directly to the context, the result will be different or be wrong ?
I am looking forward to your reply. Best wish to you!
here is the source code of KeyValueSortReducer:
public class KeyValueSortReducer extends Reducer<ImmutableBytesWritable, KeyValue, ImmutableBytesWritable, KeyValue> {
protected void reduce(ImmutableBytesWritable row, java.lang.Iterable<KeyValue> kvs,
org.apache.hadoop.mapreduce.Reducer<ImmutableBytesWritable, KeyValue, ImmutableBytesWritable, KeyValue>.Context context)
throws java.io.IOException, InterruptedException {
TreeSet<KeyValue> map = new TreeSet<KeyValue>(KeyValue.COMPARATOR);
for (KeyValue kv: kvs) {
try {
map.add(kv.clone());
} catch (CloneNotSupportedException e) {
throw new java.io.IOException(e);
}
}
context.setStatus("Read " + map.getClass());
int index = 0;
for (KeyValue kv: map) {
context.write(row, kv);
if (++index % 100 == 0) context.setStatus("Wrote " + index);
}
}
}
please have a look in to this case study. there are some requirements where you need to order keyvalue pairs into the same row in the HFile.
1.The main question : why hbase KeyValueSortReducer need to sort all KeyValue ?
Thanks to RamPrasad G's reply, we can look into the case study : http://www.deerwalk.com/blog/bulk-importing-data/
This case study will tell us more about hbase bulk import and the reducer class - KeyValueSortReducer.
The reason of sorting all KeyValue in KeyValueSortReducer reduce method is that the HFile need this sorting. you can focus on the section :
A frequently occurring problem while reducing is lexical ordering. It happens when keyvalue list to be outputted from reducer is not sorted. One example is when qualifier names for a single row are not written in lexically increasing order. Another being when multiple rows are written in same reduce method and row id’s are not written in lexically increasing order. It happens because reducer output is never sorted. All sorting occurs on keyvalue outputted by mapper and before it enters reduce method. So, it tries to add keyvalue’s outputted from reduce method in incremental fashion assuming that it is presorted. So, before keyvalue’s are written into context, they must be added into sorting list like TreeSet or HashSet with KeyValue.COMPARATOR as comparator and then writing them in order specified by sorted list.
So, when your columns is very large, it will use a lot of memory for sorting.
As the source code of KeyValueSortReducer memtioned :
/**
* Emits sorted KeyValues.
* Reads in all KeyValues from passed Iterator, sorts them, then emits
* KeyValues in sorted order. If lots of columns per row, it will use lots of
* memory sorting.
* #see HFileOutputFormat
*/
2.The referenced question : why Phoenix CSV BulkLoad reducer casue OOM ?
The reason of Phoenix CSV BulkLoad reducer casue OOM is the issue refer to PHOENIX-2649.
Due to the Comparator inside CsvTableRowKeyPair error to compare two CsvTableRowKeyPair and make all rows to pass by one single reducer in one single reduce call,
it will cause OOM quickly in my case.
Fortunately, Phoenix Team had fixed this issue upon the version of 4.7. If your phoenix version is under 4.7, please note about it and try to update your version,
or you can make a patch to your version.
I hope this answer will help you !

Serializing an object to a JSON input stream using GSON?

I'm not sure if I'm asking for a right thing, but is it possible to make the GSON Gson.toJson(...) methods family work in "streaming mode" while serializing to JSON? Let's say, sometimes there are cases when using Appendable is not possible:
final String json = gson.toJson(value);
final byte[] bytes = json.getBytes(charset);
try ( final InputStream inputStream = new ByteArrayInputStream(bytes) ) {
inputStreamConsumer.accept(inputStream);
}
The example above is not perfect in this scenario, because:
It generates a string json as a temporary buffer.
The json string produces a new byte array just to wrap it up into a ByteArrayInputStream instance.
I think it's not a big problem to write a CharSequence to InputStream adapter and get rid of creating the byte array clone, but I still couldn't get rid of generating the string temporary buffer to use the inputStreamConsumer efficiently. So, I'd expect something like:
try ( final InputStream inputStream = gson.toJsonInputStream(value) ) {
inputStreamConsumer.accept(inputStream);
}
Is it possible using just GSON somehow?
According to this comment, this cannot be done using GSON.

Java Stream BufferedReader file stream

I am using Java 8 Streams to create stream from a csv file.
I am using BufferedReader.lines(), I read the docs for BufferedReader.lines():
After execution of the terminal stream operation there are no guarantees that the reader will be at a specific position from which to read the next character or line.
public class Streamy {
public static void main(String args[]) {
Reader reader = null;
BufferedReader breader = null;
try {
reader = new FileReader("refined.csv");
} catch (FileNotFoundException e) {
e.printStackTrace();
}
breader = new BufferedReader(reader);
long l1 = breader.lines().count();
System.out.println("Line Count " + l1); // this works correctly
long l2 = breader.lines().count();
System.out.println("Line Count " + l2); // this gives 0
}
}
It looks like after reading the file for first time, reader does not get to beginning of the file. What is the way around for this problem
It looks like after reading the file for first time, reader does not get to beginning of the file.
No - and I don't know why you would expect it to given the documentation you quoted. Basically, the lines() method doesn't "rewind" the reader before starting, and may not even be able to. (Imagine the BufferedReader wraps an InputStreamReader which wraps a network connection's InputStream - once you've read the data, it's gone.)
What is the way around for this problem
Two options:
Reopen the file and read it from scratch
Save the result of lines() to a List<String>, so that you're then not reading from the file at all the second time. For example:
List<String> lines = breader.lines().collect(Collectors.toList());
As an aside, I'd strongly recommend using Files.newBufferedReader instead of FileReader - the latter always uses the platform default encoding, which isn't generally a good idea.
And for that matter, to load all the lines into a list, you can just use Files.readAllLines... or Files.lines if you want the lines as a stream rather than a list. (Note the caveats in the comments, however.)
Probably the cited fragment from JavaDoc needs to be clarified. Usually you would expect that after reading the whole file reader will point to the end of the file. But using streams it depends on whether short-circuit terminal operation is used and whether the stream is parallel. For example, if you use
String magicLine = breader.lines()
.filter(str -> str.startsWith("magic"))
.findAny()
.orElse(null);
Your reader will likely to stop after the first found line (because no need to read further) or read the whole input file if such line is not found. If you make the same operation in parallel stream, then the resulting position will be unpredictable, because input will be split to some implementation-dependent chunks where the search will be performed. That's why it's written this way in the documentation.
As for workaround ways, please read the #JonSkeet answer. And consider closing your streams via try-with-resource construct.
If there are no guarantees that the reader will be at a specific line, why wouldn't you create two readers?
reader1=new FileReader("refined.csv");
reader2=new FileReader("refined.csv");

Hadoop Custom Input format with the new API

I'm a newbie to Hadoop and I'm stuck with the following problem. What I'm trying to do is to map a shard of the database (please don't ask why I need to do that etc) to a mapper, then do certain operation on this data, output the results to reducers and use that output again to do the second phase map/reduce job on the same data using the same shard format.
Hadoop does not provide any input method to send a shard of the database. You can only send line by line using LineInputFormat and LineRecordReader. NLineInputFormat doesn't also help in this case. I need to extend FileInputFormat and RecordReader classes to write my own InputFormat. I have been advised to use LineRecordReader since the underlying code already deals with the FileSplits and all the problems associated with splitting the files.
All I need to do now is to override the nextKeyValue() method which I don't exactly know how.
for(int i=0;i<shard_size;i++){
if(lineRecordReader.nextKeyValue()){
lineValue.append(lineRecordReader.getCurrentValue().getBytes(),0,lineRecordReader.getCurrentValue().getLength());
}
}
The above code snippet is the one that wrote but somehow doesn't work well.
I would suggest to put into your input files connection strings and some other indications where to find the shard.
Mapper will take this information, connect to the database and do a job. I would not suggest t o convert result sets to hadoop's writable classes - it will hinder performance.
The problem I see to be addressed - is to have enough splits of this relatively small input.
You can simply create enough small files with a few shards references each, or you can tweak input format to build small splits. Second way will be more flexible.
What I did, is something like this. I wrote my own record reader to read n lines at a time and send them to mappers as input
public boolean nextKeyValue() throws IOException,
InterruptedException {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < 5; i++) {
if (!lineRecordReader.nextKeyValue()) {
return false;
}
lineKey = lineRecordReader.getCurrentKey();
lineValue = lineRecordReader.getCurrentValue();
sb.append(lineValue.toString());
sb.append(eol);
}
lineValue.set(sb.toString());
//System.out.println(lineValue.toString());
return true;
// throw new UnsupportedOperationException("Not supported yet.");
}
how do you thin

Incremental write of Protocol Buffer object

I have Protocol Buffer for logging data.
message Message {
required double val1 = 1;
optional int val2 = 2;
}
message BigObject {
repeated Message message = 1;
}
I receive messages one per second. They stored in memory with my BigObject and they used for some tasks. But at the same time i want to store that messages in file for backup in case application crash. Simple writing BigObject every time will be waste of time. And I trying to find way to write only added messages since last write to file. Is there a way for that?
Protobuf is an appendable format, and your layout is ideal for this. Just open your file positioned at the end, and start with a new (empty) BigObject. Add/serialize just the new Message instance, and write to the file (from the end onwards).
Now, if you parse your file from the beginning you will get a single BigObject with all the Message instances (old and new).
You could actually do this by logging each individual Message as it arrives, as long as you wrap it in a BigObject each time, i.e. in pseudo-code
loop {
msg = await NextMessage();
wrapper = new BigObject();
wrapper.Messages.Add(msg);
file = OpenFileAtEnd();
wrapper.WriteTo(file);
file.Close();
}

Resources