Hadoop Custom Input format with the new API - hadoop

I'm a newbie to Hadoop and I'm stuck with the following problem. What I'm trying to do is to map a shard of the database (please don't ask why I need to do that etc) to a mapper, then do certain operation on this data, output the results to reducers and use that output again to do the second phase map/reduce job on the same data using the same shard format.
Hadoop does not provide any input method to send a shard of the database. You can only send line by line using LineInputFormat and LineRecordReader. NLineInputFormat doesn't also help in this case. I need to extend FileInputFormat and RecordReader classes to write my own InputFormat. I have been advised to use LineRecordReader since the underlying code already deals with the FileSplits and all the problems associated with splitting the files.
All I need to do now is to override the nextKeyValue() method which I don't exactly know how.
for(int i=0;i<shard_size;i++){
if(lineRecordReader.nextKeyValue()){
lineValue.append(lineRecordReader.getCurrentValue().getBytes(),0,lineRecordReader.getCurrentValue().getLength());
}
}
The above code snippet is the one that wrote but somehow doesn't work well.

I would suggest to put into your input files connection strings and some other indications where to find the shard.
Mapper will take this information, connect to the database and do a job. I would not suggest t o convert result sets to hadoop's writable classes - it will hinder performance.
The problem I see to be addressed - is to have enough splits of this relatively small input.
You can simply create enough small files with a few shards references each, or you can tweak input format to build small splits. Second way will be more flexible.

What I did, is something like this. I wrote my own record reader to read n lines at a time and send them to mappers as input
public boolean nextKeyValue() throws IOException,
InterruptedException {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < 5; i++) {
if (!lineRecordReader.nextKeyValue()) {
return false;
}
lineKey = lineRecordReader.getCurrentKey();
lineValue = lineRecordReader.getCurrentValue();
sb.append(lineValue.toString());
sb.append(eol);
}
lineValue.set(sb.toString());
//System.out.println(lineValue.toString());
return true;
// throw new UnsupportedOperationException("Not supported yet.");
}
how do you thin

Related

why hbase KeyValueSortReducer need to sort all KeyValue

I am learning Phoenix CSV Bulk Load recently and I found that the source code of org.apache.phoenix.mapreduce.CsvToKeyValueReducer will cause OOM ( java heap out of memory ) when columns are large in one row (In my case, 44 columns in one row and the avg size of one row is 4KB).
What's more, this class is similar with the hbase bulk load reducer class - KeyValueSortReducer. It means that OOM may happen when using KeyValueSortReducer in my case.
So, I have a question of KeyValueSortReducer - why it need to sort all kvs in treeset first and then write all of them to context? If I remove the treeset sorting code and wirte all kvs directly to the context, the result will be different or be wrong ?
I am looking forward to your reply. Best wish to you!
here is the source code of KeyValueSortReducer:
public class KeyValueSortReducer extends Reducer<ImmutableBytesWritable, KeyValue, ImmutableBytesWritable, KeyValue> {
protected void reduce(ImmutableBytesWritable row, java.lang.Iterable<KeyValue> kvs,
org.apache.hadoop.mapreduce.Reducer<ImmutableBytesWritable, KeyValue, ImmutableBytesWritable, KeyValue>.Context context)
throws java.io.IOException, InterruptedException {
TreeSet<KeyValue> map = new TreeSet<KeyValue>(KeyValue.COMPARATOR);
for (KeyValue kv: kvs) {
try {
map.add(kv.clone());
} catch (CloneNotSupportedException e) {
throw new java.io.IOException(e);
}
}
context.setStatus("Read " + map.getClass());
int index = 0;
for (KeyValue kv: map) {
context.write(row, kv);
if (++index % 100 == 0) context.setStatus("Wrote " + index);
}
}
}
please have a look in to this case study. there are some requirements where you need to order keyvalue pairs into the same row in the HFile.
1.The main question : why hbase KeyValueSortReducer need to sort all KeyValue ?
Thanks to RamPrasad G's reply, we can look into the case study : http://www.deerwalk.com/blog/bulk-importing-data/
This case study will tell us more about hbase bulk import and the reducer class - KeyValueSortReducer.
The reason of sorting all KeyValue in KeyValueSortReducer reduce method is that the HFile need this sorting. you can focus on the section :
A frequently occurring problem while reducing is lexical ordering. It happens when keyvalue list to be outputted from reducer is not sorted. One example is when qualifier names for a single row are not written in lexically increasing order. Another being when multiple rows are written in same reduce method and row id’s are not written in lexically increasing order. It happens because reducer output is never sorted. All sorting occurs on keyvalue outputted by mapper and before it enters reduce method. So, it tries to add keyvalue’s outputted from reduce method in incremental fashion assuming that it is presorted. So, before keyvalue’s are written into context, they must be added into sorting list like TreeSet or HashSet with KeyValue.COMPARATOR as comparator and then writing them in order specified by sorted list.
So, when your columns is very large, it will use a lot of memory for sorting.
As the source code of KeyValueSortReducer memtioned :
/**
* Emits sorted KeyValues.
* Reads in all KeyValues from passed Iterator, sorts them, then emits
* KeyValues in sorted order. If lots of columns per row, it will use lots of
* memory sorting.
* #see HFileOutputFormat
*/
2.The referenced question : why Phoenix CSV BulkLoad reducer casue OOM ?
The reason of Phoenix CSV BulkLoad reducer casue OOM is the issue refer to PHOENIX-2649.
Due to the Comparator inside CsvTableRowKeyPair error to compare two CsvTableRowKeyPair and make all rows to pass by one single reducer in one single reduce call,
it will cause OOM quickly in my case.
Fortunately, Phoenix Team had fixed this issue upon the version of 4.7. If your phoenix version is under 4.7, please note about it and try to update your version,
or you can make a patch to your version.
I hope this answer will help you !

how can I prove field-grouping functionality(tuple with same field goes to same task)?

I'm a fresher of Storm, I'm getting started with Storm using the project storm-starter. In this project there is a Topology called WordCountTopology, the key code for building topology is:
builder.setBolt("split", new SplitSentence(), 8).shuffleGrouping("spout");
builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word"));
and in the implementation of WordCount bolt, the key method execute is:
#Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
String word = tuple.getString(0);
Integer count = counts.get(word);
if (count == null)
count = 0;
count++;
counts.put(word, count);
collector.emit(new Values(word, count));
}
My Question is:
As the functionality of filed-grouping is that: tuples with the same filed word will go to the same task for post processing. Here "task" means thread, how can I prove this functionality? In addition, in my opinion, the logic in method execute is a little awkward. In a single task, the parameter tuple is always the same, but in the execute method it does not reflect this, in other words, the logic dose not use this convenience.
Am I clear? My point is that, the code here in execute is not taking the feature of filed-grouping into account, the code here can also be applied to the situation of shuffle-grouping.
I would like to site few points, it might help clear your doubts
Here "task" means thread
In storm's terminology tasks are NOT threads but they are responsible for processing the actual logic. Each spout or bolt that you implement in your code executes as many tasks across the cluster. So you can define them as an running instance of the components i.e Spouts or Bolts.
There is another entity called Executors which are the thread responsible for running these tasks.It can run one or multiple tasks of the same component. An executor having multiple tasks actually is saying the same component is executed for multiple times by the executor.
Now coming back to your question
the code here in execute is not taking the feature of filed-grouping into account, the code here can also be applied to the situation of shuffle-grouping
In very brief A fields grouping lets you group a stream by a subset of its fields, meaning in order to do a word count, if we filtered the stream by using fieldsGrouping on a field name 'first_name` then it is expected that all the first_name field with a value say (Foo) should go to the same task, and the same field with a different value (Bar) goes to another task.
So here the execute method is supposed to receive the same field value and thus can easily update its counter and to do that it does not require to consider anything special. The whole logic is written keeping in mind that the bolt will be chained with the proper data and that's why using the proper grouping become such an important thing. So if you use shuffleGrouping then same code will run but produces incorrect data.
Well Pinky (or anyone else who finds this useful), to prove it, you just have to keep track of the bolt or spout task ID:
#Override
public void prepare(Map map, TopologyContext tc, OutputCollector oc) {
this.boltId = tc.getThisTaskId();
}
Now in the execute() of the same fieldsGrouped bolt that receives the tuples, you just print the id and the tuple:
#Override
public void execute(Tuple tuple) {
String myWord = (String) tuple.getValue(0);
System.out.println("word: "+myWord+" boltID:"+boltId);
}

hadoop: how do i know what file a task was handling when it failed?

I have a job that has some failed tasks. I want to try and reproduce on the files the tasks were handling but can't find how to know which files these were.
How can I find what files a task was handling when it failed?
I have no idea if that really works, but you may want to try that out (I was coding with Hadoop 2.2):
job.waitForCompletion(true);
Class<? extends InputFormat<?, ?>> clz = job.getInputFormatClass();
InputFormat<?, ?> inputFormat = ReflectionUtils.newInstance(clz, conf);
List<InputSplit> splits = inputFormat.getSplits(job);
TaskCompletionEvent[] events = job.getTaskCompletionEvents(0);
for (TaskCompletionEvent ev : events) {
if (ev.isMapTask() && ev.getStatus() == Status.FAILED) {
int idWithinJob = ev.idWithinJob();
InputSplit inputSplit = splits.get(idWithinJob);
if (inputSplit instanceof FileSplit) {
FileSplit sp = (FileSplit) inputSplit;
System.out.println(sp.getPath() + " failed!");
}
}
}
The idea is rather simple, you get all task events, take map and failed ones. Then you can obtain an index that is usally assigned to the split internally.
The split itself can be obtained by running it over the job data. Please note that the FileSplit can also be a part of the file (block), so you want to check the internal offset and length fields. The type of the split is dependent on the InputFormat, so there is no guarantee that the returned splits are a FileSplit.
Turns out grepping the logs shows what files the task is reading.

Hadoop variable set in reducer and read in driver

How I can set a variable in a reducer, which after its execution can be read by the driver after all tasks finish their execution? Something like:
class Driver extends Configured implements Tool{
public int run(String[] args) throws Exception {
...
JobClient.runJob(conf); // reducer sets some variable
String varValue = ...; // variable value is read by driver
}
}
WORKAROUND
I came up with this "ugly" workaround. The main idea is that you create a group of counters in which you hold only one counter where its name is the value you wish to return (you ignore the actual counter value). The code look like this:
// reducer || mapper
reporter.incrCounter("Group name", "counter name -> actual value", 0);
// driver
RunningJob runningJob = JobClient.runJob(conf);
String value = runningJob.getCounters().getGroup("Group name").iterator().next().getName();
The same will work for mappers as well. Though this solves my problem, I think this type of solution is "ugly". Thus I leave the question open.
You can't amend the configuration in a map / reduce task and expect that change to be persisted to configurations in other tasks and / or the job client that submitted the job (lets say you write different values in the reducer - which one 'wins' out and is persisted back?).
You can however write files to HDFS yourself which can then be read back when your job returns - No less ugly really but there isn't a way doesn't involve another technology (Zookeeper, HBase or any other NoSQL / RDB) holding the value between your task ending and you being able to retrieve the value upon job success.

Creating custom InputFormat and RecordReader for Binary Files in Hadoop MapReduce

I'm writing a M/R job that processes large time-series-data files written in binary format that looks something like this (new lines here for readability, actual data is continuous, obviously):
TIMESTAMP_1---------------------TIMESTAMP_1
TIMESTAMP_2**********TIMESTAMP_2
TIMESTAMP_3%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%TIMESTAMP_3
.. etc
Where timestamp is simply a 8 byte struct, identifiable as such by the first 2 bytes. The actual data is bounded between duplicate value timestamps, as displayed above, and contains one or more predefined structs. I would like to write a custom InputFormat that will emit the key/value pair to the mappers:
< TIMESTAMP_1, --------------------- >
< TIMESTAMP_2, ********** >
< TIMESTAMP_3, %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% >
Logically, I'd like to keep track of the current TIMESTAMP, and aggregate all the data until that TIMESTAMP is detected again, then send out my <TIMESTAMP, DATA> pair as a record. My problem is syncing between splits inside the RecordReader, so if a certain reader receives the following split
# a split occurs inside my data
reader X: TIMESTAMP_1--------------
reader Y: -------TIMESTAMP_1 TIMESTAMP_2****..
# or inside the timestamp
or even: #######TIMES
TAMP_1-------------- ..
What's a good way to approach this? Do I have an easy way to access the file offsets such that my CustomRecordReader can sync between splits and not lose data? I feel I have some conceptual gaps on how splits are handled, so perhaps an explanation of these may help. thanks.
In general it is not simple to create input format which support splits, since you should be able to find out where to move from the split boundary to get consistent records. XmlInputFormat is good example of format doing so.
I would suggest first consider if you indeed need splittable inputs? You can define your input format as not splittable and not have all these issues.
If you files are generally not much larger then block size - you loose nothing. If they do - you will loose part of the data locality.
You can subclass the concrete subclass of FileInputFormat, for example, SeqenceFileAsBinaryInputFormat, and override the isSplitable() method to return false:
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapred.SequenceFileAsBinaryInputFormat;
public class NonSplitableBinaryFile extends SequenceFileAsBinaryInputFormat{
#Override
protected boolean isSplitable(FileSystem fs, Path file) {
return false;
}
#Override
public RecordReader getRecordReader(InputSplit split, JobConf job,
Reporter reporter) throws IOException {
//return your customized record reader here
}
}

Resources