When to go for custom Input format for Map reduce jobs - hadoop

When should we go for custom Input Format while using Map Reduce programming ?
Say I have a file which I need to read line by line and it has 15 columns delimited by pipe, should I go for custom Input Format ?
I can use a TextInput Format as well as Custom Input Format in this case.

CustomInputFormat can be written when you need to customize input
record reading. But in your case you need not have such an implementation.
see below example of CustomInputFormat out of many such...
Example : Reading Paragraphs as Input Records
If you are working on Hadoop MapReduce or Using AWS EMR then there might be an use case where input files consistent a paragraph as key-value record instead of a single line (think about scenarios like analyzing comments of news articles). So instead of processing a single line as input if you need to process a complete paragraph at once as a single record then you will need to customize the default behavior of **TextInputFormat** i.e. to read each line by default into reading a complete paragraph as one input key-value pair for further processing in MapReduce jobs.
This requires us to to create a custom record reader which can be done by implementing the class RecordReader. The next() method is where you would tell the record reader to fetch a paragraph instead of one line. See the following implementation, it’s self-explanatory:
public class ParagraphRecordReader implements RecordReader<LongWritable, Text> {
private LineRecordReader lineRecord;
private LongWritable lineKey;
private Text lineValue;
public ParagraphRecordReader(JobConf conf, FileSplit split) throws IOException {
lineRecord = new LineRecordReader(conf, split);
lineKey = lineRecord.createKey();
lineValue = lineRecord.createValue();
}
#Override
public void close() throws IOException {
lineRecord.close();
}
#Override
public LongWritable createKey() {
return new LongWritable();
}
#Override
public Text createValue() {
return new Text("");
}
#Override
public float getProgress() throws IOException {
return lineRecord.getPos();
}
#Override
public synchronized boolean next(LongWritable key, Text value) throws IOException {
boolean appended, isNextLineAvailable;
boolean retval;
byte space[] = {' '};
value.clear();
isNextLineAvailable = false;
do {
appended = false;
retval = lineRecord.next(lineKey, lineValue);
if (retval) {
if (lineValue.toString().length() > 0) {
byte[] rawline = lineValue.getBytes();
int rawlinelen = lineValue.getLength();
value.append(rawline, 0, rawlinelen);
value.append(space, 0, 1);
appended = true;
}
isNextLineAvailable = true;
}
} while (appended);
return isNextLineAvailable;
}
#Override
public long getPos() throws IOException {
return lineRecord.getPos();
}
}
With a ParagraphRecordReader implementation, we would need to extend TextInputFormat to create a custom InputFomat by just overriding the getRecordReader method and return an object of ParagraphRecordReader to override default behavior.
ParagrapghInputFormat will look like:
public class ParagrapghInputFormat extends TextInputFormat
{
#Override
public RecordReader<LongWritable, Text> getRecordReader(InputSplit split, JobConf conf, Reporter reporter)throws IOException {
reporter.setStatus(split.toString());
return new ParagraphRecordReader(conf, (FileSplit)split);
}
}
Ensure that the job configuration to use our custom input format implementation for reading data into MapReduce jobs. It will be as simple as setting up inputformat type to ParagraphInputFormat as show below:
conf.setInputFormat(ParagraphInputFormat.class);
With above changes, we can read paragraphs as input records into MapReduce programs.
let’s assume that input file is as follows with paragraphs:
And a simple mapper code would look like:
#Override
public void map(LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter)
throws IOException {
System.out.println(key+" : "+value);
}

Yes u can use TextInputformat for you case.

Related

Custom FileInputFormat always assign one filesplit to one slot

I have been writing protobuf records to our s3 buckets. And I want to use flink dataset api to read from it. So I implemented a custom FileInputFormat to achieve this. The code is as below.
public class ProtobufInputFormat extends FileInputFormat<StandardLog.Pageview> {
public ProtobufInputFormat() {
}
private transient boolean reachedEnd = false;
#Override
public boolean reachedEnd() throws IOException {
return reachedEnd;
}
#Override
public StandardLog.Pageview nextRecord(StandardLog.Pageview reuse) throws IOException {
StandardLog.Pageview pageview = StandardLog.Pageview.parseDelimitedFrom(stream);
if (pageview == null) {
reachedEnd = true;
}
return pageview;
}
#Override
public boolean supportsMultiPaths() {
return true;
}
}
public class BatchReadJob {
public static void main(String... args) throws Exception {
String readPath1 = args[0];
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
ProtobufInputFormat inputFormat = new ProtobufInputFormat();
inputFormat.setNestedFileEnumeration(true);
inputFormat.setFilePaths(readPath1);
DataSet<StandardLog.Pageview> dataSource = env.createInput(inputFormat);
dataSource.map(new MapFunction<StandardLog.Pageview, String>() {
#Override
public String map(StandardLog.Pageview value) throws Exception {
return value.getId();
}
}).writeAsText("s3://xxx", FileSystem.WriteMode.OVERWRITE);
env.execute();
}
}
The problem is that flink always assign one filesplit to one parallelism slot. In other word, it always process the same number of file split as the number of the parallelism.
I want to know what's the correct way of implementing custom FileInputFormat.
Thanks.
I believe the behavior you're seeing is because ExecutionJobVertex calls the FileInputFormat. createInputSplits() method with a minNumSplits parameter equal to the vertex (data source) parallelism. So if you want a different behavior, then you'd have to override the createInputSplits method.
Though you didn't say what behavior you actually wanted. If, for example, you just want one split per file, then you could override the testForUnsplittable() method in your subclass of FileInputFormat to always return true; it should also set the (protected) unsplittable boolean to true.

set a conf value in mapper - get it in run method

In the run method of the Driver class, I want to fetch a String value (from the mapper function) and want to write it to a file. I used the following code, but null was returned. Please help
Mapper
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
context.getConfiguration().set("feedName", feedName);
}
Driver Class
#Override
public int run(String[] args) throws Exception {
String lineVal = conf.get("feedName")
}
Configuration is one way.
If you want to pass non-counter types of values back to the driver, you can utilize HDFS for that.
Either write to your main output context (key and values) that you emit from your job.
Or alternatively use MultipleOutputs, if you do not want to mess with your standard job output.
For example, you can write any kind of properties as Text keys and Text values from your mappers or reducers.
Once control is back to your driver, simply read from HDFS. For example you can store your name/values to the Configuration object to be used by the next job in your sequence:
public void load(Configuration targetConf, Path src, FileSystem fs) throws IOException {
InputStream is = fs.open(src);
try {
Properties props = new Properties();
props.load(new InputStreamReader(is, "UTF8"));
for (Map.Entry prop : props.entrySet()) {
String name = (String)prop.getKey();
String value = (String)prop.getValue();
targetConf.set(name, value);
}
} finally {
is.close();
}
}
Note that if you have multiple mappers or reducers where you write to MultipleOutputs, you will end up with multiple {name}-m-##### or {name}-r-##### files.
In that case, you will need to either read from every output file or run a single reducer job to combine your outputs into one and then just read from one file as shown above.
Using configuration you can only do the viceversa.
You can set values in Driver class
public int run(String[] args) throws Exception {
conf.set("feedName",value);
}
and set get those in Mapper class
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
String lineVal = conf.get("feedName");
}
UPDATE
One option to your question is write data to a file and store it in HDFS, and then access them in Driver class. These files can be treated as "Intermediate Files".
Just try it and see.

Weird output while reading a .tar.gz file in MapReduce

Please go a little Easy on me, As I am a newbie to hadoop and MapReduce.
I have a .tar.gz file that I am trying to read using mapReduce by writing a custom InputFormatter employing CompressionCodecfactory.
I read some document over Internet that CompressionCodecFactory can be used to read a .tar.gz file. hence I implemented that in my code.
The output that I get After running the code is absolute garbage.
A piece of my input file is provided below:
"MAY 2013 KOTZEBUE, AK"
"RALPH WIEN MEMORIAL AIRPORT (PAOT)"
"Lat:66° 52'N Long: 162° 37'W Elev (Ground) 30 Feet"
"Time Zone : ALASKA WBAN: 26616 ISSN#: 0197-9833"
01,21,0,11,-11,3,11,54,0," ",4, ,0.0,0.00,30.06,30.09,10.2,36,10.0,25,360,22,360,01
02,25,3,14,-9,5,12,51,0," ",4, ,0.0,0.00,30.09,30.11,6.1,34,7.7,16,010,14,360,02
03,21,1,11,-12,7,11,54,0," ",4, ,0.0,0.00,30.14,30.15,5.0,28,6.0,17,270,16,270,03
04,20,8,14,-10,11,13,51,0,"SN BR",4, ,.001,.0001,30.09,30.11,8.6,26,9.2,20,280,15,280,04
05,29,19,24,-1,21,23,41,0,"SN BR",5, ,0.6,0.06,30.11,30.14,8.1,20,8.5,22,240,20,240,05
06,27,19,23,-3,21,23,42,0,"SN BR",4, ,0.1,0.01,30.14,30.15,8.7,19,9.4,18,200,15,200,06
The the output I get is quite weird:
��#(���]�OX}�s���{Fw8OP��#ig#���e�1L'�����sAm�
��#���Q�eW�t�Ruk�#��AAB.2P�V�� \L}��+����.֏9U]N �)(���d��i(��%F�S<�ҫ ���EN��v�7�Y�%U�>��<�p���`]ݹ�#�#����9Dˬ��M�X2�'��\R��\1- ���V\K1�c_P▒W¨P[Ö␤ÍãÏ2¨▒;O
Below is the Custom InputFormat and RecordReader code:
InputFormat
public class SZ_inptfrmtr extends FileInputFormat<Text, Text>
{
#Override
public RecordReader<Text, Text> getRecordReader(InputSplit split,
JobConf job_run, Reporter reporter) throws IOException {
// TODO Auto-generated method stub
return new SZ_recordreader(job_run, (FileSplit)split);
}
}
RecordReader:
public class SZ_recordreader implements RecordReader<Text, Text>
{
FileSplit split;
JobConf job_run;
boolean processed=false;
CompressionCodecFactory compressioncodec=null; // A factory that will find the correct codec(.file) for a given filename.
public SZ_recordreader(JobConf job_run, FileSplit split)
{
this.split=split;
this.job_run=job_run;
}
#Override
public void close() throws IOException {
// TODO Auto-generated method stub
}
#Override
public Text createKey() {
// TODO Auto-generated method stub
return new Text();
}
#Override
public Text createValue() {
// TODO Auto-generated method stub
return new Text();
}
#Override
public long getPos() throws IOException {
// TODO Auto-generated method stub
return processed ? split.getLength() : 0;
}
#Override
public float getProgress() throws IOException {
// TODO Auto-generated method stub
return processed ? 1.0f : 0.0f;
}
#Override
public boolean next(Text key, Text value) throws IOException {
// TODO Auto-generated method stub
FSDataInputStream in=null;
if (!processed)
{
byte [] bytestream= new byte [(int) split.getLength()];
Path path=split.getPath();
compressioncodec=new CompressionCodecFactory(job_run);
CompressionCodec code = compressioncodec.getCodec(path);
// compressioncodec will find the correct codec by visiting the path of the file and store the result in code
System.out.println(code);
FileSystem fs= path.getFileSystem(job_run);
try
{
in =fs.open(path);
IOUtils.readFully(in, bytestream, 0, bytestream.length);
System.out.println("the input is " +in+ in.toString());
key.set(path.getName());
value.set(bytestream, 0, bytestream.length);
}
finally
{
IOUtils.closeStream(in);
}
processed=true;
return true;
}
return false;
}
}
Could anybody please point out the flaw..
There is a codec for .gz, but no codec for .tar.
Your .tar.gz is being decompressed to .tar, but it's still a tarball, and not something understood by the Hadoop system.
Your code may be stuck there in mapper and reducer class communication, To work with compressed files in MapReduce you need to set some configuration options for your job. These classes
are mandatory to set in driver class:
conf.setBoolean("mapred.output.compress", true);//Compress The Reducer Out put
conf.setBoolean("mapred.compress.map.output", true);//Compress The Mapper Output
conf.setClass("mapred.output.compression.codec",
codecClass,
CompressionCodec.class);//Compression codec for Compresing mapper output
The only difference between a MapReduce job that works with uncompressed versus
compressed IO are these three annotated lines.
I read some document over Internet that CompressionCodecFactory can be used to read a .tar.gz file. hence I implemented that in my code.
Even Compression codec do better over, but there are many codec for this purpose most one is LzopCodec and SnappyCodec for possible big data.. you can find a Git for LzopCodec here :https://github.com/twitter/hadoop-lzo/blob/master/src/main/java/com/hadoop/compression/lzo/LzopCodec.java

Weird error in Hadoop reducer

The reducer in my map-reduce job is as follows:
public static class Reduce_Phase2 extends MapReduceBase implements Reducer<IntWritable, Neighbourhood, Text,Text> {
public void reduce(IntWritable key, Iterator<Neighbourhood> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
ArrayList<Neighbourhood> cachedValues = new ArrayList<Neighbourhood>();
while(values.hasNext()){
Neighbourhood n = values.next();
cachedValues.add(n);
//correct output
//output.collect(new Text(n.source), new Text(n.neighbours));
}
for(Neighbourhood node:cachedValues){
//wrong output
output.collect(new Text(key.toString()), new Text(node.source+"\t\t"+node.neighbours));
}
}
}
TheNeighbourhood class has two attributes, source and neighbours, both of type Text. This reducer receives one key which has 19 values(of type Neighbourhood) assigned. When I output the source and neighbours inside the while loop, I get the actual values of 19 different values. However, if I output them after the while loop as shown in the code, I get 19 similar values. That is, one object gets output 19 times! It is very weired that what happens. Is there any idea on that?
Here is the code of the class Neighbourhood
public class Neighbourhood extends Configured implements WritableComparable<Neighbourhood> {
Text source ;
Text neighbours ;
public Neighbourhood(){
source = new Text();
neighbours = new Text();
}
public Neighbourhood (String s, String n){
source = new Text(s);
neighbours = new Text(n);
}
#Override
public void readFields(DataInput arg0) throws IOException {
source.readFields(arg0);
neighbours.readFields(arg0);
}
#Override
public void write(DataOutput arg0) throws IOException {
source.write(arg0);
neighbours.write(arg0);
}
#Override
public int compareTo(Neighbourhood o) {
return 0;
}
}
You're being caught out by a efficiency mechanism employed by Hadoop - Object reuse.
Your calls to values.next() is returning the same object reference each time, all Hadoop is doing behind the scenes is replaced the contents of that same object with the underlying bytes (deserialized using the readFields() method).
To avoid this you'll need to create deep copies of the object returned from values.next() - Hadoop actually has a utility class to do this for you called ReflectionUtils.copy. A simple fix would be as follows:
while(values.hasNext()){
Neighbourhood n = ReflectionUtils.newInstance(Neighbourhood.class, conf);
ReflectionUtils.copy(values.next(), n, conf);
You'll need to cache a version of the job Configuration (conf in the above code), which you can obtain by overriding the configure(JobConf) method in your Reducer:
#Override
protected void configure(JobConf job) {
conf = job;
}
Be warned though - accumulating a list in this way is often the cause of memory problems in your job, especially if you have 100,000+ values for a given single key.

Hadoop MapReduce: return sorted list of words in a text file

So my task is to return a alpahbetically sorted list of all words contained in a text file while keeping duplicates.
{To be or not to be} −→ {be be not or to to}
My idea is to take each word as the key as well as the value. This way, because hadoop sorts the keys, they will automatically be sorted alphabtically. In the Reduce phase I simply append all words with the same key (so basically identical words) to one single Text value.
public class WordSort {
public static class Map extends Mapper<LongWritable, Text, Text, Text> {
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
// transform to lower case
String lower = word.toString().toLowerCase();
context.write(new Text(lower), new Text(lower));
}
}
}
public static class Reduce extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
String result = "";
for (Text value : values){
res += value.toString() + " ";
}
context.write(key, new Text(result));
}
}
However my problem is, how do I simply return the value in my output file? At the moment I have this:
be be be
not not
or or
to to to
So in every line I have the key first and then the values, but I just want to return the values so that I get this:
be be
not
or
to to
Is this even possible or do I have to just delete one entry from the value of each word?
Disclaimer: I'm not an Hadoop user, but I do a lot of Map/Reduce with CouchDB.
If you just need the keys, why don't you emit an empty value?
Moreover, it sounds like you don't want to reduce them at all, since you want to get a key for every occurrence.
Just tried with the MaxTemperature example from the Hadoop - The Definitive Guide and the below code worked
context.write(null, new Text(result));

Resources