Weird output while reading a .tar.gz file in MapReduce - hadoop
Please go a little Easy on me, As I am a newbie to hadoop and MapReduce.
I have a .tar.gz file that I am trying to read using mapReduce by writing a custom InputFormatter employing CompressionCodecfactory.
I read some document over Internet that CompressionCodecFactory can be used to read a .tar.gz file. hence I implemented that in my code.
The output that I get After running the code is absolute garbage.
A piece of my input file is provided below:
"MAY 2013 KOTZEBUE, AK"
"RALPH WIEN MEMORIAL AIRPORT (PAOT)"
"Lat:66° 52'N Long: 162° 37'W Elev (Ground) 30 Feet"
"Time Zone : ALASKA WBAN: 26616 ISSN#: 0197-9833"
01,21,0,11,-11,3,11,54,0," ",4, ,0.0,0.00,30.06,30.09,10.2,36,10.0,25,360,22,360,01
02,25,3,14,-9,5,12,51,0," ",4, ,0.0,0.00,30.09,30.11,6.1,34,7.7,16,010,14,360,02
03,21,1,11,-12,7,11,54,0," ",4, ,0.0,0.00,30.14,30.15,5.0,28,6.0,17,270,16,270,03
04,20,8,14,-10,11,13,51,0,"SN BR",4, ,.001,.0001,30.09,30.11,8.6,26,9.2,20,280,15,280,04
05,29,19,24,-1,21,23,41,0,"SN BR",5, ,0.6,0.06,30.11,30.14,8.1,20,8.5,22,240,20,240,05
06,27,19,23,-3,21,23,42,0,"SN BR",4, ,0.1,0.01,30.14,30.15,8.7,19,9.4,18,200,15,200,06
The the output I get is quite weird:
��#(���]�OX}�s���{Fw8OP��#ig#���e�1L'�����sAm�
��#���Q�eW�t�Ruk�#��AAB.2P�V�� \L}��+����.֏9U]N �)(���d��i(��%F�S<�ҫ ���EN��v�7�Y�%U�>��<�p���`]ݹ�#�#����9Dˬ��M�X2�'��\R��\1- ���V\K1�c_P▒W¨P[ÖÍãÏ2¨▒;O
Below is the Custom InputFormat and RecordReader code:
InputFormat
public class SZ_inptfrmtr extends FileInputFormat<Text, Text>
{
#Override
public RecordReader<Text, Text> getRecordReader(InputSplit split,
JobConf job_run, Reporter reporter) throws IOException {
// TODO Auto-generated method stub
return new SZ_recordreader(job_run, (FileSplit)split);
}
}
RecordReader:
public class SZ_recordreader implements RecordReader<Text, Text>
{
FileSplit split;
JobConf job_run;
boolean processed=false;
CompressionCodecFactory compressioncodec=null; // A factory that will find the correct codec(.file) for a given filename.
public SZ_recordreader(JobConf job_run, FileSplit split)
{
this.split=split;
this.job_run=job_run;
}
#Override
public void close() throws IOException {
// TODO Auto-generated method stub
}
#Override
public Text createKey() {
// TODO Auto-generated method stub
return new Text();
}
#Override
public Text createValue() {
// TODO Auto-generated method stub
return new Text();
}
#Override
public long getPos() throws IOException {
// TODO Auto-generated method stub
return processed ? split.getLength() : 0;
}
#Override
public float getProgress() throws IOException {
// TODO Auto-generated method stub
return processed ? 1.0f : 0.0f;
}
#Override
public boolean next(Text key, Text value) throws IOException {
// TODO Auto-generated method stub
FSDataInputStream in=null;
if (!processed)
{
byte [] bytestream= new byte [(int) split.getLength()];
Path path=split.getPath();
compressioncodec=new CompressionCodecFactory(job_run);
CompressionCodec code = compressioncodec.getCodec(path);
// compressioncodec will find the correct codec by visiting the path of the file and store the result in code
System.out.println(code);
FileSystem fs= path.getFileSystem(job_run);
try
{
in =fs.open(path);
IOUtils.readFully(in, bytestream, 0, bytestream.length);
System.out.println("the input is " +in+ in.toString());
key.set(path.getName());
value.set(bytestream, 0, bytestream.length);
}
finally
{
IOUtils.closeStream(in);
}
processed=true;
return true;
}
return false;
}
}
Could anybody please point out the flaw..
There is a codec for .gz, but no codec for .tar.
Your .tar.gz is being decompressed to .tar, but it's still a tarball, and not something understood by the Hadoop system.
Your code may be stuck there in mapper and reducer class communication, To work with compressed files in MapReduce you need to set some configuration options for your job. These classes
are mandatory to set in driver class:
conf.setBoolean("mapred.output.compress", true);//Compress The Reducer Out put
conf.setBoolean("mapred.compress.map.output", true);//Compress The Mapper Output
conf.setClass("mapred.output.compression.codec",
codecClass,
CompressionCodec.class);//Compression codec for Compresing mapper output
The only difference between a MapReduce job that works with uncompressed versus
compressed IO are these three annotated lines.
I read some document over Internet that CompressionCodecFactory can be used to read a .tar.gz file. hence I implemented that in my code.
Even Compression codec do better over, but there are many codec for this purpose most one is LzopCodec and SnappyCodec for possible big data.. you can find a Git for LzopCodec here :https://github.com/twitter/hadoop-lzo/blob/master/src/main/java/com/hadoop/compression/lzo/LzopCodec.java
Related
When to go for custom Input format for Map reduce jobs
When should we go for custom Input Format while using Map Reduce programming ? Say I have a file which I need to read line by line and it has 15 columns delimited by pipe, should I go for custom Input Format ? I can use a TextInput Format as well as Custom Input Format in this case.
CustomInputFormat can be written when you need to customize input record reading. But in your case you need not have such an implementation. see below example of CustomInputFormat out of many such... Example : Reading Paragraphs as Input Records If you are working on Hadoop MapReduce or Using AWS EMR then there might be an use case where input files consistent a paragraph as key-value record instead of a single line (think about scenarios like analyzing comments of news articles). So instead of processing a single line as input if you need to process a complete paragraph at once as a single record then you will need to customize the default behavior of **TextInputFormat** i.e. to read each line by default into reading a complete paragraph as one input key-value pair for further processing in MapReduce jobs. This requires us to to create a custom record reader which can be done by implementing the class RecordReader. The next() method is where you would tell the record reader to fetch a paragraph instead of one line. See the following implementation, it’s self-explanatory: public class ParagraphRecordReader implements RecordReader<LongWritable, Text> { private LineRecordReader lineRecord; private LongWritable lineKey; private Text lineValue; public ParagraphRecordReader(JobConf conf, FileSplit split) throws IOException { lineRecord = new LineRecordReader(conf, split); lineKey = lineRecord.createKey(); lineValue = lineRecord.createValue(); } #Override public void close() throws IOException { lineRecord.close(); } #Override public LongWritable createKey() { return new LongWritable(); } #Override public Text createValue() { return new Text(""); } #Override public float getProgress() throws IOException { return lineRecord.getPos(); } #Override public synchronized boolean next(LongWritable key, Text value) throws IOException { boolean appended, isNextLineAvailable; boolean retval; byte space[] = {' '}; value.clear(); isNextLineAvailable = false; do { appended = false; retval = lineRecord.next(lineKey, lineValue); if (retval) { if (lineValue.toString().length() > 0) { byte[] rawline = lineValue.getBytes(); int rawlinelen = lineValue.getLength(); value.append(rawline, 0, rawlinelen); value.append(space, 0, 1); appended = true; } isNextLineAvailable = true; } } while (appended); return isNextLineAvailable; } #Override public long getPos() throws IOException { return lineRecord.getPos(); } } With a ParagraphRecordReader implementation, we would need to extend TextInputFormat to create a custom InputFomat by just overriding the getRecordReader method and return an object of ParagraphRecordReader to override default behavior. ParagrapghInputFormat will look like: public class ParagrapghInputFormat extends TextInputFormat { #Override public RecordReader<LongWritable, Text> getRecordReader(InputSplit split, JobConf conf, Reporter reporter)throws IOException { reporter.setStatus(split.toString()); return new ParagraphRecordReader(conf, (FileSplit)split); } } Ensure that the job configuration to use our custom input format implementation for reading data into MapReduce jobs. It will be as simple as setting up inputformat type to ParagraphInputFormat as show below: conf.setInputFormat(ParagraphInputFormat.class); With above changes, we can read paragraphs as input records into MapReduce programs. let’s assume that input file is as follows with paragraphs: And a simple mapper code would look like: #Override public void map(LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { System.out.println(key+" : "+value); }
Yes u can use TextInputformat for you case.
How to read two files parallely with two map task running in parallel
Please go a little easy on me cause I am only 3 months old in Hadoop and Mapreduce. I've 2 files 120 MB each, The data inside each file is completely unstructured but with a common pattern. Because of the varying structure of data my requirement can not be sufficed by the default LineInputFormat. Hence While reading the file I override the isSplitable() method and stop the split by returning false. so that 1 mapper can access one complete file and I can perform my logic and achieve the requirement. My machine can run two mappers in parallel, So by stopping the split i am degrading the performance by running the mapper one by one for each file rather then running two mappers parallely for a file. My Question is How can I run two mappers in parallel for both the files so the performance improves. For Example When split was allowed: file 1: split 1 (1st mapper) || split 2 (2nd mapper)------ 2 min file 2: split 1 (1st mapper) || split 2 (2nd mapper)------ 2 min Total Time for reading two files ===== 4 min When Split not allowed: file 1: no parallel jobs so (1st mapper)---------4 min file 2: no parallel jobs so (1st mapper)---------4 min Total Time to read two files ===== 8 min (Performance degraded) What I want File 1 (1st Mapper) || file 2 (2nd Mapper) ------4 min Total time to read two files ====== 4 min Basically I want both the Files to be read at the same time by two different mapper. Please help me in achieving the scenario. Below are my Custom InputFormat and Custom RecordReader Code. public class NSI_inputformatter extends FileInputFormat<NullWritable, Text>{ #Override public boolean isSplitable(FileSystem fs, Path filename) { //System.out.println("Inside the isSplitable Method of NSI_inputformatter"); return false; } #Override public RecordReader<NullWritable, Text> getRecordReader(InputSplit split, JobConf job_run, Reporter reporter) throws IOException { // TODO Auto-generated method stub //System.out.println("Inside the getRecordReader method of NSI_inputformatter"); return new NSI_record_reader(job_run, (FileSplit)split); } } Record Reader: public class NSI_record_reader implements RecordReader<NullWritable, Text> { FileSplit split; JobConf job_run; String text; public boolean processed=false; public NSI_record_reader(JobConf job_run, FileSplit split) { //System.out.println("Inside the NSI_record_reader constructor"); this.split=split; this.job_run=job_run; //System.out.println(split.toString()); } #Override public boolean next(NullWritable key, Text value) throws IOException { // TODO Auto-generated method stub //System.out.println("Inside the next method of the NLI_record_reader"); if (!processed) { byte [] content_add=new byte[(int)(split.getLength())]; Path file=split.getPath(); FileSystem fs=file.getFileSystem(job_run); FSDataInputStream input=null; try{ input=fs.open(file); System.out.println("the input is " +input+ input.toString()); IOUtils.readFully(input, content_add, 0, content_add.length); value.set(content_add, 0, content_add.length); } finally { IOUtils.closeStream(input); } processed=true; return true; } return false; } #Override public void close() throws IOException { // TODO Auto-generated method stub } #Override public NullWritable createKey() { System.out.println("Inside createkey() mrthod of NSI_record_reader"); // TODO Auto-generated method stub return NullWritable.get(); } #Override public Text createValue() { System.out.println("Inside createValue() mrthod of NSI_record_reader"); // TODO Auto-generated method stub return new Text(); } #Override public long getPos() throws IOException { // TODO Auto-generated method stub System.out.println("Inside getPs() mrthod of NSI_record_reader"); return processed ? split.getLength() : 0; } #Override public float getProgress() throws IOException { // TODO Auto-generated method stub System.out.println("Inside getProgress() mrthod of NSI_record_reader"); return processed ? 1.0f : 0.0f; } } Input Sample: <Dec 12, 2013 1:05:56 AM CST> <Error> <HTTP> <BEA-101017> <[weblogic.servlet.internal.WebAppServletContext#42e87d99 - appName: 'Agile', name: '/Agile', context-path: '/Agile', spec-version: 'null'] Root cause of ServletException. javax.servlet.jsp.JspException: Connection reset by peer: socket write error at com.agile.ui.web.taglib.common.FormTag.writeFormHeader(FormTag.java:498) at com.agile.ui.web.taglib.common.FormTag.doStartTag(FormTag.java:429) at jsp_servlet._default.__login_45_cms._jspService(__login_45_cms.java:929) at weblogic.servlet.jsp.JspBase.service(JspBase.java:34) at weblogic.servlet.internal.StubSecurityHelper$ServletServiceAction.run(StubSecurityHelper.ja va:227) Truncated. see log file for complete stacktrace > Retrieving the value for the attribute Page Two.Validation Status for the Object 769630 Retrieving the value for the attribute Page Two.Pilot Required for the Object 769630 Retrieving the value for the attribute Page Two.NPO Contact for the Object 769630 <Dec 12, 2013 1:12:13 AM CST> <Warning> <Socket> <BEA-000449> <Closing socket as no data read from it during the configured idle timeout of 0 secs> Thanks.
You could try to set property -D mapred.min.split.size=209715200. In this case FileInputFormat should not split your files because they are smaller than mapred.min.split.size.
Weird error in Hadoop reducer
The reducer in my map-reduce job is as follows: public static class Reduce_Phase2 extends MapReduceBase implements Reducer<IntWritable, Neighbourhood, Text,Text> { public void reduce(IntWritable key, Iterator<Neighbourhood> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { ArrayList<Neighbourhood> cachedValues = new ArrayList<Neighbourhood>(); while(values.hasNext()){ Neighbourhood n = values.next(); cachedValues.add(n); //correct output //output.collect(new Text(n.source), new Text(n.neighbours)); } for(Neighbourhood node:cachedValues){ //wrong output output.collect(new Text(key.toString()), new Text(node.source+"\t\t"+node.neighbours)); } } } TheNeighbourhood class has two attributes, source and neighbours, both of type Text. This reducer receives one key which has 19 values(of type Neighbourhood) assigned. When I output the source and neighbours inside the while loop, I get the actual values of 19 different values. However, if I output them after the while loop as shown in the code, I get 19 similar values. That is, one object gets output 19 times! It is very weired that what happens. Is there any idea on that? Here is the code of the class Neighbourhood public class Neighbourhood extends Configured implements WritableComparable<Neighbourhood> { Text source ; Text neighbours ; public Neighbourhood(){ source = new Text(); neighbours = new Text(); } public Neighbourhood (String s, String n){ source = new Text(s); neighbours = new Text(n); } #Override public void readFields(DataInput arg0) throws IOException { source.readFields(arg0); neighbours.readFields(arg0); } #Override public void write(DataOutput arg0) throws IOException { source.write(arg0); neighbours.write(arg0); } #Override public int compareTo(Neighbourhood o) { return 0; } }
You're being caught out by a efficiency mechanism employed by Hadoop - Object reuse. Your calls to values.next() is returning the same object reference each time, all Hadoop is doing behind the scenes is replaced the contents of that same object with the underlying bytes (deserialized using the readFields() method). To avoid this you'll need to create deep copies of the object returned from values.next() - Hadoop actually has a utility class to do this for you called ReflectionUtils.copy. A simple fix would be as follows: while(values.hasNext()){ Neighbourhood n = ReflectionUtils.newInstance(Neighbourhood.class, conf); ReflectionUtils.copy(values.next(), n, conf); You'll need to cache a version of the job Configuration (conf in the above code), which you can obtain by overriding the configure(JobConf) method in your Reducer: #Override protected void configure(JobConf job) { conf = job; } Be warned though - accumulating a list in this way is often the cause of memory problems in your job, especially if you have 100,000+ values for a given single key.
Read/Write Hbase without reducer, exceptions
I want to read and write hbase without using any reducer. I followed the example in "The Apache HBase™ Reference Guide", but there are exceptions. Here is my code: public class CreateHbaseIndex { static final String SRCTABLENAME="sourceTable"; static final String SRCCOLFAMILY="info"; static final String SRCCOL1="name"; static final String SRCCOL2="email"; static final String SRCCOL3="power"; static final String DSTTABLENAME="dstTable"; static final String DSTCOLNAME="index"; static final String DSTCOL1="key"; public static void main(String[] args) { System.out.println("CreateHbaseIndex Program starts!..."); try { Configuration config = HBaseConfiguration.create(); Scan scan = new Scan(); scan.setCaching(500); scan.setCacheBlocks(false); scan.addColumn(Bytes.toBytes(SRCCOLFAMILY), Bytes.toBytes(SRCCOL1));//info:name HBaseAdmin admin = new HBaseAdmin(config); if (admin.tableExists(DSTTABLENAME)) { System.out.println("table Exists."); } else{ HTableDescriptor tableDesc = new HTableDescriptor(DSTTABLENAME); tableDesc.addFamily(new HColumnDescriptor(DSTCOLNAME)); admin.createTable(tableDesc); System.out.println("create table ok."); } Job job = new Job(config, "CreateHbaseIndex"); job.setJarByClass(CreateHbaseIndex.class); TableMapReduceUtil.initTableMapperJob( SRCTABLENAME, // input HBase table name scan, // Scan instance to control CF and attribute selection HbaseMapper.class, // mapper ImmutableBytesWritable.class, // mapper output key Put.class, // mapper output value job); job.waitForCompletion(true); } catch (IOException e) { e.printStackTrace(); } catch (InterruptedException e) { e.printStackTrace(); } catch (ClassNotFoundException e) { e.printStackTrace(); } System.out.println("Program ends!..."); } public static class HbaseMapper extends TableMapper<ImmutableBytesWritable, Put> { private HTable dstHt; private Configuration dstConfig; #Override public void setup(Context context) throws IOException{ dstConfig=HBaseConfiguration.create(); dstHt = new HTable(dstConfig,SRCTABLENAME); } #Override public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException { // this is just copying the data from the source table... context.write(row, resultToPut(row,value)); } private static Put resultToPut(ImmutableBytesWritable key, Result result) throws IOException { Put put = new Put(key.get()); for (KeyValue kv : result.raw()) { put.add(kv); } return put; } #Override protected void cleanup(Context context) throws IOException, InterruptedException { dstHt.close(); super.cleanup(context); } } } By the way, "souceTable" is like this: key name email 1 peter a#a.com 2 sam b#b.com "dstTable" will be like this: key value peter 1 sam 2 I am a newbie in this field and need you help. Thx~
You are correct that you don't need a reducer to write to HBase, but there are some instances where a reducer might help. If you are creating an index, you might run into situations where two mappers are trying to write the same row. Unless you are careful to ensure that they are writing into different column qualifiers, you could overwrite one update with another due to race conditions. While HBase does do row level locking, it won't help if your application logic is faulty. Without seeing your exceptions, I would guess that you are failing because you are trying to write key-value pairs from your source table into your index table, where the column family doesn't exist.
In this code you are not specifying the output format. You need to add the following code job.setOutputFormatClass(TableOutputFormat.class); job.getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, DSTTABLENAME); Also, we are not supposed to create new configuration in the set up, we need to use the same configuration from context.
Hadoop MultipleOutputs does not write to multiple files when file formats are custom format
I am trying to read from cassandra and write the reducers output to multiple output files using MultipleOutputs api (Hadoop version 1.0.3). The file formats in my case are custom output formats extending FileOutputFormat. I have configured my job in a similar manner as shown in MultipleOutputs api. However, when I run the job, I only get one output file named part-r-0000 which is in text output format. If job.setOutputFormatClass() is not set, by default it considers TextOutputFormat to be the format. Also it will only allow one of the two format classes to be initialized. It completely ignores the output formats I specified in MulitpleOutputs.addNamedOutput(job, "format1", MyCustomFileFormat1.class, Text.class, Text.class) and MulitpleOutputs.addNamedOutput(job, "format2", MyCustomFileFormat2.class, Text.class, Text.class). Is someone else facing similar problem or am I doing something wrong ? I also tried to write a very simple MR program which reads from a text file and writes the output in 2 formats TextOutputFormat and SequenceFileOutputFormat as shown in the MultipleOutputs api. However, no luck there as well. I get only 1 output file in text output format. Can someone help me with this ? Job job = new Job(getConf(), "cfdefGen"); job.setJarByClass(CfdefGeneration.class); //read input from cassandra column family ConfigHelper.setInputColumnFamily(job.getConfiguration(), KEYSPACE, COLUMN_FAMILY); job.setInputFormatClass(ColumnFamilyInputFormat.class); job.getConfiguration().set("cassandra.consistencylevel.read", "QUORUM"); //thrift input job configurations ConfigHelper.setInputRpcPort(job.getConfiguration(), "9160"); ConfigHelper.setInputInitialAddress(job.getConfiguration(), HOST); ConfigHelper.setInputPartitioner(job.getConfiguration(), "RandomPartitioner"); SlicePredicate predicate = new SlicePredicate().setColumn_names(Arrays.asList(ByteBufferUtil.bytes("classification"))); //ConfigHelper.setRangeBatchSize(job.getConfiguration(), 2048); ConfigHelper.setInputSlicePredicate(job.getConfiguration(), predicate); //specification for mapper job.setMapperClass(MyMapper.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(Text.class); //specifications for reducer (writing to files) job.setReducerClass(ReducerToFileSystem.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); //job.setOutputFormatClass(MyCdbWriter1.class); job.setNumReduceTasks(1); //set output path for storing output files Path filePath = new Path(OUTPUT_DIR); FileSystem hdfs = FileSystem.get(getConf()); if(hdfs.exists(filePath)){ hdfs.delete(filePath, true); } MyCdbWriter1.setOutputPath(job, new Path(OUTPUT_DIR)); MultipleOutputs.addNamedOutput(job, "cdb1', MyCdbWriter1.class, Text.class, Text.class); MultipleOutputs.addNamedOutput(job, "cdb2", MyCdbWriter2.class, Text.class, Text.class); boolean success = job.waitForCompletion(true); return success ? 0:1; public static class ReducerToFileSystem extends Reducer<Text, Text, Text, Text> { private MultipleOutputs<Text, Text> mos; public void setup(Context context){ mos = new MultipleOutputs<Text, Text>(context); } //public void reduce(Text key, Text value, Context context) //throws IOException, InterruptedException (This was the mistake, changed the signature and it worked fine) public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { //context.write(key, value); mos.write("cdb1", key, value, OUTPUT_DIR+"/"+"cdb1"); mos.write("cdb2", key, value, OUTPUT_DIR+"/"+"cdb2"); context.progress(); } public void cleanup(Context context) throws IOException, InterruptedException { mos.close(); } } public class MyCdbWriter1<K, V> extends FileOutputFormat<K, V> { #Override public RecordWriter<K, V> getRecordWriter(TaskAttemptContext job) throws IOException, InterruptedException { } public static void setOutputPath(Job job, Path outputDir) { job.getConfiguration().set("mapred.output.dir", outputDir.toString()); } protected static class CdbDataRecord<K, V> extends RecordWriter<K, V> { #override write() close() } }
I found my mistake after debugging that my reduce method is never called. I found that my function definition did not match API's definition, changed it from public void reduce(Text key, Text value, Context context) to public void reduce(Text key, Iterable<Text> values, Context context). I don't know why reduce method does not have #Override tag, it would have prevented my mistake.
I also encountered a similar issue - mine turned out to be that I was filtering all my records in the Map process so nothing was being passed to Reduce. With un-named multiple outputs in the reduce task, this still resulted in a _SUCCESS file and an empty part-r-00000 file.