I have been writing protobuf records to our s3 buckets. And I want to use flink dataset api to read from it. So I implemented a custom FileInputFormat to achieve this. The code is as below.
public class ProtobufInputFormat extends FileInputFormat<StandardLog.Pageview> {
public ProtobufInputFormat() {
}
private transient boolean reachedEnd = false;
#Override
public boolean reachedEnd() throws IOException {
return reachedEnd;
}
#Override
public StandardLog.Pageview nextRecord(StandardLog.Pageview reuse) throws IOException {
StandardLog.Pageview pageview = StandardLog.Pageview.parseDelimitedFrom(stream);
if (pageview == null) {
reachedEnd = true;
}
return pageview;
}
#Override
public boolean supportsMultiPaths() {
return true;
}
}
public class BatchReadJob {
public static void main(String... args) throws Exception {
String readPath1 = args[0];
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
ProtobufInputFormat inputFormat = new ProtobufInputFormat();
inputFormat.setNestedFileEnumeration(true);
inputFormat.setFilePaths(readPath1);
DataSet<StandardLog.Pageview> dataSource = env.createInput(inputFormat);
dataSource.map(new MapFunction<StandardLog.Pageview, String>() {
#Override
public String map(StandardLog.Pageview value) throws Exception {
return value.getId();
}
}).writeAsText("s3://xxx", FileSystem.WriteMode.OVERWRITE);
env.execute();
}
}
The problem is that flink always assign one filesplit to one parallelism slot. In other word, it always process the same number of file split as the number of the parallelism.
I want to know what's the correct way of implementing custom FileInputFormat.
Thanks.
I believe the behavior you're seeing is because ExecutionJobVertex calls the FileInputFormat. createInputSplits() method with a minNumSplits parameter equal to the vertex (data source) parallelism. So if you want a different behavior, then you'd have to override the createInputSplits method.
Though you didn't say what behavior you actually wanted. If, for example, you just want one split per file, then you could override the testForUnsplittable() method in your subclass of FileInputFormat to always return true; it should also set the (protected) unsplittable boolean to true.
When should we go for custom Input Format while using Map Reduce programming ?
Say I have a file which I need to read line by line and it has 15 columns delimited by pipe, should I go for custom Input Format ?
I can use a TextInput Format as well as Custom Input Format in this case.
CustomInputFormat can be written when you need to customize input
record reading. But in your case you need not have such an implementation.
see below example of CustomInputFormat out of many such...
Example : Reading Paragraphs as Input Records
If you are working on Hadoop MapReduce or Using AWS EMR then there might be an use case where input files consistent a paragraph as key-value record instead of a single line (think about scenarios like analyzing comments of news articles). So instead of processing a single line as input if you need to process a complete paragraph at once as a single record then you will need to customize the default behavior of **TextInputFormat** i.e. to read each line by default into reading a complete paragraph as one input key-value pair for further processing in MapReduce jobs.
This requires us to to create a custom record reader which can be done by implementing the class RecordReader. The next() method is where you would tell the record reader to fetch a paragraph instead of one line. See the following implementation, it’s self-explanatory:
public class ParagraphRecordReader implements RecordReader<LongWritable, Text> {
private LineRecordReader lineRecord;
private LongWritable lineKey;
private Text lineValue;
public ParagraphRecordReader(JobConf conf, FileSplit split) throws IOException {
lineRecord = new LineRecordReader(conf, split);
lineKey = lineRecord.createKey();
lineValue = lineRecord.createValue();
}
#Override
public void close() throws IOException {
lineRecord.close();
}
#Override
public LongWritable createKey() {
return new LongWritable();
}
#Override
public Text createValue() {
return new Text("");
}
#Override
public float getProgress() throws IOException {
return lineRecord.getPos();
}
#Override
public synchronized boolean next(LongWritable key, Text value) throws IOException {
boolean appended, isNextLineAvailable;
boolean retval;
byte space[] = {' '};
value.clear();
isNextLineAvailable = false;
do {
appended = false;
retval = lineRecord.next(lineKey, lineValue);
if (retval) {
if (lineValue.toString().length() > 0) {
byte[] rawline = lineValue.getBytes();
int rawlinelen = lineValue.getLength();
value.append(rawline, 0, rawlinelen);
value.append(space, 0, 1);
appended = true;
}
isNextLineAvailable = true;
}
} while (appended);
return isNextLineAvailable;
}
#Override
public long getPos() throws IOException {
return lineRecord.getPos();
}
}
With a ParagraphRecordReader implementation, we would need to extend TextInputFormat to create a custom InputFomat by just overriding the getRecordReader method and return an object of ParagraphRecordReader to override default behavior.
ParagrapghInputFormat will look like:
public class ParagrapghInputFormat extends TextInputFormat
{
#Override
public RecordReader<LongWritable, Text> getRecordReader(InputSplit split, JobConf conf, Reporter reporter)throws IOException {
reporter.setStatus(split.toString());
return new ParagraphRecordReader(conf, (FileSplit)split);
}
}
Ensure that the job configuration to use our custom input format implementation for reading data into MapReduce jobs. It will be as simple as setting up inputformat type to ParagraphInputFormat as show below:
conf.setInputFormat(ParagraphInputFormat.class);
With above changes, we can read paragraphs as input records into MapReduce programs.
let’s assume that input file is as follows with paragraphs:
And a simple mapper code would look like:
#Override
public void map(LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter)
throws IOException {
System.out.println(key+" : "+value);
}
Yes u can use TextInputformat for you case.
Please go a little Easy on me, As I am a newbie to hadoop and MapReduce.
I have a .tar.gz file that I am trying to read using mapReduce by writing a custom InputFormatter employing CompressionCodecfactory.
I read some document over Internet that CompressionCodecFactory can be used to read a .tar.gz file. hence I implemented that in my code.
The output that I get After running the code is absolute garbage.
A piece of my input file is provided below:
"MAY 2013 KOTZEBUE, AK"
"RALPH WIEN MEMORIAL AIRPORT (PAOT)"
"Lat:66° 52'N Long: 162° 37'W Elev (Ground) 30 Feet"
"Time Zone : ALASKA WBAN: 26616 ISSN#: 0197-9833"
01,21,0,11,-11,3,11,54,0," ",4, ,0.0,0.00,30.06,30.09,10.2,36,10.0,25,360,22,360,01
02,25,3,14,-9,5,12,51,0," ",4, ,0.0,0.00,30.09,30.11,6.1,34,7.7,16,010,14,360,02
03,21,1,11,-12,7,11,54,0," ",4, ,0.0,0.00,30.14,30.15,5.0,28,6.0,17,270,16,270,03
04,20,8,14,-10,11,13,51,0,"SN BR",4, ,.001,.0001,30.09,30.11,8.6,26,9.2,20,280,15,280,04
05,29,19,24,-1,21,23,41,0,"SN BR",5, ,0.6,0.06,30.11,30.14,8.1,20,8.5,22,240,20,240,05
06,27,19,23,-3,21,23,42,0,"SN BR",4, ,0.1,0.01,30.14,30.15,8.7,19,9.4,18,200,15,200,06
The the output I get is quite weird:
��#(���]�OX}�s���{Fw8OP��#ig#���e�1L'�����sAm�
��#���Q�eW�t�Ruk�#��AAB.2P�V�� \L}��+����.֏9U]N �)(���d��i(��%F�S<�ҫ ���EN��v�7�Y�%U�>��<�p���`]ݹ�#�#����9Dˬ��M�X2�'��\R��\1- ���V\K1�c_P▒W¨P[ÖÍãÏ2¨▒;O
Below is the Custom InputFormat and RecordReader code:
InputFormat
public class SZ_inptfrmtr extends FileInputFormat<Text, Text>
{
#Override
public RecordReader<Text, Text> getRecordReader(InputSplit split,
JobConf job_run, Reporter reporter) throws IOException {
// TODO Auto-generated method stub
return new SZ_recordreader(job_run, (FileSplit)split);
}
}
RecordReader:
public class SZ_recordreader implements RecordReader<Text, Text>
{
FileSplit split;
JobConf job_run;
boolean processed=false;
CompressionCodecFactory compressioncodec=null; // A factory that will find the correct codec(.file) for a given filename.
public SZ_recordreader(JobConf job_run, FileSplit split)
{
this.split=split;
this.job_run=job_run;
}
#Override
public void close() throws IOException {
// TODO Auto-generated method stub
}
#Override
public Text createKey() {
// TODO Auto-generated method stub
return new Text();
}
#Override
public Text createValue() {
// TODO Auto-generated method stub
return new Text();
}
#Override
public long getPos() throws IOException {
// TODO Auto-generated method stub
return processed ? split.getLength() : 0;
}
#Override
public float getProgress() throws IOException {
// TODO Auto-generated method stub
return processed ? 1.0f : 0.0f;
}
#Override
public boolean next(Text key, Text value) throws IOException {
// TODO Auto-generated method stub
FSDataInputStream in=null;
if (!processed)
{
byte [] bytestream= new byte [(int) split.getLength()];
Path path=split.getPath();
compressioncodec=new CompressionCodecFactory(job_run);
CompressionCodec code = compressioncodec.getCodec(path);
// compressioncodec will find the correct codec by visiting the path of the file and store the result in code
System.out.println(code);
FileSystem fs= path.getFileSystem(job_run);
try
{
in =fs.open(path);
IOUtils.readFully(in, bytestream, 0, bytestream.length);
System.out.println("the input is " +in+ in.toString());
key.set(path.getName());
value.set(bytestream, 0, bytestream.length);
}
finally
{
IOUtils.closeStream(in);
}
processed=true;
return true;
}
return false;
}
}
Could anybody please point out the flaw..
There is a codec for .gz, but no codec for .tar.
Your .tar.gz is being decompressed to .tar, but it's still a tarball, and not something understood by the Hadoop system.
Your code may be stuck there in mapper and reducer class communication, To work with compressed files in MapReduce you need to set some configuration options for your job. These classes
are mandatory to set in driver class:
conf.setBoolean("mapred.output.compress", true);//Compress The Reducer Out put
conf.setBoolean("mapred.compress.map.output", true);//Compress The Mapper Output
conf.setClass("mapred.output.compression.codec",
codecClass,
CompressionCodec.class);//Compression codec for Compresing mapper output
The only difference between a MapReduce job that works with uncompressed versus
compressed IO are these three annotated lines.
I read some document over Internet that CompressionCodecFactory can be used to read a .tar.gz file. hence I implemented that in my code.
Even Compression codec do better over, but there are many codec for this purpose most one is LzopCodec and SnappyCodec for possible big data.. you can find a Git for LzopCodec here :https://github.com/twitter/hadoop-lzo/blob/master/src/main/java/com/hadoop/compression/lzo/LzopCodec.java
The reducer in my map-reduce job is as follows:
public static class Reduce_Phase2 extends MapReduceBase implements Reducer<IntWritable, Neighbourhood, Text,Text> {
public void reduce(IntWritable key, Iterator<Neighbourhood> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
ArrayList<Neighbourhood> cachedValues = new ArrayList<Neighbourhood>();
while(values.hasNext()){
Neighbourhood n = values.next();
cachedValues.add(n);
//correct output
//output.collect(new Text(n.source), new Text(n.neighbours));
}
for(Neighbourhood node:cachedValues){
//wrong output
output.collect(new Text(key.toString()), new Text(node.source+"\t\t"+node.neighbours));
}
}
}
TheNeighbourhood class has two attributes, source and neighbours, both of type Text. This reducer receives one key which has 19 values(of type Neighbourhood) assigned. When I output the source and neighbours inside the while loop, I get the actual values of 19 different values. However, if I output them after the while loop as shown in the code, I get 19 similar values. That is, one object gets output 19 times! It is very weired that what happens. Is there any idea on that?
Here is the code of the class Neighbourhood
public class Neighbourhood extends Configured implements WritableComparable<Neighbourhood> {
Text source ;
Text neighbours ;
public Neighbourhood(){
source = new Text();
neighbours = new Text();
}
public Neighbourhood (String s, String n){
source = new Text(s);
neighbours = new Text(n);
}
#Override
public void readFields(DataInput arg0) throws IOException {
source.readFields(arg0);
neighbours.readFields(arg0);
}
#Override
public void write(DataOutput arg0) throws IOException {
source.write(arg0);
neighbours.write(arg0);
}
#Override
public int compareTo(Neighbourhood o) {
return 0;
}
}
You're being caught out by a efficiency mechanism employed by Hadoop - Object reuse.
Your calls to values.next() is returning the same object reference each time, all Hadoop is doing behind the scenes is replaced the contents of that same object with the underlying bytes (deserialized using the readFields() method).
To avoid this you'll need to create deep copies of the object returned from values.next() - Hadoop actually has a utility class to do this for you called ReflectionUtils.copy. A simple fix would be as follows:
while(values.hasNext()){
Neighbourhood n = ReflectionUtils.newInstance(Neighbourhood.class, conf);
ReflectionUtils.copy(values.next(), n, conf);
You'll need to cache a version of the job Configuration (conf in the above code), which you can obtain by overriding the configure(JobConf) method in your Reducer:
#Override
protected void configure(JobConf job) {
conf = job;
}
Be warned though - accumulating a list in this way is often the cause of memory problems in your job, especially if you have 100,000+ values for a given single key.
I am using Hadoop 1.0.3.
I write logs to an Hadoop sequence file into HDFS, I call syncFS() after each bunch of logs but I never close the file (except when I am performing daily rolling).
What I want to guarantee is that the file is available to readers while the file is still being written.
I can read the bytes of the sequence file via FSDataInputStream, but if I try to use SequenceFile.Reader.next(key,val), it returns false at the first call.
I know the data is in the file since I can read it with FSDataInputStream or with the cat command and I am 100% sure that syncFS() is called.
I checked the namenode and datanode logs, no error or warning.
Why SequenceFile.Reader is unable to read my currently being written file ?
You can't ensure that a read is completely written to disk on the datanode side. You can see this in the documentation of DFSClient#DFSOutputStream.sync() which states:
All data is written out to datanodes. It is not guaranteed that data has
been flushed to persistent store on the datanode. Block allocations are
persisted on namenode.
So it basically updates the the namenode's block map with the current information and sends the data to the datanode. Since you can't flush the data to disk on the datanode, but you directly read from the datanode you hit a timeframe where the data is somewhere buffered and not accessible. Thus your sequencefile reader will think that the datastream is finished (or empty) and can't read additional bytes returning false to the deserialization process.
A datanode writes the data to disk (it is written beforehand, but not readable from outside) if the block is fully received. So you are able to read from the file once your blocksize has been reached or your file has been closed beforehand and thus finalized a block. Which totally makes sense in a distributed environment, because your writer can die and not finish a block properly- this is a matter of consistency.
So the fix would be to make the blocksize very small so the block is finished more often. But that is not so efficient and I hope it should be clear that your requirement is not suited for HDFS.
The reason the SequenceFile.Reader fails to read a file being written is that it uses the file length to perform its magic.
The file length stays at 0 while the first block is being written, and is updated only when the block is full (by default 64MB).
Then the file size is stuck at 64MB until the second block is fully written and so on...
That means you can't read the last incomplete block in a sequence file using SequenceFile.Reader, even if the raw data is readable using directly FSInputStream.
Closing the file also fixes the file length, but in my case I need to read files before they are closed.
So I hit the same issue and after some investigation and time I figured the following workaround that works.
So the problem is due to internal implementation of sequence file creation and the fact that it is using the file length which is updated per block of 64 MBs.
So I created the following class to create the reader and I wrapped the hadoop FS with my own while I overriding the get length method to return the file length instead:
public class SequenceFileUtil {
public SequenceFile.Reader createReader(Configuration conf, Path path) throws IOException {
WrappedFileSystem fileSystem = new WrappedFileSystem(FileSystem.get(conf));
return new SequenceFile.Reader(fileSystem, path, conf);
}
private class WrappedFileSystem extends FileSystem
{
private final FileSystem nestedFs;
public WrappedFileSystem(FileSystem fs){
this.nestedFs = fs;
}
#Override
public URI getUri() {
return nestedFs.getUri();
}
#Override
public FSDataInputStream open(Path f, int bufferSize) throws IOException {
return nestedFs.open(f,bufferSize);
}
#Override
public FSDataOutputStream create(Path f, FsPermission permission, boolean overwrite, int bufferSize, short replication, long blockSize, Progressable progress) throws IOException {
return nestedFs.create(f, permission,overwrite,bufferSize, replication, blockSize, progress);
}
#Override
public FSDataOutputStream append(Path f, int bufferSize, Progressable progress) throws IOException {
return nestedFs.append(f, bufferSize, progress);
}
#Override
public boolean rename(Path src, Path dst) throws IOException {
return nestedFs.rename(src, dst);
}
#Override
public boolean delete(Path path) throws IOException {
return nestedFs.delete(path);
}
#Override
public boolean delete(Path f, boolean recursive) throws IOException {
return nestedFs.delete(f, recursive);
}
#Override
public FileStatus[] listStatus(Path f) throws FileNotFoundException, IOException {
return nestedFs.listStatus(f);
}
#Override
public void setWorkingDirectory(Path new_dir) {
nestedFs.setWorkingDirectory(new_dir);
}
#Override
public Path getWorkingDirectory() {
return nestedFs.getWorkingDirectory();
}
#Override
public boolean mkdirs(Path f, FsPermission permission) throws IOException {
return nestedFs.mkdirs(f, permission);
}
#Override
public FileStatus getFileStatus(Path f) throws IOException {
return nestedFs.getFileStatus(f);
}
#Override
public long getLength(Path f) throws IOException {
DFSClient.DFSInputStream open = new DFSClient(nestedFs.getConf()).open(f.toUri().getPath());
long fileLength = open.getFileLength();
long length = nestedFs.getLength(f);
if (length < fileLength){
//We might have uncompleted blocks
return fileLength;
}
return length;
}
}
}
I faced a similar problem, here is how I fixed it:
http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201303.mbox/%3CCALtSBbY+LX6fiKutGsybS5oLXxZbVuN0WvW_a5JbExY98hJfig#mail.gmail.com%3E