Hadoop HDFS: Read sequence files that are being written - hadoop

I am using Hadoop 1.0.3.
I write logs to an Hadoop sequence file into HDFS, I call syncFS() after each bunch of logs but I never close the file (except when I am performing daily rolling).
What I want to guarantee is that the file is available to readers while the file is still being written.
I can read the bytes of the sequence file via FSDataInputStream, but if I try to use SequenceFile.Reader.next(key,val), it returns false at the first call.
I know the data is in the file since I can read it with FSDataInputStream or with the cat command and I am 100% sure that syncFS() is called.
I checked the namenode and datanode logs, no error or warning.
Why SequenceFile.Reader is unable to read my currently being written file ?

You can't ensure that a read is completely written to disk on the datanode side. You can see this in the documentation of DFSClient#DFSOutputStream.sync() which states:
All data is written out to datanodes. It is not guaranteed that data has
been flushed to persistent store on the datanode. Block allocations are
persisted on namenode.
So it basically updates the the namenode's block map with the current information and sends the data to the datanode. Since you can't flush the data to disk on the datanode, but you directly read from the datanode you hit a timeframe where the data is somewhere buffered and not accessible. Thus your sequencefile reader will think that the datastream is finished (or empty) and can't read additional bytes returning false to the deserialization process.
A datanode writes the data to disk (it is written beforehand, but not readable from outside) if the block is fully received. So you are able to read from the file once your blocksize has been reached or your file has been closed beforehand and thus finalized a block. Which totally makes sense in a distributed environment, because your writer can die and not finish a block properly- this is a matter of consistency.
So the fix would be to make the blocksize very small so the block is finished more often. But that is not so efficient and I hope it should be clear that your requirement is not suited for HDFS.

The reason the SequenceFile.Reader fails to read a file being written is that it uses the file length to perform its magic.
The file length stays at 0 while the first block is being written, and is updated only when the block is full (by default 64MB).
Then the file size is stuck at 64MB until the second block is fully written and so on...
That means you can't read the last incomplete block in a sequence file using SequenceFile.Reader, even if the raw data is readable using directly FSInputStream.
Closing the file also fixes the file length, but in my case I need to read files before they are closed.

So I hit the same issue and after some investigation and time I figured the following workaround that works.
So the problem is due to internal implementation of sequence file creation and the fact that it is using the file length which is updated per block of 64 MBs.
So I created the following class to create the reader and I wrapped the hadoop FS with my own while I overriding the get length method to return the file length instead:
public class SequenceFileUtil {
public SequenceFile.Reader createReader(Configuration conf, Path path) throws IOException {
WrappedFileSystem fileSystem = new WrappedFileSystem(FileSystem.get(conf));
return new SequenceFile.Reader(fileSystem, path, conf);
}
private class WrappedFileSystem extends FileSystem
{
private final FileSystem nestedFs;
public WrappedFileSystem(FileSystem fs){
this.nestedFs = fs;
}
#Override
public URI getUri() {
return nestedFs.getUri();
}
#Override
public FSDataInputStream open(Path f, int bufferSize) throws IOException {
return nestedFs.open(f,bufferSize);
}
#Override
public FSDataOutputStream create(Path f, FsPermission permission, boolean overwrite, int bufferSize, short replication, long blockSize, Progressable progress) throws IOException {
return nestedFs.create(f, permission,overwrite,bufferSize, replication, blockSize, progress);
}
#Override
public FSDataOutputStream append(Path f, int bufferSize, Progressable progress) throws IOException {
return nestedFs.append(f, bufferSize, progress);
}
#Override
public boolean rename(Path src, Path dst) throws IOException {
return nestedFs.rename(src, dst);
}
#Override
public boolean delete(Path path) throws IOException {
return nestedFs.delete(path);
}
#Override
public boolean delete(Path f, boolean recursive) throws IOException {
return nestedFs.delete(f, recursive);
}
#Override
public FileStatus[] listStatus(Path f) throws FileNotFoundException, IOException {
return nestedFs.listStatus(f);
}
#Override
public void setWorkingDirectory(Path new_dir) {
nestedFs.setWorkingDirectory(new_dir);
}
#Override
public Path getWorkingDirectory() {
return nestedFs.getWorkingDirectory();
}
#Override
public boolean mkdirs(Path f, FsPermission permission) throws IOException {
return nestedFs.mkdirs(f, permission);
}
#Override
public FileStatus getFileStatus(Path f) throws IOException {
return nestedFs.getFileStatus(f);
}
#Override
public long getLength(Path f) throws IOException {
DFSClient.DFSInputStream open = new DFSClient(nestedFs.getConf()).open(f.toUri().getPath());
long fileLength = open.getFileLength();
long length = nestedFs.getLength(f);
if (length < fileLength){
//We might have uncompleted blocks
return fileLength;
}
return length;
}
}
}

I faced a similar problem, here is how I fixed it:
http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201303.mbox/%3CCALtSBbY+LX6fiKutGsybS5oLXxZbVuN0WvW_a5JbExY98hJfig#mail.gmail.com%3E

Related

How to force file content to be processed sequenctially?

I got a requirement to process the file as it is means the file content should be processed as it appears in the file.
For Example: I have a file and size is 700MBs. How we can make sure the file will be processed as it appears since it depends on Datanode availability. In some cases, if any of Datanode process the file slowly(low configuration).
One way to fix this, adding unique id/key in file but we dont want to add anything new in the file.
Any thoughts :)
You can guarantee that only one mapper calculates the content of the file by writing your own FileInputFormat which sets isSplitable to false. E.g.
public class WholeFileInputFormat extends FileInputFormat<Text, BytesWritable> {
#Override
protected boolean isSplitable(FileSystem fs, Path filename) {
return false;
}
#Override
public RecordReader<Text, BytesWritable> getRecordReader(
InputSplit split, JobConf job, Reporter reporter) throws IOException {
return new WholeFileRecordReader((FileSplit) split, job);
}
}
For more examples how to do it, I like to recommend a github project. Depending on your hadoop version slight changes might be necessary.

set a conf value in mapper - get it in run method

In the run method of the Driver class, I want to fetch a String value (from the mapper function) and want to write it to a file. I used the following code, but null was returned. Please help
Mapper
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
context.getConfiguration().set("feedName", feedName);
}
Driver Class
#Override
public int run(String[] args) throws Exception {
String lineVal = conf.get("feedName")
}
Configuration is one way.
If you want to pass non-counter types of values back to the driver, you can utilize HDFS for that.
Either write to your main output context (key and values) that you emit from your job.
Or alternatively use MultipleOutputs, if you do not want to mess with your standard job output.
For example, you can write any kind of properties as Text keys and Text values from your mappers or reducers.
Once control is back to your driver, simply read from HDFS. For example you can store your name/values to the Configuration object to be used by the next job in your sequence:
public void load(Configuration targetConf, Path src, FileSystem fs) throws IOException {
InputStream is = fs.open(src);
try {
Properties props = new Properties();
props.load(new InputStreamReader(is, "UTF8"));
for (Map.Entry prop : props.entrySet()) {
String name = (String)prop.getKey();
String value = (String)prop.getValue();
targetConf.set(name, value);
}
} finally {
is.close();
}
}
Note that if you have multiple mappers or reducers where you write to MultipleOutputs, you will end up with multiple {name}-m-##### or {name}-r-##### files.
In that case, you will need to either read from every output file or run a single reducer job to combine your outputs into one and then just read from one file as shown above.
Using configuration you can only do the viceversa.
You can set values in Driver class
public int run(String[] args) throws Exception {
conf.set("feedName",value);
}
and set get those in Mapper class
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
String lineVal = conf.get("feedName");
}
UPDATE
One option to your question is write data to a file and store it in HDFS, and then access them in Driver class. These files can be treated as "Intermediate Files".
Just try it and see.

Weird error in Hadoop reducer

The reducer in my map-reduce job is as follows:
public static class Reduce_Phase2 extends MapReduceBase implements Reducer<IntWritable, Neighbourhood, Text,Text> {
public void reduce(IntWritable key, Iterator<Neighbourhood> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
ArrayList<Neighbourhood> cachedValues = new ArrayList<Neighbourhood>();
while(values.hasNext()){
Neighbourhood n = values.next();
cachedValues.add(n);
//correct output
//output.collect(new Text(n.source), new Text(n.neighbours));
}
for(Neighbourhood node:cachedValues){
//wrong output
output.collect(new Text(key.toString()), new Text(node.source+"\t\t"+node.neighbours));
}
}
}
TheNeighbourhood class has two attributes, source and neighbours, both of type Text. This reducer receives one key which has 19 values(of type Neighbourhood) assigned. When I output the source and neighbours inside the while loop, I get the actual values of 19 different values. However, if I output them after the while loop as shown in the code, I get 19 similar values. That is, one object gets output 19 times! It is very weired that what happens. Is there any idea on that?
Here is the code of the class Neighbourhood
public class Neighbourhood extends Configured implements WritableComparable<Neighbourhood> {
Text source ;
Text neighbours ;
public Neighbourhood(){
source = new Text();
neighbours = new Text();
}
public Neighbourhood (String s, String n){
source = new Text(s);
neighbours = new Text(n);
}
#Override
public void readFields(DataInput arg0) throws IOException {
source.readFields(arg0);
neighbours.readFields(arg0);
}
#Override
public void write(DataOutput arg0) throws IOException {
source.write(arg0);
neighbours.write(arg0);
}
#Override
public int compareTo(Neighbourhood o) {
return 0;
}
}
You're being caught out by a efficiency mechanism employed by Hadoop - Object reuse.
Your calls to values.next() is returning the same object reference each time, all Hadoop is doing behind the scenes is replaced the contents of that same object with the underlying bytes (deserialized using the readFields() method).
To avoid this you'll need to create deep copies of the object returned from values.next() - Hadoop actually has a utility class to do this for you called ReflectionUtils.copy. A simple fix would be as follows:
while(values.hasNext()){
Neighbourhood n = ReflectionUtils.newInstance(Neighbourhood.class, conf);
ReflectionUtils.copy(values.next(), n, conf);
You'll need to cache a version of the job Configuration (conf in the above code), which you can obtain by overriding the configure(JobConf) method in your Reducer:
#Override
protected void configure(JobConf job) {
conf = job;
}
Be warned though - accumulating a list in this way is often the cause of memory problems in your job, especially if you have 100,000+ values for a given single key.

Hadoop RawLocalFileSystem and getPos

I've found that the getPos in the RawLocalFileSystem's input stream can throw a null pointer exception if its underlying stream is closed.
I discovered this when playing with a custom record reader.
to patch it, I simply check if a call to "stream.available()" throws an exception, and if so, I return 0 in the getPos() function.
The existing getPos() implementation is found here:
https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20/src/examples/org/apache/hadoop/examples/MultiFileWordCount.java
What should be the correct behaviour of getPos() in the RecordReader?
The "getPos" in the RecordReader has changed over time.
In the old mapred RecordReader implementations, it was used to count bytes read.
/**
* Returns the current position in the input.
*
* #return the current position in the input.
* #throws IOException
*/
long getPos() throws IOException;
In the newer mapreduce RecordReader implementations, this information is not provided by the RR class, but rather, it is part of the FSInputStream implementations:
class LocalFSFileInputStream extends FSInputStream implements HasFileDescriptor {
private FileInputStream fis;
private long position;
public LocalFSFileInputStream(Path f) throws IOException {
this.fis = new TrackingFileInputStream(pathToFile(f));
}
#Override
public void seek(long pos) throws IOException {
fis.getChannel().position(pos);
this.position = pos;
}
#Override
public long getPos() throws IOException {
return this.position;
}
Thus, with the new mapreduce API, the RecordReader was abstracted to not necessarily return a getPos(). Newer implementations of RecordReaders which might want to use this underlying implementation can be rewritten to use the FSInputStream objects directly, which do provide a getPos().

Hadoop approach to outputting millions of small binary/image files

I need to process and manipulate many images in a Hadoop job, the input will be over the network, slow downloads using the MultiThreadedMapper.
But what is the best approach to the reduce ouput? I think I should write the raw binary image data into a sequence file, transfer those files to their eventual home, then write a small app to extract the individual images from the SequenceFile into individual JPGs and GIFs.
Or is there a better option to consider?
If you feel up to it (or maybe through some Googleing you can find an implementation), you could write a FileOutputFormat which wraps a FSDataOutputStream with a ZipOutputStream, giving you a Zip file for each reducer (and thus saving you the effort in writing seq file extraction program.
Don't be daunted by writing your own OutputFormat, it really isn't that difficult (and much easier than writing custom InputFormats which have to worry about splits). In fact here's a starting point - you just need to implement the write method:
// Key: Text (path of the file in the output zip)
// Value: BytesWritable - binary content of the image to save
public class ZipFileOutputFormat extends FileOutputFormat<Text, BytesWritable> {
#Override
public RecordWriter<Text, BytesWritable> getRecordWriter(
TaskAttemptContext job) throws IOException, InterruptedException {
Path file = getDefaultWorkFile(job, ".zip");
FileSystem fs = file.getFileSystem(job.getConfiguration());
return new ZipRecordWriter(fs.create(file, false));
}
public static class ZipRecordWriter extends
RecordWriter<Text, BytesWritable> {
protected ZipOutputStream zos;
public ZipRecordWriter(FSDataOutputStream os) {
zos = new ZipOutputStream(os);
}
#Override
public void write(Text key, BytesWritable value) throws IOException,
InterruptedException {
// TODO: create new ZipEntry & add to the ZipOutputStream (zos)
}
#Override
public void close(TaskAttemptContext context) throws IOException,
InterruptedException {
zos.close();
}
}
}

Resources