How to force file content to be processed sequenctially? - hadoop

I got a requirement to process the file as it is means the file content should be processed as it appears in the file.
For Example: I have a file and size is 700MBs. How we can make sure the file will be processed as it appears since it depends on Datanode availability. In some cases, if any of Datanode process the file slowly(low configuration).
One way to fix this, adding unique id/key in file but we dont want to add anything new in the file.
Any thoughts :)

You can guarantee that only one mapper calculates the content of the file by writing your own FileInputFormat which sets isSplitable to false. E.g.
public class WholeFileInputFormat extends FileInputFormat<Text, BytesWritable> {
#Override
protected boolean isSplitable(FileSystem fs, Path filename) {
return false;
}
#Override
public RecordReader<Text, BytesWritable> getRecordReader(
InputSplit split, JobConf job, Reporter reporter) throws IOException {
return new WholeFileRecordReader((FileSplit) split, job);
}
}
For more examples how to do it, I like to recommend a github project. Depending on your hadoop version slight changes might be necessary.

Related

set a conf value in mapper - get it in run method

In the run method of the Driver class, I want to fetch a String value (from the mapper function) and want to write it to a file. I used the following code, but null was returned. Please help
Mapper
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
context.getConfiguration().set("feedName", feedName);
}
Driver Class
#Override
public int run(String[] args) throws Exception {
String lineVal = conf.get("feedName")
}
Configuration is one way.
If you want to pass non-counter types of values back to the driver, you can utilize HDFS for that.
Either write to your main output context (key and values) that you emit from your job.
Or alternatively use MultipleOutputs, if you do not want to mess with your standard job output.
For example, you can write any kind of properties as Text keys and Text values from your mappers or reducers.
Once control is back to your driver, simply read from HDFS. For example you can store your name/values to the Configuration object to be used by the next job in your sequence:
public void load(Configuration targetConf, Path src, FileSystem fs) throws IOException {
InputStream is = fs.open(src);
try {
Properties props = new Properties();
props.load(new InputStreamReader(is, "UTF8"));
for (Map.Entry prop : props.entrySet()) {
String name = (String)prop.getKey();
String value = (String)prop.getValue();
targetConf.set(name, value);
}
} finally {
is.close();
}
}
Note that if you have multiple mappers or reducers where you write to MultipleOutputs, you will end up with multiple {name}-m-##### or {name}-r-##### files.
In that case, you will need to either read from every output file or run a single reducer job to combine your outputs into one and then just read from one file as shown above.
Using configuration you can only do the viceversa.
You can set values in Driver class
public int run(String[] args) throws Exception {
conf.set("feedName",value);
}
and set get those in Mapper class
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
String lineVal = conf.get("feedName");
}
UPDATE
One option to your question is write data to a file and store it in HDFS, and then access them in Driver class. These files can be treated as "Intermediate Files".
Just try it and see.

Hadoop: Getting the input file name in the mapper only once

I am new in hadoop and currently working on hadoop. I have a small query.
I have around 10 files in input folder which I need to pass to my map reduce program. I want the file Name in my mapper as my fileName contains the time at which this file got created. I saw people using FileSplit to get the file Name in mapper. If let say my input files contains million of lines then every time mapper code will be called, it will get the file Name and then extract the time from the file, which is obvious a repeated time consuming thing for the same file. Once I get the time in the mapper I do not have to again and again assign the time from the file.
How can I achieve this?
You could use Mapper's setup method to get the filename, as setup method is gaurenteed to run only once before map() method gets initialized like this:
public class MapperRSJ extends Mapper<LongWritable, Text, CompositeKeyWritableRSJ, Text> {
String filename;
#Override
protected void setup(Context context) throws IOException, InterruptedException {
FileSplit fsFileSplit = (FileSplit) context.getInputSplit();
filename = context.getConfiguration().get(fsFileSplit.getPath().getParent().getName()));
}
#Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
// process each key value pair
}
}

how to set custom input format in MapReduce?

I am writing MapReduce program and using classes in org.apache.hadoop.mapred.*. Can anybody tell me cause of this error? My CustomInputFormat class extends InputFormat and I have overridden createRecordReader method.
Signature of my CustomInputFormat is:
class ParagraphInputFormat extends InputFormat {
#Override
public RecordReader createRecordReader(InputSplit arg0,
TaskAttemptContext arg1) throws IOException, InterruptedException {
return new CustomRecordReader();
}
#Override
public List<InputSplit> getSplits(JobContext arg0) throws IOException,
InterruptedException {
// TODO Auto-generated method stub
return null;
}
}
And Signature of CustomRecordReader is class CustomRecordReader extends RecordReader
While declaring this class I used org.apache.hadoop.mapreduce.. I am confused between org.apache.hadoop.mapred. and org.apache.hadoop.mapreduce.*. Eclipse keeps on showing deprecated messages sometimes. I heard that apache has added some classes then removed those and then again added previous classes. Is is due to that? Is it affecting my code?
JobConf conf = new JobConf(new Configuration(),MyMRJob.class);
conf.setJobName("NameofJob");
conf.setOutputKeyClass(CutomeKeyClass.class); //no error to this line
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(MYMap.class);
conf.setCombinerClass(MyReduce.class);
conf.setReducerClass(MyReduce.class);
conf.setInputFormat(CustomInputFormat.class);//ERROR to this line while typing
conf.setOutputFormat(IntWritable.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
Your input format extends InputFormat of the mapreduce package (it extends rather than implements and the signature matches that of the new api), yet your job configuration is using the old API (JobConf rather than Job).
So you'll either need to amend your Custom input format to implement InputFormat (o.a.h.mapred.InputFormat), or amend your job configuration to use the new API (Job)
hey i have faced the same problem then i used the classes from
org.apache.hadoop.mapreduce instead of
org.apache.hadoop.mapred
its a problem of old and new API and for this dont use JobConf Configuration use Job Configuration only...

Hadoop HDFS: Read sequence files that are being written

I am using Hadoop 1.0.3.
I write logs to an Hadoop sequence file into HDFS, I call syncFS() after each bunch of logs but I never close the file (except when I am performing daily rolling).
What I want to guarantee is that the file is available to readers while the file is still being written.
I can read the bytes of the sequence file via FSDataInputStream, but if I try to use SequenceFile.Reader.next(key,val), it returns false at the first call.
I know the data is in the file since I can read it with FSDataInputStream or with the cat command and I am 100% sure that syncFS() is called.
I checked the namenode and datanode logs, no error or warning.
Why SequenceFile.Reader is unable to read my currently being written file ?
You can't ensure that a read is completely written to disk on the datanode side. You can see this in the documentation of DFSClient#DFSOutputStream.sync() which states:
All data is written out to datanodes. It is not guaranteed that data has
been flushed to persistent store on the datanode. Block allocations are
persisted on namenode.
So it basically updates the the namenode's block map with the current information and sends the data to the datanode. Since you can't flush the data to disk on the datanode, but you directly read from the datanode you hit a timeframe where the data is somewhere buffered and not accessible. Thus your sequencefile reader will think that the datastream is finished (or empty) and can't read additional bytes returning false to the deserialization process.
A datanode writes the data to disk (it is written beforehand, but not readable from outside) if the block is fully received. So you are able to read from the file once your blocksize has been reached or your file has been closed beforehand and thus finalized a block. Which totally makes sense in a distributed environment, because your writer can die and not finish a block properly- this is a matter of consistency.
So the fix would be to make the blocksize very small so the block is finished more often. But that is not so efficient and I hope it should be clear that your requirement is not suited for HDFS.
The reason the SequenceFile.Reader fails to read a file being written is that it uses the file length to perform its magic.
The file length stays at 0 while the first block is being written, and is updated only when the block is full (by default 64MB).
Then the file size is stuck at 64MB until the second block is fully written and so on...
That means you can't read the last incomplete block in a sequence file using SequenceFile.Reader, even if the raw data is readable using directly FSInputStream.
Closing the file also fixes the file length, but in my case I need to read files before they are closed.
So I hit the same issue and after some investigation and time I figured the following workaround that works.
So the problem is due to internal implementation of sequence file creation and the fact that it is using the file length which is updated per block of 64 MBs.
So I created the following class to create the reader and I wrapped the hadoop FS with my own while I overriding the get length method to return the file length instead:
public class SequenceFileUtil {
public SequenceFile.Reader createReader(Configuration conf, Path path) throws IOException {
WrappedFileSystem fileSystem = new WrappedFileSystem(FileSystem.get(conf));
return new SequenceFile.Reader(fileSystem, path, conf);
}
private class WrappedFileSystem extends FileSystem
{
private final FileSystem nestedFs;
public WrappedFileSystem(FileSystem fs){
this.nestedFs = fs;
}
#Override
public URI getUri() {
return nestedFs.getUri();
}
#Override
public FSDataInputStream open(Path f, int bufferSize) throws IOException {
return nestedFs.open(f,bufferSize);
}
#Override
public FSDataOutputStream create(Path f, FsPermission permission, boolean overwrite, int bufferSize, short replication, long blockSize, Progressable progress) throws IOException {
return nestedFs.create(f, permission,overwrite,bufferSize, replication, blockSize, progress);
}
#Override
public FSDataOutputStream append(Path f, int bufferSize, Progressable progress) throws IOException {
return nestedFs.append(f, bufferSize, progress);
}
#Override
public boolean rename(Path src, Path dst) throws IOException {
return nestedFs.rename(src, dst);
}
#Override
public boolean delete(Path path) throws IOException {
return nestedFs.delete(path);
}
#Override
public boolean delete(Path f, boolean recursive) throws IOException {
return nestedFs.delete(f, recursive);
}
#Override
public FileStatus[] listStatus(Path f) throws FileNotFoundException, IOException {
return nestedFs.listStatus(f);
}
#Override
public void setWorkingDirectory(Path new_dir) {
nestedFs.setWorkingDirectory(new_dir);
}
#Override
public Path getWorkingDirectory() {
return nestedFs.getWorkingDirectory();
}
#Override
public boolean mkdirs(Path f, FsPermission permission) throws IOException {
return nestedFs.mkdirs(f, permission);
}
#Override
public FileStatus getFileStatus(Path f) throws IOException {
return nestedFs.getFileStatus(f);
}
#Override
public long getLength(Path f) throws IOException {
DFSClient.DFSInputStream open = new DFSClient(nestedFs.getConf()).open(f.toUri().getPath());
long fileLength = open.getFileLength();
long length = nestedFs.getLength(f);
if (length < fileLength){
//We might have uncompleted blocks
return fileLength;
}
return length;
}
}
}
I faced a similar problem, here is how I fixed it:
http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201303.mbox/%3CCALtSBbY+LX6fiKutGsybS5oLXxZbVuN0WvW_a5JbExY98hJfig#mail.gmail.com%3E

Hadoop approach to outputting millions of small binary/image files

I need to process and manipulate many images in a Hadoop job, the input will be over the network, slow downloads using the MultiThreadedMapper.
But what is the best approach to the reduce ouput? I think I should write the raw binary image data into a sequence file, transfer those files to their eventual home, then write a small app to extract the individual images from the SequenceFile into individual JPGs and GIFs.
Or is there a better option to consider?
If you feel up to it (or maybe through some Googleing you can find an implementation), you could write a FileOutputFormat which wraps a FSDataOutputStream with a ZipOutputStream, giving you a Zip file for each reducer (and thus saving you the effort in writing seq file extraction program.
Don't be daunted by writing your own OutputFormat, it really isn't that difficult (and much easier than writing custom InputFormats which have to worry about splits). In fact here's a starting point - you just need to implement the write method:
// Key: Text (path of the file in the output zip)
// Value: BytesWritable - binary content of the image to save
public class ZipFileOutputFormat extends FileOutputFormat<Text, BytesWritable> {
#Override
public RecordWriter<Text, BytesWritable> getRecordWriter(
TaskAttemptContext job) throws IOException, InterruptedException {
Path file = getDefaultWorkFile(job, ".zip");
FileSystem fs = file.getFileSystem(job.getConfiguration());
return new ZipRecordWriter(fs.create(file, false));
}
public static class ZipRecordWriter extends
RecordWriter<Text, BytesWritable> {
protected ZipOutputStream zos;
public ZipRecordWriter(FSDataOutputStream os) {
zos = new ZipOutputStream(os);
}
#Override
public void write(Text key, BytesWritable value) throws IOException,
InterruptedException {
// TODO: create new ZipEntry & add to the ZipOutputStream (zos)
}
#Override
public void close(TaskAttemptContext context) throws IOException,
InterruptedException {
zos.close();
}
}
}

Resources