How to read binary file from hdfs? - hadoop

I'm now working on writing a parser for shapefile with spark. I take use of NewAPIHadoopFile to first extract binary records one by one from the original .shp file. The problem is that when the program takes file from local disk, it works. But when reading file from hdfs, the bytes flow I get from DataInputStream is no longer integrated with the original file. The exception is as below.
java.lang.NegativeArraySizeException
at ShapeFileParse.ShapeParseUtil.parseRecordPrimitiveContent(ShapeParseUtil.java:53)
at spatial.ShapeFileReader.nextKeyValue(ShapeFileReader.java:54)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:182)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1702)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/05/25 17:19:15 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.NegativeArraySizeException
at ShapeFileParse.ShapeParseUtil.parseRecordPrimitiveContent(ShapeParseUtil.java:53)
at spatial.ShapeFileReader.nextKeyValue(ShapeFileReader.java:54)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:182)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1702)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
My Code in RecordReader is as follow:
public void initialize(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
ShapeParseUtil.initializeGeometryFactory();
FileSplit fileSplit = (FileSplit)split;
long start = fileSplit.getStart();
long end = start + fileSplit.getLength();
int len = (int)fileSplit.getLength();
Path filePath = fileSplit.getPath();
FileSystem fileSys = filePath.getFileSystem(context.getConfiguration());
FSDataInputStream inputStreamFS = fileSys.open(filePath);
//byte[] wholeStream = new byte[len];
inputStream = new DataInputStream(inputStreamFS);
//IOUtils.readFully(inputStream, wholeStream, 0, len);
//inputStream = new DataInputStream(new ByteArrayInputStream(wholeStream));
ShapeParseUtil.parseShapeFileHead(inputStream);
}
And here is how I set my FileInputFormat
public class ShapeInputFormat extends FileInputFormat<ShapeKey, BytesWritable> {
public RecordReader<ShapeKey, BytesWritable> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
return new ShapeFileReader(split, context);
}
#Override
protected boolean isSplitable(JobContext context, Path filename) {
return false;
}
}
When I use IOUtils.readFully to test, the whole bytes array I get is good. But when using DataInputStream, the flow goes wrong. So I believe the file on hdfs is still integrated. Since my file is of only 560kb and there is only one file on hdfs now.I think it can't be assigned to multiple blocks.
I'm new to spark so maybe this is a naive problem, really thanks to who can teach me that.

I figured it out myself by changing read() to readFully() when reading byte arrays for in each record. I'm still a little bit confused since the available() I get is exactly equal to the size of split. But if I call inputStream.read(), it failed to to read to the length I assigned.

Related

set a conf value in mapper - get it in run method

In the run method of the Driver class, I want to fetch a String value (from the mapper function) and want to write it to a file. I used the following code, but null was returned. Please help
Mapper
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
context.getConfiguration().set("feedName", feedName);
}
Driver Class
#Override
public int run(String[] args) throws Exception {
String lineVal = conf.get("feedName")
}
Configuration is one way.
If you want to pass non-counter types of values back to the driver, you can utilize HDFS for that.
Either write to your main output context (key and values) that you emit from your job.
Or alternatively use MultipleOutputs, if you do not want to mess with your standard job output.
For example, you can write any kind of properties as Text keys and Text values from your mappers or reducers.
Once control is back to your driver, simply read from HDFS. For example you can store your name/values to the Configuration object to be used by the next job in your sequence:
public void load(Configuration targetConf, Path src, FileSystem fs) throws IOException {
InputStream is = fs.open(src);
try {
Properties props = new Properties();
props.load(new InputStreamReader(is, "UTF8"));
for (Map.Entry prop : props.entrySet()) {
String name = (String)prop.getKey();
String value = (String)prop.getValue();
targetConf.set(name, value);
}
} finally {
is.close();
}
}
Note that if you have multiple mappers or reducers where you write to MultipleOutputs, you will end up with multiple {name}-m-##### or {name}-r-##### files.
In that case, you will need to either read from every output file or run a single reducer job to combine your outputs into one and then just read from one file as shown above.
Using configuration you can only do the viceversa.
You can set values in Driver class
public int run(String[] args) throws Exception {
conf.set("feedName",value);
}
and set get those in Mapper class
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
String lineVal = conf.get("feedName");
}
UPDATE
One option to your question is write data to a file and store it in HDFS, and then access them in Driver class. These files can be treated as "Intermediate Files".
Just try it and see.

Hadoop: Getting the input file name in the mapper only once

I am new in hadoop and currently working on hadoop. I have a small query.
I have around 10 files in input folder which I need to pass to my map reduce program. I want the file Name in my mapper as my fileName contains the time at which this file got created. I saw people using FileSplit to get the file Name in mapper. If let say my input files contains million of lines then every time mapper code will be called, it will get the file Name and then extract the time from the file, which is obvious a repeated time consuming thing for the same file. Once I get the time in the mapper I do not have to again and again assign the time from the file.
How can I achieve this?
You could use Mapper's setup method to get the filename, as setup method is gaurenteed to run only once before map() method gets initialized like this:
public class MapperRSJ extends Mapper<LongWritable, Text, CompositeKeyWritableRSJ, Text> {
String filename;
#Override
protected void setup(Context context) throws IOException, InterruptedException {
FileSplit fsFileSplit = (FileSplit) context.getInputSplit();
filename = context.getConfiguration().get(fsFileSplit.getPath().getParent().getName()));
}
#Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
// process each key value pair
}
}

Hadoop HDFS: Read sequence files that are being written

I am using Hadoop 1.0.3.
I write logs to an Hadoop sequence file into HDFS, I call syncFS() after each bunch of logs but I never close the file (except when I am performing daily rolling).
What I want to guarantee is that the file is available to readers while the file is still being written.
I can read the bytes of the sequence file via FSDataInputStream, but if I try to use SequenceFile.Reader.next(key,val), it returns false at the first call.
I know the data is in the file since I can read it with FSDataInputStream or with the cat command and I am 100% sure that syncFS() is called.
I checked the namenode and datanode logs, no error or warning.
Why SequenceFile.Reader is unable to read my currently being written file ?
You can't ensure that a read is completely written to disk on the datanode side. You can see this in the documentation of DFSClient#DFSOutputStream.sync() which states:
All data is written out to datanodes. It is not guaranteed that data has
been flushed to persistent store on the datanode. Block allocations are
persisted on namenode.
So it basically updates the the namenode's block map with the current information and sends the data to the datanode. Since you can't flush the data to disk on the datanode, but you directly read from the datanode you hit a timeframe where the data is somewhere buffered and not accessible. Thus your sequencefile reader will think that the datastream is finished (or empty) and can't read additional bytes returning false to the deserialization process.
A datanode writes the data to disk (it is written beforehand, but not readable from outside) if the block is fully received. So you are able to read from the file once your blocksize has been reached or your file has been closed beforehand and thus finalized a block. Which totally makes sense in a distributed environment, because your writer can die and not finish a block properly- this is a matter of consistency.
So the fix would be to make the blocksize very small so the block is finished more often. But that is not so efficient and I hope it should be clear that your requirement is not suited for HDFS.
The reason the SequenceFile.Reader fails to read a file being written is that it uses the file length to perform its magic.
The file length stays at 0 while the first block is being written, and is updated only when the block is full (by default 64MB).
Then the file size is stuck at 64MB until the second block is fully written and so on...
That means you can't read the last incomplete block in a sequence file using SequenceFile.Reader, even if the raw data is readable using directly FSInputStream.
Closing the file also fixes the file length, but in my case I need to read files before they are closed.
So I hit the same issue and after some investigation and time I figured the following workaround that works.
So the problem is due to internal implementation of sequence file creation and the fact that it is using the file length which is updated per block of 64 MBs.
So I created the following class to create the reader and I wrapped the hadoop FS with my own while I overriding the get length method to return the file length instead:
public class SequenceFileUtil {
public SequenceFile.Reader createReader(Configuration conf, Path path) throws IOException {
WrappedFileSystem fileSystem = new WrappedFileSystem(FileSystem.get(conf));
return new SequenceFile.Reader(fileSystem, path, conf);
}
private class WrappedFileSystem extends FileSystem
{
private final FileSystem nestedFs;
public WrappedFileSystem(FileSystem fs){
this.nestedFs = fs;
}
#Override
public URI getUri() {
return nestedFs.getUri();
}
#Override
public FSDataInputStream open(Path f, int bufferSize) throws IOException {
return nestedFs.open(f,bufferSize);
}
#Override
public FSDataOutputStream create(Path f, FsPermission permission, boolean overwrite, int bufferSize, short replication, long blockSize, Progressable progress) throws IOException {
return nestedFs.create(f, permission,overwrite,bufferSize, replication, blockSize, progress);
}
#Override
public FSDataOutputStream append(Path f, int bufferSize, Progressable progress) throws IOException {
return nestedFs.append(f, bufferSize, progress);
}
#Override
public boolean rename(Path src, Path dst) throws IOException {
return nestedFs.rename(src, dst);
}
#Override
public boolean delete(Path path) throws IOException {
return nestedFs.delete(path);
}
#Override
public boolean delete(Path f, boolean recursive) throws IOException {
return nestedFs.delete(f, recursive);
}
#Override
public FileStatus[] listStatus(Path f) throws FileNotFoundException, IOException {
return nestedFs.listStatus(f);
}
#Override
public void setWorkingDirectory(Path new_dir) {
nestedFs.setWorkingDirectory(new_dir);
}
#Override
public Path getWorkingDirectory() {
return nestedFs.getWorkingDirectory();
}
#Override
public boolean mkdirs(Path f, FsPermission permission) throws IOException {
return nestedFs.mkdirs(f, permission);
}
#Override
public FileStatus getFileStatus(Path f) throws IOException {
return nestedFs.getFileStatus(f);
}
#Override
public long getLength(Path f) throws IOException {
DFSClient.DFSInputStream open = new DFSClient(nestedFs.getConf()).open(f.toUri().getPath());
long fileLength = open.getFileLength();
long length = nestedFs.getLength(f);
if (length < fileLength){
//We might have uncompleted blocks
return fileLength;
}
return length;
}
}
}
I faced a similar problem, here is how I fixed it:
http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201303.mbox/%3CCALtSBbY+LX6fiKutGsybS5oLXxZbVuN0WvW_a5JbExY98hJfig#mail.gmail.com%3E

Hadoop approach to outputting millions of small binary/image files

I need to process and manipulate many images in a Hadoop job, the input will be over the network, slow downloads using the MultiThreadedMapper.
But what is the best approach to the reduce ouput? I think I should write the raw binary image data into a sequence file, transfer those files to their eventual home, then write a small app to extract the individual images from the SequenceFile into individual JPGs and GIFs.
Or is there a better option to consider?
If you feel up to it (or maybe through some Googleing you can find an implementation), you could write a FileOutputFormat which wraps a FSDataOutputStream with a ZipOutputStream, giving you a Zip file for each reducer (and thus saving you the effort in writing seq file extraction program.
Don't be daunted by writing your own OutputFormat, it really isn't that difficult (and much easier than writing custom InputFormats which have to worry about splits). In fact here's a starting point - you just need to implement the write method:
// Key: Text (path of the file in the output zip)
// Value: BytesWritable - binary content of the image to save
public class ZipFileOutputFormat extends FileOutputFormat<Text, BytesWritable> {
#Override
public RecordWriter<Text, BytesWritable> getRecordWriter(
TaskAttemptContext job) throws IOException, InterruptedException {
Path file = getDefaultWorkFile(job, ".zip");
FileSystem fs = file.getFileSystem(job.getConfiguration());
return new ZipRecordWriter(fs.create(file, false));
}
public static class ZipRecordWriter extends
RecordWriter<Text, BytesWritable> {
protected ZipOutputStream zos;
public ZipRecordWriter(FSDataOutputStream os) {
zos = new ZipOutputStream(os);
}
#Override
public void write(Text key, BytesWritable value) throws IOException,
InterruptedException {
// TODO: create new ZipEntry & add to the ZipOutputStream (zos)
}
#Override
public void close(TaskAttemptContext context) throws IOException,
InterruptedException {
zos.close();
}
}
}

hadoop MultipleInputs fails with ClassCastException

My hadoop version is 1.0.3,when I use multipleinputs, I got this error.
java.lang.ClassCastException: org.apache.hadoop.mapreduce.lib.input.TaggedInputSplit cannot be cast to org.apache.hadoop.mapreduce.lib.input.FileSplit
at org.myorg.textimage$ImageMapper.setup(textimage.java:80)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
at org.apache.hadoop.mapreduce.lib.input.DelegatingMapper.run(DelegatingMapper.java:55)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
I tested single input path, no problem. Only when I use
MultipleInputs.addInputPath(job, TextInputpath, TextInputFormat.class,
TextMapper.class);
MultipleInputs.addInputPath(job, ImageInputpath,
WholeFileInputFormat.class, ImageMapper.class);
I googled and found this link https://issues.apache.org/jira/browse/MAPREDUCE-1178 which said 0.21 had this bug. But I am using 1.0.3, does this bug come back again. Anyone has the same problem or anyone can tell me how to fix it? Thanks
here is the setup code of image mapper,4th line is where the error occurs:
protected void setup(Context context) throws IOException,
InterruptedException {
InputSplit split = context.getInputSplit();
Path path = ((FileSplit) split).getPath();
try {
pa = new Text(path.toString());
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
Following up on my comment, the Javadocs for TaggedInputSplit confirms that you are probably wrongly casting the input split to a FileSplit:
/**
* An {#link InputSplit} that tags another InputSplit with extra data for use
* by {#link DelegatingInputFormat}s and {#link DelegatingMapper}s.
*/
My guess is your setup method looks something like this:
#Override
protected void setup(Context context) throws IOException,
InterruptedException {
FileSplit split = (FileSplit) context.getInputSplit();
}
Unfortunately TaggedInputSplit is not public visible, so you can't easily do an instanceof style check, followed by a cast and then call to TaggedInputSplit.getInputSplit() to get the actual underlying FileSplit. So either you'll need to update the source yourself and re-compile&deploy, post a JIRA ticket to ask this to be fixed in future version (if it already hasn't been actioned in 2+) or perform some nasty nasty reflection hackery to get to the underlying InputSplit
This is completely untested:
#Override
protected void setup(Context context) throws IOException,
InterruptedException {
InputSplit split = context.getInputSplit();
Class<? extends InputSplit> splitClass = split.getClass();
FileSplit fileSplit = null;
if (splitClass.equals(FileSplit.class)) {
fileSplit = (FileSplit) split;
} else if (splitClass.getName().equals(
"org.apache.hadoop.mapreduce.lib.input.TaggedInputSplit")) {
// begin reflection hackery...
try {
Method getInputSplitMethod = splitClass
.getDeclaredMethod("getInputSplit");
getInputSplitMethod.setAccessible(true);
fileSplit = (FileSplit) getInputSplitMethod.invoke(split);
} catch (Exception e) {
// wrap and re-throw error
throw new IOException(e);
}
// end reflection hackery
}
}
Reflection Hackery Explained:
With TaggedInputSplit being declared protected scope, it's not visible to classes outside the org.apache.hadoop.mapreduce.lib.input package, and therefore you cannot reference that class in your setup method. To get around this, we perform a number of reflection based operations:
Inspecting the class name, we can test for the type TaggedInputSplit using it's fully qualified name
splitClass.getName().equals("org.apache.hadoop.mapreduce.lib.input.TaggedInputSplit")
We know we want to call the TaggedInputSplit.getInputSplit() method to recover the wrapped input split, so we utilize the Class.getMethod(..) reflection method to acquire a reference to the method:
Method getInputSplitMethod = splitClass.getDeclaredMethod("getInputSplit");
The class still isn't public visible so we use the setAccessible(..) method to override this, stopping the security manager from throwing an exception
getInputSplitMethod.setAccessible(true);
Finally we invoke the method on the reference to the input split and cast the result to a FileSplit (optimistically hoping its a instance of this type!):
fileSplit = (FileSplit) getInputSplitMethod.invoke(split);
I had this same problem, but the actual problem was that I was still setting the InputFormat after setting up the MultipleInputs:
job.setInputFormatClass(SequenceFileInputFormat.class);
Once I removed this line everything worked fine.

Resources