passing set of images as input to mapreduce - hadoop

I have a system where I get images (jpg) from some module. I get images for 10 objects (1000 images for single object) at a time (total 10000 images at a time). I need to do some processing on these images using Hadoop cluster.
I am wondering how should I go about this. Like how should I form the input. I would like to process one object (and its images = 1000) completely in one mapper or reducer. For ex: first object in first mapper, second object in second mapper etc.
Some of the approaches that come to my mind are:
1. For each object create a directory and place all its images in that. Then tar, compress the directory and this will go as one input to a single mapper.
Do the same thing as mentioned above, but just tar the file (dont compress). Implement InputFormat interface and make "isSplittable()" return false.
Create sequencefile for each object. Sequensfile would contain a key - value pair for each object image. Here I am not sure how to tell MapReduce to give the sequencefile to just one mapper.

Here I am not sure how to tell MapReduce to give the sequencefile to just one mapper.
FileInputFormat#isSplitable is your friend here for all the file input formats. SequenceFileInputFormat extends FileInputFormat.

PROCESSING IMAGES IN HADOOP USING MAPREDUCE
HIPI:
Hipi is Hadoop Image Processing Interface. This provides set of tools and Inputformat to process bulk amount of images using Hadoops Distributes File System (HDFS) and Mapreduce.
STPES INVOLVED:
In hipi the entire process can be categorized in to 2 parts.
1) Converting all images into a bulk file(HIPI Image Bundle).
2) Processing the created bulk file of image using HIPI's image input formats.
The cull (culler class) is use to filter out images with low clarity or defects
ISSUES WITH HIPI:
To simulate my bulk image processing scenario , I used a java program to create multiple copies of same image with different names in a single directory. then by using HIPI's utility I converted all images into a bulk file (known as HIP file in HIPI ).
To check whether all images are exist in the bulk file I done the reverse process (Converted HIP file into multiple images). There is also another utility to do the same. But I didn't got all images back, and found that using HIPI I am loosing some images.
I couldn't proceed with my POC by using HIPI, and thought of creating a new framework to process bulk images using mapreduce.
NEW IMAGE PROCESSING FRAMEWORK:
In order to avoid spawning multiple maps (each per file) we must do as like HIPI does, that is , to convert all images into a single bundle file.
This bundle file is given as the input to map-reduce. The image input format parse the bundle file and create Buffered Image Object corresponding to each image.
IMAGE INPUT FORMAT-IMPORTANT CLASSES:
Image Combiner:
Merges multiple images in to a single bundle file.
ImageInputFormat :
Return ImageRecordRreader, and manages splits
ImageRecordReader:
Manage reading each splits.
Perform initial seek of the file pointer to the start of each split.
nextKeyValue() method reads each Image from the split and convert to BufferedImage.
BufferedImageWritable:
Since the key value classes of map reduce should be a writable serializable type we cannot keep BufferedImage directly as value in the map method.
This is a wrapper class that just holds the BufferedImage in it.
BufferedImageWritable
{
BufferedImage img;
#Override
public void readFields(DataInput arg0) throws IOException {
}
#Override
public void write(DataOutput arg0) throws IOException {
}
#Override
public int compareTo(byte[] o) {
return 0;
}
}
Not implemented readFiled(),write(),compareTo() methods since in my scenario I dont want to store the mages back to HDFS.
If you want to write back any images in HDFS(In map or reduce), you may have to implement all these methods. in write() you may need to write the logic to store the image as like we used to write images while creating the bulk file. The readFiled() should contain the opposite logic of write(). compareTo() is no need to implement since we never keep this image as key in map-reduce (compareTo() is invoked during the sort face of map-reduce).
Since you are getting image as BufferedImages (a common java class for image processing) it is easy to do most of the operations easily on it. But in the case of HIPI the image is available in map's value as Hipi's FloatImage class, and you might feel difficulty in doing manipulations on top of it.
I have successfully implemented facedetection program using this custom input format and OpenCV.
The code that I used to develop the same will be shared soon on GitHub
http://worldofbigdata-inaction.blogspot.in/2017/02/processing-images-in-hadoop-using.html

Related

Custom input splits for streaming the data in MapReduce

I have a large data set that is ingested into HDFS as sequence files, with the key being the file metadata and value the entire file contents. I am using SequenceFileInputFormat and hence my splits are based on the sequence file sync points.
The issue I am facing is when I ingest really large files, I am basically loading the entire file in memory in the Mapper/Reducer as the value is the entire file content. I am looking for ways to stream the file contents while retaining the Sequence file container. I even thought about writing custom splits but not sure of how I will retain the sequence file container.
Any ideas would be helpful.
The custom split approach is not suitable to this scenario for the following 2 reasons. 1) Entire file is getting loaded to the Map node because the Map function needs entire file (as value = entire content). If you split the file, Map function receives only a partial record (value) and it would fail.2) Probably the sequence file container is treating your file as a 'single record' file. So, it would have only 1 sync point at max, that is after the Header. So, even if you retain the Sequence File Container's sync points, the whole file gets loaded to the Map node as it being loaded now.
I had the concerns regarding losing the sequence files sync points if writing a custom split. I was thinking of this approach of modifying the Sequence File Input Format/Record Reader to return chunks of the file contents as opposed to the entire file, but return the same key for every chunk.
The chunking strategy would be similar to how file splits are calculated in MapReduce.

Get input stream as source for Hadoop map method

I am working with set of files from directory, which is output of another task. I need to process content of entire file at once (calculate MD5 checksums and do some transformations). I'm not sure how signature of my Mapper should look like, if I will make is as
class MyMapper extends Mapper<LongWritable, Text, NullWritable, NullWritable> { ... }
then I will get entire content of an input file in in map method. And this will be stored in memory, but files could be quite big.
Is there any way to not read the complete "record" into memory for processing by Hadoop map task, but get a "stream" for the record?
You actually don't need to worry about it. Hadoop is optimized to leverage all the resources of your cluster to do the jobs. The whole point of it is to abstract away the low level details of all that and let you focus on your use case.
I assure you Hadoop can handle your files. If they are really big and/or your cluster has less powerful or unreliable machines, then the jobs might take longer. But they won't fail (absent any other errors).
So I think your approach is fine. The only suggestion I would make is to consider avoiding canonical MapReduce because it isn't a high enough level of abstraction. Consider Cascading or JCascalog instead.

Running hadoop for processing sources in full sky maps

I have few tens of full sky maps, in binary format (FITS) of about 600MB each.
For each sky map I already have a catalog of the position of few thousand sources, i.e. stars, galaxies, radio sources.
For each source I would like to:
open the full sky map
extract the relevant section, typically 20MB or less
run some statistics on them
aggregate the outputs to a catalog
I would like to run hadoop, possibly using python via the streaming interface, to process them in parallel.
I think the input to the mapper should be each record of the catalogs,
then the python mapper can open the full sky map, do the processing and print the output to stdout.
Is this a reasonable approach?
If so, I need to be able to configure hadoop so that a full sky map is copied locally to the nodes that are processing one of its sources. How can I achieve that?
Also, what is the best way to feed the input data to hadoop? for each source I have a reference to the full sky map, latitude and longitude
Though it doesn't sound like a few tens of your sky maps are a very big data set, I've used Hadoop successfully as an simple way to write distributed applications/scripts.
For the problem you describe, I would try implementing a solution with Pydoop, and specifically Pydoop Script (full disclaimer: I'm one of the Pydoop developers).
You could set up a job that takes as input the list of sections of the sky map that you want to process, serialized in some sort of text format with one record per line. Each map task should process one of these; you can achieve this split easily with the standard NLineInputFormat.
You don't need to copy the sky map locally to all the nodes as long as the map tasks can access the file system on which it is stored. Using the pydoop.hdfs module, the map function can read the section of the sky map that it needs to process (given the coordinates it received as input) and then emit the statistics as you were saying so that they can be aggregated in the reducer. pydoop.hdfs can read from both "standard" mounted file systems and HDFS.
Though the problem domain is totally unrelated, this application may serve as an example:
https://github.com/ilveroluca/seal/blob/master/seal/dist_bcl2qseq.py#L145
It uses the same strategy, preparing a list of "coordinates" to be processed, serializing them to a file, and then launching a simple pydoop job that takes that file as input.
Hope that helps!

Hadoop read multiple lines at a time

I have a file in which a set of every four lines represents a record.
eg, first four lines represent record1, next four represent record 2 and so on..
How can I ensure Mapper input these four lines at a time?
Also, I want the file splitting in Hadoop to happen at the record boundary (line number should be a multiple of four), so records don't get span across multiple split files..
How can this be done?
A few approaches, some dirtier than others:
The right way
You may have to define your own RecordReader, InputSplit, and InputFormat. Depending on exactly what you are trying to do, you will be able to reuse some of the already existing ones of the three above. You will likely have to write your own RecordReader to define the key/value pair and you will likely have to write your own InputSplit to help define the boundary.
Another right way, which may not be possible
The above task is quite daunting. Do you have any control over your data set? Can you preprocess it in someway (either while it is coming in or at rest)? If so, you should strongly consider trying to transform your dataset int something that is easier to read out of the box in Hadoop.
Something like:
ALine1
ALine2 ALine1;Aline2;Aline3;Aline4
ALine3
ALine4 ->
BLine1
BLine2 BLine1;Bline2;Bline3;Bline4;
BLine3
BLine4
Down and Dirty
Do you have any control over the file sizes of your data? If you manually split your data on the block boundary, you can force Hadoop to not care about records spanning splits. For example, if your block size is 64MB, write your files out in 60MB chunks.
Without worrying about input splits, you could do something dirty: In your map function, add your new key/value pair into a list object. If the list object has 4 items in it, do processing, emit something, then clean out the list. Otherwise, don't emit anything and move on without doing anything.
The reason why you have to manually split the data is that you are not going to be guaranteed that an entire 4-row record will be given to the same map task.
Another way (easy but may not be efficient in some cases) is to implement the FileInputFormat#isSplitable(). Then the input files are not split and are processed one per map.
import org.apache.hadoop.fs.*;
import org.apache.hadoop.mapred.TextInputFormat;
public class NonSplittableTextInputFormat extends TextInputFormat {
#Override
protected boolean isSplitable(FileSystem fs, Path file) {
return false;
}
}
And as orangeoctopus said
In your map function, add your new key/value pair into a list object. If the list object has 4 items in it, do processing, emit something, then clean out the list. Otherwise, don't emit anything and move on without doing anything.
This has some overhead for the following reasons
Time to process the largest file drags the job completion time.
A lot of data may be transferred between the data nodes.
The cluster is not properly utilized, since # of maps = # of files.
** The above code is from Hadoop : The Definitive Guide

Hadoop: Mapping binary files

Typically in a the input file is capable of being partially read and processed by Mapper function (as in text files). Is there anything that can be done to handle binaries (say images, serialized objects) which would require all the blocks to be on same host, before the processing can start.
Stick your images into a SequenceFile; then you will be able to process them iteratively, using map-reduce.
To be a bit less cryptic: Hadoop does not natively know anything about text and not-text. It just has a class that knows how to open an input stream (hdfs handles sticthing together blocks on different nodes, to make them appear as one large files). On top of that, you have an Reader and an InputFormat that knows how to determine where in that stream records start, where they end, and how to find the beginning of the next record if you are dropped somewhere in the middle of the file. TextInputFormat is just one implementation, which treats newlines as record delimiter. There is also a special format called a SequenceFile that you can write arbitrary binary records into, and then get them back out. Use that.

Resources