How to process image files using PIG - hadoop

There are 100 image files with different colors .I want to get unique image based on the color

There is no built in Hadoop/Pig API for processing Image data.
To process image data using Pig/MapReduce, use the following steps:
Convert all the images into Sequence File/Files
Key Value
Image_file_id Image Content
Load this file into HDFS.
Use any third party library for detection like "Haar Cascades" as UDF in Pig or call the Java library in MapReduce program.

Related

PyTorch: wrapping multiple records in one file?

Is there a standard way of encoding multiple records (in this case, data from multiple .png or .jpeg images) in one file that PyTorch can read? Something similar to TensorFlow's "TFRecord" or MXNet's "RecordIO", but for PyTorch.
I need to download image data from S3 for inference, and it's much slower if my image data is in many small .jpg files rather than fewer files.
Thanks.
One thing is to store batches of images together in a single npz file. Numpy's np.savez lets you save multiple arrays compressed into a single file. Then load the file as np arrays and use torch.from_numpy to convert to tensors.

How to parse multiple pdf conversion into hadoop (example)

i have one million pdf , how to convert into text using hadoop and used this for analytics.
The goal is to use the power of hadoop for extracting pdf data as a text.
I have processed a single pdf file on Hadoop not tried with multiple file but i believe it will work fine for multiple files too..
Complete code is available on the below link
http://ybhavesh.blogspot.in/2015/12/poc-sensex-log-data-processing-pdf-file.html
Hope this helps!!..

using pyspark, read/write 2D images on hadoop file system

I want to be able to read / write images on an hdfs file system and take advantage of the hdfs locality.
I have a collection of images where each image is composed of
2D arrays of uint16
basic additional information stored as an xml file.
I want to create an archive over hdfs file system, and use spark for analyzing the archive. Right now I am struggling over the best way to store the data over hdfs file system in order to be able to take full advantage of spark+hdfs structure.
From what I understand, the best way would be to create a sequenceFile wrapper. I have two questions :
Is creating a sequenceFile wrapper the best way ?
Does anybody have any pointer to examples I could use to start with ? I must not be first one that needs to read something different than text file on hdfs through spark !
I have found a solution that works : using the pyspark 1.2.0 binaryfile does the job. It is flagged as experimental, but I was able to read tiff images with a proper combination of openCV.
import cv2
import numpy as np
# build rdd and take one element for testing purpose
L = sc.binaryFiles('hdfs://localhost:9000/*.tif').take(1)
# convert to bytearray and then to np array
file_bytes = np.asarray(bytearray(L[0][1]), dtype=np.uint8)
# use opencv to decode the np bytes array
R = cv2.imdecode(file_bytes,1)
Note the help of pyspark :
binaryFiles(path, minPartitions=None)
:: Experimental
Read a directory of binary files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI as a byte array. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
Note: Small files are preferred, large file is also allowable, but may cause bad performance.

Load image to pig

I am a newbie to analyze Images using Apache Pig.
Can anyone suggest me how to load and process the images??
I know for textfiles,
alias = load '/user/Pavan/sample.txt' using PigStorage(" ");
How to do with images??
You have a few options, which really depend on the kind of manipulation you're looking to do:
1) Write a custom load function
Pig can be used for images, but you'd need to write a custom load function, which could be more than you're looking to do.
2) Use a Sequence File (my recommendation)
You could also convert the image to a Sequence File, which Pig has a loader file for, available in the Piggybank JAR. There are also load functions and store functions for reading and writing Sequence Files available via Twitter's Elephant Bird package.
Here's an article about using Sequence Files on Hadoop for astronomical categorization tasks.
3) Go with MapReduce.
Depending on the nature of your task, you may be better off in native MapReduce.

passing set of images as input to mapreduce

I have a system where I get images (jpg) from some module. I get images for 10 objects (1000 images for single object) at a time (total 10000 images at a time). I need to do some processing on these images using Hadoop cluster.
I am wondering how should I go about this. Like how should I form the input. I would like to process one object (and its images = 1000) completely in one mapper or reducer. For ex: first object in first mapper, second object in second mapper etc.
Some of the approaches that come to my mind are:
1. For each object create a directory and place all its images in that. Then tar, compress the directory and this will go as one input to a single mapper.
Do the same thing as mentioned above, but just tar the file (dont compress). Implement InputFormat interface and make "isSplittable()" return false.
Create sequencefile for each object. Sequensfile would contain a key - value pair for each object image. Here I am not sure how to tell MapReduce to give the sequencefile to just one mapper.
Here I am not sure how to tell MapReduce to give the sequencefile to just one mapper.
FileInputFormat#isSplitable is your friend here for all the file input formats. SequenceFileInputFormat extends FileInputFormat.
PROCESSING IMAGES IN HADOOP USING MAPREDUCE
HIPI:
Hipi is Hadoop Image Processing Interface. This provides set of tools and Inputformat to process bulk amount of images using Hadoops Distributes File System (HDFS) and Mapreduce.
STPES INVOLVED:
In hipi the entire process can be categorized in to 2 parts.
1) Converting all images into a bulk file(HIPI Image Bundle).
2) Processing the created bulk file of image using HIPI's image input formats.
The cull (culler class) is use to filter out images with low clarity or defects
ISSUES WITH HIPI:
To simulate my bulk image processing scenario , I used a java program to create multiple copies of same image with different names in a single directory. then by using HIPI's utility I converted all images into a bulk file (known as HIP file in HIPI ).
To check whether all images are exist in the bulk file I done the reverse process (Converted HIP file into multiple images). There is also another utility to do the same. But I didn't got all images back, and found that using HIPI I am loosing some images.
I couldn't proceed with my POC by using HIPI, and thought of creating a new framework to process bulk images using mapreduce.
NEW IMAGE PROCESSING FRAMEWORK:
In order to avoid spawning multiple maps (each per file) we must do as like HIPI does, that is , to convert all images into a single bundle file.
This bundle file is given as the input to map-reduce. The image input format parse the bundle file and create Buffered Image Object corresponding to each image.
IMAGE INPUT FORMAT-IMPORTANT CLASSES:
Image Combiner:
Merges multiple images in to a single bundle file.
ImageInputFormat :
Return ImageRecordRreader, and manages splits
ImageRecordReader:
Manage reading each splits.
Perform initial seek of the file pointer to the start of each split.
nextKeyValue() method reads each Image from the split and convert to BufferedImage.
BufferedImageWritable:
Since the key value classes of map reduce should be a writable serializable type we cannot keep BufferedImage directly as value in the map method.
This is a wrapper class that just holds the BufferedImage in it.
BufferedImageWritable
{
BufferedImage img;
#Override
public void readFields(DataInput arg0) throws IOException {
}
#Override
public void write(DataOutput arg0) throws IOException {
}
#Override
public int compareTo(byte[] o) {
return 0;
}
}
Not implemented readFiled(),write(),compareTo() methods since in my scenario I dont want to store the mages back to HDFS.
If you want to write back any images in HDFS(In map or reduce), you may have to implement all these methods. in write() you may need to write the logic to store the image as like we used to write images while creating the bulk file. The readFiled() should contain the opposite logic of write(). compareTo() is no need to implement since we never keep this image as key in map-reduce (compareTo() is invoked during the sort face of map-reduce).
Since you are getting image as BufferedImages (a common java class for image processing) it is easy to do most of the operations easily on it. But in the case of HIPI the image is available in map's value as Hipi's FloatImage class, and you might feel difficulty in doing manipulations on top of it.
I have successfully implemented facedetection program using this custom input format and OpenCV.
The code that I used to develop the same will be shared soon on GitHub
http://worldofbigdata-inaction.blogspot.in/2017/02/processing-images-in-hadoop-using.html

Resources