Passing two files containing data in different format to a mapper - hadoop

I have 5 image files (each image file less than 5MB).
ImageDir/Image1 = {ImageID1 <image in binary form>}
...
ImageDir/Image5 = {ImageID5 <image in binary form>}
There is some textual data that is also associated with the image,
ImageData/Image1_data = {ImageID1 <text data>}
...
ImageData/Image5_data = {ImageID5 <text data>}
I want each image and its text data to go to one mapper. How do I achieve this? I know that each image would go to one mapper but how to make sure that images text data which is in different form also goes to the same mapper.

Easiest approach is to combine the image and associated data into a single file (gz, tar etc) using a script in an automated way and let a mapper process it.
AFAIK, Hadoop OOB doesn't support this. So, a custom InputFormat needs to be coded. Won't recommend this approach as the image and the associated data might be on different nodes and there will be a lot of data shuffling during the job execution.

Related

what is the best output format / platform to display different sorts of extracted data?

I am writing a script, that extracts different types of data from different kind of custom log files.
But before I continue to write, I want to determine in what output format / platform I want it to be, so it is displayed properly or it can be read properly.
examples:
sometimes it is certain lines of text with an important word in it
sometimes it is a block of text between a start and end phrase
sometimes it are data points, which i then want to visualize better in a line chart
....
OR it is a combination of those
At first i thought i write it so that it is in a markdown file format, so i can for instance create fold able blocks, so that i just unfold the part that i want to read.
But markdown is not versatile. Meaning I cant create line charts or other kinds of stuff (thinking about the future)
So know I put the different types of data in different type of output formats and visualize them in an HTML file.
meaning, the blocks of text in a markdown file, which I then import though a java-script markdown viewer
the data points, I create a line chart through a java-script chart
.....
HOWEVER, I am not sure that this is the best/correct way to go .....
What is your advice ?

Need to implement bulk PDF extraction using Tesseract API

I have large numbers PDF document, from which I need to extract text. The extracted text I use for further processing. I did this for a small subset of documents using Tesseract API in a linear approach and I get the required output. However, this takes a very long time when I have a large number of documents.
I tried to use the Hadoop environment processing capabilities (Map-Reduce) and storage (HDFS) for solving this issue. However, I am facing problem to implement Tesseract API into the Hadoop (Map-Reduce) approach. As Teserract converts the files into intermediate image files, I am confused as to how intermediate result Image files of Tesseract-API-process can be handled inside HDFS.
I have searched and unsuccesfully tried a few options earlier like:
I have extracted text from PDF by extending FileInputFormat class into my own PdfInputFormat class using Hadoop-Map-Reduce, for this i used Apache PDFBox to extract text from pdf, but when it comes to scanned-pdf's which contains image, this solution does not give me the required results.
I found few answers on the same topic stating to use -Fuse and that will help or one should generate image files locally and than upload those into hdfs for further processing. Not sure if this is the correct approach.
Would like to know approaches around this.
This is an approach found to process multiple pdf's to extract text using the power of the Hadoop Framework, and then use this text for further processing:
Put all the PDFs to be converted to text in one folder.
Create one text file per pdf to contain the path to the pdf. e.g. if I have 10 pdfs to convert, then I have 10 text files generated, each containing the unique path to the respective pdf.
These text files are given as input in the map-reduce program
Because input file size is very small only 1 input split is generated by framework for 1 input. e.g if I have 10 pdfs as input, then framework will generate 10 input-split.
From each Input-split one line(record) is read by Record-Reader and passed to one mapper as a value. So if there are 10 records(line==File Path) in input text file , 10 times mapper will run. As I have one record per input-split so one mapper-reducer is assigned to do task for that input-split.
As I have 10 input-split 10 mapper will run, parallel.
Inside the Mapper ghost-script generates images, passing the file name from Mapper value attribute. The image is converted to text using Tesseract inside the mapper itself to get the text of each pdf. This is the output.
This is passed to the reducer to do other analytics work as required.
This is the current solution. Would like feedback on this.

Nvidia Digits accuracy and loss plots data

I trained my model in Nvidia Digits 5 and I would now like to extract the accuracy and loss plots that were generated during training for a report. Is this data saved somewhere so that it would possible to extract the data for these plots so that I could plot it in Python and perhaps ultimately modify the plots to compare different models etc?
The best solution I have found is to either look at the HTML file or to scan the text file caffe_output.log that is produced by Caffe. The text file is usually stored in /var/digits/jobs/insert_your_job_id/ but you can also just run on linux systems:
locate caffe_output.log
Go to your DIGITS job folder and locate your job's subfolder. Inside you'll find a file status.pickle, which is a pickled object containing all your job's information.
You can load it in python like so:
import digits
import pickle
data = pickle.load(open('status.pickle','rb'))
This object is somewhat generic and may contain multiple tasks. For a typical classification task it will likely be just one, but you will still need to access it via data.tasks[0]. From there you can grab the plots:
data.tasks[0].combined_graph_data()
which returns a somewhat convoluted dict (unfortunately - since your network can produce many accuracy/loss outputs, as well as even custom ones). It contains everything you need though - I managed to plot accuracy with:
plt.plot( data.tasks[0].combined_graph_data()['columns'][2][1:] )
but it's likely that you'll have to write a bit of custom code. As always, dir() is your friend.

import multiple images ans store them in list using mathematica

I am using Mathematica to enhance and thin images. I used it for single image, now i want to use it for multiple images. so I have to import 6 images, do the thining and store them in a list for example. Can any one show me how to do that??
The images will be used for biometrics identification system.
Since you want a list as a result you might think of using either Table or Map. Either of those can do n things, one after another, and put the result into your final list.
Since you didn't show the steps you used for processing a single list it is a little difficult to tell you exactly how to wrap Table or Map around this.
If you have a list of image file names then you could use Map to process those names one at a time. The processing could either be a compound function to Import the image and then enhance and thin and the output of that function would be a single processed image. Map would then do the repetition over all the names.
Table might work in a similar way, but you use each iteration to get the file name, do the processing and store the result in your desired list.

how do i create a grayscale image from non image data

I have a array containing data. This array contains only image data/ or it can be just random data. No header information is available. So writing this to a file and making its extension as jpg is not going to work. Can someone please recommend a library that would do this for me.
Any language that isnt a scripting language is ok. Any environment. I would prefer if its in C/Java/Matlab.
If you have your array in MATLAB (let's say it's in a variable called im), then you can just type
imwrite(im, 'myfilename.bmp', 'bmp')
and your array will be written to a .bmp file. You can choose from a number of other common formats too. See the documentation for imwrite.
You can even write random data in this way:
a = rand(100,100);
imwrite(a,'testimg.jpg','.jpg')

Resources