How to parse multiple pdf conversion into hadoop (example) - hadoop

i have one million pdf , how to convert into text using hadoop and used this for analytics.
The goal is to use the power of hadoop for extracting pdf data as a text.

I have processed a single pdf file on Hadoop not tried with multiple file but i believe it will work fine for multiple files too..
Complete code is available on the below link
http://ybhavesh.blogspot.in/2015/12/poc-sensex-log-data-processing-pdf-file.html
Hope this helps!!..

Related

How to process image files using PIG

There are 100 image files with different colors .I want to get unique image based on the color
There is no built in Hadoop/Pig API for processing Image data.
To process image data using Pig/MapReduce, use the following steps:
Convert all the images into Sequence File/Files
Key Value
Image_file_id Image Content
Load this file into HDFS.
Use any third party library for detection like "Haar Cascades" as UDF in Pig or call the Java library in MapReduce program.

How to load multiple files in tar.gz into Pig

Scenario: Vendor will provide raw feed in tar.gz format which contains multiple files in tab delimited format
File Detail:
a) One Hit level data
b) Multiple Lookup files
c) One Header file for (a)
The feed(tar.gz) will be ingested and landed into BDP operational raw.
Query: Would like to load these data from operational raw area into Pig for data quality checking process. How this can be achieved? Should the files be extracted in hadoop for us to use or alternatives available? Please advise. Thanks!
Note: Any sample script will be more helpful
Ref : http://pig.apache.org/docs/r0.9.1/func.html#load-store-functions
Extract from Docs :
Handling Compression
Support for compression is determined by the load/store function. PigStorage and TextLoader support gzip and bzip compression for both read (load) and write (store). BinStorage does not support compression.
To work with gzip compressed files, input/output files need to have a .gz extension. Gzipped files cannot be split across multiple maps; this means that the number of maps created is equal to the number of part files in the input location.
A = load 'myinput.gz';
store A into 'myoutput.gz';

Parse Freebase RDF dump with MapReduce

I downloaded the rdf data dump from Freebase and what I need to extract is the name of every entity in English in Freebase.
Do I have to use Hadoop and MapReduce to do this, if so how? Or is there another way to extract the entity names?
It would be nice if each entity title / name were on its own line in a .txt file
You could use Hadoop, but for such simple processing, you'd spend more time uncompressing and splitting the input than you would save in being able to do the search in parallel. A simple zgrep would accomplish your task in much less time.
Something along the lines of this:
zegrep $'name.*#en\t\\.$' freebase-public/rdf/freebase-rdf-2013-09-15-00-00.gz | cut -f 1,3 | gzip > freebase-names-20130915.txt.gz
will give you a compressed two column file of Freebase MIDs and their English names. You'll probably want to make the grep a little more specific to avoid false positives (and test it, which I haven't done). This file is over 20GB compressed, so it'll take a while, but less time than even getting started to prepare a Hadoop job.
If you want to do additional filtering such as only extract entities with type of /common/topic, you may find that you need to move to a scripting language like Python to be able to look at and evaluate across multiple lines at once.
No I dont think you need to use Hadoop and MapReduce to do this. You can easily create a web service to extract RDF and send to a file. Following [1] blog post explains how you can extract RDF data using WSo2 Data services server. Similarly you can use WSO2 DSS data federation [2] feature to extract RDF data and send it to a excel data sheet
[1] - http://sparkletechthoughts.blogspot.com/2011/09/extracting-rdf-data-using-wso2-data.html
[2] - http://prabathabey.blogspot.com/2011/08/data-federation-with-wso2-data-service.html
There's a screencast for Google Compute Engine that shows you how to do this as well.

Hadoop streaming: single file or multi file per map. Don't Split

I have a lot of zip files that need to be processed by a C++ library. So I use C++ to write my hadoop streaming program. The program will read a zip file, unzip it, and process the extracted data.
My problem is that:
my mapper can't get the content of exactly one file. It usually gets something like 2.4 files or 3.2 files. Hadoop will send several files to my mapper but at least one of the file is partial. You know zip files can't be processed like this.
Can I get exactly one file per map? I don't want to use file list as input and read it from my program because I want to have the advantage of data locality.
I can accept the contents of multiple zip file per map if Hadoop don't split the zip files. I mean exactly 1, 2, 3 files, not something like 2.3 files. Actually it will be even better because my program need to load about 800MB data file for processing the unziped data. Can we do this?
You can find the solution here:
http://wiki.apache.org/hadoop/FAQ#How_do_I_get_each_of_a_job.27s_maps_to_work_on_one_complete_input-file_and_not_allow_the_framework_to_split-up_the_files.3F
The easiest way I would suggest is to set mapred.min.split.size to a large value so that your files do not get split.
If this does not work then you would need to implement an InputFormat which is not very difficult to do and you can find the steps at: http://developer.yahoo.com/hadoop/tutorial/module5.html#fileformat
Rather then depending on the min split size I would suggest an easier way is to Gzip your files.
There is a way to compress files using gzip
http://www.gzip.org/
If you are on Linux you compress the extracted data with
gzip -r /path/to/data
Now that you have this pass this data as your input in your hadoop streaming job.

hadoop job to split xml files

I've got 1000's of files to process. Each file consists of 1000's of XML files concatenated together.
I'd like to use Hadoop to split each XML file separately. What would be a good way of doing this using Hadoop?
NOTES: I am total Hadoop newbie. I plan on using Amazon EMR.
Check out Mahout's XmlInputFormat. It's a shame that this is in Mahout and not in the core distribution.
Are the XML files that are concatenated at least in the same format? If so, you set START_TAG_KEY and END_TAG_KEY to the root in each of your files. Each file will show up as one Text record in the map. Then, you can use your favorite Java XML parser to finish the job.

Resources