hadoop job to split xml files - hadoop

I've got 1000's of files to process. Each file consists of 1000's of XML files concatenated together.
I'd like to use Hadoop to split each XML file separately. What would be a good way of doing this using Hadoop?
NOTES: I am total Hadoop newbie. I plan on using Amazon EMR.

Check out Mahout's XmlInputFormat. It's a shame that this is in Mahout and not in the core distribution.
Are the XML files that are concatenated at least in the same format? If so, you set START_TAG_KEY and END_TAG_KEY to the root in each of your files. Each file will show up as one Text record in the map. Then, you can use your favorite Java XML parser to finish the job.

Related

How to parse multiple pdf conversion into hadoop (example)

i have one million pdf , how to convert into text using hadoop and used this for analytics.
The goal is to use the power of hadoop for extracting pdf data as a text.
I have processed a single pdf file on Hadoop not tried with multiple file but i believe it will work fine for multiple files too..
Complete code is available on the below link
http://ybhavesh.blogspot.in/2015/12/poc-sensex-log-data-processing-pdf-file.html
Hope this helps!!..

Downloading list of files in parallel in Apache Pig

I have a simple text file which contains list of folders on some FTP servers. Each line is a separate folder. Each folder contains couple of thousand images. I want to connect to each folder, store all files inside that foder in a SequenceFile and then remove that folder from FTP server. I have written a simple pig UDF for this. Here it is:
dirs = LOAD '/var/location.txt' USING PigStorage();
results = FOREACH dirs GENERATE download_whole_folder_into_single_sequence_file($0);
/* I don't need results bag. It is just a dummy bag */
The problem is I'm not sure if each line of input is processed in separate mapper. The input file is not a huge file just couple of hundred lines. If it were pure Map/Reduce then I would use NLineInputFormat and process each line in a separate Mapper. How can I achieve the same thing in pig?
Pig lets you write your own load functions, which let you specify which InputFormat you'll be using. So you could write your own.
That said, the job you described sounds like it would only involve a single map-reduce step. Since using Pig wouldn't reduce complexity in this case, and you'd have to write custom code just to use Pig, I'd suggest just doing it in vanilla map-reduce. If the total file size is Gigabytes or less, I'd just do it all directly on a single host. It's simpler not to use map reduce if you don't have to.
I typically use map-reduce to first load data into HDFS, and then Pig for all data processing. Pig doesn't really add any benefits over vanilla hadoop for loading data IMO, it's just a wrapper around InputFormat/RecordReader with additional methods you need to implement. Plus it's technically possible with Pig that your loader will be called multiple times. That's a gotcha you don't need to worry about using Hadoop map-reduce directly.

Hadoop streaming: single file or multi file per map. Don't Split

I have a lot of zip files that need to be processed by a C++ library. So I use C++ to write my hadoop streaming program. The program will read a zip file, unzip it, and process the extracted data.
My problem is that:
my mapper can't get the content of exactly one file. It usually gets something like 2.4 files or 3.2 files. Hadoop will send several files to my mapper but at least one of the file is partial. You know zip files can't be processed like this.
Can I get exactly one file per map? I don't want to use file list as input and read it from my program because I want to have the advantage of data locality.
I can accept the contents of multiple zip file per map if Hadoop don't split the zip files. I mean exactly 1, 2, 3 files, not something like 2.3 files. Actually it will be even better because my program need to load about 800MB data file for processing the unziped data. Can we do this?
You can find the solution here:
http://wiki.apache.org/hadoop/FAQ#How_do_I_get_each_of_a_job.27s_maps_to_work_on_one_complete_input-file_and_not_allow_the_framework_to_split-up_the_files.3F
The easiest way I would suggest is to set mapred.min.split.size to a large value so that your files do not get split.
If this does not work then you would need to implement an InputFormat which is not very difficult to do and you can find the steps at: http://developer.yahoo.com/hadoop/tutorial/module5.html#fileformat
Rather then depending on the min split size I would suggest an easier way is to Gzip your files.
There is a way to compress files using gzip
http://www.gzip.org/
If you are on Linux you compress the extracted data with
gzip -r /path/to/data
Now that you have this pass this data as your input in your hadoop streaming job.

hadoop/HDFS: Is it possible to write from several processes to the same file?

f.e. create file 20bytes.
1st process will write from 0 to 4
2nd from 5 to 9
etc
I need this to parallel creating a big files using my MapReduce.
Thanks.
P.S. Maybe it is not implemented yet, but it is possible in general - point me where I should dig please.
Are you able to explain what you plan to do with this file after you have created it.
If you need to get it out of HDFS to then use it then you can let Hadoop M/R create separate files and then use a command like hadoop fs -cat /path/to/output/part* > localfile to combine the parts to a single file and save off to the local file system.
Otherwise, there is no way you can have multiple writers open to the same file - reading and writing to HDFS is stream based, and while you can have multiple readers open (possibly reading different blocks), multiple writing is not possible.
Web downloaders request parts of the file using the Range HTTP header in multiple threads, and then either using tmp files before merging the parts together later (as Thomas Jungblut suggests), or they might be able to make use of Random IO, buffering the downloaded parts in memory before writing them off to the output file in the correct location. You unfortunately don't have the ability to perform random output with Hadoop HDFS.
I think the short answer is no. The way you accomplish this is write your multiple 'preliminary' files to hadoop and then M/R them into a single consolidated file. Basically, use hadoop, don't reinvent the wheel.

How do I control output files name and content of an Hadoop streaming job?

Is there a way to control the output filenames of an Hadoop Streaming job?
Specifically I would like my job's output files content and name to be organized by the ket the reducer outputs - each file would only contain values for one key and its name would be the key.
Update:
Just found the answer - Using a Java class that derives from MultipleOutputFormat as the jobs output format allows control of the output file names.
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/lib/MultipleOutputFormat.html
I haven't seen any samples for this out there...
Can anyone point out to an Hadoop Streaming sample that makes use of a custom output format Java class?
Using a Java class that derives from MultipleOutputFormat as the jobs output format allows control of the output file names. http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/lib/MultipleOutputFormat.html
When using Hadoop Streaming, since only one JAR is supported you actually have to fork the streaming jar and put your new output format classes in it for streaming jobs to be able to reference it...
EDIT:
As of version 0.20.2 of hadoop this Class has been deprecated and you should now use:
http://hadoop.apache.org/docs/mapreduce/current/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html
In general, Hadoop would have you consider the entire directory to be the output, and not an individual file. There's no way to directly control the filename, whether using Streaming or regular Java jobs.
However, nothing is stopping you from doing this splitting and renaming yourself, after the job has finished. You can $HADOOP dfs -cat path/to/your/output/directory/part-*, and pipe that to a script of yours that splits content up by keys and writes it to new files.

Resources