Parsing large XML to TSV - hadoop

I need to parse few XML's to TSV, the Size of the XML Files is of the order of 50 GB, I am basically doubtful about the implemetation i should choose to parse this i have two oprions
using SAXParser
use Hadoop
i have a fair bit of idea about SAXParser implementaion but i think having access to Hadoop cluster, i should use Hadoop as this is what hadoop is for i.e. Big Data
it would be great someone could provide a hint/doc as how to do this in Hadoop or efficient SAXParser implementaion for such a big file or rather what should i go for Hadoop or SAXparser?

I process large XML files in Hadoop quite regularly. I found it to be the best way (not the only way... the other is to write SAX code) since you can still operate on the records in a dom-like fashion.
With these large files, one thing to keep in mind is that you'll most definitely want to enable compression on the mapper output: Hadoop, how to compress mapper output but not the reducer output... this will speed things up quite a bit.
I've written a quick outline of how I've handled all this, maybe it'll help: http://davidvhill.com/article/processing-xml-with-hadoop-streaming. I use Python and Etrees which makes things really simple....

I don't know about SAXparser. But definitely Hadoop will do your job if you have a hadoop cluster with enough data nodes. 50Gb is nothing as I was performing operations on more than 300GB of data on my cluster. Write a map reduce job in java and the documentation for hadoop can be found at http://hadoop.apache.org/

It is rilatively trivial to process XML on hadoop by having one mapper per XML file. This approach will be fine for large number of relatively small XMLs
The problem is that in Your case files are big and thier number is small so without splitting hadoop benefit will be limited. Taking to account hadoop's overhead the benefit be negative...
In hadoop we need to be able to split input files into logical parts (called splits) to efficiently process large files.
In general XML is not looks like "spliitable" format since there is no well defined division into blocks, which can be processed independently. In the same time, if XML contains "records" of some kind splitting can be implemented.
Good discussion about splitting XMLs in haoop is here:
http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html
where Mahout's XML input format is suggested.
Regarding your case - I think as long as number of your files is not much bigger then number of cores you have on single system - hadoop will not be efficient solution.
In the same time - if you want to accumulate them over time - you can profit from hadoop as a scalable storage also.

I think that SAX has traditionally been mistakenly associated with processing big XML files... in reality, VTD-XML is often the best option, far better than SAX in terms of performance, flexibility, code readability and maintainability... on the issue of memory, VTD-XML's in-memory model is only 1.3x~1.5X the size of the corresponding XML document.
VTD-XML has another significant benefit over SAX: its unparalleled XPath support. Because of it, VTD-XML users routinely report performance gain of 10 to 60x over SAX parsing over hundreds of MB XML files.
http://www.infoq.com/articles/HIgh-Performance-Parsers-in-Java#anch104307
Read this paper that comprehensively compares the existing XML parsing frameworks in Java.
http://sdiwc.us/digitlib/journal_paper.php?paper=00000582.pdf

Related

Processing HUGE number of small files independently

The task is to process HUGE (around 10,000,000) number of small files (each around 1MB) independently (i.e. the result of processing file F1, is independent of the result of processing F2).
Someone suggested Map-Reduce (on Amazon-EMR Hadoop) for my task. However, I have serious doubts about MR.
The reason is that processing files in my case, are independent. As far as I understand MR, it works best when the output is dependent on many individual files (for example counting the frequency of each word, given many documents, since a word might be included in any document in the input file). But in my case, I just need a lot of independent CPUs/Cores.
I was wondering if you have any advice on this.
Side Notes: There is another issue which is that MR works best for "huge files rather than huge number of small size". Although there seems to be solutions for that. So I am ignoring it for now.
It is possible to use map reduce for your needs. In MapReduce, there are two phases Map and Reduce, however, the reduce phase is not a must, just for your situation, you could write a map-only MapReduce job, and all the calculations on a single file should be put into a customised Map function.
However, I haven't process such huge num of files in a single job, no idea on its efficiency. Try it yourself, and share with us :)
This is quite easy to do. In such cases - the data for MR job is typically the list of files (and not the files themselves). So the size of the data submitted to Hadoop is the size of 10M file names - which is order of a couple of gigs max.
One uses MR to split up the list of files into smaller fragments (how many can be controlled by various options). Then each mapper gets a list of files. It can process one file at a time and generate the output.
(fwiw - I would suggest Qubole (where I am a founder) instead of EMR cause it would save you a ton of money with auto-scaling and spot integration).

what are the disadvantages of mapreduce?

What are the disadvantages of mapreduce? There are lots of advantages of mapreduce. But I would like to know the disadvantages of mapreduce too.
I would rather ask when mapreduce is not a suitable choice? I don't think you would see any disadvantage if you are using it as intended. Having said that, there are certain cases where mapreduce is not a suitable choice :
Real-time processing.
It's not always very easy to implement each and everything as a MR program.
When your intermediate processes need to talk to each other(jobs run in isolation).
When your processing requires lot of data to be shuffled over the network.
When you need to handle streaming data. MR is best suited to batch process huge amounts of data which you already have with you.
When you can get the desired result with a standalone system. It's obviously less painful to configure and manage a standalone system as compared to a distributed system.
When you have OLTP needs. MR is not suitable for a large number of short on-line transactions.
There might be several other cases. But the important thing here is how well are you using it. For example, you can't expect a MR job to give you the result in a couple of ms. You can't count it as its disadvantage either. It's just that you are using it at the wrong place. And it holds true for any technology, IMHO. Long story short, think well before you act.
If you still want, you can take the above points as the disadvantages of mapreduce :)
HTH
Here are some usecases where MapReduce does not work very well.
When you need a response fast. e.g. say < few seconds (Use stream
processing, CEP etc instead)
Processing graphs
Complex algorithms e.g. some machine learning algorithms like SVM, and also see 13 drawfs
(The Landscape of Parallel Computing Research: A View From Berkeley)
Iterations - when you need to process data again and again. e.g. KMeans - use Spark
When map phase generate too many keys. Thensorting takes for ever.
Joining two large data sets with complex conditions (equal case can
be handled via hashing etc)
Stateful operations - e.g. evaluate a state machine Cascading tasks
one after the other - using Hive, Big might help, but lot of overhead
rereading and parsing data.
You need to rethink/ rewrite trivial operations like Joins, Filter to achieve in map/reduce/Key/value patterns
MapReduce assumes that the job can be parallelized. But it may not be the case for all data processing jobs.
It is closely tied with Java, of course you have Pig and Hive for rescue but you lose flexibility.
First of all, it streams the map output, if it is possible to keep it in memory this will be more efficient. I originally deployed my algorithm using MPI but when I scaled up some nodes started swapping, that's why I made the transition.
The Namenode keeps track of the metadata of all files in your distributed file system. I am reading a hadoop book (Hadoop in action) and it mentioned that Yahoo estimated the metadata to be approximately 600 bytes per file. This implies if you have too many files your Namenode could experience problems.
If you do not want to use the streaming API you have to write your program in the java language. I for example did a translation from C++. This has some side effects, for example Java has a large string overhead compared to C. Since my software is all about strings this is some sort of drawback.
To be honest I really had to think hard to find disadvantages. The problems mapreduce solved for me were way bigger than the problems it introduced. This list is definitely not complete, just a few first remarks. Obviously you have to keep in mind that it is geared towards Big Data, and that's where it will perform at its best. There are plenty of other distribution frameworks out there with their own characteristics.

Is Hadoop the right tech for this?

If I had millions of records of data, that are constantly being updated and added to every day, and I needed to comb through all of the data for records that match specific logic and then take that matching subset and insert it into a separate database would I use Hadoop and MapReduce for such a task or is there some other technology I am missing? The main reason I am looking for something other than a standard RDMS is because all of the base data is from multiple sources and not uniformly structured.
Map-Reduce is designed for algorithms that can be parallelized and local results can be computed and aggregated. A typical example would be counting words in a document. You can split this up into multiple parts where you count some of the words on one node, some on another node, etc and then add up the totals (obviously this is a trivial example, but illustrates the type of problem).
Hadoop is designed for processing large data files (such as log files). The default block size is 64MB, so having millions of small records wouldn't really be a good fit for Hadoop.
To deal with the issue of having non-uniformly structured data, you might consider a NoSQL database, which is designed to handle data where a lot of a columns are null (such as MongoDB).
Hadoop/MR are designed for batch processing and not for real time processing. So, some other alternative like Twitter Storm, HStreaming has to be considered.
Also, look at Hama for real time processing of data. Note that real time processing in Hama is still crude and a lot of improvement/work has to be done.
I would recommend Storm or Flume. In either of these you may analyze each record as it comes in and decide what to do with it.
If your data volumes are not great , and millions of records are not sounds as such I would suggest to try to get most from RDMBS, even if your schema will not be properly normalized.
I think even tavle of structure K1, K2, K3, Blob will be more useful t
In NoSQL KeyValue stores are built to support schemaless data in various flavors but their query capability are limited.
Only case I can think as usefull is MongoDB/ CoachDB capability to index schemaless data. You will be able to get records by some attribute value.
Regarding Hadoop MapReduce - i think it is not useful unless you want to harness a lot of CPUs for your processing or have a lot of data or need distributed sort capability.

Hadoop for processing very large binary files

I have a system I wish to distribute where I have a number of very large non-splittable binary files I wish to process in a distributed fashion. These are of the order of a couple of hundreds of Gb. For a variety of fixed, implementation specific reasons, these files cannot be processed in parallel but have to be processed sequentially by the same process through to the end.
The application is developed in C++ so I would be considering Hadoop pipes to stream the data in and out. Each instance will need to process of the order of 100Gb to 200Gb sequentially of its own data (currently stored in one file), and the application is currently (probably) IO limited so it's important that each job is run entirely locally.
I'm very keen on HDFS for hosting this data - the ability to automatically maintain redundant copies and to rebalance as new nodes are added will be very useful. I'm also keen on map reduce for its simplicity of computation and its requirement to host the computation as close as possible to the data. However, I'm wondering how suitable Hadoop is for this particular application.
I'm aware that for representing my data it's possible to generate non-splittable files, or alternatively to generate huge sequence files (in my case, these would be of the order of 10Tb for a single file - should I pack all my data into one). And that it's therefore possible to process my data using Hadoop. However it seems like my model doesn't fit Hadoop that well: does the community agree? Or have suggestions for laying this data out optimally? Or even for other cluster computing systems that might fit the model better?
This question is perhaps a duplicate of existing questions on hadoop, but with the exception that my system requires an order of magnitude or two more data per individual file (previously I've seen the question asked about individual files of a few Gb in size). So forgive me if this has been answered before - even for this size of data.
Thanks,
Alex
It seems like you are working with relatively few numbers of large files. Since your files are huge and not splittable, Hadoop will have trouble scheduling and distributing jobs effectively across the cluster. I think the more files that you process in one batch (like hundreds), the more worth while it will be to use Hadoop.
Since you're only working with a few files, have you tried a simpler distribution mechanism, like launching processes on multiple machines using ssh, or GNU Parallel? I've had a lot of success using this approach for simple tasks. Using a NFS mounted drive on all your nodes can share limits the amount of copying you would have to do as well.
You can write a custom InputSplit for your file, but as bajafresh4life said it won't really be ideal because unless your HDFS chunk size is the same as your file size your files are going to be spread all around and there will be network overhead. Or if you do make your HDFS size match your file size then you're not getting the benefit of all your cluster's disks. Bottom line is that Hadoop may not be the best tool for you.

Converting word docs to pdf using Hadoop

Say if I want to convert 1000s of word files to pdf then would using Hadoop to approach this problem make sense? Would using Hadoop have any advantage over simply using multiple EC2 instances with job queues?
Also if there was 1 file and 10 free nodes then would hadoop split the file and send it to the 10 nodes or will the file be sent to just 1 node while 9 sit idle?
There isn't much advantage in using hadoop for this use case. Having competing consumers read from a queue and producing output is going to be a lot easier to setup and will probably be more efficient.
Hadoop would not automatically split a document and process sections on differnt nodes. Although if you had a really big (many thousands of pages long) then the Hadoop use case would make sense - but only when the time to produce a pdf on a single machine is significant.
The map tasks could print a few thousand pages each and the reduce task merge the PDF's into a single document - although reading the resulting file may be difficult to read if it is very large.
Say if I want to convert 1000s of word
files to pdf then would using Hadoop
to approach this problem make sense?
Would using Hadoop have any advantage
over simply using multiple EC2
instances with job queues?
I think either tool could accomplish this task, so it depends on what you plan to do with the documents after conversion. Derek Gottfrid at the New York Times famously found Hadoop to be a useful tool for large-scale document conversion, so it's certainly within the realm of tasks at which Hadoop performs well.
Also if there was 1 file and 10 free
nodes then would hadoop split the file
and send it to the 10 nodes or will
the file be sent to just 1 node while
9 sit idle?
It depends on the InputFormat you use. As you can see in the documentation, you can specify how to compute the "InputSplits", which might include splitting a large document into chunks.
Good luck with whatever tool you choose for this problem!
Regards,
Jeff
How many 1000's are you talking about? If this is a once off batch I would set it up on a single machine and simply let it run, you'll be surprised I think at how fast you can convert 1000s of Docs to PDF, even if you need to run the task for a couple of days, if its a once off convert then there is no need for complications such as Hadoop. If you are continually converting 1000s of docs then its probably worth the effort of setting up something else.

Resources