How to deal with large number of parquet files - hadoop

I'm using Apache Parquet on Hadoop and after a while I have one concern. When I'm generating parquets in Spark on Hadoop it can get pretty messy. When I say messy I mean that Spark job is genearing big amount of parquet files. When I try to query them I'm dealing with big time query because Spark is merging all the files together.
Can you show me the right way to deal with it, or I'm maybe missusing them? Have you already dealt with it and how did u resolve it?
UPDATE 1:
Is some "side job" for merging those files in one parquet good enough? What size of parquet files is prefered to use, some up and down boundaries?

Take a look at this GitHub repo and this answer. In short keep size of the files larger than HDFS block size (128MB, 256MB).

A good way to reduce the number of output files is to use coalesce or repartition.

Related

HDFS small file design

I want to be able to store millions of small files (binary files- images,exe etc) (~1Mb) on HDFS, my requirements are basically to be able to query random files and not running MapReduce jobs.
The main problem for me is the Namenode memory issue, and not the MapReduce mappers issue.
So my options are:
HAR files - aggregate small files and only than saving them with their har:// path in another place
Sequence files - append them as they come in, this is more suitable for MapReduce jobs so i pretty much eliminated it
HBase - saving the small files to Hbase is another solution described in few articles on google
i guess i'm asking if there is anything i missed? can i achieve what i need by appeding binary files to big Avro/ORC/Parquet files? and then query them by name or by hash from java/client program?
Thanks,
If you append multiple files into large files, then you'll need to maintain an index of which large file each small file resides in. This is basically what Hbase will do for you. It combines data into large files, stores them in HDFS and uses sorting on keys to support fast random access. It sounds to me like Hbase would suit your needs, and if you hand rolled something yourself, you may end up redoing a lot of work that Hbase already does.

Processing HUGE number of small files independently

The task is to process HUGE (around 10,000,000) number of small files (each around 1MB) independently (i.e. the result of processing file F1, is independent of the result of processing F2).
Someone suggested Map-Reduce (on Amazon-EMR Hadoop) for my task. However, I have serious doubts about MR.
The reason is that processing files in my case, are independent. As far as I understand MR, it works best when the output is dependent on many individual files (for example counting the frequency of each word, given many documents, since a word might be included in any document in the input file). But in my case, I just need a lot of independent CPUs/Cores.
I was wondering if you have any advice on this.
Side Notes: There is another issue which is that MR works best for "huge files rather than huge number of small size". Although there seems to be solutions for that. So I am ignoring it for now.
It is possible to use map reduce for your needs. In MapReduce, there are two phases Map and Reduce, however, the reduce phase is not a must, just for your situation, you could write a map-only MapReduce job, and all the calculations on a single file should be put into a customised Map function.
However, I haven't process such huge num of files in a single job, no idea on its efficiency. Try it yourself, and share with us :)
This is quite easy to do. In such cases - the data for MR job is typically the list of files (and not the files themselves). So the size of the data submitted to Hadoop is the size of 10M file names - which is order of a couple of gigs max.
One uses MR to split up the list of files into smaller fragments (how many can be controlled by various options). Then each mapper gets a list of files. It can process one file at a time and generate the output.
(fwiw - I would suggest Qubole (where I am a founder) instead of EMR cause it would save you a ton of money with auto-scaling and spot integration).

how can i work with large number of small files in hadoop?

i am new to hadoop and i'm working with large number of small files in wordcount example.
it takes a lot of map tasks and results in slowing my execution.
how can i reduce the number of map tasks??
if the best solution to my problem is catting small files to a larger file, how can i cat them?
If you're using something like TextInputFormat, the problem is that each file has at least 1 split, so the upper bound of the number of maps is the number of files, which in your case where you have many very small files you will end up with many mappers processing each very little data.
To remedy to that, you should use CombineFileInputFormat which will pack multiple files into the same split (I think up to the block size limit), so with that format the number of mappers will be independent of the number of files, it will simply depend on the amount of data.
You will have to create your own input format by extending from CombineFileInputFormt, you can find an implementation here. Once you have your InputFormat defined, let's called it like in the link CombinedInputFormat, you can tell your job to use it by doing:
job.setInputFormatClass(CombinedInputFormat.class);
Cloudera posted a blog on small files problem sometime back. It's an old entry, but the suggested method still applies.

Performance with a large number of multiple output files in Hadoop

I'm using a custom output format that outputs a new sequence file per mapper per key, so you end up with something like this..
Input
Key1 Value
Key2 Value
Key1 Value
Files
/path/to/output/Key1/part-00000
/path/to/output/Key2/part-00000
I've noticed a huge performance hit, it usually takes around 10 minutes to simply map the input data, however after two hours the mappers weren't even half way complete. Though they were outputting rows. I expect the number of unique keys to be around half the number of input rows, around 200,000.
Has anyone ever done anything like this, or could suggest anything that might help the performance? I'd like to keep this key-splitting process within hadoop of possible.
Thanks!
I believe you should revisit your design. I don't believe HDFS scales well beyound 10M files. I suggest to read more on Hadoop, HDFS and Map/Reduce. A good place to start would be http://www.cloudera.com/blog/2009/02/the-small-files-problem/.
Good luck!
EDIT 8/26: Based on the #David Gruzman's comment, I looked deeper into the issue. Indeed the penalty for storing a large number of the small files is only for the NameNode. There is no additional space penalty to the data nodes. I removed the incorrect part of my answer.
It sounds like making output to some Key-Value store might help a lot.
For example HBASE might suit Your need since it is optimized for big number of writes, and you will reuse part of Your hadoop infrastructure.
There is existing output format to write right to HBase: http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html

Parsing large XML to TSV

I need to parse few XML's to TSV, the Size of the XML Files is of the order of 50 GB, I am basically doubtful about the implemetation i should choose to parse this i have two oprions
using SAXParser
use Hadoop
i have a fair bit of idea about SAXParser implementaion but i think having access to Hadoop cluster, i should use Hadoop as this is what hadoop is for i.e. Big Data
it would be great someone could provide a hint/doc as how to do this in Hadoop or efficient SAXParser implementaion for such a big file or rather what should i go for Hadoop or SAXparser?
I process large XML files in Hadoop quite regularly. I found it to be the best way (not the only way... the other is to write SAX code) since you can still operate on the records in a dom-like fashion.
With these large files, one thing to keep in mind is that you'll most definitely want to enable compression on the mapper output: Hadoop, how to compress mapper output but not the reducer output... this will speed things up quite a bit.
I've written a quick outline of how I've handled all this, maybe it'll help: http://davidvhill.com/article/processing-xml-with-hadoop-streaming. I use Python and Etrees which makes things really simple....
I don't know about SAXparser. But definitely Hadoop will do your job if you have a hadoop cluster with enough data nodes. 50Gb is nothing as I was performing operations on more than 300GB of data on my cluster. Write a map reduce job in java and the documentation for hadoop can be found at http://hadoop.apache.org/
It is rilatively trivial to process XML on hadoop by having one mapper per XML file. This approach will be fine for large number of relatively small XMLs
The problem is that in Your case files are big and thier number is small so without splitting hadoop benefit will be limited. Taking to account hadoop's overhead the benefit be negative...
In hadoop we need to be able to split input files into logical parts (called splits) to efficiently process large files.
In general XML is not looks like "spliitable" format since there is no well defined division into blocks, which can be processed independently. In the same time, if XML contains "records" of some kind splitting can be implemented.
Good discussion about splitting XMLs in haoop is here:
http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html
where Mahout's XML input format is suggested.
Regarding your case - I think as long as number of your files is not much bigger then number of cores you have on single system - hadoop will not be efficient solution.
In the same time - if you want to accumulate them over time - you can profit from hadoop as a scalable storage also.
I think that SAX has traditionally been mistakenly associated with processing big XML files... in reality, VTD-XML is often the best option, far better than SAX in terms of performance, flexibility, code readability and maintainability... on the issue of memory, VTD-XML's in-memory model is only 1.3x~1.5X the size of the corresponding XML document.
VTD-XML has another significant benefit over SAX: its unparalleled XPath support. Because of it, VTD-XML users routinely report performance gain of 10 to 60x over SAX parsing over hundreds of MB XML files.
http://www.infoq.com/articles/HIgh-Performance-Parsers-in-Java#anch104307
Read this paper that comprehensively compares the existing XML parsing frameworks in Java.
http://sdiwc.us/digitlib/journal_paper.php?paper=00000582.pdf

Resources