Effectively merge big parquet files - hadoop

I'm using parquet-tools to merge parquet files. But it seems that parquet-tools needs an amount of memory as big as the merged file. Do we have other ways or configurable options in parquet-tools to use memory more effectively? Cause I run the merge job in as a map job on hadoop env. And the container gets killed every time cause it used more memory than it is provided.
Thank you.

I wouldn't recommend using parquet-tools merge, since it just places row groups one after the another, so you will still have small groups, just packed together in a single file. The resulting file will typically not have noticably better performance, and under certain circumstances it may even perform worse than separate files. See PARQUET-1115 for details.
Currently the only proper way to merge Parquet files is to read all data from them and write it to a new Parquet file. You can do it with a MapReduce job (requires writing custom code for this purpose) or using Spark, Hive or Impala.

Related

Small files in hadoop

I am trying to combine small files on hdfs. This is simply for historical purposes, if needed the large file(s) would be disassembled and ran through the process to create the data for the hadoop table. Is there a way to achieve this simply? For example, day one receive 100 small files, combine into a file, then day two add/append more files into the previously created file, etc...
If the files are all the same "schema", let's say, like CSV or JSON. Then, you're welcome to write a very basic Pig / Spark job to read a whole folder of tiny files, then write it back out somewhere else, which will very likely merge all the files into larger sizes based on the HDFS block size.
You've also mentioned Hive, so use an external table for the small files, and use a CTAS query to create a separate table, thereby creating a MapReduce job, much the same as Pig would do.
IMO, if possible, the optimal solution is to setup a system "upstream" of Hadoop, which will batch your smaller files into larger files, and then dump them out to HDFS. Apache NiFi is a useful tool for this purpose.

What is the best place to store multiple small files in hadoop

I will be having multiple small text files around size of 10KB, got confused where to store those files in HBase or in HDFS. what will be the optimized storage?
Because to store in HBase I need to parse it first then save it against some row key.
In HDFS I can directly create a path and save that file at that location.
But till now whatever I read, it says you should not have multiple small files instead create less big files.
But I can not merge those files, so I can't create big file out of small files.
Kindly suggest.
A large number of small files don´t fit very well with hadoop since each file is a hdfs block and each block require a one Mapper to be processed by default.
There are several options/strategies to minimize the impact of small files, all options require to process at least one time small files and "package" them in a better format. If you are planning to read these files several times, pre-process small files could make sense, but if you will use those files just one time then it doesn´t matter.
To process small files my sugesstion is to use CombineTextInputFormat (here an example): https://github.com/lalosam/HadoopInExamples/blob/master/src/main/java/rojosam/hadoop/CombinedInputWordCount/DriverCIPWC.java
CombineTextInputFormat use one Mapper to process several files but could require to transfer the files to a different DataNode to put files together in the DAtaNode where the map is running and could have a bad performance with speculative tasks but you can disable them if your cluster is enough stable.
Alternative to repackage small files are:
Create sequence files where each record contains one of the small files. With this option you will keep the original files.
Use IdentityMapper and IdentityReducer where the number of reducers are less than the number of files. This is the most easy approach but require that each line in the files be equals and independents (Not headers or metadata at the beginning of the files required to understand the rest of the file).
Create a external table in hive and then insert all the records for this table into a new table (INSERT INTO . . . SELECT FROM . . .). This approach have the same limitations than the option two and require to use Hive, the adventage is that you don´t require to write a MapReduce.
If you can not merge files like in option 2 or 3, my suggestion is to go with option 1
You could try using HAR archives: https://hadoop.apache.org/docs/r2.7.2/hadoop-archives/HadoopArchives.html
It's no problem with having many small different files. If for example you have a table in Hive with many very small files in hdfs, it's not optimal, better to merge these files into less big ones because when reading this table a lot of mappers will be created. If your files are completely different like 'apples' and 'employees' and can not be merged than just store them as is.

Modeling Data in Hadoop

Currently I am bringing into Hadoop around 10 tables from an EDW (Enterprise Data Warehouse), these tables are closely related to a Star Schema model. I'm usig Sqoop to bring all these tables across, resulting in 10 directories containing csv files.
I'm looking at what are some better ways to store these files before striking off MR jobs. Should I follow some kind of model or build an aggregate before working on MR jobs? I'm basically looking at how might be some ways of storing related data together.
Most things I have found by searching are storing trivial csv files and reading them with opencsv. I'm looking for something a bit more involved and not just for csv files. If moving towards another format works better than csv, then that is no problem.
Boils down to: How best to store a bunch of related data in HDFS to have a good experience with MR.
I suggest spending some time with Apache Avro.
With Sqoop v1.3 and beyond you can import data from your relational data sources as Avro files using a schema of your own design. What's nice about Avro is that it provides a lot of features in addition to being a serialization format...
It gives you data+schema in the same file but is compact and efficient for fast serialization. It gives you versioning facilities which are useful when bringing in updated data with a different schema. Hive supports it in both reading and writing and Map Reduce can use it seamlessly.
It can be used as a generic interchange format between applications (not just for Hadoop) making it an interesting option for a standard, cross-platform format for data exchange in your broader architecture.
Storing these files in csv is fine. Since you will be able to process these files using text output format and could also read it through hive using specific delimiter. You could change the delimiter if you do not like comma to pipe("|") that's what I do most of the time. Also you generally need to have large files in hadoop but if its large enough that you can partition these files and each file partition is in the size of few 100 gigs then it would be a good to partition these files into separate directory based on your partition column.
Also it would be better idea to have most of the columns in single table than having many normalized small tables. But that varies depending on your data size. Also make sure whenever you copy , move or create data you do all the constraint check on your applications as it will be difficult to make small changes in the table later on, you will need to modify the complete file for even small change.
Hive Partitioning and Bucketing concepts can be used to effectively used to put similar data together (not in nodes, but in files and folders) based on a particular column. Here are some nice tutorials for Partitioning and Bucketing.

Suggested Hadoop File Format for Tabular Data

My application needs to process a couple of TB worth of tabular data. At the moment, the data is saved as several huge comma separated csv files. I can control how the files are being provided to my M/R job and I am wondering what is the preferred file format to make the job to run faster? For instance, is there any point in saving the input data as sequence files instead of the text file that I am using now? Will that make my M/R job to run noticeably faster?
From the perspective of "file format" I don't think using SequeceFile will be a great improvement over text file for csv data. If it was a single (Key,Value) pair in the CSV data, using SequenceFile over textfile would have made sense.
How ever, I am intrigued over use of RCFile (Record Columnar File) which should lend itself well for CSV like data. I have used it with hive tables and achieved some significant improvement in execution time for hive queries. I am assuming that that was due to execution efficiency in M/R since hive queries get translated to M/R programs.
Ref: http://www.ixwebhosting.mobi/2011/10/06/4823.html

Running a MR Job on a portion of the HDFS file

Imagine you have a big file stored in hdtf which contains structured data. Now the goal is to process only a portion of data in the file like all the lines in the file where second column value is between so and so. Is it possible to launch the MR job such that hdfs only stream the relevant portion of the file versus streaming everything to the mappers.
The reason is that I want to expedite the job speed by only working on the portion that I need. Probably one approach is to run a MR job to get create a new file but I am wondering if one can avoid that?
Please note that the goal is to keep the data in HDFS and I do not want to read and write from database.
HDFS stores files as a bunch of bytes in blocks, and there is no indexing, and therefore no way to only read in a portion of your file (at least at the time of this writing). Furthermore, any given mapper may get the first block of the file or the 400th, and you don't get control over that.
That said, the whole point of MapReduce is to distribute the load over many machines. In our cluster, we run up to 28 mappers at a time (7 per node on 4 nodes), so if my input file is 1TB, each map slot may only end up reading 3% of the total file, or about 30GB. You just perform the filter that you want in the mapper, and only process the rows you are interested in.
If you really need filtered access, you might want to look at storing your data in HBase. It can act as a native source for MapReduce jobs, provides filtered reads, and stores its data on HDFS, so you are still in the distributed world.
One answer is looking at the way that hive solves this problem. The data is in "tables" which are really just meta data about files on disk. Hive allows you to set columns on which a table is partitioned. This creates a separate folder for each partition so if you were partitioning a file by date you would have:
/mytable/2011-12-01
/mytable/2011-12-02
Inside of the date directory would be you actual files. So if you then ran a query like:
SELECT * FROM mytable WHERE dt ='2011-12-01'
Only files in /mytable/2011-12-01 would be fed into the job.
Tho bottom line is that if you want functionality like this you either want to move to a higher level language (hive/pig) or you need to roll your own solutions.
Big part of the processing cost - is data parsing to produce Key-Values to the Mapper. We create there (usually) one java object per value + some container. It is costly both in terms of CPU and garbage collector pressure
I would suggest the solution "in the middle". You can write input format which will read the input stream and skip non-relevant data in the early stage (for example by looking into few first bytes of the string). As a result you will read all data, but actually parse and pass to the Mapper - only portion of it.
Another approach I would consider - is to use RCFile format (or other columnar format), and take care that relevant and non relevant data will sit in the different columns.
If the files that you want to process have some unique attribute about their filename (like extension or partial filename match), you can also use the setInputPathFilter method of FileInputFormat to ignore all but the ones you want for your MR job. Hadoop by default ignores all ".xxx" and _xxx" files/dirs, but you can extend with setInputPathFilter.
As others have noted above, you will likely get sub-optimal performance out of your cluster doing something like this which breaks the "one block per mapper" paradigm, but sometimes this is acceptable. Can sometimes take more to "do it right", esp if you're dealing with a small amount of data & the time to re-architect and/or re-dump into HBase would eclipse the extra time required to run your job sub-optimally.

Resources