What's the recommended way of loading data into Hive from compressed files? - hadoop

I came across this page on CompressedStorage in the documentation and it has me a bit confused.
According to the page, if my input files (on AWS s3) are compressed gzip files, I should first load the data with the option STORED AS TextFile and then create another table with the option STORED AS SEQUENCEFILE and insert the data into that. Is that really the recommended way?
Or can I just load the data straight into a table set with the option STORED AS SEQUENCEFILE?
If the former method is really the recommended way, is there any further explanation as to why it is?

You must load your data in its format. It means, if your files are Text Files then you should load them as TextFile and if your files are Sequence Files then load them as SEQUENCEFILE.
For Hive the compression format doesn't matter because it will decompress them on fly using the extension of the file as reference (If the compression codec was configured properly in Hadoop).
The suggestion in the page that you are sharing is that it's better work with Sequence Files than Compressed Text Files. That is because a Gzip file is not splittable and if you have a very big Gzip file all the file have to be processed with only one Mapper not allowing work in parrallel distributing the effort among the cluster nodes.
Then the Hive's suggestion is convert Compressed Text Files into Sequence Files to avoid that limitation. It is only about performance.
If your files are small, then it doesn't matter (< 1 Hadoop block size - 128MB by default).

Related

uncompress snappy parquet files in Azure Databricks

i have a bunch OF snappy parquet files in a folder in azure data lake
Does anyone have code that i can use to uncompress snappy parquet files to parquet using Azure Databricks.
Thanks
The compression of Parquet files is internal to the format. You cannot simply uncompress them as with usual files that are compressed at once. In Parquet each column chunk (or actually even smaller parts of it) are compressed individually. Thus for uncompressing, you would need to read in with spark.read.parquet and write them out as completely new files with different Parquet settings for the write.
Note that using no compression is actually not useful in most settings. Snappy is such a CPU-efficient format that the minimal CPU time it uses is in no contrast to the benefit on time the size savings have on the transferral of the files to disk or over the network.

How do I stream parquet using pyarrow?

I'm trying to read in a large dataset of parquet files piece by piece, do some operation and then move on to the next one without holding them all in memory. I need to do this because the entire dataset doesn't fit into memory. Previously I used ParquetDataset and I'm aware of RecordBatchStreamReader but I'm not sure how to combine them.
How can I use Pyarrow to do this?
At the moment, the Parquet APIs only support complete reads of individual files, so we can only limit reads at the granularity of a single file. We would like to create an implementation of arrow::RecordBatchReader (the streaming data interface) that reads from Parquet files, see https://issues.apache.org/jira/browse/ARROW-1012. Patches would be welcome.

Input Format to save image files (jpeg,png) in HDFS

I want to save image files (like jpeg, png etc) on HDFS (Hadoop File System). I tried two ways :
Saved the image files as it is (i.e in the same format) into HDFS using put command. The full command was : hadoop fs -put /home/a.jpeg /user/hadoop/. It was successfully placed.
Converted these image files into Hadoop's Sequence File format & then saved in HDFS using put command.
I want to know which format should be used to save in HDFS.
And what are the pros of using Sequence File format. One of the advantage that I know is that it is splittable. Is there any other ?
images are very small in size compare to block size of HDFS storage. The problem with small files is the impact on processing performance, This is why you should use Sequence Files, HAR, HBase or merging solutions. see these two threads more info.
effective way to store image files
How many files is too many on a modern HDP cluster?
Processing a 1Mb file has an overhead to it. So processing 128 1Mb
files will cost you 128 times more "administrative" overhead, versus
processing 1 128Mb file. In plain text, that 1Mb file may contain 1000
records. The 128 Mb file might contain 128000 records.

What is the best place to store multiple small files in hadoop

I will be having multiple small text files around size of 10KB, got confused where to store those files in HBase or in HDFS. what will be the optimized storage?
Because to store in HBase I need to parse it first then save it against some row key.
In HDFS I can directly create a path and save that file at that location.
But till now whatever I read, it says you should not have multiple small files instead create less big files.
But I can not merge those files, so I can't create big file out of small files.
Kindly suggest.
A large number of small files don´t fit very well with hadoop since each file is a hdfs block and each block require a one Mapper to be processed by default.
There are several options/strategies to minimize the impact of small files, all options require to process at least one time small files and "package" them in a better format. If you are planning to read these files several times, pre-process small files could make sense, but if you will use those files just one time then it doesn´t matter.
To process small files my sugesstion is to use CombineTextInputFormat (here an example): https://github.com/lalosam/HadoopInExamples/blob/master/src/main/java/rojosam/hadoop/CombinedInputWordCount/DriverCIPWC.java
CombineTextInputFormat use one Mapper to process several files but could require to transfer the files to a different DataNode to put files together in the DAtaNode where the map is running and could have a bad performance with speculative tasks but you can disable them if your cluster is enough stable.
Alternative to repackage small files are:
Create sequence files where each record contains one of the small files. With this option you will keep the original files.
Use IdentityMapper and IdentityReducer where the number of reducers are less than the number of files. This is the most easy approach but require that each line in the files be equals and independents (Not headers or metadata at the beginning of the files required to understand the rest of the file).
Create a external table in hive and then insert all the records for this table into a new table (INSERT INTO . . . SELECT FROM . . .). This approach have the same limitations than the option two and require to use Hive, the adventage is that you don´t require to write a MapReduce.
If you can not merge files like in option 2 or 3, my suggestion is to go with option 1
You could try using HAR archives: https://hadoop.apache.org/docs/r2.7.2/hadoop-archives/HadoopArchives.html
It's no problem with having many small different files. If for example you have a table in Hive with many very small files in hdfs, it's not optimal, better to merge these files into less big ones because when reading this table a lot of mappers will be created. If your files are completely different like 'apples' and 'employees' and can not be merged than just store them as is.

Avro file type for images?

I try to...figure that case in Hadoop.
What is best file format Avro or SequenceFile, in case storing images in HDFS and process them after, with Python?
SequenceFile are key-value oriented, so I think that Avro files will work better?
I use SequenceFile to store images in HDFS and it works well. Both Avro and SequenceFile are binary file formats, hence they can store images efficiently. As a keys in SequenceFile I usually use the original image file names.
SequenceFile's are used in many image processing products, such as OpenIMAJ. You can use existing tools for working with images in SequenceFile's, for example OpenIMAJ SequenceFileTool.
In addition, you can take a look at HipiImageBundle. This is a special format provided by HIPI (Hadoop Image Processing Interface). In my experience, HipiImageBundle has better performance, than the SequenceFile. But in can be used only by HIPI.
If you don't have large number of files (less than 1M), you can try to store them without packaging in one big file and use CombineFileInputFormat to speedup processing.
I never use Avro to store images and I don't know about any project that use it.

Resources