Any one help me on how to configure my hive to accept File formats like zlib, LZO , LZ4 and snappy compression - hadoop

We are working on POC to figure out which compression technique is better to use for saving for file in compressed format and have better performance from compress format. we have 4 format *.gz, *.zlib, *.snappy & *.lz4.
We figured out *.gz and *.zlib has better compression ratio but they have performance issue while reading compressed, since these files are not splittable and Number of Mappers , reducers are always 1. These formats are default accepted by Hive 0.14.
but we want to test other compression technique for our text file like *.lz4, *.lzo and snappy
Can any one help me on how to configure my hive to read input file which is compressed in *.lzo, snappy and *.lz4 and also Avro.
is these compress technique are present hive 0.14 or should i need to upload these *.jar ( i'm .NET Guys no idea on java) and use Serde for Serialisation and deserialization.
Can any one help me whether Hive by default accept those file format like *.lzo, *.snappy and *.lz4 and avro for reading these compressed file and should i need to configure hive to read these file format. I'm looking for best performance while reading of compressed file format. Its ok to compromise on Compression ratio, but should have better performance reading.

Related

Difference between Avro file format and bz2 compression in hive

I know that below are input and output format in the hive,
Text File.
Sequence File.
RC File.
AVRO File.
ORC File.
Parquet File.
When do we use bz2 compression and how are they different from hive file format? and when to use it?
Avro is a file format and BZ2 is a compression codec. These two are completely different things.
You can chose file format and compression codec independently. Some file formats are using internal compression and have limitation on which codecs can be used. For example ORC supports ZLIB and SNAPPY codecs. And you can configure codec in table properties like this:
...
STORED AS ORC TBLPROPERTIES ("orc.compress"="SNAPPY")
Or using hive configuration:
hive.exec.orc.default.compress=SNAPPY;
Read about ORC here: ORC hive configuration
Avro supports SNAPPY and DEFLATE codecs.
SET hive.exec.compress.output=true;
SET avro.output.codec=snappy;
With textfile you can use any codec.
BZ2 is not the fastest codec and can be used in cases when you do not have strict performance requirements. Read about compression on Cloudera site.
What is important to understand here is that non-splittable compression is not always an issue if you are using splittable container. For example the whole file compressed with snappy is not splittable but ORC using SNAPPY internally is splittable because ORC is splittable.

uncompress snappy parquet files in Azure Databricks

i have a bunch OF snappy parquet files in a folder in azure data lake
Does anyone have code that i can use to uncompress snappy parquet files to parquet using Azure Databricks.
Thanks
The compression of Parquet files is internal to the format. You cannot simply uncompress them as with usual files that are compressed at once. In Parquet each column chunk (or actually even smaller parts of it) are compressed individually. Thus for uncompressing, you would need to read in with spark.read.parquet and write them out as completely new files with different Parquet settings for the write.
Note that using no compression is actually not useful in most settings. Snappy is such a CPU-efficient format that the minimal CPU time it uses is in no contrast to the benefit on time the size savings have on the transferral of the files to disk or over the network.

how to read parquet schema in non mapreduce java program

Is there a way to direct read Parquet file column names by getting metadata without mapreduce. Please give some example. I am using snappy as compression codec.
You can use either ParquetFileReader or use existing tool https://github.com/Parquet/parquet-mr/tree/master/parquet-tools for reading parquet file using command line.

What's the recommended way of loading data into Hive from compressed files?

I came across this page on CompressedStorage in the documentation and it has me a bit confused.
According to the page, if my input files (on AWS s3) are compressed gzip files, I should first load the data with the option STORED AS TextFile and then create another table with the option STORED AS SEQUENCEFILE and insert the data into that. Is that really the recommended way?
Or can I just load the data straight into a table set with the option STORED AS SEQUENCEFILE?
If the former method is really the recommended way, is there any further explanation as to why it is?
You must load your data in its format. It means, if your files are Text Files then you should load them as TextFile and if your files are Sequence Files then load them as SEQUENCEFILE.
For Hive the compression format doesn't matter because it will decompress them on fly using the extension of the file as reference (If the compression codec was configured properly in Hadoop).
The suggestion in the page that you are sharing is that it's better work with Sequence Files than Compressed Text Files. That is because a Gzip file is not splittable and if you have a very big Gzip file all the file have to be processed with only one Mapper not allowing work in parrallel distributing the effort among the cluster nodes.
Then the Hive's suggestion is convert Compressed Text Files into Sequence Files to avoid that limitation. It is only about performance.
If your files are small, then it doesn't matter (< 1 Hadoop block size - 128MB by default).

Avro file type for images?

I try to...figure that case in Hadoop.
What is best file format Avro or SequenceFile, in case storing images in HDFS and process them after, with Python?
SequenceFile are key-value oriented, so I think that Avro files will work better?
I use SequenceFile to store images in HDFS and it works well. Both Avro and SequenceFile are binary file formats, hence they can store images efficiently. As a keys in SequenceFile I usually use the original image file names.
SequenceFile's are used in many image processing products, such as OpenIMAJ. You can use existing tools for working with images in SequenceFile's, for example OpenIMAJ SequenceFileTool.
In addition, you can take a look at HipiImageBundle. This is a special format provided by HIPI (Hadoop Image Processing Interface). In my experience, HipiImageBundle has better performance, than the SequenceFile. But in can be used only by HIPI.
If you don't have large number of files (less than 1M), you can try to store them without packaging in one big file and use CombineFileInputFormat to speedup processing.
I never use Avro to store images and I don't know about any project that use it.

Resources