Difference between Avro file format and bz2 compression in hive - hadoop

I know that below are input and output format in the hive,
Text File.
Sequence File.
RC File.
AVRO File.
ORC File.
Parquet File.
When do we use bz2 compression and how are they different from hive file format? and when to use it?

Avro is a file format and BZ2 is a compression codec. These two are completely different things.
You can chose file format and compression codec independently. Some file formats are using internal compression and have limitation on which codecs can be used. For example ORC supports ZLIB and SNAPPY codecs. And you can configure codec in table properties like this:
...
STORED AS ORC TBLPROPERTIES ("orc.compress"="SNAPPY")
Or using hive configuration:
hive.exec.orc.default.compress=SNAPPY;
Read about ORC here: ORC hive configuration
Avro supports SNAPPY and DEFLATE codecs.
SET hive.exec.compress.output=true;
SET avro.output.codec=snappy;
With textfile you can use any codec.
BZ2 is not the fastest codec and can be used in cases when you do not have strict performance requirements. Read about compression on Cloudera site.
What is important to understand here is that non-splittable compression is not always an issue if you are using splittable container. For example the whole file compressed with snappy is not splittable but ORC using SNAPPY internally is splittable because ORC is splittable.

Related

uncompress snappy parquet files in Azure Databricks

i have a bunch OF snappy parquet files in a folder in azure data lake
Does anyone have code that i can use to uncompress snappy parquet files to parquet using Azure Databricks.
Thanks
The compression of Parquet files is internal to the format. You cannot simply uncompress them as with usual files that are compressed at once. In Parquet each column chunk (or actually even smaller parts of it) are compressed individually. Thus for uncompressing, you would need to read in with spark.read.parquet and write them out as completely new files with different Parquet settings for the write.
Note that using no compression is actually not useful in most settings. Snappy is such a CPU-efficient format that the minimal CPU time it uses is in no contrast to the benefit on time the size savings have on the transferral of the files to disk or over the network.

How to confirm my files are snappy compresses in hive?

So I compressed my table in hive using snappy compression and it did get compress. The size was reduced.
But when i run hadoop fs -lsr /hive/user.db/table_name, I see no file extensions with .snappy.
I want to know if they really were snappy compressed or not?

Is ORC File with Snappy Compression Splittable at Stripe?

Is ORC File with Snappy Compression Splittable at stripes?
As far as i know Snappy Compressed File is not splittable.
But i have read in a blog that snappy compressed file is splittable at stripes.
Is that true?
You would have to create your own InputFormat class, I don't believe OrcInputFormat or OrcNewInputFormat support splitting at the stripe level.

how to read parquet schema in non mapreduce java program

Is there a way to direct read Parquet file column names by getting metadata without mapreduce. Please give some example. I am using snappy as compression codec.
You can use either ParquetFileReader or use existing tool https://github.com/Parquet/parquet-mr/tree/master/parquet-tools for reading parquet file using command line.

Any one help me on how to configure my hive to accept File formats like zlib, LZO , LZ4 and snappy compression

We are working on POC to figure out which compression technique is better to use for saving for file in compressed format and have better performance from compress format. we have 4 format *.gz, *.zlib, *.snappy & *.lz4.
We figured out *.gz and *.zlib has better compression ratio but they have performance issue while reading compressed, since these files are not splittable and Number of Mappers , reducers are always 1. These formats are default accepted by Hive 0.14.
but we want to test other compression technique for our text file like *.lz4, *.lzo and snappy
Can any one help me on how to configure my hive to read input file which is compressed in *.lzo, snappy and *.lz4 and also Avro.
is these compress technique are present hive 0.14 or should i need to upload these *.jar ( i'm .NET Guys no idea on java) and use Serde for Serialisation and deserialization.
Can any one help me whether Hive by default accept those file format like *.lzo, *.snappy and *.lz4 and avro for reading these compressed file and should i need to configure hive to read these file format. I'm looking for best performance while reading of compressed file format. Its ok to compromise on Compression ratio, but should have better performance reading.

Resources