How to confirm my files are snappy compresses in hive? - hadoop

So I compressed my table in hive using snappy compression and it did get compress. The size was reduced.
But when i run hadoop fs -lsr /hive/user.db/table_name, I see no file extensions with .snappy.
I want to know if they really were snappy compressed or not?

Related

Difference between Avro file format and bz2 compression in hive

I know that below are input and output format in the hive,
Text File.
Sequence File.
RC File.
AVRO File.
ORC File.
Parquet File.
When do we use bz2 compression and how are they different from hive file format? and when to use it?
Avro is a file format and BZ2 is a compression codec. These two are completely different things.
You can chose file format and compression codec independently. Some file formats are using internal compression and have limitation on which codecs can be used. For example ORC supports ZLIB and SNAPPY codecs. And you can configure codec in table properties like this:
...
STORED AS ORC TBLPROPERTIES ("orc.compress"="SNAPPY")
Or using hive configuration:
hive.exec.orc.default.compress=SNAPPY;
Read about ORC here: ORC hive configuration
Avro supports SNAPPY and DEFLATE codecs.
SET hive.exec.compress.output=true;
SET avro.output.codec=snappy;
With textfile you can use any codec.
BZ2 is not the fastest codec and can be used in cases when you do not have strict performance requirements. Read about compression on Cloudera site.
What is important to understand here is that non-splittable compression is not always an issue if you are using splittable container. For example the whole file compressed with snappy is not splittable but ORC using SNAPPY internally is splittable because ORC is splittable.

uncompress snappy parquet files in Azure Databricks

i have a bunch OF snappy parquet files in a folder in azure data lake
Does anyone have code that i can use to uncompress snappy parquet files to parquet using Azure Databricks.
Thanks
The compression of Parquet files is internal to the format. You cannot simply uncompress them as with usual files that are compressed at once. In Parquet each column chunk (or actually even smaller parts of it) are compressed individually. Thus for uncompressing, you would need to read in with spark.read.parquet and write them out as completely new files with different Parquet settings for the write.
Note that using no compression is actually not useful in most settings. Snappy is such a CPU-efficient format that the minimal CPU time it uses is in no contrast to the benefit on time the size savings have on the transferral of the files to disk or over the network.

Is ORC File with Snappy Compression Splittable at Stripe?

Is ORC File with Snappy Compression Splittable at stripes?
As far as i know Snappy Compressed File is not splittable.
But i have read in a blog that snappy compressed file is splittable at stripes.
Is that true?
You would have to create your own InputFormat class, I don't believe OrcInputFormat or OrcNewInputFormat support splitting at the stripe level.

Any one help me on how to configure my hive to accept File formats like zlib, LZO , LZ4 and snappy compression

We are working on POC to figure out which compression technique is better to use for saving for file in compressed format and have better performance from compress format. we have 4 format *.gz, *.zlib, *.snappy & *.lz4.
We figured out *.gz and *.zlib has better compression ratio but they have performance issue while reading compressed, since these files are not splittable and Number of Mappers , reducers are always 1. These formats are default accepted by Hive 0.14.
but we want to test other compression technique for our text file like *.lz4, *.lzo and snappy
Can any one help me on how to configure my hive to read input file which is compressed in *.lzo, snappy and *.lz4 and also Avro.
is these compress technique are present hive 0.14 or should i need to upload these *.jar ( i'm .NET Guys no idea on java) and use Serde for Serialisation and deserialization.
Can any one help me whether Hive by default accept those file format like *.lzo, *.snappy and *.lz4 and avro for reading these compressed file and should i need to configure hive to read these file format. I'm looking for best performance while reading of compressed file format. Its ok to compromise on Compression ratio, but should have better performance reading.

HDFS file compression internally

I am looking for a default compression in HDFS. I saw this but I don' t want my files to have gzip like extensions(in fact, they should be accesible as if they didn' t compressed) Actually, what I am looking for is exactly like the option "Compress contents to save disk space" on Windows. This option compresses the files internally, but they can be accessed just like usual files. Any ideas will be helpful.
Thanks
This doesn't exist in standard HDFS implementations and you have to manage it yourself. You have to manage your own compression. However, a proprietary implementation of Hadoop, MapR, does this, if solving this problem is important enough for you.
After using hadoop for a little while this doesn't really bother me anymore. Pig and MapReduce and such handle the compression automatically enough for me. I know that's not a real answer, but I couldn't tell in your question if you are simply annoyed or you have a real problem this is causing. Getting use to adding | gunzip to everything didn't take long. I For example:
hadoop fs -cat /my/file.gz | gunzip
cat file.txt | gzip | hadoop fs -put - /my/file.txt.gz
When you're using compressed files you need to think about having them splittable - i.e. can Hadoop split this file when running a map reduce (if the file is not splittable it will only be read by a single map)
The usual way around this is to use a container format e.g. sequence file, orc file etc. where you can enable compression. If you are using simple text files (csv etc) - there's an lzo project by twitter but I didn't use it personally
The standard way to store files with compression in HDFS is through default compression argument while writing any file into HDFS. This is available in mapper libraries, sqoop, flume , hive , hbase catalog and so on. I am quoting some examples here from Hadoop. Here you dont need to worry about compressing the file locally for efficiency in hadoop. Its best to default hdfs file format option to perform this work. This type of compression will smoothly integrate with the hadoop mapper processing.
Job written through Mapper Library
While creating the writer in your mapper program. Here is the definition. You will write your own mapper and reducer to write the file into HDFS with your codec defined as a argument to the Writer method.
createWriter(Configuration conf, FSDataOutputStream out, Class keyClass, Class valClass, org.apache.hadoop.io.SequenceFile.CompressionType **compressionType**, CompressionCodec codec)
Sqoop Import
Below option send default compression argument for the file import into HDFS
sqoop import --connect jdbc://mysql://yourconnection/rawdata --table loglines --target-dir /tmp/data/logs/ --compress
with sqoop you can also specify specific codec as well with option
sqoop --connect jdbc://mysql://yourconnection/rawdata --table loglines --target-dir /tmp/data/logs compression-codec org.apache.hadoop.io.compress.SnappyCodec
Hive import
Below example, you can use your desired option to read the file into hive. This again is property you can set while reading from your local file.
SET hive.exec.compress.output=true;
SET parquet.compression=**SNAPPY**; --this is the default actually
CREATE TABLE raw (line STRING) STORED AS PARQUET ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
LOAD DATA LOCAL INPATH '/tmp/weblogs/20090603-access.log' INTO TABLE raw;
I have not mentioned all the example of methods of data compression while you import into HDFS.
HDFS CLI doesn't (for e.g hdfs dfs -copyFromLocal) provide any direct way to compress. This is my understanding of working with hadoop CLI.

Resources