uncompress snappy parquet files in Azure Databricks - parquet

i have a bunch OF snappy parquet files in a folder in azure data lake
Does anyone have code that i can use to uncompress snappy parquet files to parquet using Azure Databricks.
Thanks

The compression of Parquet files is internal to the format. You cannot simply uncompress them as with usual files that are compressed at once. In Parquet each column chunk (or actually even smaller parts of it) are compressed individually. Thus for uncompressing, you would need to read in with spark.read.parquet and write them out as completely new files with different Parquet settings for the write.
Note that using no compression is actually not useful in most settings. Snappy is such a CPU-efficient format that the minimal CPU time it uses is in no contrast to the benefit on time the size savings have on the transferral of the files to disk or over the network.

Related

Best method to save intermediate tables in pyspark

This is my first question on Stackoverflow.
I am replicating a SAS codebase in Pyspark. The SAS codebase produces and stores scores of intermediate SAS datasets (100 when I last counted) which are used to cross check the final output and also for other analyses at a later point in time.
My purpose is to save numerous Pyspark dataframes in some format so that they can be re-used in a separate Pyspark session. I have thought of 2 options:
Save dataframes as hive tables.
Save them as parquet files.
Are there any other formats? Which method is faster? Will parquet files or csv files have schema related issues while re-reading the files as Pyspark dataframes?
The best option is to use parquet files as they have following advantages:
3x compressed saves space
Columnar format, faster pushdowns
Optimized with spark catalyst optimizer
Schema persists as parquet contains schema related info.
The only issue is make sure you are not generating multiple small files, the default parquet block size is 128 mb so make sure you have files sufficiently large. You can repartition the data to make sure the file size is large enough
Use Deleta Lake, to iterate over data changes, changeable schema, parquet advantages, easy updates, track chages, data versioning
Parquet is default for pyspark and goes well. So you can just store as parquet files / hive table. Before pushing to hdfs/hive you can repartition files if may small files on source. If it's a huge data try partitioning hive table with a suitable column.

Input Format to save image files (jpeg,png) in HDFS

I want to save image files (like jpeg, png etc) on HDFS (Hadoop File System). I tried two ways :
Saved the image files as it is (i.e in the same format) into HDFS using put command. The full command was : hadoop fs -put /home/a.jpeg /user/hadoop/. It was successfully placed.
Converted these image files into Hadoop's Sequence File format & then saved in HDFS using put command.
I want to know which format should be used to save in HDFS.
And what are the pros of using Sequence File format. One of the advantage that I know is that it is splittable. Is there any other ?
images are very small in size compare to block size of HDFS storage. The problem with small files is the impact on processing performance, This is why you should use Sequence Files, HAR, HBase or merging solutions. see these two threads more info.
effective way to store image files
How many files is too many on a modern HDP cluster?
Processing a 1Mb file has an overhead to it. So processing 128 1Mb
files will cost you 128 times more "administrative" overhead, versus
processing 1 128Mb file. In plain text, that 1Mb file may contain 1000
records. The 128 Mb file might contain 128000 records.

What's the recommended way of loading data into Hive from compressed files?

I came across this page on CompressedStorage in the documentation and it has me a bit confused.
According to the page, if my input files (on AWS s3) are compressed gzip files, I should first load the data with the option STORED AS TextFile and then create another table with the option STORED AS SEQUENCEFILE and insert the data into that. Is that really the recommended way?
Or can I just load the data straight into a table set with the option STORED AS SEQUENCEFILE?
If the former method is really the recommended way, is there any further explanation as to why it is?
You must load your data in its format. It means, if your files are Text Files then you should load them as TextFile and if your files are Sequence Files then load them as SEQUENCEFILE.
For Hive the compression format doesn't matter because it will decompress them on fly using the extension of the file as reference (If the compression codec was configured properly in Hadoop).
The suggestion in the page that you are sharing is that it's better work with Sequence Files than Compressed Text Files. That is because a Gzip file is not splittable and if you have a very big Gzip file all the file have to be processed with only one Mapper not allowing work in parrallel distributing the effort among the cluster nodes.
Then the Hive's suggestion is convert Compressed Text Files into Sequence Files to avoid that limitation. It is only about performance.
If your files are small, then it doesn't matter (< 1 Hadoop block size - 128MB by default).

Any one help me on how to configure my hive to accept File formats like zlib, LZO , LZ4 and snappy compression

We are working on POC to figure out which compression technique is better to use for saving for file in compressed format and have better performance from compress format. we have 4 format *.gz, *.zlib, *.snappy & *.lz4.
We figured out *.gz and *.zlib has better compression ratio but they have performance issue while reading compressed, since these files are not splittable and Number of Mappers , reducers are always 1. These formats are default accepted by Hive 0.14.
but we want to test other compression technique for our text file like *.lz4, *.lzo and snappy
Can any one help me on how to configure my hive to read input file which is compressed in *.lzo, snappy and *.lz4 and also Avro.
is these compress technique are present hive 0.14 or should i need to upload these *.jar ( i'm .NET Guys no idea on java) and use Serde for Serialisation and deserialization.
Can any one help me whether Hive by default accept those file format like *.lzo, *.snappy and *.lz4 and avro for reading these compressed file and should i need to configure hive to read these file format. I'm looking for best performance while reading of compressed file format. Its ok to compromise on Compression ratio, but should have better performance reading.

Avro file type for images?

I try to...figure that case in Hadoop.
What is best file format Avro or SequenceFile, in case storing images in HDFS and process them after, with Python?
SequenceFile are key-value oriented, so I think that Avro files will work better?
I use SequenceFile to store images in HDFS and it works well. Both Avro and SequenceFile are binary file formats, hence they can store images efficiently. As a keys in SequenceFile I usually use the original image file names.
SequenceFile's are used in many image processing products, such as OpenIMAJ. You can use existing tools for working with images in SequenceFile's, for example OpenIMAJ SequenceFileTool.
In addition, you can take a look at HipiImageBundle. This is a special format provided by HIPI (Hadoop Image Processing Interface). In my experience, HipiImageBundle has better performance, than the SequenceFile. But in can be used only by HIPI.
If you don't have large number of files (less than 1M), you can try to store them without packaging in one big file and use CombineFileInputFormat to speedup processing.
I never use Avro to store images and I don't know about any project that use it.

Resources