How to unzip .gz files in a new directory in hadoop? - hadoop

I have a bunch of .gz files in a folder in hdfs. I want to unzip all of these .gz files to a new folder in hdfs. How should i do this?

I can think of achieving it through 3 different ways.
Using Linux command line
Following command worked for me.
hadoop fs -cat /tmp/Links.txt.gz | gzip -d | hadoop fs -put - /tmp/unzipped/Links.txt
My gzipped file is Links.txt.gz
The output gets stored in /tmp/unzipped/Links.txt
Using Java program
In Hadoop The Definitve Guide book, there is a section on Codecs. In that section, there is a program to Decompress the output using CompressionCodecFactory. I am re-producing that code as is:
package com.myorg.hadooptests;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
public class FileDecompressor {
public static void main(String[] args) throws Exception {
String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
Path inputPath = new Path(uri);
CompressionCodecFactory factory = new CompressionCodecFactory(conf);
CompressionCodec codec = factory.getCodec(inputPath);
if (codec == null) {
System.err.println("No codec found for " + uri);
String outputUri =
CompressionCodecFactory.removeSuffix(uri, codec.getDefaultExtension());
InputStream in = null;
OutputStream out = null;
try {
in = codec.createInputStream(;
out = fs.create(new Path(outputUri));
IOUtils.copyBytes(in, out, conf);
} finally {
This code takes the gz file path as input.
You can execute this as:
FileDecompressor <gzipped file name>
For e.g. when I executed for my gzipped file:
FileDecompressor /tmp/Links.txt.gz
I got the unzipped file at location: /tmp/Links.txt
It stores the unzipped file in the same folder. So you need to modify this code to take 2 input parameters: <input file path> and <output folder>.
Once you get this program working, you can write a Shell/Perl/Python script to call this program for each of the inputs you have.
Using Pig script
You can write a simple Pig script to achieve this.
I wrote the following script, which works:
A = LOAD '/tmp/Links.txt.gz' USING PigStorage();
Store A into '/tmp/tmp_unzipped/' USING PigStorage();
mv /tmp/tmp_unzipped/part-m-00000 /tmp/unzipped/Links.txt
rm /tmp/tmp_unzipped/
When you run this script, the unzipped contents are stored in a temporary folder: /tmp/tmp_unzipped. This folder will contain
The part-m-00000 contains the unzipped file.
Hence, we need to explicitly rename it using following command and finally delete the /tmp/tmp_unzipped folder:
mv /tmp/tmp_unzipped/part-m-00000 /tmp/unzipped/Links.txt
rm /tmp/tmp_unzipped/
So, if you use this Pig script, you just need to take care of parameterizing the file name (Links.txt.gz and Links.txt).
Again, once you get this script working, you can write a Shell/Perl/Python script to call this Pig script for each of the inputs you have.

Bash solution
In my case, I did not want to pipe-unzip the files since I was not sure of their content. Instead, I wanted to make sure all files in the zip files will be put extracted on HDFS.
I have created a simple bash script. Comments should give you a clue what is going on. There is a short description below.
cd $workdir
# get all zip files in a folder
zips=$(hadoop fs -ls /yourpath/*.zip | awk '{print $8}')
for hdfsfile in $zips
echo $hdfsfile
# copy to temp folder to unpack
hdfs dfs -copyToLocal $hdfsfile $workdir
hdfsdir=$(dirname "$hdfsfile")
zipname=$(basename "$hdfsfile")
# unpack locally and remove
unzip $zipname
rm -rf $zipname
# copy files back to hdfs
files=$(ls $workdir)
for file in $files; do
hdfs dfs -copyFromLocal $file $hdfsdir
rm -rf $file
# optionally remove the zip file from hdfs?
# hadoop fs -rm -skipTrash $hdfsfile
Get all the *.zip files in an hdfs dir
One-by-one: copy zip to a temp dir (on filesystem)
Copy all the extracted files to the dir of the zip file
I managed to have it working with a sub-dir structure for many zip files in each, using /mypath/*/*.zip.
Good luck :)

If you have compressed text files, hadoop fs -text supports gzip along with other common compression formats (snappy, lzo).
hadoop fs -text /tmp/a.gz | hadoop fs -put - /tmp/uncompressed_a

You can do this using hive (assuming it is text data).
create external table source (t str) location '<directory_with_gz_files>';
create external table target (t str) location '<target_dir>';
insert into table target select * from source;
Data will be uncompressed into new set of files.
if you do not want to change the names and if you have enough storage on the node where you are running, you can do this.
hadoop fs -get <your_source_directory> <directory_name>
It will create a directory where you run hadoop command. cd to it and gunzip all the files
cd ..
hadoop fs -moveFromLocal <directory_name> <target_hdfs_path>

Providing the scala code
import org.apache.hadoop.fs.{FSDataOutputStream, FileSystem, FileUtil, Path}
import{CompressionCodecFactory, CompressionInputStream}
import org.apache.spark.sql.SparkSession
val conf = new org.apache.hadoop.conf.Configuration()
def extractFile (sparkSession: SparkSession, compath : String, uncompPath :String): String = {
val fs = FileSystem.get(sparkSession.sparkContext.hadoopConfiguration)
val inputPath = new Path(compath)
val factory = new CompressionCodecFactory(sparkSession.sparkContext.hadoopConfiguration);
val codec = factory.getCodec(inputPath)
if (codec == null){
throw new RuntimeException(s"Not a valid codex $codec")
var in : CompressionInputStream = null;
var out : FSDataOutputStream = null;
try {
in = codec.createInputStream(;
out = fs.create(new Path(uncompPath));
IOUtils.copyBytes(in, out, conf);
} finally {

Hadoop's FileUtil class has unTar() and unZip() methods to achieve this. The unTar() method will work on .tar.gz and .tgz files as well. Unfortunately they only work on files on the local filesystem. You'll have to use one of the same class's copy() methods to copy to and from any distributed file systems you need to use.


