hadoop's gzip different from tar -zvcf? - hadoop

The question is exactly this:
I create a text file and compress it with tar -cxzf. The file's name is part-r-0000.gz
Put the file on hdfs with hadoop fs -put source dest, hadoop's version is 0.20.2-cdh
Try to view the file with hadoop fs -text part-r-0000.gz and I found that shows garbled.
I wonder if there are different versions of gz compression?

HDFS browser does not support reading gzip files. It will show as garbled text on the browser. There is no problem with gzip. Though your command definitely is fishy. c is for create x for extract.
Hadoop supports file compression. Here is a link which explains well : Compression

command taris error:
-c Create -x Extract
sample:
tar -czf ... -> create tgz file
tar -xzf ... -> extract tgz file

Related

How to unzip a split zip file in Hadoop

I have a split zip file (created by winzip in window) , then ftp to hadoop server.
Somehow i can't unzip it through something like below command
The files like below
file.z01,file.zo2,file.zo3....file.zip
Then i run below command
hadoop fs -cat /tmp/Links.txt.gz | gzip -d | hadoop fs -put - /tmp/unzipped/Links.txt
Then Error comes up
cat: Unable to write to output stream
What i expect is that unzip those split files to Hadoop particular folder
Unclear how Links.txt.gz is related to your .zip part files...
Hadoop doesn't really understand ZIP format (especially split ones), and gzip -d wouldn't work on .zip files anyway.
Zip nor gzip are splittable in Hadoop processing (read "able to be computed in parallel"), so since WinZip supports BZ2 format, I suggest you switch to that, and I don't see a need to create split files in Windows unless it's to upload the file faster...
Sidenote: hadoop fs -cat /input | <anything> | hadoop fs -put - /output is not splitting "in Hadoop"... You are copying the raw text of the file to your local buffer, then doing an operation locally, then optionally streaming it back to HDFS.

Decompress .deflate files as text in HDFS and copy result to local

After running a sqoop jobs I got the files .deflate extension (compression is configured by default). I know that I can show the file content using following command:
hadoop fs -text <file>
How can I copy this result to my local folder?
Just redirect output to some local file
hadoop fs -text hdfs_path > local_file.txt

How to decompress the gz files in hadoop

Wanted to know if there is any hadoop command to decompress the gz file
sitting on HDFS and display the content to stdout.
Just use text command
hdfs dfs -text file.gz
Hadoop knows how to detect gzip files and uncompresses it for you
You can do it easily by:
hdfs dfs -cat /path/to/file.gz | zcat

Transferring many small files into Hadoop file system

I want to transfer too many small files (e.g. 200k files) in a zip file into HDFS from the local machine. When I unzip the zip file and tranfer the files into HDFS, it takes a long time. Is there anyway I can transfer the original zip file into HDFS and unzip it there?
If your file is in GB's then this command would certainly help to avoid out of space errors as there is no need to unzip the file on local filesystem.
put command in hadoop supports reading input from stdin. For reading the input from stdin use '-' as source file.
Compressed filename: compressed.tar.gz
gunzip -c compressed.tar.gz | hadoop fs -put - /user/files/uncompressed_data
Only Disadvantage: The only drawback of this approach is that in HDFS the data will be merged into a single file even though the local compressed file contains more than one file.
http://bigdatanoob.blogspot.in/2011/07/copy-and-uncompress-file-to-hdfs.html

How to read a .deflate file in hadoop

I got some pig generated files with part-r-00000.deflate extension. I know this is a compressed file. How do I generate a normal file in a readable format. When I used hadoop fs -text, I cannot get plaintext output. The output is still binary. How can I fix this problem?
You might be using a quite old Hadoop version (e.g: 0.20.0) in which fs -text can't inflate the compressed file.
As a workaround you may try this one-liner (based on this answer):
hadoop fs -text file.deflate | perl -MCompress::Zlib -e 'undef $/; print uncompress(<>)'
you can decompress on the fly by using this command
hdfs dfs -text file.deflate | hdfs dfs -put - uncompressed_destination_file

Resources