How to decompress the gz files in hadoop - hadoop

Wanted to know if there is any hadoop command to decompress the gz file
sitting on HDFS and display the content to stdout.

Just use text command
hdfs dfs -text file.gz
Hadoop knows how to detect gzip files and uncompresses it for you

You can do it easily by:
hdfs dfs -cat /path/to/file.gz | zcat

Related

How to unzip a split zip file in Hadoop

I have a split zip file (created by winzip in window) , then ftp to hadoop server.
Somehow i can't unzip it through something like below command
The files like below
file.z01,file.zo2,file.zo3....file.zip
Then i run below command
hadoop fs -cat /tmp/Links.txt.gz | gzip -d | hadoop fs -put - /tmp/unzipped/Links.txt
Then Error comes up
cat: Unable to write to output stream
What i expect is that unzip those split files to Hadoop particular folder
Unclear how Links.txt.gz is related to your .zip part files...
Hadoop doesn't really understand ZIP format (especially split ones), and gzip -d wouldn't work on .zip files anyway.
Zip nor gzip are splittable in Hadoop processing (read "able to be computed in parallel"), so since WinZip supports BZ2 format, I suggest you switch to that, and I don't see a need to create split files in Windows unless it's to upload the file faster...
Sidenote: hadoop fs -cat /input | <anything> | hadoop fs -put - /output is not splitting "in Hadoop"... You are copying the raw text of the file to your local buffer, then doing an operation locally, then optionally streaming it back to HDFS.

Read File directly from HDFS

Is there a way to read any file format from HDFS directly by using the HDFS path, instead of having to pull the file locally from HDFS and read it.
You can use cat command on HDFS to read regular text files.
hdfs dfs -cat /path/to/file.csv
To read compressed files like gz, bz2 etc, you can use:
hdfs dfs -text /path/to/file.gz
These are the two read methods that Hadoop supports natively using FsShell comamnds. For other complex file types, you will have to use a more complex way, like, a Java program or something along those lines.
hdfs dfs -cat /path or hadoop fs -cat /path
You have to pull the entire file. Whether you use cat or text commands, the entire file is still being streamed to your shell. There's just no remnant of the file when the command ends. So, if you plan on inspecting the file a few times, it's better to get it
As an hdfs client, you must contact the namenode to acquire all block locations for a particular file.
You can try with hdfs dfs -cat
Usage: hdfs dfs -cat [-ignoreCrc] URI [URI ...]
hdfs dfs -cat /your/path

How to decompress .Snappy files in Hadoop HDFS?

I have some snappy compressed Snappy files in a directory in HDFS. I need to decompress each file and load into a Text file. Any Hadoop DFS commands are available? I am new here. Kindly help.
Thanks,
Praveen.
One way you can achieve it is via -text hadoop command
hadoop fs -text /hdfs_path/hdfs_file.snappy > some_unix_file.txt
hadoop fs -put some_unix_file.txt /hdfs_path

how to write customizesd output file format in mapreduce

Please suggest to me how to update the output fileformat (part-r-00000)(default file format) to another file format like csv or txt file formatsin map reduce programs.
You could do this:
hdfs dfs -cat /path/in/hdfs/part* |hdfs dfs -put - /chosen/path/in/hdfs/name_of_file.txt
OR
hdfs dfs -cat /path/in/hdfs/part* |hdfs dfs -put - chosen/path/in/hdfs/name_of_file.csv
Another method is -getmerge which copies to local but then you need to -copyFromLocal back to hdfs but it serves the purpose of changing your file format:
hdfs dfs -getmerge /path/in/hdfs/part* /path/in/local/file_name.format
hdfs dfs -copyFromLocal /path/in/local/file_name.format /path/in/hdfs/archive/
one way is you can copy the part-r-00000 file to xyz.txt file by using put command of hadoop.
like hdfs dfs -put part-r-00000 to xyz.txt

How to read a .deflate file in hadoop

I got some pig generated files with part-r-00000.deflate extension. I know this is a compressed file. How do I generate a normal file in a readable format. When I used hadoop fs -text, I cannot get plaintext output. The output is still binary. How can I fix this problem?
You might be using a quite old Hadoop version (e.g: 0.20.0) in which fs -text can't inflate the compressed file.
As a workaround you may try this one-liner (based on this answer):
hadoop fs -text file.deflate | perl -MCompress::Zlib -e 'undef $/; print uncompress(<>)'
you can decompress on the fly by using this command
hdfs dfs -text file.deflate | hdfs dfs -put - uncompressed_destination_file

Resources