Read File directly from HDFS - hadoop

Is there a way to read any file format from HDFS directly by using the HDFS path, instead of having to pull the file locally from HDFS and read it.

You can use cat command on HDFS to read regular text files.
hdfs dfs -cat /path/to/file.csv
To read compressed files like gz, bz2 etc, you can use:
hdfs dfs -text /path/to/file.gz
These are the two read methods that Hadoop supports natively using FsShell comamnds. For other complex file types, you will have to use a more complex way, like, a Java program or something along those lines.

hdfs dfs -cat /path or hadoop fs -cat /path

You have to pull the entire file. Whether you use cat or text commands, the entire file is still being streamed to your shell. There's just no remnant of the file when the command ends. So, if you plan on inspecting the file a few times, it's better to get it
As an hdfs client, you must contact the namenode to acquire all block locations for a particular file.

You can try with hdfs dfs -cat
Usage: hdfs dfs -cat [-ignoreCrc] URI [URI ...]
hdfs dfs -cat /your/path

Related

Append hdfs file to local file?

Since I can use hdfs dfs -appendToFile <localFile> ... <hdfsFile> command to append local file to hdfs files as mentioned in HDFS Command Line Append.
Are there any similar commands that allow me to append files in the opposite direction? That is, append hdfs files to certain local file.
For example, some commands like
# append files to local
hdfs dfs -appendToLocal <hdfsFile> <localFile>
I found that hdfs dfs -getmerge solves my question.
hdfs dfs -getmerge -nl <hdfsFile1> <hdfsFile2> ... <hdfsFileN> <localFile>

Extract HDFS folder or file details

To find the number of files present in a HDFS directory at any point of time using hive, I have created a hive external table. Can any one please help me in extracting the file details of directories present in HDFS as INPUT__FILE__NAME or hdfs dfs -stat is not serving my purpose and I want all the -ls into a csv file.
Working with the output of ls is not recommended, it is not made for this. That being said this is not the normal ls so perhaps there is no alternative.
You can put its output in a file like so:
hadoop fs -ls /path > output.txt
you can also use hdfs to find a table in all databases:
the path of hive databases is:
/apps/hive/warehouse/
so, by using hdfs :
hdfs dfs -find /apps/hive/warehouse/ -name t*
or
hadoop fs -ls /path

how to write customizesd output file format in mapreduce

Please suggest to me how to update the output fileformat (part-r-00000)(default file format) to another file format like csv or txt file formatsin map reduce programs.
You could do this:
hdfs dfs -cat /path/in/hdfs/part* |hdfs dfs -put - /chosen/path/in/hdfs/name_of_file.txt
OR
hdfs dfs -cat /path/in/hdfs/part* |hdfs dfs -put - chosen/path/in/hdfs/name_of_file.csv
Another method is -getmerge which copies to local but then you need to -copyFromLocal back to hdfs but it serves the purpose of changing your file format:
hdfs dfs -getmerge /path/in/hdfs/part* /path/in/local/file_name.format
hdfs dfs -copyFromLocal /path/in/local/file_name.format /path/in/hdfs/archive/
one way is you can copy the part-r-00000 file to xyz.txt file by using put command of hadoop.
like hdfs dfs -put part-r-00000 to xyz.txt

How to decompress the gz files in hadoop

Wanted to know if there is any hadoop command to decompress the gz file
sitting on HDFS and display the content to stdout.
Just use text command
hdfs dfs -text file.gz
Hadoop knows how to detect gzip files and uncompresses it for you
You can do it easily by:
hdfs dfs -cat /path/to/file.gz | zcat

How can I concatenate two files in hadoop into one using Hadoop FS shell?

I am working with Hadoop 0.20.2 and would like to concatenate two files into one using the -cat shell command if possible (source: http://hadoop.apache.org/common/docs/r0.19.2/hdfs_shell.html)
Here is the command I'm submitting (names have been changed):
**/path/path/path/hadoop-0.20.2> bin/hadoop fs -cat /user/username/folder/csv1.csv /user/username/folder/csv2.csv > /user/username/folder/outputdirectory/**
It returns bash: /user/username/folder/outputdirectory/: No such file or directory
I also tried creating that directory and then running it again -- i still got the 'no such file or directory' error.
I have also tried using the -cp command to copy both into a new folder and -getmerge to combine them but have no luck with the getmerge either.
The reason for doing this in hadoop is that the files are massive and would take a long time to download, merge, and re-upload outside of hadoop.
The error relates to you trying to re-direct the standard output of the command back to HDFS. There are ways you can do this, using the hadoop fs -put command with the source argument being a hypen:
bin/hadoop fs -cat /user/username/folder/csv1.csv /user/username/folder/csv2.csv | hadoop fs -put - /user/username/folder/output.csv
-getmerge also outputs to the local file system, not HDFS
Unforntunatley there is no efficient way to merge multiple files into one (unless you want to look into Hadoop 'appending', but in your version of hadoop, that is disabled by default and potentially buggy), without having to copy the files to one machine and then back into HDFS, whether you do that in
a custom map reduce job with a single reducer and a custom mapper reducer that retains the file ordering (remember each line will be sorted by the keys, so you key will need to be some combination of the input file name and line number, and the value will be the line itself)
via the FsShell commands, depending on your network topology - i.e. does your client console have a good speed connection to the datanodes? This certainly is the least effort on your part, and will probably complete quicker than a MR job to do the same (as everything has to go to one machine anyway, so why not your local console?)
To concatenate all files in the folder to an output file:
hadoop fs -cat myfolder/* | hadoop fs -put - myfolder/output.txt
If you have multiple folders on hdfs and you want to concatenate files in each of those folders, you can use a shell script to do this. (note: this is not very effective and can be slow)
Syntax :
for i in `hadoop fs -ls <folder>| cut -d' ' -f19` ;do `hadoop fs -cat $i/* | suy hadoop fs -put - $i/<outputfilename>`; done
eg:
for i in `hadoop fs -ls my-job-folder | cut -d' ' -f19` ;do `hadoop fs -cat $i/* |hadoop fs -put - $i/output.csv`; done
Explanation:
So you basically loop over all the files and cat each of the folders contents into an output file on the hdfs.

Resources