Extract HDFS folder or file details - hadoop

To find the number of files present in a HDFS directory at any point of time using hive, I have created a hive external table. Can any one please help me in extracting the file details of directories present in HDFS as INPUT__FILE__NAME or hdfs dfs -stat is not serving my purpose and I want all the -ls into a csv file.

Working with the output of ls is not recommended, it is not made for this. That being said this is not the normal ls so perhaps there is no alternative.
You can put its output in a file like so:
hadoop fs -ls /path > output.txt

you can also use hdfs to find a table in all databases:
the path of hive databases is:
/apps/hive/warehouse/
so, by using hdfs :
hdfs dfs -find /apps/hive/warehouse/ -name t*
or
hadoop fs -ls /path

Related

Read File directly from HDFS

Is there a way to read any file format from HDFS directly by using the HDFS path, instead of having to pull the file locally from HDFS and read it.
You can use cat command on HDFS to read regular text files.
hdfs dfs -cat /path/to/file.csv
To read compressed files like gz, bz2 etc, you can use:
hdfs dfs -text /path/to/file.gz
These are the two read methods that Hadoop supports natively using FsShell comamnds. For other complex file types, you will have to use a more complex way, like, a Java program or something along those lines.
hdfs dfs -cat /path or hadoop fs -cat /path
You have to pull the entire file. Whether you use cat or text commands, the entire file is still being streamed to your shell. There's just no remnant of the file when the command ends. So, if you plan on inspecting the file a few times, it's better to get it
As an hdfs client, you must contact the namenode to acquire all block locations for a particular file.
You can try with hdfs dfs -cat
Usage: hdfs dfs -cat [-ignoreCrc] URI [URI ...]
hdfs dfs -cat /your/path

hdfs dfs -put with overwrite?

I am using
hdfs dfs -put myfile mypath
and for some files I get
put: 'myfile': File Exists
does that mean there is a file with the same name or does that mean the same exact file (size, content) is already there?
how can I specify an -overwrite option here?
Thanks!
put: 'myfile': File Exists
Means,the file named "myfile" already exists in hdfs. You cannot have multiple files of the same name in hdfs
You can overwrite it using hadoop fs -put -f /path_to_local /path_to_hdfs
You can overwrite your file in hdfs using -f command.For example
hadoop fs -put -f <localfile> <hdfsDir>
OR
hadoop fs -copyFromLocal -f <localfile> <hdfsDir>
It worked fine for me. However -f command won't work in case of get or copyToLocal command. check this question
A file with the same name exists at the location you're trying to write to.
You can overwrite by specifying the -f flag.
Just updates to this answer, in Hadoop 3.X the command a bit different
hdfs dfs -put -f /local/to/path hdfs://localhost:9870/users/XXX/folder/folder2

How do I remove a file from HDFS

I am learning Hadoop and I have never worked on Unix before . So, I am facing a problem here . What I am doing is:
$ hadoop fs -mkdir -p /user/user_name/abcd
now I am gonna put a ready made file with name file.txt in HDFS
$ hadoop fs -put file.txt /user/user_name/abcd
The file gets stored in hdfs since it shows up on running -ls command.
Now , I want to remove this file from HDFS . How should i do this ? What command should i use?
If you run the command hadoop fs -usage you'll get a look at what commands the filesystem supports and with hadoop fs -help you'll get a more in-depth description of them.
For removing files the commands is simply -rm with -rf specified for recursively removing folders. Read the command descriptions and try them out.

Automatically see my folders in HDFS based on user

I'd like to know how to do this in Hadoop: Say I'm logged in as 'dev'. When I issue this command hadoop fs -ls /, I will automatically see the files that belong to me. Like so:
# hadoop fs -ls /
/user/dev/mydata1
/user/dev/mydata2
The reason for which is we have existing shell scripts that don't have to specify what subfolders in Hadoop to get the data. They only have to call /mydata1 and it knows it belongs to /user/dev/
Thanks in advance
use
hadoop fs -ls .
or
hadoop fs -ls

hadoop dfs -ls complains

Can anyone let me know what seems to be wrong here ? hadoop dfs command seems to be OK but any following options are not recognized.
[hadoop-0.20]$bin/hadoop dfs -ls ~/wordcount/input/
ls: Cannot access /home/cloudera/wordcount/input/ : No such file or directory
hadoop fs -ls /some/path/here - will list a HDFS location, not your local linux location
try first this command
hadoop fs -ls /
then investigate step by step other folders.
if you want to copy some files from local directory to users directory on HDFS location, then just use this:
hadoop fs -mkdir /users
hadoop fs -put /some/local/file /users
for more hdfs commands see this: http://hadoop.apache.org/common/docs/r0.20.0/hdfs_shell.html
FS relates to a generic file system which can point to any file systems like local, HDFS, s3 etc But dfs is very specific to HDFS. So when we use FS it can perform operation with from/to local or hadoop distributed file system to destination. But specifying DFS operation relates to HDFS.

Resources