Where can i see my data in Hadoop HDFS - hadoop

I set the dfs.name.dir and dfs.data.dir in master and slave nodes as /home/hduser/hadoop/hdfs/name
/home/hduser/hadoop/hdfs/data
I copy the file from local disk to HDFS.
Where can i see that file data in HDFS

These configuration parameters determine where in the local filesystem Hadoop stores its image and raw data. When you import file data into HDFS, it dosen't involve these values. In general, data is written into HDFS at the path you specify (when it is absolute), or a path qualified by your username (by default, I believe, this is /user/your_username) when you use a relative path.
So, if I have a file named example in my (local) home directory and say
local:~ matt> hadoop fs -put example relative/path
I should be able to find it in HDFS at /user/matt/relative/path/example. On the other hand, if I do this
local:~ matt> hadoop fs -put example /absolute/path/in/hdfs
it will be in HDFS at /absolute/path/in/hdfs/example.

Related

Difference between hadoop fs -put and hadoop distcp

We are going to do the ingestion phase in our data lake project and I have mostly used hadoop fs -put throughout my Hadoop developer experience. So what's the difference with hadoop distcp and the difference in usage?
Distcp is a special tool used for copying the data from one cluster to another. Basically you usually copy from one hdfs to hdfs, but not for local file system. Another very important thing is that the process in done as a mapreduce job with 0 reduce task which makes it more fast due to the distribution of operations. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list
hdfs put - copies the data from local system to hdfs. Uses hdfs client for this behind the scene and does all the work sequentially through accessing NameNode and Datanodes. Does not create MapReduce jobs for processing the data.
hdfs or hadoop put is used for data ingestion from Local to HDFS file system
distcp cannot be used for data ingestion from Local to HDFS as it works only on HDFS filesystem
We extensively use distcp for (Archiving ) Back-up and Restore of the HDFS files something like this
hadoop distcp $CURRENT_HDFS_PATH $BACKUP_HDFS_PATH
"distcp cannot be used for data ingestion from Local to HDFS as it works only on HDFS filesystem"
-> it can, use "file" (eg. "file:///tmp/test.txt") as schema in URL (https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/FileSystemShell.html)
Hint: use "hadoop distcp -D dfs.replication=1" to decrease distcp process time during copy operation and later replicate the copied files.
Distcp is command is used for coying the data from one cluster's hdfs location to another cluster's hdfs location only. create MapReduce jobs with 0 reducer for processing the data.
hadoop -distcp webhdfs://source-ip/directory/filename webhdfs://target-ip/directory/
scp is the command used for copying the data from one cluster's local file system to another cluster's local file system.
scp //source-ip/directory/filename //target-ip/directory/
hdfs put command - copies the data from local file system to hdfs. Does not create MapReduce jobs for processing the data.
hadoop fs -put -f /path/file /hdfspath/file
hdfs get command -copies the data from hdfs to local file system
first, go to the directory where you want to copy the file then run below command
hadoop fs -get /hdfsloc/file

How files or directories are getting stored in hadoop hdfs

I have created a file in hdfs using below command
hdfs dfs -touchz /hadoop/dir1/file1.txt
I could see the created file by using below command
hdfs dfs -ls /hadoop/dir1/
But, I could not find the location itself by using linux commands (using find or locate). I searched on internet and found following link.
How to access files in Hadoop HDFS? . It says, hdfs is virtual storage. In that case, How its taking partition which one or how much it needs to be used, where the meta data being stored
Is it taking datanode location for virtual storage which I have mentioned in hdfs-site.xml to store all the data?
I looked into datanode location and there are files available. But I could not find out anything related to my file or folder which I have created.
(I am using hadoop 2.6.0)
HDFS file system is a distributed storage system wherein the storage location is virtual and created using the disk space from all the DataNodes. While installing hadoop, you must have specified paths for dfs.namenode.name.dir and dfs.datanode.data.dir. These are the locations at which all the HDFS related files are stored on individual nodes.
While storing the data onto HDFS, it is stored as blocks of a specified size (default 128MB in Hadoop 2.X). When you use hdfs dfs commands you will see the complete files but internally HDFS stores these files as blocks. If you check the above mentioned paths on your local file system, you will see a bunch of files which correcpond to files on your HDFS. But again, you will not see them as actual files as they are split into blocks.
Check below mentioned command's output to get more details on how much space from each DataNode is used to create the virtual HDFS storage.
hdfs dfsadmin -report #Or
sudo -u hdfs hdfs dfsadmin -report
HTH
As we creating a file in local file system i.e on creating a directory in it
for ex:$/mkdir MITHUN94** it is a directory entering into that(LFS) cd MITHUN90
in that **create a new file as **$nano file1.log .
And now create a directory in** hdfs for ex: hdfs dfs -mkdir /mike90 .Here "mike90"
refers to directory name . After that creating a directory send files from LFS to hdfs. By using this command $hdfs dfs -copyFromLocal /home/gopalkrishna/file1.log
/mike90
Here '/home/gopalkrishna/file1.log' refers to pwd (present working directory)
and '/mike90' refers to directory in hdfs. By clickig $hdfs dfs -ls /mike90
the list of files .

Hadoop distcp with file list

I would like to use distcp to copy a list of files (> 1K files) into hdfs. I have already stored list of files in local directory, now can I use -f to copy all files? if yes what is the format do I have to maintain in my files list file? or is there any other better way?
You don't have to use distcp if your use-case is copying data from local filesystem (say Linux) to HDFS. You can simply use hdfs dfs -put command for the same. Here is the syntax.
hdfs dfs -put /path/to/local/dir/* /path/on/hdfs/
e.g.
hdfs dfs -mkdir /user/hduser/destination-dir/
hdfs dfs -put /home/abc/mydir/* /user/hduser/destination-dir/
You have created a file containing list of file paths but that is not at all needed. It's mainly used (for distcp) when you are copying data from one cluster to other cluster

In hadoop is it any way to get the underlying filesystem file name for an hdfs block?

I understand that hdfs stores its files as blocks on datanodes, and each block is actually stored as a file in the local filesystem of each datanode.
So I would like to know if there is a way to get the actual filename in the local filesystem for an hdfs block, given that hdfs filename.
thanks.
You can use Hadoop's FSCK command on the file you have in mind. This will return the hosts and block names. It however does not provide the full path to the file on the local filesystem.
$ hadoop fsck /path/to/file -files -blocks -locations
Another option would be through the HDFS WebUI. If you browse to each file it will list the block names and hosts.

Reading files from hdfs vs local directory

I am a beginner in hadoop. I have two doubts
1) how to access files stored in the hdfs? Is it same as using a FileReader in java.io and giving the local path or is it something else?
2) i have created a folder where i have copied the file to be stored in hdfs and the jar file of the mapreduce program. When I run the command in any directory
${HADOOP_HOME}/bin/hadoop dfs -ls
it just shows me all the files in the current dir. So does that mean all the files got added without me explicitly adding it?
Yes, it's pretty much the same. Read this post to read files from HDFS.
You should keep in mind that HDFS is different than your local file system. With hadoop dfs you access the HDFS, not the local file system. So, hadoop dfs -ls /path/in/HDFS shows you the contents of the /path/in/HDFS directory, not the local one. That's why it's the same, no matter where you run it from.
If you want to "upload" / "download" files to/from HDFS you should use the commads:
hadoop dfs -copyFromLocal /local/path /path/in/HDFS and
hadoop dfs -copyToLocal /path/in/HDFS /local/path, respectively.

Resources