Reading files from hdfs vs local directory - hadoop

I am a beginner in hadoop. I have two doubts
1) how to access files stored in the hdfs? Is it same as using a FileReader in java.io and giving the local path or is it something else?
2) i have created a folder where i have copied the file to be stored in hdfs and the jar file of the mapreduce program. When I run the command in any directory
${HADOOP_HOME}/bin/hadoop dfs -ls
it just shows me all the files in the current dir. So does that mean all the files got added without me explicitly adding it?

Yes, it's pretty much the same. Read this post to read files from HDFS.
You should keep in mind that HDFS is different than your local file system. With hadoop dfs you access the HDFS, not the local file system. So, hadoop dfs -ls /path/in/HDFS shows you the contents of the /path/in/HDFS directory, not the local one. That's why it's the same, no matter where you run it from.
If you want to "upload" / "download" files to/from HDFS you should use the commads:
hadoop dfs -copyFromLocal /local/path /path/in/HDFS and
hadoop dfs -copyToLocal /path/in/HDFS /local/path, respectively.

Related

Hadoop copying directory with contents

I am new to Hadoop and am doing a project for university. I have a folder called 'docs' that I have several text files in. When I look at it locally, I can see the various text files are there. When I copy it to Hadoop, the directory is empty.
The screenshot below shows the files in the local directory.
I use copyFromLocal to copy the directory to HDFS. As far as I can tell it should be copying the contents too?
hadoop fs -copyFromLocal ./docs
This screenshot showing the directory is empty (or is it?)
All directories (lines starting with d) show as having 0 size with the HDFS ls command. If you do hadoop fs -ls docs then you'll see all of the files and their sizes.

How files or directories are getting stored in hadoop hdfs

I have created a file in hdfs using below command
hdfs dfs -touchz /hadoop/dir1/file1.txt
I could see the created file by using below command
hdfs dfs -ls /hadoop/dir1/
But, I could not find the location itself by using linux commands (using find or locate). I searched on internet and found following link.
How to access files in Hadoop HDFS? . It says, hdfs is virtual storage. In that case, How its taking partition which one or how much it needs to be used, where the meta data being stored
Is it taking datanode location for virtual storage which I have mentioned in hdfs-site.xml to store all the data?
I looked into datanode location and there are files available. But I could not find out anything related to my file or folder which I have created.
(I am using hadoop 2.6.0)
HDFS file system is a distributed storage system wherein the storage location is virtual and created using the disk space from all the DataNodes. While installing hadoop, you must have specified paths for dfs.namenode.name.dir and dfs.datanode.data.dir. These are the locations at which all the HDFS related files are stored on individual nodes.
While storing the data onto HDFS, it is stored as blocks of a specified size (default 128MB in Hadoop 2.X). When you use hdfs dfs commands you will see the complete files but internally HDFS stores these files as blocks. If you check the above mentioned paths on your local file system, you will see a bunch of files which correcpond to files on your HDFS. But again, you will not see them as actual files as they are split into blocks.
Check below mentioned command's output to get more details on how much space from each DataNode is used to create the virtual HDFS storage.
hdfs dfsadmin -report #Or
sudo -u hdfs hdfs dfsadmin -report
HTH
As we creating a file in local file system i.e on creating a directory in it
for ex:$/mkdir MITHUN94** it is a directory entering into that(LFS) cd MITHUN90
in that **create a new file as **$nano file1.log .
And now create a directory in** hdfs for ex: hdfs dfs -mkdir /mike90 .Here "mike90"
refers to directory name . After that creating a directory send files from LFS to hdfs. By using this command $hdfs dfs -copyFromLocal /home/gopalkrishna/file1.log
/mike90
Here '/home/gopalkrishna/file1.log' refers to pwd (present working directory)
and '/mike90' refers to directory in hdfs. By clickig $hdfs dfs -ls /mike90
the list of files .

Hadoop distcp with file list

I would like to use distcp to copy a list of files (> 1K files) into hdfs. I have already stored list of files in local directory, now can I use -f to copy all files? if yes what is the format do I have to maintain in my files list file? or is there any other better way?
You don't have to use distcp if your use-case is copying data from local filesystem (say Linux) to HDFS. You can simply use hdfs dfs -put command for the same. Here is the syntax.
hdfs dfs -put /path/to/local/dir/* /path/on/hdfs/
e.g.
hdfs dfs -mkdir /user/hduser/destination-dir/
hdfs dfs -put /home/abc/mydir/* /user/hduser/destination-dir/
You have created a file containing list of file paths but that is not at all needed. It's mainly used (for distcp) when you are copying data from one cluster to other cluster

Where can i see my data in Hadoop HDFS

I set the dfs.name.dir and dfs.data.dir in master and slave nodes as /home/hduser/hadoop/hdfs/name
/home/hduser/hadoop/hdfs/data
I copy the file from local disk to HDFS.
Where can i see that file data in HDFS
These configuration parameters determine where in the local filesystem Hadoop stores its image and raw data. When you import file data into HDFS, it dosen't involve these values. In general, data is written into HDFS at the path you specify (when it is absolute), or a path qualified by your username (by default, I believe, this is /user/your_username) when you use a relative path.
So, if I have a file named example in my (local) home directory and say
local:~ matt> hadoop fs -put example relative/path
I should be able to find it in HDFS at /user/matt/relative/path/example. On the other hand, if I do this
local:~ matt> hadoop fs -put example /absolute/path/in/hdfs
it will be in HDFS at /absolute/path/in/hdfs/example.

Using multiple local folders as source in hadoop mapreduce job

I have data in multiple local folders i.e. /usr/bigboss/data1, /usr/bigboss/data2 and many more folders. I want to use all of these folders as input source for my MapReduce command and store the result at HDFS. I can not find a working command to use Hadoop Grep example to do it.
The data will need to reside in HDFS for you to process it with the grep example. You can upload the folders to HDFS using the -put FsShell command:
hadoop fs -mkdir bigboss
hadoop fs -put /usr/bigboss/data* bigboss
Which will create a folder in the current user HDFS directory, and upload each of the data directories to it
Now you should be able to run the grep example over the data

Resources