HDFS path changing when trying to update files in HDFS - hadoop

I am new to Hadoop and HDFS, so maybe it is something I am doing wrong when I copy from local (Ubuntu 10.04) to HDFS on a single node on localhost. The initial copy works fine, but when I modify my local input folder and try to copy back to HDFS, the HDFS path changes.
~$ $HADOOP_HOME/bin/hadoop dfs -copyFromLocal /tmp/anagram /user/hduser/anagram
~$ $HADOOP_HOME/bin/hadoop dfs -ls /user/hduser/anagram
Found 1 items
-rw-r--r-- 1 hduser supergroup 4067675 2011-08-29 05:44 /user/hduser/anagram/SINGLE.TXT
After adding another file (COMMON.TXT) to the same local directory, I run the same copy on the local directory to HDFS, but this time it copies to a different location than the first time (/user/hduser/anagram to /user/hduser/anagram/anagram).
~$ $HADOOP_HOME/bin/hadoop dfs -copyFromLocal /tmp/anagram /user/hduser/anagram
~$ $HADOOP_HOME/bin/hadoop dfs -ls /user/hduser/anagram
Found 2 items
-rw-r--r-- 1 hduser supergroup 4067675 2011-08-29 05:44 /user/hduser/anagram/SINGLE.TXT
drwxr-xr-x - hduser supergroup 0 2011-08-29 05:48 /user/hduser/anagram/anagram
~$ $HADOOP_HOME/bin/hadoop dfs -ls /user/hduser/anagram/anagram
Found 2 items
-rw-r--r-- 1 hduser supergroup 805232 2011-08-29 05:48 /user/hduser/anagram/anagram/COMMON.TXT
-rw-r--r-- 1 hduser supergroup 4067675 2011-08-29 05:48 /user/hduser/anagram/anagram/SINGLE.TXT
Has anyone ran into this? I found that to resolve this, you need to remove the first directory and then copy over again:
~$ $HADOOP_HOME/bin/hadoop dfs -rmr /user/hduser/anagram/anagram
Deleted hdfs://localhost:54310/user/hduser/anagram/anagram
~$ $HADOOP_HOME/bin/hadoop dfs -rmr /user/hduser/anagram
Deleted hdfs://localhost:54310/user/hduser/anagram
~$ $HADOOP_HOME/bin/hadoop dfs -copyFromLocal /tmp/anagram /user/hduser/anagram
~$ $HADOOP_HOME/bin/hadoop dfs -ls /user/hduser/anagram
Found 2 items
-rw-r--r-- 1 hduser supergroup 805232 2011-08-29 05:55 /user/hduser/anagram/COMMON.TXT
-rw-r--r-- 1 hduser supergroup 4067675 2011-08-29 05:55 /user/hduser/anagram/SINGLE.TXT
Does anyone know how to do this without having to delete the directory every time?

It seems to me that this is side effect (check the FileUtil.java, static method FileUtil.checkDest(String srcName, FileSystem dstFS, Path dst, boolean overwrite) )
try this:
hadoop dfs -copyFromLocal /tmp/anagram/*.TXT /user/hduser/anagram
for updating directory.

Related

Navigate file system in Hadoop

When running hadoop fs -ls
drwxr-xr-x - chiki supergroup 0 2019-01-14 17:03 Party_output
drwxr-xr-x - chiki supergroup 0 2018-01-22 18:25 party_uploads
but when try to access the directory
hadoop fs -ls /Party_output
showing output as
`/Party_output': No such file or directory
That's because hadoop fs -ls shows the contents of your home directory /home/chiki/.
You need to run hadoop fs -ls Party_output to see inside that directory (because it lives in /home/chiki/Party_output and not /Party_output).

Hadoop error du: java.util.ConcurrentModificationException

While working on my HDFS cluster, I get this error
du: java.util.ConcurrentModificationException
whenever I run
hdfs dfs -du -h -s /some/path/
A quick check on the Internet and I saw it was bug in Hadoop 2.7.0.
To fix the issue, I had to delete some of my Hadoop snapshot files. I believe a/some snapshot(s) had been corrupted as I had one of my data node decommissioned uncleanly from my cluster few days ago.
hdfs lsSnapshottableDir
drwxr-xr-x 0 hdfs supergroup 0 2018-01-30 17:04 0 65536 /data
[hdfs#hmastera ~]$ hdfs dfs -ls /data/.snapshot
Found 5 items
drwxr-xr-x - hdfs supergroup 0 2017-08-19 01:06 /data/.snapshot/insight-dl-cluster_snapshot_20170819T010503
drwxr-xr-x - hdfs supergroup 0 2017-08-19 01:08 /data/.snapshot/insight-dl-cluster_snapshot_20170819T010746
drwxr-xr-x - hdfs supergroup 0 2017-08-19 01:12 /data/.snapshot/insight-dl-cluster_snapshot_20170819T011013
drwxr-xr-x - hdfs supergroup 0 2017-08-19 01:14 /data/.snapshot/insight-dl-cluster_snapshot_20170819T011219
drwxr-xr-x - hdfs supergroup 0 2018-01-13 16:24 /data/.snapshot/insight-dl-cluster_snapshot_20180113T162234
`
Then I started deleting the snapshots till I got my mojo back.
hdfs# hmastera ~]
hdfs dfs -deleteSnapshot /data insight-dl-cluster_snapshot_20170819T010503
hdfs dfs -deleteSnapshot /data insight-dl-cluster_snapshot_20170819T010746
hdfs dfs -deleteSnapshot /data insight-dl-cluster_snapshot_20170819T011013
hdfs dfs -deleteSnapshot /data insight-dl-cluster_snapshot_20170819T011219
hdfs dfs -deleteSnapshot /data insight-dl-cluster_snapshot_20180113T162234
[hdfs# hmastera ~]$ hdfs dfs -du -h -s /data
510.1 G /data

Confusion on HDFS 'pwd' equivalents

First, I have read this post:Is there an equivalent to `pwd` in hdfs?. It says there is no such 'pwd' in HDFS.
However, as I progressed with the instructions of Hadoop: Setting up a Single Node Cluster, I failed on this command:
$ bin/hdfs dfs -put etc/hadoop input
put: 'input': No such file or directory
It's weird that I succeed on this command for the first time I went through the instructions, but failed for the second time. It's also weird that I succeed on this command on my friends computer, which has the same system (Ubuntu 14.04) and hadoop version (2.7.1) as mine.
Can anyone explain what happened here? Is there some 'pwd' in HDFS after all?
Firstly, You are trying to run the command $ bin/hdfs dfs -put etc/hadoop input with user that doesn't exist in the VM/HDFS
Let me clearly explain you with the following example in HDP VM
[root#sandbox hadoop-hdfs-client]# bin/hdfs dfs -put /etc/hadoop input
put: `input': No such file or directory
Here I executed the command with root user and it didn't exist in the HDP VM. Check in the following command to list the users
[root#sandbox hadoop-hdfs-client]# hadoop fs -ls /user
Found 8 items
drwxrwx--- - ambari-qa hdfs 0 2015-08-20 08:33 /user/ambari-qa
drwxr-xr-x - guest guest 0 2015-08-20 08:47 /user/guest
drwxr-xr-x - hcat hdfs 0 2015-08-20 08:36 /user/hcat
drwx------ - hive hdfs 0 2015-09-04 09:52 /user/hive
drwxr-xr-x - hue hue 0 2015-08-20 09:05 /user/hue
drwxrwxr-x - oozie hdfs 0 2015-08-20 08:37 /user/oozie
drwxr-xr-x - solr hdfs 0 2015-08-20 08:41 /user/solr
drwxrwxr-x - spark hdfs 0 2015-08-20 08:34 /user/spark
In HDFS, If you want to copy a file and not mentioning the absolute path for destination argument, it will consider home of the logged user and place your file there. Here root user not found.
Now let's switch to hive user and test
[root#sandbox hadoop-hdfs-client]# su hive
[hive#sandbox hadoop-hdfs-client]$ bin/hdfs dfs -put /etc/hadoop input
[hive#sandbox hadoop-hdfs-client]$ hadoop fs -ls /user/hive
Found 1 items
drwxr-xr-x - hive hdfs 0 2015-09-04 10:07 /user/hive/input
Yay..Successfully Copied..
Hope it helps..!!!
It means that we need to move input files to hdfs location.
Suppose you have input file named input.txt and we need to move to HDFS, then follow the below command.
Command: hdfs dfs -put /input_location /hdfs_location
In case no specific directory in HDFS
hdfs dfs -put /home/Desktop/input.txt /
In case specific directory in HDFS (Note: We need to create a directory before proceeding)
hdfs dfs -put /home/Desktop/input.txt /MR_input
After that you can run the examples
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar wordcount /input /output
Here Input and output are the paths which should be in HDFS.
Hope this helps.

hadoop copy a local file system folder to HDFS

I need to copy a folder from local file system to HDFS. I could not find any example of moving a folder(including its all subfolders) to HDFS
$ hadoop fs -copyFromLocal /home/ubuntu/Source-Folder-To-Copy HDFS-URI
You could try:
hadoop fs -put /path/in/linux /hdfs/path
or even
hadoop fs -copyFromLocal /path/in/linux /hdfs/path
By default both put and copyFromLocal would upload directories recursively to HDFS.
In Short
hdfs dfs -put <localsrc> <dest>
In detail with an example:
Checking source and target before placing files into HDFS
[cloudera#quickstart ~]$ ll files/
total 132
-rwxrwxr-x 1 cloudera cloudera 5387 Nov 14 06:33 cloudera-manager
-rwxrwxr-x 1 cloudera cloudera 9964 Nov 14 06:33 cm_api.py
-rw-rw-r-- 1 cloudera cloudera 664 Nov 14 06:33 derby.log
-rw-rw-r-- 1 cloudera cloudera 53655 Nov 14 06:33 enterprise-deployment.json
-rw-rw-r-- 1 cloudera cloudera 50515 Nov 14 06:33 express-deployment.json
[cloudera#quickstart ~]$ hdfs dfs -ls
Found 1 items
drwxr-xr-x - cloudera cloudera 0 2017-11-14 00:45 .sparkStaging
Copy files HDFS using -put or -copyFromLocal command
[cloudera#quickstart ~]$ hdfs dfs -put files/ files
Verify the result in HDFS
[cloudera#quickstart ~]$ hdfs dfs -ls
Found 2 items
drwxr-xr-x - cloudera cloudera 0 2017-11-14 00:45 .sparkStaging
drwxr-xr-x - cloudera cloudera 0 2017-11-14 06:34 files
[cloudera#quickstart ~]$ hdfs dfs -ls files
Found 5 items
-rw-r--r-- 1 cloudera cloudera 5387 2017-11-14 06:34 files/cloudera-manager
-rw-r--r-- 1 cloudera cloudera 9964 2017-11-14 06:34 files/cm_api.py
-rw-r--r-- 1 cloudera cloudera 664 2017-11-14 06:34 files/derby.log
-rw-r--r-- 1 cloudera cloudera 53655 2017-11-14 06:34 files/enterprise-deployment.json
-rw-r--r-- 1 cloudera cloudera 50515 2017-11-14 06:34 files/express-deployment.json
If you copy a folder from local then it will copy folder with all its sub folders to HDFS.
For copying a folder from local to hdfs, you can use
hadoop fs -put localpath
or
hadoop fs -copyFromLocal localpath
or
hadoop fs -put localpath hdfspath
or
hadoop fs -copyFromLocal localpath hdfspath
Note:
If you are not specified hdfs path then folder copy will be copy to hdfs with the same name of that folder.
To copy from hdfs to local
hadoop fs -get hdfspath localpath
You can use :
1.LOADING DATA FROM LOCAL FILE TO HDFS
Syntax:$hadoop fs –copyFromLocal
EX: $hadoop fs –copyFromLocal localfile1 HDIR
2. Copying data From HDFS to Local
Sys: $hadoop fs –copyToLocal < new file name>
EX: $hadoop fs –copyToLocal hdfs/filename myunx;
To copy a folder file from local to hdfs, you can the below command
hadoop fs -put /path/localpath /path/hdfspath
or
hadoop fs -copyFromLocal /path/localpath /path/hdfspath
Navigate to your "/install/hadoop/datanode/bin" folder or path where you could execute your hadoop commands:
To place the files in HDFS:
Format: hadoop fs -put "Local system path"/filename.csv "HDFS destination path"
eg)./hadoop fs -put /opt/csv/load.csv /user/load
Here the /opt/csv/load.csv is source file path from my local linux system.
/user/load means HDFS cluster destination path in "hdfs://hacluster/user/load"
To get the files from HDFS to local system:
Format : hadoop fs -get "/HDFSsourcefilepath" "/localpath"
eg)hadoop fs -get /user/load/a.csv /opt/csv/
After executing the above command, a.csv from HDFS would be downloaded to /opt/csv folder in local linux system.
This uploaded files could also be seen through HDFS NameNode web UI.
using the following commands -
hadoop fs -copyFromLocal <local-nonhdfs-path> <hdfs-target-path>
hadoop fs -copyToLocal <hdfs-input-path> <local-nonhdfs-path>
Or you also use spark FileSystem library to get or put hdfs file.
Hope this is helpful.

Browsing into a folder in Hadoop

I ssh to the debvox that it for Hadoop and if I saay hadoop fs -ls I get a lot of files including
drwxr-xr-x - root hadoop 0 2013-07-11 17:49 sandeep
drwxr-xr-x - root hadoop 0 2013-04-10 14:13 testprocedure
drwxr-xr-x - root hadoop 0 2013-04-03 13:56 tmp
I need to go inside that tmp folder, took a look at Hadoop shell commands in here but still didn't find the command for it. http://hadoop.apache.org/docs/r0.18.3/hdfs_shell.html
So what's the command to go to that folder?
Specify the directory name, as follows:
hadoop fs -ls tmp
Sample output from my Demo VM:
hadoop fs -ls
[cloudera#localhost ~]$ hadoop fs -ls
Found 12 items
-rw-r--r-- 1 cloudera supergroup 46 2013-06-18 21:18 /user/cloudera/FileWrite.txt
-rw-r--r-- 1 cloudera supergroup 13 2013-06-18 15:34 /user/cloudera/HelloWorld.txt
drwxr-xr-x - cloudera supergroup 0 2013-07-01 22:07 /user/cloudera/hiveext
drwxr-xr-x - cloudera supergroup 0 2012-06-12 15:10 /user/cloudera/input
-rw-r--r-- 1 cloudera supergroup 176 2013-06-18 23:07 /user/cloudera/input_data.txt
drwxr-xr-x - cloudera supergroup 0 2012-09-06 15:44 /user/cloudera/movies_input
drwxr-xr-x - cloudera supergroup 0 2012-09-06 17:02 /user/cloudera/movies_output
drwxr-xr-x - cloudera supergroup 0 2012-09-06 14:53 /user/cloudera/output
drwxr-xr-x - cloudera supergroup 0 2013-07-01 23:45 /user/cloudera/sample_external_input
-rw-r--r-- 1 cloudera supergroup 16 2012-06-14 01:39 /user/cloudera/test.txt
drwxr-xr-x - cloudera supergroup 0 2012-06-13 00:00 /user/cloudera/weather_input
drwxr-xr-x - cloudera supergroup 0 2012-06-13 15:13 /user/cloudera/weather_output
When I specify a directory hadoop fs -ls sample_external_input:
[cloudera#localhost ~]$ hadoop fs -ls sample_external_input
Found 2 items
-rw-r--r-- 1 cloudera supergroup 61 2013-07-01 23:17 /user/cloudera/sample_external_input/sample_external_data.txt
-rw-r--r-- 1 cloudera supergroup 13 2013-07-01 23:18 /user/cloudera/sample_external_input/sample_external_data2.txt
I need to go inside that tmp folder, took a look at Hadoop shell
commands in here but still didn't find the command for it.
http://hadoop.apache.org/docs/r0.18.3/hdfs_shell.html
There is nothing like cd which can take us inside a directory. So you can't go to that folder like you can do in your local FS. You could probably use ls a others have suggested, but that just list the content inside a directory and doesn't take you to that directory. If you really want to go inside a particular directory, you could make use of the HDFS WebUI. You can point your web browser to NameNode_Machine:50070 to go there. It allows you to browse the entire HDFS. You can view and download the files as well.
If you specify nothing after -ls, then the folders will be those in your "home" directory. If you want to give a path relative to your home folder, you can do so
hadoop fs ls tmp/someTmpStuff
(assuming tmp is a folder in your home directory ) or use a fully qualified path
hadoop fs ls /user/me/tmp/someTmpStuff
First you need to check if you have hadoop access or not. if yes then use command :
[yourhost]$ hadoop fs -ls /dir1/
It will list directory or file which is inside dir1

Resources