Hadoop missing input which is present in HDFS - hadoop

Evening All,
I'm trying to run a training sample on Hadoop mapreduce, but am receiving an error that the input path does not exist.
16/09/26 05:56:45 ERROR streaming.StreamJob: Error Launching job : Input path does not exist: hdfs://bigtop1.vagrant:8020/training
However, looking inside the hdfs directory, it's clear that the "training" folder is present.
[vagrant#bigtop1 code]$ hadoop fs -ls
Found 3 items
drwx------ - vagrant hadoop 0 2016-09-26 05:47 .staging
drwxr-xr-x - vagrant hadoop 0 2016-09-26 04:28 hw2
drwxr-xr-x - vagrant hadoop 0 2016-09-26 04:14 training
Using HDFS commands:
[vagrant#bigtop1 code]$ hdfs dfs -ls training
Found 2 items
-rw-r--r-- 3 vagrant hadoop 0 2016-09-26 04:14 training/_SUCCESS
-rw-r--r-- 3 vagrant hadoop 3311720 2016-09-26 04:14 training/part-r-00000
Does anyone know of a possible reason that Hadoop would be missing data that is clearly present?
Invocation Below, had to hide one input (-f):
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -D mapreduce.job.reduces=5 -files lr -mapper "python lr/mapper.py -n 5 -r 0.4" -reducer "python lr/reducer.py -e 0.1 -c 0.0 -f ####" -input /training/ -output /models

Please change the input parameter as something like this.
From
-input /training/
To
-input training/

When you run $ hadoop fs -ls it shows you the data in the current users home directory.
Are you sure the path to your data isnt /user/vagrant/?
If the training directory isn't present when you run $ hadoop fs -ls / then you have the path wrong.

Please change the input parameter as something like this.
-input hdfs://<machinename>/user/vagrant/training/

Related

Confusion on HDFS 'pwd' equivalents

First, I have read this post:Is there an equivalent to `pwd` in hdfs?. It says there is no such 'pwd' in HDFS.
However, as I progressed with the instructions of Hadoop: Setting up a Single Node Cluster, I failed on this command:
$ bin/hdfs dfs -put etc/hadoop input
put: 'input': No such file or directory
It's weird that I succeed on this command for the first time I went through the instructions, but failed for the second time. It's also weird that I succeed on this command on my friends computer, which has the same system (Ubuntu 14.04) and hadoop version (2.7.1) as mine.
Can anyone explain what happened here? Is there some 'pwd' in HDFS after all?
Firstly, You are trying to run the command $ bin/hdfs dfs -put etc/hadoop input with user that doesn't exist in the VM/HDFS
Let me clearly explain you with the following example in HDP VM
[root#sandbox hadoop-hdfs-client]# bin/hdfs dfs -put /etc/hadoop input
put: `input': No such file or directory
Here I executed the command with root user and it didn't exist in the HDP VM. Check in the following command to list the users
[root#sandbox hadoop-hdfs-client]# hadoop fs -ls /user
Found 8 items
drwxrwx--- - ambari-qa hdfs 0 2015-08-20 08:33 /user/ambari-qa
drwxr-xr-x - guest guest 0 2015-08-20 08:47 /user/guest
drwxr-xr-x - hcat hdfs 0 2015-08-20 08:36 /user/hcat
drwx------ - hive hdfs 0 2015-09-04 09:52 /user/hive
drwxr-xr-x - hue hue 0 2015-08-20 09:05 /user/hue
drwxrwxr-x - oozie hdfs 0 2015-08-20 08:37 /user/oozie
drwxr-xr-x - solr hdfs 0 2015-08-20 08:41 /user/solr
drwxrwxr-x - spark hdfs 0 2015-08-20 08:34 /user/spark
In HDFS, If you want to copy a file and not mentioning the absolute path for destination argument, it will consider home of the logged user and place your file there. Here root user not found.
Now let's switch to hive user and test
[root#sandbox hadoop-hdfs-client]# su hive
[hive#sandbox hadoop-hdfs-client]$ bin/hdfs dfs -put /etc/hadoop input
[hive#sandbox hadoop-hdfs-client]$ hadoop fs -ls /user/hive
Found 1 items
drwxr-xr-x - hive hdfs 0 2015-09-04 10:07 /user/hive/input
Yay..Successfully Copied..
Hope it helps..!!!
It means that we need to move input files to hdfs location.
Suppose you have input file named input.txt and we need to move to HDFS, then follow the below command.
Command: hdfs dfs -put /input_location /hdfs_location
In case no specific directory in HDFS
hdfs dfs -put /home/Desktop/input.txt /
In case specific directory in HDFS (Note: We need to create a directory before proceeding)
hdfs dfs -put /home/Desktop/input.txt /MR_input
After that you can run the examples
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar wordcount /input /output
Here Input and output are the paths which should be in HDFS.
Hope this helps.

Need explanation on Hadoop file system

For the following command,
hadoop fs -put foo.txt bar.txt
After the operation succeeds, where will bar.txt locate in my local hard drive, given
a singe node setup?
pseudo distributed setup?
Will bar.txt still get replicated 3 times for backup?
bar.txt will be placed in the current hadoop user home directory as
/user/<hadoop-user> as per the following code
#Override
public Path getHomeDirectory() {
return makeQualified(new Path("/user/" + dfs.ugi.getShortUserName()));
}
Source here
If the cluster is single node, It only replicates one time even you set the dfs.replication to 3 because Hadoop will not save the same block on same node more than once.
pseudo distributed mode will have all the hadoop daemons running on the same machine. It's nothing but single node cluster.
It you set dfs.replication to 3, Hadoop just gives you warning only.
Hope it helps!
the above fs command tries to put the file foo.txt as bar.txt in current hdfs. The path of the hdfs is determined by the current user the operation is performing. This is because you are not providing the absolute path to the destination.
If you have /user as the home directory configured in hdfs, it will take the path of /user/ and places the file there.
Also, if there is no folder in hdfs that corresponds to the current user it will fail stating file doesn't exists.
e.g. Current user running is "testusr1". and the above command places the file under "/users/testusr1" .
You can verify this by executing a command #hadoop fs -ls /user/
AFAIK this will be should be same for Pseudo or single node setup.
[root#sandbox ~]# hadoop fs -ls /user
Found 11 items
drwx------ - root hdfs 0 2015-04-13 03:59 /user/root
.
.
.
.
.
drwxr-xr-x - root hdfs 0 2015-04-13 04:18 /user/testusr1
[root#sandbox ~]#
[root#sandbox ~]# su - testusr1
[testusr1#sandbox ~]$ whoami
testusr1
[testusr1#sandbox ~]$ pwd
/home/testusr1
[testusr1#sandbox ~]$ ll
total 7
-rw-rw-r-- 1 testusr1 testusr1 49 2015-04-13 04:17 foo-testusr2.txt
[testusr1#sandbox ~]$ hadoop fs -put foo-testusr2.txt bar-testusr2.txt
And for the replication factor, you can check with he help of basic hadoop fs -ls command.
[testusr1#sandbox ~]$exit
logout
[root#sandbox ~]# hdfs dfs -ls /user/testusr1
Found 1 items
-rw-r--r-- 1 testusr1 hdfs 49 2015-04-13 04:18 /user/testusr1/bar-testusr2.txt
[root#sandbox ~]#
In the above sample output, you can see the number 1 right after the file permissions. It is reflecting as 1 and it is as per my hdfs configurations.

why does my hadoop command does not work?

I have my hadoop cluster set up with one master and two slaves.
when I type
hadoop fs -ls
ls: Cannot access .: No such file or directory.
But when I type the following:
hadoop fs -ls /
Found 1 items
drwxr-xr-x - Mike supergroup 0 2014-06-24 00:24 /usr
I get the same output both on master and slaves. why hadoop fs -ls does not work?
Thanks!
hadoop fs -ls
This tries to list current user's home directory on hdfs. since i think /user/{username} directory doesn't exist in your case hence you get the error,
hadoop fs -ls /
you are specifically telling it to list root directory which it does successfully as it exist.

How to open HDFS output file using gedit?

I have installed and executed an mapreduce program successfully in my system(Ubuntu 14.04).
I can see the output file as,
hadoopuser#arul-PC:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hadoopuser/MapReduceSample-output
Found 3 items
-rw-r--r-- 1 hadoopuser supergroup 0 2014-07-09 16:10 /user/hadoopuser/MapReduceSample-output/_SUCCESS
drwxr-xr-x - hadoopuser supergroup 0 2014-07-09 16:10 /user/hadoopuser/MapReduceSample-output/_logs
-rw-r--r-- 1 hadoopuser supergroup 880838 2014-07-09 16:10 /user/hadoopuser/MapReduceSample-output/part-00000
And I can open it on terminal using following command,
hadoopuser#arul-PC:/usr/local/hadoop$ bin/hadoop dfs -cat /user/hadoopuser/MapReduceSample-output/part-00000
I can see the output file on terminal, but I can't see the full result because my output has large amount of lines.
So I want to open it on gedit or nano.
Need Solution.
you can also use getmerge to copy HDFS file to local system.
hadoopuser#arul-PC:/usr/local/hadoop$ bin/hadoop dfs -getmerge /user/hadoopuser/MapReduceSample-output/part-00000 /home/arul/MROutput
hadoop dfs -getmerge /path/to/HDFS /path/to/save
instead of looking for plugin. You can add jar files from $HADOOP_INSTALL/bin in eclipse and compiler issues must be gone.
You can't access HDFS file from local machine(system user), so that you can't open HDFS file using gedit.
To open in gedit you have to copy to local machine.
To do that, open terminal(Ctrl+Alt+T) and use copyToLocal a Hadoop Shell Command to copy the output file into local machine.
Do the following,
hadoopuser#arul-PC:/usr/local/hadoop$ sudo bin/hadoop dfs -copyToLocal /user/hadoopuser/MapReduceSample-output/part-00000 /home/arul/Downloads/
Now you can open the output file using gedit as follows,
$ sudo gedit /home/arul/Downloads/part-00000
Note :
My HDFS username is hadoopuser.
You can move a file from HDFS to local machine. The Hadoop Shell Command fs -mv allow to move different HDFS location.
For more Hadoop Shell Commands(click here).
Update (An another option to do the same from Y-Prithvi's post)
you can also use getmerge to copy HDFS file to local system.
hadoopuser#arul-PC:/usr/local/hadoop$ bin/hadoop dfs -getmerge /user/hadoopuser/MapReduceSample-output/part-00000 /home/arul/MROutput
hadoop dfs -getmerge /path/to/HDFS /path/to/save
Eclipse Setup for Hadoop Development
this should help

Hadoop HDFS copy with wildcards?

I want to copy a certain pattern of files from within hdfs to another location in the same hdfs cluster. The dfs shell does not seem to be able to handle this:
hadoop dfs -cp /tables/weblog/server=jeckle/webapp.log.1* /tables/tinylog/server=jeckle/
No error is returned: yet also no files are copied.
You need use double quote with your path that contains wildcard, like this:
hdfs fs -cp "/path/to/foo*" /path/to/bar/
First of all, HDFS copy with wildcards is supported. Secondly, use of hadoop dfs is deprecated, you'd better use hadoop fs or hdfs dfs instead. If you're sure the operation was not successful (although it seems succeed), you could check out the log files of namenode to see what's wrong.
Interesting. This is what I get in my local VM running Hadoop 0.18.0. What version are you using? I can try on 1.2.1 also
hadoop-user#hadoop-desk:~$ hadoop fs -ls /user/hadoop-user/testcopy
hadoop-user#hadoop-desk:~$ hadoop dfs -cp /user/hadoop-user/input/*.txt /user/hadoop-user/testcopy/
hadoop-user#hadoop-desk:~$ hadoop fs -ls /user/hadoop-user/testcopy
Found 2 items
-rw-r--r-- 1 hadoop-user supergroup 79 2014-01-06 04:35 /user/hadoop-user/testcopy/HelloWorld.txt
-rw-r--r-- 1 hadoop-user supergroup 140 2014-01-06 04:35 /user/hadoop-user/testcopy/SampleData.txt
These both worked for me:
~]$ hadoop fs -cp -f /user/cloudera/Dec_17_2017/cric* /user/cloudera/Dec_17_2017/Dec_18
~]$ hadoop fs -cp -f "/user/cloudera/Dec_17_2017/cric*" /user/cloudera/Dec_17_2017/Dec_18
I thinks better way is that don't give double/single("/') quotes.
In case anybody wants to copy the files and folders from the current directory where the user is in the terminal, then
hdfs dfs -put ./

Resources