"No such file or directory" in hadoop while executing WordCount program using jar command - hadoop

I am new to Hadoop and am trying to execute the WordCount Problem.
Things I did so far -
Setting up the Hadoop Single Node cluster referring the below link.
http://www.bogotobogo.com/Hadoop/BigData_hadoop_Install_on_ubuntu_single_node_cluster.php
Write the word count problem referring the below link
https://kishorer.in/2014/10/22/running-a-wordcount-mapreduce-example-in-hadoop-2-4-1-single-node-cluster-in-ubuntu-14-04-64-bit/
Problem is when I execute the last line to run the program -
hadoop jar wordcount.jar /usr/local/hadoop/input /usr/local/hadoop/output
Following is the error I get -
The directory seems to be present
The file is also present in the directory with contents
Finally, on a side note I also tried the following directory sturcture in the jar command.
No avail! :/
I would really appreciate if someone could guide me here!
Regards,
Paul Alwin

Your first image is using input from the local Hadoop installation directory, /usr
If you want to use that data on your local filesystem, you can specify file:///usr/...
Otherwise, if you're running pseudo distributed mode, HDFS has been setup, and /usr does not exist in HDFS unless you explicitly created it there.
Based on the stacktrace, I believe the error comes from the /app/hadoop/ staging directory path not existing, or the permissions for it are not allowing your current user to run commands against that path
Suggestion: Hortonworks and Cloudera offer pre-built VirtualBox images and lots of tutorial resources. Most companies will have Hadoop from one of those vendors, so it's better to get familiar with that rather than mess around with having to install Hadoop yourself from scratch, in my opinion

Related

how hadoop directory differ from hadoop-x.x.x

I am new to hadoop and recently when I was running MapReduce jobs on Openstack hadoop cluster and cd into directory on a datanode machine, I found there are two hadoop folders one is called "hadoop" while the other named"hadoop-2.7.1". Obviously, the latter one makes more sense as it tells the hadoop version. The two folder contains same sub-directories, but how these two differ from each other? What if I'd like to disable HDFS permission checking on this machine, which one should I go?
Here is a screenshot
As colors in the screenshot suggest, hadoop is not a separate directory but is just a symbolic link, obviously pointing to hadoop-2.7.1. Run ls -l to check this.
You should cd into hadoop directory. It exists intentionally to avoid writing hadoop version explicitly. When new version of hadoop is deployed, a new versioned directory will be created, and hadoop symbolic link will be changed to point to the latest versioned directory. Like this:
hadoop-2.7.1
hadoop-2.7.2
hadoop-2.7.3
hadoop -> hadoop-2.7.3

Spark - java IOException :Failed to create local dir in /tmp/blockmgr*

I was trying to run a long running Spark Job. After few hours of execution, I get exception below :
Caused by: java.io.IOException: Failed to create local dir in /tmp/blockmgr-bb765fd4-361f-4ee4-a6ef-adc547d8d838/28
Tried to get around it by checking:
Permission issue in /tmp dir. The spark server is not running as root. but /tmp dir should be writable to all users.
/tmp Dir has enough space.
Assuming that you are working with several nodes, you'll need to check every node participate in the spark operation (master/driver + slaves/nodes/workers).
Please confirm that each worker/node have enough disk space (especially check /tmp folder), and right permissions.
Edit: The answer below did not eventually solve my case. It's because some subfolders spark (or some of its dependencies) was able to create, yet not all of them. The frequent necessity of creation of such paths would make any project unviable. Therefore I ran Spark (PySpark in my case) as an Administrator, which solved the case. So in the end it is probably a permission issue afterall.
Original answer:
I solved the same problem I had on my local Windows machine (not a cluster). Since there was no problem with permissions, I created the dir that Spark was failing to create, i.e. I created the following folder as a local user and did not need to change any permissions on that folder.
C:\Users\<username>\AppData\Local\Temp\blockmgr-97439a5f-45b0-4257-a773-2b7650d17142
After verifying all the permissions and user access.
I got the same issue when building the components in Talend studio and it resolved by providing the correct "/" in spark scratch directory (temp directory) in spark Configuration tab. This is required when building the jar in windows and running in Linux cluster.

How do you transfer files onto the Hadoop FS (HDFS) on WIndows cmdline without Cygwin?

I have zero experience with Hadoop, but suddenly have to use it at work with Spark on Windows. My question, which has been asked a few times here, but I never could quite get the syntax for what I need, is this. I'm trying to transfer a simple file called:
gensortText.txt which let's say is at c:\gensortText.txt
I know you can use hadoop fs -copyFromLocal. I've tried these things:
hadoop fs -copyFromLocal C:\gensortText.txt hdfs://0.0.0.0:19000
ERROR: Relative path in absolute URI.
hadoop fs -copyFromLocal C:\gensortOutText.txt \tmp\hadoop-Administrator\dfs
ERROR: copyFromLocal: `tmphadoop-Administratordfs': No such file or directory
and a number of other variations with hdfs: and using the tmp directory which all returned similar errors.
I have hadoop in c:\deploy as suggested in the Hadoop2Windows guide (which works and allowed me to run Hadoop. I can access the WebGui and all that). Hadoop has created my new HDFS at c:\temp. Please someone help me figure out how to transfer files into the system. It can even be manually if that's possible, but that doesn't seem to work as it doesn't show up in the Web GUI when I go to "Utilities->Browse the Filesystem". Nothing shows up there actually.
Can someone please help. Any information that's relevant I can provide, but I'm so new to this I don't really know what would be helpful. I think it's just my syntax for the cmdline tool. Can someone give me a concrete example of how to use hadoop -fs copyFromLocal or another simple way to do this? Sorry for my ignorance on the subject, and thanks for any help
To be able to run hadoop commands on Windows you need to have winutils installed and visible to hadoop process.

unable to setup psuedo distributed hadoop cluster

I am using centos 7. Downloaded and untarred hadoop 2.4.0 and followed the instruction as per the link Hadoop 2.4.0 setup
Ran the following command.
./hdfs namenode -format
Got this error :
Error: Could not find or load main class org.apache.hadoop.hdfs.server.namenode.NameNode
I see a number of posts with the same error with no accepted answers and I have tried them all without any luck.
This error can occur if the necessary jarfiles are not readable by the user running the "./hdfs" command or are misplaced so that they can't be found by hadoop/libexec/hadoop-config.sh.
Check the permissions on the jarfiles under: hadoop-install/share/hadoop/*:
ls -l share/hadoop/*/*.jar
and if necessary, chmod them as the owner of the respective files to ensure they're readable. Something like chmod 644 should be sufficient to at least check if that fixes the initial problem. For the more permanent fix, you'll likely want to run the hadoop commands as the same user that owns all the files.
I followed the link Setup hadoop 2.4.0
and I was able to get over the error message.
Seems like the documentation on hadoop site is not complete.

Hadoop on Mesos fails with "Could not find or load main class org.apache.hadoop.mapred.MesosExecutor"

I have a Mesos cluster setup -- I have verified that the master can see the slaves -- but when I attempt to run a Hadoop job, all tasks wind up with a status of LOST. The same error is present in all the slave stderr logs:
Error: Could not find or load main class org.apache.hadoop.mapred.MesosExecutor
and that is the only line in the stderr logs.
Following the instructions on http://mesosphere.io/learn/run-hadoop-on-mesos/, I have put a modified Hadoop distribution on HDFS which each slave can access.
In the lib directory of the Hadoop distribution, I have added hadoop-mesos-0.0.4.jar and mesos-0.14.2.jar.
I have verified that each slave does in fact download this Hadoop distribution, and that hadoop-mesos-0.0.4.jar contains the class org.apache.hadoop.mapred.MesosExecutor, so I cannot figure out why the class cannot be found.
I am using Hadoop from CDH4.4.0 and mesos-0.15.0-rc4.
Does any one have any suggestions as to what might be the problem? I know I would always start with a CLASSPATH problem, but, in this case, the mesos-slave is downloading, unpacking, and attempting to run a Hadoop TaskTracker so I would imagine any CLASSPATH would be setup by the mesos-slave.
In the stdout of the slave logs, the environment is printed. There is a MESOS_HADOOP_HOME which is empty. Should this be set to something? If it is supposed to be set to the downloaded Hadoop distribution, I cannot set it in advance because the Hadoop distribution is downloaded to a new location every time.
In the event that is related (some permissions issue maybe), when attempting to browse slave logs via the master UI, I get the error Error browsing path: ....
The user running mesos-slave can browse to the correct directory when I do so manually.
I found the problem. bin/hadoop of the downloaded Hadoop distribution attempts to find its location by running which $0. However, that will find a current Hadoop installation if one exists (i.e. /usr/lib/hadoop), and will load the jars under that installation's lib directory instead of the downloaded one's lib directory.
I had to modify bin/hadoop of the downloaded distribution to find its own location with dirname $0 instead of which $0.

Resources