Hadoop+HBase cluster on windows: winutils not found - windows

I'm trying to set up a fully-distributed 4-node dev cluster with Hadoop 2.20 and HBase 0.98 on Windows. I've built Hadoop on Windows successfully, and more recently, also build HBase on Windows.
We have successfully ran the wordcount example from the Hadoop installation guide, as well as a custom WebHDFS job. As HBase fully-distributed on Windows isn't supported yet, I'm running HBase under cygwin.
When trying to start hbase from my master (./bin/start-hbase.sh), I get the following error:
2014-04-17 16:22:08,599 ERROR [main] util.Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:278)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:300)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:293)
at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:76)
at org.apache.hadoop.conf.Configuration.getStrings(Configuration.java:1514)
at org.apache.hadoop.hbase.zookeeper.ZKConfig.makeZKProps(ZKConfig.java:113)
at org.apache.hadoop.hbase.zookeeper.ZKServerTool.main(ZKServerTool.java:46)
Looking at the Shell.java source, what is here set as null, seems to be the HADOOP_HOME environment variable. With hadoop under D:/hadoop, and HBase under cygwin root at C:/cygwin/root/usr/local/hbase, the cygwin $HADOOP_HOME variable is /cygdrive/d/hadoop/, and the Windows system environment variable %HADOOP_HOME% is D:\hadoop . Seems to me like with those two variables, the variable should be found correctly...
Also potentially relevant: I'm running Windows Server 2012 x64.
Edit: I have verified that there actually is a winutils.exe in D:\hadoop\bin\ .

We've found it. So, in Hadoop's Shell.java, you'll find that there are two options to communicate the Hadoop-path.
// first check the Dflag hadoop.home.dir with JVM scope
String home = System.getProperty("hadoop.home.dir");
// fall back to the system/user-global env variable
if (home == null) {
home = System.getenv("HADOOP_HOME");
}
After trial and error, we found that in the HBase options (HBase's hbase-env.sh, HBASE_OPTS variable), you'll need to add in this option with the Windows(!) path to Hadoop. In our case, we needed to add -Dhadoop.home.dir=D:/hadoop .
Good luck to anyone else who happens to stumble across this ;).

Related

xmx1000m is not recognized as an internal or external command: pig on windows

I am trying to setup pig on windows 7. I already have hadoop 2.7 single node cluster running on windows 7.
To setup pig, I have taken following steps as of now.
Downloaded the tar: http://mirror.metrocast.net/apache/pig/
Extracted tar to: C:\Users\zeba\Desktop\pig
Have set the Environment (User) Variable to:
PIG_HOME = C:\Users\zeba\Desktop\pig
PATH = C:\Users\zeba\Desktop\pig\bin
PIG_CLASSPATH = C:\Users\zeba\Desktop\hadoop\conf
Also changed HADOOP_BIN_PATH in pig.cmd to %HADOOP_HOME%\libexec as suggested by (Apache pig on windows gives "hadoop-config.cmd' is not recognized as an internal or external command" error when running "pig -x local") as was getting the same error
When I enter pig, I encounter the following error:
xmx1000m is not recognized as an internal or external command
Please help!
The error went away by installing pig-0.17.0. I was working with pig-0.16.0 previously.
Finally i got it. I changed HADOOP BIN PATH in pig.cmd to "HADOOP_HOME%\hadoop-2.9.2\libexec", as you can see "hadoop-2.9.2" is a subfile where "libexec" from my hadoop version is located..
Fix your "HADOOP_HOME" according to given image don't provide bin path only provide hadoop path.

Spark installed but no command 'hdfs' or 'hadoop' found

I am a new pyspark user.
I just downloaded and installed a spark cluster ("spark-2.0.2-bin-hadoop2.7.tgz")
after installation I wanted to access the file system (upload local files to cluster). But when I tried to type hadoop or hdfs in command it will say "no command found".
Am I gonna install hadoop/HDFS (I thought it's built in the spark, I don't get)?
Thanks in advance.
You have to install hadoop first to access HDFS.
Follow this http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
Choose the latest version of hadoop from the apache site.
Once you done with hadoop setup go to spark http://d3kbcqa49mib13.cloudfront.net/spark-2.0.2-bin-hadoop2.7.tgz download this, Extract files. Setup java_home and hadoop_home in spark-env.sh.
You don't have hdfs or hadoop on classpath so this is the reason why you are getting message: "no command found".
If you run \yourparh\hadoop-2.7.1\bin\hdfs dfs -ls / it should works and show root content.
But, You can add your hadoop/bin (hdfs, hadoop ...) commands to classpath with something like this:
export PATH $PATH:$HADOOP_HOME/bin
where HADOOP_HOME is your env. variable with path to hadoop installation folder (download and install is required)

Spark: Run Spark shell from a different directory than where Spark is installed on slaves and master

I have a small cluster (4 machines) set up with 3 slaves and a master node, all installed to /home/spark/spark. (I.e, $SPARK_HOME is /home/spark/spark)
When I use the spark shell: /home/spark/spark/bin/pyspark --master spark://192.168.0.11:7077 everything works fine. However I'd like for my colleagues to be able to connect to the cluster from a local instance of spark on their machine installed in whatever directory they wish.
Currently if somebody has spark installed in say /home/user12/spark and run /home/user12/spark/bin/pyspark --master spark://192.168.0.11:7077 the spark shell will connect to the master without problems but fails with an error when I try to run code:
class java.io.IOException: Cannot run program
"/home/user12/bin/compute-classpath.sh"
(in directory "."): error=2, No such file or directory)
The problem here is that Spark is looking for the spark installation in /home/user12/spark/, where as I'd like to just tell spark to look in /home/spark/spark/ instead.
How do I do this?
You need to edit three files, spark-submit, spark-class and pyspark (all in the bin folder).
Find the line
export SPARK_HOME = [...]
Then change it to
SPARK_HOME = [...]
Finally make sure you set SPARK_HOME to the directory where spark is installed on the cluster.
This works for me.
Here you can find a detailed explanation.
http://apache-spark-user-list.1001560.n3.nabble.com/executor-failed-cannot-find-compute-classpath-sh-td859.html

Hadoop on Mesos fails with "Could not find or load main class org.apache.hadoop.mapred.MesosExecutor"

I have a Mesos cluster setup -- I have verified that the master can see the slaves -- but when I attempt to run a Hadoop job, all tasks wind up with a status of LOST. The same error is present in all the slave stderr logs:
Error: Could not find or load main class org.apache.hadoop.mapred.MesosExecutor
and that is the only line in the stderr logs.
Following the instructions on http://mesosphere.io/learn/run-hadoop-on-mesos/, I have put a modified Hadoop distribution on HDFS which each slave can access.
In the lib directory of the Hadoop distribution, I have added hadoop-mesos-0.0.4.jar and mesos-0.14.2.jar.
I have verified that each slave does in fact download this Hadoop distribution, and that hadoop-mesos-0.0.4.jar contains the class org.apache.hadoop.mapred.MesosExecutor, so I cannot figure out why the class cannot be found.
I am using Hadoop from CDH4.4.0 and mesos-0.15.0-rc4.
Does any one have any suggestions as to what might be the problem? I know I would always start with a CLASSPATH problem, but, in this case, the mesos-slave is downloading, unpacking, and attempting to run a Hadoop TaskTracker so I would imagine any CLASSPATH would be setup by the mesos-slave.
In the stdout of the slave logs, the environment is printed. There is a MESOS_HADOOP_HOME which is empty. Should this be set to something? If it is supposed to be set to the downloaded Hadoop distribution, I cannot set it in advance because the Hadoop distribution is downloaded to a new location every time.
In the event that is related (some permissions issue maybe), when attempting to browse slave logs via the master UI, I get the error Error browsing path: ....
The user running mesos-slave can browse to the correct directory when I do so manually.
I found the problem. bin/hadoop of the downloaded Hadoop distribution attempts to find its location by running which $0. However, that will find a current Hadoop installation if one exists (i.e. /usr/lib/hadoop), and will load the jars under that installation's lib directory instead of the downloaded one's lib directory.
I had to modify bin/hadoop of the downloaded distribution to find its own location with dirname $0 instead of which $0.

HADOOP_HOME and hadoop streaming

Hi I am trying to run hadoop on a server that has hadoop installed but I have no idea the directory where hadoop resides. The server was configure by the server admin.
In order to load hadoop I use the use command from the dotkit package.
There may be several solutions but wanted to know where the hadoop package was installed, how to set up the $HADOOP_HOME variable, and how to approp run a hadoop streaming job, such as $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/mapred/contrib/streaming/hadoop-streaming.jar, aka, http://wiki.apache.org/hadoop/HadoopStreaming.
Thanks! any help would be greatly appreciated!
If you're using a cloudera distribution then it's most probably in /usr/lib/hadoop, otherwise it could be anywhere (at the discretion of your system admin).
There are some tricks you can use to try and locate it:
locate hadoop-env.sh (assuming that locate has been installed and updatedb has been run recently)
If the machine you're running this on is running a hadoop service (such as data node, job tracker, task tracker, name node), then you can perform a process list and grep for the hadoop command: ps axww | grep hadoop
Failing the above two, look for the hadoop root directory in some common locations such as: /usr/lib, /usr/local, /opt
Failing all this, and assuming your current user has the permissions: find / -name hadoop-env.sh
If you're install with rpm then it's most probably in /etc/hadoop.
Why don't you try:
echo $HADOOP_HOME
Obiviously the above env variable has to be set before you could even issue hadoop executables from anywhere on the box.

Resources