Problems running Mahout and Hadoop - hadoop

I'm new at Mahout and Hadoop.
I've successfully installed Hadoop Cluster with 3 machines, and the cluster is running fine, and I just installed Mahout on the Main namenode for "testing purposes", and I followed the instructions of installation and set the JAVA_HOME, but when I try to run classify-20newsgroups.sh it goes and download the dataset but after that I get the following error:
Error: JAVA_HOME is not set
Then I've revised the .bashrc and confirmed that the JAVA_HOME is set correctly, but it doesn't help.
Also how do I verify that Mahout is configured to run on Hadoop correctly and do you know of any example that can verify this configuration or environment?

The .bashrc is only read by a shell that is non-login, otherwise is read .bash_profile.
So you could set to read .bashrc from .bash_profile (see here What's the difference between .bashrc, .bash_profile, and .environment?) or just a set JAVA_HOME in .bash_profile.
There are another several possibilities to set the JAVA_HOME:
1) set .bashrc from terminal
~$ source .bashrc
2) set JAVA_HOME in open terminal before running classify-20newsgroups.sh
~$ JAVA_HOME=/path
~$ classify-20newsgroups.sh
3) run classify-20newsgroups.sh with JAVA_HOME, i.e.
~$ JAVA_HOME=/path classify-20newsgroups.sh
As for question about Mahout configuration for run on Hadoop. Standart example with classify-20newsgroups should work on hadoop if HADOOP_HOME is set.

You might need to explicitly set JAVA_HOME in hadoop-env.sh
In hadoop-env.sh, look for the comment "#The java implementation to use", and modify the JAVA_HOME path under it.
It should look something like this:
# The java implementation to use.
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
Of course fix the path of JAVA_HOME.

Related

Why I am getting command not found in hadoop?

I am working on a hadoop project on Ubuntu 14.04. Whenever I give the start-all.sh or start-dfs.sh, it gives me command not found message. What should I do?
You are not running the command in right environment.
The start-all.sh(deprecated) or start-dfs.sh command lies in /hadoop/bin directory. You have to find your hadoop home directory and find bin folder in it, then run the command
./start-dfs.sh
Do below inside your ~/.bashrc
export PATH=$PATH:$HADOOP_HOME/bin
then run source ~/.bashrc file. Now command should work.
This situation should be that the bin environment variable of Hadoop is not configured properly.
Modify the vi /etc/profile file
export $HADOOP_HOME=/usr/hadoop #the directory where your hadoop installed
export PATH=$HADOOP_HOME/bin:$PATH
then
source /etc/profile

Can not run pyspark command from any directory on my Mac after installation of Apache Spark

I have installed spark on my Mac, following the instructions in the book: "Apache Spark in 24 Hours". When I am in the spark directory, I am able to run pyspark by using the command:
./bin/pyspark
To install spark I created the env variable:
export SPARK_HOME=/opt/spark
Added it to the PATH:
export PATH=$SPARK_HOME/bin:$PATH
The book says that I should be able to run the "pyspark" or the "spark-shell" command from any directory, but it doesn't work:
pyspark: command not found
I followed instructions on similar questions asked by others on here:
I set my JAVA_HOME env variable:
export JAVA_HOME=$(/usr/libexec/java_home)
I also ran the following commands:
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH
When I run the env command this is the output:
SPARK_HOME=/opt/spark
TERM_PROGRAM=Apple_Terminal
SHELL=/bin/bash
TERM=xterm-256color
TMPDIR=/var/folders/hq/z0wh5c357cbgp1dh33lfhjj40000gn/T/
Apple_PubSub_Socket_Render=/private/tmp/com.apple.launchd.fJdtLqZ7dN/Render
TERM_PROGRAM_VERSION=361.1
TERM_SESSION_ID=A8BD2144-72AD-402C-A591-5C8A43DD398B
USER=richardgray
SSH_AUTH_SOCK=/private/tmp/com.apple.launchd.cQeqaF2v1z/Listeners
__CF_USER_TEXT_ENCODING=0x1F5:0x0:0x0
PATH=/opt/spark/bin:/Library/Frameworks/Python.framework/Versions/3.5/bin: /Library/Frameworks/Python.framework/Versions/3.5/bin:/Library/Frameworks/Python.framework/Versions/2.7/bin:/usr/local/heroku/bin:/Users/richardgray/.rbenv/shims:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin
PWD=/Users/richardgray
JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.7.0_25.jdk/Contents/Home
LANG=en_GB.UTF-8
XPC_FLAGS=0x0
XPC_SERVICE_NAME=0
SHLVL=1
HOME=/Users/richardgray
PYTHONPATH=/opt/spark/python/lib/py4j-0.9-src.zip:/opt/spark/python/:
LOGNAME=richardgray
_=/usr/bin/env
Is there something I am missing? Thanks in advance.
You wrote that
When I am in the spark directory, I am able to run pyspark by using
the command: ./bin/pyspark
You created export SPARK_HOME=/opt/spark
Can you please confirm that spark directory is indeed /opt/spark ?
If you execute spark from /Users/richardgray/opt/spark/bin please set:
export SPARK_HOME=/Users/richardgray/opt/spark
followed by:
export PATH=$SPARK_HOME/bin:$PATH
Note: If it solve your problem, you'll need to add those two exports to your login scripts (e.g. .profile) so the path will be set automatically

Apache Spark can not Run on Windows

I had downloaded spark-2.0.1-bin-hadoop2.7 and installed it. I installed JAVA and set JAVA_HOME in System Variables.
But in running I have this Error:
How to it can be fixed ?
I think the problem is with whitespaces in your path.
Try to place downloaded spark in for example. F:\Msc\BigData\BigDataSeminar\Spark\
Also check whether SPARK_HOME, JAVA_HOME and HADOOP_HOME are placed in the path without whitespaces.

Hadoop 2.2.0 fails running start-dfs.sh with Error: JAVA_HOME is not set and could not be found

I have a work in progress installation of Hadoop in Ubuntu 12.x. I already had a deploy user which I plan to use to run hadoop in a cluster of machines. The following code demonstrate my problem basically I can ssh olympus no problems but start-dfs.sh fails doing exactly that:
deploy#olympus:~$ ssh olympus
Welcome to Ubuntu 12.04.4 LTS (GNU/Linux 3.5.0-45-generic x86_64)
* Documentation: https://help.ubuntu.com/
Last login: Mon Feb 3 18:22:27 2014 from olympus
deploy#olympus:~$ echo $JAVA_HOME
/opt/dev/java/1.7.0_51
deploy#olympus:~$ start-dfs.sh
Starting namenodes on [olympus]
olympus: Error: JAVA_HOME is not set and could not be found.
You can edit hadoop-env.sh file and set JAVA_HOME for Hadoop
Open the file and find the line as bellow
export JAVA_HOME=/usr/lib/j2sdk1.6-sun
Uncomment the line And update the java_home as per your environment
This will solve the problem with java_home.
Weird out of the box bug on Ubuntu. The current line
export JAVA_HOME=${JAVA_HOME}
in /etc/hadoop/hadoop-env.sh should pick up java home from host but it doesnt.
Just edit the file and hard code the java home for now.
Alternatively you can edit /etc/environment to include:
JAVA_HOME=/usr/lib/jvm/[YOURJAVADIRECTORY]
This makes JAVA_HOME available to all users on the system, and allows start-dfs.sh to see the value. My guess is that start-dfs.sh is kicking off a process as another user somewhere that does not pick up the variable unless explicitly set in hadoop-env.sh.
Using hadoop-env.sh is arguably clearer -- just adding this option for completeness.
Edit the Hadoop start-up script /etc/hadoop/hadoop-env.sh by setting the JAVA_PATH explicitly.
For example:
Instead of export JAVA_HOME=${JAVA_HOME}, do
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.65-3.b17.el7.x86_64/jre
This is with the Java version, java-1.8.0-openjdk.
I have hadoop installed on /opt/hadoop/ and java is installed into /usr/lib/jvm/java-8-oracle
At the end adding this into the bash profile files, I solved any problem.
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export HADOOP_HOME=/opt/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_ROOT_LOGGERi=INFO,console
export HADOOP_SECURITY_LOGGER=INFO,NullAppender
export HDFS_AUDIT_LOGGER=INFO,NullAppender
export HADOOP_INSTALL=$HADOOP_HOME
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_PREFIX=$HADOOP_HOME
export HADOOP_LIBEXEC_DIR=$HADOOP_HOME/libexec
export JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native:$JAVA_LIBRARY_PATH
export HADOOP_YARN_HOME=$HADOOP_HOME
export YARN_LOG_DIR=/tmp

How do you install and run Accumulo and Hadoop on OS X 10.7.4

So I'm trying to run a MapReduce, word count example but I need to have Hadoop running. I tried following instructions from here but it doesn't seem to be working. The problem is the environment variable is not being set. I added the line setenv HADOOP_HOME /opt/hadoop-0.20.2 in /etc/launchd.conf but when I run echo $HADOOP_HOME it doesn't print the path.
Set the HADOOP_HOME variable directly in Accumulo's conf/accumulo-env.sh script.

Resources