spark-shell throws error in Apache spark

spark-shell throws error in Apache spark - hadoop

I have installed hadoop on ubuntu on virtual box(host os Windows 7).I have also installed Apache spark, configured SPARK_HOME in .bashrc and added HADOOP_CONF_DIR to spark-env.sh. Now when I start the spark-shell it throws error and do not initialize spark context, sql context. Am I missing something in installation and also I would want to run it on a cluster (hadoop 3 node cluster is set up).

I have the same issue when trying to install Spark local with Windows 7. Please make sure the below paths is correct and I am sure I will work with you. I answer same question in this link So, you can follow the below and it will work.
Create JAVA_HOME variable: C:\Program Files\Java\jdk1.8.0_181\bin
Add the following part to your path: ;%JAVA_HOME%\bin
Create SPARK_HOME variable: C:\spark-2.3.0-bin-hadoop2.7\bin
Add the following part to your path: ;%SPARK_HOME%\bin
The most important part Hadoop path should include bin file before winutils.ee as the following: C:\Hadoop\bin Sure you will locate winutils.exe inside this path.
Create HADOOP_HOME Variable: C:\Hadoop
Add the following part to your path: ;%HADOOP_HOME%\bin
Now you can run the cmd and write spark-shell it will work.

Related

Spark installed but no command 'hdfs' or 'hadoop' found

I am a new pyspark user.
I just downloaded and installed a spark cluster ("spark-2.0.2-bin-hadoop2.7.tgz")
after installation I wanted to access the file system (upload local files to cluster). But when I tried to type hadoop or hdfs in command it will say "no command found".
Am I gonna install hadoop/HDFS (I thought it's built in the spark, I don't get)?
Thanks in advance.

You have to install hadoop first to access HDFS.
Follow this http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
Choose the latest version of hadoop from the apache site.
Once you done with hadoop setup go to spark http://d3kbcqa49mib13.cloudfront.net/spark-2.0.2-bin-hadoop2.7.tgz download this, Extract files. Setup java_home and hadoop_home in spark-env.sh.

You don't have hdfs or hadoop on classpath so this is the reason why you are getting message: "no command found".
If you run \yourparh\hadoop-2.7.1\bin\hdfs dfs -ls / it should works and show root content.
But, You can add your hadoop/bin (hdfs, hadoop ...) commands to classpath with something like this:
export PATH $PATH:$HADOOP_HOME/bin
where HADOOP_HOME is your env. variable with path to hadoop installation folder (download and install is required)

Spark: Run Spark shell from a different directory than where Spark is installed on slaves and master

I have a small cluster (4 machines) set up with 3 slaves and a master node, all installed to /home/spark/spark. (I.e, $SPARK_HOME is /home/spark/spark)
When I use the spark shell: /home/spark/spark/bin/pyspark --master spark://192.168.0.11:7077 everything works fine. However I'd like for my colleagues to be able to connect to the cluster from a local instance of spark on their machine installed in whatever directory they wish.
Currently if somebody has spark installed in say /home/user12/spark and run /home/user12/spark/bin/pyspark --master spark://192.168.0.11:7077 the spark shell will connect to the master without problems but fails with an error when I try to run code:
class java.io.IOException: Cannot run program
"/home/user12/bin/compute-classpath.sh"
(in directory "."): error=2, No such file or directory)
The problem here is that Spark is looking for the spark installation in /home/user12/spark/, where as I'd like to just tell spark to look in /home/spark/spark/ instead.
How do I do this?

You need to edit three files, spark-submit, spark-class and pyspark (all in the bin folder).
Find the line
export SPARK_HOME = [...]
Then change it to
SPARK_HOME = [...]
Finally make sure you set SPARK_HOME to the directory where spark is installed on the cluster.
This works for me.
Here you can find a detailed explanation.
http://apache-spark-user-list.1001560.n3.nabble.com/executor-failed-cannot-find-compute-classpath-sh-td859.html

Hadoop+HBase cluster on windows: winutils not found

I'm trying to set up a fully-distributed 4-node dev cluster with Hadoop 2.20 and HBase 0.98 on Windows. I've built Hadoop on Windows successfully, and more recently, also build HBase on Windows.
We have successfully ran the wordcount example from the Hadoop installation guide, as well as a custom WebHDFS job. As HBase fully-distributed on Windows isn't supported yet, I'm running HBase under cygwin.
When trying to start hbase from my master (./bin/start-hbase.sh), I get the following error:
2014-04-17 16:22:08,599 ERROR [main] util.Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:278)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:300)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:293)
at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:76)
at org.apache.hadoop.conf.Configuration.getStrings(Configuration.java:1514)
at org.apache.hadoop.hbase.zookeeper.ZKConfig.makeZKProps(ZKConfig.java:113)
at org.apache.hadoop.hbase.zookeeper.ZKServerTool.main(ZKServerTool.java:46)
Looking at the Shell.java source, what is here set as null, seems to be the HADOOP_HOME environment variable. With hadoop under D:/hadoop, and HBase under cygwin root at C:/cygwin/root/usr/local/hbase, the cygwin $HADOOP_HOME variable is /cygdrive/d/hadoop/, and the Windows system environment variable %HADOOP_HOME% is D:\hadoop . Seems to me like with those two variables, the variable should be found correctly...
Also potentially relevant: I'm running Windows Server 2012 x64.
Edit: I have verified that there actually is a winutils.exe in D:\hadoop\bin\ .

We've found it. So, in Hadoop's Shell.java, you'll find that there are two options to communicate the Hadoop-path.
// first check the Dflag hadoop.home.dir with JVM scope
String home = System.getProperty("hadoop.home.dir");
// fall back to the system/user-global env variable
if (home == null) {
home = System.getenv("HADOOP_HOME");
}
After trial and error, we found that in the HBase options (HBase's hbase-env.sh, HBASE_OPTS variable), you'll need to add in this option with the Windows(!) path to Hadoop. In our case, we needed to add -Dhadoop.home.dir=D:/hadoop .
Good luck to anyone else who happens to stumble across this ;).

Pig installation not working

I have installed Pig 0.12.0 on my box. I have also installed Java and Hadoop and have set JAVA_HOME and HADOOP_HOME paths. When Igo to the bin directory of pig installation and type the following command on my command prompt:
pig -help
it errors out with the following message:
The system cannot find the path specified.
'-Xmx1000M' is not recognized as an internal or external command,
operable program or batch file.
Whats wrong?
Should I be using cygwin? (that didnt work either)

I just installed Pig 0.12.1 on Windows 7 without Hadoop installed. I also got this error message and resolved it by setting the "JAVA" environment variable to point to the java.exe executable.
In my case, I set JAVA=C:\Progra~1\Java\jdk1.8.0_05\bin\java.exe
I also set:
JAVA_HOME=C:\Progra~1\Java\jdk1.8.0_05
PIG_HOME=C:\pig-0.12.1 (This is where I extracted pig-0.12.1.tar.gz)
and added C:\pig-0.12.1\bin to my PATH environment variable.
Hope this helps anyone else with this issue!

I know this is a very late reply, but hope this will help someone to configure pig
Using OS Windows 8.1 ProN x64
Note -> Life will be easier in configuration, if your directory path doesn't contains a whitespace.
Steps to configure Pig with Hadoop
java path location (jdk1.8.0_151)
JAVA_HOME C:\Java\jdk1.8.0_151
python path location (Python27)
C:\Python27
ant path location (apache-ant-1.10.1)
ANT_HOME F:\Hadoop\apache-ant-1.10.1
hadoop path location (hadoop-2.8.2)
HADOOP_HOME F:\Hadoop\hadoop-2.8.2
HADOOP_COMMON_LIB_NATIVE_DIR %HADOOP_HOME%\lib\native
HADOOP_CONF_DIR %HADOOP_HOME%\etc\hadoop
pig path location (pig-0.17.0)
PIG_HOME F:\Hadoop\pig-0.17.0
System Variables Path
C:\Python27\;C:\Python27\Scripts;C:\Java\jdk1.8.0_151\bin;F:\Hadoop\hadoop-2.8.2\bin;F:\Hadoop\hadoop-2.8.2\sbin;F:\Hadoop\pig-0.17.0\bin;
explore winutils-master.zip on github, pull and download, extract to %HADOOP_HOME%\bin directory
open %PIG_HOME%\bin\pig.cmd using notepad/notepad++ (recommended notepad++)
change the below said line and later save it, close it -
"set HADOOP_BIN_PATH=%HADOOP_HOME%\bin" to
"set HADOOP_BIN_PATH=%HADOOP_HOME%\libexec"
so now the pig will access hadoop-config.cmd inside %HADOOP_HOME% path as we configured earlier.
start-all.cmd from hadoop to start the cluster with all dependencies.
go to %PIG_HOME%/bin, check with pig -help (if results are as parameterized).
pig (Enter to grunt shell.)
Note -> there are much possibilities to get below exception, if you don't configure as above.
'F:\Hadoop\hadoop-2.8.2\bin\hadoop-config.cmd' is not recognized as an internal or external command, operable program or batch file.
'-Xmx1000M' is not recognized as an internal or external command, operable program or batch file.
hope, these illustrated steps will help you to configure and start the pig grunt shell, thanks.

This will solve your problem...
1.Download PIG ->http://mirrors.estointernet.in/apache/pig/pig-0.16.0/
2.Set properties->
PIG_HOME=C:\Users\lenovo\Downloads\pig-0.16.0\pig-0.16.0
path=C:\Users\lenovo\Downloads\pig-0.16.0\pig-0.16.0\bin
PIG_CLASSPATH=C:\Users\lenovo\Downloads\hadoop-2.7.3\hadoop-2.7.3\etc\hadoop(Where
core-site.xml,mapred-site.xml are present)
3.
->open the file pig.cmd(From bin directory of PIG)
->look for the line set HADOOP_BIN_PATH=%HADOOP_HOME%\bin
->replace this with set HADOOP_BIN_PATH=%HADOOP_HOME%\libexec
4.Now in command prompt give->pig `enter code here`

Failed to locate the winutils binary in the hadoop binary path

I am getting the following error while starting namenode for latest hadoop-2.2 release. I didn't find winutils exe file in hadoop bin folder. I tried below commands
$ bin/hdfs namenode -format
$ sbin/yarn-daemon.sh start resourcemanager
ERROR [main] util.Shell (Shell.java:getWinUtilsPath(303)) - Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:278)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:300)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:293)
at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:76)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:863)

Simple Solution:
Download it from here and add to $HADOOP_HOME/bin
(Source)
IMPORTANT UPDATE:
For hadoop-2.6.0 you can download binaries from Titus Barik blog >>.
I have not only needed to point HADOOP_HOME to extracted directory [path], but also provide system property -Djava.library.path=[path]\bin to load native libs (dll).

If you face this problem when running a self-contained local application with Spark (i.e., after adding spark-assembly-x.x.x-hadoopx.x.x.jar or the Maven dependency to the project), a simpler solution would be to put winutils.exe (download from here) in "C:\winutil\bin". Then you can add winutils.exe to the hadoop home directory by adding the following line to the code:
System.setProperty("hadoop.home.dir", "c:\\\winutil\\\")
Source: Click here

If we directly take the binary distribution of Apache Hadoop 2.2.0 release and try to run it on Microsoft Windows, then we'll encounter ERROR util.Shell: Failed to locate the winutils binary in the hadoop binary path.
The binary distribution of Apache Hadoop 2.2.0 release does not contain some windows native components (like winutils.exe, hadoop.dll etc). These are required (not optional) to run Hadoop on Windows.
So you need to build windows native binary distribution of hadoop from source codes following "BUILD.txt" file located inside the source distribution of hadoop. You can follow the following posts as well for step by step guide with screen shot
Build, Install, Configure and Run Apache Hadoop 2.2.0 in Microsoft Windows OS
ERROR util.Shell: Failed to locate the winutils binary in the hadoop binary path

The statement
java.io.IOException: Could not locate executable null\bin\winutils.exe
explains that the null is received when expanding or replacing an Environment Variable. If you see the Source in Shell.Java in Common Package you will find that HADOOP_HOME variable is not getting set and you are receiving null in place of that and hence the error.
So, HADOOP_HOME needs to be set for this properly or the variable hadoop.home.dir property.
Hope this helps.
Thanks,
Kamleshwar.

Winutils.exe is used for running the shell commands for SPARK.
When you need to run the Spark without installing Hadoop, you need this file.
Steps are as follows:
Download the winutils.exe from following location for hadoop 2.7.1
https://github.com/steveloughran/winutils/tree/master/hadoop-2.7.1/bin
[NOTE: If you are using separate hadoop version then please download the winutils from corresponding hadoop version folder on GITHUB from the location as mentioned above.]
Now, create a folder 'winutils' in C:\ drive. Now create a folder 'bin' inside folder 'winutils' and copy the winutils.exe in that folder.
So the location of winutils.exe will be C:\winutils\bin\winutils.exe
Now, open environment variable and set HADOOP_HOME=C:\winutils
[NOTE: Please do not add \bin in HADOOP_HOME and no need to set HADOOP_HOME in Path]
Your issue must be resolved !!

I just ran into this issue while working with Eclipse. In my case, I had the correct Hadoop version downloaded (hadoop-2.5.0-cdh5.3.0.tgz), I extracted the contents and placed it directly in my C drive. Then I went to
Eclipse->Debug/Run Configurations -> Environment (tab) -> and added
variable: HADOOP_HOME
Value: C:\hadoop-2.5.0-cdh5.3.0

You can download winutils.exe here:
http://public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe
Then copy it to your HADOOP_HOME/bin directory.

In Pyspark, to run local spark application using Pycharm use below lines
os.environ['HADOOP_HOME'] = "C:\\winutils"
print os.environ['HADOOP_HOME']

winutils.exe are required for hadoop to perform hadoop related commands. please download
hadoop-common-2.2.0 zip file. winutils.exe can be found in bin folder. Extract the zip file and copy it in the local hadoop/bin folder.

I was facing the same problem. Removing the bin\ from the HADOOP_HOME path solved it for me. The path for HADOOP_HOME variable should look something like.
C:\dev\hadoop2.6\
System restart may be needed. In my case, restarting the IDE was sufficient.

As most answers here refer to pretty old versions of winutils, I will leave a link to the most comprehensive repository, which supports all versions of Hadoop including the most recent ones:
https://github.com/kontext-tech/winutils
(find the directory corresponding to your Hadoop version, or try the most recent one).
If you have admin permissions on you machine.
Put bin directory into C:\winutils
The whole path should be C:\winutils\bin\winutils.exe
Set HADOOP_HOME into C:\winutils
If you don't have admin permissions or want to put the binaries into user space.
Put bin directory into C:\Users\vryabtse\AppData\Local\Programs\winutils or similar user directory.
Set HADOOP_HOME value into path to this directory.

Set up HADOOP_HOME variable in windows to resolve the problem.
You can find answer in org/apache/hadoop/hadoop-common/2.2.0/hadoop-common-2.2.0-sources.jar!/org/apache/hadoop/util/Shell.java :
IOException from
public static final String getQualifiedBinPath(String executable)
throws IOException {
// construct hadoop bin path to the specified executable
String fullExeName = HADOOP_HOME_DIR + File.separator + "bin"
+ File.separator + executable;
File exeFile = new File(fullExeName);
if (!exeFile.exists()) {
throw new IOException("Could not locate executable " + fullExeName
+ " in the Hadoop binaries.");
}
return exeFile.getCanonicalPath();
}
HADOOP_HOME_DIR from
// first check the Dflag hadoop.home.dir with JVM scope
String home = System.getProperty("hadoop.home.dir");
// fall back to the system/user-global env variable
if (home == null) {
home = System.getenv("HADOOP_HOME");
}

Download desired version of hadoop folder (Say if you are installing spark on Windows then hadoop version for which your spark is built for) from this link as zip.
Extract the zip to desired directory.
You need to have directory of the form hadoop\bin (explicitly create such hadoop\bin directory structure if you want) with bin containing all the files contained in bin folder of the downloaded hadoop. This will contain many files such as hdfs.dll, hadoop.dll etc. in addition to winutil.exe.
Now create environment variable HADOOP_HOME and set it to <path-to-hadoop-folder>\hadoop. Then add ;%HADOOP_HOME%\bin; to PATH environment variable.
Open a "new command prompt" and try rerunning your command.

Download [winutils.exe]
From URL :
https://github.com/steveloughran/winutils/hadoop-version/bin
Past it under HADOOP_HOME/bin
Note : You should Set environmental variables:
User variable:
Variable: HADOOP_HOME
Value: Hadoop or spark dir

I used "hbase-1.3.0" and "hadoop-2.7.3" versions. Setting HADOOP_HOME environment variable and copying 'winutils.exe' file under HADOOP_HOME/bin folder solves the problem on a windows os.
Attention to set HADOOP_HOME environment to the installation folder of hadoop(/bin folder is not necessary for these versions).
Additionally I preferred using cross platform tool cygwin to settle linux os functionality (as possible as it can) because Hbase team recommend linux/unix env.

I was getting the same issue in windows. I fixed it by
Downloading hadoop-common-2.2.0-bin-master from link.
Create a user variable HADOOP_HOME in Environment variable and assign the path of hadoop-common bin directory as a value.
You can verify it by running hadoop in cmd.
Restart the IDE and Run it.

I recently got the same error message while running spark application on Intellij Idea. What I did was, I downloaded the winutils.exe that is compatible with the Spark version I was running and moved it to the Spark bin directory. Then in my Intellij, I edited the configuration.
The 'Environment variables' area was empty. So, I entered HADOOP_HOME = P:\spark-2.4.7-bin-hadoop2.7
Since, the winutils.exe is in the P:\spark-2.4.7-bin-hadoop2.7\bin directory, it will locate the file while running.
So, by setting HADOOP_HOME, the null would be the HADOOP_HOME directory. Complete path would be P:\spark-2.4.7-bin-hadoop2.7\bin\winutils.exe
That was how I resolved it

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio