Failed to start master for spark in windows 10 - hadoop

I am new to Spark, I am trying to start master manually (using MINGW64 in windows 10).
When I do this,
~/Downloads/spark-1.5.1-bin-hadoop2.4/spark-1.5.1-bin-hadoop2.4/sbin
$ ./start-master.sh
I got these logs,
ps: unknown option -- o
Try `ps --help' for more information.
starting org.apache.spark.deploy.master.Master, logging to /c/Users/Raunak/Downloads/spark-1.5.1-bin-hadoop2.4/spark-1.5.1-bin-hadoop2.4/sbin/../logs/spark--org.apache.spark.deploy.master.Master-1-RINKU-CISPL.out
ps: unknown option -- o
Try `ps --help' for more information.
**failed to launch org.apache.spark.deploy.master.Master:**
Spark Command: C:\Program Files\Java\jre1.8.0_77\bin\java -cp C:/Users/Raunak/Downloads/spark-1.5.1-bin-hadoop2.4/spark-1.5.1-bin-hadoop2.4/sbin/../conf\;C:/Users/Raunak/Downloads/spark-1.5.1-bin-hadoop2.4/spark-1.5.1-bin-hadoop2.4/lib/spark-assembly-1.5.1-hadoop2.4.0.jar;C:\Users\Raunak\Downloads\spark-1.5.1-bin-hadoop2.4\spark-1.5.1-bin-hadoop2.4\lib\datanucleus-api-jdo-3.2.6.jar;C:\Users\Raunak\Downloads\spark-1.5.1-bin-hadoop2.4\spark-1.5.1-bin-hadoop2.4\lib\datanucleus-core-3.2.10.jar;C:\Users\Raunak\Downloads\spark-1.5.1-bin-hadoop2.4\spark-1.5.1-bin-hadoop2.4\lib\datanucleus-rdbms-3.2.9.jar -Xms1g -Xmx1g org.apache.spark.deploy.master.Master --ip RINKU-CISPL --port 7077 --webui-port 8080
What am I doing wrong , Should I have to configure Hadoop package also for Spark?

Just found answer here: https://spark.apache.org/docs/1.2.0/spark-standalone.html
"Note: The launch scripts do not currently support Windows. To run a Spark cluster on Windows, start the master and workers by hand."
I think windows is not a good choice for Spark, anyway good luck!

Related

start-all.sh command not found

I have just installed Cloudera VM setup for hadoop. But when I open the command prompt and want to start all daemons for hadoop using command 'start-all.sh' , I get an error stating "bash : start-all.sh: command not found".
I have tried 'start-dfs.sh' too yet still gives the same error. When I use 'jps' command, I can see that none of the daemons have been started.
You can find start-all.sh and start-dfs.sh scripts in bin or sbin folders. You can use the following command to find that. Go to hadoop installation folder and run this command.
find . -name 'start-all.sh' # Finds files having name similar to start-all.sh
Then you can specify the path to start all the daemons using bash /path/to/start-all.sh
If you're using the QuickStart VM then the right way to start the cluster (as #cricket_007 hinted) is by restarting it in the Cloudera Manager UI. The start-all.sh scripts will not work since those only apply to the Hadoop servers (Name Node, Data Node, Resource Manager, Node Manager ...) but not all the services in the ecosystem (like Hive, Impala, Spark, Oozie, Hue ...).
You can refer to the YouTube video and the official documentation Starting, Stopping, Refreshing, and Restarting a Cluster

Spark submit with master as yarn-client (windows) gives Error "Could not find or load main class"

I have installed Hadoop2.7.1 with spark 1.4.1 on windows 8.1
When I execute below command
cd spark
spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode client lib/spark-examples*.jar 10
I get below error in JobHistoryServer log
Error: Could not find or load main class '-Dspark.externalBlockStore.folderName=spark-262c4697-ef0c-4042-af0c-8106b08574fb'
I did further debugging(along searching net) and could get hold of container cmd script where below sections(other lines are omitted) are given
...
#set CLASSPATH=C:/tmp/hadoop-xyz/nm-local-dir/usercache/xyz/appcache/application_1487502025818_0003/container_1487502025818_0003_02_000001/classpath-3207656532274684591.jar
...
#call %JAVA_HOME%/bin/java -server -Xmx512m -Djava.io.tmpdir=%PWD%/tmp '-Dspark.fileserver.uri=http://192.168.1.2:34814' '-Dspark.app.name=Spark shell' '-Dspark.driver.port=34810' '-Dspark.repl.class.uri=http://192.168.1.2:34785' '-Dspark.driver.host=192.168.1.2' '-Dspark.externalBlockStore.folderName=spark-dd9f3f84-6cf4-4ff8-b0f6-7ff84daf74bc' '-Dspark.master=yarn-client' '-Dspark.driver.appUIAddress=http://192.168.1.2:4040' '-Dspark.jars=' '-Dspark.executor.id=driver' -Dspark.yarn.app.container.log.dir=/dep/logs/userlogs/application_1487502025818_0003/container_1487502025818_0003_02_000001 org.apache.spark.deploy.yarn.ExecutorLauncher --arg '192.168.1.2:34810' --executor-memory 1024m --executor-cores 1 --num-executors 2 1> /dep/logs/userlogs/application_1487502025818_0003/container_1487502025818_0003_02_000001/stdout 2> /dep/logs/userlogs/application_1487502025818_0003/container_1487502025818_0003_02_000001/stderr
I check relevant files for CLASSPATH, looks OK. The main class org.apache.spark.deploy.yarn.ExecutorLauncher is available in spark assembly jar which is part of container jar
So, what could be the issue here? I searched net and found many discussions, but are for unix variants, but not many for Windows. I am wondering whether spark submit really works on Windows (yarn-client mode only, standalone cluster mode works) without any special setup!!!
BTW, if I run the above java command from cmd.exe command prompt, I get the same error as all command line arguments are quoted with single quote instead of double quotes(making these double quotes work!!), so is this a bug
Note spark-shell also fails (in yarn mode) and but yarn jar ... command works
Looks like it was a defect in earlier version. With latest Hadoop 2.7.3 with spark 2.1.0, it is working correctly.!!! Could not find any reference though.

Windows: Apache Spark History Server Config

I wanted to use Spark's History Server to make use of the logging mechanisms of my Web UI, but I find some difficulty in running this code on my Windows machine.
I have done the following:
Set my spark-defaults.conf file to reflect
spark.eventLog.enabled=true
spark.eventLog.dir=file://C:/spark-1.6.2-bin-hadoop2.6/logs
spark.history.fs.logDirectory=file://C:/spark-1.6.2-bin-hadoop2.6/logs
My spark-env.sh to reflect:
SPARK_LOG_DIR "file://C:/spark-1.6.2-bin-hadoop2.6/logs"
SPARK_HISTORY_OPTS "-Dspark.history.fs.logDirectory=file://C:/spark-1.6.2-bin-hadoop2.6/logs"
I am using Git-BASH to run the start-history-server.sh file, like this:
USERA#SYUHUH MINGW64 /c/spark-1.6.2-bin-hadoop2.6/sbin
$ sh start-history-server.sh
And, I get this error:
USERA#SYUHUH MINGW64 /c/spark-1.6.2-bin-hadoop2.6/sbin
$ sh start-history-server.sh
C:\spark-1.6.2-bin-hadoop2.6/conf/spark-env.sh: line 69: SPARK_LOG_DIR: command not found
C:\spark-1.6.2-bin-hadoop2.6/conf/spark-env.sh: line 70: SPARK_HISTORY_OPTS: command not found
ps: unknown option -- o
Try `ps --help' for more information.
starting org.apache.spark.deploy.history.HistoryServer, logging to C:\spark-1.6.2-bin-hadoop2.6/logs/spark--org.apache.spark.deploy.history.HistoryServer-1-SGPF02M9ZB.out
ps: unknown option -- o
Try `ps --help' for more information.
failed to launch org.apache.spark.deploy.history.HistoryServer:
Spark Command: C:\Program Files (x86)\Java\jdk1.8.0_91\bin\java -cp C:\spark-1.6.2-bin-hadoop2.6/conf\;C:\spark-1.6.2-bin-hadoop2.6/lib/spark-assembly-1.6.2-hadoop2.6.0.jar;C:\spark-1.6.2-bin-hadoop2.6\lib\datanucleus-api-jdo-3.2.6.jar;C:\spark-1.6.2-bin-hadoop2.6\lib\datanucleus-core-3.2.10.jar;C:\spark-1.6.2-bin-hadoop2.6\lib\datanucleus-rdbms-3.2.9.jar -Xms1g -Xmx1g org.apache.spark.deploy.history.HistoryServer
========================================
full log in C:\spark-1.6.2-bin-hadoop2.6/logs/spark--org.apache.spark.deploy.history.HistoryServer-1-SGPF02M9ZB.out
The full log from the output can be found below:
Spark Command: C:\Program Files (x86)\Java\jdk1.8.0_91\bin\java -cp C:\spark-1.6.2-bin-hadoop2.6/conf\;C:\spark-1.6.2-bin-hadoop2.6/lib/spark-assembly-1.6.2-hadoop2.6.0.jar;C:\spark-1.6.2-bin-hadoop2.6\lib\datanucleus-api-jdo-3.2.6.jar;C:\spark-1.6.2-bin-hadoop2.6\lib\datanucleus-core-3.2.10.jar;C:\spark-1.6.2-bin-hadoop2.6\lib\datanucleus-rdbms-3.2.9.jar -Xms1g -Xmx1g org.apache.spark.deploy.history.HistoryServer
========================================
I am running a sparkR script where I initialize my spark context and then call init().
Please advise whether I should be running the history server before I run my spark script?
Pointers & tips to proceed(with respect to logging) would be greatly appreciated.
On Windows you'll need to run the .cmd files of Spark not .sh. According to what I saw, there is no .cmd script for Spark history server. So basically it needs to be run manually.
I have followed the history server Linux script and in order to run it manually on Windows you'll need to take the following steps:
All history server configurations should be set at the spark-defaults.conf file (remove .template suffix) as described below
You should go to spark config directory and add the spark.history.* configurations to %SPARK_HOME%/conf/spark-defaults.conf. As follows:
spark.eventLog.enabled true
spark.history.fs.logDirectory file:///c:/logs/dir/path
After configuration is finished run the following command from %SPARK_HOME%
bin\spark-class.cmd org.apache.spark.deploy.history.HistoryServer
The output should be something like that:
16/07/22 18:51:23 INFO Utils: Successfully started service on port 18080.
16/07/22 18:51:23 INFO HistoryServer: Started HistoryServer at http://10.0.240.108:18080
16/07/22 18:52:09 INFO ShutdownHookManager: Shutdown hook called
Hope that it helps! :-)
in case any one gets the floowing exception:
17/05/12 20:27:50 ERROR FsHistoryProvider: Exception encountered when attempting
to load application log file:/C:/Spark/Logs/spark--org.apache.spark.deploy.hist
ory.HistoryServer-1-Arsalan-PC.out
java.lang.IllegalArgumentException: Codec [out] is not available. Consider setti
ng spark.io.compression.codec=snappy
at org.apache.spark.io.CompressionCodec$$anonfun$createCodec$1.apply(Com
Just go to SparkHome/config/spark-defaults.conf
and set
spark.eventLog.compress false

How to start Datanode? (Cannot find start-dfs.sh script)

We are setting up automated deployments on a headless system: so using the GUI is not an option here.
Where is start-dfs.sh script for hdfs in Hortonworks Data Platform? CDH / cloudera packages those files under the hadoop/sbin directory. However when we search for those scripts under HDP they are not found:
$ pwd
/usr/hdp/current
Which scripts exist in HDP ?
[stack#s1-639016 current]$ find -L . -name \*.sh
./hadoop-hdfs-client/sbin/refresh-namenodes.sh
./hadoop-hdfs-client/sbin/distribute-exclude.sh
./hadoop-hdfs-datanode/sbin/refresh-namenodes.sh
./hadoop-hdfs-datanode/sbin/distribute-exclude.sh
./hadoop-hdfs-nfs3/sbin/refresh-namenodes.sh
./hadoop-hdfs-nfs3/sbin/distribute-exclude.sh
./hadoop-hdfs-secondarynamenode/sbin/refresh-namenodes.sh
./hadoop-hdfs-secondarynamenode/sbin/distribute-exclude.sh
./hadoop-hdfs-namenode/sbin/refresh-namenodes.sh
./hadoop-hdfs-namenode/sbin/distribute-exclude.sh
./hadoop-hdfs-journalnode/sbin/refresh-namenodes.sh
./hadoop-hdfs-journalnode/sbin/distribute-exclude.sh
./hadoop-hdfs-portmap/sbin/refresh-namenodes.sh
./hadoop-hdfs-portmap/sbin/distribute-exclude.sh
./hadoop-client/sbin/hadoop-daemon.sh
./hadoop-client/sbin/slaves.sh
./hadoop-client/sbin/hadoop-daemons.sh
./hadoop-client/etc/hadoop/hadoop-env.sh
./hadoop-client/etc/hadoop/kms-env.sh
./hadoop-client/etc/hadoop/mapred-env.sh
./hadoop-client/conf/hadoop-env.sh
./hadoop-client/conf/kms-env.sh
./hadoop-client/conf/mapred-env.sh
./hadoop-client/libexec/kms-config.sh
./hadoop-client/libexec/init-hdfs.sh
./hadoop-client/libexec/hadoop-layout.sh
./hadoop-client/libexec/hadoop-config.sh
./hadoop-client/libexec/hdfs-config.sh
./zookeeper-client/conf/zookeeper-env.sh
./zookeeper-client/bin/zkCli.sh
./zookeeper-client/bin/zkCleanup.sh
./zookeeper-client/bin/zkServer-initialize.sh
./zookeeper-client/bin/zkEnv.sh
./zookeeper-client/bin/zkServer.sh
Notice: there are ZERO start/stop sh scripts..
In particular I am interested in the start-dfs.sh script that starts the namenode(s) , journalnode, and datanodes.
How to start DataNode
su - hdfs -c "/usr/lib/hadoop/bin/hadoop-daemon.sh --config /etc/hadoop/conf start datanode";
Github - Hortonworks Start Scripts
Update
Decided to hunt for it myself.
Spun up a single node with Ambari, installed HDP 2.2 (a), HDP 2.3 (b)
sudo find / -name \*.sh | grep start
Found
(a) /usr/hdp/2.2.8.0-3150/hadoop/src/hadoop-hdfs-project/hadoop-hdfs/src/main/bin/s‌​tart-dfs.sh
Weird that it doesn't exist in /usr/hdp/current, which should be symlinked.
(b) /hadoop/yarn/local/filecache/10/mapreduce.tar.gz/hadoop/sbin/start-dfs.sh
The recommended way to administer your hadoop cluster would be via the administrator panel. Since you are working on Hotronworks distribution, it makes more sense for you to use Ambari instead.

Could not find and execute start-all.sh and Stop-all.sh on Cloudera VM for Hadoop

How to start / Stop services from command line CDH4 --. I am new to Hadoop. Installed VM from Cloudera. Could not find start-all.sh and stop-all.sh . How to stop or start the task tracker or data node if I want. It is a single node cluster which I am using on Centos. I haven't dont any modifications.
More over I see there are changes in the directory structures in all flavours. I could not locate these sh files on the VM for my installation.
[cloudera#localhost ~]$ stop-all.sh
bash: stop-all.sh: command not found
Highly appreciate your support.
use Sudo su hdfs to start and to stop just type exit it will stop all the services.

Resources