Windows: Apache Spark History Server Config - windows

I wanted to use Spark's History Server to make use of the logging mechanisms of my Web UI, but I find some difficulty in running this code on my Windows machine.
I have done the following:
Set my spark-defaults.conf file to reflect
spark.eventLog.enabled=true
spark.eventLog.dir=file://C:/spark-1.6.2-bin-hadoop2.6/logs
spark.history.fs.logDirectory=file://C:/spark-1.6.2-bin-hadoop2.6/logs
My spark-env.sh to reflect:
SPARK_LOG_DIR "file://C:/spark-1.6.2-bin-hadoop2.6/logs"
SPARK_HISTORY_OPTS "-Dspark.history.fs.logDirectory=file://C:/spark-1.6.2-bin-hadoop2.6/logs"
I am using Git-BASH to run the start-history-server.sh file, like this:
USERA#SYUHUH MINGW64 /c/spark-1.6.2-bin-hadoop2.6/sbin
$ sh start-history-server.sh
And, I get this error:
USERA#SYUHUH MINGW64 /c/spark-1.6.2-bin-hadoop2.6/sbin
$ sh start-history-server.sh
C:\spark-1.6.2-bin-hadoop2.6/conf/spark-env.sh: line 69: SPARK_LOG_DIR: command not found
C:\spark-1.6.2-bin-hadoop2.6/conf/spark-env.sh: line 70: SPARK_HISTORY_OPTS: command not found
ps: unknown option -- o
Try `ps --help' for more information.
starting org.apache.spark.deploy.history.HistoryServer, logging to C:\spark-1.6.2-bin-hadoop2.6/logs/spark--org.apache.spark.deploy.history.HistoryServer-1-SGPF02M9ZB.out
ps: unknown option -- o
Try `ps --help' for more information.
failed to launch org.apache.spark.deploy.history.HistoryServer:
Spark Command: C:\Program Files (x86)\Java\jdk1.8.0_91\bin\java -cp C:\spark-1.6.2-bin-hadoop2.6/conf\;C:\spark-1.6.2-bin-hadoop2.6/lib/spark-assembly-1.6.2-hadoop2.6.0.jar;C:\spark-1.6.2-bin-hadoop2.6\lib\datanucleus-api-jdo-3.2.6.jar;C:\spark-1.6.2-bin-hadoop2.6\lib\datanucleus-core-3.2.10.jar;C:\spark-1.6.2-bin-hadoop2.6\lib\datanucleus-rdbms-3.2.9.jar -Xms1g -Xmx1g org.apache.spark.deploy.history.HistoryServer
========================================
full log in C:\spark-1.6.2-bin-hadoop2.6/logs/spark--org.apache.spark.deploy.history.HistoryServer-1-SGPF02M9ZB.out
The full log from the output can be found below:
Spark Command: C:\Program Files (x86)\Java\jdk1.8.0_91\bin\java -cp C:\spark-1.6.2-bin-hadoop2.6/conf\;C:\spark-1.6.2-bin-hadoop2.6/lib/spark-assembly-1.6.2-hadoop2.6.0.jar;C:\spark-1.6.2-bin-hadoop2.6\lib\datanucleus-api-jdo-3.2.6.jar;C:\spark-1.6.2-bin-hadoop2.6\lib\datanucleus-core-3.2.10.jar;C:\spark-1.6.2-bin-hadoop2.6\lib\datanucleus-rdbms-3.2.9.jar -Xms1g -Xmx1g org.apache.spark.deploy.history.HistoryServer
========================================
I am running a sparkR script where I initialize my spark context and then call init().
Please advise whether I should be running the history server before I run my spark script?
Pointers & tips to proceed(with respect to logging) would be greatly appreciated.

On Windows you'll need to run the .cmd files of Spark not .sh. According to what I saw, there is no .cmd script for Spark history server. So basically it needs to be run manually.
I have followed the history server Linux script and in order to run it manually on Windows you'll need to take the following steps:
All history server configurations should be set at the spark-defaults.conf file (remove .template suffix) as described below
You should go to spark config directory and add the spark.history.* configurations to %SPARK_HOME%/conf/spark-defaults.conf. As follows:
spark.eventLog.enabled true
spark.history.fs.logDirectory file:///c:/logs/dir/path
After configuration is finished run the following command from %SPARK_HOME%
bin\spark-class.cmd org.apache.spark.deploy.history.HistoryServer
The output should be something like that:
16/07/22 18:51:23 INFO Utils: Successfully started service on port 18080.
16/07/22 18:51:23 INFO HistoryServer: Started HistoryServer at http://10.0.240.108:18080
16/07/22 18:52:09 INFO ShutdownHookManager: Shutdown hook called
Hope that it helps! :-)

in case any one gets the floowing exception:
17/05/12 20:27:50 ERROR FsHistoryProvider: Exception encountered when attempting
to load application log file:/C:/Spark/Logs/spark--org.apache.spark.deploy.hist
ory.HistoryServer-1-Arsalan-PC.out
java.lang.IllegalArgumentException: Codec [out] is not available. Consider setti
ng spark.io.compression.codec=snappy
at org.apache.spark.io.CompressionCodec$$anonfun$createCodec$1.apply(Com
Just go to SparkHome/config/spark-defaults.conf
and set
spark.eventLog.compress false

Related

Spark submit with master as yarn-client (windows) gives Error "Could not find or load main class"

I have installed Hadoop2.7.1 with spark 1.4.1 on windows 8.1
When I execute below command
cd spark
spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode client lib/spark-examples*.jar 10
I get below error in JobHistoryServer log
Error: Could not find or load main class '-Dspark.externalBlockStore.folderName=spark-262c4697-ef0c-4042-af0c-8106b08574fb'
I did further debugging(along searching net) and could get hold of container cmd script where below sections(other lines are omitted) are given
...
#set CLASSPATH=C:/tmp/hadoop-xyz/nm-local-dir/usercache/xyz/appcache/application_1487502025818_0003/container_1487502025818_0003_02_000001/classpath-3207656532274684591.jar
...
#call %JAVA_HOME%/bin/java -server -Xmx512m -Djava.io.tmpdir=%PWD%/tmp '-Dspark.fileserver.uri=http://192.168.1.2:34814' '-Dspark.app.name=Spark shell' '-Dspark.driver.port=34810' '-Dspark.repl.class.uri=http://192.168.1.2:34785' '-Dspark.driver.host=192.168.1.2' '-Dspark.externalBlockStore.folderName=spark-dd9f3f84-6cf4-4ff8-b0f6-7ff84daf74bc' '-Dspark.master=yarn-client' '-Dspark.driver.appUIAddress=http://192.168.1.2:4040' '-Dspark.jars=' '-Dspark.executor.id=driver' -Dspark.yarn.app.container.log.dir=/dep/logs/userlogs/application_1487502025818_0003/container_1487502025818_0003_02_000001 org.apache.spark.deploy.yarn.ExecutorLauncher --arg '192.168.1.2:34810' --executor-memory 1024m --executor-cores 1 --num-executors 2 1> /dep/logs/userlogs/application_1487502025818_0003/container_1487502025818_0003_02_000001/stdout 2> /dep/logs/userlogs/application_1487502025818_0003/container_1487502025818_0003_02_000001/stderr
I check relevant files for CLASSPATH, looks OK. The main class org.apache.spark.deploy.yarn.ExecutorLauncher is available in spark assembly jar which is part of container jar
So, what could be the issue here? I searched net and found many discussions, but are for unix variants, but not many for Windows. I am wondering whether spark submit really works on Windows (yarn-client mode only, standalone cluster mode works) without any special setup!!!
BTW, if I run the above java command from cmd.exe command prompt, I get the same error as all command line arguments are quoted with single quote instead of double quotes(making these double quotes work!!), so is this a bug
Note spark-shell also fails (in yarn mode) and but yarn jar ... command works
Looks like it was a defect in earlier version. With latest Hadoop 2.7.3 with spark 2.1.0, it is working correctly.!!! Could not find any reference though.

Failed to start master for spark in windows 10

I am new to Spark, I am trying to start master manually (using MINGW64 in windows 10).
When I do this,
~/Downloads/spark-1.5.1-bin-hadoop2.4/spark-1.5.1-bin-hadoop2.4/sbin
$ ./start-master.sh
I got these logs,
ps: unknown option -- o
Try `ps --help' for more information.
starting org.apache.spark.deploy.master.Master, logging to /c/Users/Raunak/Downloads/spark-1.5.1-bin-hadoop2.4/spark-1.5.1-bin-hadoop2.4/sbin/../logs/spark--org.apache.spark.deploy.master.Master-1-RINKU-CISPL.out
ps: unknown option -- o
Try `ps --help' for more information.
**failed to launch org.apache.spark.deploy.master.Master:**
Spark Command: C:\Program Files\Java\jre1.8.0_77\bin\java -cp C:/Users/Raunak/Downloads/spark-1.5.1-bin-hadoop2.4/spark-1.5.1-bin-hadoop2.4/sbin/../conf\;C:/Users/Raunak/Downloads/spark-1.5.1-bin-hadoop2.4/spark-1.5.1-bin-hadoop2.4/lib/spark-assembly-1.5.1-hadoop2.4.0.jar;C:\Users\Raunak\Downloads\spark-1.5.1-bin-hadoop2.4\spark-1.5.1-bin-hadoop2.4\lib\datanucleus-api-jdo-3.2.6.jar;C:\Users\Raunak\Downloads\spark-1.5.1-bin-hadoop2.4\spark-1.5.1-bin-hadoop2.4\lib\datanucleus-core-3.2.10.jar;C:\Users\Raunak\Downloads\spark-1.5.1-bin-hadoop2.4\spark-1.5.1-bin-hadoop2.4\lib\datanucleus-rdbms-3.2.9.jar -Xms1g -Xmx1g org.apache.spark.deploy.master.Master --ip RINKU-CISPL --port 7077 --webui-port 8080
What am I doing wrong , Should I have to configure Hadoop package also for Spark?
Just found answer here: https://spark.apache.org/docs/1.2.0/spark-standalone.html
"Note: The launch scripts do not currently support Windows. To run a Spark cluster on Windows, start the master and workers by hand."
I think windows is not a good choice for Spark, anyway good luck!

Unable to run Hadoop on windows 7

I am new to Hadoop and trying to run it on Windows 7.
Whenever I am trying to run hadoop bash script, I get the following error :
'-Xmx32m' is not recognized as an internal or external command,
operable program or batch file.
Usage: hadoop [--config confdir] COMMAND
where COMMAND is one of:
fs run a generic filesystem user client
version print the version
jar <jar> run a jar file
checknative [-a|-h] check native hadoop and compression libraries availability
distcp <srcurl> <desturl> copy file or directories recursively
archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive
classpath prints the class path needed to get the
Hadoop jar and the required libraries
credential interact with credential providers
key manage keys via the KeyProvider
daemonlog get/set the log level for each daemon
or
CLASSNAME run the class named CLASSNAME
Most commands print help when invoked w/o parameters.
Also, when I run hdfs command ,
I get the following error :
-Xms1000m is not recognized as in internal or external command.
When I try to pass -Xmx and -Xms arguments, I get the following message :
Error occurred during initialization of VM
Could not reserve enough space for object heap
Can anyone help me out on this ?
The error message
is not recognized as an internal or external command
indicates that you attempted to run from the command line a program that Windows doesn't recognize. This likely has nothing to do with -Xms and -Xmx. The problem is Windows cannot find java.
Make sure you can ran java -version no matter what's the current folder you are in. If you can't, you need to add the java at the PATH environment variable.
This could also be an issue of installing java or hadoop in a folder that has spaces in the path e.g. C:\Program Files has a space in the folder and that can be a problem. If that's the cause then install java and hadoop on a different folder without spaces in the path.

hbase(main):007:0> create 'test', 'data' Error

I installed and configured hbase-0.94.2, while Connecting to the running instance of HBase using the hbase shell command and trying to create a table named test with a single column family named data :
hbase(main):007:0> create 'test', 'data'the shell displays an error and a stack trace .
what should I do to resolve this.
I followed this tutorial
Actually I resolved this problem by restarting Hbase
step1:
$cd /usr/local/hbase/bin
step2:
$./start-hbase.sh
localhost: zookeeper running as process 3669. Stop it first.
master running as process 3783. Stop it first.
localhost: regionserver running as process 3926. Stop it first.
step3:
$kill 3669
$kill 3783
$kill 3926
step4:
./start-hbase.sh
step5:
verify if it works : http://localhost:60010/
posting the stacktrace you will be useful for us to help you out.
also, 0.94.2 is old you should consider moving to a 0.98.x release.
here are my easy step to start using hbase:
$ wget https://archive.apache.org/dist/hbase/hbase-0.98.0/hbase-0.98.0-hadoop2-bin.tar.gz
$ tar xzvf hbase-0.98.0-hadoop2-bin.tar.gz
$ export HBASE_HOME=pwd/hbase-0.98.0-hadoop2
$ export PATH=$HBASE_HOME/bin:$PATH
$ start-hbase.sh
Now hbase is up and running and you can start using the shell
$ hbase shell

Error: Failed to create Data Storage while running embedded pig in java

I wrote a simple program to test the embedded pig in java to run in mapreduce mode.
The hadoop version in the server I am running is 0.20.2-cdh3u4a, and pig version is 0.10.0-cdh3u4a.
When I try to run in local mode, it runs successfully. But when I try to run in mapreduce mode, it gives me the error.
I run my program using the following commands as shown in http://pig.apache.org/docs/r0.9.1/cont.html#embed-java
javac -cp pig.jar EmbedPigTest.java
javac -cp pig.jar:.:/etc/hadoop/conf EmbedPigTest.java input.txt
My program gives error as:
Exception in thread "main" java.lang.RuntimeException: Failed to create DataStorage
at org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:75)
at org.apache.pig.backend.hadoop.datastorage.HDataStorage.<init>(HDataStorage.java:58)
at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:214)
at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:134)
at org.apache.pig.impl.PigContext.connect(PigContext.java:183)
at org.apache.pig.PigServer.<init>(PigServer.java:226)
at org.apache.pig.PigServer.<init>(PigServer.java:215)
at org.apache.pig.PigServer.<init>(PigServer.java:211)
at org.apache.pig.PigServer.<init>(PigServer.java:207)
at WordCount.main(EmbedPigTest.java:9)
In some online resources they say that this problem occurs due to different hadoop version. But, I didn't understand what I should do. Suggestions please !!
This is happening because you are linking to the wrong jar, Please see the link below it describes this issue very well.
http://localsteve.wordpress.com/2012/09/30/embedding-pig-for-cdh4-java-apps-fer-realz/
I was faced same kind of issue when I tried to use pig in map reduce mode without starting the services.
Please check all services using jps before using pig in map reduce mode.

Resources