Debugging Spark standalone cluster with idea - macos

I am trying to debug a Spark Application on a local cluster using a master and a worker nodes. I have been successful at setting up the master node and worker nodes using Spark standalone cluster manager with start-master.sh and it works.But I want to how Spark Application works in the spark cluster, so I want to start the cluster in debug mode. I read the start-master.sh codeļ¼Œ mock the args and start org.apache.spark.deploy.master.Master main method.Unfortunately it gets a NoClassDefFoundError,I can't open the webui. I want to know where the problem is.
The Error is :
Exception in thread "dispatcher-event-loop-1" java.lang.NoClassDefFoundError: org/eclipse/jetty/util/thread/ThreadPool
at org.apache.spark.ui.WebUI.attachPage(WebUI.scala:81)
at org.apache.spark.deploy.master.ui.MasterWebUI.initialize(MasterWebUI.scala:48)
at org.apache.spark.deploy.master.ui.MasterWebUI.<init>(MasterWebUI.scala:43)
at org.apache.spark.deploy.master.Master.onStart(Master.scala:131)
at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:122)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:216)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: org.eclipse.jetty.util.thread.ThreadPool
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 11 more
my debug configurations is:
enter image description here
Thanks!

I would suggest not to even use a spark standalone cluster for debugging.
You can run spark locally in the your IDE with breakpoints.
Spark provides you option to run locally pointing to local filesystem as HDFS.
Please follow the following link to know more about how to write test cases for local mode in spark
http://bytepadding.com/big-data/spark/word-count-in-spark/

Related

Accessing hdfs from docker-hadoop-spark--workbench via zeppelin

I have installed https://github.com/big-data-europe/docker-hadoop-spark-workbench
Then started it up with docker-compose up . I navigated to the various urls mentioned in the git readme and all appears to be up.
I then started a local apache zeppelin with:
./bin/zeppelin.sh start
In zeppelin interpreter settings i have navigated then to spark interpreter and updated the master to point to the local cluster installed with docker
master: updated from from local[*] to spark://localhost:8080
I then run in a notebook the following code:
import org.apache.hadoop.fs.{FileSystem,Path}
FileSystem.get( sc.hadoopConfiguration ).listStatus( new Path("hdfs:///")).foreach( x => println(x.getPath ))
I get this exception in zeppelin logs:
INFO [2017-12-15 18:06:35,704] ({pool-2-thread-2} Paragraph.java[jobRun]:362) - run paragraph 20171212-200101_1553252595 using null org.apache.zeppelin.interpreter.LazyOpenInterpreter#32d09a20
WARN [2017-12-15 18:07:37,717] ({pool-2-thread-2} NotebookServer.java[afterStatusChange]:2064) - Job 20171212-200101_1553252595 is finished, status: ERROR, exception: null, result: %text java.lang.NullPointerException
at org.apache.zeppelin.spark.Utils.invokeMethod(Utils.java:38)
at org.apache.zeppelin.spark.Utils.invokeMethod(Utils.java:33)
at org.apache.zeppelin.spark.SparkInterpreter.createSparkContext_2(SparkInterpreter.java:398)
at org.apache.zeppelin.spark.SparkInterpreter.createSparkContext(SparkInterpreter.java:387)
at org.apache.zeppelin.spark.SparkInterpreter.getSparkContext(SparkInterpreter.java:146)
at org.apache.zeppelin.spark.SparkInterpreter.open(SparkInterpreter.java:843)
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:70)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:491)
at org.apache.zeppelin.scheduler.Job.run(Job.java:175)
at org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:139)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
How can I access the hdfs from zeppelin and java/spark code?
Reason for the exception is that the sparkSession object is null for some reason in Zeppelin.
Reference:
https://github.com/apache/zeppelin/blob/master/spark/src/main/java/org/apache/zeppelin/spark/SparkInterpreter.java
private SparkContext createSparkContext_2() {
return (SparkContext) Utils.invokeMethod(sparkSession, "sparkContext");
}
Might be a configuration related issue. Please cross-verify the settings/configuration and spark cluster settings. Make sure that spark is working fine.
Reference: https://zeppelin.apache.org/docs/latest/interpreter/spark.html
Hope this helps.

worker is getting restarted continuously with closedchannel exception in supervisor

worker which is in one of the supervisor is getting restarted continuously and getting Closedchannel exception . But if run the same topology in another storm cluster which is in another environment , it is running without giving any errors.
Below is the error i can see from Storm UI.
java.lang.RuntimeException: java.nio.channels.ClosedChannelException at org.apache.storm.kafka.ZkCoordinator.refresh(ZkCoordinator.java:103) at org.apache.storm.kafka.ZkCoordinator.getMyManagedPartitions(ZkCoordinator.java:69) at org.apache.storm.kafka.KafkaSpout.nextTuple(KafkaSpout.java:129) at org.apache.storm.daemon.executor$fn__7990$fn__8005$fn__8036.invoke(executor.clj:648) at org.apache.storm.util$async_loop$fn__624.invoke(util.clj:484) at clojure.lang.AFn.run(AFn.java:22) at java.lang.Thread.run(Thread.java:745) Caused by: java.nio.channels.ClosedChannelException at kafka.network.BlockingChannel.send(BlockingChannel.scala:100) at kafka.consumer.SimpleConsumer.liftedTree1$1(SimpleConsumer.scala:78) at kafka.consumer.SimpleConsumer.kafka$consumer$SimpleConsumer$$sendRequest(SimpleConsumer.scala:68) at kafka.consumer.SimpleConsumer.getOffsetsBefore(SimpleConsumer.scala:127) at kafka.javaapi.consumer.SimpleConsumer.getOffsetsBefore(SimpleConsumer.scala:79) at org.apache.storm.kafka.KafkaUtils.getOffset(KafkaUtils.java:75) at org.apache.storm.kafka.KafkaUtils.getOffset(KafkaUtils.java:65) at org.apache.storm.kafka.PartitionManager.(PartitionManager.java:94) at org.apache.storm.kafka.ZkCoordinator.refresh(ZkCoordinator.java:98) ... 6 mo
Can any one please help me to find out the exact issue.Please let me know if need any more information.
I faced this issue and problem was that ZooKeeper host names not being resolved from worker host.

How to submit a spark job on a remote master node in yarn client mode?

I need to submit spark apps/jobs onto a remote spark cluster. I have currently spark on my machine and the IP address of the master node as yarn-client. Btw my machine is not in the cluster.
I submit my job with this command
./spark-submit --class SparkTest --deploy-mode client /home/vm/app.jar
I have the address of my master hardcoded into my app in the form
val spark_master = spark://IP:7077
And yet all I get is the error
16/06/06 03:04:34 INFO AppClient$ClientEndpoint: Connecting to master spark://IP:7077...
16/06/06 03:04:34 WARN AppClient$ClientEndpoint: Failed to connect to master IP:7077
java.io.IOException: Failed to connect to /IP:7077
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:200)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:187)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:183)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection refused: /IP:7077
Or instead if I use
./spark-submit --class SparkTest --master yarn --deploy-mode client /home/vm/test.jar
I get
Exception in thread "main" java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
at org.apache.spark.deploy.SparkSubmitArguments.validateSubmitArguments(SparkSubmitArguments.scala:251)
at org.apache.spark.deploy.SparkSubmitArguments.validateArguments(SparkSubmitArguments.scala:228)
at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:109)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:114)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Do I really need to have hadoop configured as well in my workstation? All the work will be done remotely and this machine is not part of the cluster.
I am using Spark 1.6.1.
First of all, if you are setting conf.setMaster(...) from your application code, it takes highest precedence (over the --master argument). If you want to run in yarn client mode, do not use MASTER_IP:7077 in application code. You should supply hadoop client config files to your driver in the following way.
You should set environment variable HADOOP_CONF_DIR or YARN_CONF_DIR to point to the directory which contains the client configurations.
http://spark.apache.org/docs/latest/running-on-yarn.html
Depending upon which hadoop features you are using in your spark application, some of the config files will be used to lookup configuration. If you are using hive (through HiveContext in spark-sql), it will look for hive-site.xml. hdfs-site.xml will be used to lookup coordinates for NameNode reading/writing to HDFS from your job.

Job via Oozie HDP 2.1 not creating job.splitmetainfo

When trying to execute a sqoop job which has my Hadoop program passed as a jar file in -jarFiles parameter, the execution blows off with below error. Any resolution seems to be not available. Other jobs with same Hadoop user is getting executed successfully.
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.FileNotFoundException: File does not exist: hdfs://sandbox.hortonworks.com:8020/user/root/.staging/job_1423050964699_0003/job.splitmetainfo
at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$InitTransition.createSplits(JobImpl.java:1541)
at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$InitTransition.transition(JobImpl.java:1396)
at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$InitTransition.transition(JobImpl.java:1363)
at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:976)
at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:135)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1241)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceStart(MRAppMaster.java:1041)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$1.run(MRAppMaster.java:1452)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1448)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1381)
So here is the way I solved it. We are using CDH5 to run Camus to pull data from kafka. We run CamusJob which is responsible for getting data from kafka using comman line:
hadoop jar...
The problem is that new hosts didn't get so-called "yarn-gateway". Cloudera names pack of configs related to service and copied to /etc/hadoop/conf
as "gateway". So I just clicked "deploy client configuration" in CM UI. YARN client conf has been copied to each YARN NodeManager node and it solved problem.

hbase-master fails to start

I am trying to set up a Hadoop development environment. I am using CDH4 and following the installation instructions in their website https://ccp.cloudera.com/display/CDH4DOC/.
I got to the point in which I was able to install CDH4 in pseudo-distributed mode and I am following the part regarding "Components that require additional configuration".
I have installed HBase-master package, but when I try to start the service I am getting the following error:
$ sudo /sbin/service hbase-master start
starting master, logging to /var/log/hbase/hbase-hbase-master-slc01euu.out
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/util/PlatformName
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.util.PlatformName
at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
Could not find the main class: org.apache.hadoop.util.PlatformName. Program will exit.
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/io/Writable
I suposse that it has something to do with some env variable (i believe HADOOP_HOME). But I am not sure where to look at since all the previous processes (name node, data node, job tracker,task tracker) started with no problem.
When I search for HADOOP_HOME variable it says that it is undefined.
Do you guys have any idea about how I could solve this?
Thanks a lot in advance.
Please set HADOOP_HOME to your hadoop installation directory
eg: export HADOOP_HOME=/usr/hadoop
Add a pointer to your HADOOP_CONF_DIR to the HBASE_CLASSPATH environment variable in hbase-env.sh.
Add a copy of hdfs-site.xml (and core-site.xml) in under ${HBASE_HOME}/conf folder

Resources