I try to run a spark job on my local Mac OS, the entire job finishes fine, but the job just doesn't finish and hangs there.
When a Spark job finished gracefully, it'll shutdown all of its spawned processes and released all its associated port numbers and have soemthing like below in its console:
17/09/26 12:04:42 INFO SparkContext: Invoking stop() from shutdown hook
17/09/26 12:04:42 INFO SparkUI: Stopped Spark web UI at http://172.29.16.34:4040
17/09/26 12:04:42 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
17/09/26 12:04:42 INFO MemoryStore: MemoryStore cleared
17/09/26 12:04:42 INFO BlockManager: BlockManager stopped
17/09/26 12:04:42 INFO BlockManagerMaster: BlockManagerMaster stopped
17/09/26 12:04:42 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
17/09/26 12:04:42 INFO SparkContext: Successfully stopped SparkContext
17/09/26 12:04:42 INFO ShutdownHookManager: Shutdown hook called
17/09/26 12:04:42 INFO ShutdownHookManager: Deleting directory /private/var/folders/mc/fl1dm0jx48d086s5jsk18dtc0000gn/T/spark-088260a7-6c27-4dbf-878a-ec6a2efa14a3
I tried to debug what's going on with this Spark job, initially I thought something is hanging there, but until I put a print out line right at the end of my spark job and that line got printed out, I got confused what could be causing it to hang over there?
Any help please?
My spark: 1.0.1
My OS X: 10.11.6
Here's my spark job code:
SparkConf sparkConf = new SparkConf();
SparkContext sparkContext = new SparkContext(sparkConf);
JavaSparkContext sc = new JavaSparkContext(sparkContext);
Configuration conf = HBaseConfiguration.create();
conf.set("hbase.zookeeper.quorum", zkQuorum);
conf.set("zookeeper.znode.parent", zkZNodeParent);
JavaPairRDD<ImmutableBytesWritable, Result> hBasePairRdd = sc.newAPIHadoopRDD(conf, TableInputFormat.class, ImmutableBytesWritable.class, Result.class);
EmbeddedSolrServer server = solrManager.setupSolrServer(shardIndex);
Thanks!
Actually, I just figured it out.
Thanks #shridharama for poking me to post some example source code of my spark job here.
Then I basically did a binary search of my code to see which part that was causing it from shutting down gracefully.
Then I eventually narrowed down to this single line:
EmbeddedSolrServer server = solrManager.setupSolrServer(shardIndex);
We use Spark job to build our solr index, this server process is never stopped.
So, then I happily added this line:
server.close();
And my spark job exits nicely on my Mac. Thanks!
But still, I don't understand how come this exact same spark job (same code, just different zookeeper configurations) could exit gracefully when running on my cluster?
Cheers!
Related
iam using hadoop apache 2.7.1 high availability cluster that consists of
two name nodes mn1,mn2 and 3 journal nodes
but while i was working on cluster i faced the following error
when i issue start-dfs.sh mn1 is standby and mn2 is active
but after that if one of theses two namenodes are off there is no possibility
to turn it on again
and here are the last lines of log of one of these two name nodes
2017-08-05 09:37:21,063 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Need to save fs image? false (staleImage=true, haEnabled=true, isRollingUpgrade=false)
2017-08-05 09:37:21,063 INFO org.apache.hadoop.hdfs.server.namenode.NameCache: initialized with 3 entries 72 lookups
2017-08-05 09:37:21,088 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Finished loading FSImage in 7052 msecs
2017-08-05 09:37:21,300 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: RPC server is binding to mn2:8020
2017-08-05 09:37:21,304 INFO org.apache.hadoop.ipc.CallQueueManager: Using callQueue class java.util.concurrent.LinkedBlockingQueue
2017-08-05 09:37:21,316 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 8020
2017-08-05 09:37:21,353 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered FSNamesystemState MBean
2017-08-05 09:37:21,354 WARN org.apache.hadoop.hdfs.server.common.Util: Path /opt/hadoop/metadata_dir should be specified as a URI in configuration files. Please update hdfs configuration.
2017-08-05 09:37:21,361 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode.
java.lang.IllegalStateException
at com.google.common.base.Preconditions.checkState(Preconditions.java:129)
at org.apache.hadoop.hdfs.server.namenode.LeaseManager.getNumUnderConstructionBlocks(LeaseManager.java:119)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getCompleteBlocksTotal(FSNamesystem.java:5741)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startCommonServices(FSNamesystem.java:1063)
at org.apache.hadoop.hdfs.server.namenode.NameNode.startCommonServices(NameNode.java:678)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:664)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:811)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:795)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1488)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1554)
2017-08-05 09:37:21,364 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
2017-08-05 09:37:21,365 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at mn2/192.168.25.22
************************************************************/
This may be
1.Namenode PORT may be Change for each NODE.
This is a particularly vexing problem.
Swallow IllegalStateExceptions thrown by removeShutdownHook in FileSystem. The javadoc states:
public boolean removeShutdownHook(Thread hook)
Throws:
IllegalStateException - If the virtual machine is already in the process of shutting down
So if we are getting this exception, it MEANS we are already in the process of shutdown, so we CANNOT, try what we may, removeShutdownHook. If Runtime had a method Runtime.isShutdownInProgress(), we could have checked for it before the removeShutdownHook call. As it stands, there is no such method. In my opinion, this would be a good patch regardless of the needs for this JIRA.
Not send SIGTERMs from the NM to the MR-AM in the first place. Rather we should expose a mechanism for the NM to politely tell the AM its no longer needed and should shutdown asap. Even after this, if an admin were to kill the MRAppMaster with a SIGTERM, the JobHistory would be lost defeating the purpose of 3614
i discovered that my problem was in journal node and not in namenode
even though the log of namenode shows the error mentioned in question
jps shows journal node but it is fake because journal node service is shut down
even though it is found in jps output
so as a solution i issue hadoop-daemon.sh stop journalnode
then hadoop-daemon.sh start journalnode
and then namenode starts to work again
I think it has something to do with memory, because it was working fine for smaller data sets. The program utilizes, and prematurely shuts down, while using Logistic Regression from Spark-Mllib. I am running this command below to start my spark program on HDFS.
export SPARK_CONF_DIR=/home/gs/conf/spark/latest
export SPARK_HOME=/home/gs/spark/latest
$SPARK_HOME/bin/spark-submit --class algoRunner --master yarn --deploy-mode cluster --conf spark.dynamicAllocation.enabled=true \
--executor-memory 8g --queue default --conf spark.hadoop.hive.querylog.location='${java.io.tmpdir}/hivelogs' \
~/spark/Product-Classifier-Pipeline-assembly-1.0.jar
I receive the following error:
17/08/02 21:53:40 ERROR ApplicationMaster: RECEIVED SIGNAL TERM
17/08/02 21:53:40 INFO SparkContext: Invoking stop() from shutdown hook
17/08/02 21:53:40 INFO SparkUI: Stopped Spark web UI at http://gsrd219n01.red.ygrid.yahoo.com:45546
17/08/02 21:53:40 INFO DAGScheduler: Job 10 failed: treeAggregate at LogisticRegression.scala:1670, took 2.351935 s
17/08/02 21:53:40 INFO DAGScheduler: ShuffleMapStage 19 (treeAggregate at LogisticRegression.scala:1670) failed in 1.947 s due to Stage cancelled because SparkContext was shut down
17/08/02 21:53:40 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerStageCompleted(org.apache.spark.scheduler.StageInfo#21bec75d)
17/08/02 21:53:40 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerJobEnd(10,1501710820713,JobFailed(org.apache.spark.SparkException: Job 10 cancelled because SparkContext was shut down))
17/08/02 21:53:40 ERROR ApplicationMaster: User class threw exception: org.apache.spark.SparkException: Job 10 cancelled because SparkContext was shut down
org.apache.spark.SparkException: Job 10 cancelled because SparkContext was shut down
The driver memory wasn't large enough. Increasing it prevented these errors.
I have a Spark application using Scala which perform series of transformation, then writing the result to parquet file.
The transformation part finished without problem, the result output is written to HDFS correctly. The application is running on top of YARN cluster of 30 nodes.
However, the Spark application itself will not complete and exit the YARN. It will remain in resource manager.
After hanging for about an hour (consuming resources and vcores), then either it finishes or throw an error and killed itself.
Here is the error log of the application. Appreciate if anyone can shed some light on this matter.
16/08/24 14:51:12 INFO impl.ContainerManagementProtocolProxy: Opening proxy : phhdpdn013x.company.com:8041
16/08/24 14:51:22 INFO cluster.YarnClusterSchedulerBackend: Registered executor NettyRpcEndpointRef(null) (phhdpdn013x.company.com:54175) with ID 1
16/08/24 14:51:22 INFO storage.BlockManagerMasterEndpoint: Registering block manager phhdpdn013x.company.com:24700 with 2.1 GB RAM, BlockManagerId(1, phhdpdn013x.company.com, 24700)
16/08/24 14:51:29 INFO cluster.YarnClusterSchedulerBackend: SchedulerBackend is ready for scheduling beginning after waiting maxRegisteredResourcesWaitingTime: 30000(ms)
16/08/24 14:51:29 INFO cluster.YarnClusterScheduler: YarnClusterScheduler.postStartHook done
16/08/24 15:11:00 ERROR scheduler.LiveListenerBus: Listener SQLListener threw an exception
java.lang.NullPointerException
at org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:167)
at org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42)
at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
at org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1181)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)
16/08/24 15:11:46 ERROR scheduler.LiveListenerBus: Listener SQLListener threw an exception
java.lang.NullPointerException
aa
What is your version of Spark?
Your ERROR looks a lot like this issue
https://issues.apache.org/jira/browse/SPARK-12339
After I finished all distribution, activation steps on manager website,
I got the error as below when I restart the cluster:
2016-07-14 14:51:12,335 INFO org.mortbay.log: Stopped HttpServer2$SelectChannelConnectorWithSafeStartup#UT190320.shis.uth.tmc.edu:50070
2016-07-14 14:51:12,436 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping NameNode metrics system...
2016-07-14 14:51:12,436 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system stopped.
2016-07-14 14:51:12,436 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system shutdown complete.
2016-07-14 14:51:12,436 FATAL org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode.
java.io.IOException:
File system image contains an old layout version -55.
An upgrade to version -59 is required.
Please restart NameNode with the "-rollingUpgrade started" option if a rolling upgrade is already started; or restart NameNode with the "-upgrade" option to start a new upgrade.
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:232)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1006)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:736)
at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:553)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:609)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:776)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:760)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1466)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1534)
2016-07-14 14:51:12,439 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
You will need to perform the upgrade as suggested error messages. It is not clear what exactly you did but I suggest you follow the documentation at http://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_earlier_cdh5_upgrade.html
sudo service hadoop-hdfs-namenode upgrade is possibly what you need.
I tried to do rolling upgrade from hadoop 2.4.0 to hadoop 2.7.1. As per http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsRollingUpgrade.html#dfsadmin_-rollingUpgrade one can rollback to previous release provided the finalise step is not done. I upgraded the setup but didnot finalise the upgrade and tried to rollback HDFS to 2.4.0
I tried the following steps
Shutdown all NNs and DNs.
Restore the pre-upgrade release in all machines.
Start NN1 as Active with the "-rollingUpgrade rollbackhttp://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsRollingUpgrade.html#namenode_-rollingUpgrade" option.
I am getting the following error after 3rd step
15/09/01 17:53:35 INFO namenode.AclConfigFlag: ACLs enabled? false
15/09/01 17:53:35 INFO common.Storage: Lock on <<NameNode dir>>/in_use.lock acquired by nodename 12152#VM-2
15/09/01 17:53:35 WARN namenode.FSNamesystem: Encountered exception loading fsimage
org.apache.hadoop.hdfs.server.common.IncorrectVersionException: Unexpected version of storage directory /data/yarn/namenode. Reported: -63. Expecting = -56.
at org.apache.hadoop.hdfs.server.common.StorageInfo.setLayoutVersion(StorageInfo.java:178)
at org.apache.hadoop.hdfs.server.common.StorageInfo.setFieldsFromProperties(StorageInfo.java:131)
at org.apache.hadoop.hdfs.server.namenode.NNStorage.setFieldsFromProperties(NNStorage.java:608)
at org.apache.hadoop.hdfs.server.common.StorageInfo.readProperties(StorageInfo.java:228)
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverStorageDirs(FSImage.java:309)
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:202)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:882)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:639)
at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:455)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:511)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:670)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:655)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1304)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1370)
15/09/01 17:53:35 INFO mortbay.log: Stopped SelectChannelConnector#0.0.0.0:50070
15/09/01 17:53:35 INFO impl.MetricsSystemImpl: Stopping NameNode metrics system...
15/09/01 17:53:35 INFO impl.MetricsSystemImpl: NameNode metrics system stopped.
15/09/01 17:53:35 INFO impl.MetricsSystemImpl: NameNode metrics system shutdown complete.
15/09/01 17:53:35 FATAL namenode.NameNode: Exception in namenode join
From rolling upgrade documentation it can be inferred that rolling upgrade is supported Hadoop 2.4.0 onwards but rollingUpgrade rollback to Hadoop 2.4.0 seems to be broken in Hadoop 2.4.0. It throws above mentioned error.
Are there any other steps to perform rollback (from rolling upgrade) or is it not supported to rollback to Hadoop 2.4.0.