I have a topology running on a Storm cluster with 3 supervisor nodes(32GRAM each node). In the first several days, the topology goes well, everything is ok. But the following error always occurred and the topology gone down after several days running:
java.lang.RuntimeException: java.lang.RuntimeException: java.lang.RuntimeException
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /brokers/topics/TOPICNAME/partitions at storm.kafka.ZkCoordinator.refresh
The topology uses a spout to consume messages from a remote Kafka service which sits on an remote server and this server is also the zookeeper service on.
I guess the reason for this exception is that the zookeeper server is instability, OR the network connection is unstable.
I have no permission to do anything with the remote kafka/zookeeper server, So I need a solution by my side to keep the topology running stably. Is there anyway to let the topology runs stably OR anyway to skip the exception while it comes out?
Or is there anyway to resubmit topology automatically?
Thank you very much!
The first thing you should have done is to google for what causes the connection loss error.
Then go to storm's log files and view which line of code is causing the error.
The right way to do things is to find out what is causing the error.
However, if you want the quicker temporary solution, then use Storm's REST API to kill the topology. Then you can use a normal Java program or a script in any language to re-launch the topology from the commandline.
Related
My streaming job has multiple custom receivers (10+) with checkpoint enabled.
When I start a job, I saw several receivers keeps restarting with the following error
[Executor task launch worker-1] receiver.ReceiverSupervisorImpl: Stopping receiver with message: Registered unsuccessfully because Driver refused to start receiver 3:
I searched and found a tip from http://scala4fun.tumblr.com/post/113172936582/how-to-spread-receivers-over-worker-hosts-in-spark. The idea is to delay the scheduler until enough number of executors are up.
It works when I start a job from a clean state. However, when I restart a job using a previous checkpoint, I am hitting the same issue. Many receivers keep restarting.
It looks like the trick didn't work when a checkpoint is used.
I am using the following function to create a StreamingContext, and the delay is done inside of fucntionToCreateContext
val ssc = StreamingContext.getOrCreate(checkpointDir, functionToCreateContext)
How can I start all the receivers when the job restarts with checkpoints?
I know there are other very similar questions on Stackoverflow but those either didn't get answered or didn't help me out. In contrast to those questions I put much more stack trace and log file information into this question. I hope that helps, although it made the question to become sorta long and ugly. I'm sorry.
Setup
I'm running a 9 node cluster on Amazon EC2 using m3.xlarge instances with DSE (DataStax Enterprise) version 4.6 installed. For each workload (Cassandra, Search and Analytics) 3 nodes are used. DSE 4.6 bundles Spark 1.1 and Cassandra 2.0.
Issue
The application (Spark/Shark-Shell) gets removed after ~3 minutes even if I do not run any query. Queries on small datasets run successful as long as they finish within ~3 minutes.
I would like to analyze much larger datasets. Therefore I need the application (shell) not to get removed after ~3 minutes.
Error description
On the Spark or Shark shell, after idling ~3 minutes or while executing (long-running) queries, Spark will eventually abort and give the following stack trace:
15/08/25 14:58:09 ERROR cluster.SparkDeploySchedulerBackend: Application has been killed. Reason: Master removed our application: FAILED
org.apache.spark.SparkException: Job aborted due to stage failure: Master removed our application: FAILED
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
FAILED: Execution Error, return code -101 from shark.execution.SparkTask
This is not very helpful (to me), that's why I'm going to show you more log file information.
Error Details / Log Files
Master
From the master.log I think the interesing parts are
INFO 2015-08-25 09:19:59 org.apache.spark.deploy.master.DseSparkMaster: akka.tcp://sparkWorker#172.31.46.48:46715 got disassociated, removing it.
INFO 2015-08-25 09:19:59 org.apache.spark.deploy.master.DseSparkMaster: akka.tcp://sparkWorker#172.31.33.35:42136 got disassociated, removing it.
and
ERROR 2015-08-25 09:21:01 org.apache.spark.deploy.master.DseSparkMaster: Application Shark::ip-172-31-46-49 with ID app-20150825091745-0007 failed 10 times, removing it
INFO 2015-08-25 09:21:01 org.apache.spark.deploy.master.DseSparkMaster: Removing app app-20150825091745-0007
Why do the worker nodes get disassociated?
In case you need to see it, I attached the master's executor (ID 1) stdout as well. The executors stderr is empty. However, I think it shows nothing useful to tackle the issue.
On the Spark Master UI I verified to see all worker nodes to be ALIVE. The second screenshot shows the application details.
There is one executor spawned on the master instance while executors on the two worker nodes get respawned until the whole application is removed. Is that okay or does it indicate some issue? I think it might be related to the "(it) failed 10 times" error message from above.
Worker logs
Furthermore I can show you logs of the two Spark worker nodes. I removed most of the class path arguments to shorten the logs. Let me know if you need to see it. As each worker node spawns multiple executors I attached links to some (not all) executor stdout and stderr dumps. Dumps of the remaining executors look basically the same.
Worker I
worker.log
Executor (ID 10) stdout
Executor (ID 10) stderr
Worker II
worker.log
Executor (ID 3) stdout
Executor (ID 3) stderr
The executor dumps seem to indicate some issue with permission and/or timeout. But from the dumps I can't figure out any details.
Attempts
As mentioned above, there are some similar questions but none of those got answered or it didn't help me to solve the issue. Anyway, things I tried and verified are:
Opened port 2552. Nothing changes.
Increased spark.akka.askTimeout which results in the Spark/Shark app to live longer but eventually it still gets removed.
Ran the Spark shell locally with spark.master=local[4]. On the one hand this allowed me to run queries longer than ~3 minutes successfully, on the other hand it obviously doesn't take advantage of the distributed environment.
Summary
To sum up, one could say that the timeouts and the fact long-running queries are successfully executed in local mode all indicate some misconfiguration. Though I cannot be sure and I don't know how to fix it.
Any help would be very much appreciated.
Edit: Two of the Analytics and two of the Solr nodes were added after the initial setup of the cluster. Just in case that matters.
Edit (2): I was able to work around the issue described above by replacing the Analytics nodes with three freshly installed Analytics nodes. I can now run queries on much larger datasets without the shell being removed. I intend not to put this as an answer to the question as it is still unclear what is wrong with the three original Analytics nodes. However, as it is a cluster for testing purposes, it was okay to simply replace the nodes (after replacing the nodes I performed a nodetool rebuild -- Cassandra on each of the new nodes to recover their data from the Cassandra datacenter).
As mentioned in the attempts, the root cause is a timeout between the master node, and one or more workers.
Another thing to try: Verify that all workers are reachable by hostname from the master, either via dns or an entry in the /etc/hosts file.
In my case, the problem was that the cluster was running in an AWS subnet without DNS. The cluster grew over time by spinning up a node, the adding the node to the cluster. When the master was built, only a subset of the addresses in the cluster was known, and only that subset was added to the /etc/hosts file.
When dse spark was run from a "new" node, then communication from the master using the worker's hostname failed and the master killed the job.
I've set up the EC2 cluster with Spark. Everything works, all master/slaves are up and running.
I'm trying to submit a sample job (SparkPi). When I ssh to cluster and submit it from there - everything works fine. However when driver is created on a remote host (my laptop), it doesn't work. I've tried both modes for --deploy-mode:
--deploy-mode=client:
From my laptop:
./bin/spark-submit --master spark://ec2-52-10-82-218.us-west-2.compute.amazonaws.com:7077 --class SparkPi ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar
Results in the following indefinite warnings/errors:
WARN TaskSchedulerImpl: Initial job has not accepted any resources;
check your cluster UI to ensure that workers are registered and have
sufficient memory 15/02/22 18:30:45
ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 0 15/02/22 18:30:45
ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 1
...and failed drivers - in Spark Web UI "Completed Drivers" with "State=ERROR" appear.
I've tried to pass limits for cores and memory to submit script but it didn't help...
--deploy-mode=cluster:
From my laptop:
./bin/spark-submit --master spark://ec2-52-10-82-218.us-west-2.compute.amazonaws.com:7077 --deploy-mode cluster --class SparkPi ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar
The result is:
.... Driver successfully submitted as driver-20150223023734-0007 ...
waiting before polling master for driver state ... polling master for
driver state State of driver-20150223023734-0007 is ERROR Exception
from cluster was: java.io.FileNotFoundException: File
file:/home/oleg/spark/spark12/ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar
does not exist. java.io.FileNotFoundException: File
file:/home/oleg/spark/spark12/ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar
does not exist. at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329) at
org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:150)
at
org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:75)
So, I'd appreciate any pointers on what is going wrong and some guidance how to deploy jobs from remote client. Thanks.
UPDATE:
So for the second issue in cluster mode, the file must be globally visible by each cluster node, so it has to be somewhere in accessible location. This solve IOException but leads to the same issue as in the client mode.
The documentation at:
http://spark.apache.org/docs/latest/security.html#configuring-ports-for-network-security
lists all the different communication channels used in a Spark cluster. As you can see, there are a bunch where the connection is made from the Executor(s) to the Driver. When you run with --deploy-mode=client, the driver runs on your laptop, so the executors will try to make a connection to your laptop. If the AWS security group that your executors run under blocks outbound traffic to your laptop (which the default security group created by the Spark EC2 scripts doesn't), or you are behind a router/firewall (more likely), they fail to connect and you get the errors you are seeing.
So to resolve it, you have to forward all the necessary ports to your laptop, or reconfigure your firewall to allow connection to the ports. Seeing as a bunch of the ports are chosen at random, this means opening up a wide range of, if not all ports. So probably using --deploy-mode=cluster, or client from the cluster, is less painful.
I advise against submitting spark jobs remotely using the port opening strategy, because it can create security problems and is in my experience, more trouble than it's worth, especially due to having to troubleshoot the communication layer.
Alternatives:
1) Livy - now an Apache project! http://livy.io or http://livy.incubator.apache.org/
2) Spark Job server - https://github.com/spark-jobserver/spark-jobserver
I'm trying to set up HBase on Hadoop and have been follow various great tutorials online by Michael G. Noll and here. Basically all is fine, my Hdfs and MapRed works well on the web interface it shows that I have 2 nodes (my NameNode is both a NameNode and a DataNode but that's just for testing purposes).
When I got to the point of installing HBase, thats where I meet problems, I get lots of different errors. The latest one I have is on the log file in my slave node
INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /10.2.xx.xx:43089 (no session established for client)
INFO org.apache.zookeeper.server.NIOServerCnxn: Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running
But when I type in
$ zkServer.sh status
It says shows the mode that both machines are running in!
Anyone has any idea what is this problem. Or does any one know of another guide/tutorial that I can follow to set this up? I've tried following the HBase documentation on setting up HBase on a distributed HDFS but it doesn't work too.
Thanks for any help offered!
Are both the zookeepers servers configured in a Qorum? If so have they managed to connect to one another and vote on who's the leader (this should all be in the logs for both servers).
Zookeeper may be running, but if they can't communicate with one another (firewall rules or miss configuration for example), then zookeeper will not accept in coming client connections
I am getting this exception when for a while i didn't communicated with HBase:
org.apache.hadoop.hbase.client.NoServerForRegionException: Timed out trying to locate root region because: Connection refused
is this something related with session expiry, if so, how can i extend session lifetime?
Query bin/hbase hbck and find in which machine root Regionserver is running..
You should get -ROOT- is okay on hbck. Make sure that all your
Regionserver is up and running.
use start regionserver for starting regionserver
I don't think this has anything to do with session lifetime.
Check your cluster to make sure that it is up and working correctly and all region servers are alive. Then check the logs to make sure that they are not reporting some error state.
HBase is complex software -- without more detailed information it is very difficult to diagnose what is going on. And often you can discover the problem by collecting the more detailed information.
This error shows that the client is not able to talk to Region server.
Check the region server associated with the region its trying to connect and check its up.
To identify the region server associated with the region please go through http://hbase.apache.org/0.94/book/regions.arch.html#regions.arch.assignment
Some factors have played a role here.
Please note the below steps which occur when you try to connect to Hbase from a client,
Hbase connects to Zookeeper to get the Ip of the regionservers which host the ROOT table.
The client caches this information about the IP's so that it doesnt have to contact the zookeeper again.
Your problem is that, your client is trying to connect to the zookeeper to get the IP. one of the below things may be going wrong,
Your client is not able to connect to the zookeeper.
The information about the ROOT contained inside the Znode in ZooKeeper is wrong.
Possible fixes.
Check if your zookeeper is working fine.
Delete the Znode for Hbase in your Zookeeper and restart the cluster. Don't worry, this wont delete your data.
Once this is achieved? the client can get the ROOT information and then query for the META table without any issue.