Storm - Supervisors crashing on reboot - apache-storm

This is an issue that is simply driving me nuts. I have a one machine Storm instance running on my Local LAN. I am currently running v0.9.1-incubating release version (from the Apache Incubator site. The issue is simply that my storm supervisor process refuses to start after EVERY SINGLE reboot. The hack fix is quite simple, remove the supervisor and workers folders from the storm local directory and re run the process; things run hunky dory then on until next reboot.
I'm providing every bit of information I think might be relevant to debug this issue. Please ask for more if needed, but just help me get some resolution.
PS: It doesn't matter if I have topologies running or not.
Zookeeper version: 3.4.5
Storm version: 0.9.1-incubating (uses Netty transport)
Both Storm and Zookeeper run on the same machine.
supervisord version: 3.0b2
OS: Ubuntu 12.04 LTS
Processor: AMD Phenom(tm) II X6 1055T Processor × 6
RAM: 5.6 GiB
Supervisor config
[program:zookeeper]
command=/path/to/zookeeper/bin/zkServer.sh "start-foreground"
process_name=zookeeper
directory=/path/to/zookeeper/bin
stdout_logfile=/var/log/zookeeper.log ; stdout log path, NONE$
stderr_logfile=/var/log/err.zookeeper.log ; stderr log path, $
priority=2
user=root
[program:storm-nimbus]
command=/path/to/storm/bin/storm nimbus
user=root
autostart=true
autorestart=true
startsecs=10
startretries=2
log_stdout=true
log_stderr=true
stderr_logfile=/var/log/storm/nimbus.err.log
stdout_logfile=/var/log/storm/nimbus.out.log
logfile_maxbytes=20MB
logfile_backups=2
priority=10
[program:storm-ui]
command=/path/to/storm/bin/storm ui
user=root
autostart=true
autorestart=true
startsecs=10
startretries=2
log_stdout=true
log_stderr=true
stderr_logfile=/var/log/storm/ui.err.log
stdout_logfile=/var/log/storm/ui.out.log
logfile_maxbytes=20MB
logfile_backups=2
priority=500
[program:storm-supervisor]
command=/path/to/storm/bin/storm supervisor
user=root
autostart=true
autorestart=true
startsecs=10
startretries=2
log_stdout=true
log_stderr=true
stderr_logfile=/var/log/storm/supervisor.err.log
stdout_logfile=/var/log/storm/supervisor.log.log
logfile_maxbytes=20MB
logfile_backups=2
priority=600
[program:storm-logviewer]
command=/path/to/storm/bin/storm logviewer
user=root
autostart=true
autorestart=true
startsecs=10
startretries=2
log_stdout=true
log_stderr=true
stderr_logfile=/var/log/storm/log.err.log
stdout_logfile=/var/log/storm/log.out.log
logfile_maxbytes=20MB
logfile_backups=2
priority=900
Storm config
#Zookeeper
storm.zookeeper.servers:
- "192.168.1.11"
# Nimbus
nimbus.host: "192.168.1.11"
nimbus.childopts: '-Xmx1024m -Djava.net.preferIPv4Stack=true -Dprocess=storm'
# UI
ui.port: 9090
ui.childopts: "-Xmx768m -Djava.net.preferIPv4Stack=true -Dprocess=storm"
# Supervisor
supervisor.childopts: '-Djava.net.preferIPv4Stack=true -Dprocess=storm'
# Worker
worker.childopts: '-Xmx768m -Djava.net.preferIPv4Stack=true -Dprocess=storm'
storm.local.dir: "/path/to/storm"
storm.messaging.transport: "backtype.storm.messaging.netty.Context"
storm.messaging.netty.server_worker_threads: 1
storm.messaging.netty.client_worker_threads: 1
storm.messaging.netty.buffer_size: 5242880
storm.messaging.netty.max_retries: 100
storm.messaging.netty.max_wait_ms: 1000
storm.messaging.netty.min_wait_ms: 100
Error message
Pastebin for log error message. I'm cross posting the relevant bits here.
java.lang.RuntimeException: java.io.EOFException
at backtype.storm.utils.Utils.deserialize(Utils.java:86) ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
at backtype.storm.utils.LocalState.snapshot(LocalState.java:45) ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
at backtype.storm.utils.LocalState.get(LocalState.java:56) ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
at backtype.storm.daemon.supervisor$sync_processes.invoke(supervisor.clj:207) ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
at clojure.lang.AFn.applyToHelper(AFn.java:161) [clojure-1.4.0.jar:na]
at clojure.lang.AFn.applyTo(AFn.java:151) [clojure-1.4.0.jar:na]
at clojure.core$apply.invoke(core.clj:603) ~[clojure-1.4.0.jar:na]
at clojure.core$partial$fn__4070.doInvoke(core.clj:2343) ~[clojure-1.4.0.jar:na]
at clojure.lang.RestFn.invoke(RestFn.java:397) ~[clojure-1.4.0.jar:na]
at backtype.storm.event$event_manager$fn__2593.invoke(event.clj:39) ~[na:na]
at clojure.lang.AFn.run(AFn.java:24) [clojure-1.4.0.jar:na]
at java.lang.Thread.run(Thread.java:679) [na:1.6.0_27]
Caused by: java.io.EOFException: null
at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2322) ~[na:1.6.0_27]
at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2791) ~[na:1.6.0_27]
at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:798) ~[na:1.6.0_27]
at java.io.ObjectInputStream.<init>(ObjectInputStream.java:298) ~[na:1.6.0_27]
at backtype.storm.utils.Utils.deserialize(Utils.java:81) ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
... 11 common frames omitted
2014-03-11 12:27:25 b.s.util [INFO] Halting process: ("Error when processing an event")

We had that exact same problem (supervisor crashing on start and same log error message) when we had a power outage on 2 of our development servers. I guess just stopping the server without previously stopping the supervisor would have the same effect.
The only working solution we found was to remove the "storm-local/supervisor" folder (I guess something in there got corrupted).

I too faced this similar issue. I used to remove the local folder always and restart the topology.

Related

AWS workers can't communicate due to Netty-Client hostname resolution

I'm actually working on topology taking data from kafka and persist them into elasticsearch. Ok first, I used the basic KafkaSpout from storm dependency to listen for data coming from a precise kafka topic and, I re-implemented the Elasticsearch bolt from the elasticsearch-hadoop project: https://github.com/elastic/elasticsearch-hadoop/blob/master/storm/src/main/java/org/elasticsearch/storm/EsBolt.java. The goal was to write on several indices in elasticsearch.
So, when I process the messages coming from kafka, I have some exceptions when the number of data grow up in the kafka queue. This is one part of the stack trace in the worker logs:
2016-04-13T22:24:44.641+0000 b.s.m.n.Client [ERROR] failed to send 580 messages to Netty-Client-ip-[internal-ip].ec2.internal/[internal-ip]:6700:
java.nio.channels.ClosedChannelException
2016-04-13T22:24:44.641+0000 b.s.m.n.Client [ERROR] failed to send 575 messages to Netty-Client-ip-[internal-ip].ec2.internal/[internal-ip]:6700:
java.nio.channels.ClosedChannelException
2016-04-13T22:25:05.970+0000 b.s.m.n.Client [WARN] Re-connection to ip-[internal-ip].ec2.internal/[internal-ip]:6701 was successful but 52890 messages
has been lost so far
2016-04-13T22:36:33.571+0000 b.s.m.n.StormClientHandler [INFO] Connection failed Netty-Client-ip-ip-[internal-ip].ec2.internal/[internal-ip]:6701
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[na:1.8.0_77]
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[na:1.8.0_77]
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[na:1.8.0_77]
at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[na:1.8.0_77]
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) ~[na:1.8.0_77]
at org.apache.storm.netty.channel.socket.nio.NioWorker.read(NioWorker.java:64) [storm-core-0.9.6.jar:0.9.6]
at org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108) [storm-core-0.9.6.jar:0.9.6]
at org.apache.storm.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318) [storm-core-0.9.6.jar:0.9.6]
at org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89) [storm-core-0.9.6.jar:0.9.6]
at org.apache.storm.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) [storm-core-0.9.6.jar:0.9.6]
at org.apache.storm.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) [storm-core-0.9.6.jar:0.9.6]
at org.apache.storm.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) [storm-core-0.9.6.jar:0.9.6]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_77]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_77]
I'm using a storm cluster of 3 nodes (1 nimbus+UI+Zookeeper and 2 supervisors). Storm version 0.9.6. Each of these machines have 4GB RAM and this is the content my storm.yml config file:
storm.zookeeper.servers:
- "nimbus-ip"
storm.local.dir: "/mnt/storm"
nimbus.seeds: ["nimbus-ip"]
storm.zookeeper.port: 2181
ui.port: 8080
nimbus.host: "nimbus-ip"
supervisor.slots.ports:
- 6700
- 6701
- 6702
- 6703
storm.messaging.netty.max_wait_ms: 10000
Can anyone help me to know why workers can't communicate due to Netty-Client hostname resolution? I already saw one report of this issue in the 0.9.4 version of storm https://issues.apache.org/jira/browse/STORM-908. Is it possible that the 0.9.6 version does not fix this issue?
Many thanks!!
I got here from google looking for answers to a similar problem. In my case, the error was:
o.a.s.m.n.Client [ERROR] connection attempt 104 to Netty-Client-ip-XXX-XXX-XXX-XXX.ec2.internal/XXX.XXX.XXX.XXX:6703 failed: java.net.ConnectException: Connection refused: ip-XXX-XXX-XXX-XXX.ec2.internal/XXX.XXX.XXX.XXX:6703
This was appearing on a 2-node storm cluster (v1.0.1).
At first, I thought this was a networking issue with AWS (which is where I was deploying the nodes). I started to look at security group rules, /etc/hosts files etc etc, none of which helped.
After some searching I discovered this: https://issues.apache.org/jira/browse/STORM-1382 and figured that maybe the issue wasn't the network at all, but something on the other end wasn't running.
So, I ssh-d into a worker node and took a look at the supervisor log, which showed me something like this lots and lots:
o.a.s.d.supervisor [INFO] 30236e62-d2e1-4d5c-b75c-f54ef07653a4 still hasn't started
When I looked at the worker.log itself, I discovered there was a problem with the default java version. That was my problem, but other people's problems may be related to other reasons that a worker may fail.
Anyway, once I set the correct default java version it all kicked into life.

Failed to get broadcast_1_piece0 of broadcast_1 in Spark Streaming job

I am running spark jobs on yarn in cluster mode. The job get the messages from kafka direct stream. I am using broadcast variables and checkpointing every 30 seconds. When I start the job first time it runs fine without any issue. If I kill the job and restart it throws below exception in executor upon receiving a message from kafka:
java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_1_piece0 of broadcast_1
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1178)
at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:165)
at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at net.juniper.spark.stream.LogDataStreamProcessor$2.call(LogDataStreamProcessor.java:177)
at net.juniper.spark.stream.LogDataStreamProcessor$2.call(LogDataStreamProcessor.java:1)
at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$fn$1$1.apply(JavaDStreamLike.scala:172)
at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$fn$1$1.apply(JavaDStreamLike.scala:172)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$28.apply(RDD.scala:1298)
at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$28.apply(RDD.scala:1298)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Does anyone have idea how to resolve this error?
Spark version: 1.5.0
CDH 5.5.1
When encountering issues where only the first run works, it always resulted in issues revolving the checkpoint data. Moreover, the use of checkpoints only happens when there is something to check, which is the first message from kafka.
I suggest you check if you the job is indeed dead, that is, maybe the process is still running on the machine that executed it.
try running a simple ps -fe and see if something is still running. if there are 2 processes trying to use the same checkpoint folder, it will always fail.
hope this helps

Async loop died! org.zeromq.ZMQException

i have this in worker log file , How can i solve it ?
[ERROR] Async loop died! org.zeromq.ZMQException:
Address already in use(0x62)
at org.zeromq.ZMQ$Socket.bind(Native Method)
at zilch.mq$bind.invoke(mq.clj:69)
at backtype.storm.messaging.zmq.ZMQContext.bind(zmq.clj:57)at backtype.storm.messaging.loader$launch_receive_thread_BANG_$fn__1629.invoke(loader.clj:26)
at backtype.storm.util$async_loop$fn__465.invoke(util.clj:375)
at clojure.lang.AFn.run(AFn.java:24) at java.lang.Thread.run(Unknown Source)
and supervisor
still hasn't start
spout in ui didn't emit
worker log file after executed the launch command
ERROR] Error on initialization of server mk-worker
java.io.IOException: No such file or directory
at java.io.UnixFileSystem.createFileExclusively(Native Method)
at java.io.File.createNewFile(Unknown Source)
at backtype.storm.util$touch.invoke(util.clj:432)
at backtype.storm.daemon.worker$fn__4348$exec_fn__1228__auto____4349.invoke(worker.clj:331)
at clojure.lang.AFn.applyToHelper(AFn.java:185)
at clojure.lang.AFn.applyTo(AFn.java:151)
at clojure.core$apply.invoke(core.clj:601)
at backtype.storm.daemon.worker$fn__4348$mk_worker__4404.doInvoke(worker.clj:323)
at clojure.lang.RestFn.invoke(RestFn.java:512)
at backtype.storm.daemon.worker$_main.invoke(worker.clj:433)
at clojure.lang.AFn.applyToHelper(AFn.java:172)
at clojure.lang.AFn.applyTo(AFn.java:151)
at backtype.storm.daemon.worker.main(Unknown Source)
INFO] Halting process: ("Error on initialization")
this what happened after restart storm
1- i tried to kill the topology
2- remove what storm-local folder contain
3- restart connect nimbus and supervisor
the result i have now
1- some executors in nimbus not alive and trying to clean up the topology
2- supervisor has this message
[ERROR] Error when processing event java.io.FileNotFoundException:
File does not exist: storm-local/workers/361c029c-b9c5-4ca7-
bced-f8ea084d45a3/heartbeats "/1444899266048
3- the worker log file
"worker 361c029c-b9c5-4ca7-bced-f8ea084d45a3 for storm topology name on
9d05b304-6bb5-497e-85b3-656eb82fb37e:6704 has finished loading
2015-10-15 10:50:46 executor [INFO] Deactivating spout spout0:(57) "
It seems that to port is not free. Make sure no other process/service uses the port or reconfigure Storm to use a different port.
In order to change the port, you need to edit conf/storm.yaml file (best on each machine that runs a supervisor). Do compare with defaults.yaml in order to find the correct parameter name.
The second error seems to relate to the local Storm tmp directory. Try to shut down the cluster, clear this directory, and restart Storm.

Ambari storm sink issue

I am getting below error while my storm topology gets first message from kafka and worker dies.
2015-08-13 12:44:58 b.s.d.executor [INFO] Finished loading executor hdfs-bolt:[3 3]
2015-08-13 12:44:58 b.s.util [ERROR] Async loop died!
java.lang.RuntimeException: Could not instantiate a class listed in config under section topology.metrics.consumer.register with fully qualified name org.apache.hadoop.metrics2.sink.storm.StormTimelineMetricsSink
at backtype.storm.metric.MetricsConsumerBolt.prepare(MetricsConsumerBolt.java:46) ~[storm-core-0.9.3.2.2.6.3-1.jar:0.9.3.2.2.6.3-1]
at backtype.storm.daemon.executor$fn__6414$fn__6427.invoke(executor.clj:732) ~[storm-core-0.9.3.2.2.6.3-1.jar:0.9.3.2.2.6.3-1]
at backtype.storm.util$async_loop$fn__451.invoke(util.clj:463) ~[storm-core-0.9.3.2.2.6.3-1.jar:0.9.3.2.2.6.3-1]
at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_67]
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.metrics2.sink.storm.StormTimelineMetricsSink
Can someone help me to solve this issue.
Check the version of Ambari installed with "rpm -qa | grep ambari" and check "/usr/lib/storm/lib" for all hosts for the ambari metrics jar matching the version
Example: ambari-metrics-storm-sink-with-common-2.0.0.151.jar
Run "yum reinstall ambari-metrics-hadoop-sink" on all storm supervisor nodes
Restart the supervisors and re-deploy the topology
Check "/usr/lib/storm/lib" to ensure that the matching ambari version jar is present
Hortonworks has posted a knowledge base article on exactly this issue: https://community.hortonworks.com/content/supportkb/49117/storm-worker-fails-with-javalangclassnotfoundexcep.html
In my case, I needed to install (not reinstall) ambari-metrics-hadoop-sink as it was not installed by default on the HDP sandbox.

InvalidResourceRequestException Yarn Exception while running Spark in Cluster mode with yarn in hadoop 2.4

Using Apache spark 1.1.0 with hadoop 2.4
Also my cluster is on CDH 5.1.3
I tried with below command to start spark with yarn.
./spark-shell --master yarn
./spark-shell --master yarn-client
I got the following exception:
14/10/15 21:33:32 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:
appMasterRpcPort: 0
appStartTime: 1413388999108
yarnAppState: RUNNING
14/10/15 21:33:44 ERROR cluster.YarnClientSchedulerBackend: Yarn
application already ended: FAILED
======Node manager Exception ============================================
Caused by:
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException):
Invalid resource request, requested memory < 0, or requested memory >
max configured, requestedMemory=1408, maxMemory=1024 at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:228)
at
org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.validateResourceRequests(RMServerUtils.java:80)
at
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:444)
at
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
at
org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1986) at
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1982) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:396) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1980)
at org.apache.hadoop.ipc.Client.call(Client.java:1410) at
org.apache.hadoop.ipc.Client.call(Client.java:1363) at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at $Proxy11.allocate(Unknown Source) at
org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
... 20 more
According to your YARN Configuration, the maximum memory an application can request for a container is 1024MB. But the spark client is requesting a container with 1408MB. Either change the config file for spark to request less RAM or raise the max memory in YARN.

Resources