Error in nimbus log - apache-storm

When I tried to run a topology in my storm client I got an error that point to a connection failed with the nimbus .
I checked my numbus log and here's what shows :
2014-04-25 11:05:03 nimbus [INFO] Uploading file from client to storm-local/nimbus/inbox/stormjar-7106a3e1-fae8-4afe-8028-5c561eeb365e.jar
2014-04-25 11:05:03 nimbus [INFO] Finished uploading file from client: storm-local/nimbus/inbox/stormjar-7106a3e1-fae8-4afe-8028-5c561eeb365e.jar
2014-04-25 11:05:03 nimbus [INFO] Received topology submission for beat with conf {"topology.max.task.parallelism" nil, "topology.acker.executors" 1, "topology.kryo.register" nil, "topology.kryo.decorators" (), "topology.nam$
2014-04-25 11:05:03 nimbus [INFO] Activating beat: beat-2-1398416703
2014-04-25 11:05:03 EvenScheduler [INFO] Available slots: (["c3a1bab3-ed50-4efc-b424-050d34d7d4bd" 6702] ["c3a1bab3-ed50-4efc-b424-050d34d7d4bd" 6703] ["8f506a92-4a1b-4cc6-8f80-ed53ea810256" 6701] ["8f506a92-4a1b-4cc6-8f80-e$
2014-04-25 11:05:03 nimbus [INFO] Setting new assignment for topology id beat-2-1398416703: #backtype.storm.daemon.common.Assignment{:master-code-dir "storm-local/nimbus/stormdist/beat-2-1398416703", :node->host {"c3a1bab3-e$
2014-04-25 12:08:03 nimbus [INFO] Cleaning inbox ... deleted: stormjar-7106a3e1-fae8-4afe-8028-5c561eeb365e.jar
**2014-04-25 13:59:47 TNonblockingServer [ERROR] Read an invalid frame size of -720899. Are you using TFramedTransport on the client side?
2014-04-25 14:00:16 TNonblockingServer [ERROR] Read an invalid frame size of -720899. Are you using TFramedTransport on the client side?**
any clarification ?

Related

Accepted socket connection from /hostname:55306 (org.apache.zookeeper.server.NIOServerCnxnFactory)

I have configured the Kafka cluster,Storm cluster and Hadoop cluster. every thing works fine when their are no jobs.
When I submit the storm jar (which gets data from kafka and process ,then store it into Hdfs) in standalone mode ,it works fine
After configuring it to server properties same code and run it on server it gives following error:
[2018-07-03 12:54:00,370] INFO Accepted socket connection from /192.168.3.222:55306 (org.apache.zookeeper.server.NIOServerCnxnFactory)
[2018-07-03 12:54:00,381] INFO Client attempting to establish new session at /192.168.3.222:55306 (org.apache.zookeeper.server.ZooKeeperServer)
[2018-07-03 12:54:00,383] INFO Established session 0x3645ed69ca40031 with negotiated timeout 20000 for client /192.168.3.222:55306 (org.apache.zookeeper.server.ZooKeeperServer)
[2018-07-03 12:54:02,429] WARN caught end of stream exception (org.apache.zookeeper.server.NIOServerCnxn)
EndOfStreamException: Unable to read additional data from client sessionid 0x3645ed69ca40031, likely client has closed socket
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:239)
at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203)
at java.lang.Thread.run(Thread.java:748)
[2018-07-03 12:54:02,433] INFO Closed socket connection for client /192.168.3.222:55306 which had sessionid 0x3645ed69ca40031
(org.apache.zookeeper.server.NIOServerCnxn)
[2018-07-03 12:54:06,000] INFO Expiring session 0x1645ed69c8c0041, timeout of 20000ms exceeded (org.apache.zookeeper.server.ZooKeeperServer)
[2018-07-03 12:54:06,000] INFO Processed session termination for sessionid: 0x1645ed69c8c0041
(org.apache.zookeeper.server.PrepRequestProcessor)
Respective versions I am using:
apache-storm-1.0.6
kafka_2.11-1.0.1
zookeeper-3.4.12
hadoop-2.9.1
nimbus log
2018-07-04 12:28:54.455 o.a.s.d.nimbus timer [INFO] Setting new assignment for topology id test-topology-1-1530686803: #org.apache.storm.daemon.common.Assignment{:master-code-dir "/usr/local/apache-services/data/storm", :node->host {"7c98bf5a-38d5-4a13-95ad-966be3a51c49" "datanode2.sakha.com"}, :executor->node+port {[2 2] ["7c98bf5a-38d5-4a13-95ad-966be3a51c49" 6700], [1 1] ["7c98bf5a-38d5-4a13-95ad-966be3a51c49" 6700], [3 3] ["7c98bf5a-38d5-4a13-95ad-966be3a51c49" 6700]}, :executor->start-time-secs {[1 1] 1530687534, [2 2] 1530687534, [3 3] 1530687534}, :worker->resources {["7c98bf5a-38d5-4a13-95ad-966be3a51c49" 6700] [0.0 0.0 0.0]}, :owner "hduser"}
2018-07-04 12:28:54.520 o.a.s.d.nimbus pool-14-thread-7 [INFO] Created download session for test-topology-1-1530686803-stormjar.jar with id a9762861-224e-4f40-824b-ae0efa687452
supervisor log
2018-07-04 12:30:46.461 o.a.s.d.s.Container SLOT_6700 [INFO] Creating symlinks for worker-id: b9c3daa0-4f4d-42d7-9963-e93b6e6179a3 storm-id: test-topology-1-1530686803 for files(0): []
2018-07-04 12:30:46.461 o.a.s.d.s.Container SLOT_6700 [INFO] Topology jar for worker-id: b9c3daa0-4f4d-42d7-9963-e93b6e6179a3 storm-id: test-topology-1-1530686803 does not contain re sources directory /usr/local/apache-services/data/storm/supervisor/stormdist/test-topology-1-1530686803/resources.
2018-07-04 12:30:46.461 o.a.s.d.s.BasicContainer SLOT_6700 [INFO] Launching worker with assignment LocalAssignment(topology_id:test-topology-1-1530686803, executors:[ExecutorInfo(task_start:2, task_end:2), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:3, task_end:3)], resources:WorkerResources(mem_on_heap:0.0, mem_off_heap:0.0, cpu:0.0), owner:hduser) for this supervisor 7c98bf5a-38d5-4a13-95ad-966be3a51c49 on port 6700 with id b9c3daa0-4f4d-42d7-9963-e93b6e6179a3
There is something wrong with your dependency tree. You posted that you got java.lang.NoSuchMethodError: org.apache.hadoop.security.authentication.util.KerberosUtil.hasKerberosTicket in your worker log. This points to you having the wrong Hadoop jar versions on your classpath when you submit the jar, or maybe you're missing the jars entirely.
Here's the pom for storm-hdfs https://github.com/apache/storm/blob/v1.0.6/external/storm-hdfs/pom.xml. By default, it compiles against Hadoop 2.6.1. If you want to use another Hadoop version, you need to ensure that you replace the listed Hadoop dependencies with newer ones in your pom (i.e. you need to manually list e.g hadoop-client in version 2.9.1 in your pom).
A good tool for you to debug this is to run mvn dependency:tree in your project, that'll let you know which versions of which jars you are including in your build.

Storm Topology does not start with parallelism hint of 1200

Version Info:
"org.apache.storm" % "storm-core" % "1.2.1"
"org.apache.storm" % "storm-kafka-client" % "1.2.1"
I have a storm Topology with 3 bolts(A,B,C), Where the middle bolt takes around 450ms mean time and other two bolts takes less than 1ms.
I am able to run topology with following parallelism hint values:
A: 4
B: 700
C: 10
But when I increase parallelism hint of B to 1200, the topology does not start.
In the topology logs, I see logs to load the executor: B multiple times, like this:
2018-05-18 18:56:37.462 o.a.s.d.executor main [INFO] Loading executor B:[111 111]
2018-05-18 18:56:37.463 o.a.s.d.executor main [INFO] Loaded executor tasks B:[111 111]
2018-05-18 18:56:37.465 o.a.s.d.executor main [INFO] Finished loading executor B:[111 111]
2018-05-18 18:56:37.528 o.a.s.d.executor main [INFO] Loading executor B:[355 355]
2018-05-18 18:56:37.529 o.a.s.d.executor main [INFO] Loaded executor tasks B:[355 355]
2018-05-18 18:56:37.530 o.a.s.d.executor main [INFO] Finished loading executor B:[355 355]
2018-05-18 18:56:37.666 o.a.s.d.executor main [INFO] Loading executor B:[993 993]
2018-05-18 18:56:37.667 o.a.s.d.executor main [INFO] Loaded executor tasks B:[993 993]
2018-05-18 18:56:37.669 o.a.s.d.executor main [INFO] Finished loading executor B:[993 993]
2018-05-18 18:56:37.713 o.a.s.d.executor main [INFO] Loading executor B:[765 765]
2018-05-18 18:56:37.714 o.a.s.d.executor main [INFO] Loaded executor tasks B:[765 765]
But in between worker process get restarted. I don't see any error in topology logs or storm logs. Following are storm logs, when worker gets restart:
2018-05-18 18:51:46.755 o.a.s.d.s.Container SLOT_6700 [INFO] Killing eaf4d8ce-e758-4912-a15d-6dab8cda96d0:766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.204 o.a.s.d.s.BasicContainer Thread-7 [INFO] Worker Process 766258fe-a604-4385-8eeb-e85cad38b674 exited with code: 143
2018-05-18 18:51:47.766 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE RUNNING msInState: 109081 topo:myTopology-1-1526649581 worker:766258fe-a604-4385-8eeb-e85cad38b674 -> KILL msInState: 0 topo:myTopology-1-1526649581 worker:766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.766 o.a.s.d.s.Container SLOT_6700 [INFO] GET worker-user for 766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.774 o.a.s.d.s.Slot SLOT_6700 [WARN] SLOT 6700 all processes are dead...
2018-05-18 18:51:47.775 o.a.s.d.s.Container SLOT_6700 [INFO] Cleaning up eaf4d8ce-e758-4912-a15d-6dab8cda96d0:766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.775 o.a.s.d.s.Container SLOT_6700 [INFO] GET worker-user for 766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.775 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/saurabh/storm-run/workers/766258fe-a604-4385-8eeb-e85cad38b674/pids/27798
2018-05-18 18:51:47.775 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/saurabh/storm-run/workers/766258fe-a604-4385-8eeb-e85cad38b674/heartbeats
2018-05-18 18:51:47.780 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/saurabh/storm-run/workers/766258fe-a604-4385-8eeb-e85cad38b674/pids
2018-05-18 18:51:47.780 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/saurabh/storm-run/workers/766258fe-a604-4385-8eeb-e85cad38b674/tmp
2018-05-18 18:51:47.781 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/saurabh/storm-run/workers/766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.782 o.a.s.d.s.Container SLOT_6700 [INFO] REMOVE worker-user 766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.782 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/saurabh/storm-run/workers-users/766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.783 o.a.s.d.s.BasicContainer SLOT_6700 [INFO] Removed Worker ID 766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.783 o.a.s.l.AsyncLocalizer SLOT_6700 [INFO] Released blob reference myTopology-1-1526649581 6700 Cleaning up BLOB references...
2018-05-18 18:51:47.784 o.a.s.l.AsyncLocalizer SLOT_6700 [INFO] Released blob reference myTopology-1-1526649581 6700 Cleaning up basic files...
2018-05-18 18:51:47.785 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/saurabh/storm-run/supervisor/stormdist/myTopology-1-1526649581
2018-05-18 18:51:47.808 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE KILL msInState: 42 topo:myTopology-1-1526649581 worker:null -> EMPTY msInState: 0
This keeps happening and topology never restarts, which used to start perfectly when parallelism hint for bolt: B was 700, there is no other change.
I see one interesting log here is, not yet sure what this means:
Worker Process 766258fe-a604-4385-8eeb-e85cad38b674 exited with code: 143
Any Suggestions?
Edit:
Config:
topology.worker.childopts: -Xms1g -Xmx16g
topology.worker.logwriter.childopts: -Xmx1024m
topology.worker.max.heap.size.mb: 3072.0
worker.childopts: -Xms1g -Xmx16g -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=1%ID% -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -XX:+UseG1GC -XX:+AggressiveOpts -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/home/saurabh.mimani/apache-storm-1.2.1/logs/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1M -Dorg.newsclub.net.unix.library.path=/usr/share/specter/uds-lib/
worker.gc.childopts:
worker.heap.memory.mb: 8192
supervisor.childopts: -Xms1g -Xmx16g
Edit:
Logs for strace -fp PID -e trace=read,write,network,signal,ipc in gist.
not yet able to understand it fully, some relevant looking from it:
[pid 3362] open("/usr/lib/locale/UTF-8/LC_CTYPE", O_RDONLY) = -1 ENOENT (No such file or directory)
[pid 3362] kill(1487, SIGTERM) = 0
[pid 3362] close(1)
Quick google suggests 143 is the exit code for when the JVM receives a SIGTERM (e.g. Always app Java end with "Exit 143" Ubuntu). You might be running out of memory, or the OS may be killing the process for some other reason. Remember that setting the parallelism hint to 1200 means that you will get 1200 tasks (copies) for bolt B, where you only had 700 before.
I was able to get this running by tweaking following configurations, seems like it was timing out due to nimbus.task.launch.sec, which was set to 120 and it was restarting the worker if it was not started within 120 secs.
Updated value of some of these configs:
drpc.request.timeout.secs: 1600
supervisor.worker.start.timeout.secs: 1200
nimbus.supervisor.timeout.secs: 1200
nimbus.task.launch.secs: 1200
About nimbus.task.launch.sec:
A special timeout used when a task is initially launched. During launch, this is the timeout used until the first heartbeat, overriding nimbus.task.timeout.secs.
A separate timeout exists for launch because there can be quite a bit of overhead to launching new JVM's and configuring them.

Amazon Ec2 instance Public IP not working for Phoenix Elixir APP Error: reason :eaddrnotavail (can't assign requested address)

I just bought an Amazon ec2 instance and installed erlang and elixir and PostgreSQL.
Just put a basic Phoenix app.
When I run mix phx. Server
It is starting in local host http://localhost:4000/
But I want to run that in Amazon public IP.
So I put that in config/dev.exs
Http: [ip:{1, 2, 3, 4}, port:4000}
After this i have created a security group and allowed all traffic.
Now when i start the app using sudo mix phx.server
I am getting the below error
Compiling 10 files (.ex)
Generated myapp_test app
[error] Failed to start Ranch listener myappTestWeb.Endpoint.HTTP in :ranch_tcp:listen([port: 4000, ip: {1, 2, 3, 4}]) for reason :eaddrnotavail (can't assign requested address)
[info] Application myapp_test exited: myappTest.Application.start(:normal, []) returned an error: shutdown: failed to start child: myappTestWeb.Endpoint
** (EXIT) shutdown: failed to start child: Phoenix.Endpoint.Handler
** (EXIT) shutdown: failed to start child: {:ranch_listener_sup, myappTestWeb.Endpoint.HTTP}
** (EXIT) shutdown: failed to start child: :ranch_acceptors_sup
** (EXIT) {:listen_error, myappTestWeb.Endpoint.HTTP, :eaddrnotavail}
[info] Application phoenix_ecto exited: :stopped
[info] Application ecto exited: :stopped
[info] Application poolboy exited: :stopped
[info] Application postgrex exited: :stopped
[info] Application decimal exited: :stopped
[info] Application db_connection exited: :stopped
[info] Application connection exited: :stopped
[info] Application cowboy exited: :stopped
[info] Application cowlib exited: :stopped
[info] Application ranch exited: :stopped
[info] Application runtime_tools exited: :stopped
=INFO REPORT==== 23-Jan-2018::10:48:23 ===
application: logger
exited: stopped
type: temporary
** (Mix) Could not start application myapp_test: myappTest.Application.start(:normal, []) returned an error: shutdown: failed to start child: myappTestWeb.Endpoint
** (EXIT) shutdown: failed to start child: Phoenix.Endpoint.Handler
** (EXIT) shutdown: failed to start child: {:ranch_listener_sup, myappTestWeb.Endpoint.HTTP}
** (EXIT) shutdown: failed to start child: :ranch_acceptors_sup
** (EXIT) {:listen_error, myappTestWeb.Endpoint.HTTP, :eaddrnotavail}
When i put the public IP in browser also it is not working.
Do i need to install apache or anyother webserver.
Or
Do i need to bind the amazon public IP anywhere in system?
Any insight on how to fix the issue will be greatly appreciated
Thanks
Your best bet at this point is to start isolating what is failing. Once you can identify components that should be working and aren't, you'll be able to make your question more focused. Some troubleshooting ideas to get you started:
can you ping the ec2 public address from your machine?
does it have that address (ip address show from the ec2 terminal)?
can the ec2 machine ping out to an external ip, like google's dns (ping 8.8.8.8)?
use netcat to see if the port is truly open: sudo nc -l 80 (on the ec2 host) and nc <ec2-ip> 80 on your machine. Then you should be able to type in your machine (make sure you hit enter after some characters) and see it appear on the ec2 host.
remove the address from your cowboy config, and let it bind to 0.0.0.0 (the default), then see if you can reach it.

Mage Resque - Job Class not found error

I'm trying to implement a asynchronous functionality in Magento using Mage-Resque. I have followed the instructions in https://github.com/ajbonner/mage-resque and installed all the components except ext-pcntl.
Now i'm able to queue a job to redis-server. I have tested the it using the default Mns_Resque_Model_Job_Logmessage class but i'm also getting following error.
[info] [11:09:42 2016-03-27] Checking default for jobs
[info] [11:09:42 2016-03-27] Found job on default
[notice] [11:09:42 2016-03-27] Starting work on (Job{default} | ID: 6fe2a430c10ff2920c3f66ec7d52e957 | Mns_Resque_Model_Job_Logmessage | [{"message":"Resque Test 1459057136"}])
[info] [11:09:42 2016-03-27] Forked 3759 at 2016-03-27 11:09:42
[info] [11:09:42 2016-03-27] Processing default since 2016-03-27 11:09:42
[critical] [11:09:42 2016-03-27] (Job{default} | ID: 6fe2a430c10ff2920c3f66ec7d52e957 | Mns_Resque_Model_Job_Logmessage | [{"message":"Resque Test 1459057136"}]) has failed Could not find job class Mns_Resque_Model_Job_Logmessage.
Its reporting that it cannot find a class Mns_Resque_Model_Job_Logmessage. What could be wrong ? Have i missed something? Please help any help would be appreciated...

Unable to use remote slave node in Apache Storm cluster

I'm following http://jayatiatblogs.blogspot.com/2011/11/storm-installation.html to try configuring Apache Storm remote cluster using few virtual machine (EC2) with Ubuntu 14.04 LTS on Amazon Web Services.
My master node is 10.0.0.230, my slave node is 10.0.0.79. My zookeeper reside in my master node. When I use storm jar storm-starter-0.9.4-jar-with-dependencies.jar storm.starter.RollingTopWords production-topology remote in master node, the message below appear, indicating it is successfully submitted:
339 [main] INFO storm.starter.RollingTopWords - Topology name: production-topology
377 [main] INFO storm.starter.RollingTopWords - Running in remote (cluster) mode
651 [main] INFO backtype.storm.StormSubmitter - Jar not uploaded to master yet. Submitting jar...
655 [main] INFO backtype.storm.StormSubmitter - Uploading topology jar storm-starter-0.9.4-jar-with-dependencies.jar to assigned location: /home/ubuntu/storm/data/nimbus/inbox/stormjar-380bb1a2-1699-4ad1-8341-3d4b92c14764.jar
672 [main] INFO backtype.storm.StormSubmitter - Successfully uploaded topology jar to assigned location: /home/ubuntu/storm/data/nimbus/inbox/stormjar-380bb1a2-1699-4ad1-8341-3d4b92c14764.jar
672 [main] INFO backtype.storm.StormSubmitter - Submitting topology production-topology in distributed mode with conf {"topology.debug":true}
714 [main] INFO backtype.storm.StormSubmitter - Finished submitting topology: production-topology
The Stoum UI & storm list command show that the topology is active:
Topology_name Status Num_tasks Num_workers Uptime_secs
-------------------------------------------------------------------
production-topology ACTIVE 0 0 59
However, in the Cluster Summary of Storm UI, there is 0 supervisor, 0 used slots, 0 free slots, 0 executors & 0 tasks. In the Topology Configuration, the supervisor.slots.ports show that it uses the default supervisor slot ports of the master node, instead of the supervisor slot ports of the slave node.
Below are my zoo.cfg of my master node:
tickTime=2000
dataDir=/home/ubuntu/zookeeper-data
clientPort=2181
The storm.yaml of my master node:
storm.zookeeper.servers:
- "10.0.0.230"
storm.zookeeper.port: 2181
nimbus.host: "localhost"
nimbus.thrift.port: 6627
nimbus.task.launch.secs: 240
supervisor.worker.start.timeout.secs: 240
supervisor.worker.timeout.secs: 240
storm.local.dir: "/home/ubuntu/storm/data"
java.library.path: "/usr/lib/jvm/java-7-oracle"
The storm.yaml of my slave node:
storm.zookeeper.server:
- "10.0.0.230"
storm.zookeeper.port: 2181
nimbus.host: "10.0.0.230"
nimbus.thrift.port: 6627
storm.local.dir: "/home/ubuntu/storm/data"
java.library.path: "/usr/lib/jvm/java-7-oracle"
supervisor.slots.ports:
- 6700
- 6701
- 6702
- 6703
- 6704
I had use zkCli.sh -server 10.0.0.230:2181 to connect to the zookeeper at the master node, it works fine:
2015-05-04 03:40:20,866 [myid:] - INFO [main:ZooKeeper#438] - Initiating client connection, connectString=10.0.0.230:2181 sessionTimeout=30000 watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher#63f78dde
2015-05-04 03:40:20,888 [myid:] - INFO [main-SendThread(10.0.0.230:2181):ClientCnxn$SendThread#975] - Opening socket connection to server 10.0.0.230/10.0.0.230:2181. Will not attempt to authenticate using SASL (unknown error)
Welcome to ZooKeeper!
2015-05-04 03:40:20,900 [myid:] - INFO [main-SendThread(10.0.0.230:2181):ClientCnxn$SendThread#852] - Socket connection established to 10.0.0.230/10.0.0.230:2181, initiating session
JLine support is enabled
2015-05-04 03:40:20,918 [myid:] - INFO [main-SendThread(10.0.0.230:2181):ClientCnxn$SendThread#1235] - Session establishment complete on server 10.0.0.230/10.0.0.230:2181, sessionid = 0x14d1ca1ab73001c, negotiated timeout = 30000
WATCHER::
WatchedEvent state:SyncConnected type:None path:null
[zk: 10.0.0.230:2181(CONNECTED) 0]
The below are the supervisor logs from my slave nodes:
2015-05-06T06:16:28.487+0000 o.a.s.z.ClientCnxn [INFO] Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
2015-05-06T06:16:28.487+0000 o.a.s.z.ClientCnxn [WARN] Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[na:1.7.0_80]
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:744) ~[na:1.7.0_80]
at org.apache.storm.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) ~[storm-core-0.9.4.jar:0.9.4]
at org.apache.storm.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) ~[storm-core-0.9.4.jar:0.9.4]
2015-05-06T06:16:28.589+0000 b.s.d.supervisor [ERROR] Error on initialization of server mk-supervisor
java.lang.RuntimeException: org.apache.storm.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /storm
at backtype.storm.util$wrap_in_runtime.invoke(util.clj:44) ~[storm-core-0.9.4.jar:0.9.4]
at backtype.storm.zookeeper$exists_node_QMARK_$fn__807.invoke(zookeeper.clj:102) ~[storm-core-0.9.4.jar:0.9.4]
at backtype.storm.zookeeper$exists_node_QMARK_.invoke(zookeeper.clj:98) ~[storm-core-0.9.4.jar:0.9.4]
at backtype.storm.zookeeper$mkdirs.invoke(zookeeper.clj:114) ~[storm-core-0.9.4.jar:0.9.4]
at backtype.storm.cluster$mk_distributed_cluster_state.invoke(cluster.clj:43) ~[storm-core-0.9.4.jar:0.9.4]
at backtype.storm.cluster$mk_storm_cluster_state.invoke(cluster.clj:238) ~[storm-core-0.9.4.jar:0.9.4]
at backtype.storm.daemon.supervisor$supervisor_data.invoke(supervisor.clj:214) ~[storm-core-0.9.4.jar:0.9.4]
at backtype.storm.daemon.supervisor$fn__5518$exec_fn__1754__auto____5519.invoke(supervisor.clj:409) ~[storm-core-0.9.4.jar:0.9.4]
at clojure.lang.AFn.applyToHelper(AFn.java:167) [clojure-1.5.1.jar:na]
at clojure.lang.AFn.applyTo(AFn.java:151) [clojure-1.5.1.jar:na]
at clojure.core$apply.invoke(core.clj:617) ~[clojure-1.5.1.jar:na]
at backtype.storm.daemon.supervisor$fn__5518$mk_supervisor__5544.doInvoke(supervisor.clj:405) [storm-core-0.9.4.jar:0.9.4]
at clojure.lang.RestFn.invoke(RestFn.java:436) [clojure-1.5.1.jar:na]
at backtype.storm.daemon.supervisor$_launch.invoke(supervisor.clj:629) [storm-core-0.9.4.jar:0.9.4]
at backtype.storm.daemon.supervisor$_main.invoke(supervisor.clj:659) [storm-core-0.9.4.jar:0.9.4]
at clojure.lang.AFn.applyToHelper(AFn.java:159) [clojure-1.5.1.jar:na]
at clojure.lang.AFn.applyTo(AFn.java:151) [clojure-1.5.1.jar:na]
at backtype.storm.daemon.supervisor.main(Unknown Source) [storm-core-0.9.4.jar:0.9.4]
Caused by: org.apache.storm.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /storm
at org.apache.storm.zookeeper.KeeperException.create(KeeperException.java:99) ~[storm-core-0.9.4.jar:0.9.4]
at org.apache.storm.zookeeper.KeeperException.create(KeeperException.java:51) ~[storm-core-0.9.4.jar:0.9.4]
at org.apache.storm.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045) ~[storm-core-0.9.4.jar:0.9.4]
at org.apache.storm.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:172) ~[storm-core-0.9.4.jar:0.9.4]
at org.apache.storm.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:161) ~[storm-core-0.9.4.jar:0.9.4]
at org.apache.storm.curator.RetryLoop.callWithRetry(RetryLoop.java:107) ~[storm-core-0.9.4.jar:0.9.4]
at org.apache.storm.curator.framework.imps.ExistsBuilderImpl.pathInForeground(ExistsBuilderImpl.java:157) ~[storm-core-0.9.4.jar:0.9.4]
at org.apache.storm.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:148) ~[storm-core-0.9.4.jar:0.9.4]
at org.apache.storm.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:36) ~[storm-core-0.9.4.jar:0.9.4]
at backtype.storm.zookeeper$exists_node_QMARK_$fn__807.invoke(zookeeper.clj:101) ~[storm-core-0.9.4.jar:0.9.4]
... 16 common frames omitted
2015-05-06T06:16:28.607+0000 b.s.util [ERROR] Halting process: ("Error on initialization")
java.lang.RuntimeException: ("Error on initialization")
at backtype.storm.util$exit_process_BANG_.doInvoke(util.clj:325) [storm-core-0.9.4.jar:0.9.4]
at clojure.lang.RestFn.invoke(RestFn.java:423) [clojure-1.5.1.jar:na]
at backtype.storm.daemon.supervisor$fn__5518$mk_supervisor__5544.doInvoke(supervisor.clj:405) [storm-core-0.9.4.jar:0.9.4]
at clojure.lang.RestFn.invoke(RestFn.java:436) [clojure-1.5.1.jar:na]
at backtype.storm.daemon.supervisor$_launch.invoke(supervisor.clj:629) [storm-core-0.9.4.jar:0.9.4]
at backtype.storm.daemon.supervisor$_main.invoke(supervisor.clj:659) [storm-core-0.9.4.jar:0.9.4]
at clojure.lang.AFn.applyToHelper(AFn.java:159) [clojure-1.5.1.jar:na]
at clojure.lang.AFn.applyTo(AFn.java:151) [clojure-1.5.1.jar:na]
at backtype.storm.daemon.supervisor.main(Unknown Source) [storm-core-0.9.4.jar:0.9.4]
Below are my nimbus logs from my master node:
2015-05-06T06:14:19.291+0000 b.s.d.nimbus [INFO] Using default scheduler
2015-05-06T06:14:19.304+0000 b.s.u.StormBoundedExponentialBackoffRetry [INFO] The baseSleepTimeMs [1000] the maxSleepTimeMs [30000] the maxRetries [5]
2015-05-06T06:14:19.415+0000 o.a.s.c.f.i.CuratorFrameworkImpl [INFO] Starting
2015-05-06T06:14:19.417+0000 o.a.s.z.ZooKeeper [INFO] Initiating client connection, connectString=10.0.0.230:2181 sessionTimeout=20000 watcher=org.apache.storm.curator.ConnectionState#795bca46
2015-05-06T06:14:19.436+0000 o.a.s.z.ClientCnxn [INFO] Opening socket connection to server 10.0.0.230/10.0.0.230:2181. Will not attempt to authenticate using SASL (unknown error)
2015-05-06T06:14:19.448+0000 o.a.s.z.ClientCnxn [INFO] Socket connection established to 10.0.0.230/10.0.0.230:2181, initiating session
2015-05-06T06:14:19.457+0000 o.a.s.z.ClientCnxn [INFO] Session establishment complete on server 10.0.0.230/10.0.0.230:2181, sessionid = 0x14d27dbda310000, negotiated timeout = 20000
2015-05-06T06:14:19.459+0000 o.a.s.c.f.s.ConnectionStateManager [INFO] State change: CONNECTED
2015-05-06T06:14:19.460+0000 b.s.zookeeper [INFO] Zookeeper state update: :connected:none
2015-05-06T06:14:20.485+0000 o.a.s.z.ClientCnxn [INFO] EventThread shut down
2015-05-06T06:14:20.485+0000 o.a.s.z.ZooKeeper [INFO] Session: 0x14d27dbda310000 closed
2015-05-06T06:14:20.486+0000 b.s.u.StormBoundedExponentialBackoffRetry [INFO] The baseSleepTimeMs [1000] the maxSleepTimeMs [30000] the maxRetries [5]
2015-05-06T06:14:20.487+0000 o.a.s.c.f.i.CuratorFrameworkImpl [INFO] Starting
2015-05-06T06:14:20.487+0000 o.a.s.z.ZooKeeper [INFO] Initiating client connection, connectString=10.0.0.230:2181/storm sessionTimeout=20000 watcher=org.apache.storm.curator.ConnectionState#510d246b
2015-05-06T06:14:20.504+0000 o.a.s.z.ClientCnxn [INFO] Opening socket connection to server 10.0.0.230/10.0.0.230:2181. Will not attempt to authenticate using SASL (unknown error)
2015-05-06T06:14:20.505+0000 o.a.s.z.ClientCnxn [INFO] Socket connection established to 10.0.0.230/10.0.0.230:2181, initiating session
2015-05-06T06:14:20.507+0000 o.a.s.z.ClientCnxn [INFO] Session establishment complete on server 10.0.0.230/10.0.0.230:2181, sessionid = 0x14d27dbda310001, negotiated timeout = 20000
2015-05-06T06:14:20.507+0000 o.a.s.c.f.s.ConnectionStateManager [INFO] State change: CONNECTED
2015-05-06T06:14:20.547+0000 b.s.d.nimbus [INFO] Starting Nimbus server...
I had used storm nimbus & storm ui in my master node, storm supervisor in my slave node.
From the supervisor.logs from my slave node, it show that my slave node tend to connect to zookeeper on local host, although I had specified in the storm.yaml of my slave node that my zookeeper is in my master node.
Why this happens and how to solve this?
So, why in the Cluster Summary of Storm UI, there is 0 supervisor, 0 used slots, 0 free slots, 0 executors & 0 tasks ?
Why it uses the supervisor slot ports of the master node, instead of the slave node?
When I click the production-topology in the Topology Summary of Storm UI, there is 0 Num workers, 0 Num executors, 0 Num tasks?
Why there is no info display for Spouts & Bolts?
I discovered the problem. I should set my zookeeper at my slave nodes, not at my master node. Now the problem is solved & the storm cluster is up.

Resources