Storm Nimbus failed to start due the first zookeeper server is down - apache-storm

I set up Apache Storm 0.9.3 in fully distributed mode (3 nodes) that leverages a fully distributed Apache Zookeeper cluster (3.4.6) consisting of 3 nodes. I did the following testing and found out that Storm Nimbus failed to start if the first zookeeper server in the configuration storm.yaml is down or temporary unreachable.
Test #1:
bring up all three zookeeper nodes
start nimbus, supervisor, ui on the storm master node, and launch supervisor on other two nodes
in this case, everything goes well
Test #2:
shut down one of the three zookeeper nodes (zookeeper is still functional)
start nimbus, supervisor, ui on the storm master node, and launch supervisor on other two nodes
in this case, if the failed zookeeper node happened to be the first one in storm.zookeeper.servers, if fails to start neither nimbus nor supervisor on the master node.
I am wondering if any of you guys encounter this problem? Is it something going unexpected or something going wrong with my configuration? Or something else?
My configuration is listed below
storm.zookeeper.servers :
- "zookeeper1.hostname.local"
- "zookeeper2.hostname.local"
- "zookeeper3.hostname.local"
nimbus.host : storm-master.hostname.local
nimbus.thrift.port : 6627
storm.zookeeper.port : 2181
supervisor.slots.ports :
- 6700
- 6701
- 6702
- 6703
ui.port : 8744
storm.local.dir : /opt/apache-storm-0.9.3/storm-local

Related

Starting storm nimbus command doesn't work

I have zookeeper servers, and I'm trying to install storm using those zk servers.
My storm.yaml file looks like:
storm.zookeeper.servers:
- "ZKSERVER-0"
- "ZKSERVER-1"
- "ZKSERVER-2"
storm.local.dir: "/opt/apache-storm-2.2.0/data"
nimbus.host: "localhost"
supervisor.slots.ports:
- 6700
- 6701
- 6702
- 6703
I tested ping with those ZKSERVERs, and it worked fine.
However, when I start nimbus with ./storm nimbus command, it doesn't show any error, but it doesn't end either.
root#69e55d266f5a:/opt/apache-storm-2.2.0/bin:> ./storm nimbus
Running: /usr/jdk64/jdk1.8.0_112/bin/java -server -Ddaemon.name=nimbus -Dstorm.options= -Dstorm.home=/opt/apache-storm-2.2.0 -Dstorm.log.dir=/opt/apache-storm-2.2.0/logs -Djava.library.path=/usr/local/lib:/opt/local/lib:/usr/lib:/usr/lib64 -Dstorm.conf.file= -cp /opt/apache-storm-2.2.0/*:/opt/apache-storm-2.2.0/lib/*:/opt/apache-storm-2.2.0/extlib/*:/opt/apache-storm-2.2.0/extlib-daemon/*:/opt/apache-storm-2.2.0/conf -Xmx1024m -Djava.deserialization.disabled=true -Dlogfile.name=nimbus.log -Dlog4j.configurationFile=/opt/apache-storm-2.2.0/log4j2/cluster.xml org.apache.storm.daemon.nimbus.Nimbus
The terminal just shows the above logs, and that doesn't change until I run control+C.
What could be a problem here?
Can you share the log of the nimbus?
Generally, the nimbus will be in a running state until you stop it, or it faces an error. If you want to be sure about your nimbus status, you can check the log of your nimbus (./logs/nimbus.log).
on running ./storm nimbus command. The process has started as it is showing in your example. This is the usual behavior.
If you want to run the storm in the background, try to run it with the nohup command
nohup ./storm nimbus > storms.log &

Unable to gossip with any seeds but continuing since node is in its own seed list

To remove a node from 2 node cluster in AWS I ran
nodetool removenode <Host ID>
After this I was supposed to get my cluster back if I put all the cassandra.yaml and cassandra-rackdc.properties correctly.
I did it but still, I am not able to get back my cluster.
nodetool status is displaying only one node.
significant system.log on cassandra is :
INFO [main] 2017-08-14 13:03:46,409 StorageService.java:553 - Cassandra version: 3.9
INFO [main] 2017-08-14 13:03:46,409 StorageService.java:554 - Thrift API version: 20.1.0
INFO [main] 2017-08-14 13:03:46,409 StorageService.java:555 - CQL supported versions: 3.4.2 (default: 3.4.2)
INFO [main] 2017-08-14 13:03:46,445 IndexSummaryManager.java:85 - Initializing index summary manager with a memory pool size of 198 MB and a resize interval of 60 minutes
INFO [main] 2017-08-14 13:03:46,459 MessagingService.java:570 - Starting Messaging Service on /172.15.81.249:7000 (eth0)
INFO [ScheduledTasks:1] 2017-08-14 13:03:48,424 TokenMetadata.java:448 - Updating topology for all endpoints that have changed
WARN [main] 2017-08-14 13:04:17,497 Gossiper.java:1388 - Unable to gossip with any seeds but continuing since node is in its own seed list
INFO [main] 2017-08-14 13:04:17,499 StorageService.java:687 - Loading persisted ring state
INFO [main] 2017-08-14 13:04:17,500 StorageService.java:796 - Starting up server gossip
Content of files:
cassandra.yaml : https://pastebin.com/A3BVUUUr
cassandra-rackdc.properties: https://pastebin.com/xmmvwksZ
system.log : https://pastebin.com/2KA60Sve
netstat -atun https://pastebin.com/Dsd17i0G
Both the nodes have same error log.
All required ports are open.
Any suggestion ?
It's usually a best practice to have one seed node per DC if you have just two nodes available in your datacenter. You shouldn't make every node a seed node in this case.
I noticed that node1 has - seeds: "node1,node2" and node2 has - seeds: "node2,node1" in your configuration. A node will start by default without contacting any other seeds if it can find it's IP address as first element in - seeds: ... section in the cassandra.yml configuration file. That's what you can also find in your logs:
... Unable to gossip with any seeds but continuing since node is in its own seed list ...
I suspect, that in your case node1 and node2 are starting without contacting each other, since they identify themselves as seed nodes.
Try to use just node1 for seed node in both instance's configuration and reboot your cluster.
In case of node1 being down and node2 is up, you have to change - seeds: ... section in node1 configuration to point just to node2's IP address and just boot node1.
If your nodes can't find each other because of firewall misconfiguration, it's usually a good approach to verify if a specific port is accessible from another location. E.g. you can use nc for checking if a certain port is open:
nc -vz node1 7000
References and Links
See the list of ports Cassandra is using under the following link
http://docs.datastax.com/en/cassandra/3.0/cassandra/configuration/secureFireWall.html
See also a detailed documentation on running multiple nodes with plenty of sample commands:
http://docs.datastax.com/en/cassandra/2.1/cassandra/initialize/initializeMultipleDS.html
This is for future reference. My problem has been solved just by opening 7000 port for same security group in AWS. Although it was open but security group was something different.
When I ran:
ec2-user#ip-Node1 ~]$ telnet Node2 7000
Trying Node2...
telnet: connect to address Node2: Connection timed out
I came to know the problem could be of the security group.
And that is how it has been solved.
About seeds I am using IP of both the nodes, like this:
-seeds: "node1,node2"
It is same on both the nodes.

Storm UI Internal Server Error for local cluster

Internal Server Error
org.apache.thrift7.transport.TTransportException: java.net.ConnectException: Connection refused
at org.apache.thrift7.transport.TSocket.open(TSocket.java:183)
at org.apache.thrift7.transport.TFramedTransport.open(TFramedTransport.java:81)
at backtype.storm.thrift$nimbus_client_and_conn.invoke(thrift.clj:75)
at backtype.storm.ui.core$all_topologies_summary.invoke(core.clj:515)
at backtype.storm.ui.core$fn__8018.invoke(core.clj:851)
at compojure.core$make_route$fn__6199.invoke(core.clj:93)
at compojure.core$if_route$fn__6187.invoke(core.clj:39)
at compojure.core$if_method$fn__6180.invoke(core.clj:24)
at compojure.core$routing$fn__6205.invoke(core.clj:106)
at clojure.core$some.invoke(core.clj:2443)
at compojure.core$routing.doInvoke(core.clj:106)
at clojure.lang.RestFn.applyTo(RestFn.java:139)
at clojure.core$apply.invoke(core.clj:619)
at compojure.core$routes$fn__6209.invoke(core.clj:111)
at ring.middleware.reload$wrap_reload$fn__6234.invoke(reload.clj:14)
at backtype.storm.ui.core$catch_errors$fn__8059.invoke(core.clj:909)
at ring.middleware.keyword_params$wrap_keyword_params$fn__6876.invoke(keyword_params.clj:27)
at ring.middleware.nested_params$wrap_nested_params$fn__6915.invoke(nested_params.clj:65)
at ring.middleware.params$wrap_params$fn__6848.invoke(params.clj:55)
at ring.middleware.multipart_params$wrap_multipart_params$fn__6943.invoke(multipart_params.clj:103)
at ring.middleware.flash$wrap_flash$fn__7124.invoke(flash.clj:14)
I follow the method in https://hadooptips.wordpress.com/2014/05/26/configuring-single-node-storm-cluster/ to set up my storm in Ubuntu 14.04 LTS.
When I try to connect to the Storm UI, the error as shown above.
My storm.yaml in /home/user/storm/conf is as below:
########### These MUST be filled in for a storm configuration
storm.zookeeper.servers:
- "localhost"
storm.zookeeper.port: 2181
nimbus.host: "localhost"
nimbus.thrift.port: 6627
# ui.port:8772
storm.local.dir: "/home/user/storm/data"
java.library.path: "/usr/lib/jvm/java-7-oracle"
supervisor.slots.ports:
- 6700
- 6701
- 6702
- 6703
- 6704
Anyone know how to solve this? I'm a newbie, a detail solution will be helpful.
My zoo.cfg is as below:
# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial synchronization phase can take
initLimit=10
# The number of ticks that can pass between sending a request and getting an acknowledgement
syncLimit=5
# the directory where the snapshot is stored.
dataDir=/home/user/zookeeper-data
# The location of the log file
dataLogDir=/home/user/zookeeper/log/data_log
# the port at which the clients will connect
clientPort=2181
server.1=10.0.0.2:2888:3888
server.2=10.0.0.3:2888:3888
server.3=10.0.0.4:2888:3888
# The number of snapshots to retain in dataDir
autopurge.snapRetainCount=3
# Purge task interval in hours
# Set to "0" to disable auto purge feature
autopurge.purgeInterval=1
I run this in VMWare, Ubuntu 14.04 LTS. What IP address should I put in the server.1 ?
I think your zookeeper is not running properly, before running zookeeper you have to create myid file that only contains id of each node.
please refer to here : Zookeeper - three nodes and nothing but errors

Namenode has not been started in slave node when i create master slave configuration in hadoop

When i setup master slave configuration in HADOOP2.2.0 and perform start-yarn.sh
command than after resource manager has been started successfully in master node but node-manager has not been started successfully in slave node. I already double check all the configuration and setting which is needed for master-slave configuration.

unable to determine zookeeper ensemble health

I setup a 3 node Zookeeper cdh4 ensemble on RHEL 5.5 machines. I have started the service by running zkServer.sh on each of the nodes. ZooKeeper instance is running on all the nodes, but how do I know if it is a part of an ensemble or are they running as individual services?
I tried to start the service and check the ensemble as stated here, on Cloudera's site, but it throws a ClassNotFoundException.
You can use the stat four letter word,
~$echo stat | nc 127.0.0.1 <zkport>
Which gives you output like,
Zookeeper version: 3.4.5-1392090, built on 09/30/2012 17:52 GMT
Clients:
/127.0.0.1:55829[0](queued=0,recved=1,sent=0)
Latency min/avg/max: 0/0/0
Received: 3
Sent: 2
Connections: 1
Outstanding: 0
Zxid: 0x100000000
Mode: leader
Node count: 4
The Mode: line tells you what mode the server is running in, either leader, follower or standalone if the node is not part of a cluster.

Resources