Apache Cassandra: Unable to gossip with any seeds - cassandra-2.0

I have built Cassandra server 2.0.3, then run it. It is starting and then stopped with messages:
X:\MyProjects\cassandra\apache-cassandra-2.0.3-src\bin>cassandra.bat >log.txt
java.lang.RuntimeException: Unable to gossip with any seeds
at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1160)
at org.apache.cassandra.service.StorageService.checkForEndpointCollision
(StorageService.java:416)
at org.apache.cassandra.service.StorageService.joinTokenRing(StorageServ
ice.java:608)
at org.apache.cassandra.service.StorageService.initServer(StorageService
.java:576)
at org.apache.cassandra.service.StorageService.initServer(StorageService
.java:475)
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.ja
va:346)
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon
.java:461)
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.jav
a:504)
What I can change to run it?

I had a similar problem with my cassandra v2.0.4 cluster running a single node.
Check your cassandra.yaml and make sure that your "listen_address" and "seeds" values match, with the exception that the seeds value requires quotes around it.

You might get this problem if your private IP address is different than the public one (like on AWS). For example, the host thinks it's "172.31.0.2" when it's visible as "55.70.33.10".
The solution to this problem is:
listen_address: 172.31.0.2
broadcast_address: 55.70.33.10

in cassandra.yaml
Make sure your cluster_name entry match on all the nodes in the cluster
(you may need to delete your storage if you changed the cluster name)
Verify that all nodes can ping to each other
broadcast_rpc_address and listen_address should be set to local IP
(not localhost or 127.0.0.1)
seeds should point to the IP address of the seed(s)

If you are on AWS and use the Ec2MultiRegionSnitch you will need to set the seeds to the public IP addresses rather than the private IPs.

I had the same problem on Ubuntu 16.04. I'm not sure which of these changes made it work, where XXX.XXX.XXX.XXX is your public facing IP address, below are selections from cassandra.yaml
seed_provider:
# Addresses of hosts that are deemed contact points.
# Cassandra nodes use this list of hosts to find each other and learn
# the topology of the ring. You must change this if you are running
# multiple nodes!
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
# seeds is actually a comma-delimited list of addresses.
# Ex: "<ip1>,<ip2>,<ip3>"
- seeds: "XXX.XXX.XXX.XXX"
listen_address: XXX.XXX.XXX.XXX
broadcast_address: XXX.XXX.XXX.XXX
broadcast_rpc_address: XXX.XXX.XXX.XXX
listen_on_broadcast_address: true
start_rpc: true
rpc_address: XXX.XXX.XXX.XXX
I also needed to restart my Virtual Machine for some reason. ¯_(ツ)_/¯

For a quick single node setup on RHEL, I did the following:
Get info about your network interface setup:
# /sbin/ifconfig -a
It will list the interfaces and the ip addresses they are attached to.
Usually it will show an "Ethernet" interface and a "Local Loopback".
Get the associated ip addresses.
Then edit conf/cassandra.yaml:
rpc_address: [Local Loopback address]
broadcast_rpc_address: [Ethernet address]
listen_address: [Local Loopback address]
broadcast_address: [Ethernet address]
listen_on_broadcast_address: true
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
- seeds: "[Ethernet address]"
Then also, open the correct ports on Linux firewall, being 9042, 7000 and 7001. More info about opening ports on Linux here:
http://ask.xmodulo.com/open-port-firewall-centos-rhel.html

in cassandra.yaml, I update the seed from domain name to IP address. and it works.

Happened to me because in my configuration the "intial_token" settings was specified (I think because I just copied to configuration file over from another cluster member). After clearing the data directory, commenting out the setting and restarting the node, it worked fine for me.

I experienced this error today...
I could not find any reason for the error other than timing issues.
I restarted many times and after a while it sticked. It looks like they expect a bi-directional communication on the gossip channel and if it does not happen quickly enough (which looks like a very small amount of time to me) then they drop the line and generate that error.
In my case I just upgraded my software and restarted the computer. So it was clearly not a connection issue between the computers (I have firewalls and SSL, to complicate matters) and the node was connected before... So the one entry I found in that regard from datastax did not apply...
https://support.datastax.com/hc/en-us/articles/209691483-Bootstap-fails-with-Unable-to-gossip-with-any-seeds-yet-new-node-can-connect-to-seed-nodes

I got the same error. There can be more than one solution. Hope my mistake is what you have done.
I had my localhost IP pointing to some domain name (and I did that in order that my Spring boot application's server context is some domain name like www.example.com:8080 instead of localhost:8080, and I had the following entry in my hosts file on Windows system).
127.0.0.1 www.example.com
While my cassandra batch file was looking for localhost which it didn't find. So, I made another entry for localhost too in my hosts file as:
127.0.0.1 localhost
127.0.0.1 www.example.com
After adding it, I opened new command prompt, ran cassandra batch from the cassandra bin directory and it then worked.

Disable the firewall and SELINUX and try again

In our case ssl was enabled, and cassandra.yaml configuration looks fine as per above comments. Then we enabled ssl debugging by by adding below jvm paramter in cassandra-env.sh -Djavax.net.debug=ssl:handshake
After starting the node again we noticed below in cassandra log file
MessagingService-Outgoing-geo2_host/xx.xx.xx.xx, Exception while
waiting for close javax.net.ssl.SSLHandshakeException: Received fatal
alert: certificate_unknown
After further investigating the ssl debug logs we got to know that the certificate was not valid. After fixing this ssl issue node was able to join the cluster.

Thanks to elvingt
His answer just remind me , I need to verify that all node needs to be able to talk to each other.
https://support.datastax.com/hc/en-us/articles/209691483-Bootstap-fails-with-Unable-to-gossip-with-any-seeds-yet-new-node-can-connect-to-seed-nodes
Gossip communications must be bi-directional.
To verify use this commnd, and you need test from BOTH SIDE
nc -vz {your_node_ip} 7000
Then I recollect that I turned on my ubuntu firewall last night. I open it by
sudo ufw allow 7000/tcp
And it is working now

Getting error during startup/bootstrap
Unable to gossip with any seeds
indicates there is some issue with broadcast_address. broadcast_address is responsible for communication with other nodes not with clients.
This address must be set in seed node(mandatory for seed node), If you are using cloud VMs you might have different IPs(public and private) hence its recommended to use your private IPs for broadcast_address this will save your n/w cost as well.
# Address to broadcast to other Cassandra nodes
# Leaving this blank will set it to the same value as listen_address
broadcast_address: 10.11.xx.xxx
In my scenario I was using IBM and once I set broadcast_address in seed nodes issue got resolved.
Please make sure you are starting your seed node first then other node, this order is mandatory.

in cassandra.yaml
changing listen_address value from localhost to domainName solved my issue

I had same issue, I checked port, used tcpdump, netcat to test connections and finally it comes to expired SSL certificates on internode_encryption. I modified internode_encryption to make it 'none', restarted all nodes and it worked.
Before all neighbor nodes were down. And node repair command was failing with:
"Did not get positive replies from all endpoints"
P.S Dont leave internode_encryption as none for a long time, just regenerate certs and enable it back.

Related

SLURM controller not being able to connect to workers and state is set as UNKNOWN

I am trying to setup a small cluster, managed with SLURM. The controller is also a compute node. The config in /etc/slurm/slurm.conf is:
NodeName=controller,node[01-02] RealMemory=250000 Sockets=1 CoresPerSocket=32 ThreadsPerCore=2 State=UNKNOWN
PartitionName=compute Nodes=ALL Default=YES MaxTime=INFINITE State=UP
When running sinfo I get:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
compute* up infinite 2 unk* node[01-02]
compute* up infinite 1 idle controller
However, when running slurmd -C on each node I get:
NodeName=node01 CPUs=64 Boards=1 SocketsPerBoard=1 CoresPerSocket=32 ThreadsPerCore=2 RealMemory=257655
UpTime=0-00:30:44
The same on the other node. I have allowed the ports 6817 and 6818 (the default slurm ports) on all machines (for TCP - which I assume is the protocol). I have also checked that the /etc/slurm/slurm.conf and /etc/slurm/slurmdbd.conf are the same, along with the munge keys (this works).
Is there anyway to debug the connection to a given machine?
Thanks in advance for any help.
I was able to go through the log files and found out the connections were being blocked. The cluster is using Fedora and so I added each machine to the firewall trusted list using this link - whitelist source ip addresses in centos 7
These updated firewall settings did not seem to be applied straight away so I had to restart all machines and now SLURM is functioning correctly.

"not in dispatcher" - issues connecting a validator peer to genesis validator

I have been banging my head for a while on this one.
So, I have successfully (maybe) created a running sawtooth validator with a settings-tp and poet-validator-registry (all containers from scratch).
I created it with a config-genesis.batch - then "proposal create" with poet and a public key pem etc. for a config.batch - then "poet registration create" for a poet.batch - "proposal create" again with the additional poet settings which give a poet-settings.batch.
Basically, I am copying for the most part the docker-compose for poet default, but now rolled with my own containers from scratch (I want to know how everything pieces together in detail).
Anyway, one of those details is regarding keys and auth... it's finally running, the settings-tp and poet-val-reg are happy with it and communicating normally and then it makes a genesis block as it should.
However, I then try to connect another validator to it as a peer...
"No chain head and not the genesis node: starting in peering mode" - GREAT!
However, when it tries to connect:
[2018-05-10 10:30:10.542 INFO dispatch] Can't send message PING_RESPONSE back to ee58844c071426276de533cadfafbd3c2448604e59fd81f4758edc07b5beea89476a6252e0a2144d43f14e06bf90c57dd2613562221954e3b2eddc6d2fcd9ef6 because connection OutboundConnectionThread-tcp://192.168.1.200:8800 not in dispatcher
[2018-05-10 10:30:10.542 INFO dispatch] Can't send last message AUTHORIZATION_VIOLATION back to ee58844c071426276de533cadfafbd3c2448604e59fd81f4758edc07b5beea89476a6252e0a2144d43f14e06bf90c57dd2613562221954e3b2eddc6d2fcd9ef6 because connection OutboundConnectionThread-tcp://192.168.1.200:8800 not in dispatcher
It's so hard to find explanations on this, only places I can find anything is the original refs in the source code and I'm not going to backwards engineer that anytime soon.
My settings for the validators on startup are:
The usual binds to 0.0.0.0
peering dynamic
scheduler serial
network trust
Any help would be so soooo appreciated!
Many thanks in advance :)
Aaron.
The usual problem with the
Can't send message PING_RESPONSE back to . . . because connection ... not in dispatcher
is configuring the peer endpoints
1) If you are using Ubuntu directly instead of Docker, use the Validator's hostname or IP address instead of the default ("validator"), which only works with Docker, or "localhost", which may not be routable
2) If you are using Docker, make sure the Docker ports are mapped to the Ubuntu OS, and that the OS IP address/port is routable between the two machines. Check the expose: and ports: entries in your docker-compose.yaml file or similar file.
3) Verify network connectivity to the remote machine with ping
4) Verify port connectivity telnet aremotehostname 8800 (replace aremotehostname with the remote peer's hostname or IP address)
5) Check peer configuration in your /etc/sawtooth/validator.toml files. Check the peering and endpoint lines. Check the seeds line (for dynamic peering) or peers line (for static peering)

Elasticsearch cluster configuration is not discovering any nodes under both unicast and multicast

I've been trying to use the lovely ansible-elasticsearch project to set up a nine-node Elasticsearch cluster.
Each node is up and running... but they are not communcating with each other. The master nodes think there are zero data nodes. The data nodes are not connecting to the master nodes.
They all have the same cluster.name. I have tried with multicast enabled (discovery.zen.ping.multicast.enabled: true) and disabled (previous setting to false, and discovery.zen.ping.unicast.hosts:["host1","host2",..."host9"]) but in either case the nodes are not communicating.
They have network connectivity to one another - verified via telnet over port 9300.
Sample output:
$ curl host1:9200/_cluster/health
{"error":{"root_cause":[{"type":"master_not_discovered_exception","reason":"waited for [30s]"}],"type":"master_not_discovered_exception","reason":"waited for [30s]"},"status":503}
I cannot think of any more reasons why they wouldn't connect - looking for any more ideas of what to try.
Edit: I finally resolved this issue. The settings that worked were publish_host to "_non_loopback:ipv4_" and unicast with discovery.zen.ping.unicast.hosts set to ["host1:9300","host2:9300","host3:9300"] - listing only the dedicated master nodes. I have a minimum master node count of 2.
The only reasons I can think that can cause that behavior are:
Connectivity issues - Ping is not a good tool to check that nodes can connect to each other. Use telnet and try connecting from host1 to host2 on port 9300.
Your elasticsearch.yml is set to bind 127.0.0.1 or the wrong host (if you're not sure, bind 0.0.0.0 to see if that solves your connectivity issues and then it's important to change it to bind only internal hosts to avoid exposure of elasticsearch directly to the internet).
Your publish_host is incorrect - This usually happens when you run ES inside a docker container for example, you need to make sure that the publish_host is set to an address that can be accessed via other hosts.

Wildfly clustering with VirtualBox

I am using VirtualBox on a WINDOWS7 as host of two DEBIAN7.7 guests, deb1 and deb2. Each guest can comunicate with the other one. Using one guest browser I can see the Wildfly istance welcome page that's running on the other guest. I run each istance in standalone-ha mode, network interfaces have mutlicast enabled, I can see on Wildfly node named srv1 that the two istances build a cluster:
...
...ISPN000094: Received new cluster view: [srv2/web|3] (2) [srv2/web, srv1/web]
where srv1 and srv2 are the node names of the istances. A tcpdump show UDP packets come across the multicast address 230.0.0.4, just where JGroups is listening. Despite all this goodness, http-session is not shared, this is my problem.
The application I use is very simple and <distributable/>, I have already used it succesfully in a multiple nodes on a single host scenario.
UPDATE: I made some tests using jgroups's test application McastReceiverTest and McastSenderTest with the following addresses: 230.0.0.4:45688, 230.0.0.4:45700 and 224.0.1.105:23364. Every test worked, on the receiver guest I can read what I sent by the sender guest. I tried to change my application too, I use this one https://github.com/liweinan/cluster-demo but http session is not shared.
Wildfly work well, I was looking at the problem as if I was still running multiple istances on my host. As JBoss forum suggests, I tried with curl retreiving my JSESSIONID and I see the cluster responding as expected. Happy ending.

Starting multiple remote servers with Akka

I'm running into some deployment issues using Akka remoting to implement a small search application.
I want to deploy my ActorSystem on a set of local cluster machines to use them as workers, but I'm a bit confused for what to put into my application.conf to make this happen. For example, I can use:
akka.remote {
transport = "akka.remote.netty.NettyRemoteTransport"
netty {
hostname = "0.0.0.0"
port = 2552
}
}
Each worker just runs the ActorSystem at startup.
This allows my worker machines to bind to their address when they start up, but then they refuse to listen to messages:
beaker-24: [ERROR] ... dropping message DaemonMsgWatch for non-local recipient akka://SearchService#beaker-24:2552/remote at akka://SearchService#0.0.0.0:2552
The documentation I've found for this so far only discusses deployment on my localhost, which is not so useful :). I'm hoping there is a way to do this without generating a separate configuration for each host.
Update:
Using an empty string as the hostname allows for contacting the host via the normal IP address. Addressing using the hostname itself doesn't work at the moment.
Setting “0.0.0.0” as host name will currently basically disable remoting, because that is not a legal IP to send to. Background: actor references get the configured IP (or host name) inserted in their address part when they leave the local system, and that is exactly their “pointer home” for other systems to send messages back.
There has been an effort by Scott which would enable a system to receive replies to a different address here, but that is not included yet—and we may well chose a different solution to this problem.

Resources