Cassandra Node communication issue - amazon-ec2

I have two node cluster on AWS. Everything was working fine until yesterday.
Today I came across a problem when I run nodetool status then the following error appears.
Node1 thinks Node2 is down and vice versa.
From ip2
ip2$ nodetool status
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
DN <ip1> ? 256 ? 27c91f95-4b58-492b-a16e-d9b99867a505 r1
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN <ip2> 9.11 GiB 256 ? e628324d-34dd-4c9c-a53d-99abfacb54af rack1
Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
From ip1
ip1$ nodetool status
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
DN <ip2> ? 256 ? e628324d-34dd-4c9c-a53d-99abfacb54af r1
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN <ip1> 9.14 GiB 256 ? 27c91f95-4b58-492b-a16e-d9b99867a505 rack1
Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
As per last line there is some replication setting problem but I am not able to figure this out. Please suggest.
WARN [OptionalTasks:1] 2017-08-08 15:33:37,223 CassandraRoleManager.java:344 - CassandraRoleManager skipped default role setup: some nodes were not ready
INFO [OptionalTasks:1] 2017-08-08 15:33:37,223 CassandraRoleManager.java:383 - Setup task failed with error, rescheduling
INFO [HANDSHAKE-/172.15.14.106] 2017-08-08 15:33:37,340 OutboundTcpConnection.java:515 - Handshaking version with /172.15.14.106

Related

Consul UI does not show

Running single node Consul (v1.8.4) on Ubuntu 18.04. consul service is up, I had set the ui to be true (default).
But when I try access http://192.168.37.128:8500/ui
This site can’t be reached 192.168.37.128 took too long to respond.
ui.json
{
"addresses": {
"http": "0.0.0.0"
}
}
consul.service file:
[Unit]
Description=Consul
Documentation=https://www.consul.io/
[Service]
ExecStart=/usr/bin/consul agent –server –ui –data-dir=/temp/consul –bootstrap-expect=1 –node=vault –bind=–config-dir=/etc/consul.d/
ExecReload=/bin/kill –HUP $MAINPID
LimitNOFILE=65536
[Install]
WantedBy=multi-user.target
systemctl status consul
● consul.service - Consul
Loaded: loaded (/etc/systemd/system/consul.service; disabled; vendor preset: enabled)
Active: active (running) since Sun 2020-10-04 19:19:08 CDT; 50min ago
Docs: https://www.consul.io/
Main PID: 9477 (consul)
Tasks: 9 (limit: 4980)
CGroup: /system.slice/consul.service
└─9477 /opt/consul/bin/consul agent -server -ui -data-dir=/temp/consul -bootstrap-expect=1 -node=vault -bind=1
agent.server.raft: heartbeat timeout reached, starting election: last-leader=
agent.server.raft: entering candidate state: node="Node at 192.168.37.128:8300 [Candid
agent.server.raft: election won: tally=1
agent.server.raft: entering leader state: leader="Node at 192.168.37.128:8300 [Leader]
agent.server: cluster leadership acquired
agent.server: New leader elected: payload=vault
agent.leader: started routine: routine="federation state anti-entropy"
agent.leader: started routine: routine="federation state pruning"
agent.leader: started routine: routine="CA root pruning"
agent: Synced node info
Shows bind at 192.168.37.128:8300
This issue was firewall, had to open firewall on 8500
sudo ufw allow 8500/tcp

YARN complains java.net.NoRouteToHostException: No route to host (Host unreachable)

Attempting to run h2o on a HDP 3.1 cluster and running into error that appears to be about YARN resource capacity...
[ml1user#HW04 h2o-3.26.0.1-hdp3.1]$ hadoop jar h2odriver.jar -nodes 3 -mapperXmx 10g
Determining driver host interface for mapper->driver callback...
[Possible callback IP address: 192.168.122.1]
[Possible callback IP address: 172.18.4.49]
[Possible callback IP address: 127.0.0.1]
Using mapper->driver callback IP address and port: 172.18.4.49:46015
(You can override these with -driverif and -driverport/-driverportrange and/or specify external IP using -extdriverif.)
Memory Settings:
mapreduce.map.java.opts: -Xms10g -Xmx10g -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Dlog4j.defaultInitOverride=true
Extra memory percent: 10
mapreduce.map.memory.mb: 11264
Hive driver not present, not generating token.
19/07/25 14:48:05 INFO client.RMProxy: Connecting to ResourceManager at hw01.ucera.local/172.18.4.46:8050
19/07/25 14:48:06 INFO client.AHSProxy: Connecting to Application History server at hw02.ucera.local/172.18.4.47:10200
19/07/25 14:48:07 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /user/ml1user/.staging/job_1564020515809_0006
19/07/25 14:48:08 INFO mapreduce.JobSubmitter: number of splits:3
19/07/25 14:48:08 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1564020515809_0006
19/07/25 14:48:08 INFO mapreduce.JobSubmitter: Executing with tokens: []
19/07/25 14:48:08 INFO conf.Configuration: found resource resource-types.xml at file:/etc/hadoop/3.1.0.0-78/0/resource-types.xml
19/07/25 14:48:08 INFO impl.YarnClientImpl: Submitted application application_1564020515809_0006
19/07/25 14:48:08 INFO mapreduce.Job: The url to track the job: http://HW01.ucera.local:8088/proxy/application_1564020515809_0006/
Job name 'H2O_47159' submitted
JobTracker job ID is 'job_1564020515809_0006'
For YARN users, logs command is 'yarn logs -applicationId application_1564020515809_0006'
Waiting for H2O cluster to come up...
ERROR: Timed out waiting for H2O cluster to come up (120 seconds)
ERROR: (Try specifying the -timeout option to increase the waiting time limit)
Attempting to clean up hadoop job...
19/07/25 14:50:19 INFO impl.YarnClientImpl: Killed application application_1564020515809_0006
Killed.
19/07/25 14:50:23 INFO client.RMProxy: Connecting to ResourceManager at hw01.ucera.local/172.18.4.46:8050
19/07/25 14:50:23 INFO client.AHSProxy: Connecting to Application History server at hw02.ucera.local/172.18.4.47:10200
----- YARN cluster metrics -----
Number of YARN worker nodes: 3
----- Nodes -----
Node: http://HW03.ucera.local:8042 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 15.0 GB used, 0 / 3 vcores used
Node: http://HW04.ucera.local:8042 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 15.0 GB used, 0 / 3 vcores used
Node: http://HW02.ucera.local:8042 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 15.0 GB used, 0 / 3 vcores used
----- Queues -----
Queue name: default
Queue state: RUNNING
Current capacity: 0.00
Capacity: 1.00
Maximum capacity: 1.00
Application count: 0
Queue 'default' approximate utilization: 0.0 / 45.0 GB used, 0 / 9 vcores used
----------------------------------------------------------------------
ERROR: Unable to start any H2O nodes; please contact your YARN administrator.
A common cause for this is the requested container size (11.0 GB)
exceeds the following YARN settings:
yarn.nodemanager.resource.memory-mb
yarn.scheduler.maximum-allocation-mb
----------------------------------------------------------------------
For YARN users, logs command is 'yarn logs -applicationId application_1564020515809_0006'
Looking in the YARN configs in Ambari UI, these properties are nowhere to be found. But checking the YARN logs in the YARN resource manager UI and checking some of the logs for the killed application, I see what appears to be unreachable-host errors...
Container: container_e05_1564020515809_0006_02_000002 on HW03.ucera.local_45454_1564102219781
LogAggregationType: AGGREGATED
=============================================================================================
LogType:stderr
LogLastModifiedTime:Thu Jul 25 14:50:19 -1000 2019
LogLength:2203
LogContents:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/hadoop/yarn/local/filecache/11/mapreduce.tar.gz/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/hadoop/yarn/local/usercache/ml1user/appcache/application_1564020515809_0006/filecache/10/job.jar/job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
log4j:WARN No appenders could be found for logger (org.apache.hadoop.mapred.YarnChild).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
java.net.NoRouteToHostException: No route to host (Host unreachable)
at java.net.PlainSocketImpl.socketConnect(Native Method)
....
at java.net.Socket.<init>(Socket.java:211)
at water.hadoop.EmbeddedH2OConfig$BackgroundWriterThread.run(EmbeddedH2OConfig.java:38)
End of LogType:stderr
***********************************************************************
Taking note of "java.net.NoRouteToHostException: No route to host (Host unreachable)". However, I can access all the other nodes from each other and they can all ping each other, so not sure what is going on here. Any suggestions for debugging or fixing?
Think I found the problem, TLDR: firewalld (nodes running on centos7) was still running, when should be disabled on HDP clusters.
From another community post:
For Ambari to communicate during setup with the hosts it deploys to and manages, certain ports must be open and available. The easiest way to do this is to temporarily disable iptables, as follows:
systemctl disable firewalld
service firewalld stop
So apparently iptables and firewalld need to be disabled across the cluster (supporting docs can be found here, I only disabled them on the Ambari installation node). After stopping these services across the cluster (I recommend using clush), was able to run the yarn job without incident.
Normally, this problem is either due to bad DNS configuration, firewalls, or network unreachability. To quote this official doc:
The hostname of the remote machine is wrong in the configuration files
The client's host table /etc/hosts has an invalid IPAddress for the target host.
The DNS server's host table has an invalid IPAddress for the target host.
The client's routing tables (In Linux, iptables) are wrong.
The DHCP server is publishing bad routing information.
Client and server are on different subnets, and are not set up to talk to each other. This may be an accident, or it is to deliberately lock down the Hadoop cluster.
The machines are trying to communicate using IPv6. Hadoop does not currently support IPv6
The host's IP address has changed but a long-lived JVM is caching the old value. This is a known problem with JVMs (search for "java negative DNS caching" for the details and solutions). The quick solution: restart the JVMs
For me, the problem was that the driver was inside a Docker container which made it impossible for the workers to send data back to it. In other words, workers and the driver not being in the same subnet. The solution as given in this answer was to set the following configurations:
spark.driver.host=<container's host IP accessible by the workers>
spark.driver.bindAddress=0.0.0.0
spark.driver.port=<forwarded port 1>
spark.driver.blockManager.port=<forwarded port 2>

Cassandra - unable to connect via cqlsh

I have a problem in connecting to cassandra via clqsh. I've deployed a cluster consisting of 3 nodes on CentOS7. I could see that nodes are connecting with each other. nodetool status output is bellow:
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN ${SEED2} 226.47 KiB 1 60,3% <hash> rack1
UN ${SEED} 190.77 KiB 1 50,9% <hash> rack1
UN ${IP} 157.62 KiB 1 88,7% <hash> rack1
But connecting via cqlsh doesn't work. I've tried connection to localhost and to node IP. Here is the output of cqlsh command:
[root#node02 default.conf]# cqlsh
Connection error: ('Unable to connect to any servers', {'127.0.0.1':
error(111, "Tried connecting to [('127.0.0.1', 9042)]. Last error:
Connection refused")})
[root#node02 default.conf]# cqlsh ${IP}
connection error: ('Unable to connect to any servers', {'${IP}':
ConnectionShutdown('Connection to ${IP} was closed',)})
It's not such obvious for me why 'Connection to ... was closed' is printed if connecting to rpc_address but 'Connectiong refused' when connecting to the localhost.
Does anyone know the cause of such problem?
cassandra.yaml file is bellow:
# Cassandra storage config YAML
cluster_name: '${NAME}'
hinted_handoff_enabled: true
authenticator: org.apache.cassandra.auth.AllowAllAuthenticator
data_file_directories:
- /var/lib/cassandra/data
commitlog_directory: /var/lib/cassandra/commitlog
hints_directory: /var/lib/cassandra/hints
key_cache_size_in_mb: 2
key_cache_save_period: 14400
row_cache_size_in_mb: 0
row_cache_save_period: 0
saved_caches_directory: /var/lib/cassandra/saved_caches
commitlog_sync: periodic
commitlog_sync_period_in_ms: 10000
concurrent_reads: 32
concurrent_writes: 32
storage_port: 7000
ssl_storage_port: 7001
rpc_port: 9042
start_rpc: true
rpc_keepalive: true
rpc_server_type: sync
request_scheduler: org.apache.cassandra.scheduler.NoScheduler
index_interval: 128
listen_address: ${IP}
rpc_address: ${IP}
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
- seeds: ${IP},${SEED}
Found the issue. You set rpc_port to 9042. I think you're confusing rpc with native (cql). Rpc is the old interface that is deprecated in later releases. I would recommend setting start_rpc to false and set rpc_port back to it's default value: 9160.

Spark node is not starting in DSE cluster

Analytics node seems down.
Datacenter: Analytics
=====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns
DN 172.20.10.20 4.44 MB 1 ?
Datacenter: Cassandra
=====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns
UN 172.20.10.18 281.94 GB 1 ?
UN 172.20.10.19 281.21 GB 1 ?
UN 172.20.10.17 281.23 GB 1 ?
Datacenter: Solr
================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns
UN 172.20.10.22 277.97 GB 1 ?
UN 172.20.10.21 286.75 GB 1 ?
So i logged in to that node and trying to start spark
dse cassandra -k
But i get the below exception
Exception encountered during startup: null
INFO 07:15:58 DSE shutting down...
INFO 07:15:58 All plugins are stopped.

Datastax Opscenter - Agent not connecting

I setup Cassandra, OpsCenter and the needed DataStax agent on my EC2 Amazon machine. At the moment it's only one machine.
Everything seems to be running fine, except the node list is empty and so are the keyspaces in the Opscenter. The cassandra, datastax and opscenter logs show no errors and I followed the installation / configuration carefully. Then tried all the suggested fixes.
My guess is the problem lies in the communication between the agent and opscenter.
After a while these requests fail:
etc/cassandra/cassandra.yaml: (simplified)
cluster_name: 'CassandraCluster'
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
- seeds: "1.2.3.4"
listen_address: 1.2.3.4
rpc_address: 0.0.0.0
endpoint_snitch: Ec2Snitch
etc/opscenter/opscenterd.conf: (simplified)
[webserver]
port = 81
interface = 0.0.0.0
[authentication]
enabled = False
[stat_reporter]
[agents]
use_ssl = false
var/lib/datastax-agent/conf/address.yaml: (simplified)
stomp_interface: 1.2.3.4
local_interface: 1.2.3.4
use_ssl: 0
nodetool status output:
Note: Ownership information does not include topology; for complete information, specify a keyspace
Datacenter: eu-west_1_cassandra
===============================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 1.2.3.4 2.06 MB 256 100.0% 8a121c12-7cbf-4a2a-b111-4ad111c111d8 1a
Nothing really strange shows up in the log except for the repetitive occurence of the following line in the agent.log:
INFO [install-location-finder] 2015-03-11 15:26:04,690 New JMX connection (127.0.0.1:7199)
INFO [install-location-finder] 2015-03-11 15:27:04,698 New JMX connection (127.0.0.1:7199)
INFO [install-location-finder] 2015-03-11 15:28:04,709 New JMX connection (127.0.0.1:7199)
INFO [install-location-finder] 2015-03-11 15:29:04,716 New JMX connection (127.0.0.1:7199)
INFO [install-location-finder] 2015-03-11 15:30:04,724 New JMX connection (127.0.0.1:7199)
INFO [install-location-finder] 2015-03-11 15:31:04,731 New JMX connection (127.0.0.1:7199)
To supply all the info here are the logs:
opscenterd.log
agent.log
cassandra/system.log
In certain environments the persistent connection between the browser and opscenterd may fail. We're working on implementing a more robust connection that will work in all environments, but in the meantime you can use the following workaround:
http://www.datastax.com/documentation/opscenter/5.1/opsc/troubleshooting/opscTroubleshootingZeroNodes.html
Minimal configuration that I find working was setting this options below for address.yaml
stomp_interface: [opscenter-ip]
stomp_port: 61620
use_ssl: 0
cassandra_conf: /etc/cassandra/cassandra.yaml
jmx_host: [cassandra-node-ip]
jmx_port: 7199
Make sure you have sysstat installed also.

Resources