High CPU usage on idle AMQ Artemis cluster, related to locks with shared-store HA - high-availability

I have AMQ Artemis cluster, shared-store HA (master-slave), 2.17.0.
I noticed that all my clusters (active servers only) that are idle (no one is using them) using from 10% to 20% of CPU, except one, which is using around 1% (totally normal). I started investigating...
Long story short - only one cluster has a completely normal CPU usage. The only difference I've managed to find that if I connect to that normal cluster's master node and attempt telnet slave 61616 - it will show as connected. If I do the same in any other cluster (that has high CPU usage) - it will show as rejected.
In order to better understand what is happening, I enabled DEBUG logs in instance/etc/logging.properties. Here is what master node is spamming:
2021-05-07 13:54:31,857 DEBUG [org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl] Backup is not active, trying original connection configuration now.
2021-05-07 13:54:32,357 DEBUG [org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl] Trying reconnection attempt 0/1
2021-05-07 13:54:32,357 DEBUG [org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl] Trying to connect with connectorFactory = org.apache.activemq.artemis.core.remoting.impl.netty$NettyConnectorFactory#6cf71172, connectorConfig=TransportConfiguration(name=slave-connector, factory=org-apache-activemq-artemis-core-remoting-impl-netty-NettyConnectorFactory) ?trustStorePassword=****&port=61616&keyStorePassword=****&sslEnabled=true&host=slave-com&trustStorePath=/path/to/ssl/truststore-jks&keyStorePath=/path/to/ssl/keystore-jks
2021-05-07 13:54:32,357 DEBUG [org.apache.activemq.artemis.core.remoting.impl.netty.NettyConnector] Connector NettyConnector [host=slave.com, port=61616, httpEnabled=false$ httpUpgradeEnabled=false, useServlet=false, servletPath=/messaging/ActiveMQServlet, sslEnabled=true, useNio=true] using native epoll
2021-05-07 13:54:32,357 DEBUG [org.apache.activemq.artemis.core.client] AMQ211002: Started EPOLL Netty Connector version 4.1.51.Final to slave.com:61616
2021-05-07 13:54:32,358 DEBUG [org.apache.activemq.artemis.core.remoting.impl.netty.NettyConnector] Remote destination: slave.com/123.123.123.123:61616
2021-05-07 13:54:32,358 DEBUG [org.apache.activemq.artemis.spi.core.remoting.ssl.SSLContextFactory] Creating SSL context with configuration
trustStorePassword=****
port=61616
keyStorePassword=****
sslEnabled=true
host=slave.com
trustStorePath=/path/to/ssl/truststore.jks
keyStorePath=/path/to/ssl/keystore.jks
2021-05-07 13:54:32,448 DEBUG [org.apache.activemq.artemis.core.remoting.impl.netty.NettyConnector] Added ActiveMQClientChannelHandler to Channel with id = 77c078c2
2021-05-07 13:54:32,448 DEBUG [org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl] Connector towards NettyConnector [host=slave.com, port=61616, httpEnabled=false, httpUpgradeEnabled=false, useServlet=false, servletPath=/messaging/ActiveMQServlet, sslEnabled=true, useNio=true] failed
This is what slave is spamming:
2021-05-07 14:06:53,177 DEBUG [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] trying to lock position: 1
2021-05-07 14:06:53,178 DEBUG [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] failed to lock position: 1
If I attempt to telnet from master node to slave node (same if I do it from slave to slave):
[root#master]# telnet slave.com 61616
Trying 123.123.123.123...
telnet: connect to address 123.123.123.123: Connection refused
However if I attempt the same telnet in that the only working cluster, I can successfully "connect" from master to slave...
Here is what I suspect:
Master acquires lock in instance/data/journal/server.lock
Master keeps trying to connect to slave server
Slave unable to start, because it cannot acquire the same server.lock on shared storage.
Master uses high CPU because of such hard-trying to connect to slave, which is not running.
What am I doing wrong?
EDIT: This is how my NFS mounts look like (taken from mount command):
some_server:/some_dir on /path/to/artemis/instance/data type nfs4 (rw,relatime,sync,vers=4.1,rsize=65536,wsize=65536,namlen=255,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,soft,noac,proto=tcp,timeo=50,retrans=1,sec=sys,clientaddr=123.123.123.123,local_lock=none,addr=123.123.123.123)

Turns out issue was in broker.xml configuration. In static-connectors I somehow decided to list only a "non-current server" (e.g. I have srv0 and srv1 - in srv0 I only added connector of srv1 and vice versa).
What it used to be (on 1st master node):
<cluster-connections>
<cluster-connection name="abc">
<connector-ref>srv0-connector</connector-ref>
<message-load-balancing>ON_DEMAND</message-load-balancing>
<max-hops>1</max-hops>
<static-connectors>
<connector-ref>srv1-connector</connector-ref>
</static-connectors>
</cluster-connection>
</cluster-connections>
How it is now (on 1st master node):
<cluster-connections>
<cluster-connection name="abc">
<connector-ref>srv0-connector</connector-ref>
<message-load-balancing>ON_DEMAND</message-load-balancing>
<max-hops>1</max-hops>
<static-connectors>
<connector-ref>srv0-connector</connector-ref>
<connector-ref>srv1-connector</connector-ref>
</static-connectors>
</cluster-connection>
</cluster-connections>
After listing all cluster's nodes, the CPU normalized and it's not only ~1% on active node. The issue is totally not related AMQ Artemis connections spamming or file locks.

Related

Using Fiware Draco 2.1.0 How to setup NiFi cluster using external zookeeper

I'm trying to set up a NiFi cluster using an external zookeeper version is 3.4.10 container is run kubectl pod.
I have changed the following things in nifi.properties
# cluster node properties (only configure for cluster nodes) #
nifi.cluster.is.node=true
nifi.cluster.node.address=XXXXXXXXXXXXXXXX
nifi.cluster.node.protocol.port=8082
nifi.cluster.node.protocol.max.threads=50
nifi.cluster.node.event.history.size=25
nifi.cluster.node.connection.timeout=5 sec
nifi.cluster.node.read.timeout=5 sec
nifi.cluster.node.max.concurrent.requests=100
nifi.cluster.firewall.file=
nifi.cluster.flow.election.max.wait.time=1 min
nifi.cluster.flow.election.max.candidates=
# cluster load balancing properties #
nifi.cluster.load.balance.host=
nifi.cluster.load.balance.port=6342
nifi.cluster.load.balance.connections.per.node=1
nifi.cluster.load.balance.max.thread.count=8
nifi.cluster.load.balance.comms.timeout=30 sec
# zookeeper properties, used for cluster management #
nifi.zookeeper.connect.string=xxxx:2181,xxxx:2181,xxxx:2181
nifi.zookeeper.connect.timeout=10 secs
nifi.zookeeper.session.timeout=10 secs
nifi.zookeeper.root.node=/nifi
I am not able to connect zookeeper getting a connection error
2022-09-23 10:13:23,665 ERROR [main-EventThread] o.a.c.f.imps.CuratorFrameworkImpl Background operation retry gave up
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
at org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:885)
at org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:677)
at org.apache.curator.framework.imps.WatcherRemovalFacade.processBackgroundOperation(WatcherRemovalFacade.java:152)
at org.apache.curator.framework.imps.GetConfigBuilderImpl$2.processResult(GetConfigBuilderImpl.java:222)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:601)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
2022-09-23 10:13:24,331 INFO [main-EventThread] o.a.c.f.state.ConnectionStateManager State change: RECONNECTED
2022-09-23 10:13:24,331 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener#30dbc287 Connection State changed to RECONNECTED
2022-09-23 10:13:24,431 INFO [main-EventThread] o.a.c.f.state.ConnectionStateManager State change: SUSPENDED
2022-09-23 10:13:24,431 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener#30dbc287 Connection State changed to SUSPENDED
Connection State continues changing reconnecting and suspending
How can I set up a NiFi cluster for using an external zookeeper connection?
Did any one is facing simiar issues with this combination?
Can you please assist

YARN complains java.net.NoRouteToHostException: No route to host (Host unreachable)

Attempting to run h2o on a HDP 3.1 cluster and running into error that appears to be about YARN resource capacity...
[ml1user#HW04 h2o-3.26.0.1-hdp3.1]$ hadoop jar h2odriver.jar -nodes 3 -mapperXmx 10g
Determining driver host interface for mapper->driver callback...
[Possible callback IP address: 192.168.122.1]
[Possible callback IP address: 172.18.4.49]
[Possible callback IP address: 127.0.0.1]
Using mapper->driver callback IP address and port: 172.18.4.49:46015
(You can override these with -driverif and -driverport/-driverportrange and/or specify external IP using -extdriverif.)
Memory Settings:
mapreduce.map.java.opts: -Xms10g -Xmx10g -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Dlog4j.defaultInitOverride=true
Extra memory percent: 10
mapreduce.map.memory.mb: 11264
Hive driver not present, not generating token.
19/07/25 14:48:05 INFO client.RMProxy: Connecting to ResourceManager at hw01.ucera.local/172.18.4.46:8050
19/07/25 14:48:06 INFO client.AHSProxy: Connecting to Application History server at hw02.ucera.local/172.18.4.47:10200
19/07/25 14:48:07 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /user/ml1user/.staging/job_1564020515809_0006
19/07/25 14:48:08 INFO mapreduce.JobSubmitter: number of splits:3
19/07/25 14:48:08 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1564020515809_0006
19/07/25 14:48:08 INFO mapreduce.JobSubmitter: Executing with tokens: []
19/07/25 14:48:08 INFO conf.Configuration: found resource resource-types.xml at file:/etc/hadoop/3.1.0.0-78/0/resource-types.xml
19/07/25 14:48:08 INFO impl.YarnClientImpl: Submitted application application_1564020515809_0006
19/07/25 14:48:08 INFO mapreduce.Job: The url to track the job: http://HW01.ucera.local:8088/proxy/application_1564020515809_0006/
Job name 'H2O_47159' submitted
JobTracker job ID is 'job_1564020515809_0006'
For YARN users, logs command is 'yarn logs -applicationId application_1564020515809_0006'
Waiting for H2O cluster to come up...
ERROR: Timed out waiting for H2O cluster to come up (120 seconds)
ERROR: (Try specifying the -timeout option to increase the waiting time limit)
Attempting to clean up hadoop job...
19/07/25 14:50:19 INFO impl.YarnClientImpl: Killed application application_1564020515809_0006
Killed.
19/07/25 14:50:23 INFO client.RMProxy: Connecting to ResourceManager at hw01.ucera.local/172.18.4.46:8050
19/07/25 14:50:23 INFO client.AHSProxy: Connecting to Application History server at hw02.ucera.local/172.18.4.47:10200
----- YARN cluster metrics -----
Number of YARN worker nodes: 3
----- Nodes -----
Node: http://HW03.ucera.local:8042 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 15.0 GB used, 0 / 3 vcores used
Node: http://HW04.ucera.local:8042 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 15.0 GB used, 0 / 3 vcores used
Node: http://HW02.ucera.local:8042 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 15.0 GB used, 0 / 3 vcores used
----- Queues -----
Queue name: default
Queue state: RUNNING
Current capacity: 0.00
Capacity: 1.00
Maximum capacity: 1.00
Application count: 0
Queue 'default' approximate utilization: 0.0 / 45.0 GB used, 0 / 9 vcores used
----------------------------------------------------------------------
ERROR: Unable to start any H2O nodes; please contact your YARN administrator.
A common cause for this is the requested container size (11.0 GB)
exceeds the following YARN settings:
yarn.nodemanager.resource.memory-mb
yarn.scheduler.maximum-allocation-mb
----------------------------------------------------------------------
For YARN users, logs command is 'yarn logs -applicationId application_1564020515809_0006'
Looking in the YARN configs in Ambari UI, these properties are nowhere to be found. But checking the YARN logs in the YARN resource manager UI and checking some of the logs for the killed application, I see what appears to be unreachable-host errors...
Container: container_e05_1564020515809_0006_02_000002 on HW03.ucera.local_45454_1564102219781
LogAggregationType: AGGREGATED
=============================================================================================
LogType:stderr
LogLastModifiedTime:Thu Jul 25 14:50:19 -1000 2019
LogLength:2203
LogContents:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/hadoop/yarn/local/filecache/11/mapreduce.tar.gz/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/hadoop/yarn/local/usercache/ml1user/appcache/application_1564020515809_0006/filecache/10/job.jar/job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
log4j:WARN No appenders could be found for logger (org.apache.hadoop.mapred.YarnChild).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
java.net.NoRouteToHostException: No route to host (Host unreachable)
at java.net.PlainSocketImpl.socketConnect(Native Method)
....
at java.net.Socket.<init>(Socket.java:211)
at water.hadoop.EmbeddedH2OConfig$BackgroundWriterThread.run(EmbeddedH2OConfig.java:38)
End of LogType:stderr
***********************************************************************
Taking note of "java.net.NoRouteToHostException: No route to host (Host unreachable)". However, I can access all the other nodes from each other and they can all ping each other, so not sure what is going on here. Any suggestions for debugging or fixing?
Think I found the problem, TLDR: firewalld (nodes running on centos7) was still running, when should be disabled on HDP clusters.
From another community post:
For Ambari to communicate during setup with the hosts it deploys to and manages, certain ports must be open and available. The easiest way to do this is to temporarily disable iptables, as follows:
systemctl disable firewalld
service firewalld stop
So apparently iptables and firewalld need to be disabled across the cluster (supporting docs can be found here, I only disabled them on the Ambari installation node). After stopping these services across the cluster (I recommend using clush), was able to run the yarn job without incident.
Normally, this problem is either due to bad DNS configuration, firewalls, or network unreachability. To quote this official doc:
The hostname of the remote machine is wrong in the configuration files
The client's host table /etc/hosts has an invalid IPAddress for the target host.
The DNS server's host table has an invalid IPAddress for the target host.
The client's routing tables (In Linux, iptables) are wrong.
The DHCP server is publishing bad routing information.
Client and server are on different subnets, and are not set up to talk to each other. This may be an accident, or it is to deliberately lock down the Hadoop cluster.
The machines are trying to communicate using IPv6. Hadoop does not currently support IPv6
The host's IP address has changed but a long-lived JVM is caching the old value. This is a known problem with JVMs (search for "java negative DNS caching" for the details and solutions). The quick solution: restart the JVMs
For me, the problem was that the driver was inside a Docker container which made it impossible for the workers to send data back to it. In other words, workers and the driver not being in the same subnet. The solution as given in this answer was to set the following configurations:
spark.driver.host=<container's host IP accessible by the workers>
spark.driver.bindAddress=0.0.0.0
spark.driver.port=<forwarded port 1>
spark.driver.blockManager.port=<forwarded port 2>

Datastax Opscenter - Agent not connecting

I setup Cassandra, OpsCenter and the needed DataStax agent on my EC2 Amazon machine. At the moment it's only one machine.
Everything seems to be running fine, except the node list is empty and so are the keyspaces in the Opscenter. The cassandra, datastax and opscenter logs show no errors and I followed the installation / configuration carefully. Then tried all the suggested fixes.
My guess is the problem lies in the communication between the agent and opscenter.
After a while these requests fail:
etc/cassandra/cassandra.yaml: (simplified)
cluster_name: 'CassandraCluster'
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
- seeds: "1.2.3.4"
listen_address: 1.2.3.4
rpc_address: 0.0.0.0
endpoint_snitch: Ec2Snitch
etc/opscenter/opscenterd.conf: (simplified)
[webserver]
port = 81
interface = 0.0.0.0
[authentication]
enabled = False
[stat_reporter]
[agents]
use_ssl = false
var/lib/datastax-agent/conf/address.yaml: (simplified)
stomp_interface: 1.2.3.4
local_interface: 1.2.3.4
use_ssl: 0
nodetool status output:
Note: Ownership information does not include topology; for complete information, specify a keyspace
Datacenter: eu-west_1_cassandra
===============================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 1.2.3.4 2.06 MB 256 100.0% 8a121c12-7cbf-4a2a-b111-4ad111c111d8 1a
Nothing really strange shows up in the log except for the repetitive occurence of the following line in the agent.log:
INFO [install-location-finder] 2015-03-11 15:26:04,690 New JMX connection (127.0.0.1:7199)
INFO [install-location-finder] 2015-03-11 15:27:04,698 New JMX connection (127.0.0.1:7199)
INFO [install-location-finder] 2015-03-11 15:28:04,709 New JMX connection (127.0.0.1:7199)
INFO [install-location-finder] 2015-03-11 15:29:04,716 New JMX connection (127.0.0.1:7199)
INFO [install-location-finder] 2015-03-11 15:30:04,724 New JMX connection (127.0.0.1:7199)
INFO [install-location-finder] 2015-03-11 15:31:04,731 New JMX connection (127.0.0.1:7199)
To supply all the info here are the logs:
opscenterd.log
agent.log
cassandra/system.log
In certain environments the persistent connection between the browser and opscenterd may fail. We're working on implementing a more robust connection that will work in all environments, but in the meantime you can use the following workaround:
http://www.datastax.com/documentation/opscenter/5.1/opsc/troubleshooting/opscTroubleshootingZeroNodes.html
Minimal configuration that I find working was setting this options below for address.yaml
stomp_interface: [opscenter-ip]
stomp_port: 61620
use_ssl: 0
cassandra_conf: /etc/cassandra/cassandra.yaml
jmx_host: [cassandra-node-ip]
jmx_port: 7199
Make sure you have sysstat installed also.

hdfs data node disconnected from namenode

I get from time to time the following errors in cloudera manager:
This DataNode is not connected to one or more of its NameNode(s).
and
The Cloudera Manager agent got an unexpected response from this role's web server.
(usually together, sometimes only one of them)
In most references to these errors in SO and Google, the issue is a configuration problem (and the data node never connects to the name node)
In my case the data nodes usually connect at start up, but loose the connection after some time - so it doesn't appear to be a bad configuration.
Any other options?
Is it possible to force the data node to reconnect to the name node?
Is it possible to "ping" the name node from the data node (simulate the connection attempt of the data node)
Could it be some kind of resource problem (to many open files \ connections)?
sample logs (the errors vary from time to time)
2014-02-25 06:39:49,179 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: exception:
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.56.144.18:50010 remote=/10.56.144.28:48089]
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:165)
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:153)
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:114)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:504)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:673)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:338)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:92)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:64)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
at java.lang.Thread.run(Thread.java:662)
2014-02-25 06:39:49,180 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.56.144.18:50010, dest: /10.56.144.28:48089, bytes: 132096, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1315770947_27, offset: 0, srvID: DS-990970275-10.56.144.18-50010-1384349167420, blockid: BP-1381780028-10.56.144.16-1384349161741:blk_-8718668700255896235_5121440, duration: 480291679056
2014-02-25 06:39:49,180 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.56.144.18, storageID=DS-990970275-10.56.144.18-50010-1384349167420, infoPort=50075, ipcPort=50020, storageInfo=lv=-40;cid=cluster16;nsid=7043943;c=0):Got exception while serving BP-1381780028-10.56.144.16-1384349161741:blk_-8718668700255896235_5121440 to /10.56.144.28:48089
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.56.144.18:50010 remote=/10.56.144.28:48089]
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:165)
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:153)
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:114)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:504)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:673)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:338)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:92)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:64)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
at java.lang.Thread.run(Thread.java:662)
2014-02-25 06:39:49,181 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: host.com:50010:DataXceiver error processing READ_BLOCK operation src: /10.56.144.28:48089 dest: /10.56.144.18:50010
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.56.144.18:50010 remote=/10.56.144.28:48089]
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:165)
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:153)
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:114)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:504)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:673)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:338)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:92)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:64)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
at java.lang.Thread.run(Thread.java:662)
Hadoop uses specific ports to communicate between the DataNode and the NameNode. It could be that a firewall is blocking those specific ports. Check the default ports in the Cloudera WebSite and test the connectivity to the NameNode with specific ports.
If you're using Linux then please make sure that you have configured these properties correctly:
Disable SELINUX
type the command getenforce on CLI and if it shows enforcing, means it is enabled. Change it fro /etc/selinux/config file.
Disable Firewall
Make sure you have NTP service installed.
Make sure your server can SSH to all client nodes.
Make sure all the nodes have FQDN(Fully Qualified Domain Name) and have an entry in /etc/hosts with name and IP.
If these settings are right in the place then please attach the log of any of your datanode which got disconnected.
I ran into this error
"This DataNode is not connected to one or more of its NameNode(s). "
and I solved it by turning off safe mode and restart HDFS service
I realize you took some steps to test this, but intermittent disconnects still make it sound like a Connectivity issue.
If nodes really don't come back after a disconnect, that may be a configuration issue, which could well be completely independent from the reason why they disconnect in the first place.

JBoss AS 7 Infinispan Cluster

I have a two node JBoss AS 7.1.1.FINAL cluster setup in the following way -
master - running on Ubuntu Server 12.10 (VirtualBox VM)
slave - running on Windows 7 (VirtaulBox host machine)
I have deployed a Spring web application on both nodes and I'm trying to set up a working replicated cache. My problem is that the cache does not seem to be replicated even though the clustering apparently works.
My config -
in domain.xml (both on master and slave)
<subsystem xmlns="urn:jboss:domain:infinispan:1.2" default-cache-container="cluster">
<cache-container name="cluster" aliases="ha-partition" default-cache="default" jndi-name="java:jboss/infinispan/cluster" start="EAGER">
<transport lock-timeout="60000" />
<replicated-cache name="default" mode="SYNC" batching="true">
<locking isolation="REPEATABLE_READ"/>
</replicated-cache>
</cache-container>
</subsystem>
This is pretty much the default config in domain.xml, except for the jndi-name and the EAGER start.
In spring configuration -
<infinispan:container-cache-manager id="cacheManager" cache-container-ref="springCacheContainer" />
<jee:jndi-lookup id="springCacheContainer" jndi-name="java:jboss/infinispan/cluster" />
With this set up, the caching works, but its not replicated. The caches seem to operate independently of each other. Also, the EAGER start seems to have no effect. The caches seem to be initialized only when they are first used.
from master log (first time cache is used)-
[Server:server-one] 03:25:55,756 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (ajp-192.168.2.13-192.168.2.13-8009-3) ISPN000078: Starting JGroups Channel
[Server:server-one] 03:25:55,762 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (ajp-192.168.2.13-192.168.2.13-8009-3) ISPN000094: Received new cluster view: [master:server-one/cluster|1] [master:server-one/cluster, slave:server-one-slave/cluster]
[Server:server-one] 03:25:55,763 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (ajp-192.168.2.13-192.168.2.13-8009-3) ISPN000079: Cache local address is master:server-one/cluster, physical addresses are [192.168.2.13:55200]
[Server:server-one] 03:25:55,769 INFO [org.infinispan.factories.GlobalComponentRegistry] (ajp-192.168.2.13-192.168.2.13-8009-3) ISPN000128: Infinispan version: Infinispan 'Brahma' 5.1.2.FINAL
[Server:server-one] 03:25:55,851 INFO [org.jboss.as.clustering.infinispan] (ajp-192.168.2.13-192.168.2.13-8009-3) JBAS010281: Started cluster cache from cluster container
from slave log (first time cache is used)-
[Server:server-one-slave] 03:29:38,124 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (ajp--192.168.2.10-8009-2) ISPN000078: Starting JGroups Channel
[Server:server-one-slave] 03:29:38,129 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (ajp--192.168.2.10-8009-2) ISPN000094: Received new cluster view: [master:server-one/cluster|1] [master:server-one/cluster, slave:server-one-slave/cluster]
[Server:server-one-slave] 03:29:38,130 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (ajp--192.168.2.10-8009-2) ISPN000079: Cache local address is slave:server-one-slave/cluster, physical addresses are [192.168.2.10:55200]
[Server:server-one-slave] 03:29:38,133 INFO [org.infinispan.factories.GlobalComponentRegistry] (ajp--192.168.2.10-8009-2) ISPN000128: Infinispan version: Infinispan 'Brahma' 5.1.2.FINAL
[Server:server-one-slave] 03:29:38,195 INFO [org.jboss.as.clustering.infinispan] (ajp--192.168.2.10-8009-2) JBAS010281: Started cluster cache from cluster container
I don't think this is a udp/multicast issue, as I have mod_cluster, HornetQ and Quartz set up in this cluster and they all work as expected.
Putting <distributable/> in web.xml did the trick.
I had a similar issue where my cache wouldn't replicate until the application was first used. I was able to resolve this by setting the "start" attribute of the replicated-cache to EAGER, along with the cache-container attribute start="EAGER".
<replicated-cache name="default" mode="SYNC" batching="true" start="EAGER">

Resources