Controller issue : losing access - websphere-liberty

Using V8.5.5.9 with Java 7 on two different machines running Windows. I have been through all the process for "RXA Setup for Collective Operations" :UAC, Sharing Registry .... I have googled it but i am struggling with this problem :
It looks like the connection is working and after 40s ... Boom it is losing it. Any idea ?
Machine 1: Main ( 192.168.0.39 )
Server.xml
< collectiveController replicaPort="10010"
replicaSet="cogito1:10010"
isInitialReplicaSet="false" host="main"
replicaHost="main"/>
Log
[16-06-07 12:36:06:301 EDT] 00000011 com.ibm.ws.frappe.paxos.impl.CommandsExecutor:10010 I CWWKX6013I: The collective controller state is No Paxos Instance, last proposed command is -1, the last accepted command is -1, the last executed command is 0 and the log is 274.
[16-06-07 12:36:06:617 EDT] 00000029 .utils.service.multiplexed.impl.UniverseAndReplicaData:10010 I CWWKX6009I: The collective controller successfully connected to replica 192.168.0.39:10010. Current active replica set is []. The configured replica set is [192.168.0.162:10010]. The connected standby replicas are [192.168.0.39:10010].
[16-06-07 12:36:06:835 EDT] 00000029 .utils.service.multiplexed.impl.UniverseAndReplicaData:10010 I CWWKX6009I: The collective controller successfully connected to replica 192.168.0.162:10010. Current active replica set is [192.168.0.162:10010]. The configured replica set is [192.168.0.162:10010]. The connected standby replicas are [192.168.0.39:10010].
[16-06-07 12:36:46:355 EDT] 00000025 e.serviceregistry.backend.RegistryReplicationService:default E CWWKX6008E: The collective controller is unavailable, probably due to a loss of majority of the replica set, or a comm
Machine 2: Cogito1 ( 192.168.0.162 )
Server.xml
< collectiveController replicaPort="10010" replicaSet="cogito1:10010"
isInitialReplicaSet="true" />
Log
[16-06-07 12:36:06:301 EDT] 00000011 com.ibm.ws.frappe.paxos.impl.CommandsExecutor:10010 I CWWKX6013I: The collective controller state is No Paxos Instance, last proposed command is -1, the last accepted command is -1, the last executed command is 0 and the log is 274.
[16-06-07 12:36:06:617 EDT] 00000029 .utils.service.multiplexed.impl.UniverseAndReplicaData:10010 I CWWKX6009I: The collective controller successfully connected to replica 192.168.0.39:10010. Current active replica set is []. The configured replica set is [192.168.0.162:10010]. The connected standby replicas are [192.168.0.39:10010].
[16-06-07 12:36:06:835 EDT] 00000029 .utils.service.multiplexed.impl.UniverseAndReplicaData:10010 I CWWKX6009I: The collective controller successfully connected to replica 192.168.0.162:10010. Current active replica set is [192.168.0.162:10010]. The configured replica set is [192.168.0.162:10010]. The connected standby replicas are [192.168.0.39:10010].
[16-06-07 12:36:46:355 EDT] 00000025 e.serviceregistry.backend.RegistryReplicationService:default E CWWKX6008E: The collective controller is unavailable, probably due to a loss of majority of the replica set, or a comm

I'm assuming that you're trying to set up two replica controllers, one on host main and the other on host cogito1. Based on that assumption, you need to ensure unique replicaPorts for each controller. In other words, your configuration for main should not be replicaPort="10010", but instead another free port, like replicaPort="10011".
Also, note that for High Availability, you need a minimum of 3 replica controllers.

It appears that you have not completed the addReplica step for 192.168.0.39. It seems to always be in the standby set, which would be true if it has not been added via this command. For example you can see the command in the current Beta documentation here: https://www.ibm.com/support/knowledgecenter/was_beta_liberty/com.ibm.websphere.wlp.nd.multiplatform.doc/ae/tagt_wlp_configure_replicas.html
With a replica set of two replicas and one not having been added, this will result in an inoperable set. One replica is not a majority of the 2 total replicas, thus it cannot reach an operable state.

Related

Infinispan clustered REPL_ASYNC cache: command indefinitely bounced between two nodes

Im running a spring boot application using infinispan 10.1.8 in a 2 node cluster. The 2 nodes are communicating via jgroups TCP. I configured several REPL_ASYNC.
The problem:
One of these caches, at some point is causing the two nodes to exchange the same message over and over, causing high CPU and memory usage. The only way to stop this is to stop one of the two nodes.
More details, here is the configuration.
org.infinispan.configuration.cache.Configuration replAsyncNoExpirationConfiguration = new ConfigurationBuilder()
.clustering()
.cacheMode(CacheMode.REPL_ASYNC)
.transaction()
.lockingMode(LockingMode.OPTIMISTIC)
.transactionMode(TransactionMode.NON_TRANSACTIONAL)
.statistics().enabled(cacheInfo.isStatsEnabled())
.locking()
.concurrencyLevel(32)
.lockAcquisitionTimeout(15, TimeUnit.SECONDS)
.isolationLevel(IsolationLevel.READ_COMMITTED)
.expiration()
.lifespan(-1) //entries do not expire
.maxIdle(-1) // even when they are idle for some time
.wakeUpInterval(-1) // disable the periodic eviction process
.build();
One of these caches (named formConfig) is causing me abnormal communication between the two nodes, this is what happens:
with jmeter I generate traffic load targeting only node 1
for some time node 2 receives cache entries from node 1 via SingleRpcCommand, no anomalies, even formConfig cache behaves properly
after some time a new cache entry is sent to the formConfig cache
At this point the same message seems to keep bouncing between the two nodes:
node 1 sends entry mn-node1.company.acme-develop sending command to all: SingleRpcCommand{cacheName='formConfig', command=PutKeyValueCommand{key=SimpleKey [form_config,MECHANICAL,DESIGN,et,7850]
node 2 receives the entry mn-node2.company.acme-develop received command from mn-node1.company.acme-develop: SingleRpcCommand{cacheName='formConfig', command=PutKeyValueCommand{key=SimpleKey [form_config,MECHANICAL,DESIGN,et,7850]
node 2 sends the entry back to node 1 mn-node2.company.acme-develop sending command to all: SingleRpcCommand{cacheName='formConfig', command=PutKeyValueCommand{key=SimpleKey [form_config,MECHANICAL,DESIGN,et,7850]
node 1 receives the entry mn-node1.company.acme-develop received command from mn-node2.company.acme-develop: SingleRpcCommand{cacheName='formConfig', command=PutKeyValueCommand{key=SimpleKey [form_config,MECHANICAL,DESIGN,et,7850],
node 1 sends the entry to node 2 and so on and on...
Some other things:
the system is not under load, jmeter is running only few users in parallel
Even stopping jmeter this loop doesn't stop
formConfig is the only cache that behaves this way. All the other REPL_ASYNC caches work properly. I deactivated only formConfig cache and the system is working correctly.
I cannot reproduce the problem with two nodes running on my machine
Here's a more complete log file including logs from both nodes.
Other infos:
opendjdk 11 hot spot
spring boot 2.2.7
infinispan spring boot starter 2.2.4
using JbossUserMarshaller
I'm suspecting
something related to transactional configuration
or something related to serialization/deserialization of the cached object
The only scenario where this can happen is when the SimpleKey has different hashCode().
Are there any exceptions in the log? Are you able to check if the hashCode() is the same after serialization & deserialization of the key?

Aerospike missing data when adding new node to cluster

I have a Aerospike (3.11.1.1) cluster with 6 nodes. When I try to add a new node, sometimes some objects are "temporarily" lost while the cluster is migrating data. After the migration finishes, the missing data return. Is this a BUG or am I doing something wrong? How to avoid
Notices that while migration happens, the master object count is lower then the actual final object count after migration finishes
Master and replica count before finishing migrations:
Master and replica count after finishing migrations:
My aerospike.conf:
service {
user root
group root
paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
paxos-recovery-policy auto-reset-master
pidfile /var/run/aerospike/asd.pid
service-threads 32
transaction-queues 32
transaction-threads-per-queue 4
batch-index-threads 40
proto-fd-max 15000
batch-max-requests 30000
replication-fire-and-forget true
}
logging {
# Log file must be an absolute path.
file /var/log/aerospike/aerospike.log {
context any info
}
}
network {
service {
#address any
port 3000
}
heartbeat {
mode mesh
mesh-seed-address-port 10.240.0.32 3002
mesh-seed-address-port 10.240.0.33 3002
port 3002
interval 150
timeout 20
}
fabric {
port 3001
}
info {
port 3003
}
}
namespace mynamespace {
replication-factor 2
memory-size 1500M
default-ttl 0 # 30 days, use 0 to never expire/evict.
ldt-enabled true
write-commit-level-override master
storage-engine device {
file /data/aerospike.dat
#device /dev/sdb
write-block-size 1M
filesize 280G
}
}
Some of the discrepancy was due to an issue in the original migration/rebalance design and is addressed in the protocol change in Aerospike 3.13. Prior to the protocol change in 3.13, when running replication-factor 2, the operator must upgrade one node at a time and wait for migrations to complete afterwards.
Additional discrepancy is Aerospike avoiding over counting master-objects and replica objects (i.e. prole-objects) during migration. Also with 3.13 we added a stat for the non-replica-objects which are objects that are not currently acting as master or replica. These are either (a) objects on a partition that has inbound migrations and will eventually act as replica or (b) these are objects on a partition that will not participate and will be dropped when migrations terminate for the partition.
Prior to 3.13, non-replica-object of type (a) would reduce the counts for both master-objects or prole-objects. This is because prior to the protocol change, when a partition returns that was previously master, it immediately resumes as master even though it doesn't have the new writes that took place while it was away. This isn't optimal behavior but it isn't losing data since we will resolve the missing records from the non-replica-objects on other nodes. Post protocol change, a returning 'master' partition will not resume as 'master' until it has received all migrations from other nodes.
Prior to 3.13, non-replica-objects of type (b) would immediately drop and would reduce the count for prole-objects. This causes the replication-factor of records written while a node was away to be reduced by one (e.g. replication-factor 2 temporarily becomes replication-factor 1). This is also the reason it was important to wait for migrations to complete before proceeding to upgrade the next node. Post protocol change (unless running in-memory only), it is no longer necessary to wait for migrations to complete between node upgrades because the interim 'subset partitions' aren't dropped which prevents record's replication-factor from being reduced (actually, with the new protocol, during migrations there are often replication-factor + 1 copies of a record).

Redis cluster does not support simultaneous fail of several master nodes

I've got the following configuration:
Redis_version:3.2.0
3 master nodes and 3 slave nodes
Each master node is replicated to a slave Everything is correct. When one master node fails by a "kill" command, the corresponding slave node becomes the master as expected. After few seconds, cluster_state returns to the OK state.
BUT, if two master nodes fail simultaneously, none of the associated slave nodes become the master. The cluster_state stays in "fail" state.
cluster nodes command output.
b60c284a515b31aa6b11022fc07cf1a399171e04 127.0.0.1:7000 master,fail? - 1464690455030 1464690454930 1 disconnected 0-5460
637d1f074419963653b206c5ed7cbed4c3d0ace0 127.0.0.1:7001 master,fail? - 1464690455030 1464690454930 2 disconnected 5461-10922
d2aae2a3d87c6407e002076740c8febf80f37865 127.0.0.1:7003 myself,slave b60c284a515b31aa6b11022fc07cf1a399171e04 0 0 4 connected
72d4c9ce140fb57436c1b21702bf3c646ef29db3 127.0.0.1:7002 master - 0 1464690718480 3 connected 10923-16383
af34a7b2241943baf23e634e81b552d8bf23cdd0 127.0.0.1:7005 slave 72d4c9ce140fb57436c1b21702bf3c646ef29db3 0 1464690718480 6 connected
d0fec0609c9e786ac9ca4629f36cabd7c5c3130c 127.0.0.1:7004 slave 637d1f074419963653b206c5ed7cbed4c3d0ace0 0 1464690718480 5 connected
The slave auto-failover won't happen when at least half of the masters get disconnected, because the failover election is required more than half of the masters come into consensus.
To start a manual failover, connect to the slave node with redis-cli and send a cluster failover TAKEOVER command (the takeover is required).
In your case
redis-cli -h 127.0.0.1 -p 7003 cluster failover takeover
After the :7003 becomes a master, the other slave will start an automatic failover as well since there are more than half (2/3) of the masters are alive.

Why empheral node does not deleted from zookeeper after sessiontimeout value

I am creating an Empheral node with the help of CuratorFrameworkFactory.newClient method which takes, znodes addresses,sessiontimeoutinms,connectiontimeoutinms,Retry) . I have pass 5*1000 as sessiontimeoutinms and 15*1000 as connectiontimeoutinms. This method is able to create the EPHEMERAL node in my zookeeper but this EPHEMERAL node does not deleted till the application run.
Why this happens as sessiontimeout is 5 seconds.
Most probable cause is your heartbeat setting for Zookeeper (aka tickTime) is higher, and minimum session timeout can't be lower than 2*tickTime.
to debug, when an ephemeral node is created check the ephemeralOwner from the zkCli. the value is the session id.
when the client that owns the node, in the zookeeper logs, you should get this line :
INFO [ProcessThread(sid:0 cport:2182)::PrepRequestProcessor#486] -
Processed session termination for sessionid: 0x161988b731d000c
in this case the ephemeralOwener was 0x161988b731d000c. If you don't get that, you would have got some error. In my case it was EOF exception, which was because of a client library and server mismatch.

Hector is unable to read Cassandra data when nodes reboot or terminate

We are trying to run a cassandra cluster on AWS/EC2 within a standard VPC footprint (cassandra nodes on private subnets). Because this is AWS there is always a chance that an EC2 instance will terminate or reboot with no warning. I have been simulating this case on a test cluster and I am seeing things with the cluster that I thought a cluster was suppose to prevent. Specifically if a node reboots some data will go temporarily missing until the node completes its reboot. If a node terminates it appears that some data is lost forever.
For my test I just did a bunch of writes (using QUORUM consistency) to some keyspaces then interrogate the contents of those keyspaces as I bring down nodes (either through reboot or terminate). I'm just using cqlsh SELECT to do the keyspace/column family interrogation of the cluster using ONE consistency level.
Note, even though I am performing no writes to the cluster while I am doing the SELECTs rows temporarily disappear when rebooting and can permanently go missing during termination.
I thought Netflix Priam might be able to help, but sadly it doesn't work in a VPC the last time I checked.
Also, because we are using ephemeral storage instances there is no equivalent of 'shutdown' so I cannot run any scripts during reboot/terminate of an instance to perform a nodetool decommission or nodetool removenode before an instance goes away. Terminate is the equivalent of kicking the plug out of the wall.
Since I am using a replication factor of 3 and quorum/write that should mean that all data is written to at least 2 nodes. So, unless I am totally misunderstanding things (which is possible), losing one node should not mean that I lose any data for any period of time when I am using consistency level ONE for the read.
Questions
Why wouldn't a 6 node cluster with a replication factor of 3 work?
Do I need to run something like a 12 node cluster with a replication factor of 7? Don't bother telling me that will fix the problem, because it doesn't.
Do I need to use consistency level of ALL on the writes then use ONE or QUORUM on the reads?
Is there something not quite right with virtual nodes? unlikely
Are there nodetool commands besides removenode that I need to run when a node terminates to recover missing data? As mentioned earlier, when a reboot occurs, eventually the missing data reappears.
Is there some cassandra savant who can look at my cassandra.yaml file below and send me on the path to salvation?
More Info added 7/19
I don't think this is a QUORUM vs ONE vs ALL is the issue. The test I set up performs no writes to the keyspaces after the initial population of the column families. So the data has had plenty of time (hours) to make it to all the nodes as required by the replication factor. Plus the test dataset is REALLY small (2 column families with about 300-1000 values each). So in other words, the data is completely static.
The behavior I am seeing seems to be tied to the fact that the ec2 instance is no longer on the network. The reason I say this is because if I log on to a node and just do a cassandra stop I see no loss of data. But if I do the reboot or terminate I start getting the following in a stack trace.
CassandraHostRetryService - Downed Host Retry service started with queue size -1 and retry delay 10s
CassandraHostRetryService - Downed Host retry shutdown complete
CassandraHostRetryService - Downed Host retry shutdown hook called
Caused by: TimedOutException()
Caused by: TimedOutException()
So it seems to be more of a networking communication issue in that the cluster is expecting, for example 10.0.12.74, to be on the network after it has joined the cluster. If that ip is suddenly unreachable either due to reboot or termination the timeouts start happening.
When I do a nodetool status under all three scenarios (cassandra stop, reboot or terminate) the status of the node shows up as DN. Which is what you would expect. Eventually nodetool status will return to UN with cassandra start or reboot, but obviously termination always stays DN.
Details of my Configuration
Here are some details of my configuration (cassandra.yaml is at the bottom of this posting):
Nodes are running in private subnets of a VPC.
Cassandra 1.2.5 with num_tokens: 256 (virtual nodes). initial_token: (blank). I am really hoping this works because all of our nodes run in autoscaling groups so the thought that redistribution could be handle dynamically is appealing.
EC2 m1.large one seed and one non-seed node in each availability zone. (so 6 total nodes in the cluster).
Ephemeral storage, not EBS.
Ec2Snitch with NetworkTopologyStrategy and all keyspaces have replication factor of 3.
Non-seed nodes are auto_bootstraped, seed nodes are not.
sample cassandra.yaml file
cluster_name: 'TestCluster'
num_tokens: 256
initial_token:
hinted_handoff_enabled: true
max_hint_window_in_ms: 10800000
hinted_handoff_throttle_in_kb: 1024
max_hints_delivery_threads: 2
authenticator: org.apache.cassandra.auth.AllowAllAuthenticator
authorizer: org.apache.cassandra.auth.AllowAllAuthorizer
partitioner: org.apache.cassandra.dht.Murmur3Partitioner
disk_failure_policy: stop
key_cache_size_in_mb:
key_cache_save_period: 14400
row_cache_size_in_mb: 0
row_cache_save_period: 0
row_cache_provider: SerializingCacheProvider
saved_caches_directory: /opt/company/dbserver/caches
commitlog_sync: periodic
commitlog_sync_period_in_ms: 10000
commitlog_segment_size_in_mb: 32
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
- seeds: "SEED_IP_LIST"
flush_largest_memtables_at: 0.75
reduce_cache_sizes_at: 0.85
reduce_cache_capacity_to: 0.6
concurrent_reads: 32
concurrent_writes: 8
memtable_flush_queue_size: 4
trickle_fsync: false
trickle_fsync_interval_in_kb: 10240
storage_port: 7000
ssl_storage_port: 7001
listen_address: LISTEN_ADDRESS
start_native_transport: false
native_transport_port: 9042
start_rpc: true
rpc_address: 0.0.0.0
rpc_port: 9160
rpc_keepalive: true
rpc_server_type: sync
thrift_framed_transport_size_in_mb: 15
thrift_max_message_length_in_mb: 16
incremental_backups: true
snapshot_before_compaction: false
auto_bootstrap: AUTO_BOOTSTRAP
column_index_size_in_kb: 64
in_memory_compaction_limit_in_mb: 64
multithreaded_compaction: false
compaction_throughput_mb_per_sec: 16
compaction_preheat_key_cache: true
read_request_timeout_in_ms: 10000
range_request_timeout_in_ms: 10000
write_request_timeout_in_ms: 10000
truncate_request_timeout_in_ms: 60000
request_timeout_in_ms: 10000
cross_node_timeout: false
endpoint_snitch: Ec2Snitch
dynamic_snitch_update_interval_in_ms: 100
dynamic_snitch_reset_interval_in_ms: 600000
dynamic_snitch_badness_threshold: 0.1
request_scheduler: org.apache.cassandra.scheduler.NoScheduler
index_interval: 128
server_encryption_options:
internode_encryption: none
keystore: conf/.keystore
keystore_password: cassandra
truststore: conf/.truststore
truststore_password: cassandra
client_encryption_options:
enabled: false
keystore: conf/.keystore
keystore_password: cassandra
internode_compression: all
I think http://www.datastax.com/documentation/cassandra/1.2/cassandra/dml/dml_config_consistency_c.html will clear up a lot of this. In particular, QUORUM/ONE is not guaranteed to return the most recent data. QUORUM/QUORUM is. So is ALL/ONE, but that will be intolerant to failure on write.
Edit to go with the new information:
CassandraHostRetryService is part of Hector. I assumed you were testing with cqlsh like a sane person would. Lessons:
Use cqlsh for testing
Use the DataStax Java Driver for building your application, which is faster, easier to use, and has more insight into the cluster state than Hector thanks to the native protocol it's built on.

Resources