Infinispan clustered REPL_ASYNC cache: command indefinitely bounced between two nodes - spring-boot

Im running a spring boot application using infinispan 10.1.8 in a 2 node cluster. The 2 nodes are communicating via jgroups TCP. I configured several REPL_ASYNC.
The problem:
One of these caches, at some point is causing the two nodes to exchange the same message over and over, causing high CPU and memory usage. The only way to stop this is to stop one of the two nodes.
More details, here is the configuration.
org.infinispan.configuration.cache.Configuration replAsyncNoExpirationConfiguration = new ConfigurationBuilder()
.clustering()
.cacheMode(CacheMode.REPL_ASYNC)
.transaction()
.lockingMode(LockingMode.OPTIMISTIC)
.transactionMode(TransactionMode.NON_TRANSACTIONAL)
.statistics().enabled(cacheInfo.isStatsEnabled())
.locking()
.concurrencyLevel(32)
.lockAcquisitionTimeout(15, TimeUnit.SECONDS)
.isolationLevel(IsolationLevel.READ_COMMITTED)
.expiration()
.lifespan(-1) //entries do not expire
.maxIdle(-1) // even when they are idle for some time
.wakeUpInterval(-1) // disable the periodic eviction process
.build();
One of these caches (named formConfig) is causing me abnormal communication between the two nodes, this is what happens:
with jmeter I generate traffic load targeting only node 1
for some time node 2 receives cache entries from node 1 via SingleRpcCommand, no anomalies, even formConfig cache behaves properly
after some time a new cache entry is sent to the formConfig cache
At this point the same message seems to keep bouncing between the two nodes:
node 1 sends entry mn-node1.company.acme-develop sending command to all: SingleRpcCommand{cacheName='formConfig', command=PutKeyValueCommand{key=SimpleKey [form_config,MECHANICAL,DESIGN,et,7850]
node 2 receives the entry mn-node2.company.acme-develop received command from mn-node1.company.acme-develop: SingleRpcCommand{cacheName='formConfig', command=PutKeyValueCommand{key=SimpleKey [form_config,MECHANICAL,DESIGN,et,7850]
node 2 sends the entry back to node 1 mn-node2.company.acme-develop sending command to all: SingleRpcCommand{cacheName='formConfig', command=PutKeyValueCommand{key=SimpleKey [form_config,MECHANICAL,DESIGN,et,7850]
node 1 receives the entry mn-node1.company.acme-develop received command from mn-node2.company.acme-develop: SingleRpcCommand{cacheName='formConfig', command=PutKeyValueCommand{key=SimpleKey [form_config,MECHANICAL,DESIGN,et,7850],
node 1 sends the entry to node 2 and so on and on...
Some other things:
the system is not under load, jmeter is running only few users in parallel
Even stopping jmeter this loop doesn't stop
formConfig is the only cache that behaves this way. All the other REPL_ASYNC caches work properly. I deactivated only formConfig cache and the system is working correctly.
I cannot reproduce the problem with two nodes running on my machine
Here's a more complete log file including logs from both nodes.
Other infos:
opendjdk 11 hot spot
spring boot 2.2.7
infinispan spring boot starter 2.2.4
using JbossUserMarshaller
I'm suspecting
something related to transactional configuration
or something related to serialization/deserialization of the cached object

The only scenario where this can happen is when the SimpleKey has different hashCode().
Are there any exceptions in the log? Are you able to check if the hashCode() is the same after serialization & deserialization of the key?

Related

High Availability in SymmetricDS

To all those SymmetricDS nerds over there, this one's for you all.
Right, so we have a main db, DB-01. We have 3 instances of our application running namely R1,R2,R3. Each instance has its own in-memory db namely D1,D2,D3 which it(application) is accessing respectively. We are using SymmetricDS to do one-way sync from DB-01 to D1,D2,D3. So, there is a server node, corporate C0, pointing to DB-01 and 3 client nodes, stores S1,S2,S3 pointing to D1,D2,D3 respectively.
All is working fine.
But now, we would like to introduce High Availability and there by FAILOVER into this topology i.e., at any time there will be 2 server nodes running, say Master and Slave, that would be accessing the same DB-01. If Master server goes down, clients should automatically connect to the Slave node and continue operation.
What all might be the configuration changes required to accomplish this? Are there any examples or documentations that i can reproduce to understand this concept?
We do this via clustering with 2 SymmetricDS services running on 2 app servers pointing to the High Availability (HA) connections. Then all you need is HA connections to failover like normal and Symmetric DS clustering does the rest.
Link for the user manual on clustering.
https://www.symmetricds.org/doc/3.13/html/user-guide.html#_clustering
EDIT let me get some configs for you on here service 1:
engine.name=<SDS_SERVICE_1>
db.driver=net.sourceforge.jtds.jdbc.Driver
db.url=jdbc:jtds:sqlserver://<HA_connection1>:1433/<DB>;useCursors=true;bufferMaxMemory=10240;lobBuffer=5242880
db.user=***********
db.password=***********
registration.url=http://<IP>:7004/sync/<SDS_MAIN>
sync.url=http://<IP>:7004/sync/<SDS_SERVICE_1>
group.id=<GID>
external.id=100
auto.registration=true
initial.load.create.first=true
sync.table.prefix=sym
start.initial.load.extract.job=false
cluster.lock.enabled=true
cluster.server.id=11
cluster.lock.timeout.ms=600000
cluster.lock.refresh.ms=60000
compression.level=-1
compression.strategy=0
Service 2:
engine.name=<SDS_SERVICE_2>
db.driver=net.sourceforge.jtds.jdbc.Driver
db.url=jdbc:jtds:sqlserver://<HA_connection2>:1433/<DB>;useCursors=true;bufferMaxMemory=10240;lobBuffer=5242880
db.user=***********
db.password=***********
registration.url=http://<IP>:7004/sync/<SDS_MAIN>
sync.url=http://<IP>:7004/sync/<SDS_SERVICE_2>
group.id=<GID>
external.id=100
auto.registration=true
initial.load.create.first=true
sync.table.prefix=sym
start.initial.load.extract.job=false
cluster.lock.enabled=true
cluster.server.id=12
cluster.lock.timeout.ms=600000
cluster.lock.refresh.ms=60000
compression.level=-1
compression.strategy=0

Difference between expected_nodes and recover_after_nodes parameters

I can't see the difference between two parameters for the recovery phase of the gateway module.
In the documentation :
The gateway.recover_after_nodes setting (which accepts a number) controls after how many (...) eligible nodes (...) recovery will start.
The gateway.expected_nodes allows to set how many (...) eligible nodes are expected to be in the cluster, and once met, (...) recovery starts
From what I understand, these two settings trigger the recovery phase once the number of node is equal to the value set.
Why using one over the other?
And what is the point of using both of them?
For example :
gateway:
recover_after_nodes: 3
expected_nodes: 5
In this case, what is the purpose of expected_nodes? recovery will be triggered as soon as there will be 3 nodes. There must be another reason to use it.
I hope my question is clear enough.
Thanks in advance!
When using recovery_after_nodes, recover_after_data_nodes or recovery_after_master_nodes, once all set conditions are met the cluster will then start waiting recover_after_time before starting recovery:
The gateway.recover_after_time setting (which accepts a time value)
sets the time to wait till recovery happens once all
gateway.recover_after...nodes conditions are met.
When using expected_nodes, expected_data_nodes or expected_master nodes, recovery will start once all conditions are met - the cluster will not wait. In addition, it will also default recovery_after_time to 5 min.
In your test case:
gateway:
recover_after_nodes: 3
expected_nodes: 5
Once you hit 3 nodes a countdown clock starts and the cluster will then recovery in either 5 minutes (the default) or if you hit 5 nodes. Basically it allows you to set a minimum threshold (recovery_after_nodes), with a timeout (recovery_after_time) to wait for a desired state (expected_nodes). You will either recovery recovery_after_time after recovery_after_nodes is hit, or when expected_nodes is hit (no additional waiting) - whichever comes first.
from the public document, there are misunderstandings in this threads.
http://www.elastic.co/guide/en/elasticsearch/reference/1.x/modules-gateway.html
gateway:
recover_after_time: 5m
expected_nodes: 2
In an expected 2 nodes cluster will cause recovery to start 5 minutes
after the first node is up, but once there are 2 nodes in the cluster,
recovery will begin immediately (without waiting).
so, the timer defined by recover_after_time will start already after a first node is up. not start after finding nodes defined in recover_after_nodes

Redis sync fails. Redis copy keys and values works

I have two redis instances both running on the same machine on win64. The version is the one from https://github.com/MSOpenTech/redis with no amendments and the binaries are running as per download from github (ie version 2.6.12).
I would like to create a slave and sync it to the master. I am doing this on the same machine to ensure it works before creating a slave on a WAN located machine which will take around an hour to transfer the data that exists in the primary.
However, I get the following error:
[4100] 15 May 18:54:04.620 * Connecting to MASTER...
[4100] 15 May 18:54:04.620 * MASTER <-> SLAVE sync started
[4100] 15 May 18:54:04.620 * Non blocking connect for SYNC fired the event.
[4100] 15 May 18:54:04.620 * Master replied to PING, replication can continue...
[4100] 15 May 18:54:28.364 * MASTER <-> SLAVE sync: receiving 2147483647 bytes from master
[4100] 15 May 18:55:05.772 * MASTER <-> SLAVE sync: Loading DB in memory
[4100] 15 May 18:55:14.508 # Short read or OOM loading DB. Unrecoverable error, aborting now.
The only way I can sync up is via a mini script something along the lines of :
import orm.model
if __name__ == "__main__":
src = orm.model.caching.Redis(**{"host":"source_host","port":6379})
dest = orm.model.caching.Redis(**{"host":"source_host","port":7777})
ks = src.handle.keys()
for i,k in enumerate(ks):
if i % 1000 == 0:
print i, "%2.1f %%" % ( (i * 100.0) / len(ks))
dest.handle.set(k,src.handle.get(k))
where orm.model.caching.* are my middleware cache implementation bits (which for redis is just creating a self.handle instance variable).
Firstly, I am very suspicious of the number in the receiving bytes as that is 2^32-1 .. a very strange coincidence. Secondly, OOM can mean out of memory, yet I can fire up a 2nd process and sync that via the script but doing this via redis --slaveof fails with what appears to be out of memory. Surely this can't be right?
redis-check-dump does not run as this is the windows implementation.
Unfortunately there is sensitive data in the keys I am syncing so I can't offer it to anybody to investigate. Sorry about that.
I am definitely running the 64 bit version as it states this upon startup in the header.
I don't mind syncing via my mini script and then just enabling slave mode, but I don't think that is possible as the moment slaveof is executed, it drops all known data and resyncs from scratch (and then fails).
Any ideas ??
I have also seen this error earlier, but the latest bits from 2.8.4 seem to have resolved it https://github.com/MSOpenTech/redis/tree/2.8.4_msopen

Hector is unable to read Cassandra data when nodes reboot or terminate

We are trying to run a cassandra cluster on AWS/EC2 within a standard VPC footprint (cassandra nodes on private subnets). Because this is AWS there is always a chance that an EC2 instance will terminate or reboot with no warning. I have been simulating this case on a test cluster and I am seeing things with the cluster that I thought a cluster was suppose to prevent. Specifically if a node reboots some data will go temporarily missing until the node completes its reboot. If a node terminates it appears that some data is lost forever.
For my test I just did a bunch of writes (using QUORUM consistency) to some keyspaces then interrogate the contents of those keyspaces as I bring down nodes (either through reboot or terminate). I'm just using cqlsh SELECT to do the keyspace/column family interrogation of the cluster using ONE consistency level.
Note, even though I am performing no writes to the cluster while I am doing the SELECTs rows temporarily disappear when rebooting and can permanently go missing during termination.
I thought Netflix Priam might be able to help, but sadly it doesn't work in a VPC the last time I checked.
Also, because we are using ephemeral storage instances there is no equivalent of 'shutdown' so I cannot run any scripts during reboot/terminate of an instance to perform a nodetool decommission or nodetool removenode before an instance goes away. Terminate is the equivalent of kicking the plug out of the wall.
Since I am using a replication factor of 3 and quorum/write that should mean that all data is written to at least 2 nodes. So, unless I am totally misunderstanding things (which is possible), losing one node should not mean that I lose any data for any period of time when I am using consistency level ONE for the read.
Questions
Why wouldn't a 6 node cluster with a replication factor of 3 work?
Do I need to run something like a 12 node cluster with a replication factor of 7? Don't bother telling me that will fix the problem, because it doesn't.
Do I need to use consistency level of ALL on the writes then use ONE or QUORUM on the reads?
Is there something not quite right with virtual nodes? unlikely
Are there nodetool commands besides removenode that I need to run when a node terminates to recover missing data? As mentioned earlier, when a reboot occurs, eventually the missing data reappears.
Is there some cassandra savant who can look at my cassandra.yaml file below and send me on the path to salvation?
More Info added 7/19
I don't think this is a QUORUM vs ONE vs ALL is the issue. The test I set up performs no writes to the keyspaces after the initial population of the column families. So the data has had plenty of time (hours) to make it to all the nodes as required by the replication factor. Plus the test dataset is REALLY small (2 column families with about 300-1000 values each). So in other words, the data is completely static.
The behavior I am seeing seems to be tied to the fact that the ec2 instance is no longer on the network. The reason I say this is because if I log on to a node and just do a cassandra stop I see no loss of data. But if I do the reboot or terminate I start getting the following in a stack trace.
CassandraHostRetryService - Downed Host Retry service started with queue size -1 and retry delay 10s
CassandraHostRetryService - Downed Host retry shutdown complete
CassandraHostRetryService - Downed Host retry shutdown hook called
Caused by: TimedOutException()
Caused by: TimedOutException()
So it seems to be more of a networking communication issue in that the cluster is expecting, for example 10.0.12.74, to be on the network after it has joined the cluster. If that ip is suddenly unreachable either due to reboot or termination the timeouts start happening.
When I do a nodetool status under all three scenarios (cassandra stop, reboot or terminate) the status of the node shows up as DN. Which is what you would expect. Eventually nodetool status will return to UN with cassandra start or reboot, but obviously termination always stays DN.
Details of my Configuration
Here are some details of my configuration (cassandra.yaml is at the bottom of this posting):
Nodes are running in private subnets of a VPC.
Cassandra 1.2.5 with num_tokens: 256 (virtual nodes). initial_token: (blank). I am really hoping this works because all of our nodes run in autoscaling groups so the thought that redistribution could be handle dynamically is appealing.
EC2 m1.large one seed and one non-seed node in each availability zone. (so 6 total nodes in the cluster).
Ephemeral storage, not EBS.
Ec2Snitch with NetworkTopologyStrategy and all keyspaces have replication factor of 3.
Non-seed nodes are auto_bootstraped, seed nodes are not.
sample cassandra.yaml file
cluster_name: 'TestCluster'
num_tokens: 256
initial_token:
hinted_handoff_enabled: true
max_hint_window_in_ms: 10800000
hinted_handoff_throttle_in_kb: 1024
max_hints_delivery_threads: 2
authenticator: org.apache.cassandra.auth.AllowAllAuthenticator
authorizer: org.apache.cassandra.auth.AllowAllAuthorizer
partitioner: org.apache.cassandra.dht.Murmur3Partitioner
disk_failure_policy: stop
key_cache_size_in_mb:
key_cache_save_period: 14400
row_cache_size_in_mb: 0
row_cache_save_period: 0
row_cache_provider: SerializingCacheProvider
saved_caches_directory: /opt/company/dbserver/caches
commitlog_sync: periodic
commitlog_sync_period_in_ms: 10000
commitlog_segment_size_in_mb: 32
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
- seeds: "SEED_IP_LIST"
flush_largest_memtables_at: 0.75
reduce_cache_sizes_at: 0.85
reduce_cache_capacity_to: 0.6
concurrent_reads: 32
concurrent_writes: 8
memtable_flush_queue_size: 4
trickle_fsync: false
trickle_fsync_interval_in_kb: 10240
storage_port: 7000
ssl_storage_port: 7001
listen_address: LISTEN_ADDRESS
start_native_transport: false
native_transport_port: 9042
start_rpc: true
rpc_address: 0.0.0.0
rpc_port: 9160
rpc_keepalive: true
rpc_server_type: sync
thrift_framed_transport_size_in_mb: 15
thrift_max_message_length_in_mb: 16
incremental_backups: true
snapshot_before_compaction: false
auto_bootstrap: AUTO_BOOTSTRAP
column_index_size_in_kb: 64
in_memory_compaction_limit_in_mb: 64
multithreaded_compaction: false
compaction_throughput_mb_per_sec: 16
compaction_preheat_key_cache: true
read_request_timeout_in_ms: 10000
range_request_timeout_in_ms: 10000
write_request_timeout_in_ms: 10000
truncate_request_timeout_in_ms: 60000
request_timeout_in_ms: 10000
cross_node_timeout: false
endpoint_snitch: Ec2Snitch
dynamic_snitch_update_interval_in_ms: 100
dynamic_snitch_reset_interval_in_ms: 600000
dynamic_snitch_badness_threshold: 0.1
request_scheduler: org.apache.cassandra.scheduler.NoScheduler
index_interval: 128
server_encryption_options:
internode_encryption: none
keystore: conf/.keystore
keystore_password: cassandra
truststore: conf/.truststore
truststore_password: cassandra
client_encryption_options:
enabled: false
keystore: conf/.keystore
keystore_password: cassandra
internode_compression: all
I think http://www.datastax.com/documentation/cassandra/1.2/cassandra/dml/dml_config_consistency_c.html will clear up a lot of this. In particular, QUORUM/ONE is not guaranteed to return the most recent data. QUORUM/QUORUM is. So is ALL/ONE, but that will be intolerant to failure on write.
Edit to go with the new information:
CassandraHostRetryService is part of Hector. I assumed you were testing with cqlsh like a sane person would. Lessons:
Use cqlsh for testing
Use the DataStax Java Driver for building your application, which is faster, easier to use, and has more insight into the cluster state than Hector thanks to the native protocol it's built on.

XBee - XBee-API and multiple endpoints

Using Andrew Rapp's XBee-API, how can I sample I/O data via a coordinator from more than two endpoints?
I have 17 Series 1 XBees. I have programmed one to be a coordinator (API mode = 2) and the rest to be endpoints. Using XBee-API I am sending a Force I/O Sample ("IS") remote AT command, unicast to each endpoint. This works perfectly well when there are up to two endpoints, but as soon as a third is added, one of the three always becomes non-responsive (times out with XBeeTimeoutException). It's not always the same physical unit that stops responding, but it is always the third one (for example, if I send Force I/O Sample to Device1, Device2, and Device3, Device3 will time out, and if I change the order to Device3, Device1, Device2, Device2 will time out.
If I set up more than three XBees, about 1 out of 3 will time out - but not every third one.
I've verified that the XBees themselves are fine. I've searched the Internet and Stack Overflow in particular to no avail. I've tried using a simple ZNetRemoteAtRequest. I've tried opening and closing the XBee coordinator serial connection once for all three devices, once per device, and once per program run. I've tried varying the distance between the coordinator and endpoints (never more than five feet apart). I've tried different coordinator configuration parameters (from the Digi documentation). I've tried changing out the XBee for the coordinator.
This is the code I'm using to send the Force I/O Sample request to each endpoint and read the response:
xbee = new XBee(); // Coordinator
xbee.open("/dev/ttyUSB0, 115200)); // Happens before any of the endpoints are contacted
... // Loop through known endpoint addresses
XBeeRequest request = new ZBForceSampleRequest(new XBeeAddress64(endpointAddress));
ZNetRemoteAtResponse response = null;
response = (ZNetRemoteAtResponse) xbee.sendSynchronous(request, remoteXBeeTimeout);
if (response.isOk()) {
// Process response payload
}
... // End loop and finally close coordinator connection
What might help polling I/O samples from more than two endpoints?
EDIT: I found that Andrew Rapp's XBee-API library fakes multithreaded behavior, which causes the synchronization issues described in this question. I wrote a replacement library that is actually multithreaded and correctly maps responses from multiple XBee endpoints: https://github.com/steveperkins/xbee-api-for-java-1-4. When I wrote it Java 1.4 was necessary for use on the BeagleBone, Plug, and Zotac single-board PCs but it's an easy conversion to 1.7+.
Are you using hardware flow control on your serial port? Is it possible that you're sending requests out when the local XBee has deasserted CTS (e.g., asking you to stop sending)? I assume you're running at 115200 bps, so the XBee serial port can keep up with the network data rate.
Can you turn on debugging information, or connect some port monitoring hardware/software to display the data going over the serial port to the local XBee?

Resources