Grid gain node stops automatically on Amazon EC2 - amazon-ec2

I am using gridgain version 6.1.9 and have setup couple of nodes on amazon ec2. I have configured tcp discovery for discovering the nodes. The nodes startup fine and also join each other. but after a few minutes ( around 20 minutes ), grid gain node stops with message
**GridGain node stopped OK [uptime=00:21:53:493]**
I have set the socket timeout to 30 seconds. Is there any thing else that I might have missed configuring ?

Related

Apache Ignite 2.7 to 2.10 upgrade: Server Node can not rejoin cluster

I have a 5 node Service Grid running on an Ignite 2.10.0 Cluster. Testing the upgrade, I stop one Server Node (SIGTERM) and wait for it it rejoin. It fails to stay connected to the cluster?
Each node is a primary micro service provider and a back for another (Cluster Singletons). The service that was running on the node that left the cluster is properly picked up by it's backup node. However, the server node can not stay connected to the cluster ever again!
Rejoin strategy:
Let systemd restart ignite.
The node rejoins, but then the new Server Node invokes it's shutdown-hook
Go back to 1
I have no idea why the rejoined node shuts itself down. As far as I can tell, the Coordinator did not kill this youngest Server Node. I am logging with DEBUG and IGNITE_QUEIT set to false; I still can't find anyting in the logs.
I tried increasing network timeouts, but the newly re-joined node still shuts down???
Any idea what is going on or where to look?
Thanks in advance.
Greg
Environment:
RHEL 7.9, Java 11
Ignite configuration:
persistence is set to false.
clientReconnectDisabled is set to true

Consul bootstrap-expect value

I have a consul cluster which normally should have 5 servers and a bunch of clients. Our script to start the servers originally configured like this
consul agent -server -bootstrap-expect 5 -join <ips of all 5 servers>
However, we had to reOS all servers and bootstrap again -- one of our servers was down with hardware issues and the bootstrap no longer works.
My question is -- in a situation where there are 5 servers, but 3 are sufficient for quorum, should -bootstrap-expect be set to 3?
The documentation here https://www.consul.io/docs/agent/options.html#_bootstrap_expect seems to imply that -bootstrap-expect should be set to the total number of servers which means that even a single machine being down will prevent the cluster from bootstrapping
To be clear our startup scripts are static files, so when I say there are 5 servers it means that up to 5 could be started with the server tag.
In your case, if you don't explicitly need all 5 servers to be online during initial cluster setup, you should set -bootstrap-expect to 3. This will avoid situations similar to what happened i.e. you have 5 servers and you tell them they must wait for all 5 to be online, for initial cluster setup. As documentation suggests:
When provided, Consul waits until the specified number of servers are available and then bootstraps the cluster. This allows an initial leader to be elected automatically.
With --bootstrap-expect=3 as soon as 3 of your 5 Consul servers have joined cluster, the leader election will start, and in case last 2 join much later, cluster will function. And for that matter you can have any number of servers join at later time.

Why Cassandra client is slower in ec2 then a machine out of aws?

I have set up a 6 nodes cluster in ec2. I tried to scan a table with 100M rows--2000 partitions. I wrote a client with launching 20-50 threads to read the table by
for partitionkey in keys
select * from table where partitionkey=?
Each query is a task executed by a thread. When I ran my application in my mac, it is 2x faster then it is running a m3.2xlarge box in ec2.
I also noticed that when I run the application in my mac, traffic is kind of distributed evenly to 6 nodes. However, when my application is running on ec2, nearly half traffic goes to one node. I tried to set pool options to limit connections to one host, it did not help.
Any one has ideal? Thanks in advance.
I set broadcast_rpc_address as a public ip, can client in aws using private IP.

AppFabric Cache Cluster not detecting a node has failed in a timely fashion

Setup:
We're using AppFabric 1.1 on Windows 2008 Enterprise Edition VMs.
We setup a cluster with three nodes using SQL server for cluster configuration and also using offloading so SQL server is supposed to do the cluster management by making sure to create the cluster with: New-AFCacheCluster -Offloading true. We then add the three nodes and start the cluster up. All is good.
We then setup a single cache instance, call it "Test", with HA using the -Secondaries 1 option.
Test Scenario:
We then use a test app to put some test data into the cache and access that data and everything is working great. So then we go to the VM host and down the NIC for one of the nodes in the cluster to simulate that node's failure.
Results:
As soon as the NIC is disabled on the one node, when we go to read from the cache we get timeouts instead of a clean failover.
If we go run Get-AFCacheHostStatus on either of the other two hosts that are still up, the first time after the NIC is disabled, this call will take a very long time to return the status of the hosts. Once it finally does return status, it shows the node on which we yanked the NIC as being in UNKNOWN status. Subsequent calls to Get-AFCacheHostStatus will return quickly, but always showing the error message that the one node is unreachable and shows it in the UNKNOWN status.
Ok, so AF itself detects that node is in UNKNOWN status, but the test app is still getting timeouts at this point. Some minutes later, somewhere btwn 5-10mins, the app will eventually start working again with only the two nodes we have left.
Sooo, what's going on here? Are we configuring something incorrectly? Why is the cluster taking so long to recover from this basic kind of failure?

tomcat 6 - Cluster / BackupManager

I have a question regarding Clustering (session replication/failover) in tomcat 6 using BackupManager. Reason I chose BackupManager, is because it replicates the session to only one other server.
I am going to run through the example below to try and explain my question.
I have 6 nodes setup in a tomcat 6 cluster with BackupManager. The front end is one Apache server using mod_jk with sticky session enabled
Each node has 1 session each.
node1 has a session from client1
node2 has a session from client2
..
..
Now lets say node1 goes down ; assuming node2 is the backup, node2 now has two sessions (for client2 and client1)
The next time client1 makes a request, what exactly happens ?
Does Apache "know" that node1 is down and does it send the request directly to node2 ?
=OR=
does it try each of the 6 instances and find out the hard way who the backup is ?
Not too sure about the workings with BackupManager, my reading of this good URL suggests the replication is intelligent enough in identifying the backup.
In-memory session replication, is
session data replicated across all
Tomcat instances within the cluster,
Tomcat offers two solutions,
replication across all instances
within the cluster or replication to
only its backup server, this solution
offers a guaranteed session data
replication ...
SimpleTcpCluster uses Apache Tribes to maintain communicate with the communications group. Group membership is established and maintained by Apache Tribes, it handles server crashes and recovery. Apache Tribes also offer several levels of guaranteed message delivery between group members. This is achieved updating in-session memory to reflect any session data changes, the replication is done immediately between members ...
You can reduce the amount of data by
using the BackupManager (send only to
one node, the backup node)
You'll be able to see this from the logs if notifyListenersOnReplication="true" is set.
On the other hand, you could still use DeltaManager and split your cluster into 3 domains of 2 servers each.
Say these will be node 1 <-> node 2, 3 <-> 4 and 5 <-> 6.
In such a case - configuring the domain worker attribute, will ensure that session replication will only happen within the domain.
And mod_jk then definitely knows which server to look on when node1 fails.
http://tomcat.apache.org/tomcat-6.0-doc/cluster-howto.html states
Currently you can use the domain
worker attribute (mod_jk > 1.2.8) to
build cluster partitions with the
potential of having a more scaleable
cluster solution with the
DeltaManager(you'll need to configure
the domain interceptor for this).
And a better example on this link:
http://people.apache.org/~mturk/docs/article/ftwai.html
See the "Domain Clustering model" section.

Resources