consul - Duplicate call generated during timeout - spring

I have two micro services MS1 and MS2 running on 2 nodes say n1 and n2. When MS1 calls MS2, consul discovers MS2 of node1. However, MS2 of node1 takes longer than the read timeout defined. As soon as the timeout is reached, I see a call coming to MS2 of node2.
Is this the expected behavior of Consul to redirect a call to a different node when the 1st node takes long?
This is happening only if the endpoint is a GET Request, does not happen in a POST request.

Related

Infinispan clustered REPL_ASYNC cache: command indefinitely bounced between two nodes

Im running a spring boot application using infinispan 10.1.8 in a 2 node cluster. The 2 nodes are communicating via jgroups TCP. I configured several REPL_ASYNC.
The problem:
One of these caches, at some point is causing the two nodes to exchange the same message over and over, causing high CPU and memory usage. The only way to stop this is to stop one of the two nodes.
More details, here is the configuration.
org.infinispan.configuration.cache.Configuration replAsyncNoExpirationConfiguration = new ConfigurationBuilder()
.clustering()
.cacheMode(CacheMode.REPL_ASYNC)
.transaction()
.lockingMode(LockingMode.OPTIMISTIC)
.transactionMode(TransactionMode.NON_TRANSACTIONAL)
.statistics().enabled(cacheInfo.isStatsEnabled())
.locking()
.concurrencyLevel(32)
.lockAcquisitionTimeout(15, TimeUnit.SECONDS)
.isolationLevel(IsolationLevel.READ_COMMITTED)
.expiration()
.lifespan(-1) //entries do not expire
.maxIdle(-1) // even when they are idle for some time
.wakeUpInterval(-1) // disable the periodic eviction process
.build();
One of these caches (named formConfig) is causing me abnormal communication between the two nodes, this is what happens:
with jmeter I generate traffic load targeting only node 1
for some time node 2 receives cache entries from node 1 via SingleRpcCommand, no anomalies, even formConfig cache behaves properly
after some time a new cache entry is sent to the formConfig cache
At this point the same message seems to keep bouncing between the two nodes:
node 1 sends entry mn-node1.company.acme-develop sending command to all: SingleRpcCommand{cacheName='formConfig', command=PutKeyValueCommand{key=SimpleKey [form_config,MECHANICAL,DESIGN,et,7850]
node 2 receives the entry mn-node2.company.acme-develop received command from mn-node1.company.acme-develop: SingleRpcCommand{cacheName='formConfig', command=PutKeyValueCommand{key=SimpleKey [form_config,MECHANICAL,DESIGN,et,7850]
node 2 sends the entry back to node 1 mn-node2.company.acme-develop sending command to all: SingleRpcCommand{cacheName='formConfig', command=PutKeyValueCommand{key=SimpleKey [form_config,MECHANICAL,DESIGN,et,7850]
node 1 receives the entry mn-node1.company.acme-develop received command from mn-node2.company.acme-develop: SingleRpcCommand{cacheName='formConfig', command=PutKeyValueCommand{key=SimpleKey [form_config,MECHANICAL,DESIGN,et,7850],
node 1 sends the entry to node 2 and so on and on...
Some other things:
the system is not under load, jmeter is running only few users in parallel
Even stopping jmeter this loop doesn't stop
formConfig is the only cache that behaves this way. All the other REPL_ASYNC caches work properly. I deactivated only formConfig cache and the system is working correctly.
I cannot reproduce the problem with two nodes running on my machine
Here's a more complete log file including logs from both nodes.
Other infos:
opendjdk 11 hot spot
spring boot 2.2.7
infinispan spring boot starter 2.2.4
using JbossUserMarshaller
I'm suspecting
something related to transactional configuration
or something related to serialization/deserialization of the cached object
The only scenario where this can happen is when the SimpleKey has different hashCode().
Are there any exceptions in the log? Are you able to check if the hashCode() is the same after serialization & deserialization of the key?

Consuming long running tasks using pika BlockingConnection on EC2

I have this queue system where I use pika's BlockingConnection to consume a rabbitmq queue hosted on https://www.cloudamqp.com/.
Here's how the connection is setup:
params = pika.URLParameters(self.queue_url)
self.connection = pika.BlockingConnection(params)
self.channel = self.connection.channel()
self.channel.queue_declare(queue=self.queue_name, durable=True)
self.channel.basic_qos(prefetch_count=1)
# set up subscription on the queue
self.channel.basic_consume(self._on_new_task,
queue=self.queue_name,
no_ack=False)
self.channel.start_consuming()
def _on_new_task(self, ch, method, properties, body):
logging.info("Processing new scan")
"""
Do long running task here
"""
self.channel.basic_ack(delivery_tag=method.delivery_tag)
# Program crashes here
Processing a task from the queue can take up to 15 minutes. Most of the time, however, processing the task only takes 2-3 minutes, and then everything works fine. When it takes > 5 minutes I get the following errors from pika:
ERROR - pika.adapters.base_connection - Read empty data, calling disconnect
WARNING - pika.adapters.base_connection - Socket closed when connection was open
WARNING - pika.connection - Disconnected from RabbitMQ at squirrel.rmq.cloudamqp.com:5672 (0): Not specified
CRITICAL - pika.adapters.blocking_connection - Connection close detected
appearantly the basic_ack is not handled successfully because the task still remains on the task queue, leading to an infinite loop where the task is never released from the queue.
I've tried to adjust the heartbeat, both to 0, 30 600 etc, but it looks like heartbeats does not work with a blocking connection, since the long running task is blocking the connection to send heartbeats (?)
The program runs inside a docker container. And I get the problem when hosting the container on AWS ECS2 but not when running the container on my home PC. It therefore seems to me that ECS2 is taking down idle connections (?).
This assumption is backed up by the fact that when running
for i in range(1,10):
self.connection.sleep(60)
self.connection.process_data_events()
instead of the long running task, everything works out fine, probably because the heartbeats are sent and keeps the connection alive.
Not sure how to solve this. Is there any point in replacing pika's BlockingConnection with SelectConnection? Or is there any configuration in EC2/ECS2 that does not tear down idle connections?

consul query all service nodes in one request

https://www.consul.io/docs/agent/http/catalog.html
/v1/catalog/services: Lists the nodes in a given service
I have lot of services, have to query consul for the nodes for each of them .. THUS This is called multiple times
/v1/catalog/service/ : Nodes for the service
need a http api to get all the services in just one requests , something like
/v1/catalog/servicesNodes : Nodes for each service
{
"service1":[{Node":"2e6c1dbe173f","Address":"172.17.42.1",
"ServiceID":"aa:80", "ServiceName":"aaww",...},{}],
"service2":[{Node":"2e6c1dbe173ee","Address":"172.17.42.1",
"ServiceID":"aaqq:80", "ServiceName":"aaqqww",...},{}],
}

Why empheral node does not deleted from zookeeper after sessiontimeout value

I am creating an Empheral node with the help of CuratorFrameworkFactory.newClient method which takes, znodes addresses,sessiontimeoutinms,connectiontimeoutinms,Retry) . I have pass 5*1000 as sessiontimeoutinms and 15*1000 as connectiontimeoutinms. This method is able to create the EPHEMERAL node in my zookeeper but this EPHEMERAL node does not deleted till the application run.
Why this happens as sessiontimeout is 5 seconds.
Most probable cause is your heartbeat setting for Zookeeper (aka tickTime) is higher, and minimum session timeout can't be lower than 2*tickTime.
to debug, when an ephemeral node is created check the ephemeralOwner from the zkCli. the value is the session id.
when the client that owns the node, in the zookeeper logs, you should get this line :
INFO [ProcessThread(sid:0 cport:2182)::PrepRequestProcessor#486] -
Processed session termination for sessionid: 0x161988b731d000c
in this case the ephemeralOwener was 0x161988b731d000c. If you don't get that, you would have got some error. In my case it was EOF exception, which was because of a client library and server mismatch.

pacemaker corosync service ignored

Two Node cluster Node A , Node B .
Service X running on Node A, Node B is DC.
We are using stack corosync with Pacemaker.
Failure Timeout is 10 sec .
Target-Role is started .
Events happens like this
Node A sends event to Node B Service X is down
Node B prints Ignoring expired failure for Service X
After this Service X is never restarted by the Cluster.
Now questions are:
Why is Node B (DC) ignoring the expired failure?
Even for this time DC ignored but as the Service X is down, Node A should monitor the service and again send failure status to Node B and at that time Node B should restart the service. Why this no hapenning?
One Reason for this may be time difference between two servers (DC and Other Machine) .
So , DC thinks that this event is old and ignore it . Please sync time and then try to re-create the issue .
U can add the following property to your crm configuration which will try to start failed, expired resources.
start-failure-is-fatal="false"

Resources