How to debug OKE (Oracle Kubernetes Engine) cluster's slowness - performance

We have an OKE cluster in US East (Ashburn) region. The API latency is high when compared with a similar cluster in another region (India). The cluster configuration and network configuration are the same. Any pointers to debug this latency issue? The Ashburn cluster is ~10 times slow.
Ashburn cluster:
Speed Test from pod:
Ping from pod:
India cluster:
Speed Test from pod:
Ping from pod:

Related

(kubernetes) NodePort is too slow, when the pod is running on a different node

I have 2 nodes (node1, node2) for Kubernetes cluster.
And I created a service with NodePort that have a Pod is running on node1
When I request to the pod through node1's NodePort, it is ok. (the request has 1Mb content in the body)
But the request through node2's NodePort is too slow. (almost 5times slower than node1)
Any help here is appreciated
Edit1#
FYI
Calico Enviroments

ElasticSearch Client Node Loses Connection on AWS EC2 with Kernel Log "Setting Capacity to 83886080"

I have an ElasticSearch 2.4.4 cluster with 3 client nodes, 3 master nodes, and 4 data nodes, all on AWS EC2. Servers are running Ubuntu 16.04.3 LTS (GNU/Linux 4.4.0-104-generic x86_64). An AWS Application ELB is in front of the client nodes.
At random times, one of the clients will suddenly write this message to the kernel log:
Jan 17 05:54:51 localhost kernel: [2101268.191447] Setting capacity to 83886080
Note, this is the size of the primary boot drive in sectors (it's 40GB). After this message is received, the client node loses its connection to the other nodes in the cluster, and reports:
[2018-01-17 05:56:21,483][INFO ][discovery.zen ] [prod_es_233_client_1] master_left [{prod_es_233_master_2}{0Sat6dx9QxegO2rM03_o9A}{172.31.101.13}{172.31.101.13:9300}{data=false, master=true}],
reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
The kernel message seems to be coming from xen-blkfront.c
https://github.com/analogdevicesinc/linux/blob/8277d2088f33ed6bffaafbc684a6616d6af0250b/drivers/block/xen-blkfront.c#L2383
This problem seems unrelated to the number or type of requests to ES at the time, or any other load-related parameter. It just occurs randomly.
The Load Balancer will record 504s and 460s when attempting to contact the bad client. Other client nodes are not affected and return with normal speed.
Is this a problem with EC2's implementation of Xen?

DRBD - automatic recover after disconnect

I have High availability cluster that configured with DRBD resource.
Master/Slave Set: RVClone01 [RV_data01]
Masters: [ rvpcmk01-cr ]
Slaves: [ rvpcmk02-cr ]
I perform a test that disconnect one of the network adapter that connect between the DRBD network interfaces (for example shutdown the network adapter).
Now the cluster display statuses that everything o.k BUT the status of the DRBD when running "drbd-overview" shows in primary server:
[root#rvpcmk01 ~]# drbd-overview
0:drbd0/0 WFConnection Primary/Unknown UpToDate/DUnknown /opt ext4 30G 13G 16G 45%
and in the secondary server:
[root#rvpcmk02 ~]# drbd-overview
0:drbd0/0 StandAlone Secondary/Unknown UpToDate/DUnknown
Now I have few questions:
1. Why cluster doesn't know about the problem with the DRBD?
2. Why when I put the network adapter that was down to UP again and connect back the connection between the DRBD the DRBD didn't handle this failure and sync back the DRBD when connection is o.k?
3. I saw an article that talk about "Solve a DRBD split-brain" - https://www.hastexo.com/resources/hints-and-kinks/solve-drbd-split-brain-4-steps/
in this article it's explain how to get over a problem of disconnection and resync the DRBD.
BUT how I should know that this kind of problem exist?
I hope I explain my case clearly and provide enough information about what I have and what I need...
1) You aren't using fencing/STONITH devices in Pacemaker or DRBD, which is why nothing happens when you unplug your network interface that DRBD is using. This isn't a scenario that Pacemaker will react to without defining fencing policies within DRBD, and STONITH devices within Pacemaker.
2) You likely are only using one ring for the Corosync communications (the same as the DRBD device), which will cause the Secondary to promote to Primary (introducing a split-brain in DRBD), until the cluster communications are reconnected and realize they have two masters, demoting one to Secondary. Again, fencing/STONITH would prevent/handle this.
3) You can set up the split-brain notification handler in your DRBD configuration.
Once you have STONITH/fencing devices setup in Pacemaker, you would add the following definitions to your DRBD configuration to "fix" all the issues you mentioned in your question:
resource <resource>
handlers {
split-brain "/usr/lib/drbd/notify-split-brain.sh root";
fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
...
}
disk {
fencing resource-and-stonith;
...
}
...
}
Setting up fencing/STONITH in Pacemaker is a little too dependent on your hardware/software for me to give you pointers on setting that up for your cluster. This should get you pointed in the right direction:
http://clusterlabs.org/doc/crm_fencing.html
Hope that helps!

AWS network latency is high

I provisioned two i2.8xlarge VMs on AWS, which should have 10G network as it said on the portal.
But when I run iperf3 test (with "-P 64", which means run with 64 TCP connections) on the VMs, it just gave me ~5Gbps throughput. At the same time when running iperf3, the icmp ping latency is relatively high (about 19ms).
//by the way, iperf3 with 1 TCP connection, gave me the best throughput number, which is about 7Gbps.
Do I miss any configuration/setting to my AWS deployment so that I am not able to get the high throughput and reasonable low latency?
Thanks!

Setting up a 2 node (ec2 ubuntu instances) Cassandra cluster

I'm new to Cassandra, and I'm trying to set up a simple 2 node cluster on two test ec2 ubuntu instances. but replication is not working, nodetool ring doesn't show both instances. What could I be doing wrong?
I'm using cassandra version 2.0.11.
here's what my config like on both machines:
listen_address: <private_ip>
rpc_address: <private_ip>
broadcast_address: <public_ip>
seeds: <private_ip_of_other_machine>
endpoint_snitch: Ec2Snitch
I have configured EC2 security group to allow all traffic on all ports between these instances. What am I doing wrong here? I can provide the cassandra logs if required.
Thank you.
EDIT: the error I'm getting currently is this:
java.lang.RuntimeException: Unable to gossip with any seeds
at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1340)
at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:543)
at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:766)
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:693)
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:585)
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:300)
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:516)
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:625)
ERROR 15:08:03 Exception encountered during startup
java.lang.RuntimeException: Unable to gossip with any seeds
at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1340) ~[apache-cassandra-2.2.5.jar:2.2.5]
at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:543) ~[apache-cassandra-2.2.5.jar:2.2.5]
at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:766) ~[apache-cassandra-2.2.5.jar:2.2.5]
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:693) ~[apache-cassandra-2.2.5.jar:2.2.5]
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:585) ~[apache-cassandra-2.2.5.jar:2.2.5]
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:300) [apache-cassandra-2.2.5.jar:2.2.5]
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:516) [apache-cassandra-2.2.5.jar:2.2.5]
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:625) [apache-cassandra-2.2.5.jar:2.2.5]
WARN 15:08:03 No local state or state is in silent shutdown, not announcing shutdown
The 1st thing I see is that your seeds: list is wrong. Both nodes should have the same seeds: list. For a simple 2-node test setup, you only need 1 seed (pick either one). If the nodes are in the same AZ, you can use the private IP.

Resources