Circuit Breaker envoy proxy - circuit-breaker

I am going to setup an envoy proxy, but still confused about the Circuit Breaker. For example: max_connections (UInt32Value) is the maximum number of connections that Envoy will make to the upstream cluster. If not specified, the default is 1024.
Does that mean it will limit max_connections per hosts in cluster or limit max_connections per cluster?
Thank you in advance.

Circuit breaker is a cluster attribute and max_connections will apply to all hosts that form a cluster. The Envoy's circuit breaking mechanism is fully distributed (not coordinated).
Meaning for example, if you set max_connections for http1 to 1024 then this global value will apply to all hosts. The hosts in the cluster have 1024 cakes to share and not more.
Best source is the documentation: Envoy: Circuit Breaking

Related

What is the maximum allowed round trip time (RTT) between 2 nodes?

Consul reference architecture mentions below statement -
"In any case, there should be high-bandwidth, low-latency (sub 8ms round trip) connectivity between the failure domains."
What happens if the RTT is more than 8ms? What is the maximum allowed RTT between 2 nodes in a cluster?
This limitation primarily applies to latency between Consul servers. Excessive latency between the servers could cause instability with Raft which could affect the availability of the server clusters.
Clients, however, can operate with higher latency thresholds. HashiCorp is in the process of updating documentation to clarify this, and also list acceptable latency thresholds for client agents.

large, multi-dc cassandra deployment in ec2

Our current cassandra cluster is 26 nodes spread across four AWS EC2 regions. We use elastic IP's for all of our nodes, and we use security groups to allow all nodes to talk to each other.
There is a hard limit of 250 security group rules per network interface (http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Appendix_Limits.html#vpc-limits-security-groups). From the documentation:
You can have 50 inbound and 50 outbound rules per security group (giving a total of 100 combined inbound and outbound rules). If you need to increase or decrease this limit, you can contact AWS Support — a limit change applies to both inbound and outbound rules. However, the multiple of the limit for inbound or outbound rules per security group and the limit for security groups per network interface cannot exceed 250
Since each node needs a security group entry for every other node in the cluster, that means we have a hard limit of a 250 node cluster.
I can think of some ways to mitigate the issue, like using two security groups, where one allows access from the 'local' nodes in the same region, and the other security group has the elastic IP's for the 'remote' nodes in other regions. This would help, but only a little bit.
I have been in touch with AWS technical support about this, and they suggested using contiguous blocks of elastic IP's (which I did not know was possible). This seems like it would solve the problem, but it turns out this is an involved process, and requires us (my company) to become the ARIN owner of these IP's. The AWS reps are pushing me towards alternatives, such as DynamoDB. I am open to other technologies to solve our problem, but I get the feeling they are just trying to get us to use AWS managed services. On top of that they are rather slow in getting back to me when I ask questions like "Is this typically how customers run multi region cassandra clusters in ec2?"
Does anyone have experience with this approach?
What are some alternatives to using security groups? Does anyone have experience running a large (>250 node), multi region cassandra cluster in EC2? Or even a cluster with >50 nodes (at which point a single security group isn't feasible any more)?

How will a server running multiple Docker virtual machines handle the TCP limitation?

Under a REALLY heavy load, a server doesn't seem to "recycle" the TCP connections quickly enough.
I'm looking into using Docker to deal with a higher than usual number of requests per second to an API by creating multiple instances of a node server on one machine vs using multiple machines.
If the following sysctl settings are set, the recycling does seem to happen faster but there is still a hard limit on how many sockets there can be in existence:
net.ipv4.ip_local_port_range='1024 65000'
net.ipv4.tcp_tw_reuse='1'
net.ipv4.tcp_fin_timeout='15
When running multiple docker instances, is the total cap on tcp connections still equal to the number of maximum tcp connections the "parent" machine can handle?
Yes, the total cap of TCP connections will be capped by the Docker host.
However, there are three very different limits:
total cap of open connections (regardless of the source/destination IP address), which is related to the maximum number of file descriptors, and can be extremely high (i.e. millions)
total cap of outbound connections for a given local IP address (limited to 64K per local IP address)
total cap of connections tracked by netfilter
TCP port recycling deals with the 2nd limit. If you use netstat -nt in the host and container, you should be able to easily check if you're getting close to it. If that's the case, the sysctls that you used should help a lot.
If you're container is handling outside traffic, it shouldn't be subject to that limit; however, you could hit the 3rd one. You can check the number of tracked connections with conntrack -S, and if necessary, bump up the max number of connections by tweaking /proc/sys/net/ipv4/netfilter/ip_conntrack_max.
It would be helpful to indicate which symptoms you are seeing, that make you think that the server doesn't recycle the connections fast enough?

Active MQ load balancing to achieve high throughput

Currently my activeMQ configuration (non persistent messaging) allows me to achieve 2000 msgs/sec. There are four queues and four consumers consuming the messages. There's only one activeMQ broker in this configuration. I would like to achieve a higher throughput of about 5000 msgs/sec (with addition of additional brokers). I'm pretty clueless on how to achieve this with out splitting individual queues on to individual ActiveMQ instances. What are the topologies that support higher throughput than the individual instance with out splitting the queues among instances ?
Adding a network of brokers might help. That is if you have a decent number of consumers and a decent number of producers connecting to different brokers.
If you have a single producer or a single consumer, all traffic will still go over one of the brokers, making it the bottleneck in any case. So, your actual setup of the servers using the AMQ broker is important.
You will also need to check what's the bottleneck of your physical machines. Is it I/O? CPU? Memory usage/heap size? Even Linkspeed? Use OS tools together with visualvm to track this down. Then you at least know what kind of server you need next.
In any case, some semi-manual load balancing is always possible over several nodes, weather you are using a network of brokers or not. Just make sure messages are routed through certain brokers depending on their content or whatnot. If you cannot distinguish between different message types in any logical way - you can do things like finding some integer number in the message (be it client IP, yesterdays temperature in celsius or whatever), and do a number modulo <num brokers>. Then route it to the destination you selected. Round robin is also an option. There is almost always a way to distribute the load in a logical way among several brokers.

max concurrent connection to amazon load balancer

My testing shows that amazon load balancer rest connection with its instance when it has about 10k concurrent connections into it. Is that a limit of Amazon load balancer? If not, is there a setting for it? I need to support upto 1M concurrent connections for my testing.
Thanks,
Sean Nguyen
The ELB should scale way beyond that, but you need to be testing from multiple test clients that appear to come from unique source IPs. This will cause multiple ELB instances to spawn multiple instances behind the scenes (this can be detected by DNS lookups). This is explained in the whitepaper that Rightscale published:
http://blog.rightscale.com/2010/04/01/benchmarking-load-balancers-in-the-cloud/
Note that it takes a little while for ELB resources to scale out, so tests need to run for 20 minutes or more.
You also need to be sure that you have enough resources behind the load balancer. EC2 instances (as shown in the white paper mentioned above) seem to hit a throughput limit of around 100k packets per second which limits the number of concurrent connections that can be served (bear in mind the overhead of TCP and HTTP). You will need a lot of instances to be able to cope with 1M concurrent connections, and I'm not sure at what point you will hit the limit of ELB; in RightScale's test they only hit 19k.
Also you need to be clear about exactly what you mean by 1M concurrent connections, do you mean total keep-alive connections (assuming keep-alive enabled), or do you mean 1M transactions per second?

Resources