Using Azure load balancer to reboot/update server with zero downtime - windows

I have a really simple setup: An azure load balancer for http(s) traffic, two application servers running windows and one database, which also contains session data.
The goal is being able to reboot or update the software on the servers, without a single request being dropped. The problem is that the health probe will do a test every 5 seconds and needs to fail 2 times in a row. This means when I kill the application server, a lot of requests during those 10 seconds will time out. How can I avoid this?
I have already tried running the health probe on a different port, then denying all traffic to the different port, using windows firewall. Load balancer will think the application is down on that node, and therefore no longer send new traffic to that specific node. However... Azure LB does hash-based load balancing. So the traffic which was already going to the now killed node, will keep going there for a few seconds!

First of all, could you give us additional details: is your database load balanced as well ? Are you performing read and write on this database or only read ?
For your information, you have the possibility to change Azure Load Balancer distribution mode, please refer to this article for details: https://learn.microsoft.com/en-us/azure/load-balancer/load-balancer-distribution-mode
I would suggest you to disable the server you are updating at load balancer level. Wait a couple of minutes (depending of your application) before starting your updates. This should "purge" your endpoint. When update is done, update your load balancer again and put back the server in it.
Cloud concept is infrastructure as code: this could be easily scripted and included in you deployment / update procedure.
Another solution would be to use Traffic Manager. It could give you additional option to manage your endpoints (It might be a bit oversized for 2 VM / endpoints).
Last solution is to migrate to a PaaS solution where all this kind of features are already available (Deployment Slot).
Hoping this will help.
Best regards

Related

503 error on server load tests on Wildfly server on Jelastic

I have an app deployed on a wildfly server on the Jelastic PaaS. This app functions normally with a few users. I'm trying to do some load tests, by using JMeter, in this case calling a REST api 300 times in 1 second.
This leads to around 60% error rate on the requests, all of them being 503 (service temporarily unavailable). I don't know what things I have to tweak in the environment to get rid of those errors. I'm pretty sure it's not my app's fault, since it is not heavy and i get the same results even trying to test the load on the Index page.
The topology of the environment is simply 1 wildfly node (with 20 cloudlets) and a Postgres database with 20 cloudlets. I had fancier topologies, but trying to narrow the problem down I cut the load balancer (NGINX) and the multiple wildfly nodes.
Requests via the shared load balancer (i.e. when your internet facing node does not have a public IP) face strict QoS limits to protect platform stability. The whole point of the shared load balancer is it's shared by many users, so you can't take 100% of its resources for yourself.
With a public IP, your traffic goes straight from the internet to your node and therefore those QoS limits are not needed or applicable.
As stated in the documentation, you need a public IP for production workloads (a load test should be considered 'production' in this context).
I don't know what things I have to tweak in the environment to get rid of those errors
we don't know either and as your question doesn't provide sufficient level of details we can come up only with generic suggestions like:
Check WildFly log for any suspicious entries. HTTP 503 is a server-side error so it should be logged along with the stacktrace which will lead you to the root cause
Check whether Wildfly instance(s) have enough headroom to operate in terms of CPU, RAM, et, it can be done using i.e. JMeter PerfMon Plugin
Check JVM and WildFly specific JMX metrics using JVisualVM or the aforementioned JMeter PerfMon Plugin
Double check Undertow subsystem configuration for any connection/request/rate limiting entries
Use a profiler tool like JProfiler or YourKit to see what are the slowest functions, largest objects, etc.

JMeter for Clustered Scenarios

I have to perform load testing on a load balanced (clustered) system composed of three servers.
Is it a good practice to test via JMeter each server ? Or maybe it would be better testing the whole cluster calling the load balancer dedicated endpoint ?
Thanks !
Well behaved load test needs to mimic real life application usage as close as possible therefore if the load balancer acts as a single entry point to the system - JMeter needs to hit this endpoint only so the whole system will be like a "black box".
With regards to distributed systems testing best practices you can also consider the following couple of areas:
Load balancers may route requests depending on the origin so it might be a good idea to implement IP Spoofing so each JMeter virtual user could have its own source IP address.
Load Balancer endpoint host(s) may have multiple IP addresses so consider DNS Cache Manager to your Test Plan so each JMeter virtual user could resolve endpoint address on its own as due to caching of DNS calls on OS or JVM level your test can hit one node only while others will be idle.
Testing the whole system is always better. I mean you can find out problem with the load balancing management. It's always better to be in the same condition as the production environment.
After your assessment of the first test. You can regulate and find out that maybe one server is slowing down the chain with another test.
The answer is both (and you should find more cases.), You need to test your system on load as much as close to real environment to know its capabillities.
But also ,for example, when upgrading a version, sometimes only a few or even one server remain online and you need to know what load it can sustain .

How to load test an Apache HTTP Load balanced servers

Apache Jmeter allows us to hit the server with simultaneous connections. On, the other hand I have 4 webservers - one acting as a load balancer and other 3 acting as a application server. So, i want to load test these servers at once to check its performance. Is there a way to load test a load balancer. Any tools that will be helpful to carry out. I will edit this question further with more information. For time being could someone point out a starting point.
Ramp your normal app-test as usual (through the load balancer).
Eventually, you'll get high response times. If you see your application servers are running fine, then it's (probably*) your load balancer that's the issue. If the application servers are falling over, then you don't need to worry about the performance of your load balancer- it's not the bottleneck.
*obviously, there could be other problems, eg, simple network throughput. But you should be able to tell what's going on with some simple monitoring.
Yes you can make a load test on your load balancer Apache server.
Target your http requests to the load balancer apache server. I'm assuming the LB will distribute the load evenly through the 3 backend servers. After the test you evaluate the response times. Are they good? Great.
If they aren't good you can make a second test targeting the 3 backend servers (don't let the requests pass by the LB). If the response times are better now then you know your LB is the problem, otherwise you may need to add more backend servers or optimize your applications (i guess this is what you'll do).

mod_jk vs mod_cluster

Can someone please tell me the pro's and con's of mod_jk vs mod_cluster.
We are looking to do very simple load balancing.. We are going to be using sticky sessions and just need something to route new requests to a new server if one server goes down. I feel that mod_jk does this and does a good job so why do I need mod_cluster?
If your JBoss version is 5.x or above, you should use mod_cluster, it will give you a better performance and reliability than mod_jk. Here you've some reasons:
better load balacing between app servers: the load balancing logic is calculated based on information and metrics provided directly by the applications servers (bear in mind they have first hand information about its load), in contrast with mod_jk with which the logic is calculated by the proxy itself. For that, mod_cluster uses an extra connection between the servers and the proxy (a part from the data one), used to send this load information.
better integration with the lifecycle of the applications deployed in the servers: the servers keep the proxy informed about the changes of the application in each respective node (for example if you undeploy the application in one of the nodes, the node will inform the proxy (mod_cluster) immediately, avoiding this way the inconvenient 404 errors.
it doesn't require ajp: you can also use it with http or https.
better management of the servers lifecycle events: when a server shutdowns or it's restarted, it informs the proxy about its state, so that the proxy can reconfigure itself automatically.
You can use sticky sessions as well with mod cluster, though of course, if one of the nodes fails, mod cluster won't help to keep the user sessions (as it would happen as well with other balancers, unless you've the JBoss nodes in cluster). But due to the reasons given above (keeping track of the server lifecycle events, and better load balancing mainly), in case one of the servers goes down, mod cluster will manage it better and more transparently to the user (the proxy will be informed immediately, and so it will never send requests to that node, until it's informed that it's restarted).
Remember that you can use mod_cluster with JBoss AS/EAP 5.x or JBoss Web 2.1.1 or above (in the case of Tomcat I think it's version 6 or above).
To sum up, though your use case of load balancing is simple, mod_cluster offers a better performance and scalability.
You can look for more information in the JBoss site for mod_cluster, and in its documentation page.

Haproxy Load Balancer, EC2, writing my own availability script

I've been looking at high availability solutions such as heartbeat, and keepalived to failover when an haproxy load balancer goes down. I realised that although we would like high availability it's not really a requirement at this point in time to do it to the extent of the expenditure on having 2 load balancer instances running at any one time so that we get instant failover (particularly as one lb is going to be redundant in our setup).
My alternate solution is to fire up a new load balancer EC2 instance from an AMI if the current load balancer has stopped working and associate it to the elastic ip that our domain name points to. This should ensure that downtime is limited to the time it takes to fire up the new instance and associate the elastic ip, which given our current circumstance seems like a reasonably cost effective solution to high availability, particularly as we can easily do it multi-av zone. I am looking to do this using the following steps:
Prepare an AMI of the load balancer
Fire up a single ec2 instance acting as the load balancer and assign the Elastic IP to it
Have a micro server ping the current load balancer at regular intervals (we always have an extra micro server running anyway)
If the ping times out, fire up a new EC2 instance using the load balancer AMI
Associate the elastic ip to the new instance
Shut down the old load balancer instance
Repeat step 3 onwards with the new instance
I know how to run the commands in my script to start up and shut down EC2 instances, associate the elastic IP address to an instance, and ping the server.
My question is what would be a suitable ping here? Would a standard ping suffice at regular intervals, and what would be a good interval? Or is this a rather simplistic approach and there is a smarter health check that I should be doing?
Also if anyone foresees any problems with this approach please feel free to comment
I understand exactly where you're coming from, my company is in the same position. We care about having a highly available fault tolerant system however the overhead cost simply isn't viable for the traffic we get.
One problem I have with your solution is that you're assuming the micro instance and load balancer wont both die at the same time. With my experience with amazon I can tell you it's defiantly possible that this could happen, however unlikely, its possible that whatever causes your load balancer to die also takes down the micro instance.
Another potential problem is you also assume that you will always be able to start another replacement instance during downtime. This is simply not the case, take for example an outage amazon had in their us-east-1 region a few days ago. A power outage caused one of their zones to loose power. When they restored power and began to recover the instances their API's were not working properly because of the sheer load. During this time it took almost 1 hour before they were available. If an outage like this knocks out your load balancer and you're unable to start another you'll be down.
That being said. I find the ELB's provided by amazon are a better solution for me. I'm not sure what the reasoning is behind using HAProxy but I recommend investigating the ELB's as they will allow you to do things such as auto-scaling etc.
For each ELB you create amazon creates one load balancer in each zone that has an instance registered. These are still vulnerable to certain problems during severe outages at amazon like the one described above. For example during this downtime I could not add new instances to the load balancers but my current instances ( the ones not affected by the power outage ) were still serving requests.
UPDATE 2013-09-30
Recently we've changed our infrastructure to use a combination of ELB and HAProxy. I find that ELB gives the best availability but the fact that it uses DNS load balancing doesn't work well for my application. So our setup is ELB in front of a 2 node HAProxy cluster. Using this tool HAProxyCloud I created for AWS I can easily add auto scaling groups to the HAProxy servers.
I know this is a little old, but the solution you suggest is overcomplicated, there's a much simpler method that does exactly what you're trying to accomplish...
Just put your HAProxy machine, with your custom AMI in an auto-scaling group with a minimum AND maximum of 1 instance. That way when your instance goes down the ASG will bring it right back up, EIP and all. No external monitoring necessary, same if not faster response to downed instances.

Resources