i have setup a simple loadbalancer using Apacher 2.4 for 2 tomcat servers. i have noticed that the BUSY column in the balancer-manager page never decreases and keep increasing until both of them reach around 200, the performance will be very sluggish.
i cannot find any documentation detailing about the balancer-manager frontend but i guessing the BUSY column is referring to the number of open connections to the balancer members. is that right?
does my apache LB doesnt close idles connection and keep opening new one until it exhausted the resources.
Please guide me on this. i have to keep restarting apache services every week in order to reset the BUSY column and make the LB smooth again.
Server running on Windows 2003 + Apache 2.4.4
Related
We are working on an ASP.NET 5 Web API project that is in production now but we are experiencing an issue where it becomes unresponsive intermittently throughout the day.
A few notes about the application architecture. It is an ASP.NET Web API project using a MariaDB database on a separate EC2 instance within the same private network. The connection string uses the private IP of the database server to avoid any name resolution issues. The site is hosted via IIS 10.
The application itself has been developed carefully following the best practices provided by Microsoft. Heavy focus on async operations, minimizing query response times and offloading more expensive operations into background services.
The app is extremely responsive. It performs with sub 100ms responses on almost all requests, even the more complicated requests, and all the way up until it becomes unresponsive this high level of performance remains the same. We tend to see between 10-30 requests per second and 300-500 select queries per second at peak usage so not too extreme. However, randomly (2-3 times over a 24 hour period) it will begin hanging on requests and simply not respond to the request. During this time, the database is still extremely responsive and we are never over 300 connections out of our 512 connection limit.
The resources on the application server itself are never really taxed much at all. The CPU never gets above ~20% and the memory usage sits around 20-30%.
If I were to stop the site in IIS and start it again while this is happening, it will quickly come back online. If I don't it will be down for a few minutes until IIS finally kills it due to a failed health check. There are no real errors generated as a response to the issue other than typical errors caused by the hanging of the process such as connection terminated errors. The only thing I have seen before that gave me pause was the fact that there a few connection timeouts when getting the connection from the pool, but like I said, the connections to the server are never close to the limit.
Also, this app and version has been in production for months and it wasn't until the traffic volume started to grow that we started seeing these issues. At this point, I am at a loss for next steps of troubleshooting and I'm seeking suggestions.
In IIS App Pool advanced settings set Start Mode to AlwaysRunning
I never found a root cause for this issue, however, after updating to newer versions of .NET MVC this issue went away. My best guess is that changes with the Kestrel possibly resolved this issue, although, I have no idea what specific change that might have been. I have gone through the change logs a few times and didn't see anything that specifically jumped out at me.
My sites get down every 2-3 days. It doesn't show any error on upfront, the browser keeps on loading for a very long time, but no data appears. When I check the apache error logs I found Max Request Workers limit exhausted. For the last 10 days, I am increasing the same the frequency is increased to 5days but still getting down. The site was launched 45 days ago, running perfectly for 30 days. Even we have not observed any hike in the traffic. The site is hosted at the AWS plan is t2.2xlarge.
Do you use many filters for layered navigation? When bots hit it if using sql search it will exceed max connections and lock things up and repeat over and over. One possible area to look at. I already had this issue and had to block all bad bots in robot.txt. Check mostly for Chinese bots and block by IP in htaccess or firewall tune robot.txt to instruct delay 10 for bots. Connect your site to cloudflare and tune things to disallow huge hits. In general, mostly Chinese bots are the ones who don't respect rules and robot.txt si personally blocked all China.
I have got a single ELK stack with a single node running in a vagrant virtual box on my machine. It has 3 indexes which are 90mb, 3.6gb, and 38gb.
At the same time, I have also got a Javascript application running on the host machine, consuming data from Elasticsearch which runs no problem, speed and everything's perfect. (Locally)
The issue comes when I put my Javascript application in production, as the Elasticsearch endpoint in the application has to go from localhost:9200 to MyDomainName.com:9200. The speed of the application runs fine within the company, but when I access it from home, the speed drastically decreases and often crashes. However, when I go to Kibana from home, running query there is fine.
The company is using BT broadband and has a download speed of 60mb, and 20mb upload. Doesn't use fixed IP so have to update A record whenever IP changes manually, but I don't think is relevant to the problem.
Is the internet speed the main issue that affected the loading speed outside of the company? How do I improve this? Is cloud (CDN?) the only option that would make things run faster? If so how much would it cost to host it in the cloud assuming I would index a lot of documents in the first time, but do a daily max. 10mb indexing after?
UPDATE1: Metrics from sending a request from Home using Chrome > Network
Queued at 32.77s
Started at 32.77s
Resource Scheduling
- Queueing 0.37 ms
Connection Start
- Stalled 38.32s
- DNS Lookup 0.22ms
- Initial Connection
Request/Response
- Request sent 48 μs
- Waiting (TTFB) 436.61.ms
- Content Download 0.58 ms
UPDATE2:
The stalling period seems to been much lesser when I use a VPN?
I have a really simple setup: An azure load balancer for http(s) traffic, two application servers running windows and one database, which also contains session data.
The goal is being able to reboot or update the software on the servers, without a single request being dropped. The problem is that the health probe will do a test every 5 seconds and needs to fail 2 times in a row. This means when I kill the application server, a lot of requests during those 10 seconds will time out. How can I avoid this?
I have already tried running the health probe on a different port, then denying all traffic to the different port, using windows firewall. Load balancer will think the application is down on that node, and therefore no longer send new traffic to that specific node. However... Azure LB does hash-based load balancing. So the traffic which was already going to the now killed node, will keep going there for a few seconds!
First of all, could you give us additional details: is your database load balanced as well ? Are you performing read and write on this database or only read ?
For your information, you have the possibility to change Azure Load Balancer distribution mode, please refer to this article for details: https://learn.microsoft.com/en-us/azure/load-balancer/load-balancer-distribution-mode
I would suggest you to disable the server you are updating at load balancer level. Wait a couple of minutes (depending of your application) before starting your updates. This should "purge" your endpoint. When update is done, update your load balancer again and put back the server in it.
Cloud concept is infrastructure as code: this could be easily scripted and included in you deployment / update procedure.
Another solution would be to use Traffic Manager. It could give you additional option to manage your endpoints (It might be a bit oversized for 2 VM / endpoints).
Last solution is to migrate to a PaaS solution where all this kind of features are already available (Deployment Slot).
Hoping this will help.
Best regards
While looking into the resource balancer and dynamic load metrics on Service Fabric, we ran into some questions (Running devbox SDK GA 2.0.135).
In the Service Fabric Explorer (the portal and the standalone application) we can see that the balancing is ran very often, most of the time it is done almost instantly and this happens every second. While looking at the Load Metric Information on the nodes or partitions it is not updating the values as we report load.
We send a dynamic load report based on our interaction (a HTTP request to a service), increasing the reported load data of a single partition by a large amount. This spike becomes visible somewhere in 5 minutes at which point the balancer actually starts balancing. This seems to be an interval in which the load data gets refreshed. The last reported time gets updated all the time but without the new value.
We added the metrics to applicationmanifest and the clustermanifest to make sure it gets used in the balancing.
This means the resource balancer uses the same data for 5 minutes. Is this a configurable setting? Is it constraint because it is running on a devbox?
We tried a lot of variables in the clustermanifest but none seem to be affecting this refreshtime.
If this is not adaptable, can someone explain why would you run the balancer with stale data? and why this 5 minute interval was chosen?
This is indeed a configurable setting, and the default is 5 minutes. The idea behind it is that in prod you have tons of replicas all reporting load all the time, and so you want to batch them up so you don't spam the Cluster Resource Manager with all those as independent messages.
You're probably right in that this value is way too long for local development. We'll look into changing that for the local clusters, but in the meantime you can add the following to your local cluster manifest to change the amount of time we wait by default. If there are other settings already in there, just add the SendLoadReportInterval line. The value is in seconds and you can adjust it accordingly. The below would change the default load reporting interval from 5 minutes (300 seconds) to 1 minute (60 seconds).
<Section Name="ReconfigurationAgent">
<Parameter Name="SendLoadReportInterval" Value="60" />
</Section>
Please note that doing so does increase load on some of the system services (TANSTAAFL), and as always if you're operating on a generated or complete cluster manifest be sure to Test-ServiceFabricClusterManifest before deploying it. If you're working with a local development cluster the easiest way to get it deployed is probably just to modify the cluster manifest template (by default here: "C:\Program Files\Microsoft SDKs\Service Fabric\ClusterSetup\NonSecure\ClusterManifestTemplate.xml") and just add the line, then right click on the Service Fabric Local Cluster Manager in your system tray and select "Reset Local Cluster". This will regenerate the local cluster with your changes to the template.