RabbitMQ 3.7.13 on Microsoft Windows Server 2012 R2 Standard 32GB RAM 48GB page file
Very low utilization - 10 queues , 20 clients , hundreds of messages per day , < 1MB size
Ran fine for 1 year then started becoming unresponsive in a consistent pattern :
Restart RabbitMQ Windows Service
RabbitMQ accepts new connections and processes messages
Connections/sockets start ramping from 940 up to max 7280 in ~10 mins
RabbitMQ stops accepting new connections and becomes unresponsive, dashboard shows 500 Internal Server error
When this started happening 2 weeks ago , restarting service would buy about 24 hours of working time before Rabbit became unresponsive again. But that progressively decreased until now a restart only provides 10 mins uptime.
Looking at server memory history shows some occasional spikes to max capacity.
What could be causing this ? What are some diagnostic techniques to apply ?
Related
I have got a single ELK stack with a single node running in a vagrant virtual box on my machine. It has 3 indexes which are 90mb, 3.6gb, and 38gb.
At the same time, I have also got a Javascript application running on the host machine, consuming data from Elasticsearch which runs no problem, speed and everything's perfect. (Locally)
The issue comes when I put my Javascript application in production, as the Elasticsearch endpoint in the application has to go from localhost:9200 to MyDomainName.com:9200. The speed of the application runs fine within the company, but when I access it from home, the speed drastically decreases and often crashes. However, when I go to Kibana from home, running query there is fine.
The company is using BT broadband and has a download speed of 60mb, and 20mb upload. Doesn't use fixed IP so have to update A record whenever IP changes manually, but I don't think is relevant to the problem.
Is the internet speed the main issue that affected the loading speed outside of the company? How do I improve this? Is cloud (CDN?) the only option that would make things run faster? If so how much would it cost to host it in the cloud assuming I would index a lot of documents in the first time, but do a daily max. 10mb indexing after?
UPDATE1: Metrics from sending a request from Home using Chrome > Network
Queued at 32.77s
Started at 32.77s
Resource Scheduling
- Queueing 0.37 ms
Connection Start
- Stalled 38.32s
- DNS Lookup 0.22ms
- Initial Connection
Request/Response
- Request sent 48 μs
- Waiting (TTFB) 436.61.ms
- Content Download 0.58 ms
UPDATE2:
The stalling period seems to been much lesser when I use a VPN?
i have setup a simple loadbalancer using Apacher 2.4 for 2 tomcat servers. i have noticed that the BUSY column in the balancer-manager page never decreases and keep increasing until both of them reach around 200, the performance will be very sluggish.
i cannot find any documentation detailing about the balancer-manager frontend but i guessing the BUSY column is referring to the number of open connections to the balancer members. is that right?
does my apache LB doesnt close idles connection and keep opening new one until it exhausted the resources.
Please guide me on this. i have to keep restarting apache services every week in order to reset the BUSY column and make the LB smooth again.
Server running on Windows 2003 + Apache 2.4.4
OS: Windows Server 2012 Standard
IIS: 8.0.9200.16384
Processor: 4x Xeon 2.67Ghz CPU
RAM: 40GB
Problem:
We have recently enabled IIS's AutoStart feature, since doing so our start up time for the application pools has gone up considerably. The application pool appears to be running but it seems to ramp up its CPU usage to the maximum 25% for about 30 minutes and the websites running in that pool don't respond until this has completed. We have checked the event log and there doesn't appear to be any faults. We have checked the logging in our preload function and this appears to only take about 60-90 seconds.
How can we diagnose what is causing the delay in the application pools starting up?
Background:
We are serving up multiple copies of the same ASP.Net MVC3 application, from multiple application pools (20 sites per pool). We have approximately 8 pools service up 160 sites We have IProcessHostPreloadClient built which preloads some settings from the database when the sites starts up. We have a second server with the same basic specs but only 3 pools of 20, which only takes approx 5 minutes per pool to start up.
For anyone interested here is what we did to resolve/mitigate the issue:
Break up our sites into smaller groupings per application pools (this reduces the startup time per application pool). We went with 10 sites per pool.
Change to using the IIS 8 Application Initialize 'PreloadEnabled' option rather than the serviceAutoStartProvider for site initialization.
When deploying new code, don't restart the application pools, instead use the app_offline.htm feature to unload the application and restart it.
The app_offline.htm feature is the key one for us, this means we are able to deploy new versions of our software with out stopping and starting the application pools and incurring the start up time penalty. Also incrementally restarting application pools help reduce the strain on the CPU which meant we got a consistent start up time for each pool. This is only required when we do an IIS reset or server restart (rarely).
I have Apache + mod_php installed on Windows but I can't relate it to any of those used on Linux.
It only has this:
# ThreadsPerChild: constant number of worker threads in the server process
# MaxRequestsPerChild: maximum number of requests a server process serves
ThreadsPerChild 250
MaxRequestsPerChild 0
regarding the children.
httpd.exe only takes 12MB of RAM and if I do an "ab" test with a script with only sleep(10) with 30 concurrent connections it only goes to 30MB of usage and it can take all of them together! I did the same on my ubuntu vps also in mod_php and to get 30 concurrent connections I had to start 30 servers and the VPS basically crashed because the RAM usage went over 200MB for the apache processes only. So the question is why the RAM used is -so- little?
I was wondering whether someone can shed some light on the following issue:
We've been seeing spikes for JDBC calls from within a Spring 2.5.6 based web service run on Websphere 6.1 on AIX for calls into Oracle 64-bit 10.2.0.5.0 The JDBC driver version is 10.2.0.3.0.
We're hitting the database with a single thread, the average response time is for the web service is 16ms, but we're seeing 11 spikes of about 1 seconds or higher (amongst about 11,000 calls in 5 minutes). Introscope is telling us that about half these spikes are caused by "select 1 from dual" (which the Websphere connection pool uses to validate the connection).
On the database side, we've traced the sessions created by the Websphere connection pool, and none that does not indicate any spikes inside the database.
Any ideas/suggestions on what could be causing these spikes?
EDIT:
Our connection pool is set up with 20 connections, and monitoring is showing that only one connection is used.
EDIT2:
We've upgraded our Oracle JDBC driver to 10.2.0.5 with no difference.
Perhaps it's a pool that's not sized properly.
11,000 calls in 5 minutes, or 300 seconds, means 37 calls per second. An average of 0.016 seconds per connection means that you can handle 2,313 calls per connection. A pool size of 4-5 should be able to handle the traffic. I don't know if one of those queries runs a little long if a request ends up waiting for a connection to become available.
The 'SELECT 1 FROM DUAL' query is what the pool will execute to check and see if the connection is live and usable.
You could try increasing the size of the pool or looking at some of the other parameters that govern what the pool does with a connection to ensure that it's live.
The answer to this problem ended up not being related to WebSphere or Oracle but was a good old fashioned network configuration problem which resulted in TCP retransmission timeouts between the WebSphere server and the Oracle RAC cluster.
In order to arrive at that diagnostic I was looking at the output of netstat -p tcp before and after a test run and found that the
retransmit timeouts
stat was increasing. Now the Retransmission Timeout Algorithm configuration can be viewed using:
$ no -a
...
rto_high = 64
rto_length = 13
rto_limit = 7
rto_low = 1
Which indicates that the retransmission timeouts will take between 1 and 64 seconds and will back-off increasingly, which explains why we've been seeing spikes of 1 second, 2 seconds, 4 second, 10 seconds and 22 seconds but nothing away from these peaks (i.e. no 6 second spike).
Once the network config was fixed, the problem went away.
Does switching off "Pretest new connections" help?