Server 2008 R2 slowness, GUI lag, slow file access - performance

I have a box with Windows Server 2008 R2 on it that is running extremely slow.
GUI lag is so bad it can take several minutes just to open up an explorer window for example. Explorer crashes. File access is very slow from connected users. Etc.
Specs:
RAM: 16GB
CPU: Intel Xeon
Disks:
C: 1TB (This has the server OS as well as SQL Server.)
D: 1TB (This is two 1TB disks in RAID 1 holding user data.)
Roles:
AD, GP, DHCP, DNS
What I've checked/noticed:
CPU utilization never really gets above 30%. Average over the space of a day is roughly 2%.
RAM is at 55% utilization. Most all of that of course being used by SQL Server.
Network utilization averages <2%
Disk stats for 5 minutes perfmon while server is symptomatic:
C: Avg. Idle %: ~99%
D: Avg. Idle %: ~99%
C: Avg. Disk Queue Length: Between 0 and 0.015.
D: Avg. Disk Queue Length: Between 0 and 0.023.
In Resource Monitor, I cannot pull a list of processes that are running. Nor can I see any disks. It's all just blank white boxes no matter how long I give it to load the data.
Rebooting the box makes the machine run perfectly. It's always when I come in in the morning that it's slowed back down.
Checking the event log doesn't show any obvious tells as to what's causing this. The only thing I spotted was the classic "The winlogon notification subscriber took 464 second(s) to handle the notification event (Logon)." error. But that really only serves to let me know that the server is in fact running slow.
I'm dying over here trying to figure out what's causing this. Any ideas or help would be most appreciated.

Despite just about every metric telling me the disks were running fine, replacing the disks ultimately fixed the issue. I guess I'll never know why they were behaving like that.

I had the same issuee.
.. and I checked my settings by running this command
netsh interface tcp show global
Also if you just want to test by disabling it run this
netsh interface tcp set global autotuning=disabled
to set it back run this
netsh interface tcp set global autotuning=normal
Worth a try but you could/should reboot the server after making the change..
... and when I only disabled it, my server performance was fixed.

Related

Elasticsearch speed vs. Cloud (localhost to production)

I have got a single ELK stack with a single node running in a vagrant virtual box on my machine. It has 3 indexes which are 90mb, 3.6gb, and 38gb.
At the same time, I have also got a Javascript application running on the host machine, consuming data from Elasticsearch which runs no problem, speed and everything's perfect. (Locally)
The issue comes when I put my Javascript application in production, as the Elasticsearch endpoint in the application has to go from localhost:9200 to MyDomainName.com:9200. The speed of the application runs fine within the company, but when I access it from home, the speed drastically decreases and often crashes. However, when I go to Kibana from home, running query there is fine.
The company is using BT broadband and has a download speed of 60mb, and 20mb upload. Doesn't use fixed IP so have to update A record whenever IP changes manually, but I don't think is relevant to the problem.
Is the internet speed the main issue that affected the loading speed outside of the company? How do I improve this? Is cloud (CDN?) the only option that would make things run faster? If so how much would it cost to host it in the cloud assuming I would index a lot of documents in the first time, but do a daily max. 10mb indexing after?
UPDATE1: Metrics from sending a request from Home using Chrome > Network
Queued at 32.77s
Started at 32.77s
Resource Scheduling
- Queueing 0.37 ms
Connection Start
- Stalled 38.32s
- DNS Lookup 0.22ms
- Initial Connection
Request/Response
- Request sent 48 μs
- Waiting (TTFB) 436.61.ms
- Content Download 0.58 ms
UPDATE2:
The stalling period seems to been much lesser when I use a VPN?

VMWare ESXi, RHEL, LUKS and network latency

My company is running into a network performance problem that seemingly has all of the "experts" we're working with (VMWare support, RHEL support, our managed services hosting provider) stumped.
The issue is that network latency between our VMs (even VMs residing on the same physical host) increases--up to 100x or more!--with network throughput. For example, without any network load, latency (measured by ping) might be ~0.1ms. Start transferring a couple 100MB files, and latency grows to 1ms. Initiate a bunch (~20 or so) concurrent data transfers between two VMs, and the latency between the VMs can increase to upwards of 10ms.
This is a huge problem for us because we have application server VMs hosting processes that might issue 1 million or so queries against a database server (different VM) per hour. Adding a millisecond or two to each query therefore increases our runtime substantially--sometimes doubling or tripling our expected durations.
We've got what I would think is a pretty standard environment:
ESXi 6.0u2
4 Dell M620 blades with 2x Xeon E5-2650v2 processors and 128GB RAM
SolidFire SAN
And our base VM configuration consists of:
RHEL7, minimal install
Multiple LUNs configured for mount points at /boot, /, /var/log, /var/log/audit, /home, /tmp and swap
All partitions except /boot encrypted with LUKS (over LVM)
Our database server VMs are running Postgres 9.4.
We've already tried the following:
Change the virtual NIC from VMNETx3 to e1000 and back
Adjust RHEL ethernet stack settings
Using ESXi's "low latency" option for the VMs
Upgrading our hosts and vCenter from ESX 5.5 to 6.0u2
Creating bare-bones VMs (setup as above with LUKS, etc., but without any of our production services on them) for testing
Moving the datastore from the SSD SolidFire SAN to local (on-blade) spinning storage
None of these improved network latency. The only test that showed expected (non-deteriorating) latency is when we set up a second pair of bare-bones VMs without LUKS encryption. Unfortunately, we need fully encrypted partitions (for which we manage the keys) because we are dealing with regulated, sensitive data.
I don't see how LUKS--in and of itself--can be to blame here. Rather, I suspect that LUKS running with some combination of ESX, our hosting hardware, and/or our VM hardware configuration is to blame.
I performed a test in a much wimpier environment (MacBook Pro, i5, 8GB RAM, VMWare Fusion 6.0, Centos7 VMs configured similarly with LUKS on LVM and the same testing scripts) and was unable to reproduce the latency issue. Regardless of how much network traffic I sent between the VMs, latency remained steady at about 0.4ms. And this was on a laptop with a ton of the things going on!
Any pointers/tips/solutions will be greatly appreciated!
After much scrutiny and comparing the non-performing VMs against the performant VMs, we identified the issue as a bad selection for the advanced "Latency Sensitivity" setting.
For our poorly performing VMs, this was set to "Low". After changing the setting to "Normal" and restarting the VMs, latency dropped by ~100x and throughput (which we hadn't originally noticed was also a problem) increased by ~250x!

Postgres constant 30% CPU usage

I recently migrated my Postgres database from Windows to CentOS 6.7.
On Windows the database never used much CPU, but on Linux I see it using a constant ~30% CPU (using top). (4 core on machine)
Anyone know if this is normally, or why it would be doing this?
The application seems to run fine, and as fast or faster than Windows.
Note, it is a big database, 100gb+ data, 1000+ databases.
I tried using Pgadmin to monitor the server status, but the server status hangs, and fails to run, error "the log_filename parameter must be equal"
With 1000 databases I expect vacuum workers and stats collector to spend a lot of time checking about what needs maintenance.
I suggest you to do two things
raise the autovacuum_naptime parameter to reduce the frequency of checks
put the stats_temp_directory on a ramdisk
You probably also set a high max_connections limit to allow your clients to use those high number of databases and this is another probable source of CPU load, due to the high number of 'slots' to be checked every time a backend has to synchronize with the others.
There could be multiple reasons for increasing server loads.
If you are looking for query level loads on server then you should match a specific Postgres backend ID to a system process ID using the pg_stat_activity system table.
SELECT pid, datname, usename, query FROM pg_stat_activity;
Once you know what queries are running you can investigate further (EXPLAIN/EXPLAIN ANALYZE; check locks, etc.)
You may have lock contention issues, probably due to very high max_connections. Consider lowering max_connections and using a connection pooler if this is the case. But that can increase turn around time for clients connections.
Might be Windows System blocking connections and not allowing to use system. And now Linus allowing its connections to use CPU and perform faster. :P
Also worth read:
How to monitor PostgreSQL
Monitoring CPU and memory usage from Postgres

What can interfere with testing a server's performance?

My HTTP server can't take load tests... It gives really high latency when multiple connections are made.
Server Configuration:
5 instances of (CPU 0.5vCore, Memory 512MB, Disk 20GB)
A load balancer
10G shared bandwidth
When I transfer a 3.5mb zip, it takes about 1second when there is only one connection. However, when over 30 connections are made, it goes up to 20~50 seconds.
I am testing with JMeter on my laptop. Is there a possibility that my testing environment interferes with the load-testing?
If so, what would be a solution to improve my testing environment?
First of all you need to monitor and pin down the problem(s).
Start off by picking up information on these four layers:
CPU Usage
Memory Usage
Network Usage
I/O Usage
All of them on the OS layer. (Monitoring tools will vary depending on your OS).
Once you have this data and you can narrow the problem path (CPU bound, network latency, I/O latency or whatever) an answer will kick in. Also doing this (if it is the first time you are trying to test your app) will help you get scaling information on your environment and your application in general.

What is named.exe process and how to avoid consuming high CPU rates

I have a Windows Server 2008 with Plesk running two web sites.
Sometimes the server is going slow and there is a named.exe process making the CPU peak 100%.
It last a short period of time and after a while it comes again.
I would like to know what this process is for and how to configure it for not consuming this cpu and make my sites go slow.
This must be a DNS service, also known as Bind. High CPU usage may indicate one of the following:
DNS is re-reading its configuration. In this case high CPU usage shall be aligned with your activities in Plesk - i.e. adding and removing domains.
Someone (normally another DNS server) is pulling data from your DNS server. It is normal process. As you say it is for short period of time, it doesn't look like DNS DDoS
AFAIK there is no default way in Windows to restrict software from taking 100% CPU if no other apps require CPU at the moment.
See "DNS Treewalk Suite" system, off the process, and uses the antivirus.
Check the error "log" in the system.

Resources