High load average with NO CPU usage - linux-kernel

I have a webserver running on a VM and often we have high server load whose root cause has not yet found.
I have sofar tried to dig out some Disk activity, memory usage, swap usage and CPU usage during the incident
So far i have found the following.
High pgscand/s, pgsteal/s %vmeff noticed during the load.
%vmeff is 100 %
03:00:01 PM 1.06 65.63 662.44 0.00 1790.95 0.00 0.00 0.00 0.00
03:10:01 PM 4.39 56.46 1744.16 0.00 2370.19 0.00 221.60 221.60 100.00
The server runs on Redhat 7.5. Webserver is Apache and there are no outside hits to the webserver to cause any load. There are also no cronjobs running.
I would like to know the root cause of the issue

Related

Optimizing for update only files of size 4-20kb on restricted NVMe AWS disks (Low bandwidth, low latency)

Working on a problem statement where we can't use a DB, instead, we have to implement our own.
Problem statement:-
A request is made to the server with blob (4KB avg, 20KB max) with a UUID and store the blob on disk, sounds perfect use-case for key-pair blobfs, but the catch is, a request from the same UUID will be made again and we'll have to replace the previously written data with a new one, i.e at any point of time we're only concerned with the latest blob.
Request from the same UUID will be made every 10 seconds.
SIGKILL could be made to the server to check the integrity of files. It's okay to lose data of requests made in the last 3 seconds. However, we'd want to make this time as low as possible.
Number of writes is way more than the number of reads
We want to know the maximum number of UUIDs we can support on the given infrastructure.
Infrastructure:-
We have a c6gd.large AWS instance which comes with 4GB RAM, 2 CPUs, 120GB NVMe. But the problem with NVMe is that the bandwidth is restricted by AWS (source), however, we do get low latency (within 150 microseconds).
As shown below, we get more bandwidth while storing in bs of 5kb than 1GB. Also creating a new file is faster than overwriting an existing one.
What I've tried so far:-
Because of the above benchmarks, for each request, I create/update a file with each UUID and write the blob to it.
I've tried both xfs and ext4 filesystem and ext4 performs a bit better and is giving around 8.5k Requests/Second for a 2 hour test. Which means we can support 85k probes as each probe will send request only once every 10 seconds.
I've tested using wrk and noticed that CPU usage on average is around 70-80% and RAM usage is around 3GB (out of 4GB)
I've mounted the disk with ext4 with rw,seclabel,noatime,nodiratime,journal_async_commit,nobarrier these mount options.
NOTE: I've benchmarked the http server alone and it's supporting 100k Req/sec, hence it won't be a bottleneck.
I've used golang for it and this is how I'm writing to file.
I've read about RocksDB architecture, and LSM trees looks promising however I'm of the opinion that compaction process won't give us a huge benefit given that badwidth is just around 100MBps. Am I wrong to think that?
Another question that's in mind is when there are say 1000 writes in a batch to go on disk, I'm assuming that the fs (ext4) journaling will sort these operations. Is this assumpting correct? If not is there a way to enforce this? Also can I batch these write requests to be processed every say 100ms?
Are there any other ideas which I can try?
ideas
buffer requests(go channel, unix datagram socket, redis, ...)
bulk upsert every 100ms
tried
badger db & go channel
rocks db & unix datagram socket
sqlite3 & unix datagram socket
not tried
RabbitMQ
Kafka
Nats
PostgreSQL
Redis
front
communication
db
result
net/http(go)
go channel
badger(go)
killed(OOM)
net/http(go)
unix datagram socket
rocksdb(rust)
many timeouts
net/http(go)
unix datagram socket
sqlite3(rust)
18,000 rps
benchmark result
latency distribution(ms) and requests per second
50%
75%
90%
99%
rps
description
0.6
0.7
1.2
9.9
176k
1. create files(direct)
2.4
4.5
7.1
14
54k
2. create files(indirect)
6.9
13
19
71
18k
3. sqlite3
10
18
24
46
13k
4. upsert files(dirty)
339
350
667
1320
0.4k
5. upsret files(fdatasync)
connections: 128
threads: 128
duration: 60 seconds
1. create files(direct)
wrk -> GET request
GET request -> actix web server
create empty files
2. create files(indirect)
wrk -> GET request
GET request -> net/http server
data(4k) -> unix datagram socket
unix datagram socket -> create empty files
3. sqlite3
wrk -> GET request
GET request -> net/http server
data(4k) -> unix datagram socket
unix datagram socket -> channel(buffer=128)
channel -> sqlite3(bulk upserts, every 100 ms)
4. upsert files(dirty)
wrk -> GET request
GET request -> net/http server
data(4k) -> unix datagram socket
unix datagram socket -> ext4(without fdatasync)
5. upsret files(fdatasync)
wrk -> GET request
GET request -> net/http server
data(4k) -> unix datagram socket
unix datagram socket -> ext4(temp file)
fdatasync
rename
local pc info
storage
cat /etc/fstab
LABEL=cloudimg-rootfs / ext4 defaults 0 1
cpuinfo
fgrep 'model name' /proc/cpuinfo|cat -n
1 model name : Intel(R) Core(TM) i7-9750H CPU # 2.60GHz
2 model name : Intel(R) Core(TM) i7-9750H CPU # 2.60GHz
3 model name : Intel(R) Core(TM) i7-9750H CPU # 2.60GHz
4 model name : Intel(R) Core(TM) i7-9750H CPU # 2.60GHz
5 model name : Intel(R) Core(TM) i7-9750H CPU # 2.60GHz
6 model name : Intel(R) Core(TM) i7-9750H CPU # 2.60GHz
7 model name : Intel(R) Core(TM) i7-9750H CPU # 2.60GHz
8 model name : Intel(R) Core(TM) i7-9750H CPU # 2.60GHz
9 model name : Intel(R) Core(TM) i7-9750H CPU # 2.60GHz
10 model name : Intel(R) Core(TM) i7-9750H CPU # 2.60GHz
11 model name : Intel(R) Core(TM) i7-9750H CPU # 2.60GHz
12 model name : Intel(R) Core(TM) i7-9750H CPU # 2.60GHz
dd
% dd if=/dev/zero of=./zeros.dat bs=4096 count=262144
262144+0 records in
262144+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 2.67881 s, 401 MB/s
% dd if=/dev/zero of=./zeros.dat bs=4096 count=262144 conv=fdatasync
262144+0 records in
262144+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 1.02852 s, 1.0 GB/s
% dd if=/dev/zero of=./zeros.dat bs=4096 count=262144 conv=fsync
262144+0 records in
262144+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.875749 s, 1.2 GB/s

What is the Faults column in 'top'?

I'm trying to download Xcode (onto version El Capitan) and it seems to be stuck. When I run 'top', I see a process called 'storedownloadd' and the "STATE" column is alternating between sleeping, stuck,and running. The 'FAULTS' has a quickly increasing number with a plus sign after it. The 'FAULTS' column is now over 400,000 and increasing. other than 'top', I see no sign of activity of the download. Does this indicate that something is amiss? Here's a screen shot:
Processes: 203 total, 2 running, 10 stuck, 191 sleeping, 795 threads 11:48:14
Load Avg: 4.72, 3.24, 1.69 CPU usage: 56.54% user, 6.41% sys, 37.3% idle SharedLibs: 139M resident, 19M data, 20M linkedit. MemRegions: 18620 total, 880M resident, 92M private, 255M shared. PhysMem: 7812M used (922M wired), 376M unused.
VM: 564G vsize, 528M framework vsize, 0(0) swapins, 512(0) swapouts. Networks: packets: 122536/172M in, 27316/2246K out. Disks: 78844/6532M read, 240500/6746M written.
PID COMMAND %CPU TIME #TH #WQ #PORT MEM PURG CMPRS PGRP PPID STATE BOOSTS %CPU_ME %CPU_OTHRS UID FAULTS COW MSGSENT MSGRECV SYSBSD SYSMACH
354 storedownloadd 0.3 00:47.58 16 5 200 255M 0B 0B 354 1 sleeping *3[1] 155.53838 0.00000 501 412506+ 54329 359852+ 6620+ 2400843+ 1186426+
57 UserEventAgent 0.0 00:00.35 22 17 378 4524K+ 0B 0B 57 1 sleeping *0[1] 0.23093 0.00000 0 7359+ 235 15403+ 7655+ 24224+ 17770
384 Terminal 3.3 00:12.02 10 4 213 34M+ 12K 0B 384 1 sleeping *0[42] 0.11292 0.04335 501 73189+ 482 31076+ 9091+ 1138809+ 72076+
When top reports back FAULTS it's referring to "page faults", which are more specifically:
The number of major page faults that have occurred for a task. A page
fault occurs when a process attempts to read from or write to a
virtual page that is not currently present in its address space. A
major page fault is when disk access is involved in making that page
available.
If an application tries to access an address on a memory page that is not currently in physical RAM, a page fault occurs. When that happens, the virtual memory system invokes a special page-fault handler to respond to the fault immediately. The page-fault handler stops the code from executing, locates a free page of physical memory, loads the page containing the data needed from disk, updates the page table, and finally returns control to the program — which can then access the memory address normally. This process is known as paging.
Minor page faults can be common depending on the code that is attempting to execute and the current memory availability on the system, however, there are also different levels to be aware of (minor, major, invalid), which are described in more detail at the links below.
↳ Apple : About The Virtual Memory System
↳ Wikipedia : Page Fault
↳ Stackoverflow.com : page-fault

Any way to improve time spent in `Kernel#require`?

I'm developing a simple desktop gui (swing) app in JRuby 9.0.0.0-pre2 (latest rvm version), and it is a step up from 1.7.19. It uses mechanize to access a corporate website and upload a file. The app has a JFrame with 2 images (a few kb), a JButton, and a bit textis, and it takes about 8 seconds to load the window. These loading times are unacceptable.
The builtin profiler jruby --profile script.rb shows this:
total self children calls method
5.36 0.03 5.32 806 Kernel.require
4.54 0.00 4.53 371 Kernel.require
4.53 0.00 4.53 8 Kernel.require_relative
1.28 0.08 1.20 2691 Array#each
1.20 0.10 1.11 35 Kernel.load
Besides Array#each, all methods are Kernel methods. Is this what Aaron Patterson was talking about in Railsconf2015? Or is this a specific to JRuby implementation? Can I boost this? Client didn't help, and I'm not sure if I can turn this on when I warble it to a jar even if it did help.

Performance Issue on Nginx Static File Serving 10Gbps Server

I'm using Nginx to Serve Static Files on Dedicated Servers.
The server has no website, it is only a File Download Server. File sizes range from MB to GBs.
Previously I had 8 Dedicated Servers with 500 Mbps at unmetered.com. Each of them was performing great.
I thought to buy a 10Gbps server from FDCServers. Because one is easy to manage than multiple servers.
Below are specs of server:
Dual Xeon E5-2640 (15M Cache, 2.50 GHz, 7.20 GT/s Intel® QPI) - 24 Cores
128 GB RAM
10 Gbit/s Network Unmetered
Ubuntu 14.04 LTS
1.5 TB SATA
But my new giant server is not giving speed more than 500 to 600 Mbps. I installed nload to monitor traffic and upload/download speed. It is reporting almost same as previous unmetered.com servers.
Then I thought that it might be due to Read rate limitation of SATA hard disk.
So I purchased and installed 3 X 240 GB SSD Drives in New Powerful server.
I moved file into SSD Drive and downloaded it for testing purpose. Speed is still not good. I'm getting only 250 to 300 Kbps. Whereas It should give me at least 2Mbps (Which is the speed limit per IP I placed in Nginx Configuration Files).
I then searched on Gigabit Ethernet Tuning settings. Found couple of sysctl settings that need to be tuned for 10Gbps network.
http://www.nas.nasa.gov/hecc/support/kb/Optional-Advanced-Tuning-for-Linux_138.html
I implemented them but still throughput is same like my previous 500Mbps servers.
Can you please help in to improve the Network throughput of this server. I asked FDCServer support team and they confirmed that their server's can easily give 3 to 5 Gbps and they can't help me to tune it.
After all tuning and setting I'm getting only 700Mbit at most.
Let me know if you need more details.
Perform the test memory:
for DDR3 1333MHz PC10600
$ dd if=/dev/zero bs=1024k count=512 > /dev/null
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 0.0444859 s, 12.1 GB/s
test disk io:
$ pv ./100MB.bin > /dev/null
100MiB 0:00:00 [3.36GiB/s] [=================================================================================================================================================================================>] 100%
test cpu speed with the help pipe:
$ dd if=/dev/zero bs=1024k count=512 2> /dev/null| pv > /dev/null
512MiB 0:00:00 [2.24GiB/s] [ <=> ]
speed nginx download from localhost should be ~1.5-2 GB/s
cheking:
$ wget -O /dev/null http://127.0.0.1/100MB.bin
--2014-12-10 09:08:57-- http://127.0.0.1:8080/100MB.bin
Connecting to 127.0.0.1:8080... connected.
HTTP request sent, awaiting response... 200 OK
Length: 104857600 (100M) [application/octet-stream]
Saving to: ‘/dev/null’
100%[=======================================================================================================================================================================================>] 104,857,600 --.-K/s in 0.06s
2014-12-10 09:08:57 (1.63 GB/s) - ‘/dev/null’ saved [104857600/104857600]
Check this solution.
remove lines:
output_buffers 1 512k;
aio on;
directio 512;
and change
sendfile off;
tcp_nopush off;
tcp_nodelay off;
to
sendfile on;
tcp_nopush on;
tcp_nodelay on;
good luck
I think you need to split the issues and test independently to determine the real problem - it's no use guessing it's the disk and spending hundreds, or thousands, on new disks if it is the network. You have too many variables to just change randomly - you need to divide and conquer.
1) To test the disks, use a disk performance tool or good old dd to measure throughput in bytes/sec and latency in milliseconds. Read data blocks from disk and write to /dev/null to test read speed. Read data blocks from /dev/zero and write to disk to test write speed - if necessary.
Are your disks RAIDed by the way? And split over how many controllers?
2) To test the network, use nc (a.k.a. netcat) and thrash the network to see what throughput and latency you measure. Read data blocks from /dev/zero and send across network with nc. Read data blocks from the network and discard to /dev/null for testing in the other direction.
3) To test your nginx server, put some static files on a RAMdisk and then you will be independent of the physical disks.
Only then will you know what needs tuning...

Is there a way to determine filetype and read hand to buffer in one step?

Mahoro is a libmagic wrapper. Right now my process for reading in a file is:
filetype = Mahoro.new.file(full_path)
File.open(full_path, get_access_string(filetype)) do |f|
The problem is that Mahoro seems to read the entire file, and not just the header strings. So I get a profiling result like:
%self total self wait child calls name
6.02 0.26 0.26 0.00 0.00 1 Mahoro#file
5.81 4.36 0.25 0.00 4.11 1 Parser#read_from_file
Each are taking .25 seconds, which implies that they are duplicating each other's work. Is there a way to get the file as a string from libmagic? That seems to be the only way to make this process more efficient.

Resources