Performance Issue on Nginx Static File Serving 10Gbps Server

Performance Issue on Nginx Static File Serving 10Gbps Server - performance

I'm using Nginx to Serve Static Files on Dedicated Servers.
The server has no website, it is only a File Download Server. File sizes range from MB to GBs.
Previously I had 8 Dedicated Servers with 500 Mbps at unmetered.com. Each of them was performing great.
I thought to buy a 10Gbps server from FDCServers. Because one is easy to manage than multiple servers.
Below are specs of server:
Dual Xeon E5-2640 (15M Cache, 2.50 GHz, 7.20 GT/s Intel® QPI) - 24 Cores
128 GB RAM
10 Gbit/s Network Unmetered
Ubuntu 14.04 LTS
1.5 TB SATA
But my new giant server is not giving speed more than 500 to 600 Mbps. I installed nload to monitor traffic and upload/download speed. It is reporting almost same as previous unmetered.com servers.
Then I thought that it might be due to Read rate limitation of SATA hard disk.
So I purchased and installed 3 X 240 GB SSD Drives in New Powerful server.
I moved file into SSD Drive and downloaded it for testing purpose. Speed is still not good. I'm getting only 250 to 300 Kbps. Whereas It should give me at least 2Mbps (Which is the speed limit per IP I placed in Nginx Configuration Files).
I then searched on Gigabit Ethernet Tuning settings. Found couple of sysctl settings that need to be tuned for 10Gbps network.
http://www.nas.nasa.gov/hecc/support/kb/Optional-Advanced-Tuning-for-Linux_138.html
I implemented them but still throughput is same like my previous 500Mbps servers.
Can you please help in to improve the Network throughput of this server. I asked FDCServer support team and they confirmed that their server's can easily give 3 to 5 Gbps and they can't help me to tune it.
After all tuning and setting I'm getting only 700Mbit at most.
Let me know if you need more details.

Perform the test memory:
for DDR3 1333MHz PC10600
$ dd if=/dev/zero bs=1024k count=512 > /dev/null
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 0.0444859 s, 12.1 GB/s
test disk io:
$ pv ./100MB.bin > /dev/null
100MiB 0:00:00 [3.36GiB/s] [=================================================================================================================================================================================>] 100%
test cpu speed with the help pipe:
$ dd if=/dev/zero bs=1024k count=512 2> /dev/null| pv > /dev/null
512MiB 0:00:00 [2.24GiB/s] [ <=> ]
speed nginx download from localhost should be ~1.5-2 GB/s
cheking:
$ wget -O /dev/null http://127.0.0.1/100MB.bin
--2014-12-10 09:08:57-- http://127.0.0.1:8080/100MB.bin
Connecting to 127.0.0.1:8080... connected.
HTTP request sent, awaiting response... 200 OK
Length: 104857600 (100M) [application/octet-stream]
Saving to: ‘/dev/null’
100%[=======================================================================================================================================================================================>] 104,857,600 --.-K/s in 0.06s
2014-12-10 09:08:57 (1.63 GB/s) - ‘/dev/null’ saved [104857600/104857600]
Check this solution.
remove lines:
output_buffers 1 512k;
aio on;
directio 512;
and change
sendfile off;
tcp_nopush off;
tcp_nodelay off;
to
sendfile on;
tcp_nopush on;
tcp_nodelay on;
good luck

I think you need to split the issues and test independently to determine the real problem - it's no use guessing it's the disk and spending hundreds, or thousands, on new disks if it is the network. You have too many variables to just change randomly - you need to divide and conquer.
1) To test the disks, use a disk performance tool or good old dd to measure throughput in bytes/sec and latency in milliseconds. Read data blocks from disk and write to /dev/null to test read speed. Read data blocks from /dev/zero and write to disk to test write speed - if necessary.
Are your disks RAIDed by the way? And split over how many controllers?
2) To test the network, use nc (a.k.a. netcat) and thrash the network to see what throughput and latency you measure. Read data blocks from /dev/zero and send across network with nc. Read data blocks from the network and discard to /dev/null for testing in the other direction.
3) To test your nginx server, put some static files on a RAMdisk and then you will be independent of the physical disks.
Only then will you know what needs tuning...

Related

Optimizing for update only files of size 4-20kb on restricted NVMe AWS disks (Low bandwidth, low latency)

Working on a problem statement where we can't use a DB, instead, we have to implement our own.
Problem statement:-
A request is made to the server with blob (4KB avg, 20KB max) with a UUID and store the blob on disk, sounds perfect use-case for key-pair blobfs, but the catch is, a request from the same UUID will be made again and we'll have to replace the previously written data with a new one, i.e at any point of time we're only concerned with the latest blob.
Request from the same UUID will be made every 10 seconds.
SIGKILL could be made to the server to check the integrity of files. It's okay to lose data of requests made in the last 3 seconds. However, we'd want to make this time as low as possible.
Number of writes is way more than the number of reads
We want to know the maximum number of UUIDs we can support on the given infrastructure.
Infrastructure:-
We have a c6gd.large AWS instance which comes with 4GB RAM, 2 CPUs, 120GB NVMe. But the problem with NVMe is that the bandwidth is restricted by AWS (source), however, we do get low latency (within 150 microseconds).
As shown below, we get more bandwidth while storing in bs of 5kb than 1GB. Also creating a new file is faster than overwriting an existing one.
What I've tried so far:-
Because of the above benchmarks, for each request, I create/update a file with each UUID and write the blob to it.
I've tried both xfs and ext4 filesystem and ext4 performs a bit better and is giving around 8.5k Requests/Second for a 2 hour test. Which means we can support 85k probes as each probe will send request only once every 10 seconds.
I've tested using wrk and noticed that CPU usage on average is around 70-80% and RAM usage is around 3GB (out of 4GB)
I've mounted the disk with ext4 with rw,seclabel,noatime,nodiratime,journal_async_commit,nobarrier these mount options.
NOTE: I've benchmarked the http server alone and it's supporting 100k Req/sec, hence it won't be a bottleneck.
I've used golang for it and this is how I'm writing to file.
I've read about RocksDB architecture, and LSM trees looks promising however I'm of the opinion that compaction process won't give us a huge benefit given that badwidth is just around 100MBps. Am I wrong to think that?
Another question that's in mind is when there are say 1000 writes in a batch to go on disk, I'm assuming that the fs (ext4) journaling will sort these operations. Is this assumpting correct? If not is there a way to enforce this? Also can I batch these write requests to be processed every say 100ms?
Are there any other ideas which I can try?

ideas
buffer requests(go channel, unix datagram socket, redis, ...)
bulk upsert every 100ms
tried
badger db & go channel
rocks db & unix datagram socket
sqlite3 & unix datagram socket
not tried
RabbitMQ
Kafka
Nats
PostgreSQL
Redis
front
communication
db
result
net/http(go)
go channel
badger(go)
killed(OOM)
net/http(go)
unix datagram socket
rocksdb(rust)
many timeouts
net/http(go)
unix datagram socket
sqlite3(rust)
18,000 rps
benchmark result
latency distribution(ms) and requests per second
50%
75%
90%
99%
rps
description
0.6
0.7
1.2
9.9
176k
1. create files(direct)
2.4
4.5
7.1
14
54k
2. create files(indirect)
6.9
13
19
71
18k
3. sqlite3
10
18
24
46
13k
4. upsert files(dirty)
339
350
667
1320
0.4k
5. upsret files(fdatasync)
connections: 128
threads: 128
duration: 60 seconds
1. create files(direct)
wrk -> GET request
GET request -> actix web server
create empty files
2. create files(indirect)
wrk -> GET request
GET request -> net/http server
data(4k) -> unix datagram socket
unix datagram socket -> create empty files
3. sqlite3
wrk -> GET request
GET request -> net/http server
data(4k) -> unix datagram socket
unix datagram socket -> channel(buffer=128)
channel -> sqlite3(bulk upserts, every 100 ms)
4. upsert files(dirty)
wrk -> GET request
GET request -> net/http server
data(4k) -> unix datagram socket
unix datagram socket -> ext4(without fdatasync)
5. upsret files(fdatasync)
wrk -> GET request
GET request -> net/http server
data(4k) -> unix datagram socket
unix datagram socket -> ext4(temp file)
fdatasync
rename
local pc info
storage
cat /etc/fstab
LABEL=cloudimg-rootfs / ext4 defaults 0 1
cpuinfo
fgrep 'model name' /proc/cpuinfo|cat -n
1 model name : Intel(R) Core(TM) i7-9750H CPU # 2.60GHz
2 model name : Intel(R) Core(TM) i7-9750H CPU # 2.60GHz
3 model name : Intel(R) Core(TM) i7-9750H CPU # 2.60GHz
4 model name : Intel(R) Core(TM) i7-9750H CPU # 2.60GHz
5 model name : Intel(R) Core(TM) i7-9750H CPU # 2.60GHz
6 model name : Intel(R) Core(TM) i7-9750H CPU # 2.60GHz
7 model name : Intel(R) Core(TM) i7-9750H CPU # 2.60GHz
8 model name : Intel(R) Core(TM) i7-9750H CPU # 2.60GHz
9 model name : Intel(R) Core(TM) i7-9750H CPU # 2.60GHz
10 model name : Intel(R) Core(TM) i7-9750H CPU # 2.60GHz
11 model name : Intel(R) Core(TM) i7-9750H CPU # 2.60GHz
12 model name : Intel(R) Core(TM) i7-9750H CPU # 2.60GHz
dd
% dd if=/dev/zero of=./zeros.dat bs=4096 count=262144
262144+0 records in
262144+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 2.67881 s, 401 MB/s
% dd if=/dev/zero of=./zeros.dat bs=4096 count=262144 conv=fdatasync
262144+0 records in
262144+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 1.02852 s, 1.0 GB/s
% dd if=/dev/zero of=./zeros.dat bs=4096 count=262144 conv=fsync
262144+0 records in
262144+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.875749 s, 1.2 GB/s

Nginx and sysctl configuration - Performance setting

Nginx is acting as a reverse proxy for adserver, receiving 20k requests per minute. Response happens within 100ms from the adserver to the nginx
Running on a Virtual Machine with configuration as
128GB RAM
4 vCPU
100GB HDD
Considering above, what is good setting of Nginx and also sysctl.conf

Please keep in mind that kernel tuning is complex and requires a lot of evaluation until you get the correct results. If someone spots a mistake please let me know so that I can adjust my own configuration :-)
Also, your memory is quite high for the amount of requests if this server is only running Nginx, you could check how much you are using during peak hours and adjust accordingly.
An important thing to check is the amount of file descriptors, in your situation I would set it to 65.000 to cope with the 20.000+ requests per second. The reason is that in a normal situation you would only require about 4.000 file descriptors as you have 4.000 simultanious open connections (20.000 * 2 * 0.1). However in case of an issue with a back end it could take up to 1 second or more to load an advertisement. In that case the amount of simultanious open connections would be higher:
20.000 * 2 * 1.5 = 60.000.
So setting it to 65K would in my opinion be a save value.
You can check the amount of file descriptors via:
cat /proc/sys/fs/file-max
If this is below the 65000 you'll need to set this in the /etc/sysctl.conf:
fs.file-max = 65000
Also for Nginx you'll need to add the following in the file: /etc/systemd/system/nginx.service.d/override.conf
[Service]
LimitNOFILE=65000
In the nginx.conf file:
worker_rlimit_nofile 65000;
When added you will need to apply the changes:
sudo sysctl -p
sudo systemctl daemon-reload
sudo systemctl restart nginx
After these settings the following settings will get you started:
vm.swappiness = 0 # The kernel will swap only to avoid an out of memory condition
vm.min_free_kbytes = 327680 # The kernel will start swapping when memory is below this limit (300MB)
vm.vfs_cache_pressure = 125 # reclaim memory which is used for caching of VFS caches quickly
vm.dirty_ratio = 15 # Write pages to disk when 15% of memory is dirty
vm.dirty_background_ratio = 10 # System can start writing pages to disk when 15% of memory is dirty
Additionally I use the following security settings in my sysctl configuration in conjunction with the tunables above. Feel free to use them, for credits
# Avoid a smurf attack
net.ipv4.icmp_echo_ignore_broadcasts = 1
# Turn on protection for bad icmp error messages
net.ipv4.icmp_ignore_bogus_error_responses = 1
# Turn on syncookies for SYN flood attack protection
net.ipv4.tcp_syncookies = 1
# Turn on and log spoofed, source routed, and redirect packets
net.ipv4.conf.all.log_martians = 1
net.ipv4.conf.default.log_martians = 1
# No source routed packets here
net.ipv4.conf.all.accept_source_route = 0
net.ipv4.conf.default.accept_source_route = 0
# Turn on reverse path filtering
net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.default.rp_filter = 1
# Make sure no one can alter the routing tables
net.ipv4.conf.all.accept_redirects = 0
net.ipv4.conf.default.accept_redirects = 0
net.ipv4.conf.all.secure_redirects = 0
net.ipv4.conf.default.secure_redirects = 0
# Don't act as a router
net.ipv4.ip_forward = 0
net.ipv4.conf.all.send_redirects = 0
net.ipv4.conf.default.send_redirects = 0
# Turn on execshild
kernel.exec-shield = 1
kernel.randomize_va_space = 1
As you are proxying request I would add the following to your sysctl.conf file to make sure you are not running out of ports, it is optional but if you are running into issues it is something to keep in mind:
net.ipv4.ip_local_port_range=1024 65000
As I normally evaluate the default settings and adjust accordingly I did not supply the IPv4 and ipv4.tcp_ options. You can find an example below but please do not copy and paste, you'll be required to do some reading before you start tuning these variables.
# Increase TCP max buffer size setable using setsockopt()
net.ipv4.tcp_rmem = 4096 87380 8388608
net.ipv4.tcp_wmem = 4096 87380 8388608
# Increase Linux auto tuning TCP buffer limits
# min, default, and max number of bytes to use
# set max to at least 4MB, or higher if you use very high BDP paths
# Tcp Windows etc
net.core.rmem_max = 8388608
net.core.wmem_max = 8388608
net.core.netdev_max_backlog = 5000
net.ipv4.tcp_window_scaling = 1
The above parameters is not everything you should consider, there are many more parameters that you can tune, for example:
Set the amount of worker processes to 4 (one per CPU core).
Tune the backlog queue.
If you do not need an acccess log I would simply turn it off to remove the disk I/O.
Optionally: lower or disable gzip compression if your CPU usage is getting to high.

Greenplum gp_vmem_protect_limit configuration

We are doing a PoC by installing Greenplum on AWS environment. We have setup each of our segment servers as d2.8xlarge instance types which has 240 GB of RAM with no SWAP.
I am now trying to setup the gp_vmem_protect_limit using the formula mentioned in gpdb documents and the value is coming to 25600MB.
But in one of the Zendesk Notes it says that gp_vmem_protect_limit will be breached when "sessions executing on this segment are attempting together to use more than configured limit. " Does the segment in this text mean Segment Host or number of primary segments?
Also, with the Eager Free option being set I see that the memory utilization is very poor when running the TPC-DS benchmark with 5 concurrent users. I would like to improve the memory utilization of the environment and below are the other memory configurations
gpconfig -c gp_vmem_protect_limit -v 25600MB
gpconfig -c max_statement_mem -v 16384MB
gpconfig -c statement_mem -v 2400MB
Any suggestions?
Thanks,
Jayadeep

There is a calculator for it!
http://greenplum.org/calc/
You should also add a swap file or disk. It is pretty easy to do in Amazon too. I would add at least a 4GB swap file to each host when you have 240GB of RAM.

ext4 commit= mount option and dirty_writeback_centisecs

I'm tring to understand the way bytes go from write() to the phisical disk plate to tune my picture server performance.
Thing I don't understand is what is the difference between these two: commit= mount option and dirty_writeback_centisecs. Looks like they are about the same procces of writing changes to the storage device, but still different.
I do not get it clear which one fires first on the way to the disk for my bytes.

Yeah, I just ran into this investigating mount options for an SDCard Ubuntu install on an ARM Chromebook. Here's what I can tell you...
Here's how to see the dirty and writeback amounts:
user#chrubuntu:~$ cat /proc/meminfo | grep "Dirty" -A1
Dirty: 14232 kB
Writeback: 4608 kB
(edit: This dirty and writeback is rather high, I had a compile running when I ran this.)
So data to be written out is dirty. Dirty data can still be eliminated (if say, a temporary file is created, used, and deleted before it goes to writeback, it'll never have to be written out). As dirty data is moved into writeback, the kernel tries to combine smaller requests that may be into dirty into single larger I/O requests, this is one reason why dirty_expire_centisecs is usually not set too low. Dirty data is usually put into writeback when a) Enough data is cached to get up to vm.dirty_background_ratio. b) As data gets to be vm.dirty_writeback_centisecs centiseconds old (3000 default is 30 seconds) it is put into writeback. vm.dirty_writeback_centisecs, a writeback daemon is run by default every 500 centiseconds (5 seconds) to actually flush out anything in writeback.
fsync will flush out an individual file (force it from dirty into writeback and wait until it's flushed out of writeback), and sync does that with everything. As far as I know, it does this ASAP, bypassing any attempt to try to balance disk reads and writes, it stalls the device doing 100% writes until the sync completes.
commit=5 default ext4 mount option actually forces a sync() every 5 seconds on that filesystem. This is intended to ensure that writes are not unduly delayed if there's heavy read activity (ideally losing a maximum of 5 seconds of data if power is cut or whatever.) What I found with an Ubuntu install on SDCard (in a Chromebook) is that this actually just leads to massive filesystem stalls like every 5 seconds if you're writing much to the card, ChromeOS uses commit=600 and I applied that Ubuntu-side to good effect.

The dirty_writeback_centisecs, configures the daemons of the kernel Linux related to the virtual memory (that's why the vm). Which are in charge of making a write back from the RAM memory to all the storage devices, so if you configure the dirty_writeback_centisecs and you have 25 different storage devices mounted on your system it will have the same amount of time of writeback for all the 25 storage systems.
While the commit is done per storage device (actually is per filesystem) and is related to the sync process instead of the daemons from the virtual memory.
So you can see it as:
dirty_writeback_centisecs
writing from RAM to all filesystems
commit
each filesystem fetches from RAM

QEMU-KVM and Perf Statistics

I got some VMs running on an IBM Power8 using QEMU-KVM and I want to get statistics about LLC misses.
How can I do that in order to get statistics for each VM separately?

You want to have these data from the whole VM or for one application running on a VM?
I tested it on a Ubuntu 15.04 image over QEMU-KVM, and I am able to get it using perf. In this case, I am getting the LLC stats regarding to a gzip operation. Take a look:
$ perf stat -e LLC-loads,LLC-load-misses gzip -9 /tmp/vmlinux
Performance counter stats for 'gzip -9 /tmp/vmlinux':
263,653 LLC-loads
10,753 LLC-load-misses # 4.08% of all LL-cache hits
4.006553608 seconds time elapsed

For more detailed/explanatory content about some POWER events, refer to these documents:
Comprehensive PMU Event Reference – POWER7
Commonly Used Metrics for Performance Analysis – POWER7
The former is a more of a reference, and the latter is more of a tutorial (including a section about cache/memory hierarchy w/ hits/misses).
Those should be listed in: https://www.power.org/events/Power7

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio