In our organization we have started an integration through a web service with api rest but we have a rare performance problem.
Data:
We have a virtual machine (VMWare) 4 core/8Gb ram. sufficient remote storage.
Ubuntu server 18.04
openjdk 11.0.7 2020-04-14
JAVA_OPTS='-Djava.awt.headless=true -Xms512m -Xmx2048m -XX:MaxPermSize=256m'
mysql: See 5.7.30-0ubuntu0.18.04.1 (It's running locally but the app connects by host name).
APP: Spring boot 2.1.3 (tomcat & spring data jpa & hikari & hibernate) All parameters by default.
top - 15:09:15 up 2 days, 14:21, 1 user, load average: 0.03, 0.01, 0.00
Tasks: 189 total, 1 running, 100 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.3 us, 0.2 sy, 0.0 ni, 99.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 8168140 total, 148740 free, 7590936 used, 428464 buff/cache
KiB Swap: 2097148 total, 1352428 free, 744720 used. 332048 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2383 app 20 0 41920 3944 3220 R 0.7 0.0 0:00.53 top
2698 app 20 0 5835612 402424 15312 S 0.7 4.9 23:13.92 java
1786 mysql 20 0 2680528 321892 8108 S 0.3 3.9 20:38.32 mysqld
2677 app 20 0 5850152 441440 15824 S 0.3 5.4 28:01.41 java <------
2769 app 20 0 5868308 977.2m 16868 S 0.3 12.3 49:25.72 java
ps -eaf | grep java
app 2677 2676 0 Jul07 ? 00:28:01 java -Dserver.port=4560 -jar app-ws-1.0.0-SNAPSHOT.jar <------
app 2698 2696 0 Jul07 ? 00:23:14 java -Dserver.port=4561 -jar app-ws-1.0.0-SNAPSHOT.jar
app 2769 2768 1 Jul07 ? 00:49:26 java -jar app-gui-1.0.0-SNAPSHOT.jar
We have 2 webservices, one functional (2677) and the other in testing (2698) and a web app (2768).
We have a problem with the first one. When processing calls the first one takes >30s, causing a timeout in the calling system, but the following calls are processed ok <5s.
The number of calls is minimum, 10 max. per day and never concurrent. Timeout can also occur if several hours pass without calls (>5h).
We have checked the code, we have checked WMware/Ubuntu (suspension options) and we haven't seen anything in the monitoring.
We have been told that it could be JVM and GC problems but I personally don't understand much and I haven't seen anything with the Memory analyzer.
Later on we have implemented in the app itself a dummy call (localhost) every 10 minutes to "warm up the machine" but even so the first call still takes >30s and the rest does not. The dummy call only answers ok.
We don't know what the cause could be and we don't know how to discard options since it is a productive environment and it doesn't admit many changes.
Related
run: top
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
13960 git 20 0 2032080 336220 13304 S 1.0 16.3 0:31.50 ruby
14284 git 20 0 554792 300168 10844 S 0.0 14.5 0:04.27 ruby
14287 git 20 0 546056 291068 10652 S 0.0 14.1 0:03.13 ruby
2705 mysql 20 0 1082876 287544 380 S 0.0 13.9 0:01.70 mysqld
14104 git 20 0 524072 276016 13324 S 0.0 13.4 0:24.69 ruby
14281 git 20 0 524072 267504 4812 S 0.0 13.0 0:00.00 ruby
13978 gitlab-+ 20 0 579824 39872 39280 S 0.0 1.9 0:00.12 postgres
1404 www 20 0 142196 31304 820 S 0.0 1.5 0:00.05 nginx
1405 www 20 0 142196 31304 820 S 0.0 1.5 0:00.05 nginx
1403 www 20 0 142196 30992 508 S 0.0 1.5 0:00.04 nginx
My machine only has 2GB of memory.
Is there a way to optimize the configuration and reduce the memory consumption?
Not really: see GitLab Requirements for memory
You need at least 8GB of addressable memory (RAM + swap) to install and use GitLab!
The operating system and any other running applications will also be using memory so keep in mind that you need at least 4GB available before running GitLab. With less memory GitLab will give strange errors during the reconfigure run and 500 errors during usage.
We recommend having at least 2GB of swap on your server, even if you currently have enough available RAM. Having swap will help reduce the chance of errors occurring if your available memory changes.
We also recommend configuring the kernel’s swappiness setting to a low value like 10 to make the most of your RAM while still having the swap available when needed.
I just installed Linux and Intel MPI to two machines:
(1) Quite old (~8 years old) SuperMicro server, which has 24 cores (Intel Xeon X7542 X 4). 32 GB memory.
OS: CentOS 7.5
(2) New HP ProLiant DL380 server, which has 32 cores (Intel Xeon Gold 6130 X 2). 64 GB memory.
OS: OpenSUSE Leap 15
After installing OS and Intel MPI, I compiled intel MPI benchmark and ran it:
$ mpirun -np 4 ./IMB-EXT
It is quite surprising that I find the same error when running IMB-EXT and IMB-RMA, though I have a different OS and everything (even GCC version used to compile Intel MPI benchmark is different -- in CentOS, I used GCC 6.5.0, and in OpenSUSE, I used GCC 7.3.1).
On the CentOS machine, I get:
#---------------------------------------------------
# Benchmarking Unidir_Put
# #processes = 2
# ( 2 additional processes waiting in MPI_Barrier)
#---------------------------------------------------
#
# MODE: AGGREGATE
#
#bytes #repetitions t[usec] Mbytes/sec
0 1000 0.05 0.00
4 1000 30.56 0.13
8 1000 31.53 0.25
16 1000 30.99 0.52
32 1000 30.93 1.03
64 1000 30.30 2.11
128 1000 30.31 4.22
and on the OpenSUSE machine, I get
#---------------------------------------------------
# Benchmarking Unidir_Put
# #processes = 2
# ( 2 additional processes waiting in MPI_Barrier)
#---------------------------------------------------
#
# MODE: AGGREGATE
#
#bytes #repetitions t[usec] Mbytes/sec
0 1000 0.04 0.00
4 1000 14.40 0.28
8 1000 14.04 0.57
16 1000 14.10 1.13
32 1000 13.96 2.29
64 1000 13.98 4.58
128 1000 14.08 9.09
When I don't use mpirun (which means there is only one process to run IMB-EXT), the benchmark runs through, but Unidir_Put needs >=2 processes, so doesn't help so much, and I also find that the functions with MPI_Put and MPI_Get is extremely slower than I expected (from my experience). Also, using MVAPICH on the OpenSUSE machine did not help. The output is:
#---------------------------------------------------
# Benchmarking Unidir_Put
# #processes = 2
# ( 6 additional processes waiting in MPI_Barrier)
#---------------------------------------------------
#
# MODE: AGGREGATE
#
#bytes #repetitions t[usec] Mbytes/sec
0 1000 0.03 0.00
4 1000 17.37 0.23
8 1000 17.08 0.47
16 1000 17.23 0.93
32 1000 17.56 1.82
64 1000 17.06 3.75
128 1000 17.20 7.44
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 49213 RUNNING AT iron-0-1
= EXIT CODE: 139
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
update: I tested OpenMPI, and it goes through smoothly (although my application does not recommend using openmpi, and I still don't understand why Intel MPI or MVAPICH doesn't work...)
#---------------------------------------------------
# Benchmarking Unidir_Put
# #processes = 2
# ( 2 additional processes waiting in MPI_Barrier)
#---------------------------------------------------
#
# MODE: AGGREGATE
#
#bytes #repetitions t[usec] Mbytes/sec
0 1000 0.06 0.00
4 1000 0.23 17.44
8 1000 0.22 35.82
16 1000 0.22 72.36
32 1000 0.22 144.98
64 1000 0.22 285.76
128 1000 0.30 430.29
256 1000 0.39 650.78
512 1000 0.51 1008.31
1024 1000 0.84 1214.42
2048 1000 1.86 1100.29
4096 1000 7.31 560.59
8192 1000 15.24 537.67
16384 1000 15.39 1064.82
32768 1000 15.70 2086.51
65536 640 12.31 5324.63
131072 320 10.24 12795.03
262144 160 12.49 20993.49
524288 80 30.21 17356.93
1048576 40 81.20 12913.67
2097152 20 199.20 10527.72
4194304 10 394.02 10644.77
Is there any chance that I am missing something in installing MPI, or installing OS in these servers? Actually, I assume that OS is the problem, but not sure where to start...
Thanks a lot in advance,
Jae
Although this question is well written, you were not explicit about
Intel MPI benchmark (please add header)
Intel MPI
Open MPI
MVAPICH
supported host network fabrics - for each MPI distribution
selected fabric while running MPI benchmark
Compilation settings
Debugging this kind of trouble with disparate host machines, multiple Linux distributions and compiler versions can be quite hard. Remote debugging on StackOverflow is even harder.
First of all ensure reproducibility. This seems to be the case. One of many debugging approaches, the one I would recommend, is to reduce complexity of the system as a whole, test smaller sub-systems and start shifting responsibility to third parties. You may replace self-compiled executables with software packages provided by distribution software/package repositories or third parties like Conda.
Intel recently started to provide its libraries through YUM/APT repos as well as for Conda and PyPI. I found that helps a lot with reproducible deployments of HPC clusters and even runtime/development environments. I recommend to use it for CentOS 7.5.
YUM/APT repository for Intel MKL, Intel IPP, Intel DAAL, and Intel® Distribution for Python* (for Linux*):
Installing Intel® Performance Libraries and Intel® Distribution for Python* Using YUM Repository
Installing Intel® Performance Libraries and Intel® Distribution for Python* Using APT Repository
Conda* package/ Anaconda Cloud* support (Intel MKL, Intel IPP, Intel DAAL, Intel Distribution for Python):
Installing Intel Distribution for Python and Intel Performance Libraries with Anaconda
Available Intel packages can be viewed here
Install from the Python Package Index (PyPI) using pip (Intel MKL, Intel IPP, Intel DAAL)
Installing the Intel® Distribution for Python* and Intel® Performance Libraries with pip and PyPI
I do not know much about OpenSUSE Leap 15.
After encountering situations where I found that rethinkdb service is down for unknown reason, I noticed it uses a lot of memory:
# free -m
total used free shared buffers cached
Mem: 7872 7744 128 0 30 68
-/+ buffers/cache: 7645 226
Swap: 4031 287 3744
# top
top - 23:12:51 up 7 days, 1:16, 3 users, load average: 0.00, 0.00, 0.00
Tasks: 133 total, 1 running, 132 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 0.2%sy, 0.0%ni, 99.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 8061372k total, 7931724k used, 129648k free, 32752k buffers
Swap: 4128760k total, 294732k used, 3834028k free, 71260k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1835 root 20 0 7830m 7.2g 5480 S 1.0 94.1 292:43.38 rethinkdb
29417 root 20 0 15036 1256 944 R 0.3 0.0 0:00.05 top
1 root 20 0 19364 1016 872 S 0.0 0.0 0:00.87 init
# cat log_file | tail -9
2014-09-22T21:56:47.448701122 0.052935s info: Running rethinkdb 1.12.5 (GCC 4.4.7)...
2014-09-22T21:56:47.452809839 0.057044s info: Running on Linux 2.6.32-431.17.1.el6.x86_64 x86_64
2014-09-22T21:56:47.452969820 0.057204s info: Using cache size of 3327 MB
2014-09-22T21:56:47.453169285 0.057404s info: Loading data from directory /rethinkdb_data
2014-09-22T21:56:47.571843375 0.176078s info: Listening for intracluster connections on port 29015
2014-09-22T21:56:47.587691636 0.191926s info: Listening for client driver connections on port 28015
2014-09-22T21:56:47.587912507 0.192147s info: Listening for administrative HTTP connections on port 8080
2014-09-22T21:56:47.595163724 0.199398s info: Listening on addresses
2014-09-22T21:56:47.595167377 0.199401s info: Server ready
It seems a lot considering the size of the files:
# du -h
4.0K ./tmp
156M .
Do I need to configure a different cache size? Do you think it has something to do with finding the service surprisingly gone? I'm using v1.12.5
There were a few leak in the previous version, the main one being https://github.com/rethinkdb/rethinkdb/issues/2840
You should probably update RethinkDB -- the current version being 1.15.
If you run 1.12, you need to export your data, but that should be the last time you need it since 1.14 introduced seamless migrations.
From Understanding RethinkDB memory requirements - RethinkDB
By default, RethinkDB automatically configures the cache size limit according to the formula (available_mem - 1024 MB) / 2. available_mem
You can change this via a config file as they document, or change it with a size (in MB) from the command line:
rethinkdb --cache-size 2048
I'm using ubuntu Ubuntu 14.04.1 LTS
atopsar -d 30 - shows that one of hard drive (sda) in the system is heavily used. This hard drive serves only mysql database. The most frequently used DBs where relocated to another hard drives (sdb, sdd) via symbolic links. Now atopsar shows nearly same load for sda and under 5% load to other HDDs.
Is there a way to know which files are heavily used on HDD?
Can it be that mysql InnoDB log files (ib_logfile) are fragmented? And therefore atopsar show such big load (50%-70%). What can be done in that case?
There are some output from atopsar -d 30:
08:52:47 disk busy read/s KB/read writ/s KB/writ avque avserv _dsk_
08:53:17 sda 63% 0.0 0.0 50.2 14.6 1.1 12.57 ms
sdb 5% 0.0 0.0 9.4 19.8 4.2 5.81 ms
sdd 2% 0.0 0.0 3.7 18.9 1.4 5.82 ms
08:53:47 sda 60% 0.0 16.0 48.1 15.7 1.0 12.55 ms
sdb 5% 0.0 0.0 6.9 17.5 4.6 7.35 ms
sdd 2% 0.0 0.0 4.7 24.9 1.4 4.06 ms
08:54:17 sda 38% 0.5 16.0 30.6 15.6 1.2 12.25 ms
sdb 3% 0.0 0.0 5.6 18.3 3.3 5.50 ms
sdd 2% 0.0 0.0 3.3 19.2 1.1 4.86 ms
08:54:47 sda 53% 0.0 0.0 42.5 16.5 1.1 12.37 ms
sdb 6% 0.0 0.0 8.7 21.0 5.8 6.37 ms
sdd 2% 0.0 0.0 3.1 23.1 1.3 5.68 ms
08:55:17 sda 51% 0.0 4.0 42.7 16.9 1.1 11.94 ms
sdb 5% 0.0 0.0 9.4 20.5 5.0 5.51 ms
sdd 1% 0.0 0.0 1.5 17.6 1.1 7.73 ms
08:55:47 sda 52% 0.0 0.0 40.6 14.5 1.0 12.85 ms
sdb 5% 0.0 0.0 6.8 19.5 5.4 6.66 ms
sdd 2% 0.0 0.0 4.3 31.3 1.3 4.78 ms
There is sysdig tool which allow you to see system-wide activities just like strace does for single process: http://www.sysdig.org/
There are examples for Disk usage info: https://github.com/draios/sysdig/wiki/Sysdig%20Examples#disk-io
See the top processes in terms of disk bandwidth usage
sysdig -c topprocs_file
See the top files in terms of read+write bytes
sysdig -c topfiles_bytes
Print the top files that apache has been reading from or writing to
sysdig -c topfiles_bytes proc.name=httpd
See the top directories in terms of R+W disk activity
sysdig -c fdbytes_by fd.directory "fd.type=file"
See the top files in terms of R+W disk activity in the /tmp directory
sysdig -c fdbytes_by fd.filename "fd.directory=/tmp/"
Observe the I/O activity on all the files named 'passwd'
sysdig -A -c echo_fds "fd.filename=passwd"
Sysdig is modern and convenient tool. For older Linuxes is it possible to get similar information using SystemTap: http://lukas.zapletalovi.com/2014/05/systemtap-as-a-system-wide-strace-tool.html
PS Thanks to habrahabr.ru with this post about Sysdig http://habrahabr.ru/company/selectel/blog/222839/
PPS Brendan D. Gregg created this picture "A quick tour of many tools..." for his Linux Performance page:
To find out the most heavily used files in the system please use: sudo pt-ioprofile -cell sizes
Example of output:
total pread read pwrite fsync lseek filename
10862592 0 0 10862592 0 0 /var/mysqldata/mysql/ibdata1
827392 0 0 827392 0 0 /var/mysqllog/mysql/ib_logfile0
... (other trivial I/O records truncated)
Got it from https://dba.stackexchange.com/questions/21209/innodb-high-disk-write-i-o-on-ibdata1-file-and-ib-logfile0
Please be aware that by default Percona toolkit attaches only to mysqld. And to find out most heavily used file you have to run it to all processes that might create such load. In my case I was definitely sure that it's mysql server, so it's enough for me.
Please read http://www.percona.com/doc/percona-toolkit/2.0/pt-ioprofile.html before you use it.
Try investigating with
dstat --top-bio
it will give you processes that use most of IO.
In linux you have /proc/diskstats - it gives only block device level stats.
I have never seen a mechanism to determine which file is busy in linux.
my sphinx configuration is:
================================ config/sphinx.yml
development:
bin_path: "/usr/local/bin"
searchd_binary_name: searchd
indexer_binary_name: indexer
but everytime i run a rake ts:index
Sphinx cannot be found on your system. You may need to configure the following
settings in your config/sphinx.yml file:
* bin_path
* searchd_binary_name
* indexer_binary_name
For more information, read the documentation:
For more information, read the documentation:
http://freelancing-god.github.com/ts/en/advanced_config.html
Generating Configuration to config/development.sphinx.conf
Sphinx 2.0.1-beta (r2792)
Copyright (c) 2001-2011, Andrew Aksyonoff
Copyright (c) 2008-2011, Sphinx Technologies Inc (http://sphinxsearch.com)
using config file 'config/development.sphinx.conf'...
indexing index 'post_core'...
collected 2 docs, 0.0 MB
sorted 0.0 Mhits, 100.0% done
total 2 docs, 675 bytes
total 0.006 sec, 110510 bytes/sec, 327.43 docs/sec
skipping non-plain index 'post'...
total 6 reads, 0.000 sec, 0.0 kb/call avg, 0.0 msec/call avg
total 12 writes, 0.000 sec, 0.1 kb/call avg, 0.0 msec/call avg
rotating indices: succesfully sent SIGHUP to searchd (pid=19438).
Generating Configuration to config/development.sphinx.conf
Sphinx 2.0.1-beta (r2792)
Copyright (c) 2001-2011, Andrew Aksyonoff
Copyright (c) 2008-2011, Sphinx Technologies Inc (http://sphinxsearch.com)
using config file 'config/development.sphinx.conf'...
indexing index 'post_core'...
collected 2 docs, 0.0 MB
sorted 0.0 Mhits, 100.0% done
total 2 docs, 675 bytes
total 0.006 sec, 105567 bytes/sec, 312.79 docs/sec
skipping non-plain index 'post'...
total 6 reads, 0.000 sec, 0.0 kb/call avg, 0.0 msec/call avg
total 12 writes, 0.000 sec, 0.1 kb/call avg, 0.0 msec/call avg
rotating indices: succesfully sent SIGHUP to searchd (pid=19438).
So what's the problem? Why does the rake output that it cant find it even though its installed?
The warning from Thinking Sphinx could definitely be clearer... the problem is very likely to be how old your version of Thinking Sphinx is. Older TS versions don't know about Sphinx 2.0.x - so I'd recommend updating to the latest version of Thinking Sphinx (either 1.4.6 for Rails 1.2 and 2.x, or 2.0.5 for Rails 3).
There are two things that help to solve this problem. First, as Pat says, it is useful to update the Thinking Sphinx plugin or gem to the latest version (either 1.4.x for Rails 2, or 2.0.x for Rails 3). Second it helps sometimes to specify the version of Sphinx in the configuration file (you can find it out by calling "indexer"), especially if Sphinx is running on a remote server and Thinking Sphinx does not have access to Sphinx locally:
production:
..
version: 2.0.4 # <------- Version of Sphinx on remote server 192.168.1.10
port: 9312
address: 192.168.1.10
..
I was facing the same issue and looked everywhere for an answer without any resolution.
The trick that worked for me was to install older version of sphinx. v .9 instead of the latest beta.
Using the latest Thinking-Sphinx with this version of sphinx resolved the issue.