Relatively old Dell R620 server (32 cores / 128GB RAM) was working perfect for years with Ubuntu. Plain OS install, no Virtualization.
2 system disks in mirror (XFS)
6 RAID 5 disks for /var (XFS)
server is used for a nightly check of a MySQL Xtrabackup file.
Before the format and move to Centos 7 the process would finish by 08:00, Now running late at noon.
99% of the job is opening a large tar.gz file.
htop : there are only two processes doing something :
1. gzip -d : about 20% CPU
2. tar zxf Xtrabackup.tar.gz : about 4-7% CPU
iotop : it's steady at around 3M/s (Read) / 20-25 M/s (Write) which is about 25% of what i would expect at minimum.
Memory : Used : 1GB of 128GB
Server is fully updated both OS / HW / Firmware including the disks firmware.
IDRAC shows no problems.
Bottom line : Server is not working hard (to say the least) but performance is way off.
Any ideas would be appreciated.
vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 2 0 469072 0 130362040 0 0 57 341 0 0 0 0 98 2 0
0 2 0 456916 0 130374568 0 0 3328 24576 1176 3241 2 1 94 4 0
You have blocked processes and also io operations (around 20MB/s). And this mean for me you have few processes which concurrently access disc resources. What you can do to improve the performance is instead of
tar zxf Xtrabackup.tar.gz
use
gzip -d Xtrabackup.tar.gz|tar xvf -
The second add parallelism and can benefit from multy processor, You can also benefit from increase of the pipe (fifo) buffer. Check this answer for some ideas
Also consider to tune filesystem where are stored output files of tar
Related
What are some (reliable) tests to find the disk a certain partition is on and put that result into a variable?
For example, output of lsblk:
...
sda 8:0 0 9.1T 0 disk
└─sda1 8:1 0 9.1T 0 part /foopath
...
mmcblk0 179:0 0 29.7G 0 disk
├─mmcblk0p1 179:1 0 256M 0 part /barpath
└─mmcblk0p2 179:2 0 29.5G 0 part /foobarpath
If partition="/dev/mmcblk0p2", how can I put mmcblk0 as the disk it is a part of into a variable? Or similarly, if partition="/dev/sda1", how to put sda as the disk it is a part of into a variable?
disk=${partition::-1} seemed to be a hack until I encountered partitions such as mmcblk0p1, hence the request for a more reliable test...
The purpose of isolating the disk and using variable is to pass it to smartctl -n standby /dev/sda to find if disk is currently spinning, etc.
Operating environment is Linux Mint 19.3 and Ubuntu 20.
Any ideas?
Thanks to #KamilCuk and #don_crissti ;)
"Print just the parent device" using lsblk
#!/bin/bash
partition="/dev/sda1"
disk="$(lsblk -no pkname "${partition}")"
I want to disable the timer interrupt on some of the cores (1-2) on my machine which is a x86 running centos 7 with rt patch, both cores are isolated cores with nohz_full, (you can see the cmdline) but timer interrupt continues to interrupt the real time process which are running on core1 and core2.
1. uname -r
3.10.0-693.11.1.rt56.632.el7.x86_64
2. cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-3.10.0-693.11.1.rt56.632.el7.x86_64 \
root=/dev/mapper/centos-root ro crashkernel=auto \
rd.lvm.lv=centos/root rd.lvm.lv=centos/swap rhgb quiet \
default_hugepagesz=2M hugepagesz=2M hugepages=1024 \
intel_iommu=on isolcpus=1-2 irqaffinity=0 intel_idle.max_cstate=0 \
processor.max_cstate=0 idle=mwait tsc=perfect rcu_nocbs=1-2 rcu_nocb_poll \
nohz_full=1-2 nmi_watchdog=0
3. cat /proc/interrupts
CPU0 CPU1 CPU2
0: 29 0 0 IO-APIC-edge timer
.....
......
NMI: 0 0 0 Non-maskable interrupts
LOC: 835205157 308723100 308384525 Local timer interrupts
SPU: 0 0 0 Spurious interrupts
PMI: 0 0 0 Performance monitoring interrupts
IWI: 0 0 0 IRQ work interrupts
RTR: 0 0 0 APIC ICR read retries
RES: 347330843 309191325 308417790 Rescheduling interrupts
CAL: 0 935 935 Function call interrupts
TLB: 320 22 58 TLB shootdowns
TRM: 0 0 0 Thermal event interrupts
THR: 0 0 0 Threshold APIC interrupts
DFR: 0 0 0 Deferred Error APIC interrupts
MCE: 0 0 0 Machine check exceptions
MCP: 2 2 2 Machine check polls
CPUs/Clocksource:
4. lscpu | grep CPU.s
CPU(s): 3
On-line CPU(s) list: 0-2
NUMA node0 CPU(s): 0-2
5. cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc
Thanks a lot for any help.
Moses
Even with nohz_full= you get some ticks on the isolated CPUs:
Some process-handling operations still require the occasional scheduling-clock tick. These operations include calculating CPU load, maintaining sched average, computing CFS entity vruntime, computing avenrun, and carrying out load balancing. They are currently accommodated by scheduling-clock tick every second or so. On-going work will eliminate the need even for these infrequent scheduling-clock ticks.
(Documentation/timers/NO_HZ.txt, cf. (Nearly) full tickless operation in 3.10 LWN, 2013)
Thus, you have to check the rate of the local timer, e.g.:
$ perf stat -a -A -e irq_vectors:local_timer_entry sleep 120
(while your isolated threads/processes are running)
Also, nohz_full= is only effective if there is just one runnable task on each isolated core. You can check that with e.g. ps -L -e -o pid,tid,user,state,psr,cmd and cat /proc/sched_debug.
Perhaps you need to move some (kernel) tasks to you house-keeping core, e.g.:
# tuna -U -t '*' -c 0-4 -m
You can get more insights into what timers are still active by looking at /proc/timer_list.
Another method to investigate causes for possible interruption is to use the functional tracer (ftrace). See also Reducing OS jitter due to per-cpu kthreads for some examples.
I see nmi_watchdog=0 in your kernel parameters, but you don't disable the soft watchdog. Perhaps this is another timer tick source that would show up with ftrace.
You can disable all watchdogs with nowatchdog.
Btw, some of your kernel parameters seem to be off:
tsc=perfect - do you mean tsc=reliable? The 'perfect' value isn't documented in the kernel docs
idle=mwait - do you mean idle=poll? Again, I can't find the 'mwait' value in the kernel docs
intel_iommu=on - what's the purpose of this?
For a year my scheduling command with slurm was fine, but is now attempting to allocate N+1 nodes for my job now that someone else is scheduled on a node I'm running on.
I'm trying to run a job on 2 nodes that runs as 48 processes all with two cores. (so, 48 jobs with MPI, each using OMP_NUM_THREADS=2), so 96 all together.
In the past I've always run this with (using hostname as example since it illustrates the problem)
srun --partition=intel \
-N 2 \
--ntasks-per-node=24 \
--ntasks=48 \
--cpus-per-task=2 \
hostname
But now squeue is telling me that this is pending for resources awaiting 3 nodes instead of 2.
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1282 intel hostname gmp13x64 PD 0:00 3 (Resources)
1229 intel bwd-pm25 shunliu R 5-12:51:18 1 dena5
The intel queue is composed of two nodes, dena5 and dena6
gmp13x64#dena:script_dev$ sinfo --Node --long
Sat Nov 5 10:49:51 2016
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT FEATURES REASON
dena5 1 intel allocated 48 2:12:2 128833 1951 1 (null) none
dena6 1 intel idle 48 2:12:2 128833 1951 1 (null) none
My interpretation is the info that sinfo is giving is that my setup should be fine. Plus, in the past this has always worked.
Why is slurm all of a sudden saying that this requires 3 nodes? How would that even be allowed considering the intel partition only has 2 nodes? Am I mis-interpreting some of the flags I'm using?
I'm using puppet as a provisioner for Vagrant, and am coming across an issue where Puppet will hang for an extremely long time when I do a "vagrant provision". Building the box from scratch using "vagrant up" doesn't seem to be a problem, only subsequent provisions.
If I turn puppet debug on and watch where it hangs, it seems to stop at various, seemingly arbitrary, points the first of which is:
Info: Applying configuration version '1401868442'
Debug: Prefetching yum resources for package
Debug: Executing '/bin/rpm --version'
Debug: Executing '/bin/rpm -qa --nosignature --nodigest --qf '%{NAME} %|EPOCH?{% {EPOCH}}:{0}| %{VERSION} %{RELEASE} %{ARCH}\n''
Executing this command on the server myself returns immediately.
Eventually, it gets past this and continues. Using the summary option, I get the following, after waiting for a very long time for it to complete:
Debug: Finishing transaction 70191217833880
Debug: Storing state
Debug: Stored state in 9.39 seconds
Notice: Finished catalog run in 1493.99 seconds
Changes:
Total: 2
Events:
Failure: 2
Success: 2
Total: 4
Resources:
Total: 18375
Changed: 2
Failed: 2
Skipped: 35
Out of sync: 4
Time:
User: 0.00
Anchor: 0.01
Schedule: 0.01
Yumrepo: 0.07
Augeas: 0.12
Package: 0.18
Exec: 0.96
Service: 1.07
Total: 108.93
Last run: 1401869964
Config retrieval: 16.49
Mongodb database: 3.99
File: 76.60
Mongodb user: 9.43
Version:
Config: 1401868442
Puppet: 3.4.3
This doesn't seem very helpful to me, as the amount of time total's 108 seconds, so where have the other 1385 seconds gone?
Throughout, Puppet seems to be hammering the box, using up a lot of CPU, but still doesn't seem to advance. The memory it uses seems to continually increase. When I kick off the command, top looks like this:
Cpu(s): 10.2%us, 2.2%sy, 0.0%ni, 85.5%id, 2.2%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 4956928k total, 2849296k used, 2107632k free, 63464k buffers
Swap: 950264k total, 26688k used, 923576k free, 445692k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
28486 root 20 0 439m 334m 3808 R 97.5 6.9 2:02.92 puppet
22 root 20 0 0 0 0 S 1.3 0.0 0:07.55 kblockd/0
18276 mongod 20 0 788m 31m 3040 S 1.3 0.6 2:31.82 mongod
20756 jboss-as 20 0 3081m 1.5g 21m S 1.3 31.4 7:13.15 java
20930 elastics 20 0 2340m 236m 6580 S 1.0 4.9 1:44.80 java
266 root 20 0 0 0 0 S 0.3 0.0 0:03.85 jbd2/dm-0-8
22717 vagrant 20 0 98.0m 2252 1276 S 0.3 0.0 0:01.81 sshd
28762 vagrant 20 0 15036 1228 932 R 0.3 0.0 0:00.10 top
1 root 20 0 19364 1180 964 S 0.0 0.0 0:00.86 init
To me, this seems fine, there's over 2GB of available memory and plenty of available swap. I have a max open files limit of 1024.
About 10-15 minutes later, still no advance in the console output, but top looks like this:
Cpu(s): 11.2%us, 1.6%sy, 0.0%ni, 86.9%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%s
Mem: 4956928k total, 3834376k used, 1122552k free, 64248k buffers
Swap: 950264k total, 24408k used, 925856k free, 445728k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
28486 root 20 0 1397m 1.3g 3808 R 99.6 26.7 15:16.19 puppet
18276 mongod 20 0 788m 31m 3040 R 1.7 0.6 2:45.03 mongod
20756 jboss-as 20 0 3081m 1.5g 21m S 1.3 31.4 7:25.93 java
20930 elastics 20 0 2340m 238m 6580 S 0.7 4.9 1:52.03 java
8486 root 20 0 308m 952 764 S 0.3 0.0 0:06.03 VBoxService
As you can see, puppet is now using a lot more of the memory, and it seems to continue in this fashion. The box it's building has 5GB of RAM, so I wouldn't have expected it to have memory issues. However, further down the line, after a long wait, I do get "Cannot allocate memory - fork(2)"
Running unlimit -a, I get:
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 38566
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 1024
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Which, again looks fine to me...
To be honest, I'm completely at a loss as to how to go about solving this, or what is causing it.
Any help or insight would be greatly appreciated!
EDIT:
So I managed to fix this eventually... It came down to using recurse with a file directive for a large directory. The target directory in question contained around 2GB worth of files, and puppet took a huge amount of time loading this into memory and doing it's hashes and comparisons. The first time I stood the server up, the directory was relatively empty so the check was quick, but then other resources were placed in it that increased its size massively, meaning subsequent runs took much longer.
The memory error that eventually was thrown was because, I can only assume, Puppet was loading the whole thing into memory in order to do its stuff...
I found a way around using the recurse function, and am now trying to avoid it like the plague...
Yeah, the problem with the recurse parameter on the file type is that it checks every single file's checksum, which on a massive directory adds up real quick.
As Felix suggests, using checksum => none is one way to fix it, another is to accomplish the task you're trying to do (say chmod or chown a whole directory) with an exec performing the native task, with an unless to check if it's already been done.
Something like:
define check_mode($mode) {
exec { "/bin/chmod $mode $name":
unless => "/bin/sh -c '[ $(/usr/bin/stat -c %a $name) == $mode ]'",
}
}
Taken from http://projects.puppetlabs.com/projects/1/wiki/File_Permission_Check_Patterns
Is there a shell command to know about how much memory is being used at a particular moment and details of how much each process is using, how much virtual memory is left etc?
For "each process", how about top:
PhysMem: 238M wired, 865M active, 549M inactive, 1652M used, 395M free.
VM: 162G vsize, 1039M framework vsize, 124775(0) pageins, 9149(0) pageouts.
PID COMMAND %CPU TIME #TH #WQ #POR #MREG RPRVT RSHRD RSIZE VPRVT VSIZE PGRP PPID STATE UID
7233 top 5.7 00:00.53 1/1 0 24 33 1328K 264K 1904K 17M 2378M 7233 3766 running 0
e.g.:
rprvt Resident private address space size.
rshrd Resident shared address space size.
rsize Resident memory size.
vsize Total memory size.
vprvt Private address space size.
Let's also hear it for the old classic, vmstat.
$ vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 0 30160 15884 418680 281936 0 0 406 22 6 3 1 1 93 5
Depends on your operating system. In Linux, free answers two out of your three questions.
~> free
total used free shared buffers cached
Mem: 904580 895128 9452 0 63700 777728
-/+ buffers/cache: 53700 850880
Swap: 506036 0 506036
"Swap" refers to virtual memory.
If you're on Linux, give ps_mem.py a try.
If you are on an up-to-date Linux, cat /proc/$pid/smaps is the business.
If you are on OSX, check https://superuser.com/questions/97235/how-much-swap-is-a-given-mac-application-using.