hardware and system
CPU:Intel(R) Xeon(R) CPU E5-2620 v2 # 2.10GHz * 2
MEMORY: 128G
POOL: 2T * 10 (raidz3) + 512G * 2 L2ARC
TRUENAS: TrueNAS-13.0-U3.1
dd test
# on truenas
root#freenas[/mnt/dev1/docker]# dd if=/dev/zero of=test.dat bs=1M count=400 conv=fsync oflag=direct
400+0 records in
400+0 records out
419430400 bytes transferred in 0.126497 secs (3315728885 bytes/sec)
root#freenas[/mnt/dev1/docker]# dd if=/dev/zero of=test.dat bs=512K count=800 conv=fsync oflag=direct
800+0 records in
800+0 records out
419430400 bytes transferred in 0.128923 secs (3253346675 bytes/sec)
root#freenas[/mnt/dev1/docker]# dd if=/dev/zero of=test.dat bs=4K count=20000 conv=fsync oflag=direct
20000+0 records in
20000+0 records out
81920000 bytes transferred in 0.135577 secs (604232041 bytes/sec)
# on nfs client (1GB network)
[root#node1 docker]# dd if=/dev/zero of=test.dat bs=1M count=400 conv=fsync oflag=direct
400+0 records in
400+0 records out
419430400 bytes (419 MB) copied, 4.01762 s, 104 MB/s
[root#node1 docker]# dd if=/dev/zero of=test.dat bs=512K count=800 conv=fsync oflag=direct
800+0 records in
800+0 records out
419430400 bytes (419 MB) copied, 4.39617 s, 95.4 MB/s
[root#node1 docker]# dd if=/dev/zero of=test.dat bs=4K count=20000 conv=fsync oflag=direct
20000+0 records in
20000+0 records out
81920000 bytes (82 MB) copied, 6.95326 s, 11.8 MB/s
nfs
# client mount
192.168.10.16:/mnt/dev1/docker on /docker type nfs4 (rw,noatime,nodiratime,vers=4.1,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.10.120,local_lock=none,addr=192.168.10.16,_netdev)
other
Truenas pool already set sync disable
HELP
We see 4k test is very slow on nfs client. Can someone help me troubleshoot?
The problem with TrueNas is that it ships preconfigured with periodic security checks that hog the system.
You should do the following:
Edit using vim the file /usr/local/lib/python3.9/site-packages/middlewared/etc_files/local/collectd.conf
and disable python plugin by commenting the line "LoadPlugin python"
Go to Truenas UI, in System, reporting, and change points from 1200 to 100, so it purges and regenerate RRD data.
Then execute:
service collectd onestatus
service collectd onerestart
Then edit with vi in a shell session the file /etc/defaults/periodic.conf and set
security_status_neggrpperm_enable="NO"
See other recommended settings in https://gist.github.com/adam12/b2a645adc55c4c076d9887838ab4bae7
Edit also the file /conf/base/etc/defaults/periodic.conf because it will not persist a boot otherwise.
There is also some settings in the /data directory, in sqlite databases, this is written by the TrueNas UI, be careful.
Also if you have like me, a ntopng in your network, and see the message kex_exchange_identification pop every 30 minutes in your freenas terminal screen, then you have two options:
disable network device discovery in ntopng settings, or
Go to TrueNAS, Services menu, SSH, and change the port from 22 to other, for example 8022.
Best regards
Related
I'd like to back up a SSD which I'm using for CentOS. Trying to learn dd. My drive is a fairly simple GPT partition of 120GB.
I run "dd" to copy the image of sda to a USB stick sdd1:
[root#localhost ~]# dd if=/dev/sda conv=sync,noerror status=progress bs=64k of=/dev/sdd1
120029118464 bytes (120 GB, 112 GiB) copied, 30810 s, 3.9 MB/s
1831575+1 records in
1831576+0 records out
120034164736 bytes (120 GB, 112 GiB) copied, 30810.8 s, 3.9 MB/s
But then when I examine the USB stick, there is nothing to be seen on it and I see no way to mount it
this is what appears under the Disks command
Question is:
How do I access the image?
(As a side note, I read a claim that the dd command is like the IBM JCL statement of the same name. I was a mainframe programmer. The IBM DD command is often still called a "DD Card". It doesn't copy files. It just joins your file declaration in your program to some external file. To copy a file the old skool way is to use IEBGENER)
if=/dev/sda Is cloning the entire disk and of=/dev/sdd1 Is writing to a partition. Which doesn't make much sense.
You may want to clone the entire disk onto another disk
dd if=/dev/sda conv=sync,noerror status=progress bs=64k of=/dev/sdd
Or better yet clone to an compressed image
dd if=/dev/sda | gzip > /sda.img.gz
And restore like so
gzip -d /sda.img.gz | dd of=/dev/sda
I want to find the CPU process usage for all Oracle processes on an AIX box.
On Solaris I can do the following:
prstat -n 400 -c -s cpu -p 9013 1 1
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
9013 oracle 3463M 2928M sleep 53 0 0:00:35 0.9% oracle/2
Total: 1 processes, 2 lwps, load averages: 2.25, 2.32, 2.40
This basically reports the CPU usage for a given process ID (in this case 9013). Given a list of all Oracle PID’s I can use this command to get the CPU usage for each one, sum them up and hey presto I have my Oracle database CPU usage.
How can I get the same with AIX?
Thanks
You can try nmon or topas, which will show the current %CPU. You might also want to look into using WLM to create a class for all the Oracle processes, then use wlmstat to see the CPU usage for that class. That would save you the trouble of adding them up manually.
I know that using Activity Monitor we can see CPU utilisation.But i want to get this information of my remote system through script. top command is not helpful for me. Please comment me any other way to get it.
What is the objection to top in logging mode?
top -l 1 | grep -E "^CPU|^Phys"
CPU usage: 3.27% user, 14.75% sys, 81.96% idle
PhysMem: 5807M used (1458M wired), 10G unused.
Or use sysctl
sysctl vm.loadavg
vm.loadavg: { 1.31 1.85 2.00 }
Or use vm_stat
vm_stat
Mach Virtual Memory Statistics: (page size of 4096 bytes)
Pages free: 3569.
Pages active: 832177.
Pages inactive: 283212.
Pages speculative: 2699727.
Pages throttled: 0.
Pages wired down: 372883.
I'm using puppet as a provisioner for Vagrant, and am coming across an issue where Puppet will hang for an extremely long time when I do a "vagrant provision". Building the box from scratch using "vagrant up" doesn't seem to be a problem, only subsequent provisions.
If I turn puppet debug on and watch where it hangs, it seems to stop at various, seemingly arbitrary, points the first of which is:
Info: Applying configuration version '1401868442'
Debug: Prefetching yum resources for package
Debug: Executing '/bin/rpm --version'
Debug: Executing '/bin/rpm -qa --nosignature --nodigest --qf '%{NAME} %|EPOCH?{% {EPOCH}}:{0}| %{VERSION} %{RELEASE} %{ARCH}\n''
Executing this command on the server myself returns immediately.
Eventually, it gets past this and continues. Using the summary option, I get the following, after waiting for a very long time for it to complete:
Debug: Finishing transaction 70191217833880
Debug: Storing state
Debug: Stored state in 9.39 seconds
Notice: Finished catalog run in 1493.99 seconds
Changes:
Total: 2
Events:
Failure: 2
Success: 2
Total: 4
Resources:
Total: 18375
Changed: 2
Failed: 2
Skipped: 35
Out of sync: 4
Time:
User: 0.00
Anchor: 0.01
Schedule: 0.01
Yumrepo: 0.07
Augeas: 0.12
Package: 0.18
Exec: 0.96
Service: 1.07
Total: 108.93
Last run: 1401869964
Config retrieval: 16.49
Mongodb database: 3.99
File: 76.60
Mongodb user: 9.43
Version:
Config: 1401868442
Puppet: 3.4.3
This doesn't seem very helpful to me, as the amount of time total's 108 seconds, so where have the other 1385 seconds gone?
Throughout, Puppet seems to be hammering the box, using up a lot of CPU, but still doesn't seem to advance. The memory it uses seems to continually increase. When I kick off the command, top looks like this:
Cpu(s): 10.2%us, 2.2%sy, 0.0%ni, 85.5%id, 2.2%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 4956928k total, 2849296k used, 2107632k free, 63464k buffers
Swap: 950264k total, 26688k used, 923576k free, 445692k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
28486 root 20 0 439m 334m 3808 R 97.5 6.9 2:02.92 puppet
22 root 20 0 0 0 0 S 1.3 0.0 0:07.55 kblockd/0
18276 mongod 20 0 788m 31m 3040 S 1.3 0.6 2:31.82 mongod
20756 jboss-as 20 0 3081m 1.5g 21m S 1.3 31.4 7:13.15 java
20930 elastics 20 0 2340m 236m 6580 S 1.0 4.9 1:44.80 java
266 root 20 0 0 0 0 S 0.3 0.0 0:03.85 jbd2/dm-0-8
22717 vagrant 20 0 98.0m 2252 1276 S 0.3 0.0 0:01.81 sshd
28762 vagrant 20 0 15036 1228 932 R 0.3 0.0 0:00.10 top
1 root 20 0 19364 1180 964 S 0.0 0.0 0:00.86 init
To me, this seems fine, there's over 2GB of available memory and plenty of available swap. I have a max open files limit of 1024.
About 10-15 minutes later, still no advance in the console output, but top looks like this:
Cpu(s): 11.2%us, 1.6%sy, 0.0%ni, 86.9%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%s
Mem: 4956928k total, 3834376k used, 1122552k free, 64248k buffers
Swap: 950264k total, 24408k used, 925856k free, 445728k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
28486 root 20 0 1397m 1.3g 3808 R 99.6 26.7 15:16.19 puppet
18276 mongod 20 0 788m 31m 3040 R 1.7 0.6 2:45.03 mongod
20756 jboss-as 20 0 3081m 1.5g 21m S 1.3 31.4 7:25.93 java
20930 elastics 20 0 2340m 238m 6580 S 0.7 4.9 1:52.03 java
8486 root 20 0 308m 952 764 S 0.3 0.0 0:06.03 VBoxService
As you can see, puppet is now using a lot more of the memory, and it seems to continue in this fashion. The box it's building has 5GB of RAM, so I wouldn't have expected it to have memory issues. However, further down the line, after a long wait, I do get "Cannot allocate memory - fork(2)"
Running unlimit -a, I get:
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 38566
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 1024
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Which, again looks fine to me...
To be honest, I'm completely at a loss as to how to go about solving this, or what is causing it.
Any help or insight would be greatly appreciated!
EDIT:
So I managed to fix this eventually... It came down to using recurse with a file directive for a large directory. The target directory in question contained around 2GB worth of files, and puppet took a huge amount of time loading this into memory and doing it's hashes and comparisons. The first time I stood the server up, the directory was relatively empty so the check was quick, but then other resources were placed in it that increased its size massively, meaning subsequent runs took much longer.
The memory error that eventually was thrown was because, I can only assume, Puppet was loading the whole thing into memory in order to do its stuff...
I found a way around using the recurse function, and am now trying to avoid it like the plague...
Yeah, the problem with the recurse parameter on the file type is that it checks every single file's checksum, which on a massive directory adds up real quick.
As Felix suggests, using checksum => none is one way to fix it, another is to accomplish the task you're trying to do (say chmod or chown a whole directory) with an exec performing the native task, with an unless to check if it's already been done.
Something like:
define check_mode($mode) {
exec { "/bin/chmod $mode $name":
unless => "/bin/sh -c '[ $(/usr/bin/stat -c %a $name) == $mode ]'",
}
}
Taken from http://projects.puppetlabs.com/projects/1/wiki/File_Permission_Check_Patterns
I have a hadoop cluster we assuming is performing pretty "bad". The nodes are pretty beefy.. 24 cores, 60+G RAM ..etc. And we are wondering if there are some basic linux/hadoop default configuration that prevent hadoop from fully utilizing our hardware.
There is a post here that described a few possibilities that I think might be true.
I tried logging in the namenode as root, hdfs and also myself and trying to see the output of lsof and also the setting of ulimit. Here are the output, can anyone help me understand why the setting doesn't match with the open files number.
For example, when I logged in as root. The lsof looks like this:
[root#box ~]# lsof | awk '{print $3}' | sort | uniq -c | sort -nr
7256 cloudera-scm
3910 root
2173 oracle
1886 hbase
1575 hue
1180 hive
801 mapred
470 oozie
427 yarn
418 hdfs
244 oragrid
241 zookeeper
94 postfix
87 httpfs
...
But when I check out the ulimit output, it looks like this:
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 806018
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 1024
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
I am assuming, there should be no more than 1024 files opened by one user, however, when you look at the output of lsof, there are 7000+ files opened by one user, can anyone help explain what is going on here?
Correct me if I had made any mistake understanding the relation between ulimit and lsof.
Many thanks!
You need to check limits for the process. It may be different from your shell session:
Ex:
[root#ADWEB_HAPROXY3 ~]# cat /proc/$(pidof haproxy)/limits | grep open
Max open files 65536 65536 files
[root#ADWEB_HAPROXY3 ~]# ulimit -n
4096
In my case haproxy has a directive on its config file to change maximum open files, there should be something for hadoop as well
I had a very similar issue, which caused one of the claster's YARN TimeLine server to stop due to reaching magical 1024 files limit and crashing with "too many open files" errors.
After some investigation it came out that it had some serious issues with dealing with too many files in TimeLine's LevelDB. For some reason YARN ignored yarn.timeline-service.entity-group-fs-store.retain-seconds setting (by default it's set to 7 days, 604800ms). We had LevelDB files dating back for over a month.
What seriously helped was applying a fix described in here: https://community.hortonworks.com/articles/48735/application-timeline-server-manage-the-size-of-the.html
Basically, there are a couple of options I tried:
Shrink TTL (time to live) settings First enable TTL:
<property>
<description>Enable age off of timeline store data.</description>
<name>yarn.timeline-service.ttl-enable</name>
<value>true</value>
</property>
Then set yarn.timeline-service.ttl-ms (set it to some low settings for a period of time):
\
<property>
<description>Time to live for timeline store data in milliseconds.</description>
<name>yarn.timeline-service.ttl-ms</name>
<value>604800000</value>
</property>
Second option, as described, is to stop TimeLine server, delete the whole LevelDB and restart the server. This will start the ATS database from scratch. Works fine if you failed with any other options.
To do it, find the database location from yarn.timeline-service.leveldb-timeline-store.path, back it up and remove all subfolders from it. This operation will require root access to the server where TimeLine is located.
Hope it helps.