Elasticsearch incrase total_in_bytes Memory - elasticsearch

I'm running a single node on an Ubuntu machine. Currently I have not enough space on my node.
"mem" : {
"total_in_bytes" : 25217441792,
"free_in_bytes" : 674197504,
"used_in_bytes" : 24543244288,
"free_percent" : 3,
"used_percent" : 97
}
On the other hand when I execute the command df -h , I see that I still have enough space on my linux server
Filesystem Size Used Avail Use% Mounted on
udev 12G 12G 0 100% /dev
tmpfs 2,4G 1,5M 2,4G 1% /run
/dev/mapper/ubuntu--vg-ubuntu--lv 158G 110G 42G 73% / ***<=====***
tmpfs 12G 0 12G 0% /dev/shm
tmpfs 5,0M 0 5,0M 0% /run/lock
tmpfs 12G 0 12G 0% /sys/fs/cgroup
/dev/sda2 974M 304M 603M 34% /boot
/dev/mapper/ubuntu--vg-lv--opt 206G 70G 127G 36% /opt***<=======***
tmpfs 2,4G 0 2,4G 0% /run/user/1000
/dev/loop0 64M 64M 0 100% /snap/core20/1634
/dev/loop10 56M 56M 0 100% /snap/core18/2620
/dev/loop8 64M 64M 0 100% /snap/core20/1695
/dev/loop2 56M 56M 0 100% /snap/core18/2632
/dev/loop7 92M 92M 0 100% /snap/lxd/23991
/dev/loop3 50M 50M 0 100% /snap/snapd/17883
/dev/loop4 92M 92M 0 100% /snap/lxd/24061
Please how can I increase the value of total_in_bytes ?
Thanks.

df -h shows disk space, not memory.
If you want to increase the memory heap you can modify the jvm.options file.
# Xms represents the initial size of total heap space
# Xmx represents the maximum size of total heap space
-Xms4g
-Xmx4g
Here heap will be 4gb. The recommendation is to set this parameter to 50% of the physical ram of the node, and not more than 32GB.
You can read more about memory in Elasticsearch here:
https://www.elastic.co/blog/managing-and-troubleshooting-elasticsearch-memory

Related

ansible print the stdout_lines in csv file in the exact format that playbook prints on the console

I have the following ansible playbook code which prints the some metrices of one remote server. Here I want to print the output in the csv file with the exact msg format shown in the output. How to print this in csv file.
Ansible playbook:
tasks:
- name: Get ip address of the remote node
ansible.builtin.shell: hostname -i | awk '{print $2}'
register: ipaddr
- name: Check uptime
shell: uptime | cut -d',' -f1
register: uptime_op
- debug:
msg: "{{uptime_op.stdout_lines}}"
- name: Get lsbkl value
shell: lsblk
register: lsblk_output
- debug:
msg: "{{lsblk_output.stdout_lines}}"
- name: Get Disc space value
shell: df -h
register: df_output
- debug:
msg: "{{df_output.stdout_lines}}"
output:
PLAY [test_host] *************************************************************************************************************
TASK [Gathering Facts] ******************************************************************************************************
Tuesday 20 December 2022 10:07:07 -0800 (0:00:00.017) 0:00:00.017 ******
ok: [hostname.domain.com]
TASK [Get ip address of the remote node] ************************************************************************************
Tuesday 20 December 2022 10:07:14 -0800 (0:00:07.399) 0:00:07.417 ******
changed: [hostname.domain.com]
TASK [Check uptime] *********************************************************************************************************
Tuesday 20 December 2022 10:07:18 -0800 (0:00:03.860) 0:00:11.278 ******
changed: [hostname.domain.com]
TASK [debug] ****************************************************************************************************************
Tuesday 20 December 2022 10:07:22 -0800 (0:00:03.781) 0:00:15.059 ******
ok: [hostname.domain.com] => {
"msg": [
" 23:37pm up 359 days 5:53"
]
}
TASK [Get lsbkl value] ******************************************************************************************************
Tuesday 20 December 2022 10:07:22 -0800 (0:00:00.086) 0:00:15.145 ******
changed: [hostname.domain.com]
TASK [debug] ****************************************************************************************************************
Tuesday 20 December 2022 10:07:26 -0800 (0:00:03.815) 0:00:18.960 ******
ok: [hostname.domain.com] => {
"msg": [
"NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT",
"sda 8:0 0 1.1T 0 disk ",
"├─sda1 8:1 0 15G 0 part /",
"├─sda2 8:2 0 518M 0 part /boot/efi",
"├─sda3 8:3 0 1K 0 part ",
"├─sda5 8:5 0 2G 0 part /ctools",
"├─sda6 8:6 0 10G 0 part /var",
"├─sda7 8:7 0 48G 0 part [SWAP]",
"├─sda8 8:8 0 250M 0 part /dsm",
"├─sda9 8:9 0 501M 0 part /var/cfengine",
"├─sda10 8:10 0 10G 0 part /tmp",
"└─sda11 8:11 0 1T 0 part /infrastructure",
"sdb 8:16 0 1.8T 0 disk ",
"├─sdb1 8:17 0 484.3G 0 part /p4depot",
"├─sdb2 8:18 0 931.3G 0 part /p4meta",
"└─sdb3 8:19 0 372.9G 0 part /p4log"
]
}
TASK [Get Disc space value] *************************************************************************************************
Tuesday 20 December 2022 10:07:26 -0800 (0:00:00.088) 0:00:19.049 ******
changed: [hostname.domain.com]
TASK [debug] ****************************************************************************************************************
Tuesday 20 December 2022 10:07:30 -0800 (0:00:03.787) 0:00:22.836 ******
ok: [hostname.domain.com] => {
"msg": [
"Filesystem Size Used Avail Use% Mounted on",
"devtmpfs 189G 8.0K 189G 1% /dev",
"tmpfs 189G 0 189G 0% /dev/shm",
"tmpfs 189G 4.0G 185G 3% /run",
"tmpfs 189G 0 189G 0% /sys/fs/cgroup",
"/dev/sda1 15G 11G 4.8G 69% /",
"/dev/sda2 518M 0 518M 0% /boot/efi",
"/dev/sda10 10G 83M 10G 1% /tmp",
"/dev/sda11 1.1T 34M 1.1T 1% /infrastructure",
"/dev/sda8 247M 62M 185M 25% /dsm",
"/dev/sda6 10G 1.5G 8.6G 15% /var",
"/dev/sda9 498M 119M 379M 24% /var/cfengine",
"/dev/sdb2 931G 30G 902G 4% /p4meta",
"/dev/sdb3 373G 61M 373G 1% /p4log",
"/dev/sdb1 485G 112G 373G 23% /p4depot",
"/dev/sda5 2.1G 3.6M 1.8G 1% /ctools",
"tmpfs 1.0G 0 1.0G 0% /dsm/tmp/dsmbg.tmpfs",
"10.223.232.121:/new_itools 951G 497G 454G 53% /nfs/site/itools",
"incfs03n03b-04:/common_usr_local 11G 1.2G 8.9G 12% /nfs/iind/local",
"incfs04n08b-1:/prod 513M 1.5M 512M 1% /nfs/iind/proj/prod",
"incfs06n11b-1:/home0 351G 149G 202G 43% /nfs/iind/disks/home23",
"incfs02n10a-1:/iind_disks_home24 501G 59G 442G 12% /nfs/iind/disks/home24",
"incfs06n04a-05:/iind_gen_adm 301G 176G 125G 59% /nfs/site/gen/adm",
"incfs03n06b-1:/ba_ctg_home01 301G 263G 38G 88% /nfs/iind/disks/home110",
"inc08n07b-1:/home_tree 11G 79M 10G 1% /nfs/iind/home",
"incfs06n10a-1:/iind_gen_adm_netmeter_m 81G 28G 53G 35% /nfs/iind/disks/iind_gen_adm_netmeter",
"tmpfs 38G 0 38G 0% /run/user/37124",
"incfs07n05b-1:/common 201G 158G 43G 79% /nfs/site/disks/iind_gen_adm_common",
"tmpfs 38G 0 38G 0% /run/user/12142325"
]
}
PLAY RECAP ******************************************************************************************************************
hostname.domain.com : ok=8 changed=4 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
Attaching the expected csv file how it should look like.
Given the registered data df_output.stdout_lines there must be also df_output.stdout attribute. Use the filter community.general.jc and parse the registered data
- set_fact:
df: "{{ df_output.stdout|community.general.jc('df') }}"
gives
df:
- available: 189
filesystem: devtmpfs
mounted_on: /dev
size: 189G
use_percent: 1
used: 8
- available: 189
filesystem: tmpfs
mounted_on: /dev/shm
size: 189G
use_percent: 0
used: 0
...
Then, for each host create a CSV file on the controller. For example,
- copy:
dest: "/tmp/ansible_df_{{ item }}.csv"
content: |
{{ df_output.stdout_lines.0.split()[:-1]|join(',') }}
{% for m in hostvars[item]['df'] %}
{{ m.filesystem }},{{ m.size }},{{ m.used }},{{m.available }},{{ m.use_percent }},{{ m.mounted_on }}
{% endfor %}
loop: "{{ ansible_play_hosts }}"
run_once: true
delegate_to: localhost
will create
shel> cat /tmp/ansible_df_localhost.csv
Filesystem,Size,Used,Avail,Use%,Mounted
devtmpfs,189G,8,189,1,/dev
tmpfs,189G,0,189,0,/dev/shm
tmpfs,189G,4,185,3,/run
tmpfs,189G,0,189,0,/sys/fs/cgroup
/dev/sda1,15G,11,4,69,/
/dev/sda2,518M,0,518,0,/boot/efi
/dev/sda10,10G,83,10,1,/tmp
/dev/sda11,1.1T,34,1,1,/infrastructure
/dev/sda8,247M,62,185,25,/dsm
/dev/sda6,10G,1,8,15,/var
/dev/sda9,498M,119,379,24,/var/cfengine
/dev/sdb2,931G,30,902,4,/p4meta
/dev/sdb3,373G,61,373,1,/p4log
/dev/sdb1,485G,112,373,23,/p4depot
/dev/sda5,2.1G,3,1,1,/ctools
tmpfs,1.0G,0,1,0,/dsm/tmp/dsmbg.tmpfs
10.223.232.121:/new_itools,951G,497,454,53,/nfs/site/itools
incfs03n03b-04:/common_usr_local,11G,1,8,12,/nfs/iind/local
incfs04n08b-1:/prod,513M,1,512,1,/nfs/iind/proj/prod
incfs06n11b-1:/home0,351G,149,202,43,/nfs/iind/disks/home23
incfs02n10a-1:/iind_disks_home24,501G,59,442,12,/nfs/iind/disks/home24
incfs06n04a-05:/iind_gen_adm,301G,176,125,59,/nfs/site/gen/adm
incfs03n06b-1:/ba_ctg_home01,301G,263,38,88,/nfs/iind/disks/home110
inc08n07b-1:/home_tree,11G,79,10,1,/nfs/iind/home
incfs06n10a-1:/iind_gen_adm_netmeter_m,81G,28,53,35,/nfs/iind/disks/iind_gen_adm_netmeter
tmpfs,38G,0,38,0,/run/user/37124
incfs07n05b-1:/common,201G,158,43,79,/nfs/site/disks/iind_gen_adm_common
tmpfs,38G,0,38,0,/run/user/12142325
Given the data for testing
shell> cat data.json
{
"df_stdout_lines": [
"Filesystem Size Used Avail Use% Mounted on",
"devtmpfs 189G 8.0K 189G 1% /dev",
"tmpfs 189G 0 189G 0% /dev/shm",
"tmpfs 189G 4.0G 185G 3% /run",
"tmpfs 189G 0 189G 0% /sys/fs/cgroup",
"/dev/sda1 15G 11G 4.8G 69% /",
"/dev/sda2 518M 0 518M 0% /boot/efi",
"/dev/sda10 10G 83M 10G 1% /tmp",
"/dev/sda11 1.1T 34M 1.1T 1% /infrastructure",
"/dev/sda8 247M 62M 185M 25% /dsm",
"/dev/sda6 10G 1.5G 8.6G 15% /var",
"/dev/sda9 498M 119M 379M 24% /var/cfengine",
"/dev/sdb2 931G 30G 902G 4% /p4meta",
"/dev/sdb3 373G 61M 373G 1% /p4log",
"/dev/sdb1 485G 112G 373G 23% /p4depot",
"/dev/sda5 2.1G 3.6M 1.8G 1% /ctools",
"tmpfs 1.0G 0 1.0G 0% /dsm/tmp/dsmbg.tmpfs",
"10.223.232.121:/new_itools 951G 497G 454G 53% /nfs/site/itools",
"incfs03n03b-04:/common_usr_local 11G 1.2G 8.9G 12% /nfs/iind/local",
"incfs04n08b-1:/prod 513M 1.5M 512M 1% /nfs/iind/proj/prod",
"incfs06n11b-1:/home0 351G 149G 202G 43% /nfs/iind/disks/home23",
"incfs02n10a-1:/iind_disks_home24 501G 59G 442G 12% /nfs/iind/disks/home24",
"incfs06n04a-05:/iind_gen_adm 301G 176G 125G 59% /nfs/site/gen/adm",
"incfs03n06b-1:/ba_ctg_home01 301G 263G 38G 88% /nfs/iind/disks/home110",
"inc08n07b-1:/home_tree 11G 79M 10G 1% /nfs/iind/home",
"incfs06n10a-1:/iind_gen_adm_netmeter_m 81G 28G 53G 35% /nfs/iind/disks/iind_gen_adm_netmeter",
"tmpfs 38G 0 38G 0% /run/user/37124",
"incfs07n05b-1:/common 201G 158G 43G 79% /nfs/site/disks/iind_gen_adm_common",
"tmpfs 38G 0 38G 0% /run/user/12142325"
]
}
Example of a complete playbook for testing
- hosts: localhost
vars_files:
- data.json
vars:
df_output:
stdout: |
Filesystem Size Used Avail Use% Mounted on
devtmpfs 189G 8.0K 189G 1% /dev
tmpfs 189G 0 189G 0% /dev/shm
tmpfs 189G 4.0G 185G 3% /run
tmpfs 189G 0 189G 0% /sys/fs/cgroup
/dev/sda1 15G 11G 4.8G 69% /
/dev/sda2 518M 0 518M 0% /boot/efi
/dev/sda10 10G 83M 10G 1% /tmp
/dev/sda11 1.1T 34M 1.1T 1% /infrastructure
/dev/sda8 247M 62M 185M 25% /dsm
/dev/sda6 10G 1.5G 8.6G 15% /var
/dev/sda9 498M 119M 379M 24% /var/cfengine
/dev/sdb2 931G 30G 902G 4% /p4meta
/dev/sdb3 373G 61M 373G 1% /p4log
/dev/sdb1 485G 112G 373G 23% /p4depot
/dev/sda5 2.1G 3.6M 1.8G 1% /ctools
tmpfs 1.0G 0 1.0G 0% /dsm/tmp/dsmbg.tmpfs
10.223.232.121:/new_itools 951G 497G 454G 53% /nfs/site/itools
incfs03n03b-04:/common_usr_local 11G 1.2G 8.9G 12% /nfs/iind/local
incfs04n08b-1:/prod 513M 1.5M 512M 1% /nfs/iind/proj/prod
incfs06n11b-1:/home0 351G 149G 202G 43% /nfs/iind/disks/home23
incfs02n10a-1:/iind_disks_home24 501G 59G 442G 12% /nfs/iind/disks/home24
incfs06n04a-05:/iind_gen_adm 301G 176G 125G 59% /nfs/site/gen/adm
incfs03n06b-1:/ba_ctg_home01 301G 263G 38G 88% /nfs/iind/disks/home110
inc08n07b-1:/home_tree 11G 79M 10G 1% /nfs/iind/home
incfs06n10a-1:/iind_gen_adm_netmeter_m 81G 28G 53G 35% /nfs/iind/disks/iind_gen_adm_netmeter
tmpfs 38G 0 38G 0% /run/user/37124
incfs07n05b-1:/common 201G 158G 43G 79% /nfs/site/disks/iind_gen_adm_common
tmpfs 38G 0 38G 0% /run/user/12142325
stdout_lines: "{{ df_stdout_lines }}"
tasks:
- debug:
var: df_output.stdout_lines
- debug:
var: df_output.stdout
- set_fact:
df: "{{ df_output.stdout|community.general.jc('df') }}"
- debug:
var: df
- copy:
dest: "/tmp/ansible_df_{{ item }}.csv"
content: |
{{ df_output.stdout_lines.0.split()[:-1]|join(',') }}
{% for m in hostvars[item]['df'] %}
{{ m.filesystem }},{{ m.size }},{{ m.used }},{{m.available }},{{ m.use_percent }},{{ m.mounted_on }}
{% endfor %}
loop: "{{ ansible_play_hosts }}"
run_once: true
delegate_to: localhost

how to increase disk space of virtual machine created by vagrant

I use vagrant create centos virtual machines using the following script:
Vagrant.configure("2") do |config|
(2..4).each do |i|
config.vm.define "node#{i}" do |node|
node.vm.provider "virtualbox" do |v|
v.name = "node#{i}"
v.memory = 3072
v.cpus = 2
config.disksize.size = '20GB'
end
node.vm.box = "cnode"
node.vm.hostname = "node#{i}"
node.vm.network :private_network, ip: "192.168.3.#{i}"
end
end
end
But / space is only 8.4GB:
[vagrant#node2 opt]$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/centos-root 8.4G 1.1G 7.4G 13% /
devtmpfs 1.4G 0 1.4G 0% /dev
tmpfs 1.4G 0 1.4G 0% /dev/shm
tmpfs 1.4G 8.3M 1.4G 1% /run
tmpfs 1.4G 0 1.4G 0% /sys/fs/cgroup
/dev/sda1 497M 118M 379M 24% /boot
none 234G 148G 87G 64% /vagrant
config.disksize.size is a parameter from the 3rd party https://github.com/sprotheroe/vagrant-disksize
As mentioned in the README :
Depending on the guest, you may need to resize the partition and the
filesystem from within the guest. At present the plugin only resizes
the underlying disk.
you can refer to vagrant no space left on device for the steps on the resizing partition

How to tell sed to make changes only to 1st column of an output

So i have an output with 6 columns, and what i want to do is ONLY for the first column to delete everything before the last semicolon " / ".
What i have so far is this
df -k | awk '{print $1}' | sed 's#.*/##'
but i dont want to use the awk there in order to take only the first column like this, i want to find a way that i can tell to sed to make these changes to the first column only.
So the original output is like this:
Filesystem kbytes used avail capacity Mounted on
/dev/dsk/c0d0s0 12324895 5082804 7118843 42% /
/devices 0 0 0 0% /devices
ctfs 0 0 0 0% /system/contract
proc 0 0 0 0% /proc
mnttab 0 0 0 0% /etc/mnttab
swap 8998420 1052 8997368 1% /etc/svc/volatile
objfs 0 0 0 0% /system/object
sharefs 0 0 0 0% /etc/dfs/sharetab
/usr/lib/libc/libc_hwcap1.so.1 12324895 5082804 7118843 42% /lib/libc.so.1
fd 0 0 0 0% /dev/fd
/dev/dsk/c0d0s3 4136995 146364 3949262 4% /var
swap 9145604 148236 8997368 2% /tmp
swap 8997400 32 8997368 1% /var/run
and i want the first column to look like this:
Filesystem
c0d0s0
devices
ctfs
proc
mnttab
swap
objfs
sharefs
libc_hwcap1.so.1
fd
c0d0s3
swap
swap
$ awk '{sub(/.*\//,"",$1)}1' file
Filesystem kbytes used avail capacity Mounted on
c0d0s0 12324895 5082804 7118843 42% /
devices 0 0 0 0% /devices
ctfs 0 0 0 0% /system/contract
proc 0 0 0 0% /proc
mnttab 0 0 0 0% /etc/mnttab
swap 8998420 1052 8997368 1% /etc/svc/volatile
objfs 0 0 0 0% /system/object
sharefs 0 0 0 0% /etc/dfs/sharetab
libc_hwcap1.so.1 12324895 5082804 7118843 42% /lib/libc.so.1
fd 0 0 0 0% /dev/fd
c0d0s3 4136995 146364 3949262 4% /var
swap 9145604 148236 8997368 2% /tmp
swap 8997400 32 8997368 1% /var/run
.
$ awk 'NR==1{sub(/Mounted on/,"Mounted_on")} {sub(/.*\//,"",$1)}1' file | column -t
Filesystem kbytes used avail capacity Mounted_on
c0d0s0 12324895 5082804 7118843 42% /
devices 0 0 0 0% /devices
ctfs 0 0 0 0% /system/contract
proc 0 0 0 0% /proc
mnttab 0 0 0 0% /etc/mnttab
swap 8998420 1052 8997368 1% /etc/svc/volatile
objfs 0 0 0 0% /system/object
sharefs 0 0 0 0% /etc/dfs/sharetab
libc_hwcap1.so.1 12324895 5082804 7118843 42% /lib/libc.so.1
fd 0 0 0 0% /dev/fd
c0d0s3 4136995 146364 3949262 4% /var
swap 9145604 148236 8997368 2% /tmp
swap 8997400 32 8997368 1% /var/run
Just split the first field in /-slices and replace the first field with the last of these slices whenever it occurs as the first part of the line:
awk '{n=split($1,a,"/"); gsub("^"$1,a[n])}1' file
Test
$ awk '{n=split($1,a,"/"); gsub("^"$1,a[n])}1' file
Filesystem kbytes used avail capacity Mounted on
c0d0s0 12324895 5082804 7118843 42% /
devices 0 0 0 0% /devices
ctfs 0 0 0 0% /system/contract
proc 0 0 0 0% /proc
mnttab 0 0 0 0% /etc/mnttab
swap 8998420 1052 8997368 1% /etc/svc/volatile
objfs 0 0 0 0% /system/object
sharefs 0 0 0 0% /etc/dfs/sharetab
libc_hwcap1.so.1 12324895 5082804 7118843 42% /lib/libc.so.1
fd 0 0 0 0% /dev/fd
c0d0s3 4136995 146364 3949262 4% /var
swap 9145604 148236 8997368 2% /tmp
swap 8997400 32 8997368 1% /var/run
Note awk '{n=split($1,a,"/"); $1=a[n]}1' would also work, only that the format would be lost because the full string gets recalculated when you modify one of its fields.
df -k | awk '{print $1}' | perl -pe 's/^[\S]*\///g'
or
df -k | awk '{print $1}' |perl -lane '$F[0]=~s/.*\///g;print "#F"'
df -k|awk -F' ' '{print $1}'|sed "s/.*\///g"
sed solution
$ sed -r 's~.*/(\S+) ~\1~' file
or
$ sed -r 's~.*/(\S+)\s~\1~' file

Oracle 11gr2 failed check of kernel parameters on hp-ux

I'm installing oracle 11gR2 on 64 bit itanium HP-UX (v 11.31) system ( for HP Operation Manager 9 ).
According with the installation requiremens, I've changed the kernel parameters but when I start the installation process it don't recognize them.
Below the parameters that I've set :
Parameter ( Manual) (on server)
-------------------------------------------------------------
fs_async 0 0
ksi_alloc_max (nproc*8) 10240*8 = 81920
executable_stack 0 0
max_thread_proc 1024 3003
maxdsiz 0x40000000 (1073741824) 2063835136
maxdsiz_64bit 0x80000000 (2147483648) 2147483648
maxfiles 256 (a) 4096
maxssiz 0x8000000 (134217728) 134217728
maxssiz_64bit 0x40000000 (1073741824) 1073741824
maxuprc ((nproc*9)/10) 9216
msgmni (nproc) 10240
msgtql (nproc) 32000
ncsize 35840 95120
nflocks (nproc) 10240
ninode (8*nproc+2048) 83968
nkthread (((nproc*7)/4)+16) 17936
nproc 4096 10240
semmni (nproc) 10240
semmns (semmni*2) 20480
semmnu (nproc-4) 10236
semvmx 32767 65535
shmmax size of memory or 0x40000000 (higher one) 1073741824
shmmni 4096 4096
shmseg 512 1024
vps_ceiling 64 64
if this can help:
[root#HUG30055 /] # swapinfo
Kb Kb Kb PCT START/ Kb
TYPE AVAIL USED FREE USED LIMIT RESERVE PRI NAME
dev 4194304 0 4194304 0% 0 - 1 /dev/vg00/lvol2
dev 8388608 0 8388608 0% 0 - 1 /dev/vg00/lvol10
reserve - 742156 -742156
memory 7972944 3011808 4961136 38%
[root#HUG30055 /] # bdf /tmp
Filesystem kbytes used avail %used Mounted on
/dev/vg00/lvol6 4194304 1773864 2401576 42% /tmp

Find a local minimum in a special graph

The issue at hand looks easy, but I could not find an easy solution so far.
I've got a histogram describing the value distributing of an array of floats, roughly looking like this:
As you can see, there is a local maximum near 0, which keeps falling down to a local minimum, then rising quickly to a plateau, and in the end falling to 0. I would like to detect the local minimum.
In practice, the histogram is not as smooth:
There are lots of spikes, and the local minimum may be stretched and uneven. I'm not sure how to tackle this problem.
There is little domain knowledge. The first max may even be higher than the second max. There may be spikes in any direction, values may be as low as 0.
This is a real life sample taken from 8 distinct runs. It's scaled to 0 - 10 to make it easier to understand.
0: 22% 12% 19% 17% 6% 5% 6% 5%
1: 3% 2% 1% 1% 4% 1% 4% 1%
2: 6% 2% 13% 5% 0% 2% 0% 2%
3: 62% 62% 52% 42% 2% 5% 2% 5%
4: 4% 19% 12% 28% 10% 13% 10% 13%
5: 0% 0% 3% 29% 30% 29% 30%
6: 37% 31% 37% 30%
7: 1% 7% 1% 7%
8: 6% 1% 6% 1%
9:
10:
Values rounded down. Missing values denote no occurrence of any value.
Explanation of the first line:
0: 22% the initial max
1: 3% local min
2: 6% still min
3: 62% plateau max
4: 4% second min
5: 0% 0
6: no more values
7:
8:
9:
10:
For reference, a list of the same data, this time scaled to 0 - 100 (there were no values in the 90-100 range at all). I messed up on the formatting, but it should give a rough idea.
0: 0% 0% 0% 1% 0% 0% 0% 0%
1: 0% 1% 1% 3% 0% 0% 0% 0%
2: 1% 2% 1% 3% 0% 0% 0% 0%
3: 4% 2% 3% 3% 0% 1% 0% 1%
4: 6% 1% 3% 2% 0% 0% 0% 0%
5: 2% 0% 3% 1% 0% 0% 0% 0%
6: 1% 0% 2% 0% 0% 0% 0% 0%
7: 1% 0% 1% 0% 0% 0% 0% 0%
8: 1% 0% 1% 0% 0% 0% 0% 0%
9: 1% 0% 1% 0% 1% 0% 1% 0%
10: 1% 0% 0% 0% 1% 0% 1% 0%
11: 0% 0% 0% 0% 0% 0% 0% 0%
12: 0% 0% 0% 0% 0% 0% 0% 0%
13: 0% 0% 0% 0% 0% 0% 0% 0%
14: 0% 0% 0% 0% 0% 0% 0% 0%
15: 0% 0% 0% 0% 0% 0% 0% 0%
16: 0% 0% 0% 0% 0% 0% 0% 0%
17: 0% 0% 0% 0% 0% 0% 0% 0%
18: 0% 0% 0% 0% 0% 0% 0% 0%
19: 0% 0% 0% 0% 0% 0% 0% 0%
20: 0% 0% 0% 0% 0% 0% 0% 0%
21: 0% 0% 0% 0% 0% 0% 0% 0%
22: 0% 0% 0% 0% 0% 0% 0% 0%
23: 0% 0% 0% 0% 0% 0% 0% 0%
24: 0% 0% 1% 0% 0% 0% 0% 0%
25: 0% 0% 1% 0% 0% 0% 0% 0%
26: 0% 0% 1% 0% 0% 0% 0% 0%
27: 0% 0% 1% 0% 0% 0% 0% 0%
28: 1% 0% 2% 1% 0% 0% 0% 0%
29: 3% 0% 2% 2% 0% 0% 0% 0%
30: 7% 1% 3% 2% 0% 0% 0% 0%
31: 10% 2% 4% 3% 0% 0% 0% 0%
32: 10% 3% 4% 4% 0% 0% 0% 0%
33: 6% 6% 5% 5% 0% 0% 0% 0%
34: 5% 5% 4% 4% 0% 0% 0% 0%
35: 5% 8% 6% 3% 0% 0% 0% 0%
36: 5% 10% 6% 4% 0% 0% 0% 0%
37: 5% 9% 5% 3% 0% 0% 0% 0%
38: 3% 8% 5% 5% 0% 0% 0% 0%
39: 2% 5% 5% 5% 0% 0% 0% 0%
40: 1% 4% 4% 5% 0% 1% 0% 1%
41: 1% 3% 2% 5% 0% 1% 0% 1%
42: 0% 1% 1% 4% 0% 0% 0% 0%
43: 0% 2% 0% 4% 1% 1% 1% 1%
44: 0% 1% 0% 3% 1% 1% 1% 1%
45: 0% 1% 0% 1% 0% 1% 0% 1%
46: 0% 1% 0% 1% 1% 1% 1% 1%
47: 0% 1% 0% 0% 1% 1% 1% 1%
48: 0% 1% 0% 0% 1% 1% 1% 1%
50: 0% 0% 0% 1% 1% 1% 1% 1%
50: 0% 1% 1% 1% 1% 1%
51: 0% 0% 2% 1% 2% 1%
52: 0% 1% 2% 1% 2% 1%
53: 0% 0% 4% 2% 4% 2%
54: 0% 2% 2% 2% 2%
55: 0% 2% 2% 2% 2%
56: 0% 2% 3% 2% 3%
57: 0% 2% 4% 2% 4%
58: 4% 6% 4% 6%
59: 3% 3% 3% 3%
60: 5% 5% 5% 5%
61: 5% 7% 5% 7%
62: 3% 5% 3% 5%
63: 4% 3% 4% 3%
64: 5% 2% 5% 2%
65: 3% 2% 2% 2%
66: 5% 1% 5% 1%
67: 1% 0% 1% 0%
68: 1% 0% 1% 0%
69: 0% 1% 0% 1%
70: 0% 0% 0% 0%
71: 0% 0% 0% 0%
72: 0% 0% 0% 0%
73: 0% 1% 0% 1%
74: 0% 0% 0% 0%
75: 0% 0% 0% 0%
76: 0% 1% 0% 1%
77: 0% 0% 0% 0%
78: 0% 0% 0% 0%
79: 0% 0% 0% 0%
80: 0% 0% 0% 1%
81: 0% 0% 0% 0%
82: 0% 0% 0% 0%
83: 0% 0% 0% 0%
84: 0% 0% 0% 0%
85: 1% 1%
86: 0% 0%
87: 1% 1%
88: 1% 1%
89: 0% 0%
Your "true" histogram is low frequency. Your noise is high frequency. Low-pass filtering the data with an appropriate bandwidth filter will get rid of most of the noise.
Here's an algoithm:
Smooth your data set by calculating
a moving average for a small window.
Test your smoothed data for local minima (i.e. any single datum
that it is smaller than its
neighbours.
If there are more than two local minima, increase the window size, and goto step 1.
Update:
Having looked at the sample data you posted, I've realised that you need to detect minimal plateaus rather than just individual points, so step two in the algorithm should be tweaked to identify a point as part of a minimum if there are no neighbours with smaller values between the nearest higher value neighbours on either side. Then when counting minima in step 3, a minimal plateau should count as a single minimum.
I've tested this algorithm on your example datasets and it performs well, picking minima at: 18, 12, 15, 13, 23, 20, 23and20 for your datasets respectively.
a possible heuristic: using spline approximation to smooth the histogram, and make it polynomical-like and then look for a local minimum.
note that this is only a heuristic solution and might fail... but I think will provide a good solution for most cases.
This actually sounds rather like histogram-based image segmentation to me (although this is not an image, so it's really just histogram segmentation). Sounds weird, but bear with me.
Is what's important about the minimum the fact that it's a minimum, or that it divides the small maximum from the large maximum? If it's the fact that it divides the maxima, then segmentation is definitely what you want.
Have a look at K-means clustering. You'd have two clusters. It's not a terribly complicated procedure, but Wikipedia (and other sources) do a much better job of explaining it than i could, so i'll leave it to them.

Resources