Get the most frequently occurring number in a bash array - bash

I needed to find the most frequent number in an array. I did it this way:
# our array, the most frequent value is 55
declare -a array=(44 55 55 55 66 66)
# counting unque string with uniq and then sorting as numbers
array=($(printf "%s\n" "${array[#]}"| uniq -c | sort -n -r))
# printing 2nd element of array, as the first one - number of occurencies
printf ${array[1]}
Is it a better/more beautiful way to do it, instead of building a weird array(2nd step) which consists mixed counts and numbers together?
And am I doing sorting correctly? (uniq returns values in 2 columns, so I'm not sure how it chooses the column)

If I had to do this in bash, I would use awk to skip sorting anything and just count the elements:
printf '%s\n' "${array[#]}" | awk '{
if (++arr[$0] > max) {
max=arr[$0];
ans=$0
}
}
END {print ans}'
You can also implement the same algorithm in bash 4 or later using an associative array:
# These don't strictly need to be initialized, but it's safer
# to ensure they don't already have values.
declare -A counts=()
max=0
ans=
for i in "${array[#]}"; do
if ((++counts[$i] > max)); then
max=${counts[$i]}
ans=$i
fi
done
printf '%s\n' "$ans"

If you dont want to use awk to do this, you can still do it with sort and uniq but be careful, you need to have the input ALREADY sorted before counting. Otherwise it will not work. For instance :
declare -a array=(34 3 45 45 66 55 44 55 55 55 66 45 45 8 6 45 45 66 32 9 18)
printf "%s\n" "${array[#]}" | sort -n -r | uniq -c | sort -n -r | head -1 | awk '{print $2}'
where for the given input the code correctly extracts the most repeated number, but in the sample you gave it will not work and it will tell 55 is the most repeated number, although thats wrong, since its clearly 45, but uniq only counts continuous items, if they are sparse it will count them incorrectly.
Regads!

A bit more verbose version of chepner's logic using associative arrays on bash v4+ onward. We build the associative array hashMap with key as array element and the count of its occurrence as the value. Once we build the array, we find from the array having the max count and retrieve its value.
#!/usr/bin/env bash
declare -a array=(44 55 55 55 66 66)
declare -A hashMap
declare -i max=0
for element in "${array[#]}"; do
((hashMap["$element"]++))
done
for key in "${!hashMap[#]}"; do
(( "${hashMap[$key]}" > max )) && { max="${hashMap[$key]}"; element="$key" ; }
done
printf '%d\n' "$element"

another minimalist awk
$ awk '{for(mi=i=1;i<=NF;i++) if(a[$mi]<++a[$i]) mi=i; print $mi}' <<< "${array[#]}"

Related

bash: get size and [x,y]-element from a list of different size arrays

BASH: I have a list of array of different size from an external config sourced file:
declare -a line0=( 00 01 02 )
declare -a line1=( 10 11 )
...
declare -a line9=( 90 91 92 93 )
rows=9
Getting the size of a single array works as so:
${#line0[#]}
this can be practical only for few arrays like in the example.
I need to get the array size in a for loop. I tryed this:
for ((r=0;r<rows;r++)) do
line="line$r"
echo line:$line
cols="${#$line[#]}" # 1st assignment
cols="${#line$r[#]}" # 2nd assignment
done
but got 'bad substitution' error for both assignment.
Then, supposing to know the max cols value, I need to extract the single element from arrays with two nested loops. I tryed so:
cols=4
for ((r=0;r<rows;r++)) do
line="line$r"
echo line:$line
for ((c=0;c<cols;c++)) do
val=${$line[$c]} # 1st assignment
val=${line$r[$c]} # 2nd assignment
echo val:$val
done
done
but got 'bad substitution' error for both assignment.
Edit: what is the right method to get size and [x,y]-element from a list of different size arrays?
I looked for other questions, but there are solutions for same sized arrays only
You can make use of indirect reference with declare -n ref=varname.
Would you please try:
for ((r = 0; r < rows; r++)); do
declare -n line="line$r" # now "line" will be a reference to the array
cols="${#line[#]}" # you can access the array with the name "line"
echo "line:$r cols:$cols"
done
BTW it might be better to say rows=10, because the count of rows is 10 (IMHO).
Another method could be
#!/bin/bash
source configfile
for arrname in ${!line*}; do
[[ $arrname =~ ^line[0-9]+$ ]] || continue
declare -n arrptr=$arrname
printf 'line:%d size:%d\n' "${arrname#line}" "${#arrptr[#]}"
done
This assumes all arrays whose names match the regular expression ^line[0-9]+$ are of interest.
declare -p $(seq 1 "$rows"|xargs -I {} echo line{})|awk -F'[][ =]' '{print $3,$(NF-2)+1}'
line1 2
line2 2
line3 3
line4 3
line5 2
line6 2
line7 1
line8 2
line9 4
seq 1 "$rows"|xargs -I {} echo line{} list all array-names
declare -p <array-name> display the attributes and values
awk -F'[][ =]' '{print $3,$(NF-2)+1}' print array-name and size
like so is perfect:
declare -a line0=( 00 01 02 )
declare -a line1=( 10 11 )
...
declare -a line9=( 90 91 92 93 )
rows=10
for ((r=0;r<rows;r++)) do
declare -n line="line$r"
#echo line:"$line"
cols="${#line[#]}"
echo cols:$cols
for ((c=0;c<cols;c++)) do
val=${line[$c]}
echo r:$r c:$c val:$val
done
done
The only strange think is that printing "line" return something non-sense:
echo line:"$line"
show (seems 1st element of lineN):
line:00
line:10
...
but I do not need to print that, only to use as refs, works, thanks

How can a "grep | sed | awk" script merging line pairs be more cleanly implemented?

I have a little script to extract specific data and cleanup the output a little. It seems overly messy and i'm wondering if the script can be trimmed down a bit.
The input file contains of pairs of lines -- names, followed by numbers.
Line pairs where the numeric value is not between 80 and 199 should be discarded.
Pairs may sometimes, but will not always, be preceded or followed by blank lines, which should be ignored.
Example input file:
al12t5682-heapmemusage-latest.log
38
al12t5683-heapmemusage-latest.log
88
al12t5684-heapmemusage-latest.log
100
al12t5685-heapmemusage-latest.log
0
al12t5686-heapmemusage-latest.log
91
Example/wanted output:
al12t5683 88
al12t5684 100
al12t5686 91
Current script:
grep --no-group-separator -PxB1 '([8,9][0-9]|[1][0-9][0-9])' inputfile.txt \
| sed 's/-heapmemusage-latest.log//' \
| awk '{$1=$1;printf("%s ",$0)};NR%2==0{print ""}'
Extra input example
al14672-heapmemusage-latest.log
38
al14671-heapmemusage-latest.log
5
g4t5534-heapmemusage-latest.log
100
al1t0000-heapmemusage-latest.log
0
al1t5535-heapmemusage-latest.log
al1t4676-heapmemusage-latest.log
127
al1t4674-heapmemusage-latest.log
53
A1t5540-heapmemusage-latest.log
54
G4t9981-heapmemusage-latest.log
45
al1c4678-heapmemusage-latest.log
81
B4t8830-heapmemusage-latest.log
76
a1t0091-heapmemusage-latest.log
88
al1t4684-heapmemusage-latest.log
91
Extra Example expected output:
g4t5534 100
al1t4676 127
al1c4678 81
a1t0091 88
al1t4684 91
another awk
$ awk -F- 'NR%2{p=$1; next} 80<=$1 && $1<=199 {print p,$1}' file
al12t5683 88
al12t5684 100
al12t5686 91
UPDATE
for the empty line record delimiter
$ awk -v RS= '80<=$2 && $2<=199{sub(/-.*/,"",$1); print}' file
al12t5683 88
al12t5684 100
al12t5686 91
Consider implementing this in native bash, as in the following (which can be seen running with your sample input -- including sporadically-present blank lines -- at http://ideone.com/Qtfmrr):
#!/bin/bash
name=; number=
while IFS= read -r line; do
[[ $line ]] || continue # skip blank lines
[[ -z $name ]] && { name=$line; continue; } # first non-blank line becomes name
number=$line # second one becomes number
if (( number >= 80 && number < 200 )); then
name=${name%%-*} # prune everything after first "-"
printf '%s %s\n' "$name" "$number" # emit our output
fi
name=; number= # clear the variables
done <inputfile.txt
The above uses no external commands whatsoever -- so whereas it might be slower to run over large input than a well-implemented awk or perl script, it also has far shorter startup time since no interpreter other than the already-running shell is required.
See:
BashFAQ #1 - How can I read a file (data stream, variable) line-by-line (and/or field-by-field)?, describing the while read idiom.
BashFAQ #100 - How do I do string manipulations in bash?; or The Bash-Hackers' Wiki on parameter expansion, describing how name=${name%%-*} works.
The Bash-Hackers' Wiki on arithmetic expressions, describing the (( ... )) syntax used for numeric comparisons.
perl -nle's/-.*//; $n=<>; print "$_ $n" if 80<=$n && $n<=199' inputfile.txt
With gnu sed
sed -E '
N
/\n[8-9][0-9]$/bA
/\n1[0-9]{2}$/!d
:A
s/([^-]*).*\n([0-9]+$)/\1 \2/
' infile

Renumbering numbers in a text file based on an unique mapping

I have a big txt file with 2 columns and more than 2 million rows. Every value represents an id and there may be duplicates. There are about 100k unique ids.
1342342345345 34523453452343
0209239498238 29349203492342
2349234023443 99203900992344
2349234023443 182834349348
2923000444 9902342349234
I want to identify each id and re-number all of them starting from 1. It should re-number duplicates also using the same new id. If possible, it should be done using bash.
The output could be something like:
123 485934
34 44834
167 34564
167 2345
2 34564
Doing this in pure bash will be really slow. I'd recommend:
tr -s '[:blank:]' '\n' <file |
sort -un |
awk '
NR == FNR {id[$1] = FNR; next}
{for (i=1; i<=NF; i++) {$i = id[$i]}; print}
' - file
4 8
3 7
5 9
5 2
1 6
With bash and sort:
#!/bin/bash
shopt -s lastpipe
declare -A hash # declare associative array
index=1
# read file and fill associative array
while read -r a b; do
echo "$a"
echo "$b"
done <file | sort -nu | while read -r x; do
hash[$x]="$((index++))"
done
# read file and print values from associative array
while read -r a b; do
echo "${hash[$a]} ${hash[$b]}"
done < file
Output:
4 8
3 7
5 9
5 2
1 6
See: man bash and man sort
Pure Bash, with a single read of the file:
declare -A hash
index=1
while read -r a b; do
[[ ${hash[$a]} ]] || hash[$a]=$((index++)) # assign index only if not set already
[[ ${hash[$b]} ]] || hash[$b]=$((index++)) # assign index only if not set already
printf '%s %s\n' "${hash[$a]}" "${hash[$b]}"
done < file > file.indexed
Notes:
the index is assigned in the order read (not based on sorting)
we make a single pass through the file (not two as in other solutions)
Bash's read is slower than awk; however, if the same logic is implemented in Perl or Python, it will be much faster
this solution is more CPU bound because of the hash lookups
Output:
1 2
3 4
5 6
5 7
8 9
Just keep a monotonic counter and a table of seen numbers; when you see a new id, give it the value of the counter and increment:
awk '!a[$1]{a[$1]=++N} {$1=a[$1]} !a[$2]{a[$2]=++N} {$2=a[$2]} 1' input
awk 'NR==FNR { ids[$1] = ++c; next }
{ print ids[$1], ids[$2] }
' <( { cut -d' ' -f1 renum.in; cut -d' ' -f2 renum.in; } | sort -nu ) renum.in
join the two columns into one then sort the that into numerical order (-n), and make unique (-u), before using awk to use this sequence to generate an array of mappings between old to new ids.
Then for each line in input, swap ids and print.

Get common values in 2 arrays in shell scripting [duplicate]

This question already has answers here:
Intersection of two lists in Bash
(5 answers)
Closed 3 years ago.
I have an
array1 = (20,30,40,50)
array2 = (10,20,30,80,100,110,40)
I have to get the common values from these 2 arrays in my array 3 like:
array3 = (20,30,40)
in ascending sorted order.
Shell and standard Unix utilities are good at dealing with text files.
In that realm, arrays would be text files whose elements are the lines.
To find the common part between two such arrays, there's the standard comm command. comm expects alphabetically sorted input though.
So, if you have two files A and B containing the elements of those two arrays, one per line (which also means the array elements can't contain newline characters), you can find the intersection with
comm -12 <(sort A) <(sort B)
If you want to start with bash arrays (but using arrays in shells is generally a good indication that you're using the wrong tool for your task), you can convert back and forth between the bash arrays and our text file arrays of lines with printf '%s\n' and word splitting:
array_one=(20 30 40 50)
array_two=(10 20 30 80 100 110 40)
IFS=$'\n'; set -f
intersection=($(comm -12 <(
printf '%s\n' "${array_one[#]}" | sort) <(
printf '%s\n' "${array_two[#]}" | sort)))
You almost certainly should not be using shell for this so here's ONE awk solution to your specific problem:
awk 'BEGIN{
split("20,30,40,50",array1,/,/)
split("10,20,30,80,100,110,40",array2,/,/)
for (i=1;i in array1;i++)
for (j=1;j in array2;j++)
if (array1[i] == array2[j])
array3[++k] = array1[i]
for (k=1; k in array3; k++)
printf "array3[%d] = %d\n",k,array3[k]
}'
array3[1] = 20
array3[2] = 30
array3[3] = 40
and if you tell us what you're really trying to do you can get a lot more help.
A pure bash solution using arrays:
#!/bin/bash
array1=(20,30,40,50)
array2=(10,20,30,80,100,110,40)
IFS=,
for i in $array1 $array2;{ ((++tmp[i]));}
for i in ${!tmp[*]};{ [ ${tmp[i]} -gt 1 ] && array3+=($i);}
echo ${array3[*]}
Output
20 30 40
As array3 is not an associative array, the indexes comes in ascending order using ${!array[*]} notation. If You need comma separated list as input, use echo "${array3[*]}".
It can be used if the source elements are integers. It works only if each of the source arrays contain unique numbers..
Here's a solution with standard command line tools (sort and join):
join <(printf %s\\n "${array1[#]}" | sort -u) \
<(printf %s\\n "${array2[#]}" | sort -u) | sort -n
join requires its inputs to be sorted, and does not recognize numerical sort order. Consequently, I sort both lists in the default collation order, join them, and then resort the result numerically.
I also assumed that you'd created the arrays really as arrays, i.e.:
array1=(20 30 40 50)
I think the rest is more or less self-evident, possibly with the help of help printf and man bash.
maybe you can use perl for try.
#!/bin/perl
use warnings;
use strict;
my #array1 = (20,30,40,50);
my #array2 = (10,20,30,80,100,110,40);
my #array3 = ();
foreach my $x (#array1) {
# body...
if (grep(/$x/, #array2)){
print "found $x\n";
#array3=(#array3,$x);
};
}
print #array3
In addition to any of these fine answers, it seems that you also want to sort your array (containing the answer) in ascending order.
You can do that in a number of different ways, including this:
readarray array3 <<<"$(printf "%s\n" "${array3[#]}" | sort -n)"
This method also allows you to filter out duplicate values:
readarray array3 <<<"$(printf "%s\n" "${array3[#]}" | sort -n | uniq)"
And for the sake of the exercise, here's yet another way of solving it:
#!/bin/bash
array1=(20 30 40 50)
array2=(10 20 30 80 100 110 40)
declare -a array3
#sort both arrays
readarray array1 <<<"$(printf "%s\n" "${array1[#]}" | sort -n)"
readarray array2 <<<"$(printf "%s\n" "${array2[#]}" | sort -n)"
# look for values
i2=0
for i1 in ${!array1[#]}; do
while (( i2 < ${#array2[#]} && ${array1[$i1]} > ${array2[$i2]} )); do (( i2++ )); done
[[ ${array1[$i1]} == ${array2[$i2]} ]] && array3+=(${array1[$i1]})
done
echo ${array3[#]}
Consider using python:
In [6]: array1 = (20,30,40,50)
In [7]: array2 = (10,20,30,80,100,110,40)
In [8]: set(array1) & set(array2)
Out[8]: set([40, 20, 30])

Sed - convert negative to positive numbers

I am trying to convert all negative numbers to positive numbers and have so far come up with this
echo "-32 45 -45 -72" | sed -re 's/\-([0-9])([0-9])\ /\1\2/p'
but it is not working as it outputs:
3245 -45 -72
I thought by using \1\2 I would have got the positive number back ?
Where am I going wrong ?
Why not just remove the -'s?
[root#vm ~]# echo "-32 45 -45 -72" | sed 's/-//g'
32 45 45 72
My first thought is not using sed, if you don't have to. awk can understand that they're numbers and convert them thusly:
echo "-32 45 -45 -72" | awk -vRS=" " -vORS=" " '{ print ($1 < 0) ? ($1 * -1) : $1 }'
-vRS sets the "record separator" to a space, and -vORS sets the "output record separator" to a space. Then it simply checks each value, sees if it's less than 0, and multiplies it by -1 if it is, and if it's not, just prints the number.
In my opinion, if you don't have to use sed, this is more "correct," since it treats numbers like numbers.
This might work for you:
echo "-32 45 -45 -72" | sed 's/-\([0-9]\+\)/\1/g'
Reason why your regex is failing is
Your only doing a single substitution (no g)
Your replacement has no space at the end.
The last number has no space following so it will always fail.
This would work too but less elegantly (and only for 2 digit numbers):
echo "-32 45 -45 -72" | sed -rn 's/-([0-9])([0-9])(\s?)/\1\2\3/gp'
Of course for this example only:
echo "-32 45 -45 -72" | tr -d '-'
You are dealing with numbers as with a string of characters. More appropriate would be to store numbers in an array and use built in Shell Parameter Expansion to remove the minus sign:
[~] $ # Creating and array with an arbitrary name:
[~] $ array17=(-32 45 -45 -72)
[~] $ # Calling all elements of the array and removing the first minus sign:
[~] $ echo ${array17[*]/-}
32 45 45 72
[~] $

Resources