This question already has answers here:
Intersection of two lists in Bash
(5 answers)
Closed 3 years ago.
I have an
array1 = (20,30,40,50)
array2 = (10,20,30,80,100,110,40)
I have to get the common values from these 2 arrays in my array 3 like:
array3 = (20,30,40)
in ascending sorted order.
Shell and standard Unix utilities are good at dealing with text files.
In that realm, arrays would be text files whose elements are the lines.
To find the common part between two such arrays, there's the standard comm command. comm expects alphabetically sorted input though.
So, if you have two files A and B containing the elements of those two arrays, one per line (which also means the array elements can't contain newline characters), you can find the intersection with
comm -12 <(sort A) <(sort B)
If you want to start with bash arrays (but using arrays in shells is generally a good indication that you're using the wrong tool for your task), you can convert back and forth between the bash arrays and our text file arrays of lines with printf '%s\n' and word splitting:
array_one=(20 30 40 50)
array_two=(10 20 30 80 100 110 40)
IFS=$'\n'; set -f
intersection=($(comm -12 <(
printf '%s\n' "${array_one[#]}" | sort) <(
printf '%s\n' "${array_two[#]}" | sort)))
You almost certainly should not be using shell for this so here's ONE awk solution to your specific problem:
awk 'BEGIN{
split("20,30,40,50",array1,/,/)
split("10,20,30,80,100,110,40",array2,/,/)
for (i=1;i in array1;i++)
for (j=1;j in array2;j++)
if (array1[i] == array2[j])
array3[++k] = array1[i]
for (k=1; k in array3; k++)
printf "array3[%d] = %d\n",k,array3[k]
}'
array3[1] = 20
array3[2] = 30
array3[3] = 40
and if you tell us what you're really trying to do you can get a lot more help.
A pure bash solution using arrays:
#!/bin/bash
array1=(20,30,40,50)
array2=(10,20,30,80,100,110,40)
IFS=,
for i in $array1 $array2;{ ((++tmp[i]));}
for i in ${!tmp[*]};{ [ ${tmp[i]} -gt 1 ] && array3+=($i);}
echo ${array3[*]}
Output
20 30 40
As array3 is not an associative array, the indexes comes in ascending order using ${!array[*]} notation. If You need comma separated list as input, use echo "${array3[*]}".
It can be used if the source elements are integers. It works only if each of the source arrays contain unique numbers..
Here's a solution with standard command line tools (sort and join):
join <(printf %s\\n "${array1[#]}" | sort -u) \
<(printf %s\\n "${array2[#]}" | sort -u) | sort -n
join requires its inputs to be sorted, and does not recognize numerical sort order. Consequently, I sort both lists in the default collation order, join them, and then resort the result numerically.
I also assumed that you'd created the arrays really as arrays, i.e.:
array1=(20 30 40 50)
I think the rest is more or less self-evident, possibly with the help of help printf and man bash.
maybe you can use perl for try.
#!/bin/perl
use warnings;
use strict;
my #array1 = (20,30,40,50);
my #array2 = (10,20,30,80,100,110,40);
my #array3 = ();
foreach my $x (#array1) {
# body...
if (grep(/$x/, #array2)){
print "found $x\n";
#array3=(#array3,$x);
};
}
print #array3
In addition to any of these fine answers, it seems that you also want to sort your array (containing the answer) in ascending order.
You can do that in a number of different ways, including this:
readarray array3 <<<"$(printf "%s\n" "${array3[#]}" | sort -n)"
This method also allows you to filter out duplicate values:
readarray array3 <<<"$(printf "%s\n" "${array3[#]}" | sort -n | uniq)"
And for the sake of the exercise, here's yet another way of solving it:
#!/bin/bash
array1=(20 30 40 50)
array2=(10 20 30 80 100 110 40)
declare -a array3
#sort both arrays
readarray array1 <<<"$(printf "%s\n" "${array1[#]}" | sort -n)"
readarray array2 <<<"$(printf "%s\n" "${array2[#]}" | sort -n)"
# look for values
i2=0
for i1 in ${!array1[#]}; do
while (( i2 < ${#array2[#]} && ${array1[$i1]} > ${array2[$i2]} )); do (( i2++ )); done
[[ ${array1[$i1]} == ${array2[$i2]} ]] && array3+=(${array1[$i1]})
done
echo ${array3[#]}
Consider using python:
In [6]: array1 = (20,30,40,50)
In [7]: array2 = (10,20,30,80,100,110,40)
In [8]: set(array1) & set(array2)
Out[8]: set([40, 20, 30])
Related
I want to sort 2 arrays at the same time. The arrays are the following: wordArray and numArray. Both are global.
These 2 arrays contain all the words (without duplicates) and the number of the appearances of each word from a text file.
Right now I am using Bubble Sort to sort both of them at the same time:
# Bubble Sort function
function bubble_sort {
local max=${#numArray[#]}
size=${#numArray[#]}
while ((max > 0))
do
local i=0
while ((i < max))
do
if [ "$i" != "$(($size-1))" ]
then
if [ ${numArray[$i]} \< ${numArray[$((i + 1))]} ]
then
local temp=${numArray[$i]}
numArray[$i]=${numArray[$((i + 1))]}
numArray[$((i + 1))]=$temp
local temp2=${wordArray[$i]}
wordArray[$i]=${wordArray[$((i + 1))]}
wordArray[$((i + 1))]=$temp2
fi
fi
((i += 1))
done
((max -= 1))
done
}
#Calling Bubble Sort function
bubble_sort "${numArray[#]}" "${wordArray[#]}"
But for some reason it won't sort them properly when large arrays are in place.
Does anyone knows what's wrong with it or an other approach to sort the words with the corresponding number of appearance with or without arrays?
This:
wordArray = (because, maybe, why, the)
numArray = (5, 12, 20, 13)
Must turn to this:
wordArray = (why, the, maybe, because)
numArray = (20, 13, 12, 5)
Someone recommended to write the two arrays side by side in a text file and sort the file.
How will it work for this input:
1 Arthur
21 Zebra
to turn to this output:
21 Zebra
1 Arthur
Assuming the arrays do not contain tab character or newline character, how about:
#!/bin/bash
wordArray=(why the maybe because)
numArray=(20 13 12 5)
tmp1=$(mktemp tmp.XXXXXX) # file to be sorted
tmp2=$(mktemp tmp.XXXXXX) # sorted result
for (( i = 0; i < ${#wordArray[#]}; i++ )); do
echo "${numArray[i]}"$'\t'"${wordArray[i]}" # write the number and word delimited by a tab character
done > "$tmp1"
sort -nrk1,1 "$tmp1" > "$tmp2" # sort the file by number in descending order
while IFS=$'\t' read -r num word; do # read the lines splitting by the tab character
numArray_sorted+=("$num") # add the number to the array
wordArray_sorted+=("$word") # add the word to the array
done < "$tmp2"
rm -- "$tmp1" # unlink the temp file
rm -- "$tmp2" # same as above
echo "${wordArray_sorted[#]}" # same as above
echo "${numArray_sorted[#]}" # see the result
Output:
why the maybe because
20 13 12 5
If you prefer not to create temp files, here is the process substitution version, which will run faster without writing/reading temp files.
#!/bin/bash
wordArray=(why the maybe because)
numArray=(20 13 12 5)
while IFS=$'\t' read -r num word; do
numArray_sorted+=("$num")
wordArray_sorted+=("$word")
done < <(
sort -nrk1,1 < <(
for (( i = 0; i < ${#wordArray[#]}; i++ )); do
echo "${numArray[i]}"$'\t'"${wordArray[i]}"
done
)
)
echo "${wordArray_sorted[#]}"
echo "${numArray_sorted[#]}"
Or simpler (using the suggestion by KamilCuk):
#!/bin/bash
wordArray=(why the maybe because)
numArray=(20 13 12 5)
while IFS=$'\t' read -r num word; do
numArray_sorted+=("$num")
wordArray_sorted+=("$word")
done < <(
paste <(printf "%s\n" "${numArray[#]}") <(printf "%s\n" "${wordArray[#]}") | sort -nrk1,1
)
echo "${wordArray_sorted[#]}"
echo "${numArray_sorted[#]}"
You need numeric sort for the numbers. You can sort an array like this:
mapfile -t wordArray <(printf '%s\n' "${wordArray[#]}" | sort -n)
But what you actually need is something like:
for num in "${numArray[#]}"; do
echo "$num: ${wordArray[j++]}"
done |
sort -n k1,1
But, earlier in the process, you should have used only one array, where the word and frequency (or vice versa) are key value pairs. Then they always have a direct relationship, and can be printed similarly to the for loop above.
I have a array=(4,2,8,9,1,0) and I don't want to sort the array to find the highest number in the array because I need to get the index value of the highest number as it is, so I can use it for further reference.
Expected output:
9 index value => 3
Can somebody help me to achieve this?
Slight variation with a loop using the ternary conditional operator and no assumptions about range of values:
arr=(4 2 8 9 1 0)
max=${arr[0]}
maxIdx=0
for ((i = 1; i < ${#arr[#]}; ++i)); do
maxIdx=$((arr[i] > max ? i : maxIdx))
max=$((arr[i] > max ? arr[i] : max))
done
printf '%s index => values %s\n' "$maxIdx" "$max"
The only assumption is that array indices are contiguous. If they aren't, it becomes a little more complex:
arr=([1]=4 [3]=2 [5]=8 [7]=9 [9]=1 [11]=0)
indices=("${!arr[#]}")
maxIdx=${indices[0]}
max=${arr[maxIdx]}
for i in "${indices[#]:1}"; do
((arr[i] <= max)) && continue
maxIdx=$i
max=${arr[i]}
done
printf '%s index => values %s\n' "$maxIdx" "$max"
This first gets the indices into a separate array and sets the initial maximum to the value corresponding to the first index; then, it iterates over the indices, skipping the first one (the :1 notation), checks if the current element is a new maximum, and if it is, stores the index and the maximum.
Without using sort, you can use a simple loop in shell. Here is a sample bash code:
#!/usr/bin/env bash
array=(4 2 8 9 1 0)
for i in "${!array[#]}"; do
[[ -z $max ]] || (( ${array[i]} > $max )) && { max="${array[i]}"; maxind=$i; }
done
echo "max=$max, maxind=$maxind"
max=9, maxind=3
arr=(4 2 8 9 1 0)
paste <(printf "%s\n" "${arr[#]}") <(seq 0 $((${#arr[#]} - 1)) ) |
sort -k1,1 |
tail -n1 |
sed 's/\t/ index value => /'
Print each array element on a newline with printf
Print array indexes with seq
Join both streams using paste
Numerically sort the lines using the first fields (ie. array value) sort
Print the last line tail -n1
The array value and result is separated by a tab. Substitute tab with the output string you want using sed. One could use ex. cut -d, -f2 to get only the index or use read a b <( ... ) to read the numbers into variables, etc.
Using Perl
$ export data=4,2,8,9,1,0
$ echo $data | perl -ne ' map{$i++; if($_>$x) {$x=$_;$id=$i} } split(","); print "max=$x", " index=",--${id},"\n" '
max=9 index=3
$
I needed to find the most frequent number in an array. I did it this way:
# our array, the most frequent value is 55
declare -a array=(44 55 55 55 66 66)
# counting unque string with uniq and then sorting as numbers
array=($(printf "%s\n" "${array[#]}"| uniq -c | sort -n -r))
# printing 2nd element of array, as the first one - number of occurencies
printf ${array[1]}
Is it a better/more beautiful way to do it, instead of building a weird array(2nd step) which consists mixed counts and numbers together?
And am I doing sorting correctly? (uniq returns values in 2 columns, so I'm not sure how it chooses the column)
If I had to do this in bash, I would use awk to skip sorting anything and just count the elements:
printf '%s\n' "${array[#]}" | awk '{
if (++arr[$0] > max) {
max=arr[$0];
ans=$0
}
}
END {print ans}'
You can also implement the same algorithm in bash 4 or later using an associative array:
# These don't strictly need to be initialized, but it's safer
# to ensure they don't already have values.
declare -A counts=()
max=0
ans=
for i in "${array[#]}"; do
if ((++counts[$i] > max)); then
max=${counts[$i]}
ans=$i
fi
done
printf '%s\n' "$ans"
If you dont want to use awk to do this, you can still do it with sort and uniq but be careful, you need to have the input ALREADY sorted before counting. Otherwise it will not work. For instance :
declare -a array=(34 3 45 45 66 55 44 55 55 55 66 45 45 8 6 45 45 66 32 9 18)
printf "%s\n" "${array[#]}" | sort -n -r | uniq -c | sort -n -r | head -1 | awk '{print $2}'
where for the given input the code correctly extracts the most repeated number, but in the sample you gave it will not work and it will tell 55 is the most repeated number, although thats wrong, since its clearly 45, but uniq only counts continuous items, if they are sparse it will count them incorrectly.
Regads!
A bit more verbose version of chepner's logic using associative arrays on bash v4+ onward. We build the associative array hashMap with key as array element and the count of its occurrence as the value. Once we build the array, we find from the array having the max count and retrieve its value.
#!/usr/bin/env bash
declare -a array=(44 55 55 55 66 66)
declare -A hashMap
declare -i max=0
for element in "${array[#]}"; do
((hashMap["$element"]++))
done
for key in "${!hashMap[#]}"; do
(( "${hashMap[$key]}" > max )) && { max="${hashMap[$key]}"; element="$key" ; }
done
printf '%d\n' "$element"
another minimalist awk
$ awk '{for(mi=i=1;i<=NF;i++) if(a[$mi]<++a[$i]) mi=i; print $mi}' <<< "${array[#]}"
I have a big txt file with 2 columns and more than 2 million rows. Every value represents an id and there may be duplicates. There are about 100k unique ids.
1342342345345 34523453452343
0209239498238 29349203492342
2349234023443 99203900992344
2349234023443 182834349348
2923000444 9902342349234
I want to identify each id and re-number all of them starting from 1. It should re-number duplicates also using the same new id. If possible, it should be done using bash.
The output could be something like:
123 485934
34 44834
167 34564
167 2345
2 34564
Doing this in pure bash will be really slow. I'd recommend:
tr -s '[:blank:]' '\n' <file |
sort -un |
awk '
NR == FNR {id[$1] = FNR; next}
{for (i=1; i<=NF; i++) {$i = id[$i]}; print}
' - file
4 8
3 7
5 9
5 2
1 6
With bash and sort:
#!/bin/bash
shopt -s lastpipe
declare -A hash # declare associative array
index=1
# read file and fill associative array
while read -r a b; do
echo "$a"
echo "$b"
done <file | sort -nu | while read -r x; do
hash[$x]="$((index++))"
done
# read file and print values from associative array
while read -r a b; do
echo "${hash[$a]} ${hash[$b]}"
done < file
Output:
4 8
3 7
5 9
5 2
1 6
See: man bash and man sort
Pure Bash, with a single read of the file:
declare -A hash
index=1
while read -r a b; do
[[ ${hash[$a]} ]] || hash[$a]=$((index++)) # assign index only if not set already
[[ ${hash[$b]} ]] || hash[$b]=$((index++)) # assign index only if not set already
printf '%s %s\n' "${hash[$a]}" "${hash[$b]}"
done < file > file.indexed
Notes:
the index is assigned in the order read (not based on sorting)
we make a single pass through the file (not two as in other solutions)
Bash's read is slower than awk; however, if the same logic is implemented in Perl or Python, it will be much faster
this solution is more CPU bound because of the hash lookups
Output:
1 2
3 4
5 6
5 7
8 9
Just keep a monotonic counter and a table of seen numbers; when you see a new id, give it the value of the counter and increment:
awk '!a[$1]{a[$1]=++N} {$1=a[$1]} !a[$2]{a[$2]=++N} {$2=a[$2]} 1' input
awk 'NR==FNR { ids[$1] = ++c; next }
{ print ids[$1], ids[$2] }
' <( { cut -d' ' -f1 renum.in; cut -d' ' -f2 renum.in; } | sort -nu ) renum.in
join the two columns into one then sort the that into numerical order (-n), and make unique (-u), before using awk to use this sequence to generate an array of mappings between old to new ids.
Then for each line in input, swap ids and print.
Is there a command in KornShell (ksh) scripting to sort an array of integers? In this specific case, I am interested in simplicity over efficiency. For example if the variable $UNSORTED_ARR contained values "100911, 111228, 090822" and I wanted to store the result in $SORTED_ARR
Is it actually an indexed array or a list in a string?
Array:
UNSORTED_ARR=(100911 111228 090822)
SORTED_ARR=($(printf "%s\n" ${UNSORTED_ARR[#]} | sort -n))
String:
UNSORTED_ARR="100911, 111228, 090822"
SORTED_ARR=$(IFS=, printf "%s\n" ${UNSORTED_ARR[#]} | sort -n | sed ':a;$s/\n/,/g;N;ba')
There are several other ways to do this, but the principle is the same.
Here's another way for a string using a different technique:
set -s -- ${UNSORTED_ARR//,}
SORTED_ARR=$#
SORTED_ARR=${SORTED_ARR// /, }
Note that this is a lexicographic sort so you would see this kind of thing when the numbers don't have leading zeros:
$ set -s -- 10 2 1 100 20
$ echo $#
1 10 100 2 20
If I take that out then it works but I can't loop through it (because its a list of strings now) – pws5068 Mar 4 '11 at 21:01
Do this:
\# create sorted array
set **-s** -A $#