Most Efficient way to find pairs between 2 arrays in bash

Most Efficient way to find pairs between 2 arrays in bash - bash

I have 2 large arrays with hash values stored in them. I'm trying to find the best way to verify all of the hash values in array_a are also found in array_b. The best I've got so far is
Import the Hash files into an array
Sort each array
For loop through array_a
Inside of array_a's for loop, do another for look for array_b (seems inefficient).
If found unset value in array_b
Set "found" value to 1 and break loop
If array_a doesn't have a match output to file.
I have large images that I need to verify have been uploaded to the site and the hash values match. I've created a file from the original files and scraped the website ones to create a second list of hash values. Trying to keep this a vanilla as possible, so only using typical bash functionality.
#!/bin/bash
array_a=($(< original_sha_values.txt))
array_b=($(< sha_values_after_downloaded.txt))
# Sort to speed up.
IFS=$'\n' array_a_sorted=($(sort <<<"${array_a[*]}"))
unset IFS
IFS=$'\n' array_b_sorted=($(sort <<<"${array_b[*]}"))
unset IFS
for item1 in "${array_a_sorted[#]}" ; do
found=0
for item2 in "${!array_b_sorted[#]}" ; do
if [[ $item1 == ${array_b_sorted[$item2]} ]]; then
unset 'array_b_sorted[item2]'
found=1
break
fi
done
if [[ $found == 0 ]]; then
echo "$item1" >> hash_is_missing_a_match.log
fi
done
Sorting to sped it up a lot
IFS=$'\n' array_a_sorted=($(sort <<<"${array_a[*]}"))
unset IFS
IFS=$'\n' array_b_sorted=($(sort <<<"${array_b[*]}"))
unset IFS
Is this really the best way of doing this?
for item1 in "${array_a_sorted[#]}" ; do
...
for item2 in "${!array_b_sorted[#]}" ; do
if ...
unset 'array_b_sorted[item2]'
break
Both arrays have 12,000 lines of 64bit hashes, taking 20+ minutes to compare. Is there a way to improve the speed?

you're doing it hard way.
If the task is: find the entries in file1 not in file2. Here is a shorter approach
$ comm -23 <(sort f1) <(sort f2)

I think karakfa's answer is probably the best approach if you just want to get it done and not worry about optimizing bash code.
However, if you still want to do it in bash, and you are willing to use some bash-specific features, you could shave off a lot of time using an associative array instead of two regular arrays:
# Read the original hash values into a bash associative array
declare -A original_hashes=()
while read hash; do
original_hashes["$hash"]=1
done < original_sha_values.txt
# Then read the downloaded values and check each one to see if it exists
# in the associative array. Lookup time *should* be O(1)
while read hash; do
if [[ -z "${original_hashes["$hash"]+x}" ]]; then
echo "$hash" >> hash_is_missing_a_match.log
fi
done < sha_values_after_downloaded.txt
This should be a lot faster than the nested loop implementation using regular arrays. Also, I didn't need any sorting, and all of the insertions and lookups on the associative array should be O(1), assuming bash implements associative arrays as hash tables. I couldn't find anything authoritative to back that up though, so take that with a grain of salt. Either way, it should still be faster than the nested loop method.
If you want the output sorted, you can just change the last line to:
done < <(sort sha_values_after_downloaded.txt)
in which case you're still only having to sort one file, not two.

Related

bash : huge loop on simple and big array, bug on the last element + memory leak [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 3 years ago.
Improve this question
On the last element of an array of strings, the script hangs, uses 100% cpu and memory gets more than 500MB, and more ....
The array size is 1406
Bash is 4.3 (old server)
Does someone know this bug ?
# get sorted values (strings) of the associative array to build a simple array
final=($(for k in "${!namefiles_servers[#]}"; do echo $k===${namefiles_servers[$k]}; done | sort))
for val in "${final[#]}"; do
...
# at the 1406th element, the last statement of the loop is executed and the script hangs
done

We need a lot more information about your problem, preferably the entire section of the script where the problem occurs. We need to know what you're doing in the loop, since it might be caused by some program..
However, here's some information about speeding up your script. If it has something to do with too much resources, you can optimize your script easily.
For big problems, certainly when you're using a slow (but loved) language as bash, it's better to store your data in a file than use them all in one for loop. That's because the for loop holds the entire array, whereas using the code below parses the file line-by-line. Of course, if you have separate blocks of data, you need to delimit it with read its -d option or something.
# sort your file here
while read -r line; do
#something
done < ${MYIFLE}
Another thing that might cause a problem is supplying entire arrays to functions, such as:
function test_array_1 {
local arr=($#)
sleep 1
(for i in ${arr[#]}; do echo $i; done)>/dev/null
}
It's better to pass a reference to it, because it would neglect a copy. This makes the script much faster.
function test_array_2 {
local -n arr=$1
sleep 1
(for i in ${arr[#]}; do echo $i; done)>/dev/null
}
A copy would take the script below around 3 minutes, whereas the reference takes 44 seconds. You can test it yourself.
#! /bin/bash
function test_array_1 {
local arr=($#)
sleep 1
(for i in ${arr[#]}; do echo $i; done)>/dev/null
}
function test_array_2 {
local -n arr=$1
sleep 1
(for i in ${arr[#]}; do echo $i; done)>/dev/null
}
test_function=$1
ARR=$(seq 1 10000000)
if [ "${test_function}" = "test_array_1" ]; then
time test_array_1 ${ARR[#]} &
elif [ "${test_function}" = "test_array_2" ]; then
time test_array_2 ARR &
fi
wait
exit 0
EDIT:
Thinking about this some more, why do you need two arrays? Since the arrays equal in amount of entries, you can rewrite your for loop and make the final array redundant.
#! /bin/bash
# This array represents random data, which you seem to have. Otherwise, don't
# sort at all
namefiles_servers=("$(seq 1 100 | sort -R)")
# You can sort it as such, no need for loops.
namefiles_servers=($(echo "${namefiles_servers[#]}" | sort))
# And access it with a loop. ${#namefiles_servers[#]} is the number of
# elements in your array
for ((i=0; i<${#namefiles_servers[#]}; i++)); do
echo index $i has value ${namefiles_servers[$i]}
done

Portable array indexing in both bash and zsh

Array indexing is 0-based in bash, and 1-based in zsh (unless option KSH_ARRAYS is set).
As an example: To access the first element of an array, is there something nicer than:
if [ -n $BASH_VERSION ]; then
echo ${array[0]}
else
echo ${array[1]}
fi

TL;DR:
To always get consistent behaviour, use:
${array[#]:offset:length}
Explanation
For code which works in both bash and zsh, you need to use the offset:length syntax rather than the [subscript] syntax.
Even for zsh-only code, you'll still need to do this (or use emulate -LR zsh) since zsh's array subscripting basis is determined by the option KSH_ARRAYS.
Eg, to reference the first element in an array:
${array[#]:0:1}
Here, array[#] is all the elements, 0 is the offset (which always is 0-based), and 1 is the number of elements desired.

How to use for/in with tuples?

Is it possible to do something like:
for (a,b) in (1,2) (2,1)
do
run_program.py $a $b
done
I only know the for do done syntax in Linux. I want to run the program with the specific two (a,b) instances (or course, it easily generalizes to much larger than two).

There is no tuple construct in bash, and also no destructuring (the behavior which you're relying on to assign a=1 and b=2 when iterating over (1,2)). What you can do is have multiple arrays, where the same index in each refers to corresponding data, and iterate by index.
#!/bin/bash
# ^^^^ - IMPORTANT: /bin/sh does not support arrays, you *must* use bash
a1=( 1 2 ) # a1[0]=1; a1[1]=2
a2=( 2 1 ) # a2[0]=2; a2[1]=1
for idx in "${!a1[#]}"; do # iterate over indices: idx=0, then idx=1
a=${a1[$idx]} # lookup idx in array a1
b=${a2[$idx]} # lookup idx in array a2
run_program.py "$a" "$b" # ...and use both
done
Syntax pointers:
"${!array[#]}" expands to the list of indices for the array array.
a1=( 1 2 ) assigns to an array named a1. See BashFAQ #5 for an introduction to arrays in bash.
If you have constraints in your input that allows items to be split unambiguously, it's also possible to (hackishly) use that. For an example using a pattern of behaviors explained in BashFAQ #1:
inputs='1:2,2:1,'
while IFS=: read -r -d, a b <&3; do
run_program.py "$a" "$b" 3<&-
done 3<<<"$inputs"
Note that the use of 3 here is arbitrary: File descriptors 0, 1 and 2 are reserved for stdin, stdout and stderr; 3-9 are explicitly available for shell scripts to use; and in practice, higher FD numbers tend to also be available as well (but are prone to be dynamically auto-allocated to store backups or for other shell behavior; that said, a well-behaved shell won't stomp on a FD that a user has explicitly allocated, and will move an FD auto-allocated to store backups of temporarily-redirected descriptors out-of-the-way if the user puts it to explicit use).

Bash generate random numbers from pool of numbers

I want to generate a random number from given list
For example if I give the numbers
1,22,33,400,400,23,12,53 etc.
I want to select a random number from the given numbers.

Couldn't find an exact duplicate of this. So here goes my attempt, exactly what 123 mentions in comments. The solution is portable across shell variants and does not make use of any shell binaries to simplify performance.
You can run the below commands directly on the console.
# Read the elements into bash array, with IFS being the de-limiter for input
IFS="," read -ra randomNos <<< "1,22,33,400,400,23,12,53"
# Print the random numbers using the '$RANDOM' variable built-in modulo with
# array length.
printf "%s\n" "${randomNos[ $RANDOM % ${#randomNos[#]}]}"
As per the comments below, if you want to ignore a certain list of numbers from a range to select; do the approach as below
#!/bin/bash
# Initilzing the ignore list with the numbers you have mentioned
declare -A ignoreList='([21]="1" [25]="1" [53]="1" [80]="1" [143]="1" [587]="1" [990]="1" [993]="1")'
# Generating the random number
randomNumber="$(($RANDOM % 1023))"
# Printing the number if it is not in the ignore list
[[ ! -n "${ignoreList["$randomNumber"]}" ]] && printf "%s\n" "$randomNumber"
You can save it in a bash variable like
randomPortNumber=$([[ ! -n "${ignoreList["$randomNumber"]}" ]] && printf "%s\n" "$randomNumber")
Remember associative-arrays need bash version ≥4 to work.

Skip items in bash when in exclusion array

I have a script that loops over database names and if the name of the current database is in my exclusion array I want to skip it. How would I accomplish this in bash?
excluded_databases=("template1" "template0")
for database in $databases
do
if ...; then
# perform something on the database...
fi
done

You can do it by testing each name in turn, but you might be better off filtering the list in one operation. (The following assumes that no name in $databases contains whitespace, which is implicit given your for loop).
for database in $(printf %s\\n $databases |
grep -Fvx "${excluded_databases[#]/#/-e}"); do
# something
done
Explanation of the idioms:
printf %s\\n ... prints each of its arguments on a single line.
grep -Fvx searchs for exact matches (-F) of the whole line (-x) and inverts the match result (-v).
"${array[#]/#/-e}" prepends -e to each element of the array array, which is useful when you need to provide each element of the array as a (repeated) command-line option to a utility. In this case, the utility is grep and the -e flag is used to provide a match pattern.
I've been criticized in the past for printf %s\\n -- some people prefer printf '%s\n' -- but I find the first one easier to type. YMMV.
As a comment, it seems like it would be better to make $databases an array as well as $excluded_databases, which would allow for names including whitespace. The printf | grep solution still doesn't allow newlines in names; it's complicated to work around that. If you were to make that change, you'd only need to change the printf to printf %s\\n "${databases[#]}".

You can use this condition to check for presence of an element in an array:
if [[ "${excluded_databases[#]/$database}" == "${excluded_databases[#]}" ]]
Another option using case:
case "${excluded_databases[#]}" in *"$database"*) echo "found in array" ;; esac

If you are using bash 4 or greater then using an associative array will help you here.
declare -A excluded_databases=(["template1"]=1 ["template0"]=1)
for database in $databases
do
if [ -z "${excluded_databases[$database]}" ]; then
continue
fi
# ... do something with $database
done

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Most Efficient way to find pairs between 2 arrays in bash - bash

you're doing it hard way. If the task is: find the entries in file1 not in file2. Here is a shorter approach $ comm -23 <(sort f1) <(sort f2)

Related

bash : huge loop on simple and big array, bug on the last element + memory leak [closed]

Portable array indexing in both bash and zsh

How to use for/in with tuples?

Bash generate random numbers from pool of numbers

Skip items in bash when in exclusion array

Categories

Resources