Element matching within the different lists - bash

As usually :) I have two sets of input files with the same names but different extensions.
Using bash I made simple script which create 2 lists with identical elements names while looping 2 sets of the files from 2 dirs and using only it names w/o extensions as the elements of those lists:
#!/bin/bash
workdir=/data2/Gleb/TEST/claire+7d4+md/water_analysis/MD1
traj_all=${workdir}/tr_all
top_all=${workdir}/top_all
#make 2 lists for both file types
Trajectories=('');
Topologies=('');
#looping of 1st input files
echo "Trr has been found in ${traj_all}:"
for tr in ${traj_all}/*; do # ????
tr_n_full=$(basename "${tr}")
tr_n="${tr_n_full%.*}"
Trajectories=("${Trajectories[#]}" "${tr_n}");
done
#sort elements within ${Trajectories[#]} lists!! >> HERE I NEED HELP!
#looping of 2nd files
echo "Top has been found in ${top_all}:"
for top in ${top_all}/*; do # ????
top_n_full=$(basename "${top}")
top_n="${top_n_full%.*}"
Topologies=("${Topologies[#]}" "${top_n}");
done
#sort elements within ${Topologies[#] lists!! >> HERE I NEED HELP!
#make input.in file for some program- matching of elements from both lists >> HERE I NEED HELP!
for i in $(seq 1 ${#Topologies[#]}); do
printf "parm $top_all/${Topologies[i]}.top \ntrajin $traj_all/${Trajectories[i]}.mdcrd\nwatershell ${Area} ${output}/watershell_${Topologies[i]}_${Area}.dat > output.in
done
I'd thankful if someone provide me with good possibility how to improve this script:
1) I need to sort elements in both lists in the similar pattern after last elements have been added in each of them;
2) I need to add some test on the LAST step of the script which will create final output.in file only in case if elements are the same (in principle in this case it always should be the same!) in both lists which are matched during this operation by printf.
Thanks for the help,
Gleb

Here's a simpler way to create the arrays:
# Create an empty array
Trajectories=();
for tr in "${traj_all}"/*; do
# Remove the string of directories
tr_base=${tr##*/}
# Append the name without extension to the array
Trajectories+="${tr_base%.*}"
done
In bash, this will normally result in a sorted list, because the expansion of * in the glob is sorted. But you can sort it with sort; it is simplest if you are certain there are no newlines in the filenames:
mapfile -t sorted_traj < <(printf %s\\n "${Trajectories[#]}" | sort)
To compare two sorted arrays, you could use join:
# convenience function; could have helped above, too.
lines() { printf %s\\n "$#"; }
# Some examples:
# 1. compare a with b and print the lines which are only in a
join -v 1 -t '' <(lines "${a[#]}") <(lines "${b[#]}")
# 2. create c as an array with the lines which are only in b
mapfile -t c < <( join -v 2 -t '' <(lines "${a[#]}") <(lines "${b[#]}") )
If you create both difference lists, then the two arrays are equal if both lists are empty. If you are expecting both arrays to be the same, though, and if this is time-critical (probably not), you could do a simple precheck:
if [[ "${a[*]}" = "${b[*]}" ]]; then
# the arrays are the same
else
# the arrays differ; do some more work to see how.
fi

Related

Compare multiple tsv files for match columns

I have 4466 .tsv files with this structure:
file_structure
I want to compare the 4466 files to see how many IDs (the first column) matches.
I only found bash commands with two files with "comm". Could you tell me how I could do that?
Thank you
I read your question as:
Amongst all TSV files, which column IDs are found in every file?
If that's true, we want the intersection of all the sets of column IDs from all files. We can use the join command to get the intersection of any two files, and we can use the algebraic properites of an intersection to effectively join all files.
Consider the intersection of ID for these three files:
file1.tsv file2.tsv file3.tsv
--------- --------- ---------
ID ID ID
1 1 2
2 3 3
3
"3" is the only ID shared between all three. We can only join two files together at a time, so we need some way to effectively get, join (join file1.tsv file2.tsv) file3.tsv. Fortunately for us intersections are idempotent and associative, so we can apply join iteratively in a loop over all the files, like so:
# "Prime" the common file
cp file1.tsv common.tsv
for TSV in file*.tsv; do
join "$TSV" common.tsv > myTmp
mv myTmp common.tsv
echo "After joining $TSV, common IDs are:"
cat common.tsv
done
When I run that it prints the following:
After joining file1.tsv, common IDs are:
ID
1
2
3
After joining file2.tsv, common IDs are:
ID
1
3
After joining file3.tsv, common IDs are:
ID
3
The first iteration joins file1 with itself (because we primed common with file1); this is where we intersection to be idempotent
The second iteration joins in file2, cutting out ID "2"
The third iteration joins in file3, cutting ID down to just "3"
Technically, join considers the string "ID" to be one of the things to evaluate... it doesn't know what a header line is, or an what an ID is... it just knows to look in some number of fields for common values. In that example we didn't specify a field so it defaulted to the first field, and it always found "ID" and it always found "3".
For your files, we need to tell join to:
separate on a tab character, with -t <TAB-CHAR>
only output the join field (which, by default, is the first field), with -o 0
Here's my full implementation:
#!/bin/sh
TAB="$(printf '\t')"
# myJoin joins tsvX with the previously-joined common on
# the first field of both files; saving the the first field
# of the joined output back into common
myJoin() {
tsvX="$1"
join -t "$TAB" -o 0 common.tsv "$tsvX" > myTmp.tsv
mv myTmp.tsv common.tsv
}
# "Prime" common
cp input1.tsv common.tsv
for TSV in input*.tsv; do
myJoin "$TSV"
done
echo "The common IDs are:"
tail -n -1 common.tsv
For an explanation of why "$(printf '\t')", check out the following for POSIX compliance:
https://www.shellcheck.net/wiki/SC3003
https://unix.stackexchange.com/a/468048/366399
The question sounds quite vague. So, assuming that you want to extract IDs that all 4466 files have in common, i.e. IDs such that each of them occurs at least once in all of the *.tsv files, you can do this (e.g.) in pure Bash using associative arrays and calculating “set intersections” on them.
#!/bin/bash
# removes all IDs from array $1 that do not occur in array $2.
intersect_ids() {
local -n acc="$1"
local -rn operand="$2"
local id
for id in "${!acc[#]}"; do
((operand["$id"])) || unset "acc['${id}']"
done
}
# prints IDs that occur in all files called *.tsv in directory $1.
get_ids_intersection() (
shopt -s nullglob
local -ar files=("${1}/"*.tsv)
local -Ai common_ids next_ids
local file id _
if ((${#files[#]})); then
while read -r id _; do ((++common_ids["$id"])); done < "${files[0]}"
for file in "${files[#]:1}"; do
while read -r id _; do ((++next_ids["$id"])); done < "$file"
intersect_ids common_ids next_ids
next_ids=()
done
fi
for id in "${!common_ids[#]}"; do printf '%s\n' "$id"; done
)
get_ids_intersection /directory/where/tsv/files/are

How do I print out 2 separate arrays with new lines in bash script

So basically I want to be able to print out 2 separate arrays with newlines between each element.
Sample output I'm looking for:
a x
b y
(a,b being apart of one array x,y being a separate array)
Currently im using:
printf "%s\n" "${words[#]} ${newWords[#]}"
But the output comes out like:
a
b x
y
As bash is tagged, you could use paste from GNU coreutils with each array as an input:
$ words=(a b)
$ newWords=(x y)
$ paste <(printf '%s\n' "${words[#]}") <(printf '%s\n' "${newWords[#]}")
a x
b y
TAB is the default column separator but you can change it with option -d.
If you have array items that might contain newlines, you can switch to e.g. NUL-delimited strings by using the -z flag and producing each input using printf '%s\0'.
What does "${words[#]} ${newWords[#]}" produce? Let's put that expansion into another array and see what's inside it:
words=(a b)
newWords=(x y)
tmp=("${words[#]} ${newWords[#]}")
declare -p tmp
declare -a tmp=([0]="a" [1]="b x" [2]="y")
So, the last element of the first array and the first element of the second array are joined as a string; the other elements remain individual.
paste with 2 process substitutions is a good way to solve this. If you want to do it in plain bash, iterate over the indices of the arrays:
for idx in "${!words[#]}"; do
printf '%s\t%s\n' "${words[idx]}" "${newWords[idx]}"
done

Bash. Associative array iteration (ordered and without duplicates)

I have two problems handling associative arrays. First one is that I can't keep a custom order on it.
#!/bin/bash
#First part, I just want to print it ordered in the custom created order (non-alphabetical)
declare -gA array
array["PREFIX_THIS","value"]="true"
array["PREFIX_IS","value"]="false"
array["PREFIX_AN","value"]="true"
array["PREFIX_ORDERED","value"]="true"
array["PREFIX_ARRAY","value"]="true"
for item in "${!array[#]}"; do
echo "${item}"
done
Desired output is:
PREFIX_THIS,value
PREFIX_IS,value
PREFIX_AN,value
PREFIX_ORDERED,value
PREFIX_ARRAY,value
But I'm obtaining this:
PREFIX_IS,value
PREFIX_ORDERED,value
PREFIX_THIS,value
PREFIX_AN,value
PREFIX_ARRAY,value
Until here the first problem. For the second problem, the order is not important. I added more stuff to the associative array and I just want to loop on it without duplicates. Adding this:
array["PREFIX_THIS","text"]="Text for the var"
array["PREFIX_IS","text"]="Another text"
array["PREFIX_AN","text"]="Text doesn't really matter"
array["PREFIX_ORDERED","text"]="Whatever"
array["PREFIX_ARRAY","text"]="More text"
I just want to loop over "PREFIX_THIS", "PREFIX_IS", "PREFIX_AN", etc... printing each one only once. I just want to print doing an "echo" on loop (order is not important for this part, just to print each one only once). Desired output:
PREFIX_ORDERED
PREFIX_AN
PREFIX_ARRAY
PREFIX_IS
PREFIX_THIS
I achieved it doing "dirty" stuff. But there must be a more elegant way. This is my working but not too much elegant approach:
already_set=""
var_name=""
for item in "${!array[#]}"; do
var_name="${item%,*}"
if [[ ! ${already_set} =~ "${var_name}" ]]; then
echo "${var_name}"
already_set+="${item}"
fi
done
Any help? Thanks.
Iteration Order
As Inian pointed out in the comments, you cannot fix the order in which "${!array[#]}" expands for associative arrays. However, you can store all keys inside a normal array that you can order manually.
keysInCustomOrder=(PREFIX_{THIS,IS,AN,ORDERED,ARRAY})
for key in "${keysInCustomOrder[#]}"; do
echo "do something with ${array[$key,value]}"
done
Unique Prefixes of Keys
For your second problem: a["key1","key2"] is the same as a["key1,key2"]. In bash, arrays are always 1D therefore there is no perfect solution. However, you can use the following one-liner as long as , is never part of key1.
$ declare -A array=([a,1]=x [a,2]=y [b,1]=z [c,1]=u [c,2]=v)
$ printf %s\\n "${!array[#]}" | cut -d, -f1 | sort -u
a
b
c
When your keys may also contain linebreaks delemit each key by null \0.
printf %s\\0 "${!array[#]}" | cut -zd, -f1 | sort -zu
Alternatively you could use reference variables to simulate 2D-arrays, however I would advice against using them.

Bash: nested loop one way comparison

I have one queston about nested loop with bash.
I have an input files with one file name per line (full path)
I read this file and then i make a nest loop:
for i in $filelines ; do
echo $i
for j in $filelines ; do
./program $i $j
done
done
The program I within the loop is pretty low.
Basically it compare the file A with the file B.
I want to skip A vs A comparison (i.e comparing one file with itslef) AND
I want to avoid permutation (i.e. for file A and B, only perform A against B and not B against A).
What is the simplest to perform this?
Version 2: this one takes care of permutations
#!/bin/bash
tmpunsorted="/tmp/compare_unsorted"
tmpsorted="/tmp/compare_sorted"
>$tmpunsorted
while read linei
do
while read linej
do
if [ $linei != $linej ]
then
echo $linei $linej | tr " " "\n" | sort | tr "\n" " " >>$tmpunsorted
echo >>$tmpunsorted
fi
done <filelines
done <filelines
sort $tmpunsorted | uniq > $tmpsorted
while read linecompare
do
echo "./program $linecompare"
done <$tmpsorted
# Cleanup
rm -f $tmpunsorted
rm -f $tmpsorted
What is done here:
I use the while loop to read each line, twice, i and j
if the value of the lines is the same, forget them, no use to consider them
if they are different, output them into a file ($tmpunsorted). And they are sorted in alphebetical order before going tothe $tmpunsorted file. This way the arguments are always in the same order. So a b and b a will be same in the unsorted file.
I then apply sort | uniq on $tmpunsorted, so the result is a list of individual argument pairs.
finally loop on the $tmpsorted file, and call the program on each individual pair.
Since I do not have your program, I did an echo, which you should remove to use the script.

Loop over two associative arrays in Bash

Say I have two associative arrays in Bash
declare -A a
declare -A b
a[xz]=1
b[xz]=2
a[zx]=3
b[zx]=4
I want to do something like this
for arr in ${a[#]} ${b[#]}; do echo ${arr[zx]}; done
and get 3 and 4 in output
but I get
$ for arr in ${a[#]} ${b[#]}; do echo ${arr[zx]}; done
1
3
2
4
Is there a way to do this in Bash?
You don't want to iterate over the contents; you want to iterate over the names of the arrays, then use indirect expansion to get the desired value of the fixed key from each array.
for arr in a b; do
t=$arr[zx] # first a[zx], then b[zx]
printf '%s\n' "${!t}"
done
Here, the variable "name" for use in indirect expansion is the name of the array along with the desired index.
Assuming the keys in both the arrays match(a major assumption), you can use one array as reference and loop over the keys and print in each array.
for key in "${!a[#]}"; do
printf "Array-1(%s) %s Array-2(%s) %s\n" "$key" "${a[$key]}" "$key" "${b[$key]}"
done
which produces an output as below. You can of-course remove the fancy debug words(Array-1, Array-2) which was added just for an understanding purpose.
Array-1(xz) 1 Array-2(xz) 2
Array-1(zx) 3 Array-2(zx) 4
One general good practice is always quote (for key in "${!a[#]}") your array expansions in bash, so that the elements are not subjected to word-splitting by the shell.

Resources