Execute a program over all pairs of files in a directory using bash script - bash

I have a directory with a bunch of files. I need to create a bash file to qsub and run a program over all pairs of all files:
for $file1, $file2 in all_pairs
do
/path/program -i $file1 $file2 -o $file1.$file2.result
done
So I could do:
qsub script.sh
to get:
file1.file2.result
file1.file3.result
file2.file3.result
for directory with:
file1
file2
file3

The following is probably the easiest:
the pair a-b is different from b-a:
set -- file1 file2 file3 file4 ...
for f1; do
for f2; do
/path/program -i "$f1" "$f2" -o "$f1.$f2.result"
done
done
the pair a-b is equal to b-a:
set -- file1 file2 file3 file4 ...
for f1; do
shift
for f2; do
/path/program -i "$f1" "$f2" -o "$f1.$f2.result"
done
done

You can do it as in every other programming language:
files=(file1 file2 file3) # or use a glob to list the files automatically, for instance =(*)
max="${#files[#]}"
for ((i=0; i<max; i++)); do
for ((j=i+1; j<max; j++)); do
echo -i "${files[i]}" "${files[j]}" -o "${files[i]}${files[j]}.result"
done
done
Replace echo with /path/program when you are happy with the result

Related

Bash pass the file names which are not in the ith element of loop

In a simple processing of files, where you want to do something on every file in a directory, you do something like this:
for i in file1 file2 file3 file5
do
echo "Processing $i"
done
What I want to do here is pass $i as well as the non-$i files as an argument to a command. Lets say my directory contains 4 files (file1, file2, file3, file5). For example in the first iteration of the loop when file1 is being processed, I want to pass the rest of the files (file2, file3, file5) to the -b argument of the command.
For example, first iteration of loop in bash should look something like this:
FILES=/path/to/directory
for i in $FILES
do
bedtools intersect -a $i -b file2 file3 file5
done
In second iteration as the file2 is in the $i the rest of the files will be passed to -b argument.
for i in $FILES
do
bedtools intersect -a $i -b file1 file3 file5
done
and so on for all the files in the directory. In short, pass the current file to -a argument and rest of the files to -b argument.
It will be great if somebody can help me with this. Thank you.
You can just use a numeric loop and take slices out of the array:
shopt -s nullglob
files=( path/to/directory/* )
for (( i = 0; i < ${#files[#]}; ++i )); do
file=${files[i]}
others=( "${files[#]:0:i}" "${files[#]:i+1}" )
bedtools intersect -a "$file" -b "${others[#]}"
done
This loops though the indices of the array files and slices the part before and after the current index i to get the others.
You can try out like this as well,
op=$(find /path/to/directory ! -iname ".*")
temp=$op
for i in $op;
do
rfile=${temp//$i/}
rfile=$(echo $rfile | tr '\n' ' ')
bedtools intersect -a $i -b $rfile
done
count=0; files=(*)
for i in ${files[*]}; do
unset files[count]
echo "bedtools intersect -a $i -b ${files[*]}"
files+=($i)
((count++))
done

how to ignore a newLine character in the compare script as below

#!/bin/bash
function compare {
for file1 in /dir1/*.csv
do
file2=/dir2/$(basename "$file1")
if [[ -e "$file2" ]] ### loop only if the file2 with same filename as file1 is present ###
then
awk 'BEGIN {FS==","} NR == FNR{arr[$0];next} ! ($0 in arr)' $file1 $file2 > /dirDiff/`echo $(basename "$file1")_diff`
fi
done
}
function removeNULL {
for i in /dirDiff/*_diff
do
if [[ ! -s "$i" ]] ### if file exists with zero size ###
then
\rm -- "$i"
fi
done
}
compare
removeNULL
file1 and file2 are the formatted files from two different sources. Source1 is inducing an arbitrary newLine character making one record to split into two records, causing script to fail and generate wrong diff o/p.
I want my script to compare b/w file1 and file2 by ignoring the induced newLine character by Source1. But, I am not sure how my script will identify b/w an actual new record and the manually induced newLine.
file1:-
11447438218480362,6005560623,6005560623,11447438218480362,5,20160130103044,100,195031,,1,0,00,49256,0
,195031_5_00_6,0.1,6;
11447691224860640,6997557634,6997557634,11447691224860640,601511,20160130103457,500,195035,,2,0,00,45394,0
,195035_601511_00_6,0.5,6;
file2:-
11447438218480362,6005560623,6005560623,11447438218480362,5,20160130103044,100,195031,,1,0,00,49256,0,195031_5_00_6,0.1,6;
11447691224860640,6997557634,6997557634,11447691224860640,601511,20160130103457,500,195035,,2,0,00,45394,0,195035_601511_00_6,0.5,6;
Appreciate your support.
You could preprocess your file1 joining lines not ending in ; with the next line:
sed -r ":again; /;$/! { N; s/(.+)[\r\n]+(.+)/\1\2/g; b again; }" file1
so that file1 and file2 are comparable.

How can I merge before move files?

I have some files (few millions) and I keep file list in files.txt like this:
/home/user/1.txt
/home/user/2.txt
/home/user/3.txt
/home/user/4.txt
/home/user/5.txt
I need to move all, but before move I must merge too.
I can move like this:
#!/bin/bash
for files in $(cat files.txt); do
mv $files /home/user/hop/
done
I can merge all with cat * but I need to merge by twos, like this:
1.txt and 2.txt merge --> 1.txt and move.
3.txt and 4.txt merge --> 3.txt and move.
5.txt --> 5.txt and move.
But I must merge before move, in /home/user/, not in /home/user/hop/
How can I do this?
You can use $ cat file1 file2 file3 file4 file5 file6 > out.txt after you moved them, with this you can also set the order of the files to be merged.
Also works for binaries.
You can use this script:
while read -r f; do
if ((++i % 2)); then
p="$f"
else
cat "$f" >> "$p"
mv "$p" /home/user/hop/
rm "$f"
unset p
fi
done < list.txt
[[ -n $p ]] && mv "$p" /home/user/hop/

Taking line intersection of several files

I see comm can do 2 files and diff3 can do 3 files. I want to do for more files (5ish).
One way:
comm -12 file1 file2 >tmp1
comm -12 tmp1 file3 >tmp2
comm -12 tmp2 file4 >tmp3
comm -12 tmp3 file5
This process could be turned into a script
comm -12 $1 $2 > tmp1
for i in $(seq 3 1 $# 2>/dev/null); do
comm -12 tmp`expr $i - 2` $(eval echo '$'$i) >tmp`expr $i - 1`
done
if [ $# -eq 2 ]; then
cat tmp1
else
cat tmp`expr $i - 1`
fi
rm tmp*
This seems like poorly written code, even to a newbie like me, is there a better way?
It's quite a bit more convoluted than it has to be. Here's another way of doing it.
#!/bin/bash
# Create some temp files to avoid trashing and deleting tmp* in the directory
tmp=$(mktemp)
result=$(mktemp)
# The intersection of one file is itself
cp "$1" "$result"
shift
# For each additional file, intersect with the intermediate result
for file
do
comm -12 "$file" "$result" > "$tmp" && mv "$tmp" "$result"
done
cat "$result" && rm "$result"

How can I print the lines of four files together?

I have four files and i want to print the 1st line of file1, file2, file3, file4 , then the second line of file1,file2,file3,file4, and then the 3rd line of each file and so on
I tried the following code but it gave me an error:
for i in $(cat $file1)
do
for j in $(cat $file2)
do
for k in $(cat $file3)
do
for l in $(cat $file4)
echo "${i}"
echo "${j}"
echo "${k}"
echo "${l}"
done
done
done
done
so what can i use other than echo ?
There is s tool for that already.
paste "$file1" "$file2" "$file3" "$file4"
Use paste -d $'\n' if you don't want columnar output. (Thanks, #AnsgarWiechers!)
Use paste.
paste file1 file2 file3 file4
Will this do it for you?
paste -d '\n' file1 file2 file3 ...
If you want the contents the files on one line:
paste file1 file2 file3 ...

Resources