Bash pass the file names which are not in the ith element of loop - bash

In a simple processing of files, where you want to do something on every file in a directory, you do something like this:
for i in file1 file2 file3 file5
do
echo "Processing $i"
done
What I want to do here is pass $i as well as the non-$i files as an argument to a command. Lets say my directory contains 4 files (file1, file2, file3, file5). For example in the first iteration of the loop when file1 is being processed, I want to pass the rest of the files (file2, file3, file5) to the -b argument of the command.
For example, first iteration of loop in bash should look something like this:
FILES=/path/to/directory
for i in $FILES
do
bedtools intersect -a $i -b file2 file3 file5
done
In second iteration as the file2 is in the $i the rest of the files will be passed to -b argument.
for i in $FILES
do
bedtools intersect -a $i -b file1 file3 file5
done
and so on for all the files in the directory. In short, pass the current file to -a argument and rest of the files to -b argument.
It will be great if somebody can help me with this. Thank you.

You can just use a numeric loop and take slices out of the array:
shopt -s nullglob
files=( path/to/directory/* )
for (( i = 0; i < ${#files[#]}; ++i )); do
file=${files[i]}
others=( "${files[#]:0:i}" "${files[#]:i+1}" )
bedtools intersect -a "$file" -b "${others[#]}"
done
This loops though the indices of the array files and slices the part before and after the current index i to get the others.

You can try out like this as well,
op=$(find /path/to/directory ! -iname ".*")
temp=$op
for i in $op;
do
rfile=${temp//$i/}
rfile=$(echo $rfile | tr '\n' ' ')
bedtools intersect -a $i -b $rfile
done

count=0; files=(*)
for i in ${files[*]}; do
unset files[count]
echo "bedtools intersect -a $i -b ${files[*]}"
files+=($i)
((count++))
done

Related

Grab random files from a directory using just bash

I am looking to create a bash script that can grab files fitting a certain glob pattern and cp them to another folder for example
$foo\
a.txt
b.txt
c.txt
e.txt
f.txt
g.txt
run script that request 2 files I would get
$bar\
c.txt
f.txt
I am not sure if bash has a random number generator and how to use that to pull from a list. The directory is large as well (over 100K) so some of the glob stuff won't work.
Thanks in advance
Using GNU shuf, this copies N random files matching the given glob pattern in the given source directory to the given destination directory.
#!/bin/bash -e
shopt -s failglob
n=${1:?} glob=${2:?} source=${3:?} dest=${4:?}
declare -i rand
IFS=
[[ -d "$source" ]]
[[ -d "$dest" && -w "$dest" ]]
cd "$dest"
dest=$PWD
cd "$OLDPWD"
cd "$source"
printf '%s\0' $glob |
shuf -zn "$n" |
xargs -0 cp -t "$dest"
Use like:
./cp-rand 2 '?.txt' /source/dir /dest/dir
This will work for a directory containing thousands of files. xargs will manage limits like ARG_MAX.
$glob, unquoted, undergoes filename expansion (glob expansion). Because IFS is empty, the glob pattern can contain whitespace.
Matching sub-directories will cause cp to error and a premature exit (some files may have already been copied). cp -r to allow sub-directories.
cp -t target and xargs -0 are not POSIX.
Note that using a random number to select files from a list can cause cause duplicates, so you might copy less than N files. Hence using GNU shuf.
Try this:
#!/bin/bash
sourcedir="files"
# Arguments processing
if [[ $# -ne 1 ]]
then
echo "Usage: random_files.bash NUMBER-OF-FILES"
echo " NUMBER-OF-FILES: how many random files to select"
exit 0
else
numberoffiles="$1"
fi
# Validations
listoffiles=()
while IFS='' read -r line; do listoffiles+=("$line"); done < <(find "$sourcedir" -type f -print)
totalnumberoffiles=${#listoffiles[#]}
# loop on the number of files the user wanted
for (( i=1; i<=numberoffiles; i++ ))
do
# Select a random number between 0 and $totalnumberoffiles
randomnumber=$(( RANDOM % totalnumberoffiles ))
echo "${listoffiles[$randomnumber]}"
done
build an array with the filenames
random a number from 0 to the size of the array
display the filename at that index
I built in a loop if you want to randomly select more than one file
you can setup another argument for the location of the files, I hard coded it here.
Another method, if this one fails because of to many files in the same directory, could be:
#!/bin/bash
sourcedir="files"
# Arguments processing
if [[ $# -ne 1 ]]
then
echo "Usage: random_files.bash NUMBER-OF-FILES"
echo " NUMBER-OF-FILES: how many random files to select"
exit 0
else
numberoffiles="$1"
fi
# Validations
find "$sourcedir" -type f -print >list.txt
totalnumberoffiles=$(wc -l list.txt | awk '{print $1}')
# loop on the number of files the user wanted
for (( i=1; i<=numberoffiles; i++ ))
do
# Select a random number between 1 and $totalnumberoffiles
randomnumber=$(( ( RANDOM % totalnumberoffiles ) + 1 ))
sed -n "${randomnumber}p" list.txt
done
/bin/rm -f list.txt
build a list of the files, so that each filename will be on one line
select a random number
in that one, the randomnumber must be +1 since line count starts at 1, not at 0 like in an array.
use sed to print the random line from the list of files

How to add grouping mechanism inside for loop in bash

I have a for loop that loops through a list of files, and inside the for loop a script is called, that takes this file name as input.
Something like
for file in $(cat list_of_files) ; do
script $file
done
the file list_of_files has files like
file1
file2
file3
...
so with each iteration, one file is processed.
I have to design something like, loop through all the files, group them into groups of 3 , so that in one loop, script will be called 3 times, and not one by one,and then again the other 3 will be called in second loop iteration and so on
something like,
for file in $(cat list_of_files) ; do
# do somekind of grouping here
call one more loop to run the sript.sh 3 times, so something like
for i=1 to 3 and then next iteration from 4 to 6 and so on..
script.sh $file1
script.sh $file2
script.sh $file3
done
I am struggling currently on how to get this looping done and i am stuck here and could not think of efficient way here.
Change for ... in to while read
for file in $(cat list_of_files)
This style of loop is subtly dangerous and/or incorrect. It won't work right on file names with spaces, asterisks, or other special characters. As a general rule, avoid for x in $(...) loops. For more details, see:
Bash Pitfalls: for f in $(ls *.mp3).
A safer alternative is to use while read along with process substitution, like so:
while IFS= read -r file; do
...
done < <(cat list_of_files)
It's ugly, I'll admit, but it will handle special characters safely. It split apart file names with spaces and it won't expand * globs. For more details on what this is doing, see:
Unix.SE: Understanding “IFS= read -r line”.
You can then remove the Useless Use of Cat and use a simple redirection instead:
while IFS= read -r file; do
...
done < list_of_files
Read 3 at a time
So far these changes haven't answered your core question, how to group files 3 at a time. The switch to read has actually served a second purpose. It makes grouping easy. The trick is to call read multiple times per iteration. This is an easy change with while read; it's not so easy with for ... in.
Here's what that looks like:
while IFS= read -r file1 &&
IFS= read -r file2 &&
IFS= read -r file3
do
script.sh "$file1"
script.sh "$file2"
script.sh "$file3"
done < list_of_files
This calls read three times, and once all three succeed it proceeds to the loop body.
It will work great if you always have a multiple of 3 items to process. If not, it will mess up at the end and skip the last file or two. If that's an issue we can update it to try to handle that case.
while IFS= read -r file1; do
IFS= read -r file2
IFS= read -r file3
script.sh "$file1"
[[ -n $file2 ]] && script.sh "$file2"
[[ -n $file3 ]] && script.sh "$file3"
done < list_of_files
Run the scripts in parallel
If I understand your question right, you also want to run the scripts at the same time rather than sequentially, one after the other. If so, the way to do that is to append &, which will cause them to run in the background. Then call wait to block until they have all finished before proceeding.
while IFS= read -r file1; do
IFS= read -r file2
IFS= read -r file3
script.sh "$file1" &
[[ -n $file2 ]] && script.sh "$file2" &
[[ -n $file3 ]] && script.sh "$file3" &
wait
done < list_of_files
How about
xargs -d $'\n' -L 1 -P 3 script.sh <list_of_files
-P 3 runs 3 processes in parallel. Each of those gets the input of one line (due to -L 1), and the -d options ensures that spaces in an input line are not considered separate arguments.
You can use bash arrays to store the filenames until you get 3 of them:
#!/bin/bash
files=()
while IFS= read -r f; do
files+=( "$f" )
(( ${#files[#]} < 3 )) && continue
script.sh "${files[0]}"
script.sh "${files[1]}"
script.sh "${files[2]}"
files=()
done < list_of_files
However, I think that John Kugelman's answer is simpler, then better: it uses less bash-specific features, then it can be more easily converted to a POSIX version.
you should not mix script languages if you don't absolutely have to
you can start with that
from os import listdir
from os.path import isfile, join
PATH_FILES = "/yourfolder"
def yourFunction(file_name):
file_path = PATH_FILES + "/" + file_name
print(file_path) #or do something else
print(file_path) #or do something else
print(file_path) #or do something else
file_names = [f for f in listdir(PATH_FILES) if isfile(join(PATH_FILES, f))]
for file_name in file_names:
yourFunction(file_name)
If mapfile aka readarray is available/acceptable. (bash4+ is required)
Assuming script.sh can accept multiple input.
#!/usr/bin/env bash
while mapfile -tn3 files && (( ${#files[*]} == 3 )); do
script.sh "${files[#]}"
done < list_of_files
otherwise loop through the array named files
#!/usr/bin/env bash
while mapfile -tn3 files && (( ${#files[*]} == 3 )); do
for file in "${files[#]}"; do
script.sh "$file"
done
done < list_of_files
The body after the do will run/execute if there are always 3 lines if there is not enough lines to satisfy 3 lines until the end of the file, just remove the
&& (( ${#files[*]} == 3 ))
from the script.
or do it manually one-by-one, but it should have 3 lines to be processed (the file) until the end.
#!/usr/bin/env bash
while mapfile -tn3 files && (( ${#files[*]} == 3 )); do
script.sh "${file[0]}"
script.sh "${file[1]}"
script.sh "${file[2]}"
done < list_of_files

Execute a program over all pairs of files in a directory using bash script

I have a directory with a bunch of files. I need to create a bash file to qsub and run a program over all pairs of all files:
for $file1, $file2 in all_pairs
do
/path/program -i $file1 $file2 -o $file1.$file2.result
done
So I could do:
qsub script.sh
to get:
file1.file2.result
file1.file3.result
file2.file3.result
for directory with:
file1
file2
file3
The following is probably the easiest:
the pair a-b is different from b-a:
set -- file1 file2 file3 file4 ...
for f1; do
for f2; do
/path/program -i "$f1" "$f2" -o "$f1.$f2.result"
done
done
the pair a-b is equal to b-a:
set -- file1 file2 file3 file4 ...
for f1; do
shift
for f2; do
/path/program -i "$f1" "$f2" -o "$f1.$f2.result"
done
done
You can do it as in every other programming language:
files=(file1 file2 file3) # or use a glob to list the files automatically, for instance =(*)
max="${#files[#]}"
for ((i=0; i<max; i++)); do
for ((j=i+1; j<max; j++)); do
echo -i "${files[i]}" "${files[j]}" -o "${files[i]}${files[j]}.result"
done
done
Replace echo with /path/program when you are happy with the result

Take two at a time in a bash "for file in $list" construct

I have a list of files where two subsequent ones always belong together. I would like a for loop extract two files out of this list per iteration, and then work on these two files at a time (for an example, let's say I want to just concatenate, i.e. cat the two files).
In a simple case, my list of files is this:
FILES="file1_mateA.txt file1_mateB.txt file2_mateA.txt file2_mateB.txt"
I could hack around it and say
FILES="file1 file2"
for file in $FILES
do
actual_mateA=${file}_mateA.txt
actual_mateB=${file}_mateB.txt
cat $actual_mateA $actual_mateB
done
But I would like to be able to handle lists where mate A and mate B have arbitrary names, e.g.:
FILES="first_file_first_mate.txt first_file_second_mate.txt file2_mate1.txt file2_mate2.txt"
Is there a way to extract two values out of $FILES per iteration?
Use an array for the list:
files=(fileA1 fileA2 fileB1 fileB2)
for (( i=0; i<${#files[#]} ; i+=2 )) ; do
echo "${files[i]}" "${files[i+1]}"
done
You could read the values from a while loop and use xargs to restrict each read operation to two tokens.
files="filaA1 fileA2 fileB1 fileB2"
while read -r a b; do
echo $a $b
done < <(echo $files | xargs -n2)
You could use xargs(1), e.g.
ls -1 *.txt | xargs -n2 COMMAND
The switch -n2 let xargs select 2 consecutive filenames from the pipe output which are handed down do the COMMAND
To concatenate the 10 files file01.txt ... file10.txt pairwise
one can use
ls *.txt | xargs -n2 sh -c 'cat $# > $1.$2.joined' dummy
to get the 5 result files
file01.txt.file02.txt.joined
file03.txt.file04.txt.joined
file05.txt.file06.txt.joined
file07.txt.file08.txt.joined
file09.txt.file10.txt.joined
Please see 'info xargs' for an explantion.
How about this:
park=''
for file in $files # wherever you get them from, maybe $(ls) or whatever
do
if [ "$park" = '' ]
then
park=$file
else
process "$park" "$file"
park=''
fi
done
In each odd iteration it just stores the value (in park) and in each even iteration it then uses the stored and the current value.
Seems like one of those things awk is suited for
$ awk '{for (i = 1; i <= NF; i+=2) if( i+1 <= NF ) print $i " " $(i+1) }' <<< "$FILES"
file1_mateA.txt file1_mateB.txt
file2_mateA.txt file2_mateB.txt
You could then loop over it by setting IFS=$'\n'
e.g.
#!/bin/bash
FILES="file1_mateA.txt file1_mateB.txt file2_mateA.txt file2_mateB.txt file3_mat
input=$(awk '{for (i = 1; i <= NF; i+=2) if( i+1 <= NF ) print $i " " $(i+1) }'
IFS=$'\n'
for set in $input; do
cat "$set" # or something
done
Which will try to do
$ cat file1_mateA.txt file1_mateB.txt
$ cat file2_mateA.txt file2_mateB.txt
And ignore the odd case without the match.
You can transform you string to array and read this new array by elements:
#!/bin/bash
string="first_file_first_mate.txt first_file_second_mate.txt file2_mate1.txt file2_mate2.txt"
array=(${string})
size=${#array[*]}
idx=0
while [ "$idx" -lt "$size" ]
do
echo ${array[$idx]}
echo ${array[$(($idx+1))]}
let "idx=$idx+2"
done
If you have delimiter in string different from space (i.e. ;) you can use the following transformation to array:
array=(${string//;/ })
You could try something like this:
echo file1 file2 file3 file4 | while read -d ' ' a; do read -d ' ' b; echo $a $b; done
file1 file2
file3 file4
Or this, somewhat cumbersome technique:
echo file1 file2 file3 file4 |tr " " "\n" | while :;do read a || break; read b || break; echo $a $b; done
file1 file2
file3 file4

Nested for loop comparing files

I am trying to write a bash script that looks at two files with the same name, each in a different directory.
I know this can be done with diff -r, however, I would like to take everything that is in the second file that is not in the first file and output it into an new file (also with the same file name)
I have written a (nested) loop with a grep command but it's not good and gives back a syntax error:
#!/bin/bash
FILES=/Path/dir1/*
FILES2=/Path/dir2/*
for f in $FILES
do
for i in $FILES2
do
if $f = $i
grep -vf $i $f > /Path/dir3/$i
done
done
Any help much appreciated.
try this
#!/bin/bash
cd /Path/dir1/
for f in *; do
comm -13 <(sort $f) <(sort /Path/dir2/$f) > /Path/dir3/$f
done
if syntax in shell is
if test_command;then commands;fi
commands are executed if test_command exit code is 0
if [ $f = $i ] ; then grep ... ; fi
but in your case it will be more efficient to get the file name
for i in $FILES; do
f=/Path/dir2/`basename $i`
grep
done
finally, maybe this will be more efficient than grep -v
comm -13 <(sort $f) <(sort $i)
comm -13 will get everything which is in the second and not in first ; comm without arguments generates 3 columns of output : first is only in first, second only in second and third what is common.
-13 or -1 -3 removes first and third column
#!/bin/bash
DIR1=/Path/dir1
DIR2=/Path/dir2
DIR3=/Path/dir3
for f in $DIR1/*
do
for i in $DIR2/*
do
if [ "$(basename $f)" = "$(basename $i)" ]
then
grep -vf "$i" "$f" > "$DIR3/$(basename $i)"
fi
done
done
This assumes no special characters in filenames. (eg, whitespace. Use double quotes if that is unacceptable.):
a=/path/dir1
b=/path/dir2
for i in $a/*; do test -e $b/${i##*/} &&
diff $i $b/${i##*/} | sed -n '/^< /s///p'; done

Resources