How can I select random files from a directory in bash?

How can I select random files from a directory in bash? - bash

I have a directory with about 2000 files. How can I select a random sample of N files through using either a bash script or a list of piped commands?

Here's a script that uses GNU sort's random option:
ls |sort -R |tail -$N |while read file; do
# Something involving $file, or you can leave
# off the while to just get the filenames
done

You can use shuf (from the GNU coreutils package) for that. Just feed it a list of file names and ask it to return the first line from a random permutation:
ls dirname | shuf -n 1
# probably faster and more flexible:
find dirname -type f | shuf -n 1
# etc..
Adjust the -n, --head-count=COUNT value to return the number of wanted lines. For example to return 5 random filenames you would use:
find dirname -type f | shuf -n 5

Here are a few possibilities that don't parse the output of ls and that are 100% safe regarding files with spaces and funny symbols in their name. All of them will populate an array randf with a list of random files. This array is easily printed with printf '%s\n' "${randf[#]}" if needed.
This one will possibly output the same file several times, and N needs to be known in advance. Here I chose N=42.
a=( * )
randf=( "${a[RANDOM%${#a[#]}]"{1..42}"}" )
This feature is not very well documented.
If N is not known in advance, but you really liked the previous possibility, you can use eval. But it's evil, and you must really make sure that N doesn't come directly from user input without being thoroughly checked!
N=42
a=( * )
eval randf=( \"\${a[RANDOM%\${#a[#]}]\"\{1..$N\}\"}\" )
I personally dislike eval and hence this answer!
The same using a more straightforward method (a loop):
N=42
a=( * )
randf=()
for((i=0;i<N;++i)); do
randf+=( "${a[RANDOM%${#a[#]}]}" )
done
If you don't want to possibly have several times the same file:
N=42
a=( * )
randf=()
for((i=0;i<N && ${#a[#]};++i)); do
((j=RANDOM%${#a[#]}))
randf+=( "${a[j]}" )
a=( "${a[#]:0:j}" "${a[#]:j+1}" )
done
Note. This is a late answer to an old post, but the accepted answer links to an external page that shows terrible bash practice, and the other answer is not much better as it also parses the output of ls. A comment to the accepted answer points to an excellent answer by Lhunath which obviously shows good practice, but doesn't exactly answer the OP.

ls | shuf -n 10 # ten random files

A simple solution for selecting 5 random files while avoiding to parse ls. It also works with files containing spaces, newlines and other special characters:
shuf -ezn 5 * | xargs -0 -n1 echo
Replace echo with the command you want to execute for your files.

This is an even later response to #gniourf_gniourf's late answer, which I just upvoted because it's by far the best answer, twice over. (Once for avoiding eval and once for safe filename handling.)
But it took me a few minutes to untangle the "not very well documented" feature(s) this answer uses. If your Bash skills are solid enough that you saw immediately how it works, then skip this comment. But I didn't, and having untangled it I think it's worth explaining.
Feature #1 is the shell's own file globbing. a=(*) creates an array, $a, whose members are the files in the current directory. Bash understands all the weirdnesses of filenames, so that list is guaranteed correct, guaranteed escaped, etc. No need to worry about properly parsing textual file names returned by ls.
Feature #2 is Bash parameter expansions for arrays, one nested within another. This starts with ${#ARRAY[#]}, which expands to the length of $ARRAY.
That expansion is then used to subscript the array. The standard way to find a random number between 1 and N is to take the value of random number modulo N. We want a random number between 0 and the length of our array. Here's the approach, broken into two lines for clarity's sake:
LENGTH=${#ARRAY[#]}
RANDOM=${a[RANDOM%$LENGTH]}
But this solution does it in a single line, removing the unnecessary variable assignment.
Feature #3 is Bash brace expansion, although I have to confess I don't entirely understand it. Brace expansion is used, for instance, to generate a list of 25 files named filename1.txt, filename2.txt, etc: echo "filename"{1..25}".txt".
The expression inside the subshell above, "${a[RANDOM%${#a[#]}]"{1..42}"}", uses that trick to produce 42 separate expansions. The brace expansion places a single digit in between the ] and the }, which at first I thought was subscripting the array, but if so it would be preceded by a colon. (It would also have returned 42 consecutive items from a random spot in the array, which is not at all the same thing as returning 42 random items from the array.) I think it's just making the shell run the expansion 42 times, thereby returning 42 random items from the array. (But if someone can explain it more fully, I'd love to hear it.)
The reason N has to be hardcoded (to 42) is that brace expansion happens before variable expansion.
Finally, here's Feature #4, if you want to do this recursively for a directory hierarchy:
shopt -s globstar
a=( ** )
This turns on a shell option that causes ** to match recursively. Now your $a array contains every file in the entire hierarchy.

If you have Python installed (works with either Python 2 or Python 3):
To select one file (or line from an arbitrary command), use
ls -1 | python -c "import sys; import random; print(random.choice(sys.stdin.readlines()).rstrip())"
To select N files/lines, use (note N is at the end of the command, replace this by a number)
ls -1 | python -c "import sys; import random; print(''.join(random.sample(sys.stdin.readlines(), int(sys.argv[1]))).rstrip())" N

If you want to copy a sample of those files to another folder:
ls | shuf -n 100 | xargs -I % cp % ../samples/
make samples directory first obviously.

MacOS does not have the sort -R and shuf commands, so I needed a bash only solution that randomizes all files without duplicates and did not find that here. This solution is similar to gniourf_gniourf's solution #4, but hopefully adds better comments.
The script should be easy to modify to stop after N samples using a counter with if, or gniourf_gniourf's for loop with N. $RANDOM is limited to ~32000 files, but that should do for most cases.
#!/bin/bash
array=(*) # this is the array of files to shuffle
# echo ${array[#]}
for dummy in "${array[#]}"; do # do loop length(array) times; once for each file
length=${#array[#]}
randomi=$(( $RANDOM % $length )) # select a random index
filename=${array[$randomi]}
echo "Processing: '$filename'" # do something with the file
unset -v "array[$randomi]" # set the element at index $randomi to NULL
array=("${array[#]}") # remove NULL elements introduced by unset; copy array
done

If you have more files in your folder, you can use the below piped command I found in unix stackexchange.
find /some/dir/ -type f -print0 | xargs -0 shuf -e -n 8 -z | xargs -0 cp -vt /target/dir/
Here I wanted to copy the files, but if you want to move files or do something else, just change the last command where I have used cp.

This is the only script I can get to play nice with bash on MacOS. I combined and edited snippets from the following two links:
ls command: how can I get a recursive full-path listing, one line per file?
http://www.linuxquestions.org/questions/linux-general-1/is-there-a-bash-command-for-picking-a-random-file-678687/
#!/bin/bash
# Reads a given directory and picks a random file.
# The directory you want to use. You could use "$1" instead if you
# wanted to parametrize it.
DIR="/path/to/"
# DIR="$1"
# Internal Field Separator set to newline, so file names with
# spaces do not break our script.
IFS='
'
if [[ -d "${DIR}" ]]
then
# Runs ls on the given dir, and dumps the output into a matrix,
# it uses the new lines character as a field delimiter, as explained above.
# file_matrix=($(ls -LR "${DIR}"))
file_matrix=($(ls -R $DIR | awk '; /:$/&&f{s=$0;f=0}; /:$/&&!f{sub(/:$/,"");s=$0;f=1;next}; NF&&f{ print s"/"$0 }'))
num_files=${#file_matrix[*]}
# This is the command you want to run on a random file.
# Change "ls -l" by anything you want, it's just an example.
ls -l "${file_matrix[$((RANDOM%num_files))]}"
fi
exit 0

I use this: it uses temporary file but goes deeply in a directory until it find a regular file and return it.
# find for a quasi-random file in a directory tree:
# directory to start search from:
ROOT="/";
tmp=/tmp/mytempfile
TARGET="$ROOT"
FILE="";
n=
r=
while [ -e "$TARGET" ]; do
TARGET="$(readlink -f "${TARGET}/$FILE")" ;
if [ -d "$TARGET" ]; then
ls -1 "$TARGET" 2> /dev/null > $tmp || break;
n=$(cat $tmp | wc -l);
if [ $n != 0 ]; then
FILE=$(shuf -n 1 $tmp)
# or if you dont have/want to use shuf:
# r=$(($RANDOM % $n)) ;
# FILE=$(tail -n +$(( $r + 1 )) $tmp | head -n 1);
fi ;
else
if [ -f "$TARGET" ] ; then
rm -f $tmp
echo $TARGET
break;
else
# is not a regular file, restart:
TARGET="$ROOT"
FILE=""
fi
fi
done;

How about a Perl solution slightly doctored from Mr. Kang over here:
How can I shuffle the lines of a text file on the Unix command line or in a shell script?
$ ls | perl -MList::Util=shuffle -e '#lines = shuffle(<>); print
#lines[0..4]'

Related

Bash for loop doesn't execute one sentence more that once [duplicate]

I have a directory with about 2000 files. How can I select a random sample of N files through using either a bash script or a list of piped commands?

Here's a script that uses GNU sort's random option:
ls |sort -R |tail -$N |while read file; do
# Something involving $file, or you can leave
# off the while to just get the filenames
done

You can use shuf (from the GNU coreutils package) for that. Just feed it a list of file names and ask it to return the first line from a random permutation:
ls dirname | shuf -n 1
# probably faster and more flexible:
find dirname -type f | shuf -n 1
# etc..
Adjust the -n, --head-count=COUNT value to return the number of wanted lines. For example to return 5 random filenames you would use:
find dirname -type f | shuf -n 5

Here are a few possibilities that don't parse the output of ls and that are 100% safe regarding files with spaces and funny symbols in their name. All of them will populate an array randf with a list of random files. This array is easily printed with printf '%s\n' "${randf[#]}" if needed.
This one will possibly output the same file several times, and N needs to be known in advance. Here I chose N=42.
a=( * )
randf=( "${a[RANDOM%${#a[#]}]"{1..42}"}" )
This feature is not very well documented.
If N is not known in advance, but you really liked the previous possibility, you can use eval. But it's evil, and you must really make sure that N doesn't come directly from user input without being thoroughly checked!
N=42
a=( * )
eval randf=( \"\${a[RANDOM%\${#a[#]}]\"\{1..$N\}\"}\" )
I personally dislike eval and hence this answer!
The same using a more straightforward method (a loop):
N=42
a=( * )
randf=()
for((i=0;i<N;++i)); do
randf+=( "${a[RANDOM%${#a[#]}]}" )
done
If you don't want to possibly have several times the same file:
N=42
a=( * )
randf=()
for((i=0;i<N && ${#a[#]};++i)); do
((j=RANDOM%${#a[#]}))
randf+=( "${a[j]}" )
a=( "${a[#]:0:j}" "${a[#]:j+1}" )
done
Note. This is a late answer to an old post, but the accepted answer links to an external page that shows terrible bash practice, and the other answer is not much better as it also parses the output of ls. A comment to the accepted answer points to an excellent answer by Lhunath which obviously shows good practice, but doesn't exactly answer the OP.

ls | shuf -n 10 # ten random files

A simple solution for selecting 5 random files while avoiding to parse ls. It also works with files containing spaces, newlines and other special characters:
shuf -ezn 5 * | xargs -0 -n1 echo
Replace echo with the command you want to execute for your files.

This is an even later response to #gniourf_gniourf's late answer, which I just upvoted because it's by far the best answer, twice over. (Once for avoiding eval and once for safe filename handling.)
But it took me a few minutes to untangle the "not very well documented" feature(s) this answer uses. If your Bash skills are solid enough that you saw immediately how it works, then skip this comment. But I didn't, and having untangled it I think it's worth explaining.
Feature #1 is the shell's own file globbing. a=(*) creates an array, $a, whose members are the files in the current directory. Bash understands all the weirdnesses of filenames, so that list is guaranteed correct, guaranteed escaped, etc. No need to worry about properly parsing textual file names returned by ls.
Feature #2 is Bash parameter expansions for arrays, one nested within another. This starts with ${#ARRAY[#]}, which expands to the length of $ARRAY.
That expansion is then used to subscript the array. The standard way to find a random number between 1 and N is to take the value of random number modulo N. We want a random number between 0 and the length of our array. Here's the approach, broken into two lines for clarity's sake:
LENGTH=${#ARRAY[#]}
RANDOM=${a[RANDOM%$LENGTH]}
But this solution does it in a single line, removing the unnecessary variable assignment.
Feature #3 is Bash brace expansion, although I have to confess I don't entirely understand it. Brace expansion is used, for instance, to generate a list of 25 files named filename1.txt, filename2.txt, etc: echo "filename"{1..25}".txt".
The expression inside the subshell above, "${a[RANDOM%${#a[#]}]"{1..42}"}", uses that trick to produce 42 separate expansions. The brace expansion places a single digit in between the ] and the }, which at first I thought was subscripting the array, but if so it would be preceded by a colon. (It would also have returned 42 consecutive items from a random spot in the array, which is not at all the same thing as returning 42 random items from the array.) I think it's just making the shell run the expansion 42 times, thereby returning 42 random items from the array. (But if someone can explain it more fully, I'd love to hear it.)
The reason N has to be hardcoded (to 42) is that brace expansion happens before variable expansion.
Finally, here's Feature #4, if you want to do this recursively for a directory hierarchy:
shopt -s globstar
a=( ** )
This turns on a shell option that causes ** to match recursively. Now your $a array contains every file in the entire hierarchy.

If you have Python installed (works with either Python 2 or Python 3):
To select one file (or line from an arbitrary command), use
ls -1 | python -c "import sys; import random; print(random.choice(sys.stdin.readlines()).rstrip())"
To select N files/lines, use (note N is at the end of the command, replace this by a number)
ls -1 | python -c "import sys; import random; print(''.join(random.sample(sys.stdin.readlines(), int(sys.argv[1]))).rstrip())" N

If you want to copy a sample of those files to another folder:
ls | shuf -n 100 | xargs -I % cp % ../samples/
make samples directory first obviously.

MacOS does not have the sort -R and shuf commands, so I needed a bash only solution that randomizes all files without duplicates and did not find that here. This solution is similar to gniourf_gniourf's solution #4, but hopefully adds better comments.
The script should be easy to modify to stop after N samples using a counter with if, or gniourf_gniourf's for loop with N. $RANDOM is limited to ~32000 files, but that should do for most cases.
#!/bin/bash
array=(*) # this is the array of files to shuffle
# echo ${array[#]}
for dummy in "${array[#]}"; do # do loop length(array) times; once for each file
length=${#array[#]}
randomi=$(( $RANDOM % $length )) # select a random index
filename=${array[$randomi]}
echo "Processing: '$filename'" # do something with the file
unset -v "array[$randomi]" # set the element at index $randomi to NULL
array=("${array[#]}") # remove NULL elements introduced by unset; copy array
done

If you have more files in your folder, you can use the below piped command I found in unix stackexchange.
find /some/dir/ -type f -print0 | xargs -0 shuf -e -n 8 -z | xargs -0 cp -vt /target/dir/
Here I wanted to copy the files, but if you want to move files or do something else, just change the last command where I have used cp.

This is the only script I can get to play nice with bash on MacOS. I combined and edited snippets from the following two links:
ls command: how can I get a recursive full-path listing, one line per file?
http://www.linuxquestions.org/questions/linux-general-1/is-there-a-bash-command-for-picking-a-random-file-678687/
#!/bin/bash
# Reads a given directory and picks a random file.
# The directory you want to use. You could use "$1" instead if you
# wanted to parametrize it.
DIR="/path/to/"
# DIR="$1"
# Internal Field Separator set to newline, so file names with
# spaces do not break our script.
IFS='
'
if [[ -d "${DIR}" ]]
then
# Runs ls on the given dir, and dumps the output into a matrix,
# it uses the new lines character as a field delimiter, as explained above.
# file_matrix=($(ls -LR "${DIR}"))
file_matrix=($(ls -R $DIR | awk '; /:$/&&f{s=$0;f=0}; /:$/&&!f{sub(/:$/,"");s=$0;f=1;next}; NF&&f{ print s"/"$0 }'))
num_files=${#file_matrix[*]}
# This is the command you want to run on a random file.
# Change "ls -l" by anything you want, it's just an example.
ls -l "${file_matrix[$((RANDOM%num_files))]}"
fi
exit 0

I use this: it uses temporary file but goes deeply in a directory until it find a regular file and return it.
# find for a quasi-random file in a directory tree:
# directory to start search from:
ROOT="/";
tmp=/tmp/mytempfile
TARGET="$ROOT"
FILE="";
n=
r=
while [ -e "$TARGET" ]; do
TARGET="$(readlink -f "${TARGET}/$FILE")" ;
if [ -d "$TARGET" ]; then
ls -1 "$TARGET" 2> /dev/null > $tmp || break;
n=$(cat $tmp | wc -l);
if [ $n != 0 ]; then
FILE=$(shuf -n 1 $tmp)
# or if you dont have/want to use shuf:
# r=$(($RANDOM % $n)) ;
# FILE=$(tail -n +$(( $r + 1 )) $tmp | head -n 1);
fi ;
else
if [ -f "$TARGET" ] ; then
rm -f $tmp
echo $TARGET
break;
else
# is not a regular file, restart:
TARGET="$ROOT"
FILE=""
fi
fi
done;

How about a Perl solution slightly doctored from Mr. Kang over here:
How can I shuffle the lines of a text file on the Unix command line or in a shell script?
$ ls | perl -MList::Util=shuffle -e '#lines = shuffle(<>); print
#lines[0..4]'

Bash: any file in current directory

Is there a shorthand in bash to select an arbitrary file? * enumerates all files in the current directory, but what if I only want one file and don't care which it is?
FWIW I'm testing several different ffmpeg commands in a directory with similarly named video files, so tab-complete is cumbersome.

Here's the robust way of getting the first or a random file in a directory, handling the edge case of not having any files:
#!/bin/bash
# Let globs expand to 0 elements instead of themselves if no matches
shopt -s nullglob
# Add all the files in the current dir to an array
files=(*)
# Check if the array has any elements
if [[ ${#files[#]} -gt 0 ]]
then
first_file=${files[0]}
random_file=${files[RANDOM%${#files[#]}]}
echo "The first file is ${first_file}"
echo "A random file is ${random_file}"
else
echo "There are no files in the current directory."
fi
If you just want something short and hacky for interactive testing, you can create an array and reference it unindexed to get the first element with minimal typing:
$ testfile=( *.avi )
$ ffmpeg -i "$testfile" test.mp3
You can also bind Tab to zsh style completion:
$ bind 'TAB:menu-complete'
now, for the rest of this session, when you press Tab you'll get a complete filename instead of just a prefix (press Tab again to cycle through matches). This will let you conveniently pick a file with a single keystroke.

Occasionally I was using the shuf:
find -name '*whatever*' | shuf | head -n 1
The shuf is a tool, part of GNU coreutils, which prints the input lines in random order. In other words, it shuffles the lines.

Shell script: Get name of last file in a folder by alphabetical order

I have a folder with backups from a MySQL database that are created automatically. Their name consists of the date the backup was made, like so:
2010-06-12_19-45-05.mysql.gz
2010-06-14_19-45-05.mysql.gz
2010-06-18_19-45-05.mysql.gz
2010-07-01_19-45-05.mysql.gz
What is a way to get the filename of the last file in the list, i.e. of the one which in alphabetical order comes last?
In a shell script, I would like to do something like
LAST_BACKUP_FILE= ???
gunzip $LAST_BACKUP_FILE;

ls -1 | tail -n 1
If you want to assign this to a variable, use $(...) or backticks.
FILE=`ls -1 | tail -n 1`
FILE=$(ls -1 | tail -n 1)

#Sjoerd's answer is correct, I'll just pick a few nits from it:
you don't need the -1 option to enforce one path per line if you pipe the output somewhere:
ls | tail -n 1
you can use -r to get the listing in reverse order, and take the first one:
ls -r | head -n 1
gunzip some.log.gz will write uncompressed data into some.log and remove some.log.gz, which may or may not be what you want (probably isn't). if you want to keep the compressed source, pipe it into gunzip:
gunzip < some.file.gz
you might want to protect the script against situation when the dir contains no files, since
gunzip $empty_variable
expands to just
gunzip
and such invocation will wait indefinitely for data on standard input:
latest="$(ls -r /some/where/*.gz | head -1)"
if test -z "$latest"; then
# there's no logs yet, bail out
exit
fi
gunzip < $latest

ls can yield unexpected results when parsed by other commands if the filenames have unusual characters. The following always works:
for LAST_BACKUP_FILE in *; do : ; done
for LAST_BACKUP_FILE in * loops through every filename (and folder name, if there are any) in order in the current directory, storing each in $LAST_BACKUP_FILE
do : does nothing
done finishes after the last file
Now, the last file is stored in $LAST_BACKUP_FILE.
If you happen to want the first file, use this:
for FIRST_BACKUP_FILE in *; do break; done
The break statement jumps out of the loop after the first file is stored in $FIRST_BACKUPT_FILE.
(from comment below) If you want hidden files included in the search, then use the command shopt -s dotglob before running the loops.

The shell is more powerful than many think. Just let it work for you. Assuming filenames without spaces,
set -- $(ls -r *.gz)
LAST_BACKUP_FILE=$1
does the trick with a single fork, no pipes, and you can even avoid the fork if your shell supports arithmetic expansion as in
set -- *.gz
shift $(($# - 1))
LAST_BACKUP_FILE=$1

shell scripting: search/replace & check file exist

I have a perl script (or any executable) E which will take a file foo.xml and write a file foo.txt. I use a Beowulf cluster to run E for a large number of XML files, but I'd like to write a simple job server script in shell (bash) which doesn't overwrite existing txt files.
I'm currently doing something like
#!/bin/sh
PATTERN="[A-Z]*0[1-2][a-j]"; # this matches foo in all cases
todo=`ls *.xml | grep $PATTERN -o`;
isdone=`ls *.txt | grep $PATTERN -o`;
whatsleft=todo - isdone; # what's the unix magic?
#tack on the .xml prefix with sed or something
#and then call the job server;
jobserve E "$whatsleft";
and then I don't know how to get the difference between $todo and $isdone. I'd prefer using sort/uniq to something like a for loop with grep inside, but I'm not sure how to do it (pipes? temporary files?)
As a bonus question, is there a way to do lookahead search in bash grep?
To clarify/extend the problem:
I have a bunch of programs that take input from sources like (but not necessarily) data/{branch}/special/{pattern}.xml and write output to another directory results/special/{branch}-{pattern}.txt (or data/{branch}/intermediate/{pattern}.dat, e.g.). I want to check in my jobfarming shell script if that file already exists.
So E transforms data/{branch}/special/{pattern}.xml->results/special/{branch}-{pattern}.dat, for instance. I want to look at each instance of the input and check if the output exists. One (admittedly simpler) way to do this is just to touch *.done files next to each input file and check for those results, but I'd rather not manage those, and sometimes the jobs terminate improperly so I wouldn't want them marked done.
N.B. I don't need to check concurrency yet or lock any files.
So a simple, clear way to solve the above problem (in pseudocode) might be
for i in `/bin/ls *.xml`
do
replace xml suffix with txt
if [that file exists]
add to whatsleft list
end
done
but I'm looking for something more general.

#!/bin/sh
shopt -s extglob # allow extended glob syntax, for matching the filenames
LC_COLLATE=C # use a sort order comm is happy with
IFS=$'\n' # so filenames can have spaces but not newlines
# (newlines don't work so well with comm anyhow;
# shame it doesn't have an option for null-separated
# input lines).
files_todo=( **([A-Z])0[1-2][a-j]*.xml )
files_done=( **([A-Z])0[1-2][a-j]*.txt )
files_remaining=( \
$(comm -23 --nocheck-order \
<(printf "%s\n" "${files_todo[#]%.xml}") \
<(printf "%s\n" "${files_done[#]%.txt}") ))
echo jobserve E $(for f in "${files_remaining[#]%.xml}"; do printf "%s\n" "${f}.txt"; done)
This assumes that you want a single jobserve E call with all the remaining files as arguments; it's rather unclear from the specification if such is the case.
Note the use of extended globs rather than parsing ls, which is considered very poor practice.
To transform input to output names without using anything other than shell builtins, consider the following:
if [[ $in_name =~ data/([^/]+)/special/([^/]+).xml ]] ; then
out_name=results/special/${BASH_REMATCH[1]}-${BASH_REMATCH[2]}.dat
else
: # ...handle here the fact that you have a noncompliant name...
fi

The question title suggests that you might be looking for:
set -o noclobber
The question content indicates a wholly different problem!
It seems you want to run 'jobserve E' on each '.xml' file without a matching '.txt' file. You'll need to assess the TOCTOU (Time of Check, Time of Use) problems here because you're in a cluster environment. But the basic idea could be:
todo=""
for file in *.xml
do [ -f ${file%.xml}.txt ] || todo="$todo $file"
done
jobserve E $todo
This will work with Korn shell as well as Bash. In Bash you could explore making 'todo' into an array; that will deal with spaces in file names better than this will.
If you have processes still generating '.txt' files for '.xml' files while you run this check, you will get some duplicated effort (because this script cannot tell that the processing is happening). If the 'E' process creates the corresponding '.txt' file as it starts processing it, that minimizes the chance or duplicated effort. Or, maybe consider separating the processed files from the unprocessed files, so the 'E' process moves the '.xml' file from the 'to-be-done' directory to the 'done' directory (and writes the '.txt' file to the 'done' directory too). If done carefully, this can avoid most of the multi-processing problems. For example, you could link the '.xml' to the 'done' directory when processing starts, and ensure appropriate cleanup with an 'atexit()' handler (if you are moderately confident your processing program does not crash). Or other trickery of your own devising.

whatsleft=$( ls *.xml *.txt | grep $PATTERN -o | sort | uniq -u )
Note this actually gets a symmetric difference.

i am not exactly sure what you want, but you can check for existence of the file first, if it exists, create a new name? ( Or in your E (perl script) you do this check. )
if [ -f "$file" ];then
newname="...."
fi
...
jobserve E .... > $newname
if its not what you want, describe more clearly in your question what you mean by "don't overwrite files"..

for posterity's sake, this is what i found to work:
TMPA='neverwritethis.tmp'
TMPB='neverwritethat.tmp'
ls *.xml | grep $PATTERN -o > $TMPA;
ls *.txt | grep $PATTERN -o > $TMPB;
whatsleft = `sort $TMPA $TMPB | uniq -u | sed "s/%/.xml" > xargs`;
rm $TMPA $TMPB;

Handle special characters in bash for...in loop

Suppose I've got a list of files
file1
"file 1"
file2
a for...in loop breaks it up between whitespace, not newlines:
for x in $( ls ); do
echo $x
done
results:
file
1
file1
file2
I want to execute a command on each file. "file" and "1" above are not actual files. How can I do that if the filenames contains things like spaces or commas?
It's a little trickier than I think find -print0 | xargs -0 could handle, because I actually want the command to be something like "convert input/file1.jpg .... output/file1.jpg" so I need to permutate the filename in the process.

Actually, Mark's suggestion works fine without even doing anything to the internal field separator. The problem is running ls in a subshell, whether by backticks or $( ) causes the for loop to be unable to distinguish between spaces in names. Simply using
for f in *
instead of the ls solves the problem.
#!/bin/bash
for f in *
do
echo "$f"
done

UPDATE BY OP: this answer sucks and shouldn't be on top ... #Jordan's post below should be the accepted answer.
one possible way:
ls -1 | while read x; do
echo $x
done

I know this one is LONG past "answered", and with all due respect to eduffy, I came up with a better way and I thought I'd share it.
What's "wrong" with eduffy's answer isn't that it's wrong, but that it imposes what for me is a painful limitation: there's an implied creation of a subshell when the output of the ls is piped and this means that variables set inside the loop are lost after the loop exits. Thus, if you want to write some more sophisticated code, you have a pain in the buttocks to deal with.
My solution was to take the "readline" function and write a program out of it in which you can specify any specific line number that you may want that results from any given function call. ... As a simple example, starting with eduffy's:
ls_output=$(ls -1)
# The cut at the end of the following line removes any trailing new line character
declare -i line_count=$(echo "$ls_output" | wc -l | cut -d ' ' -f 1)
declare -i cur_line=1
while [ $cur_line -le $line_count ] ;
do
# NONE of the values in the variables inside this do loop are trapped here.
filename=$(echo "$ls_output" | readline -n $cur_line)
# Now line contains a filename from the preceeding ls command
cur_line=cur_line+1
done
Now you have wrapped up all the subshell activity into neat little contained packages and can go about your shell coding without having to worry about the scope of your variable values getting trapped in subshells.
I wrote my version of readline in gnuc if anyone wants a copy, it's a little big to post here, but maybe we can find a way...
Hope this helps,
RT

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio