Properly handle lists of files with whitespace in filename - bash

I want to iterate over a list of files in Bash and perform some action. The problem: the file names may contain whitespace, which creates an obvious problem with wildcards or ls:
touch a\ b
FILES=* # or $(ls)
for FILE in $FILES; do echo $FILE; done
yields
a
b
Now, the conventional way to handle this is to use find … -print0 instead. However, this only works (well) in conjunction with xargs -0, not with Bash variables / loops.
My idea was to set $IFS to the null character to make this work. However, the comp.unix.shell seems to think that this is impossible in bash.
Bummer. Well, it’s theoretically possible to use another character, such as : (after all, $PATH uses this format, too):
IFS=$':'
FILES=$(find . -print0 | xargs -0 printf "%s:")
for FILE in $FILES; do echo $FILE; done
(The output is slightly different but fair enough.)
However, I can’t help but feel that this is clumsy and that there should be a more direct way of doing this. I’m looking for a more direct way of accomplishing this, preferably using wildcards or ls.

The best way to handle this is to store the file list as an array, rather than a string (and be sure to double-quote all variable substitutions):
files=(*)
for file in "${files[#]}"; do
echo "$file"
done
If you want to generate an array from find's output (e.g. if you need to search recursively), see this previous answer.

Exactly what you have in the first example works fine for me in Msys Bash, Cygwin and on my Fedora box:
FILES=*
for FILE in $FILES
do
echo $FILE
done

Its very important to preceed
IFS=""
otherwise files with two directly following spaces will not be found

Related

how list just one file from a (bash) shell directory listing

A bit lowly a query but here goes:
bash shell script. POSIX, Mint 21
I just want one/any (mp3) file from a directory. As a sample.
In normal execution, a full run, the code would be such
for f in *.mp3 do
#statements
done
This works fine but if I wanted to sample just one file of such an array/glob (?) without looping, how might I do that? I don't care which file, just that it is an mp3 from the directory I am working in.
Should I just start this for-loop and then exit(break) after one statement, or is there a neater way more tailored-for-the-job way?
for f in *.mp3 do
#statement
break
done
Ta (can not believe how dopey I feel asking this one, my forehead will hurt when I see the answers )
Since you are using Linux (Mint) you've got GNU find so one way to get one .mp3 file from the current directory is:
mp3file=$(find . -maxdepth 1 -mindepth 1 -name '*.mp3' -printf '%f' -quit)
-maxdepth 1 -mindepth 1 causes the search to be restricted to one level under the current directory.
-printf '%f' prints just the filename (e.g. foo.mp3). The -print option would print the path to the filename (e.g. ./foo.mp3). That may not matter to you.
-quit causes find to exit as soon as one match is found and printed.
Another option is to use the Bash : (colon) command and $_ (dollar underscore) special variable:
: *.mp3
mp3file=$_
: *.mp3 runs the : command with the list of .mp3 files in the current directory as arguments. The : command ignores its arguments and does nothing.
mp3file=$_ sets the value of the mp3file variable to the last argument supplied to the previous command (:).
The second option should not be used if the number of .mp3 files is large (hundreds or more) because it will find all of the files and sort them by name internally.
In both cases $mp3file should be checked to ensure that it really exists (e.g. [[ -e $mp3file ]]) before using it for anything else, in case there are no .mp3 files in the directory.
I would do it like this in POSIX shell:
mp3file=
for f in *.mp3; do
if [ -f "$f" ]; then
mp3file=$f
break
fi
done
# At this point, the variable mp3file contains a filename which
# represents a regular file (or a symbolic link) with the .mp3
# extension, or empty string if there is no such a file.
The fact that you use
for f in *.mp3 do
suggests to me, that the MP3s are named without to much strange characters in the filename.
In that case, if you really don't care which MP3, you could:
f=$(ls *.mp3|head)
statement
Or, if you want a different one every time:
f=$(ls *.mp3|sort -R | tail -1)
Note: if your filenames get more complicated (including spaces or other special characters), this will not work anymore.
Assuming you don't have spaces in your filenames, (and I don't understand why the collective taboo is against using ls in scripts at all, rather than not having spaces in filenames, personally) then:-
ls *.mp3 | tr ' ' '\n' | sed -n '1p'

How to remove unknown file extensions from files using script

I can remove file extensions if I know the extensions, for example to remove .txt from files:
foreach file (`find . -type f`)
mv $file `basename $file .txt`
end
However if I don't know what kind of file extension to begin with, how would I do this?
I tried:
foreach file (`find . -type f`)
mv $file `basename $file .*`
end
but it wouldn't work.
What shell is this? At least in bash you can do:
find . -type f | while read -r; do
mv -- "$REPLY" "${REPLY%.*}"
done
(The usual caveats apply: This doesn't handle files whose name contains newlines.)
You can use sed to compute base file name.
foreach file (`find . -type f`)
mv $file `echo $file | sed -e 's/^\(.*\)\.[^.]\+$/\1/'`
end
Be cautious: The command you seek to run could cause loss of data!
If you don't think your file names contain newlines or double quotes, then you could use:
find . -type f -name '?*.*' |
sed 's/\(.*\)\.[^.]*$/mv "&" "\1"/' |
sh
This generates your list of files (making sure that the names contain at least one character plus a .), runs each file name through the sed script to convert it into an mv command by effectively removing the material from the last . onwards, and then running the stream of commands through a shell.
Clearly, you test this first by omitting the | sh part. Consider running it with | sh -x to get a trace of what the shell's doing. Consider making sure you capture the output of the shell, standard output and standard error, into a log file so you've got a record of the damage that occurred.
Do make sure you've got a backup of the original set of files before you start playing with this. It need only be a tar file stored in a different part of the directory hierarchy, and you can remove it as soon as you're happy with the results.
You can choose any shell; this doesn't rely on any shell constructs except pipes and single quotes and double quotes (pretty much common to all shells), and the sed script is version neutral too.
Note that if you have files xyz.c and xyz.h before you run this, you'll only have a file xyz afterwards (and what it contains depends on the order in which the files are processed, which needn't be alphabetic order).
If you think your file names might contain double quotes (but not single quotes), you can play with the changing the quotes in the sed script. If you might have to deal with both, you need a more complex sed script. If you need to deal with newlines in file names, then it is time to (a) tell your user(s) to stop being silly and (b) fix the names so they don't contain newlines. Then you can use the script above. If that isn't feasible, you have to work a lot harder to get the job done accurately — you probably need to make sure you've got a find that supports -print0, a sed that supports -z and an xargs that supports -0 (installing the most recent GNU versions if you don't already have the right support in place).
It's very simple:
$ set filename=/home/foo/bar.dat
$ echo ${filename:r}
/home/foo/bar
See more in man tcsh, in "History substitution":
r
Remove a filename extension '.xxx', leaving the root name.

naming output files using variable parameters and input filenames in bash

I have a series of files in sub-directories that I want to loop through, process, and name according to the input filename and the various parameters (models) I'm using to process the files.
For example file names like AG005574, AG004788, AG003854 and parameter/model values like ATd, PZa, RTK1, so I want to end with files like
AG005574_ATd
AG005574_PZa
AG005574_RTK1
AG004788_ATd
AG004788_PZa
etc.
I loop through the subfolders, run the process and output the results like so:
#!/usr/bin/bash
model=$1
for file in $(find /path/to/files/*/ -type f -name 'AG*.fa');
do output=${model}"_"${file} ;
process_call --out=$output."tab" --options ../Path/to/model/$1.hmm $file ;
echo $file
done
I want to be able to specify the model on the command-line (hence the model=$1). However, my approach does not work in general; I can get the output named by model using
do output=$model ;
but this also writes only the last file processed because it over-writes all the others (and no input filename is used). Any help/tutoring is much appreciated.
Pass ALL the model names as parameters to the script:
/path/to/script ATd PZa RTK1
then
#!/bin/bash
find /path/to/files/*/ -type f -name 'AG*.fa' |
while IFS= read -r file; do
echo "$file"
for model in "$#"; do
output="${file%.fa}_$model.tab"
process_call --out="$output" --options "../Path/to/model/$model.hmm" "$file"
done
done
If you already know all the models, you can build that into the script:
#!/bin/bash
models=( ATd PZa RTK1 )
...
for model in "${models[#]}"; do
...
I think your problem is that when the file name given by find is:
/path/to/files/xyz/AG002378.fa
your output parameter becomes, for $1 as ATd,
ATd_/path/to/files/xyz/AG002378.fa
instead of:
/path/to/files/xyz/AG002378_ATd
That is, you want the .fa removed, and the _ATd added.
The classic commands for this are dirname and basename:
dir=$(dirname "$file")
base=$(basename "$file" .fa)
output="$dir/${file}_$1"
There are tricks you can do with:
base_with_suffix=${file##*/}
base=${base_with_suffix%.fa}
which do not invoke an external command. The dirname operation can be done too:
dir=${file%/*}
but I think basename and dirname are clearer (but I could be biassed by many years experience during which there wasn't an alternative). Also, there are edge cases where the string manipulation expressions don't work well but the commands work correctly, but they are unlikely to actually impact your code.
It is not entirely clear from your question exactly what you want as the output, but variations on the themes shown should allow you to solve the problem.

Difference between using ls and find to loop over files in a bash script

I'm not sure I understand exactly why:
for f in `find . -name "strain_flame_00*.dat"`; do
echo $f
mybase=`basename $f .dat`
echo $mybase
done
works and:
for f in `ls strain_flame_00*.dat`; do
echo $f
mybase=`basename $f .dat`
echo $mybase
done
does not, i.e. the filename does not get stripped of the suffix. I think it's because what comes out of ls is formatted differently but I'm not sure. I even tried to put eval in front of ls...
The correct way to iterate over filenames here would be
for f in strain_flame_00*.dat; do
echo "$f"
mybase=$(basename "$f" .dat)
echo "$mybase"
done
Using for with a glob pattern, and then quoting all references to the filename is the safest way to use filenames that may have whitespace.
First of all, never parse the output of the ls command.
If you MUST use ls and you DON'T know what ls alias is out there, then do this:
(
COLUMNS=
LANG=
NLSPATH=
GLOBIGNORE=
LS_COLORS=
TZ=
unset ls
for f in `ls -1 strain_flame_00*.dat`; do
echo $f
mybase=`basename $f .dat`
echo $mybase
done
)
It is surrounded by parenthesis to protect existing environment, aliases and shell variables.
Various environment names were NUKED (as ls does look those up).
One unalias command (self-explanatory).
One unset command (again, protection against scrupulous over-lording 'ls' function).
Now, you can see why NOT to use the 'ls'.
Another difference that hasn't been mentioned yet is that find is recursive search by default, whereas ls is not. (even though both can be told to do recursive / non-recursive through options; and find can be told to recurse up to a specified depth)
And, as others have mentioned, if it can be achieved by globbing, you should avoid using either.

Shell script: execute cmd on a file, with additional processing of file name

So I am going to post a question about shell scripting again.
Problem Definition: For all files under a dir, ex.:
A_anything.txt, B_anything.txt, ......
I want to execute a script, say 'CMD', on each of them, with the output files named like:
A_result.txt, B_result.txt, ......
In addition, at the first line of these output file, I want to have the file name of the original one.
The 'find -exec' util seems to me unable to extract part of the file name.
Does someone know a solution to this problem, by any means(shell, python, find,etc)? Thank you!
cd /directory
for file in *.txt ; do
newfilename=`echo "$file"|sed 's/\(.\+\)_.*/\1_result.txt/`
echo "$file" > "$newfilename"
your-command $file >> "$newfilename"
done
HTH
Well, there's more than one way to do it (including using Perl, where that's the motto), but probably I'd write it like this:
find . -name '[A-Z]_*.txt' -type f -print0 |
xargs -0 modify_rename.sh
And then I'd write the script modify_rename.sh like this:
#!/bin/sh
for file in "$#"
do
dirname=$(dirname "$file")
basename=$(basename "$file" .txt)
leadname=${file%_*}
outname="$dirname/${leadname}_result.txt"
# Optionally check for pre-existence of $outname
{
# Optionally echo "$basename.txt" instead of "$file"
echo "$file"
# Does this invocation of CMD write to standard output?
# If not, adjust invocation appropriately.
CMD "$file"
} > "$outname"
done
The advantage of this separation into separate scripting operations is that the rename/modify operation can be checked out separately from the search process - which runs less risk of zapping your entire directory structure with bad commands.
Bash has the tools to avoid invoking basename and dirname but the notation is moderatly excruciating; I find the clarity of the command names worth having. I'd be happy if bash implemented them as built-ins. There are plenty of other ways to get the prefix of the file; this should be safe, though, even in the presence of spaces (tabs, newlines) in file or directory names because of the careful use of double quotes.

Resources