Bash for loop not working over large dataset in OSX - macos

I have a directory with a large number of sub-directories some of which have several zip files in them. I'm trying to write a bash script that will go through the directories and look for the name "Archive-foo" enter the sub-directory and if it contains zip files unzip them and then trash the zip files.
The script I wrote works on my test directories (5 sub directories) but when I tried to use it on the main archive directory (1200+ sub-directories) it fails to do anything.
Is there a max number of items a for loop can cycle through?
here's my code
#!/bin/bash
SAVEIFS=$IFS
IFS=$(echo -en "\n\b")
NUMBER=0
for i in $( ls )
do
#echo "$i"" is in the Top Level"
NUMBER=$[NUMBER+1]
if ($(test -d "$i"))
then
#echo "$i"" is a Directory"
if [[ "$i" == *Archive* ]]
then
#echo "$i"" has Archive in the name"
cd "$i"
unzip -n "*".zip
mv *.zip ~/.Trash
#else
#echo "$i"" does not have Archive in the name"
fi
#else
#echo "$i"" is NOT a Directory skipping"
fi
done
echo "$NUMBER of items"
IFS=$SAVEIFS

There's a limit on the size of command lines, and for i in $( ls ) may be exceeding it.
Try this syntax instead:
ls | while read i;
do
...
done
The only problem with this is that the pipeline runs the while loop in a subshell, so assignments to NUMBER won't persist into the original shell process. You can have the loop prints a line whenever it processes a line, and pipe the whole loop to wc -l to count the number of lines.

Barmer answer hit the issue on the nose. Using for file in $(...) as loop headers is not a very good idea:
It is slower: The shell executes what is in $(..) first, then runs the for loop. It can't start the for until $(...) finishes.
It can overrun the command line buffer: The shell executes $(..) and then puts it on the command line. The command line buffer may be about 32 Kilobytes, maybe more now, but if you have 10,000 files and each file is averaging 20 characters, you end up with over a 200Kb command line buffer,
For loops are terrible at handling bad file names: If file names have white spaces in them, each word is treated like a file.
A much better construct is:
find . ... -print0 | while read -d $\0 file
do
...
done
This can execute the while read loop while the find is executing, making it faster.
This can't overrun the command line buffer.
Most importantly, this construct handles almost any type of file name. The find will return each file separated by a NUL character - a character that cannot be in a file name. The -d $\0 tells the read command that the NUL character is the delimiter between file names. This handles spaces, tabs, and even new lines in file names.
The find is also very flexible. You can limit the list to only files, files in a particular age range, etc. The most common ones needed to replae for loops are:
$ find . -depth 1
acts just like ls -a:
$ find . \! -name ".*" -prune -a -depth 1
Acts just like ls, and will skip over files names that begin with ..

Related

how list just one file from a (bash) shell directory listing

A bit lowly a query but here goes:
bash shell script. POSIX, Mint 21
I just want one/any (mp3) file from a directory. As a sample.
In normal execution, a full run, the code would be such
for f in *.mp3 do
#statements
done
This works fine but if I wanted to sample just one file of such an array/glob (?) without looping, how might I do that? I don't care which file, just that it is an mp3 from the directory I am working in.
Should I just start this for-loop and then exit(break) after one statement, or is there a neater way more tailored-for-the-job way?
for f in *.mp3 do
#statement
break
done
Ta (can not believe how dopey I feel asking this one, my forehead will hurt when I see the answers )
Since you are using Linux (Mint) you've got GNU find so one way to get one .mp3 file from the current directory is:
mp3file=$(find . -maxdepth 1 -mindepth 1 -name '*.mp3' -printf '%f' -quit)
-maxdepth 1 -mindepth 1 causes the search to be restricted to one level under the current directory.
-printf '%f' prints just the filename (e.g. foo.mp3). The -print option would print the path to the filename (e.g. ./foo.mp3). That may not matter to you.
-quit causes find to exit as soon as one match is found and printed.
Another option is to use the Bash : (colon) command and $_ (dollar underscore) special variable:
: *.mp3
mp3file=$_
: *.mp3 runs the : command with the list of .mp3 files in the current directory as arguments. The : command ignores its arguments and does nothing.
mp3file=$_ sets the value of the mp3file variable to the last argument supplied to the previous command (:).
The second option should not be used if the number of .mp3 files is large (hundreds or more) because it will find all of the files and sort them by name internally.
In both cases $mp3file should be checked to ensure that it really exists (e.g. [[ -e $mp3file ]]) before using it for anything else, in case there are no .mp3 files in the directory.
I would do it like this in POSIX shell:
mp3file=
for f in *.mp3; do
if [ -f "$f" ]; then
mp3file=$f
break
fi
done
# At this point, the variable mp3file contains a filename which
# represents a regular file (or a symbolic link) with the .mp3
# extension, or empty string if there is no such a file.
The fact that you use
for f in *.mp3 do
suggests to me, that the MP3s are named without to much strange characters in the filename.
In that case, if you really don't care which MP3, you could:
f=$(ls *.mp3|head)
statement
Or, if you want a different one every time:
f=$(ls *.mp3|sort -R | tail -1)
Note: if your filenames get more complicated (including spaces or other special characters), this will not work anymore.
Assuming you don't have spaces in your filenames, (and I don't understand why the collective taboo is against using ls in scripts at all, rather than not having spaces in filenames, personally) then:-
ls *.mp3 | tr ' ' '\n' | sed -n '1p'

automatice bash command for multiple files

I have a directory with multiple files
file1_1.txt
file1_2.txt
file2_1.txt
file2_2.txt
...
And I need to run a command structured like this
command [args] file1 file2
So I was wondering if there was a way to call the command just one time on all the files, instead of having to call It each time on each pair of files.
Use find and xargs, with sort, since the order appears meaningful in your case:
find . -name 'file?_?.txt' | sort | xargs -n2 command [args]
If your command can take multiple pairs of files on the command line then it should be sufficient to run
command ... *_[12].txt
The files in expanded glob patterns (such as *_[12].txt) are automatically sorted so the files will be paired correctly.
If the command can only take one pair of files then it will need to be run multiple times to process all of the files. One way to do this automatically is:
for file1 in *_1.txt; do
file2=${file1%_1.txt}_2.txt
[[ -f $file2 ]] && echo command "$file1" "$file2"
done
You'll need to replace echo command with the correct command name and arguments.
See Removing part of a string (BashFAQ/100 (How do I do string manipulation in bash?)) for an explanation of ${file1%_1.txt}.
#!/bin/bash
cmd (){
readarray -d " " arr <<<"$#"
for ((i=0; i<${#arr[#]}; i+=2))
do
n=$(($i+1))
firstFile="${arr[$i]}"
secondFile="${arr[$n]}"
echo "pair -- ${firstFile} ${secondFile}"
done
}
cmd file*_[12].txt
pair -- file1_1.txt file1_2.txt
pair -- file2_1.txt file2_2.txt

Iterate through several files in bash [duplicate]

This question already has answers here:
How to zero pad a sequence of integers in bash so that all have the same width?
(15 answers)
Closed 6 years ago.
I have a folder with several files that are named like this:
file.001.txt.gz, file.002.txt.gz, ... , file.150.txt.gz
What I want to do is use a loop to run a program with each file. I was thinking in something like this (just a sketch):
for i in {1:150}
gunzip file.$i.txt.gz
./my_program file.$i.txt output.$1.txt
gzip file.$1.txt
First of all, I don't know if something like this is gonna work, and second, I can't figure out how to keep the three digits numeration the file have ('001' instead of just '1').
Thanks a lot
The syntax for ranges in bash is
{1..150}
not {1:150}.
Moreover, if your bash is recent enough, you can add the leading zeroes:
{001..150}
The correct syntax of the for loop needs do and done.
for i in {001..150} ; do
# ...
done
It's unclear what $1 contains in your script.
To iterate over files I believe the simpler way is:
(assuming there are no files named 'file.*.txt' already in the directory and that your output file can have a different name)
for i in file.*.txt.gz; do
gunzip $i
./my_program $i $i-output.txt
gzip file.*.txt
done
Using find command:
# Path to the source directory
dir="./"
while read file
do
output="$(basename "$file")"
output="$(dirname "$file")/"${output/#file/output}
echo "$file ==> $output"
done < <(find "$dir" \
-regextype 'posix-egrep' \
-regex '.*file\.[0-9]{3}\.txt\.gz$')
The same via pipe:
find "$dir" \
-regextype 'posix-egrep' \
-regex '.*file\.[0-9]{3}\.txt\.gz$' | \
while read file
do
output="$(basename "$file")"
output="$(dirname "$file")/"${output/#file/output}
echo "$file ==> $output"
done
Sample output
/home/ruslan/tmp/file.001.txt.gz ==> /home/ruslan/tmp/output.001.txt.gz
/home/ruslan/tmp/file.002.txt.gz ==> /home/ruslan/tmp/output.002.txt.gz
(for $dir=/home/ruslan/tmp/).
Description
The scripts iterate the files in $dir directory. The $file variable is filled with the next line read from the find command.
The find command returns a list of paths corresponding to the regular expression '.*file\.[0-9]{3}\.txt\.gz$'.
The $output variable is built from two parts: basename (path without directories) and dirname (path to file's directory).
${output/#file/output} expression replaces file with output at the front end of $output variable (see Manipulating Strings)
Try-
for i in $(seq -w 1 150) #-w adds the leading zeroes
do
gunzip file."$i".txt.gz
./my_program file."$i".txt output."$1".txt
gzip file."$1".txt
done
The syntax for ranges is as choroba said, but when iterating over files you usually want to use a glob. If you know all the files have three digits in their names you can match on digits:
shopt -s nullglob
for i in file.0[0-9][0-9].txt.gz file.1[0-4][0-9] file.15[0].txt.gz; do
gunzip file.$i.txt.gz
./my_program file.$i.txt output.$i.txt
gzip file.$i.txt
done
This will only iterate through files that exist. If you use the range expression, you have to take extra care not to try to operate on files that don't exist.
for i in file.{000..150}.txt.gz; do
[[ -e "$i" ]] || continue
...otherstuff
done

Iterating a group of folders and files while removing certain files that are contained in a list

I have a set of files that I download that contain files that I want to remove. I would like to create a list of some form, the script should support blobbing so I can be pretty aggressive with file removal without getting into the complexities of using regex within the list of files.
I am also stumped in that I put a sleep command within the loop of my script, and that is not getting run after each iteration, but only once at the end of run.
Here is the script
# Get to the place where all the durty work happens
cd /Volumes/Videos
FILES=".DS_Store
*.txt
*.sample
*.sample.*
*.samples"
if [ "$(pwd)" == "/Volumes/Videos" ]; then
echo "You are currently in $(pwd)"
echo "You would not have read the above if this script were operating anywhere else"
# Dekete fikes from list above
for f in "$FILES"
do
echo "Removing $f";
rm -f "$f";
echo "$f has been deleted";
sleep 10;
echo "";
echo "";
done
# See if dir is empty, ask if we want to delete it or keep it
# Iterate evert movie file, see if we want to nuke contents. Maybe use part of last openned to help find those files fast
else
# Not in the correct directory
echo "This script is trying to alter files in a location that it should not be working"
echo "Script is currently trying to work in $(pwd)"
exit 1
fi
The main thing that has be completely stumped is the sleep command. It is run once, not once per file iteration. If I have 100 files to go through I get 10 seconds of sleep, not 100*10.
I will be adding in some other features, like if a file is smaller than x bytes, go ahead and delete it too. These files will have spaces and other odd characters in the filenames, am I creating my variables correctly to make this script handle those scenarios as well as be as POSIX compliant as possible. I will change the shebang to sh over bash and try to add in set -o noun set and set -o err exit though I tend to have a lot of trouble when I do that.
Is there a better form of list I should be using? I am not objectionable to storing the pattern match list in a separate file. I can include it, or read it in with any of a few commands.
These are also nested files, a dir, that contains files, or a dir that contains a dir that contains some files. Something like this:
/Volumes/Videos:
The Great guy in a tree
The Great guy in a tree S01e01
sample.avi
readme.txt
The Great guy in a tree S01e01.mpg
The Great guy in a tree S01e02
The Great guy in a tree S01e02.mpg
The Great guy in a tree S01e03
The Great guy in a tree S01e03.mpg
The Great guy in a tree S01e04
The Great guy in a tree S01e04.mpg
Thank you.
The reason that your script is not working as you expect is because your for loop is written incorrectly. This example shows what is going on:
$ i=0
$ FILES=".DS_Store
*.txt
*.sample
*.sample.*
*.samples"
$ for f in "$FILES"; do echo $((++i)) "$f"; done
1 .DS_Store
*.txt
*.sample
*.sample.*
*.samples
Note that only one number is output, indicating that the loop is only going around once. Also, no pathname expansion has occurred.
In order to make your script work as you expect, you can remove the quotes around "$FILES". This means that each word in your string will be evaluated separately, rather than all at once. It also means that pathname expansion of the wildcards that you are using will occur, so all files ending in .txt will be removed, which I guess is what you meant.
Instead of using a string to store your list of expressions, you might prefer to make use of an array:
FILES=( '.DS_Store' '*.txt' '*.sample' '*.sample.*' '*.samples' )
The quotes around each element prevent expansion (so the array only has 5 elements, not the fully expanded list). You could then change your loop to for f in ${FILES[#]} (again, no double quotes results in each element of the list being expanded).
Although removing the quotes fixes your script, I would agree with #hek2mgl's suggestion of using find. It allows you to find files by name, size, date modified and a lot more in one line. If you want to pause between the deletion of each file, you could use something like this:
find \( -name "*.sample" -o -name "*.txt" \) -delete -exec sleep 10 \;
You can use find:
find -type f -name '.DS_Store' -o -name '*.txt' -o -name '*.sample.*' -o -name '*.samples' -delete

shell scripting: search/replace & check file exist

I have a perl script (or any executable) E which will take a file foo.xml and write a file foo.txt. I use a Beowulf cluster to run E for a large number of XML files, but I'd like to write a simple job server script in shell (bash) which doesn't overwrite existing txt files.
I'm currently doing something like
#!/bin/sh
PATTERN="[A-Z]*0[1-2][a-j]"; # this matches foo in all cases
todo=`ls *.xml | grep $PATTERN -o`;
isdone=`ls *.txt | grep $PATTERN -o`;
whatsleft=todo - isdone; # what's the unix magic?
#tack on the .xml prefix with sed or something
#and then call the job server;
jobserve E "$whatsleft";
and then I don't know how to get the difference between $todo and $isdone. I'd prefer using sort/uniq to something like a for loop with grep inside, but I'm not sure how to do it (pipes? temporary files?)
As a bonus question, is there a way to do lookahead search in bash grep?
To clarify/extend the problem:
I have a bunch of programs that take input from sources like (but not necessarily) data/{branch}/special/{pattern}.xml and write output to another directory results/special/{branch}-{pattern}.txt (or data/{branch}/intermediate/{pattern}.dat, e.g.). I want to check in my jobfarming shell script if that file already exists.
So E transforms data/{branch}/special/{pattern}.xml->results/special/{branch}-{pattern}.dat, for instance. I want to look at each instance of the input and check if the output exists. One (admittedly simpler) way to do this is just to touch *.done files next to each input file and check for those results, but I'd rather not manage those, and sometimes the jobs terminate improperly so I wouldn't want them marked done.
N.B. I don't need to check concurrency yet or lock any files.
So a simple, clear way to solve the above problem (in pseudocode) might be
for i in `/bin/ls *.xml`
do
replace xml suffix with txt
if [that file exists]
add to whatsleft list
end
done
but I'm looking for something more general.
#!/bin/sh
shopt -s extglob # allow extended glob syntax, for matching the filenames
LC_COLLATE=C # use a sort order comm is happy with
IFS=$'\n' # so filenames can have spaces but not newlines
# (newlines don't work so well with comm anyhow;
# shame it doesn't have an option for null-separated
# input lines).
files_todo=( **([A-Z])0[1-2][a-j]*.xml )
files_done=( **([A-Z])0[1-2][a-j]*.txt )
files_remaining=( \
$(comm -23 --nocheck-order \
<(printf "%s\n" "${files_todo[#]%.xml}") \
<(printf "%s\n" "${files_done[#]%.txt}") ))
echo jobserve E $(for f in "${files_remaining[#]%.xml}"; do printf "%s\n" "${f}.txt"; done)
This assumes that you want a single jobserve E call with all the remaining files as arguments; it's rather unclear from the specification if such is the case.
Note the use of extended globs rather than parsing ls, which is considered very poor practice.
To transform input to output names without using anything other than shell builtins, consider the following:
if [[ $in_name =~ data/([^/]+)/special/([^/]+).xml ]] ; then
out_name=results/special/${BASH_REMATCH[1]}-${BASH_REMATCH[2]}.dat
else
: # ...handle here the fact that you have a noncompliant name...
fi
The question title suggests that you might be looking for:
set -o noclobber
The question content indicates a wholly different problem!
It seems you want to run 'jobserve E' on each '.xml' file without a matching '.txt' file. You'll need to assess the TOCTOU (Time of Check, Time of Use) problems here because you're in a cluster environment. But the basic idea could be:
todo=""
for file in *.xml
do [ -f ${file%.xml}.txt ] || todo="$todo $file"
done
jobserve E $todo
This will work with Korn shell as well as Bash. In Bash you could explore making 'todo' into an array; that will deal with spaces in file names better than this will.
If you have processes still generating '.txt' files for '.xml' files while you run this check, you will get some duplicated effort (because this script cannot tell that the processing is happening). If the 'E' process creates the corresponding '.txt' file as it starts processing it, that minimizes the chance or duplicated effort. Or, maybe consider separating the processed files from the unprocessed files, so the 'E' process moves the '.xml' file from the 'to-be-done' directory to the 'done' directory (and writes the '.txt' file to the 'done' directory too). If done carefully, this can avoid most of the multi-processing problems. For example, you could link the '.xml' to the 'done' directory when processing starts, and ensure appropriate cleanup with an 'atexit()' handler (if you are moderately confident your processing program does not crash). Or other trickery of your own devising.
whatsleft=$( ls *.xml *.txt | grep $PATTERN -o | sort | uniq -u )
Note this actually gets a symmetric difference.
i am not exactly sure what you want, but you can check for existence of the file first, if it exists, create a new name? ( Or in your E (perl script) you do this check. )
if [ -f "$file" ];then
newname="...."
fi
...
jobserve E .... > $newname
if its not what you want, describe more clearly in your question what you mean by "don't overwrite files"..
for posterity's sake, this is what i found to work:
TMPA='neverwritethis.tmp'
TMPB='neverwritethat.tmp'
ls *.xml | grep $PATTERN -o > $TMPA;
ls *.txt | grep $PATTERN -o > $TMPB;
whatsleft = `sort $TMPA $TMPB | uniq -u | sed "s/%/.xml" > xargs`;
rm $TMPA $TMPB;

Resources