I have a directory with multiple files
file1_1.txt
file1_2.txt
file2_1.txt
file2_2.txt
...
And I need to run a command structured like this
command [args] file1 file2
So I was wondering if there was a way to call the command just one time on all the files, instead of having to call It each time on each pair of files.
Use find and xargs, with sort, since the order appears meaningful in your case:
find . -name 'file?_?.txt' | sort | xargs -n2 command [args]
If your command can take multiple pairs of files on the command line then it should be sufficient to run
command ... *_[12].txt
The files in expanded glob patterns (such as *_[12].txt) are automatically sorted so the files will be paired correctly.
If the command can only take one pair of files then it will need to be run multiple times to process all of the files. One way to do this automatically is:
for file1 in *_1.txt; do
file2=${file1%_1.txt}_2.txt
[[ -f $file2 ]] && echo command "$file1" "$file2"
done
You'll need to replace echo command with the correct command name and arguments.
See Removing part of a string (BashFAQ/100 (How do I do string manipulation in bash?)) for an explanation of ${file1%_1.txt}.
#!/bin/bash
cmd (){
readarray -d " " arr <<<"$#"
for ((i=0; i<${#arr[#]}; i+=2))
do
n=$(($i+1))
firstFile="${arr[$i]}"
secondFile="${arr[$n]}"
echo "pair -- ${firstFile} ${secondFile}"
done
}
cmd file*_[12].txt
pair -- file1_1.txt file1_2.txt
pair -- file2_1.txt file2_2.txt
Related
Is there a shorthand in bash to select an arbitrary file? * enumerates all files in the current directory, but what if I only want one file and don't care which it is?
FWIW I'm testing several different ffmpeg commands in a directory with similarly named video files, so tab-complete is cumbersome.
Here's the robust way of getting the first or a random file in a directory, handling the edge case of not having any files:
#!/bin/bash
# Let globs expand to 0 elements instead of themselves if no matches
shopt -s nullglob
# Add all the files in the current dir to an array
files=(*)
# Check if the array has any elements
if [[ ${#files[#]} -gt 0 ]]
then
first_file=${files[0]}
random_file=${files[RANDOM%${#files[#]}]}
echo "The first file is ${first_file}"
echo "A random file is ${random_file}"
else
echo "There are no files in the current directory."
fi
If you just want something short and hacky for interactive testing, you can create an array and reference it unindexed to get the first element with minimal typing:
$ testfile=( *.avi )
$ ffmpeg -i "$testfile" test.mp3
You can also bind Tab to zsh style completion:
$ bind 'TAB:menu-complete'
now, for the rest of this session, when you press Tab you'll get a complete filename instead of just a prefix (press Tab again to cycle through matches). This will let you conveniently pick a file with a single keystroke.
Occasionally I was using the shuf:
find -name '*whatever*' | shuf | head -n 1
The shuf is a tool, part of GNU coreutils, which prints the input lines in random order. In other words, it shuffles the lines.
I have a directory with a large number of sub-directories some of which have several zip files in them. I'm trying to write a bash script that will go through the directories and look for the name "Archive-foo" enter the sub-directory and if it contains zip files unzip them and then trash the zip files.
The script I wrote works on my test directories (5 sub directories) but when I tried to use it on the main archive directory (1200+ sub-directories) it fails to do anything.
Is there a max number of items a for loop can cycle through?
here's my code
#!/bin/bash
SAVEIFS=$IFS
IFS=$(echo -en "\n\b")
NUMBER=0
for i in $( ls )
do
#echo "$i"" is in the Top Level"
NUMBER=$[NUMBER+1]
if ($(test -d "$i"))
then
#echo "$i"" is a Directory"
if [[ "$i" == *Archive* ]]
then
#echo "$i"" has Archive in the name"
cd "$i"
unzip -n "*".zip
mv *.zip ~/.Trash
#else
#echo "$i"" does not have Archive in the name"
fi
#else
#echo "$i"" is NOT a Directory skipping"
fi
done
echo "$NUMBER of items"
IFS=$SAVEIFS
There's a limit on the size of command lines, and for i in $( ls ) may be exceeding it.
Try this syntax instead:
ls | while read i;
do
...
done
The only problem with this is that the pipeline runs the while loop in a subshell, so assignments to NUMBER won't persist into the original shell process. You can have the loop prints a line whenever it processes a line, and pipe the whole loop to wc -l to count the number of lines.
Barmer answer hit the issue on the nose. Using for file in $(...) as loop headers is not a very good idea:
It is slower: The shell executes what is in $(..) first, then runs the for loop. It can't start the for until $(...) finishes.
It can overrun the command line buffer: The shell executes $(..) and then puts it on the command line. The command line buffer may be about 32 Kilobytes, maybe more now, but if you have 10,000 files and each file is averaging 20 characters, you end up with over a 200Kb command line buffer,
For loops are terrible at handling bad file names: If file names have white spaces in them, each word is treated like a file.
A much better construct is:
find . ... -print0 | while read -d $\0 file
do
...
done
This can execute the while read loop while the find is executing, making it faster.
This can't overrun the command line buffer.
Most importantly, this construct handles almost any type of file name. The find will return each file separated by a NUL character - a character that cannot be in a file name. The -d $\0 tells the read command that the NUL character is the delimiter between file names. This handles spaces, tabs, and even new lines in file names.
The find is also very flexible. You can limit the list to only files, files in a particular age range, etc. The most common ones needed to replae for loops are:
$ find . -depth 1
acts just like ls -a:
$ find . \! -name ".*" -prune -a -depth 1
Acts just like ls, and will skip over files names that begin with ..
I've got a directory with a few thousand files in it, named things like:
filename.ext
filename (1).ext
filename (2).ext
otherfile.ext
otherfile (1).ext
etc.
Most of the files with bracketed numbers are duplicates of the original, but in some cases they're not.
How can I keep my original files, delete the duplicates, but not lose the files that are different?
I know that I could rm *\).ext, but that obviously doesn't make sure that files match the original.
I'm using OS X, so I have a md5 program that functions sort of like md5sum in Linux, though it puts the hash at the end of the line instead of the beginning. I was thinking I could use an awk script to take the output of md5 *.ext | awk 'some script', find duplicates by md5, and delete them, but the command line is too long (bash: /sbin/md5: Argument list too long).
And I don't know what to write in the script. I was thinking of storing things in an array with this:
awk '{a[$NF]++} a[$NF]>1{sub(/).*/,""); sub(/.*(/,""); system("rm " $0);}'
But that always seems to delete my original.
What am I doing wrong? How do I do it right?
Thanks.
Your awk script deletes original files because when you sort your files, . (period) sorts after (space). SO the first file that's seen is numbered, not the original, and subsequent checks (including the one against the original) compare files to the first numbered one.
Not only does rm *\).txt fail to match the original, it loses files that may not have an original in the first place.
I wouldn't do this quite this way. Rather than checking every numbered file and verifying whether it matches an original, you can go through your list of originals, then delete the numbered files that match them.
Instead:
$ for file in *[^\)].txt; do echo "-- Found: $file"; rm -v $(basename "$file" .txt)\ \(*\).txt; done
You can expand this to check MD5's along the way. But it's more code, so I'll break it into multiple lines, in a script:
#!/bin/bash
shopt -s nullglob # Show nothing if a fileglob matches no files
for file in *[^\)].ext; do
md5=$(md5 -q "$file") # The -q option gives you only the message digest
echo "-- Found: $file ($md5)"
for duplicate in $(basename "$file" .ext)\ \(*\).ext; do
if [[ "$md5" = "$(md5 -q "$duplicate")" ]]; then
rm -v "$duplicate"
fi
done
done
As an alternative, you can probably get away with doing this a little more simply, with less CPU overhead than calculating MD5 digests. Unix and Linux have a shell tool called cmp, which is like diff without the output. So:
#!/bin/bash
shopt -s nullglob
for file in *[^\)].ext; do
for duplicate in $(basename "$file" .ext)\ \(*\).ext; do
if cmp "$file" "$duplicate"; then
rm -v "$file"
fi
done
done
If you don't need to use AWK, you could maybe do something simpler in bash:
for file in *\([0-9]*\)*; do
[ -e "$(echo "$file" | sed -e 's/ ([0-9]\+)//')" ] && rm "$file"
done
Hope this helps a little =)
I've been handed a project that consists of several dozen (probably over 100, I haven't counted) bash scripts. Most of the scripts make at least one call to another one of the scripts. I'd like to get the equivalent of a call graph where the nodes are the scripts instead of functions.
Is there any existing software to do this?
If not, does anybody have clever ideas for how to do this?
Best plan I could come up with was to enumerate the scripts and check to see if the basenames are unique (they span multiple directories). If there are duplicate basenames, then cry, because the script paths are usually held in variable names so you may not be able to disambiguate. If they are unique, then grep the names in the scripts and use those results to build up a graph. Use some tool (suggestions?) to visualize the graph.
Suggestions?
Wrap the shell itself by your implementation, log who called you wrapper and exec the original shell.
Yes you have to start the scripts in order to identify which script is really used. Otherwise you need a tool with the same knowledge as the shell engine itself to support the whole variable expansion, PATHs etc -- I never heard about such a tool.
In order to visualize the calling graph use GraphViz's dot format.
Here's how I wound up doing it (disclaimer: a lot of this is hack-ish, so you may want to clean up if you're going to use it long-term)...
Assumptions:
- Current directory contains all scripts/binaries in question.
- Files for building the graph go in subdir call_graph.
Created the script call_graph/make_tgf.sh:
#!/bin/bash
# Run from dir with scripts and subdir call_graph
# Parameters:
# $1 = sources (default is call_graph/sources.txt)
# $2 = targets (default is call_graph/targets.txt)
SOURCES=$1
if [ "$SOURCES" == "" ]; then SOURCES=call_graph/sources.txt; fi
TARGETS=$2
if [ "$TARGETS" == "" ]; then TARGETS=call_graph/targets.txt; fi
if [ ! -d call_graph ]; then echo "Run from parent dir of call_graph" >&2; exit 1; fi
(
# cat call_graph/targets.txt
for file in `cat $SOURCES `
do
for target in `grep -v -E '^ *#' $file | grep -o -F -w -f $TARGETS | grep -v -w $file | sort | uniq`
do echo $file $target
done
done
)
Then, I ran the following (I wound up doing the scripts-only version):
cat /dev/null | tee call_graph/sources.txt > call_graph/targets.txt
for file in *
do
if [ -d "$file" ]; then continue; fi
echo $file >> call_graph/targets.txt
if file $file | grep text >/dev/null; then echo $file >> call_graph/sources.txt; fi
done
# For scripts only:
bash call_graph/make_tgf.sh call_graph/sources.txt call_graph/sources.txt > call_graph/scripts.tgf
# For scripts + binaries (binaries will be leaf nodes):
bash call_graph/make_tgf.sh > call_graph/scripts_and_bin.tgf
I then opened the resulting tgf file in yEd, and had yEd do the layout (Layout -> Hierarchical). I saved as graphml to separate the manually-editable file from the automatically-generated one.
I found that there were certain nodes that were not helpful to have in the graph, such as utility scripts/binaries that were called all over the place. So, I removed these from the sources/targets files and regenerated as necessary until I liked the node set.
Hope this helps somebody...
Insert a line at the beginning of each shell script, after the #! line, which logs a timestamp, the full pathname of the script, and the argument list.
Over time, you can mine this log to identify likely candidates, i.e. two lines logged very close together have a high probability of the first script calling the second.
This also allows you to focus on the scripts which are still actually in use.
You could use an ed script
1a
log blah blah blah
.
wq
and run it like so:
find / -perm +x -exec ed {} <edscript
Make sure you test the find command with -print instead of the exec clause. And / is probably not the path that you want to use. If you have to include bin directories then you will probably need to switch to grep in order to identify the pathnames to include, then when you have a file full of the right names, use xargs instead of find to run the script.
I have a perl script (or any executable) E which will take a file foo.xml and write a file foo.txt. I use a Beowulf cluster to run E for a large number of XML files, but I'd like to write a simple job server script in shell (bash) which doesn't overwrite existing txt files.
I'm currently doing something like
#!/bin/sh
PATTERN="[A-Z]*0[1-2][a-j]"; # this matches foo in all cases
todo=`ls *.xml | grep $PATTERN -o`;
isdone=`ls *.txt | grep $PATTERN -o`;
whatsleft=todo - isdone; # what's the unix magic?
#tack on the .xml prefix with sed or something
#and then call the job server;
jobserve E "$whatsleft";
and then I don't know how to get the difference between $todo and $isdone. I'd prefer using sort/uniq to something like a for loop with grep inside, but I'm not sure how to do it (pipes? temporary files?)
As a bonus question, is there a way to do lookahead search in bash grep?
To clarify/extend the problem:
I have a bunch of programs that take input from sources like (but not necessarily) data/{branch}/special/{pattern}.xml and write output to another directory results/special/{branch}-{pattern}.txt (or data/{branch}/intermediate/{pattern}.dat, e.g.). I want to check in my jobfarming shell script if that file already exists.
So E transforms data/{branch}/special/{pattern}.xml->results/special/{branch}-{pattern}.dat, for instance. I want to look at each instance of the input and check if the output exists. One (admittedly simpler) way to do this is just to touch *.done files next to each input file and check for those results, but I'd rather not manage those, and sometimes the jobs terminate improperly so I wouldn't want them marked done.
N.B. I don't need to check concurrency yet or lock any files.
So a simple, clear way to solve the above problem (in pseudocode) might be
for i in `/bin/ls *.xml`
do
replace xml suffix with txt
if [that file exists]
add to whatsleft list
end
done
but I'm looking for something more general.
#!/bin/sh
shopt -s extglob # allow extended glob syntax, for matching the filenames
LC_COLLATE=C # use a sort order comm is happy with
IFS=$'\n' # so filenames can have spaces but not newlines
# (newlines don't work so well with comm anyhow;
# shame it doesn't have an option for null-separated
# input lines).
files_todo=( **([A-Z])0[1-2][a-j]*.xml )
files_done=( **([A-Z])0[1-2][a-j]*.txt )
files_remaining=( \
$(comm -23 --nocheck-order \
<(printf "%s\n" "${files_todo[#]%.xml}") \
<(printf "%s\n" "${files_done[#]%.txt}") ))
echo jobserve E $(for f in "${files_remaining[#]%.xml}"; do printf "%s\n" "${f}.txt"; done)
This assumes that you want a single jobserve E call with all the remaining files as arguments; it's rather unclear from the specification if such is the case.
Note the use of extended globs rather than parsing ls, which is considered very poor practice.
To transform input to output names without using anything other than shell builtins, consider the following:
if [[ $in_name =~ data/([^/]+)/special/([^/]+).xml ]] ; then
out_name=results/special/${BASH_REMATCH[1]}-${BASH_REMATCH[2]}.dat
else
: # ...handle here the fact that you have a noncompliant name...
fi
The question title suggests that you might be looking for:
set -o noclobber
The question content indicates a wholly different problem!
It seems you want to run 'jobserve E' on each '.xml' file without a matching '.txt' file. You'll need to assess the TOCTOU (Time of Check, Time of Use) problems here because you're in a cluster environment. But the basic idea could be:
todo=""
for file in *.xml
do [ -f ${file%.xml}.txt ] || todo="$todo $file"
done
jobserve E $todo
This will work with Korn shell as well as Bash. In Bash you could explore making 'todo' into an array; that will deal with spaces in file names better than this will.
If you have processes still generating '.txt' files for '.xml' files while you run this check, you will get some duplicated effort (because this script cannot tell that the processing is happening). If the 'E' process creates the corresponding '.txt' file as it starts processing it, that minimizes the chance or duplicated effort. Or, maybe consider separating the processed files from the unprocessed files, so the 'E' process moves the '.xml' file from the 'to-be-done' directory to the 'done' directory (and writes the '.txt' file to the 'done' directory too). If done carefully, this can avoid most of the multi-processing problems. For example, you could link the '.xml' to the 'done' directory when processing starts, and ensure appropriate cleanup with an 'atexit()' handler (if you are moderately confident your processing program does not crash). Or other trickery of your own devising.
whatsleft=$( ls *.xml *.txt | grep $PATTERN -o | sort | uniq -u )
Note this actually gets a symmetric difference.
i am not exactly sure what you want, but you can check for existence of the file first, if it exists, create a new name? ( Or in your E (perl script) you do this check. )
if [ -f "$file" ];then
newname="...."
fi
...
jobserve E .... > $newname
if its not what you want, describe more clearly in your question what you mean by "don't overwrite files"..
for posterity's sake, this is what i found to work:
TMPA='neverwritethis.tmp'
TMPB='neverwritethat.tmp'
ls *.xml | grep $PATTERN -o > $TMPA;
ls *.txt | grep $PATTERN -o > $TMPB;
whatsleft = `sort $TMPA $TMPB | uniq -u | sed "s/%/.xml" > xargs`;
rm $TMPA $TMPB;