In shell, how do I delete numbered duplicate files? - bash

I've got a directory with a few thousand files in it, named things like:
filename.ext
filename (1).ext
filename (2).ext
otherfile.ext
otherfile (1).ext
etc.
Most of the files with bracketed numbers are duplicates of the original, but in some cases they're not.
How can I keep my original files, delete the duplicates, but not lose the files that are different?
I know that I could rm *\).ext, but that obviously doesn't make sure that files match the original.
I'm using OS X, so I have a md5 program that functions sort of like md5sum in Linux, though it puts the hash at the end of the line instead of the beginning. I was thinking I could use an awk script to take the output of md5 *.ext | awk 'some script', find duplicates by md5, and delete them, but the command line is too long (bash: /sbin/md5: Argument list too long).
And I don't know what to write in the script. I was thinking of storing things in an array with this:
awk '{a[$NF]++} a[$NF]>1{sub(/).*/,""); sub(/.*(/,""); system("rm " $0);}'
But that always seems to delete my original.
What am I doing wrong? How do I do it right?
Thanks.

Your awk script deletes original files because when you sort your files, . (period) sorts after (space). SO the first file that's seen is numbered, not the original, and subsequent checks (including the one against the original) compare files to the first numbered one.
Not only does rm *\).txt fail to match the original, it loses files that may not have an original in the first place.
I wouldn't do this quite this way. Rather than checking every numbered file and verifying whether it matches an original, you can go through your list of originals, then delete the numbered files that match them.
Instead:
$ for file in *[^\)].txt; do echo "-- Found: $file"; rm -v $(basename "$file" .txt)\ \(*\).txt; done
You can expand this to check MD5's along the way. But it's more code, so I'll break it into multiple lines, in a script:
#!/bin/bash
shopt -s nullglob # Show nothing if a fileglob matches no files
for file in *[^\)].ext; do
md5=$(md5 -q "$file") # The -q option gives you only the message digest
echo "-- Found: $file ($md5)"
for duplicate in $(basename "$file" .ext)\ \(*\).ext; do
if [[ "$md5" = "$(md5 -q "$duplicate")" ]]; then
rm -v "$duplicate"
fi
done
done
As an alternative, you can probably get away with doing this a little more simply, with less CPU overhead than calculating MD5 digests. Unix and Linux have a shell tool called cmp, which is like diff without the output. So:
#!/bin/bash
shopt -s nullglob
for file in *[^\)].ext; do
for duplicate in $(basename "$file" .ext)\ \(*\).ext; do
  if cmp "$file" "$duplicate"; then
rm -v "$file"
fi
done
done

If you don't need to use AWK, you could maybe do something simpler in bash:
for file in *\([0-9]*\)*; do
[ -e "$(echo "$file" | sed -e 's/ ([0-9]\+)//')" ] && rm "$file"
done
Hope this helps a little =)

Related

How to recursively copy files while removing part of the path

I have a hundreds of image files in a structure like this:
path/to/file/100/image1.jpg
path/to/file/9999/image765.jpg
path/to/file/333/picture2.jpg
I'd like to remove the 4th part of the path (100,9999,333, ...) so that I get this:
path/to/file/image1.jpg
path/to/file/image765.jpg
path/to/file/picture2.jpg
In this case the image file names have no duplicates and the the target directory could be named entirely different if this makes things easier (e.g. target could be "another/path/to/the/images/image1.jpg"
The solution might be some combination of find/cut/rename command.
How can I do this in bash?
Since you only have "hundreds" of files, it's quite possible that you don't need to do anything special, and can just write:
mv path/to/file/*/*.jpg path/to/file/
But depending on the number of files and lengths of their names, this may turn out to be more than the kernel will let you pass to a single command, in which case you may need to write a for-loop instead:
for file in path/to/file/*/*.jpg ; do
mv "$file" path/to/file/
done
(Of course, this assumes you have mv on your path. There's no Bash builtin for renaming a file, so any approach will depend on what else is available on your system. If you don't have mv, you'll need to adjust the above accordingly.)
I recommend using ruakh's solution if it will work, but if you need to explicitly test for those numeric directories, here's an alternative.
I'm just using echo to pipe the list of names in, and to show the mv at the end, but you could use find (example in a comment) and remove the echo on the mv to make it live.
IFS=/
echo "path/to/file/100/image1.jpg
path/to/file/9999/image765.jpg
path/to/file/333/picture2.jpg" |
# find path/to/file -name "*.jpg" |
while read -r orig
do this=""
read -a line <<< "$orig"
for sub in "${line[#]}"
do if [[ "$sub" =~ ^[0-9]+$ ]]
then continue
else this="$this$sub/"
fi
done
old="${line[*]}"
echo mv "$old" "${this%/}"
done
mv path/to/file/100/image1.jpg path/to/file/image1.jpg
mv path/to/file/9999/image765.jpg path/to/file/image765.jpg
mv path/to/file/333/picture2.jpg path/to/file/picture2.jpg

grep files based on name prefixes

I have a question on how to approach a problem I've been trying to tackle at multiple points over the past month. The scenario is like so:
I have a a base directory with multiple sub-directories all following the same sub-directory format:
A/{B1,B2,B3} where all B* have a pipeline/results/ directory structure under them.
All of these results directories have multiple *.xyz files in them. These *.xyz files have a certain hierarchy based on their naming prefixes. The naming prefixes in turn depend on how far they've been processed. They could be, for example, select.xyz, select.copy.xyz, and select.copy.paste.xyz, where the operations are select, copy and paste. What I wish to do is write a ls | grep or a find that picks these files based on their processing levels.
EDIT:
The processing pipeline goes select -> copy -> paste. The "most processed" file would be the one with the most of those stages as prefixes in its filename. i.e. select.copy.paste.xyz is more processed than select.copy, which in turn is more processed than select.xyz
For example, let's say
B1/pipeline/results/ has select.xyz and select.copy.xyz,
B2/pipeline/results/ has select.xyz
B3/pipeline/results/ has select.xyz, select.copy.xyz, and select.copy.paste.xyz
How can I write a ls | grep/find that picks the most processed file from each subdirectory? This should give me B1/pipeline/results/select.copy.xyz, B2/pipeline/results/select.xyz and B3/pipeline/results/select.copy.paste.xyz.
Any pointer on how I can think about an approach would help. Thank you!
For this answer, we will ignore the upper part A/B{1,2,3} of the directory structure. All files in some .../pipeline/results/ directory will be considered, even if the directory is A/B1/doNotIncludeMe/forbidden/pipeline/results. We assume that the file extension xyz is constant.
A simple solution would be to loop over the directories and check whether the files exist from back to front. That is, check if select.copy.paste.xyz exists first. In case the file does not exist, check if select.copy.xyz exists and so on. A script for this could look like the following:
#! /bin/bash
# print paths of the most processed files
shopt -s globstar nullglob
for d in **/pipeline/result; do
if [ -f "$d/select.copy.paste.xyz" ]; then
echo "$d/select.copy.paste.xyz"
elif [ -f "$d/select.copy.xyz" ]; then
echo "$d/select.copy.xyz"
elif [ -f "$d/select.xyz" ]; then
echo "$d/select.xyz"
else
# there is no file at all
fi
done
It does the job, but is not very nice. We can do better!
#! /bin/bash
# print paths of the most processed files
shopt -s globstar nullglob
for dir in **/pipeline/result; do
for file in "$dir"/select{.copy{.paste,},}.xyz; do
[ -f "$file" ] && echo "$file" && break
done
done
The second script does exactly the same thing as the first one, but is easier to maintain, adapt, and so on. Both scripts work with file and directory names that contain spaces or even newlines.
In case you don't have whitespace in your paths, the following (hacky, but loop-free) script can also be used.
#! /bin/bash
# print paths of the most processed files
shopt -s globstar nullglob
files=(**/pipeline/result/select{.copy{.paste,},}.xyz)
printf '%s\n' "${files[#]}" | sed -r 's#(.*/)#\1 #' | sort -usk1,1 | tr -d ' '

Performance with bash loop when renaming files

Sometimes I need to rename some amount of files, such as add a prefix or remove something.
At first I wrote a python script. It works well, and I want a shell version. Therefore I wrote something like that:
$1 - which directory to list,
$2 - what pattern will be replacement,
$3 - replacement.
echo "usage: dir pattern replacement"
for fname in `ls $1`
do
newName=$(echo $fname | sed "s/^$2/$3/")
echo 'mv' "$1/$fname" "$1/$newName&&"
mv "$1/$fname" "$1/$newName"
done
It works but very slowly, probably because it needs to create a process (here sed and mv) and destroy it and create same process again just to have a different argument. Is that true? If so, how to avoid it, how can I get a faster version?
I thought to offer all processed files a name (using sed to process them at once), but it still needs mv in the loop.
Please tell me, how you guys do it? Thanks. If you find my question hard to understand please be patient, my English is not very good, sorry.
--- update ---
I am sorry for my description. My core question is: "IF we should use some command in loop, will that lower performance?" Because in for i in {1..100000}; do ls 1>/dev/null; done creating and destroying a process will take most of the time. So what I want is "Is there any way to reduce that cost?".
Thanks to kev and S.R.I for giving me a rename solution to rename files.
Every time you call an external binary (ls, sed, mv), bash has to fork itself to exec the command and that takes a big performance hit.
You can do everything you want to do in pure bash 4.X and only need to call mv
pat_rename(){
if [[ ! -d "$1" ]]; then
echo "Error: '$1' is not a valid directory"
return
fi
shopt -s globstar
cd "$1"
for file in **; do
echo "mv $file ${file//$2/$3}"
done
}
Simplest first. What's wrong with rename?
mkdir tstbin
for i in `seq 1 20`
do
touch tstbin/filename$i.txt
done
rename .txt .html tstbin/*.txt
Or are you using an older *nix machine?
To avoid re-executing sed on each file, you could instead setup two name streams, one original, and one transformed, then sip from the ends:
exec 3< <(ls)
exec 4< <(ls | sed 's/from/to/')
IFS=`echo`
while read -u3 orig && read -u4 to; do
mv "${orig}" "${to}";
done;
I think you can store all of file names into a file or string, and use awk and sed do it once instead of one by one.

In a small script to monitor a folder for new files, the script seems to be finding the wrong files

I'm using this script to monitor the downloads folder for new .bin files being created. However, it doesn't seem to be working. If I remove the grep, I can make it copy any file created in the Downloads folder, but with the grep it's not working. I suspect the problem is how I'm trying to compare the two values, but I'm really not sure what to do.
#!/bin/sh
downloadDir="$HOME/Downloads/"
mbedDir="/media/mbed"
inotifywait -m --format %f -e create $downloadDir -q | \
while read line; do
if [ $(ls $downloadDir -a1 | grep '[^.].*bin' | head -1) == $line ]; then
cp "$downloadDir/$line" "$mbedDir/$line"
fi
done
The ls $downloadDir -a1 | grep '[^.].*bin' | head -1 is the wrong way to go about this. To see why, suppose you had files named a.txt and b.bin in the download directory, and then c.bin was added. inotifywait would print c.bin, ls would print a.txt\nb.bin\nc.bin (with actual newlines, not \n), grep would thin that to b.bin\nc.bin, head would remove all but the first line leaving b.bin, which would not match c.bin. You need to be checking $line to see if it ends in .bin, not scanning a directory listing. I'll give you three ways to do this:
First option, use grep to check $line, not the listing:
if echo "$line" | grep -q '[.]bin$'; then
Note that I'm using the -q option to supress grep's output, and instead simply letting the if command check its exit status (success if it found a match, failure if not). Also, the RE is anchored to the end of the line, and the period is in brackets so it'll only match an actual period (normally, . in a regular expression matches any single character). \.bin$ would also work here.
Second option, use the shell's ability to edit variable contents to see if $line ends in .bin:
if [ "${line%.bin}" != "$line" ]; then
the "${line%.bin}" part gives the value of $line with .bin trimmed from the end if it's there. If that's not the same as $line itself, then $line must've ended with .bin.
Third option, use bash's [[ ]] expression to do pattern matching directly:
if [[ "$line" == *.bin ]]; then
This is (IMHO) the simplest and clearest of the bunch, but it only works in bash (i.e. you must start the script with #!/bin/bash).
Other notes: to avoid some possible issues with whitespace and backslashes in filenames, use while IFS= read -r line; do and follow #shellter's recommendation about double-quotes religiously.
Also, I'm not very familiar with inotifywait, but AIUI its -e create option will notify you when the file is created, not when its contents are fully written out. Depending on the timing, you may wind up copying partially-written files.
Finally, you don't have any checking for duplicate filenames. What should happen if you download a file named foo.bin, it gets copied, you delete the original, then download a different file named foo.bin. As the script is now, it'll silently overwrite the first foo.bin. If this isn't what you want, you should add something like:
if [ ! -e "$mbedDir/$line" ]; then
cp "$downloadDir/$line" "$mbedDir/$line"
elif ! cmp -s "$downloadDir/$line" "$mbedDir/$line"; then
echo "Eeek, a duplicate filename!" >&2
# or possibly something more constructive than that...
fi

shell scripting: search/replace & check file exist

I have a perl script (or any executable) E which will take a file foo.xml and write a file foo.txt. I use a Beowulf cluster to run E for a large number of XML files, but I'd like to write a simple job server script in shell (bash) which doesn't overwrite existing txt files.
I'm currently doing something like
#!/bin/sh
PATTERN="[A-Z]*0[1-2][a-j]"; # this matches foo in all cases
todo=`ls *.xml | grep $PATTERN -o`;
isdone=`ls *.txt | grep $PATTERN -o`;
whatsleft=todo - isdone; # what's the unix magic?
#tack on the .xml prefix with sed or something
#and then call the job server;
jobserve E "$whatsleft";
and then I don't know how to get the difference between $todo and $isdone. I'd prefer using sort/uniq to something like a for loop with grep inside, but I'm not sure how to do it (pipes? temporary files?)
As a bonus question, is there a way to do lookahead search in bash grep?
To clarify/extend the problem:
I have a bunch of programs that take input from sources like (but not necessarily) data/{branch}/special/{pattern}.xml and write output to another directory results/special/{branch}-{pattern}.txt (or data/{branch}/intermediate/{pattern}.dat, e.g.). I want to check in my jobfarming shell script if that file already exists.
So E transforms data/{branch}/special/{pattern}.xml->results/special/{branch}-{pattern}.dat, for instance. I want to look at each instance of the input and check if the output exists. One (admittedly simpler) way to do this is just to touch *.done files next to each input file and check for those results, but I'd rather not manage those, and sometimes the jobs terminate improperly so I wouldn't want them marked done.
N.B. I don't need to check concurrency yet or lock any files.
So a simple, clear way to solve the above problem (in pseudocode) might be
for i in `/bin/ls *.xml`
do
replace xml suffix with txt
if [that file exists]
add to whatsleft list
end
done
but I'm looking for something more general.
#!/bin/sh
shopt -s extglob # allow extended glob syntax, for matching the filenames
LC_COLLATE=C # use a sort order comm is happy with
IFS=$'\n' # so filenames can have spaces but not newlines
# (newlines don't work so well with comm anyhow;
# shame it doesn't have an option for null-separated
# input lines).
files_todo=( **([A-Z])0[1-2][a-j]*.xml )
files_done=( **([A-Z])0[1-2][a-j]*.txt )
files_remaining=( \
$(comm -23 --nocheck-order \
<(printf "%s\n" "${files_todo[#]%.xml}") \
<(printf "%s\n" "${files_done[#]%.txt}") ))
echo jobserve E $(for f in "${files_remaining[#]%.xml}"; do printf "%s\n" "${f}.txt"; done)
This assumes that you want a single jobserve E call with all the remaining files as arguments; it's rather unclear from the specification if such is the case.
Note the use of extended globs rather than parsing ls, which is considered very poor practice.
To transform input to output names without using anything other than shell builtins, consider the following:
if [[ $in_name =~ data/([^/]+)/special/([^/]+).xml ]] ; then
out_name=results/special/${BASH_REMATCH[1]}-${BASH_REMATCH[2]}.dat
else
: # ...handle here the fact that you have a noncompliant name...
fi
The question title suggests that you might be looking for:
set -o noclobber
The question content indicates a wholly different problem!
It seems you want to run 'jobserve E' on each '.xml' file without a matching '.txt' file. You'll need to assess the TOCTOU (Time of Check, Time of Use) problems here because you're in a cluster environment. But the basic idea could be:
todo=""
for file in *.xml
do [ -f ${file%.xml}.txt ] || todo="$todo $file"
done
jobserve E $todo
This will work with Korn shell as well as Bash. In Bash you could explore making 'todo' into an array; that will deal with spaces in file names better than this will.
If you have processes still generating '.txt' files for '.xml' files while you run this check, you will get some duplicated effort (because this script cannot tell that the processing is happening). If the 'E' process creates the corresponding '.txt' file as it starts processing it, that minimizes the chance or duplicated effort. Or, maybe consider separating the processed files from the unprocessed files, so the 'E' process moves the '.xml' file from the 'to-be-done' directory to the 'done' directory (and writes the '.txt' file to the 'done' directory too). If done carefully, this can avoid most of the multi-processing problems. For example, you could link the '.xml' to the 'done' directory when processing starts, and ensure appropriate cleanup with an 'atexit()' handler (if you are moderately confident your processing program does not crash). Or other trickery of your own devising.
whatsleft=$( ls *.xml *.txt | grep $PATTERN -o | sort | uniq -u )
Note this actually gets a symmetric difference.
i am not exactly sure what you want, but you can check for existence of the file first, if it exists, create a new name? ( Or in your E (perl script) you do this check. )
if [ -f "$file" ];then
newname="...."
fi
...
jobserve E .... > $newname
if its not what you want, describe more clearly in your question what you mean by "don't overwrite files"..
for posterity's sake, this is what i found to work:
TMPA='neverwritethis.tmp'
TMPB='neverwritethat.tmp'
ls *.xml | grep $PATTERN -o > $TMPA;
ls *.txt | grep $PATTERN -o > $TMPB;
whatsleft = `sort $TMPA $TMPB | uniq -u | sed "s/%/.xml" > xargs`;
rm $TMPA $TMPB;

Resources