How to delete all but two instances of a file? - bash

I have a directory with similarly named files, in this pattern:
00002_930831_fa.ppm 00398_940422_fa.ppm 00714_960530_fa.ppm
00002_930831_fb.ppm 00398_940422_fb.ppm 00714_960530_fb.ppm
00002_931230_fa.ppm 00399_940422_fa.ppm 00714_960620_fa.ppm
00002_931230_fb.ppm 00399_940422_fb.ppm 00714_960620_fb.ppm
00002_940128_fa.ppm 00400_940422_fa.ppm 00715_941201_fa.ppm
00002_940128_fb.ppm 00400_940422_fb.ppm 00715_941201_fb.ppm
00002_940422_fa.ppm 00401_940422_fa.ppm 00715_941205_fa.ppm
00002_940422_fb.ppm 00401_940422_fb.ppm 00715_941205_fb.ppm
00002_940928_fa.ppm 00402_940422_fa.ppm 00716_941201_fa.ppm
00002_940928_fb.ppm 00402_940422_fb.ppm 00716_941201_fb.ppm
What I need to do is remove for example all but two instances of the 00002 sample (doesn't matter which ones), so that I'm left for example with 00002_930831_fa.ppm and 00002_930831_fb.ppm. The problem is I need this done for all samples, 00003, 00004 and so on. I need to be left with two files for each sample.
I've tried with find but I'm not sure how to frase my condition.
Can this be solved by simply piping commands or do I have to solve it with a bash script?

Just use head or tail to filter your filename list:
ls 00002_* | tail -n +3 | xargs rm

Create an array that contains all matching file names, then use the substring parameter expansion operator to pass all but the first two elements as arguments to rm.
while read -r sample; do
matching_files=( ${sample}_* )
# To make sure at least two files survive:
(( ${#matching_files[#]} > 2 )) && rm "${matching_files[#]:2}"
done < samples.txt

Using an associative array:
#!/bin/bash
[[ BASH_VERSINFO -ge 4 ]] || {
echo "You need Bash 4.0 or newer to run this script." >&2
exit 1
}
declare -A COUNTER=()
for A in *.ppm; do
IFS=_ read I __ <<< "$A"
(( ++COUNTER[$I] > 2 )) && rm "$A"
done
Simulation:
Skip 00002_930831_fa.ppm
Skip 00002_930831_fb.ppm
rm 00002_931230_fa.ppm
rm 00002_931230_fb.ppm
rm 00002_940128_fa.ppm
rm 00002_940128_fb.ppm
rm 00002_940422_fa.ppm
rm 00002_940422_fb.ppm
rm 00002_940928_fa.ppm
rm 00002_940928_fb.ppm
Skip 00398_940422_fa.ppm
Skip 00398_940422_fb.ppm
Skip 00399_940422_fa.ppm
Skip 00399_940422_fb.ppm
Skip 00400_940422_fa.ppm
Skip 00400_940422_fb.ppm
Skip 00401_940422_fa.ppm
Skip 00401_940422_fb.ppm
Skip 00402_940422_fa.ppm
Skip 00402_940422_fb.ppm
Skip 00714_960530_fa.ppm
Skip 00714_960530_fb.ppm
rm 00714_960620_fa.ppm
rm 00714_960620_fb.ppm
Skip 00715_941201_fa.ppm
Skip 00715_941201_fb.ppm
rm 00715_941205_fa.ppm
rm 00715_941205_fb.ppm
Skip 00716_941201_fa.ppm
Skip 00716_941201_fb.ppm
Note: Test it first on some dummy files.
Come to think of it:
IFS=_ read I __ <<< "$A"
Can just be
I=${A%%_*}

with bash version 4:
declare -A files
for f in *ppm; do
files[${f%%_*}]+="$f "
done
for i in "${!files[#]}"; do
set -- ${files[$i]}
shift 2
(($# > 0)) && echo rm $*
done
Remove echo if you're satisfied it's selecting the right files to delete.
Won't work if there are any filenames with whitespace.

Related

Why doesn't counting files with "for file in $0/*; let i=$i+1; done" work?

I'm new in ShellScripting and have the following script that i created based on a simpler one, i want to pass it an argument with the path to count files. Cannot find my logical mistake to make it work right, the output is always "1"
#!/bin/bash
i=0
for file in $0/*
do
let i=$i+1
done
echo $i
To execute the code i use
sh scriptname.sh /path/to/folder/to/count/files
$0 is the name with which your script was invoked (roughly, subject to several exceptions that aren't pertinent here). The first argument is $1, and so it's $1 that you want to use in your glob expression.
#!/bin/bash
i=0
for file in "$1"/*; do
i=$(( i + 1 )) ## $(( )) is POSIX-compliant arithmetic syntax; let is deprecated.
done
echo "$i"
That said, you can get this number more directly:
#!/bin/bash
shopt -s nullglob # allow globs to expand to an empty list
files=( "$1"/* ) # put list of files into an array
echo "${#files[#]}" # count the number of items in the array
...or even:
#!/bin/sh
set -- "$1"/* # override $# with the list of files matching the glob
if [ -e "$1" ] || [ -L "$1" ]; then # if $1 exists, then it had matches
echo "$#" # ...so emit their number.
else
echo 0 # otherwise, our result is 0.
fi
If you want to count the number of files in a directory, you can run something like this:
ls /path/to/folder/to/count/files | wc -l

Recursively list hidden files without ls, find or extendedglob

As an exercise I have set myself the task of recursively listing files using bash builtins. I particularly don't want to use ls or find and I would prefer not to use setopt extendedglob. The following appears to work but I cannot see how to extend it with /.* to list hidden files. Is there a simple workaround?
g() { for k in "$1"/*; do # loop through directory
[[ -f "$k" ]] && { echo "$k"; continue; }; # echo file path
[[ -d "$k" ]] && { [[ -L "$k" ]] && { echo "$k"; continue; }; # echo symlinks but don't follow
g "$k"; }; # start over with new directory
done; }; g "/Users/neville/Desktop" # original directory
Added later: sorry - I should have said: 'bash-3.2 on OS X'
Change
for k in "$1"/*; do
to
for k in "$1"/* "$1"/.[^.]* "$1"/..?*; do
The second glob matches all files whose names start with a dot followed by anything other than a dot, while the third matches all files whose names start with two dots followed by something. Between the two of them, they will match all hidden files other than the entries . and ...
Unfortunately, unless the shell option nullglob is set, those (like the first glob) could remain as-is if there are no files whose names match (extremely likely in the case of the third one) so it is necessary to verify that the name is actually a file.
An alternative would be to use the much simpler glob "$1"/.*, which will always match the . and .. directory entries, and will consequently always be substituted. In that case, it's necessary to remove the two entries from the list:
for k in "$1"/* "$1"/.*; do
if ! [[ $k =~ /\.\.?$ ]]; then
# ...
fi
done
(It is still possible for "$1"/* to remain in the list, though. So that doesn't help as much as it might.)
Set the GLOBIGNORE file to exclude . and .., which implicitly turns on "shopt -u dotglob". Then your original code works with no other changes.
user#host [/home/user/dir]
$ touch file
user#host [/home/user/dir]
$ touch .dotfile
user#host [/home/user/dir]
$ echo *
file
user#host [/home/user/dir]
$ GLOBIGNORE=".:.."
user#host [/home/user/dir]
$ echo *
.dotfile file
Note that this is bash-specific. In particular, it does not work in ksh.
You can specify multiple arguments to for:
for k in "$1"/* "$1"/.*; do
But if you do search for .* in directories , you should be aware that it also gives you the . and .. files. You may also be given a nonexistent file if the "$1"/* glob matches, so I would check that too.
With that in mind, this is how I would correct the loop:
g() {
local k subdir
for k in "$1"/* "$1"/.*; do # loop through directory
[[ -e "$k" ]] || continue # Skip missing files (unmatched globs)
subdir=${k##*/}
[[ "$subdir" = . ]] || [[ "$subdir" = .. ]] && continue # Skip the pseudo-directories "." and ".."
if [[ -f "$k" ]] || [[ -L "$k" ]]; then
printf %s\\n "$k" # Echo the paths of files and symlinks
elif [[ -d "$k" ]]; then
g "$k" # start over with new directory
fi
done
}
g ~neville/Desktop
Here the funky-looking ${k##*/} is just a fast way to take the basename of the file, while local was put in so that the variables don't modify any existing variables in the shell.
One more thing I've changed is echo "$k" to printf %s\\n "$k", because echo is irredeemably flawed in its argument handling and should be avoided for the purpose of echoing an unknown variable. (See Rich's sh tricks for an explanation of how; it boils down to -n and -e throwing a spanner in the works.)
By the way, this will NOT print sockets or fifos - is that intentional?

bash string length in a loop

I am looping through a folder and depending on the length of files do certain condition. I seem not to come right with that. I evaluate and output the length of a string in the terminal.
echo $file|wc -c gives me the answer of all files in the terminal.
But incorporating this into a loop is impossible
for file in `*.zip`; do
if [[ echo $file|wc -c ==9]]; then
some commands
where I want to operate on files that have a length of nine characters
Try this one:
for file in *.zip ; do
wcout=$(wc -c "$file")
if [[ ${wcout%% *} -eq 9 ]] ; then
# some commands
fi
done
The %% operator in variable expansion deletes everything that match the pattern after it. This is glob pattern, not regular expression.
Opposite to natural good sense of typical programmers the == operator in BASH compares strings, not numbers.
Alternatively (following the comment) you can:
for file in *.zip ; do
wcout=$(wc -c < "$file")
if [[ ${wcout} -eq 9 ]] ; then
# some commands
fi
done
Additional observation is that if BASH cannot expand *.zip as there is no ZIP files in the current directory it will pass "*.zip" into $file and let single iteration of the loop. That leads to the error reported by wc command. So it would be recommended to add:
if [[ -e ${file} ]] ; then ...
as a prevention mechanism.
Comments leads to another form of this solution (plus I added my safety mechanism):
for file in *.zip ; do
if [[ -e "$file" && (( $(wc -c < "$file") == 9 )) ]] ; then
# some commands
fi
done
using filter outside the loop
ls -1 *.zip \
| grep -E '^.{9}$' \
| while read FileName
do
# Your action
done
using filter inside loop
ls -1 *.zip \
| while read FileName
do
if [ ${#FileName} -eq 9 ]
then
# Your action
fi
done
alternative to ls -1 that is always a bit dangereous, find . -name '*.zip' -print [ but you neet to add 2 char length or filter the name form headin ./ and maybe limit to current folder depth ]

Removing Duplicate Files in Unix

I want to be able to delete duplicate files and at the same time create a symbolic link to the removed duplicate lines.So far I can display the duplicate files ,the problem is removal and deleting.Since I want to retain a copy
find "$#" -type f -print0 | xargs -0 -n1 md5sum | sort --key=1,32 | uniq -w
32 -d --all-repeated=separate
Output:
1463b527b1e7ed9ed8ef6aa953e9ee81 ./tope5final
1463b527b1e7ed9ed8ef6aa953e9ee81 ./Tests/tope5
2a6dfec6f96c20f2c2d47f6b07e4eb2f ./tope3final
2a6dfec6f96c20f2c2d47f6b07e4eb2f ./Tests/tope3
5baa4812f4a0838dbc283475feda542a ./tope1bfinal
5baa4812f4a0838dbc283475feda542a ./Tests/tope1b
69d7799197049b64f8675ed4500df76c ./tope3afinal
69d7799197049b64f8675ed4500df76c ./Tests/tope3a
945fe30c545fc0d7dc2d1cb279cf9c04 ./Tests/butter6
945fe30c545fc0d7dc2d1cb279cf9c04 ./Tests/tope6
98340fa2af27c79da7efb75ae7c01ac6 ./tope2cfinal
98340fa2af27c79da7efb75ae7c01ac6 ./Tests/tope2c
d15df73b8eaf1cd237ce96d58dc18041 ./tope1afinal
d15df73b8eaf1cd237ce96d58dc18041 ./Tests/tope1a
d5ce8f291a81c1e025d63885297d4b56 ./tope4final
d5ce8f291a81c1e025d63885297d4b56 ./Tests/tope4
ebde372904d6d2d3b73d2baf9ac16547 ./tope1cfinal
ebde372904d6d2d3b73d2baf9ac16547 ./Tests/tope1c
In this case for example I want to delete ./tope1cfinal and remain with ./Tests/tope1c. After deleting I also want to create a symbolic link with name /tope1cfinal pointing to /Tests/tope1c.
One possibility: create an associative array, the keys of which are the md5sum, the fields of which are the corresponding first file found (the one that won't be deleted). Each time an md5sum is found in this associative array, the file will be deleted and a corresponding link to the corresponding key will be created (after checking that the file to delete isn't the original file). Takes the directories to search as arguments; with no arguments the search is performed inside current directory.
#!/bin/bash
shopt -s globstar nullglob
(($#==0)) && set .
declare -A md5sum=() || exit 1;
while(($#)); do
[[ $1 ]] || continue
for file in "$1"/**/*; do
[[ -f $file ]] || continue
h=$(md5sum < "$file") || continue
read h _ <<< "$h" # This line is optional: to remove the hyphen in the md5sm
if [[ ${md5sum[$h]} ]]; then
# already seen this md5sum
[[ "$file" -ef "${md5sum[$h]}" ]] && continue # prevent unwanted removal!
rm -- "$file" || continue
ln -rs -- "${md5sum[$h]}" "$file"
else
# first time seeing this file
md5sum[$h]=$file
fi
done
shift
done
(Untested, use at your own risks!)

in Bash, how can I remove a numeric range of directories?

I am writing an alias to remove a range of directories that contain integers. I can't figure out how to replace the values e.g. 322 and 394 with 2 variables (arguments) that I add when entering the command.
This is the alias in its current state.
alias rRange='ls -1 | awk -F'"'"'v'"'"' '"'"'{if ( ($2>=322) && ($2<=394) ) print "rm -fRv " $0 }'"'"''
but I would like to be able to enter:
rRange 322 394
to be able to drive that alias instead. Currently those values are hard coded in there.
A step by step, bottom-up, deconstruction on how to solve this and many similar problems:
To generate an integer sequence, use bash brace-expansion (see details in man bash):
$ echo {2..5}
2 3 4 5
You may also generate non-consecutive and multiple ranges:
$ echo {2..4} {8..9}
2 3 4 8 9
If you have variables instead of constants, you can use eval to expand them:
$ a=2 b=5
$ eval echo {$a..$b}
2 3 4 5
To list any file/directory which contains these values, enclose with *s:
$ eval echo *{$a..$b}*
To remove the files instead of listing them, use rm instead of echo:
$ eval rm *{$a..$b}*
To remove directories use rm -r (if directories are non-empty) or rmdir (if-empty):
$ eval rm -r *{$a..$b}*
$ eval rmdir *{$a..$b}*
Use a function instead of an alias:
function rRange {
for (( I = $1; I <=$2; ++I )); do
[[ -d $I ]] && rm -fRv "$I"
done
}
More expansion.
rm ... *{322..394}*

Resources