I need some help with a shell code. Now I have this code:
find $dirname -type f -exec md5sum '{}' ';' | sort | uniq --all-repeated=separate -w 33 | cut -c 35-
This code finds duplicated files (with same content) in a given directory. What I need to do is to update it - find out latest (by date) modified file (from duplicated files list), print that file name and also give opportunity to delete that file in terminal.
Doing this in pure bash is a tad awkward, it would be a lot easier to write
this in perl or python.
Also, if you were looking to do this with a bash one-liner, it might be feasible,
but I really don't know how.
Anyhoo, if you really want a pure bash solution below is an attempt at doing
what you describe.
Please note that:
I am not actually calling rm, just echoing it - don't want to destroy your files
There's a "read -u 1" in there that I'm not entirely happy with.
Here's the code:
#!/bin/bash
buffer=''
function process {
if test -n "$buffer"
then
nbFiles=$(printf "%s" "$buffer" | wc -l)
echo "================================================================================="
echo "The following $nbFiles files are byte identical and sorted from oldest to newest:"
ls -lt -c -r $buffer
lastFile=$(ls -lt -c -r $buffer | tail -1)
echo
while true
do
read -u 1 -p "Do you wish to delete the last file $lastFile (y/n/q)? " answer
case $answer in
[Yy]* ) echo rm $lastFile; break;;
[Nn]* ) echo skipping; break;;
[Qq]* ) exit;;
* ) echo "please answer yes, no or quit";;
esac
done
echo
fi
}
find . -type f -exec md5sum '{}' ';' |
sort |
uniq --all-repeated=separate -w 33 |
cut -c 35- |
while read -r line
do
if test -z "$line"
then
process
buffer=''
else
buffer=$(printf "%s\n%s" "$buffer" "$line")
fi
done
process
echo "done"
Here's a "naive" solution implemented in bash (except for two external commands: md5sum, of course, and stat used only for user's comfort, it's not part of the algorithm). The thing implements a 100% Bash quicksort (that I'm kind of proud of):
#!/bin/bash
# Finds similar (based on md5sum) files (recursively) in given
# directory. If several files with same md5sum are found, sort
# them by modified (most recent first) and prompt user for deletion
# of the oldest
die() {
printf >&2 '%s\n' "$#"
exit 1
}
quicksort_files_by_mod_date() {
if ((!$#)); then
qs_ret=()
return
fi
# the return array is qs_ret
local first=$1
shift
local newers=()
local olders=()
qs_ret=()
for i in "$#"; do
if [[ $i -nt $first ]]; then
newers+=( "$i" )
else
olders+=( "$i" )
fi
done
quicksort_files_by_mod_date "${newers[#]}"
newers=( "${qs_ret[#]}" )
quicksort_files_by_mod_date "${olders[#]}"
olders=( "${qs_ret[#]}" )
qs_ret=( "${newers[#]}" "$first" "${olders[#]}" )
}
[[ -n $1 ]] || die "Must give an argument"
[[ -d $1 ]] || die "Argument must be a directory"
dirname=$1
shopt -s nullglob
shopt -s globstar
declare -A files
declare -A hashes
for file in "$dirname"/**; do
[[ -f $file ]] || continue
read md5sum _ < <(md5sum -- "$file")
files[$file]=$md5sum
((hashes[$md5sum]+=1))
done
has_found=0
for hash in "${!hashes[#]}"; do
((hashes[$hash]>1)) || continue
files_with_same_md5sum=()
for file in "${!files[#]}"; do
[[ ${files[$file]} = $hash ]] || continue
files_with_same_md5sum+=( "$file" )
done
has_found=1
echo "Found ${hashes[$hash]} files with md5sum=$hash, sorted by modified (most recent first):"
# sort them by modified date (using quicksort :p)
quicksort_files_by_mod_date "${files_with_same_md5sum[#]}"
for file in "${qs_ret[#]}"; do
printf " %s %s\n" "$(stat --printf '%y' -- "$file")" "$file"
done
read -p "Do you want to remove the oldest? [yn] " answer
if [[ ${answer,,} = y ]]; then
echo rm -fv -- "${qs_ret[#]:1}"
fi
done
if((!has_found)); then
echo "Didn't find any similar files in directory \`$dirname'. Yay."
fi
I guess the script is self-explanatory (you can read it like a story). It uses the best practices I know of, and is 100% safe regarding any silly characters in file names (e.g., spaces, newlines, file names starting with hyphens, file names ending with a newline, etc.).
It uses bash's globs, so it might be a bit slow if you have a bloated directory tree.
There are a few error checkings, but many are missing, so don't use as-is in production! (it's a trivial but rather tedious taks to add these).
The algorithm is as follows: scan each file in the given directory tree; for each file, will compute its md5sum and store in associative arrays:
files with keys the file names and values the md5sums.
hashes with keys the hashes and values the number of files the md5sum of which is the key.
After this is done, we'll scan through all the found md5sum, select only the ones that correspond to more than one file, then select all files with this md5sum, then quicksort them by modified date, and prompt the user.
A sweet effect when no dups are found: the script nicely informs the user about it.
I would not say it's the most efficient way of doing things (might be better in, e.g., Perl), but it's really a lot of fun, surprisingly easy to read and follow, and you can potentially learn a lot by studying it!
It uses a few bashisms and features that only are in bash version ≥ 4
Hope this helps!
Remark. If on your system date has the -r switch, you can replace the stat command by:
date -r "$file"
Remark. I left the echo in front of rm. Remove it if you're happy with how the script behaves. Then you'll have a script that uses 3 external commands :).
Related
I want to move all JSON files created within a jenkins job to a different folder.
It is possible that the job does not create any json file.
In that case the mv command is raising an error and so that job is failing.
How do I prevent mv command from raising error in case no file is found?
Welcome to SO.
Why do you not want the error?
If you just don't want to see the error, then you could always just throw it away with 2>/dev/null, but PLEASE don't do that. Not every error is the one you expect, and this is a debugging nightmare. You could write it to a log with 2>$logpath and then build in logic to read that to make certain it's ok, and ignore or respond accordingly --
mv *.json /dir/ 2>$someLog
executeMyLogParsingFunction # verify expected err is the ONLY err
If it's because you have set -e or a trap in place, and you know it's ok for the mv to fail (which might not be because there is no file!), then you can use this trick -
mv *.json /dir/ || echo "(Error ok if no files found)"
or
mv *.json /dir/ ||: # : is a no-op synonym for "true" that returns 0
see https://www.gnu.org/software/bash/manual/html_node/Conditional-Constructs.html
(If it's failing simply because the mv is returning a nonzero as the last command, you could also add an explicit exit 0, but don't do that either - fix the actual problem rather than patching the symptom. Any of these other solutions should handle that, but I wanted to point out that unless there's a set -e or a trap that catches the error, it shouldn't cause the script to fail unless it's the very last command.)
Better would be to specifically handle the problem you expect without disabling error handling on other problems.
shopt -s nullglob # globs with no match do not eval to the glob as a string
for f in *.json; do mv "$f" /dir/; done # no match means no loop entry
c.f. https://www.gnu.org/software/bash/manual/html_node/The-Shopt-Builtin.html
or if you don't want to use shopt,
for f in *.json; do [[ -e "$f" ]] && mv "$f" /dir/; done
Note that I'm only testing existence, so that will include any match, including directories, symlinks, named pipes... you might want [[ -f "$f" ]] && mv "$f" /dir/ instead.
c.f. https://www.gnu.org/software/bash/manual/html_node/Bash-Conditional-Expressions.html
This is expected behavior -- it's why the shell leaves *.json unexpanded when there are no matches, to allow mv to show a useful error.
If you don't want that, though, you can always check the list of files yourself, before passing it to mv. As an approach that works with all POSIX-compliant shells, not just bash:
#!/bin/sh
# using a function here gives us our own private argument list.
# that's useful because minimal POSIX sh doesn't provide arrays.
move_if_any() {
dest=$1; shift # shift makes the old $2 be $1, the old $3 be $2, etc.
# so, we then check how many arguments were left after the shift;
# if it's only one, we need to also check whether it refers to a filesystem
# object that actually exists.
if [ "$#" -gt 1 ] || [ -e "$1" ] || [ -L "$1" ]; then
mv -- "$#" "$dest"
fi
}
# put destination_directory/ in $1 where it'll be shifted off
# $2 will be either nonexistent (if we were really running in bash with nullglob set)
# ...or the name of a legitimate file or symlink, or the string '*.json'
move_if_any destination_directory/ *.json
...or, as a more bash-specific approach:
#!/bin/bash
files=( *.json )
if (( ${#files[#]} > 1 )) || [[ -e ${files[0]} || -L ${files[0]} ]]; then
mv -- "${files[#]}" destination/
fi
Loop over all json files and move each of them, if it exists, in a oneliner:
for X in *.json; do [[ -e $X ]] && mv "$X" /dir/; done
I'm using this script to generate a list of the available commands with manual pages on the system. Running this with time shows an average of about 49 seconds on my computer.
#!/usr/local/bin/bash
for x in $(for f in $(compgen -c); do which $f; done | sort -u); do
dir=$(dirname $x)
cmd=$(basename $x)
if [[ ! $(man --path "$cmd" 2>&1) =~ 'No manual entry' ]]; then
printf '%b\n' "${dir}:\n${cmd}"
fi
done | awk '!x[$0]++'
Is there a way to optimize this for faster results?
This is a small sample of my current output. The goal is to group commands by directory. This will later be fed into an array.
/bin: # directories generated by $dir
[ # commands generated by $cmd (compgen output)
cat
chmod
cp
csh
date
Going for a complete disregard of built-ins here. That's what which does, anyway. Script not thoroughly tested.
#!/bin/bash
shopt -s nullglob # need this for "empty" checks below
MANPATH=${MANPATH:-/usr/share/man:/usr/local/share/man}
IFS=: # chunk up PATH and MANPATH, both colon-deliminated
# just look at the directory!
has_man_p() {
local needle=$1 manp manpp result=()
for manp in $MANPATH; do
# man? should match man0..man9 and a bunch of single-char things
# do we need 'man?*' for longer suffixes here?
for manpp in "$manp"/man?; do
# assumption made for filename formats. section not checked.
result=("$manpp/$needle".*)
if (( ${#result[#]} > 0 )); then
return 0
fi
done
done
return 1
}
unset seen
declare -A seen # for deduplication
for p in $PATH; do
printf '%b:\n' "$p" # print the path first
for exe in "$p"/*; do
cmd=${exe##*/} # the sloppy basename
if [[ ! -x $exe || ${seen[$cmd]} == 1 ]]; then
continue
fi
seen["$cmd"]=1
if has_man_p "$cmd"; then
printf '%b\n' "$cmd"
fi
done
done
Time on Cygwin with a truncated PATH (the full one with Windows has too many misses for the original version):
$ export PATH=/usr/local/bin:/usr/bin
$ time (sh ./opti.sh &>/dev/null)
real 0m3.577s
user 0m0.843s
sys 0m2.671s
$ time (sh ./orig.sh &>/dev/null)
real 2m10.662s
user 0m20.138s
sys 1m5.728s
(Caveat for both versions: most stuff in Cygwin's /usr/bin comes with a .exe extension)
What I want to achieve is the following :
I want the subtitles for my TV Show downloaded automatically.
The script "getSubtitle.sh" is ran as soon as the show is downloaded, but it can happen that no subtitle are released yet.
So what I am doing to counter this :
Creating a file each time "getSubtitle.sh" is ran. It contain the location of the script with its arguments, for example :
/Users/theo/logSubtitle/getSubtitle.sh "The Walking Dead - 5x10 - Them.mp4" "The.Walking.Dead.S05E10.480p.HDTV.H264.mp4" "/Volumes/Window HD/Série/The Walking Dead"
If a subtitle has been found, this file will contain only this line, if no subtitle has been found, this file will have 2 lines (the first one being "no subtitle downloaded", and the second one being the path to the script as explained above)
Now, once I get this, I'm planning to run a cron everyday that will do the following :
Remove all file that have only 1 line (Subtitle found), and execute the script again for the remaining file. Here is the full script :
cd ~/logSubtitle/waiting/
for f in *
do nbligne=$(wc -l $f | cut -c 8)
if [ "$nbligne" = "1" ]
then
rm $f
else
command=$(sed -n "2 p" $f)
sh $command 3>&1 1>&2 2>&3 | grep down > $f ; echo $command >> $f
fi
done
This is unfortunately not working, I have the feeling that the script is not called.
When I replace $command by the line in the text file, it is working.
I am sure that $command match the line because of the "echo $command >> $f" at the end of my script.
So I really don't get what I am missing here, any ideas ?
Thanks.
I'm not sure what you're trying to achieve with the cut -c 8 part in wc -l $f | cut -c 8. cut -c 8 will select the 8th character of the output of wc -l.
A suggestion: to check whether your file contains 1 or two lines (and since you'll need the content of the second line, if any, anyway), use mapfile. This will slurp the file in an array, one line per field. You can use the option -n 2 to read at most 2 lines. This will be much more efficient, safe and nice than your solution:
mapfile -t -n 2 ary < file
Then:
if ((${#ary[#]}==1)); then
printf 'File contains one line only: %s\n' "${ary[0]}"
elif ((${#ary[#]==2)); then
printf 'File contains (at least) two lines:\n'
printf ' %s\n' "${ary[#]}"
else
printf >&2 'Error, no lines found in file\n'
fi
Another suggestion: use more quotes!
With this, a better way to write your script:
#!/bin/bash
dir=$HOME/logSubtitle/waiting/
shopt -s nullglob
for f in "$dir"/*; do
mapfile -t -n 2 ary < "$f"
if ((${#ary[#]}==1)); then
rm -- "$f" || printf >&2 "Error, can't remove file %s\n" "$f"
elif ((${#ary[#]}==2)); then
{ sh -c "${ary[1]}" 3>&1 1>&2 2>&3 | grep down; echo "${ary[1]}"; } > "$f"
else
printf >&2 'Error, file %s contains no lines\n' "$f"
fi
done
After the done keyword you can even add the redirection 2>> logfile to a log file if you wish. Make sure the cron job is run with your user: check crontab -l and, if needed, edit it with crontab -e.
Use eval instead of sh. The reason it works with eval and not sh is due to the number of passes to evaluate variables. sh will treat the sed command as its command to execute while eval will evaluate the sed command first and then execute the result.
Briefly explained.
I have a directory with lots of files. I want to keep only the 6 newest. I guess I can look at their creation date and run rm on all those that are too old, but is the a better way for doing this? Maybe some linux command I could use?
Thanks!
:)
rm -v $(ls -t mysvc-*.log | tail -n +7)
ls -t, list sorted by time
tail -n +7, +7 here means length-7, so all but first 7 lines
$() makes a list of strings from the enclosed command output
rm to remove the files, of course
Beware files with space in their names, $() splits on any white-space!
Here's my take on it, as a script. It does handle spaces in file names even if it is a bit of a hack.
#!/bin/bash
eval set -- $(ls -t1 | sed -e 's/.*/"&"/')
if [[ $# -gt 6 ]] ; then
shift 6
while [[ $# -gt 0 ]] ; do
echo "remove this file: $1" # rm "$1"
shift
done
fi
The second option to ls up there is a "one" for one file name per line. Doesn't actually seem to matter, though, since that appears to be the default when ls isn't feeding a tty.
How can I test if a command outputs an empty string?
Previously, the question asked how to check whether there are files in a directory. The following code achieves that, but see rsp's answer for a better solution.
Empty output
Commands don’t return values – they output them. You can capture this output by using command substitution; e.g. $(ls -A). You can test for a non-empty string in Bash like this:
if [[ $(ls -A) ]]; then
echo "there are files"
else
echo "no files found"
fi
Note that I've used -A rather than -a, since it omits the symbolic current (.) and parent (..) directory entries.
Note: As pointed out in the comments, command substitution doesn't capture trailing newlines. Therefore, if the command outputs only newlines, the substitution will capture nothing and the test will return false. While very unlikely, this is possible in the above example, since a single newline is a valid filename! More information in this answer.
Exit code
If you want to check that the command completed successfully, you can inspect $?, which contains the exit code of the last command (zero for success, non-zero for failure). For example:
files=$(ls -A)
if [[ $? != 0 ]]; then
echo "Command failed."
elif [[ $files ]]; then
echo "Files found."
else
echo "No files found."
fi
More info here.
TL;DR
if [[ $(ls -A | head -c1 | wc -c) -ne 0 ]]; then ...; fi
Thanks to netj
for a suggestion to improve my original:if [[ $(ls -A | wc -c) -ne 0 ]]; then ...; fi
This is an old question but I see at least two things that need some improvement or at least some clarification.
First problem
First problem I see is that most of the examples provided here simply don't work. They use the ls -al and ls -Al commands - both of which output non-empty strings in empty directories. Those examples always report that there are files even when there are none.
For that reason you should use just ls -A - Why would anyone want to use the -l switch which means "use a long listing format" when all you want is test if there is any output or not, anyway?
So most of the answers here are simply incorrect.
Second problem
The second problem is that while some answers work fine (those that don't use ls -al or ls -Al but ls -A instead) they all do something like this:
run a command
buffer its entire output in RAM
convert the output into a huge single-line string
compare that string to an empty string
What I would suggest doing instead would be:
run a command
count the characters in its output without storing them
or even better - count the number of maximally 1 character using head -c1(thanks to netj for posting this idea in the comments below)
compare that number with zero
So for example, instead of:
if [[ $(ls -A) ]]
I would use:
if [[ $(ls -A | wc -c) -ne 0 ]]
# or:
if [[ $(ls -A | head -c1 | wc -c) -ne 0 ]]
Instead of:
if [ -z "$(ls -lA)" ]
I would use:
if [ $(ls -lA | wc -c) -eq 0 ]
# or:
if [ $(ls -lA | head -c1 | wc -c) -eq 0 ]
and so on.
For small outputs it may not be a problem but for larger outputs the difference may be significant:
$ time [ -z "$(seq 1 10000000)" ]
real 0m2.703s
user 0m2.485s
sys 0m0.347s
Compare it with:
$ time [ $(seq 1 10000000 | wc -c) -eq 0 ]
real 0m0.128s
user 0m0.081s
sys 0m0.105s
And even better:
$ time [ $(seq 1 10000000 | head -c1 | wc -c) -eq 0 ]
real 0m0.004s
user 0m0.000s
sys 0m0.007s
Full example
Updated example from the answer by Will Vousden:
if [[ $(ls -A | wc -c) -ne 0 ]]; then
echo "there are files"
else
echo "no files found"
fi
Updated again after suggestions by netj:
if [[ $(ls -A | head -c1 | wc -c) -ne 0 ]]; then
echo "there are files"
else
echo "no files found"
fi
Additional update by jakeonfire:
grep will exit with a failure if there is no match. We can take advantage of this to simplify the syntax slightly:
if ls -A | head -c1 | grep -E '.'; then
echo "there are files"
fi
if ! ls -A | head -c1 | grep -E '.'; then
echo "no files found"
fi
Discarding whitespace
If the command that you're testing could output some whitespace that you want to treat as an empty string, then instead of:
| wc -c
you could use:
| tr -d ' \n\r\t ' | wc -c
or with head -c1:
| tr -d ' \n\r\t ' | head -c1 | wc -c
or something like that.
Summary
First, use a command that works.
Second, avoid unnecessary storing in RAM and processing of potentially huge data.
The answer didn't specify that the output is always small so a possibility of large output needs to be considered as well.
if [ -z "$(ls -lA)" ]; then
echo "no files found"
else
echo "There are files"
fi
This will run the command and check whether the returned output (string) has a zero length.
You might want to check the 'test' manual pages for other flags.
Use the "" around the argument that is being checked, otherwise empty results will result in a syntax error as there is no second argument (to check) given!
Note: that ls -la always returns . and .. so using that will not work, see ls manual pages. Furthermore, while this might seem convenient and easy, I suppose it will break easily. Writing a small script/application that returns 0 or 1 depending on the result is much more reliable!
For those who want an elegant, bash version-independent solution (in fact should work in other modern shells) and those who love to use one-liners for quick tasks. Here we go!
ls | grep . && echo 'files found' || echo 'files not found'
(note as one of the comments mentioned, ls -al and in fact, just -l and -a will all return something, so in my answer I use simple ls
Bash Reference Manual
6.4 Bash Conditional Expressions
-z string
True if the length of string is zero.
-n string
string
True if the length of string is non-zero.
You can use shorthand version:
if [[ $(ls -A) ]]; then
echo "there are files"
else
echo "no files found"
fi
As Jon Lin commented, ls -al will always output (for . and ..). You want ls -Al to avoid these two directories.
You could for example put the output of the command into a shell variable:
v=$(ls -Al)
An older, non-nestable, notation is
v=`ls -Al`
but I prefer the nestable notation $( ... )
The you can test if that variable is non empty
if [ -n "$v" ]; then
echo there are files
else
echo no files
fi
And you could combine both as if [ -n "$(ls -Al)" ]; then
Sometimes, ls may be some shell alias. You might prefer to use $(/bin/ls -Al). See ls(1) and hier(7) and environ(7) and your ~/.bashrc (if your shell is GNU bash; my interactive shell is zsh, defined in /etc/passwd - see passwd(5) and chsh(1)).
I'm guessing you want the output of the ls -al command, so in bash, you'd have something like:
LS=`ls -la`
if [ -n "$LS" ]; then
echo "there are files"
else
echo "no files found"
fi
sometimes "something" may come not to stdout but to the stderr of the testing application, so here is the fix working more universal way:
if [[ $(partprobe ${1} 2>&1 | wc -c) -ne 0 ]]; then
echo "require fixing GPT parititioning"
else
echo "no GPT fix necessary"
fi
Here's a solution for more extreme cases:
if [ `command | head -c1 | wc -c` -gt 0 ]; then ...; fi
This will work
for all Bourne shells;
if the command output is all zeroes;
efficiently regardless of output size;
however,
the command or its subprocesses will be killed once anything is output.
All the answers given so far deal with commands that terminate and output a non-empty string.
Most are broken in the following senses:
They don't deal properly with commands outputting only newlines;
starting from Bash≥4.4 most will spam standard error if the command output null bytes (as they use command substitution);
most will slurp the full output stream, so will wait until the command terminates before answering. Some commands never terminate (try, e.g., yes).
So to fix all these issues, and to answer the following question efficiently,
How can I test if a command outputs an empty string?
you can use:
if read -n1 -d '' < <(command_here); then
echo "Command outputs something"
else
echo "Command doesn't output anything"
fi
You may also add some timeout so as to test whether a command outputs a non-empty string within a given time, using read's -t option. E.g., for a 2.5 seconds timeout:
if read -t2.5 -n1 -d '' < <(command_here); then
echo "Command outputs something"
else
echo "Command doesn't output anything"
fi
Remark. If you think you need to determine whether a command outputs a non-empty string, you very likely have an XY problem.
Here's an alternative approach that writes the std-out and std-err of some command a temporary file, and then checks to see if that file is empty. A benefit of this approach is that it captures both outputs, and does not use sub-shells or pipes. These latter aspects are important because they can interfere with trapping bash exit handling (e.g. here)
tmpfile=$(mktemp)
some-command &> "$tmpfile"
if [[ $? != 0 ]]; then
echo "Command failed"
elif [[ -s "$tmpfile" ]]; then
echo "Command generated output"
else
echo "Command has no output"
fi
rm -f "$tmpfile"
Sometimes you want to save the output, if it's non-empty, to pass it to another command. If so, you could use something like
list=`grep -l "MY_DESIRED_STRING" *.log `
if [ $? -eq 0 ]
then
/bin/rm $list
fi
This way, the rm command won't hang if the list is empty.
As mentioned by tripleee in the question comments , use moreutils ifne (if input not empty).
In this case we want ifne -n which negates the test:
ls -A /tmp/empty | ifne -n command-to-run-if-empty-input
The advantage of this over many of the another answers when the output of the initial command is non-empty. ifne will start writing it to STDOUT straight away, rather than buffering the entire output then writing it later, which is important if the initial output is slowly generated or extremely long and would overflow the maximum length of a shell variable.
There are a few utils in moreutils that arguably should be in coreutils -- they're worth checking out if you spend a lot of time living in a shell.
In particular interest to the OP may be dirempty/exists tool which at the time of writing is still under consideration, and has been for some time (it could probably use a bump).