Check file size is larger than 1.6GB - bash

I have the following step in my Bash code which checks to see if a file is bigger than 1.6GB:
#!/bin/sh
SIZE_THRESHOLD=1717986918
if [[ $(find /home/methuselah/file -type f -size +${SIZE_THRESHOLD}c 2>/dev/null) ]]; then
somecmdbecausefileisbigger
else
somecmdbecausefileissmaller
fi
The command somecmdbecausefileisbigger is never triggered even when the file size is greater than 1.6GB. Why is this?

I don't know why your find command doesn't work, but I do know a simpler way to do this:
if [ $(stat -f %z /home/methuselah/file) -gt ${SIZE_THRESHOLD} ]; then
Though unfortunately you have to replace -f %z with -c %s on Linux (the former works on BSD and MacOS).

Just use stat (note that your version of stat may differ; check your man page for details):
if [ "$(stat -c '%s' /home/methuselah/file)" -gt "$SIZE_THRESHOLD" ]; then

Related

find and gzip a directory recursively without a directory/file test

I'm working on improving our bash backup script, and would like to move away from rsync and towards using gzip and a "find since last run timestamp" system. I would like to have a mirror of the original tree, except have each destination file gzipped. However, if I pass a destination path to gzip that does not exist, it complains. I created the test below, but I can't believe that this is the most efficient solution. Am I going about this wrong?
Also, I'm not crazy about using while read either, but I can't get the right variable expansion with the alternatives I've tried, such as a for file in 'find' do.
Centos 6.x. Relevant snip below, simplified for focus:
cd /mnt/${sourceboxname}/${drive}/ && eval find . -newer timestamp | while read objresults;
do
if [[ -d "${objresults}" ]]
then
mkdir -p /backup/${sourceboxname}/${drive}${objresults}
else
cat /mnt/${sourceboxname}/${drive}/"${objresults}" | gzip -fc > /backup/${sourceboxname}/${drive}"${objresults}".gz
fi
done
touch timestamp #if no stderr
With proposed changes from my comments incorporated, I suggest this code:
#!/bin/bash
src="/mnt/$sourceboxname/$drive"
dst="/backup/$sourceboxname/$drive"
timestamp="$src/timestamp"
errors=$({ cd "$src" && find -newer "$timestamp" | while read objresults;
do
mkdir -p $(basename "$dst/$objresults")
[[ -d "$objresults" ]] || gzip -fc < "$objresults" > "$dst/$objresults.gz"
done; } 2>&1)
if [[ -z "$errors" ]]
then
touch "$timestamp"
else
echo "$errors" >&2
exit 1
fi

How to find latest modified files and delete them with SHELL code

I need some help with a shell code. Now I have this code:
find $dirname -type f -exec md5sum '{}' ';' | sort | uniq --all-repeated=separate -w 33 | cut -c 35-
This code finds duplicated files (with same content) in a given directory. What I need to do is to update it - find out latest (by date) modified file (from duplicated files list), print that file name and also give opportunity to delete that file in terminal.
Doing this in pure bash is a tad awkward, it would be a lot easier to write
this in perl or python.
Also, if you were looking to do this with a bash one-liner, it might be feasible,
but I really don't know how.
Anyhoo, if you really want a pure bash solution below is an attempt at doing
what you describe.
Please note that:
I am not actually calling rm, just echoing it - don't want to destroy your files
There's a "read -u 1" in there that I'm not entirely happy with.
Here's the code:
#!/bin/bash
buffer=''
function process {
if test -n "$buffer"
then
nbFiles=$(printf "%s" "$buffer" | wc -l)
echo "================================================================================="
echo "The following $nbFiles files are byte identical and sorted from oldest to newest:"
ls -lt -c -r $buffer
lastFile=$(ls -lt -c -r $buffer | tail -1)
echo
while true
do
read -u 1 -p "Do you wish to delete the last file $lastFile (y/n/q)? " answer
case $answer in
[Yy]* ) echo rm $lastFile; break;;
[Nn]* ) echo skipping; break;;
[Qq]* ) exit;;
* ) echo "please answer yes, no or quit";;
esac
done
echo
fi
}
find . -type f -exec md5sum '{}' ';' |
sort |
uniq --all-repeated=separate -w 33 |
cut -c 35- |
while read -r line
do
if test -z "$line"
then
process
buffer=''
else
buffer=$(printf "%s\n%s" "$buffer" "$line")
fi
done
process
echo "done"
Here's a "naive" solution implemented in bash (except for two external commands: md5sum, of course, and stat used only for user's comfort, it's not part of the algorithm). The thing implements a 100% Bash quicksort (that I'm kind of proud of):
#!/bin/bash
# Finds similar (based on md5sum) files (recursively) in given
# directory. If several files with same md5sum are found, sort
# them by modified (most recent first) and prompt user for deletion
# of the oldest
die() {
printf >&2 '%s\n' "$#"
exit 1
}
quicksort_files_by_mod_date() {
if ((!$#)); then
qs_ret=()
return
fi
# the return array is qs_ret
local first=$1
shift
local newers=()
local olders=()
qs_ret=()
for i in "$#"; do
if [[ $i -nt $first ]]; then
newers+=( "$i" )
else
olders+=( "$i" )
fi
done
quicksort_files_by_mod_date "${newers[#]}"
newers=( "${qs_ret[#]}" )
quicksort_files_by_mod_date "${olders[#]}"
olders=( "${qs_ret[#]}" )
qs_ret=( "${newers[#]}" "$first" "${olders[#]}" )
}
[[ -n $1 ]] || die "Must give an argument"
[[ -d $1 ]] || die "Argument must be a directory"
dirname=$1
shopt -s nullglob
shopt -s globstar
declare -A files
declare -A hashes
for file in "$dirname"/**; do
[[ -f $file ]] || continue
read md5sum _ < <(md5sum -- "$file")
files[$file]=$md5sum
((hashes[$md5sum]+=1))
done
has_found=0
for hash in "${!hashes[#]}"; do
((hashes[$hash]>1)) || continue
files_with_same_md5sum=()
for file in "${!files[#]}"; do
[[ ${files[$file]} = $hash ]] || continue
files_with_same_md5sum+=( "$file" )
done
has_found=1
echo "Found ${hashes[$hash]} files with md5sum=$hash, sorted by modified (most recent first):"
# sort them by modified date (using quicksort :p)
quicksort_files_by_mod_date "${files_with_same_md5sum[#]}"
for file in "${qs_ret[#]}"; do
printf " %s %s\n" "$(stat --printf '%y' -- "$file")" "$file"
done
read -p "Do you want to remove the oldest? [yn] " answer
if [[ ${answer,,} = y ]]; then
echo rm -fv -- "${qs_ret[#]:1}"
fi
done
if((!has_found)); then
echo "Didn't find any similar files in directory \`$dirname'. Yay."
fi
I guess the script is self-explanatory (you can read it like a story). It uses the best practices I know of, and is 100% safe regarding any silly characters in file names (e.g., spaces, newlines, file names starting with hyphens, file names ending with a newline, etc.).
It uses bash's globs, so it might be a bit slow if you have a bloated directory tree.
There are a few error checkings, but many are missing, so don't use as-is in production! (it's a trivial but rather tedious taks to add these).
The algorithm is as follows: scan each file in the given directory tree; for each file, will compute its md5sum and store in associative arrays:
files with keys the file names and values the md5sums.
hashes with keys the hashes and values the number of files the md5sum of which is the key.
After this is done, we'll scan through all the found md5sum, select only the ones that correspond to more than one file, then select all files with this md5sum, then quicksort them by modified date, and prompt the user.
A sweet effect when no dups are found: the script nicely informs the user about it.
I would not say it's the most efficient way of doing things (might be better in, e.g., Perl), but it's really a lot of fun, surprisingly easy to read and follow, and you can potentially learn a lot by studying it!
It uses a few bashisms and features that only are in bash version ≥ 4
Hope this helps!
Remark. If on your system date has the -r switch, you can replace the stat command by:
date -r "$file"
Remark. I left the echo in front of rm. Remove it if you're happy with how the script behaves. Then you'll have a script that uses 3 external commands :).

Best Way to Get File Modified Time in Seconds

This seems to be a classic case of non-standard features on various platforms.
Quite simply, I want a universally (or at least widely) supported method for getting the modified time of a file as a unix timestamp in seconds.
Now I know of various ways to do this with stat but most are platform specific; for example stat -c %Y $file works for some, but won't work on OS X (and presumably other FreeBSD systems) which uses stat -f %m $file instead.
Likewise, some platforms support date -r $file +%s, however OS X/FreeBSD again does not as the -r option seems to just be an alternate to using +%s for getting a unix timestamp, rather than the reference file option as on other platforms.
The other alternative I'm familiar with is to use find with the -printf option, but again this is not widely supported. The last method I know of is parsing ls which, aside from being an unpleasant thing to have to do, is not something I believe can (or at least should) be relied upon either.
So, is there a more compatible method for getting a file's modified time? Currently I'm just throwing different variations of stat into a script and running them until one exits with a status of zero, but this is far from ideal, even if I cache the successful command to run first in future.
Since it seems like there might not be a "correct" solution I figured I'd post my current one for comparison:
if stat -c %Y . >/dev/null 2>&1; then
get_modified_time() { stat -c %Y "$1" 2>/dev/null; }
elif stat -f %m . >/dev/null 2>&1; then
get_modified_time() { stat -f %m "$1" 2>/dev/null; }
elif date -r . +%s >/dev/null 2>&1; then
get_modified_time() { date -r "$1" +%s 2>/dev/null; }
else
echo 'get_modified_time() is unsupported' >&2
get_modified_time() { printf '%s' 0; }
fi
[edit]
I'm updating this to reflect the more up to date version of the code I use, basically it tests the two main stat methods and a somewhat common date method in any attempt to get the modified time for the current working directory, and if one of the methods succeeds it creates a function encapsulating it for use later in the script.
This method differs from the previous one I posted since it always does some processing, even if get_modified_time is never called, but it's more efficiently overall if you do need to call it most of the time. It also lets you catch an unsupported platform earlier on.
If you prefer the function that only tests functions when it is called, then here's the other form:
get_modified_time() {
modified_time=$(stat -c %Y "$1" 2> /dev/null)
if [ "$?" -ne 0 ]; then
modified_time=$(stat -f %m "$1" 2> /dev/null)
if [ "$?" -ne 0 ]; then
modified_time=$(date -r "$1" +%s 2> /dev/null)
[ "$?" -ne 0 ] && modified_time=0
fi
fi
echo "$modified_time"
}
Why so complicated?
after not finding something on the web I simply read the manual of ls
man ls
which gave me
ls --time-style=full-iso -l
and change time in the format hh:mm:ss.sssssssss
As
ls --time-style=+FORMAT -l
while FORMAT is used like with date (see man date)
ls --time-style=+%c
will give you local date and time with seconds as an integer (without decimal point).
You can strip off additional ls information (e.g. file name, owner...) when you pipe through awk.
Just ask the operating system for its name, and go from there. Alternatively, write a C program, or use Python or something else that's pretty common and more standardized: How to get file creation & modification date/times in Python?

How can I check the size of a file using Bash?

I've got a script that checks for 0-size, but I thought there must be an easier way to check for file sizes instead. I.e. file.txt is normally 100 kB; how can I make a script check if it is less than 90 kB (including 0), and make it Wget a new copy because the file is corrupt in this case?
What I'm currently using...
if [ -n file.txt ]
then
echo "everything is good"
else
mail -s "file.txt size is zero, please fix. " myemail#gmail.com < /dev/null
# Grab wget as a fallback
wget -c https://www.server.org/file.txt -P /root/tmp --output-document=/root/tmp/file.txt
mv -f /root/tmp/file.txt /var/www/file.txt
fi
[ -n file.txt ] doesn't check its size. It checks that the string file.txt is non-zero length, so it will always succeed.
If you want to say "size is non-zero", you need [ -s file.txt ].
To get a file's size, you can use wc -c to get the size (file length) in bytes:
file=file.txt
minimumsize=90000
actualsize=$(wc -c <"$file")
if [ $actualsize -ge $minimumsize ]; then
echo size is over $minimumsize bytes
else
echo size is under $minimumsize bytes
fi
In this case, it sounds like that's what you want.
But FYI, if you want to know how much disk space the file is using, you could use du -k to get the size (disk space used) in kilobytes:
file=file.txt
minimumsize=90
actualsize=$(du -k "$file" | cut -f 1)
if [ $actualsize -ge $minimumsize ]; then
echo size is over $minimumsize kilobytes
else
echo size is under $minimumsize kilobytes
fi
If you need more control over the output format, you can also look at stat. On Linux, you'd start with something like stat -c '%s' file.txt, and on BSD and Mac OS X, something like stat -f '%z' file.txt.
stat can also check the file size. Some methods are definitely better: using -s to find out whether the file is empty or not is easier than anything else if that's all you want. And if you want to find files of a size, then find is certainly the way to go.
I also like du a lot to get file size in kb, but, for bytes, I'd use stat:
size=$(stat -f%z $filename) # BSD stat
size=$(stat -c%s $filename) # GNU stat?
An alternative solution with AWK and double parenthesis:
FILENAME=file.txt
SIZE=$(du -sb $FILENAME | awk '{ print $1 }')
if ((SIZE<90000)) ; then
echo "less";
else
echo "not less";
fi
If your find handles this syntax, you can use it:
find -maxdepth 1 -name "file.txt" -size -90k
This will output file.txt to stdout if and only if the size of file.txt is less than 90k. To execute a script script if file.txt has a size less than 90k:
find -maxdepth 1 -name "file.txt" -size -90k -exec script \;
If you are looking for just the size of a file:
cat $file | wc -c
Sample output:
203233
This works in both Linux and macOS:
function filesize
{
local file=$1
size=`stat -c%s $file 2>/dev/null` # Linux
if [ $? -eq 0 ]
then
echo $size
return 0
fi
eval $(stat -s $file) # macOS
if [ $? -eq 0 ]
then
echo $st_size
return 0
fi
return -1
}
Use:
python -c 'import os; print (os.path.getsize("... filename ..."))'
It is portable, for all flavours of Python, and it avoids variation in stat dialects.
For getting the file size in both Linux and Mac OS X (and presumably other BSD systems), there are not many options, and most of the ones suggested here will only work on one system.
Given f=/path/to/your/file,
what does work in both Linux and Mac's Bash:
size=$( perl -e 'print -s shift' "$f" )
or
size=$( wc -c "$f" | awk '{print $1}' )
The other answers work fine in Linux, but not in Mac:
du doesn't have a -b option in Mac, and the BLOCKSIZE=1 trick doesn't work ("minimum blocksize is 512", which leads to a wrong result)
cut -d' ' -f1 doesn't work because on Mac, the number may be right-aligned, padded with spaces in front.
So if you need something flexible, it's either perl's -s operator , or wc -c piped to awk '{print $1}' (awk will ignore the leading white space).
And of course, regarding the rest of your original question, use the -lt (or -gt) operator:
if [ $size -lt $your_wanted_size ]; then, etc.
Based on gniourf_gniourf’s answer,
find "file.txt" -size -90k
will write file.txt to stdout if and only if the size of file.txt is less than 90K, and
find "file.txt" -size -90k -exec command \;
will execute the command command if file.txt has a size less than 90K. 
I have tested this on Linux. 
From find(1),
…  Command-line arguments following (the -H, -L and -P options) are taken to be names of files or directories to be examined, up to the first argument that begins with ‘-’, …
(emphasis added).
ls -l $file | awk '{print $6}'
assuming that ls command reports filesize at column #6
I would use du's --threshold for this. Not sure if this option is available in all versions of du but it is implemented in GNU's version.
Quoting from du(1)'s manual:
-t, --threshold=SIZE
exclude entries smaller than SIZE if positive, or entries greater
than SIZE if negative
Here's my solution, using du --threshold= for OP's use case:
THRESHOLD=90k
if [[ -z "$(du --threshold=${THRESHOLD} file.txt)" ]]; then
mail -s "file.txt size is below ${THRESHOLD}, please fix. " myemail#gmail.com < /dev/null
mv -f /root/tmp/file.txt /var/www/file.txt
fi
The advantage of that, is that du can accept an argument to that option in a known format - either human as in 10K, 10MiB or what ever you feel comfortable with - you don't need to manually convert between formats / units since du handles that.
For reference, here's the explanation on this SIZE argument from the man page:
The SIZE argument is an integer and optional unit (example: 10K is
10*1024). Units are K,M,G,T,P,E,Z,Y (powers of 1024) or KB,MB,... (powers
of 1000). Binary prefixes can be used, too: KiB=K, MiB=M, and so on.
Okay, if you're on a Mac, do this:
stat -f %z "/Users/Example/config.log"
That's it!

Why does this bash script not work as I want

I have the following bash script
for s in $(ls -1 fig/*.py); do
name=`basename $s .py`
if [ -e "fig/$name.pdf" -o "fig/$name.pdf" -ot "fig/$name.data" -ot "fig/$name.py" ]; then
$s
fi
done
It is supposed to invoke a python script if the output pdf does not exist, or the pdf is older than the py or data file.
Unfortunaly, the script is now never invoked. What did I do wrong?
EDIT
Thanks Benoit! My final script is:
for s in fig/*.py ; do # */ fix highlighting
name="$(basename "$s" .py)"
if test ! -e "fig/$name.pdf" -o "fig/$name.pdf" -ot "fig/$name.data" -o "fig/$name.pdf" -ot "fig/$name.py"
then
"$s"
fi
done
Many mistakes. See bash pitfalls especially first one (never rely on ls to get a list of files).
It also seems that the test is not well-formed: Two -ot in a row seem strange to me. Using -o to perform an OR instead of an AND seems weird also.
for s in fig/*.py ; do # */ for code colouring on SO
name="$(basename "$s" .py)"
if test -e "fig/$name.pdf" -o "fig/$name.pdf" -ot "fig/$name.data" -ot "fig/$name.py"
then
"$s"
fi
done

Resources