How to find files bigger then some size, and sort them by last modification date? - bash

I need to write script which write down each files in selected catalog, which are bigger then some size. Also I need to sort them by size, name and last modification date.
So I have made the first two cases:
Sort by size
RESULTS=`find $CATALOG -size +$SIZE | sort -n -r | sed 's_.*/__'`
Sort by name
RESULTS=`find $CATALOG -size +$SIZE | sed 's_.*/__' | sort -n `
But I have no idea how to sort results by last modification date.
Any help would be appreciated.

One of the best approaches, provided you don't have too many files, is to use ls to do the sorting itself.
Sort by name and print one file per line:
find $CATALOG -size +$SIZE -exec ls -1 {} +
Sort by size and print one file per line:
find $CATALOG -size +$SIZE -exec ls -S1 {} +
Sort by modification time and print one file per line:
find $CATALOG -size +$SIZE -exec ls -t1 {} +
You can also play with the ls switches: Sort by modification time (small first) with long listing format, with human-readable sizes:
find $CATALOG -size +$SIZE -exec ls -hlrt {} +
Oh, you might want to only find the files (and ignore the directories):
find $CATALOG -size +$SIZE -type f -exec ls -hlrt {} +
Finally, some remarks: Avoid capitalized variable names in bash (it's considered bad practice) and avoid back ticks, use $(...) instead. E.g.,
results=$(find "$catalog" -size +$size -type f -exec ls -1rt {} +)
Also, you probably don't want to put all the results in a string like the previous line. You probably want to put the results in an array. In that case, use mapfile like this:
mapfile -t results < <(find "$catalog" -size +$size -type f -exec ls -1rt {} +)

Try xargs (do whatever, treating STDIN as a list of arguments) and the -t and -r flags to ls.
i.e. something like this:
find $CATALOG -size +$SIZE | xargs ls -ltr
That will give you the files sorted by last modification date.
Sorting by multiple attributes at once is going to be really awkward to do with shell utilities and pipes though — I think you'll need to use a scripting language (ruby, perl, php, whatever), unless your shell fu is strong.

Related

How to find duplicated jpgs by content?

I'd like to find and remove an image in a series of folders. The problem is that the image names are not necessarily the same.
What I did was to copy an arbitrary string from the images bytecode and use it like
grep -ir 'YA'uu�KU���^H2�Q�W^YSp��.�^H^\^Q��P^T' .
But since there are thousands of images this method lasts for ever. Also, some images are created by imagemagic of the original, so can not use size to find them all.
So I'm wondering what is the most efficient way to do so?
Updated Answer
If you have the checksum of a specific file in mind that you want to compare with, you can checksum all files in all subdirectories and find the one that is the same:
find . -name \*.jpg -exec bash -c 's=$(md5 < {}); echo $s {}' \; | grep "94b48ea6e8ca3df05b9b66c0208d5184"
Or this may work for you too:
find . -name \*.jpg -exec md5 {} \; | grep "94b48ea6e8ca3df05b9b66c0208d5184"
Original Answer
The easiest way is to generate an md5 checksum once for each file. Depending on how your md5 program works, you would do something like this:
find . -name \*.jpg -exec bash -c 's=$(md5 < {}); echo $s {}' \;
94b48ea6e8ca3df05b9b66c0208d5184 ./a.jpg
f0361a81cfbe9e4194090b2f46db5dad ./b.jpg
c7e4f278095f40a5705739da65532739 ./c.jpg
Or maybe you can use
md5 -r *.jpg
94b48ea6e8ca3df05b9b66c0208d5184 a.jpg
f0361a81cfbe9e4194090b2f46db5dad b.jpg
c7e4f278095f40a5705739da65532739 c.jpg
Now you can use uniq to find all duplicates.

Find duplicates in variable

I am trying to found duplicates in a list. Right now I am searching for a list of files with specific file extensions and storing these files in a variable called 'files'.
For each file in files I am formatting these so only have the filename.
I then want to check this list for duplicates but I can't get my head around it.
files=$(find /root/123 -type f \( -iname "*.txt" -o -iname "*.bat" \))
for file in $files; do
formatted=$(echo ${file##*/})
unique=$(echo $formatted | sort | uniq -c)
done
echo $unique
Any help is much appreciated!!
Find duplicates in variable
I guess you don't need to reinvent the wheel, simply use fdupes ot fslint
Depending on your system, you can install it by using:
yum -y install fdupes
or
apt-get install fdupes
Usage of fdupes is pretty straight forward:
fdupes /path/to/dir
If you just need the .txt files, you can pipe the result to grep, i.e.:
fdupes /path/to/dir | grep .txt
$files is not an array. It is a string.
You are splitting it on whitespace. This is not safe for filenames with spaces.
You are also globbing. This isn't safe for filenames with globbing metacharacters in the names.
See Bash FAQ 001 for how to safely operate over data line-by-line. Also see Don't read lines with for.
You can also get find to spit out arbitrarily formatted output with the -printf argument. (i.e. -printf %f will print out just the file name (no path information).)
You don't need echo for that variable assignment. (i.e. formatted=${file##*/} works just fine.)
$formatted contains a single filename. You can't really sort or uniq a single item.
Putting all the above together and assuming that you want to detect duplicates by suffix-less name (and not file contents) then...
If you aren't worried about filenames with newlines then you can just use this:
find /root/123 -type f \( -iname "*.txt" -o -iname "*.bat" \) -printf %f | sort | uniq -c
If you are worried about them then you need to read the lines manually (something like this for bash 4+):
declare -A files
while IFS= read -r -d '' file; do
((files["$file"]+=1))
done <(find /root/123 -type f \( -iname "*.txt" -o -iname "*.bat" \) -printf '%f\0')
declare -p files

OSX bash recursively find files sorted by size except in folder

I came across the following command, which nearly does what I need:
find . -type f -print0 | xargs -0 ls -l | sort -k5,5rn > ~/files.txt
Now, I don't have a clue what any of this means (would love an explanation, but not that important).
The one thing I need to add is to not bother with specific folders (i.e. I have a Documents folder with 10s of thousands of Word docs, which is making this command take a long long time).
Can anyone suggest an addition to the above command that will have find ignore a given folder(s)?
Exclude paths matching */Documents/* from find:
find . -type f ! -path "*/Documents/*" -print 0 | ...
Since you asked for an explanation...
find . -type f -print0
That's the find utility, which travels through the file system to find something that matches what you want it to. The . essentially means it will try to find anything, but since you specified -type f it will only find "regular files." -print0, as you may have guessed, simply prints the full path to the standard output (useful for piping). It uses a null character at the end of each line (as opposed to -print, this will be relevant in a moment).
xargs -0 ls -l
xargs takes a list of things from standard input and then executes a given command ("utility") using what is passed to it as an argument. In this case, the utility is the command ls -l so xargs takes the results from find and performs ls -l on them, giving you the long, full path; this is basically just a way to turn your list of files into a list of files with information such as size. The -0 option allows xargs to interpret null characters as the separator between lines, which exists (almost?) solely to allow it to work with the -print0 option above.
sort -k5,5rn > ~/files.txt
sort is pretty cool. It sorts things. -k tells it which column to sort by, in this case column 5 (and only column 5). The rn bit means sort using numbers and reverse the order. The default is largest at the bottom so this puts largest first. Sorting numerically can get confusing if you use unit-suffixes (B, K, M, G, etc.) using ls -lh.
Different options or other ways to find large files:
find ~ -size +100M ! -path ~/Documents\* ! -path ~/Library\*
find ~ -size +100M | grep -v "^$HOME/Documents/" | while IFS= read -r l; do stat -f'%z %N' "$l"; done | sort -rn
shopt -s extglob; find ~/!(Documents) -type f -exec stat -f'%z %N' {} \; | sort -rn | head -n200
mdfind 'kMDItemFSSize>=1e8&&kMDItemContentTypeTree!=public.directory' | while IFS= read -r l; do stat -f'%z %N' "$l"; done | sort -rn
You might also just use Finder:

Ignore spaces in Solaris 'find' output

I am trying to remove all empty files that are older than 2 days. Also I am ignoring hidden files, starting with dot. I am doing it with this code:
find /u01/ -type f -size 0 -print -mtime +2 | grep -v "/\\." | xargs rm
It works fine until there are spaces in the name of the file. How could I make my code ignore them?
OS is Solaris.
Option 1
Install GNU find and GNU xargs in an appropriate location (not /usr/bin) and use:
find /u01/ -type f -size 0 -mtime +2 -name '[!.]*' -print0 | xargs -0 rm
(Note that I removed (what I think is) a stray -print from your find options. The options shown removes empty files modified more than 2 days ago where the name does not start with a ., which is the condition that your original grep seemed to deal with.)
Option 2
The problem is primarily that xargs is defined to split its input at spaces. An alternative is to write your own xargs surrogate that behaves sensibly with spaces in names; I've done that. You then only run into problems if the file names contain newlines — which the file system allows. Using a NUL ('\0') terminator is guaranteed safe; it is the only character that can't appear in a path name (which is why GNU chose to use it with -print0 etc).
Option 3
A final better option is perhaps:
find /u01/ -type f -size 0 -mtime +2 -name '[!.]*' -exec rm {} \;
This avoids using xargs at all and handles all file names (path names) correctly — at the cost of executing rm once for each file found. That's not too painful if you're only dealing with a few files on each run.
POSIX 2008 introduces the notation + in place of the \; and then behaves rather like xargs, collecting as many arguments as will conveniently fit in the space it allocates for the command line before running the command:
find /u01/ -type f -size 0 -mtime +2 -name '[!.]*' -exec rm {} +
The versions of Solaris I've worked on do not support that notation, but I know I work on antique versions of Solaris. GNU find does support the + marker and therefore renders the -print0 and xargs -0 workaround unnecessary.

How can I convert JPG images recursively using find?

Essentially what I want to do is search the working directory recursively, then use the paths given to resize the images. For example, find all *.jpg files, resize them to 300x300 and rename to whatever.jpg.
Should I be doing something along the lines of $(find | grep *.jpg) to get the paths? When I do that, the output is directories not enclosed in quotation marks, meaning that I would have to insert them before it would be useful, right?
I use mogrify with find.
Lets say, I need everything inside my nested folder/another/folder/*.jpg to be in *.png
find . -name "*.jpg" -print0|xargs -I{} -0 mogrify -format png {}
&& with a bit of explaination:
find . -name *.jpeg -- to find all the jpeg's inside the nested folders.
-print0 -- to print desired filename withouth andy nasty surprises (eg: filenames space seperated)
xargs -I {} -0 -- to process file one by one with mogrify
and lastly those {} are just dummy file name for result from find.
You can use something like this with GNU find:
find . -iname \*jpg -exec /your/image/conversion/script.sh {} +
This will be safer in terms of quoting, and spawn fewer processes. As long as your script can handle the length of the argument list, this solution should be the most efficient option.
If you need to handle really long file lists, you may have to pay the price and spawn more processes. You can modify find to handle each file separately. For example:
find . -iname \*jpg -exec /your/image/conversion/script.sh {} \;

Resources