How sort find result by file sizes - shell

How can I sort by file size the results of the find command?
I trying to sort the result of this find command:
find ./src -type f -print0
I don't want the size of directories, I need the files relative paths sorted by size only.

Here is how to do using find command:
find . -type f -exec ls -al {} \; | sort -k 5 -n | sed 's/ \+/\t/g' | cut -f 9
Here is how to do using recursive ls command:
ls -lSR | sort -k 5 -n
Or, if you want to display only file names:
ls -lSR | sort -k 5 -n | sed 's/ \+/\t/g' | cut -f 9

Parsing ls or variations thereof will add unnecesary complexity.
Sorted file paths by file size:
find src -type f -printf '%s\t%p\n' | sort -n | cut -f2-
Notes:
Change sort -n to sort -nr to get reverse order
The question had -print0 but catering for file names containing newlines seem pedantic.
The question mentioned relative paths, and changing %p to %P will get relative paths under src/

find -type f -exec du -sm {} \; | sort -nk1
Size in MiB, path is relative.

find /folder -type f -exec ls -S {} +
WARNING! I use not -exec ... \;, but -exec ... {} +. This construction doesn't pluck every file just it was found, but it waits, while all files will be found, and then puts them to one command (this time — ls) as one big list of arguments.
Then ls is looking at files and, because key -S, sorts them by size, largest first.

Literally none of these answers actually worked.
Here's what I made.
#!/bin/bash
################# ls-by-min ################
## List By Min File Size ##
## Copyright (c) 2020 Theodore R. Smith ##
## License: MIT ##
if [ -z "$1" ]; then
echo "Error: You must specify a minimum file size (in MB)."
exit 1
fi
FILE_SIZE=$1
if [ "$2" = "-r" ]; then
MAXDEPTH=512
else
MAXDEPTH=1
fi
find . -maxdepth ${MAXDEPTH} -type f -size "+${FILE_SIZE}M" -exec du -sm {} \; | sort -rnk1 | sed 's/^[0-9]\+\t*//g'
https://github.com/hopeseekr/BashScripts/blob/master/ls-by-min
You run it by doing:
ls-by-min 100 [-r]
and it will list, ordered biggest to smallest, only files that are 100 MB or bigger in the current directory. Pass in -r for it to be recursive.

Related

How to count files in subdir and filter output in bash

Hi hoping someone can help, I have some directories on disk and I want to count the number of files in them (as well as dir size if possible) and then strip info from the output. So far I have this
find . -type d -name "*,d" -print0 | xargs -0 -I {} sh -c 'echo -e $(find "{}" | wc -l) "{}"' | sort -n
This gets me all the dir's that match my pattern as well as the number of files - great!
This gives me something like
2 ./bob/sourceimages/psd/dzv_body.psd,d
2 ./bob/sourceimages/psd/dzv_body_nrm.psd,d
2 ./bob/sourceimages/psd/dzv_body_prm.psd,d
2 ./bob/sourceimages/psd/dzv_eyeball.psd,d
2 ./bob/sourceimages/psd/t_zbody.psd,d
2 ./bob/sourceimages/psd/t_gear.psd,d
2 ./bob/sourceimages/psd/t_pupil.psd,d
2 ./bob/sourceimages/z_vehicles_diff.tga,d
2 ./bob/sourceimages/zvehiclesa_diff.tga,d
5 ./bob/sourceimages/zvehicleswheel_diff.jpg,d
From that I would like to filter based on max number of files so > 4 for example, I would like to capture filetype as a variable for each remaining result e.g ./bob/sourceimages/zvehicleswheel_diff.jpg,d
I guess I could use awk for this?
Then finally I would like like to remove all the results from disk, with find I normally just do something like -exec rm -rf {} \; but I'm not clear how it would work here
Thanks a lot
EDITED
While this is clearly not the answer, these commands get me the info I want in the form I want it. I just need a way to put it all together and not search multiple times as that's total rubbish
filetype=$(find . -type d -name "*,d" -print0 | awk 'BEGIN { FS = "." }; {
print $3 }' | cut -d',' -f1)
filesize=$(find . -type d -name "*,d" -print0 | xargs -0 -I {} sh -c 'du -h
{};' | awk '{ print $1 }')
filenumbers=$(find . -type d -name "*,d" -print0 | xargs -0 -I {} sh -c
'echo -e $(find "{}" | wc -l);')
files_count=`ls -keys | nl`
For instance:
ls | nl
nl printed numbers of lines

How to call a function while using find in bash?

So my objective here is to print a small graph, followed by the file size and the path for the 15 largest files. However, I'm running into issues trying to call the create_graph function on each line. Here's what isn't working
find $path -type f | sort -nr | head -$n | while read line; do
size=$(stat -c '%s' $line)
create_graph $largest $size 50
echo "$size $line"
done
My problem is that it isn't sorting the files, and the files aren't the n largest files. So it appears my "while read line" is messing it all up.
Any suggestions?
The first command,
find $path -type f
just prints out file names. So it can't sort them by size. If you want to sort them by size, you need to make it print out the size. Try this:
find $path -type f -exec du -b {} \; | sort -nr | cut -f 2 | head -$n | ...
Update:
Actually, only the first part of that seems to do everything you want from it:
find $path -type f -exec du -b {} \; | sort -nr | head -$n
will print out a table with size and filename, sorted by file size, and limited to $n rows.
Of course I don't know what the create_graph does.
Explanation:
find $path -type f -exec du -b {} \;
Find all files (not directories or links) in ${path} or its subdirectories, and execute the command du -b <file> on each.
du -b <file>
will output the size of the file (disk usage). See man du for details.
This will produce something like this:
8880 ./line_too_long/line.o
4470 ./line_too_long/line.f
934 ./random/rand.f
9080 ./random/rand
23602 ./random/monte
7774 ./random/monte.f90
13610 ./format/form
288 ./format/form.f90
411 ./delme.f90
872 ./delme_mod.mod
9029 ./delme
So for each file, it prints the size (-b for 'in bytes').
Then you can do a numerical sort on that.
$ find . -type f -exec du -b {} \; | sort -nr
23602 ./random/monte
13610 ./format/form
9080 ./random/rand
9029 ./delme
8880 ./line_too_long/line.o
7774 ./random/monte.f90
4470 ./line_too_long/line.f
934 ./random/rand.f
872 ./delme_mod.mod
411 ./delme.f90
288 ./format/form.f90
And if you then cut it off after the first, say five entries:
$ find . -type f -exec du -b {} \; | sort -nr | head -5
23602 ./random/monte
13610 ./format/form
9080 ./random/rand
9029 ./delme
8880 ./line_too_long/line.o
Some idea to put that back together:
find . -type f -exec du -b {} \; | sort -nr | head -$n | while read line; do
size=$(cut -d ' ' -f 1 <<< $line)
file=$(cut -d ' ' -f 2 <<< $line)
create_graph $largest $size 50
echo $line
done
Note that I have no idea what create_graph is or what $largest contains. I took that straight out of your script.

How can I count the number of words in a directory recursively?

I'm trying to calculate the number of words written in a project. There are a few levels of folders and lots of text files within them.
Can anyone help me find out a quick way to do this?
bash or vim would be good!
Thanks
use find the scan the dir tree and wc will do the rest
$ find path -type f | xargs wc -w | tail -1
last line gives the totals.
tldr;
$ find . -type f -exec wc -w {} + | awk '/total/{print $1}' | paste -sd+ | bc
Explanation:
The find . -type f -exec wc -w {} + will run wc -w on all the files (recursively) contained by . (the current working directory). find will execute wc as few times as possible but as many times as is necessary to comply with ARG_MAX --- the system command length limit. When the quantity of files (and/or their constituent lengths) exceeds ARG_MAX, then find invokes wc -w more than once, giving multiple total lines:
$ find . -type f -exec wc -w {} + | awk '/total/{print $0}'
8264577 total
654892 total
1109527 total
149522 total
174922 total
181897 total
1229726 total
2305504 total
1196390 total
5509702 total
9886665 total
Isolate these partial sums by printing only the first whitespace-delimited field of each total line:
$ find . -type f -exec wc -w {} + | awk '/total/{print $1}'
8264577
654892
1109527
149522
174922
181897
1229726
2305504
1196390
5509702
9886665
paste the partial sums with a + delimiter to give an infix summation:
$ find . -type f -exec wc -w {} + | awk '/total/{print $1}' | paste -sd+
8264577+654892+1109527+149522+174922+181897+1229726+2305504+1196390+5509702+9886665
Evaluate the infix summation using bc, which supports both infix expressions and arbitrary precision:
$ find . -type f -exec wc -w {} + | awk '/total/{print $1}' | paste -sd+ | bc
30663324
References:
https://www.cyberciti.biz/faq/argument-list-too-long-error-solution/
https://www.in-ulm.de/~mascheck/various/argmax/
https://linux.die.net/man/1/find
https://linux.die.net/man/1/wc
https://linux.die.net/man/1/awk
https://linux.die.net/man/1/paste
https://linux.die.net/man/1/bc
You could find and print all the content and pipe to wc:
find path -type f -exec cat {} \; -exec echo \; | wc -w
Note: the -exec echo \; is needed in case a file doesn't end with a newline character, in which case the last word of one file and the first word of the next will not be separated.
Or you could find and wc and use awk to aggregate the counts:
find . -type f -exec wc -w {} \; | awk '{ sum += $1 } END { print sum }'
If there's one thing I've learned from all the bash questions on SO, it's that a filename with a space will mess you up. This script will work even if you have whitespace in the file names.
#!/usr/bin/env bash
shopt -s globstar
count=0
for f in **/*.txt
do
words=$(wc -w "$f" | awk '{print $1}')
count=$(($count + $words))
done
echo $count
Assuming you don't need to recursively count the words and that you want to include all the files in the current directory , you can use a simple approach such as:
wc -l *
10 000292_0
500 000297_0
510 total
If you want to count the words for only a specific extension in the current directory , you could try :
cat *.txt | wc -l

Print the content of all the files in the newest directory in BASH [duplicate]

Is there any sort option available in find command to get directory with least access date/time
find . -type d -printf "%A# %p\n" | sort -n | tail -n 1 | cut -d " " -f 2-
If you prefer the filename without leading path, replace %p by %f.
the below linux command displays the access and modified time along with size
stat -f
find -type d -printf '%T+ %p\n' | sort | head -1
source
find -type d -printf '%T+ %p\n' | sort
This sound like more of a job for ls:
ls -ultd *|grep ^d
The problem with using find, at least on my system (cygwin/bash), is that find accesses the dirs, so all access-times result in current time, defeating your apparent purpose.
A simple shell script will also do:
unset -v oldest
for i in "$dir"/*; do
[ "$i" -ot "$oldest" -o "$oldest" = "" ] && oldest="$i"
done
note: to find the oldest directory use "$dir"/*/ above (thanks Cyrus) and -type d below with the find command.
In bash if you need a recursive solution, then you can rewrite it as a while loop with process substitution using find
unset -v oldest
while IFS= read -r i; do
[ "$i" -ot "$oldest" -o "$oldest" = "" ] && oldest="$i"
done < <(find "$dir" -type f)

Bash script to limit a directory size by deleting files accessed last

I had previously used a simple find command to delete tar files not accessed in the last x days (in this example, 3 days):
find /PATH/TO/FILES -type f -name "*.tar" -atime +3 -exec rm {} \;
I now need to improve this script by deleting in order of access date and my bash writing skills are a bit rusty. Here's what I need it to do:
check the size of a directory /PATH/TO/FILES
if size in 1) is greater than X size, get a list of the files by access date
delete files in order until size is less than X
The benefit here is for cache and backup directories, I will only delete what I need to to keep it within a limit, whereas the simplified method might go over size limit if one day is particularly large. I'm guessing I need to use stat and a bash for loop?
I improved brunner314's example and fixed the problems in it.
Here is a working script I'm using:
#!/bin/bash
DELETEDIR="$1"
MAXSIZE="$2" # in MB
if [[ -z "$DELETEDIR" || -z "$MAXSIZE" || "$MAXSIZE" -lt 1 ]]; then
echo "usage: $0 [directory] [maxsize in megabytes]" >&2
exit 1
fi
find "$DELETEDIR" -type f -printf "%T#::%p::%s\n" \
| sort -rn \
| awk -v maxbytes="$((1024 * 1024 * $MAXSIZE))" -F "::" '
BEGIN { curSize=0; }
{
curSize += $3;
if (curSize > maxbytes) { print $2; }
}
' \
| tac | awk '{printf "%s\0",$0}' | xargs -0 -r rm
# delete empty directories
find "$DELETEDIR" -mindepth 1 -depth -type d -empty -exec rmdir "{}" \;
Here's a simple, easy to read and understand method I came up with to do this:
DIRSIZE=$(du -s /PATH/TO/FILES | awk '{print $1}')
if [ "$DIRSIZE" -gt "$SOMELIMIT" ]
then
for f in `ls -rt --time=atime /PATH/TO/FILES/*.tar`; do
FILESIZE=`stat -c "%s" $f`
FILESIZE=$(($FILESIZE/1024))
DIRSIZE=$(($DIRSIZE - $FILESIZE))
if [ "$DIRSIZE" -lt "$LIMITSIZE" ]; then
break
fi
done
fi
I didn't need to use loops, just some careful application of stat and awk. Details and explanation below, first the code:
find /PATH/TO/FILES -name '*.tar' -type f \
| sed 's/ /\\ /g' \
| xargs stat -f "%a::%z::%N" \
| sort -r \
| awk '
BEGIN{curSize=0; FS="::"}
{curSize += $2}
curSize > $X_SIZE{print $3}
'
| sed 's/ /\\ /g' \
| xargs rm
Note that this is one logical command line, but for the sake of sanity I split it up.
It starts with a find command based on the one above, without the parts that limit it to files older than 3 days. It pipes that to sed, to escape any spaces in the file names find returns, then uses xargs to run stat on all the results. The -f "%a::%z::%N" tells stat the format to use, with the time of last access in the first field, the size of the file in the second, and the name of the file in the third. I used '::' to separate the fields because it is easier to deal with spaces in the file names that way. Sort then sorts them on the first field, with -r to reverse the ordering.
Now we have a list of all the files we are interested in, in order from latest accessed to earliest accessed. Then the awk script adds up all the sizes as it goes through the list, and begins outputting them when it gets over $X_SIZE. The files that are not output this way will be the ones kept, the other file names go to sed again to escape any spaces and then to xargs, which runs rm them.

Resources