du summary isn't equal to sum of the elements - filesize

The directory contains normal files in normal directories, no symlinks and remote fs (it's actually a maildir++ storage, so not even sparse files are expected). I don't readily see how it's possible that the summary of the directory sizes is significantly larger than the total du provides:
$ du * .[a-zA-Z]* -bsc | tail -n1
2722800257 total
$ du * .[a-zA-Z]* -b | awk '{sum+=$1} END {print sum}'
3341577554
Reality seems to match the larger number.

Your second command du -b ... | awk ... is overstating the total because it counts subdirectory sizes multiple times. Each subdirectory size is counted on its own, then counted again as part of the size of each of its ancestor directories.
It's easier to see what's happening in a small example like this, on a filesystem where an empty directory happens to consume 4KB:
$ mkdir -p foo/bar/baz
$ du -bsc foo
12288 foo
12288 total
$ du -b foo
4096 foo/bar/baz
8192 foo/bar
12288 foo
$ du -b foo | awk '{t += $1} END {print t}'
24576

Related

Compare size of directory with the free space in disk

Im creating a script which compare a size of a folder /home/user/files with a free space in /backups to ensure the copy can be done.
These returns size in (blocks?) and route
du -B 1 /home/user/files | cut -f 1 -d " "
12288 /home/user/files
But this returns in GB:
df -h /var | tail -1 | awk '{print $4}'
3.9G
And cant verifi with an if.
How can i get the same type of measure (blocks, kb...) and compare it to ensure free space is also to make a copy?
Thx!

bash: 1 liner for selecting maximum files with total size limit

In a directory I want to select maximum files with given total size and move them to a different directory. While listing files for selection, we need to sort them by name.
As an example, to make it clear, let us say the total size is 500MB with each file size being less than 500 MB.
use case1:
a.bz2 200MB
b.bz2 100MB
c.bz2 300MB
d.bz2 400MB
Move a.bz2 and b.bz2 (total = 300MB) to directory ../selected (Because selecting 3rd file makes total size > 500MB)
use case 2:
a.bz2 200MB
b.bz2 200MB
c.bz2 100MB
d.bz2 400MB
Move a.bz2 , b.bz2 and c.bz2 (total = 500MB) to directory ../selected
I know how to add size of each file but breaking the loop like in a C program requires me writing a script. Instead I want it in 1 liner using pipe ( | )
not sure if this qualifies as a one-liner, but...
find . -type f -print0 | xargs -r -0 du -k |
awk '{sum+=$1; if(sum>500000){exit}; print}' | cut -f2- | tr '\n' '\0' |
xargs -r -0 mv -t ../selected

What's wrong with my these two Bash pipeline methods for counting average file size in a directory?

The problem: I am trying to calculate an average file size for the directory I'm in (ignoring sub-directories) using one-liners. I have two methods:
ls -l | gawk '{sum += $5; n++;} END {print sum/n;}'
and
var1=$(du -Ss| awk '{print $1}') ; var2=$(ls -l | wc -l) ; echo $var1/$var2 | bc
They seem to yield similar numbers, albeit different units (first one in kB, second one in MB).
The numbers themselves however are slightly wrong. What's going on? Which one is more right?
du and ls report differently. Consider this part of the du man page:
--apparent-size
print apparent sizes, rather than disk usage; although the
apparent size is usually smaller, it may be larger due to holes
in ('sparse') files, internal fragmentation, indirect blocks,
and the like
That gives an idea about the possible differences between what ls shows (apparent size) and what du shows (by default, the actual disk usage).
$ truncate -s 10737418240 sparse
$ ls -l sparse
-rw-rw-r-- 1 ec2-user ec2-user 10737418240 Feb 20 00:19 sparse
$ du sparse
0 sparse
$ ls -ls sparse
0 -rw-rw-r-- 1 ec2-user ec2-user 10737418240 Feb 20 00:19 sparse
The above shows the difference in reporting for a sparse file.
Also, the counting of files using ls -l will include subdirectories, symlinks, etc. You can instead use find to show only files:
find . -maxdepth 1 -type f

Get total size of a list of files in UNIX

I want to run a find command that will find a certain list of files and then iterate through that list of files to run some operations. I also want to find the total size of all the files in that list.
I'd like to make the list of files FIRST, then do the other operations. Is there an easy way I can report just the total size of all the files in the list?
In essence I am trying to find a one-liner for the 'total_size' variable in the code snippet below:
#!/bin/bash
loc_to_look='/foo/bar/location'
file_list=$(find $loc_to_look -type f -name "*.dat" -size +100M)
total_size=???
echo 'total size of all files is: '$total_size
for file in $file_list; do
# do a bunch of operations
done
You should simply be able to pass $file_list to du:
du -ch $file_list | tail -1 | cut -f 1
du options:
-c display a total
-h human readable (i.e. 17M)
du will print an entry for each file, followed by the total (with -c), so we use tail -1 to trim to only the last line and cut -f 1 to trim that line to only the first column.
Methods explained here have hidden bug. When file list is long, then it exceeds limit of shell comand size. Better use this one using du:
find <some_directories> <filters> -print0 | du <options> --files0-from=- --total -s|tail -1
find produces null ended file list, du takes it from stdin and counts.
this is independent of shell command size limit.
Of course, you can add to du some switches to get logical file size, because by default du told you how physical much space files will take.
But I think it is not question for programmers, but for unix admins :) then for stackoverflow this is out of topic.
This code adds up all the bytes from the trusty ls for all files (it excludes all directories... apparently they're 8kb per folder/directory)
cd /; find -type f -exec ls -s \; | awk '{sum+=$1;} END {print sum/1000;}'
Note: Execute as root. Result in megabytes.
The problem with du is that it adds up the size of the directory nodes as well. It is an issue when you want to sum up only the file sizes. (Btw., I feel strange that du has no option for ignoring the directories.)
In order to add the size of files under the current directory (recursively), I use the following command:
ls -laUR | grep -e "^\-" | tr -s " " | cut -d " " -f5 | awk '{sum+=$1} END {print sum}'
How it works: it lists all the files recursively ("R"), including the hidden files ("a") showing their file size ("l") and without ordering them ("U"). (This can be a thing when you have many files in the directories.) Then, we keep only the lines that start with "-" (these are the regular files, so we ignore directories and other stuffs). Then we merge the subsequent spaces into one so that the lines of the tabular aligned output of ls becomes a single-space-separated list of fields in each line. Then we cut the 5th field of each line, which stores the file size. The awk script sums these values up into the sum variable and prints the results.
ls -l | tr -s ' ' | cut -d ' ' -f <field number> is something I use a lot.
The 5th field is the size. Put that command in a for loop and add the size to an accumulator and you'll get the total size of all the files in a directory. Easier than learning AWK. Plus in the command substitution part, you can grep to limit what you're looking for (^- for files, and so on).
total=0
for size in $(ls -l | tr -s ' ' | cut -d ' ' -f 5) ; do
total=$(( ${total} + ${size} ))
done
echo ${total}
The method provided by #Znik helps with the bug encountered when the file list is too long.
However, on Solaris (which is a Unix), du does not have the -c or --total option, so it seems there is a need for a counter to accumulate file sizes.
In addition, if your file names contain special characters, this will not go too well through the pipe (Properly escaping output from pipe in xargs
).
Based on the initial question, the following works on Solaris (with a small amendment to the way the variable is created):
file_list=($(find $loc_to_look -type f -name "*.dat" -size +100M))
printf '%s\0' "${file_list[#]}" | xargs -0 du -k | awk '{total=total+$1} END {print total}'
The output is in KiB.

Command to list all file types and their average size in a directory

I am working on a specific project where I need to work out the make-up of a large extract of documents so that we have a baseline for performance testing.
Specifically, I need a command that can recursively go through a directory and, for each file type, inform me of the number of files of that type and their average size.
I've looked at solutions like:
Unix find average file size,
How can I recursively print a list of files with filenames shorter than 25 characters using a one-liner? and https://unix.stackexchange.com/questions/63370/compute-average-file-size, but nothing quite gets me to what I'm after.
This du and awk combination should work for you:
du -a mydir/ | awk -F'[.[:space:]]' '/\.[a-zA-Z0-9]+$/ { a[$NF]+=$1; b[$NF]++ }
END{for (i in a) print i, b[i], (a[i]/b[i])}'
Give you something to start, with below script, you will get a list of file and its size, line by line.
#!/usr/bin/env bash
DIR=ABC
cd $DIR
find . -type f |while read line
do
# size=$(stat --format="%s" $line) # For the system with stat command
size=$(perl -e 'print -s $ARGV[0],"\n"' $line ) # #Mark Setchell provided the command, but I have no osx system to test it.
echo $size $line
done
Output sample
123 ./a.txt
23 ./fds/afdsf.jpg
Then it is your homework, with above output, you should be easy to get file type and their average size
You can use "du" maybe:
du -a -c *.txt
Sample output:
104 M1.txt
8 in.txt
8 keys.txt
8 text.txt
8 wordle.txt
136 total
The output is in 512-byte blocks, but you can change it with "-k" or "-m".

Resources