Sum of file sizes with awk on a list of files - bash

I have a list of files and want to sum over their file sizes.
So, I created a (global) variable as counter and are trying to loop over that list, get the file size with ls and cut&add it with
export COUNTER=1
for x in $(cat ./myfiles.lst); do ls -all $x | awk '{COUNTER+=$5}'; done
However, my counter is empty?
> echo $COUNTER
> 1
Does someone has an idea for my, what I am missing here?
Cheers and thanks,
Thomas
OK, I found a way piping the result from the awk pipe into a variable
(which is probably not elegant but working ;) )
for x in $(cat ./myfiles.lst); do a=$(ls -all $x |awk '{print $5}'); COUNTER=$(($COUNTER+$a)) ; done
> echo $COUNTER
> 4793061514

awk is getting called for every file, so in COUNTER you got the last file's size.
A better solution is:
ls -all <myfiles.lst | awk '{COUNTER+=$5} END {print COUNTER}'
But you are reinventing the wheel here. You can do something like
du -s <myfiles.lst
(If you have du installed. Note: see the comments below my answer about du. I had tested this with cygwin and with that it worked like a charm.)

Shorter version of the last:
ls -l | awk '{sum += $5} END {print sum}'
Now, say you want to filter by certain types of files, age, etc... Just throw the ls -l into a find, and you can filter using find's extensive filter parameters:
find . -type f -exec ls -l {} \; | awk '{sum += $5} END {print sum}'

ls -ltS | awk -F " " {'print $5'} | awk '{s+=$1} END {print s}'

Related

Searching for .extension files recursively and print the number of lines in the files found?

I ran into a problem I am trying to solve but can't think about a way without doing the whole thing from the beginning. My script gets an extension and searches for every .extension file recursively, then outputs the "filename:row #:word #". I would like to print out the total amount of row #-s found in those files too. Is there any way to do it using the existing code?
for i in find . -name '*.$1'|awk -F/ '{print $NF}'
do
echo "$i:`wc -l <$i|bc`:`wc -w <$i|bc`">>temp.txt
done
sort -r -t : -k3 temp.txt
cat temp.txt
I think you're almost there, unless I am missing something in your requirements:
#!/bin/bash
total=0
for f in `find . -name "*.$1"` ; do
lines=`wc -l < $f`
words=`wc -w < $f`
total=`echo "$lines+$total" | bc`
echo "* $f:$lines:$words"
done
echo "# Total: $total"
Edit:
Per recommendation of #Mark Setchel in the comments, this is a more refined version of the script above:
#!/bin/bash
total=0
for f in `find . -name "*.$1"` ; do
read lines words _ < <(wc -wl "$f")
total=$(($lines+$total))
echo "* $f:$lines:$words"
done
echo "# Total: $total"
Cheers
This is a one-liner printing the lines found per file, the path of the file and at the end the sum of all lines found in all the files:
find . -name "*.go" -exec wc -l {} \; | awk '{s+=$1} {print $1, $2} END {print s}'
In this example if will find for all files ending *.go then will execute use wc -l to get the number of lines and print the output to stdout, awk then is used to sum all the output of column 1 in the variable s the one will be only printed at the end: END {print s}
In case you would also like to get the words and the total sum at the end you could use:
find . -name "*.go" -exec wc {} \; | \
awk '{s+=$1; w+=$2} {print $1, $2, $4} END {print "Total:", s, w}'
Hope this can give you an idea about how to format, sum etc. your data based on the input.

In loop cat file - echo name of file - count

I trying make oneline command with operation where I can do:
in folder "data" have 570 files - each file have some text line - file are called from 1 to 570.txt
I want cat each file, grep by word and count how manny that word occurs.
For the moment he is trying to get this using ' for '
for FILES in $(find /home/my/data/ -type f -print -exec cat {} \;) ; do echo $FILES; cat $FILES |grep word ; done |wc -l
but if I do that they correctly counts but does not display the counted file
I would like it to look :
----> 1.txt <----
210
---> 2.txt <----
15
etc, etc, etc..
How to get it
grep -o word * | uniq -c
is practically all you need.
grep -o word * gives a line for each hit, but only prints the match, in this case "word". Each line is prefixed with the filename it was found in.
uniq -c gives only one line per file so to say and prefixes it with the count.
You can further format it to your needs with awk or whatever, though, for example like this:
grep -o word * | uniq -c | cut -f1 -d':' | awk '{print "File: " $2 " Count: " $1}'
You can try this :
for file in /path/to/folder/data/* ; do echo "----> $file <----" ; grep -c "word_to_count" /path/to/folder/data/$file ; done
for loop will ierate over file inside folder "data".
For each of these file, print the name and search for number of occurrence of "word_to_count" (grep -c will directly output a count of matching lines).
Be carefull, if there is more than one iteration of your search word inside a line, this solution will count only one for these iteration.
Bit of awk should do it?
awk '{s+=$1} END {print s}' mydatafile
Note: some versions of awk have some odd behaviours if you are going to be adding anything exceeding 2^31 (2147483647). See comments for more background. One suggestion is to use printf rather than print:
awk '{s+=$1} END {printf "%.0f", s}' mydatafile
$ python -c "import sys; print(sum(int(l) for l in sys.stdin))"
If you only want the total number of lines, you could use
find /home/my/data/ -type f -exec cat {} + | wc -l

how to use compound bash functions from different directories

I have some directory e.g. /ice/cream which contains some files that I want to want to sort in size, and then find a minimum value in the largest file; however I want to do this from the parent directory /ice.
The bash line I wrote only works within /ice/cream, i'd like to make it work from /ice, I tried
awk 'BEGIN {min = 0} {if($7<min) min=$7} END {print min}' $(ls -lS cream/ | head -n 2 | awk '{print $9}')
which does not work because awk doesnt know the path to the file found by the second $() function; please help! Cheers
A safer way to get the largest file; the call to stat may differ depending on your implementation:
max_file () {
local max_size size
max_size=0
for f in "$1"/*; do
size=$(stat -c %s "$f")
if (( size > max_size )); then
max_file="$f"
max_size="$size"
fi
done
echo "$max_file"
}
awk '...' "$(biggest_file cream/)"
Your ls pipeline is way too complicated, and you need a * after the dir/ to get the relative name output:
awk 'BEGIN {min = 0} {if($7<min) min=$7} END {print min}' $(ls -S cream/* | head -1)
As first answered by #Etan Reisner in comment, the line was missing a *; working code is:
awk 'BEGIN {min = 0} {if($7<min) min=$7} END {print min}' $(ls -lS cream/* | head -n 1 | awk '{print $9}')
Thank you.

How to list all files and put number in front of them , using shell

I want to count all files that I have in my directory and put number in front of them, and in a new line, for example :
file.txt nextfile.txt example.txt
and the output to be :
1.file.txt
2.nextfile.txt
3.example.txt
and so on.
i am trying something with : ls -L |
You can do this if you have nl installed:
ls -1 | nl
(Note with modern shells (ls usually a built-in) the -1 part is not needed. And this applies to the below solutions too.)
Or with awk:
ls -1 | awk '{print NR, $0}'
Or with a single awk command:
awk '{c=1 ; for (f in ARGV) {print c, f ; c++ } }' *
Or with cat:
cat -n <(ls -1)
You can do this by using shell built-in printf in a for loop:
n=0
for i in *; do
printf "%d.%s\n" $((n++)) "$i"
done

Return two variables in awk

At the moment here is what im doing
ret=$(ls -la | awk '{print $3 " " $9}')
usr=$(echo $ret | awk '{print $1}')
fil=$(echo $ret | awk '{print $2}')
The problem is that im not running an ls im running a command that takes time, so you can understand the logic.
Is there a way I can set the return value to set two external values, so something such as
ls -la | awk -r usr=x -r fil=y '{x=$3; y=$9}'
This way the command will be run once and i can minimize it to one line
It's not pretty, but if you really need to do this in one line you can use awk/bash's advanced meta-programming capabilities :)
eval $(ls -la | awk '{usr = $3 " " usr;fil = $9 " " fil} END{print "usr=\""usr"\";fil=\""fil"\""}')
To print:
echo -e $usr
echo -e $fil
Personally, I'd stick with what you have - it's much more readable and performance overhead is tiny compared to the above:
$time <three line approach>
real 0m0.017s
user 0m0.006s
sys 0m0.011s
$time <one line approach>
real 0m0.009s
user 0m0.004s
sys 0m0.007s
A workaround using read
usr=""
fil=""
while read u f; do usr="$usr\n$u"; fil="$fil\n$f"; done < <(ls -la | awk '{print $3 " " $9}')
For performance issue you could use <<<, but avoid it if the returned text is large:
while read u f; do usr="$usr\n$u"; fil="$fil\n$f"; done <<< $(ls -la | awk '{print $3 " " $9}')
A more portable way inspired from #WilliamPursell's answer:
$ usr=""
$ fil=""
$ while read u f; do usr="$usr\n$u"; fil="$fil\n$f"; done << EOF
> $(ls -la | awk '{print $3 " " $9}')
> EOF
What you want to do is capture the output of ls or any other command and then process it later.
ls=$(ls -l)
first=$(echo $ls | awk '{print $1}')
second=$(echo $ls | awk '{print $2}')
Using bash v4 associative array:
unset FILES
declare -A FILES
FILES=( ls -la | awk '{print $9 " " $3}' )
Print the list of owner & file:
for fil in ${!FILES[#]}
do
usr=${FILES["$fil"]}
echo -e "$usr" "\t" "$fil"
done
My apologies, I cannot test on my computer because my bash v3.2 does not support associative array :-(.
Please, report any issue...
The accepted answer uses process substitution, which is a bashism that only works on certain platforms. A more portable solution is to use a heredoc:
read u f << EOF
$( ls ... )
EOF
It is tempting to try:
ls ... | read u f
but the read then runs in a subshell. A common technique is:
ls ... | { read u f; # use $u and $f here; }
but to make the variables available in the remainder of the script, the interpolated heredoc is the most portable approach. Note that it requires the shell to read all of the output of the program into memory, so is not suitable if the output is expected to be large or if the process is long running.
You could use a bash array or the positional parameters as temporary holding place:
ret_ary=( $(command | awk '{print $3, $9}') )
usr=${ret_ary[0]}
fil=${ret_ary[1]}
set -- $(command | awk '{print $3, $9}')
usr=$1
fil=$2

Resources