How to get largest file in directory in bash? - bash

I have a following question. In variable DIR there is unknown number of files and folders. I would like to get the name and size in bytes of the largest one in following order: name size. For example: file.txt 124.
I tried:
cd $DIR
du -a * | sort | head -1
But it does not show the size in bytes, and it is in size name format. How can I improve it please?

This should do the trick:
ls -larS | awk -F' {1,}' 'END{print $NF," ",$5}'
LS long listing reverse sort by Size, then Awk prints the last field $NF, and the 5th field, using a single space or multiple single spaces, as the field separator, of the last line, being the largest size (due to reverse sort order above).
Edit:
It was mentioned a space in the file name might cause an issue, my first suggestion, is dont use spaces in filenames, it is just plain wrong, but if you have to:
ls -larS | awk -F' {1,}' 'END{for (i=9; i<=NF; i++) printf $i" "; print " ",$5}'
will handle the space, or two, or three, or how ever many

What about the following pipeline? I'm using GNU findutils and GNU coreutils. If you work on a Mac you might have to install them.
find -maxdepth 1 -type f -printf '%s %f\0' \
| sort -z -k1,1nr \
| head -zn1 \
| cut -zd' ' -f2-
Explanation:
find -maxdepth 1 -type f -printf '%s %f\0'
Find files in the current folder and print them along with their filesize in bytes, zero terminated. Zero terminated because filenames may contain newlines in UNIX.
sort -z -k1,1nr
Sort the listing by the filesize in bytes, column 1, in reverse order (largest first). -z reads input zero terminated.
head -zn1
prints the first item, which is the largest, after the previous sorting. -z reads input zero terminated
cut -zd' ' -f2-
Cut off the filesize, print only the filename. -z reads input zero terminated.

A variation which should produce the exact output requested:
find -maxdepth 1 -type f -printf "%f %s\0" \
| sort -znr -k2 \
| head -zn1 \
| tr "\0" "\n"

Related

Get Average of Found Numbers in Each File to Two Decimal Places

I have a script that searches through all files in the directory and pulls the number next to the word <Overall>. I want to now get the average of the numbers from each file, and output the filename next to the average to two decimal places. I've gotten most of it to work except displaying the average. I should say I think it works, I'm not sure if it's pulling all of the instances in the file, and I'm definitely not sure if it's finding the average, it's hard to tell without the precision. I'm also sorting by the average at the end. I'm trying to use awk and bc to get the average, there's probably a better method.
What I have now:
path="/home/Downloads/scores/*"
(for i in $path
do
echo `basename $i .dat` `grep '<Overall>' < $i |
head -c 10 | tail -c 1 | awk '{total += $1} END {print total/NR}' | bc`
done) | sort -g -k 2
The output i get is:
John 4
Lucy 4
Matt 5
Sara 5
But it shouldn't be an integer and it should be to two decimal places.
Additionally, the files I'm searching through look like this:
<Student>John
<Math>2
<English>3
<Overall>5
<Student>Richard
<Math>2
<English>2
<Overall>4
In general, your script does not extract all numbers from each file, but only the first digit of the first number. Consider the following file:
<Overall>123 ...
<Overall>4 <Overall>56 ...
<Overall>7.89 ...
<Overall> 0 ...
The command grep '<Overall>' | head -c 10 | tail -c 1 will only extract 1.
To extract all numbers preceded by <Overall> you can use grep -Eo '<Overall> *[0-9.]*' | grep -o '[0-9.]*' or (depending on your version) grep -Po '<Overall>\s*\K[0-9.]*'.
To compute the average of these numbers you can use your awk command or specialized tools like ... | average (from the package num-utils) or ... | datamash mean 1.
To print numbers with two decimal places (that is 1.00 instead of 1 and 2.35 instead of 2.34567) you can use printf.
#! /bin/bash
path=/home/Downloads/scores/
for i in "$path"/*; do
avg=$(grep -Eo '<Overall> *[0-9.]*' "$file" | grep -o '[0-9.]*' |
awk '{total += $1} END {print total/NR}')
printf '%s %.2f\n' "$(basename "$i" .dat)" "$avg"
done |
sort -g -k 2
Sorting works only if file names are free of whitespace (like space, tab, newline).
Note that you can swap out the two lines after avg=$( with any method mentioned above.
You can use a sed command and retrieve the values to calculate their average with bc:
# Read the stdin, store the value in an array and perform a bc call
function avg() { mapfile -t l ; IFS=+ bc <<< "scale=2; (${l[*]})/${#l[#]}" ; }
# Browse the .dat files, then display for each file the average
find . -iname "*.dat" |
while read f
do
f=${f##*/} # Remove the dirname
# Echoes the file basename and a tabulation (no newline)
echo -en "${f%.dat}\t"
# Retrieves all the "Overall" values and passes them to our avg function
sed -E -e 's/<Overall>([0-9]+)/\1/' "$f" | avg
done
Output example:
score-2 1.33
score-3 1.33
score-4 1.66
score-5 .66
The pipeline head -c 10 | tail -c 1 | awk '{total += $1} END {print total/NR}' | bc needs improvement.
head -c 10 | tail -c 1 leaves only the 10th character of the first Overall line from each file; better drop that.
Instead, use awk to "remove" the prefix <Overall> and extract the number; we can do this by using <Overall> for the input field separator.
Also use awk to format the result to two decimal places.
Since awk did the job, there's no more need for bc; drop it.
The above pipeline becomes awk -F'<Overall>' '{total += $2} END {printf "%.2f\n", total/NR}'.
Don't miss to keep the ` after it.

Ignoring lines from grep matching any element in a bash array

I have an array (superStringIgnoreArray) containing superstrings like "formula", "forest", "foreign", "fortify", and I am running the following grep lines:
eval 'find "$SEARCH_DIR" -type f -print0 | xargs -0 grep -HniI "$hitWord" >> "$OUTPUT_FILE"'
eval 'find "$SEARCH_DIR" -type f -print0 | xargs -0 grep -HniI --color=always "$hitWord" | more'
Where hitWord in this instance would be "for".
How can I return all hits that do not match any entry in my superStringIgnoreArray? (so lines containing "for", "form", "fort" "fork" "forming" would be returned, but "fortify", "forest", etc would not).
Example output:
srcToSearch/open_source_licenses.txt:12:source software packages. One or more such open_source_licenses.txt files may there**for**e
srcToSearch/open_source_licenses.txt:19:-- **For** vCenter Server 5.5u2 GA, the license in**for**mation listed in Parts 2,
srcToSearch/open_source_licenses.txt:22:-- **For** vCenter Server on Linux Virtual Appliance 5.5u2 GA, the license
srcToSearch/open_source_licenses.txt:23:in**for**mation listed in Parts 1, 2, 3, 4, 5 and 6 are applicable.
srcToSearch/open_source_licenses.txt:29:document. This list is provided **for** your convenience; please read further if
grep + bash solution:
superStringIgnoreArray=("formula" "forest" "foreign" "fortify")
grep -HniIr "$hitWord" "$SEARCH_DIR"/* \
| grep -v -f <(printf '%s\n' "${superStringIgnoreArray[#]}") | tee "$OUTPUT_FILE"
since you're outputting the filenames, chaining another grep won't be trivial, but you can achieve the same with awk
$ grep -HniIFr "$hitWord" "$SEARCH_DIR" |
awk 'BEGIN {OFS=FS=":"}
NR==FNR {a[tolower($0)]; next}
{f=$1;n=$2;$1=$2="";
for(k in a) if(tolower($0)~k) next}
{$1=f;$2=n;print}' blacklist -
here awk limits the matches after the filename with : delimiter. If you "hitWord" is a literal adding -F will help. awk is still doing pattern matching though. tolower() is to make the second step case insensitive too.
Since delimiter ":" can appear within the body, we can't depend on $3 in awk, instead, store $1 and $2; remove them from the line, match and add them back before printing. I guess at this point you can add the first grep functionality to this awk as well.
However, I think without -o flag, this and other line based solutions will fail when there is a actual match and unwanted match on the same line. If the unwanted superstrings are few, perhaps a negative lookback/lookahead pattern is a better solution.
If your blacklist is not a file but an array, you can do file substitution as in the other answer, replace with
... | awk '...' <(printf '%s\n' "${superStringIgnoreArray[#]}") -

get last line from grep search on multiple files

I'm curently having some problem with a grep command.
I've found the way to only show the last line of a grep search :
grep PATERN FILE_NAME | tail -1
I also find the way to make a grep search in multiple selected files :
find . -name "FILE_NAME" | xargs -I name grep PATERN name
Now I would like to only get the last line of the grep result for each single file.
I tried this :
find . -name "FILE_NAME" | xargs -I name grep PATERN name | tail -1
This returns me only the last value of the last file where I would like to have the last matching patern for every file.
for f in $(find . -name "FILE_NAME"); do grep PATTERN $f | tail -1; done
Sort has a uniq option that allows you to select just one line from many. Try this:
grep PATTERN FILENAMES* | tac | sort -u -t: -k1,1
Explanation:
Grep will return one line for each match in a file. This looks like:
$ grep match file*
file1.txt:match
file1.txt:match2
file2.txt:match3
file2.txt:match4
And what we want is two lines from that output:
$ ???
file1.txt:match2
file2.txt:match4
You can treat this as a sort of table, in which the first column is the filename and the second is the match, where the column separator is the ':' character.
Our first pipe reverses the output:
$ grep match file* | tac
file2.txt:match4
file2.txt:match3
file1.txt:match2
file1.txt:match
Our second pipe to sort, says: pull out the first unique line (-u), where the key to group by is the first one (-k1,1, key from column 1 to column 1), and we split the data into columns with ':' as a delimiter (-t:). It will also sort our output too! And its output:
$ grep match file* | tac sort -u -t: -k1,1
file1.txt:match2
file2.txt:match4
An alternative to this could be done with awk instead of grep. A Posix version would read:
awk '(FNR==1)&&s{print s; s=""}/PATTERN/{s=$0}END{if(s) print s}' file1 file2 file3 ...
Using GNU awk, you can use ENDFILE
awk 'BEGINFILE{s=""}/PATTERN/{s=$0}ENDFILE{if(s) print s}' file1 file2 file3 ...
you can use find to execute commands too:
find . -name "<file-name-to-find>" -exec grep "<pattern-to-match>" "{}" ";" | tail -1
"{}" is the file name, take care with shell globing and expasion when writing the command
Another way to find the last line is to reverse the file and output the first match.
find . -name "FILE_NAME" | xargs -I name sh -c 'tac name|sed -n "/PATTERN/{p;q}"'
You could start with grep's -B (before) parameter. For example to get 5 lines before the match:
duli#i5 /etc/php5/apache2 $ grep -i -B5 timezone php.ini
[CLI Server]
; Whether the CLI web server uses ANSI color coding in its terminal output.
cli_server.color = On
[Date]
; Defines the default timezone used by the date functions
; http://php.net/date.timezone
;date.timezone =
Get last line of each file (prefixed with file name). Then, filter output based on pattern.
find . -name "*" -exec tail -v -n1 {} \; | grep "some_string" -B1
on macOS, you have to do it slightly different way
find . -name "*" | xargs tail -1 | grep "some_string" -B1
7years too late to the party. A slow way to modify the Line of command:
find . -name "FILE_NAME" | xargs -I name sh -c "grep PATERN name | tail -1"
If you need to show the file name in each line:
find . -name "FILE_NAME" | xargs -I name sh -c "grep -H PATERN name | tail -1"
There is a solution without the need for loops, this gives what the OP wants.
find . -type f -exec sh -c "fgrep print {} /dev/null |tail -1" \;
./tway.pl:print map(lambda x : x[1], filter(lambda x : x[0].startswith('volume'), globals().items()))
./txml.py: print("%s does not exist: %s\n" % (host, error))
./utils.py:print combine_dicts(a, b, operator.mul)
./xml_example.py:print ET.tostring(root, method="text")
Compared without the tail -1 gives Too many lines per file but proves the above works.
find . -type f -exec sh -c "fgrep print {} /dev/null" \;
gives:
./tway.pl:print map(lambda x : x[1], filter(lambda x : x[0].startswith('volume'), globals().items()))
./txml.py: print("%s resolved to --> %s\n" % (host, ip))
./txml.py: print("%s does not exist: %s\n" % (host, error))
./utils.py:print "a", a
./utils.py:print "b", b
./utils.py:print combine_dicts(a, b, operator.mul)
./xml_example.py: print ">>"
./xml_example.py: print ET.tostring(e, method="text")
./xml_example.py: print "<<"
./xml_example.py:print ET.tostring(root, method="text")
EDIT - remove the /dev/null if you don't want the filename included in the output.
The sed version
# As soon as we find pattern
# we save that line in hold space
save_pattern_line='/PATTERN/{h;d}'
# switch pattern and hold space
switch_spaces='x'
# At the end of the file
# if the pattern is in the pattern space
# (which we swapped with our hold space)
# switch again, print and exit
eof_print='${/PATTERN/{x;p;d}}'
# Else, switch pattern and hold space
switch_spaces='x'
find . -name 'FILE_NAME' |
xargs sed -s -n -e $save_pattern_line \
-e $switch_spaces \
-e $eof_print \
-e $switch_spaces
The quickest way to do this would be get the output last 1 (or more) lines from the files and then grep through that. So -
tail -1 filenames.* | grep "what you want to grep for"

Total Size in a Module

I am trying to find the total size of the libc.a module using the unix size command.
When running it, I get a lot of different files total sizes. What is one possible way in which I can take one column of the output of 'size' (the column dec in this case which has the total size of the file in decimal) and add them up?
Is that possible to do in the command line?
One way to do this:
size libc.a | awk '{ print " " $4 " " }' | tail -n +2 | paste -sd'+' | xargs expr
The steps:
Use size to generate the ouput
Use awk to print the fourth field inserted between spaces
Use tail to print from the second line (ie. skip the first line)
Use paste to join all the lines, separated by a +
use xargs to pass the result as a parameter to expr.
Use expr to add up the values
Assuming you are not interested in the actual file size of libc.a, but would like to calculate the sum of the dec column in the output of size, here is a command line:
size libc.a | tail -n +2 | cut -f 4 | tr -d '\n' | sed "s/^ *//;s/ *$//;s/ */+/g;s/$/\n/" | bc

Only get hash value using md5sum (without filename)

I use md5sum to generate a hash value for a file.
But I only need to receive the hash value, not the file name.
md5=`md5sum ${my_iso_file}`
echo ${md5}
Output:
3abb17b66815bc7946cefe727737d295 ./iso/somefile.iso
How can I 'strip' the file name and only retain the value?
A simple array assignment works... Note that the first element of a Bash array can be addressed by just the name without the [0] index, i.e., $md5 contains only the 32 characters of md5sum.
md5=($(md5sum file))
echo $md5
# 53c8fdfcbb60cf8e1a1ee90601cc8fe2
Using AWK:
md5=`md5sum ${my_iso_file} | awk '{ print $1 }'`
You can use cut to split the line on spaces and return only the first such field:
md5=$(md5sum "$my_iso_file" | cut -d ' ' -f 1)
On Mac OS X:
md5 -q file
md5="$(md5sum "${my_iso_file}")"
md5="${md5%% *}" # remove the first space and everything after it
echo "${md5}"
Another way is to do:
md5sum filename | cut -f 1 -d " "
cut will split the line to each space and return only the first field.
By leaning on head:
md5_for_file=`md5sum ${my_iso_file}|head -c 32`
One way:
set -- $(md5sum $file)
md5=$1
Another way:
md5=$(md5sum $file | while read sum file; do echo $sum; done)
Another way:
md5=$(set -- $(md5sum $file); echo $1)
(Do not try that with backticks unless you're very brave and very good with backslashes.)
The advantage of these solutions over other solutions is that they only invoke md5sum and the shell, rather than other programs such as awk or sed. Whether that actually matters is then a separate question; you'd probably be hard pressed to notice the difference.
If you need to print it and don't need a newline, you can use:
printf $(md5sum filename)
md5=$(md5sum < $file | tr -d ' -')
md5=`md5sum ${my_iso_file} | cut -b-32`
md5sum puts a backslash before the hash if there is a backslash in the file name. The first 32 characters or anything before the first space may not be a proper hash.
It will not happen when using standard input (file name will be just -), so pixelbeat's answer will work, but many others will require adding something like | tail -c 32.
if you're concerned about screwy filenames :
md5sum < "${file_name}" | awk NF=1
f244e67ca3e71fff91cdf9b8bd3aa7a5
other messier ways to deal with this :
md5sum "${file_name}" | awk NF=NF OFS= FS=' .*$'
or
| awk '_{ exit }++_' RS=' '
f244e67ca3e71fff91cdf9b8bd3aa7a5
to do it entirely inside awk :
mawk 'BEGIN {
__ = ARGV[ --ARGC ]
_ = sprintf("%c",(_+=(_^=_<_)+_)^_+_*++_)
RS = FS
gsub(_,"&\\\\&",__)
( _=" md5sum < "((_)(__)_) ) | getline
print $(_*close(_)) }' "${file_name}"
f244e67ca3e71fff91cdf9b8bd3aa7a5
Well, I had the same problem today, but I was trying to get the file MD5 hash when running the find command.
I got the most voted question and wrapped it in a function called md5 to run in the find command. The mission for me was to calculate the hash for all files in a folder and output it as hash:filename.
md5() { md5sum $1 | awk '{ printf "%s",$1 }'; }
export -f md5
find -type f -exec bash -c 'md5 "$0"' {} \; -exec echo -n ':' \; -print
So, I'd got some pieces from here and also from 'find -exec' a shell function in Linux
For the sake of completeness, a way with sed using a regular expression and a capture group:
md5=$(md5sum "${my_iso_file}" | sed -r 's:\\*([^ ]*).*:\1:')
The regular expression is capturing everything in a group until a space is reached. To get a capture group working, you need to capture everything in sed.
(More about sed and capture groups here: How can I output only captured groups with sed?)
As delimiter in sed, I use colons because they are not valid in file paths and I don't have to escape the slashes in the filepath.
Another way:
md5=$(md5sum ${my_iso_file} | sed '/ .*//' )
md5=$(md5sum < index.html | head -c -4)

Resources