sorting in unix by file size - sorting

I have a unix file with data like:
35|ag
0|ca
22.0 K|nt
43.8 G|ct
90.0 M|se
2.4 M|ew
1.6 K|et
0|er
0|dr
18|ld
Output:
43.8 G|ct
90.0 M|se
2.4 M|ew
22.0 K|nt
1.6 K|et
35|ag
18|ld
0|ca
0|er
0|dr
I need to sort this in decreasing order of size. I could convert this to uniform size such as bytes and then sort them, but I was hoping if it could be sorted directly.
Thanks in advance!

Using GNU sort. man sort:
-h, --human-numeric-sort
compare human readable numbers (e.g., 2K 1G)
but we need to remove the spaces from the first field. Using tr
$ cat file | tr -d ' ' | sort -h -r -t\| -k 1
43.8G|ct
90.0M|se
2.4M|ew
22.0K|nt
1.6K|et
35|ag
18|ld
0|er
0|dr
0|ca

Related

bash grep only numbers and compare them

Index.html of the curl command looks like below.
<html>
<head><title>Index of myorg/release/builds/production/</title>
</head>
<body>
<h1>Index of myorg/release/builds/production/</h1>
<pre>Name Last modified Size</pre><hr/>
<pre>../
1.0.60/ 06-Jul-2022 07:47 -
1.0.63/ 06-Jul-2022 10:21 -
1.0.64/ 09-Jul-2022 18:08 -
1.0.65/ 09-Jul-2022 18:42 -
1.0.71/ 10-Jul-2022 10:23 -
1.0.73/ 14-Jul-2022 17:28 -
1.0.75/ 20-Jul-2022 07:25 -
{STOCKIO}/ 24-May-2022 11:09 -
dashboard-react-module-1.0.29.tar.gz 24-May-2022 07:27 87.74 MB
dashboard-react-module-1.0.29.tar.gz.md5 24-May-2022 07:27 32 bytes
dashboard-react-module-1.0.29.tar.gz.sha1 24-May-2022 07:27 40 bytes
dashboard-react-module-1.0.29.tar.gz.sha256 24-May-2022 07:27 64 bytes
dashboard-react-module.tar.gz 24-May-2022 07:27 87.74 MB
dashboard-react-module.tar.gz.md5 24-May-2022 07:27 32 bytes
dashboard-react-module.tar.gz.sha1 24-May-2022 07:27 40 bytes
</pre>
<hr/><address style="font-size:small;">Artifactory/6.23.41 Server .myorg.com Port 80</address></body></html>
I'm unable to construct a logic to find the largest entry in the file, here its - 1.0.75
I tried grepping only the numbers like - grep -E "[[:digit:]]\.[[:digit:]]\.[[:digit:]]{1,4}" index.html but it throws the same output as above.
My idea is to get all the numeric entries like 1.0.60, 1.0.63 ... in to an array, cut the last part of the number and compare them to get the largest number, but, unable to find the right grep command that gives only the numeric values.
Or is there a much efficient way to do it ?
Using sed to filter the data, sort to arrange (in case unsorted) and tail to show the last (largest) entry
$ sed -En '/href/s~[^>]*>([0-9][^/]*).*~\1~p' input_file | sort -n | tail -1
1.0.75
Match lines containing the string href
Capture within parenthesis the match and exclude everthing else
Return the match with backrefence \1
sort the piped output by numbers
Print the last line (highest value)
With your shown samples and attempts, please try following GNU awk+ sort with head solution.
awk 'match($0,/<a href="([0-9]+(\.[0-9]+)*)/,arr){print arr[1] | "sort -rV | head -1"}' Input_file
Explanation: Using awk program to parse Input_file to it. Where using its match function in which using regex <a href="([0-9]+(\.[0-9]+)*/) where it creates capturing group of matched values to only have versions in it. GNU awk capabilities to store matched regex values as values into array so creating arr array which will contain only version values. Then using | to run BASH command sort -rV(Version sort) to get it revers order sort(descending order) and once all values ae printed; sending this output to head command and printing only very first output which will be highest version only.
No doubt a lot of ways to do this..
cat foo1.x | grep 'href="[0-9]' | sed -E 's/.*href=.1.0.([0-9]+).*/\1/' | sort -u -n | tail -1
Versions are sorted in index.html..
Getting the last one
awk -F'["/]' '/href="([0-9]+\.[0-9]+\.[0-9]+)\/"/{n=$2}END{print n}' index.html
1.0.75
If versions are not sorted
awk -F'["/]' '
/href="([0-9]+\.[0-9]+\.[0-9]+)\/"/ { a[NR]=$2 }
END{
asorti(a,b,"#val_num_desc");
print a[b[1]]
}
' index.html
1.0.75

sort does not work with -h with text file

In my OS, I can find
-h, --human-numeric-sort
compare human readable numbers (e.g., 2K 1G)
And I have a file aaa.txt:
2M
5904K
1G
Then I type
sort -h aaa.txt
The output is
5904K
2M
1G
It's wrong. It should be
2M
5904K
1G
Questions:
Why does sort -h not work? The result is wrong even in lexicographically order perspective. How to sort the aaa.txt file in human readable numbers.
Or it can work only with du -h? But the most vostes answer seems can work with awk.
With du -h, sort does not need to specify which field, like sort -k1h,1 ? Why? What would happend if the memory size is not in the first field?
Why does sort -h not work?
Below is a comment from GNU sort's source code.
/* Compare numbers ending in units with SI xor IEC prefixes
<none/unknown> < K/k < M < G < T < P < E < Z < Y
Assume that numbers are properly abbreviated.
i.e. input will never have both 6000K and 5M. */
It's not mentioned in the man page, but -h is not supposed to work with your input.
How to sort the aaa.txt file in human readable numbers.
You can use numfmt to perform a Schwartzian transform as shown below.
$ numfmt --from=auto < aaa.txt | paste - aaa.txt | sort -n | cut -f2
2M
5904K
1G

How do I sort a "MON_YYYY_day_NUM" time with UNIX tools?

I'm wondering how do i sort this example based on time. I have already sorted it based on everything else, but i just cannot figure out how to go sort it using time (the 07:30 part for example).
My current code:
sort -t"_" -k3n -k2M -k5n (still need to implement the time sort for the last sort)
What still needs to be sorted is the time:
Dunaj_Dec_2000_day_1_13:00.jpg
Rim_Jan_2001_day_1_13:00.jpg
Ljubljana_Nov_2002_day_2_07:10.jpg
Rim_Jan_2003_day_3_08:40.jpg
Rim_Jan_2003_day_3_08:30.jpg
Any help or just a point in the right direction is greatly appreciated!
Alphabetically; 24h time with a fixed number of digits is okay to sort using a plain alphabetic sort.
sort -t"_" -k3n -k2M -k5n -k6 # default sorting
sort -t"_" -k3n -k2M -k5n -k6V # version-number sort.
There's also a version sort V which would work fine.
I have to admit to shamelessly stealing from this answer on SO:
How to split log file in bash based on time condition
awk -F'[_:.]' '
BEGIN {
months["Jan"] = 1
months["Feb"] = 2
months["Mar"] = 3
months["Apr"] = 4
months["May"] = 5
months["Jun"] = 6
months["Jul"] = 7
months["Aug"] = 8
months["Sep"] = 9
months["Oct"] = 10
months["Nov"] = 11
months["Dec"] = 12
}
{ print mktime($3" "months[$2]" "$5" "$6" "$7" 00"), $0 }
' input | sort -n | cut -d\ -f2-
Use _:.\ field separator characters to parse each file name.
Initialize an associative array so we can map month names to numerical values (1-12)
Uses awk function mktime() - it takes a string in the format of "YYYY MM DD HH MM SS [ DST ]" as per https://www.gnu.org/software/gawk/manual/html_node/Time-Functions.html. Each line of input is print with a column prepending with the time in epoch seconds.
The results are piped to sort -n which will sort numerically the first column
Now that the results are sorted, we can remove the first column with cut
I have a MAC, so I had to use gawk to get the mktime function (it's not available with MacOS awk normally ). mawk is another option I've read.

Bash: uniq count large dataset

I have a set of CVS files spanning over 70GB, with 35GB being about the field i'm interested in (with around 100 Bytes for this field in each row)
The data are highly duplicated (a sampling show that the top 1000 cover 50%+ of the rows) and I'm interested in getting the total uniq count
With a not so large data set I would do
cat my.csv | cut -f 5 | sort | uniq -c | sort --numeric and it works fine
However the problem I have is that (to my understanding) because of the intermediate sort , this command will need to hold in RAM (and then on disk because it does not fit my 16Go of RAM) the whole set of data, to after stream it to uniq -c
I would like to know if there's a command /script awk/python to do the sort | uniq -c in one step so that the RAM consumption should be far lower ?
You can try this:
perl -F, -MDigest::MD5=md5 -lanE 'say unless $seen{ md5($F[4]) }++' < file.csv >unique_field5.txt
it will holds in the memory 16byte long md5-digest for every unique field-5 (e.g. $F[4]). Or you can use
cut -d, -f5 csv | perl -MDigest::MD5=md5 -lnE 'say unless $seen{md5($_)}++'
for the same result.
Of course, the md5 isn't cryptographically safe these days, but probably will be enough for sorting... Of course, it is possible to use sha1 or sha256, just use the -MDigest::SHA=sha255. Of course, the sha-digests are longer - e.g. needs more memory.
It is similar as the awk linked in the comments, with a difference, here is used as hash-key not the whole input line, but just the 16byte long MD5 digest.
EDIT
Because me wondering about the performance, created this test case:
# this perl create 400,000,000 records
# each 100 bytes + attached random number,
# total size of data 40GB.
# each invocation generates same data (srand(1))
# because the random number is between 0 - 50_000_000
# here is approx. 25% unique records.
gendata() {
perl -E '
BEGIN{ srand(1) }
say "x"x100, int(rand()*50_000_000) for 1..400_000_000
'
}
# the unique sorting - by digest
# also using Devel::Size perl module to get the final size of the data hold in the memory
# using md5
domd5() {
perl -MDigest::MD5=md5 -MDevel::Size=total_size -lnE '
say unless $seen{md5($_)}++;
END {
warn"total: " . total_size(\%seen);
}'
}
#using sha256
dosha256() {
perl -MDigest::SHA=sha256 -MDevel::Size=total_size -lnE '
say unless $seen{sha256($_)}++;
END {
warn"total: " . total_size(\%seen);
}'
}
#MAIN
time gendata | domd5 | wc -l
time gendata | dosha256 | wc -l
results:
total: 5435239618 at -e line 4, <> line 400000000.
49983353
real 10m12,689s
user 12m43,714s
sys 0m29,069s
total: 6234973266 at -e line 4, <> line 400000000.
49983353
real 15m51,884s
user 18m23,900s
sys 0m29,485s
e.g.:
for the md5
memory usage: 5,435,239,618 bytes - e.g. appox 5.4 GB
unique records: 49,983,353
time to run: 10 min
for the sha256
memory usage: 6,234,973,266 bytes - e.g. appox 6.2 GB
unique records: 49,983,353
time to run: 16 min
In contrast, doing the plain-text unique search using the "usual" approach:
doplain() {
perl -MDevel::Size=total_size -lnE '
say unless $seen{$_}++;
END {
warn"total: " . total_size(\%seen);
}'
}
e.g running:
time gendata | doplain | wc -l
result:
memory usage is much bigger: 10,022,600,682 - my notebook with 16GB RAM starts heavy swapping (as having SSD, so a not big deal - but still..)
unique records: 49,983,353
time to run: 8:30 min
Result?
just use the
cut -d, -f5 csv | perl -MDigest::MD5=md5 -lnE 'say unless $seen{md5($_)}++'
and you should get the unique lines enough fast.
You can try this:
split --filter='sort | uniq -c | sed "s/^\s*//" > $FILE' -b 15G -d "dataset" "dataset-"
At this point you should have around 5 dataset-<i> each of which should be much less that 15G.
To merge the file you can save the following bash script as merge.bash:
#! /bin/bash
#
read prev_line
prev_count=${prev_line%% *}
while read line; do
count="${line%% *}"
line="${line#* }" # This line does not handle blank lines correctly
if [ "$line" != "$prev_line" ]; then
echo "$prev_count $prev_line"
prev_count=$count
prev_line=$line
else
prev_count=$((prev_count + count))
fi
done
echo "$prev_count $prev_line"
And run the command:
sort -m -k 2 dataset-* | bash merge.sh > final_dataset.
Note: blank line are not handled correctly, if it suits your needs you can remove them from your dataset or correct merge.bash.

How do I calculate the standard deviation in my shell script?

I have a shell script:
dir=$1
cd $dir
grep -P -o '(?<=<rating>).*' * |
awk -F: '{A[$1]+=$2;L[$1]++;next}END
{for(i in A){print i, A[i]/L[i]}}' | sort -nr -k2 |
awk '{ sub(/.dat/, " "); print }'
which sums up all of the numbers that follow the <rating> field in each file of my folder but now I need to calculate the standard deviation of the numbers rather than getting the average. By summing up the difference of each rating in the file from the mean squared and then dividing this by the sample size -1. I do not need to do this in every file in the folder, but instead in 2 specific files, hotel_188937.dat and hotel_203921.dat. Here is an example of the contents of one of these files:
<Overall Rating>
<Avg. Price>$155
<URL>
<Author>Jeter5
<Content>I hope we're not disappointed! We enjoyed New Orleans...
<Date>Dec 19, 2008
<No. Reader>-1
<No. Helpful>-1
<rating>4
<Value>-1
<Rooms>3
<Location>5
<Cleanliness>3
<Check in / front desk>5
<Service>5
<Business service>5
<Author>...
repeat fields again...
The sample size of the first file is 127 with a mean of 4.78 compared with a sample size of 324 and a mean of 4.78 for the second file. Is there anyway that I can alter my script to calculate the standard deviation for these two specific files rather than calculating the average for every file in my directory? Thanks for your time.
You can do all in one awk script
$ awk -F'>' '
$1=="<rating" {k=FILENAME;sub(/.dat/,"",k);
s[k]+=$2;ss[k]+=$2^2;c[k]++}
END{for(i in s)
print i,m=s[i]/c[i],sqrt(ss[i]/c[i]-m^2)}' r1.dat r2.dat
r1 2.5 1.11803
r2 3 1.41421
s is for sum, ss for square sum, c for count, m for mean. Note that this computes population standard deviation not sample standard deviation. For latter you need to do some scaling adjustments with (count-1).
Yes.
The * in the grep line tells it to search in all the files.
Change the line
grep -P -o '(?<=<rating>).*' * |
to
grep -P -o '(?<=<rating>).*' hotel_188937.dat hotel_203921.dat |

Resources