How to convert file size to human readable and print with other columns? - bash

I want to convert the 5th column in this command output to human readable format.
For ex if this is my input :
-rw-rw-r-- 1 bhagyaraj bhagyaraj 280000 Jun 17 18:34 demo1
-rw-rw-r-- 1 bhagyaraj bhagyaraj 2800000 Jun 17 18:34 demo2
-rw-rw-r-- 1 bhagyaraj bhagyaraj 28000000 Jun 17 18:35 demo3
To something like this :
-rw-rw-r-- 280K demo1
-rw-rw-r-- 2.8M demo2
-rw-rw-r-- 28M demo3
I tried this command, but this will return only the file size column.
ls -l | tail -n +2 |awk '{print $5 | "numfmt --to=si"}'
ls is just for example my use case is very huge and repeated execution must be avoided
Any help would be appreciated :)

Just use -h --si
-h, --human-readable with -l and -s, print sizes like 1K 234M 2G etc.
--si likewise, but use powers of 1000 not 1024
So the command would be
ls -lh --si | tail -n +2
If you don't use ls and the command you intend to run doesn't have an option similar to -h --si in ls then numfmt already has the --field option to specify which column you want to format. For example
$ df | LC_ALL=en_US.UTF-8 numfmt --header --field 2-4 --to=si
Filesystem 1K-blocks Used Available Use% Mounted on
udev 66M 0 66M 0% /dev
tmpfs 14M 7.2K 14M 1% /run
/dev/mapper/vg0-lv--0 4.1G 3.7G 416M 90% /
tmpfs 5.2K 4 5.2K 1% /run/lock
/dev/nvme2n1p1 524K 5.4K 518K 2% /boot/efi
Unfortunately although numfmt does try to preserve the columnation, it fails if there are some large variation in the line length after inserting group separators like you can see above. So sometimes you might still need to reformat the table with column
df | LC_ALL=en_US.UTF-8 numfmt --header --field 2-4 --to=si | column -t -R 2,3,4,5
The -R 2,3,4,5 option is for right alignment, but some column versions like the default one in Ubuntu don't support it so you need to remove that
Alternatively you can also use awk to format only the columns you want, for example column 5 in case of ls
$ ls -l demo* | awk -v K=1e3 -v M=1e6 -v G=1e9 'func format(v) {
if (v > G) return v/G "G"; else if (v > M) return v/M "M";
else if (v > K) return v/K "K"; else return v
} { $5 = format($5); print $0 }' | column -t
-rw-rw-r-- 1 ph ph 280K Jun 18 09:23 demo1
-rw-rw-r-- 1 ph ph 2.8M Jun 18 09:24 demo2
-rw-rw-r-- 1 ph ph 28M Jun 18 09:23 demo3
-rw-rw-r-- 1 ph ph 2.8G Jun 18 09:30 demo4
And column 2, 3, 4 in case of df
# M=1000 and G=1000000 because df output is 1K-block, not bytes
$ df | awk -v M=1000 -v G=1000000 'func format(v) {
if (v > G) return v/G "G"; else if (v > M) return v/M "M"; else return v
}
{
# Format only columns 2, 3 and 4, ignore header
if (NR > 1) { $2 = format($2); $3 = format($3); $4 = format($4) }
print $0
}' OFS="\t" | column -t
Filesystem 1K-blocks Used Available Use% Mounted on
udev 65.8273G 0 65.8273G 0% /dev
tmpfs 13.1772G 7M 13.1702G 1% /run
/dev/mapper/vg0-lv--0 4073.78G 3619.05G 415.651G 90% /
tmpfs 65.8861G 0 65.8861G 0% /dev/shm
tmpfs 5.12M 4 5.116M 1% /run/lock
tmpfs 65.8861G 0 65.8861G 0% /sys/fs/cgroup
/dev/nvme2n1p2 999.32M 363.412M 567.096M 40% /boot

UPDATE 1 :
if you need just a barebones module for byte size formatting (it's setup for base-2 now but modifying it for --si should be trivial):
{m,g}awk '
BEGIN { OFS="="
_____=log(__=(_+=_+=_^=_)^(____=++_))
} gsub("[.]0+ bytes",
" -bytes-",
$(!__*($++NF = sprintf("%#10.*f %s" ,____,
(_ = $!!__) / __^(___=int(log(_)/_____)),
!___ ? "bytes" : substr("KMGTPEZY",___,!!__)"iB"))))^!__'
=
734 734 -bytes-
180043 175.82324 KiB
232819 227.36230 KiB
421548373 402.01986 MiB
838593829 799.74540 MiB
3739382399 3.48257 GiB
116601682159 108.59378 GiB
147480014471 137.35147 GiB
11010032230111 10.01357 TiB
19830700070261 18.03592 TiB
111120670776601 101.06366 TiB
15023323323323321 13.34339 PiB
85255542224555233 75.72213 PiB
444444666677766611 394.74616 PiB
106941916666944416909 92.75733 EiB
111919999919911191991 97.07513 EiB
767777766776776776777767 650.33306 ZiB
5558888858993555888686669 4.59821 YiB
========================
this is probably waaaay overkill, but I wrote it a while back, which can calculate the human-readable value, as well as comma formatted of raw byte value, supporting everything from kilo-bit to yotta-byte
with options for :
base 2 or base 10 (enter 10 or "M/m" for metric)
bytes (B) or bits (b)
The only thing that needs to be hard coded in are the letters themselves, since they grow linearly upon either
every 3rd power of 10 (1,000), or
every 5th power of 4 (1,024)
.
{m,g}awk '
BEGIN {
1 FS = OFS = "="
}
2302812 $!NF = substr(bytesformat($2, 10, "B"), 1, 15)\
substr(bytesformat($2, 2, "B"), 1, 15)\
bytesformat($2, 2, "b")'
# Functions, listed alphabetically
6908436 function bytesformat(_,_______,________,__, ___, ____, _____, ______)
{
6908436 _____=__=(____^=___*=((____=___+=___^= "")/___)+___+___)
6908436 ___/=___
6908436 sub("^0+","",_)
6908436 ____=_____-= substr(_____,index(_____,index(_____,!__))) * (_______~"^(10|[Mm])$")
6908436 _______=length((____)____)^(________~"^b(it)?$")
6908436 if ((____*__) < (_______*_)) { # 6906267
24438981 do {
24438981 ____*=_____
24438981 ++___
} while ((____*__) < (_______*_))
}
6908436 __=_
6908436 sub("(...)+$", ",&", __)
6908436 gsub("[^#-.][^#-.][^#-.]", "&,", __)
6908436 gsub("[,]*$|^[,]+", "", __)
6908436 sub("^[.]", "0&", __)
6908436 return \
sprintf("%10.4f %s%s | %s byte%.*s",
_=="" ? +_:_/(_____^___)*_______,
substr("KMGTPEZY", ___, _^(_<_)),
--_______?"b":"B",__==""?+__:__,(_^(_<_))<_,"s")
}
|
In this sample, it's showing metric bytes, binary bytes, binary bits, and raw input byte value :
180.0430 KB | 175.8232 KB | 1.3736 Mb | 180,043 bytes
232.8190 KB | 227.3623 KB | 1.7763 Mb | 232,819 bytes
421.5484 MB | 402.0199 MB | 3.1408 Gb | 421,548,373 bytes
838.5938 MB | 799.7454 MB | 6.2480 Gb | 838,593,829 bytes
3.7394 GB | 3.4826 GB | 27.8606 Gb | 3,739,382,399 bytes
116.6017 GB | 108.5938 GB | 868.7502 Gb | 116,601,682,159 bytes
147.4800 GB | 137.3515 GB | 1.0731 Tb | 147,480,014,471 bytes
11.0100 TB | 10.0136 TB | 80.1085 Tb | 11,010,032,230,111 bytes
19.8307 TB | 18.0359 TB | 144.2873 Tb | 19,830,700,070,261 bytes
111.1207 TB | 101.0637 TB | 808.5093 Tb | 111,120,670,776,601 bytes
15.0233 PB | 13.3434 PB | 106.7471 Pb | 15,023,323,323,323,321 bytes
85.2555 PB | 75.7221 PB | 605.7771 Pb | 85,255,542,224,555,233 bytes
444.4447 PB | 394.7462 PB | 3.0840 Eb | 444,444,666,677,766,611 bytes
106.9419 EB | 92.7573 EB | 742.0586 Eb | 106,941,916,666,944,416,909 bytes
111.9200 EB | 97.0751 EB | 776.6010 Eb | 111,919,999,919,911,191,991 bytes
767.7778 ZB | 650.3331 ZB | 5.0807 Yb | 767,777,766,776,776,776,777,767 bytes
5.5589 YB | 4.5982 YB | 36.7856 Yb | 5,558,888,858,993,555,888,686,669 bytes

Related

How do you get items from txt into presentable table in bash?

I'm trying to retrieve items from Node01.pc and put it within a table.
Example:
echo ${NodeCPU[0]} is able to print the item from the line.
But when I use printf or echo it either breaks or does not display the output from the array item.
The formating of the table seems work and it displays only if it's not the arrays. Could it be that there's more than to the file that I can see?
Node01.pc contains
192.168.0.99
2
70
16
80
4
4
100
4
VS122:NMAD:20:20:1:1
VS122:NAMD:20:20:1:1
RS123:FEM:10:20:1:1
QV999:BEM:20:20:1:1
But I only need lines 3,5,7,9
I'm not sure if what is the best way to do this, or if I even need to store items into arrays.
I thought about retrieving all text from the texts files and making a new file which will contain all the data, but I'm not sure how to do that.
This is the code that I have right now.
#!/bin/bash
Node01=($(cat Node01.pc))
Node02=($(cat Node02.pc))
Node03=($(cat Node03.pc))
Node04=($(cat Node04.pc))
Node05=($(cat Node05.pc))
NodeCPU=("${Node01[2]}" "${Node02[2]}" "${Node03[2]}" "${Node04[2]}" "${Node05[2]}")
NodeMEM=("${Node01[4]}" "${Node02[4]}" "${Node03[4]}" "${Node04[4]}" "${Node05[4]}")
NodeHDD=("${Node01[6]}" "${Node02[6]}" "${Node03[6]}" "${Node04[6]}" "${Node05[6]}")
NodeNET=("${Node01[8]}" "${Node02[8]}" "${Node03[8]}" "${Node04[8]}" "${Node05[8]}")
seperator=----------------------
seperator=$seperator$seperator
rows="%-10s| %-7s| %-7s| %-7s| %-7s\n"
TableWidth=140
printf "%-10s| %-7s| %-7s| %-7s| %-7s\n" NodeNumber CPU MEM HDD NET
printf "%.${TableWidth}s\n" "$seperator"
for((i=0;i<=4;i++))
do
printf "$rows" "$(( $i+1 ))" "${NodeCPU[i]}" "${NodeMEM[i]}" "${NodeHDD[i]}" "${NodeNET[i]}"
done
read
This is an example of what I want to display
NodeNumber | CPU | MEM | HDD | NET
----------------------------------
1 | 10 | 20 | 20 | 40
2 | 10 | 20 | 20 | 40
3 | 10 | 20 | 20 | 40
4 | 10 | 20 | 20 | 40
5 | 10 | 20 | 20 | 40
EDIT This is what I'm currently getting:
NodeNumber| CPU | MEM | HDD | NET
--------------------------------------------
| 4 | 70
| 5 | 90
| 6 | 100
| 6 | 70
| 40 | 40
Issue I'm having is with
printf "$rows" "$(( $i+1 ))" "${NodeCPU[i]}" "${NodeMEM[i]}" "${NodeHDD[i]}" "${NodeNET[i]}"
Why worry about all the separate array? Simply loop over all "Node*.pc" files in the current directory and read the contents of each file into an array with readarray and then output the file count and elements nos. 2, 4, 6, 8 of the array in the proper format (adjust elements output as needed), e.g.
#!/bin/bash
cnt=1 ## file counter
## print heading
printf "NodeNumber | CPU | MEM | HDD | NET\n----------------------------------\n"
for i in Node*.pc; do ## loop over all Node*.pc files in directory
readarray -t node < "$i" ## read contents into array
## output count and elements 2, 4, 6, 8 in proper format
printf "%-11s| %-4s| %-4s| %-4s| %s\n" $((cnt++)) \
"${node[2]}" "${node[4]}" "${node[6]}" "${node[8]}"
done
Example Use/Output
With the example data shown copied to the file Node01.pc in the current directory, you would get:
$ bash node.sh
NodeNumber | CPU | MEM | HDD | NET
----------------------------------
1 | 70 | 80 | 4 | 4
(I called the script node.sh)
It would output the information from each file as separate lines numbered 1, 2, ... Look things over an let me know if this is what you intended. (you can also do the same thing with awk faster by setting FS=\n and treating the lines as columns in a single record)
You can do the same thing in awk with:
awk '
BEGIN {
RS=""; FS="\n"
printf "NodeNumber | CPU | MEM | HDD | NET\n----------------------------------\n"
}
NF >= 9 {
printf "%-11s| %-4s| %-4s| %-4s| %s\n",++cnt,$3,$5,$7,$9
}
' Node*.pc
(note: in awk the field numbers are 1-based, while in bash the array indexes are 0-based)
Output is the same.

how to sum file size from ls like output log with Bytes, KiB, MiB, GiB

I have a pre-computed ls like output (it is not from actual ls command) and I cannot modify or recalcuate it. It looks like as follows:
2016-10-14 14:52:09 0 Bytes folder/
2020-04-18 05:19:04 201 Bytes folder/file1.txt
2019-10-16 00:32:44 201 Bytes folder/file2.txt
2019-08-26 06:29:46 201 Bytes folder/file3.txt
2020-07-08 16:13:56 411 Bytes folder/file4.txt
2020-04-18 03:03:34 201 Bytes folder/file5.txt
2019-10-16 08:27:11 1.1 KiB folder/file6.txt
2019-10-16 10:13:52 201 Bytes folder/file7.txt
2019-10-16 08:44:35 920 Bytes folder/file8.txt
2019-02-17 14:43:10 590 Bytes folder/file9.txt
The log may have GiB, MiB, KiB, Bytes at least. Among possibile values are zero values, or values w/wo comma for each of prefixes:
0 Bytes
3.9 KiB
201 Bytes
2.0 KiB
2.7 MiB
1.3 GiB
A similar approach is the following
awk 'BEGIN{ pref[1]="K"; pref[2]="M"; pref[3]="G";} { total = total + $1; x = $1; y = 1; while( x > 1024 ) { x = (x + 1023)/1024; y++; } printf("%g%s\t%s\n",int(x*10)/10,pref[y],$2); } END { y = 1; while( total > 1024 ) { total = (total + 1023)/1024; y++; } printf("Total: %g%s\n",int(total*10)/10,pref[y]); }'
but does not work correctly in my case:
$ head -n 10 files_sizes.log | awk '{print $3,$4}' | sort | awk 'BEGIN{ pref[1]="K"; pref[2]="M"; pref[3]="G";} { total = total + $1; x = $1; y = 1; while( x > 1024 ) { x = (x + 1023)/1024; y++; } printf("%g%s\t%s\n",int(x*10)/10,pref[y],$2); } END { y = 1; while( total > 1024 ) { total = (total + 1023)/1024; y++; } printf("Total: %g%s\n",int(total*10)/10,pref[y]); }'
0K Bytes
1.1K KiB
201K Bytes
201K Bytes
201K Bytes
201K Bytes
201K Bytes
411K Bytes
590K Bytes
920K Bytes
Total: 3.8M
This output wrongly calculate the size. My desidered output is the correct total sum of the input log file:
0 Bytes
201 Bytes
201 Bytes
201 Bytes
411 Bytes
201 Bytes
1.1 KiB
201 Bytes
920 Bytes
590 Bytes
Total: 3.95742 KiB
NOTE
The correct value as the result of the sum of the Bytes is
201 * 5 + 590 + 920 = 2926, so the total adding the KiB is
2.857422 + 1.1 = 3,95742 KiB = 4052.400 Bytes
[UPDATE]
I have update with the comparison of the results from KamilCuk and Ted Lyngmo and Walter A solutions that gives pretty much the same values:
$ head -n 10 files_sizes.log | tr -s ' ' | cut -d' ' -f3,4 | sed 's/ //; s/Bytes//; s/B//' | gnumfmt --from=auto | awk '{s+=$1}END{print s " Bytes"}'
117538 Bytes
$ head -n 1000 files_sizes.log | tr -s ' ' | cut -d' ' -f3,4 | sed 's/ //; s/Bytes//; s/B//' | gnumfmt --from=auto | awk '{s+=$1}END{print s " Bytes"}'
1225857 Bytes
$ head -n 10000 files_sizes.log | tr -s ' ' | cut -d' ' -f3,4 | sed 's/ //; s/Bytes//; s/B//' | gnumfmt --from=auto | awk '{s+=$1}END{print s " Bytes"}'
12087518 Bytes
$ head -n 1000000 files_sizes.log | tr -s ' ' | cut -d' ' -f3,4 | sed 's/ //; s/Bytes//; s/B//' | gnumfmt --from=auto | awk '{s+=$1}END{print s " Bytes"}'
77238840381 Bytes
$ head -n 100000000 files_sizes.log | tr -s ' ' | cut -d' ' -f3,4 | sed 's/ //; s/Bytes//; s/B//' | gnumfmt --from=auto | awk '{s+=$1}END{print s " Bytes"}'
2306569381835 Bytes
and
$ head -n 10 files_sizes.log | ./count_files.sh
3.957422 KiB
$ head -n 1000 files_sizes.log | ./count_files.sh
1.168946 MiB
$ head -n 10000 files_sizes.log | ./count_files.sh
11.526325 MiB
$ head -n 1000000 files_sizes.log | ./count_files.sh
71.934024 GiB
$ head -n 100000000 files_sizes.log | ./count_files.sh
2.097807 TiB
and
(head -n 100000000 files_sizes.log | tr -s ' ' | cut -d' ' -f3,4 | sed 's/Bytes//; s/KiB/* 1024/; s/MiB/* 1024 * 1024/;s/GiB/* 1024 * 1024 * 1024/; s/$/ + /; $s/+ //' | tr -d '\n' ; echo) | bc
2306563692898.8
where
2.097807 TiB = 2.3065631893 TB = 2306569381835 Bytes
Computationally, I have compared all the three solution for speed:
$ time head -n 100000000 files_sizes.log | ./count_files.sh
2.097807 TiB
real 2m7.956s
user 2m10.023s
sys 0m1.696s
$ time head -n 100000000 files_sizes.log | tr -s ' ' | cut -d' ' -f3,4 | sed 's/ //; s/Bytes//; s/B//' | gnumfmt --from=auto | awk '{s+=$1}END{print s " Bytes"}'
2306569381835 Bytes
real 4m12.896s
user 5m45.750s
sys 0m4.026s
$ time (head -n 100000000 files_sizes.log | tr -s ' ' | cut -d' ' -f3,4 | sed 's/Bytes//; s/KiB/* 1024/; s/MiB/* 1024 * 1024/;s/GiB/* 1024 * 1024 * 1024/; s/$/ + /; $s/+ //' | tr -d '\n' ; echo) | bc
2306563692898.8
real 4m31.249s
user 6m40.072s
sys 0m4.252s
Use numfmt to convert those numbers.
cat <<EOF |
2016-10-14 14:52:09 0 Bytes folder/
2020-04-18 05:19:04 201 Bytes folder/file1.txt
2019-10-16 00:32:44 201 Bytes folder/file2.txt
2019-08-26 06:29:46 201 Bytes folder/file3.txt
2020-07-08 16:13:56 411 Bytes folder/file4.txt
2020-04-18 03:03:34 201 Bytes folder/file5.txt
2019-10-16 08:27:11 1.1 KiB folder/file6.txt
2019-10-16 10:13:52 201 Bytes folder/file7.txt
2019-10-16 08:44:35 920 Bytes folder/file8.txt
2019-02-17 14:43:10 590 Bytes folder/file9.txt
2019-02-17 14:43:10 3.9 KiB folder/file9.txt
2019-02-17 14:43:10 2.7 MiB folder/file9.txt
2019-02-17 14:43:10 1.3 GiB folder/file9.txt
EOF
# extract 3rd and 4th column
tr -s ' ' | cut -d' ' -f3,4 |
# Remove space, remove "Bytes", remove "B"
sed 's/ //; s/Bytes//; s/B//' |
# convert to numbers
numfmt --from=auto |
# sum
awk '{s+=$1}END{print s}'
outputs the sum.
For input like the described:
2016-10-14 14:52:09 0 Bytes folder/
2020-04-18 05:19:04 201 Bytes folder/file1.txt
2019-10-16 00:32:44 201 Bytes folder/file2.txt
2019-08-26 06:29:46 201 Bytes folder/file3.txt
2020-07-08 16:13:56 411 Bytes folder/file4.txt
2020-04-18 03:03:34 201 Bytes folder/file5.txt
2019-10-16 08:27:11 1.1 KiB folder/file6.txt
2019-10-16 10:13:52 201 Bytes folder/file7.txt
2019-10-16 08:44:35 920 Bytes folder/file8.txt
2019-02-17 14:43:10 590 Bytes folder/file9.txt
You could use a table of units that you'd like to be able to decode:
BEGIN {
unit["Bytes"] = 1;
unit["kB"] = 10**3;
unit["MB"] = 10**6;
unit["GB"] = 10**9;
unit["TB"] = 10**12;
unit["PB"] = 10**15;
unit["EB"] = 10**18;
unit["ZB"] = 10**21;
unit["YB"] = 10**24;
unit["KB"] = 1024;
unit["KiB"] = 1024**1;
unit["MiB"] = 1024**2;
unit["GiB"] = 1024**3;
unit["TiB"] = 1024**4;
unit["PiB"] = 1024**5;
unit["EiB"] = 1024**6;
unit["ZiB"] = 1024**7;
unit["YiB"] = 1024**8;
}
Then just sum it up in the main loop:
{
if($4 in unit) total += $3 * unit[$4];
else printf("ERROR: Can't decode unit at line %d: %s\n", NR, $0);
}
And print the result at the end:
END {
binaryunits[0] = "Bytes";
binaryunits[1] = "KiB";
binaryunits[2] = "MiB";
binaryunits[3] = "GiB";
binaryunits[4] = "TiB";
binaryunits[5] = "PiB";
binaryunits[6] = "EiB";
binaryunits[7] = "ZiB";
binaryunits[8] = "YiB";
for(i = 8;; --i) {
if(total >= 1024**i || i == 0) {
printf("%.3f %s\n", total/(1024**i), binaryunits[i]);
break;
}
}
}
Output:
3.957 KiB
Note that you can add a she-bang to beginning of the awk-script to make it possible to run it on its own so that you won't have to put it in a bash script:
#!/usr/bin/awk -f
You can parse the input before sending them to bc:
echo "0 Bytes
3.9 KiB
201 Bytes
2.0 KiB
2.7 MiB
1.3 GiB" |
sed 's/Bytes//; s/KiB/* 1024/; s/MiB/* 1024 * 1024/;
s/GiB/* 1024 * 1024 * 1024/; s/$/ + /' |
tr -d '\n' |
sed 's/+ $/\n/' |
bc
When your sed doesn't support \n, you can try replacing the '\n' with a real newline like
sed 's/+ $/
/'
or add an echo after parsing (and move part of the last sed into the first command for removing the last +)
(echo "0 Bytes
3.9 KiB
201 Bytes
2.0 KiB
2.7 MiB
1.3 GiB" | sed 's/Bytes//; s/KiB/* 1024/; s/MiB/* 1024 * 1024/;
s/GiB/* 1024 * 1024 * 1024/; s/$/ + /; $s/+ //' | tr -d '\n' ; echo) | bc
Very good idea from #KamilCuk to make use of numfmt. Based on his answer, here is an alternative command which uses a single awk call wrapping numfmt with a two-way pipe. It requires a recent version of GNU awk (ok with 5.0.1, unstable with 4.1.4, not tested in between).
LC_NUMERIC=C gawk '
BEGIN {
conv = "numfmt --from=auto"
PROCINFO[conv, "pty"] = 1
}
{
sub(/B.*/, "", $4)
print $3 $4 |& conv
conv |& getline val
sum += val
}
END { print sum }
' input
Notes
LC_NUMERIC=C (bash/ksh/zsh) is for portability on systems using a non-english locale.
PROCINFO[conv, "pty"] = 1 lets have the output of numfmt flushed on each line (to avoid a dealock).
Let me give you a better way to work with ls: don't use it as a command, but as a find switch:
find . -maxdepth 1 -ls
This returns file sizes in a uniform unity, as explained in find's manpage, which makes it far easier to do calculations on.
Good luck

Grep rows from top command based on a condition

[xxxxx#xxxx3 ~]$ top
top - 16:29:00 up 197 days, 19:06, 12 users, load average: 19.16, 21.08, 21.58
Tasks: 3668 total, 21 running, 3646 sleeping, 0 stopped, 1 zombie
Cpu(s): 14.1%us, 6.8%sy, 0.0%ni, 79.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 264389504k total, 53305000k used, 211084504k free, 859908k buffers
Swap: 134217720k total, 194124k used, 134023596k free, 12854016k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
19938 jai_web 20 0 3089m 2.9g 7688 R 100.0 1.1 0:10.26 Engine
19943 jai_web 20 0 3089m 2.9g 7700 R 100.0 1.1 0:10.14 Engine
20147 jai_web 20 0 610m 454m 3556 R 78.4 0.2 0:02.54 java
77169 jai_web 20 0 9414m 1.4g 29m S 21.3 0.6 38:51.69 java
20160 jai_web 20 0 362m 196m 3336 R 16.7 0.1 0:00.54 java
272287 jai_web 20 0 20.1g 2.0g 5784 S 15.1 0.8 165:39.50 java
26597 jai_web 20 0 6371m 134m 3444 S 9.6 0.1 429:41.97 java
From the snippet of top command above i want to grep PIDs which have Value of TIME+ greater than 10:00:00 that belongs to 'java' process
so am expecting grep output as below:
77169 jai_web 20 0 9414m 1.4g 29m S 21.3 0.6 **38:51.69** java
272287 jai_web 20 0 20.1g 2.0g 5784 S 15.1 0.8 **165:39.58** java
26597 jai_web 20 0 6371m 134m 3444 S 9.6 0.1 **429:41.97** java
i have tried below:
top -p "$(pgrep -d ',' java)"
But doesnt satisfies my condition.Please assist
I would just do this for one time analysis.
$ top -n 1 -b | awk '$NF=="java" && $(NF-1) >= "10:00.00"'
Ok here is what I came up with...
You need to get the output of top, filter only the java lines, then check each line to see if the TIME is bigger than your limit. Here is what I did:
#!/bin/bash
#
tmpfile="/tmp/top.output"
top -o TIME -n 1 | grep java >$tmpfile
# filter each line and keep only the ones where TIME is bigger than a certain value
limit=10
while read line
do
# take the line and keep only the 11th field, which is the time value
# In that time value, keep only the first number
timevalue=$(echo $line | awk '{print $12}' | cut -d':' -f1)
# compare timevalue to the limit we set
if [ $timevalue -gt $limit ]
then
# output the entire line
echo $line
fi
done <$tmpfile
# cleanup
rm -f /tmp/top.output
The trick here is to extract the TIME value, only the first digits. The other digits are not significant, as long as the first is bigger than 10.
Someone might know of a way to do it via grep, but I doubt it, I have never seen conditionals in grep.

Hadoop fs -du-h sorting by size for M, G, T, P, E, Z, Y

I am running this command --
sudo -u hdfs hadoop fs -du -h /user | sort -nr
and the output is not sorted in terms of gigs, Terabytes,gb
I found this command -
hdfs dfs -du -s /foo/bar/*tobedeleted | sort -r -k 1 -g | awk '{ suffix="KMGT"; for(i=0; $1>1024 && i < length(suffix); i++) $1/=1024; print int($1) substr(suffix, i, 1), $3; }'
but did not seem to work.
is there a way or a command line flag i can use to make it sort and output should look like--
123T /xyz
124T /xyd
126T /vat
127G /ayf
123G /atd
Please help
regards
Mayur
hdfs dfs -du -h <PATH> | awk '{print $1$2,$3}' | sort -hr
Short explanation:
The hdfs command gets the input data.
The awk only prints the first three fields with a comma in between the 2nd and 3rd.
The -h of sort compares human readable numbers like 2K or 4G, while the -r reverses the sort order.
hdfs dfs -du -h <PATH> | sed 's/ //' | sort -hr
sed will strip out the space between the number and the unit, after which sort will be able to understand it.
This is a rather old question, but stumbled across it while trying to do the same thing. As you were providing the -h (human readable flag) it was converting the sizes to different units to make it easier for a human to read. By leaving that flag off we get the aggregate summary of file lengths (in bytes).
sudo -u hdfs hadoop fs -du -s '/*' | sort -nr
Not as easy to read but means you can sort it correctly.
See https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/FileSystemShell.html#du for more details.
I would use some small skript. It's primitive but reliable
#!/bin/bash
PATH_TO_FOLDER="$1"
hdfs dfs -du -h $PATH_TO_FOLDER > /tmp/output
cat /tmp/output | awk '$2 ~ /^[0-9]+$/ {print $1,$NF}' | sort -k1,1n
cat /tmp/output | awk ' {if ($2 == "K" ) print $1,$2,$NF}' | sort -k1,1n
cat /tmp/output | awk ' {if ($2 == "M" ) print $1,$2,$NF}' | sort -k1,1n
cat /tmp/output | awk ' {if ($2 == "G" ) print $1,$2,$NF}' | sort -k1,1n
cat /tmp/output | awk ' {if ($2 == "T" ) print $1,$2,$NF}' | sort -k1,1n
rm /tmp/output
Try this to sort hdfs dfs -ls -h /path sort -r -n -k 5
-rw-r--r-- 3 admin admin 108.5 M 2016-05-05 17:23 /user/admin/2008.csv.bz2
-rw-r--r-- 3 admin admin 3.1 M 2016-05-17 16:19 /user/admin/warand_peace.txt
Found 11 items
drwxr-xr-x - admin admin 0 2016-05-16 17:34 /user/admin/oozie-oozi
drwxr-xr-x - admin admin 0 2016-05-16 16:35 /user/admin/Jars
drwxr-xr-x - admin admin 0 2016-05-12 05:30 /user/admin/.Trash
drwxrwxrwx - admin admin 0 2016-05-16 11:21 /user/admin/2015_11_21
drwxrwxrwx - admin admin 0 2016-05-16 11:21 /user/admin/2015_11_20
drwxrwxrwx - admin admin 0 2016-05-16 11:21 /user/admin/2015_11_19
drwxrwxrwx - admin admin 0 2016-05-16 11:21 /user/admin/2015_11_18
drwx------ - admin admin 0 2016-05-16 17:38 /user/admin/.staging

Ways to speed up my bash script?

Yet, i know its a lot faster than doing things by hand. But is there anyway to maybe speed up this script? Multi-thread or something? I'm new to unix and this is my first script =). Open for suggestions or any changes made. Script seems to pause a lot on a certain generated domain randomly.
#!/bin/bash
for domain in $(pwgen -1A0B 2 10000);
do
whois $domain.com | egrep -q '^No match|^NOT FOUND|^Not fo|AVAILABLE|^No Data Fou|has not been regi|No entri'
if [ $? -eq 0 ]; then
echo "$domain.com : available"
else
echo "$domain.com"
fi
done
Before splitting and distribution,
WARNING This seem not to be useful: Asking pwgen to build 10'000 lines formed by two characters between a and z... Also there is only echo $((26*26)) -> 676 possibilities (in fact, as pwgen try to build speakable words, there is only 625 possibilities).
pwgen -1A0B 2 10000 | sort | uniq -c | sort -n | tail
27 ju
27 mu
27 vs
27 xt
27 zx
28 df
28 sy
28 zc
29 dp
29 zd
So with this command, you will do upto 29 times same thing.
Trying 10x to run pwgen -1A0B 2 10000 for printing how much different combinaison is proposed and which combinaison was proposed more time and less time:
for ((i=10;i--;)); do
echo $(
(
(
pwgen -1A0B 2 10000 |
sort |
uniq -c |
sort -n |
tee /dev/fd/6 |
wc -l >/dev/fd/7
) 6>&1 | (
head -n1
tail -n1
)
) 7>&1
)
done
6 bd 625 31 bn
3 bj 625 29 sq
6 je 625 30 ey
4 ac 625 30 sz
5 ds 625 29 wf
4 xw 625 28 qb
4 jj 625 30 pa
6 io 625 29 sg
4 vw 625 30 kb
5 fz 625 31 os
this print:
| | | | |
| | | | \- max proposed pattern
| | | \---- number of times max proposed pattern was issued
| | \-------- number of different differents purposes
| \----------- min proposed pattern
\-------------- number of times min proposed pattern was issued
Create a file with desired domain names first. Call this domains.lst:
pwgen -1A0B 2 10000 > domains.lst
Then create smaller files out of this:
split --lines=100 domains.lst domains.lst.
Then create a script which gets a file-name and processes that file using whois. Also creates an output file input.out.
Create another script that uses & to start the above script in the background for all small chunks. Merge the outputs after all background tasks finish.

Resources