how to sum file size from ls like output log with Bytes, KiB, MiB, GiB - bash

I have a pre-computed ls like output (it is not from actual ls command) and I cannot modify or recalcuate it. It looks like as follows:
2016-10-14 14:52:09 0 Bytes folder/
2020-04-18 05:19:04 201 Bytes folder/file1.txt
2019-10-16 00:32:44 201 Bytes folder/file2.txt
2019-08-26 06:29:46 201 Bytes folder/file3.txt
2020-07-08 16:13:56 411 Bytes folder/file4.txt
2020-04-18 03:03:34 201 Bytes folder/file5.txt
2019-10-16 08:27:11 1.1 KiB folder/file6.txt
2019-10-16 10:13:52 201 Bytes folder/file7.txt
2019-10-16 08:44:35 920 Bytes folder/file8.txt
2019-02-17 14:43:10 590 Bytes folder/file9.txt
The log may have GiB, MiB, KiB, Bytes at least. Among possibile values are zero values, or values w/wo comma for each of prefixes:
0 Bytes
3.9 KiB
201 Bytes
2.0 KiB
2.7 MiB
1.3 GiB
A similar approach is the following
awk 'BEGIN{ pref[1]="K"; pref[2]="M"; pref[3]="G";} { total = total + $1; x = $1; y = 1; while( x > 1024 ) { x = (x + 1023)/1024; y++; } printf("%g%s\t%s\n",int(x*10)/10,pref[y],$2); } END { y = 1; while( total > 1024 ) { total = (total + 1023)/1024; y++; } printf("Total: %g%s\n",int(total*10)/10,pref[y]); }'
but does not work correctly in my case:
$ head -n 10 files_sizes.log | awk '{print $3,$4}' | sort | awk 'BEGIN{ pref[1]="K"; pref[2]="M"; pref[3]="G";} { total = total + $1; x = $1; y = 1; while( x > 1024 ) { x = (x + 1023)/1024; y++; } printf("%g%s\t%s\n",int(x*10)/10,pref[y],$2); } END { y = 1; while( total > 1024 ) { total = (total + 1023)/1024; y++; } printf("Total: %g%s\n",int(total*10)/10,pref[y]); }'
0K Bytes
1.1K KiB
201K Bytes
201K Bytes
201K Bytes
201K Bytes
201K Bytes
411K Bytes
590K Bytes
920K Bytes
Total: 3.8M
This output wrongly calculate the size. My desidered output is the correct total sum of the input log file:
0 Bytes
201 Bytes
201 Bytes
201 Bytes
411 Bytes
201 Bytes
1.1 KiB
201 Bytes
920 Bytes
590 Bytes
Total: 3.95742 KiB
NOTE
The correct value as the result of the sum of the Bytes is
201 * 5 + 590 + 920 = 2926, so the total adding the KiB is
2.857422 + 1.1 = 3,95742 KiB = 4052.400 Bytes
[UPDATE]
I have update with the comparison of the results from KamilCuk and Ted Lyngmo and Walter A solutions that gives pretty much the same values:
$ head -n 10 files_sizes.log | tr -s ' ' | cut -d' ' -f3,4 | sed 's/ //; s/Bytes//; s/B//' | gnumfmt --from=auto | awk '{s+=$1}END{print s " Bytes"}'
117538 Bytes
$ head -n 1000 files_sizes.log | tr -s ' ' | cut -d' ' -f3,4 | sed 's/ //; s/Bytes//; s/B//' | gnumfmt --from=auto | awk '{s+=$1}END{print s " Bytes"}'
1225857 Bytes
$ head -n 10000 files_sizes.log | tr -s ' ' | cut -d' ' -f3,4 | sed 's/ //; s/Bytes//; s/B//' | gnumfmt --from=auto | awk '{s+=$1}END{print s " Bytes"}'
12087518 Bytes
$ head -n 1000000 files_sizes.log | tr -s ' ' | cut -d' ' -f3,4 | sed 's/ //; s/Bytes//; s/B//' | gnumfmt --from=auto | awk '{s+=$1}END{print s " Bytes"}'
77238840381 Bytes
$ head -n 100000000 files_sizes.log | tr -s ' ' | cut -d' ' -f3,4 | sed 's/ //; s/Bytes//; s/B//' | gnumfmt --from=auto | awk '{s+=$1}END{print s " Bytes"}'
2306569381835 Bytes
and
$ head -n 10 files_sizes.log | ./count_files.sh
3.957422 KiB
$ head -n 1000 files_sizes.log | ./count_files.sh
1.168946 MiB
$ head -n 10000 files_sizes.log | ./count_files.sh
11.526325 MiB
$ head -n 1000000 files_sizes.log | ./count_files.sh
71.934024 GiB
$ head -n 100000000 files_sizes.log | ./count_files.sh
2.097807 TiB
and
(head -n 100000000 files_sizes.log | tr -s ' ' | cut -d' ' -f3,4 | sed 's/Bytes//; s/KiB/* 1024/; s/MiB/* 1024 * 1024/;s/GiB/* 1024 * 1024 * 1024/; s/$/ + /; $s/+ //' | tr -d '\n' ; echo) | bc
2306563692898.8
where
2.097807 TiB = 2.3065631893 TB = 2306569381835 Bytes
Computationally, I have compared all the three solution for speed:
$ time head -n 100000000 files_sizes.log | ./count_files.sh
2.097807 TiB
real 2m7.956s
user 2m10.023s
sys 0m1.696s
$ time head -n 100000000 files_sizes.log | tr -s ' ' | cut -d' ' -f3,4 | sed 's/ //; s/Bytes//; s/B//' | gnumfmt --from=auto | awk '{s+=$1}END{print s " Bytes"}'
2306569381835 Bytes
real 4m12.896s
user 5m45.750s
sys 0m4.026s
$ time (head -n 100000000 files_sizes.log | tr -s ' ' | cut -d' ' -f3,4 | sed 's/Bytes//; s/KiB/* 1024/; s/MiB/* 1024 * 1024/;s/GiB/* 1024 * 1024 * 1024/; s/$/ + /; $s/+ //' | tr -d '\n' ; echo) | bc
2306563692898.8
real 4m31.249s
user 6m40.072s
sys 0m4.252s

Use numfmt to convert those numbers.
cat <<EOF |
2016-10-14 14:52:09 0 Bytes folder/
2020-04-18 05:19:04 201 Bytes folder/file1.txt
2019-10-16 00:32:44 201 Bytes folder/file2.txt
2019-08-26 06:29:46 201 Bytes folder/file3.txt
2020-07-08 16:13:56 411 Bytes folder/file4.txt
2020-04-18 03:03:34 201 Bytes folder/file5.txt
2019-10-16 08:27:11 1.1 KiB folder/file6.txt
2019-10-16 10:13:52 201 Bytes folder/file7.txt
2019-10-16 08:44:35 920 Bytes folder/file8.txt
2019-02-17 14:43:10 590 Bytes folder/file9.txt
2019-02-17 14:43:10 3.9 KiB folder/file9.txt
2019-02-17 14:43:10 2.7 MiB folder/file9.txt
2019-02-17 14:43:10 1.3 GiB folder/file9.txt
EOF
# extract 3rd and 4th column
tr -s ' ' | cut -d' ' -f3,4 |
# Remove space, remove "Bytes", remove "B"
sed 's/ //; s/Bytes//; s/B//' |
# convert to numbers
numfmt --from=auto |
# sum
awk '{s+=$1}END{print s}'
outputs the sum.

For input like the described:
2016-10-14 14:52:09 0 Bytes folder/
2020-04-18 05:19:04 201 Bytes folder/file1.txt
2019-10-16 00:32:44 201 Bytes folder/file2.txt
2019-08-26 06:29:46 201 Bytes folder/file3.txt
2020-07-08 16:13:56 411 Bytes folder/file4.txt
2020-04-18 03:03:34 201 Bytes folder/file5.txt
2019-10-16 08:27:11 1.1 KiB folder/file6.txt
2019-10-16 10:13:52 201 Bytes folder/file7.txt
2019-10-16 08:44:35 920 Bytes folder/file8.txt
2019-02-17 14:43:10 590 Bytes folder/file9.txt
You could use a table of units that you'd like to be able to decode:
BEGIN {
unit["Bytes"] = 1;
unit["kB"] = 10**3;
unit["MB"] = 10**6;
unit["GB"] = 10**9;
unit["TB"] = 10**12;
unit["PB"] = 10**15;
unit["EB"] = 10**18;
unit["ZB"] = 10**21;
unit["YB"] = 10**24;
unit["KB"] = 1024;
unit["KiB"] = 1024**1;
unit["MiB"] = 1024**2;
unit["GiB"] = 1024**3;
unit["TiB"] = 1024**4;
unit["PiB"] = 1024**5;
unit["EiB"] = 1024**6;
unit["ZiB"] = 1024**7;
unit["YiB"] = 1024**8;
}
Then just sum it up in the main loop:
{
if($4 in unit) total += $3 * unit[$4];
else printf("ERROR: Can't decode unit at line %d: %s\n", NR, $0);
}
And print the result at the end:
END {
binaryunits[0] = "Bytes";
binaryunits[1] = "KiB";
binaryunits[2] = "MiB";
binaryunits[3] = "GiB";
binaryunits[4] = "TiB";
binaryunits[5] = "PiB";
binaryunits[6] = "EiB";
binaryunits[7] = "ZiB";
binaryunits[8] = "YiB";
for(i = 8;; --i) {
if(total >= 1024**i || i == 0) {
printf("%.3f %s\n", total/(1024**i), binaryunits[i]);
break;
}
}
}
Output:
3.957 KiB
Note that you can add a she-bang to beginning of the awk-script to make it possible to run it on its own so that you won't have to put it in a bash script:
#!/usr/bin/awk -f

You can parse the input before sending them to bc:
echo "0 Bytes
3.9 KiB
201 Bytes
2.0 KiB
2.7 MiB
1.3 GiB" |
sed 's/Bytes//; s/KiB/* 1024/; s/MiB/* 1024 * 1024/;
s/GiB/* 1024 * 1024 * 1024/; s/$/ + /' |
tr -d '\n' |
sed 's/+ $/\n/' |
bc
When your sed doesn't support \n, you can try replacing the '\n' with a real newline like
sed 's/+ $/
/'
or add an echo after parsing (and move part of the last sed into the first command for removing the last +)
(echo "0 Bytes
3.9 KiB
201 Bytes
2.0 KiB
2.7 MiB
1.3 GiB" | sed 's/Bytes//; s/KiB/* 1024/; s/MiB/* 1024 * 1024/;
s/GiB/* 1024 * 1024 * 1024/; s/$/ + /; $s/+ //' | tr -d '\n' ; echo) | bc

Very good idea from #KamilCuk to make use of numfmt. Based on his answer, here is an alternative command which uses a single awk call wrapping numfmt with a two-way pipe. It requires a recent version of GNU awk (ok with 5.0.1, unstable with 4.1.4, not tested in between).
LC_NUMERIC=C gawk '
BEGIN {
conv = "numfmt --from=auto"
PROCINFO[conv, "pty"] = 1
}
{
sub(/B.*/, "", $4)
print $3 $4 |& conv
conv |& getline val
sum += val
}
END { print sum }
' input
Notes
LC_NUMERIC=C (bash/ksh/zsh) is for portability on systems using a non-english locale.
PROCINFO[conv, "pty"] = 1 lets have the output of numfmt flushed on each line (to avoid a dealock).

Let me give you a better way to work with ls: don't use it as a command, but as a find switch:
find . -maxdepth 1 -ls
This returns file sizes in a uniform unity, as explained in find's manpage, which makes it far easier to do calculations on.
Good luck

Related

How to convert file size to human readable and print with other columns?

I want to convert the 5th column in this command output to human readable format.
For ex if this is my input :
-rw-rw-r-- 1 bhagyaraj bhagyaraj 280000 Jun 17 18:34 demo1
-rw-rw-r-- 1 bhagyaraj bhagyaraj 2800000 Jun 17 18:34 demo2
-rw-rw-r-- 1 bhagyaraj bhagyaraj 28000000 Jun 17 18:35 demo3
To something like this :
-rw-rw-r-- 280K demo1
-rw-rw-r-- 2.8M demo2
-rw-rw-r-- 28M demo3
I tried this command, but this will return only the file size column.
ls -l | tail -n +2 |awk '{print $5 | "numfmt --to=si"}'
ls is just for example my use case is very huge and repeated execution must be avoided
Any help would be appreciated :)
Just use -h --si
-h, --human-readable with -l and -s, print sizes like 1K 234M 2G etc.
--si likewise, but use powers of 1000 not 1024
So the command would be
ls -lh --si | tail -n +2
If you don't use ls and the command you intend to run doesn't have an option similar to -h --si in ls then numfmt already has the --field option to specify which column you want to format. For example
$ df | LC_ALL=en_US.UTF-8 numfmt --header --field 2-4 --to=si
Filesystem 1K-blocks Used Available Use% Mounted on
udev 66M 0 66M 0% /dev
tmpfs 14M 7.2K 14M 1% /run
/dev/mapper/vg0-lv--0 4.1G 3.7G 416M 90% /
tmpfs 5.2K 4 5.2K 1% /run/lock
/dev/nvme2n1p1 524K 5.4K 518K 2% /boot/efi
Unfortunately although numfmt does try to preserve the columnation, it fails if there are some large variation in the line length after inserting group separators like you can see above. So sometimes you might still need to reformat the table with column
df | LC_ALL=en_US.UTF-8 numfmt --header --field 2-4 --to=si | column -t -R 2,3,4,5
The -R 2,3,4,5 option is for right alignment, but some column versions like the default one in Ubuntu don't support it so you need to remove that
Alternatively you can also use awk to format only the columns you want, for example column 5 in case of ls
$ ls -l demo* | awk -v K=1e3 -v M=1e6 -v G=1e9 'func format(v) {
if (v > G) return v/G "G"; else if (v > M) return v/M "M";
else if (v > K) return v/K "K"; else return v
} { $5 = format($5); print $0 }' | column -t
-rw-rw-r-- 1 ph ph 280K Jun 18 09:23 demo1
-rw-rw-r-- 1 ph ph 2.8M Jun 18 09:24 demo2
-rw-rw-r-- 1 ph ph 28M Jun 18 09:23 demo3
-rw-rw-r-- 1 ph ph 2.8G Jun 18 09:30 demo4
And column 2, 3, 4 in case of df
# M=1000 and G=1000000 because df output is 1K-block, not bytes
$ df | awk -v M=1000 -v G=1000000 'func format(v) {
if (v > G) return v/G "G"; else if (v > M) return v/M "M"; else return v
}
{
# Format only columns 2, 3 and 4, ignore header
if (NR > 1) { $2 = format($2); $3 = format($3); $4 = format($4) }
print $0
}' OFS="\t" | column -t
Filesystem 1K-blocks Used Available Use% Mounted on
udev 65.8273G 0 65.8273G 0% /dev
tmpfs 13.1772G 7M 13.1702G 1% /run
/dev/mapper/vg0-lv--0 4073.78G 3619.05G 415.651G 90% /
tmpfs 65.8861G 0 65.8861G 0% /dev/shm
tmpfs 5.12M 4 5.116M 1% /run/lock
tmpfs 65.8861G 0 65.8861G 0% /sys/fs/cgroup
/dev/nvme2n1p2 999.32M 363.412M 567.096M 40% /boot
UPDATE 1 :
if you need just a barebones module for byte size formatting (it's setup for base-2 now but modifying it for --si should be trivial):
{m,g}awk '
BEGIN { OFS="="
_____=log(__=(_+=_+=_^=_)^(____=++_))
} gsub("[.]0+ bytes",
" -bytes-",
$(!__*($++NF = sprintf("%#10.*f %s" ,____,
(_ = $!!__) / __^(___=int(log(_)/_____)),
!___ ? "bytes" : substr("KMGTPEZY",___,!!__)"iB"))))^!__'
=
734 734 -bytes-
180043 175.82324 KiB
232819 227.36230 KiB
421548373 402.01986 MiB
838593829 799.74540 MiB
3739382399 3.48257 GiB
116601682159 108.59378 GiB
147480014471 137.35147 GiB
11010032230111 10.01357 TiB
19830700070261 18.03592 TiB
111120670776601 101.06366 TiB
15023323323323321 13.34339 PiB
85255542224555233 75.72213 PiB
444444666677766611 394.74616 PiB
106941916666944416909 92.75733 EiB
111919999919911191991 97.07513 EiB
767777766776776776777767 650.33306 ZiB
5558888858993555888686669 4.59821 YiB
========================
this is probably waaaay overkill, but I wrote it a while back, which can calculate the human-readable value, as well as comma formatted of raw byte value, supporting everything from kilo-bit to yotta-byte
with options for :
base 2 or base 10 (enter 10 or "M/m" for metric)
bytes (B) or bits (b)
The only thing that needs to be hard coded in are the letters themselves, since they grow linearly upon either
every 3rd power of 10 (1,000), or
every 5th power of 4 (1,024)
.
{m,g}awk '
BEGIN {
1 FS = OFS = "="
}
2302812 $!NF = substr(bytesformat($2, 10, "B"), 1, 15)\
substr(bytesformat($2, 2, "B"), 1, 15)\
bytesformat($2, 2, "b")'
# Functions, listed alphabetically
6908436 function bytesformat(_,_______,________,__, ___, ____, _____, ______)
{
6908436 _____=__=(____^=___*=((____=___+=___^= "")/___)+___+___)
6908436 ___/=___
6908436 sub("^0+","",_)
6908436 ____=_____-= substr(_____,index(_____,index(_____,!__))) * (_______~"^(10|[Mm])$")
6908436 _______=length((____)____)^(________~"^b(it)?$")
6908436 if ((____*__) < (_______*_)) { # 6906267
24438981 do {
24438981 ____*=_____
24438981 ++___
} while ((____*__) < (_______*_))
}
6908436 __=_
6908436 sub("(...)+$", ",&", __)
6908436 gsub("[^#-.][^#-.][^#-.]", "&,", __)
6908436 gsub("[,]*$|^[,]+", "", __)
6908436 sub("^[.]", "0&", __)
6908436 return \
sprintf("%10.4f %s%s | %s byte%.*s",
_=="" ? +_:_/(_____^___)*_______,
substr("KMGTPEZY", ___, _^(_<_)),
--_______?"b":"B",__==""?+__:__,(_^(_<_))<_,"s")
}
|
In this sample, it's showing metric bytes, binary bytes, binary bits, and raw input byte value :
180.0430 KB | 175.8232 KB | 1.3736 Mb | 180,043 bytes
232.8190 KB | 227.3623 KB | 1.7763 Mb | 232,819 bytes
421.5484 MB | 402.0199 MB | 3.1408 Gb | 421,548,373 bytes
838.5938 MB | 799.7454 MB | 6.2480 Gb | 838,593,829 bytes
3.7394 GB | 3.4826 GB | 27.8606 Gb | 3,739,382,399 bytes
116.6017 GB | 108.5938 GB | 868.7502 Gb | 116,601,682,159 bytes
147.4800 GB | 137.3515 GB | 1.0731 Tb | 147,480,014,471 bytes
11.0100 TB | 10.0136 TB | 80.1085 Tb | 11,010,032,230,111 bytes
19.8307 TB | 18.0359 TB | 144.2873 Tb | 19,830,700,070,261 bytes
111.1207 TB | 101.0637 TB | 808.5093 Tb | 111,120,670,776,601 bytes
15.0233 PB | 13.3434 PB | 106.7471 Pb | 15,023,323,323,323,321 bytes
85.2555 PB | 75.7221 PB | 605.7771 Pb | 85,255,542,224,555,233 bytes
444.4447 PB | 394.7462 PB | 3.0840 Eb | 444,444,666,677,766,611 bytes
106.9419 EB | 92.7573 EB | 742.0586 Eb | 106,941,916,666,944,416,909 bytes
111.9200 EB | 97.0751 EB | 776.6010 Eb | 111,919,999,919,911,191,991 bytes
767.7778 ZB | 650.3331 ZB | 5.0807 Yb | 767,777,766,776,776,776,777,767 bytes
5.5589 YB | 4.5982 YB | 36.7856 Yb | 5,558,888,858,993,555,888,686,669 bytes

Calculating percentile for each request from a log file based on start time and end time using bash script

I have a simulation.log file where it will have below results and I want to calculate 5th, 25th, 95th, 99th percentile of each request using shell script by reading the file.
Below is a sample simulation.log file where 1649410339141 and 1649410341026 are start and end time in milliseconds.
REQUEST1 somelogprinted TTP123099SM000202 002 1649410339141 1649410341026 OK
REQUEST2 somelogprinted TTP123099SM000202 001 1649410339141 1649410341029 OK
......
I tried below code but did not give me any result and I am not a Unix developer:
FILE=filepath
sort -n $* > $FILE
N=$(wc -l $FILE | awk '{print $1}')
P50=$(dc -e "$N 2 / p")
P90=$(dc -e "$N 9 * 10 / p")
P99=$(dc -e "$N 99 * 100 / p") echo ";;
50th, 90th and 99th percentiles for
$N data points" awk "FNR==$P50 || FNR==$P90 || FNR==$P99" $FILE
Sample output:
Request | 5thpercentile | 25Percentile | 95Percentile | 99Percentile
Request1 | 657 | 786 | 821 | 981
Request2 | 453 | 654 | 795 | 854

Join two csv files if value is between interval in file 2

I have two csv files that I need to join, F1 has milions of lines, F2 (file 1) has thousands of lines. I need to join these files, if the position in file F1 (F1.pos) is between F2.start and F2.end. Is there any way, how to do this in bash? Because I have a code in Python pandas to sqllite3 and I am looking for something quicker.
Table F1 looks like:
| name | pos |
|------ |------ |
| a | 1020 |
| b | 1200 |
| c | 1800 |
Table F2 looks like:
| interval_name | start | end |
|--------------- |------- |------ |
| int1 | 990 | 1090 |
| int2 | 1100 | 1150 |
| int3 | 500 | 2000 |
Result should look like:
| name | pos | interval_name | start | end |
|------ |------ |--------------- |------- |------ |
| a | 1020 | int1 | 990 | 1090 |
| a | 1020 | int3 | 500 | 2000 |
| b | 1200 | int1 | 990 | 1090 |
| b | 1200 | int3 | 500 | 2000 |
| c | 1800 | int3 | 500 | 2000 |
DISCLAIMER: Use dedicated/local tools if available, this is hacking:
There is an apparent error in your desired output: name b should not match int1.
$ tail -n+1 *.csv
==> f1.csv <==
name,pos
a,1020
b,1200
c,1800
==> f2.csv <==
interval_name,start,end
int1,990,1090
int2,1100,1150
int3,500,2000
$ awk -F, -vOFS=, '
BEGIN {
print "name,pos,interval_name,start,end"
PROCINFO["sorted_in"]="#ind_num_asc"
}
FNR==1 {next}
NR==FNR {Int[$1] = $2 "," $3; next}
{
for(i in Int) {
split(Int[i], I)
if($2 >= I[1] && $2 <= I[2]) print $0, i, Int[i]
}
}
' f2.csv f1.csv
Outputs:
name,pos,interval_name,start,end
a,1020,int1,990,1090
a,1020,int3,500,2000
b,1200,int3,500,2000
c,1800,int3,500,2000
This is not particularly efficient in any way; the only sorting used is to ensure that the Int array is parsed in the correct order, which changes if your sample data is not indicative of the actual schema. I would be very interested to know how my solution performs vs pandas.
Here's one in awk. It hashes the smaller file records to arrays and for each of the bigger file records it iterates thru the hashes so it is slow:
$ awk '
NR==FNR { # hash f2 records
start[NR]=$4
end[NR]=$6
data[NR]=substr($0,2)
next
}
FNR<=2 { # mind the front matter
print $0 data[FNR]
}
{ # check if in range and output
for(i in start)
if($4>start[i] && $4<end[i])
print $0 data[i]
}' f2 f1
Output:
| name | pos | interval_name | start | end |
|------ |------ |--------------- |------- |------ |
| a | 1020 | int1 | 990 | 1090 |
| a | 1020 | int3 | 500 | 2000 |
| b | 1200 | int3 | 500 | 2000 |
| c | 1800 | int3 | 500 | 2000 |
I doubt that a bash script would be faster than a python script. Just don't import the files into a database – write a custom join function instead!
The best way to join depends on your input data. If nearly all F1.pos are inside of nearly all intervals then a naive approach would be the fastest. The naive approach in bash would look like this:
#! /bin/bash
join --header -t, -j99 F1 F2 |
sed 's/^,//' |
awk -F, 'NR>1 && $2 >= $4 && $2 <= $5'
# NR>1 is only there to skip the column headers
However, this will be very slow if there are only a few intersections, for instance, when the average F1.pos only is in 5 intervals. In this case the following approach will be way faster. Implement it in a programing language of your choice – bash is not appropriate for this:
Sort F1 by pos in ascending order.
Sort F2 by start and then by end in ascending order.
For each sorted file, keep a pointer to a line, starting at the first line.
Repeat until F1's pointer reaches the end:
For the current F1.pos advance F2's pointer until F1.pos ≥ F2.start.
Lock F2's pointer, but continue to read lines until F1.pos ≤ F2.end. Print the read lines in the output format name,pos,interval_name,start,end.
Advance F1's pointer by one line.
Only sorting the files could be actually faster in bash. Here is a script to sort both files.
#! /bin/bash
sort -t, -n -k2 F1-without-headers > F1-sorted
sort -t, -n -k2,3 F2-without-headers > F2-sorted
Consider using LC_ALL=C, -S N% and --parallel N to speed up the sorting process.

Split a column into separate columns based on value

I have a tab delimited file that looks as follows:
cat my file.txt
gives:
1 299
1 150
1 50
1 57
2 -45
2 62
3 515
3 215
3 -315
3 -35
3 3
3 6789
3 34
5 66
5 1334
5 123
I'd like to use Unix commands to get a tab-delimited file that based on values in column#1, each column of the output file will hold all relevant values of column#2
(I'm using here for the example the separator "|" instead of tab only to illustrate my desired output file):
299 | -45 | 515 | 66
150 | 62 | 215 | 1334
50 | | -315 |
57 | | -35 |
| | 3 |
The corresponding Headers (1,2,3,5; based on column#1 values) could be a nice addition to the code (as shown below), but the main request is to split the information of the first file into separated columns. Thanks!
1 | 2 | 3 | 5
299 | -45 | 515 | 66
150 | 62 | 215 | 1334
50 | | -315 |
57 | | -35 |
| | 3 |
Here's a one liner that matches your output. It builds a string $ARGS containing as many process substitutions as there are unique values in the first column. Then, $ARGS is used as the argument for the paste command:
HEADERS=$(cut -f 1 file.txt | sort -n | uniq); ARGS=""; for h in $HEADERS; do ARGS+=" <(grep ^"$h"$'\t' file.txt | cut -f 2)"; done; echo $HEADERS | tr ' ' '|'; eval "paste -d '|' $ARGS"
Output:
1|2|3|5
299|-45|515|66
150|62|215|1334
50||-315|
57||-35|
||3|
You can use gnu-awk
awk '
BEGIN{max=0;}
{
d[$1][length(d[$1])+1] = $2;
if(length(d[$1])>max)
max = length(d[$1]);
}
END{
PROCINFO["sorted_in"] = "#ind_num_asc";
line = "";
flag = 0;
for(j in d){
line = line (flag?"\t|\t":"") j;
flag = 1;
}
print line;
for(i=1; i<=max; ++i){
line = "";
flag = 0;
for(j in d){
line = line (flag?"\t|\t":"") d[j][i];
flag = 1;
}
print line;
}
}' file.txt
you get
1 | 2 | 3 | 5
299 | -45 | 515 | 66
150 | 62 | 215 | 1334
50 | | -315 |
57 | | -35 |
| | 3 |
Or, you can use python .... for example, in split2Columns.py
import sys
records = [line.split() for line in open(sys.argv[1])]
import collections
records_dict = collections.defaultdict(list)
for key, val in records:
records_dict[key].append(val)
from itertools import izip_longest
print "\t|\t".join(records_dict.keys())
print "\n".join(("\t|\t".join(map(str,l)) for l in izip_longest(*records_dict.values(), fillvalue="")))
python split2Columns.py file.txt
you get same result
#Jose Ricardo Bustos M. - thanks for your answer! I unfortunately couldn't install on my Mac the gnu-awk, but based on your suggestive answer I've performed something similar using awk:
HEADERS=$(cut -f 1 try.txt | awk '!x[$0]++');
H=( ${HEADERS// / });
MAXUNIQNUM=$(cut -f 1 try.txt |uniq -c|awk '{print $1}'|sort -nr|head -1);
awk -v header="${H[*]}" -v max=$MAXUNIQNUM
'BEGIN {
split(header,headerlist," ");
for (q = 1;q <= length(headerlist); q++)
{counter[q]=1;}
}
{for (z = 1; z <= length(headerlist); z++){
if (headerlist[z] == $1){
arr[counter[z],headerlist[z]] = $2;
counter[z]++
};
}
}
END {
for (x = 1; x <= max; x++){
for (y = 1; y<= length(headerlist); y++){
printf "%s\t",arr[x,headerlist[y]];
}
printf "\n"
}
}' try.txt
This is using an array to keep track of the column headings, using them to name temporary files and paste everything together in the end:
#!/bin/bash
infile=$1
filenames=()
idx=0
while read -r key value; do
if [[ "${filenames[$idx]}" != "$key" ]]; then
(( ++idx ))
filenames[$idx]="$key"
echo -e "$key\n----" > "$key"
fi
echo "$value" >> "$key"
done < "$1"
paste "${filenames[#]}"
rm "${filenames[#]}"

Sum of Columns for multiple variables

Using Shell Script (Bash), I am trying to sum the columns for all the different variables of a list. Suppose I have the following input of a Test.tsv file
Win Lost
Anna 1 1
Charlotte 3 1
Lauren 5 5
Lauren 6 3
Charlotte 3 2
Charlotte 4 5
Charlotte 2 5
Anna 6 4
Charlotte 2 3
Lauren 3 6
Anna 1 2
Anna 6 2
Lauren 2 1
Lauren 5 5
Lauren 6 6
Charlotte 1 3
Anna 1 4
And I want to sum up how much each of the participants have won and lost. So I want to get this as a result:
Sum Win Sum Lost
Anna 57 58
Charlotte 56 57
Lauren 53 56
What I would usually do is take the sum per person and per column and repeat that process over and over. See below how I would do it for the example mentioned:
cat Test.tsv | grep -Pi '\bAnna\b' | cut -f2 -d$'\t' |paste -sd+ | bc > Output.tsv
cat Test.tsv | grep -Pi '\bCharlotte\b' | cut -f2 -d$'\t' |paste -sd+ | bc >> Output.tsv
cat Test.tsv | grep -Pi '\bLauren\b' | cut -f2 -d$'\t' |paste -sd+ | bc >> Output.tsv
cat Test.tsv | grep -Pi '\bAnna\b' | cut -f3 -d$'\t' |paste -sd+ | bc > Output.tsv
cat Test.tsv | grep -Pi '\bCharlotte\b' | cut -f3 -d$'\t' |paste -sd+ | bc >> Output.tsv
cat Test.tsv | grep -Pi '\bLauren\b' | cut -f3 -d$'\t' |paste -sd+ | bc >> Output.tsv
However I would need to repeat this line for every participant. This becomes a pain when you have to many variables you want to sum it up for.
What would be the way to write this script?
Thanks!
This is pretty straightforward with awk. Using GNU awk:
awk -F '\t' 'BEGIN { OFS = FS } NR > 1 { won[$1] += $2; lost[$1] += $3 } END { PROCINFO["sorted_in"] = "#ind_str_asc"; print "", "Sum Win", "Sum Lost"; for(p in won) print p, won[p], lost[p] }' filename
-F '\t' makes awk split lines at tabs, then:
BEGIN { OFS = FS } # the output should be separated the same way as the input
NR > 1 { # From the second line forward (skip header)
won[$1] += $2 # tally up totals
lost[$1] += $3
}
END { # When done, print the lot.
# GNU-specific: Sorted traversal or player names
PROCINFO["sorted_in"] = "#ind_str_asc"
print "", "Sum Win", "Sum Lost"
for(p in won) print p, won[p], lost[p]
}

Resources