Awk--For loop by comparing two files - for-loop

I have two big files.
File 1 looks like following:
10 2864001 2864012
10 5942987 5943316
File 2 looks like following:
10 2864000 28
10 2864001 28
10 2864002 28
10 2864003 27
10 2864004 28
10 2864005 26
10 2864006 26
10 2864007 26
10 2864008 26
10 2864009 26
10 2864010 26
10 2864011 26
10 2864012 26
So I want to create a for loop in such a way that,
First column of File 1 must match first column of File 2 AND
To start a for loop by matching second column of File 1 with second
column of File 2 AND
Sum third column of File 2 until third column of File 1 match to
second column of File 2.
So the output of above example should be sum of third column of File 2 for first line of File 1 which is 347. I tried to use NR and FNR but I have not been able to do it so far. Could you please help me to generate awk script?
Thank you so much

Transcribed, so there may be typos:
awk '
BEGIN { lastFNR=0; acount=0; FIRST="T"}
FNR < lastFNR {FIRST="F"; aindex=0; next}
FIRST=="T" {
sta[acount] = $2
fna[acount] = $3
acount += 1
lastFNR=FNR
}
FIRST=="F" && $2 >= sta[index] && $2 <= fna[aindex] {
sum[aindex] += $3
lastFNR = FNR
}
FIRST=="F" && $2 > fna[aindex] {
aindex ==1
if (aindex > acount) { FIRST="E" }
}
END {
for(aindex=0; aindex<acount; +=aindex) {
print sta[aindex], "through", fna[index], "totals", sum[aindex]
}
}
' file 1 file2

You could try
awk -f s.awk file1 file2
where s.awk is
NR==FNR {
a[$1,$2]=$3
next
}
($1,$2) in a {
do {
s+=$3
if ((getline)+0 < 1) break
} while ($2 != a[$1,$2])
print s
}
{ s=0 }
output:
319

Related

awk super slow processing many rows but not many columns

While looking into this this question the challenge was to take this matrix:
4 5 6 2 9 8 4 8
m d 6 7 9 5 4 g
t 7 4 2 4 2 5 3
h 5 6 2 5 s 3 4
r 5 7 1 2 2 4 1
4 1 9 0 5 6 d f
x c a 2 3 4 5 9
0 0 3 2 1 4 q w
And turn into:
4 5
m d
t 7
h 5
r 5
4 1
x c
0 0
6 2 # top of next 2 columns...
6 7
4 2
... each N elements from each row of the matrix -- in this example, N=2...
3 4
4 1
d f
5 9
q w # last element is lower left of matrix
The OP stated the input was 'much bigger' than the example without specifying the shape of the actual input (millions of rows? millions of columns? or both?)
I assumed (mistakenly) that the file had millions of rows (it was later specified to have millions of columns)
BUT the interesting thing is that most of the awks written were perfectly acceptable speed IF the shape of the data was millions of columns.
Example: #glennjackman posted a perfectly useable awk so long as the long end was in columns, not in rows.
Here, you can use his Perl to generate an example matrix of rows X columns. Here is that Perl:
perl -E '
my $cols = 2**20; # 1,048,576 columns - the long end
my $rows = 2**3; # 8 rows
my #alphabet=( 'a'..'z', 0..9 );
my $size = scalar #alphabet;
for ($r=1; $r <= $rows; $r++) {
for ($c = 1; $c <= $cols; $c++) {
my $idx = int rand $size;
printf "%s ", $alphabet[$idx];
}
printf "\n";
}' >file
Here are some candidate scripts that turn file (from that Perl script) into the output of 2 columns taken from the front of each row:
This is the speed champ regardless of the shape of input in Python:
$ cat col.py
import sys
cols=int(sys.argv[2])
offset=0
delim="\t"
with open(sys.argv[1], "r") as f:
dat=[line.split() for line in f]
while offset<=len(dat[0])-cols:
for sl in dat:
print(delim.join(sl[offset:offset+cols]))
offset+=cols
Here is a Perl that is also quick enough regardless of the shape of the data:
$ cat col.pl
push #rows, [#F];
END {
my $delim = "\t";
my $cols_per_group = 2;
my $col_start = 0;
while ( 1 ) {
for my $row ( #rows ) {
print join $delim, #{$row}[ $col_start .. ($col_start + $cols_per_group - 1) ];
}
$col_start += $cols_per_group;
last if ($col_start + $cols_per_group - 1) > $#F;
}
}
Here is an alternate awk that is slower but a consistent speed (and the number of lines in the file needs to be pre-calculated):
$ cat col3.awk
function join(start, end, result, i) {
for (i=start; i<=end; i++)
result = result $i (i==end ? ORS : FS)
return result
}
{ col_offset=0
for(i=1;i<=NF;i+=cols) {
s=join(i,i+cols-1)
col[NR+col_offset*nl]=join(i,i+cols-1)
col_offset++
++cnt
}
}
END { for(i=1;i<=cnt;i++) printf "%s", col[i]
}
And Glenn Jackman's awk (not to pick on him since ALL the awks had the same bad result with many rows):
function join(start, end, result, i) {
for (i=start; i<=end; i++)
result = result $i (i==end ? ORS : FS)
return result
}
{
c=0
for (i=1; i<NF; i+=n) {
c++
col[c] = col[c] join(i, i+n-1)
}
}
END {
for (i=1; i<=c; i++)
printf "%s", col[i] # the value already ends with newline
}
Here are the timings with many columns (ie, in the Perl scrip that generates file above, my $cols = 2**20 and my $rows = 2**3):
echo 'glenn jackman awk'
time awk -f col1.awk -v n=2 file >file1
echo 'glenn jackman gawk'
time gawk -f col1.awk -v n=2 file >file5
echo 'perl'
time perl -lan columnize.pl file >file2
echo 'dawg Python'
time python3 col.py file 2 >file3
echo 'dawg awk'
time awk -f col3.awk -v nl=$(awk '{cnt++} END{print cnt}' file) -v cols=2 file >file4
Prints:
# 2**20 COLUMNS; 2**3 ROWS
glenn jackman awk
real 0m4.460s
user 0m4.344s
sys 0m0.113s
glenn jackman gawk
real 0m4.493s
user 0m4.379s
sys 0m0.109s
perl
real 0m3.005s
user 0m2.774s
sys 0m0.230s
dawg Python
real 0m2.871s
user 0m2.721s
sys 0m0.148s
dawg awk
real 0m11.356s
user 0m11.038s
sys 0m0.312s
But transpose the shape of the data by setting my $cols = 2**3 and my $rows = 2**20 and run the same timings:
# 2**3 COLUMNS; 2**20 ROWS
glenn jackman awk
real 23m15.798s
user 16m39.675s
sys 6m35.972s
glenn jackman gawk
real 21m49.645s
user 16m4.449s
sys 5m45.036s
perl
real 0m3.605s
user 0m3.348s
sys 0m0.228s
dawg Python
real 0m3.157s
user 0m3.065s
sys 0m0.080s
dawg awk
real 0m11.117s
user 0m10.710s
sys 0m0.399s
So question:
What would cause the first awk to be 100x slower if the data are transposed to millions of rows vs millions of columns?
It is the same number of elements processed and the same total data. The join function is called the same number of times.
String concatenation being saved in a variable is one of the slowest operations in awk (IIRC it's slower than I/O) as you're constantly having to find a new memory location to hold the result of the concatenation and there's more of that happening in the awk scripts as the rows get longer so it's probably all of the string concatenation in the posted solutions that's causing the slowdown.
Something like this should be fast and shouldn't be dependent on how many fields there are vs how many records:
$ cat tst.awk
{
for (i=1; i<=NF; i++) {
vals[++numVals] = $i
}
}
END {
for (i=1; i<=numVals; i+=2) {
valNr = i + ((i-1) * NF) # <- not correct, fix it!
print vals[valNr], vals[valNr+1]
}
}
I don't have time right now to figure out the correct math to calculate the index for the single loop approach above (see the comment in the code) so here's a working version with 2 loops that doesn't require as much thought and shouldn't run much if any, slower:
$ cat tst.awk
{
for (i=1; i<=NF; i++) {
vals[++numVals] = $i
}
}
END {
inc = NF - 1
for (i=0; i<NF; i+=2) {
for (j=1; j<=NR; j++) {
valNr = i + j + ((j-1) * inc)
print vals[valNr], vals[valNr+1]
}
}
}
$ awk -f tst.awk file
4 5
m d
t 7
h 5
r 5
4 1
x c
0 0
6 2
6 7
4 2
6 2
7 1
9 0
a 2
3 2
9 8
9 5
4 2
5 s
2 2
5 6
3 4
1 4
4 8
4 g
5 3
3 4
4 1
d f
5 9
q w
A play with strings:
$ awk '
{
a[NR]=$0 # hash rows to a
c[NR]=1 # index pointer
}
END {
b=4 # buffer size for match
i=NR # row count
n=length(a[1]) # process til the first row is done
while(c[1]<n)
for(j=1;j<=i;j++) { # of each row
match(substr(a[j],c[j],b),/([^ ]+ ?){2}/) # read 2 fields and separators
print substr(a[j],c[j],RLENGTH) # output them
c[j]+=RLENGTH # increase index pointer
}
}' file
b=4 is a buffer that is optimal for 2 single digit fields and 2 single space separators (a b ) as was given in the original question, but if the data is real world data, b should be set to something more suitable. If omitted leading the match line to match(substr(a[j],c[j]),/([^ ]+ ?){2}/) kills the performance for data with lots of columns.
I got times around 8 seconds for datasets of sizes 220 x 23 and 23 x 220.
Ed Morton's approach did fix the speed issue.
Here is the awk I wrote that supports variable columns:
$ cat col.awk
{
for (i=1; i<=NF; i++) {
vals[++numVals] = $i
}
}
END {
for(col_offset=0; col_offset + cols <= NF; col_offset+=cols) {
for (i=1; i<=numVals; i+=NF) {
for(j=0; j<cols; j++) {
printf "%s%s", vals[i+j+col_offset], (j<cols-1 ? FS : ORS)
}
}
}
}
$ time awk -f col.awk -v cols=2 file >file.cols
real 0m5.810s
user 0m5.468s
sys 0m0.339s
This is about 6 seconds for datasets of sizes 220 x 23 and 23 x 220
But MAN it sure is nice to have strong support of arrays of arrays (such as in Perl or Python...)

Merging sums of numbers from different files and deleting select duplicate lines

I've checked other threads here on merging, but they seem to be mostly about merging text, and not quite what I needed, or at least I couldn't figure out a way to connect their solutions to my own problem.
Problem
I have 10+ input files, each consisting of two columns of numbers (think of them as x,y data points for a graph). Goals:
Merge these files into 1 file for plotting
For any duplicate x values in the merge, add their respective y-values together, then print one line with x in field 1 and the added y-values in field 2.
Consider this example for 3 files:
y1.dat
25 16
27 18
y2.dat
24 10
27 9
y3.dat
24 2
29 3
According to my goals above, I should be able to merge them into one file with output:
final.dat
24 12
25 16
27 27
29 3
Attempt
So far, I have the following:
#!/bin/bash
loops=3
for i in `seq $loops`; do
if [ $i == 1 ]; then
cp -f y$i.dat final.dat
else
awk 'NR==FNR { arr[NR] = $1; p[NR] = $2; next } {
for (n in arr) {
if ($1 == arr[n]) {
print $1, p[n] + $2
n++
}
}
print $1, $2
}' final.dat y$i.dat >> final.dat
fi
done
Output:
25 16
27 18
24 10
27 27
27 9
24 12
24 2
29 3
On closer inspection, it's clear I have duplicates of the original x-values.
The problem is my script needs to print all the x-values first, and then I can add them together for my output. However, I don't know how to go back and remove the lines with the old x-values that I needed to make the addition.
If I blindly use uniq, I don't know whether the old x-values or the new x-value is deleted. With awk '!duplicate[$1]++' the order of lines deleted was reversed over the loop, so it deletes on the first loop correctly but the wrong ones after that.
Been at this for a long time, would appreciate any help. Thank you!
I am assuming you already merged all the files into a single one before making the calculation. Once that's done the script is as simple as :
awk '{ if ( $1 != "" ) { coord[$1]+=$2 } } END { for ( k in coord ) { print k " " coord[k] } }' input.txt
Hope it helps!
Edit : How this works ?
if ( $1 != "" ) { coord[$1]+=$2 }
This line will get executed for each line in your input. It will first check whether there is a value for X, otherwise it simply ignores the line. This helps to ignore empty lines should your file have any. The block which gets executed : coord[$1]+=$2 is the heart of the script and creates a dictionary with X being the key of each entry and at the same time it adds each value for Y found.
END { for ( k in coord ) { print k " " coord[k] }
This block will execute after awk has iterated over all the lines in your file. It will simply grab each key from the dictionary and print it, then a space and finally the sum of all the values which were found, or in other words, the value for that specific key.
Using Perl one-liner
> cat y1.dat
25 16
27 18
> cat y2.dat
24 10
27 9
> cat y3.dat
24 2
29 3
> perl -lane ' $kv{$F[0]}+=$F[1]; END { print "$_ $kv{$_}" for(sort keys %kv) }' y*dat
24 12
25 16
27 27
29 3
>

how to put the lines 4 , 5 ,and 30, at the beginning followed by the rest of lines

I looking to way to put the line number 5 ,7 ,and 8 at first followed by the rest of lines :
the file i have looks like :
3 0.14239002E-02 0.22510807E-04 -0.26742979E-05
4 0.57704593E-03 0.68034193E-03 0.68119554E-03
5 0.64948134E-03 0.18797759E-04 0.92341181E-04
6 -0.70701827E-03 0.14093323E-02 -0.88504803E-04
7 -0.99123291E-03 0.53649558E-05 0.56815134E-03
8 -0.10869857E-02 0.17371795E-02 -0.25683281E-03
9 -0.16270520E-02 -0.44482889E-06 -0.97268563E-05
I need the output to be like:
5 0.64948134E-03 0.18797759E-04 0.92341181E-04
7 -0.99123291E-03 0.53649558E-05 0.56815134E-03
8 -0.10869857E-02 0.17371795E-02 -0.25683281E-03
3 0.14239002E-02 0.22510807E-04 -0.26742979E-05
4 0.57704593E-03 0.68034193E-03 0.68119554E-03
6 -0.70701827E-03 0.14093323E-02 -0.88504803E-04
9 -0.16270520E-02 -0.44482889E-06 -0.97268563E-05
Any suggest using sort or awk or some good way, thank you.
If you don't mind reading the file twice, then you can do it with awk:
awk '(NR == FNR && (FNR == 5 || FNR == 7 || FNR == 8)) \
|| (NR != FNR && !(FNR == 5 || FNR == 7 || FNR == 8))' file file
Or if you prefer, and your version of awk supports it, then you can use xor:
awk '!xor(NR == FNR, FNR == 5 || FNR == 7 || FNR == 8)' file file
If it's the contents of the first field $1 that you want, not the line number NR, then change the comparisons FNR == 5 to $1 == 5, etc.
The result of the command will go to standard output. If you want to redirect it to a file, then add a redirection to the end of the command.
A generic solution
awk -v NewOrder='5 7 8' '
# prepare the list
BEGIN{
NOSize = split( NewOrder, NOs)
for ( i=1; i<=NOSize; i++) Big = Big < NOs[i] ? NOs[i] : Big
}
# print line out of scope (of order change)
$1 > Big { print;next}
# load in memory line until last order change
{ Ls[$1]=$0; Rs[++j]=$1 }
# when reaching last line to print in other order, print the new order content
$1 == Big {
#print new line first
for ( k=1; k<=NOSize; k++){i=NOs[k];print Ls[i]; Ps[i]=1}
# print the other in FIFO order
for ( k=1; k<=(j-1); k++) {i=Rs[k];if (! Ps[i]) print Ls[i]}
}
' YourFile
put the list of list with pattern to change in variable NewOrder separate by a space
take first field as reference and not number of line for line reorder

Take input from a column of one file, and match a condition in two columns in second file

I have two files like this:
File 1:
1 1987969 1987970 . 7.078307 33
1 2066715 2066716 . 7.426998 34
1 2066774 2066775 . 6.851217 33
File 2:
1 HANASAI gelliu 1186928 1441229
1 FEBRUCA sepaca 3455487 3608150
I want to take each value of column 3 in File 1, and search in File 2 with a condition like (if File1_col3_value >= File2_col4_value && File1_col3_value <= File2_col5_value) then print the whole line of File 2 in a new file.
One more thing is also important: for every variable from file_1 to be searched in file_2, the corresponding value in column_1 should be the same in both files, e.g. for '1987970' of file_1, the value in corresponding in column_1 is '1', so in file_2, it should also be '1' in first column.
Thanks
EDIT: Only considers lines with matching "class" values in column 1
$ cat msh.awk
# Save all the pairs of class and third-column values from file1
NR==FNR { a[$1,$3]; next }
# For each line of file2, if there exists a third-column-file1
# value between the values of columns 4 and 5 in a record of the
# same class, print the line
{
for (cv in a) {
split(cv, class_val, SUBSEP);
c = class_val[1];
v = class_val[2];
if (c == $1 && v >= $4 && v <= $5) {
print
break
}
}
}
$ cat file1
1 1987969 1987970 . 7.078307 33
1 2066715 2066716 . 7.426998 34
1 2066774 1200000 . 6.851217 33
1 2066774 2066775 . 6.851217 33
$ cat file2
1 HANASAI gelliu 1186928 1441229
1 FEBRUCA sepaca 3455487 3608150
$ awk -f msh.awk file1 file2
1 HANASAI gelliu 1186928 1441229

{awk} How to read a line and compare a $ with its next/previous line?

The command below is used to read an input file containing 7682 lines:
I use the --field-separator then converted some fields into what I need, and the grep got rid of the 2 first lines I do not need.
awk --field-separator=";" '($1<15) {print int(a=(($1-1)/480)+1) " " ($1-((int(a)-1)*480)) " " (20*log($6)/log(10))}' 218_DW.txt | grep -v "0 480 -inf"
I used ($1<15) so that I only print 14 lines, better for testing. The output I get is exactly what I want, but, there is more I need to do on that:
1 1 48.2872
1 2 48.3021
1 3 48.1691
1 4 48.1502
1 5 48.1564
1 6 48.1237
1 7 48.1048
1 8 48.015
1 9 48.0646
1 10 47.9472
1 11 47.8469
1 12 47.8212
1 13 47.8616
1 14 47.8047
From above, $1 will increment from 1-16, $2 from 1-480, it's always continuous,
so when it gets to 16 480 47.8616 it restarts from 2 1 47.8616 until last line is 16 480 10.2156
So I get 16*480=7680 lines
What I want to do is simple, but, I don't get it :)
I want to compare the current line with the next one. But not all fields, only $3, it's a value in dB that decreases when $2 increases.
In example:
The current line is 1 1 48.2872=a
Next line is 1 2 48.3021=b
If [ (a - b) > 6 ] then print $1 $2 $3
Of course (a - b) has got to be an absolute value, always > 0.
The beast will be to be able to compare the current line (the $3 only) with it's next and previous line ($3).
Something like this:
1 3 48.1691=a
1 4 48.1502=b
1 5 48.1564=c
If [ ABS(b - a) > 6 ] OR If [ ABS(b - c) > 6 ] then print $1 $2 $3
But of course first line can only be compared with its next one and the last one with its previous one. Is it possible?
Try this:
#!/usr/bin/awk -f
function abs(x) {
if (x >= 0)
return x;
else
return -1 * x;
}
function compare(a,b) {
return abs(a - b) > 6;
}
function update() {
before_value = current_value;
current_line = $0;
current_value = $3;
}
BEGIN {
line_n = 1;
}
#Edit: added to skip blank lines and differently formatted lines in
# general. You could add some error message and/or exit function
# here to detect badly formatted data.
NF != 3 {
next;
}
line_n == 1 {
update();
line_n += 1;
next;
}
line_n == 2 {
if (compare(current_value, $3))
print current_line;
update();
line_n += 1;
next;
}
{
if (compare(current_value, before_value) && compare(current_value, $3))
print current_line;
update();
}
END {
if (compare(current_value, before_value)) {
print current_line;
}
}
The funny thing is that I had this code lying around from a old project where I had to do basically the same thing. Adapted it a little for you. I think it solves your problem (how I understood it, at least). If it doesn't, it should point you in the right direction.
Instructions to run the awk script:
Supposing you saved the code with the name "awkscript", the data file is named "datafile" and they are both in the current folder, you should first mark the script as executable with chmod +x awkscript and then execute it passing the data file as parameter with ./awkscript datafile or use it as part of a sequence of pipes as in cat datafile | ./awkscript.
Comparing the current line to the previous one is trivial, so I think the problem you're having is that you can't figure out how to compare the current line to the next one. Just keep 2 previous lines instead of 1 and always operate on the line before the one that's actually being read as $0, i.e. the line stored in the array p1 in this example (p2 is the line before it and $0 is the line after it):
function abs(val) { return (val > 0 ? val : -val) }
NR==2 {
if ( abs(p1[3] - $3) > 6 ) {
print p1[1], p1[2], p1[3]
}
}
NR>2 {
if ( ( abs(p1[3] - p2[3]) > 6 ) || ( abs(p1[3] - $3) > 6 ) ) {
print p1[1], p1[2], p1[3]
}
}
{ prev2=prev1; prev1=$0; split(prev2,p2); split(prev1,p1) }
END {
if ( ( abs(p1[3] - p2[3]) > 6 ) ) {
print p1[1], p1[2], p1[3]
}
}

Resources