Awk - using an array with multiple files - bash

I'm new to shell scripting, and I need to compute an average file1, and then the average of the result and a number in file2, so far i came up with this, but it doesn't print anything.
awk '{if ($FILENAME == "spring") array[$1]=($2+$3+$4+$5+$6+$7+$8)/7; if($FILENAME == "fall") array[$1]=(array[$1]+$2)/2 } END { for (var in array) print var,array[var]}' ./spring ./fall
Any way to solve this problem?

How about awk '{s+=$1}ENDFILE{print FILENAME,s/FNR;s=0}' RS=" " file1 file2:
$ cat file1
1 2 3 4 5 6 7 8
$ cat file2
1 2
$ awk '{s+=$1}ENDFILE{print FILENAME,s/FNR;s=0}' RS=" " file1 file2
file1 4.5
file2 1.5

There are no sigils in awk. Try dropping the $:
awk 'FILENAME ~ /spring/ { array[$1]=($2+$3+$4+$5+$6+$7+$8)/7 }
FILENAME ~ /fall/ { array[$1]=(array[$1]+$2)/2 }
END { for (var in array) print var,array[var]}' ./spring ./fall
In short, FILENAME is the name of the file currently being processed, but $FILENAME is equivalent to $0 when FILENAME starts with a letter.

Related

Count number of nonempty entries in each column of, e.g., comm output

The Unix command comm file1 file2 has a 3 column output with lines unique to file1 in the first column, lines unique to file2 in the second, and lines shared by both in the 3rd (assuming file1 and file2 are sorted). It ends up looking something like this:
$ echo -e "alpha\nbravo\ncharlie" > file1
$ echo -e "alpha\nbravo\ndelta" > file2
$ comm file1 file2
alpha
bravo
charlie
delta
If I want the number of nonempty lines in each column, is there a general way to parse the output of comm and count those?
I know that for comm in particular I could just run
for i in {12,23,31}; do comm -$i file1 file2 | wc -l; done
but I'm curious about solutions that take the comm output file as a starting point, for the sake of getting better at Unix command line. I added the awk tag because I have a hunch there's a good awk solution.
The other answer covers your question of using awk to do the job quite well, but it is also worth mentioning that the GNU version of comm has a --total option which will print the sum of each column in a similar manner.
You may use this awk:
comm file1 file2 |
awk -F '\t' -v OFS='\n' '{ if ($1=="") if ($2=="") c3++; else c2++; else c1++ }
END { print c3, c2, c1 }'
2
1
1
Note that output of comm is tab delimited with these cases:
1st and 2nd empty column in common lines
1st empty column in lines unique to file2
1st non-empty column in lines unique to file1
The question is interesting, but not as easy as one would imagine, especially if you do not have the --total option.
A couple of things about comm:
comm works on sorted files
if a line appears n times in file1 and m times n < m times in file2, comm will output n-m entries in column 2 and n entries in column 3.
$ comm <(echo -e "1\n2\n3") <(echo "2\n2\n3\n4")
1
2
2
3
4
comm uses <tab>-character as a default separator, processing its output becomes problematic if your input contains this character.
$ comm <(echo -e "1\t2\n3") <(echo "2\n3\n4")
1 2 << this is the weird line
2
3
4
Luckily it has an option to define the delimiter (--output-delimiter=STR)
comm only adds a delimiter if other non-empty fields are following
$ comm --output-delimiter=SEP <(echo -e "1\n2\n3") <(echo "2\n3\n4")
1 << NO SEP (1 field)
SEPSEP2 << TWO SEP (3 fields)
SEPSEP3 << TWO SEP (3 fields)
SEP4 << ONE SEP (2 fields)
How can we solve it now:
We should clearly not use an ASCII symbol as a delimiter, this is asking for problems when processing ASCII files, so what you can do is use a non-printable character as a delimiter. You could use for example <start-of-heading>-character with octal value \001 (it does not accept the <null>-character). This generally solves the issues you might have due to point (3)
$ comm --output-delimiter=$'\001' <(echo -e "1\t2\n3") <(echo "2\n3\n4")
this output can now be piped into an extremely simple awk
$ awk -F "\001" '{a[NF]++}END{print a[1],a[2],a[3] }'
the above works because of point (4).
So you can just do:
$ comm --output-delimiter=$'\001' file1 file2 \
| awk -F "\001" '{a[NF]++}END{print a[1],a[2],a[3] }'
But I don't have that --output-delimiter option: This calls for the pure awk solution. We keep track of 3 arrays. a for file1 b for file2 and c for the combination. (c keeps track of all the entries). We make sure to keep point (2) into account.
$ awk '(NR==FNR) { a[$0]++; c[$0]++ }
(NR!=FNR) { b[$0]++; c[$0]-- }
END { for(i in c) {
if (c[i] < 0) { countb+=-c[i]; countc+=a[i] }
else if (c[i] == 0) { countc+=a[i] }
else { counta+= c[i]; countc+=b[i] }
}
print counta, countb, countc
}' file1 file2
We could essentially get rid of the array b as it can be derived from a and c, but I wanted to make it a bit more clear how it works; the other version would be:
$ awk '(NR==FNR) { a[$0]++; c[$0]++; next } { c[$0]-- }
END { for(i in c) {
counta+=(c[i]>0 ? c[i] : 0)
countb-=(c[i]<0 ? c[i] : 0)
countc+=a[i] - (c[i]>0 ? c[i] : 0)
}
print counta, countb, countc
}' file1 file2
Using Perl
$ comm file1 file2 | perl -lne ' /^\t\t/ and $kv{2}++; /^\t\S+/ and $kv{1}++; /^\S+/ and $kv{3}++; END { print "col-$_:$kv{$_}" for(keys %kv) } '
col-3:1
col-1:1
col-2:2
$
or
$ comm file1 file2 | perl -lne ' /(^\t\t)|(^\t\S+)|(^.)/ and $x=$+[0]>2?3:$+[0]; $kv{$x}++; END { print "col-$_:$kv{$_}" for(keys %kv) } '
col-3:1
col-1:1
col-2:2
$
where
col-1 -> first file
col-3 -> second file
col-2 -> both file
obviously you can do all in awk without comm or requiring sorted inputs.
$ awk 'NR==FNR {a[$1]; next}
{if($1 in a) {c3++; delete a[$1]}
else c2++}
END {print length(a),c2,c3}' file1 file2
1 1 2
that's counts for file1 only, file2 only, and common.
Note, this requires that the records are unique in each file.

awk - Compare columns from two files and replace text in first file

I have two files. The first has 1 column and the second has 3 columns. I want to compare first columns of both files. If there is a coincidence, replace column 2 and 3 for specific values; if not, print the same line.
File 1:
$ cat file1
26
28
30
File 2:
$ cat file2
1,a,0
2,a,0
22,a,0
23,a,0
24,a,0
25,a,0
26,r,1510139756
27,a,0
28,r,1510244156
29,a,0
30,r,1510157364
31,a,0
32,a,0
33,r,1510276164
34,a,0
40,a,0
Desired output:
$ cat file2
1,a,0
2,a,0
22,a,0
23,a,0
24,a,0
25,a,0
26,a,0
27,a,0
28,a,0
29,a,0
30,a,0
31,a,0
32,a,0
33,r,1510276164
34,a,0
40,a,0
I am using gawk to do this (it's inside a shell script and I am using solaris) but I can't get the output right. It only prints the lines that matches:
$fuente="file2"
gawk -v fuente="$fuente" 'FNR==NR{a[FNR]=$1; next}{print $1,$2="a",$3="0" }' $fuente file1 > file3
The output I got:
$ cat file3
26 a 0
28 a 0
30 a 0
awk one-liner:
awk 'NR==FNR{ a[$1]; next }$1 in a{ $2="a"; $3=0 }1' file1 FS=',' OFS=',' file2
The output:
1,a,0
2,a,0
22,a,0
23,a,0
24,a,0
25,a,0
26,a,0
27,a,0
28,a,0
29,a,0
30,a,0
31,a,0
32,a,0
33,r,1510276164
34,a,0
40,a,0
Really spread out for clarity; called (fuente.awk) like so:
awk -F \, -v fuente=file1 -f fuente.awk file2 # -F == IFS
BEGIN {
OFS="," # set OFS to make printing easier
while (getline x < fuente > 0) # safe way; read file into array
{
a[++i]=x # stuff indexed array
}
}
{ # For each line in file2
for (k=1 ; k<=i ; k++) # Lop over array (elements in file1)
{
if (($1==a[k]) && (! flag))
{
print($1,"a",0) # Found print new line
flag=1 # print only once
}
}
if (! flag) # Not found
{
print($0) # print original
}
flag=0 # reset flag
}
END { }

using awk how to merge 2 files, say A & B and do a left outer join function and include all columns in both files

I have multiple files with different number of columns, i need to do a merge on first file and second file and do a left outer join in awk respective to first file and print all columns in both files matching the first column of both files.
I have tried below codes to get close to my output. But i can't print the ",', where no matching number is found in second file. Below is the code. Join needs sorting and takes more time than awk. My file sizes are big, like 30 million records.
awk -F ',' '{
if (NR==FNR){ r[$1]=$0}
else{ if($1 in r)
r[$1]=r[$1]gensub($1,"",1)}
}END{for(i in r){print r[i]}}' file1 file2
file1
number,column1,column2,..columnN
File2
numbr,column1,column2,..columnN
Output
number,file1.column1,file1.column2,..file1.columnN,file2.column1,file2.column3...,file2.columnN
file1
1,a,b,c
2,a,b,c
3,a,b,c
5,a,b,c
file2
1,x,y
2,x,y
5,x,y
6,x,y
7,x,y
desired output
1,a,b,c,x,y
2,a,b,c,x,y
3,a,b,c,,,
5,a,b,c,x,y
$ cat tst.awk
BEGIN { FS=OFS="," }
NR==FNR {
tail = gensub(/[^,]*,/,"",1)
if ( FNR == 1 ) {
empty = gensub(/[^,]/,"","g",tail)
}
file2[$1] = tail
next
}
{ print $0, ($1 in file2 ? file2[$1] : empty) }
$ awk -f tst.awk file2 file1
1,a,b,c,x,y
2,a,b,c,x,y
3,a,b,c,,
5,a,b,c,x,y
The above uses GNU awk for gensub(), with other awks it's just one more step to do [g]sub() on the appropriate variable after initially assigning it.
An interesting (to me at least!) alternative you might want to test for a performance difference is:
$ cat tst.awk
BEGIN { FS=OFS="," }
NR==FNR {
tail = gensub(/[^,]*,/,"",1)
idx[$1] = NR
file2[NR] = tail
if ( FNR == 1 ) {
file2[""] = gensub(/[^,]/,"","g",tail)
}
next
}
{ print $0, file2[idx[$1]] }
$ awk -f tst.awk file2 file1
1,a,b,c,x,y
2,a,b,c,x,y
3,a,b,c,,
5,a,b,c,x,y
but I don't really expect it to be any faster and it MAY even be slower.
you can try,
awk 'BEGIN{FS=OFS=","}
FNR==NR{d[$1]=substr($0,index($0,",")+1); next}
{print $0, ($1 in d?d[$1]:",")}' file2 file1
you get,
1,a,b,c,x,y
2,a,b,c,x,y
3,a,b,c,,
5,a,b,c,x,y
join to the rescue:
$ join -t $',' -a 1 -e '' -o 0,1.2,1.3,1.4,2.2,2.3 file1.txt file2.txt
Explanation:
-t $',': Field separator token.
-a 1: Do not discard records from file 1 if not present in file 2.
-e '': Missing records will be treated as an empty field.
-o: Output format.
file1.txt
1,a,b,c
2,a,b,c
3,a,b,c
5,a,b,c
file2.txt
1,x,y
2,x,y
5,x,y
6,x,y
7,x,y
Output
1,a,b,c,x,y
2,a,b,c,x,y
3,a,b,c,,
5,a,b,c,x,y

Problems in mapping indices using awk

Hi all I have this data files
File1
1 The hero
2 Chainsaw and the gang
3 .........
4 .........
where the first field is the id and the second field is the product name
File 2
The hero 12
The hero 2
Chainsaw and the gang 2
.......................
From these two files I want to have a third file
File 3
The hero 12 1
The hero 2 1
Chainsaw and the gang 2 2
.......................
As you can see I am just adding the indices reading from file 1
I used this method
awk -F '\t' 'NR == FNR{a[$2]=$1; next}; {print $0, a[$1]}' File1 File2 > File 3
where I am creating this associated array using File 1 and doing just lookup using product names from file 2
However my files are huge, I have like 20 million product names and this process is taking a lot of time. Any suggestions, how I can speed it up?
You can use this awk:
awk 'FNR==NR{p=$1; $1=""; sub(/^ +/, ""); a[$0]=p;next} {q=$NF; $NF=""; sub(/ +$/, "")}
($0 in a) {print $0, q, a[$0]}' f1 f2
The hero 12 1
The hero 2 1
Chainsaw and the gang 2 2
The script you posted won't produce the output you want from the input files you posted so let's fix that first:
$ cat file1
1 The hero
2 Chainsaw and the gang
$ cat file2
The hero 12
The hero 2
Chainsaw and the gang 2
$ awk -F'\t' 'NR==FNR{map[$2]=$1;next} {key=$0; sub(/[[:space:]]+[^[:space:]]+$/,"",key); print $0, map[key]}' file1 file2
The hero 12 1
The hero 2 1
Chainsaw and the gang 2 2
Now, is that really too slow or were you doing some pre or post-processing and that was the real speed issue?
The obvious speed up is if your "file2" is sorted then you can delete the corresponding map[] value whenever the key changes so your map[] gets smaller every time you use it. e.g. something like this (untested):
$ awk -F'\t' '
NR==FNR {map[$2]=$1; next}
{ key=$0; sub(/[[:space:]]+[^[:space:]]+$/,"",key); print $0, map[key] }
key != prev { delete map[prev] }
{ prev = key }
' file1 file2
Alternative approach when populating map[] uses too much time/memory and file2 is sorted:
$ awk '
{ key=$0
sub(/[[:space:]]+[^[:space:]]+$/,"",key)
if (key != prev) {
cmd = "awk -F\"\t\" -v key=\"" key "\" \047$2 == key{print $1;exit}\047 file1"
cmd | getline val
close(cmd)
}
print $0, val
prev = key
}' file2
From comments you're having scaling problems with your lookups. The general fix for that is to merge sorted sequences:
join -t $'\t' -1 2 -2 1 -o 1.2,2.2,1.1 \
<( sort -t $'\t' -k2 file1) \
<( sort -t $'\t' -sk1,1 file2)
I gather Windows can't do process substitution, so you have to use temporary files:
sort -t $'\t' -k2 file1 >idlookup.bykey
sort -t $'\t' -sk1,1 file2 >values.bykey
join -t $'\t' -1 2 -2 1 -o 1.2,2.2,1.1 idlookup.bykey values.bykey
If you need to preserve the value lookup sequence use nl to put line numbers on the front and sort on those at the end.
If your issue is performance then try this perl script:
#!/usr/bin/perl -l
use strict;
use warnings;
my %h;
open my $fh1 , "<", "file1.txt";
open my $fh2 , "<", "file2.txt";
open my $fh3 , ">", "file3.txt";
while (<$fh1>) {
my ($v, $k) = /(\d+)\s+(.*)/;
$h{$k} = $v;
}
while (<$fh2>) {
my ($k, $v) = /(.*)\s+(\d+)$/;
print $fh3 "$k $v $h{$k}" if exists $h{$k};
}
Save the above script in say script.pl and run it as perl script.pl. Make sure the file1.txt and file2.txt are in the same directory as the script.

How do i store the output of AWK as a variable to use in a SED command in a BASH script?

I have used this to create a character string:
awk -F "|" 'NR==1 {print $2$4}' file
I now want to use this string to replace another string using SED, using this code, which i know works:
sed -i.bak 's/\(<remote>\).*\(<\/remote>\)/<remote> http:\/\/blabla\/'$var' <\/remote>/g' file2.xml
But what i can't figure out is how to save the output of the awk as $var first to use as an input for SED. can anyone help? thanks.
Don't do that, just use awk, e.g.:
awk -F'|' '
NR==FNR { if (NR==1) var = $2$4; next }
{ gsub(/<remote>.*<\/remote>/,"<remote> http://blabla/" var " </remote>"); print }
' file file2.xml > tmp && mv tmp file2.xml
With GNU awk you could make it more efficient but it won't make a noticeable difference unless your first file is huge:
awk -F'|' '
NR==FNR { var = $2$4; nextfile }
{ gsub(/<remote>.*<\/remote>/,"<remote> http://blabla/" var " </remote>"); print }
' file file2.xml > tmp && mv tmp file2.xml
wrt a comment I made in someone else's post, here's why you should us -v var=val instead of populating variables in the awk args list unless you need to do that to change initial values between files (not necessary with GNU awk courtesy of BEGINFILE):
Here's a script to add the values stored in 3 files to some initial seed value and print the result for each file and across all files (the seed is just added once to the total). file2 is empty:
$ cat file1
3
6
$ cat file2
$ cat file3
2
5
$ cat tst.awk
BEGIN {
print "seed value =", seed
for (i=1; i<ARGC; i++)
subtotal[ARGV[i]] = seed
total = seed
}
{
subtotal[FILENAME] += $0
total += $0
}
END {
for (filename in subtotal)
print "File", filename, "subtotal =", subtotal[filename]
print "total =", total
}
$
$ awk -v seed=7 -f tst.awk file1 file2 file3
seed value = 7
File file1 subtotal = 16
File file2 subtotal = 7
File file3 subtotal = 14
total = 23
All makes sense right? Now lets move the setting of "seed" into the argument list:
$ awk -f tst.awk seed=7 file1 file2 file3
seed value =
File file1 subtotal = 9
File file2 subtotal =
File file3 subtotal = 7
File seed=7 subtotal =
total = 16
Note that the seed value is missing from the first print, the subtotals and total are wrong, and there's an output trying to print a value for a file named "seed=7".
It's all explicable and predictable once you know exactly what you're doing with awk but I bet it's pretty baffling to a newcomer so IMHO populating variables in the arg list should not be how we advise people to initialize their variables by default since it's got far less intuitive semantics than -v variable=value.
var=$(awk -F "|" 'NR==1 {print $2$4}' file)
But its better you give us data and what to do with it. This way we can make an awk or sed that does it all.
Edit:
var="test"
echo "https://blabla/bla/bla </remote>" | awk '{$NF="</"v">"}1' v="$var"
https://blabla/bla/bla </test>
This will replace the something with value of $var

Resources