awk to check order of header line in text files - bash

In the bash below I am trying to use awk to verify that the order of the headers is exactly the same between the tab-delimited files (key has the order of fields and text files, usually 3 in a directory).
If the order is correct or matches are found between the files, then print FILENAME has the expected order of fields, but if the order does not match between the files, then print FILENAME causes "the order of $i is not correct", where $i is the field out of order using key as the order. Thank you :)
key
Index Chr Start End Ref Alt Inheritance Score
file1.txt
Index Chr Start End Ref Alt Inheritance Score
1 1 10 100 A - . 2
file2.txt
Index Chr Start End Ref Alt Inheritance
1 1 10 100 A - . 2
2 1 20 100 A - . 5
file3.txt
Index Chr Start End Ref Alt Inheritance
1 1 10 100 A - . 2
2 1 20 100 A - . 5
3 1 75 100 A - . 2
4 1 25 100 A - . 5
awk
for f in /home/cmccabe/Desktop/validate/*.txt ; do
bname=`basename $f`
awk '
FNR==NR {
order=(awk '!seen[$0]++ {lines[i++]=$0}
END {for (i in lines) if (seen[lines[i]]==1) print lines[i]})'
k=(awk '!seen[$0]++ {lines[i++]=$0}
END {for (i in lines) if (seen[lines[i]]==1) print lines[i]})'
if($order==$k) print FILENAME " has expected order of fields"
else
print FILENAME " order of $i is not correct"
}' key $f
done
desired output
/home/cmccabe/Desktop/validate/file1.txt has expected order of fields
/home/cmccabe/Desktop/validate/file2.txt order of Score is not correct
/home/cmccabe/Desktop/validate/file3.txt order of Score is not correct

Given those input, you can do something like:
awk 'FNR==NR{hn=split($0,header); next}
FNR==1 {n=split($0,fh)
for(i=1;i<=hn; i++)
if (fh[i]!=header[i]) {
printf "%s: order of %s is not correct\n" ,FILENAME, header[i]
next}
if (hn==n)
print FILENAME, "has expected order of fields"
else
print FILENAME, "has extra fields"
next
}' key f{1..3}
Prints:
f1 has expected order of fields
f2 order of Score is not correct
f3 order of Score is not correct

$ cat tst.awk
NR==FNR { split($0,keys); next }
FNR==1 {
allmatched = 1
for (i=1; i in keys; i++) {
if ($i != keys[i] ) {
printf "%s order of %s is not correct\n", FILENAME, keys[i]
allmatched = 0
}
}
if ( allmatched ) {
printf "%s has expected order of fields\n", FILENAME
}
nextfile
}
$ awk -f tst.awk key file1 file2 file3
file1 has expected order of fields
file2 order of Score is not correct
file3 order of Score is not correct
The above uses GNU awk for nextfile for efficiency. With other awks just delete that statement and accept the whole of each file will be read.
You didn't include in your sample a case where a header appears in a file but was NOT present in keys so I assume that can't happen and so you don't need the script to handle it.

Related

AWK: Concatenate and process three or more files with a method similar to FNR==NR approach

Since I am learning awk; I found out FNR==NR approach is a very common method to process two files. If FNR==NR; then it is the first file, when FNR reset to 1 while reading every line from concatenated files it means !(FNR==NR) and it is obviously the second file.
When it comes to three or more files I can't see a way which is second and third file as both have the same !(FNR==NR) condition. This made me to try to figure out how can there be something like FNR2 and FNR3?
So I implemented a method to process three files in one awk. Assuming like there is FNR1 FNR2 FNR3 for each file. For every file I made for loop that runs seperately. Condition is same for every loop NR==FNR# and actually get what I expected:
So I wonder if there are more sober, concise methods that deliver similar results with belowawkcode
Sample File Contents
$ cat file1
X|A1|Z
X|A2|Z
X|A3|Z
X|A4|Z
$ cat file2
X|Y|A3
X|Y|A4
X|Y|A5
$ cat file3
A1|Y|Z
A4|Y|Z
AWK for loop
$ cat fnrarray.sh
awk -v FS='[|]' '{ for(i=FNR ; i<=NR && i<=FNR && NR==FNR; i++) {x++; print "NR:",NR,"FNR1:",i,"FNR:",FNR,"\tfirst file\t"}
for(i=FNR ; i+x<=NR && i<=FNR && NR==FNR+x; i++) {y++; print "NR:",NR,"FNR2:",i+x,"FNR:",FNR,"\tsecond file\t"}
for(i=FNR ; i+x+y<=NR && i<=FNR && NR==FNR+x+y; i++) {print "NR:",NR,"FNR3:",i+x+y,"FNR:",FNR,"\tthird file\t"}
}' file1 file2 file3
Current and desired output
$ sh fnrarray.sh
NR: 1 FNR1: 1 FNR: 1 first file
NR: 2 FNR1: 2 FNR: 2 first file
NR: 3 FNR1: 3 FNR: 3 first file
NR: 4 FNR1: 4 FNR: 4 first file
NR: 5 FNR2: 5 FNR: 1 second file
NR: 6 FNR2: 6 FNR: 2 second file
NR: 7 FNR2: 7 FNR: 3 second file
NR: 8 FNR3: 8 FNR: 1 third file
NR: 9 FNR3: 9 FNR: 2 third file
You can see NR is aligning with FNR# and it is readable which NR is for which file#.
Another Method
I found this method FNR==1{++f} f==1 {} here Handling 3 Files using awk
But this method is replacing arr1[1] when new line is read every time
Fail attempt 1
$ awk -v FS='[|]' 'FNR==1{++f} f==1 {split($2,arr); print arr1[1]}' file1 file2 file3
A1
A2
A3
A4
Success with for loop (arr1[1] is not changed)
$ awk -v FS='[|]' '{for(i=FNR ; i<=NR && i<=FNR && NR==FNR; i++) {arr1[++k]=$2; print arr1[1]}}' file1 file2 file3
A1
A1
A1
A1
When it comes to three or more files I can't see a way which is second
and third file as both have the same !(FNR==NR) condition. This made
me to try to figure out how can there be something like FNR2 and FNR3?
Here is example:
$ cat f1
X|A1|Z
X|A2|Z
X|A3|Z
X|A4|Z
$ cat f2
X|Y|A3
X|Y|A4
X|Y|A5
$ cat f3
A1|Y|Z
A4|Y|Z
Sample output:
$ awk -F '|' 'FNR==1{file++}{array[file, FNR]=$0; max=max>FNR?max:FNR}END{for(f=1; f<=file; f++){ for(row=1; row<=max; row++){ key=f SUBSEP row; if(key in array)print "file: "f,"row :"row,"record: "array[key] } }}' f1 f2 f3
file: 1 row :1 record: X|A1|Z
file: 1 row :2 record: X|A2|Z
file: 1 row :3 record: X|A3|Z
file: 1 row :4 record: X|A4|Z
file: 2 row :1 record: X|Y|A3
file: 2 row :2 record: X|Y|A4
file: 2 row :3 record: X|Y|A5
file: 3 row :1 record: A1|Y|Z
file: 3 row :2 record: A4|Y|Z
Explanation:
awk -F '|' 'FNR==1{ # FNR will reset for every file
file++ # so whenever FNR==1 increment variable file
}
{
# array name : array
# array key being : file, FNR
# array value : $0 which current record/row
array[file, FNR] = $0;
# here we find which row count in all available files
max = max > FNR ? max : FNR
}
END{ # end block when all files are read
# start iterating over file
# as we now variable file hold total no files read
for(f=1; f<=file; f++)
{
# iterate now for record from each file
# variable max holds max row count
for(row=1; row<=max; row++)
{
# variable key will now have
# key = file-number SUBSET row-number
key=f SUBSEP row;
# if key exists in array
# print array value
if(key in array)
print "file: "f,"row :"row,"record: "array[key]
}
}
}' f1 f2 f3
Other option would be to use true multi-dimensional arrays like below. gawk specific of course.
Assuming filenames are unique, otherwise use FNR==1{ file++} and in place of FILENAME use file
$ awk --version
GNU Awk 4.2.1, API: 2.0 (GNU MPFR 3.1.6-p2, GNU MP 6.1.2)
Copyright (C) 1989, 1991-2018 Free Software Foundation.
$ awk -F '|' '{
true_multi_array[FILENAME][FNR] = $0
}
END{
for(file in true_multi_array)
for(row in true_multi_array[file])
print "file:",file, "row :" row, "record:" true_multi_array[file][row]
}' f1 f2 f3
file: f1 row :1 record:X|A1|Z
file: f1 row :2 record:X|A2|Z
file: f1 row :3 record:X|A3|Z
file: f1 row :4 record:X|A4|Z
file: f2 row :1 record:X|Y|A3
file: f2 row :2 record:X|Y|A4
file: f2 row :3 record:X|Y|A5
file: f3 row :1 record:A1|Y|Z
file: f3 row :2 record:A4|Y|Z
To identify files in order using GNU awk no matter what:
awk '
ARGIND == 1 { do 1st file stuff }
ARGIND == 2 { do 2nd file stuff }
ARGIND == 3 { do 3rd file stuff }
' file1 file2 file3
e.g. to get the text under "output" in your question from the 3 sample input files you provided:
awk '
ARGIND == 1 { pos = "first" }
ARGIND == 2 { pos = "second" }
ARGIND == 3 { pos = "third" }
{ print "NR:", NR, "FNR" ARGIND ":", NR, "FNR:", FNR, pos " file" }
' file1 file2 file3
NR: 1 FNR1: 1 FNR: 1 first file
NR: 2 FNR1: 2 FNR: 2 first file
NR: 3 FNR1: 3 FNR: 3 first file
NR: 4 FNR1: 4 FNR: 4 first file
NR: 5 FNR2: 5 FNR: 1 second file
NR: 6 FNR2: 6 FNR: 2 second file
NR: 7 FNR2: 7 FNR: 3 second file
NR: 8 FNR3: 8 FNR: 1 third file
NR: 9 FNR3: 9 FNR: 2 third file
or using any awk if all file names are unique whether any of them are empty or not:
awk '
FILENAME == ARGV[1] { do 1st file stuff }
FILENAME == ARGV[2] { do 2nd file stuff }
FILENAME == ARGV[3] { do 3rd file stuff }
' file1 file2 file3
or if the files aren't empty then whether unique or not (note file1 twice in the arg list):
awk '
FNR == 1 { argind++ }
argind == 1 { do 1st file stuff }
argind == 2 { do 2nd file stuff }
argind == 3 { do 3rd file stuff }
' file1 file2 file1
if a file names can appear multiple times in the arg list and some of the files could be empty then it becomes trickier with a non-GNU awk which is why GNU awk has ARGIND, e.g. something like (untested):
awk '
BEGIN {
for (i=1; i<ARGC; i++) {
fname = ARGV[i]
if ( (getline line < fname) > 0 ) {
# file is not empty so save its position in the args
# list in an array indexed by its name and the number
# of times that name has been seen so far
arginds[fname,++tmpcnt[fname]] = i
}
close(fname)
}
}
FNR == 1 { argind = arginds[FILENAME,++cnt[FILENAME]] }
argind == 1 { do 1st file stuff }
argind == 2 { do 2nd file stuff }
argind == 3 { do 3rd file stuff }
' file1 file2 file1

Awk - Count Each Unique Value and Match Values Between 2 Files

I have two files. I am trying to get the count of each unique field in column 8 in file 1, and then match the unique field value from the 6th column of the 2nd file.
So essentially, I am trying to -> take each unique value and value count from column 8 from File1, if there is a match in column6 of file2
File1:
2020-12-23 23:59:12,235911688,\N,34,20201223233739,797495497,404,819,\N,
2020-12-23 23:59:12,235911419,\N,34,265105814,718185263,200,819,\N,
2020-12-23 23:59:12,235912029,\N,34,20201223233739,748362773,404,819,\N,
2020-12-23 23:59:12,235911839,\N,34,20201223233738,745662697,404,400,\N,
2020-12-23 23:59:12,235911839,\N,34,20201223233738,745662697,404,400,\N,
2020-12-24 23:59:12,235911839,\N,34,20201223233738,745662697,404,400,\N,
File2:
public static String status_code = "819";
public static String DeActivate = "400";
Expected output:
total count of status_code,819 : 3
total count of DeActivate,400 : 3
My code:
awk 'NR==FNR{a[$8]++}NR!=FNR{gsub(/"/,"",$6);b[$6]=$0}END{for( i in b){printf "Total count of %s,%d : %d\n",gensub(/^([^ ]+).*/,"\\1","1",b[i]),i,a[i]}}' File1 File2
Algorithm
1.Take the 8th feild from 1st file:(eg:819)
2.Count how time unique feild(819) occurs in file(based of date)
3 take the corresponding value of 819 from 4th feild of file2
4 print output together
I believe I should be able to do this with awk, but for some reason I am really struggling with this.
(It is something like SQL JOINing two relational database tables on File1's $8 being equal to File2's $6.)
awk '
NR==FNR { # For the first file
a[$8]++; # count each $8
}
NF&&NR!=FNR { # For non empty lines of file 2
gsub(/[^0-9]/,"",$6); # remove non-digits from $6
b[$6]=$4 # save name of constant to b
}
END{
for(i in b){ # for constants occurring in File2
if(a[i]) { # if File1 had non zero count
printf( "Total count of %s,%d : %d\n",b[i],i,a[i]);
#print data
}
}
}' "FS=," File1 FS=" " File2
The above code works with your sample input. It produces the following output:
Total count of DeActivate,400 : 3
Total count of status_code,819 : 3
I think the main problem is that you do not specify comma as field separator for File1. See Processing two files with different field separators in awk
A shorter, more efficient, way without the second array and for loop:
$ cat demo.awk
NR == FNR {
a[$8]++
next
}
{
gsub(/[^0-9]/,"",$6)
printf "Total count of %s,%d : %d\n", $4, $6, a[$6]
}
$ awk -f demo.awk FS="," file1 FS=" " file2
Total count of status_code,819 : 3
Total count of DeActivate,400 : 3
$

Merging two files column and row-wise in bash

I would like to merge two files, column and row-wise but am having difficulty doing so with bash. Here is what I would like to do.
File1:
1 2 3
4 5 6
7 8 9
File2:
2 3 4
5 6 7
8 9 1
Expected output file:
1/2 2/3 3/4
4/5 5/6 6/7
7/8 8/9 9/1
This is just an example. The actual files are two 1000x1000 data matrices.
Any thoughts on how to do this? Thanks!
Or use paste + awk
paste file1 file2 | awk '{ n=NF/2; for(i=1; i<=n; i++) printf "%s/%s ", $i, $(i+n); printf "\n"; }'
Note that this script adds a trailing space after the last value. This can be avoided with a more complicated awk script or by piping the output through an additional command, e.g.
paste file1 file2 | awk '{ n=NF/2; for(i=1; i<=n; i++) printf "%s/%s ", $i, $(i+n); printf "\n"; }' | sed 's/ $//'
awk solution without additional sed. Thanks to Jonathan Leffler. (I knew it is possible but was too lazy to think about this.)
awk '{ n=NF/2; pad=""; for(i=1; i<=n; i++) { printf "%s%s/%s", pad, $i, $(i+n); pad=" "; } printf "\n"; }'
paste + perl version that works with an arbitrary number of columns without having to hold an entire file in memory:
paste file1.txt file2.txt | perl -MList::MoreUtils=pairwise -lane '
my #a = #F[0 .. (#F/2 - 1)]; # The values from file1
my #b = #F[(#F/2) .. $#F]; # The values from file2
print join(" ", pairwise { "$a/$b" } #a, #b); # Merge them together again'
It uses the non-standard but useful List::MoreUtils module; install through your OS package manager or favorite CPAN client.
Assumptions:
no blank lines in files
both files have the same number of rows
both files have the same number of fieldds
no idea how many rows and/or fields we'll have to deal with
One awk solution:
awk '
# first file (FNR==NR):
FNR==NR { for ( i=1 ; i<=NF ; i++) # loop through fields
{ line[FNR,i]=$(i) } # store field in array; array index = row number (FNR) + field number (i)
next # skip to next line in file
}
# second file:
{ pfx="" # init printf prefix as empty string
for ( i=1 ; i<=NF ; i++) # loop through fields
{ printf "%s%s/%s", # print our results:
pfx, line[FNR,i], $(i) # prefix, corresponding field from file #1, "/", current field
pfx=" " # prefix for rest of fields in this line is a space
}
printf "\n" # append linefeed on end of current line
}
' file1 file2
NOTES:
remove comments to declutter code
memory usage will climb as the size of the matrix increases (probably not an issue for the smallish fields and OPs comment about a 1000 x 1000 matrix)
The above generates:
1/2 2/3 3/4
4/5 5/6 6/7
7/8 8/9 9/1

Merging word counts with Bash and Unix

I made a Bash script that extracts words from a text file with grep and sed and then sorts them with sort and counts the repetitions with wc, then sort again by frequency. The example output looks like this:
12 the
7 code
7 with
7 add
5 quite
3 do
3 well
1 quick
1 can
1 pick
1 easy
Now I'd like to merge all words with the same frequency into one line, like this:
12 the
7 code with add
5 quite
3 do well
1 quick can pick easy
Is there any way to do that with Bash and standard Unix toolset? Or I would have to write a script / program in some more sophisticated scripting language?
With awk:
$ echo "12 the
7 code
7 with
7 add
5 quite
3 do
3 well
1 quick
1 can
1 pick
1 easy" | awk '{cnt[$1]=cnt[$1] ? cnt[$1] OFS $2 : $2} END {for (e in cnt) print e, cnt[e]} ' | sort -nr
12 the
7 code with add
5 quite
3 do well
1 quick can pick easy
You can do something similar with Bash 4 associative arrays. awk is easier and POSIX though. Use that.
Explanation:
awk splits the line apart by the separator in FS, in this case the default of horizontal whitespace;
$1 is the first field of the count - use that to collect items with the same count in an associative array keyed by the count with cnt[$1];
cnt[$1]=cnt[$1] ? cnt[$1] OFS $2 : $2 is a ternary assignment - if cnt[$1] has no value, just assign the second field $2 to it (The RH of :). If it does have a previous value, concatenate $2 separated by the value of OFS (the LH of :);
At the end, print out the value of the associative array.
Since awk associative arrays are unordered, you need to sort again by the numeric value of the first column. gawk can sort internally, but it is just as easy to call sort. The input to awk does not need to be sorted, so you can eliminate that part of the pipeline.
If you want the digits to be right justified (as your have in your example):
$ awk '{cnt[$1]=cnt[$1] ? cnt[$1] OFS $2 : $2}
END {for (e in cnt) printf "%3s %s\n", e, cnt[e]} '
If you want gawk to sort numerically by descending values, you can add PROCINFO["sorted_in"]="#ind_num_desc" prior to traversing the array:
$ gawk '{cnt[$1]=cnt[$1] ? cnt[$1] OFS $2 : $2}
END {PROCINFO["sorted_in"]="#ind_num_desc"
for (e in cnt) printf "%3s %s\n", e, cnt[e]} '
With single GNU awk expression (without sort pipeline):
awk 'BEGIN{ PROCINFO["sorted_in"]="#ind_num_desc" }
{ a[$1]=(a[$1])? a[$1]" "$2:$2 }END{ for(i in a) print i,a[i]}' file
The output:
12 the
7 code with add
5 quite
3 do well
1 quick can pick easy
Bonus alternative solution using GNU datamash tool:
datamash -W -g1 collapse 2 <file
The output (comma-separated collapsed fields):
12 the
7 code,with,add
5 quite
3 do,well
1 quick,can,pick,easy
awk:
awk '{a[$1]=a[$1] FS $2}!b[$1]++{d[++c]=$1}END{while(i++<c)print d[i],a[d[i]]}' file
sed:
sed -r ':a;N;s/(\b([0-9]+).*)\n\s*\2/\1/;ta;P;D'
You start with sorted data, so you only need a new line when the first field changes.
echo "12 the
7 code
7 with
7 add
5 quite
3 do
3 well
1 quick
1 can
1 pick
1 easy" |
awk '
{
if ($1==last) {
printf(" %s",$2)
} else {
last=$1;
printf("%s%s",(NR>1?"\n":""),$0)
}
}; END {print}'
next time you find yourself trying to manipulate text with a combination of grep and sed and shell and..., stop and just use awk instead - the end result will be clearer, simpler, more efficient, more portable, etc...
$ cat file
It was the best of times, it was the worst of times,
it was the age of wisdom, it was the age of foolishness.
.
$ cat tst.awk
BEGIN { FS="[^[:alpha:]]+" }
{
for (i=1; i<NF; i++) {
word2cnt[tolower($i)]++
}
}
END {
for (word in word2cnt) {
cnt = word2cnt[word]
cnt2words[cnt] = (cnt in cnt2words ? cnt2words[cnt] " " : "") word
printf "%3d %s\n", cnt, word
}
for (cnt in cnt2words) {
words = cnt2words[cnt]
# printf "%3d %s\n", cnt, words
}
}
$
$ awk -f tst.awk file | sort -rn
4 was
4 the
4 of
4 it
2 times
2 age
1 worst
1 wisdom
1 foolishness
1 best
.
$ cat tst.awk
BEGIN { FS="[^[:alpha:]]+" }
{
for (i=1; i<NF; i++) {
word2cnt[tolower($i)]++
}
}
END {
for (word in word2cnt) {
cnt = word2cnt[word]
cnt2words[cnt] = (cnt in cnt2words ? cnt2words[cnt] " " : "") word
# printf "%3d %s\n", cnt, word
}
for (cnt in cnt2words) {
words = cnt2words[cnt]
printf "%3d %s\n", cnt, words
}
}
$
$ awk -f tst.awk file | sort -rn
4 it was of the
2 age times
1 best worst wisdom foolishness
Just uncomment whichever printf line you like in the above script to get whichever type of output you want. The above will work in any awk on any UNIX system.
Using miller's nest verb:
mlr -p nest --implode --values --across-records -f 2 --nested-fs ' ' file
Output:
12 the
7 code with add
5 quite
3 do well
1 quick can pick easy

Display only lines in which 1 column is equal to another, and a second column is in a range in AWK and Bash

I have two files. The first file looks like this:
1 174392
1 230402
2 4933400
3 39322
4 42390021
5 80022392
6 3818110
and so on
the second file looks like this:
chr1 23987 137011
chr1 220320 439292
chr2 220320 439292
chr2 2389328 3293292
chr3 392329 398191
chr4 421212 3292393
and so on.
I want to return the whole line, provided that the first column in FILE1 = the first line in FILE2, as a string match AND the 2nd column in file 2 is greater than column 2 in FILE2 but less than column 3 in FILE2.
So in the above example, the line
1 230402
in FILE1 and
chr1 220320 439292
in FILE2 would satisfy the conditions because 230402 is between 220320 and 439292 and 1 would be equal to chr1 after I make the strings match, therefore that line in FILE2 would be printed.
The code I wrote was this:
#!/bin/bash
$F1="FILE1.txt"
read COL1 COL2
do
grep -w "chr$COL1" FILE2.tsv \
| awk -v C2=$COL2 '{if (C2>$1 && C2<$2); print $0}'
done < "$F1"
I have tried many variations of this. I do not care if the code is entirely in awk, entirely in bash, or a mixture.
Can anyone help?
Thank you!
Here is one way using awk:
awk '
NR==FNR {
$1 = "chr" $1
seq[$1,$2]++;
next
}
{
for(key in seq) {
split(key, tmp, SUBSEP);
if(tmp[1] == $1 && $2 <= tmp[2] && tmp[2] <= $3 ) {
print $0
}
}
}' file1 file2
chr1 220320 439292
We read the first file in to an array using key as column 1 and 2. We add a string "chr" to column 1 while making it a key for easy comparison later on
When we process the file 2, we iterate over our array and split the key.
We compare the first piece of our key to column 1 and check if second piece of the key is in the range of second and third column.
If it satisfies our condition, we print the line.
awk 'BEGIN {i = 0}
FNR == NR { chr[i] = "chr" $1; test[i++] = $2 }
FNR < NR { for (c in chr) {
if ($1 == chr[c] && test[c] > $2 && test[c] < $3) { print }
}
}' FILE1.txt FILE2.tsv
FNR is the line number within the current file, NR is the line number within all the input. So the first block processes the first file, collecting all the lines into arrays. The second block processes any remaining files, searching through the array of chrN values looking for a match, and comparing the other two numbers to the number from the first file.
Thanks very much!
These answers work and are very helpful.
Also at long last I realized I should have had:
awk -v C2=$COL2 'if (C2>$1 && C2<$2); {print $0}'
with the brace in a different place and I would have been fine.
At any rate, thank you very much!

Resources