Awk - Count Each Unique Value and Match Values Between 2 Files - shell

I have two files. I am trying to get the count of each unique field in column 8 in file 1, and then match the unique field value from the 6th column of the 2nd file.
So essentially, I am trying to -> take each unique value and value count from column 8 from File1, if there is a match in column6 of file2
File1:
2020-12-23 23:59:12,235911688,\N,34,20201223233739,797495497,404,819,\N,
2020-12-23 23:59:12,235911419,\N,34,265105814,718185263,200,819,\N,
2020-12-23 23:59:12,235912029,\N,34,20201223233739,748362773,404,819,\N,
2020-12-23 23:59:12,235911839,\N,34,20201223233738,745662697,404,400,\N,
2020-12-23 23:59:12,235911839,\N,34,20201223233738,745662697,404,400,\N,
2020-12-24 23:59:12,235911839,\N,34,20201223233738,745662697,404,400,\N,
File2:
public static String status_code = "819";
public static String DeActivate = "400";
Expected output:
total count of status_code,819 : 3
total count of DeActivate,400 : 3
My code:
awk 'NR==FNR{a[$8]++}NR!=FNR{gsub(/"/,"",$6);b[$6]=$0}END{for( i in b){printf "Total count of %s,%d : %d\n",gensub(/^([^ ]+).*/,"\\1","1",b[i]),i,a[i]}}' File1 File2
Algorithm
1.Take the 8th feild from 1st file:(eg:819)
2.Count how time unique feild(819) occurs in file(based of date)
3 take the corresponding value of 819 from 4th feild of file2
4 print output together
I believe I should be able to do this with awk, but for some reason I am really struggling with this.

(It is something like SQL JOINing two relational database tables on File1's $8 being equal to File2's $6.)
awk '
NR==FNR { # For the first file
a[$8]++; # count each $8
}
NF&&NR!=FNR { # For non empty lines of file 2
gsub(/[^0-9]/,"",$6); # remove non-digits from $6
b[$6]=$4 # save name of constant to b
}
END{
for(i in b){ # for constants occurring in File2
if(a[i]) { # if File1 had non zero count
printf( "Total count of %s,%d : %d\n",b[i],i,a[i]);
#print data
}
}
}' "FS=," File1 FS=" " File2
The above code works with your sample input. It produces the following output:
Total count of DeActivate,400 : 3
Total count of status_code,819 : 3
I think the main problem is that you do not specify comma as field separator for File1. See Processing two files with different field separators in awk

A shorter, more efficient, way without the second array and for loop:
$ cat demo.awk
NR == FNR {
a[$8]++
next
}
{
gsub(/[^0-9]/,"",$6)
printf "Total count of %s,%d : %d\n", $4, $6, a[$6]
}
$ awk -f demo.awk FS="," file1 FS=" " file2
Total count of status_code,819 : 3
Total count of DeActivate,400 : 3
$

Related

awk to get first column if the a specific number in the line is greater than a digit

I have a data file (file.txt) contains the below lines:
123 pro=tegs, ETA=12:00, team=xyz,user1=tom,dom=dby.com
345 pro=rbs, team=abc,user1=chan,dom=sbc.int,ETA=23:00
456 team=efg, pro=bvy,ETA=22:00,dom=sss.co.uk,user2=lis
I'm expecting to get the first column ($1) only if the ETA= number is greater than 15, like here I will have 2nd and 3rd line first column only is expected.
345
456
I tried like cat file.txt | awk -F [,TPF=]' '{print $1}' but its print whole line which has ETA at the end.
Using awk
$ awk -F"[=, ]" '{for (i=1;i<NF;i++) if ($i=="ETA") if ($(i+1) > 15) print $1}' input_file
345
456
With your shown samples please try following GNU awk code. Using match function of GNU awk where I am using regex (^[0-9]+).*ETA=([0-9]+):[0-9]+ which creates 2 capturing groups and saves its values into array arr. Then checking condition if 2nd element of arr is greater than 15 then print 1st value of arr array as per requirement.
awk '
match($0,/(^[0-9]+).*\<ETA=([0-9]+):[0-9]+/,arr) && arr[2]+0>15{
print arr[1]
}
' Input_file
I would harness GNU AWK for this task following way, let file.txt content be
123 pro=tegs, ETA=12:00, team=xyz,user1=tom,dom=dby.com
345 pro=rbs, team=abc,user1=chan,dom=sbc.int,ETA=23:00
456 team=efg, pro=bvy,ETA=02:00,dom=sss.co.uk,user2=lis
then
awk 'substr($0,index($0,"ETA=")+4,2)+0>15{print $1}' file.txt
gives output
345
Explanation: I use String functions, index to find where is ETA= then substr to get 2 characters after ETA=, 4 is used as ETA= is 4 characters long and index gives start position, I use +0 to convert to integer then compare it with 15. Disclaimer: this solution assumes every row has ETA= followed by exactly 2 digits.
(tested in GNU Awk 5.0.1)
Whenever input contains tag=value pairs as yours does, it's best to first create an array of those mappings (v[]) below and then you can just access the values by their tags (names):
$ cat tst.awk
BEGIN {
FS = "[, =]+"
OFS = ","
}
{
delete v
for ( i=2; i<NF; i+=2 ) {
v[$i] = $(i+1)
}
}
v["ETA"]+0 > 15 {
print $1
}
$ awk -f tst.awk file
345
456
With that approach you can trivially enhance the script in future to access whatever values you like by their names, test them in whatever combinations you like, output them in whatever order you like, etc. For example:
$ cat tst.awk
BEGIN {
FS = "[, =]+"
OFS = ","
}
{
delete v
for ( i=2; i<NF; i+=2 ) {
v[$i] = $(i+1)
}
}
(v["pro"] ~ /b/) && (v["ETA"]+0 > 15) {
print $1, v["team"], v["dom"]
}
$ awk -f tst.awk file
345,abc,sbc.int
456,efg,sss.co.uk
Think about how you'd enhance any other solution to do the above or anything remotely similar.
It's unclear why you think your attempt would do anything of the sort. Your attempt uses a completely different field separator and does not compare anything against the number 15.
You'll also want to get rid of the useless use of cat.
When you specify a column separator with -F that changes what the first column $1 actually means; it is then everything before the first occurrence of the separator. Probably separately split the line to obtain the first column, space-separated.
awk -F 'ETA=' '$2 > 15 { split($0, n, /[ \t]+/); print n[1] }' file.txt
The value in $2 will be the data after the first separator (and up until the next one) but using it in a numeric comparison simply ignores any non-numeric text after the number at the beginning of the field. So for example, on the first line, we are actually literally checking if 12:00, team=xyz,user1=tom,dom=dby.com is larger than 15 but it effectively checks if 12 is larger than 15 (which is obviously false).
When the condition is true, we split the original line $0 into the array n on sequences of whitespace, and then print the first element of this array.
Using awk you could match ETA= followed by 1 or more digits. Then get the match without the ETA= part and check if the number is greater than 15 and print the first field.
awk '/^[0-9]/ && match($0, /ETA=[0-9]+/) {
if(substr($0, RSTART+4, RLENGTH-4)+0 > 15) print $1
}' file
Output
345
456
If the first field should start with a number:
awk '/^[0-9]/ && match($0, /ETA=[0-9]+/) {
if(substr($0, RSTART+4, RLENGTH-4) > 15)+0 print $1
}' file

Unix : lookup value from one file to other files based on two joining keys

I'm looking for help on joining(at unix level) 2 files(file1 n file2),
pickup values from file2 as a priority on file1. i.e if value in tmpValue exists on file2 that should be taken instead of file1 srcValue, if no tmpValue on file2 then use srcValue on file1 tmpValue.
Note : lookup has to be done based on 2 key columns : id and dept
sample
file1
------
id^dept^name^srcValue
1^d1^a^s123
2^d2^b^s456
3^d3^c^
file2
--------
id^dept^name
1^d1^Tva
3^d3^TVb
4^d4^Tvm
Desired output
---------------------
id^dept^name^FinalValue
1^d1^a^Tva
2^d2^b^s456
3^d3^c^TVb
Below sample code works fine if i consider only 1 column as key column(id), but I'm not sure how to mention both id and dept as key columns.
awk -F"^" 'BEGIN{OFS="^"}
{
if (NR==FNR) {
a[$1]=$3;
next
}
if ($1 in a){
$4=a[$1]
}
print
}' file2 file1
output of same(above)code
id^dept^name^name
1^d1^a^Tva
2^d2^b^s456
3^d3^c^TVb
I'm not sure how to mention both id and dept as key columns.
Store both.
a[$1, $2]=$3;
And compare both.
if (($1, $2) in a) {
Comma just "merges" values using OFS. So $1, $2 is basically equal to $1 OFS $2 - it's a string with ^ in between. You can print ($1, $2) for example to inspect it.

How can i reoder lines in a text file based on a pattern?

I have a text file that contains batches of 4 lines, the first line of each batch is in the correct position however the next 3 lines are not always in the correct order.
name cat
label 4
total 5
value 4
name dog
total 4
label 3
value 6
name cow
value 6
total 1
label 4
name fish
total 3
label 5
value 6
I would like each 4 line batch to be in the following format:
name cat
value 4
total 5
label 4
so the output would be:
name cat
value 4
total 5
label 4
name dog
value 6
total 4
label 3
name cow
value 6
total 1
label 4
name fish
value 6
total 3
label 5
The file contains thousands of lines in total, so i would like to build a command that can deal with all potential orders of the 3 lines and re-arrange them if not in the correct format.
I am aware i can use awk to search lines that begin with a particular string and them re-arrange them:
awk '$1 == "value" { print $3, $4, $1, $2; next; } 1'
However i can not figure out how to acheive something similiar that processes over multiple lines.
How can i acheive this?
By setting RS to the empty string, each block of text separated by at least one empty line, is considered a single record. From there it's easy to capture each key-value pair and output them in the desired order.
BEGIN {RS=""}
{
for (i=1; i<=NF; i+=2) a[$i] = $(i+1)
print "name", a["name"] ORS \
"value", a["value"] ORS \
"total", a["total"] ORS \
"label", a["label"] ORS
}
$ awk -f a.awk file
name cat
value 4
total 5
label 4
name dog
value 6
total 4
label 3
name cow
value 6
total 1
label 4
name fish
value 6
total 3
label 5
Could you please try following.
awk '
/^name/{
if(name){
print name ORS array["value"] ORS array["total"] ORS array["label"] ORS
delete array
}
name=$0
next
}
{
array[$1]=$0
}
END{
print name ORS array["value"] ORS array["total"] ORS array["label"]
}
' Input_file
EDIT: Adding refined solution of above suggested by Kvantour sir.
awk -v OFS="\n" '
(!NF) && ("name" in a){
print a["name"],a["value"],a["total"],a["label"] ORS
delete a
next
}
{
a[$1]=$0
}
END{
print a["name"],a["value"],a["total"],a["label"]
}
' Input_file
The simplest way is the following:
awk 'BEGIN{RS=""; ORS="\n\n"; FS=OFS="\n"}
{ for(i=1;i<=NF;++i) { k=substr($i,1,index($i," ")-1); a[k]=$i } }
{ print a["name"],a["value"],a["total"],a["label"] }' file
How does this work?
Awk knows the concept records and fields. Files are split in records where consecutive records are split by the record separator RS. Each record is split in fields, where consecutive fields are split by the field separator FS. By default, the record separator RS is set to be the <newline> character (\n) and thus each record is a line. The record separator has the following definition:
RS:
The first character of the string value of RS shall be the input record separator; a <newline> by default. If RS contains more than one character, the results are unspecified. If RS is null, then records are separated by sequences consisting of a <newline> plus one or more blank lines, leading or trailing blank lines shall not result in empty records at the beginning or end of the input, and a <newline> shall always be a field separator, no matter what the value of FS is.
So with the file format you give, we can define the records based on RS="" and the field separator `FS="\n".
Each record looks simplified as:
key1 string1 << field $1
key2 string2 << field $2
key3 string3 << field $3
key4 string4 << field $4
...
keyNF stringNF << field $NF
When awk reads a record, we first parse it by storing all key-value pairs in an array a. Afterwards, we ask to print the values we find interesting. For this, we need to define the output-field-separators OFS and output-record-separator ORS.
In Vim you could sort the file in sections using reverse order sort!:
for i in range(1,line("$"))
/^name/+1,/^name/+3sort!
endfor
Same command issued from the shell:
$ ex -s '+for i in range(1,line("$"))|/^name/+1,/^name/+3sort!|endfor' '+%p' '+q!' inputfile

awk to check order of header line in text files

In the bash below I am trying to use awk to verify that the order of the headers is exactly the same between the tab-delimited files (key has the order of fields and text files, usually 3 in a directory).
If the order is correct or matches are found between the files, then print FILENAME has the expected order of fields, but if the order does not match between the files, then print FILENAME causes "the order of $i is not correct", where $i is the field out of order using key as the order. Thank you :)
key
Index Chr Start End Ref Alt Inheritance Score
file1.txt
Index Chr Start End Ref Alt Inheritance Score
1 1 10 100 A - . 2
file2.txt
Index Chr Start End Ref Alt Inheritance
1 1 10 100 A - . 2
2 1 20 100 A - . 5
file3.txt
Index Chr Start End Ref Alt Inheritance
1 1 10 100 A - . 2
2 1 20 100 A - . 5
3 1 75 100 A - . 2
4 1 25 100 A - . 5
awk
for f in /home/cmccabe/Desktop/validate/*.txt ; do
bname=`basename $f`
awk '
FNR==NR {
order=(awk '!seen[$0]++ {lines[i++]=$0}
END {for (i in lines) if (seen[lines[i]]==1) print lines[i]})'
k=(awk '!seen[$0]++ {lines[i++]=$0}
END {for (i in lines) if (seen[lines[i]]==1) print lines[i]})'
if($order==$k) print FILENAME " has expected order of fields"
else
print FILENAME " order of $i is not correct"
}' key $f
done
desired output
/home/cmccabe/Desktop/validate/file1.txt has expected order of fields
/home/cmccabe/Desktop/validate/file2.txt order of Score is not correct
/home/cmccabe/Desktop/validate/file3.txt order of Score is not correct
Given those input, you can do something like:
awk 'FNR==NR{hn=split($0,header); next}
FNR==1 {n=split($0,fh)
for(i=1;i<=hn; i++)
if (fh[i]!=header[i]) {
printf "%s: order of %s is not correct\n" ,FILENAME, header[i]
next}
if (hn==n)
print FILENAME, "has expected order of fields"
else
print FILENAME, "has extra fields"
next
}' key f{1..3}
Prints:
f1 has expected order of fields
f2 order of Score is not correct
f3 order of Score is not correct
$ cat tst.awk
NR==FNR { split($0,keys); next }
FNR==1 {
allmatched = 1
for (i=1; i in keys; i++) {
if ($i != keys[i] ) {
printf "%s order of %s is not correct\n", FILENAME, keys[i]
allmatched = 0
}
}
if ( allmatched ) {
printf "%s has expected order of fields\n", FILENAME
}
nextfile
}
$ awk -f tst.awk key file1 file2 file3
file1 has expected order of fields
file2 order of Score is not correct
file3 order of Score is not correct
The above uses GNU awk for nextfile for efficiency. With other awks just delete that statement and accept the whole of each file will be read.
You didn't include in your sample a case where a header appears in a file but was NOT present in keys so I assume that can't happen and so you don't need the script to handle it.

Display only lines in which 1 column is equal to another, and a second column is in a range in AWK and Bash

I have two files. The first file looks like this:
1 174392
1 230402
2 4933400
3 39322
4 42390021
5 80022392
6 3818110
and so on
the second file looks like this:
chr1 23987 137011
chr1 220320 439292
chr2 220320 439292
chr2 2389328 3293292
chr3 392329 398191
chr4 421212 3292393
and so on.
I want to return the whole line, provided that the first column in FILE1 = the first line in FILE2, as a string match AND the 2nd column in file 2 is greater than column 2 in FILE2 but less than column 3 in FILE2.
So in the above example, the line
1 230402
in FILE1 and
chr1 220320 439292
in FILE2 would satisfy the conditions because 230402 is between 220320 and 439292 and 1 would be equal to chr1 after I make the strings match, therefore that line in FILE2 would be printed.
The code I wrote was this:
#!/bin/bash
$F1="FILE1.txt"
read COL1 COL2
do
grep -w "chr$COL1" FILE2.tsv \
| awk -v C2=$COL2 '{if (C2>$1 && C2<$2); print $0}'
done < "$F1"
I have tried many variations of this. I do not care if the code is entirely in awk, entirely in bash, or a mixture.
Can anyone help?
Thank you!
Here is one way using awk:
awk '
NR==FNR {
$1 = "chr" $1
seq[$1,$2]++;
next
}
{
for(key in seq) {
split(key, tmp, SUBSEP);
if(tmp[1] == $1 && $2 <= tmp[2] && tmp[2] <= $3 ) {
print $0
}
}
}' file1 file2
chr1 220320 439292
We read the first file in to an array using key as column 1 and 2. We add a string "chr" to column 1 while making it a key for easy comparison later on
When we process the file 2, we iterate over our array and split the key.
We compare the first piece of our key to column 1 and check if second piece of the key is in the range of second and third column.
If it satisfies our condition, we print the line.
awk 'BEGIN {i = 0}
FNR == NR { chr[i] = "chr" $1; test[i++] = $2 }
FNR < NR { for (c in chr) {
if ($1 == chr[c] && test[c] > $2 && test[c] < $3) { print }
}
}' FILE1.txt FILE2.tsv
FNR is the line number within the current file, NR is the line number within all the input. So the first block processes the first file, collecting all the lines into arrays. The second block processes any remaining files, searching through the array of chrN values looking for a match, and comparing the other two numbers to the number from the first file.
Thanks very much!
These answers work and are very helpful.
Also at long last I realized I should have had:
awk -v C2=$COL2 'if (C2>$1 && C2<$2); {print $0}'
with the brace in a different place and I would have been fine.
At any rate, thank you very much!

Resources