Compare strings using awk command

Compare strings using awk command - bash

I am at initial stage of learning shell scripting. So please explain me the steps for better understanding.
Consider I have two files
Content of the two files are as below:
File1.txt
ABC=10
DEF=20
XYZ=30
File2.txt
DEF=15
XYZ=20
I want to write a simple shell script to check both the files and add the values and print the final output as below. like
ABC=10
DEF=35
XYZ=50

You can use awk:
awk 'BEGIN{FS=OFS="="} FNR==NR{a[$1]=$2;next} {a[$1]+=$2}
END{for (i in a) print i, a[i]}' file1 file2
ABC=10
XYZ=50
DEF=35
Breakup:
NR == FNR { # While processing the first file
a[$1] = $2 # store the second field by the first in an array
next # move to next record
}
{ # while processing the second file
a[$1]+=$2 # add already stored value by 2nd field in 2nd file
}
END{..} # iterate the array and print the values
If you want to keep original ordering intact then use:
awk 'BEGIN{FS=OFS="="} FNR==NR{if (!($1 in a)) b[++n]=$1; a[$1]=$2;next} {a[$1]+=$2}
END{for (i=1; i<=n; i++) print b[i], a[b[i]]}' file1 file2
ABC=10
DEF=35
XYZ=50

Related

awk if line contains

Currently using,
$ awk 'NR==FNR{a[$1];next} ($3 in a)' find.txt path_to_100_files/*
to search a directory containing multiple files, for strings from a .txt (find.txt)
find.txt contains
example1
example 2
example#eampol.com
exa exa exa123
...
example of .txt files within directory
example example example.com
example 2 example example lol
now currently it searches for the string within column 3, using ($3 in a) meaning $3 = column #3, but sometimes string can be on $1 or $5 and so on, how can I get it to search every column instead of just the 3rd?

awk '
NR==FNR{a[$1];next}
{ for (i=1; i<=NF; i++) if ($i in a) { print; next } }
' find.txt path_to_100_files/*
The above assumes your existing script behaves as desired given exa exa exa123.

awk match substring in column from 2 files

I have the following two files (real data is tab-delimited instead of semicolon):
input.txt
Astring|2042;MAR0303;foo1;B
Dstring|2929;MAR0283;foo2;C
db.txt updated
TG9284;Astring|2042|morefoohere_foo_foo
TG9281;Cstring|2742|foofoofoofoofoo Dstring|2929|foofoofoo
So, column1 of input.txtis a substring of column2 of db.txt. Only two "fields" separated by | is important here.
I want to use awk to match these two columns and print the following (again in tab-delimited form):
Astring|2042;MAR0303;foo1;B;TG9284
Dstring|2929;MAR0283;foo2;C;TG9281
This is my code:
awk -F'[\t]' 'NR==FNR{a[$1]=$1}$1 in a {print $0"\t"$1}' input.txt db.txt
EDIT
column2 of db.txt contains strings of column1 of input.txt, delimited by a space. There are many more strings in the real example than shown in the short excerpt.

You can use this awk:
awk 'BEGIN{FS=OFS="\t"} NR==FNR{
split($2, b, "|"); a[b[1] "|" b[2]]=$1; next}
$1 in a {print $0, a[$1]}' db.txt input.txt
Astring|2042 MAR0303 foo1 B TG9284
Dstring|2929 MAR0283 foo2 C TG9281
EDIT:
As per your comment you can use:
awk 'BEGIN{FS=OFS="\t"} NR==FNR {
a[$2]=$1; next} {for (i in a) if (index(i, $1)) print $0, a[i]}' db.txt input.txt
Astring|2042 MAR0303 foo1 B TG9284
Dstring|2929 MAR0283 foo2 C TG9281

Going with the semicolons, you can replace with the tabs:
$ awk -F\; '
NR==FNR { # hash the db file
a[$2]=$1
next
}
{
for(i in a) # for each record in input file
if($1~i) { # see if $1 matches a key in a
print $0 ";" a[i] # output
# delete a[i] # delete entry from a for speed (if possible?)
break # on match, break from for loop for speed
}
}' db input # order order
Astring|2042;MAR0303;foo1;B;TG9284
Dstring|2929;MAR0283;foo2;C;TG9281
For each record in input script matches the $1 against every entry in db, so it's slow. You can speed it up by adding a break to the if and deleteing matching entry from a (if your data allows it).

bash - In a csv file, get the last column of all duplicate and concatenate to first occurence

I have a sorted csv file, where several entries are duplicates, except for the last column. How can I concatenate all the last columns to the first occurrence of each entry ?
Input:
Test1,123,somestuff
Test1,123,differentstuff
Test2,345,otherstuff
Output:
Test1,123,somestuff, differentstuff
Test2,345,otherstuff
EDIT:
Obtaining the last column is easy (cut -d, -f3 test.csv); now I need to add it to every first occurence of an entry.

Use awk utility:
awk -F, '{ k=$1 FS $2; a[k] = (k in a)? a[k] FS $3 : $3 }
END{ for(i in a) print i,a[i] }' OFS=',' csvfile
The output:
Test1,123,somestuff,differentstuff
Test2,345,otherstuff
-F, - field separator
k=$1 FS $2 - associative array key (grouping records by the first 2 field values)

Compare multiple Columns and Append the result into another file

I have two files file1 and file2, Both the files have 5 columns.
I want to compare first 4 columns of file1 with file2.
If they are equal, need to compare the 5th column. If 5th column values are different, need to print the file1's 5th column as file2's 6th column.
I have used below awk to compare two columns in two different files, but how to compare multiple columns and append the particular column in another file if matches found?
awk -F, 'NR==FNR{_1[$1]++;next}!_1[$1]'
file1:
111,item1,garde1,wing1,maingroup
123,item3,grade5,wing10,topcat
132,item2,grade3,wing7,middlecat
134,item2,grade3,wing7,middlecat
177,item8,gradeA,wing11,lowcat
file2:
111,item1,garde1,wing1,maingroup
123,item3,grade5,wing10,lowcat
132,item3,grade3,wing7,middlecat
126,item2,grade3,wing7,maingroup
177,item8,gradeA,wing11,lowcat
Desired output:
123,item3,grade5,wing10,lowcat,topcat

Awk can simulate multidimensional arrays by sequencing the indices. Underneath the indices are concatenated using the built-in SUBSEP variable as a separator:
$ awk -F, -v OFS=, 'NR==FNR { a[$1,$2,$3,$4]=$5; next } a[$1,$2,$3,$4] && a[$1,$2,$3,$4] != $5 { print $0,a[$1,$2,$3,$4] }' file1.txt file2.txt
123,item3,grade5,wing10,lowcat,topcat
awk -F, -v OFS=,
Set both input and output separators to ,
NR==FNR { a[$1,$2,$3,$4]=$5; next }
Create an associative array from the first file relating the first four fields of each line to the
fifth. When using a comma-separated list of values as an index, awk actually concatenates them
using the value of the built-in SUBSEP variable as a separator. This is awk's way of
simulating multidimensional arrays with a single subscript. You can set SUBSEP to any value you like
but the default, which is a non-printing character unlikely to appear in the data, is usually
fine. (You can also just do the trick yourself, something like a[$1 "|" $2 "|" $3 "|" $4],
assuming you know that your data contains no vertical bars.)
a[$1,$2,$3,$4] && a[$1,$2,$3,$4] != $5 { print $0,a[$1,$2,$3,$4] }
Arriving here, we know we are looking at the second file. If the first four fields were found in the
first file, and the $5 from the first file is different than the $5 in the second, print the line
from the second file followed by the $5 from the first. (I am assuming here that no $5 from the first file will have a value that evaluates to false, such as 0 or empty.)

$ cat tst.awk
BEGIN { FS=OFS="," }
{ key = $0; sub("(,[^,]*){"NF-4"}$","",key) }
NR==FNR { file1[key] = $5; next }
(key in file1) && ($5 != file1[key]) {
print $0, file1[key]
}
$ awk -f tst.awk file1 file2
123,item3,grade5,wing10,lowcat,topcat

Unix AWK field seperator finding sum of one field group by other

I am using below awk command which is returning me unique value of parameter $11 and occurrence of it in the file as output separated by commas. But along with that I am looking for sum of parameter $14(last value) in the output. Please help me on it.
sample string in file
EXSTAT|BNK|2014|11|05|15|29|46|23169|E582754245|QABD|S|000|351
$14 is last value 351
bash-3.2$ grep 'EXSTAT|' abc.log|grep '|S|' |
awk -F"|" '{ a[$11]++ } END { for (b in a) { print b"," a[b] ; } }'
QDER,3
QCOL,1
QASM,36
QBEND,23
QAST,3
QGLBE,30
QCD,30
TBENO,1
QABD,9
QABE,5
QDCD,5
TESUB,1
QFDE,12
QCPA,3
QADT,80
QLSMR,6
bash-3.2$ grep 'EXSTAT|' abc.log
EXSTAT|BNK|2014|11|05|15|29|03|23146|E582754222|QGLBE|S|000|424
EXSTAT|BNK|2014|11|05|15|29|05|23147|E582754223|QCD|S|000|373
EXSTAT|BNK|2014|11|05|15|29|12|23148|E582754224|QASM|S|000|1592
EXSTAT|BNK|2014|11|05|15|29|13|23149|E582754225|QADT|S|000|660
EXSTAT|BNK|2014|11|05|15|29|14|23150|E582754226|QADT|S|000|261
EXSTAT|BNK|2014|11|05|15|29|14|23151|E582754227|QADT|S|000|250
EXSTAT|BNK|2014|11|05|15|29|15|23152|E582754228|QADT|S|000|245
EXSTAT|BNK|2014|11|05|15|29|15|23153|E582754229|QADT|S|000|258
EXSTAT|BNK|2014|11|05|15|29|17|23154|E582754230|QADT|S|000|261
EXSTAT|BNK|2014|11|05|15|29|18|23155|E582754231|QADT|S|000|263
EXSTAT|BNK|2014|11|05|15|29|18|23156|E582754232|QADT|S|000|250
EXSTAT|BNK|2014|11|05|15|29|19|23157|E582754233|QADT|S|000|270
EXSTAT|BNK|2014|11|05|15|29|19|23158|E582754234|QADT|S|000|264
EXSTAT|BNK|2014|11|05|15|29|20|23159|E582754235|QADT|S|000|245
EXSTAT|BNK|2014|11|05|15|29|20|23160|E582754236|QADT|S|000|241
EXSTAT|BNK|2014|11|05|15|29|21|23161|E582754237|QADT|S|000|237
EXSTAT|BNK|2014|11|05|15|29|21|23162|E582754238|QADT|S|000|229
EXSTAT|BNK|2014|11|05|15|29|22|23163|E582754239|QADT|S|000|234
EXSTAT|BNK|2014|11|05|15|29|22|23164|E582754240|QADT|S|000|237
EXSTAT|BNK|2014|11|05|15|29|23|23165|E582754241|QADT|S|000|254
EXSTAT|BNK|2014|11|05|15|29|23|23166|E582754242|QADT|S|000|402
EXSTAT|BNK|2014|11|05|15|29|24|23167|E582754243|QADT|S|000|223
EXSTAT|BNK|2014|11|05|15|29|24|23168|E582754244|QADT|S|000|226

Just add another associative array:
awk -F"|" '{a[$11]++;c[$11]+=$14}END{for(b in a){print b"," a[b]","c[b]}}'
tested below:
> cat temp
EXSTAT|BNK|2014|11|05|15|29|03|23146|E582754222|QGLBE|S|000|424
EXSTAT|BNK|2014|11|05|15|29|05|23147|E582754223|QCD|S|000|373
EXSTAT|BNK|2014|11|05|15|29|12|23148|E582754224|QASM|S|000|1592
EXSTAT|BNK|2014|11|05|15|29|13|23149|E582754225|QADT|S|000|660
EXSTAT|BNK|2014|11|05|15|29|14|23150|E582754226|QADT|S|000|261
EXSTAT|BNK|2014|11|05|15|29|14|23151|E582754227|QADT|S|000|250
EXSTAT|BNK|2014|11|05|15|29|15|23152|E582754228|QADT|S|000|245
EXSTAT|BNK|2014|11|05|15|29|15|23153|E582754229|QADT|S|000|258
EXSTAT|BNK|2014|11|05|15|29|17|23154|E582754230|QADT|S|000|261
EXSTAT|BNK|2014|11|05|15|29|18|23155|E582754231|QADT|S|000|263
EXSTAT|BNK|2014|11|05|15|29|18|23156|E582754232|QADT|S|000|250
EXSTAT|BNK|2014|11|05|15|29|19|23157|E582754233|QADT|S|000|270
EXSTAT|BNK|2014|11|05|15|29|19|23158|E582754234|QADT|S|000|264
EXSTAT|BNK|2014|11|05|15|29|20|23159|E582754235|QADT|S|000|245
EXSTAT|BNK|2014|11|05|15|29|20|23160|E582754236|QADT|S|000|241
EXSTAT|BNK|2014|11|05|15|29|21|23161|E582754237|QADT|S|000|237
EXSTAT|BNK|2014|11|05|15|29|21|23162|E582754238|QADT|S|000|229
EXSTAT|BNK|2014|11|05|15|29|22|23163|E582754239|QADT|S|000|234
EXSTAT|BNK|2014|11|05|15|29|22|23164|E582754240|QADT|S|000|237
EXSTAT|BNK|2014|11|05|15|29|23|23165|E582754241|QADT|S|000|254
EXSTAT|BNK|2014|11|05|15|29|23|23166|E582754242|QADT|S|000|402
EXSTAT|BNK|2014|11|05|15|29|24|23167|E582754243|QADT|S|000|223
EXSTAT|BNK|2014|11|05|15|29|24|23168|E582754244|QADT|S|000|226
> awk -F"|" '{a[$11]++;c[$11]+=$14}END{for(b in a){print b"," a[b]","c[b]}}' temp
QGLBE,1,424
QADT,20,5510
QASM,1,1592
QCD,1,373
>
also check the test here

You need not use grep for searching the file if it contains EXSTAT the awk can do that for you as well.
For example:
awk 'BEGIN{FS="|"; OFS=","} $1~EXSTAT && $12~S {sum[$11]+=$14; count[$11]++}END{for (i in sum) print i,count[i],sum[i]}' abc.log
for the input file abc.log with contents
EXSTAT|BNK|2014|11|05|15|29|03|23146|E582754222|QGLBE|S|000|424
EXSTAT|BNK|2014|11|05|15|29|05|23147|E582754223|QCD|S|000|373
EXSTAT|BNK|2014|11|05|15|29|12|23148|E582754224|QASM|S|000|1592
EXSTAT|BNK|2014|11|05|15|29|13|23149|E582754225|QADT|S|000|660
EXSTAT|BNK|2014|11|05|15|29|14|23150|E582754226|QADT|S|000|261
EXSTAT|BNK|2014|11|05|15|29|14|23151|E582754227|QADT|S|000|250
EXSTAT|BNK|2014|11|05|15|29|15|23152|E582754228|QADT|S|000|245
EXSTAT|BNK|2014|11|05|15|29|15|23153|E582754229|QADT|S|000|258
EXSTAT|BNK|2014|11|05|15|29|17|23154|E582754230|QADT|S|000|261
EXSTAT|BNK|2014|11|05|15|29|18|23155|E582754231|QADT|S|000|263
EXSTAT|BNK|2014|11|05|15|29|18|23156|E582754232|QADT|S|000|250
EXSTAT|BNK|2014|11|05|15|29|19|23157|E582754233|QADT|S|000|270
EXSTAT|BNK|2014|11|05|15|29|19|23158|E582754234|QADT|S|000|264
EXSTAT|BNK|2014|11|05|15|29|20|23159|E582754235|QADT|S|000|245
EXSTAT|BNK|2014|11|05|15|29|20|23160|E582754236|QADT|S|000|241
EXSTAT|BNK|2014|11|05|15|29|21|23161|E582754237|QADT|S|000|237
EXSTAT|BNK|2014|11|05|15|29|21|23162|E582754238|QADT|S|000|229
EXSTAT|BNK|2014|11|05|15|29|22|23163|E582754239|QADT|S|000|234
EXSTAT|BNK|2014|11|05|15|29|22|23164|E582754240|QADT|S|000|237
EXSTAT|BNK|2014|11|05|15|29|23|23165|E582754241|QADT|S|000|254
EXSTAT|BNK|2014|11|05|15|29|23|23166|E582754242|QADT|S|000|402
EXSTAT|BNK|2014|11|05|15|29|24|23167|E582754243|QADT|S|000|223
EXSTAT|BNK|2014|11|05|15|29|24|23168|E582754244|QADT|S|000|226
it will give an output as
QASM,1,1592
QGLBE,1,424
QADT,20,5510
QCD,1,373
What it does?
'BEGIN{FS="|"; OFS=","} excecuted before the input file is processed. It sets FS, input field seperator as | and OFS output field seperator as ,
$1~EXSTAT && $12~S{sum[$11]+=$14; count[$11]++} action is for each line
$1~EXSTAT && $12~S checks if first field is EXSTAT and 12th field is S
sum[$11]+=$14 array sum of field $14 indexed by $11
count[$11]++ array count indexed by $11
END{for (i in sum) print i,count[i],sum[i]}' excecuted at end of file, prints the content of the arrays

You can use a second array.
awk -F"|" '/EXSTAT\|/&&/\|S\|/{a[$11]++}/EXSTAT\|/{s[$11]+=$14}\
END{for(b in a)print b","a[b]","s[b];}' abc.log
Explanation
/EXSTAT\|/&&/\|S\|/{a[$11]++} on lines that contain both EXSTAT| and |S|, increment a[$11].
/EXSTAT\|/ on lines containing EXSTAT| add $14 to s[$11]
END{for(b in a)print b","a[b]","s[b];} print out all keys in array a, values of array a, and values of array s, separated by commas.

#!awk -f
BEGIN {
FS = "|"
}
$1 == "EXSTAT" && $12 == "S" {
foo[$11] += $14
}
END {
for (bar in foo)
printf "%s,%s\n", bar, foo[bar]
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Compare strings using awk command - bash

Related

awk if line contains

awk match substring in column from 2 files

bash - In a csv file, get the last column of all duplicate and concatenate to first occurence

Compare multiple Columns and Append the result into another file

Unix AWK field seperator finding sum of one field group by other

Categories

Resources