Printing content in two file processing in awk - shell

I am referring this link https://stackoverflow.com/a/54767231/11084572.
I have a config file where 2nd column is feature and 3rd column is action. I have another large file where I need to match the 1st column of this file to the 1st column of the config file and perform action according to the feature.
Assumption: In File.txt column are named as Min (3rd col),Median (4th), Max(5th)
Config.txt
Apple All Max
Car abc Median
Car xyz Min
Book cvb Median
Book pqr Max
File.txt
Apple first 10 20 30
Apple second 20 30 40
Car abc 10 20 30
Car xyz 20 30 40
Car wxyz 10 20 30
Book cvb 60 70 80
Book pqr 80 90 100
Expected Output:
Apple first 30
Apple second 40
Car abc 20
Car xyz 20
Car wxyz 10
Book cvb 70
Book pqr 100
The above output is generated on the followinfg approach:
1) Since the file.txt is large, so if the feature (2nd col) of config file is ALL, so all the matching 1st column would perform action according to the 3rd col of config file.
2) Otherwise it perform if the 2nd col of config file matches as **substring** to the 2nd col of file.txt
Here what I have tried:
awk 'BEGIN {m["Min"]=3;m["Median"]=4;m["Max"]=5}
NR==FNR{ arr[$1]=$2;brr[$1]=$3;next}
($1 in arr && arr[$1]=="All") {print $1,$2,$m[brr[$1]]}
($1 in arr && $2==arr[$1] ) {print $1 ,$2,$m[brr[$1]]}
' Config.txt File.txt
Code output:
Apple first 30
Apple second 40
Book pqr 100
Car xyz 20
The above output is only printing one field of matched 1st col (like Book cvb 70 is not printing). Also how could I matched the string as ending string (Ex. xyz defined in config.txt matches to both xyz and wxyz of file.txt .
Please help me to solve above challenge. Thanks!

Your expected sample output is NOT looking as per your shown sample of Input_file(eg--> Car abc 200 where there is NO 200 in file.txt), if I got it correctly could you please try following.
awk '
BEGIN{
b["min"]=3
b["max"]=5
b["median"]=4
}
FNR==NR{
c[$1]
++d[$1]
a[$1 d[$1]]=tolower($NF)
next
}
($1 in c){
if(e[$1]<d[$1]){
++e[$1]
}
else{
e[$1]!=""?e[$1]:++e[$1]
}
print $1,$2,$b[a[$1 e[$1]]]
}' config.txt file.txt
Output will be as follows.
Apple first 30
Apple second 40
Car abc 20
Car xyz 20
Car wxyz 10
Book cvb 70
Book pqr 100
Explanation: Adding explanation for above code now.
awk ' ##Starting awk program here.
BEGIN{ ##Mentioning BEGIN section here which will be executed once and before reading Input_file only.
b["min"]=3 ##Creating an array named b whose index is string min and value is 3.
b["max"]=5 ##Creating an array named b whose index is string max and value is 5.
b["median"]=4 ##Creating an array named b whose index is string median and value is 4.
} ##Closing BLOCK section here.
FNR==NR{ ##Checking condition FNR==NR which will be executed when 1st Input_file named config.txt is being read.
c[$1] ##Creating an array named c whose index is $1.
++d[$1] ##Creating an array named d and with index is $1 whose value is keep increasing with 1 on its each occurence.
a[$1 d[$1]]=tolower($NF) ##Creating an array named a whose index is $1 and value of d[$1] and value is small letters value of $NF(last column) of current line.
next ##Using next keyword of awk to skip all further statements from here.
}
($1 in c){ ##Checking conditions if $1 of current line is present of array c then do following.
if(e[$1]<d[$1]){ ##Checking condition if value of e[$1] is lesser than d[$1] then do following.
++e[$1] ##Creating array named e whose index is $1 and incrementing its value with 1 here.
}
else{ ##Using else for above if condition here.
e[$1]!=""?e[$1]:++e[$1] ##Checking if e[$1] is NULL then increment it with 1 or leave it as it is.
}
print $1,$2,$b[a[$1 e[$1]]] ##Printing 1st, 2nd fields value along with field value of array b whose index is value of array a with index of $1 e[$1] here.
}' config.txt file.txt ##Mentioning Input_files here.

Related

Print the output if the first column of three file match

I'm having 3 files with data as below:
File1-
a 10
b 20
c 30
File2-
a 11
b 22
c 45
d 33
File3-
a 23
b 33
c 46
I need to print the output like below if the first column of the three files matches:
a 10 11 23
b 20 22 33
c 30 45 46
I tried the below code but not getting the required output:
#!/bin/bash
awk 'FNR==NR{a[$1]=$2;next} {print $0,$1 in a?a[$1]:""}' File1 File2 File3
With your shown samples, could you please try following. Written and tested with GNU awk.
awk '
{
arr[$1]=(arr[$1]?arr[$1] OFS:"")$2
count[$1]++
}
END{
for(key in arr){
if(count[key]==(ARGC-1)){
print key,arr[key]
}
}
}
' Input_file1 Input_file2 Input_file3
NOTE: Just want to add here this answer to a new answer from all mentioned answers in shared dupe link under comments of question.
With shown samples output will be as follows.
a 10 11 23
b 20 22 33
c 30 45 46
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{
arr[$1]=(arr[$1]?arr[$1] OFS:"")$2 ##Creating arr with index of first field and value is 2nd field and keep appending its value in array.
count[$1]++ ##Creating count array with index of 1st field and keep increasing it.
}
END{ ##Starting END block of this program from here.
for(key in arr){ ##Traversing through all items in arr.
if(count[key]==(ARGC-1)){ ##Checking condition if count with index of key is equal to ARGC-1 then print current item with its value.
print key,arr[key]
}
}
}
' file1 file2 file3 ##Mentioning Input_file names here.
Using join:
join <(join File1 File2) File3
Join works with two files and so redirect the result of "join file1 and file2" back into another join command that compares this to file3

How can i reoder lines in a text file based on a pattern?

I have a text file that contains batches of 4 lines, the first line of each batch is in the correct position however the next 3 lines are not always in the correct order.
name cat
label 4
total 5
value 4
name dog
total 4
label 3
value 6
name cow
value 6
total 1
label 4
name fish
total 3
label 5
value 6
I would like each 4 line batch to be in the following format:
name cat
value 4
total 5
label 4
so the output would be:
name cat
value 4
total 5
label 4
name dog
value 6
total 4
label 3
name cow
value 6
total 1
label 4
name fish
value 6
total 3
label 5
The file contains thousands of lines in total, so i would like to build a command that can deal with all potential orders of the 3 lines and re-arrange them if not in the correct format.
I am aware i can use awk to search lines that begin with a particular string and them re-arrange them:
awk '$1 == "value" { print $3, $4, $1, $2; next; } 1'
However i can not figure out how to acheive something similiar that processes over multiple lines.
How can i acheive this?
By setting RS to the empty string, each block of text separated by at least one empty line, is considered a single record. From there it's easy to capture each key-value pair and output them in the desired order.
BEGIN {RS=""}
{
for (i=1; i<=NF; i+=2) a[$i] = $(i+1)
print "name", a["name"] ORS \
"value", a["value"] ORS \
"total", a["total"] ORS \
"label", a["label"] ORS
}
$ awk -f a.awk file
name cat
value 4
total 5
label 4
name dog
value 6
total 4
label 3
name cow
value 6
total 1
label 4
name fish
value 6
total 3
label 5
Could you please try following.
awk '
/^name/{
if(name){
print name ORS array["value"] ORS array["total"] ORS array["label"] ORS
delete array
}
name=$0
next
}
{
array[$1]=$0
}
END{
print name ORS array["value"] ORS array["total"] ORS array["label"]
}
' Input_file
EDIT: Adding refined solution of above suggested by Kvantour sir.
awk -v OFS="\n" '
(!NF) && ("name" in a){
print a["name"],a["value"],a["total"],a["label"] ORS
delete a
next
}
{
a[$1]=$0
}
END{
print a["name"],a["value"],a["total"],a["label"]
}
' Input_file
The simplest way is the following:
awk 'BEGIN{RS=""; ORS="\n\n"; FS=OFS="\n"}
{ for(i=1;i<=NF;++i) { k=substr($i,1,index($i," ")-1); a[k]=$i } }
{ print a["name"],a["value"],a["total"],a["label"] }' file
How does this work?
Awk knows the concept records and fields. Files are split in records where consecutive records are split by the record separator RS. Each record is split in fields, where consecutive fields are split by the field separator FS. By default, the record separator RS is set to be the <newline> character (\n) and thus each record is a line. The record separator has the following definition:
RS:
The first character of the string value of RS shall be the input record separator; a <newline> by default. If RS contains more than one character, the results are unspecified. If RS is null, then records are separated by sequences consisting of a <newline> plus one or more blank lines, leading or trailing blank lines shall not result in empty records at the beginning or end of the input, and a <newline> shall always be a field separator, no matter what the value of FS is.
So with the file format you give, we can define the records based on RS="" and the field separator `FS="\n".
Each record looks simplified as:
key1 string1 << field $1
key2 string2 << field $2
key3 string3 << field $3
key4 string4 << field $4
...
keyNF stringNF << field $NF
When awk reads a record, we first parse it by storing all key-value pairs in an array a. Afterwards, we ask to print the values we find interesting. For this, we need to define the output-field-separators OFS and output-record-separator ORS.
In Vim you could sort the file in sections using reverse order sort!:
for i in range(1,line("$"))
/^name/+1,/^name/+3sort!
endfor
Same command issued from the shell:
$ ex -s '+for i in range(1,line("$"))|/^name/+1,/^name/+3sort!|endfor' '+%p' '+q!' inputfile

Add unique value from first column before each group

I have following file contents:
T12 19/11/19 2000
T12 18/12/19 2040
T15 19/11/19 2000
T15 18/12/19 2080
How to get following output with awk,bash and etc, I searched for similar examples but didn't find so far:
T12
19/11/19 2000
18/12/19 2040
T15
19/11/19 2000
18/12/19 2080
Thanks,
S
Could you please try following. This code will print output in same order in which first field is occurring in Input_file.
awk '
!a[$1]++ && NF{
b[++count]=$1
}
NF{
val=$1
$1=""
sub(/^ +/,"")
c[val]=(c[val]?c[val] ORS:"")$0
}
END{
for(i=1;i<=count;i++){
print b[i] ORS c[b[i]]
}
}
' Input_file
Output will be as follows.
T12
19/11/19 2000
18/12/19 2040
T15
19/11/19 2000
18/12/19 2080
Explanation: Adding detailed explanation for above code here.
awk ' ##Starting awk program from here.
!a[$1]++ && NF{ ##Checking condition if $1 is NOT present in array a and line is NOT NULL then do following.
b[++count]=$1 ##Creating an array named b whose index is variable count(every time its value increases cursor comes here) and its value is first field of current line.
} ##Closing BLOCK for this condition now.
NF{ ##Checking condition if a line is NOT NULL then do following.
val=$1 ##Creating variable named val whose value is $1 of current line.
$1="" ##Nullifying $1 here of current line.
sub(/^ +/,"") ##Substituting initial space with NULL now in line.
c[val]=(c[val]?c[val] ORS:"")$0 ##Creating an array c whose index is variable val and its value is keep concatenating to its own value with ORS value.
} ##Closing BLOCK for this condition here.
END{ ##Starting END block for this awk program here.
for(i=1;i<=count;i++){ ##Starting a for loop which runs from i=1 to till value of variable count.
print b[i] ORS c[b[i]] ##Printing array b whose index is i and array c whose index is array b value with index i.
}
} ##Closing this program END block here.
' Input_file ##Mentioning Input_file name here.
Here is a quick awk:
$ awk 'BEGIN{RS="";ORS="\n\n"}{printf "%s\n",$1; gsub($1" +",""); print}' file
How does it work?
Awk knows the concept records and fields.
Files are split in records where consecutive records are split by the record separator RS. Each record is split in fields, where consecutive fields are split by the field separator FS.
By default, the record separator RS is set to be the <newline> character (\n) and thus each record is a line. The record separator has the following definition:
RS:
The first character of the string value of RS shall be the input record separator; a <newline> by default. If RS contains more than one character, the results are unspecified. If RS is null, then records are separated by sequences consisting of a <newline> plus one or more blank lines, leading or trailing blank lines shall not result in empty records at the beginning or end of the input, and a <newline> shall always be a field separator, no matter what the value of FS is.
So with the file format you give, we can define the records based on RS="".
By default, the field separator is set to be any sequence of blanks. So $1 will point to that particular word we want on the separate line. So we print it with printf, and then we remove any reference to it with gsub.
awk is very flexible and provides a number of ways to solve the same problem. The answers you have already are excellent. Another way to approach the problem is to simply keep a single variable that holds the current field 1 as its value. (unset by default) When the first field changes, you simply output the first field as the current heading. Otherwise you output the 2nd and 3rd fields. If a blank-line is encountered, simply output the newline.
awk -v h= '
NF < 3 {print ""; next}
$1 != h {h=$1; print $1}
{printf "%s %s\n", $2, $3}
' file
Above are the 3-rules. If the line is empty (checked with number of fields less than three (NF < 3), then output the newline and skip to the next record. The second checks if the first field is not equal to your current heading variable h -- if not, set h to the new heading and output it. All non-empty records have the 2nd and 3rd fields output.
Result
Just paste the command above at the command line and you will get the desired result, e.g.
awk -v h= '
> NF < 3 {print ""; next}
> $1 != h {h=$1; print $1}
> {printf "%s %s\n", $2, $3}
> ' file
T12
19/11/19 2000
18/12/19 2040
T15
19/11/19 2000
18/12/19 2080

Detect increment made in any column

I have following data as input. I am trying to find the increment per group.
col1 col2 col3 group
1 2 100 alpha
1 2 100 alpha
1 2 100 alpha
3 4 200 beta
3 4 200 beta
3 4 200 beta
3 4 300 beta
5 6 700 charlie
7 8 400 tango
7 8 300 tango
7 8 700 tango
Example output:
tango: 300
charlie:0
beta:100
alpha:0
I am trying this approch but answers are incorrect as sometimes values increases in between the samples:
awk 'NR>1{print $NF}' foo |while read line;do grep -w $line foo|sort -k3n ;done |awk '!a[$4]++' |sort -k4
1 2 100 alpha
3 4 200 beta
5 6 700 charlie
7 8 300 tango
awk 'NR>1{print $NF}' foo |while read line;do grep -w $line foo|sort -k3n ;done |tac|awk '!a[$4]++' |sort -k4
1 2 100 alpha
3 4 300 beta
5 6 700 charlie
7 8 700 tango
Awk solution:
awk 'NR==1{ next }
g && $4 != g{ print g":"(v - gr[g]) }
!($4 in gr){ gr[$4]=$3 }{ g=$4; v=$3 }
END{ print g":"(v - gr[g]) }' file
NR==1{ next } - skip the 1st record
g - variable aimed to hold group name
v - variable aimed to hold group value
!($4 in gr){ gr[$4]=$3 } - on the 1st occurrence of a distinct group name $4 - save its first value $3 into array gr
g && $4 != g{ print g":"(v - gr[g]) } - if the current group name $4 differs from the previous one g - print the delta between the last and 1st values of the previous group
The output:
alpha:0
beta:100
charlie:0
tango:300
The following should do the trick, this solution does not require the file to be sorted by group name.
awk '(NR==1){next}
{groupc[$4]++}
(groupc[$4]==1){groupv[$4]=$3}
{groupl[$4]=$3}
END{for(i in groupc) { print i":",groupl[i]-groupv[i]} }
' foo
The following things happen :
skip the first line (NR==1){next}
count how many time group is occuring {groupc[$4]++}
if the group count equal 1 define its first value under groupv
define the last seen value as groupl
at the END, run over all array keys (which are the groups), and print the last minus the first value.
output :
tango: 300
alpha: 0
beta: 100
charlie: 0
Following awk may help you in same too. It will provide output in same sequence as per your Input_file's last column values.
awk '
FNR==1{
next}
prev!=$NF && prev{
val=prev_val!=a[prev]?prev_val-a[prev]:0;
printf("%s %d\n",prev,val>0?val:0)}
!a[$NF]{
a[$NF]=$(NF-1)}
{
prev=$NF;
prev_val=$(NF-1)}
END{
val=prev_val!=a[prev]?prev_val-a[prev]:0;
printf("%s %d\n",prev,val>0?val:0)}
' Input_file
Output will be as follows. Will add explanation too shortly.
alpha 0
beta 100
charlie 0
tango 300
Explanation: Adding explanation of code too now for learning purposes of all.
awk '
FNR==1{ ##To skip first line of Input_file which is heading I am putting condition if FNR==1 then do next, where next will skip all further statements of awk.
next}
prev!=$NF && prev{ ##Checking conditions here if variable prev value is NOT equal to current line $NF and variable prev is NOT NULL then do following:
val=prev_val!=a[prev]?prev_val-a[prev]:0;##create a variable val, if prev_val is not equal to a[prev] then subttract prev_val and s[prev] else it will be zero.
printf("%s %d\n",prev,val>0?val:0)} ##printing the value of variable prev(which is nothing but value of last column) and then print value of val if greater than 0 or print 0 in place of val here.
!a[$NF]{ ##Checking if array a value whose index is $NF is NULL then fill it with current $NF value, actually this is to get the very first value of any column so that later we could subtract it with the its last value as per OP request.
a[$NF]=$(NF-1)}
{
prev=$NF; ##creating variable named prev and assigning its value to last column of the current line.
prev_val=$(NF-1)} ##creating variable named prev_val whose value will be second last columns value of current line.
END{ ##starting end block of awk code here, it will come when Input_file is done with reading.
val=prev_val!=a[prev]?prev_val-a[prev]:0;##getting value of variable val where checking if prev_val is not equal to a[prev] then subtract prev_val and s[prev] else it will be zero.
printf("%s %d\n",prev,val>0?val:0)} ##printing the value of variable prev(which is nothing but value of last column) and then print value of val if greater than 0 or print 0 in place of val here.
' Input_file ##Mentioning the Input_file name here.
$ cat tst.awk
NR==1 { next }
!($4 in beg) { beg[$4] = $3 }
{ end[$4] = $3 }
END {
for (grp in beg) {
print grp, end[grp] - beg[grp]
}
}
$ awk -f tst.awk file
tango 300
alpha 0
beta 100
charlie 0

awk code to collapse rows and keep the whole line based on one column's value

I'm trying to write an awk script to collapse identical rows (defined by several columns) and keep the whole line that has the minimum value.
This is my example input:
A 20 30 Boston US 3 tempCity top
A 20 30 London UK 2 coldCity top
A 20 30 Singapore SG 4 hotCity top
B 10 20 Tokyo JP 3 coldCity mid
I would like to keep only one row with the minimum value of sixth column, if the first, second, third and eighth columns are the same. This is my expected output:
A 20 30 London UK 2 coldCity top
B 10 20 Tokyo JP 3 coldCity mid
I have tried to write this code:
awk -v OFS='\t' '{par=$1 OFS $2 OFS $3 OFS $8} $6<a[par]{a[par]=(par in a)?a[par]$0:$0} END {for (i in a) print i, a[i]}' cityList.txt
but I only got the following output:
A 20 30 top
B 10 20 mid
I'm a newbie in awk, so any help is much appreciated! Thanks in advance!
You're almost there!
awk -v OFS='\t' '!a[$1,$2,$3,$8] || $6 < a[$1,$2,$3,$8] { a[$1,$2,$3,$8] = $0 } END {for (i in a) print a[i]}' file
I changed the condition on setting the value in the array a, so that it's set when the key is not defined or the value is less than the current key.
I've chosen to use $1,$2,$3,$8 everywhere - you could set a variable equal to this using $1 SUBSEP $2 SUBSEP $3 SUBSEP $8 if you want to avoid repetition. SUBSEP is a control character, which is very unlikely to clash with the contents of the key.
The loop in the END block only prints out the line stored in a[i], rather than concatenating the key, which you were attempting to do.
alternative to awk, perhaps easier to read as well
$ sort -k6,6n cities | sort -u -k1,3 -k8
A 20 30 London UK 2 coldCity top
B 10 20 Tokyo JP 3 coldCity mid

Resources