I'm having 3 files with data as below:
File1-
a 10
b 20
c 30
File2-
a 11
b 22
c 45
d 33
File3-
a 23
b 33
c 46
I need to print the output like below if the first column of the three files matches:
a 10 11 23
b 20 22 33
c 30 45 46
I tried the below code but not getting the required output:
#!/bin/bash
awk 'FNR==NR{a[$1]=$2;next} {print $0,$1 in a?a[$1]:""}' File1 File2 File3
With your shown samples, could you please try following. Written and tested with GNU awk.
awk '
{
arr[$1]=(arr[$1]?arr[$1] OFS:"")$2
count[$1]++
}
END{
for(key in arr){
if(count[key]==(ARGC-1)){
print key,arr[key]
}
}
}
' Input_file1 Input_file2 Input_file3
NOTE: Just want to add here this answer to a new answer from all mentioned answers in shared dupe link under comments of question.
With shown samples output will be as follows.
a 10 11 23
b 20 22 33
c 30 45 46
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{
arr[$1]=(arr[$1]?arr[$1] OFS:"")$2 ##Creating arr with index of first field and value is 2nd field and keep appending its value in array.
count[$1]++ ##Creating count array with index of 1st field and keep increasing it.
}
END{ ##Starting END block of this program from here.
for(key in arr){ ##Traversing through all items in arr.
if(count[key]==(ARGC-1)){ ##Checking condition if count with index of key is equal to ARGC-1 then print current item with its value.
print key,arr[key]
}
}
}
' file1 file2 file3 ##Mentioning Input_file names here.
Using join:
join <(join File1 File2) File3
Join works with two files and so redirect the result of "join file1 and file2" back into another join command that compares this to file3
Related
I'm trying to compare the first column of file1, with the first column of file2. if there is match-> print the corresponding value that is in column2 of file2
Checked some suggestion around but didnt find the right code.
file1 (single column)
987
675
21
23
21
2645
file2 (two columns)
234 def
987 one
22 abc
21 two
675 three
24 rty
25 qwe
Expected output:
one
three
two
two
Im using:
awk 'FNR==NR { r[$1] = $0; next; } r[$1] { print r[$1]; next }' file2 file1
and i get this:
987 one
675 three
21 two
21 two
Any suggestion?
Thank you!
this should work...
$ awk 'FNR==NR{r[$1]=$2; next} $1 in r{print r[$1]}' file2 file1
essentially, if you don't want to print first field, just store the second field in your r array.
The second next is redundant; also check the existence of the field in array with in, since the value might be zero (or null string) in which case r[$1] will be false.
I am referring this link https://stackoverflow.com/a/54767231/11084572.
I have a config file where 2nd column is feature and 3rd column is action. I have another large file where I need to match the 1st column of this file to the 1st column of the config file and perform action according to the feature.
Assumption: In File.txt column are named as Min (3rd col),Median (4th), Max(5th)
Config.txt
Apple All Max
Car abc Median
Car xyz Min
Book cvb Median
Book pqr Max
File.txt
Apple first 10 20 30
Apple second 20 30 40
Car abc 10 20 30
Car xyz 20 30 40
Car wxyz 10 20 30
Book cvb 60 70 80
Book pqr 80 90 100
Expected Output:
Apple first 30
Apple second 40
Car abc 20
Car xyz 20
Car wxyz 10
Book cvb 70
Book pqr 100
The above output is generated on the followinfg approach:
1) Since the file.txt is large, so if the feature (2nd col) of config file is ALL, so all the matching 1st column would perform action according to the 3rd col of config file.
2) Otherwise it perform if the 2nd col of config file matches as **substring** to the 2nd col of file.txt
Here what I have tried:
awk 'BEGIN {m["Min"]=3;m["Median"]=4;m["Max"]=5}
NR==FNR{ arr[$1]=$2;brr[$1]=$3;next}
($1 in arr && arr[$1]=="All") {print $1,$2,$m[brr[$1]]}
($1 in arr && $2==arr[$1] ) {print $1 ,$2,$m[brr[$1]]}
' Config.txt File.txt
Code output:
Apple first 30
Apple second 40
Book pqr 100
Car xyz 20
The above output is only printing one field of matched 1st col (like Book cvb 70 is not printing). Also how could I matched the string as ending string (Ex. xyz defined in config.txt matches to both xyz and wxyz of file.txt .
Please help me to solve above challenge. Thanks!
Your expected sample output is NOT looking as per your shown sample of Input_file(eg--> Car abc 200 where there is NO 200 in file.txt), if I got it correctly could you please try following.
awk '
BEGIN{
b["min"]=3
b["max"]=5
b["median"]=4
}
FNR==NR{
c[$1]
++d[$1]
a[$1 d[$1]]=tolower($NF)
next
}
($1 in c){
if(e[$1]<d[$1]){
++e[$1]
}
else{
e[$1]!=""?e[$1]:++e[$1]
}
print $1,$2,$b[a[$1 e[$1]]]
}' config.txt file.txt
Output will be as follows.
Apple first 30
Apple second 40
Car abc 20
Car xyz 20
Car wxyz 10
Book cvb 70
Book pqr 100
Explanation: Adding explanation for above code now.
awk ' ##Starting awk program here.
BEGIN{ ##Mentioning BEGIN section here which will be executed once and before reading Input_file only.
b["min"]=3 ##Creating an array named b whose index is string min and value is 3.
b["max"]=5 ##Creating an array named b whose index is string max and value is 5.
b["median"]=4 ##Creating an array named b whose index is string median and value is 4.
} ##Closing BLOCK section here.
FNR==NR{ ##Checking condition FNR==NR which will be executed when 1st Input_file named config.txt is being read.
c[$1] ##Creating an array named c whose index is $1.
++d[$1] ##Creating an array named d and with index is $1 whose value is keep increasing with 1 on its each occurence.
a[$1 d[$1]]=tolower($NF) ##Creating an array named a whose index is $1 and value of d[$1] and value is small letters value of $NF(last column) of current line.
next ##Using next keyword of awk to skip all further statements from here.
}
($1 in c){ ##Checking conditions if $1 of current line is present of array c then do following.
if(e[$1]<d[$1]){ ##Checking condition if value of e[$1] is lesser than d[$1] then do following.
++e[$1] ##Creating array named e whose index is $1 and incrementing its value with 1 here.
}
else{ ##Using else for above if condition here.
e[$1]!=""?e[$1]:++e[$1] ##Checking if e[$1] is NULL then increment it with 1 or leave it as it is.
}
print $1,$2,$b[a[$1 e[$1]]] ##Printing 1st, 2nd fields value along with field value of array b whose index is value of array a with index of $1 e[$1] here.
}' config.txt file.txt ##Mentioning Input_files here.
I'm using awk to merge multiple (>3) files, and I want to keep the headers. I found a previous post that does exactly what I need, but I don't quite understand what's happening. I was hoping someone could walk me through it so I can learn from it! (I tried commenting on the original post but did not have enough reputation)
This code
awk '{a[FNR]=((a[FNR])?a[FNR]FS$2:$0)}END{for(i=1;i<=FNR;i++) print a[i]}' f*
transforms the input files as desired. See example tables below.
Input files:
file1.txt:
id value1
a 10
b 30
c 50
file2.txt:
id value2
a 90
b 30
c 20
file3.txt:
id value3
a 0
b 1
c 25
desired output
merge.txt:
id value1 value2 value3
a 10 90 0
b 30 30 1
c 50 20 25
Again, here's the code
awk '{a[FNR]=((a[FNR])?a[FNR]FS$2:$0)}END{for(i=1;i<=FNR;i++) print a[i]}' f* > merge.txt
I'm having trouble understanding the first part of the code {a[FNR]=((a[FNR])?a[FNR]FS$2:$0)}, but understand the loop in the second part of the code.
I think in the first part of the code, an array is being established. The code runs through and check for matching records on the first column id, and if there's a match then append the second column ($2) value and print the entire record ($0).
But...I don't understand the beginning syntax. When is it established that the first column id is the same across all three files and to only add the second column?
That code is buggy and unnecessarily complicated, use this instead:
$ awk 'NR==FNR{a[FNR]=$0; next} {a[FNR] = a[FNR] OFS $2} END{for (i=1;i<=FNR;i++) print a[i]}' file1 file2 file3
id value1 value2 value3
a 10 90 0
b 30 30 1
c 50 20 25
pipe the output to column -t for alignment if you like:
$ awk 'NR==FNR{a[NR]=$0;next} {a[FNR] = a[FNR] OFS $2} END{for (i=1;i<=FNR;i++) print a[i]}' file1 file2 file3 | column -t
id value1 value2 value3
a 10 90 0
b 30 30 1
c 50 20 25
If you NEED to key off ids (e.g. because they differ across the files) then it'd be:
$ awk '
BEGIN { OFS="\t" }
!($1 in a) { ids[++numIds]=$1 }
{ a[$1][ARGIND]=$2 }
END {
for (i=1;i<=numIds;i++) {
id = ids[i]
printf "%s%s", id, OFS
for (j=1;j<=ARGIND;j++) {
printf "%s%s", a[id][j], (j<ARGIND ? OFS : ORS)
}
}
}
' file1 file2 file3 | column -s$'\t' -t
id value1 value2 value3
a 10 90 0
b 30 30 1
c 50 25
x 20
That last script used GNU awk for multi-dimensional arrays and just had c changed to x in input file2 to test it.
Feel free to ask if you have questions but I THINK that code is pretty clear.
First the data:
file1 file2 file3
NR FNR $1 $2 NR FNR $1 $2 NR FNR $1 $2
================ ================ ================
1 1 id value1 5 1 id value2 9 1 id value3
2 2 a 10 6 2 a 90 10 2 a 0
3 3 b 30 7 3 b 30 11 3 b 1
4 4 c 50 8 4 c 20 12 4 c 25
The first part: a[FNR]=( (a[FNR]) ? a[FNR]FS$2 : $0 ) could be written as:
if(a[FNR]=="") # actually if(a[FNR]=="" || a[FNR]==0)
a[FNR]=$0 # a[FNR] is "id value1" when NR==1
else
a[FNR]=a[FNR] FS $2 # a[FNR]="id value1" FS "value2" when NR==5
Each file has 4 records, ie. FNR==4 on the last record of each file, especially the last file, since value of FNR remains after processing the last file:
END { # after hashing all record in all files
for(i=1;i<=FNR;i++) # i=1, 2, 3, 4
print a[i] # print "id value1 value value3" etc.
}
James has explained pretty well the awk logic in his answer.
In case you're looking for an alternative here is a paste based solution:
paste file1 file2 file3 | awk '{print $1, $2, $4, $6}' OFS='\t'
id value1 value2 value3
a 10 90 0
b 30 30 1
c 50 20 25
FNR is the number of records relative to the current input file. So the line number in file1, file2 etc. http://www.thegeekstuff.com/2010/01/8-powerful-awk-built-in-variables-fs-ofs-rs-ors-nr-nf-filename-fnr/?ref=binfind.com/web
The ? is the ternary operator and is saying, if there's already something in a[FNR] then append $2 of the current record to what's there, else it's null so store the whole record (i.e. $0).
Pseudo code that may help explain things:
if a[FNR] != ""
a[FNR] = a[FNR] : FS : $2
else
a[FNR] = $0
You can see that the a, b, c from every record after the first file is dropped - could be x, y, z and this program wouldn't care. It's taking the second field and appending to a[2], a[3] etc.
You can use awk with pr to do this:
$ pr -mts$'\t' f1 <(awk '{print $2}' f2) <(awk '{print $2}' f3)
id value1 value2 value3
a 10 90 0
b 30 30 1
c 50 20 25
(Those are tabs in between the columns)
Or use paste the same way:
$ paste f1 <(awk '{print $2}' f2) <(awk '{print $2}' f3)
id value1 value2 value3
a 10 90 0
b 30 30 1
c 50 20 25
Actually I have csv file with suppose 20 headers and they have corresponding values for those headers in the next row for a particular record.
Example : Source file
Age,Name,Salary
25,Anand,32000
I want my output file to be in this format.
Example : Output file
Age
25
Name
Anand
Salary
32000
So for doing this which awk/grep/sed command to be used?
I'd say
awk -F, 'NR == 1 { split($0, headers); next } { for(i = 1; i <= NF; ++i) { print headers[i]; print $i } }' filename
That is
NR == 1 { # in the first line
split($0, headers) # remember the headers
next # do nothing else
}
{ # after that:
for(i = 1; i <= NF; ++i) { # for all fields:
print headers[i] # print the corresponding header
print $i # followed by the field
}
}
Addendum: Obligatory, crazy sed solution (not recommended for productive use; written for fun, not profit):
sed 's/$/,/; 1 { h; d; }; G; :a s/\([^,]*\),\([^\n]*\n\)\([^,]*\),\(.*\)/\2\4\n\3\n\1/; ta; s/^\n\n//' filename
That works as follows:
s/$/,/ # Add a comma to all lines for more convenient processing
1 { h; d; } # first line: Just put it in the hold buffer
G # all other lines: Append hold bufffer (header fields) to the
# pattern space
:a # jump label for looping
# isolate the first fields from the data and header lines,
# move them to the end of the pattern space
s/\([^,]*\),\([^\n]*\n\)\([^,]*\),\(.*\)/\2\4\n\3\n\1/
ta # do this until we got them all
s/^\n\n// # then remove the two newlines that are left as an artifact of
# the algorithm.
Here is one awk
awk -F, 'NR==1{for (i=1;i<=NF;i++) a[i]=$i;next} {for (i=1;i<=NF;i++) print a[i] RS $i}' file
Age
25
Name
Anand
Salary
32000
First for loop store the header in array a
Second for loop prints header from array a with corresponding data.
Using GNU awk 4.* for 2D arrays:
$ awk -F, '{a[NR][1];split($0,a[NR])} END{for (i=1;i<=NF;i++) for (j=1;j<=NR;j++) print a[j][i]}' file
Age
25
Name
Anand
Salary
32000
In general to transpose rows and columns:
$ cat file
11 12 13
21 22 23
31 32 33
41 42 43
with GNU awk:
$ awk '{a[NR][1];split($0,a[NR])} END{for (i=1;i<=NF;i++) for (j=1;j<=NR;j++) printf "%s%s", a[j][i], (j<NR?OFS:ORS)}' file
11 21 31 41
12 22 32 42
13 23 33 43
or with any awk:
$ awk '{for (i=1;i<=NF;i++) a[NR][i]=$i} END{for (i=1;i<=NF;i++) for (j=1;j<=NR;j++) printf "%s%s", a[j][i], (j<NR?OFS:ORS)}' file
11 21 31 41
12 22 32 42
13 23 33 43
Here I have tried awk script to compare fields from two different files.
awk 'NR == FNR {if (NF >= 4) a[$1] b[$4]; next} {for (i in a) for (j in b) if (i >= $2 && i <=$3 && j>=$2 && j<=$3 ) {print $1, $2, $3, i, j; next}}' file1 file2
Input files:
File1:
24926 17 206 25189 5.23674 5.71882 4.04165 14.99721 c
50760 17 48 50874 3.49903 4.25043 7.66602 15.41548 c
104318 15 269 104643 2.94218 5.18301 5.97225 14.09744 c
126088 17 70 126224 3.12993 5.32649 6.14936 14.60578 c
174113 16 136 174305 4.32339 2.36452 8.60971 15.29762 c
196474 14 89 196626 2.24367 5.16966 7.33723 14.75056 c
......
......
File2:
GT_004279 1 280
GT_003663 19891 20217
GT_003416 22299 23004
GT_003151 24916 25391
GT_001715 39470 39714
GT_001585 40896 41380
....
....
The output which I got is:
GT_004279 1 280 2465483 2639576
GT_003663 19891 20217 2005645 2005798
GT_003416 22299 23004 2291204 2269898
GT_003151 24916 25391 2501183 25189
GT_001715 39470 39714 3964440 3950417
......
......
The desired output should be 1st and 4th field values from file1 lies in between 2nd and 3rd field values from file2. For example, If I have taken above given lines as INPUT files, the output must be..
GT_003151 24916 25391 24926 25189
If I guess correctly the problem is within the If loop. So, Could someone help to rectify this problem.
Thanks
You need to make composite keys and iterate through them. When you create such composite keys they are separated by SUBSEP variable. So you just split based on that and do the check.
awk '
NR==FNR{ flds[$1,$4]; next }
{
for (key in flds) {
split (key, fld, SUBSEP)
if ($2<=fld[1] && $3>=fld[2])
print $0, fld[1], fld[2]
}
}' file1 file2
GT_003151 24916 25391 24926 25189