Related
I am trying to compare columns of two CSV files and save all matched lines to a new CSV file with a header. Below are the example files
file1:
ID,type,gene,startpos,endpos,product
C20775336,marker,gene1,1895,2166,ID=gene1;Name=maker-C20
C20775337,marker,gene2,895,1166,ID=mRNA1;Parent=gene1;N
C20775339,marker,gene3,1895,1962,Parent=mRNA1
C20775335,marker,gene4,2795,2962,ID=CDS1;Parent=mRNA1
C20775338,marker,gene5,895,1166,ID=mRNA1;Parent=gene1;N
file2:
Id,start,End
C2002,895,1166
C2003,1895,2166
C2004,2795,2962
here I am trying to compare columns 4th and 5th of file1 with columns 2nd and 3rd of file2 and save it to a new CSV file if matched.
using this command awk -F',' 'NR==FNR{A[$2,$3]=$0;next} A[$4,$5]' file2 file1 I am getting this output:
C20775336,marker,gene1,1895,2166,ID=gene1;Name=maker-C20
C20775337,marker,gene2,895,1166,ID=mRNA1;Parent=gene1;N
C20775335,marker,gene4,2795,2962,ID=CDS1;Parent=mRNA1
C20775338,marker,gene5,895,1166,ID=mRNA1;Parent=gene1;N
But I want the header of file1 as well which is achieved if the header name is identical in both files for example if startpos and endpos of file1 changed to start and end or vice-versa.
is there any way without having an identical header name, it can be done. So my expected output file will be like:
output:
ID,type,gene,startpos,Endpos,product
C20775336,marker,gene1,1895,2166,ID=gene1;Name=maker-C20
C20775337,marker,gene2,895,1166,ID=mRNA1;Parent=gene1;N
C20775335,marker,gene4,2795,2962,ID=CDS1;Parent=mRNA1
C20775338,marker,gene5,895,1166,ID=mRNA1;Parent=gene1;N
You may use another condition FNR == 1:
awk -F, 'NR==FNR {A[$2,$3]=$0; next} FNR == 1 || ($4,$5) in A' f2 f1
ID,type,gene,startpos,endpos,product
C20775336,marker,gene1,1895,2166,ID=gene1;Name=maker-C20
C20775337,marker,gene2,895,1166,ID=mRNA1;Parent=gene1;N
C20775335,marker,gene4,2795,2962,ID=CDS1;Parent=mRNA1
C20775338,marker,gene5,895,1166,ID=mRNA1;Parent=gene1;N
I want to iterate over a csv file and discard the rows while writing to a file which doesnt have all columns in a row.
I have an input file mtest.csv like this
IP##Process##Device##ID
TestIP1##TestProcess2##TestDevice1##TestID1
TestIP2##TestProcess2##TestDevice2
TestIP3##TestProcess3##TestDevice3##TestID3
But I am trying to only write those row records where all the 4 columns are present. The output should not have the TestIP2 column complete row as it has 3 columns.
Sample output should look like this:
IP##Process##Device##ID
TestIP1##TestProcess2##TestDevice1##TestID1
TestIP3##TestProcess3##TestDevice3##TestID3
I used to do like this to get all the columns earlier but it writes the TestIP2 row as well which has 3 columns
awk -F "\##" '{print $1"\##"substr($2,1,50)"\##"substr($3,1,50)"\##"substr($4,1,50)}' mtest.csv >output2.csv
But when I try to ensure that it writes to file when all 4 columns are present, it doesn't work
awk -F "\##", 'NF >3 {print $1"\##"substr($2,1,50)"\##"substr($3,1,50)"\##"substr($4,1,50); exit}' mtest.csv >output2.csv
You are making things harder than it need to be. All you need to do is check NF==4 to output any records containing four fields. Your total awk expression would be:
awk -F'##' NF==4 < mtest.csv
(note: the default action by awk is print so there is no explicit print required.)
Example Use/Output
With your sample input in mtest.csv, you would receive:
$ awk -F'##' NF==4 < mtest.csv
IP##Process##Device##ID
TestIP1##TestProcess2##TestDevice1##TestID1
TestIP3##TestProcess3##TestDevice3##TestID3
Thanks David and vukung
Both your solutions are okay.I want to write to a file so that i can trim the length of each field as well
I think this below statement works out
awk -F "##" 'NF>3 {print $1"\##"substr($2,1,50)"\##"substr($3,1,2)"\##"substr($4,1,3)}' mtest.csv >output2.csv
I have two files,
File 1
2,1,1,1,Test1,1540584051,52
6,5,1,1,Test2,1540579206,54
3,3,0,0,Test3,1540591243,36
File 2
2,1,0,2,Test1,1540584051,52
6,5,0,2,Test2,1540579206,54
i want to look up column 7 value from File 1 to check if it matches with column 7 value from File 2 and when matched, replace the that line in file 2 with corresponding line in file 1
So the output would be
2,1,1,1,Test1,1540584051,52
6,5,1,1,Test2,1540579206,54
Thanks in advance.
You can do that with the following script:
BEGIN { FS="," }
NR==FNR {
lookup[$7] = $0
next
}
{
if (lookup[$7] != "") {
$0 = lookup[$7]
}
print
}
END {
print ""
print "Lookup table used was:"
for (i in lookup) {
print " Key '"i"', Value '"lookup[i]"'"
}
}
The BEGIN section simply sets the field separator to , so individual fields can be easily processed.
The NR and FNR variables are, respectively, the line number of the full input stream (all files) and the line number of the current file in the input stream. When you are processing the first (or only) file, these will be equal, so we use this as a means to simply store the lines from the first file, keyed on field seven.
When NR and FNR are not equal, it's because you've started the second file and this is where we want to replace lines if their key exists in the first file.
This is done by simply checking if a line exists in the lookup table with the desired key and, if it does, replacing the current line the lookup table line. Then we print the (original or replaced) line.
The END section is there just for debugging purposes, it outputs the lookup table that was created and used, and you can remove it once you're satisfied the script works as expected.
You'll see the output in the following transcript, illustrating hopefully that it is working correctly:
pax$ cat file1
2,1,1,1,Test1,1540584051,52
6,5,1,1,Test2,1540579206,54
3,3,0,0,Test3,1540591243,36
pax$ cat file2
2,1,0,2,Test1,1540584051,52
6,5,0,2,Test2,1540579206,54
pax$ awk -f sudarshan.awk file1 file2
2,1,1,1,Test1,1540584051,52
6,5,1,1,Test2,1540579206,54
Lookup table used was:
Key '36', Value '3,3,0,0,Test3,1540591243,36'
Key '52', Value '2,1,1,1,Test1,1540584051,52'
Key '54', Value '6,5,1,1,Test2,1540579206,54'
If you need it as a "short as possible" one-liner to use from your script, just use:
awk -F, 'NR==FNR{x[$7]=$0;next}{if(x[$7]!=""){$0=x[$7]};print}' file1 file2
though I prefer the readable version myself.
This might work for you (GNU sed):
sed -r 's|^([^,]*,){6}([^,]*).*|/^([^,]*,){6}\2/s/.*/&/p|' file1 | sed -rnf - file2
Turn file1 into a sed script and using the 7th field as a key lookup replace any line in file2 that matches.
In your example the 7th field is the last one, so a short version of the above solution is:
sed -r 's|.*,(.*)|/.*,\1/s/.*/&/p|' file1 | sed -nf - file2
I have this small file small.csv:
STATE|STAGE|SUBCAT_ID|TOTAL TIMING|FAMA_COEFF_TIMING|DB_IMPORT_TIMING|COMMENT
SUCCEEDED|fe|L3-002559|110|7|15
SUCCEEDED|staging|L3-002241|46||24
And this bigger file big.csv:
STATE|STAGE|SUBCAT_ID|TOTAL TIMING|FAMA_COEFF_TIMING|DB_IMPORT_TIMING|COMMENT
SUCCEEDED|fe|L3-004082|16|0|8
SUCCEEDED|staging|L3-002730|85||57
SUCCEEDED|staging|L3-002722|83||56
SUCCEEDED|fe|L3-002559|100|7|15
I need a command (probably awk) that will loop on the small.csv file to check if the 1st, 2nd and 3rd column match a record in the big.csv file and then calculate based on the 4th column the difference small-big. So in the example above, since the 1st record's first 3 columns match the 4th record in big.csv the output would be:
SUCCEEDED|fe|L3-002559|10
where 10 is 110-100
Thank you
Assuming that lines with similar first three fields do not occur more than twice in the two files taken together. This works:
awk -F '|' 'FNR!=1 { key = $1 "|" $2 "|" $3; if(a[key]) print key "|" a[key]-$4; else a[key]=$4 }' small.csv big.csv
I have two files as below:
File1:
a1|f1|c1|d1|e1
a2|f1|c2|d2|e2
a3|f2|c3|d3|e3
a4|f2|c4|d4|e4
a5|f4|c5|d5|e5
File2:
z1|f1|c1|d1|e1
z2|f1|c2|d2|e2
z3|f2|c3|d3|e3
z4|f2|c4|d4|e4
z5|f3|c5|d5|e5
Output file should have lines interleaved from both the files such that the rows are sorted according to 2nd field.
Output file:
a1|f1|c1|d1|e1
a2|f1|c2|d2|e2
z1|f1|c1|d1|e1
z2|f1|c2|d2|e2
a3|f2|c3|d3|e3
a4|f2|c4|d4|e4
z3|f2|c3|d3|e3
z4|f2|c4|d4|e4
z5|f3|c5|d5|e5
a5|f4|c5|d5|e5
I tried appending File2 to File1 and then sort on 2nd field. But it does not maintain the order present in the source files.
file_1:
a1|f1|c1|d1|e1
a2|f1|c2|d2|e2
a3|f2|c3|d3|e3
a4|f2|c4|d4|e4
a5|f4|c5|d5|e5
file_2:
z1|f1|c1|d1|e1
z2|f1|c2|d2|e2
z3|f2|c3|d3|e3
z4|f2|c4|d4|e4
z5|f3|c5|d5|e5
awk -F"|" '{a[$2] = a[$2]"\n"$0;} END {for (var in a) print a[var]}' file_1 file_2 | sed '/^\s*$/d'
awk
-F : tokenize the data on '|' character.
a[$2] : creates an hash table whose key is string identified by $2 and
value is previous data at a[$2] + current complete line ($0) separated by newline.
sed
used to remove the empty lines from the output.
Output:
a1|f1|c1|d1|e1
a2|f1|c2|d2|e2
z1|f1|c1|d1|e1
z2|f1|c2|d2|e2
a3|f2|c3|d3|e3
a4|f2|c4|d4|e4
z3|f2|c3|d3|e3
z4|f2|c4|d4|e4
z5|f3|c5|d5|e5
a5|f4|c5|d5|e5