bash: output contents of diff function into 2 columns

bash: output contents of diff function into 2 columns - bash

I have a file that looks like so:
file1.txt
rs13339951:45007956:T:C 45007956
rs2838331 45026728
rs5647 12335
rs4687576 5353566
file2.txt
rs13339951 45007956
rs2838331 45026728
rs5647 12335
rs4687576:ATCFHF 5353566
More descriptions:
Some of the values in column1 are identical between the 2 files but not all
The values in column2 are all identical between the 2 files
I want to identify the rows for which the values in column1 differs between the 2 files. I.e. these rows 1 and 4 in my example. I can do this with diff file1.txt and file2.txt.
However, I would like to obtain a end file like so (see below). Indeed, I aim to use sed to replace the names of one file in the other so that both files match completely.
rs13339951:45007956:T:C rs13339951
rs4687576 rs4687576:ATCFHF

awk is perfect for this
awk 'FNR==NR {a[$2]=$1; next} a[$2]!=$1 {print a[$2] " " $1}' file1 file2
outputs
rs13339951:45007956:T:C rs13339951
rs4687576 rs4687576:ATCFHF
We are passing two files to awk. It will pass over them consecutively.
FNR==NR {.... next} { ... }
With this "trick" the first action is executed for the first file and the second action is executed for the second file.
a[$2]=$1
A key value lookup table. Second column is key first column is value. We build this lookup table while reading the first file.
a[$2]!=$1 {print a[$2] " " $1}
While iterating over the second file, compare the current first column with the value in the lookup table. If they do not match print the desired output.

Related

csv file operation: compare two csv files and return all matched line with header

I am trying to compare columns of two CSV files and save all matched lines to a new CSV file with a header. Below are the example files
file1:
ID,type,gene,startpos,endpos,product
C20775336,marker,gene1,1895,2166,ID=gene1;Name=maker-C20
C20775337,marker,gene2,895,1166,ID=mRNA1;Parent=gene1;N
C20775339,marker,gene3,1895,1962,Parent=mRNA1
C20775335,marker,gene4,2795,2962,ID=CDS1;Parent=mRNA1
C20775338,marker,gene5,895,1166,ID=mRNA1;Parent=gene1;N
file2:
Id,start,End
C2002,895,1166
C2003,1895,2166
C2004,2795,2962
here I am trying to compare columns 4th and 5th of file1 with columns 2nd and 3rd of file2 and save it to a new CSV file if matched.
using this command awk -F',' 'NR==FNR{A[$2,$3]=$0;next} A[$4,$5]' file2 file1 I am getting this output:
C20775336,marker,gene1,1895,2166,ID=gene1;Name=maker-C20
C20775337,marker,gene2,895,1166,ID=mRNA1;Parent=gene1;N
C20775335,marker,gene4,2795,2962,ID=CDS1;Parent=mRNA1
C20775338,marker,gene5,895,1166,ID=mRNA1;Parent=gene1;N
But I want the header of file1 as well which is achieved if the header name is identical in both files for example if startpos and endpos of file1 changed to start and end or vice-versa.
is there any way without having an identical header name, it can be done. So my expected output file will be like:
output:
ID,type,gene,startpos,Endpos,product
C20775336,marker,gene1,1895,2166,ID=gene1;Name=maker-C20
C20775337,marker,gene2,895,1166,ID=mRNA1;Parent=gene1;N
C20775335,marker,gene4,2795,2962,ID=CDS1;Parent=mRNA1
C20775338,marker,gene5,895,1166,ID=mRNA1;Parent=gene1;N

You may use another condition FNR == 1:
awk -F, 'NR==FNR {A[$2,$3]=$0; next} FNR == 1 || ($4,$5) in A' f2 f1
ID,type,gene,startpos,endpos,product
C20775336,marker,gene1,1895,2166,ID=gene1;Name=maker-C20
C20775337,marker,gene2,895,1166,ID=mRNA1;Parent=gene1;N
C20775335,marker,gene4,2795,2962,ID=CDS1;Parent=mRNA1
C20775338,marker,gene5,895,1166,ID=mRNA1;Parent=gene1;N

AWK write to a file based on number of fields in csv

I want to iterate over a csv file and discard the rows while writing to a file which doesnt have all columns in a row.
I have an input file mtest.csv like this
IP##Process##Device##ID
TestIP1##TestProcess2##TestDevice1##TestID1
TestIP2##TestProcess2##TestDevice2
TestIP3##TestProcess3##TestDevice3##TestID3
But I am trying to only write those row records where all the 4 columns are present. The output should not have the TestIP2 column complete row as it has 3 columns.
Sample output should look like this:
IP##Process##Device##ID
TestIP1##TestProcess2##TestDevice1##TestID1
TestIP3##TestProcess3##TestDevice3##TestID3
I used to do like this to get all the columns earlier but it writes the TestIP2 row as well which has 3 columns
awk -F "\##" '{print $1"\##"substr($2,1,50)"\##"substr($3,1,50)"\##"substr($4,1,50)}' mtest.csv >output2.csv
But when I try to ensure that it writes to file when all 4 columns are present, it doesn't work
awk -F "\##", 'NF >3 {print $1"\##"substr($2,1,50)"\##"substr($3,1,50)"\##"substr($4,1,50); exit}' mtest.csv >output2.csv

You are making things harder than it need to be. All you need to do is check NF==4 to output any records containing four fields. Your total awk expression would be:
awk -F'##' NF==4 < mtest.csv
(note: the default action by awk is print so there is no explicit print required.)
Example Use/Output
With your sample input in mtest.csv, you would receive:
$ awk -F'##' NF==4 < mtest.csv
IP##Process##Device##ID
TestIP1##TestProcess2##TestDevice1##TestID1
TestIP3##TestProcess3##TestDevice3##TestID3

Thanks David and vukung
Both your solutions are okay.I want to write to a file so that i can trim the length of each field as well
I think this below statement works out
awk -F "##" 'NF>3 {print $1"\##"substr($2,1,50)"\##"substr($3,1,2)"\##"substr($4,1,3)}' mtest.csv >output2.csv

Bash vlookup kind of solution

I have two files,
File 1
2,1,1,1,Test1,1540584051,52
6,5,1,1,Test2,1540579206,54
3,3,0,0,Test3,1540591243,36
File 2
2,1,0,2,Test1,1540584051,52
6,5,0,2,Test2,1540579206,54
i want to look up column 7 value from File 1 to check if it matches with column 7 value from File 2 and when matched, replace the that line in file 2 with corresponding line in file 1
So the output would be
2,1,1,1,Test1,1540584051,52
6,5,1,1,Test2,1540579206,54
Thanks in advance.

You can do that with the following script:
BEGIN { FS="," }
NR==FNR {
lookup[$7] = $0
next
}
{
if (lookup[$7] != "") {
$0 = lookup[$7]
}
print
}
END {
print ""
print "Lookup table used was:"
for (i in lookup) {
print " Key '"i"', Value '"lookup[i]"'"
}
}
The BEGIN section simply sets the field separator to , so individual fields can be easily processed.
The NR and FNR variables are, respectively, the line number of the full input stream (all files) and the line number of the current file in the input stream. When you are processing the first (or only) file, these will be equal, so we use this as a means to simply store the lines from the first file, keyed on field seven.
When NR and FNR are not equal, it's because you've started the second file and this is where we want to replace lines if their key exists in the first file.
This is done by simply checking if a line exists in the lookup table with the desired key and, if it does, replacing the current line the lookup table line. Then we print the (original or replaced) line.
The END section is there just for debugging purposes, it outputs the lookup table that was created and used, and you can remove it once you're satisfied the script works as expected.
You'll see the output in the following transcript, illustrating hopefully that it is working correctly:
pax$ cat file1
2,1,1,1,Test1,1540584051,52
6,5,1,1,Test2,1540579206,54
3,3,0,0,Test3,1540591243,36
pax$ cat file2
2,1,0,2,Test1,1540584051,52
6,5,0,2,Test2,1540579206,54
pax$ awk -f sudarshan.awk file1 file2
2,1,1,1,Test1,1540584051,52
6,5,1,1,Test2,1540579206,54
Lookup table used was:
Key '36', Value '3,3,0,0,Test3,1540591243,36'
Key '52', Value '2,1,1,1,Test1,1540584051,52'
Key '54', Value '6,5,1,1,Test2,1540579206,54'
If you need it as a "short as possible" one-liner to use from your script, just use:
awk -F, 'NR==FNR{x[$7]=$0;next}{if(x[$7]!=""){$0=x[$7]};print}' file1 file2
though I prefer the readable version myself.

This might work for you (GNU sed):
sed -r 's|^([^,]*,){6}([^,]*).*|/^([^,]*,){6}\2/s/.*/&/p|' file1 | sed -rnf - file2
Turn file1 into a sed script and using the 7th field as a key lookup replace any line in file2 that matches.
In your example the 7th field is the last one, so a short version of the above solution is:
sed -r 's|.*,(.*)|/.*,\1/s/.*/&/p|' file1 | sed -nf - file2

How do I match columns in a small file to a larger file and do calculations using awk

I have this small file small.csv:
STATE|STAGE|SUBCAT_ID|TOTAL TIMING|FAMA_COEFF_TIMING|DB_IMPORT_TIMING|COMMENT
SUCCEEDED|fe|L3-002559|110|7|15
SUCCEEDED|staging|L3-002241|46||24
And this bigger file big.csv:
STATE|STAGE|SUBCAT_ID|TOTAL TIMING|FAMA_COEFF_TIMING|DB_IMPORT_TIMING|COMMENT
SUCCEEDED|fe|L3-004082|16|0|8
SUCCEEDED|staging|L3-002730|85||57
SUCCEEDED|staging|L3-002722|83||56
SUCCEEDED|fe|L3-002559|100|7|15
I need a command (probably awk) that will loop on the small.csv file to check if the 1st, 2nd and 3rd column match a record in the big.csv file and then calculate based on the 4th column the difference small-big. So in the example above, since the 1st record's first 3 columns match the 4th record in big.csv the output would be:
SUCCEEDED|fe|L3-002559|10
where 10 is 110-100
Thank you

Assuming that lines with similar first three fields do not occur more than twice in the two files taken together. This works:
awk -F '|' 'FNR!=1 { key = $1 "|" $2 "|" $3; if(a[key]) print key "|" a[key]-$4; else a[key]=$4 }' small.csv big.csv

Merge two pipe separated files into one file based on some condition

I have two files as below:
File1:
a1|f1|c1|d1|e1
a2|f1|c2|d2|e2
a3|f2|c3|d3|e3
a4|f2|c4|d4|e4
a5|f4|c5|d5|e5
File2:
z1|f1|c1|d1|e1
z2|f1|c2|d2|e2
z3|f2|c3|d3|e3
z4|f2|c4|d4|e4
z5|f3|c5|d5|e5
Output file should have lines interleaved from both the files such that the rows are sorted according to 2nd field.
Output file:
a1|f1|c1|d1|e1
a2|f1|c2|d2|e2
z1|f1|c1|d1|e1
z2|f1|c2|d2|e2
a3|f2|c3|d3|e3
a4|f2|c4|d4|e4
z3|f2|c3|d3|e3
z4|f2|c4|d4|e4
z5|f3|c5|d5|e5
a5|f4|c5|d5|e5
I tried appending File2 to File1 and then sort on 2nd field. But it does not maintain the order present in the source files.

file_1:
a1|f1|c1|d1|e1
a2|f1|c2|d2|e2
a3|f2|c3|d3|e3
a4|f2|c4|d4|e4
a5|f4|c5|d5|e5
file_2:
z1|f1|c1|d1|e1
z2|f1|c2|d2|e2
z3|f2|c3|d3|e3
z4|f2|c4|d4|e4
z5|f3|c5|d5|e5
awk -F"|" '{a[$2] = a[$2]"\n"$0;} END {for (var in a) print a[var]}' file_1 file_2 | sed '/^\s*$/d'
awk
-F : tokenize the data on '|' character.
a[$2] : creates an hash table whose key is string identified by $2 and
value is previous data at a[$2] + current complete line ($0) separated by newline.
sed
used to remove the empty lines from the output.
Output:
a1|f1|c1|d1|e1
a2|f1|c2|d2|e2
z1|f1|c1|d1|e1
z2|f1|c2|d2|e2
a3|f2|c3|d3|e3
a4|f2|c4|d4|e4
z3|f2|c3|d3|e3
z4|f2|c4|d4|e4
z5|f3|c5|d5|e5
a5|f4|c5|d5|e5

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

bash: output contents of diff function into 2 columns - bash

Related

csv file operation: compare two csv files and return all matched line with header

AWK write to a file based on number of fields in csv

Bash vlookup kind of solution

How do I match columns in a small file to a larger file and do calculations using awk

Merge two pipe separated files into one file based on some condition

Categories

Resources