Merge two pipe separated files into one file based on some condition - sorting

I have two files as below:
File1:
a1|f1|c1|d1|e1
a2|f1|c2|d2|e2
a3|f2|c3|d3|e3
a4|f2|c4|d4|e4
a5|f4|c5|d5|e5
File2:
z1|f1|c1|d1|e1
z2|f1|c2|d2|e2
z3|f2|c3|d3|e3
z4|f2|c4|d4|e4
z5|f3|c5|d5|e5
Output file should have lines interleaved from both the files such that the rows are sorted according to 2nd field.
Output file:
a1|f1|c1|d1|e1
a2|f1|c2|d2|e2
z1|f1|c1|d1|e1
z2|f1|c2|d2|e2
a3|f2|c3|d3|e3
a4|f2|c4|d4|e4
z3|f2|c3|d3|e3
z4|f2|c4|d4|e4
z5|f3|c5|d5|e5
a5|f4|c5|d5|e5
I tried appending File2 to File1 and then sort on 2nd field. But it does not maintain the order present in the source files.

file_1:
a1|f1|c1|d1|e1
a2|f1|c2|d2|e2
a3|f2|c3|d3|e3
a4|f2|c4|d4|e4
a5|f4|c5|d5|e5
file_2:
z1|f1|c1|d1|e1
z2|f1|c2|d2|e2
z3|f2|c3|d3|e3
z4|f2|c4|d4|e4
z5|f3|c5|d5|e5
awk -F"|" '{a[$2] = a[$2]"\n"$0;} END {for (var in a) print a[var]}' file_1 file_2 | sed '/^\s*$/d'
awk
-F : tokenize the data on '|' character.
a[$2] : creates an hash table whose key is string identified by $2 and
value is previous data at a[$2] + current complete line ($0) separated by newline.
sed
used to remove the empty lines from the output.
Output:
a1|f1|c1|d1|e1
a2|f1|c2|d2|e2
z1|f1|c1|d1|e1
z2|f1|c2|d2|e2
a3|f2|c3|d3|e3
a4|f2|c4|d4|e4
z3|f2|c3|d3|e3
z4|f2|c4|d4|e4
z5|f3|c5|d5|e5
a5|f4|c5|d5|e5

Related

csv file operation: compare two csv files and return all matched line with header

I am trying to compare columns of two CSV files and save all matched lines to a new CSV file with a header. Below are the example files
file1:
ID,type,gene,startpos,endpos,product
C20775336,marker,gene1,1895,2166,ID=gene1;Name=maker-C20
C20775337,marker,gene2,895,1166,ID=mRNA1;Parent=gene1;N
C20775339,marker,gene3,1895,1962,Parent=mRNA1
C20775335,marker,gene4,2795,2962,ID=CDS1;Parent=mRNA1
C20775338,marker,gene5,895,1166,ID=mRNA1;Parent=gene1;N
file2:
Id,start,End
C2002,895,1166
C2003,1895,2166
C2004,2795,2962
here I am trying to compare columns 4th and 5th of file1 with columns 2nd and 3rd of file2 and save it to a new CSV file if matched.
using this command awk -F',' 'NR==FNR{A[$2,$3]=$0;next} A[$4,$5]' file2 file1 I am getting this output:
C20775336,marker,gene1,1895,2166,ID=gene1;Name=maker-C20
C20775337,marker,gene2,895,1166,ID=mRNA1;Parent=gene1;N
C20775335,marker,gene4,2795,2962,ID=CDS1;Parent=mRNA1
C20775338,marker,gene5,895,1166,ID=mRNA1;Parent=gene1;N
But I want the header of file1 as well which is achieved if the header name is identical in both files for example if startpos and endpos of file1 changed to start and end or vice-versa.
is there any way without having an identical header name, it can be done. So my expected output file will be like:
output:
ID,type,gene,startpos,Endpos,product
C20775336,marker,gene1,1895,2166,ID=gene1;Name=maker-C20
C20775337,marker,gene2,895,1166,ID=mRNA1;Parent=gene1;N
C20775335,marker,gene4,2795,2962,ID=CDS1;Parent=mRNA1
C20775338,marker,gene5,895,1166,ID=mRNA1;Parent=gene1;N
You may use another condition FNR == 1:
awk -F, 'NR==FNR {A[$2,$3]=$0; next} FNR == 1 || ($4,$5) in A' f2 f1
ID,type,gene,startpos,endpos,product
C20775336,marker,gene1,1895,2166,ID=gene1;Name=maker-C20
C20775337,marker,gene2,895,1166,ID=mRNA1;Parent=gene1;N
C20775335,marker,gene4,2795,2962,ID=CDS1;Parent=mRNA1
C20775338,marker,gene5,895,1166,ID=mRNA1;Parent=gene1;N

bash: output contents of diff function into 2 columns

I have a file that looks like so:
file1.txt
rs13339951:45007956:T:C 45007956
rs2838331 45026728
rs5647 12335
rs4687576 5353566
file2.txt
rs13339951 45007956
rs2838331 45026728
rs5647 12335
rs4687576:ATCFHF 5353566
More descriptions:
Some of the values in column1 are identical between the 2 files but not all
The values in column2 are all identical between the 2 files
I want to identify the rows for which the values in column1 differs between the 2 files. I.e. these rows 1 and 4 in my example. I can do this with diff file1.txt and file2.txt.
However, I would like to obtain a end file like so (see below). Indeed, I aim to use sed to replace the names of one file in the other so that both files match completely.
rs13339951:45007956:T:C rs13339951
rs4687576 rs4687576:ATCFHF
awk is perfect for this
awk 'FNR==NR {a[$2]=$1; next} a[$2]!=$1 {print a[$2] " " $1}' file1 file2
outputs
rs13339951:45007956:T:C rs13339951
rs4687576 rs4687576:ATCFHF
We are passing two files to awk. It will pass over them consecutively.
FNR==NR {.... next} { ... }
With this "trick" the first action is executed for the first file and the second action is executed for the second file.
a[$2]=$1
A key value lookup table. Second column is key first column is value. We build this lookup table while reading the first file.
a[$2]!=$1 {print a[$2] " " $1}
While iterating over the second file, compare the current first column with the value in the lookup table. If they do not match print the desired output.

Split one file into multiple files based on pattern with awk

I have a binary file with the following format:
file
04550525023506346054(....)64645634636346346344363468badcafe268664363463463463463463463464647(....)474017497417428badcafe34376362623626(....)262
and I need to split it in multiple files (using awk) that look like this:
file1
045505250235063460546464563463634634634436346
file2
8badcafe26866436346346346346346346346464747401749741742
file3
8badcafe34376362623626262
I have found on stackoverflow the following line:
cat file |
awk -v RS="\x8b\xad\xca\xfe" 'NR > 1 { print RS $0 > "file" (NR-1); close("file" (NR-1)) }'
and it works for all the files but the first.
Indeed, the file I called file1, is not created because it does not start with the eye catcher 8badcafe.
How can I fix the previous command line in order to have the output I need?
Thanks!
try:
awk '{gsub(/8badcafe/,"\n&");num=split($0, a,"\n");for(i=1;i<=num;i++){print a[i] > "file"++e}}' Input_file
Substituting the string "8badcafe" to a new line and string's value. Then splitting the current line into an array named a whose field separator is new line. then traversing through the array a's all values and printing them one by one into the file1, file2.... with "file" and a increasing count variable named e.
Output files as follows:
cat file1
045505250235063460546464563463634634634436346
cat file2
8badcafe26866436346346346346346346346464747401749741742
cat file3
8badcafe34376362623626262

Extract first 5 fields from semicolon-separated file

I have a semicolon-separated file with 10 fields on each line. I need to extract only the first 5 fields.
Input:
A.txt
1;abc ;xyz ;0.0000;3.0; ; ;0.00; ; xyz;
Output file:
B.txt
1;abc ;xyz ;0.0000;3.0;
You can cut from field1-5:
cut -d';' -f1-5 file
If the ending ; is needed, you can append it by other tool or using grep(assume your grep has -P option):
kent$ grep -oP '^(.*?;){5}' file
1;abc ;xyz ;0.0000;3.0;
In sed you can match the pattern string; 5 times:
sed 's/\(\([^;]*;\)\{5\}\).*/\1/' A.txt
or, when your sedsupports -r:
sed -r 's/(([^;]*;){5}).*/\1/' A.txt
cut -f-5 -d";" A.txt > B.txt
Where:
- -f selects the fields (-5 from start to 5)
- -d provides a delimiter, (here the semicolon)
Given that the input is field-based, using awk is another option:
awk 'BEGIN { FS=OFS=";"; ORS=OFS"\n" } { NF=5; print }' A.txt > B.txt
If you're using BSD/macOS, insert $1=$1; after NF=5; to make this work.
FS=OFS=";" sets both the input field separator, FS, and the output field separator, OFS, to a semicolon.
The input field separator is used to break each input record (line) into fields.
The output field separator is used to rebuild the record when individual fields are modified or the number of fields are modified.
ORS=OFS"\n" sets the output record separator to a semicolon followed by a newline, given that a trailing ; should be output.
Simply omit this statement if the trailing ; is undesired.
{ NF=5; print } truncates the input record to 5 fields, by setting NF, the number (count) of fields to 5 and then prints the modified record.
It is at this point that OFS comes into play: the first 5 fields are concatenated to form the output record, using OFS as the separator.
Note: BSD/macOS Awk doesn't modify the record just by setting NF; you must additionally modify a field explicitly for the changed field count to take effect: a dummy operation such as $1=$1 (assigning field 1 to itself) is sufficient.
awk '{print $1,$2,$3}' A.txt >B.txt
1;abc ;xyz ;0.0000;3.0;

search 2 fields in a file in another huge file, passing 2nd file only once

file1 has 100,000 lines. Each line has 2 fields such as:
test 12345678
test2 43213423
Another file has millions of lines. Here is an example of how the above file entries look in file2:
'99' 'databases' '**test**' '**12345678**'
'1002' 'exchange' '**test2**' '**43213423**'
I would like a way to grep these 2 fields from file1 so that I can find any line that contains both, but the gotcha is, I would like to search the 100,000 entries through the 2nd file once as looping a grep is very slow as it could loop 100,000 x 10,000,000.
Is that at all possible?
You can do this in awk:
awk -F"['[:blank:]]+" 'NR == FNR { a[$1,$2]; next } $4 SUBSEP $5 in a' file1 file2
First set the field separator so that the quotes around the fields in the second file are consumed.
The first block applies to the first file and sets keys in the array a. The comma in the array index translates to the control character SUBSEP in the key.
Lines are printed in the second file when the third and fourth fields (with the SUBSEP in between) match one of the keys. Due to the ' at the start of the line, the first field $1 is actually an empty string, so the fields you want are $4 and $5.
If your fields are always quoted in the second file, then you can do this instead:
awk -v q="'" 'NR == FNR { a[q $1 q,q $2 q]; next } $3 SUBSEP $4 in a' file file2
This inserts the quotes into the array a, so the fields in the second file match without having to consume the quotes.
fgrep and sed method:
sed "s/\b/'/g;s/\b/**/g" file1 | fgrep -f - file2
Modify a stream from file1 with sed to match the format of the second file, (i.e. surround the fields with single quotes and asterisks), and send the stream to standard output. The fgrep -f - inputs that stream as a list of fixed strings (but no regexps) and finds every matching line in file2.
Output:
'99' 'databases' '**test**' '**12345678**'
'1002' 'exchange' '**test2**' '**43213423**'

Resources