I have two files:
file 1:
rs3094315 1 0 742429 G A
rs12124819 1 0 766409 G A
rs2272756 1 0 871896 A G
rs3128126 1 0 952073 G A
rs3934834 1 0 995669 A G
rs3766192 1 0 1007060 G A
file 2:
rs12565286 1 0 711153 C G
rs12138618 1 0 740098 A G
rs3094315 1 0 742429 G A
rs3131968 1 0 744055 A G
rs12562034 1 0 758311 A G
rs2905035 1 0 765522 A G
rs12124819 1 0 766409 G A
rs2980319 1 0 766985 A T
rs4040617 1 0 769185 G A
rs2980300 1 0 775852 T C
rs4951864 1 0 787889 C T
rs12132517 1 0 788664 A G
rs950122 1 0 836727 C G
rs2272756 1 0 871896 A G
rs3128126 1 0 952073 G A
rs3121561 1 0 980243 T C
rs3813193 1 0 988364 C G
rs4075116 1 0 993492 C T
rs3934834 1 0 995669 T C
rs3766193 1 0 1007033 C G
rs3766192 1 0 1007060 C T
rs3766191 1 0 1007450 T C
The files have many more matches in the first column after these shown here, there are about 500k lines in both files.
I'm trying to use the following command to find matches in the first column (rs####) and if found, put the matches on one line in a new folder.
awk 'NF==FNR{s=$1; a[s]=$0; next} a[$1]{print $0" "a[$1]}' file1 file2 > mergedfiles
However, this command only gives 1 match (shown below) in mergedfiles and I just can't figure out what is going wrong. It's probably something really easy :s. Thanks in advance if you are able to clear this problem up.
rs3766192 1 0 1007060 C T rs3766192 1 0 1007060 G A
Use:
NR==FNR
Your condition only picks up the sixth line (because there are 6 fields in the first file)!
Related
I have a tab separated text file below. I want to match values in column 2 and replace the values in column 5. The condition is if there are X or Y in column 2, I want column 5 to have 1 just like in the result below.
1:935662:C:CA 1 0 935662 0
1:941119:A:G 2 0 941119 0
1:942934:G:C 3 0 942934 0
1:942951:C:T X 0 942951 0
1:943937:C:T X 0 943937 0
1:944858:A:G Y 0 944858 0
1:945010:C:A X 0 945010 0
1:946247:G:A 1 0 946247 0
result:
1:935662:C:CA 1 0 935662 0
1:941119:A:G 2 0 941119 0
1:942934:G:C 3 0 942934 0
1:942951:C:T X 0 942951 1
1:943937:C:T X 0 943937 1
1:944858:A:G Y 0 944858 1
1:945010:C:A X 0 945010 1
1:946247:G:A 1 0 946247 0
I tried awk -F'\t' '{ $5 = ($2 == X ? 1 : $2) } 1' OFS='\t' file.txt but I am not sure how to match both X and Y in one step.
With awk:
awk 'BEGIN{FS=OFS="\t"} $2=="X" || $2=="Y"{$5="1"}1' file
Output:
1:935662:C:CA 1 0 935662 0
1:941119:A:G 2 0 941119 0
1:942934:G:C 3 0 942934 0
1:942951:C:T X 0 942951 1
1:943937:C:T X 0 943937 1
1:944858:A:G Y 0 944858 1
1:945010:C:A X 0 945010 1
1:946247:G:A 1 0 946247 0
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
Assuming you want $5 to be zero (as opposed to remaining unchanged) if the condition is false:
$ awk 'BEGIN{FS=OFS="\t"} {$5=($2 ~ /^[XY]$/)} 1' file
1:935662:C:CA 1 0 935662 0
1:941119:A:G 2 0 941119 0
1:942934:G:C 3 0 942934 0
1:942951:C:T X 0 942951 1
1:943937:C:T X 0 943937 1
1:944858:A:G Y 0 944858 1
1:945010:C:A X 0 945010 1
1:946247:G:A 1 0 946247 0
I have two file:
file1:
1 imm_1_898835 0 908972 0 A
1 vh_1_1108138 0 1118275 T C
1 vh_1_1110294 0 1120431 A G
1 rs9729550 0 1135242 C A
file2:
1 exm1916089 0 865545 0 0
1 exm44 0 865584 0 G
1 exm46 0 865625 0 G
1 exm47 0 865628 A G
1 exm51 0 908972 0 G
1 exmF 0 1120431 C A
I want to obtain a file that is the overlap between file 1 and 2 based on columns 1 and 4,and I would print the common values for columns 1 and 4 and also columns 2 for file1 and file2.
e.g
I want:
1 908972 imm_1_898835 exm51
1 1120431 vh_1_1110294 exmF
Could you please try following.
awk 'FNR==NR{a[$1,$4]=$2;next} (($1,$4) in a){print $1,$4,a[$1,$4],$2}' file1 file2
I have a large text file that looks like this:
1 1:49298 0 49298 T C
1 1:54676 0 54676 T C
1 1:54676 0 54676 A G
1 1:86028 0 86028 C T
1 1:86028 0 86028 T G
1 1:86028 0 86028 A G
1 1:91536 0 91536 T G
The second column contains some multiples - there are definitely duplicates and it is possible that there are triplicates etc, but I have not explored this fully.
I would like to add the letter 'b' to the end of the second occurrence in column 2, and 'c' to the third occurrence, 'd' to the fourth occurrence, and so on. So the output file should look like this:
1 1:49298 0 49298 T C
1 1:54676 0 54676 T C
1 1:54676b 0 54676 A G
1 1:86028 0 86028 C T
1 1:86028b 0 86028 T G
1 1:86028c 0 86028 A G
1 1:91536 0 91536 T G
I thought this could be done using awk, but I have not yet figured out any viable options.
This MIGHT be what you're looking for:
$ awk 'cnt[$2]++ { $2=sprintf("%s%c", $2, 96 + cnt[$2]) } 1' file | column -t
1 1:49298 0 49298 T C
1 1:54676 0 54676 T C
1 1:54676b 0 54676 A G
1 1:86028 0 86028 C T
1 1:86028b 0 86028 T G
1 1:86028c 0 86028 A G
1 1:91536 0 91536 T G
another awk which will let you control the codes you append
$ awk -v codes="$(echo {b..z})" 'BEGIN{split(codes,s)}
{$2=$2 s[c[$2]++]}1' file | column -t
1 1:49298 0 49298 T C
1 1:54676 0 54676 T C
1 1:54676b 0 54676 A G
1 1:86028 0 86028 C T
1 1:86028b 0 86028 T G
1 1:86028c 0 86028 A G
1 1:91536 0 91536 T G
Or perl:
perl -lane '
$F[1] .= chr(96 + $count{$F[1]}) if $count{$F[1]}++ > 0;
print join "\t", #F
' file
And also this:
awk '{if ($4 == previous) {i++; print $1, $2sprintf("%c", 97+ i),$3,$4,$5,$6} else {previous = $4; i = 0; print;}}' file
1 1:49298 0 49298 T C
1 1:54676 0 54676 T C
1 1:54676b 0 54676 A G
1 1:86028 0 86028 C T
1 1:86028b 0 86028 T G
1 1:86028c 0 86028 A G
1 1:91536 0 91536 T G
I have a lage text file that I would like to filter by excluding lines that have a number of columns matching a certain character. I had previously removed lines where all columns from 2 onwards contained a 0 or a . like so:
awk '{
for (i=2; i<=NF; i++)
if ($i!~/^(\.|0)/) {
print
break
}
}'
but now I would like it so that I would print lines that had less than a specific number of columns with this value (".").
For example with data:
A B C D E
0 1 . 0 0
1 ./. 0 1 1
1 1 0 0 0
0 0 . . 0
. ./. . . .
and a match value of 2 I would expect the bottom two lines to be excluded so that the output would be:
A B C D E
0 1 . 0 0
1 ./. 0 1 1
1 1 0 0 0
Any ideas?
With awk:
$ awk '{c=0;for(i=1;i<NF;i++) c += ($i == ".")}c<2' file
A B C D E
0 1 . 0 0
1 ./. 0 1 1
1 1 0 0 0
Basically it iterates each column and add one to the counter if the column equals a period (.).
The c<2 part will only print the line if there is less than two columns with periods.
With sed one can use:
$ sed -r 'h;s/[^. ]+//g;s/\.\. *//g;/\. \./d;x' file
A B C D E
0 1 . 0 0
1 ./. 0 1 1
1 1 0 0 0
-r enables extended regular expressions (-E on *BSD).
Basically a copy of the pattern space is stored in the hold buffer, then all but spaces and periods is removed.
Now if the pattern space contains two separate periods it can be deleted if not the pattern space can be exchanged with the hold buffer.
$ awk '{delete a; for(i=1;i<=NF;i++) a[$i]++; if(a["."]>=2) next} 1' foo
A B C D E
0 1 . 0 0
1 ./. 0 1 1
1 1 0 0 0
It iterates all fields (for), counts field values and if 2 or more . in a record, restrains from printing (next). If you want to count the periods only from field 3 onward, change the start value of i in the for: for(i=3; ...).
$ cat ip.txt
A B C D E
0 1 . 0 0
1 ./. 0 1 1
1 1 0 0 0
0 0 . . 0
. ./. . . .
$ perl -ne '(#c)=/\.\/\.|\./g; print if $#c < 1' ip.txt
A B C D E
0 1 . 0 0
1 ./. 0 1 1
1 1 0 0 0
(#c)=/\.\/\.|\./g array of ./. or . matches from current line
$#c indicates index of last element, i.e (size of array - 1)
So, to ignore lines containing 3 elements like ./. or . use $#c < 2
Similar to #spasic's answer, but easier (for me) to read!
perl -ane 'print if (grep { /^\.$/} #F) < 2' file
A B C D E
0 1 . 0 0
1 ./. 0 1 1
1 1 0 0 0
The -a separates the space-separated fields into an array called #F for me. I then grep in the array #F looking for elements that consist of just a period - i.e. those that start with a period and end immediately after the period. That counts the lone periods in each line and I print the line if that number is less than 2.
Perhaps this is alright.
awk '$0 !~/\. \./' file
A B C D E
0 1 . 0 0
1 ./. 0 1 1
1 1 0 0 0
I wish to replace blank fields with zeros using awk but when I use sed 's/ /0/' file, I seem to replace all white spaces when I only wish to consider missing data. Using awk '{print NF}' file returns different field numbers (i.e. 9,4) due to some empty fields
input
590073920 20120523 0 M $480746499 CM C 500081532 SP
501298333 0 M *BB
501666604 0 M *OO
90007162 7 M +178852
90007568 3 M +189182
output
590073920 20120523 0 M $480746499 CM C 500081532 SP
501298333 0 0 M *BB 0 0 0 0
501666604 0 0 M *OO 0 0 0 0
90007162 0 7 M +178852 0 0 0 0
90007568 0 3 M +189182 0 0 0 0
Using GNU awk FIELDWIDTHS feature for fixed width processing:
$ awk '{for(i=1;i<=NF;i++)if($i~/^ *$/)$i=0}1' FIELDWIDTHS="11 9 5 2 16 3 2 11 2" file | column -t
590073920 20120523 0 M $480746499 CM C 500081532 SP
501298333 0 0 M *BB 0 0 0 0
501666604 0 0 M *OO 0 0 0 0
90007162 0 7 M +178852 0 0 0 0
90007568 0 3 M +189182 0 0 0 0