How to transform genotypes into T/F -1/0/1 format using awk? - bash

I have a very large dataset that I would like to transform from genotypes to a coded format. The genotypes should be represented as follows:
A A -> -1
A B -> 0
B B -> 1
I have thought about this using awk but I cannot seem to get a working solution that can read two columns and output a single code in place of the genotypes. The input file looks like this:
AnimalID Locus Allele1 Allele2
1 1 A B
1 2 A A
1 3 B B
2 1 B A
2 2 B A
2 3 A A
And should be coded to an output file to look like this:
AnimalID Locus1 Locus2 Locus3
1 0 -1 1
2 0 0 -1
I am assuming this can be done using boolean T/F? Any suggestions would be welcomed. Thanks.

Here is something to get you started:
I have stored the mapping in BEGIN block. If the locus is missing for a particular ID, this will just print blank for that. You didnt specify what B A would mean, so I took the liberty of mapping it to 0 based on your output.
awk '
BEGIN {
map["A","A"] = -1;
map["A","B"] = 0;
map["B","B"] = 1;
map["B","A"] = 0;
}
NR>1 {
idCount = (idCount<$1) ? $1 : idCount;
locusCount = (locusCount<$2) ? $2 : locusCount
code[$1,$2] = map[$3,$4]
}
END {
printf "%s ", "AnimalID";
for(cnt=1; cnt<=locusCount; cnt++) {
printf "%s%s", "Locus" cnt, ((cnt==locusCount) ? "\n" : " ")
}
for(cnt=1; cnt<=idCount; cnt++) {
printf "%s\t", cnt;
for(locus=1; locus<=locusCount; locus++) {
printf "%s%s", code[cnt,locus], ((locus==locusCount) ? "\n" : "\t")
}
}
}' inputFile
Output:
AnimalID Locus1 Locus2 Locus3
1 0 -1 1
2 0 0 -1

Related

Understanding AWK and CSV files

How can I write an AWK program that analyses a list of fields in CSV files, count the number of each different string in the specified field, and print out the count of each string that is found? I have only coded in C and Java, so I am completely confused on the syntax of AWK. I understand the simplest of concepts, however, AWK is structured much differently. Any time is appreciated, thank you!
BEGIN {
FS = ""
}
{
for(i = 1; i <= NF; i++)
freq[$i]++
PROCINFO ["sorted_in"] = "#val_num_desc" #this got the desired result
}
END {
for {this in freq)
printf "%s\t%d\n", this, freq[this]
}
On a CSV file containing:
Field1, Field2, Field3, Field4
A, B, C, D
A, E, F, G
Z, E, C, D
Z, W, C, Q
I am able to obtain the result:
A 2
B 1
C 3
Q 1
D 1
E 2
F 1
, 12
G 1
W 1
Field1,Field2,Field3,Field4 1
Z 2
This is the desired result:
A 10
C 7
D 2
E 2
Z 2
B 1
Q 1
Field1 1
Field2 1
F 1
Field3 1
G 1
Field4 1
W 1
There is an edit to my code which is commented.
Fixed your code:
$ awk '
BEGIN { # you need BEGIN block for FS
FS = ", *" # your data had ", " and "," seps
} # ... based on your sample output
{
for(i = 1; i <= NF; i++)
freq[$i]++
}
END {
for(this in freq) # fixed a parenthesis
printf "%s\t%d\n", this, freq[this]
}' file
Output (using GNU awk. Other awks displayed output in different order):
A 2
B 1
C 3
Q 1
D 2
Field1 1
E 2
Field2 1
F 1
Field3 1
G 1
Field4 1
W 1
Z 2
AWK really isn't the right tool for this job. While AWK can interpret Comma or Tab separated data, it has no concept of field enclosures or escapes. So it could handle a simple example like:
Month,Day
January,Sunday
February,Monday
but would fail with this valid example:
Month,Day
January,"Sunday"
February,"Monday"
Because of that, I would recommend considering another language. Something like Python:
import csv
o = open('a.csv')
for m in csv.DictReader(o):
print(m)
https://docs.python.org/library/csv.html
Or Ruby:
require 'csv'
CSV.table('a.csv').each do |m|
p m
end
https://ruby-doc.org/stdlib/libdoc/csv/rdoc/CSV.html
Or even PHP:
<?php
$r = fopen('a.csv', 'r');
$a_head = fgetcsv($r);
while (true) {
$a_row = fgetcsv($r);
if (feof($r)) {
break;
}
$m_row = array_combine($a_head, $a_row);
print_r($m_row);
}
https://php.net/function.fgetcsv

AWK printing fields in multiline records

I have an input file with fields in several lines. In this file, the field pattern is repeated according to query size.
ZZZZ
21293
YYYYY XXX WWWW VV
13242 MUTUAL BOTH NO
UUUUU TTTTTTTT SSSSSSSS RRRRR QQQQQQQQ PPPPPPPP
3 0 3 0
NNNNNN MMMMMMMMM LLLLLLLLL KKKKKKKK JJJJJJJJ
2 0 5 3
IIIIII HHHHHH GGGGGGG FFFFFFF EEEEEEEEEEE DDDDDDDDDDD
5 3 0 3
My desired output is one line per total group of fields. Empty
fields should be marked. Example:"x"
21293 13242 MUTUAL BOTH NO 3 0 X 3 0 X 2 0 X 5 3 5 3 0 X 3 X
12345 67890 MUTUAL BOTH NO 3 0 X 3 0 X 2 0 X 5 3 5 3 0 X 3 X
I have been thinking about how can I get the desired output with awk/unix scripts but can't figure it out. Any ideas? Thank you very much!!!
This isn't really a great fit for awk's style of programming, which is based on fields that are delimited by a pattern, not fields with variable positions on the line. But it can be done.
When you process the first line in each pair, scan through it finding the positions of the beginning of each field name.
awk 'NR%3 == 1 {
delete fieldpos;
delete fieldlen;
lastspace = 1;
fieldindex = 0;
for (i = 1; i <= length(); i++) {
if (substr($0, i, 1) != " ") {
if (lastspace) {
fieldpos[fieldindex] = i;
if (fieldindex > 0) {
fieldlen[fieldindex-1] = i - fieldpos[fieldindex-1];
}
fieldindex++;
}
lastspace = 0;
} else {
lastspace = 1;
}
}
}
NR%3 == 2 {
for (i = 0; i < fieldindex; i++) {
if (i in fieldlen) {
f = substr($0, fieldpos[i], fieldlen[i]);
} else { # last field, go to end of line
f = substr($0, fieldpos[i]);
}
gsub(/^ +| +$/, "", f); # trim surrounding spaces
if (f == "") { f = "X" }
printf("%s ", f);
}
}
NR%15 == 14 { print "" } # print newline after 5 data blocks
'
Assuming your fields are separated by blank chars and not tabs, GNU awk's FIELDWITDHS is designed to handle this sort of situation:
/^ZZZZ/ { if (rec!="") print rec; rec="" }
/^[[:upper:]]/ {
FIELDWIDTHS = ""
while ( match($0,/\S+\s*/) ) {
FIELDWIDTHS = (FIELDWIDTHS ? FIELDWIDTHS " " : "") RLENGTH
$0 = substr($0,RLENGTH+1)
}
next
}
NF {
for (i=1;i<=NF;i++) {
gsub(/^\s+|\s+$/,"",$i)
$i = ($i=="" ? "X" : $i)
}
rec = (rec=="" ? "" : rec " ") $0
}
END { print rec }
$ awk -f tst.awk file
2129 13242 MUTUAL BOTH NO 3 0 X 3 0 X 2 0 X 5 3 5 3 0 X 3 X
In other awks you'd use match()/substr(). Note that the above isn't perfect in that it truncates a char off 21293 - that's because I'm not convinced your input file is accurate and if it is you haven't told us why that number is longer than the string on the preceding line or how to deal with that.

Formatting text files in bash - adding new rows, changing the sign of numbers within a column etc

I would be very grateful to any input from you on the following issue. Apologies in advance if there one too many questions in this post.
I have text files with 3 columns (tab separated) and n rows. I would like to:
switch rows and columns (which I have done using the script below)
add 3 columns of zero to each row
switch row 1 and 2
change the sign of the numbers within the newly-set 2nd row (original 2nd column)
within one script (if possible).
Or from a file with the following format:
1 2 3
1 2 3
1 2 3
1 2 3
.....
I want to get:
0 0 0 2 2 2 2 ...
0 0 0 -1 -1 -1 -1...
0 0 0 3 3 3 3 ...
switch rows & columns:
awk '
{
for (i=1; i<=NF; i++) {
a[NR,i] = $i
}
}
NF>p { p = NF }
END {
for(j=1; j<=p; j++) {
str=a[1,j]
for(i=2; i<=NR; i++){
str=str" "a[i,j];
}
print str
}
}' "$WD"/grads > "$WD"/vect
Thank you for your help in advance.
Best,
R
There are several things you could do, for example:
awk '
NF>n{
n=NF
}
{
A[1,NR]=-$1
for(i=2; i<=NF; i++) A[i,NR]=$i
}
END{
for(i=2; i<=n; i=(i==2)?1:(i==1)?3:i+1) {
for(j=1; j<=NR; j++) $j=A[i,j]
print 0,0,0,$0
}
}
' file

Finding a range of numbers of a file in another file using awk

I have lots of files like this:
3
10
23
.
.
.
720
810
980
And a much bigger file like this:
2 0.004
4 0.003
6 0.034
.
.
.
996 0.01
998 0.02
1000 0.23
What I want to do is find in which range of the second file my first file falls and then estimate the mean of the values in the 2nd column of that range.
Thanks in advance.
NOTE
The numbers in the files do not necessarily follow an easy pattern like 2,4,6...
Since your smaller files are sorted you can pull out the first row and the last row to get the min and max. Then you just need go through the bigfile with an awk script to compute the mean.
So for each smallfile small you would run the script
awk -v start=$(head -n 1 small) -v end=$(tail -n 1 small) -f script bigfile
Where script can be something simple like
BEGIN {
sum = 0;
count = 0;
range_start = -1;
range_end = -1;
}
{
irow = int($1)
ival = $2 + 0.0
if (irow >= start && end >= irow) {
if (range_start == -1) {
range_start = NR;
}
sum = sum + ival;
count++;
}
else if (irow > end) {
if (range_end == -1) {
range_end = NR - 1;
}
}
}
END {
print "start =", range_start, "end =", range_end, "mean =", sum / count
}
You can try below:
for r in *; do
awk -v r=$r -F' ' \
'NR==1{b=$2;v=$4;next}{if(r >= b && r <= $2){m=(v+$4)/2; print m; exit}; b=$2;v=$4}' bigfile.txt
done
Explanation:
First pass it saves column 2 & 4 into temp variables. For all other passes it checks if filename r is between the begin range (previous coluimn 2) and end range (current column 2).
It then works out the mean and prints the result.

sum,count,distinct count fields with awk

I have a text file like this:
A B C D E
----------------------
x x e 2 10
y y g 1 8
z o e 2 9
o o q 1 10
p z e 3 22
x x e 1 11
z o a 1 24
y z b 1 25
I want to use awk do the same thing as this SQL does:
select A,
B,
count(distinct C),
sum(D),
sum(case when E>20 then E else 0 END)
from test
group by A,B
output:
A B count(distinct C) sum(D) sum(case when E>20 then E else 0 END)
-------------------------------------------------------
o o 1 1 0
p z 1 3 22
x x 1 3 0
y y 1 1 0
y z 1 1 25
z o 2 3 24
Here is my solution but the distinct part is not completed:
awk '
{
idx4[$1"|"$2]=idx4[$1"|"$2]+$4;
idx5[$1"|"$2]=$5>20?idx5[$1"|"$2]+$5:idx5[$1"|"$2]
}
END {
for (i in idx4) print i, idx4[i], idx5[i]
}' OFS="\t" test
=============================================================================
I have completed this by hours, here is my code:
{
if (idx3[$1"|"$2, $3] == 0) {
idx3[$1"|"$2, $3]+=1;
}
idx4[$1"|"$2]=idx4[$1"|"$2]+$4;
idx5[$1"|"$2]=$5>20?idx5[$1"|"$2]+$5:idx5[$1"|"$2]
}
END {
for (j in idx3) {
split(j, idx, SUBSEP)
count[idx[1]]++
}
for (i in idx4) {
print i, count[i], idx4[i], idx5[i]
}
} OFS="\t"
#Scrutinizer has given a more readable code below, i think that is better.
Try this (similar to your own solution):
awk '
NR<3{
next
}
{
i=$1 OFS $2
D[i]+=$4
}
!A[i,$3]++{
C[i]++
}
$5>20{
E[i]+=$5
}
END{
for(i in D)print i, C[i], D[i], E[i]+0
}
' OFS='\t' infile
NR<3 is used to skip the two header lines. If they are not present in the input file you can leave that section out.
please try this script, I had test it can oupt the result as you expect.
awk '
{
if( NR<3) {next}
idx4[$1"|"$2]=idx4[$1"|"$2]+$4;
idx5[$1"|"$2]=$5>20?idx5[$1"|"$2]+$5:idx5[$1"|"$2]
if( index(TR[$1"|"$2],$3)==0 )
{
TR[$1"|"$2] = TR[$1"|"$2]"|"$3;
TRD[$1"|"$2] +=1;
}
}
END {
for (i in idx4) print i, TRD[i], idx4[i], idx5[i]+0
}' OFS="\t" test
$ gawk '{ a[$1,$2]["C"][$3]; a[$1,$2]["D"]+=$4; a[$1,$2]["E"]+=($5>20 ? $5:0) }
END {
PROCINFO["sorted_in"] = "#ind_str_asc"
for ( i in a) {
split(i, b, SUBSEP)
print b[1], b[2], length(a[i]["C"]), a[i]["D"], a[i]["E"]
}
}' OFS='\t' file
o o 1 1 0
p z 1 3 22
x x 1 3 0
y y 1 1 0
y z 1 1 25
z o 2 3 24

Resources