Organize the file by AWK - bash

Well, I have the following file:
week ID Father Mother Group C_Id Weight Age System Gender
9 107001 728 7110 922 107001 1287 56 2 2
10 107001 728 7110 1022 107001 1319 63 2 2
11 107001 728 7110 1122 107001 1491 70 2 2
1 107002 702 7006 111 107002 43 1 1 1
2 107002 702 7006 211 107002 103 7 1 1
4 107002 702 7006 411 107002 372 21 1 1
1 107003 729 7112 111 107003 40 1 1 1
2 107003 729 7112 211 107003 90 7 1 1
5 107003 729 7112 511 107003 567 28 1 1
7 107003 729 7112 711 107003 1036 42 1 1
I need to transpose the Age ($8) and Weight ($7) columns, where the column ($8) will be the new label (1, 7, 21, 28, 42, 56, 63, 70). Additionally, the age label should be in ascending order. But not all animals have all age measures, animals that do not possess should be given the "NS" symbol. The Id, Father, Mother, System, and Gender columns will be maintained, but with the transposition of the Age and Weight columns, it will not be necessary to repeat these variables as in the first table. Week, Group and C_Id columns are not required. Visually, I need that file be this way:
ID Father Mother System Gender 1 7 21 28 42 56 63 70
107001 728 7110 2 2 NS NS NS NS NS 1287 1319 1491
107002 702 7006 1 1 43 103 372 NS NS NS NS NS
107003 729 7112 1 1 40 90 NS 567 1036 NS NS NS
I tried this program:
#!/bin/bash
awk 'NR==1{h=$2 OFS $3 OFS $4 OFS $9 OFS $10; next}
{a[$2]=(($1 in a)?(a[$1] OFS $NF):(OFS $3 OFS $4 OFS $9 OFS $10));
if(!($8 in b)) {h=h OFS $8; b[$8]}}
END{print h; for(k in a) print k,a[k]}' banco.txt | column -t > a
But I got it:
ID Father Mother System Gender
56 63 70 1 7 21 28 42
107001 728 7110 2 2
107002 702 7006 1 1
107003 729 7112 1 1
And I'm stuck at that point, any suggestion please? Thanks.

With GNU awk for "sorted_in":
$ cat tst.awk
{
id = $2
weight = $7
age = $8
idAge2weight[id,age] = weight
id2common[id] = $2 OFS $3 OFS $4 OFS $9 OFS $10
ages[age]
}
END {
PROCINFO["sorted_in"] = "#ind_num_asc"
printf "%s", id2common["ID"]
for (age in ages) {
printf "%s%s", OFS, age
}
print ""
delete id2common["ID"]
for (id in id2common) {
printf "%s", id2common[id]
for (age in ages) {
weight = ((id,age) in idAge2weight ? idAge2weight[id,age] : "NS")
printf "%s%s", OFS, weight
}
print ""
}
}
$ awk -f tst.awk file | column -t
ID Father Mother System Gender Age 1 7 21 28 42 56 63 70
107001 728 7110 2 2 NS NS NS NS NS NS 1287 1319 1491
107002 702 7006 1 1 NS 43 103 372 NS NS NS NS NS
107003 729 7112 1 1 NS 40 90 NS 567 1036 NS NS NS
I added the pipe to column -t just so you could see the field alignment.

Related

Sort text file after the 16th column with p-values in bash

I have a tab delimited textfile with 18 column and more than 300000 rows. I have also a header line and I would sort the whole text file by the 16th column, which contains p-values. So I would like to sort it, having the lowest p-values above and also leaving the headline as it is.
I already have a code, it doesn't give me any error message, but it only shows the header line in the output file, and nothing else.
Here is my file:
filename CHROM ID x11_CT x12_CT CT1 CT2 SampleSize x21_CT x21 x22_CT x22 x11 x12 chIGSFA P_value GZD ZGSR
V1003 1 rs3131972 212 1 1068 14 541 856 0.791127541589649 13 0.0120147874306839 0.195933456561922 0.000924214417744917 0.70567673346914 0.400882778478405 0.00649170940375354 0.0361163844076152
V1003 1 rs3131962 170 1 1066 14 540 896 0.82962962962963 13 0.012037037037037 0.157407407407407 0.000925925925925926 0.40191966550969 0.526099523335894 0.00450617283950613 0.027281782875571
V1003 1 rs12562034 128 0 1068 14 541 940 0.868761552680222 14 0.0129390018484288 0.118299445471349 0 0.951515008754774 0.329333964471109 0.00612270697448755 0.041938142300103
V1003 1 rs12131377 78 0 1060 14 537 982 0.914338919925512 14 0.0130353817504655 0.0726256983240224 0 0.555433052966582 0.456106209942983 0.0037868148101911 0.0321609387794883
Output should look like this:
filename CHROM ID x11_CT x12_CT CT1 CT2 SampleSize x21_CT x21 x22_CT x22 x11 x12 chIGSFA P_value GZD ZGSR
V1003 1 rs12562034 128 0 1068 14 541 940 0.868761552680222 14 0.0129390018484288 0.118299445471349 0 0.951515008754774 0.329333964471109 0.00612270697448755 0.041938142300103
V1003 1 rs3131972 212 1 1068 14 541 856 0.791127541589649 13 0.0120147874306839 0.195933456561922 0.000924214417744917 0.70567673346914 0.400882778478405 0.00649170940375354 0.0361163844076152
V1003 1 rs12131377 78 0 1060 14 537 982 0.914338919925512 14 0.0130353817504655 0.0726256983240224 0 0.555433052966582 0.456106209942983 0.0037868148101911 0.0321609387794883
V1003 1 rs3131962 170 1 1066 14 540 896 0.82962962962963 13 0.012037037037037 0.157407407407407 0.000925925925925926 0.40191966550969 0.526099523335894 0.00450617283950613 0.027281782875571
Here is my code:
awk 'NR==1; NR > 1 {print $0 | "sort -g -rk 16,16"}' file.txt > file_out.txt
I'm guessing your sort doesn't have a -g option and so it's failing and not producing any output. Try this instead just using POSIX options:
$ awk 'NR==1; NR > 1 {print | "sort -nrk 16,16"}' file
filename CHROM ID x11_CT x12_CT CT1 CT2 SampleSize x21_CT x21 x22_CT x22 x11 x12 chIGSFA P_value GZD ZGSR
V1003 1 rs3131962 170 1 1066 14 540 896 0.82962962962963 13 0.012037037037037 0.157407407407407 0.000925925925925926 0.40191966550969 0.526099523335894 0.00450617283950613 0.027281782875571
V1003 1 rs12131377 78 0 1060 14 537 982 0.914338919925512 14 0.0130353817504655 0.0726256983240224 0 0.555433052966582 0.456106209942983 0.0037868148101911 0.0321609387794883
V1003 1 rs3131972 212 1 1068 14 541 856 0.791127541589649 13 0.0120147874306839 0.195933456561922 0.000924214417744917 0.70567673346914 0.400882778478405 0.00649170940375354 0.0361163844076152
V1003 1 rs12562034 128 0 1068 14 541 940 0.868761552680222 14 0.0129390018484288 0.118299445471349 0 0.951515008754774 0.329333964471109 0.00612270697448755 0.041938142300103
Would you please try the following:
cat <(head -n 1 file.txt) <(tail -n +2 file.txt | sort -nk16,16) > file_out.txt
Using GNU awk (for array sorting):
awk 'NR==1 { print;next } { map[$3][$16]=$0 } END { PROCINFO["sorted_in"]="#ind_num_asc";for(i in map) { for(j in map[i]) { print map[i][j] } } }' file
Explanation
awk 'NR==1 {
print;next # Header record, print and skip to the next line
}
{
map[$3][$16]=$0 # None header line - create a two dimensional array indexed by ID (assuming that it is unique in the file) and by 16th field with the line as the value
}
END { PROCINFO["sorted_in"]="#ind_num_asc"; # Set the array sorting to index number ascending
for(i in map) {
for(j in map[i]) {
print map[i][j] # Loop through the array printing the values
}
}
}' file
I suggest you to try next script:
#!/bin/bash
head -n 1 file.txt > file_out.txt
tail -n +2 file.txt | sort -k 16 >> file_out.txt
This definitely works, according to your output sample, when you convert the blanks into tabs, obviously.
awk to the rescue!
$ awk 'NR==1; NR>1{print | "sort -k16n"}' file | column -t
filename CHROM ID x11_CT x12_CT CT1 CT2 SampleSize x21_CT x21 x22_CT x22 x11 x12 chIGSFA P_value GZD ZGSR
V1003 1 rs12562034 128 0 1068 14 541 940 0.868761552680222 14 0.0129390018484288 0.118299445471349 0 0.951515008754774 0.329333964471109 0.00612270697448755 0.041938142300103
V1003 1 rs3131972 212 1 1068 14 541 856 0.791127541589649 13 0.0120147874306839 0.195933456561922 0.000924214417744917 0.70567673346914 0.400882778478405 0.00649170940375354 0.0361163844076152
V1003 1 rs12131377 78 0 1060 14 537 982 0.914338919925512 14 0.0130353817504655 0.0726256983240224 0 0.555433052966582 0.456106209942983 0.0037868148101911 0.0321609387794883
V1003 1 rs3131962 170 1 1066 14 540 896 0.82962962962963 13 0.012037037037037 0.157407407407407 0.000925925925925926 0.40191966550969 0.526099523335894 0.00450617283950613 0.027281782875571

How to merge files depending on a string in a specific column

I have two files that I need to merge together based on what string they contain in a specific column.
File 1 looks like this:
1 1655 1552 189
1 1433 1552 185
1 1623 1553 175
1 691 1554 182
1 1770 1554 184
1 1923 1554 182
1 1336 1554 181
1 660 1592 179
1 743 1597 179
File 2 looks like this:
1 1552 0 0 2 -9 G A A A
1 1553 0 0 2 -9 A A G A
1 1554 0 751 2 -9 A A A A
1 1592 0 577 1 -9 G A A A
1 1597 0 749 2 -9 A A G A
1 1598 0 420 1 -9 A A A A
1 1600 0 0 1 -9 A A G G
1 1604 0 1583 1 -9 A A A A
1 1605 0 1080 2 -9 G A A A
I am wanting to match column 3 from file 1 to column 2 on file 2, with my output looking like:
1 1655 1552 189 0 0 2 -9 G A A A
1 1433 1552 185 0 0 2 -9 G A A A
1 1623 1553 175 0 0 2 -9 A A G A
1 691 1554 182 0 751 2 -9 A A A A
1 1770 1554 184 0 751 2 -9 A A A A
1 1923 1554 182 0 751 2 -9 A A A A
1 1336 1554 181 0 751 2 -9 A A A A
1 660 1592 179 0 577 1 -9 G A A A
1 743 1597 179 0 749 2 -9 A A G A
I am not interested in keeping any lines in file 2 that are not in file 1. Thanks in advance!
Thanks to #Abelisto I managed to figure something out 4 hours later!
sort -k 3,3 File1.txt > Pheno1.txt
awk '($2 >0)' File2.ped > Ped1.ped
sort -k 2,2 Ped1.ped > Ped2.ped
join -1 3 -2 2 Pheno1.txt Ped2.ped > Ped3.txt
cut -d ' ' -f 1,4,5 --complement Ped3.txt > Output.ped
My real File2 actually contained negative values in the 2nd column (thankfully my real File1 didn't have any negatives) hence the use of awk to remove those rows
Using awk:
awk 'NR == FNR { arr[$2]=$3" "$4" "$5" "$6" "$7" "$8" "$9" "$10 } NR != FNR { print $1" "$2" "$3" "$4" "arr[$3] }' file2 file1
Process file2 first (NR==FNR) Set up an array called arr with the 3rd space delimited field as the index and the 3rd to 10th fields as values separated with a space. Then when processing the first file (NR!=FNR) print the 1st to the 4th space delimited fields followed by the contents of arr, index field 3.
Since $1 seems like constant 1 and I have no idea about rowcounts of either file (800,000 columns in file2 sounded a lot) I'm hashing file1 instead:
$ awk '
NR==FNR {
a[$3]=a[$3] (a[$3]==""?"":ORS) $2 OFS $3 OFS $4
next
}
($2 in a) {
n=split(a[$2],t,ORS)
for(i=1;i<=n;i++) {
$2=t[i]
print
}
}' file1 file2
Output:
1 1655 1552 189 0 0 2 -9 G A A A
1 1433 1552 185 0 0 2 -9 G A A A
1 1623 1553 175 0 0 2 -9 A A G A
1 691 1554 182 0 751 2 -9 A A A A
1 1770 1554 184 0 751 2 -9 A A A A
1 1923 1554 182 0 751 2 -9 A A A A
1 1336 1554 181 0 751 2 -9 A A A A
1 660 1592 179 0 577 1 -9 G A A A
1 743 1597 179 0 749 2 -9 A A G A
When posting a question, please add details such as row and column counts to it. Better requirements yield better answers.

Reshape table and complete voids with NA (or -999) using bash

I'm trying to create a table based on the ASCII bellow. What I need is to arrange the numbers from the 2nd column in a matrix. The first and third columns of the ASCII give columns and rows in the new matrix. The new matrix needs to be fully populated, so it is necessary to complete missing positions on the new table with NA (or -999).
This is what I have
$ cat infile.txt
1 68 2
1 182 3
1 797 4
2 4 1
2 70 2
2 339 3
2 1396 4
3 12 1
3 355 3
3 1854 4
4 7 1
4 85 2
4 333 3
5 9 1
5 68 2
5 182 3
5 922 4
6 10 1
6 70 2
and what I would like to have:
NA 4 12 7 9 10
68 70 NA 85 68 70
182 339 355 333 182 NA
797 1396 1854 NA 922 NA
I can only use standard UNIX commands (e.g. awk, sed, grep, etc).
So What I have so far...
I can mimic a 2d array in bash
irows=(`awk '{print $1 }' infile.txt`) # rows positions
jcols=(`awk '{print $3 }' infile.txt`) # columns positions
values=(`awk '{print $2 }' infile.txt`) # values
declare -A matrix # the new matrix
nrows=(`sort -k3 -n in.txt | tail -1 | awk '{print $3}'`) # numbers of rows
ncols=(`sort -k1 -n in.txt | tail -1 | awk '{print $1}'`) # numbers of columns
nelem=(`echo "${#values[#]}"`) # number of elements I want to pass to the new matrix
# Creating a matrix (i,j) with -999
for ((i=0;i<=$((nrows-1));i++)) do
for ((j=0;j<=$((ncols-1));j++)) do
matrix[$i,$j]=-999
done
done
and even print on the screen
for ((i=0;i<=$((nrows-1));i++)) do
for ((j=0;j<=$((ncols-1));j++)) do
printf " %i" ${matrix[$i,$j]}
done
echo
done
But when I tried to assign the elements, something gets wrong
for ((i=0;i<=$((nelem-1));i++)) do
matrix[${irows[$i]},${jcols[$i]}]=${values[$i]}
done
Thanks in advance for any help with this, really.
A solution in plain bash by simulating a 2D array with an associative array could be something like that (Notice that row and column counts are not hard coded and the code works with any permutation of input lines provided that each line has the format specified in the question):
$ cat printmat
#!/bin/bash
declare -A mat
nrow=0
ncol=0
while read -r col elem row; do
mat[$row,$col]=$elem
if ((row > nrow)); then nrow=$row; fi
if ((col > ncol)); then ncol=$col; fi
done
for ((row = 1; row <= nrow; ++row)); do
for ((col = 1; col <= ncol; ++col)); do
elem=${mat[$row,$col]}
if [[ -z $elem ]]; then elem=NA; fi
if ((col == ncol)); then elem+=$'\n'; else elem+=$'\t'; fi
printf "%s" "$elem"
done
done
$ ./printmat < infile.txt prints out
NA 4 12 7 9 10
68 70 NA 85 68 70
182 339 355 333 182 NA
797 1396 1854 NA 922 NA
Any time you find yourself writing a loop in shell just to manipulate text you have the wrong approcah. See why-is-using-a-shell-loop-to-process-text-considered-bad-practice for many of the reasons why.
Using any awk in any shell on every UNIX box:
$ cat tst.awk
{
vals[$3,$1] = $2
numRows = ($3 > numRows ? $3 : numRows)
numCols = $1
}
END {
OFS = "\t"
for (rowNr=1; rowNr<=numRows; rowNr++) {
for (colNr=1; colNr<=numCols; colNr++) {
val = ((rowNr,colNr) in vals ? vals[rowNr,colNr] : "NA")
printf "%s%s", val, (colNr < numCols ? OFS : ORS)
}
}
}
.
$ awk -f tst.awk infile.txt
NA 4 12 7 9 10
68 70 NA 85 68 70
182 339 355 333 182 NA
797 1396 1854 NA 922 NA
here is one way to get you started. Note that this is not intended to be "the" answer but to encourage you to try to learn the toolkit.
$ join -a1 -e NA -o2.2 <(printf "%s\n" {1..4}"_"{1..6}) \
<(awk '{print $3"_"$1,$2}' file | sort -n) |
pr -6at
NA 4 12 7 9 10
68 70 NA 85 68 70
182 339 355 333 182 NA
797 1396 1854 NA 922 NA
works, however, row and column counts are hard coded, which is not the proper way to do it.
Preferred solution will be filling up an awk 2D array with the data and print it in matrix form at the end.

Specific sorting to category by awk and bash

Dear all i have one question.
I have the input like : (second column is only index)
chr1 1 30
chr1 2 40.5
chr1 3 30.5
chr1 4 41
chr2 10 60
chr2 15 40.1
And i want to get this:
chr1 chr2
30 - 31 2 0
31 - 32 0 0
...
40 - 41 1 1 etc..
I need categorize data to each group from 30 to 60 per 1. From the input data I count all rows for chr1 which are contain in in the category 30-31 from $3. I have this code, but I do not understand where is problem: (some problem with loop)
samtools view /home/filip/Desktop/AMrtin\ Hynek/54321Odfiltrovany.bam | awk '{ n=length($10); print $3,"\t",NR,"\t", gsub(/[GCCgcs]/,"",$10)/n;}' | awk '($3 <= 0.6 && $3 >= 0.3)' | awk '{print $1,"\t",$2,"\t",($3*100)}' > data.txt
for j in chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22
do
export $j
awk -v sop=$j '{if($1 == $sop) print $0}' data.txt |
awk '{d=int($3)
a[d]++
if (NR==1) {min=d}
min=(min>=d?d:min)
max=(max>d?max:d)}
END{for (i=min; i<=max; i++) print i, "-", i+1, a[i]+0}' ;
done
Part of code I made by help "fedorqui"
Using awk:
awk '
!($1 in chrs) { chr[++c] = $1 ; chrs[$1]++ }
{
val = int($3);
map[$1,val]++;
min = (NR==1?val:min>=val?val:min);
max = (max>val?max:val)
}
END {
printf "\t\t"
for (j=1; j<=c; j++) {
printf "%s%s", sep, chr[j]
sep = "\t"
}
print ""
for (i=min; i<=max; i++) {
printf "%d - %d\t", i, i+1
for (j=1; j<=c; j++) {
printf "\t%s", map[chr[j],i] + 0
}
print ""
}
}' file
chr1 chr2
30 - 31 2 0
31 - 32 0 0
32 - 33 0 0
...
38 - 39 0 0
39 - 40 0 0
40 - 41 1 1
41 - 42 1 0
42 - 43 0 0
...
59 - 60 0 0
60 - 61 0 1
You increment the chr array by the order of chromosome seen.
Rest stuff in the main block is pretty much your code except that we also create a map array that is indexed at chromosome and range having counts as its value.
In the END block we first iterate over our chr array and print the chromosomes
Then using our min and max variables we create a loop and print the values from our map array which is indexed at chromosome and the range.
I have truncated some lines from the output. As you can see from the output it will print all numbers starting at min and ending at max.
First, you could use :
for j in {1..22}; do
chrj="char$j"
# now you could use $chrj instead of $j in this loop
done
Instead of :
for j in chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22
do
# ...
done
Then, you don't need to multiply calls to awk and pipes. Only one awk should be enough.
For example :
... | awk '($3 <= 0.6 && $3 >= 0.3)' | awk '{print $1,"\t",$2,"\t",($3*100)}'
Should be :
awk '($3 <= 0.6 && $3 >= 0.3){print $1,"\t",$2,"\t",($3*100)}'
# or
awk '{if ($3 <= 0.6 && $3 >= 0.3){print $1,"\t",$2,"\t",($3*100)}}'
Otherwise :
export $j
What is the purpose of this export ?
I haven't read everything on your code but at this point many optimizations must be done !
If you are using gawk, this should work. There's a filter for $1 that should handle everything you were doing with $j (unless you truly need only chr1..chr22, in which case it should still be possible to develop a regex for it).
BEGIN {
for(i = 30; i <= 60; i++) {
rstring = i " - " i + 1;
rows[rstring] = 0;
}
}
$1 ~ /^chr[0-9][0-9]?$/ {
row = int($3) " - " int($3) + 1;
columns[$1] = 0;
rows[row] = 0;
data[row][$1] += 1;
rowwidth = length(row) > rowwidth ? length(row) : rowwidth;
colwidth = length($1) > colwidth ? length($1) : colwidth;
}
END {
rowheader = "%-" (rowwidth * 2) "s";
colheader = "%" colwidth "s\t";
dataformat = "%" int(colwidth / 2) "d\t";
asorti(columns, sortedcolumns);
asorti(rows, sortedrows);
printf rowheader, "";
for(c in sortedcolumns) printf "%s\t", sortedcolumns[c];
print "";
for(r in sortedrows) {
printf rowheader, sortedrows[r];
for(c in sortedcolumns)
printf dataformat, data[sortedrows[r]][sortedcolumns[c]];
print ""
}
}
Running it with gawk -f [scriptfile from above] < data.txt should produce something like:
chr1 chr2
30 - 31 2 0
31 - 32 0 0
. . .
39 - 40 0 0
40 - 41 1 1
41 - 42 1 0
42 - 43 0 0
. . .
59 - 60 0 0
60 - 61 0 1
Following can be used if you want to use Perl
perl -ane '
$h{$F[0]}{int $F[2]}++;
push #range, int $F[2];
}{
#range = sort #range;
print "\t\t", join "\t", sort { $a cmp $b } keys %h; print "\n";
for $i ($range[0] .. $range[-1]) {
print "$i - ", $i + 1, "\t\t";
print $h{$_}{$i} + 0, "\t" for sort { $a cmp $b } keys %h; print "\n"
}' file
Output should be like this
chr1 chr2
30 - 31 2 0
31 - 32 0 0
32 - 33 0 0
33 - 34 0 0
34 - 35 0 0
35 - 36 0 0
36 - 37 0 0
37 - 38 0 0
38 - 39 0 0
39 - 40 0 0
40 - 41 1 1
41 - 42 1 0
42 - 43 0 0
43 - 44 0 0
44 - 45 0 0
45 - 46 0 0
46 - 47 0 0
47 - 48 0 0
48 - 49 0 0
49 - 50 0 0
50 - 51 0 0
51 - 52 0 0
52 - 53 0 0
53 - 54 0 0
54 - 55 0 0
55 - 56 0 0
56 - 57 0 0
57 - 58 0 0
58 - 59 0 0
59 - 60 0 0

printing selected rows from a file using awk

I have a text file with data in the following format.
1 0 0
2 512 6
3 992 12
4 1536 18
5 2016 24
6 2560 29
7 3040 35
8 3552 41
9 4064 47
10 4576 53
11 5088 59
12 5600 65
13 6080 71
14 6592 77
15 7104 83
I want to print all the lines where $1 > 1000.
awk 'BEGIN {$1 > 1000} {print " " $1 " "$2 " "$3}' graph_data_tmp.txt
This doesn't seem to give the output that I am expecting.What am I doing wrong?
You can do this :
awk '$1>1000 {print $0}' graph_data_tmp.txt
print $0 will print all the content of the line
If you want to print the content of the line after the 1000th line/ROW, then you could do the same by replacing $1 with NR. NR represents the number of rows.
awk 'NR>1000 {print $0}' graph_data_tmp.txt
All you need is:
awk '$1>1000' file

Resources