Formatting text files in bash - adding new rows, changing the sign of numbers within a column etc - bash

I would be very grateful to any input from you on the following issue. Apologies in advance if there one too many questions in this post.
I have text files with 3 columns (tab separated) and n rows. I would like to:
switch rows and columns (which I have done using the script below)
add 3 columns of zero to each row
switch row 1 and 2
change the sign of the numbers within the newly-set 2nd row (original 2nd column)
within one script (if possible).
Or from a file with the following format:
1 2 3
1 2 3
1 2 3
1 2 3
.....
I want to get:
0 0 0 2 2 2 2 ...
0 0 0 -1 -1 -1 -1...
0 0 0 3 3 3 3 ...
switch rows & columns:
awk '
{
for (i=1; i<=NF; i++) {
a[NR,i] = $i
}
}
NF>p { p = NF }
END {
for(j=1; j<=p; j++) {
str=a[1,j]
for(i=2; i<=NR; i++){
str=str" "a[i,j];
}
print str
}
}' "$WD"/grads > "$WD"/vect
Thank you for your help in advance.
Best,
R

There are several things you could do, for example:
awk '
NF>n{
n=NF
}
{
A[1,NR]=-$1
for(i=2; i<=NF; i++) A[i,NR]=$i
}
END{
for(i=2; i<=n; i=(i==2)?1:(i==1)?3:i+1) {
for(j=1; j<=NR; j++) $j=A[i,j]
print 0,0,0,$0
}
}
' file

Related

Count the occurences of a number in all the columns in bash

I have a data set like this:
1 3 3 4 5 2 3 3
2 2 2 1 2 2 2 2
1 3 3 3 3 3 3 3
1 4 4 4 4 4 4 3
I would like to count the number of times that the number "one" appears per column, so I would like the output like:
3 0 0 1 0 0 0 0
Does anyone know how to do it in bash?
Thank you very much!
Ana
Do it in awk. Iterate over number of fields and if the field is equal to 1 increment the array. Then on the end print the array.
awk '{ for (i = 1; i <= NF; ++i) { if($i == 1) { ++c[i]; } }
END{ for (i = 1; i <= NF; ++i) { printf "%d%s", c[i], i!=NF ? OFS : ORS; } }

Find specific keyword on column 1 and append new line on column 2 shell script

I have one text file look like the followings:
empty 2
23 8
19 1
empty
11
I am trying to append new line on column 2 if column 1 has keyword "empty". Any one know how to do this? The following is the expected output:
empty
23 2
19 8
empty
11 1
Here is a script for gnu awk:
{ col1[ FNR ] = $1
col2[ FNR ] = sprintf("%s %s",$2, $3)
}
END {
k2 = 0;
for( k1 = 1; k1 <= FNR; k1++) {
if( col1[ k1 ] != "empty" ){
k2++
print col1[ k1], col2[ k2]
}
else print col1[ k1]
}
}
It stores the values of column1 and (column 2 + column 3) in two different arrays. During the output ( in the END) it consumes a value from the second array only if the first column is not "empty".
awk to the rescue!
$ awk 'p{t=$2;$2=p;p=t} $1=="empty"{if($2!=""){p=$2;$2=""}}1' file
empty
23 2
19 8
empty
11 1

AWK printing fields in multiline records

I have an input file with fields in several lines. In this file, the field pattern is repeated according to query size.
ZZZZ
21293
YYYYY XXX WWWW VV
13242 MUTUAL BOTH NO
UUUUU TTTTTTTT SSSSSSSS RRRRR QQQQQQQQ PPPPPPPP
3 0 3 0
NNNNNN MMMMMMMMM LLLLLLLLL KKKKKKKK JJJJJJJJ
2 0 5 3
IIIIII HHHHHH GGGGGGG FFFFFFF EEEEEEEEEEE DDDDDDDDDDD
5 3 0 3
My desired output is one line per total group of fields. Empty
fields should be marked. Example:"x"
21293 13242 MUTUAL BOTH NO 3 0 X 3 0 X 2 0 X 5 3 5 3 0 X 3 X
12345 67890 MUTUAL BOTH NO 3 0 X 3 0 X 2 0 X 5 3 5 3 0 X 3 X
I have been thinking about how can I get the desired output with awk/unix scripts but can't figure it out. Any ideas? Thank you very much!!!
This isn't really a great fit for awk's style of programming, which is based on fields that are delimited by a pattern, not fields with variable positions on the line. But it can be done.
When you process the first line in each pair, scan through it finding the positions of the beginning of each field name.
awk 'NR%3 == 1 {
delete fieldpos;
delete fieldlen;
lastspace = 1;
fieldindex = 0;
for (i = 1; i <= length(); i++) {
if (substr($0, i, 1) != " ") {
if (lastspace) {
fieldpos[fieldindex] = i;
if (fieldindex > 0) {
fieldlen[fieldindex-1] = i - fieldpos[fieldindex-1];
}
fieldindex++;
}
lastspace = 0;
} else {
lastspace = 1;
}
}
}
NR%3 == 2 {
for (i = 0; i < fieldindex; i++) {
if (i in fieldlen) {
f = substr($0, fieldpos[i], fieldlen[i]);
} else { # last field, go to end of line
f = substr($0, fieldpos[i]);
}
gsub(/^ +| +$/, "", f); # trim surrounding spaces
if (f == "") { f = "X" }
printf("%s ", f);
}
}
NR%15 == 14 { print "" } # print newline after 5 data blocks
'
Assuming your fields are separated by blank chars and not tabs, GNU awk's FIELDWITDHS is designed to handle this sort of situation:
/^ZZZZ/ { if (rec!="") print rec; rec="" }
/^[[:upper:]]/ {
FIELDWIDTHS = ""
while ( match($0,/\S+\s*/) ) {
FIELDWIDTHS = (FIELDWIDTHS ? FIELDWIDTHS " " : "") RLENGTH
$0 = substr($0,RLENGTH+1)
}
next
}
NF {
for (i=1;i<=NF;i++) {
gsub(/^\s+|\s+$/,"",$i)
$i = ($i=="" ? "X" : $i)
}
rec = (rec=="" ? "" : rec " ") $0
}
END { print rec }
$ awk -f tst.awk file
2129 13242 MUTUAL BOTH NO 3 0 X 3 0 X 2 0 X 5 3 5 3 0 X 3 X
In other awks you'd use match()/substr(). Note that the above isn't perfect in that it truncates a char off 21293 - that's because I'm not convinced your input file is accurate and if it is you haven't told us why that number is longer than the string on the preceding line or how to deal with that.

How to transform genotypes into T/F -1/0/1 format using awk?

I have a very large dataset that I would like to transform from genotypes to a coded format. The genotypes should be represented as follows:
A A -> -1
A B -> 0
B B -> 1
I have thought about this using awk but I cannot seem to get a working solution that can read two columns and output a single code in place of the genotypes. The input file looks like this:
AnimalID Locus Allele1 Allele2
1 1 A B
1 2 A A
1 3 B B
2 1 B A
2 2 B A
2 3 A A
And should be coded to an output file to look like this:
AnimalID Locus1 Locus2 Locus3
1 0 -1 1
2 0 0 -1
I am assuming this can be done using boolean T/F? Any suggestions would be welcomed. Thanks.
Here is something to get you started:
I have stored the mapping in BEGIN block. If the locus is missing for a particular ID, this will just print blank for that. You didnt specify what B A would mean, so I took the liberty of mapping it to 0 based on your output.
awk '
BEGIN {
map["A","A"] = -1;
map["A","B"] = 0;
map["B","B"] = 1;
map["B","A"] = 0;
}
NR>1 {
idCount = (idCount<$1) ? $1 : idCount;
locusCount = (locusCount<$2) ? $2 : locusCount
code[$1,$2] = map[$3,$4]
}
END {
printf "%s ", "AnimalID";
for(cnt=1; cnt<=locusCount; cnt++) {
printf "%s%s", "Locus" cnt, ((cnt==locusCount) ? "\n" : " ")
}
for(cnt=1; cnt<=idCount; cnt++) {
printf "%s\t", cnt;
for(locus=1; locus<=locusCount; locus++) {
printf "%s%s", code[cnt,locus], ((locus==locusCount) ? "\n" : "\t")
}
}
}' inputFile
Output:
AnimalID Locus1 Locus2 Locus3
1 0 -1 1
2 0 0 -1

Finding a range of numbers of a file in another file using awk

I have lots of files like this:
3
10
23
.
.
.
720
810
980
And a much bigger file like this:
2 0.004
4 0.003
6 0.034
.
.
.
996 0.01
998 0.02
1000 0.23
What I want to do is find in which range of the second file my first file falls and then estimate the mean of the values in the 2nd column of that range.
Thanks in advance.
NOTE
The numbers in the files do not necessarily follow an easy pattern like 2,4,6...
Since your smaller files are sorted you can pull out the first row and the last row to get the min and max. Then you just need go through the bigfile with an awk script to compute the mean.
So for each smallfile small you would run the script
awk -v start=$(head -n 1 small) -v end=$(tail -n 1 small) -f script bigfile
Where script can be something simple like
BEGIN {
sum = 0;
count = 0;
range_start = -1;
range_end = -1;
}
{
irow = int($1)
ival = $2 + 0.0
if (irow >= start && end >= irow) {
if (range_start == -1) {
range_start = NR;
}
sum = sum + ival;
count++;
}
else if (irow > end) {
if (range_end == -1) {
range_end = NR - 1;
}
}
}
END {
print "start =", range_start, "end =", range_end, "mean =", sum / count
}
You can try below:
for r in *; do
awk -v r=$r -F' ' \
'NR==1{b=$2;v=$4;next}{if(r >= b && r <= $2){m=(v+$4)/2; print m; exit}; b=$2;v=$4}' bigfile.txt
done
Explanation:
First pass it saves column 2 & 4 into temp variables. For all other passes it checks if filename r is between the begin range (previous coluimn 2) and end range (current column 2).
It then works out the mean and prints the result.

Resources