from xyz to matrix with awk - matrix

I have a problem that I managed to solve with a work around so I am here hoping to learn from you more elegant solutions ;-)
I have to parse the output of a program: it writes a file of three columns x y z like this
1 1 11
1 2 12
1 3 13
1 4 14
2 1 21
2 2 22
2 3 23
2 4 24
3 1 31
3 2 32
3 3 33
3 4 34
4 1 41
4 2 42
4 3 43
4 4 44
in a matrix like this
11 12 13 14
21 22 23 24
31 32 33 34
41 42 43 44
I solved with a two line bash script like this
dim_matrix=$(awk 'END{print sqrt(NR)}' file_xyz) #since I know that the matrix has to be squared and there are no blank lines in the file_xyz
awk '{printf("%s%s",$3, !(NR%'${dim_matrix}'==0) ? OFS :ORS ) }' file_xyz
Can you please suggest me a way to perform the same only with awk?

awk does not do real multidimensional arrays, but you can fake it with a properly constructed string:
awk '
{mx[$1 "," $2] = $3}
END {
size=sqrt(NR)
for (x=1; x<=size; x++) {
for (y=1; y<=size; y++)
printf("%s ",mx[x "," y])
print ""
}
}
' filename
You can accomplish your example with a single awk call and a call to wc
awk -v "nlines=$(wc -l < filename)" '
BEGIN {size = sqrt(nlines)}
{printf("%s%s", $3, (NR % size == 0 ? ORS : OFS))
}' filename

A "not so" readable version:
awk '($0=$NF x)&&ORS=NR%4?FS:RS' infile
Parameters added as per OP's request:
awk '
($0 = $NF x) && ORS = NR % n ? FS : RS
' n="$1" infile
In the script above I'm using $1, but you can use any shell variable.
The explanation follows:
$0 = $NF - set $0 (the entire current input record)
to the current value of the last field ($NF).
ORS = NR % n ? FS : RS - using the ternary operator:
expression ? return_this_if_true : return_this_otherwise,
set the OutputRecordSeparator to:
when NR % n evaluates true (i.e. returns value different than 0)
set ORS to the current value of FS (FieldSeparator - runs of white space
characters by default)
otherwise set it to RS (which defaults to a newline)
The x (an unitialized variable and thus a NULL string when used in concatenation)
is needed in order to handle correctly the output
when the last field is 0 (or an empty string).
This is because the assignement statement in awk
actually in this case returns the assigned value,
if $NF is 0, the rest of the && boolean statement
will be ignored.

I am not totally sure what you try do, try this:
awk 'NR%4==0{print s " " $NF;s="";next}{s=s?s " " $NF:$NF}' file1

Related

How to split a string depends on a pattern in other column (UNIX environment)

I have a TAB file something like:
V I 280 6 - VRSSAI
N V 2739 7 - SAVNATA
A R 203 5 - AEERR
Q A 2517 7 - AQSTPSP
S S 1012 5 - GGGSS
L A 281 11 - AAEPALSAGSL
And I would like to check the last column respect to the order of letters in 1st and 2nd column. If are coincidences between the first and last letter in last column comparing to the 1st and 2nd column respectively remain identical. On the contrary if there are not coincidences I would like to locate the reverse order pattern in last column and then print the string from the letter in 1st column to the end and then take the first letter and print to the letter in 2nd column. The desired output would be:
V I 280 6 - VRSSAI
N V 2739 7 - NATASAV
A R 203 5 - AEERR
Q A 2517 7 - QSTPSPA
S S 1012 5 - SGGGS
L A 281 11 - LSAGSLAAEPA
In this way I'm try to do different scripts but do not work correctly I don't know exactly why.
awk 'BEGIN {FS=OFS="\t"}{gsub(/$2$1/,"\t",$6); print $1$7$6$2}' "input" > "output";
Other way is:
awk 'BEGIN {FS=OFS="\t"} {len=split($11,arrseq,"$7$6"); for(i=0;i<len;i++){printf "%s ",arrseq[i],arrseq[i+1]}' `"input" > "output";`
And I try by means of substr function too but finally no one works correctly. Is it possible to do in bash? Thanks in advance
I try to put an example in order to understand better the question.
$1 $2 $6
L A AAEPALSAGSL (reverse pattern 'AL' $2$1)
desired output in $6 from the corresponding $2 letter within reverse pattern to the end following by first letter to corresponding $1 letter within the reverse pattern
$1 $2 $6
L A LSAGSLAAEPA
If I understood the question correctly, this awk should do it:
awk '( substr($6, 1, 1) != $1 || substr($6, length($6), 1) != $2 ) && i = index($6, $2$1) { $6 = substr($6, i+1) substr($6, 1, i) }1' OFS=$'\t' data
You basically want to rotate the string so that the beginning of the string matches the char in $1 and the end of the string matches the char in $2. Strings that cannot be rotated to match that condition are left unchanged, for example:
A B 3 3 - BCAAB
You can try this awk, it's not perfect but it give you a starting point.
awk '{i=(match($6,$1));if(i==1)print;else{a=$6;b=substr(a,i);c=substr(a,1,(i-1));$6=b c;print}}' OFS='\t' infile
gawk '
BEGIN{
OFS="\t"
}
$6 !~ "^"$1".*"$2"$" {
$6 = gensub("(.*"$2")("$1".*)", "\\2\\1", 1, $6)
}
{print}
' input.txt
Output
V I 280 6 - VRSSAI
N V 2739 7 - NATASAV
A R 203 5 - AEERR
Q A 2517 7 - QSTPSPA
S S 1012 5 - SGGGS
L A 281 11 - LSAGSLAAEPA

Subset a file by row and column numbers

We want to subset a text file on rows and columns, where rows and columns numbers are read from a file. Excluding header (row 1) and rownames (col 1).
inputFile.txt Tab delimited text file
header 62 9 3 54 6 1
25 1 2 3 4 5 6
96 1 1 1 1 0 1
72 3 3 3 3 3 3
18 0 1 0 1 1 0
82 1 0 0 0 0 1
77 1 0 1 0 1 1
15 7 7 7 7 7 7
82 0 0 1 1 1 0
37 0 1 0 0 1 0
18 0 1 0 0 1 0
53 0 0 1 0 0 0
57 1 1 1 1 1 1
subsetCols.txt Comma separated with no spaces, one row, numbers ordered. In real data we have 500K columns, and need to subset ~10K.
1,4,6
subsetRows.txt Comma separated with no spaces, one row, numbers ordered. In real data we have 20K rows, and need to subset about ~300.
1,3,7
Current solution using cut and awk loop (Related post: Select rows using awk):
# define vars
fileInput=inputFile.txt
fileRows=subsetRows.txt
fileCols=subsetCols.txt
fileOutput=result.txt
# cut columns and awk rows
cut -f2- $fileInput | cut -f`cat $fileCols` | sed '1d' | awk -v s=`cat $fileRows` 'BEGIN{split(s, a, ","); for (i in a) b[a[i]]} NR in b' > $fileOutput
Output file: result.txt
1 4 6
3 3 3
7 7 7
Question:
This solution works fine for small files, for bigger files 50K rows and 200K columns, it is taking too long, 15 minutes plus, still running. I think cutting the columns works fine, selecting rows is the slow bit.
Any better way?
Real input files info:
# $fileInput:
# Rows = 20127
# Cols = 533633
# Size = 31 GB
# $fileCols: 12000 comma separated col numbers
# $fileRows: 300 comma separated row numbers
More information about the file: file contains GWAS genotype data. Every row represents sample (individual) and every column represents SNP. For further region based analysis we need to subset samples(rows) and SNPs(columns), to make the data more manageable (small) as an input for other statistical softwares like r.
System:
$ uname -a
Linux nYYY-XXXX ZZZ Tue Dec 18 17:22:54 CST 2012 x86_64 x86_64 x86_64 GNU/Linux
Update: Solution provided below by #JamesBrown was mixing the orders of columns in my system, as I am using different version of awk, my version is: GNU Awk 3.1.7
Even though in If programming languages were countries, which country would each language represent? they say that...
Awk: North Korea. Stubbornly resists change, and its users appear to be unnaturally fond of it for reasons we can only speculate on.
... whenever you see yourself piping sed, cut, grep, awk, etc, stop and say to yourself: awk can make it alone!
So in this case it is a matter of extracting the rows and columns (tweaking them to exclude the header and first column) and then just buffering the output to finally print it.
awk -v cols="1 4 6" -v rows="1 3 7" '
BEGIN{
split(cols,c); for (i in c) col[c[i]] # extract cols to print
split(rows,r); for (i in r) row[r[i]] # extract rows to print
}
(NR-1 in row){
for (i=2;i<=NF;i++)
(i-1) in col && line=(line ? line OFS $i : $i); # pick columns
print line; line="" # print them
}' file
With your sample file:
$ awk -v cols="1 4 6" -v rows="1 3 7" 'BEGIN{split(cols,c); for (i in c) col[c[i]]; split(rows,r); for (i in r) row[r[i]]} (NR-1 in row){for (i=2;i<=NF;i++) (i-1) in col && line=(line ? line OFS $i : $i); print line; line=""}' file
1 4 6
3 3 3
7 7 7
With your sample file, and inputs as variables, split on comma:
awk -v cols="$(<$fileCols)" -v rows="$(<$fileRows)" 'BEGIN{split(cols,c, /,/); for (i in c) col[c[i]]; split(rows,r, /,/); for (i in r) row[r[i]]} (NR-1 in row){for (i=2;i<=NF;i++) (i-1) in col && line=(line ? line OFS $i : $i); print line; line=""}' $fileInput
I am quite sure this will be way faster. You can for example check Remove duplicates from text file based on second text file for some benchmarks comparing the performance of awk over grep and others.
One in Gnu awk version 4.0 or later as column ordering relies on for and PROCINFO["sorted_in"]. The row and col numbers are read from files:
$ awk '
BEGIN {
PROCINFO["sorted_in"]="#ind_num_asc";
}
FILENAME==ARGV[1] { # process rows file
n=split($0,t,",");
for(i=1;i<=n;i++) r[t[i]]
}
FILENAME==ARGV[2] { # process cols file
m=split($0,t,",");
for(i=1;i<=m;i++) c[t[i]]
}
FILENAME==ARGV[3] && ((FNR-1) in r) { # process data file
for(i in c)
printf "%s%s", $(i+1), (++j%m?OFS:ORS)
}' subsetRows.txt subsetCols.txt inputFile.txt
1 4 6
3 3 3
7 7 7
Some performance gain could probably come from moving the ARGV[3] processing block to the top berore 1 and 2 and adding a next to it's end.
Not to take anything away from both excellent answers. Just because this problem involves large set of data I am posting a combination of 2 answers to speed up the processing.
awk -v cols="$(<subsetCols.txt)" -v rows="$(<subsetRows.txt)" '
BEGIN {
n = split(cols, c, /,/)
split(rows, r, /,/)
for (i in r)
row[r[i]]
}
(NR-1) in row {
for (i=1; i<=n; i++)
printf "%s%s", $(c[i]+1), (i<n?OFS:ORS)
}' inputFile.txt
PS: This should work with older awk version or non-gnu awk as well.
to refine #anubhava solution we can
get rid of searching over 10k values for each row
to see if we are on the right row by takeing advantage of the fact the input is already sorted
awk -v cols="$(<subsetCols.txt)" -v rows="$(<subsetRows.txt)" '
BEGIN {
n = split(cols, c, /,/)
split(rows, r, /,/)
j=1;
}
(NR-1) == r[j] {
j++
for (i=1; i<=n; i++)
printf "%s%s", $(c[i]+1), (i<n?OFS:ORS)
}' inputFile.txt
Python has a csv module. You read a row into a list, print the desired columns to stdout, rinse, wash, repeat.
This should slice columns 20,000 to 30,000.
import csv
with open('foo.txt') as f:
gwas = csv.reader(f, delimiter=',', quoting=csv.QUOTE_NONE)
for row in gwas:
print(row[20001:30001]

Extract column after pattern from file

I have a sample file which looks like this:
5 6 7 8
55 66 77 88
A B C D
1 2 3 4
2 4 6 8
3 8 12 16
E F G H
11 22 33 44
and so on...
I would like to enter a command in a bash script or just in a bash terminal to extract one of the columns independently of the others. For instance, I would like to do something like a grep/awk command with the pattern=C and get the following output:
C
3
6
12
How can I extract a specific column independent of the others and also put a # of lines to extract after the pattern so that I don't get the above column with the 7's or the G column in my output?
If it's always 3 records after the found term:
awk '{for(i=1;i<=NF;i++) {if($i=="C") col=i}} col>0 && rcount<=3 {print $col; rcount++}' test
This will look at each field in your record and if it finds a "C", it will capture the column number i. If the column number is greater than 0 then it will print the contents of the column. It counts up to 3 records and then stops printing.
$ cat tst.awk
!prevNF { delete f; for (i=1; i<=NF; i++) f[$i] = i }
NF && (tgt in f) { print $(f[tgt]) }
{ prevNF = NF }
$ awk -v tgt=C -f tst.awk file
C
3
6
12
$ awk -v tgt=F -f tst.awk file
F
22

Loop for change the column - how not mistake $ of column with $i

I would like to make a loop for change columns in a awk condition. However, the $ symbol is making mistake with the replacement for "i". Any idea how to fix it?
#!/bin/bash
for i in {2..5}
do
awk '$$i>=10 && $$i<=20' permut1.txt >> out.txt
done
input:
abc 1 1 2 3 4
bbb 0 1 2 0 1
ccc 1 1 0 2 2
ddd 0 1 3 1 3
fff 15 15 4 15 15
ggg 15 15 15 15 15
I want this output:
ggg 15 15 15 15 15
In awk, $ is a prefix operator whose argument must be a non-negative integer. That's quite different from the meaning of the $ in bash.
The easiest way to pass a variable from bash to awk is to use the -v var=value command line option in the awk command:
awk -v field=2 '$field >= 10 && $field <= 20' permut1.txt
The above will print all lines whose second field is between 10 and 20. You could iterate in bash to do multiple scans of the data, each one scanning a different column:
for i in 2 3 4; do
awk -v field=$i '$field >= 10 && $field <= 20' permut1.txt
done
But I suspect that what you are trying to do is to iterate in awk over the fields, and print the lines which satisfy all three tests. Again, the fact that the awk $ is an operator can make this relatively simple. Another awk feature which simplifies the logic is the next command, which reads the next input line and restarts the pattern matching loop. That makes it easy to require that all three tests match:
awk '{ for (field = 2; field < 5; ++field) {
if ($field < 10 || $field > 20) next;
}
# We can only get here if none of the fields were outside
# the range. $0 is the entire line.
print $0;
}' permut1.txt
Because the default pattern action is precisely print $0, we can shorten that script:
awk '{ for (field = 2; field < 5; ++field)
if ($field < 10 || $field > 20) next;
}
} 1' permut1.txt
The 1 at the end is a condition which will always be true, with no action (or, in other words, the default action); if the preceding rule doesn't execute the next command for any of the fields, then the 1 condition will be executed, and the default action will cause the line to be printed.

Sum of all rows of all columns - Bash

I have a file like this
1 4 7 ...
2 5 8
3 6 9
And I would like to have as output
6 15 24 ...
That is the sum of all the lines for all the columns. I know that to sum all the lines of a certain column (say column 1) you can do like this:
awk '{sum+=$1;}END{print $1}' infile > outfile
But I can't do it automatically for all the columns.
One more awk
awk '{for(i=1;i<=NF;i++)$i=(a[i]+=$i)}END{print}' file
Output
6 15 24
Explanation
{for (i=1;i<=NF;i++) Set field to 1 and increment through
$i=(a[i]+=$i) Set the field to the sum + the value in field
END{print} Print the last line which now contains the sums
As with the other answers this will retain the order of the fields regardless of the number of them.
You want to sum every column differently. Hence, you need an array, not a scalar:
$ awk '{for (i=1;i<=NF;i++) sum[i]+=$i} END{for (i in sum) print sum[i]}' file
6
15
24
This stores sum[column] and finally prints it.
To have the output in the same line, use:
$ awk '{for (i=1;i<=NF;i++) sum[i]+=$i} END{for (i in sum) printf "%d%s", sum[i], (i==NF?"\n":" ")}' file
6 15 24
This uses the trick printf "%d%s", sum[i], (i==NF?"\n":" "): print the digit + a character. If we are in the last field, let this char be new line; otherwise, just a space.
There is a very simple command called numsum to do this:
numsum -c FileName
-c --- Print out the sum of each column.
For example:
cat FileName
1 4 7
2 5 8
3 6 9
Output :
numsum -c FileName
6 15 24
Note:
If the command is not installed in your system, you can do it with this command:
apt-get install num-utils
echo "1 4 7
2 5 8
3 6 9 " \
| awk '{for (i=1;i<=NF;i++){
sums[i]+=$i;maxi=i}
}
END{
for(i=1;i<=maxi;i++){
printf("%s ", sums[i])
}
print}'
output
6 15 24
My recollection is that you can't rely on for (i in sums) to produce the keys any particular order, but maybe this is "fixed" in newer versions of gawk.
In case you're using an old-line Unix awk, this solution will keep your output in the same column order, regardless of how "wide" your file is.
IHTH
AWK Program
#!/usr/bin/awk -f
{
print($0);
len=split($0,a);
if (maxlen < len) {
maxlen=len;
}
for (i=1;i<=len;i++) {
b[i]+=a[i];
}
}
END {
for (i=1;i<=maxlen;i++) {
printf("%s ", b[i]);
}
print ""
}
Output
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
3 6 9 12 15
Your answer is correct. It is just missed to print "sum". Try this:
awk '{sum+=$1;} END{print sum;}' infile > outfile

Resources