I'm trying to write an AWK command that allows me to perform matrix multiplication between two tab separated files.
example:
cat m1
1 2 3 4
5 6 7 8
cat m2
1 2
3 4
5 6
7 8
desired output:
50 60
114 140
without any validation of the input files for the sizes.
it will be easier to break into two scripts, one for transposing the second matrix and one to create a dot product of vectors. Also to simply awk code, you can resort to join.
$ awk '{m=NF/2; for(i=1;i<=m;i++) sum[NR] += $i*$(i+m)}
END {for(i=1;i<=NR;i++)
printf "%s", sum[i] (i==sqrt(NR)?ORS:OFS);
print ""}' <(join -j99 m1 <(transpose m2))
where transpose function is defined as
$ function transpose() { awk '{for(j=1;j<=NF;j++) a[NR,j]=$j}
END {for(i=1;i<=NF;i++)
for(j=1;j<=NR;j++)
printf "%s",a[j,i] (j==NR?ORS:OFS)}' "$1"; }
I would suggest going with GNU Octave:
octave --eval 'load("m1"); load("m2"); m1*m2'
Output:
ans =
50 60
114 140
However, assuming well-formatted files you can do it like this with GNU awk:
matrix-mult.awk
ARGIND == 1 {
for(i=1; i<=NF; i++)
m1[FNR][i] = $i
m1_width = NF
m1_height = FNR
}
ARGIND == 2 {
for(i=1; i<=NF; i++)
m2[FNR][i] = $i
m2_width = NF
m2_height = FNR
}
END {
if(m1_width != m2_height) {
print "Matrices are incompatible, unable to multiply!"
exit 1
}
for(i=1; i<=m1_height; i++) {
for(j=1; j<=m2_width; j++) {
for(k=1; k<=m1_width; k++)
sum += m1[i][k] * m2[k][j]
printf sum OFS; sum=0
}
printf ORS
}
}
Run it like this:
awk -f matrix-mult.awk m1 m2
Output:
50 60
114 140
If you process the second matrix before the first matrix, then you don't have to transpose the second matrix or to store both matrices in an array:
awk 'NR==FNR{for(i=1;i<=NF;i++)a[NR,i]=$i;w=NF;next}{for(i=1;i<=w;i++){s=0;for(j=1;j<=NF;j++)s+=$j*a[j,i];printf"%s"(i==w?RS:FS),s}}' m2 m1
When I replaced multidimensional arrays with arrays of arrays by replacing a[NR,i] with a[NR][i] and a[j,i] with a[j][i], it made the code about twice as fast in gawk. But arrays of arrays are not supported by nawk, which is /usr/bin/awk on macOS.
Or another option is to use R:
Rscript -e 'as.matrix(read.table("m1"))%*%as.matrix(read.table("m2"))'
Or this gets the names of the input files as command line arguments and prints the result without column names or row names:
Rscript -e 'write.table(Reduce(`%*%`,lapply(commandArgs(T),function(x)as.matrix(read.table(x)))),col.names=F,row.names=F)' m1 m2
Related
I would like to merge two files, column and row-wise but am having difficulty doing so with bash. Here is what I would like to do.
File1:
1 2 3
4 5 6
7 8 9
File2:
2 3 4
5 6 7
8 9 1
Expected output file:
1/2 2/3 3/4
4/5 5/6 6/7
7/8 8/9 9/1
This is just an example. The actual files are two 1000x1000 data matrices.
Any thoughts on how to do this? Thanks!
Or use paste + awk
paste file1 file2 | awk '{ n=NF/2; for(i=1; i<=n; i++) printf "%s/%s ", $i, $(i+n); printf "\n"; }'
Note that this script adds a trailing space after the last value. This can be avoided with a more complicated awk script or by piping the output through an additional command, e.g.
paste file1 file2 | awk '{ n=NF/2; for(i=1; i<=n; i++) printf "%s/%s ", $i, $(i+n); printf "\n"; }' | sed 's/ $//'
awk solution without additional sed. Thanks to Jonathan Leffler. (I knew it is possible but was too lazy to think about this.)
awk '{ n=NF/2; pad=""; for(i=1; i<=n; i++) { printf "%s%s/%s", pad, $i, $(i+n); pad=" "; } printf "\n"; }'
paste + perl version that works with an arbitrary number of columns without having to hold an entire file in memory:
paste file1.txt file2.txt | perl -MList::MoreUtils=pairwise -lane '
my #a = #F[0 .. (#F/2 - 1)]; # The values from file1
my #b = #F[(#F/2) .. $#F]; # The values from file2
print join(" ", pairwise { "$a/$b" } #a, #b); # Merge them together again'
It uses the non-standard but useful List::MoreUtils module; install through your OS package manager or favorite CPAN client.
Assumptions:
no blank lines in files
both files have the same number of rows
both files have the same number of fieldds
no idea how many rows and/or fields we'll have to deal with
One awk solution:
awk '
# first file (FNR==NR):
FNR==NR { for ( i=1 ; i<=NF ; i++) # loop through fields
{ line[FNR,i]=$(i) } # store field in array; array index = row number (FNR) + field number (i)
next # skip to next line in file
}
# second file:
{ pfx="" # init printf prefix as empty string
for ( i=1 ; i<=NF ; i++) # loop through fields
{ printf "%s%s/%s", # print our results:
pfx, line[FNR,i], $(i) # prefix, corresponding field from file #1, "/", current field
pfx=" " # prefix for rest of fields in this line is a space
}
printf "\n" # append linefeed on end of current line
}
' file1 file2
NOTES:
remove comments to declutter code
memory usage will climb as the size of the matrix increases (probably not an issue for the smallish fields and OPs comment about a 1000 x 1000 matrix)
The above generates:
1/2 2/3 3/4
4/5 5/6 6/7
7/8 8/9 9/1
I would like to compute spectrum using awk or shell scripting. I have a data, e.g.,
ifile.txt
1
2
3
4
1
3
2
2
3
99
Where 99 is an undefined value.
Formula to compute spectrum is
for k=1,2,3,4,...
I was doing it in the following way.
for i in {1..10};do
awk '{if($1 != 99) printf "%f %f\n",
$1*sin(2*3.14*'$i'*NR/10),
$1*cos(2*3.14*'$i'*NR/10)}' ifile.txt > ifile1.txt
sum_1=$(awk '{sum += $1} END {print sum}' ifile1.txt)
sum_2=$(awk '{sum += $2} END {print sum}' ifile1.txt)
awk '{printf "%f\n", (1/2)*(((1/5)*('$sum_1')^2)+((1/5)*('$sum_2')^2))}'
>> ofile.txt
done
Would please suggest where I am doing mistake. The computation is neither printing anything nor ending. However I am getting values in ifile1.txt
After our successful dialog I compiled our common effort into a self-standing awk script (spectrum.awk):
$1 < 99 {
for (i = 1; i <= 10; ++i) {
sum1[i] += $1*sin(2*3.14*i*NR/10)
sum2[i] += $1*cos(2*3.14*i*NR/10)
}
next
}
END {
for (i = 1; i <= 10; ++i) {
printf "%f\n", (1/2)*(((1/5)*(sum1[i])^2)+((1/5)*(sum2[i])^2))
}
}
It uses arrays (sum1 and sum2) to compute all 10 values in one run.
Unfortunately, I don't know anything about the theoretical background. I cannot see your image (due to proxy issues of my company). Thus, you may give feedback if the computation is wrong.
Sample session:
$ echo '1
2
3
4
1
3
2
2
3
99' | awk -f spectrum.awk
1.417046
2.019819
0.288438
2.688501
0.100023
2.680672
0.296974
1.993613
1.338068
44.097319
At least, it looks "nice".
I am new to text preprocessing and AWK language.
I am trying to loop through each record in a given field(field1) and find the max and min of values and store it in a variable.
Algorithm :
1) Set Min = 0 and Max = 0
2) Loop through $1(field 1)
3) Compare FNR of the field 1 and set Max and Min
4) Finally print Max and Min
this is what I tried :
BEGIN{max = 0; min = 0; NF = 58}
{
for(i = 0; i < NF-57; i++)
{
for(j =0; j < NR; j++)
{
min = (min < $j) ? min : $j
max = (max > $j) ? max : $j
}
}
}
END{print max, min}
#Dataset
f1 f2 f3 f4 .... f58
0.3 3.3 0.5 3.6
0.9 4.7 2.5 1.6
0.2 2.7 6.3 9.3
0.5 3.6 0.9 2.7
0.7 1.6 8.9 4.7
Here, f1,f2,..,f58 are the fields or columns in Dataset.
I need to loop through column one(f1) and find Min-Max.
Output Required:
Min = 0.2
Max = 0.9
What I get as a result:
Min = ''(I dont get any result)
Max = 9.3(I get max of all the fields instead of field1)
This is for learning purpose so I asked for one column So that I can try on my own for multiple columns
These is what I have:
This for loop would only loop 4 times as there r only four fields. Will the code inside the for loop execute for each record that is, for 5 times?
for(i = 0; i < NF; i++)
{
if (min[i]=="") min[i]=$i
if (max[i]=="") max[i]=$i
if ($i<min[i]) min[i]=$i
if ($i>max[i]) max[i]=$i
}
END
{
OFS="\t";
print "min","max";
#If I am not wrong, I saved the data in an array and I guess this would be the right way to print all min and max?
for(i=0; i < NF; i++;)
{
print min[i], max[i]
}
}
Here is a working solution which is really much easier than what you are doing:
/^-?[0-9]*(\.[0-9]*)?$/ checks that $1 is indeed a valid number, otherwise it is discarded.
sort -n | awk '$1 ~ /^-?[0-9]*(\.[0-9]*)?$/ {a[c++]=$1} END {OFS="\t"; print "min","max";print a[0],a[c-1]}'
If you don't use this, then min and max need to be initialized, for example with the first value:
awk '$1 ~ /^-?[0-9]*(\.[0-9]*)?$/ {if (min=="") min=$1; if (max=="") max=$1; if ($1<min) min=$1; if ($1>max) max=$1} END {OFS="\t"; print "min","max";print min, max}'
Readable versions:
sort -n | awk '
$1 ~ /^-?[0-9]*(\.[0-9]*)?$/ {
a[c++]=$1
}
END {
OFS="\t"
print "min","max"
print a[0],a[c-1]
}'
and
awk '
$1 ~ /^-?[0-9]*(\.[0-9]*)?$/ {
if (min=="") min=$1
if (max=="") max=$1
if ($1<min) min=$1
if ($1>max) max=$1
}
END {
OFS="\t"
print "min","max"
print min, max
}'
On your input, is outputs:
min max
0.2 0.9
EDIT (replying to the comment requiring more information on how awk works):
Awk loops through lines (named records) and for each line you have columns (named fields) available. Each awk iteration reads a line and provides among others the NR and NF variables. In your case, you are only interested in the first column, so you will only use $1 which is the first column field. For each record where $1 is matching /^-?[0-9]*(\.[0-9]*)?$/ which is a regex matching positive and negative integers or floats, we are either storing the value in an array a (in the first version) or setting the min/max variables if needed (in the second version).
Here is the explanation for the condition $1 ~ /^-?[0-9]*(\.[0-9]*)?$/:
$1 ~ means we are checking if the first field $1 matches the regex between slashes
^ means we start matching from the beginning of the $1 field
-? means an optional minus sign
[0-9]* is any number of digits (including zero, so .1 or -.1 can be matched)
()? means an optional block which can be present or not
\.[0-9]* if that optional block is present, it should start with a dot and contain zero or more digits (so -. or . can be matched! adapt the regex if you have uncertain input)
$ means we are matching until the last character from the $1 field
If you wanted to loop through fields, you would have to use a for loop from 1 to NF (included) like this:
echo "1 2 3 4" | awk '{for (i=1; i<=NF; i++) {if (min=="") min=$(i); if (max=="") max=$(i); if ($(i)<min) min=$(i); if ($(i)>max) max=$(i)}} END {OFS="\t"; print "min","max";print min, max}'
(please note that I have not checked the input here for simplicity purposes)
Which outputs:
min max
1 4
If you had more lines as an input, awk would also process them after reading the first record, example with this input:
1 2 3 4
5 6 7 8
Outputs:
min max
1 8
To prevent this and only work on the first line, you can add a condition like NR == 1 to process only the first line or add an exit statement after the for loop to stop processing the input after the first line.
If you're looking to only column 1, you may try this:
awk '/^[[:digit:]].*/{if($1<min||!min){min=$1};if($1>max){max=$1}}END{print min,max}' dataset
The script looks for line starting with digit and set the min or max if it didn't find it before.
If I have an arbitrary number of files, say n files, and each file contains a matrix, how can I use bash or awk to sum up all the matrices in each file and get an output?
For example, if n=3, and I have these 3 files with the following contents
$ cat mat1.txt
1 2 3
4 5 6
7 8 9
$cat mat2.txt
1 1 1
1 1 1
1 1 1
$ cat mat3.txt
2 2 2
2 2 2
2 2 2
I want to get this output:
$ cat output.txt
4 5 6
7 8 9
10 11 12
Is there a simple one liner to do this?
Thanks!
$ awk '{for (i=1;i<=NF;i++) total[FNR","i]+=$i;} END{for (j=1;j<=FNR;j++) {for (i=1;i<=NF;i++) printf "%3i ",total[j","i]; print "";}}' mat1.txt mat2.txt mat3.txt
4 5 6
7 8 9
10 11 12
This will automatically adjust to different size matrices. I don't believe that I have used any GNU features so this should be portable to OSX and elsewhere.
How it works:
This command reads from each line from each matrix, one matrix at a time.
For each line read, the following command is executed:
for (i=1;i<=NF;i++) total[FNR","i]+=$i
This loops over every column on the line and adds it to the array total.
GNU awk has multidimensional arrays but, for portability, they are not used here. awk's arrays are associative and this creates an index from the file's line number, FNR, and the column number i, by combining them together with a comma. The result should be portable.
After all the matrices have been read, the results in total are printed:
END{for (j=1;j<=FNR;j++) {for (i=1;i<=NF;i++) printf "%3i ",total[j","i]; print ""}}
Here, j loops over each line up to the total number of lines, FNR. Then i loops over each column up to the total number of columns, NF. For each row and column, the total is printed via printf "%3i ",total[j","i]. This prints the total as a 3-character-wide integer. If you numbers are float or are bigger, adjust the format accordingly.
At the end of each row, the print "" statement causes a newline character to be printed.
You can use awk with paste:
awk -v n=3 '{for (i=1; i<=n; i++) printf "%s%s", ($i + $(i+n) + $(i+n*2)),
(i==n)?ORS:OFS}' <(paste mat{1,2,3}.txt)
4 5 6
7 8 9
10 11 12
GNU awk has multi-dimensional arrays.
gawk '
{
for (i=1; i<=NF; i++)
m[i][FNR] += $i
}
END {
for (y=1; y<=FNR; y++) {
for (x=1; x<=NF; x++)
printf "%d ", m[x][y]
print ""
}
}
' mat{1,2,3}.txt
I'm very new to Bash so I'm sorry if this question is actually very simple. I am dealing with a text file that contains many vertical lists of numbers 2-32 counting up by 2, and each number has a line of other text following it. The problem is that some of the lists are missing numbers. Any pointers for a code that could go through and check to see if each number is there, and if not add a line and put the number in.
One list might look like:
2 djhfbadsljfhdsalkfjads;lfkjs
4 dfhadslkfjhasdlkfjhdsalfkjsahf
6 dsa;fghds;lfhsdalfkjhds;fjdsklj
8 daflgkdsakfjhasdlkjhfasdjkhf
12 dlsagflakdjshgflksdhflksdahfl
All the way down to 32. How would I in this case make it so the 10 is recognized as missing and then added in above the 12? Thanks!
Here's one awk-based solution (formatted for readability, not necessarily how you would type it):
awk ' { value[0 + $1] = $2 }
END { for (i = 2; i < 34; i+=2)
print i, value[i]
}' input.txt
It basically just records the existing lines in a key/value pair (associative array), then at the end, prints all the records you care about, along with the (possibly empty) value saved earlier.
Note: if the first column needs to be seen as a string instead of an integer, this variant should work:
awk ' { value[$1] = $2 }
END { for (i = 2; i < 34; i+=2)
print i, value[i ""]
}' input.txt
You can use awk to figure out the missing line and add it back:
awk '$1==NR*2{i=NR*2+2} i < $1 { while (i<$1) {print i; i+=2} i+=2}
END{for (; i<=32; i+=2) print i} 1' file
Testing:
cat file
2 djhfbadsljfhdsalkfjads;lfkjs
4 dfhadslkfjhasdlkfjhdsalfkjsahf
6 dsa;fghds;lfhsdalfkjhds;fjdsklj
20 daflgkdsakfjhasdlkjhfasdjkhf
24 dlsagflakdjshgflksdhflksdahfl
awk '$1==NR*2{i=NR*2+2} i < $1 { while (i<$1) {print i; i+=2} i+=2}
END{for (; i<=32; i+=2) print i} 1' file
2 djhfbadsljfhdsalkfjads;lfkjs
4 dfhadslkfjhasdlkfjhdsalfkjsahf
6 dsa;fghds;lfhsdalfkjhds;fjdsklj
8
10
12
14
16
18
20 daflgkdsakfjhasdlkjhfasdjkhf
22
24 dlsagflakdjshgflksdhflksdahfl
26
28
30
32