I tried to convert two data files into a matrix in Stata.
In the first data file there are only 10 columns, so I used:
mkmat d1 d2 d3 d4 d5 d6 d7 d8 d9 d10, matrix(dataname)
However, the second data file contains more than 100 columns.
Do I have to manually include in mkmat all variable names, or there is a better way to do this?
Consider the following toy example:
clear
set obs 5
forvalues i = 1 / 5 {
generate d`i' = rnormal()
}
list
+-----------------------------------------------------------+
| d1 d2 d3 d4 d5 |
|-----------------------------------------------------------|
1. | .2347558 .255076 -1.309553 1.202226 -1.188903 |
2. | .1994864 .5560354 -.7548561 1.353276 -1.836232 |
3. | 1.444645 -1.798258 1.189875 -.0599763 .4022007 |
4. | .2568011 -1.27296 .5404224 -.1167567 1.853389 |
5. | -.4792487 .175548 1.846101 .4198408 -1.182597 |
+-----------------------------------------------------------+
You could simply use wildcard characters:
mkmat d*, matrix(d)
or
mkmat d?, matrix(d)
Alternatively, the commands ds and unab can be used to create a local macro containing a list of qualifying variable names, which can then be used in mkmat:
ds d*
mkmat `r(varlist)', matrix(d1)
matrix list d1
d1[5,5]
d1 d2 d3 d4 d5
r1 .23475575 .25507599 -1.3095527 1.2022264 -1.1889035
r2 .19948645 .5560354 -.75485611 1.3532759 -1.8362321
r3 1.4446446 -1.7982582 1.1898755 -.0599763 .4022007
r4 .25680107 -1.2729601 .54042244 -.11675671 1.8533887
r5 -.47924873 .175548 1.846101 .41984081 -1.1825972
unab varlist : d*
mkmat `varlist', matrix(d2)
matrix list d2
d2[5,5]
d1 d2 d3 d4 d5
r1 .23475575 .25507599 -1.3095527 1.2022264 -1.1889035
r2 .19948645 .5560354 -.75485611 1.3532759 -1.8362321
r3 1.4446446 -1.7982582 1.1898755 -.0599763 .4022007
r4 .25680107 -1.2729601 .54042244 -.11675671 1.8533887
r5 -.47924873 .175548 1.846101 .41984081 -1.1825972
The advantage of ds is that it can be used to further filter results with its has() or not() options.
For example, if some of your variables are strings, mkmat will complain:
tostring d3 d5, force replace
mkmat d*, matrix(d)
string variables not allowed in varlist;
d3 is a string variable
However, the following will work fine:
ds d*, has(type numeric)
d1 d2 d4
mkmat `r(varlist)', matrix(d)
matrix list d
d[5,3]
d1 d2 d4
r1 -1.5934615 2.1092126 -.99447298
r2 -.51445526 -.62898564 .56975317
r3 -1.8468649 -.68184066 .26716048
r4 -.02007644 -.29140079 2.2511463
r5 -.62507766 .6255222 1.0599482
Type help ds or help unab from Stata's command prompt for full syntax details.
Related
I have a file that can resembles to something like that:
function(a, b, c1, d1, e1, f1);
function(a, b, c2
,d2, e2
f2);
useless things
function(a, b, c3,
/* something lol */
// something else */
d3, e3
, f3
);
The idea is to get something like:
c1
d1
e1
f1
c2
d2
e2
f2
c3
d3
e3
f3
I am using sed to remove things that are useless between each function, so I came with
sed -n '/function(/,/);/p' file
Here I get the three functions without the useless things.
Now I am tring to put the thing between function into one line maybe delete also the things after // or between /* */. But I don't know how can I "concat" things so I can get 3 lines instead of 10
Using grep
grep -o '[a-z][0-9]' < input_File
Demo :
$cat file.txt
function(a, b, c1, d1, e1, f1);
function(a, b, c2
,d2, e2
f2);
useless things
function(a, b, c3,
/* something lol */
// something else */
d3, e3
, f3
);
$grep -o '[a-z][0-9]' < file.txt
c1
d1
e1
f1
c2
d2
e2
f2
c3
d3
e3
f3
$
I have multiple tab delimieted files where only the two first columns are in common. I'm trying to combine them in one tab delimited file .
Example: let's say we have 3 files (file1, file2, file3) that we want to combine into file4.
(row and column names are just for demonstration purposes and are not included in any of the files)
Input files =>
File1: 2 rows(r1,r2), 3 columns(c1,c2,c3)
c1 c2 c3
r1 a b c
r2 d e f
File2: 3 rows(r3,r4,r5), 3 columns(c1,c2,c4)
c1 c2 c4
r3 1 2 3
r4 4 5 6
r5 7 8 9
File3: 1 row(r6), 4 columns(c1, c2, c5, c6)
c1 c2 c5 c6
r6 w x y z
Output file =>
for all 3 files, the 2 first columns (c1, c2) have the same name
File4:
c1 c2 c3 c4 c5 c6
r1 a b c - - -
r2 d e f - - -
r3 1 2 - 3 - -
r4 4 5 - 6 - -
r5 7 8 - 9 - -
r6 w x - - y z
What I'm trying to do is: for each of the files add the needed empty columns so that all files have the same number of columns then reorder the columns with "awk" then use "cat" to stack them vertically. But I don't know if this is the best way or there is a more efficient way to do it.
Thanks,
The following essentially does the task. It essentially builds up a matrix entry which is indexed by the row and column names.
awk '(FNR==1) {
for(i=1;i<=NF;++i) {
if (!($i in columns)) { column_order[++cn] = $i; columns[$i] }
c[i+1]=$i
}
next
}
!($1 in rows) { row_order[++rn] = $1; rows[$1] }
{ for(i=2;i<=NF;++i) entry[$1,c[i]]=$i }
END {
s="";for(j=1;j<=cn;++j) s=s OFS column_order[j]; print s
for(i=1;i<=rn;++i) {
row_name=row_order[i]
s=row_name
for(j=1;j<=cn;++j) {
col_name = column_order[j]
s=s OFS ((row_name,col_name) in entry ? entry[row_name,col_name] : "-")
}
print s
}
}' file1 file2 file3 file4 ... filen
Hi guys I have two files each of them with N columns and M rows.
File1
1 2 4 6 8
20 4 8 10 12
15 5 7 9 11
File2
1 a1 b1 c5 d1
2 a1 b2 c4 d2
3 a2 b3 c3 d3
19 a3 b4 c2 d4
14 a4 b5 c1 d5
And what I need is to search the closest value in the column 1, and print specific columns in the output. so for example the output should be:
File3
1 2 4 6 8
1 a1 b1 c5 d1
20 4 8 10 12
19 a3 b4 c2 d4
15 5 7 9 11
14 a4 b5 c1 d5
Since 1 = 1, 19 is the closest to 20 and 14 to 15, the output are those lines.
How can I do this in awk or any other tool?
Help!
This is what I have until now:
echo "ARGIND == 1 {
s1[\$1]=\$1;
s2[\$1]=\$2;
s3[\$1]=\$3;
s4[\$1]=\$4;
s5[\$1]=\$5;
}
ARGIND == 2 {
bestdiff=-1;
for (v in s1)
if (bestdiff < 0 || (v-\$1)**2 <= bestdiff)
{
s11=s1[v];
s12=s2[v];
s13=s3[v];
s14=s4[v];
s15=s5[v];
bestdiff=(v-\$1)**2;
if (bestdiff < 2){
print \$0
print s11,s12,s13,s14,s15}}">diff.awk
awk -f diff.awk file2 file1
output:
1 2 4 6 8
1 a1 b1 c5 d1
20 4 8 10 12
19 a3 b4 c2 d4
15 5 7 9 1
14 a4 b5 c1 d5
1 2
1 1
14 15
I have no idea why the last three lines.
What I ended with trying to give a way to answer:
function closest(b,i) { # define a function
distance=999999; # this should be higher than the max index to avoid returning null
for (x in b) { # loop over the array to get its keys
(x+0 > i+0) ? tmp = x - i : tmp = i - x # +0 to compare integers, ternary operator to reduce code, compute the diff between the key and the target
if (tmp < distance) { # if the distance if less than preceding, update
distance = tmp
found = x # and save the key actually found closest
}
}
return found # return the closest key
}
{ # parse the files for each line (no condition)
if (NR>FNR) { # If we changed file (File Number Record is less than Number Record) change array
b[$1]=$0 # make an array with $1 as key
} else {
akeys[max++] = $1 # store the array keys to ensure order at end as for (x in array) does not guarantee the order
a[$1]=$0 # make an array with $1 as key
}
}
END { # Now we ended parsing the two files, print the result
for (i in akeys) { # loop over the first file keys
print a[akeys[i]] # print the value for this file
if (akeys[i] in b) { # if the same key exist in second file
print b[akeys[i]] # then print it
} else {
bindex = closest(b,akeys[i]) # call the function to find the closest key from second file
print b[bindex] # print what we found
}
}
}
I hope this is enough commented to be clear, feel free to comment if needed.
Warning This may become really slow if you have a large number of line in the second file as the second array will be parsed for each key of first file which is not present in second file./Warning
Given your sample inputs a1 and a2:
$ mawk -f closest.awk a1 a2
1 2 4 6 8
1 a1 b1 c5 d1
20 4 8 10 12
19 a3 b4 c2 d4
15 5 7 9 11
14 a4 b5 c1 d5
Here it is written that the output parameters of MPI_Cart_shift are ranks of the source and destination processes. However, in this tutorial (code below) what is returned as the source process is later used in MPI_Isend to send messages. Anyone can clear it up - what actually "source" and "destination" mean?
#include "mpi.h"
#include <stdio.h>
#define SIZE 16
#define UP 0
#define DOWN 1
#define LEFT 2
#define RIGHT 3
int main(argc,argv)
int argc;
char *argv[]; {
int numtasks, rank, source, dest, outbuf, i, tag=1,
inbuf[4]={MPI_PROC_NULL,MPI_PROC_NULL,MPI_PROC_NULL,MPI_PROC_NULL,},
nbrs[4], dims[2]={4,4},
periods[2]={0,0}, reorder=0, coords[2];
MPI_Request reqs[8];
MPI_Status stats[8];
MPI_Comm cartcomm;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
if (numtasks == SIZE) {
MPI_Cart_create(MPI_COMM_WORLD, 2, dims, periods, reorder, &cartcomm);
MPI_Comm_rank(cartcomm, &rank);
MPI_Cart_coords(cartcomm, rank, 2, coords);
MPI_Cart_shift(cartcomm, 0, 1, &nbrs[UP], &nbrs[DOWN]);
MPI_Cart_shift(cartcomm, 1, 1, &nbrs[LEFT], &nbrs[RIGHT]);
printf("rank= %d coords= %d %d neighbors(u,d,l,r)= %d %d %d %d\n",
rank,coords[0],coords[1],nbrs[UP],nbrs[DOWN],nbrs[LEFT],
nbrs[RIGHT]);
outbuf = rank;
for (i=0; i<4; i++) {
dest = nbrs[i];
source = nbrs[i];
MPI_Isend(&outbuf, 1, MPI_INT, dest, tag,
MPI_COMM_WORLD, &reqs[i]);
MPI_Irecv(&inbuf[i], 1, MPI_INT, source, tag,
MPI_COMM_WORLD, &reqs[i+4]);
}
MPI_Waitall(8, reqs, stats);
printf("rank= %d inbuf(u,d,l,r)= %d %d %d %d\n",
rank,inbuf[UP],inbuf[DOWN],inbuf[LEFT],inbuf[RIGHT]); }
else
printf("Must specify %d processors. Terminating.\n",SIZE);
MPI_Finalize();
}
MPI_Cart_shift: Returns the shifted source and destination ranks, given a shift direction and amount
int MPI_Cart_shift(MPI_Comm comm, int direction, int displ, int *source, int *dest)
What you hand in to the function is comm, direction and displ. Where direction specifies the dimension in which the displacement is taken. The displacement is the distance.
Example
Imagine a 2D cart topology like this (names are not ranks but process-names, only for explanation):
A1 A2 A3 A4 A5
B1 B2 B3 B4 B5
C1 C2 C3 C4 C5
D1 D2 D3 D4 D5
E1 E2 E3 E4 E5
As you might already have understood you are writing SPMD-Code in MPI, therefore we can now pick, w.l.o.g., one process to show what is happening. Let's pick C3
The general idea of MPI_Cart_shift is that we get the rank of a specified process in our topology.
First, we have to decide in which direction we want to go, let's pick 0, which is the column dimension.
Then we have to specify a distance to the other process, let's say this is 2.
So the call would be like:
MPI_Cart_shift(cartcomm, 0, 2, &source, &dest);
Now, the ranks which are placed into the source and dest variables are those respectively of the processes A3 and E3.
How to interpret the results
I (C3) want to send data to the process in the same column with a distance of 2. So this is the dest rank.
If you do the same from the viewpoint of A3: process A3 gets as its dest field the rank of C3.
And this is what source says: what is the rank of the process which is sending me those data if it calls the same MPI_Cart_shift.
If there is no process at the specified place the variable contains MPI_PROC_NULL.
So the results of the call at each process would look like this (with source|dest for each process, using - for MPI_PROC_NULL):
MPI_Cart_shift(cartcomm, 0, 2, &source, &dest);
A1 A2 A3 A4 A5
-|C1 -|C2 -|C3 -|C4 -|C5
B1 B2 B3 B4 B5
-|D1 -|D2 -|D3 -|D4 -|D5
C1 C2 C3 C4 C5
A1|E1 A2|E2 A3|E3 A4|E4 A5|E5
D1 D2 D3 D4 D5
B1|- B2|- B3|- B4|- B5|-
E1 E2 E3 E4 E5
C1|- C2|- C3|- C4|- C5|-
Additional bit of information
If you create the cart with any dimension set periods = 1 then there is a virtual edge between the first and the last node of the cart. In this example, periods[0] = 1 would make a connection between A1 and E1, between A2 and E2, and so on. If you then call the MPI_Cart_shift, the counting has to be wrapped around the corners so your output would be:
A1 A2 A3 A4 A5
D1|C1 D2|C2 D3|C3 D4|C4 D5|C5
B1 B2 B3 B4 B5
E1|D1 E2|D2 E3|D3 E4|D4 E5|D5
C1 C2 C3 C4 C5
A1|E1 A2|E2 A3|E3 A4|E4 A5|E5
D1 D2 D3 D4 D5
B1|A1 B2|A2 B3|A3 B4|A4 B5|A5
E1 E2 E3 E4 E5
C1|B1 C2|B2 C3|B3 C4|B4 C5|B5
MPI_Cart_shift is a convenience function. It's primary usage is for data shifts, i.e. operations in which each rank sends data in a certain direction (i.e. to destination) and receives data from the opposite direction (i.e. from source) (forward operation). When source is used as destination and destination as source, data flows in the opposite direction (backward operation). An example of such operation is the halo swapping and it usually requires two shifts along each dimension - one forward and one backward.
MPI_Cart_shift is a convenience function since its action is equivalent to the following set of MPI calls:
// 1. Determine the rank of the current process
int rank;
MPI_Comm_rank(cartcomm, &rank);
// 2. Transform the rank into topology coordinates
int coords[ndims];
MPI_Cart_coords(cartcomm, rank, ndims, coords);
// 3. Save the current coordinate along the given direction
int saved_coord = coords[direction];
// 4. Compute the "+"-shifted position and convert to rank
coords[direction] = saved_coord + displ;
// Adjust for periodic boundary if necessary
if (periods[direction])
coords[direction] %= dims[direction];
// 5. Convert to rank
MPI_Cart_rank(cartcomm, coords, &destination);
// 6. Compute the "-"-shifted position and convert to rank
coords[direction] = saved_coord - displ;
// Adjust for periodic boundary
if (periods[direction])
coords[direction] %= dims[direction];
// 7. Convert to rank
MPI_Cart_rank(cartcomm, coords, &source);
One could also compute the rank<->coordinate transforms using arithmetic without calls to MPI_Cart_rank or MPI_Cart_coords but it would be very inflexible as the formulas change when the dimensionality of the topology changes.
Something very important. The ranks as computed by MPI_Cart_shift (or by the equivalent code above) are related to the cartcomm communicator. Those match the ranks in the original communicator (the one used in MPI_Cart_create) only if reorder = 0. When reordering is allowed, the ranks could differ and therefore one should not use those ranks within the context of the original communicator. The following code of yours is valid but strongly dependent on the fact that reorder = 0 in the call to MPI_Cart_create:
dest = nbrs[i];
source = nbrs[i];
MPI_Isend(&outbuf, 1, MPI_INT, dest, tag,
MPI_COMM_WORLD, &reqs[i]);
MPI_Irecv(&inbuf[i], 1, MPI_INT, source, tag,
MPI_COMM_WORLD, &reqs[i+4]);
Here nbrs are computed within cartcomm and then used within MPI_COMM_WORLD. The correct code should use cartcomm in both communication calls:
MPI_Isend(&outbuf, 1, MPI_INT, dest, tag,
cartcomm, &reqs[i]);
MPI_Irecv(&inbuf[i], 1, MPI_INT, source, tag,
cartcomm, &reqs[i+4]);
Some algorithms require that data travels the other way round, i.e. forward and backward are swapped. For such algorithms the displacement displ specified could be negative. In general, a call to MPI_Cart_shift with negative displacement is equivalent to a call with positive displacement but source and destination swapped.
I heard a lot about amazing performance of programs written in Haskell, and wanted to make some tests. So, I wrote a 'library' for matrix operations just to compare it's performance with the same stuff written in pure C.
First of all I tested 500000 matrices multiplication performance, and noticed that it was... never-ending (i. e. ending with out of memory exception after 10 minutes of so)! After studying haskell a bit more I managed to get rid of laziness and the best result I managed to get is ~20 times slower than its equivalent in C.
So, the question: could you review the code below and tell if its performance can be improved a bit more? 20 times is still disappointing me a bit.
import Prelude hiding (foldr, foldl, product)
import Data.Monoid
import Data.Foldable
import Text.Printf
import System.CPUTime
import System.Environment
data Vector a = Vec3 a a a
| Vec4 a a a a
deriving Show
instance Foldable Vector where
foldMap f (Vec3 a b c) = f a `mappend` f b `mappend` f c
foldMap f (Vec4 a b c d) = f a `mappend` f b `mappend` f c `mappend` f d
data Matr a = Matr !a !a !a !a
!a !a !a !a
!a !a !a !a
!a !a !a !a
instance Show a => Show (Matr a) where
show m = foldr f [] $ matrRows m
where f a b = show a ++ "\n" ++ b
matrCols (Matr a0 b0 c0 d0 a1 b1 c1 d1 a2 b2 c2 d2 a3 b3 c3 d3)
= [Vec4 a0 a1 a2 a3, Vec4 b0 b1 b2 b3, Vec4 c0 c1 c2 c3, Vec4 d0 d1 d2 d3]
matrRows (Matr a0 b0 c0 d0 a1 b1 c1 d1 a2 b2 c2 d2 a3 b3 c3 d3)
= [Vec4 a0 b0 c0 d0, Vec4 a1 b1 c1 d1, Vec4 a2 b2 c2 d2, Vec4 a3 b3 c3 d3]
matrFromList [a0, b0, c0, d0, a1, b1, c1, d1, a2, b2, c2, d2, a3, b3, c3, d3]
= Matr a0 b0 c0 d0
a1 b1 c1 d1
a2 b2 c2 d2
a3 b3 c3 d3
matrId :: Matr Double
matrId = Matr 1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
normalise (Vec4 x y z w) = Vec4 (x/w) (y/w) (z/w) 1
mult a b = matrFromList [f r c | r <- matrRows a, c <- matrCols b] where
f a b = foldr (+) 0 $ zipWith (*) (toList a) (toList b)
First, I doubt that you'll ever get stellar performance with this implementation. There are too many conversions between different representations. You'd be better off basing your code on something like the vector package. Also you don't provide all your testing code, so there are probably other issues that we can't here. This is because the pipeline of production to consumption has a big impact on Haskell performance, and you haven't provided either end.
Now, two specific problems:
1) Your vector is defined as either a 3 or 4 element vector. This means that for every vector there's an extra check to see how many elements are in use. In C, I imagine your implementation is probably closer to
struct vec {
double *vec;
int length;
}
You should do something similar in Haskell; this is how vector and bytestring are implemented for example.
Even if you don't change the Vector definition, make the fields strict. You should also either add UNPACK pragmas (to Vector and Matrix) or compile with -funbox-strict-fields.
2) Change mult to
mult a b = matrFromList [f r c | r <- matrRows a, c <- matrCols b] where
f a b = Data.List.foldl' (+) 0 $ zipWith (*) (toList a) (toList b)
The extra strictness of foldl' will give much better performance in this case than foldr.
This change alone might make a big difference, but without seeing the rest of your code it's difficult to say.
Answering my own question just to share new results I got yesterday:
I upgraded ghc to the most recent version and performance became indeed not that bad (only ~7 times worse).
Also I tried implementing the matrix in a stupid and simple way (see the listing below) and got really acceptable performance - only about 2 times slower than C equivalent.
data Matr a = Matr ( a, a, a, a
, a, a, a, a
, a, a, a, a
, a, a, a, a)
mult (Matr (!a0, !b0, !c0, !d0,
!a1, !b1, !c1, !d1,
!a2, !b2, !c2, !d2,
!a3, !b3, !c3, !d3))
(Matr (!a0', !b0', !c0', !d0',
!a1', !b1', !c1', !d1',
!a2', !b2', !c2', !d2',
!a3', !b3', !c3', !d3'))
= Matr ( a0'', b0'', c0'', d0''
, a1'', b1'', c1'', d1''
, a2'', b2'', c2'', d2''
, a3'', b3'', c3'', d3'')
where a0'' = a0 * a0' + b0 * a1' + c0 * a2' + d0 * a3'
b0'' = a0 * b0' + b0 * b1' + c0 * b2' + d0 * b3'
c0'' = a0 * c0' + b0 * c1' + c0 * c2' + d0 * c3'
d0'' = a0 * d0' + b0 * d1' + c0 * d2' + d0 * d3'
a1'' = a1 * a0' + b1 * a1' + c1 * a2' + d1 * a3'
b1'' = a1 * b0' + b1 * b1' + c1 * b2' + d1 * b3'
c1'' = a1 * c0' + b1 * c1' + c1 * c2' + d1 * c3'
d1'' = a1 * d0' + b1 * d1' + c1 * d2' + d1 * d3'
a2'' = a2 * a0' + b2 * a1' + c2 * a2' + d2 * a3'
b2'' = a2 * b0' + b2 * b1' + c2 * b2' + d2 * b3'
c2'' = a2 * c0' + b2 * c1' + c2 * c2' + d2 * c3'
d2'' = a2 * d0' + b2 * d1' + c2 * d2' + d2 * d3'
a3'' = a3 * a0' + b3 * a1' + c3 * a2' + d3 * a3'
b3'' = a3 * b0' + b3 * b1' + c3 * b2' + d3 * b3'
c3'' = a3 * c0' + b3 * c1' + c3 * c2' + d3 * c3'
d3'' = a3 * d0' + b3 * d1' + c3 * d2' + d3 * d3'