I have a file that can resembles to something like that:
function(a, b, c1, d1, e1, f1);
function(a, b, c2
,d2, e2
f2);
useless things
function(a, b, c3,
/* something lol */
// something else */
d3, e3
, f3
);
The idea is to get something like:
c1
d1
e1
f1
c2
d2
e2
f2
c3
d3
e3
f3
I am using sed to remove things that are useless between each function, so I came with
sed -n '/function(/,/);/p' file
Here I get the three functions without the useless things.
Now I am tring to put the thing between function into one line maybe delete also the things after // or between /* */. But I don't know how can I "concat" things so I can get 3 lines instead of 10
Using grep
grep -o '[a-z][0-9]' < input_File
Demo :
$cat file.txt
function(a, b, c1, d1, e1, f1);
function(a, b, c2
,d2, e2
f2);
useless things
function(a, b, c3,
/* something lol */
// something else */
d3, e3
, f3
);
$grep -o '[a-z][0-9]' < file.txt
c1
d1
e1
f1
c2
d2
e2
f2
c3
d3
e3
f3
$
Related
I have a quick question with respect to the doParallel package in R. I have an optimize.R file where it contains roughly 18 functions A1, A2, A3, A4, ..., A18 within it, and A18 would basically contain all the functions A1, A2, A3,..., A17. I have another result.R file, where I used import every functions within optimize.R into the Global Environment of RStudio (located on the upper-right corner of RStudio). In addition, within result.R file, I have a function F that calls function E under the loop foreach(..., .export = c(".GlobalEnv")) %dopar% {do something}.
Now, in the final.R file (which is the main file), I use source(/C:/Users/[Name]/Desktop/result.R, echo = F). Within final.R, I have a function res that calls function F.
However, when I execute this final.R file, I got the error message: task 1: cannot find variable C , which is weird because C is a function, not a variable. Also, it seems to me the .export = c(".GlobalEnv") fails to work, as function C is already in the Global Environment. Can anyone suggest some ways to overcome this kind of issue? I tried to export
Flow of the 3 files.
final.R file
source(/C:/Users/[Name]/Desktop/result.R, echo = F)
library(doParallel)
registerDoParallel(4)
res <- function(var_x, var_y) {
best.rev <- foreach(i=1:5, .export = c(".GlobalEnv")) %dopar% {
F(var_x, var_y)
do something else
}
output <- best.rev
}
result.R file
source(/C:/Users/[Name]/Desktop/optimize.R, echo = F)
F <- function(var1, var2) {
E(var1, var2)
do something else
}
optimize.R file
A1 <- function(x1, x2) {....}
A2 <- function(x1, x3, x4) {...}
A3 <- function(x1, x2, x3) {...}
....
A18 <- function(x1,x2,x3,x4,...,x8) {
a1 <- A1(x1, x2)
a2 <- A2(a1, x3, x4)
a3 <- A3(a1, a2, x3)
return(list(a1, a2, a3))
}
I am learning Z3 and perhaps my question does not apply, so please be patient.
Suppose I have the following:
c1, c2 = z3.BitVec('c1', 32), z3.BitVec('c2', 32)
c1 = c1 + c1
c2 = c2 + c2
c2 = c2 + c1
c1 = c1 + c2
e1 = z3.simplify(c1)
e2 = z3.simplify(c2)
When I print their sexpr():
print "e1=", e1.sexpr()
print "e2=", e2.sexpr()
Output:
e1= (bvadd (bvmul #x00000004 c1) (bvmul #x00000002 c2))
e2= (bvadd (bvmul #x00000002 c2) (bvmul #x00000002 c1))
My question is, how can I evaluate the numerical value of 'e1' and 'e2' for user supplied values of c1 and c2?
For example, e1(c1=1, c2=1) == 6, e2(c1=1, c2=1) == 4
Thanks!
I figured it out. I had to introduce two separate variables that will hold the expressions. Then I had to introduce two result variables for which I can query the model for their value:
e1, e2, c1, c2, r1, r2 = z3.BitVec('e1', 32), z3.BitVec('e2', 32), z3.BitVec('c1', 32),
z3.BitVec('c2', 32), z3.BitVec('r1', 32), z3.BitVec('r2', 32)
e1 = c1
e2 = c2
e1 = e1 + e1
e2 = e2 + e2
e2 = e2 + e1
e1 = e1 + e2
e1 = z3.simplify(e1)
e2 = z3.simplify(e2)
print "e1=", e1
print "e2=", e2
s = z3.Solver()
s.add(c1 == 5, c2 == 1, e1 == r1, e2 == r2)
if s.check() == z3.sat:
m = s.model()
print 'r1=', m[r1].as_long()
print 'r2=', m[r2].as_long()
I tried to convert two data files into a matrix in Stata.
In the first data file there are only 10 columns, so I used:
mkmat d1 d2 d3 d4 d5 d6 d7 d8 d9 d10, matrix(dataname)
However, the second data file contains more than 100 columns.
Do I have to manually include in mkmat all variable names, or there is a better way to do this?
Consider the following toy example:
clear
set obs 5
forvalues i = 1 / 5 {
generate d`i' = rnormal()
}
list
+-----------------------------------------------------------+
| d1 d2 d3 d4 d5 |
|-----------------------------------------------------------|
1. | .2347558 .255076 -1.309553 1.202226 -1.188903 |
2. | .1994864 .5560354 -.7548561 1.353276 -1.836232 |
3. | 1.444645 -1.798258 1.189875 -.0599763 .4022007 |
4. | .2568011 -1.27296 .5404224 -.1167567 1.853389 |
5. | -.4792487 .175548 1.846101 .4198408 -1.182597 |
+-----------------------------------------------------------+
You could simply use wildcard characters:
mkmat d*, matrix(d)
or
mkmat d?, matrix(d)
Alternatively, the commands ds and unab can be used to create a local macro containing a list of qualifying variable names, which can then be used in mkmat:
ds d*
mkmat `r(varlist)', matrix(d1)
matrix list d1
d1[5,5]
d1 d2 d3 d4 d5
r1 .23475575 .25507599 -1.3095527 1.2022264 -1.1889035
r2 .19948645 .5560354 -.75485611 1.3532759 -1.8362321
r3 1.4446446 -1.7982582 1.1898755 -.0599763 .4022007
r4 .25680107 -1.2729601 .54042244 -.11675671 1.8533887
r5 -.47924873 .175548 1.846101 .41984081 -1.1825972
unab varlist : d*
mkmat `varlist', matrix(d2)
matrix list d2
d2[5,5]
d1 d2 d3 d4 d5
r1 .23475575 .25507599 -1.3095527 1.2022264 -1.1889035
r2 .19948645 .5560354 -.75485611 1.3532759 -1.8362321
r3 1.4446446 -1.7982582 1.1898755 -.0599763 .4022007
r4 .25680107 -1.2729601 .54042244 -.11675671 1.8533887
r5 -.47924873 .175548 1.846101 .41984081 -1.1825972
The advantage of ds is that it can be used to further filter results with its has() or not() options.
For example, if some of your variables are strings, mkmat will complain:
tostring d3 d5, force replace
mkmat d*, matrix(d)
string variables not allowed in varlist;
d3 is a string variable
However, the following will work fine:
ds d*, has(type numeric)
d1 d2 d4
mkmat `r(varlist)', matrix(d)
matrix list d
d[5,3]
d1 d2 d4
r1 -1.5934615 2.1092126 -.99447298
r2 -.51445526 -.62898564 .56975317
r3 -1.8468649 -.68184066 .26716048
r4 -.02007644 -.29140079 2.2511463
r5 -.62507766 .6255222 1.0599482
Type help ds or help unab from Stata's command prompt for full syntax details.
I heard a lot about amazing performance of programs written in Haskell, and wanted to make some tests. So, I wrote a 'library' for matrix operations just to compare it's performance with the same stuff written in pure C.
First of all I tested 500000 matrices multiplication performance, and noticed that it was... never-ending (i. e. ending with out of memory exception after 10 minutes of so)! After studying haskell a bit more I managed to get rid of laziness and the best result I managed to get is ~20 times slower than its equivalent in C.
So, the question: could you review the code below and tell if its performance can be improved a bit more? 20 times is still disappointing me a bit.
import Prelude hiding (foldr, foldl, product)
import Data.Monoid
import Data.Foldable
import Text.Printf
import System.CPUTime
import System.Environment
data Vector a = Vec3 a a a
| Vec4 a a a a
deriving Show
instance Foldable Vector where
foldMap f (Vec3 a b c) = f a `mappend` f b `mappend` f c
foldMap f (Vec4 a b c d) = f a `mappend` f b `mappend` f c `mappend` f d
data Matr a = Matr !a !a !a !a
!a !a !a !a
!a !a !a !a
!a !a !a !a
instance Show a => Show (Matr a) where
show m = foldr f [] $ matrRows m
where f a b = show a ++ "\n" ++ b
matrCols (Matr a0 b0 c0 d0 a1 b1 c1 d1 a2 b2 c2 d2 a3 b3 c3 d3)
= [Vec4 a0 a1 a2 a3, Vec4 b0 b1 b2 b3, Vec4 c0 c1 c2 c3, Vec4 d0 d1 d2 d3]
matrRows (Matr a0 b0 c0 d0 a1 b1 c1 d1 a2 b2 c2 d2 a3 b3 c3 d3)
= [Vec4 a0 b0 c0 d0, Vec4 a1 b1 c1 d1, Vec4 a2 b2 c2 d2, Vec4 a3 b3 c3 d3]
matrFromList [a0, b0, c0, d0, a1, b1, c1, d1, a2, b2, c2, d2, a3, b3, c3, d3]
= Matr a0 b0 c0 d0
a1 b1 c1 d1
a2 b2 c2 d2
a3 b3 c3 d3
matrId :: Matr Double
matrId = Matr 1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
normalise (Vec4 x y z w) = Vec4 (x/w) (y/w) (z/w) 1
mult a b = matrFromList [f r c | r <- matrRows a, c <- matrCols b] where
f a b = foldr (+) 0 $ zipWith (*) (toList a) (toList b)
First, I doubt that you'll ever get stellar performance with this implementation. There are too many conversions between different representations. You'd be better off basing your code on something like the vector package. Also you don't provide all your testing code, so there are probably other issues that we can't here. This is because the pipeline of production to consumption has a big impact on Haskell performance, and you haven't provided either end.
Now, two specific problems:
1) Your vector is defined as either a 3 or 4 element vector. This means that for every vector there's an extra check to see how many elements are in use. In C, I imagine your implementation is probably closer to
struct vec {
double *vec;
int length;
}
You should do something similar in Haskell; this is how vector and bytestring are implemented for example.
Even if you don't change the Vector definition, make the fields strict. You should also either add UNPACK pragmas (to Vector and Matrix) or compile with -funbox-strict-fields.
2) Change mult to
mult a b = matrFromList [f r c | r <- matrRows a, c <- matrCols b] where
f a b = Data.List.foldl' (+) 0 $ zipWith (*) (toList a) (toList b)
The extra strictness of foldl' will give much better performance in this case than foldr.
This change alone might make a big difference, but without seeing the rest of your code it's difficult to say.
Answering my own question just to share new results I got yesterday:
I upgraded ghc to the most recent version and performance became indeed not that bad (only ~7 times worse).
Also I tried implementing the matrix in a stupid and simple way (see the listing below) and got really acceptable performance - only about 2 times slower than C equivalent.
data Matr a = Matr ( a, a, a, a
, a, a, a, a
, a, a, a, a
, a, a, a, a)
mult (Matr (!a0, !b0, !c0, !d0,
!a1, !b1, !c1, !d1,
!a2, !b2, !c2, !d2,
!a3, !b3, !c3, !d3))
(Matr (!a0', !b0', !c0', !d0',
!a1', !b1', !c1', !d1',
!a2', !b2', !c2', !d2',
!a3', !b3', !c3', !d3'))
= Matr ( a0'', b0'', c0'', d0''
, a1'', b1'', c1'', d1''
, a2'', b2'', c2'', d2''
, a3'', b3'', c3'', d3'')
where a0'' = a0 * a0' + b0 * a1' + c0 * a2' + d0 * a3'
b0'' = a0 * b0' + b0 * b1' + c0 * b2' + d0 * b3'
c0'' = a0 * c0' + b0 * c1' + c0 * c2' + d0 * c3'
d0'' = a0 * d0' + b0 * d1' + c0 * d2' + d0 * d3'
a1'' = a1 * a0' + b1 * a1' + c1 * a2' + d1 * a3'
b1'' = a1 * b0' + b1 * b1' + c1 * b2' + d1 * b3'
c1'' = a1 * c0' + b1 * c1' + c1 * c2' + d1 * c3'
d1'' = a1 * d0' + b1 * d1' + c1 * d2' + d1 * d3'
a2'' = a2 * a0' + b2 * a1' + c2 * a2' + d2 * a3'
b2'' = a2 * b0' + b2 * b1' + c2 * b2' + d2 * b3'
c2'' = a2 * c0' + b2 * c1' + c2 * c2' + d2 * c3'
d2'' = a2 * d0' + b2 * d1' + c2 * d2' + d2 * d3'
a3'' = a3 * a0' + b3 * a1' + c3 * a2' + d3 * a3'
b3'' = a3 * b0' + b3 * b1' + c3 * b2' + d3 * b3'
c3'' = a3 * c0' + b3 * c1' + c3 * c2' + d3 * c3'
d3'' = a3 * d0' + b3 * d1' + c3 * d2' + d3 * d3'
I have a data.frame named "d" of ~1,300,000 lines and 4 columns and another data.frame named "gc" of ~12,000 lines and 2 columns (but see the smaller example below).
d <- data.frame( gene=rep(c("a","b","c"),4), val=rnorm(12), ind=c( rep(rep("i1",3),2), rep(rep("i2",3),2) ), exp=c( rep("e1",3), rep("e2",3), rep("e1",3), rep("e2",3) ) )
gc <- data.frame( gene=c("a","b","c"), chr=c("c1","c2","c3") )
Here is how "d" looks like:
gene val ind exp
1 a 1.38711902 i1 e1
2 b -0.25578496 i1 e1
3 c 0.49331256 i1 e1
4 a -1.38015272 i1 e2
5 b 1.46779219 i1 e2
6 c -0.84946320 i1 e2
7 a 0.01188061 i2 e1
8 b -0.13225808 i2 e1
9 c 0.16508404 i2 e1
10 a 0.70949804 i2 e2
11 b -0.64950167 i2 e2
12 c 0.12472479 i2 e2
And here is "gc":
gene chr
1 a c1
2 b c2
3 c c3
I want to add a 5th column to "d" by incorporating data from "gc" that match with the 1st column of "d". For the moment I am using sapply.
d$chr <- sapply( 1:nrow(d), function(x) gc[ gc$gene==d[x,1], ]$chr )
But on the real data, it takes a "very long" time (I am running the command with "system.time()" since more than 30 minutes and it's still not finished).
Do you have any idea of how I could rewrite this in a clever way? Or should I consider using plyr, maybe with the "parallel" option (I have four cores on my computer)? In such a case, what would be the best syntax?
Thanks in advance.
I think you can just use the factor as index:
gc[ d[,1], 2]
[1] c1 c2 c3 c1 c2 c3 c1 c2 c3 c1 c2 c3
Levels: c1 c2 c3
does the same as:
sapply( 1:nrow(d), function(x) gc[ gc$gene==d[x,1], ]$chr )
[1] c1 c2 c3 c1 c2 c3 c1 c2 c3 c1 c2 c3
Levels: c1 c2 c3
But is much faster:
> system.time(replicate(1000,sapply( 1:nrow(d), function(x) gc[ gc$gene==d[x,1], ]$chr )))
user system elapsed
5.03 0.00 5.02
>
> system.time(replicate(1000,gc[ d[,1], 2]))
user system elapsed
0.12 0.00 0.13
Edit:
To expand a bit on my comment. The gc dataframe requires one row for each level of gene in the order of the levels for this to work:
d <- data.frame( gene=rep(c("a","b","c"),4), val=rnorm(12), ind=c( rep(rep("i1",3),2), rep(rep("i2",3),2) ), exp=c( rep("e1",3), rep("e2",3), rep("e1",3), rep("e2",3) ) )
gc <- data.frame( gene=c("c","a","b"), chr=c("c1","c2","c3") )
gc[ d[,1], 2]
[1] c1 c2 c3 c1 c2 c3 c1 c2 c3 c1 c2 c3
Levels: c1 c2 c3
sapply( 1:nrow(d), function(x) gc[ gc$gene==d[x,1], ]$chr )
[1] c2 c3 c1 c2 c3 c1 c2 c3 c1 c2 c3 c1
Levels: c1 c2 c3
But it is not hard to fix that:
levels(gc$gene) <- levels(d$gene) # Seems redundant as this is done right quite often automatically
gc <- gc[order(gc$gene),]
gc[ d[,1], 2]
[1] c2 c3 c1 c2 c3 c1 c2 c3 c1 c2 c3 c1
Levels: c1 c2 c3
sapply( 1:nrow(d), function(x) gc[ gc$gene==d[x,1], ]$chr )
[1] c2 c3 c1 c2 c3 c1 c2 c3 c1 c2 c3 c1
Levels: c1 c2 c3
An alternative solution that does not beat Sasha's approach timing-wise, but is more generalizable and readable, is to simply merge the two data frames:
d <- merge(d, gc)
I have a slower system, so here are my timings:
> system.time(replicate(1000,sapply( 1:nrow(d), function(x) gc[ gc$gene==d[x,1], ]$chr )))
user system elapsed
11.22 0.12 11.86
> system.time(replicate(1000,gc[ d[,1], 2]))
user system elapsed
0.34 0.00 0.35
> system.time(replicate(1000, merge(d, gc, by="gene")))
user system elapsed
3.35 0.02 3.40
The benefit is that you could have multiple keys, fine control over non-matching items, etc.