I'm interested if there's a package in R to support call-chain style data manipulation, like in C#/LINQ, F#?
I want to enable style like this:
var list = new[] {1,5,10,12,1};
var newList = list
.Where(x => x > 5)
.GroupBy(x => x%2)
.OrderBy(x => x.Key.ToString())
.Select(x => "Group: " + x.Key)
.ToArray();
I don't know of one, but here's the start of what it could look like:
`%then%` = function(x, body) {
x = substitute(x)
fl = as.list(substitute(body))
car = fl[[1L]]
cdr = {
if (length(fl) == 1)
list()
else
fl[-1L]
}
combined = as.call(
c(list(car, x), cdr)
)
eval(combined, parent.frame())
}
df = data.frame(x = 1:7)
df %then% subset(x > 2) %then% print
This prints
x
3 3
4 4
5 5
6 6
7 7
If you keep using hacks like that it should be pretty simple to get the kind of
syntax you find pleasing ;-)
edit: combined with plyr, this becomes not bad at all:
(data.frame(
x = c(1, 1, 1, 2, 2, 2),
y = runif(6)
)
%then% subset(y > 0.2)
%then% ddply(.(x), summarize,
ysum = sum(y),
ycount = length(y)
)
%then% print
)
dplyr chaining syntax resembles LINQ (stock example):
flights %>%
group_by(year, month, day) %>%
select(arr_delay, dep_delay) %>%
summarise(
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
) %>%
filter(arr > 30 | dep > 30)
Introduction to dplyr - Chaining
(Not an answer. More an extended comment on Owen's answer.] Owen's answer helped me understand what you were after and I thoroughly enjoyed reading his insightful answer. This "outside to inside" style reminded me of an example on the help(Reduce) page where the Funcall function is defined and then successively applied:
## Iterative function application:
Funcall <- function(f, ...) f(...)
## Compute log(exp(acos(cos(0))
Reduce(Funcall, list(log, exp, acos, cos), 0, right = TRUE)
What I find especially intriguing about Owen's macro is that it essentially redefines the argument processing of existing functions. I tried thinking of how I might provide arguments to the "interior" functions for the Funcall aproach and then realized that his %then% function had already sorted that task out. He was using the function names without their leftmost arguments but with all their other right-hand arguments. Brilliant!
https://github.com/slycoder/Rpipe
c(1,1,1,6,4,3) %|% sort() %|% unique()
# result => c(1,3,4)
Admittedly, it would be nice to have a where function here, or alternatively to allow anonymous functions to be passed in, but hey the source code is there: fork it and add it if you want.
Related
Sorry if it's a dumb question, but I am having trouble figuring out how to use pmodels in the drc package. I've searched everywhere online and all I can find is the definition, which is: "a data frame with a many columns as there are parameters in the non-linear function. Or a list containing a formula for each parameter in the nonlinear function." There are examples online, but I have no what it represents. For example, for the commands:
sel.m2 <- drm(dead/total~conc, type, weights=total, data=selenium, fct=LL.2(),
type="binomial", pmodels=list(~1, ~factor(type)-1))
met.as.m1<-drm(gain ~ dose, product, data = methionine, fct = AR.3(),
pmodels = list(~1, ~factor(product), ~factor(product)))
plot(met.as.m1, log = "", ylim = c(1450, 1800))
auxins.m1 <- boxcox(drm(y ~ dose, h, pmodels = data.frame(h, h, 1, h), fct = LL.4(), data = auxins), method = "anova")
I see pmodels as a list and data frame, but what does the "-1"vs "~1" mean or what does it mean to list a factor, what's the significance of the order within the parenthesis?
I agree that it's not well explained for new people. Unfortunately, I can only answer you in part.
A late response but for anyone else:
Two resources are available for reference with drc:
a) The writers published about drc. See main text and supplementary (S3 in this example) DOI:10.1371/journal.pone.0146021
b) See the drc.pdf and ctrl+f for pmodel to inspect the various uses.
data.frame vs. list depends on the grouping level I believe.
After playing around with my data (subsets), I found that
pmodels() = parameter/pooled models
aka how you set those parameters to equal (i.e., global/shared or not).
With your last example using the auxins df
library(drc)
auxins.m1 <- boxcox(drm(y ~ dose, h, pmodels = data.frame(h, h, 1, h),
fct = LL.4(), data = auxins), method = "anova")
## changed names to familiar terms by a non-statistician
auxins.m1 <- boxcox(drm(y ~ dose, h, pmodels = data.frame(h, h, 1, h),
fct = LL.4(names=c("hill.slope","bot","top","ed50"), data = auxins), method = "anova")
Shows that the top is set to 1.
The order is the same as the LL.4(names...)
So if you set
pmodels = data.frame(h, 1, 1, h) ## ("hill.slope","bot","top","ed50")
as they do in the drc.pdf on pg.10, you'll see that it's to set a common/shared bottom and top.
Check out pg.9 of their supplementary article, it shows that for LL.2, the two-parameter logistic fit has pre-set top = 1 and bottom = 0. The output of
selenium.LL.2.2 <- drm(dead/total~conc, type, weights = total,
data = selenium, fct = LL.2(), type="binomial",
pmodels = list(~factor(type)-1, ~1)) ## ("hill-slope", "ed50")
Shows that ed50 is assumed constant.
Alternatively from pg.91 of the drc.pdf:
## Fitting the model with freely varying ED50 values
mecter.free <- drm(rgr ~ dose, pct, data = mecter,
fct = LL.4(), pmodels = list(~1, ~1, ~1, ~factor(pct) - 1))
Unfortunately, it's really not clear what the object-1 means vs. just the object.
A better approach might be to use the base drm() without the special case of LL.#()
Check
getMeanFunctions()
to see all available functions
if you're trying to fix a value at a certain value you can
fct = LL.4(fixed = c(NA,0,1,NA))
## effectively becomes the standard LL.2()
## or
fct = LL.4(fixed = c(1,0,NA,NA))
## common hill slope = 1; assumes baseline correction hence = 0
Related in part; see a lot of drm functions laid out:
https://stackoverflow.com/a/39257095
A tidy number is a number whose digits are in non-decreasing order, e.g. 1234. Here is a way of finding tidy numbers written in Ruby:
def tidy_number(n)
n.to_s.chars.sort.join.to_i == n
end
p tidy_number(12345678) # true
p tidy_number(12345878) # false
I tried to write the same thing in Scala and arrived at the following:
object MyClass {
def tidy_number(n:Int) = n.toString.toList.sorted.mkString.toInt == n;
def main(args: Array[String]) {
println(tidy_number(12345678)) // true
println(tidy_number(12345878)) // false
}
}
The only way I could do it in Scala was by converting an integer to a string to a list and then sorting the list and going back again. My question: is there a better way? 'Better' in the sense there are fewer conversions. I'm mainly looking for a concise piece of Scala but I'd be grateful if someone pointed out a more concise way of doing it in Ruby as well.
You can use sorted on strings in Scala, so
def tidy_number(n: Int) = {
val s = n.toString
s == s.sorted
}
Doing it in two parts also avoids an extra toInt conversion.
I've never used ruby, but this post suggests you're doing it the best way
You can check each adjacent pair of digits to make sure that the first value is <= the second:
def tidy_number(n:Int) =
n.toString.sliding(2,1).forall(p => p(0) <= p(1))
Update following from helpful comments
As noted in the comments, this fails for single-digit numbers. Taking this and another comment together gives this:
def tidy_number(n:Int) =
(" "+n).sliding(2,1).forall(p => p(0) <= p(1))
Being even more pedantic it would be better to convert back to Int before comparing so that you don't rely on the sort order of the characters representing digits being the same as the sort order for the digits themselves.
def tidy_number(n:Int) =
(" "+n).sliding(2,1).forall(p => p(0).toInt <= p(1).toInt)
The only way I could do it in Scala was by converting an integer to a string to a list and then sorting the list and going back again. My question: is there a better way?
String conversion and sorting is not required.
def tidy_number(n :Int) :Boolean =
Iterator.iterate(n)(_/10)
.takeWhile(_>0)
.map(_%10)
.sliding(2)
.forall{case Seq(a,b) => a >= b
case _ => true} //is a single digit "tidy"?
Better? Hard to say. One minor advantage is that the math operations (/ and %) stop after the first false. So for the Int value 1234567890 there are only 3 ops (2 modulo and 1 division) before the result is returned. (Digits are compared in reverse order.)
But seeing as an Int is, at most, only 10 digits long, I'd go with Joel Berkeley's answer just for its brevity.
def tidyNum(n:Int) = {
var t =n; var b = true;
while(b && t!=0){b=t%10>=(t/10)%10;t=t/10};b
}
In Scala REPL:
scala> tidyNum(12345678)
res20: Boolean = true
scala> tidyNum(12343678)
res21: Boolean = false
scala> tidyNum(12)
res22: Boolean = true
scala> tidyNum(1)
res23: Boolean = true
scala> tidyNum(121)
res24: Boolean = false
This can work for any size with just the modification of type such as Long or BigInt in function signature like tidyNum(n:Long) or tidyNum(n:BigInt) as below:
def tidyNum(n:BigInt) = {
var t =n; var b = true;
while(b && t!=0){b=t%10>=(t/10)%10;t=t/10};b
}
scala> tidyNum(BigInt("123456789"))
res26: Boolean = true
scala> tidyNum(BigInt("1234555555555555555555555567890"))
res29: Boolean = false
scala> tidyNum(BigInt("123455555555555555555555556789"))
res30: Boolean = true
And the BigInt can be used for types Int, Long as well.
I use code like this to find data values for my calculations:
sub get_data {
$x =0 if($_[1] eq "A"); #get column number by name
$data{'A'}= [2.00000, 0.15000, -0.00143, 33.51030, 0.77, 1, 0, 12];
return $data{$_[0]}[$x];
}
Data is stored like this in Perl file. I plan no more than 100 columns. Then to get value I call:
get_data(column, row);
Now I realized that that is terribly slow way to look up data in table. How can I do it faster? SQL?
Looking at your github code, the main problem you have is that your
big hash of arrays is initialized every time the function is called.
Your current code:
my #atom;
# {'name'}= radius, depth, solvation_parameter, volume, covalent_radius, hydrophobic, H_acceptor, MW
$atom{'C'}= [2.00000, 0.15000, -0.00143, 33.51030, 0.77, 1, 0, 12];
$atom{'A'}= [2.00000, 0.15000, -0.00052, 33.51030, 0.77, 0, 0, ''];
$atom{'N'}= [1.75000, 0.16000, -0.00162, 22.44930, 0.75, 0, 1, 14];
$atom{'O'}= [1.60000, 0.20000, -0.00251, 17.15730, 0.73, 0, 1, 16];
...
Time taken for your test case on the slow netbook I'm typing this on: 6m24.400s.
The most important thing to do is to move this out of the function, so it's
initialized only once, when the module is loaded.
Time taken after this simple change: 1m20.714s.
But since I'm making suggestions, you could write it more legibly:
my %atom = (
C => [ 2.00000, 0.15000, -0.00143, 33.51030, 0.77, 1, 0, 12 ],
A => [ 2.00000, 0.15000, -0.00052, 33.51030, 0.77, 0, 0, '' ],
...
);
Note that %atom is a hash in both cases, so your code doesn't do what you
were imagining: it declares a lexically-scoped array #atom, which is unused, then proceeds to fill up an unrelated global variable %atom. (Also do you really want an empty string for MW of A? And what kind of atom is A anyway?)
Secondly, your name-to-array-index mapping is also slow. Current code:
#take correct value from data table
$x = 0 if($_[1] eq "radius");
$x = 1 if($_[1] eq "depth");
$x = 2 if($_[1] eq "solvation_parameter");
$x = 3 if($_[1] eq "volume");
$x = 4 if($_[1] eq "covalent_radius");
$x = 5 if($_[1] eq "hydrophobic");
$x = 6 if($_[1] eq "H_acceptor");
$x = 7 if($_[1] eq "MW");
This is much better done as a hash (again, initialized outside the function):
my %index = (
radius => 0,
depth => 1,
solvation_parameter => 2,
volume => 3,
covalent_radius => 4,
hydrophobic => 5,
H_acceptor => 6,
MW => 7
);
Or you could be snazzy if you wanted:
my %index = map { [qw[radius depth solvation_parameter volume
covalent_radius hydrophobic H_acceptor MW
]]->[$_] => $_ } 0..7;
Either way, the code inside the function is then simply:
$x = $index{$_[1]};
Time now: 1m13.449s.
Another approach is just to define your field numbers as constants.
Constants are capitalized by convention:
use constant RADIUS=>0, DEPTH=>1, ...;
Then the code in the function is
$x = $_[1];
and you then need to call the function using the constants instead of strings:
get_atom_parameter('C', RADIUS);
I haven't tried this.
But stepping back a bit and looking at how you are using this function:
while($ligand_atom[$x]{'atom_type'}[0]) {
print STDERR $ligand_atom[$x]{'atom_type'}[0];
$y=0;
while($protein_atom[$y]) {
$d[$x][$y] = sqrt(distance_sqared($ligand_atom[$x],$protein_atom[$y]))
- get_atom_parameter::get_atom_parameter($ligand_atom[$x]{'atom_type'}[0], 'radius');
- get_atom_parameter::get_atom_parameter($protein_atom[$y]{'atom_type'}[0], 'radius');
$y++;
}
$x++;
print STDERR ".";
}
Each time through the loop you are calling get_atom_parameter twice to
retrieve the radius.
But for the inner loop, one atom is constant throughout. So hoist the call
to get_atom_parameter out of the inner loop, and you've almost halved the
number of calls:
while($ligand_atom[$x]{'atom_type'}[0]) {
print STDERR $ligand_atom[$x]{'atom_type'}[0];
$y=0;
my $lig_radius = get_atom_parameter::get_atom_parameter($ligand_atom[$x]{'atom_type'}[0], 'radius');
while($protein_atom[$y]) {
$d[$x][$y] = sqrt(distance_sqared($ligand_atom[$x],$protein_atom[$y]))
- $lig_radius
- get_atom_parameter::get_atom_parameter($protein_atom[$y]{'atom_type'}[0], 'radius');
$y++;
}
$x++;
print STDERR ".";
}
But there's more. In your test case the ligand has 35 atoms and the
protein 4128 atoms. This means that your initial code made
4128*35*2 = 288960 calls to get_atom_parameter, and while now it's
only 4128*35 + 35 = 144515 calls, it's easy to just make some arrays with
the radii so that it's only 4128 + 35 = 4163 calls:
my $protein_size = $#protein_atom;
my $ligand_size;
{
my $x=0;
$x++ while($ligand_atom[$x]{'atom_type'}[0]);
$ligand_size = $x-1;
}
#print STDERR "protein_size = $protein_size, ligand_size = $ligand_size\n";
my #protein_radius;
for my $y (0..$protein_size) {
$protein_radius[$y] = get_atom_parameter::get_atom_parameter($protein_atom[$y]{'atom_type'}[0], 'radius');
}
my #lig_radius;
for my $x (0..$ligand_size) {
$lig_radius[$x] = get_atom_parameter::get_atom_parameter($ligand_atom[$x]{'atom_type'}[0], 'radius');
}
for my $x (0..$ligand_size) {
print STDERR $ligand_atom[$x]{'atom_type'}[0];
my $lig_radius = $lig_radius[$x];
for my $y (0..$protein_size) {
$d[$x][$y] = sqrt(distance_sqared($ligand_atom[$x],$protein_atom[$y]))
- $lig_radius
- $protein_radius[$y]
}
print STDERR ".";
}
And finally, the call to distance_sqared [sic]:
#distance between atoms
sub distance_sqared {
my $dxs = ($_[0]{'x'}-$_[1]{'x'})**2;
my $dys = ($_[0]{'y'}-$_[1]{'y'})**2;
my $dzs = ($_[0]{'z'}-$_[1]{'z'})**2;
return $dxs+$dys+$dzs;
}
This function can usefully be replaced with the following, which uses
multiplication instead of **.
sub distance_sqared {
my $dxs = ($_[0]{'x'}-$_[1]{'x'});
my $dys = ($_[0]{'y'}-$_[1]{'y'});
my $dzs = ($_[0]{'z'}-$_[1]{'z'});
return $dxs*$dxs+$dys*$dys+$dzs*$dzs;
}
Time after all these modifications: 0m53.639s.
More about **: elsewhere you declare
use constant e_math => 2.71828;
and use it thus:
$Gauss1 += e_math ** (-(($d[$x][$y]*2)**2));
The built-in function exp() calculates this for you (in fact, ** is commonly
implemented as x**y = exp(log(x)*y), so each time you are doing this you are
performing an unnecessary logarithm the result of which is just slightly less
than 1 as your constant is only accurate to 6 d.p.). This change would alter
the output very slightly. And again, **2 should be replaced by multiplication.
Anyway, this answer is probably long enough for now, and calculation of d[]
is no longer the bottleneck it was.
Summary: hoist constant values out of loops and functions! Calculating the
same thing repeatedly is no fun at all.
Using any kind of database for this would not help your performance in the
slightest. One thing that might help you though is Inline::C. Perl is
not really built for this kind of intensive computation, and Inline::C
would allow you to easily move performance-critical bits into C while
keeping your existing I/O in Perl.
I would be willing to take a shot at a partial C port. How stable
is this code, and how fast do you want it to be? :)
Putting this in a DB will make it MUCH easier to maintain, scale, expand, etc.... Using a DB can also save you a lot of RAM -- it gets and stores in RAM only the desired result instead of storing ALL values.
With regards to speed it depends. With a text file you take a long time to read all the values into RAM, but once it is loaded, retrieving the values is super fast, faster than querying a DB.
So it depends on how your program is written and what it is for. Do you read all the values ONCE and then run 1000 queries? The TXT file way is probably faster. Do you read all the values every time you make a query (to make sure you have the latest value set) -- then the DB would be faster. Do you 1 query/day? use a DB. etc......
I have a Validation object
val v = Validation[String, Option[Int]]
I need to make a second validation, to check if actual Integer value is equals to 100 for example. If I do
val vv = v.map(_.map(intValue => if (intValue == 100)
intValue.success[String]
else
"Bad value found".fail[Integer]))
I get:
Validation[String, Option[Validation[String, Int]]]
How is it possible to get vv also as Validation[String, Option[Int]] in most concise way
=========
Found possible solution from my own:
val validation: Validation[String, Option[Int]] = Some(100).success[String]
val validatedTwice: Validation[String, Option[Int]] = validation.fold(
_ => validation, // if Failure then return it
_.map(validateValue _) getOrElse validation // validate Successful result
)
def validateValue(value: Int): Validation[String, Option[Int]] = {
if (value == 100)
Some(value).success[String]
else
"Bad value".fail[Option[Int]]
}
Looks not concise and elegant although it works
==============
Second solution from my own, but also looks over-compicated:
val validatedTwice2: Validation[String, Option[Int]] = validation.flatMap(
_.map(validateValue _).map(_.map(Some(_))) getOrElse validation)
def validateValue(value: Int): Validation[String, Int] = {
if (value == 100)
value.success[String]
else
"Bad value".fail[Int]
}
Your solution is over-complicated. The following will suffice!
v flatMap (_.filter(_ == 100).toSuccess("Bad value found"))
The toSuccess comes from OptionW and converts an Option[A] into a Validation[X, A] taking the value provided for the failure case in the event that the option is empty. The flatMap works like this:
Validation[X, A]
=> (A => Validation[X, B])
=> (via flatMap) Validation[X, B]
That is, flatMap maps and then flattens (join in scalaz-parlance):
Validation[X, A]
=> (A => Validation[X, B]]
=> (via map) Validation[X, Validation[X, B]]
=> (via join) Validation[X, B]
First, let's set up some type aliases because typing this out repeatedly will get old pretty fast. We'll tidy up your validation logic a little too while we're here.
type V[X] = Validation[String, X]
type O[X] = Option[X]
def checkInt(i: Int): V[Int] = Validation.fromEither(i != 100 either "Bad value found" or i)
val v: V[O[Int]] = _
this is where we're starting out - b1 is equivalent to your vv situation
val b1: V[O[V[Int]]] = v.map(_.map(checkInt))
so let's sequence the option to flip over the V[O[V[Int]]] into a V[V[O[Int]]]
val b2: V[V[O[Int]]] = v.map(_.map(checkInt)).map(_.sequence[V, Int])
or if you're feeling lambda-y it could have been
sequence[({type l[x] = Validation[String, x]})#l, Int]
next we flatten out that nested validation - we're going to pull in the Validation monad because we actually do want the fastfail behaviour here, although it's generally not the right thing to do.
implicit val monad = Validation.validationMonad[String]
val b3: V[O[Int]] = v.map(_.map(checkInt)).map(_.sequence[V, Int]).join
So now we've got a Validation[String, Option[Int]], so we're there, but this is still pretty messy. Lets use some equational reasoning to tidy it up
By the second functor law we know that:
X.map(_.f).map(_.g) = X.map(_.f.g) =>
val i1: V[O[Int]] = v.map(_.map(checkInt).sequence[V, Int]).join
and by the definition of a monad:
X.map(f).join = X.flatMap(f) =>
val i2: V[O[Int]] = v.flatMap(_.map(checkInt).sequence[V, Int])
and then we apply the free theorem of traversal:
(I struggled with that bloody paper so much, but it looks like some of it sunk in!):
X.map(f).sequence = X.traverse(f andThen identity) = X.traverse(f) =>
val i3: V[O[Int]] = v.flatMap(_.traverse[V, Int](checkInt))
so now we're looking at something a bit more civilised. I imagine there's some trickery to be played with the flatMap and traverse, but I've run out of inspiration.
Use flatMap, like so:
v.flatMap(_.parseInt.fail.map(_.getMessage).validation)
I am new to Scala and am trying to get a list of random double values:
The thing is, when I try to run this, it takes way too long compared to its Java counterpart. Any ideas on why this is or a suggestion on a more efficient approach?
def random: Double = java.lang.Math.random()
var f = List(0.0)
for (i <- 1 to 200000)
( f = f ::: List(random*100))
f = f.tail
You can also achieve it like this:
List.fill(200000)(math.random)
the same goes for e.g. Array ...
Array.fill(200000)(math.random)
etc ...
You could construct an infinite stream of random doubles:
def randomList(): Stream[Double] = Stream.cons(math.random, randomList)
val f = randomList().take(200000)
This will leverage lazy evaluation so you won't calculate a value until you actually need it. Even evaluating all 200,000 will be fast though. As an added bonus, f no longer needs to be a var.
Another possibility is:
val it = Iterator.continually(math.random)
it.take(200000).toList
Stream also has a continually method if you prefer.
First of all, it is not taking longer than java because there is no java counterpart. Java does not have an immutable list. If it did, performance would be about the same.
Second, its taking a lot of time because appending lists have linear performance, so the whole thing has quadratic performance.
Instead of appending, prepend, which had constant performance.
if your using mutable state anyways you should use a mutable collection like buffer which you can add too with += (which then would be the real counterpart to java code).
but why dont u use list comprehension?
val f = for (_ <- 1 to 200000) yield (math.random * 100)
by the way: var f = List(0.0) ... f = f.tail can be replaced by var f: List[Double] = Nil in your example. (no more performance but more beauty ;)
Yet more options! Tail recursion:
def randlist(n: Int, part: List[Double] = Nil): List[Double] = {
if (n<=0) part
else randlist(n-1, 100*random :: part)
}
or mapped ranges:
(1 to 200000).map(_ => 100*random).toList
Looks like you want to use Vector instead of List. List has O(1) prepend, Vector has O(1) append. Since you are appending, but using concatenation, it'll be faster to use Vector:
def random: Double = java.lang.Math.random()
var f: Vector[Double] = Vector()
for (i <- 1 to 200000)
f = f :+ (random*100)
Got it?