Generate a list of primes up to a certain number - algorithm

I'm trying to generate a list of primes below 1 billion. I'm trying this, but this kind of structure is pretty shitty. Any suggestions?
a <- 1:1000000000
d <- 0
b <- for (i in a) {for (j in 1:i) {if (i %% j !=0) {d <- c(d,i)}}}

That sieve posted by George Dontas is a good starting point. Here's a much faster version with running times for 1e6 primes of 0.095s as opposed to 30s for the original version.
sieve <- function(n)
{
n <- as.integer(n)
if(n > 1e8) stop("n too large")
primes <- rep(TRUE, n)
primes[1] <- FALSE
last.prime <- 2L
fsqr <- floor(sqrt(n))
while (last.prime <= fsqr)
{
primes[seq.int(2L*last.prime, n, last.prime)] <- FALSE
sel <- which(primes[(last.prime+1):(fsqr+1)])
if(any(sel)){
last.prime <- last.prime + min(sel)
}else last.prime <- fsqr+1
}
which(primes)
}
Here are some alternate algorithms below coded about as fast as possible in R. They are slower than the sieve but a heck of a lot faster than the questioners original post.
Here's a recursive function that uses mod but is vectorized. It returns for 1e5 almost instantaneously and 1e6 in under 2s.
primes <- function(n){
primesR <- function(p, i = 1){
f <- p %% p[i] == 0 & p != p[i]
if (any(f)){
p <- primesR(p[!f], i+1)
}
p
}
primesR(2:n)
}
The next one isn't recursive and faster again. The code below does primes up to 1e6 in about 1.5s on my machine.
primest <- function(n){
p <- 2:n
i <- 1
while (p[i] <= sqrt(n)) {
p <- p[p %% p[i] != 0 | p==p[i]]
i <- i+1
}
p
}
BTW, the spuRs package has a number of prime finding functions including a sieve of E. Haven't checked to see what the speed is like for them.
And while I'm writing a very long answer... here's how you'd check in R if one value is prime.
isPrime <- function(x){
div <- 2:ceiling(sqrt(x))
!any(x %% div == 0)
}

This is an implementation of the Sieve of Eratosthenes algorithm in R.
sieve <- function(n)
{
n <- as.integer(n)
if(n > 1e6) stop("n too large")
primes <- rep(TRUE, n)
primes[1] <- FALSE
last.prime <- 2L
for(i in last.prime:floor(sqrt(n)))
{
primes[seq.int(2L*last.prime, n, last.prime)] <- FALSE
last.prime <- last.prime + min(which(primes[(last.prime+1):n]))
}
which(primes)
}
sieve(1000000)

Prime Numbers in R
The OP asked to generate all prime numbers below one billion. All of the answers provided thus far are either not capable of doing this, will take a long a time to execute, or currently not available in R (see the answer by #Charles). The package RcppAlgos (I am the author) is capable of generating the requested output in just over 1 second using only one thread. It is based off of the segmented sieve of Eratosthenes by Kim Walisch.
RcppAlgos
library(RcppAlgos)
system.time(primeSieve(1e9)) ## using 1 thread
user system elapsed
1.099 0.077 1.176
Using Multiple Threads
And in recent versions (i.e. >= 2.3.0), we can utilize multiple threads for even faster generation. For example, now we can generate the primes up to 1 billion in under half a second!
system.time(primeSieve(10^9, nThreads = 8))
user system elapsed
2.046 0.048 0.375
Summary of Available Packages in R for Generating Primes
library(schoolmath)
library(primefactr)
library(sfsmisc)
library(primes)
library(numbers)
library(spuRs)
library(randtoolbox)
library(matlab)
## and 'sieve' from #John
Before we begin, we note that the problems pointed out by #Henrik in schoolmath still exists. Observe:
## 1 is NOT a prime number
schoolmath::primes(start = 1, end = 20)
[1] 1 2 3 5 7 11 13 17 19
## This should return 1, however it is saying that 52
## "prime" numbers less than 10^4 are divisible by 7!!
sum(schoolmath::primes(start = 1, end = 10^4) %% 7L == 0)
[1] 52
The point is, don't use schoolmath for generating primes at this point (no offense to the author... In fact, I have filed an issue with the maintainer).
Let's look at randtoolbox as it appears to be incredibly efficient. Observe:
library(microbenchmark)
## the argument for get.primes is for how many prime numbers you need
## whereas most packages get all primes less than a certain number
microbenchmark(priRandtoolbox = get.primes(78498),
priRcppAlgos = RcppAlgos::primeSieve(10^6), unit = "relative")
Unit: relative
expr min lq mean median uq max neval
priRandtoolbox 1.00000 1.00000 1.000000 1.000000 1.000000 1.0000000 100
priRcppAlgos 12.79832 12.55065 6.493295 7.355044 7.363331 0.3490306 100
A closer look reveals that it is essentially a lookup table (found in the file randtoolbox.c from the source code).
#include "primes.h"
void reconstruct_primes()
{
int i;
if (primeNumber[2] == 1)
for (i = 2; i < 100000; i++)
primeNumber[i] = primeNumber[i-1] + 2*primeNumber[i];
}
Where primes.h is a header file that contains an array of "halves of differences between prime numbers". Thus, you will be limited by the number of elements in that array for generating primes (i.e. the first one hundred thousand primes). If you are only working with smaller primes (less than 1,299,709 (i.e. the 100,000th prime)) and you are working on a project that requires the nth prime, randtoolbox is the way to go.
Below, we perform benchmarks on the rest of the packages.
Primes up to One Million
microbenchmark(priRcppAlgos = RcppAlgos::primeSieve(10^6),
priNumbers = numbers::Primes(10^6),
priSpuRs = spuRs::primesieve(c(), 2:10^6),
priPrimes = primes::generate_primes(1, 10^6),
priPrimefactr = primefactr::AllPrimesUpTo(10^6),
priSfsmisc = sfsmisc::primes(10^6),
priMatlab = matlab::primes(10^6),
priJohnSieve = sieve(10^6),
unit = "relative")
Unit: relative
expr min lq mean median uq max neval
priRcppAlgos 1.000000 1.00000 1.00000 1.000000 1.00000 1.00000 100
priNumbers 21.550402 23.19917 26.67230 23.140031 24.56783 53.58169 100
priSpuRs 232.682764 223.35847 233.65760 235.924538 236.09220 212.17140 100
priPrimes 46.591868 43.64566 40.72524 39.106107 39.60530 36.47959 100
priPrimefactr 39.609560 40.58511 42.64926 37.835497 38.89907 65.00466 100
priSfsmisc 9.271614 10.68997 12.38100 9.761438 11.97680 38.12275 100
priMatlab 21.756936 24.39900 27.08800 23.433433 24.85569 49.80532 100
priJohnSieve 10.630835 11.46217 12.55619 10.792553 13.30264 38.99460 100
Primes up to Ten Million
microbenchmark(priRcppAlgos = RcppAlgos::primeSieve(10^7),
priNumbers = numbers::Primes(10^7),
priSpuRs = spuRs::primesieve(c(), 2:10^7),
priPrimes = primes::generate_primes(1, 10^7),
priPrimefactr = primefactr::AllPrimesUpTo(10^7),
priSfsmisc = sfsmisc::primes(10^7),
priMatlab = matlab::primes(10^7),
priJohnSieve = sieve(10^7),
unit = "relative", times = 20)
Unit: relative
expr min lq mean median uq max neval
priRcppAlgos 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 20
priNumbers 30.57896 28.91780 31.26486 30.47751 29.81762 40.43611 20
priSpuRs 533.99400 497.20484 490.39989 494.89262 473.16314 470.87654 20
priPrimes 125.04440 114.71349 112.30075 113.54464 107.92360 103.74659 20
priPrimefactr 52.03477 50.32676 52.28153 51.72503 52.32880 59.55558 20
priSfsmisc 16.89114 16.44673 17.48093 16.64139 18.07987 22.88660 20
priMatlab 30.13476 28.30881 31.70260 30.73251 32.92625 41.21350 20
priJohnSieve 18.25245 17.95183 19.08338 17.92877 18.35414 32.57675 20
Primes up to One Hundred Million
For the next two benchmarks, we only consider RcppAlgos, numbers, sfsmisc, matlab, and the sieve function by #John.
microbenchmark(priRcppAlgos = RcppAlgos::primeSieve(10^8),
priNumbers = numbers::Primes(10^8),
priSfsmisc = sfsmisc::primes(10^8),
priMatlab = matlab::primes(10^8),
priJohnSieve = sieve(10^8),
unit = "relative", times = 20)
Unit: relative
expr min lq mean median uq max neval
priRcppAlgos 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 20
priNumbers 35.64097 33.75777 32.83526 32.25151 31.74193 31.95457 20
priSfsmisc 21.68673 20.47128 20.01984 19.65887 19.43016 19.51961 20
priMatlab 35.34738 33.55789 32.67803 32.21343 31.56551 31.65399 20
priJohnSieve 23.28720 22.19674 21.64982 21.27136 20.95323 21.31737 20
Primes up to One Billion
N.B. We must remove the condition if(n > 1e8) stop("n too large") in the sieve function.
## See top section
## system.time(primeSieve(10^9))
## user system elapsed
## 1.099 0.077 1.176 ## RcppAlgos single-threaded
## gc()
system.time(matlab::primes(10^9))
user system elapsed
31.780 12.456 45.549 ## ~39x slower than RcppAlgos
## gc()
system.time(numbers::Primes(10^9))
user system elapsed
32.252 9.257 41.441 ## ~35x slower than RcppAlgos
## gc()
system.time(sieve(10^9))
user system elapsed
26.266 3.906 30.201 ## ~26x slower than RcppAlgos
## gc()
system.time(sfsmisc::primes(10^9))
user system elapsed
24.292 3.389 27.710 ## ~24x slower than RcppAlgos
From these comparison, we see that RcppAlgos scales much better as n gets larger.
_________________________________________________________
| | 1e6 | 1e7 | 1e8 | 1e9 |
| |---------|----------|-----------|-----------
| RcppAlgos | 1.00 | 1.00 | 1.00 | 1.00 |
| sfsmisc | 9.76 | 16.64 | 19.66 | 23.56 |
| JohnSieve | 10.79 | 17.93 | 21.27 | 25.68 |
| numbers | 23.14 | 30.48 | 32.25 | 34.86 |
| matlab | 23.43 | 30.73 | 32.21 | 38.73 |
---------------------------------------------------------
The difference is even more dramatic when we utilize multiple threads:
microbenchmark(ser = primeSieve(1e6),
par = primeSieve(1e6, nThreads = 8), unit = "relative")
Unit: relative
expr min lq mean median uq max neval
ser 1.741342 1.492707 1.481546 1.512804 1.432601 1.275733 100
par 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100
microbenchmark(ser = primeSieve(1e7),
par = primeSieve(1e7, nThreads = 8), unit = "relative")
Unit: relative
expr min lq mean median uq max neval
ser 2.632054 2.50671 2.405262 2.418097 2.306008 2.246153 100
par 1.000000 1.00000 1.000000 1.000000 1.000000 1.000000 100
microbenchmark(ser = primeSieve(1e8),
par = primeSieve(1e8, nThreads = 8), unit = "relative", times = 20)
Unit: relative
expr min lq mean median uq max neval
ser 2.914836 2.850347 2.761313 2.709214 2.755683 2.438048 20
par 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 20
microbenchmark(ser = primeSieve(1e9),
par = primeSieve(1e9, nThreads = 8), unit = "relative", times = 10)
Unit: relative
expr min lq mean median uq max neval
ser 3.081841 2.999521 2.980076 2.987556 2.961563 2.841023 10
par 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10
And multiplying the table above by the respective median times for the serial results:
_____________________________________________________________
| | 1e6 | 1e7 | 1e8 | 1e9 |
| |---------|----------|-----------|-----------
| RcppAlgos-Par | 1.00 | 1.00 | 1.00 | 1.00 |
| RcppAlgos-Ser | 1.51 | 2.42 | 2.71 | 2.99 |
| sfsmisc | 14.76 | 40.24 | 53.26 | 70.39 |
| JohnSieve | 16.32 | 43.36 | 57.62 | 76.72 |
| numbers | 35.01 | 73.70 | 87.37 | 104.15 |
| matlab | 35.44 | 74.31 | 87.26 | 115.71 |
-------------------------------------------------------------
Primes Over a Range
microbenchmark(priRcppAlgos = RcppAlgos::primeSieve(10^9, 10^9 + 10^6),
priNumbers = numbers::Primes(10^9, 10^9 + 10^6),
priPrimes = primes::generate_primes(10^9, 10^9 + 10^6),
unit = "relative", times = 20)
Unit: relative
expr min lq mean median uq max neval
priRcppAlgos 1.0000 1.0000 1.000 1.0000 1.0000 1.0000 20
priNumbers 115.3000 112.1195 106.295 110.3327 104.9106 81.6943 20
priPrimes 983.7902 948.4493 890.243 919.4345 867.5775 708.9603 20
Primes up to 10 billion in Under 6 Seconds
## primes less than 10 billion
system.time(tenBillion <- RcppAlgos::primeSieve(10^10, nThreads = 8))
user system elapsed
26.077 2.063 5.602
length(tenBillion)
[1] 455052511
## Warning!!!... Large object created
tenBillionSize <- object.size(tenBillion)
print(tenBillionSize, units = "Gb")
3.4 Gb
Primes Over a Range of Very Large Numbers:
Prior to version 2.3.0, we were simply using the same algorithm for numbers of every magnitude. This is okay for smaller numbers when most of the sieving primes have at least one multiple in each segment (Generally, the segment size is limited by the size of L1 Cache ~32KiB). However, when we are dealing with larger numbers, the sieving primes will contain many numbers that will have fewer than one multiple per segment. This situation creates a lot of overhead, as we are performing many worthless checks that pollutes the cache. Thus, we observe much slower generation of primes when the numbers are very large. Observe for version 2.2.0 (See Installing older version of R package):
## Install version 2.2.0
## packageurl <- "http://cran.r-project.org/src/contrib/Archive/RcppAlgos/RcppAlgos_2.2.0.tar.gz"
## install.packages(packageurl, repos=NULL, type="source")
system.time(old <- RcppAlgos::primeSieve(1e15, 1e15 + 1e9))
user system elapsed
7.932 0.134 8.067
And now using the cache friendly improvement originally developed by Tomás Oliveira, we see drastic improvements:
## Reinstall current version from CRAN
## install.packages("RcppAlgos"); library(RcppAlgos)
system.time(cacheFriendly <- primeSieve(1e15, 1e15 + 1e9))
user system elapsed
2.258 0.166 2.424 ## Over 3x faster than older versions
system.time(primeSieve(1e15, 1e15 + 1e9, nThreads = 8))
user system elapsed
4.852 0.780 0.911 ## Over 8x faster using multiple threads
Take Away
There are many great packages available for generating primes
If you are looking for speed in general, there is no match to RcppAlgos::primeSieve, especially for larger numbers.
If you are working with small primes, look no further than randtoolbox::get.primes.
If you need primes in a range, the packages numbers, primes, & RcppAlgos are the way to go.
The importance of good programming practices cannot be overemphasized (e.g. vectorization, using correct data types, etc.). This is most aptly demonstrated by the pure base R solution provided by #John. It is concise, clear, and very efficient.

Best way that I know of to generate all primes (without getting into crazy math) is to use the Sieve of Eratosthenes.
It is pretty straightforward to implement and allows you calculate primes without using division or modulus. The only downside is that it is memory intensive, but various optimizations can be made to improve memory (ignoring all even numbers for instance).

This method should be Faster and simpler.
allPrime <- function(n) {
primes <- rep(TRUE, n)
primes[1] <- FALSE
for (i in 1:sqrt(n)) {
if (primes[i]) primes[seq(i^2, n, i)] <- FALSE
}
which(primes)
}
0.12 second on my computer for n = 1e6
I implemented this in function AllPrimesUpTo in package primefactr.

I recommend primegen, Dan Bernstein's implementation of the Atkin-Bernstein sieve. It's very fast and will scale well to other problems. You'll need to pass data out to the program to use it, but I imagine there are ways to do that?

You can also cheat and use the primes() function in the schoolmath package :D

The isPrime() function posted above could use sieve(). One only needs to check if any of
the primes < ceiling(sqrt(x)) divide x with no remainder. Need to handle 1 and 2, also.
isPrime <- function(x) {
div <- sieve(ceiling(sqrt(x)))
(x > 1) & ((x == 2) | !any(x %% div == 0))
}

No suggestions, but allow me an extended comment of sorts. I ran this experiment with the following code:
get_primes <- function(n_min, n_max){
options(scipen=999)
result = vector()
for (x in seq(max(n_min,2), n_max)){
has_factor <- F
for (p in seq(2, ceiling(sqrt(x)))){
if(x %% p == 0) has_factor <- T
if(has_factor == T) break
}
if(has_factor==F) result <- c(result,x)
}
result
}
and after almost 24 hours of uninterrupted computer operations, I got a list of 5,245,897 primes. The π(1,000,000,000) = 50,847,534, so it would have taken 10 days to complete this calculation.
Here is the file of these first ~ 5 million prime numbers.

for (i in 2:1000) {
a = (2:(i-1))
b = as.matrix(i%%a)
c = colSums(b != 0)
if (c == i-2)
{
print(i)
}
}

Every number (i) before (a) is checked against the list of prime numbers (n) generated by checking for number (i-1)
Thanks for suggestions:
prime = function(a,n){
n=c(2)
i=3
while(i <=a){
for(j in n[n<=sqrt(i)]){
r=0
if (i%%j == 0){
r=1}
if(r==1){break}
}
if(r!=1){n = c(n,i)}
i=i+2
}
print(n)
}

Related

Why does coxph() combined with cluster() give much smaller standard errors than other methods to adjust for clustering (e.g. coxme() or frailty()?

I am working on a dataset to test the association between empirical antibiotics (variable emp, the antibiotics are cefuroxime or ceftriaxone compared with a reference antibiotic) and 30-day mortality (variable mort30). The data comes from patients admitted in 6 hospitals (variable site2) with a specific type of infection. Therefore, I would like to adjust for this clustering of patients on hospital level.
First I did this using the coxme() function for mixed models. However, based on visual inspection of the Schoenfeld residuals there were violations of the proportional hazards assumption and I tried adding a time transformation (tt) to the model. Unfortunately, the coxme() does not offer the possibility for time transformations.
Therfore, I tried other options to adjust for the clustering, including coxph() combined with frailty() and cluster. Surprisingly, the standard errors I get using the cluster() option are much smaller than using the coxme() or frailty().
**Does anyone know what is the explanation for this and which option would provide the most reliable estimates?
**
1) Using coxme:
> uni.mort <- coxme(Surv(FUdur30, mort30num) ~ emp + (1 | site2), data = total.pop)
> summary(uni.mort)
Cox mixed-effects model fit by maximum likelihood
Data: total.pop
events, n = 58, 253
Iterations= 24 147
NULL Integrated Fitted
Log-likelihood -313.8427 -307.6543 -305.8967
Chisq df p AIC BIC
Integrated loglik 12.38 3.00 0.0061976 6.38 0.20
Penalized loglik 15.89 3.56 0.0021127 8.77 1.43
Model: Surv(FUdur30, mort30num) ~ emp + (1 | site2)
Fixed coefficients
coef exp(coef) se(coef) z p
empCefuroxime 0.5879058 1.800214 0.6070631 0.97 0.33
empCeftriaxone 1.3422317 3.827576 0.5231278 2.57 0.01
Random effects
Group Variable Std Dev Variance
site2 Intercept 0.2194737 0.0481687
> confint(uni.mort)
2.5 % 97.5 %
empCefuroxime -0.6019160 1.777728
empCeftriaxone 0.3169202 2.367543
2) Using frailty()
uni.mort <- coxph(Surv(FUdur30, mort30num) ~ emp + frailty(site2), data = total.pop)
> summary(uni.mort)
Call:
coxph(formula = Surv(FUdur30, mort30num) ~ emp + frailty(site2),
data = total.pop)
n= 253, number of events= 58
coef se(coef) se2 Chisq DF p
empCefuroxime 0.6302 0.6023 0.6010 1.09 1.0 0.3000
empCeftriaxone 1.3559 0.5221 0.5219 6.75 1.0 0.0094
frailty(site2) 0.40 0.3 0.2900
exp(coef) exp(-coef) lower .95 upper .95
empCefuroxime 1.878 0.5325 0.5768 6.114
empCeftriaxone 3.880 0.2577 1.3947 10.796
Iterations: 7 outer, 27 Newton-Raphson
Variance of random effect= 0.006858179 I-likelihood = -307.8
Degrees of freedom for terms= 2.0 0.3
Concordance= 0.655 (se = 0.035 )
Likelihood ratio test= 12.87 on 2.29 df, p=0.002
3) Using cluster()
uni.mort <- coxph(Surv(FUdur30, mort30num) ~ emp, cluster = site2, data = total.pop)
> summary(uni.mort)
Call:
coxph(formula = Surv(FUdur30, mort30num) ~ emp, data = total.pop,
cluster = site2)
n= 253, number of events= 58
coef exp(coef) se(coef) robust se z Pr(>|z|)
empCefuroxime 0.6405 1.8975 0.6009 0.3041 2.106 0.035209 *
empCeftriaxone 1.3594 3.8937 0.5218 0.3545 3.834 0.000126 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
empCefuroxime 1.897 0.5270 1.045 3.444
empCeftriaxone 3.894 0.2568 1.944 7.801
Concordance= 0.608 (se = 0.027 )
Likelihood ratio test= 12.08 on 2 df, p=0.002
Wald test = 15.38 on 2 df, p=5e-04
Score (logrank) test = 10.69 on 2 df, p=0.005, Robust = 5.99 p=0.05
(Note: the likelihood ratio and score tests assume independence of
observations within a cluster, the Wald and robust score tests do not).
>

A variant of the Knapsack algorithm

I have a list of items, a, b, c,..., each of which has a weight and a value.
The 'ordinary' Knapsack algorithm will find the selection of items that maximises the value of the selected items, whilst ensuring that the weight is below a given constraint.
The problem I have is slightly different. I wish to minimise the value (easy enough by using the reciprocal of the value), whilst ensuring that the weight is at least the value of the given constraint, not less than or equal to the constraint.
I have tried re-routing the idea through the ordinary Knapsack algorithm, but this can't be done. I was hoping there is another combinatorial algorithm that I am not aware of that does this.
In the german wiki it's formalized as:
finite set of objects U
w: weight-function
v: value-function
w: U -> R
v: U -> R
B in R # constraint rhs
Find subset K in U subject to:
sum( w(u) <= B ) | all w in K
such that:
max sum( v(u) ) | all u in K
So there is no restriction like nonnegativity.
Just use negative weights, negative values and a negative B.
The basic concept is:
sum( w(u) ) <= B | all w in K
<->
-sum( w(u) ) >= -B | all w in K
So in your case:
classic constraint: x0 + x1 <= B | 3 + 7 <= 12 Y | 3 + 10 <= 12 N
becomes: -x0 - x1 <= -B |-3 - 7 <=-12 N |-3 - 10 <=-12 Y
So for a given implementation it depends on the software if this is allowed. In terms of the optimization-problem, there is no problem. The integer-programming formulation for your case is as natural as the classic one (and bounded).
Python Demo based on Integer-Programming
Code
import numpy as np
import scipy.sparse as sp
from cylp.cy import CyClpSimplex
np.random.seed(1)
""" INSTANCE """
weight = np.random.randint(50, size = 5)
value = np.random.randint(50, size = 5)
capacity = 50
""" SOLVE """
n = weight.shape[0]
model = CyClpSimplex()
x = model.addVariable('x', n, isInt=True)
model.objective = value # MODIFICATION: default = minimize!
model += sp.eye(n) * x >= np.zeros(n) # could be improved
model += sp.eye(n) * x <= np.ones(n) # """
model += np.matrix(-weight) * x <= -capacity # MODIFICATION
cbcModel = model.getCbcModel()
cbcModel.logLevel = True
status = cbcModel.solve()
x_sol = np.array(cbcModel.primalVariableSolution['x'].round()).astype(int) # assumes existence
print("INSTANCE")
print(" weights: ", weight)
print(" values: ", value)
print(" capacity: ", capacity)
print("Solution")
print(x_sol)
print("sum weight: ", x_sol.dot(weight))
print("value: ", x_sol.dot(value))
Small remarks
This code is just a demo using a somewhat low-level like library and there are other tools available which might be better suited (e.g. windows: pulp)
it's the classic integer-programming formulation from wiki modifies as mentioned above
it will scale very well as the underlying solver is pretty good
as written, it's solving the 0-1 knapsack (only variable bounds would need to be changed)
Small look at the core-code:
# create model
model = CyClpSimplex()
# create one variable for each how-often-do-i-pick-this-item decision
# variable needs to be integer (or binary for 0-1 knapsack)
x = model.addVariable('x', n, isInt=True)
# the objective value of our IP: a linear-function
# cylp only needs the coefficients of this function: c0*x0 + c1*x1 + c2*x2...
# we only need our value vector
model.objective = value # MODIFICATION: default = minimize!
# WARNING: typically one should always use variable-bounds
# (cylp problems...)
# workaround: express bounds lower_bound <= var <= upper_bound as two constraints
# a constraint is an affine-expression
# sp.eye creates a sparse-diagonal with 1's
# example: sp.eye(3) * x >= 5
# 1 0 0 -> 1 * x0 + 0 * x1 + 0 * x2 >= 5
# 0 1 0 -> 0 * x0 + 1 * x1 + 0 * x2 >= 5
# 0 0 1 -> 0 * x0 + 0 * x1 + 1 * x2 >= 5
model += sp.eye(n) * x >= np.zeros(n) # could be improved
model += sp.eye(n) * x <= np.ones(n) # """
# cylp somewhat outdated: need numpy's matrix class
# apart from that it's just the weight-constraint as defined at wiki
# same affine-expression as above (but only a row-vector-like matrix)
model += np.matrix(-weight) * x <= -capacity # MODIFICATION
# internal conversion of type neeeded to treat it as IP (or else it would be
LP)
cbcModel = model.getCbcModel()
cbcModel.logLevel = True
status = cbcModel.solve()
# type-casting
x_sol = np.array(cbcModel.primalVariableSolution['x'].round()).astype(int)
Output
Welcome to the CBC MILP Solver
Version: 2.9.9
Build Date: Jan 15 2018
command line - ICbcModel -solve -quit (default strategy 1)
Continuous objective value is 4.88372 - 0.00 seconds
Cgl0004I processed model has 1 rows, 4 columns (4 integer (4 of which binary)) and 4 elements
Cutoff increment increased from 1e-05 to 0.9999
Cbc0038I Initial state - 0 integers unsatisfied sum - 0
Cbc0038I Solution found of 5
Cbc0038I Before mini branch and bound, 4 integers at bound fixed and 0 continuous
Cbc0038I Mini branch and bound did not improve solution (0.00 seconds)
Cbc0038I After 0.00 seconds - Feasibility pump exiting with objective of 5 - took 0.00 seconds
Cbc0012I Integer solution of 5 found by feasibility pump after 0 iterations and 0 nodes (0.00 seconds)
Cbc0001I Search completed - best objective 5, took 0 iterations and 0 nodes (0.00 seconds)
Cbc0035I Maximum depth 0, 0 variables fixed on reduced cost
Cuts at root node changed objective from 5 to 5
Probing was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
Gomory was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
Knapsack was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
Clique was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
MixedIntegerRounding2 was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
FlowCover was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
TwoMirCuts was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
Result - Optimal solution found
Objective value: 5.00000000
Enumerated nodes: 0
Total iterations: 0
Time (CPU seconds): 0.00
Time (Wallclock seconds): 0.00
Total time (CPU seconds): 0.00 (Wallclock seconds): 0.00
INSTANCE
weights: [37 43 12 8 9]
values: [11 5 15 0 16]
capacity: 50
Solution
[0 1 0 1 0]
sum weight: 51
value: 5

Performance of Prime Testing with Haskell

I have two ways of testing for primes. One of them called isPrime and the other is isBigPrime. What I originally wanted is to test "big" primes with "small" primes that I have already computed, so that the testing becomes faster. Here are my implementations:
intSqrt :: Integer -> Integer
intSqrt n = round $ sqrt $ fromIntegral n
isPrime' :: Integer->Integer -> Bool
isPrime' 1 m = False
isPrime' n m = do
if (m > (intSqrt n))
then True
else if (rem n (m+1) == 0)
then False
else do isPrime' n (m+1)
isPrime :: Integer -> Bool
isPrime 2 = True
isPrime 3 = True
isPrime n = isPrime' n 1
isBigPrime' :: Integer ->Int ->Bool
isBigPrime' n i =
if ( ( smallPrimes !! i ) > intSqrt n )
then True
else if (rem n (smallPrimes !! i) == 0)
then False
else do isBigPrime' n (i+1)
smallPrimes = [2,3, List of Primes until 1700]
--Start at 1 because we only go through uneven numbers
isBigPrime n = isBigPrime' n 1
primes m = [2]++[k | k <- [3,5..m], isPrime k]
bigPrimes m = smallPrimes ++ [k | k <- [1701,1703..m], isBigPrime k]
main = do
print $ (sum $ [Enter Method] 2999999 )
I have chosen 1700 as upper limit because I wanted to have primes up to 3e6 and sqrt(3e6) ~ 1700. I took the sum of them to compare those two algorithms. I thought that bigPrimes would be way faster that primes because first of all it does way less calculations and it has a head start, which is not too big but anyway. However to my surprise bigPrimes was slower than primes. Here are the results:
For primes
Performance counter stats for './p10':
16768,627686 task-clock (msec) # 1,000 CPUs utilized
58 context-switches # 0,003 K/sec
1 cpu-migrations # 0,000 K/sec
6.496 page-faults # 0,387 K/sec
53.416.641.157 cycles # 3,186 GHz
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
160.411.056.099 instructions # 3,00 insns per cycle
34.512.352.987 branches # 2058,150 M/sec
10.673.742 branch-misses # 0,03% of all branches
16,773316435 seconds time elapsed
and for bigPrimes
Performance counter stats for './p10':
19111,667046 task-clock (msec) # 0,999 CPUs utilized
259 context-switches # 0,014 K/sec
3 cpu-migrations # 0,000 K/sec
6.278 page-faults # 0,328 K/sec
61.027.453.425 cycles # 3,193 GHz
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
198.207.905.034 instructions # 3,25 insns per cycle
34.632.138.061 branches # 1812,094 M/sec
106.102.114 branch-misses # 0,31% of all branches
19,126843560 seconds time elapsed
I was wondering why that would be the case. I am suspecting that using primes!!n makes bigPrimes somewhat slower but I am not entirely sure.
A common antipattern brought from other languages is to iterate over indices and use (!!) to index into a list. In Haskell, it is instead idiomatic to simply iterate over the list itself. So:
isBigPrime' :: Integer -> [Integer] ->Bool
isBigPrime' n [] = True
isBigPrime' n (p:ps) = p > intSqrt n || (rem n p /= 0 && isBigPrime' n ps)
isBigPrime n = isBigPrime' n (drop 1 smallPrimes)
On my machine, your primes takes 25.3s; your bigPrimes takes 20.9s; and my bigPrimes takes 3.2s. There are several other pieces of low-hanging fruit (e.g. using p*p > n instead of p > intSqrt n), but this is by far the most significant one.

order of growth in algorithms

Suppose that you time a program as a function of N and produce
the following table.
N seconds
-------------------
19683 0.00
59049 0.00
177147 0.01
531441 0.08
1594323 0.44
4782969 2.46
14348907 13.58
43046721 74.99
129140163 414.20
387420489 2287.85
Estimate the order of growth of the running time as a function of N.
Assume that the running time obeys a power law T(N) ~ a N^b. For your
answer, enter the constant b. Your answer will be marked as correct
if it is within 1% of the target answer - we recommend using
two digits after the decimal separator, e.g., 2.34.
Can someone explain how to calculate this?
Well, it is a simple mathematical problem.
I : a*387420489^b = 2287.85 -> a = 387420489^b/2287.85
II: a*43046721^b = 74.99 -> a = 43046721^b/74.99
III: (I and II)-> 387420489^b/2287.85 = 43046721^b/74.99 ->
-> http://www.purplemath.com/modules/solvexpo2.htm
Use logarithms to solve.
1.You should calculate the ratio of the growth change from one row to the one next
N seconds
--------------------
14348907 13.58
43046721 74.99
129140163 414.2
387420489 2287.85
2.Calculate the change's ratio for N
43046721 / 14348907 = 3
129140163 / 43046721 = 3
therefore the rate of change for N is 3.
3.Calculate the change's ratio for seconds
74.99 / 13.58 = 5.52
Now let check the ratio between one more pare of rows to be sure
414.2 / 74.99 = 5.52
so the change's ratio for seconds is 5.52
4.Build the following equitation
3^b = 5.52
b = 1.55
Finally we get that the order of growth of the running time is 1.55.

Trouble converting long list of data.frames (~1 million) to single data.frame using do.call and ldply

I know there are many questions here in SO about ways to convert a list of data.frames to a single data.frame using do.call or ldply, but this questions is about understanding the inner workings of both methods and trying to figure out why I can't get either to work for concatenating a list of almost 1 million df's of the same structure, same field names, etc. into a single data.frame. Each data.frame is of one row and 21 columns.
The data started out as a JSON file, which I converted to lists using fromJSON, then ran another lapply to extract part of the list and converted to data.frame and ended up with a list of data.frames.
I've tried:
df <- do.call("rbind", list)
df <- ldply(list)
but I've had to kill the process after letting it run up to 3 hours and not getting anything back.
Is there a more efficient method of doing this? How can I troubleshoot what is happening and why is it taking so long?
FYI - I'm using RStudio server on a 72GB quad-core server with RHEL, so I don't think memory is the problem. sessionInfo below:
> sessionInfo()
R version 2.14.1 (2011-12-22)
Platform: x86_64-redhat-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] multicore_0.1-7 plyr_1.7.1 rjson_0.2.6
loaded via a namespace (and not attached):
[1] tools_2.14.1
>
Given that you are looking for performance, it appears that a data.table solution should be suggested.
There is a function rbindlist which is the same but much faster than do.call(rbind, list)
library(data.table)
X <- replicate(50000, data.table(a=rnorm(5), b=1:5), simplify=FALSE)
system.time(rbindlist.data.table <- rbindlist(X))
## user system elapsed
## 0.00 0.01 0.02
It is also very fast for a list of data.frame
Xdf <- replicate(50000, data.frame(a=rnorm(5), b=1:5), simplify=FALSE)
system.time(rbindlist.data.frame <- rbindlist(Xdf))
## user system elapsed
## 0.03 0.00 0.03
For comparison
system.time(docall <- do.call(rbind, Xdf))
## user system elapsed
## 50.72 9.89 60.88
And some proper benchmarking
library(rbenchmark)
benchmark(rbindlist.data.table = rbindlist(X),
rbindlist.data.frame = rbindlist(Xdf),
docall = do.call(rbind, Xdf),
replications = 5)
## test replications elapsed relative user.self sys.self
## 3 docall 5 276.61 3073.444445 264.08 11.4
## 2 rbindlist.data.frame 5 0.11 1.222222 0.11 0.0
## 1 rbindlist.data.table 5 0.09 1.000000 0.09 0.0
and against #JoshuaUlrich's solutions
benchmark(use.rbl.dt = rbl.dt(X),
use.rbl.ju = rbl.ju (Xdf),
use.rbindlist =rbindlist(X) ,
replications = 5)
## test replications elapsed relative user.self
## 3 use.rbindlist 5 0.10 1.0 0.09
## 1 use.rbl.dt 5 0.10 1.0 0.09
## 2 use.rbl.ju 5 0.33 3.3 0.31
I'm not sure you really need to use as.data.frame, because a data.table inherits class data.frame
rbind.data.frame does a lot of checking you don't need. This should be a pretty quick transformation if you only do exactly what you want.
# Use data from Josh O'Brien's post.
set.seed(21)
X <- replicate(50000, data.frame(a=rnorm(5), b=1:5), simplify=FALSE)
system.time({
Names <- names(X[[1]]) # Get data.frame names from first list element.
# For each name, extract its values from each data.frame in the list.
# This provides a list with an element for each name.
Xb <- lapply(Names, function(x) unlist(lapply(X, `[[`, x)))
names(Xb) <- Names # Give Xb the correct names.
Xb.df <- as.data.frame(Xb) # Convert Xb to a data.frame.
})
# user system elapsed
# 3.356 0.024 3.388
system.time(X1 <- do.call(rbind, X))
# user system elapsed
# 169.627 6.680 179.675
identical(X1,Xb.df)
# [1] TRUE
Inspired by the data.table answer, I decided to try and make this even faster. Here's my updated solution, to try and keep the check mark. ;-)
# My "rbind list" function
rbl.ju <- function(x) {
u <- unlist(x, recursive=FALSE)
n <- names(u)
un <- unique(n)
l <- lapply(un, function(N) unlist(u[N==n], FALSE, FALSE))
names(l) <- un
d <- as.data.frame(l)
}
# simple wrapper to rbindlist that returns a data.frame
rbl.dt <- function(x) {
as.data.frame(rbindlist(x))
}
library(data.table)
if(packageVersion("data.table") >= '1.8.2') {
system.time(dt <- rbl.dt(X)) # rbindlist only exists in recent versions
}
# user system elapsed
# 0.02 0.00 0.02
system.time(ju <- rbl.ju(X))
# user system elapsed
# 0.05 0.00 0.05
identical(dt,ju)
# [1] TRUE
Your observation that the time taken increases exponentially with the number of data.frames suggests that breaking the rbinding into two stages could speed things up.
This simple experiment seems to confirm that that's a very fruitful path to take:
## Make a list of 50,000 data.frames
X <- replicate(50000, data.frame(a=rnorm(5), b=1:5), simplify=FALSE)
## First, rbind together all 50,000 data.frames in a single step
system.time({
X1 <- do.call(rbind, X)
})
# user system elapsed
# 137.08 57.98 200.08
## Doing it in two stages cuts the processing time by >95%
## - In Stage 1, 100 groups of 500 data.frames are rbind'ed together
## - In Stage 2, the resultant 100 data.frames are rbind'ed
system.time({
X2 <- lapply(1:100, function(i) do.call(rbind, X[((i*500)-499):(i*500)]))
X3 <- do.call(rbind, X2)
})
# user system elapsed
# 6.14 0.05 6.21
## Checking that the results are the same
identical(X1, X3)
# [1] TRUE
You have a list of data.frames that each have a single row. If it is possible to convert each of those to a vector, I think that would speed things up a lot.
However, assuming that they need to be data.frames, I'll create a function with code borrowed from Dominik's answer at Can rbind be parallelized in R?
do.call.rbind <- function (lst) {
while (length(lst) > 1) {
idxlst <- seq(from = 1, to = length(lst), by = 2)
lst <- lapply(idxlst, function(i) {
if (i == length(lst)) {
return(lst[[i]])
}
return(rbind(lst[[i]], lst[[i + 1]]))
})
}
lst[[1]]
}
I have been using this function for several months, and have found it to be faster and use less memory than do.call(rbind, ...) [the disclaimer is that I've pretty much only used it on xts objects]
The more rows that each data.frame has, and the more elements that the list has, the more beneficial this function will be.
If you have a list of 100,000 numeric vectors, do.call(rbind, ...) will be better. If you have list of length one billion, this will be better.
> df <- lapply(1:10000, function(x) data.frame(x = sample(21, 21)))
> library(rbenchmark)
> benchmark(a=do.call(rbind, df), b=do.call.rbind(df))
test replications elapsed relative user.self sys.self user.child sys.child
1 a 100 327.728 1.755965 248.620 79.099 0 0
2 b 100 186.637 1.000000 181.874 4.751 0 0
The relative speed up will be exponentially better as you increase the length of the list.

Resources