Modified cumsum function

Modified cumsum function - performance

Below is a simplified version of a segment of code that I'm working on (a lot of additional calculations are left out to avoid confusion). It's just a modified form of the cumsum function. I don't want to re-invent the wheel, so does this function already exist? If not, what scheme would provide the best speed?
#Set up the data
set.seed(1)
junk <- rnorm(1000000)
junk1 <- rnorm(1000000)
cumval <- numeric(1000000)
#Initialize the accumulator
cumval[1] <- 1
#Perform the modified cumsum
system.time({
for (i in 2:1000000) cumval[i] <- junk[i] + (junk1[i] * cumval[i-1])
})
#Plot the result
plot(cumval, type="l")

This algorithm is something that fits the compiler package perfectly!
#Set up the data
set.seed(1)
junk <- rnorm(1000000)
junk1 <- rnorm(1000000)
# The original code
f <- function(junk, junk1) {
cumval <- numeric(1000000)
cumval[1] <- 1
for (i in 2:1000000) cumval[i] <- junk[i] + (junk1[i] * cumval[i-1])
cumval
}
system.time( f(junk, junk1) ) # 4.11 secs
# Now try compiling it...
library(compiler)
g <- cmpfun(f)
system.time( g(junk, junk1) ) # 0.98 secs
...so it would be interesting to know if this algorithm is in any way "typical" - in that case the compiler could perhaps be even more optimized for situations like this...

It is faster but doesn't give correct results.
Run this
set.seed(1)
N <- 10
junk <- rnorm(N)
junk1 <- rnorm(N)
cumval <- numeric(N)
cumval.1 <- numeric(N)
cumval[1] <- 1
for( i in 2:N ) cumval[i] <- junk[i] + junk1[i]*cumval[i-1]
cumval
cumval.1 <- cumsum( junk[-1] + (junk1[-1] * cumval.1[-N]) )
cumval.1
and you'll see that cumval and cumval.1 are not even the same length.
One needs to rewrite the recurrence relation.
I don't see a way to convert the recurrence to a non recurrence formula.

Consider cumval[5]. Using j[] for junk and jk[] for junk1 and omitting * symbols, its expansion would be:
j[5] +jk[5]j[4] + jk[5]jk[4]j[3] + jk[5]jk[4]jk[3]j[2] + jk[5]jk[4]jk[3]jk[2]
The pattern suggests this might be (close to ?) an expression for the 5th term:
sum( j[1:5] * c(1, Reduce("*" , rev(jk[2:5]), accumulate=TRUE) )

Related

How to set my inequality constraint into my R function?

I am working on a project consisting of the analysis of different portfolio constructions in a universe of various assets. I work on 22 assets and I recalibrate my portfolio every 90 days. This is why a weights penalties (see code) constraint is applied as the allocation changes every period.
I am currently implementing a construction based on independent components. My objective is to minimize the modified value at risk based on its components. (See code below).
My function runs correctly and everything seems to be OK, my function "MVaR.IC.port" and "MVaR.cm" work well. However, I can only implement this model in the case where short selling is allowed. I would now like to operate only in "Long only", i.e. that my weight vectors w only contain elements >=0. Concretely, i want that the expression "w <- t(w.IC)%*%a$A" in my code be >=0.
Do you know how to help me? Thank you in advance.
[results w.out.MVaR.IC.22,][1] Here are the results that must be positive. I also constraint that the sum of the weights must be equal to 1 (the investor allocates 100% of his wealth.).
Thomas
PS: train and test represent my rolling windows. In fact, I calibrate my models on 'train' (in sample) and apply them on 'test' (out of sample) in order to analyse their performance.
########################################
######### MVar on IC with CM #########
########################################
lower = rep(-5,k)
upper = rep(5,k)
#Set up objective function and constraint
MVaR.IC.cm.port <- function(S, weights, alpha, MixingMatrix)
{
obj <- MVaR(S, weights, alpha)
w.ICA <- t(weights)%*%MixingMatrix
weight.penalty = abs(1000*(1-sum(w.ICA)))
down.weight.penalty = 1000*sum(w.ICA[w.ICA > 1])
up.weight.penalty = 1000*abs(sum(w.ICA[w.ICA < -1]))
return(obj + weight.penalty + down.weight.penalty + up.weight.penalty)
}
#Out of sample return portfolio computation
ret.out.MVaR.IC.cm.22 <- c()
w.out.MVaR.IC.cm.22 <- matrix(ncol = n, nrow = 10)
for (i in 0:9) {
train <- as.matrix(portfolioReturns.new[((1+i*90):(8*90+i*90)),])
test <- as.matrix(portfolioReturns.new[(1+8*90+i*90):(9*90+i*90),])
a <- myfastICA(train, k, alg.typ = "parallel", fun = "logcosh", alpha = 1,
method = "R", row.norm = FALSE, maxit = 2000,
tol = 0.0000000001, verbose = TRUE)
x <- DEoptim(MVaR.IC.cm.port,lower,upper,
control=list(NP=(10*k),F=0.8,CR=0.9, trace=50),
S=a$S, alpha = alpha, MixingMatrix = a$A)
w.IC <- matrix(x$optim$bestmem, ncol=1)
w <- t(w.IC)%*%a$A
for (j in 1:ncol(train)){
w.out.MVaR.IC.cm.22[(i+1),j] <- w[j]
}
ret.out.MVaR.IC.cm.22 <- rbind(ret.out.MVaR.IC.cm.22, test %*% t(w))
}
w.out.MVaR.IC.cm.22

How to debug this for loop that masks multiple rasters using a single shapefile?

I have a group of multiband rasters in a folder that I'd like to mask using a single shapefile. I'd like for the masked outputs (rasters) to go into a separate folder. I received help on here a while back in making a for loop to accomplish this. It worked great, but when I scaled it up (increased the number of polygons in the shapefile, and added more rasters) it didn't work as well anymore. To be specific, it will sometimes mask the rasters correctly, and sometimes not. I can't discern any pattern as i've run this code multiple times and each time the set of rasters that don't get masked are different.
Code so far:
library(terra)
#Creating directory to store inputs
ras_dir <- "/Users/USERID/rasters"
if (!file.exists(ras_dir)) {
ras_dir <- dir.create("/Users/USERID/rasters")
}
#Creating directory to store outputs
mask_dir <- "/Users/USERID/masks"
if (!file.exists(mask_dir)) {
mask_dir <- dir.create("/Users/USERID/masks")
}
#Twelve polygons in a shapefile
v <- vect(system.file("ex/lux.shp", package="terra"))
v <- v[c(1,4,5,7,9,12)]
#10 rasters with 5 layers each. I'm not a good enough coder to programmatically write #these rasters to a directory and have them be different.
f <- system.file("ex/elev.tif", package="terra")
r <- rast(f)
r <- rep(r, 5) * 1:5
names(r) <- paste0("band", 1:5)
writeRaster(r, "/Users/USERID/ras1.tif")
writeRaster(r, "/Users/USERID/ras2.tif")
writeRaster(r, "/Users/USERID/ras3.tif")
writeRaster(r, "/Users/USERID/ras4.tif")
writeRaster(r, "/Users/USERID/ras5.tif")
inf <- list.files("/Users/USERID/rasters", pattern="tif$", full.names=TRUE)
outf <- gsub("/Users/USERID/rasters", "/Users/USERID/masks", inf)
for (i in 1:length(inf)) {
r <- rast(inf[i])
c <- crop(r, v) #Here I crop first as it saves lots of time
m <- mask(c, v, filename = outf[i], overwrite = TRUE)
}
To be clear, I know the code above runs correctly. For some reason it doesn't run with my longer dataset and I was wondering if anyone could shed light on any potential pitfalls of this type of for loop.

This looks good to me. One change I would make is to actually use the variables that represent the paths. That is, only hard-code them once, as I do below. But it would seem that in your actual implementation there is some mistake. It can help to use statements like print(outf[i]) in the loop, and first try with a few files, and then more to see where the error first comes in.
library(terra)
ras_dir <- "rasters"
mask_dir <- "masks"
dir.create(ras_dir, FALSE, FALSE)
dir.create(mask_dir, FALSE, FALSE)
#Twelve polygons in a shapefile
v <- vect(system.file("ex/lux.shp", package="terra"))
v <- v[c(1,4,5,7,9,12)]
#10 rasters with 5 layers each.
r <- rast(system.file("ex/elev.tif", package="terra"))
r <- rep(r, 5) * 1:5
names(r) <- paste0("band", 1:5)
for (i in 1:5) {
writeRaster(r * i, file.path(ras_dir, paste0("ras", i, ".tif")), overwrite=TRUE)
}
inf <- list.files(ras_dir, pattern="tif$", full.names=TRUE)
outf <- gsub(ras_dir, mask_dir, inf)
for (i in 1:length(inf)) {
r <- rast(inf[i])
c <- crop(r, v)
m <- mask(c, v, filename = outf[i], overwrite = TRUE)
}

Slow nested loop in R

I'm new to R and having trouble vectorizing a nested loop that is particularly slow. The loop goes through a list of centers (vectors stored in a structure) and finds the distance between these vectors and the rows of an array called x below. I know this needs to be vectorized for speed, but cannot figure out the appropriate functions to or use of apply to do so.
clusterCenters <- matrix(runif(10000),nrow=100)
clusterMembers <- matrix(runif(400000),nrow=4000)
features <- matrix(0,(dim(clusterMembers)[1]),(dim(clusterCenters)[1]))
for(c in 1:dim(clusterCenters)[1]){
center <- clusterCenters[c,]
for(v in 1:(dim(clusterMembers)[1])){
vector <- clusterMembers[v,]
features[v,c] <- sqrt(sum((center - vector)^2))
}
}
Thanks for any help.

You can take advantage of R's recycling rules to make this a bit faster. But you have to know and account for the fact that R stores matrices in column-major order. You do that by transposing clusterMembers and then the center vector will be recycled along the columns of t(clusterMembers).
set.seed(21)
clusterCenters <- matrix(runif(10000),nrow=100)
clusterMembers <- matrix(runif(400000),nrow=4000)
# your original code in function form
seven <- function() {
features <- matrix(0,(dim(clusterMembers)[1]),(dim(clusterCenters)[1]))
for(c in 1:dim(clusterCenters)[1]){
center <- clusterCenters[c,]
for(v in 1:(dim(clusterMembers)[1])){
vector <- clusterMembers[v,]
features[v,c] <- sqrt(sum((center - vector)^2))
}
}
features
}
# my fancy function
josh <- function() {
tcm <- t(clusterMembers)
Features <- matrix(0,ncol(tcm),nrow(clusterCenters))
for(i in 1:nrow(clusterCenters)) {
# clusterCenters[i,] returns a vector because drop=TRUE by default
Features[,i] <- colSums((clusterCenters[i,]-tcm)^2)
}
Features <- sqrt(Features) # outside the loop to avoid function calls
}
system.time(seven())
# user system elapsed
# 2.7 0.0 2.7
system.time(josh())
# user system elapsed
# 0.28 0.11 0.39
identical(seven(),josh())
# [1] TRUE

What's the ideal implementation for the Sieve of Eratosthenes between Lists, Arrays, and Mutable Arrays?

In Haskell, I've found three simple implementations of the Sieve of Eratosthenes on the Rosetta Code page.
Now my question is, which one should be used in which situations?
Correcting my initial reasoning would be helpful too:
I'm assuming the List one is the most idiomatic and easy to read for a Haskeller. Is it correct, though? I'm wondering if it suffers from the same problems as another list-based sieve that I then learned was not actually implementing the algorithm:
(edit: shown here is the list-based sieve I know has problems, not the one from RosettaCode, which I pasted at the bottom)
primes = sieve [2..] where
sieve (p:x) = p : sieve [ n | n <- x, n `mod` p > 0 ]
In terms of performance, the immutable Array seems to be the winner. With an upper bound m of 2000000, the times were about:
1.3s for List
0.6s for Array
1.8s for Mutable Array
So I'd pick Array for performance.
And of course, the Mutable Array one is also easy to reason about since I have a more imperative language background. I'm not sure why I would pick this one if I'm coding in Haskell, though, since it's both slower than the others and non-idiomatic.
Code copied here for reference:
List:
primesTo m = 2 : eratos [3,5..m] where
eratos (p : xs) | p*p>m = p : xs
| True = p : eratos (xs `minus` [p*p, p*p+2*p..])
minus a#(x:xs) b#(y:ys) = case compare x y of
LT -> x : minus xs b
EQ -> minus xs ys
GT -> minus a ys
minus a b = a
Immutable Array:
import Data.Array.Unboxed
primesToA m = sieve 3 (array (3,m) [(i,odd i) | i<-[3..m]] :: UArray Int Bool)
where
sieve p a
| p*p > m = 2 : [i | (i,True) <- assocs a]
| a!p = sieve (p+2) $ a//[(i,False) | i <- [p*p, p*p+2*p..m]]
| otherwise = sieve (p+2) a
Mutable Array:
import Control.Monad (forM_, when)
import Control.Monad.ST
import Data.Array.ST
import Data.Array.Unboxed
primeSieve :: Integer -> UArray Integer Bool
primeSieve top = runSTUArray $ do
a <- newArray (2,top) True -- :: ST s (STUArray s Integer Bool)
let r = ceiling . sqrt $ fromInteger top
forM_ [2..r] $ \i -> do
ai <- readArray a i
when ai $ do
forM_ [i*i,i*i+i..top] $ \j -> do
writeArray a j False
return a
-- Return primes from sieve as list:
primesTo :: Integer -> [Integer]
primesTo top = [p | (p,True) <- assocs $ primeSieve top]
EDIT
I showed Turner's Sieve at the top but that's not the list-based example I'm comparing with the other two. I just wanted to know if the list-based example suffers from the same "not the correct Sieve of Eratosthenes" problems as Turner's.
It appears the performance comparison is unfair because the Mutable Array example goes through evens as well and uses Integer rather than Int...

This
primes = sieve [2..] where
sieve (p:x) = p : sieve [ n | n <- x, n `mod` p > 0 ]
is not a sieve. It's very inefficient trial division. Don't use that!
I'm curious about how you got your times, there is no way that the Turner "sieve" could produce the primes not exceeding 2,000,000 in mere seconds. Letting it find the primes to 200,000 took
MUT time 6.38s ( 6.39s elapsed)
GC time 9.19s ( 9.20s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 15.57s ( 15.59s elapsed)
on my box (64-bit Linux, ghc-7.6.1, compiled with -O2). The complexity of that algorithm is O(N² / log² N), almost quadratic. Letting it proceed to 2,000,000 would take about twenty minutes.
Your times for the array versions are suspicious too, though in the other direction. Did you measure interpreted code?
Sieving to 2,000,000, compiled with optimisations, the mutable array code took 0.35 seconds to run, and the immutable array code 0.12 seconds.
Now, that still has the mutable array about three times slower than the immutable array.
But, it's an unfair comparison. For the immutable array, you used Ints, and for the mutable array Integers. Changing the mutable array code to use Ints - as it should, since under the hood, arrays are Int-indexed, so using Integer is an unnecessary performance sacrifice that buys nothing - made the mutable array code run in 0.15 seconds. Close to the mutable array code, but not quite there. However, you let the mutable array do more work, since in the immutable array code you only eliminate odd multiples of the odd primes, but in the mutable array code, you mark all multiples of all primes. Changing the mutable array code to treat 2 specially, and only eliminate odd multiples of odd primes brings that down to 0.12 seconds too.
But, you're using range-checked array indexing, which is slow, and, since the validity of the indices is checked in the code itself, unnecessary. Changing that to using unsafeRead and unsafeWrite brings down the time for the immutable array to 0.09 seconds.
Then you have the problem that using
forM_ [x, y .. z]
uses boxed Ints (fortunately, GHC eliminates the list). Writing a loop yourself, so that only unboxed Int#s are used, the time goes down to 0.02 seconds.
{-# LANGUAGE MonoLocalBinds #-}
import Control.Monad (forM_, when)
import Control.Monad.ST
import Data.Array.ST
import Data.Array.Unboxed
import Data.Array.Base
primeSieve :: Int -> UArray Int Bool
primeSieve top = runSTUArray $ do
a <- newArray (0,top) True
unsafeWrite a 0 False
unsafeWrite a 1 False
let r = ceiling . sqrt $ fromIntegral top
mark step idx
| top < idx = return ()
| otherwise = do
unsafeWrite a idx False
mark step (idx+step)
sift p
| r < p = return a
| otherwise = do
prim <- unsafeRead a p
when prim $ mark (2*p) (p*p)
sift (p+2)
mark 2 4
sift 3
-- Return primes from sieve as list:
primesTo :: Int -> [Int]
primesTo top = [p | (p,True) <- assocs $ primeSieve top]
main :: IO ()
main = print .last $ primesTo 2000000
So, wrapping up, for a Sieve of Eratosthenes, you should use an array - not surprising, its efficiency depends on being able to step from one multiple to the next in short constant time.
You get very simple and straightforward code with immutable arrays, and that code performs decently for not too high limits (it gets relatively worse for higher limits, since the arrays are still copied and garbage-collected, but that's not too bad).
When you need better performance, you need mutable arrays. Writing efficient mutable array code is not entirely trivial, one has to know how the compiler translates the various constructs to choose the right one, and some would consider such code unidiomatic. But you can also use a library (disclaimer: I wrote it) that provides a fairly efficient implementation rather than writing it yourself.

Mutable array will always be the winner in terms of performance (and you really should've copied the version that works on odds only as a minimum; it should be the fastest of the three - also because it uses Int and not Integer).
For lists, tree-shaped merging incremental sieve should perform better than the one you show. You can always use it with takeWhile (< limit) if needed. I contend that it conveys the true nature of the sieve most clearly:
import Data.List (unfoldr)
primes :: [Int]
primes = 2 : _Y ((3 :) . gaps 5 . _U . map (\p -> [p*p, p*p+2*p..]))
_Y g = g (_Y g) -- recursion
_U ((x:xs):t) = (x :) . union xs . _U -- ~= nub . sort . concat
. unfoldr (\(a:b:c) -> Just (union a b, c)) $ t
gaps k s#(x:xs) | k < x = k : gaps (k+2) s -- ~= [k,k+2..]\\s, when
| otherwise = gaps (k+2) xs -- k<=x && null(s\\[k,k+2..])
union a#(x:xs) b#(y:ys) = case compare x y of -- ~= nub . sort .: (++)
LT -> x : union xs b
EQ -> x : union xs ys
GT -> y : union a ys
_U reimplements Data.List.Ordered.unionAll, and gaps 5 is (minus [5,7..]), fused for efficiency, with minus and union from the same package.
Of course nothing beats the brevity of Data.List.nubBy (((>1).).gcd) [2..] (but it's very slow).
To your 1st new question: not. It does find the multiples by counting up, as any true sieve should (although "minus" on lists is of course under-performant; the above improves on that by re-arranging a linear subtraction chain ((((xs-a)-b)-c)- ... ) into a subtraction of tree-folded additions, xs-(a+((b+c)+...))).

As has been said, using mutable arrays has the best performance. The following code is derived from this "TemplateHaskell" version converted back to something more in line with the straight mutable Array solution as the "TemplateHaskell" doesn't seem to make any difference, with some further optimizations. I believe it is faster than the usual mutable unboxed array versions due to the further optimizations and especially due to the use of the "unsafeRead" and "unsafeWrite" functions that avoid range checking of the arrays, probably also internally using pointers for array access:
{-# OPTIONS -O2 -optc-O3 #-}
import Control.Monad
import Control.Monad.ST
import Data.Array.ST
import Data.Array.Unboxed
import Data.Array.Base
primesToUA :: Word32-> [Word32]
primesToUA top = do
let sieveUA top = runSTUArray $ do
let m = ((fromIntegral top) - 3) `div` 2 :: Int
buf <- newArray (0,m) True -- :: ST s (STUArray s Int Bool)
let cullUA i = do
let p = i + i + 3
strt = p * (i + 1) + i
let cull j = do
if j > m then cullUA (i + 1)
else do
unsafeWrite buf j False
cull (j + p)
if strt > m then return ()
else do
e <- unsafeRead buf i
if e then cull strt else cullUA (i + 1)
cullUA 0; return buf
if top > 1 then 2 : [2 * (fromIntegral i) + 3 | (i,True) <- assocs $ sieveUA top]
else []
main = do
x <- read `fmap` getLine -- 1mln 2mln 10mln 100mln
print (last (primesToUA x)) -- 0.01 0.02 0.09 1.26 seconds
EDIT: The above code has been corrected and the times below edited to reflect the correction and as well the comments comparing the paged to the non-paged version have been edited.
The times to run this to the indicated top ranges are as shown in the comment table at the bottom of the above source code as measured by ideone.com and are about exactly five times faster than the answer posted by #WillNess as also measured at ideone.com. This code takes a trivial amount of time to cull the primes to two million and only 1.26 seconds to cull to a hundred million. These times are about 2.86 times faster when run on my i7 (3.5 GHz) at 0.44 seconds to a hundred million, and takes 6.81 seconds to run to one billion. The memory use is just over six megabytes for the former and sixty megabytes for the latter, which is the memory used by the huge (bit packed) array. This array also explains the non-linear performance in that as the array size exceeds the CPU cache sizes the average memory access times get worse per composite number representation cull.
EDIT_ADD: A page segmented sieve is more efficient in that it has better memory access efficiency when the buffer size is kept smaller than the L1 or L2 CPU caches, and as well has the advantage that it is unbounded so that the upper range does not have to be specified in advance and a much smaller memory footprint being just the base primes less than the square root of the range used plus the page buffer memory. The following code has been written as a page segmented implementation and is somewhat faster than the non-paged version; it also offers the advantage that one can change the output range specification at the top of the code to 'Word64' from 'Word32' so it is then not limited to the 32-bit number range, at only a slight cost in processing time (for 32-bit compiled code) for any range that is in common. The code is as follows:
-- from http://www.haskell.org/haskellwiki/Prime_numbers#Using_ST_Array
{-# OPTIONS -O2 -optc-O3 #-}
import Data.Word
import Control.Monad
import Control.Monad.ST
import Data.Array.ST
import Data.Array.Unboxed
import Data.Array.Base
primesUA :: () -> [Word32]
primesUA () = do
let pgSZBTS = 262144 * 8
let sieveUA low bps = runSTUArray $ do
let nxt = (fromIntegral low) + (fromIntegral pgSZBTS)
buf <- newArray (0,pgSZBTS - 1) True -- :: ST s (STUArray s Int Bool)
let cullUAbase i = do
let p = i + i + 3
strt = p * (i + 1) + i
when (strt < pgSZBTS) $ do
e <- unsafeRead buf i
if e then do
let loop j = do
if j < pgSZBTS then do
unsafeWrite buf j False
loop (j + p)
else cullUAbase (i + 1)
loop strt
else cullUAbase (i + 1)
let cullUA ~(p:t) = do
let bp = (fromIntegral p)
i = (bp - 3) `div` 2
s = bp * (i + 1) + i
when (s < nxt) $ do
let strt = do
if s >= low then fromIntegral (s - low)
else do
let b = (low - s) `rem` bp
if b == 0 then 0 else fromIntegral (bp - b)
let loop j = do
if j < pgSZBTS then do
unsafeWrite buf j False
loop (j + (fromIntegral p))
else cullUA t
loop strt
if low <= 0 then cullUAbase 0 else cullUA bps
return buf
let sieveList low bps = do
[2 * ((fromIntegral i) + low) + 3 | (i,True) <- assocs $ sieveUA low bps]
let sieve low bps = do
(sieveList low bps) ++ sieve (low + (fromIntegral pgSZBTS)) bps
let primes' = ((sieveList 0 []) ++ sieve (fromIntegral pgSZBTS) primes') :: [Word32]
2 : sieve 0 primes'
main = do
x <- read `fmap` getLine -- 1mln 2mln 10mln 100mln
-- 0.02 0.03 0.13 1.13 seconds
print (length (takeWhile ((>=) (fromIntegral x)) (primesUA ())))
The above code has quite a few more lines of code than the non-paged case due to the need to cull the composite number representations from the first page array differently than succeeding pages. The code also has the fixes so that there aren't memory leaks due to the base primes list and the output list now not being the same list (thus avoiding holding onto the whole list in memory).
Note that this code takes close to linear time (over the range sieved) as the range gets larger due to the culling buffer being of a constant size less than the L2 CPU cache. Memory use is a fraction of that used by the non-paged version at just under 600 kilobytes for a hundred million and just over 600 kilobytes for one billion, which slight increase is just the extra space required for the base primes less than the square root of the range list.
On ideone.com this code produces the number of primes to a hundred million in about 1.13 seconds and about 12 seconds to one billion (32-bit setting). Probably wheel factorization and definitely multi-core processing would make it even faster on a multi-core CPU. On my i7 (3.5 GHz), it takes 0.44 seconds to sieve to a hundred million and 4.7 seconds to one billion, with the roughly linear performance with increasing range as expected. It seems that there is some sort of non-linear overhead in the version of GHC run on ideone.com that has some performance penalty for larger ranges that is not present for the i7 and that is perhaps related to different garbage collection, as the page buffers are being created new for each new page. END_EDIT_ADD
EDIT_ADD2: It seems that much of the processing time for the above page segmented code is used in (lazy) list processing, so the code is accordingly reformulated with several improvements as follows:
Implemented a prime counting function that does not use list processing and uses "popCount" table look ups to count the number of 'one' bits in a 32 bit word at a time. In this way, the time to find the results is insignificant compared to the actual sieve culling time.
Stored the base primes as a list of bit packed page segments, which is much more space efficient than storing a list of primes, and the the time to convert the page segments to primes as required is not much of a computational overhead.
Tuned the prime segment make function so that for the initial zero page segment, it uses its own bit pattern as a source page, thus making the composite number cull code shorter and simpler.
The code then becomes as follows:
{-# OPTIONS -O3 -rtsopts #-} -- -fllvm ide.com doesn't support LLVM
import Data.Word
import Data.Bits
import Control.Monad
import Control.Monad.ST
import Data.Array.ST (runSTUArray)
import Data.Array.Unboxed
import Data.Array.Base
pgSZBTS = (2^18) * 8 :: Int -- size of L2 data cache
type PrimeType = Word32
type Chunk = UArray PrimeType Bool
-- makes a new page chunk and culls it
-- if the base primes list provided is empty then
-- it uses the current array as source (for zero page base primes)
mkChnk :: Word32 -> [Chunk] -> Chunk
mkChnk low bschnks = runSTUArray $ do
let nxt = (fromIntegral low) + (fromIntegral pgSZBTS)
buf <- nxt `seq` newArray (fromIntegral low, fromIntegral nxt - 1) True
let cull ~(p:ps) =
let bp = (fromIntegral p)
i = (bp - 3) `shiftR` 1
s = bp * (i + 1) + i in
let cullp j = do
if j >= pgSZBTS then cull ps
else do
unsafeWrite buf j False
cullp (j + (fromIntegral p)) in
when (s < nxt) $ do
let strt = do
if s >= low then fromIntegral (s - low)
else do
let b = (low - s) `rem` bp
if b == 0 then 0 else fromIntegral (bp - b)
cullp strt
case bschnks of
[] -> do bsbf <- unsafeFreezeSTUArray buf
cull (listChnkPrms [bsbf])
_ -> cull $ listChnkPrms bschnks
return buf
-- creates a page chunk list starting at the lw value
chnksList :: Word32 -> [Chunk]
chnksList lw =
mkChnk lw basePrmChnks : chnksList (lw + fromIntegral pgSZBTS)
-- converts a page chunk list to a list of primes
listChnkPrms :: [Chunk] -> [PrimeType]
listChnkPrms [] = []
listChnkPrms ~(hdchnk#(UArray lw _ rng _):tlchnks) =
let nxtp i =
if i >= rng then [] else
if unsafeAt hdchnk i then
(case ((lw + fromIntegral i) `shiftL` 1) + 3 of
np -> np) : nxtp (i + 1)
else nxtp (i + 1) in
(hdchnk `seq` lw `seq` nxtp 0) ++ listChnkPrms tlchnks
-- the base page chunk list used to cull the higher order pages,
-- note that it has special treatment for the zero page.
-- It is more space efficient to store this as chunks rather than
-- as a list of primes or even a list of deltas (gaps), with the
-- extra processing to convert as needed not too much.
basePrmChnks :: [Chunk]
basePrmChnks = mkChnk 0 [] : chnksList (fromIntegral pgSZBTS)
-- the full list of primes could be accessed with the following function.
primes :: () -> [PrimeType]
primes () = 2 : (listChnkPrms $ chnksList 0)
-- a quite fast prime counting up to the given limit using
-- chunk processing to avoid innermost list processing loops.
countPrimesTo :: PrimeType -> Int
countPrimesTo limit =
let lmtb = (limit - 3) `div` 2 in
let sumChnks acc chnks#(chnk#(UArray lo hi rng _):chnks') =
let cnt :: UArray PrimeType Word32 -> Int
cnt bfw =
case if lmtb < hi then fromIntegral (lmtb - lo) else rng of
crng -> case crng `shiftR` 5 of
rngw ->
let cnt' i ac =
ac `seq` if i >= rngw then
if (i `shiftL` 5) >= rng then ac else
case (-2) `shiftL` fromIntegral (lmtb .&. 31) of
msk -> msk `seq`
case (unsafeAt bfw rngw) .&.
(complement msk) of
bts -> bts `seq` case popCount bts of
c -> c `seq` case ac + c of nac -> nac
else case ac + (popCount $ unsafeAt bfw i) of
nacc -> nacc `seq` cnt' (i + 1) (nacc)
in cnt' 0 0 in
acc `seq` case runST $ do -- make UArray _ Bool into a UArray _ Word32
stbuf <- unsafeThawSTUArray chnk
stbufw <- castSTUArray stbuf
bufw <- unsafeFreezeSTUArray stbufw
return $ cnt bufw of
c -> c `seq` case acc + c of
nacc -> nacc `seq` if hi >= lmtb then nacc
else sumChnks nacc chnks' in
if limit < 2 then 0 else if limit < 3 then 1 else
lmtb `seq` sumChnks 1 (chnksList 0)
main = do
x <- read `fmap` getLine -- 1mln 2mln 10mln 100mln 1000mln
-- 0.02 0.03 0.06 0.45 4.60 seconds
-- 7328 7328 8352 8352 9424 Kilobytes
-- this takes 14.34 seconds and 9424 Kilobytes to 3 billion on ideone.com,
-- and 9.12 seconds for 3 billion on an i7-2700K (3.5 GHz).
-- The above ratio of about 1.6 is much better than usual due to
-- the extremely low memory use of the page segmented algorithm.
-- It seems thaat the Windows Native Code Generator (NCG) from GHC
-- is particularly poor, as the Linux 32-bit version takes
-- less than two thirds of the time for exactly the same program...
print $ countPrimesTo x
-- print $ length $ takeWhile ((>=) x) $ primes () -- the slow way to do this
The times and memory requirements given in the code are as observed when run on ideone.com, with 0.02, 0.03, 0.05, 0.30, 3.0, and 9.1 seconds required to run on my i7-2700K (3.5 GHz) for one, two, ten, a hundred, a thousand (one billion), and three thousand (three billion) million range, respectively, with a pretty much constant memory footprint slowly increasing with the number of base primes less than the square root of the range as required. When compiled with the LLVM compiler back end, these times become 0.01, 0.02, 0.02, 0.12, 1.35, and 4.15 seconds, respectively, due to its more efficient use of registers and machine instructions; this last is quite close to the same speed as if compiled with a 64-bit compiler rather than the 32-bit compiler used as the efficient use of registers means that availability of extra registers doesn't make much difference.
As commented in the code, the ratio between performance on my real machine and the ideone.com servers becomes much less than for much more memory wasteful algorithms due to not being throttled by memory access bottlenecks so the limit to speed is mostly just the ratio of CPU clock speeds as well as CPU processing efficiency per clock cycle. However, as commented there, there is some strange inefficiency with the GHC Native Code Generator (NCG) when run under Windows (32-bit compiler) in that the run times are over 50% slower than if run under Linux (as the ideone.com server uses). AFAIK they both have a common code base for the same Haskell GHC version and the only divergence is in the linker used (which is also used with the LLVM backend, which is not affected) as GHC NCG does not use GCC but only the mingw32 assembler, which should also be the same.
Note that this code when compiled with the LLVM compiler back end is about the same speed as the same algorithm written for highly optimized 'C/C++' implementations indicating that Haskell really has the ability to develop very tight loop coding. It might be said that the Haskell code is quite a bit more readable and secure than equivalent 'C/C++' code once one gets used to the Haskell paradigms of monadic and non-strict code. The further refinements in execution speed for the Sieve of Eratosthenes are purely a function of the tuning of the implementations used and not the choice of language between Haskell and 'C/C++'.
Summary: Of course, this isn't yet the ultimate in speed for a Haskell version of the Sieve of Eratosthenes in that we still haven't tuned the memory access to more efficiently use the fast CPU L1 cache, nor have we significantly reduced the total number of composite culling operations necessary using extreme wheel factorization other than to eliminate the odds processing. However, this is enough to answer the question in showing that mutable arrays are the most efficient way of addressing such tight loop type of problems, with potential speed gains of about 100 times over the use of lists or immutable arrays. END_EDIT_ADD2

r: for loop operation with nested indices runs super slow

I have an operation I'd like to run for each row of a data frame, changing one column. I'm an apply/ddply/sqldf man, but I'll use loops when they make sense, and I think this is one of those times. This case is tricky because the column to changes depends on information that changes by row; depending on information in one cell, I should make a change to only one of ten other cells in that row. With 75 columns and 20000 rows, the operation takes 10 minutes, when every other operation in my script takes 0-5 seconds, ten seconds max. I've stripped my problem down to the very simple test case below.
n <- 20000
t.df <- data.frame(matrix(1:5000, ncol=10, nrow=n) )
system.time(
for (i in 1:nrow(t.df)) {
t.df[i,(t.df[i,1]%%10 + 1)] <- 99
}
)
This takes 70 seconds with ten columns, and 360 when ncol=50. That's crazy. Are loops the wrong approach? Is there a better, more efficient way to do this?
I already tried initializing the nested term (t.df[i,1]%%10 + 1) as a list outside the for loop. It saves about 30 seconds (out of 10 minutes) but makes the example code above more complicated. So it helps, but its not the solution.
My current best idea came while preparing this test case. For me, only 10 of the columns are relevant (and 75-11 columns are irrelevant). Since the run times depend so much on the number of columns, I can just run the above operation on a data frame that excludes irrelevant columns. That will get me down to just over a minute. But is "for loop with nested indices" even the best way to think about my problem?

It seems the real bottleneck is having the data in the form of a data.frame. I assume that in your real problem you have a compelling reason to use a data.frame. Any way to convert your data in such a way that it can remain in a matrix?
By the way, great question and a very good example.
Here's an illustration of how much faster loops are on matrices than on data.frames:
> n <- 20000
> t.df <- (matrix(1:5000, ncol=10, nrow=n) )
> system.time(
+ for (i in 1:nrow(t.df)) {
+ t.df[i,(t.df[i,1]%%10 + 1)] <- 99
+ }
+ )
user system elapsed
0.084 0.001 0.084
>
> n <- 20000
> t.df <- data.frame(matrix(1:5000, ncol=10, nrow=n) )
> system.time(
+ for (i in 1:nrow(t.df)) {
+ t.df[i,(t.df[i,1]%%10 + 1)] <- 99
+ }
+ )
user system elapsed
31.543 57.664 89.224

Using row and col seems less complicated to me:
t.df[col(t.df) == (row(t.df) %% 10) + 1] <- 99
I think Tommy's is still faster, but using row and col might be easier to understand.

#JD Long is right that if t.df can be represented as a matrix, things will be much faster.
...And then you can actually vectorize the whole thing so that it is lightning fast:
n <- 20000
t.df <- data.frame(matrix(1:5000, ncol=10, nrow=n) )
system.time({
m <- as.matrix(t.df)
m[cbind(seq_len(nrow(m)), m[,1]%%10L + 1L)] <- 99
t2.df <- as.data.frame(m)
}) # 0.00 secs
Unfortunately, the matrix indexing I use here does not seem to work on a data.frame.
EDIT
A variant where I create a logical matrix to index works on data.frame, and is almost as fast:
n <- 20000
t.df <- data.frame(matrix(1:5000, ncol=10, nrow=n) )
system.time({
t2.df <- t.df
# Create a logical matrix with TRUE wherever the replacement should happen
m <- array(FALSE, dim=dim(t2.df))
m[cbind(seq_len(nrow(t2.df)), t2.df[,1]%%10L + 1L)] <- TRUE
t2.df[m] <- 99
}) # 0.01 secs

UPDATE: Added the matrix version of Tommy's solution to the benchmarking exercise.
You can vectorize it. Here is my solution and a comparison with the loop
n <- 20000
t.df <- (matrix(1:5000, ncol=10, nrow=n))
f_ramnath <- function(x){
idx <- x[,1] %% 10 + 1
x[cbind(1:NROW(x), idx)] <- 99
return(x)
}
f_long <- function(t.df){
for (i in 1:nrow(t.df)) {
t.df[i,(t.df[i,1]%%10 + 1)] <- 99
}
return(t.df)
}
f_joran <- function(t.df){
t.df[col(t.df) == (row(t.df) %% 10) + 1] <- 99
return(t.df)
}
f_tommy <- function(t.df){
t2.df <- t.df
# Create a logical matrix with TRUE wherever the replacement should happen
m <- array(FALSE, dim=dim(t2.df))
m[cbind(seq_len(nrow(t2.df)), t2.df[,1]%%10L + 1L)] <- TRUE
t2.df[m] <- 99
return(t2.df)
}
f_tommy_mat <- function(m){
m[cbind(seq_len(nrow(m)), m[,1]%%10L + 1L)] <- 99
}
To compare the performance of the different approaches, we can use rbenchmark.
library(rbenchmark)
benchmark(f_long(t.df), f_ramnath(t.df), f_joran(t.df), f_tommy(t.df),
f_tommy_mat(t.df), replications = 20, order = 'relative',
columns = c('test', 'elapsed', 'relative')
test elapsed relative
5 f_tommy_mat(t.df) 0.135 1.000000
2 f_ramnath(t.df) 0.172 1.274074
4 f_tommy(t.df) 0.311 2.303704
3 f_joran(t.df) 0.705 5.222222
1 f_long(t.df) 2.411 17.859259

Another option for when you do need mixed column types (and so you can't use matrix) is := in data.table. Example from ?":=" :
require(data.table)
m = matrix(1,nrow=100000,ncol=100)
DF = as.data.frame(m)
DT = as.data.table(m)
system.time(for (i in 1:1000) DF[i,1] <- i)
# 591 seconds
system.time(for (i in 1:1000) DT[i,V1:=i])
# 1.16 seconds ( 509 times faster )

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Modified cumsum function - performance

Related

How to set my inequality constraint into my R function?

How to debug this for loop that masks multiple rasters using a single shapefile?

Slow nested loop in R

What's the ideal implementation for the Sieve of Eratosthenes between Lists, Arrays, and Mutable Arrays?

r: for loop operation with nested indices runs super slow

Categories

Resources