Discrepancy between R and Matlab speed [duplicate] - performance

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Why are loops slow in R?
Consider the following task. A dataset has 40 variables for 20,000 "users". Each user has between 1 and 150 observations. All users are stacked in a matrix called data. The first column is the id of the user and identifies the user. All id are stored in a 20,000 X 1 matrix called userid.
Consider the following R code
useridl = length(userid)
itime=proc.time()[3]
for (i in 1:useridl) {
temp =data[data[,1]==userid[i],]
}
etime=proc.time()[3]
etime-itime
This code just goes through the 20,000 users, creating the temp matrix every time. With the subset of observations belonging to userid[i]. It takes about 6 minutes in a MacPro.
In MatLab, the same task
tic
for i=1:useridl
temp=data(data(:,1)==userid(i),:);
end
toc
takes 1 minute.
Why is R so much slower? This is standard task, I am using matrices in both cases. Any ideas?

As #joran commented, that's bad R practice. Instead of repeatedly subsetting your original matrix, just put the subsets in a list once and then iterate over the list with lapply or similar.
# make example data
set.seed(21)
userid <- 1:1e4
obs <- sample(150, length(userid), TRUE)
users <- rep(userid, obs)
Data <- cbind(users,matrix(rnorm(40*sum(obs)),sum(obs),40))
# reorder so Data isn't sorted by userid
Data <- Data[order(Data[,2]),]
# note that you have to call the data.frame method explicitly,
# the default method returns a vector
system.time(temp <- split.data.frame(Data, Data[,1])) ## Returns times in seconds
# user system elapsed
# 2.84 0.08 2.92
My guess is that the garbage collector is slowing down your R code, since you're continually overwriting the temp object.

Related

Input to different attributes values from a random.sample list

so this is what I'm trying to do, and I'm not sure how cause I'm new to python. I've searched for a few options and I'm not sure why this doesn't work.
So I have 6 different nodes, in maya, called aiSwitch. I need to generate random different numbers from 0 to 6 and input that value in the aiSiwtch*.index.
In short the result should be
aiSwitch1.index = (random number from 0 to 5)
aiSwitch2.index = (another random number from 0 to 5 different than the one before)
And so on unil aiSwitch6.index
I tried the following:
import maya.cmds as mc
import random
allswtich = mc.ls('aiSwitch*')
for i in allswitch:
print i
S = range(0,6)
print S
shuffle = random.sample(S, len(S))
print shuffle
for w in shuffle:
print w
mc.setAttr(i + '.index', w)
This is the result I get from the prints:
aiSwitch1 <-- from print i
[0,1,2,3,4,5] <--- from print S
[2,3,5,4,0,1] <--- from print Shuffle (random.sample results)
2
3
5
4
0
1 <--- from print w, every separated item in the random.sample list.
Now, this happens for every aiSwitch, cause it's in a loop of course. And the random numbers are always a different list cause it happens every time the loop runs.
So where is the problem then?
aiSwitch1.index = 1
And all the other aiSwitch*.index always take only the last item in the list but the time I get to do the setAttr. It seems to be that w is retaining the last value of the for loop. I don't quite understand how to
Get a random value from 0 to 5
Input that value in aiSwitch1.index
Get another random value from 0 to 6 different to the one before
Input that value in aiSwitch2.index
Repeat until aiSwitch5.index.
I did get it to work with the following form:
allSwitch = mc.ls('aiSwitch')
for i in allSwitch:
mc.setAttr(i + '.index', random.uniform(0,5))
This gave a random number from 0 to 5 to all aiSwitch*.index, but some of them repeat. I think this works cause the value is being generated every time the loop runs, hence setting the attribute with a random number. But the numbers repeat and I was trying to avoid that. I also tried a shuffle but failed to get any values from it.
My main mistake seems to be that I'm generating a list and sampling it, but I'm failing to assign every different item from that list to different aiSwitch*.index nodes. And I'm running out of ideas for this.
Any clues would be greatly appreciated.
Thanks.
Jonathan.
Here is a somewhat Pythonic way: shuffle the list of indices, then iterate over it using zip (which is useful for iterating over structures in parallel, which is what you need to do here):
import random
index = list(range(6))
random.shuffle(index)
allSwitch = mc.ls('aiSwitch*')
for i,j in zip(allSwitch,index):
mc.setAttr(i + '.index', j)

Julia: Parallel for loop over partitions iterator

So I'm trying to iterate over the list of partitions of something, say 1:n for some n between 13 and 21. The code that I ideally want to run looks something like this:
valid_num = #parallel (+) for p in partitions(1:n)
int(is_valid(p))
end
println(valid_num)
This would use the #parallel for to map-reduce my problem. For example, compare this to the example in the Julia documentation:
nheads = #parallel (+) for i=1:200000000
Int(rand(Bool))
end
However, if I try my adaptation of the loop, I get the following error:
ERROR: `getindex` has no method matching getindex(::SetPartitions{UnitRange{Int64}}, ::Int64)
in anonymous at no file:1433
in anonymous at multi.jl:1279
in run_work_thunk at multi.jl:621
in run_work_thunk at multi.jl:630
in anonymous at task.jl:6
which I think is because I am trying to iterate over something that is not of the form 1:n (EDIT: I think it's because you cannot call p[3] if p=partitions(1:n)).
I've tried using pmap to solve this, but because the number of partitions can get really big, really quickly (there are more than 2.5 million partitions of 1:13, and when I get to 1:21 things will be huge), constructing such a large array becomes an issue. I left it running over night and it still didn't finish.
Does anyone have any advice for how I can efficiently do this in Julia? I have access to a ~30 core computer and my task seems easily parallelizable, so I would be really grateful if anyone knows a good way to do this in Julia.
Thank you so much!
The below code gives 511, the number of partitions of size 2 of a set of 10.
using Iterators
s = [1,2,3,4,5,6,7,8,9,10]
is_valid(p) = length(p)==2
valid_num = #parallel (+) for i = 1:30
sum(map(is_valid, takenth(chain(1:29,drop(partitions(s), i-1)), 30)))
end
This solution combines the takenth, drop, and chain iterators to get the same effect as the take_every iterator below under PREVIOUS ANSWER. Note that in this solution, every process must compute every partition. However, because each process uses a different argument to drop, no two processes will ever call is_valid on the same partition.
Unless you want to do a lot of math to figure out how to actually skip partitions, there is no way to avoid computing partitions sequentially on at least one process. I think Simon's answer does this on one process and distributes the partitions. Mine asks each worker process to compute the partitions itself, which means the computation is being duplicated. However, it is being duplicated in parallel, which (if you actually have 30 processors) will not cost you time.
Here is a resource on how iterators over partitions are actually computed: http://www.informatik.uni-ulm.de/ni/Lehre/WS03/DMM/Software/partitions.pdf.
PREVIOUS ANSWER (More complicated than necessary)
I noticed Simon's answer while writing mine. Our solutions seem similar to me, except mine uses iterators to avoid storing partitions in memory. I'm not sure which would actually be faster for what size sets, but I figure it's good to have both options. Assuming it takes you significantly longer to compute is_valid than to compute the partitions themselves, you can do something like this:
s = [1,2,3,4]
is_valid(p) = length(p)==2
valid_num = #parallel (+) for i = 1:30
foldl((x,y)->(x + int(is_valid(y))), 0, take_every(partitions(s), i-1, 30))
end
which gives me 7, the number of partitions of size 2 for a set of 4. The take_every function returns an iterator that returns every 30th partition starting with the ith. Here is the code for that:
import Base: start, done, next
immutable TakeEvery{Itr}
itr::Itr
start::Any
value::Any
flag::Bool
skip::Int64
end
function take_every(itr, offset, skip)
value, state = Nothing, start(itr)
for i = 1:(offset+1)
if done(itr, state)
return TakeEvery(itr, state, value, false, skip)
end
value, state = next(itr, state)
end
if done(itr, state)
TakeEvery(itr, state, value, true, skip)
else
TakeEvery(itr, state, value, false, skip)
end
end
function start{Itr}(itr::TakeEvery{Itr})
itr.value, itr.start, itr.flag
end
function next{Itr}(itr::TakeEvery{Itr}, state)
value, state_, flag = state
for i=1:itr.skip
if done(itr.itr, state_)
return state[1], (value, state_, false)
end
value, state_ = next(itr.itr, state_)
end
if done(itr.itr, state_)
state[1], (value, state_, !flag)
else
state[1], (value, state_, false)
end
end
function done{Itr}(itr::TakeEvery{Itr}, state)
done(itr.itr, state[2]) && !state[3]
end
One approach would be to divide the problem up into pieces that are not too big to realize and then process the items within each piece in parallel, e.g. as follows:
function my_take(iter,state,n)
i = n
arr = Array[]
while !done(iter,state) && (i>0)
a,state = next(iter,state)
push!(arr,a)
i = i-1
end
return arr, state
end
function get_part(npart,npar)
valid_num = 0
p = partitions(1:npart)
s = start(p)
while !done(p,s)
arr,s = my_take(p,s,npar)
valid_num += #parallel (+) for a in arr
length(a)
end
end
return valid_num
end
valid_num = #time get_part(10,30)
I was going to use the take() method to realize up to npar items from the iterator but take() appears to be deprecated so I've included my own implementation which I've called my_take(). The getPart() function therefore uses my_take() to obtain up to npar partitions at a time and carry out a calculation on them. In this case, the calculation just adds up their lengths, because I don't have the code for the OP's is_valid() function. get_part() then returns the result.
Because the length() calculation isn't very time-consuming, this code is actually slower when run on parallel processors than it is on a single processor:
$ julia -p 1 parpart.jl
elapsed time: 10.708567515 seconds (373025568 bytes allocated, 6.79% gc time)
$ julia -p 2 parpart.jl
elapsed time: 15.70633439 seconds (548394872 bytes allocated, 9.14% gc time)
Alternatively, pmap() could be used on each piece of the problem instead of the parallel for loop.
With respect to the memory issue, realizing 30 items from partitions(1:10) took nearly 1 gigabyte of memory on my PC when I ran Julia with 4 worker processes so I expect realizing even a small subset of partitions(1:21) will require a great deal of memory. It may be desirable to estimate how much memory would be needed to see if it would be at all possible before trying such a computation.
With respect to the computation time, note that:
julia> length(partitions(1:10))
115975
julia> length(partitions(1:21))
474869816156751
... so even efficient parallel processing on 30 cores might not be enough to make the larger problem solvable in a reasonable time.

Operating in parallel on a large constant datastructure in Julia

I have a large vector of vectors of strings:
There are around 50,000 vectors of strings,
each of which contains 2-15 strings of length 1-20 characters.
MyScoringOperation is a function which operates on a vector of strings (the datum) and returns an array of 10100 scores (as Float64s). It takes about 0.01 seconds to run MyScoringOperation (depending on the length of the datum)
function MyScoringOperation(state:State, datum::Vector{String})
...
score::Vector{Float64} #Size of score = 10000
I have what amounts to a nested loop.
The outer loop typically would runs for 500 iterations
data::Vector{Vector{String}} = loaddata()
for ii in 1:500
score_total = zeros(10100)
for datum in data
score_total+=MyScoringOperation(datum)
end
end
On one computer, on a small test case of 3000 (rather than 50,000) this takes 100-300 seconds per outer loop.
I have 3 powerful servers with Julia 3.9 installed (and can get 3 more easily, and then can get hundreds more at the next scale).
I have basic experience with #parallel, however it seems like it is spending a lot of time copying the constant (It more or less hang on the smaller testing case)
That looks like:
data::Vector{Vector{String}} = loaddata()
state = init_state()
for ii in 1:500
score_total = #parallel(+) for datum in data
MyScoringOperation(state, datum)
end
state = update(state, score_total)
end
My understanding of the way this implementation works with #parallel is that it:
For Each ii:
partitions data into a chuck for each worker
sends that chuck to each worker
works all process there chunks
main procedure sums the results as they arrive.
I would like to remove step 2,
so that instead of sending a chunk of data to each worker,
I just send a range of indexes to each worker, and they look it up from their own copy of data. or even better, only giving each only their own chunk, and having them reuse it each time (saving on a lot of RAM).
Profiling backs up my belief about the functioning of #parellel.
For a similarly scoped problem (with even smaller data),
the non-parallel version runs in 0.09seconds,
and the parallel runs in
And the profiler shows almost all the time is spent 185 seconds.
Profiler shows almost 100% of this is spend interacting with network IO.
This should get you started:
function get_chunks(data::Vector, nchunks::Int)
base_len, remainder = divrem(length(data),nchunks)
chunk_len = fill(base_len,nchunks)
chunk_len[1:remainder]+=1 #remained will always be less than nchunks
function _it()
for ii in 1:nchunks
chunk_start = sum(chunk_len[1:ii-1])+1
chunk_end = chunk_start + chunk_len[ii] -1
chunk = data[chunk_start: chunk_end]
produce(chunk)
end
end
Task(_it)
end
function r_chunk_data(data::Vector)
all_chuncks = get_chunks(data, nworkers()) |> collect;
remote_chunks = [put!(RemoteRef(pid)::RemoteRef, all_chuncks[ii]) for (ii,pid) in enumerate(workers())]
#Have to add the type annotation sas otherwise it thinks that, RemoteRef(pid) might return a RemoteValue
end
function fetch_reduce(red_acc::Function, rem_results::Vector{RemoteRef})
total = nothing
#TODO: consider strongly wrapping total in a lock, when in 0.4, so that it is garenteed safe
#sync for rr in rem_results
function gather(rr)
res=fetch(rr)
if total===nothing
total=res
else
total=red_acc(total,res)
end
end
#async gather(rr)
end
total
end
function prechunked_mapreduce(r_chunks::Vector{RemoteRef}, map_fun::Function, red_acc::Function)
rem_results = map(r_chunks) do rchunk
function do_mapred()
#assert r_chunk.where==myid()
#pipe r_chunk |> fetch |> map(map_fun,_) |> reduce(red_acc, _)
end
remotecall(r_chunk.where,do_mapred)
end
#pipe rem_results|> convert(Vector{RemoteRef},_) |> fetch_reduce(red_acc, _)
end
rchunk_data breaks the data into chunks, (defined by get_chunks method) and sends those chunks each to a different worker, where they are stored in RemoteRefs.
The RemoteRefs are references to memory on your other proccesses(and potentially computers), that
prechunked_map_reduce does a variation on a kind of map reduce to have each worker first run map_fun on each of it's chucks elements, then reduce over all the elements in its chuck using red_acc (a reduction accumulator function). Finally each worker returns there result which is then combined by reducing them all together using red_acc this time using the fetch_reduce so that we can add the first ones completed first.
fetch_reduce is a nonblocking fetch and reduce operation. I believe it has no raceconditions, though this maybe because of a implementation detail in #async and #sync. When julia 0.4 comes out, it is easy enough to put a lock in to make it obviously have no race conditions.
This code isn't really battle hardened. I don;t believe the
You also might want to look at making the chuck size tunable, so that you can seen more data to faster workers (if some have better network or faster cpus)
You need to reexpress your code as a map-reduce problem, which doesn't look too hard.
Testing that with:
data = [float([eye(100),eye(100)])[:] for _ in 1:3000] #480Mb
chunk_data(:data, data)
#time prechunked_mapreduce(:data, mean, (+))
Took ~0.03 seconds, when distributed across 8 workers (none of them on the same machine as the launcher)
vs running just locally:
#time reduce(+,map(mean,data))
took ~0.06 seconds.

Stata -- extract regression output for 3500 regressions run in a loop

I am using a forval loop to run 3,500 regressions, one for each group. I then need to summarize the results. Typically, when I use loops to run regressions I use the estimates store function followed by estout. Below is a sample code. But I believe there is a limit of 300 that this code can handle. I would very much appreciate if someone could let me know how to automate the process for 3,500 regressions.
Sample code:
forval j = 1/3500 {
regress y x if group == `j'
estimates store m`j', title(Model `j')
}
estout m* using "Results.csv", cells(b t) ///
legend label varlabels(_cons constant) ///
stats(r2 df_r N, fmt(3 0 1) label(R-sqr dfres N)) replace
Here's an example using statsby where I run a regression of price on mpg for each of the 5 groups defined by the rep78 variable and store the results in Stata dataset called my_regs:
sysuse auto, clear
statsby _b _se, by(rep78) saving(my_regs): reg price mpg
use my_regs.dta
If you prefer, you can omit the saving() option and then your dataset will be replaced in memory by the regression results, so you won't need to open the file directly with use.

add columns to data frame using foreach and %dopar%

In Revolution R 2.12.2 on Windows 7 and Ubuntu 64-bit 11.04 I have a data frame with over 100K rows and over 100 columns, and I derive ~5 columns (sqrt, log, log10, etc) for each of the original columns and add them to the same data frame. Without parallelism using foreach and %do%, this works fine, but it's slow. When I try to parallelize it with foreach and %dopar%, it will not access the global environment (to prevent race conditions or something like that), so I cannot modify the data frame because the data frame object is 'not found.'
My question is how can I make this faster? In other words, how to parallelize either the columns or the transformations?
Simplified example:
require(foreach)
require(doSMP)
w <- startWorkers()
registerDoSMP(w)
transform_features <- function()
{
cols<-c(1,2,3,4) # in my real code I select certain columns (not all)
foreach(thiscol=cols, mydata) %dopar% {
name <- names(mydata)[thiscol]
print(paste('transforming variable ', name))
mydata[,paste(name, 'sqrt', sep='_')] <<- sqrt(mydata[,thiscol])
mydata[,paste(name, 'log', sep='_')] <<- log(mydata[,thiscol])
}
}
n<-10 # I often have 100K-1M rows
mydata <- data.frame(
a=runif(n,1,100),
b=runif(n,1,100),
c=runif(n,1,100),
d=runif(n,1,100)
)
ncol(mydata) # 4 columns
transform_features()
ncol(mydata) # if it works, there should be 8
Notice if you change %dopar% to %do% it works fine
Try the := operator in data.table to add the columns by reference. You'll need with=FALSE so you can put the call to paste on the LHS of :=.
See When should I use the := operator in data.table?
Might it be easier if you did something like
n<-10
mydata <- data.frame(
a=runif(n,1,100),
b=runif(n,1,100),
c=runif(n,1,100),
d=runif(n,1,100)
)
mydata_sqrt <- sqrt(mydata)
colnames(mydata_sqrt) <- paste(colnames(mydata), 'sqrt', sep='_')
mydata <- cbind(mydata, mydata_sqrt)
producing something like
> mydata
a b c d a_sqrt b_sqrt c_sqrt d_sqrt
1 29.344088 47.232144 57.218271 58.11698 5.417018 6.872565 7.564276 7.623449
2 5.037735 12.282458 3.767464 40.50163 2.244490 3.504634 1.940996 6.364089
3 80.452595 76.756839 62.128892 43.84214 8.969537 8.761098 7.882188 6.621340
4 39.250277 11.488680 38.625132 23.52483 6.265004 3.389496 6.214912 4.850240
5 11.459075 8.126104 29.048527 76.17067 3.385126 2.850632 5.389669 8.727581
6 26.729365 50.140679 49.705432 57.69455 5.170045 7.081008 7.050208 7.595693
7 42.533937 7.481240 59.977556 11.80717 6.521805 2.735186 7.744518 3.436157
8 41.673752 89.043099 68.839051 96.15577 6.455521 9.436265 8.296930 9.805905
9 59.122106 74.308573 69.883037 61.85404 7.689090 8.620242 8.359607 7.864734
10 24.191878 94.059012 46.804937 89.07993 4.918524 9.698403 6.841413 9.438217
There are two ways you can handle this:
Loop over each column (or, better yet, a subset of the columns) and apply the transformations to create a temporary data frame, return that, and then do cbind of the list of data frames, as #Henry suggested.
Loop over the transformations, apply each to the data frame, and then return the transformation data frames, cbind, and proceed.
Personally, the way I tend to do things like this is create a bigmatrix object (either in memory or on disk, using the bigmemory package), and you can access all of the columns in shared memory. Just pre-allocate the columns you will fill in, and you won't need to do a post hoc cbind. I tend to do it on disk. Just be sure to run flush(), to make sure everything is written to disk.

Resources