Parallelize row-operation in Julia - parallel-processing

Coming from a R background, I was exploring the parallel possibilities by Julia. My objective is to replicate the performance of mcapply (parallel apply)
** The problem: **
I iterate a function on the rows of a data-frame that looks like that:
for i in 1:_nrow # of my DataFrame
lat1 = Raw_Data[i,"lat1"]
lat2 = Raw_Data[i,"lat2"]
lon1 = Raw_Data[i,"long1"]
lon2 = Raw_Data[i,"long2"]
iata1 = Raw_Data[i,"iata1"]
iata2 = Raw_Data[i,"iata2"]
a[i] = [(iata1::String,iata2::String, trunc(i,2), get_intermediary_points(lat1,lon1,lat2,lon2,j) ) for j in 0:.1:1]
end
Now, as a step toward parallelization, I can also create an anonymous function that does quite similar work, running calculation on each chunk of my dataframe:
Raw_Data["selector"] = rand(1:nproc,_nrow) # Define how I split my dataframe. 1 chunck per proc
B = by(Raw_Data,:selector,intermediary_points)
Is there a way to speed up calculations with a parallelized "by"? Otherwise, please suggest good alternative.
Thanks!
Note: This is how my dataframe Raw_Data looks like
6x7 DataFrame:
iata1 lat1 long1 iata2 lat2 long2
[1,] 1 "ELH" 0.444616 -1.3384 "FLL" 0.455079 -1.39891
[2,] 2 "BCN" 0.720765 0.0362729 "UFA" 0.955274 0.976218
[3,] 3 "ACE" 0.505053 -0.237426 "VCE" 0.794214 0.215582
[4,] 4 "PVG" 0.543669 2.12552 "LZH" 0.425277 1.91171
[5,] 5 "CDG" 0.855379 0.0444809 "VLC" 0.689233 -0.00835298
[6,] 6 "HLD" 0.858699 2.08915 "CGQ" 0.765906 2.18718

I figure out what happened. I didn't made all the inputs available to all processors.
Basically, if you are running into the same problem:
All functions should have #everywhere in front of them
All packages should also be declared as #everywhere using DataFrames
All parameters should also be declared with #everywhere in front
of it
Now, that's a lot of work. You can follow http://julia.readthedocs.org/en/latest/manual/parallel-computing/ to use stand-alone packages that would simplify a bit the process.
Cheers.

Related

Robust Standard Errors in lm() using stargazer()

I have read a lot about the pain of replicate the easy robust option from STATA to R to use robust standard errors. I replicated following approaches: StackExchange and Economic Theory Blog. They work but the problem I face is, if I want to print my results using the stargazer function (this prints the .tex code for Latex files).
Here is the illustration to my problem:
reg1 <-lm(rev~id + source + listed + country , data=data2_rev)
stargazer(reg1)
This prints the R output as .tex code (non-robust SE) If i want to use robust SE, i can do it with the sandwich package as follow:
vcov <- vcovHC(reg1, "HC1")
if I now use stargazer(vcov) only the output of the vcovHC function is printed and not the regression output itself.
With the package lmtest() it is possible to print at least the estimator, but not the observations, R2, adj. R2, Residual, Residual St.Error and the F-Statistics.
lmtest::coeftest(reg1, vcov. = sandwich::vcovHC(reg1, type = 'HC1'))
This gives the following output:
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.54923 6.85521 -0.3719 0.710611
id 0.39634 0.12376 3.2026 0.001722 **
source 1.48164 4.20183 0.3526 0.724960
country -4.00398 4.00256 -1.0004 0.319041
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
How can I add or get an output with the following parameters as well?
Residual standard error: 17.43 on 127 degrees of freedom
Multiple R-squared: 0.09676, Adjusted R-squared: 0.07543
F-statistic: 4.535 on 3 and 127 DF, p-value: 0.00469
Did anybody face the same problem and can help me out?
How can I use robust standard errors in the lm function and apply the stargazer function?
You already calculated robust standard errors, and there's an easy way to include it in the stargazeroutput:
library("sandwich")
library("plm")
library("stargazer")
data("Produc", package = "plm")
# Regression
model <- plm(log(gsp) ~ log(pcap) + log(pc) + log(emp) + unemp,
data = Produc,
index = c("state","year"),
method="pooling")
# Adjust standard errors
cov1 <- vcovHC(model, type = "HC1")
robust_se <- sqrt(diag(cov1))
# Stargazer output (with and without RSE)
stargazer(model, model, type = "text",
se = list(NULL, robust_se))
Solution found here: https://www.jakeruss.com/cheatsheets/stargazer/#robust-standard-errors-replicating-statas-robust-option
Update I'm not so much into F-Tests. People are discussing those issues, e.g. https://stats.stackexchange.com/questions/93787/f-test-formula-under-robust-standard-error
When you follow http://www3.grips.ac.jp/~yamanota/Lecture_Note_9_Heteroskedasticity
"A heteroskedasticity-robust t statistic can be obtained by dividing an OSL estimator by its robust standard error (for zero null hypotheses). The usual F-statistic, however, is invalid. Instead, we need to use the heteroskedasticity-robust Wald statistic."
and use a Wald statistic here?
This is a fairly simple solution using coeftest:
reg1 <-lm(rev~id + source + listed + country , data=data2_rev)
cl_robust <- coeftest(reg1, vcov = vcovCL, type = "HC1", cluster = ~
country)
se_robust <- cl_robust[, 2]
stargazer(reg1, reg1, cl_robust, se = list(NULL, se_robust, NULL))
Note that I only included cl_robust in the output as a verification that the results are identical.

How to calculate number of missing values summed over time dimension in a netcdf file in bash

I have a netcdf file with data as a function of lon,lat and time. I would like to calculate the total number of missing entries in each grid cell summed over the time dimension, preferably with CDO or NCO so I do not need to invoke R, python etc.
I know how to get the total number of missing values
ncap2 -s "nmiss=var.number_miss()" in.nc out.nc
as I answered to this related question:
count number of missing values in netcdf file - R
and CDO can tell me the total summed over space with
cdo info in.nc
but I can't work out how to sum over time. Is there a way for example of specifying the dimension to sum over with number_miss in ncap2?
We added the missing() function to ncap2 to solve this problem elegantly as of NCO 4.6.7 (May, 2017). To count missing values through time:
ncap2 -s 'mss_val=three_dmn_var_dbl.missing().ttl($time)' in.nc out.nc
Here ncap2 chains two methods together, missing(), followed by a total over the time dimension. The 2D variable mss_val is in out.nc. The response below does the same but averages over space and reports through time (because I misinterpreted the OP).
Old/obsolete answer:
There are two ways to do this with NCO/ncap2, though neither is as elegant as I would like. Either call assemble the answer one record at a time by calling num_miss() with one record at a time, or (my preference) use the boolean comparison function followed by the total operator along the axes of choice:
zender#aerosol:~$ ncap2 -O -s 'tmp=three_dmn_var_dbl;mss_val=tmp.get_miss();tmp.delete_miss();tmp_bool=(tmp==mss_val);tmp_bool_ttl=tmp_bool.ttl($lon,$lat);print(tmp_bool_ttl);' ~/nco/data/in.nc ~/foo.nc
tmp_bool_ttl[0]=0
tmp_bool_ttl[1]=0
tmp_bool_ttl[2]=0
tmp_bool_ttl[3]=8
tmp_bool_ttl[4]=0
tmp_bool_ttl[5]=0
tmp_bool_ttl[6]=0
tmp_bool_ttl[7]=1
tmp_bool_ttl[8]=0
tmp_bool_ttl[9]=2
or
zender#aerosol:~$ ncap2 -O -s 'for(rec=0;rec<time.size();rec++){nmiss=three_dmn_var_int(rec,:,:).number_miss();print(nmiss);}' ~/nco/data/in.nc ~/foo.nc
nmiss = 0
nmiss = 0
nmiss = 8
nmiss = 0
nmiss = 0
nmiss = 1
nmiss = 0
nmiss = 2
nmiss = 1
nmiss = 2
Even though you are asking for another solution, I would like to show you that it takes only one very short line to find the answer with the help of Python. The variable m_data has exactly the same shape as a variable with missing values read using the netCDF4 package. With the execution of only one np.sum command with the correct axis specified, you have your answer.
import numpy as np
import matplotlib.pyplot as plt
import netCDF4 as nc4
# Generate random data for this experiment.
data = np.random.rand(365, 64, 128)
# Masked data, this is how the data is read from NetCDF by the netCDF4 package.
# For this example, I mask all values less than 0.1.
m_data = np.ma.masked_array(data, mask=data<0.1)
# It only takes one operation to find the answer.
n_values_missing = np.sum(m_data.mask, axis=0)
# Just a plot of the result.
plt.figure()
plt.pcolormesh(n_values_missing)
plt.colorbar()
plt.xlabel('lon')
plt.ylabel('lat')
plt.show()
# Save a netCDF file of the results.
f = nc4.Dataset('test.nc', 'w', format='NETCDF4')
f.createDimension('lon', 128)
f.createDimension('lat', 64 )
n_values_missing_nc = f.createVariable('n_values_missing', 'i4', ('lat', 'lon'))
n_values_missing_nc[:,:] = n_values_missing[:,:]
f.close()

Julia: How to copy data to another processor in Julia

How do you move data from one processor to another in julia?
Say I have an array
a = [1:10]
Or some other data structure. What is the proper way to put it on all other available processors so that it will be available on those processors as the same variable name?
I didn't know how to do this at first, so I spent some time figuring it out.
Here are some functions I wrote to pass objects:
sendto
Send an arbitrary number of variables to specified processes.
New variables are created in the Main module on specified processes. The
name will be the key of the keyword argument and the value will be the
associated value.
function sendto(p::Int; args...)
for (nm, val) in args
#spawnat(p, eval(Main, Expr(:(=), nm, val)))
end
end
function sendto(ps::Vector{Int}; args...)
for p in ps
sendto(p; args...)
end
end
Examples
# creates an integer x and Matrix y on processes 1 and 2
sendto([1, 2], x=100, y=rand(2, 3))
# create a variable here, then send it everywhere else
z = randn(10, 10); sendto(workers(), z=z)
getfrom
Retrieve an object defined in an arbitrary module on an arbitrary
process. Defaults to the Main module.
The name of the object to be retrieved should be a symbol.
getfrom(p::Int, nm::Symbol; mod=Main) = fetch(#spawnat(p, getfield(mod, nm)))
Examples
# get an object from named x from Main module on process 2. Name it x
x = getfrom(2, :x)
passobj
Pass an arbitrary number of objects from one process to arbitrary
processes. The variable must be defined in the from_mod module of the
src process and will be copied under the same name to the to_mod
module on each target process.
function passobj(src::Int, target::Vector{Int}, nm::Symbol;
from_mod=Main, to_mod=Main)
r = RemoteRef(src)
#spawnat(src, put!(r, getfield(from_mod, nm)))
for to in target
#spawnat(to, eval(to_mod, Expr(:(=), nm, fetch(r))))
end
nothing
end
function passobj(src::Int, target::Int, nm::Symbol; from_mod=Main, to_mod=Main)
passobj(src, [target], nm; from_mod=from_mod, to_mod=to_mod)
end
function passobj(src::Int, target, nms::Vector{Symbol};
from_mod=Main, to_mod=Main)
for nm in nms
passobj(src, target, nm; from_mod=from_mod, to_mod=to_mod)
end
end
Examples
# pass variable named x from process 2 to all other processes
passobj(2, filter(x->x!=2, procs()), :x)
# pass variables t, u, v from process 3 to process 1
passobj(3, 1, [:t, :u, :v])
# Pass a variable from the `Foo` module on process 1 to Main on workers
passobj(1, workers(), [:foo]; from_mod=Foo)
use #eval #everywhere... and escape the local variable. like this:
julia> a=collect(1:3)
3-element Array{Int64,1}:
1
2
3
julia> addprocs(1)
1-element Array{Int64,1}:
2
julia> #eval #everywhere a=$a
julia> #fetchfrom 2 a
3-element Array{Int64,1}:
1
2
3
Just so everyone here knows, I put these ideas together into a package ParallelDataTransfer.jl for this. So you just need to do
using ParallelDataTransfer
(after installing) in order to use the functions mentioned in the answers here. Why? These functions are pretty useful! I added some testing, some new macros, and updated them a bit (they pass on v0.5, fail on v0.4.x). Feel free to put in pull requests to edit these and add more.
To supplement #spencerlyon2 's answer here are some macros:
function sendtosimple(p::Int, nm, val)
ref = #spawnat(p, eval(Main, Expr(:(=), nm, val)))
end
macro sendto(p, nm, val)
return :( sendtosimple($p, $nm, $val) )
end
macro broadcast(nm, val)
quote
#sync for p in workers()
#async sendtosimple(p, $nm, $val)
end
end
end
The #spawnat macro binds a value to a symbol on a particular process
julia> #sendto 2 :bip pi/3
RemoteRef{Channel{Any}}(9,1,5340)
julia> #fetchfrom 2 bip
1.0471975511965976
The #broadcast macro binds a value to a symbol in all processes except 1 (as I found doing so made future expressions using the name copy the version from process 1)
julia> #broadcast :bozo 5
julia> #fetchfrom 2 bozo
5
julia> bozo
ERROR: UndefVarError: bozo not defined
julia> bozo = 3 #these three lines are why I exclude pid 1
3
julia> #fetchfrom 7 bozo
3
julia> #fetchfrom 7 Main.bozo
5

Very simple set value of array cell, program very slow when he writes on specify column

I am using a continuous and old professional program. My program builds several simple data arrays and writes the array to an excel cell like this:
Sheets("toto").Cells(4,i) = "blabla"
But for one value of i, the write time is very long and I don't understand why.
Here is my code :
...
For No_Bug = 0 To Indtab - 1
If mesComments(No_Bug) <> "" Then
Sheets(feuille_LBT).Cells(Ligne_Bug, 1) = Ligne_Bug - 5
Sheets(feuille_LBT).Cells(Ligne_Bug, 2) = mesID_Test(No_Bug)
Sheets(feuille_LBT).Cells(Ligne_Bug, 3) = mesResultats(No_Bug)
Sheets(feuille_LBT).Cells(Ligne_Bug, 4) = mesComments(No_Bug)
Sheets(feuille_LBT).Cells(Ligne_Bug, 5).FormulaLocal = mesScreens(No_Bug)
Sheets(feuille_LBT).Cells(Ligne_Bug, 6) = 2 'If I comment only this line, the programm is fast, ifnot the programm is very slow (~1, 2 secondes per loop), What the hell ??? xD
Sheets(feuille_LBT).Cells(Ligne_Bug, 7) = 1
End If
...
Is this cell referenced from other cells? Check if any complicated computations related with this cell.

idata.frame: Why error "is.data.frame(df) is not TRUE"?

I'm working with a large data frame called exp (file here) in R. In the interests of performance, it was suggested that I check out the idata.frame() function from plyr. But I think I'm using it wrong.
My original call, slow but it works:
df.median<-ddply(exp,
.(groupname,starttime,fPhase,fCycle),
numcolwise(median),
na.rm=TRUE)
With idata.frame, Error: is.data.frame(df) is not TRUE
library(plyr)
df.median<-ddply(idata.frame(exp),
.(groupname,starttime,fPhase,fCycle),
numcolwise(median),
na.rm=TRUE)
So, I thought, perhaps it is my data. So I tried the baseball dataset. The idata.frame example works fine: dlply(idata.frame(baseball), "id", nrow) But if I try something similar to my desired call using baseball, it doesn't work:
bb.median<-ddply(idata.frame(baseball),
.(id,year,team),
numcolwise(median),
na.rm=TRUE)
>Error: is.data.frame(df) is not TRUE
Perhaps my error is in how I'm specifying the groupings? Anyone know how to make my example work?
ETA:
I also tried:
groupVars <- c("groupname","starttime","fPhase","fCycle")
voi<-c('inadist','smldist','lardist')
i<-idata.frame(exp)
ag.median <- aggregate(i[,voi], i[,groupVars], median)
Error in i[, voi] : object of type 'environment' is not subsettable
which uses a faster way of getting the medians, but gives a different error. I don't think I understand how to use idata.frame at all.
Given you are working with 'big' data and looking for perfomance, this seems a perfect fit for data.table.
Specifically the lapply(.SD,FUN) and .SDcols arguments with by
Setup the data.table
library(data.table)
DT <- as.data.table(exp)
iexp <- idata.frame(exp)
Which columns are numeric
numeric_columns <- names(which(unlist(lapply(DT, is.numeric))))
dt.median <- DT[, lapply(.SD, median), by = list(groupname, starttime, fPhase,
fCycle), .SDcols = numeric_columns]
some benchmarking
library(rbenchmark)
benchmark(data.table = DT[, lapply(.SD, median), by = list(groupname, starttime,
fPhase, fCycle), .SDcols = numeric_columns],
plyr = ddply(exp, .(groupname, starttime, fPhase, fCycle), numcolwise(median), na.rm = TRUE),
idataframe = ddply(exp, .(groupname, starttime, fPhase, fCycle), function(x) data.frame(inadist = median(x$inadist),
smldist = median(x$smldist), lardist = median(x$lardist), inadur = median(x$inadur),
smldur = median(x$smldur), lardur = median(x$lardur), emptyct = median(x$emptyct),
entct = median(x$entct), inact = median(x$inact), smlct = median(x$smlct),
larct = median(x$larct), na.rm = TRUE)),
aggregate = aggregate(exp[, numeric_columns],
exp[, c("groupname", "starttime", "fPhase", "fCycle")],
median),
replications = 5)
## test replications elapsed relative user.self
## 4 aggregate 5 5.42 1.789 5.30
## 1 data.table 5 3.03 1.000 3.03
## 3 idataframe 5 11.81 3.898 11.77
## 2 plyr 5 9.47 3.125 9.45
Strange behaviour, but even in the docs it says that idata.frame is experimental. You probably found a bug. Perhaps you could rewrite the check at the top of ddply that tests is.data.frame().
In any case, this cuts about 20% off the time (on my system):
system.time(df.median<-ddply(exp, .(groupname,starttime,fPhase,fCycle), function(x) data.frame(
inadist=median(x$inadist),
smldist=median(x$smldist),
lardist=median(x$lardist),
inadur=median(x$inadur),
smldur=median(x$smldur),
lardur=median(x$lardur),
emptyct=median(x$emptyct),
entct=median(x$entct),
inact=median(x$inact),
smlct=median(x$smlct),
larct=median(x$larct),
na.rm=TRUE))
)
Shane asked you in another post if you could cache the results of your script. I don't really have an idea of your workflow, but it may be best to setup a chron to run this and store the results, daily/hourly whatever.

Resources