Obtain the number of CPU cores in Julia - parallel-processing

I want to obtain the number of cores available in Julia. Currently I am doing the following:
using PyCall
#pyimport psutil
nCores = psutil.cpu_count()
This calls a Python function. I would like, however, to use some Julia procedure. How can it be done?

Sys.CPU_CORES is not defined in Julia v1.1.0. However, the following does the job.
length(Sys.cpu_info())

I'm not 100% certain about this, but CPU_CORES returns the number of (hyper-threading) cores on my machine (OSX 10.9.5 and Julia 0.3.5), even when I start Julia in serial mode. I've been checking the number of available cores using nworkers() and nprocs(). Starting up Julia without the -p flag this returns 1 for both.
When I start julia as julia -p 4
julia> nprocs()
5
julia> nworkers()
4
In both cases CPU_CORES returns 8.

In recent versions of Julia, you can use Sys.CPU_CORES (and not Base.CPU_CORES as some answers mentioned). Tested on 0.6.

According to the documentation, the "number of cores available" can be limited by the JULIA_NUM_THREADS environment variable.
To see the number of threads available to Julia, use
Threads.nthreads()

Sys.CPU_CORES is undefined in Julia 1.0.0 (at least, running on a macbook—I don't imagine that would make a difference). Instead, use Sys.CPU_THREADS.

I don't know Julia but "psutil.cpu_count(logical=False)" in Python gives you the number of physical CPUs (hyper threaded ones are not counted).

Related

A fast solution to obtain the best ARIMA model in R (function `auto.arima`)

I have a data series composed by 2775 elements:
mean(series)
[1] 21.24862
length(series)
[1] 2775
max(series)
[1] 81.22
min(series)
[1] 9.192
I would like to obtain the best ARIMA model by using function auto.arima of package forecast:
library(forecast)
fit=auto.arima(Netherlands,stepwise=F,approximation = F)
But I am having a big problem: RStudio is running for an hour and a half without results. (I developed an R code to perform these calculations, employed on a Windows machine equipped with a 2.80GHz Intel(R) Core(TM) i7 CPU and 16.0 GB RAM.) I suspect that this is due to the length of time series. A solution could be the parallelization? (But I don't know how apply it).
Anyway, suggestions to speed this code? Thanks!
The forecast package has many of its functions built with parallel processing in mind. One of the arguments of the auto.arima() function is 'parallel'.
According to the package documentation, "If [parallel = ] TRUE and stepwise = FALSE, then the specification search is done in parallel.This can give a significant speedup on mutlicore machines."
If parallel = TRUE, it will automatically select how many 'cores' to use (for a laptop or desktop, it is often the number of cores * 2. For example, I have 4 cores and each core has 2 processors = 8 'cores'). If you want to manually set the number of cores, also use the argument num.cores.
I'd recommend checking out the e-book written by Hyndman all about the package. It is like a time-series forecasting bible.

How to make H2O driverless AI use more cores on CPU?

My machine has 20 cores on its CPU, but when running Driverless AI, it uses only 4 of them.How can I make it use more cores for faster results?
It will depend on your setup and what version of DAI you are using, but you can specify the number of CPUs you want to use in the config.toml file. For your convenience I have pasted the relevant section of the toml file below as well as provided the documentation link that includes details for setting up this file.
## Hardware: Configure hardware settings here (GPUs, CPUs, Memory, etc.)
# Max number of CPU cores to use per experiment. Set to <= 0 to use all cores.
# One can also set environment variable "OMP_NUM_THREADS" to number of cores to use for OpenMP
# e.g. In bash: export OMP_NUM_THREADS=32 and export OPENBLAS_NUM_THREADS=32
#Set to -1 for all available cores.
#max_cores = -1
documentation link

Is bash in windows implemented differently from native bash, specifically for loops

I ran the following command on mac in an ad hoc fashion in mac store:
time for x in {1..5000000}; do if ! (($x % 10000)); then echo $x; fi done
to perform a very rudimentary benchmark. What this does is that it creates a list from 1 - 5000000, check if it's divisible by 10000, and print if it does. And time benchmark the time for the process to execute. I've been arriving at around 40 secs for macbook air, 32 for pros, all 8th gen intel processors. A particular pattern I noticed is that it freezes for a long time before printing out anything, presumably this is because it's creating a list from 1 to 5000000 and putting it in memory.
However, my friend who use windows reported faster times on gen 5 core m processor with Windows 10 native bash shell, on the order of 15 seconds. I suspect it's because windows bash treat for x in {1..5000000} as a generator. In this way the process never made into memory as everything would only needed to be stored in cache, achieving greater speed. Can anyone confirm that for loops for bash interpreter is the same/different across windows implementation and linux/mac implementations?

Parallel computing in Julia - running a simple for-loop on multiple cores

For starters, I have to say I'm completely new to parallel computing (and know close to nothing about computer science), so my understanding of what things like "workers" or "processes" actually are is very limited. I do however have a question about running a simple for-loop that presumably has no dependencies between the iterations in parallel.
Let's say I wanted to do the following:
for N in 1:5:20
println("The N of this iteration in $N")
end
If I simply wanted these messages to appear on screen and the order of appearance didn't matter, how could one achieve this in Julia 0.6, and for future reference in Julia 0.7 (and therefore 1.0)?
Just to add the example to the answer of Chris. Since the release of julia 1.3 you do this easily with Threads.#threads
Threads.#threads for N in 1:5:20
println("The number of this iteration is $N")
end
Here you are running only one julia session with multiple threads instead of using Distributed where you run multiple julia sessions.
See, e.g. multithreading blog post for more information.
Distributed Processing
Start julia with e.g. julia -p 4 if you want to use 4 cpus (or use the function addprocs(4)). In Julia 1.x, you make a parallel loop as following:
using Distributed
#distributed for N in 1:5:20
println("The N of this iteration in $N")
end
Note that every process have its own variables per default.
For any serious work, have a look at the manual https://docs.julialang.org/en/v1.4/manual/parallel-computing/, in particular the section about SharedArrays.
Another option for distributed computing are the function pmap or the package MPI.jl.
Threads
Since Julia 1.3, you can also use Threads as noted by wueli.
Start julia with e.g. julia -t 4 to use 4 threads. Alternatively you can or set the environment variable JULIA_NUM_THREADS before starting julia.
For example Linux/Mac OS:
export JULIA_NUM_THREADS=4
In windows, you can use set JULIA_NUM_THREADS 4 in the cmd prompt.
Then in julia:
Threads.#threads for N = 1::20
println("N = $N (thread $(Threads.threadid()) of out $(Threads.nthreads()))")
end
All CPUs are assumed to have access to shared memory in the examples above (e.g. "OpenMP style" parallelism) which is the common case for multi-core CPUs.

Performance penalty of persistent variables in MATLAB

Recently I profiled some MATLAB code and I was shocked to see the following in a heavily used function:
5.76 198694 58 persistent CONSTANTS;
3.44 198694 59 if isempty(CONSTANTS) % initialize CONSTANTS
In other words, MATLAB spent about 9 seconds, over 198694 function calls, declaring the persistent CONSTANTS and checking if it has been initialized. That represents 13% of the total time spent in that function.
Do persistent variables really carry that much of a performance penalty in MATLAB? Or are we doing something terribly wrong here?
UPDATE
#Andrew I tried your sample script and I am very, very perplexed by the output:
time calls line
6 function has_persistent
6.48 200000 7 persistent CONSTANTS
1.91 200000 8 if isempty(CONSTANTS)
9 CONSTANTS = 42;
10 end
I tried the bench() command and it showed my machine in the middle range of the sample machines. Running Ubuntu 64 bits on a Intel(R) Core(TM) i7 CPU, 4GB RAM.
That's the standard way of using persistent variables in Matlab. You're doing what you're supposed to. There will be noticable overhead for it, but your timings do seem kind of surprisingly high.
Here's a similar test I ran in 32-bit Matlab R2009b on a 3.0 GHz Intel Core 2 QX9650 machine under Windows XP x64. Similar results on other machines and versions. About 5x faster than your timings.
Test:
function call_has_persistent
for i = 1:200000
has_persistent();
end
function has_persistent
persistent CONSTANTS
if isempty(CONSTANTS)
CONSTANTS = 42;
end
Results:
0.89 200000 7 persistent CONSTANTS
0.25 200000 8 if isempty(CONSTANTS)
What Matlab version, OS, and CPU are you running on? What does CONSTANTS get initialized with? Does Matlab's bench() output seem reasonable for your machine?
Your timings do seem high. There may be a bug or config issue there to fix. But if you really want to get Matlab code fast, the standard advice is to "vectorize" it: restructure the code so that it makes fewer function calls on larger input arrays, and makes use of Matlab's built in vectorized functions instead of loops or control structures, to avoid having 200,000 calls to the function in the first place. If possible. Matlab has relatively high overhead per function or method call (see Is MATLAB OOP slow or am I doing something wrong? for some numbers), so you can often get more mileage by refactoring to eliminate function calls instead of making the individual function calls faster.
It may be worth benchmarking some other basic Matlab operations on your machine, to see if it's just "persistent" that seems slow. Also try profiling just this little call_has_persistent test script in isolation to see if the context of your function makes a difference.

Resources