How to make H2O driverless AI use more cores on CPU? - cpu

My machine has 20 cores on its CPU, but when running Driverless AI, it uses only 4 of them.How can I make it use more cores for faster results?

It will depend on your setup and what version of DAI you are using, but you can specify the number of CPUs you want to use in the config.toml file. For your convenience I have pasted the relevant section of the toml file below as well as provided the documentation link that includes details for setting up this file.
## Hardware: Configure hardware settings here (GPUs, CPUs, Memory, etc.)
# Max number of CPU cores to use per experiment. Set to <= 0 to use all cores.
# One can also set environment variable "OMP_NUM_THREADS" to number of cores to use for OpenMP
# e.g. In bash: export OMP_NUM_THREADS=32 and export OPENBLAS_NUM_THREADS=32
#Set to -1 for all available cores.
#max_cores = -1
documentation link

Related

Setting Hugging Face dataloader_num_workers for multi-GPU training

Should the HuggingFace transformers TrainingArguments dataloader_num_workers argument be set per GPU? Or total across GPUs? And does this answer change depending whether the training is running in DataParallel or DistributedDataParallel mode?
For example if I have a machine with 4 GPUs and 48 CPUs (running only this training task), would there be any expected value in setting dataloader_num_workers greater than 12 (48 / 4)? Or would they all start contending over the same resources?
As I understand when running in DDP mode (with torch.distributed.launch or similar), one training process manages each device, but in the default DP mode one lead process manages everything. So maybe the answer to this is 12 for DDP but ~47 for DP?

Confused about OMP_NUM_THREADS and numactl NUMA-cores bindings

I'm confused about how multiple launches of same python command bind to cores on a NUMA Xeon machine.
I read that OMP_NUM_THREADS env var sets the number of threads launched for a numactl process. So if I ran numactl --physcpubind=4-7 --membind=0 python -u test.py with OMP_NUM_THREADS=4 on a hyperthreaded HT machine (lscpu output below) it'd limit the this numactl process to 4 threads.
But since machine has HT, it's not clear to me if 4-7 in the above are 4 physical or 4 logical.
How to find which of the numa-node-0 cores in 0-23,96-119 are physical and which ones logical? Are 96-119 all logical or are they interspersed?
If 4-7 are all physical cores, then with HT on there would be only 2 physical cores needed, so what happens to the other 2?
Where is OpenMP library getting invoked in binding threads to physical cores?
(from my limited understanding I could just launch command python main.py in a sh shell 20 times with different numactl bindings and OMP_NUM_THREADS still applies, even though I didn't explicitly use MPI lib anywhere, is that correct?)
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 48 bits virtual
CPU(s): 192
On-line CPU(s) list: 0-191
Thread(s) per core: 2
Core(s) per socket: 48
Socket(s): 2
NUMA node(s): 4
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Platinum 9242 CPU # 2.30GHz
Stepping: 7
Frequency boost: enabled
CPU MHz: 1000.026
CPU max MHz: 2301,0000
CPU min MHz: 1000,0000
BogoMIPS: 4600.00
L1d cache: 3 MiB
L1i cache: 3 MiB
L2 cache: 96 MiB
L3 cache: 143 MiB
NUMA node0 CPU(s): 0-23,96-119
NUMA node1 CPU(s): 24-47,120-143
NUMA node2 CPU(s): 48-71,144-167
NUMA node3 CPU(s): 72-95,168-191
I read that OMP_NUM_THREADS env var sets the number of threads launched for a numactl process.
numactl do not launch threads. It controls NUMA policy for processes or shared memory. However, OpenMP runtimes may adapt the number of threads created by a region based on the environment set by numactl (although AFAIK this behaviour is undefined by the standard). You should use the environment variable OMP_NUM_THREADS to set the number of threads. You can check the OpenMP configuration using the environment variable OMP_DISPLAY_ENV.
How to find which of the numa-node-0 cores in 0-23,96-119 are physical and which ones logical? Are 96-119 all logical or are they interspersed?
This is a bit complex. Physical IDs are the ones available in /proc/cpuinfo. They are not guaranteed to stay the same over time (eg. they can change when the machine is restarted) nor "intuitive" (ie. following rules like being contiguous for threads/cores close to each other). One should avoid hard-coding them manually. e.g. a BIOS update or kernel update might lead to enumerating logical cores in a different order.
You can use the great tool hwloc to convert well-defined deterministic logical IDs to physical ones. Here, you cannot be entirely sure that 0 and 96 are two threads sharing the same core (although this is probably true here for your processor, where it looks like the kernel enumerated one logical core from each physical core as cores 0..95, then 96..191 for the other logical core on each physical core). The other common possibility is for Linux to do both logical cores of each physical core consecutively, making logical cores 2n and 2n+1 share a physical core.
If 4-7 are all physical cores, then with HT on there would be only 2 physical cores needed, so what happens to the other 2?
--physcpubind of numctl accepts physical cpu numbers as shown in the "processor" fields of /proc/cpuinfo regarding the documentation. Thus, 4-7 here should be interpreted as physical thread IDs. Two threads IDs can refer to the same physical core (which is always the case on Intel processors with hyper-threading enabled).
Where is OpenMP library getting invoked in binding threads to physical cores?
AFAIK, this is implementation dependent of the OpenMP runtime used (eg. GOMP, IOMP, etc.). The initialization of the OpenMP runtime is often done lazily when the first parallel section is encountered. For the binding, some runtimes read /proc/cpuinfo manually while some other use hwloc. If you want deterministic bindings, then you should use the OMP_PLACES and OMP_PROC_BIND environment variables to tell the runtime to bind threads using a custom user-defined method and not the default one.
If you want to be safe and portable, use the following configuration (using Bash):
OMP_NUM_THREADS=4
OMP_PROC_BIND=TRUE
OMP_PLACES={$(hwloc-calc --physical-output --sep "},{" --intersect PU core:all.pu:0)}
The OpenMP threads will be scheduled on OpenMP places. The above configuration configure the OpenMP runtime so that there will be 4 threads statically map on 4 different fixed cores.

Different profiling modes for different cores using perf

I have the following questions regarding perf.
a) Is it possible that I run different profiling modes on different cores simultaneously. e.g. Core 0 with event based sampling (sampling every N events) and Core 1 with free running counter based sampling?
b) In case a) is not possible. Then is it possible to get a snapshot of the PMU counters on the other cores (Core 1) for every sample (overflow at N events) on Core 0?
P.S: The platform is a RPi 3b+ based on the Arm Cortex A53
It is possible to operate different profiling modes on different cores of the CPU simultaneously.
perf also has a processor-wide mode wherein all threads running on the designated processors are monitored. Counts and samples are thus aggregated per CPU/core.
-C, --cpu=
Count only on the list of CPUs provided. Multiple CPUs can be
provided as a comma-separated list with no space: 0,1. Ranges of
CPUs are specified with -: 0-2. In per-thread mode, this option
is ignored. The -a option is still necessary to activate
system-wide monitoring. Default is to count on all CPUs.
Running both the free-running counter as well as the sampling mechanism of perf simultaneously, is possible on different cores of the CPU like below -
eg. for cpu 0:
perf stat --cpu 0 -B dd if=/dev/zero of=/dev/null count=1000000
and for cpu 1:
perf record --cpu 1 sleep 20

A fast solution to obtain the best ARIMA model in R (function `auto.arima`)

I have a data series composed by 2775 elements:
mean(series)
[1] 21.24862
length(series)
[1] 2775
max(series)
[1] 81.22
min(series)
[1] 9.192
I would like to obtain the best ARIMA model by using function auto.arima of package forecast:
library(forecast)
fit=auto.arima(Netherlands,stepwise=F,approximation = F)
But I am having a big problem: RStudio is running for an hour and a half without results. (I developed an R code to perform these calculations, employed on a Windows machine equipped with a 2.80GHz Intel(R) Core(TM) i7 CPU and 16.0 GB RAM.) I suspect that this is due to the length of time series. A solution could be the parallelization? (But I don't know how apply it).
Anyway, suggestions to speed this code? Thanks!
The forecast package has many of its functions built with parallel processing in mind. One of the arguments of the auto.arima() function is 'parallel'.
According to the package documentation, "If [parallel = ] TRUE and stepwise = FALSE, then the specification search is done in parallel.This can give a significant speedup on mutlicore machines."
If parallel = TRUE, it will automatically select how many 'cores' to use (for a laptop or desktop, it is often the number of cores * 2. For example, I have 4 cores and each core has 2 processors = 8 'cores'). If you want to manually set the number of cores, also use the argument num.cores.
I'd recommend checking out the e-book written by Hyndman all about the package. It is like a time-series forecasting bible.

Mapping logical processors to physical processors

On a dual quad-core GetProcessAffinityMask (or the dialog from "Set affinity" in taskman.exe) will report eight logical processors. How do I find out which logical processor is on which physical processor? Especially: which logical processors are on the same physical processor?
EDIT: If it is not possible to do this programmatically, do anyone just know what the normal mapping is? Are the first four on the first processor and the second four on the second or are the odd numbered on the first and the even numbered on the second?
You can use Win32_Processor WMI class to query the number of cores, number of logical processors, architecture, cache memory and other information about the CPUs on the system.
To query information about the relationship between the logical processors in a system, you can use GetLogicalProcessorInformation API function.
In case you don't want to write the code yourself, SysInternal's handy coreinfo utility comes closest to answering your questions. It implements GetLogicalProcessorInformation as Mehrdad recommends. For a Xeon E5640 (quad core, 8 threads), you get from coreinfo:
c:\App\SysInternals>Coreinfo.exe -c
Coreinfo v3.0 - Dump information on system CPU and memory topology
Copyright (C) 2008-2011 Mark Russinovich
Sysinternals - www.sysinternals.com
Logical to Physical Processor Map:
**------ Physical Processor 0 (Hyperthreaded)
--**---- Physical Processor 1 (Hyperthreaded)
----**-- Physical Processor 2 (Hyperthreaded)
------** Physical Processor 3 (Hyperthreaded)
There are 8 * for the 8 hyperthreads, two per core, as expected for this chip. What's not clear, though, is how the arrangement of * matches up with the list of logical processors as Windows presents them. For instance, Task Manager gives me a dialog for assigning the processor affinity, labeled CPU 0 through CPU 7, for any process. It's fair (but not necessary) to assume that you can take coreinfo's output and number the logical processors left-to-right. So "CPU 5" would be the second hyperthread running on physical processor 2.
The numbering is done in a sequential manner: first all physical cores followed by the logical cores [1] .
[1] CPU Numbering on a hypertheading enabled system

Resources