How to solve UPC Runtime error: out of shared memory

How to solve UPC Runtime error: out of shared memory - upc

I am trying to run a Berkeley UPC code on a computer with 64 cores and 256 GB RAM. However the code fails to run because it cannot find enough memory. The following should work because 51 x 5 = 255 GB < 256 GB
upcrun -n 51 -shared-heap=5GB xcorupc_sac inputpgas_sac{$rc1}.txt
..
UPCR: UPC thread 3 of 51 on range (pshm node 0 of 1, process 3 of 51, pid=191914)
UPCR: UPC thread 16 of 51 on range (pshm node 0 of 1, process 16 of 51, pid=191927)
UPC Runtime warning: Requested shared memory (5120 MB) > available (2515 MB) on node 0 (range): using 2515 MB per thread instead
UPC Runtime error: out of shared memory
Local shared memory in use: 1594 MB per-thread, 81340 MB total
Global shared memory in use: 0 MB per-thread, 1 MB total
Total shared memory limit: 2515 MB per-thread, 128281 MB total
upc_alloc unable to service request from thread 0 for 1672245248 more bytes
NOTICE: Before reporting bugs, run with GASNET_BACKTRACE=1 in the environment to generate a backtrace.
NOTICE: We recommend linking the debug version of GASNet to assist you in resolving this application issue.
I don't understand why the Total shared memory limit is 128 GB which is half of the total physical memory present. I cannot over-ride it even with a the shared-heap flag where I am clearly asking for 5 GB per thread. Any suggestions ?
cat /proc/meminfo
MemTotal: 263378836 kB
The UPC build was compiled using flags --with-sptr-packed-bits=20,9,35 that allows up to 2^35 = 32 GB of shared memory per thread.
EDIT1: Following is the output of the command upcc --version
[avinash#range jointinvsurf5_cajoint_compile]$ upcc --version
This is upcc (the Berkeley Unified Parallel C compiler), v. 2019.4.4
(getting remote translator settings...)
----------------------+---------------------------------------------------------
UPC Runtime | v. 2019.4.4, built on Feb 11 2020 at 23:31:40
----------------------+---------------------------------------------------------
UPC-to-C translator | v. 2.28.0, built on Jul 19 2018 at 20:29:47
| host aphid linux-x86_64/64
| gcc v4.2.4 (Ubuntu 4.2.4-1ubuntu4)
----------------------+---------------------------------------------------------
Translator location | http://upc-translator.lbl.gov/upcc-2019.4.0.cgi
----------------------+---------------------------------------------------------
networks supported | smp udp mpi ibv
----------------------+---------------------------------------------------------
default network | ibv
----------------------+---------------------------------------------------------
pthreads support | available (if used, default is 2 pthreads per process)
----------------------+---------------------------------------------------------
Configured with | '--with-translator=http://upc-translator.lbl.gov/upcc-2
| 019.4.0.cgi' '--with-sptr-packed-bits=20,9,35'
| '--prefix=/usr/local/berkeley_upc/opt'
| '--with-multiconf-magic=opt'
----------------------+---------------------------------------------------------
Configure features | trans_bupc,pragma_upc_code,driver_upcc,runtime_upcr,
| gasnet,upc_collective,upc_io,upc_memcpy_async,
| upc_memcpy_vis,upc_ptradd,upc_thread_distance,upc_tick,
| upc_sem,upc_dump_shared,upc_trace_printf,
| upc_trace_mask,upc_local_to_shared,upc_all_free,
| upc_atomics,pupc,upc_types,upc_castable,upc_nb,nodebug,
| notrace,nostats,nodebugmalloc,nogasp,nothrille,
| segment_fast,os_linux,cpu_x86_64,cpu_64,cc_gnu,
| packedsptr,upc_io_64
----------------------+---------------------------------------------------------
Configure id | range Tue Feb 11 23:18:39 PST 2020 gnome-initial-setup
----------------------+---------------------------------------------------------
Binary interface | 64-bit x86_64-unknown-linux-gnu
----------------------+---------------------------------------------------------
Runtime interface # | Runtime supports 3.0 -> 3.13: Translator uses 3.6
----------------------+---------------------------------------------------------
| --- BACKEND SETTINGS (for ibv network) ---
----------------------+---------------------------------------------------------
C compiler | /usr/bin/gcc
| GNU/4.8.5/4.8.5 20150623 (Red Hat 4.8.5-39)
| gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39) Copyright
| (C) 2015 Free Software Foundation, Inc.
----------------------+---------------------------------------------------------
C compiler flags | -O3 --param max-inline-insns-single=35000 --param
| inline-unit-growth=10000 --param
| large-function-growth=200000 -Wno-unused
| -Wunused-result -Wno-unused-parameter -Wno-address
| -std=gnu99
----------------------+---------------------------------------------------------
linker | /data/seismo82/avinash/Programs/openmpiinstall/bin/mpic
| c
| GNU/4.8.5/4.8.5 20150623 (Red Hat 4.8.5-39)
| gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39) Copyright
| (C) 2015 Free Software Foundation, Inc.
----------------------+---------------------------------------------------------
linker flags | -D_GNU_SOURCE=1 -O3 --param
| max-inline-insns-single=35000 --param
| inline-unit-growth=10000 --param
| large-function-growth=200000 -Wno-unused
| -Wunused-result -Wno-unused-parameter -Wno-address
| -std=gnu99 -L/data/seismo82/avinash/Programs/myupc/opt
| -L/data/seismo82/avinash/Programs/myupc/opt/umalloc
| -lupcr-ibv-seq -lumalloc
| -L/data/seismo82/avinash/Programs/myupc/opt/gasnet/ibv-
| conduit -lgasnet-ibv-seq -libverbs -lpthread -lrt
| -L/usr/lib/gcc/x86_64-redhat-linux/4.8.5 -lgcc -lm
----------------------+---------------------------------------------------------
EDIT2: Following is the output of df -h /dev/shm command
[avinash#range jointinvsurf5_cajoint_compile]$ df -h /dev/shm
Filesystem Size Used Avail Use% Mounted on
tmpfs 126G 21M 126G 1% /dev/shm

By default, Berkeley UPC uses kernel shared memory services to cross-map the UPC shared segments between co-located processes. For smp-conduit, this is the only mode of operation.
Assuming this is a Linux system with configure defaults, the most likely explanation is exhaustion of the kernel-provided POSIX shared memory space. You can confirm this by looking at the virtual file system where that resides. Here's an example from a system configured for up to 20G of shared memory:
$df -h /dev/shm /var/shm /run/shm
df: '/var/shm': No such file or directory
df: '/run/shm': No such file or directory
Filesystem Size Used Avail Use% Mounted on
tmpfs 20G 504K 20G 1% /dev/shm
This value limits the total per-node shared memory segment space. This limit can usually be raised by an administrator adjusting kernel settings, although the details vary with distribution.
For more info, see the section 'System Settings for POSIX Shared Memory' in https://gasnet.lbl.gov/dist-ex/README
Finally, note that even once the above issue is addressed, asking for 255 GB of shared memory heap on a system with 256 GB of physical DRAM (99.6%) may be inadvisable. This leaves very little space for the non-shared portions of application memory (stack, static data, malloc heap) and for memory overheads of the kernel and daemon processes. Depending on your kernel settings this may trigger an out-of-memory panic to start killing processes. We generally recommend a safe rule-of-thumb limit of 85% of physical memory (assuming the system is otherwise idle), and "proceed with caution" beyond that.

Related

What is "hard_xmit called while tx busy"?

I try to understand some kernel messages related to the CAN bus driver for mcp251x
in syslog I have several
hard_xmit called while tx busy
and at last
mcp251x spi0.0 can0: bus-off
What is hard_xmit and what causes it?
uname -a
Linux cilix-19 5.15.32-v7l+ #1538 SMP Thu Mar 31 19:39:41 BST 2022 armv7l GNU/Linux
dmesg | grep model
[ 0.000000] OF: fdt: Machine model: Raspberry Pi Compute Module 4 Rev 1.0

Intel MPI benchmark fails when # bytes > 128: IMB-EXT

I just installed Linux and Intel MPI to two machines:
(1) Quite old (~8 years old) SuperMicro server, which has 24 cores (Intel Xeon X7542 X 4). 32 GB memory.
OS: CentOS 7.5
(2) New HP ProLiant DL380 server, which has 32 cores (Intel Xeon Gold 6130 X 2). 64 GB memory.
OS: OpenSUSE Leap 15
After installing OS and Intel MPI, I compiled intel MPI benchmark and ran it:
$ mpirun -np 4 ./IMB-EXT
It is quite surprising that I find the same error when running IMB-EXT and IMB-RMA, though I have a different OS and everything (even GCC version used to compile Intel MPI benchmark is different -- in CentOS, I used GCC 6.5.0, and in OpenSUSE, I used GCC 7.3.1).
On the CentOS machine, I get:
#---------------------------------------------------
# Benchmarking Unidir_Put
# #processes = 2
# ( 2 additional processes waiting in MPI_Barrier)
#---------------------------------------------------
#
# MODE: AGGREGATE
#
#bytes #repetitions t[usec] Mbytes/sec
0 1000 0.05 0.00
4 1000 30.56 0.13
8 1000 31.53 0.25
16 1000 30.99 0.52
32 1000 30.93 1.03
64 1000 30.30 2.11
128 1000 30.31 4.22
and on the OpenSUSE machine, I get
#---------------------------------------------------
# Benchmarking Unidir_Put
# #processes = 2
# ( 2 additional processes waiting in MPI_Barrier)
#---------------------------------------------------
#
# MODE: AGGREGATE
#
#bytes #repetitions t[usec] Mbytes/sec
0 1000 0.04 0.00
4 1000 14.40 0.28
8 1000 14.04 0.57
16 1000 14.10 1.13
32 1000 13.96 2.29
64 1000 13.98 4.58
128 1000 14.08 9.09
When I don't use mpirun (which means there is only one process to run IMB-EXT), the benchmark runs through, but Unidir_Put needs >=2 processes, so doesn't help so much, and I also find that the functions with MPI_Put and MPI_Get is extremely slower than I expected (from my experience). Also, using MVAPICH on the OpenSUSE machine did not help. The output is:
#---------------------------------------------------
# Benchmarking Unidir_Put
# #processes = 2
# ( 6 additional processes waiting in MPI_Barrier)
#---------------------------------------------------
#
# MODE: AGGREGATE
#
#bytes #repetitions t[usec] Mbytes/sec
0 1000 0.03 0.00
4 1000 17.37 0.23
8 1000 17.08 0.47
16 1000 17.23 0.93
32 1000 17.56 1.82
64 1000 17.06 3.75
128 1000 17.20 7.44
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 49213 RUNNING AT iron-0-1
= EXIT CODE: 139
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
update: I tested OpenMPI, and it goes through smoothly (although my application does not recommend using openmpi, and I still don't understand why Intel MPI or MVAPICH doesn't work...)
#---------------------------------------------------
# Benchmarking Unidir_Put
# #processes = 2
# ( 2 additional processes waiting in MPI_Barrier)
#---------------------------------------------------
#
# MODE: AGGREGATE
#
#bytes #repetitions t[usec] Mbytes/sec
0 1000 0.06 0.00
4 1000 0.23 17.44
8 1000 0.22 35.82
16 1000 0.22 72.36
32 1000 0.22 144.98
64 1000 0.22 285.76
128 1000 0.30 430.29
256 1000 0.39 650.78
512 1000 0.51 1008.31
1024 1000 0.84 1214.42
2048 1000 1.86 1100.29
4096 1000 7.31 560.59
8192 1000 15.24 537.67
16384 1000 15.39 1064.82
32768 1000 15.70 2086.51
65536 640 12.31 5324.63
131072 320 10.24 12795.03
262144 160 12.49 20993.49
524288 80 30.21 17356.93
1048576 40 81.20 12913.67
2097152 20 199.20 10527.72
4194304 10 394.02 10644.77
Is there any chance that I am missing something in installing MPI, or installing OS in these servers? Actually, I assume that OS is the problem, but not sure where to start...
Thanks a lot in advance,
Jae

Although this question is well written, you were not explicit about
Intel MPI benchmark (please add header)
Intel MPI
Open MPI
MVAPICH
supported host network fabrics - for each MPI distribution
selected fabric while running MPI benchmark
Compilation settings
Debugging this kind of trouble with disparate host machines, multiple Linux distributions and compiler versions can be quite hard. Remote debugging on StackOverflow is even harder.
First of all ensure reproducibility. This seems to be the case. One of many debugging approaches, the one I would recommend, is to reduce complexity of the system as a whole, test smaller sub-systems and start shifting responsibility to third parties. You may replace self-compiled executables with software packages provided by distribution software/package repositories or third parties like Conda.
Intel recently started to provide its libraries through YUM/APT repos as well as for Conda and PyPI. I found that helps a lot with reproducible deployments of HPC clusters and even runtime/development environments. I recommend to use it for CentOS 7.5.
YUM/APT repository for Intel MKL, Intel IPP, Intel DAAL, and Intel® Distribution for Python* (for Linux*):
Installing Intel® Performance Libraries and Intel® Distribution for Python* Using YUM Repository
Installing Intel® Performance Libraries and Intel® Distribution for Python* Using APT Repository
Conda* package/ Anaconda Cloud* support (Intel MKL, Intel IPP, Intel DAAL, Intel Distribution for Python):
Installing Intel Distribution for Python and Intel Performance Libraries with Anaconda
Available Intel packages can be viewed here
Install from the Python Package Index (PyPI) using pip (Intel MKL, Intel IPP, Intel DAAL)
Installing the Intel® Distribution for Python* and Intel® Performance Libraries with pip and PyPI
I do not know much about OpenSUSE Leap 15.

Vultr virtual cpu vs DigitalOcean Intel(R) Xeon(R) CPU E5-2650 v4 # 2.20GHz

When I run a uname -ar command on Vultr command line I see the following:
Linux my.vultr.account.com 4.12.10-coreos #1 SMP Tue Sep 5 20:29:13
UTC 2017 x86_64 Virtual CPU a7769a6388d5 GenuineIntel GNU/Linux
On DigitalOcean I get:
Linux master 4.11.11-coreos #1 SMP Tue Jul 18 23:06:59 UTC 2017 x86_64
Intel(R) Xeon(R) CPU E5-2650 v4 # 2.20GHz GenuineIntel GNU/Linux
I don't know what the difference means? Is virtual cpu worse/same/better than what I see in DigitalOcean output of "Intel(R) Xeon(R)"?

The real Intel Xeon E5-2650 v4 is a CPU with 12 cores. Depending on your VPS configuration you get X amount of cores assigned from that CPU. Hence virtual CPUs.
Pertaining the specs of Vultr. The official response from Vultr support:
"We do not provide specific information on the CPUs we offer. They are all late-model Intel Xeon CPUs."
The a7769a6388d5 is a 2.4Ghz virtual CPU. According to:
wget freevps.us/downloads/bench.sh -O - -o /dev/null|bash
From there on it can be a diversity of 2.4GHz Intel E5 Xeon's from either V2, V3, V4 generation. You can get down to the bottom of it:
cat /proc/cpuinfo
Family 6 Model 61 Stepping 2 = Broadwell? etc.
Tip: CPU speed is not the best way to compare your VPS though. Focus more on I/O speed, datacenter location, uplink speed and ping times.

# of OpenCL devices on 2012 Macbook pro

I'm writing an openCL program on a mid 2012 13" macbook pro with the following specs:
Processor: 2.9 GHz Intel Core i7
Graphics: Intel HD Graphics 4000
In my program I do the following to check how many devices I have access to:
// get first platform
cl_platform_id platform;
err = clGetPlatformIDs(1, &platform, NULL);
// get device count
cl_uint gpuCount;
err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 0, NULL, &gpuCount);
cl_uint cpuCount;
err |= clGetDeviceIDs(platform, CL_DEVICE_TYPE_CPU, 0, NULL, &cpuCount);
std::cout<<"NUM CPUS: "<<cpuCount<<" NUM GPUS: "<<gpuCount<<std::endl;
After execution, my program states that I have only one CPU and zero GPUs.
How can that be? Is openCL not compatible with Intel HD Graphics 4000 card? And I thought my computer had a dual core processor. So shouldn't there be 2 CPUs and 1 GPU?
Or am I simply not fetching the data correctly?
EDIT: I have found the issue. After upgrading my OS to Mavericks (was previously running Mountain Lion), openCL now recognizes my graphics card as a valid device.

Your processor has multiple cores, which are recognized as Compute Units. Run following code snippet & check that number of CU is as expected:
cl_device_id device;
cl_uint max_compute_units;
cl_int ret = clGetDeviceInfo(device, CL_DEVICE_MAX_COMPUTE_UNITS, sizeof(cl_uint), &max_compute_units, NULL);
printf("Number of computing units: %u\n", max_compute_units);

This doesn't answer your question, (and doesn't need downvoting please) but hopefully will help you work out what you have actually got installed. I would have posted it as a comment, but the formatting there would be useless.
If you want a legible list of your installed CPU and Graphics equipment, the following command does it nicely:
system_profiler | awk '/^Hardware/ || /^Graphics/{p=1;print;next} /^[A-Za-z]/{p=0} p'
Graphics/Displays:
AMD Radeon HD 6970M:
Chipset Model: AMD Radeon HD 6970M
Type: GPU
Bus: PCIe
PCIe Lane Width: x16
VRAM (Total): 1024 MB
Vendor: ATI (0x1002)
Device ID: 0x6720
Revision ID: 0x0000
ROM Revision: 113-C2960H-203
EFI Driver Version: 01.00.560
Displays:
iMac:
Display Type: LCD
Resolution: 2560 x 1440
Pixel Depth: 32-Bit Color (ARGB8888)
Main Display: Yes
Mirror: Off
Online: Yes
Built-In: Yes
Hardware:
Hardware Overview:
Model Name: iMac
Model Identifier: iMac12,2
Processor Name: Intel Core i7
Processor Speed: 3.4 GHz
Number of Processors: 1
Total Number of Cores: 4
L2 Cache (per Core): 256 KB
L3 Cache: 8 MB
Memory: 16 GB
Boot ROM Version: IM121.0047.B1F
SMC Version (system): 1.72f2
Serial Number (system): DGKH90PWDHJW
Hardware UUID: 1025AC04-9F8E-5342-9EF4-XXXXXXXXXXXXX
And also this for the actual CPU details:
sysctl -a | grep "brand_string"
machdep.cpu.brand_string: Intel(R) Core(TM) i7-2600 CPU # 3.40GHz
And this for OpenCL version:
system_profiler | grep -A 11 OpenCL:
OpenCL:
Version: 2.3.59
Obtained from: Apple
Last Modified: 19/09/2014 10:28
Kind: Intel
64-Bit (Intel): Yes
Signed by: Software Signing, Apple Code Signing Certification Authority, Apple Root CA
Get Info String: 2.3.59, Copyright 2008-2013 Apple Inc.
Location: /System/Library/Frameworks/OpenCL.framework
Private: No
P.S. If there is a better way to provide additonal, useful information (which is not really a proper answer) on SO than this, please let me know.

How to measure performance impact of GCC linking option -Wl,-z,relro,-z,now on binary startup on ARM

I'm trying to find a way to measure the start-up performance impact of using relro and early binding linkage options on an ARM platform.
Someone can suggest me how to find the time spent linking shared libraries for a binary compiled with that options ?
Many thanks.
Edit 1:
No time information on my machine.
root#arm:/# LD_DEBUG=statistics /bin/date
1470: number of relocations: 90
1470: number of relocations from cache: 3
1470: number of relative relocations: 1207 Thu Jan 1 00:17:00 UTC 1970
1470:
1470: runtime linker statistics:
1470: final number of relocations: 108
1470: final number of relocations from cache: 3

If you are using GLIBC:
$ LD_DEBUG=statistics /bin/date
4494:
4494: runtime linker statistics:
4494: total startup time in dynamic loader: 932928 clock cycles
4494: time needed for relocation: 299052 clock cycles (32.0%)
4494: number of relocations: 106
4494: number of relocations from cache: 4
4494: number of relative relocations: 1276
4494: time needed to load objects: 420660 clock cycles (45.0%)
Fri Feb 28 16:40:48 PST 2014
Build your binary with and without -z,relro and compare the numbers.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio