ESP32 SPIRAM / PSRAM management for VSCode platformio - esp32

I am troubleshooting my firmware (brand new but nearing dev completion) memory management: it runs on a custom PCB with a WROVER module.
I want to take advantage of the extra PSRAM/SPIRAM so that the PSRAM is used automatically by the libs.
But I have doubt that this is working as I expect.
Therefore I wanted to make sure the following features are actually triggered as I ask them from the platformio.ini:
SPIRAM is taken into account;
And all libs takes advantage of it.
[UPDATE]
According to the Espressif doc (https://docs.espressif.com/projects/esp-idf/en/latest/esp32/api-guides/external-ram.html#external-ram-config-malloc), this should be the default behaviour (Provide External RAM via malloc() (default))
Here is my platformio.ini:
[env:esp32dev]
platform = espressif32
board = esp-wrover-kit
;esp32dev
framework = arduino
monitor_speed = 115200
lib_deps =
adafruit/Adafruit NeoPixel#^1.10.0
plerup/EspSoftwareSerial#^6.13.2
lbernstone/Tone32#^1.0.0
miwagner/ESP32CAN#^0.0.1
board_build.partitions = huge_app.csv
;no_ota.csv
extra_scripts = pre:build_script_versioning.py
build_flags =
-DBOARD_HAS_PSRAM
-mfix-esp32-psram-cache-issue
-DCONFIG_MBEDTLS_DYNAMIC_BUFFER=1
-DCONFIG_BT_ALLOCATION_FROM_SPIRAM_FIRST=1
-DCONFIG_SPIRAM_CACHE_WORKAROUND=1
And the information related to my ESP32 WROVER:
b1453|β€”|🌩 nano: Firmware v0.1 (build 1453) by Sdl
b1453|β€”|🌩 nano: -- only suited for PCB REV 3 --
b1453|β€”|🌩 nano: Free heap size: 212076
b1453|β€”|🌩 nano: Min free heap size: 206600
b1453|β€”|🌩 nano: Max alloc heap size: 113792
b1453|β€”|🌩 nano: PsRam size: 4194252
b1453|β€”|🌩 nano: Runing on chip:
b1453|β€”|🌩 nano: - model 1 (1 is ESP32)
b1453|β€”|🌩 nano: - rev. 1
b1453|β€”|🌩 nano: Chip has the following features:
b1453|β€”|🌩 nano: - has flash. 0
b1453|β€”|🌩 nano: - has 2.4GHz. 1
b1453|β€”|🌩 nano: - has BLE. 1
b1453|β€”|🌩 nano: - has BT classic. 1
[UPDATE]
As you can see in my platformIo ini file, I use the following flags:
-mfix-esp32-psram-cache-issue: passed to gcc is similar to CONFIG_SPIRAM_CACHE_WORKAROUND=1;
-DCONFIG_MBEDTLS_DYNAMIC_BUFFER=1: this should tell mbedTLS to allocate the buffer with a malloc, hence use my PSRAM;
-DCONFIG_BT_ALLOCATION_FROM_SPIRAM_FIRST=1: this should let the Bluetooth to use the PSRAM for its own allocations;
-DCONFIG_SPIRAM_CACHE_WORKAROUND=1: and this is similar to the first point (so probably is redundant with -mfix-esp32-psram-cache-issue);
How can I make sure that each and every above flag is taken into account.
Additionnally, for example, how can I make sure that the iot module actually allocate the TLS from the PSRAM?
Is there any code I can run from my firmware?
And any tips on PSRAM usage is VERY WELCOME.

According to the Espressif documentation, there are four ways to use the PSRAM. If you want to use PSRAM explicitly to store something, you need to use ps_malloc() to allocate the memory. This simple sketch will show the allocation of memory from PSRAM before, after of PSRAM memory allocation and after freeing the memory.
void setup() {
log_d("Used PSRAM: %d", ESP.getPsramSize() - ESP.getFreePsram()); byte* psdRamBuffer = (byte*)ps_malloc(500000);
log_d("Used PSRAM: %d", ESP.getPsramSize() - ESP.getFreePsram());
free(psdRamBuffer);
log_d("Used PSRAM: %d", ESP.getPsramSize() - ESP.getFreePsram());
}
Please noted that there is a threshold when a single allocation should prefer external memory, when allocating a size less than the threshold, the allocator will try internal memory first. Read the documentation for more details.

Related

Why are OpenGL and CUDA contexts memory greedy?

I develop software which usually includes both OpenGL and Nvidia CUDA SDK. Recently, I also started to seek ways to optimize run-time memory footprint. I noticed the following (Debug and Release builds differ only by 4-7 Mb):
Application startup - Less than 1 Mb total
OpenGL 4.5 context creation ( + GLEW loader init) - 45 Mb total
CUDA 8.0 context (Driver API) creation 114 Mb total.
If I create OpenGL context in "headless" mode, the GL context uses 3 Mb less, which probably goes to default frame buffers allocation. That makes sense as the window size is 640x360.
So after OpenGL and CUDA context are up, the process already consumes 114 Mb.
Now, I don't have deep knowledge regarding OS specific stuff that occurs under the hood during GL and CUDA context creation, but 45 Mb for GL and 68 for CUDA seems a whole lot to me. I know that usually several megabytes goes to system frame buffers, function pointers,(probably a bulk of allocations happens on driver side). But hitting over 100 Mb with just "empty" contexts looks too much.
I would like to know:
Why GL/CUDA context creation consumes such a considerable amount of memory?
Are there ways to optimize that?
The system setup under test:
Windows 10 64bit. NVIDIA GTX 960 GPU (Driver Version:388.31). 8 Gb RAM. Visual Studio 2015, 64bit C++ console project.
I measure memory consumption using Visual Studio built-in Diagnostic Tools -> Process Memory section.
UPDATE
I tried Process Explorer, as suggested by datenwolf. Here is the screenshot of what I got, (my process at the bottom marked with yellow):
I would appreciate some explanation on that info. I was always looking at "Private Bytes" in "VS Diagnostic Tools" window. But here I see also "Working Set", "WS Private" etc. Which one correctly shows how much memory my process currently uses? 281,320K looks way too much, because as I said above, the process at the startup does nothing, but creates CUDA and OpenGL contexts.
Partial answer: This is an OS-specific issue; on Linux, CUDA takes 9.3 MB.
I'm using CUDA (not OpenGL) on GNU/Linux:
CUDA version: 10.2.89
OS distribution: Devuan GNU/Linux Beowulf (~= Debian Buster without systemd)
Kernel: Linux 5.2.0
Processor: Intel x86_64
To check how much memory gets used by CUDA when creating a context, I ran the following C program (which also checks what happens after context destruction):
#include <stdio.h>
#include <cuda.h>
#include <malloc.h>
#include <stdlib.h>
static void print_allocation_stats(const char* s)
{
printf("%s:\n", s);
printf("--------------------------------------------------\n");
malloc_stats();
printf("--------------------------------------------------\n\n");
}
int main()
{
display_mallinfo("Initially");
int status = cuInit(0);
if (status != 0 ) { return EXIT_FAILURE; }
print_allocation_stats("After CUDA driver initialization");
int device_id = 0;
unsigned flags = 0;
CUcontext context_id;
status = cuCtxCreate(&context_id, flags, device_id);
if (status != CUDA_SUCCESS ) { return EXIT_FAILURE; }
print_allocation_stats("After context creation");
status = cuCtxDestroy(context_id);
if (status != CUDA_SUCCESS ) { return EXIT_FAILURE; }
print_allocation_stats("After context destruction");
return EXIT_SUCCESS;
}
(note that this uses a glibc-specific function, not in the standard library.)
Summarizing the results and snipping irrelevant parts:
Point in program
Total bytes
In-use
Max MMAP Regions
Max MMAP bytes
Initially
135168
1632
0
0
After CUDA driver initialization
552960
439120
2
307200
After context creation
9314304
6858208
8
6643712
After context destruction
7016448
580688
8
6643712
So CUDA starts with 0.5 MB and after allocating a context takes up 9.3 MB (going back down to 7.0 MB on destroying the context). 9 MB is still a lot of memory for not having done anything; but - maybe some of it is all-zeros, or uninitialized, or copy-on-write, in which case it doesn't really take up that much memory.
It's possible that memory use improved dramatically over the two years between the driver release with CUDA 8 and with CUDA 10, but I doubt it. So - it looks like your problem is Windows specific.
Also, I should mention I did not create an OpenGL context - which is another part of OP's question; so I haven't estimated how much memory that takes. OP brings up the question of whether the sum is greater than its part, i.e. whether a CUDA context would take more memory if an OpenGL context existed as well; I believe this should not be the case, but readers are welcome to try and report...

Linux - dtb - RAM address and size passing via dtb to kernel

I'm a newbie to Linux
So While booting linux on my "cutom embedded development board" I can see some log
Memory: 405860K/509952K available (2604K kernel code, 188K rwdata, 1068K rodata, 164K init, 131K bss, 87708K reserved, 16384K cma-reserved)
176 Virtual kernel memory layout:
Which means Linux has detected 512MB of RAM,(Eventhough I have 2GB of RAM)
I assume this infomation needs to be passed dtb, can someone help me regarding how this information node looks like and how can I increase the size of it ?
You need to change the memory node, so that the linux can see the 2gb. you can refer this link. Also you might want to set CONFIG_VMSPLIT_2G

Error using Tensorflow with GPU

I've tried a bunch of different Tensorflow examples, which works fine on the CPU but generates the same error when I'm trying to run them on the GPU. One little example is this:
import tensorflow as tf
# Creates a graph.
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)
# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
# Runs the op.
print sess.run(c)
The error is always the same, CUDA_ERROR_OUT_OF_MEMORY:
I tensorflow/stream_executor/dso_loader.cc:101] successfully opened CUDA library libcublas.so.7.0 locally
I tensorflow/stream_executor/dso_loader.cc:101] successfully opened CUDA library libcudnn.so.6.5 locally
I tensorflow/stream_executor/dso_loader.cc:101] successfully opened CUDA library libcufft.so.7.0 locally
I tensorflow/stream_executor/dso_loader.cc:101] successfully opened CUDA library libcuda.so locally
I tensorflow/stream_executor/dso_loader.cc:101] successfully opened CUDA library libcurand.so.7.0 locally
I tensorflow/core/common_runtime/local_device.cc:40] Local device intra op parallelism threads: 24
I tensorflow/core/common_runtime/gpu/gpu_init.cc:103] Found device 0 with properties:
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:0a:00.0
Total memory: 11.25GiB
Free memory: 105.73MiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:103] Found device 1 with properties:
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:0b:00.0
Total memory: 11.25GiB
Free memory: 133.48MiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:127] DMA: 0 1
I tensorflow/core/common_runtime/gpu/gpu_init.cc:137] 0: Y Y
I tensorflow/core/common_runtime/gpu/gpu_init.cc:137] 1: Y Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:702] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:0a:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:702] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K80, pci bus id: 0000:0b:00.0)
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:42] Allocating 105.48MiB bytes.
E tensorflow/stream_executor/cuda/cuda_driver.cc:932] failed to allocate 105.48M (110608384 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
F tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:47] Check failed: gpu_mem != nullptr Could not allocate GPU device memory for device 0. Tried to allocate 105.48MiB
Aborted (core dumped)
I guess that the problem has to do with my configuration rather than the memory usage of this tiny example. Does anyone have any idea?
Edit:
I've found out that the problem may be as simple as someone else running a job on the same GPU, which would explain the little amount of free memory. In that case: sorry for taking up your time...
There appear to be two issues here:
By default, TensorFlow allocates a large fraction (95%) of the available GPU memory (on each GPU device) when you create a tf.Session. It uses a heuristic that reserves 200MB of GPU memory for "system" uses, but doesn't set this aside if the amount of free memory is smaller than that.
It looks like you have very little free GPU memory on either of your GPU devices (105.73MiB and 133.48MiB). This means that TensorFlow will attempt to allocate memory that should probably be reserved for the system, and hence the allocation fails.
Is it possible that you have another TensorFlow process (or some other GPU-hungry code) running while you attempt to run this program? For example, a Python interpreter with an open sessionβ€”even if it is not using the GPUβ€”will attempt to allocate almost the entire GPU memory.
Currently, the only way to restrict the amount of GPU memory that TensorFlow uses is the following configuration option (from this question):
# Assume that you have 12GB of GPU memory and want to allocate ~4GB:
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
This can happen because your TensorFlow session is not able to get sufficient amount of memory in the GPU. Maybe you have a low amount of free memory for other processes like TensorFlow or there is another TensorFlow session running in your system . so you have to configure the amount of memory the TensorFlow session will use
if you are using TensorFlow 1.x
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
as Tensorflow 2.x has undergone major changes from 1.x.if you want to use TensorFlow 1.x versions method/function there is a compatibility module kept in TensorFlow 2.x. So TensorFlow 2.x user can use this piece of code
gpu_options = tf.compat.v1.GPUOptions(per_process_gpu_memory_fraction=0.333)
sess = tf.compat.v1.Session(config=tf.compat.v1.ConfigProto(gpu_options=gpu_options))

How MLO (minimal bootloader) works?

I am trying to understand how a MLO is loaded into the on-chip of a SOC and do the minimal configuration. I am using TI DM8168 soc.
I have gone through the following link to understand the MLO or x-loader:
http://omappedia.org/wiki/Bootloader_Project
I got to know that the ROM Code loads the MLO (x-loader) to the on-chip RAM of the SoC which do the minimal configuration and finally loads the uboot (universal bootloader), that finally initiates the linux kernel.
My doubt here is that my on-chip RAM size is 64 KB and the MLO size is 116 KB, then how the ROM code is loading the MLO to the on-chip RAM
It seems that the DM8168 has more than 64KiB internal RAM: as explained in
the DM816x AM389x PSP 04.00.01.13 Feature Performance Guide, it has at least two more blocks of internal RAM, referenced OMC0 and OMC1, both being 256KiB in size.
Those two banks can be used by u-boot according to this document:
OCMC0 0x40300000 - 0x4033FFFF OCMC 0 will be used by ROM Code and U-boot. Once Linux kernel boots, OCMC0 is free and kernel can use it. If OCMC0 should not be used to load u-boot if loaded using CCS.
OCMC1 0x40400000 - 0x4043FFFF OCMC 1 will be used by ROM Code and U-boot. Once Linux kernel boots, OCMC0 is free and kernel can use it.
From u-boot-omap3/board/ti/ti8168/config.mk, it seems u-boot is using OMC1
TI_LOAD_ADDR = 0x40400000
This would explain why your 116KiB u-boot image can fit in the DM8168 internal RAM.

dma_alloc_coherent fails when buffer > 2M on kernel 3.2

I have this x86 device and a kernel module that tries to allocate DMA memory. It has a parameter called dmasize that allows to control the size of allocated memory.
I've noticed that allocation succeeds when dmasize=2M but not if larger. Even at boot time.
I heard there was a limitation by CONSISTENT_DMA_SIZE, but looking at lxr, I can't find it for arch x86 kernel 3.2.
Not sure if it is relevant, but this is a 32 bit machine with 8GB of RAM and a pae enabled kernel.
This is the call to dma_alloc_coherent:
dma_addr_t dma_handle;
if (!(_dma_vbase = dma_alloc_coherent(0, alloc_size, &dma_handle, GFP_KERNEL)) || !dma_handle) {
gprintk("_alloc_mpool: Kernel failed to allocate the memory pool of size 0x%lx\n", (unsigned long)alloc_size);
return;
}
Appreciate anyone who can help with this.
Just in case anyone comes across this, the answer is as follows:
The config flag CONFIG_FORCE_MAX_ZONEORDER which defaults at 11 at most architecture is the cause for this limitation.
increasing it to 12 (and recompiling the kernel) fixes the problem.
I suspect using CMA will also be possible but since my kernel doesn't support it, I cannot say for sure.

Resources