how to free a CPU buffer defined in halide - cpu

my projec used the buffer both in GPU and CPU.
code are similar like this:
xxx = HalideBuffer_n(width, heght, 1); // buffer in CPU
xxx.device_malloc(device); // buffer in GPU
PS: using HalideBuffer_n = Halide::Runtime::Buffer<uint8_t, 3>
i can use
xxx.device_free();
to free the buffer in GPU.
but which API can i choose to free the buffer in CPU?
i have used such as free or deallocate, but it seems dont work.

A Halide::Buffer is a shared pointer to the underlying allocation. xxx.deallocate() drops the reference to it. If this is the only copy of that HalideBuffer_n object then it will free the underlying memory too. If it's not freeing it means that another copy of that HalideBuffer_n object exists somewhere.

Related

How can I improve SPDK performance on userspace DMA access?

I am working on a userspace PCI driver which uses SPDK/VFIO APIs to do dma access.
Currently for each DMA allocation request I need to fill up structure spdk_vfio_dma_map then call system call ioctl(fd, VFIO_IOMMU_MAP_DMA, &dma_map) to map the DMA region through IOMMU. Then later call ioctl(fd, VFIO_IOMMU_UNMAP_DMA, &dma_map) to unmap the IOMMU mapping.
This is working fine so far and looks like it's what SPDK examples are using. However I am wondering if there is a way to pre-allocate all memory buffer in userspace then in each DMA allocation request just use the pre-allocated memory instead of doing ioctl call each time?
Any idea is well appreciated.
Don't know if I get the issue but the whole idea (of DPDK and SPDK) is to allocate all the memory you are using on application start or driver probe.
If you are using memory that is under application control all the time then you don't need to do VFIO_IOMMU_MAP_DMA and VFIO_IOMMU_UNMAP_DMA every DMA transaction. If this is not the case you have two options:
Do the VFIO_IOMMU_MAP_DMA and VFIO_IOMMU_UNMAP_DMA for every IO
Copy the payload to the memory that is already registered in VFIO_IOMMU_MAP_DMA.
First option is better for huge memory blocks, while second is better for small IO chunks.

FFmpeg, av_frame_free - Does not free all memory allocated by av_frame_clone

I want to clone AVFrame. For this purpose, I call av_frame_clone function.
Then I want to free all memory allocated by old AVFrame. For this purpose I call av_frame_free function. Memory which is pointer by data is not freed by av_frame_free function. So what is the correct way of cloning and deleting a AVFrame in ffmpeg ?
Thanks for responses.
The docs for av_frame_clone() say:
Create a new frame that references the same data as src. This is a
shortcut for av_frame_alloc()+av_frame_ref().
Those for av_frame_free() say:
Free the frame and any dynamically allocated objects in it, e.g.
extended_data. If the frame is reference counted, it will be
unreferenced first.
So, combining these two functions looks correct.
What happens with the original frame? Probably it needs to an unref?

STM32F4 running FreeRTOS in external RAM

We have a thesis project at work were the guys are trying to get external RAM to work for the STM32F417 MCU.
The project is trying out some stuff that is really resource hungry and the internal RAM just isn't enough.
The question is how to best do this.
The current approach has been to just replace the RAM address in the link script (gnu ld) with the address for external RAM.
The problem with that approach is that during initialisation, the chip has to run on internal RAM since the FSMC has not been initialized.
It seems to work but as soon as pvPortMalloc is run we get a hard fault and it is probably due to dereferencing bogus addresses, we can see that variables are not initialized correctly at system init (which makes sense I guess since the internal RAM is not used at all, when it probably should be).
I realize that this is a vague question, but what is the general approach when running code in external RAM on a Cortex M4 MCU, more specifically the STM32F4?
Thanks
FreeRTOS defines and uses a single big memory area for stack and heap management; this is simply an array of bytes, the size of which is specified by the configTOTAL_HEAP_SIZE symbol in FreeRTOSConfig.h. FreeRTOS allocates tasks stack in this memory area using its pvPortMalloc function, therefore the main goal here is to place the FreeRTOS heap area into external SRAM.
The FreeRTOS heap memory area is defined in heap_*.c (with the exception of heap_3.c that uses the standard library malloc and it doesn't define any custom heap area), the variable is called ucHeap. You can use your compiler extensions to set its section. For GCC, that would be something like:
static uint8_t ucHeap[ configTOTAL_HEAP_SIZE ] __attribute__ ((section (".sram_data")));
Now we need to configure the linker script to place this custom section into external SRAM. There are several ways to do this and it depends again on the toolchain you're using. With GCC one way to do this would be to define a memory region for the SRAM and a section for ".sram_data" to append to the SRAM region, something like:
MEMORY
{
...
/* Define SRAM region */
sram : ORIGIN = <SRAM_START_ADDR>, LENGTH = <SRAM_SIZE>
}
SECTIONS
{
...
/* Define .sram_data section and place it in sram region */
.sram_data :
{
*(.sram_data)
} >sram
...
}
This will place the ucHeap area in external SRAM, while all the other text and data sections will be placed in the default memory regions (internal flash and ram).
A few notes:
make sure you initialize the SRAM controller/FSMC prior to calling any FreeRTOS function (like xTaskCreate)
once you start the tasks, all stack allocated variables will be placed in ucHeap (i.e. ext RAM), but global variables are still allocated in internal RAM. If you still have internal RAM size issues, you can configure other global variables to be placed in the ".sram_section" using compiler extensions (as shown for ucHeap)
if your code uses dynamic memory allocation, make sure you use pvPortMalloc/vPortFree, instead of the stdlib malloc/free. This is because only pvPortMalloc/vPortFree will use the ucHeap area in ext RAM (and they are thread-safe, which is a plus)
if you're doing a lot of dynamic task creation/deletion and memory allocation with pvPortMalloc/vPortFree with different memory block sizes, consider using heap_4.c instead of heap_2.c. heap_2.c has memory fragmentation problems when using several different block sizes, whereas heap_4.c is able to combine adjacent free memory blocks into a single large block
Another (and possibly simpler) solution would be to define the ucHeap variable as a pointer instead of an array, like this:
static uint8_t * const ucHeap = <SRAM_START_ADDR>;
This wouldn't require any special linker script editing, everything can be placed in the default sections. Note that with this solution the linker won't explicitly reserve any memory for the heap and you will loose some potentially useful information/errors (like heap area not fitting in ext RAM). But as long as you only have ucHeap in external RAM and you have configTOTAL_HEAP_SIZE smaller than external RAM size, that might work just fine.
When the application starts up it will try to initialise data by either clearing it to zero, or initialising it to a non-zero value, depending on the section the variable is placed in. Using a normal run time model, that will happen before main() is called. So you have something like:
1) Reset vector calls init code
2) C run time init code initialises variables
3) C run time init code calls main()
If you use the linker to place variables in external RAM then you need to ensure the RAM is accessible before that initialisation takes place, otherwise you will get a hard fault. Therefore you need to either have a boot loader that sets up the system for you, then starts your application....or more simply just edit the start up code to do the following:
1) Reset vector calls init code
2) >>>C run time init code configures external RAM<<<
3) C run time init code initialised variables
4) C run time init code calls main().
That way the RAM is available before you try to access it.
However, if all you want to do is have the FreeRTOS heap in external RAM, then you can leave the init code untouched, and just use an appropriate heap implementation - basically one that does not just declare a large static array. For example, if you use heap_5 then all you need to do is ensure the heap init function is called before any allocation is performed, because the heap init just describes which RAM to use as the heap, rather than statically declaring the heap.

Doing a zero-copy move of data from a Linux kernel buffer to hard disk

am trying to move data from a buffer in kernel space into the hard
disk without having to incur any additional copies from kernel buffer to
user buffers or any other kernel buffers. Any ideas/suggestions would be
most helpful.
The use case is basically a demux driver which collects data into a
demux buffer in kernel space and this buffer has to be emptied
periodically by copying the contents into a FUSE-based partition on the
disk. As the buffer gets full, a user process is signalled which then
determines the sector numbers on the disk the contents need to be copied
to.
I was hoping to mmap the above demux kernel buffer into user address
space and issue a write system call to the raw partition device. But
from what I can see, the this data is being cached by the kernel on its
way to the Hard Disk driver. And so I am assuming that involves
additional copies by the linux kernel.
At this point I am wondering if there is any other mechansim to do this
without involving additional copies by the kernel. I realize this is an
unsual usage scenario for non-embedded environments, but I would
appreciate any feedback on possible options.
BTW - I have tried using O_DIRECT when opening the raw partition, but
the subsequent write call fails if the buffer being passed is the
mmapped buffer.
Thanx!
You need to expose your demux buffer as a file descriptor (presumably, if you're using mmap() then you're already doing this - great!).
On the kernel side, you then need to implement the splice_read member of struct file_operations.
On the userspace side, create a pipe(), then use splice() twice - once to move the data from the demux file descriptor into the pipe, and a second time to move the data from the pipe to the disk file. Use the SPLICE_F_MOVE flag.
As documented in the splice() man page, it will avoid actual copies where it can, by copying references to pages of kernel memory rather than the pages themselves.

Usage of CoTaskMemAlloc?

When is it appropriate to use CoTaskMemAlloc? Can someone give an example?
Use CoTaskMemAlloc when returning a char* from a native C++ library to .NET as a string.
C#
[DllImport("test.dll", CharSet=CharSet.Ansi)]
extern static string Foo();
C
char* Foo()
{
std::string response("response");
int len = response.length() + 1;
char* buff = (char*) CoTaskMemAlloc(len);
strcpy_s(buff, len, response.c_str());
return buff;
}
Since .NET uses CoTaskMemFree, you have to allocate the string like this, you can't allocate it on the stack or the heap using malloc / new.
Gosh, I had to think for a while for this one -- I've done a fair amount of small-scale COM programming with ATL and rarely have had to use it.
There is one situation though that comes to mind: Windows Shell extensions. If you are dealing with a set of filesystem objects you might have to deal with PIDLs (pointer to an ID list). These are bizarre little filesystem object abstractions and they need to be explicitly allocated/deallocated using a COM-aware allocator such as CoTaskMemAlloc. There is also an alternative, the IMalloc interface pointer obtained from SHGetMalloc (deprecated) or CoGetMalloc -- it's just an abstraction layer to use, so that your code isn't tied to a specific memory allocator and can use any appropriate one.
The point of using CoTaskMemAlloc or IMalloc rather than malloc() is that the memory allocation/deallocation needs to be something that is "COM-aware" so that its allocation and deallocation are performed consistently at run-time, even if the allocation and deallocation are done by completely unrelated code (e.g. Windows allocates memory, transfers it to your C++ code which later deallocates, or your C++ code allocates, transfers it to someone else's VB code which later deallocates). Neither malloc() nor new are capable of interoperating with the system's run-time heap so you can't use them to allocate memory to transfer to other COM objects, or to receive memory from other COM objects and deallocate.
This MSDN article compares a few of the various allocators exposed by Win32, including CoTaskMemAlloc. It's mainly used in COM programming--most specifically when the implementation of a COM server needs to allocate memory to return back to a client. If you aren't writing a COM server, then you probably don't need to use it.
(However, if you call code that allocates memory using CoTaskMemAlloc and returns it back to you, you'll need to free the returned allocation(s) using CoTaskMemFree.)
there is not really much which can go wrong as the following calls all end up with the same allocation:
CoTaskMemAlloc/SHAlloc -> IMalloc.Alloc -> GlobalAlloc(GMEM_FIXED)
only if you use non-windows (compiler-library) calls like malloc() things will go wrong.
Officially one should use CoTaskMemAlloc for COM calls (like allocating a FORMATETC.ptd field)
That CoTaskMemAlloc equals GlobalAlloc() will stay this way 'till eternity is seen at the clipboard api versus com STGMEDIUM. The STGMEDIUM uses the clipboard structures and method and while STGMEDIUM is com and thus CoTaskMemAlloc, the clipboard apis prescribe GlobalAlloc()
CoTaskMemAlloc is same as malloc except that former is used to allocate memory which is used across process boundaries.
i.e., if we have two processes, process1 and process2, assume that process1 is a COM server, and process2 is a COM Client which uses the interfaces exposed by process1.
If process1 has to send some data, then he can allocate memory using CoTaskMemAlloc to allocate the memory and copies the data.
That memory location can be accessed by process2.
COM library automatically does the marshalling and unmarshalling.

Resources