Shared Memory in a BMC system and Frame Buffer - memory-management

I have a system that includes an AST BMC. The BMC can provide a KVM for a user to control the host.
The BMC can give the user a KVM because it has access to the host frame buffer.
On the host, the frame buffer is the physical memory of the PCI bus and you can get it using lspci.
On the BMC I have found the memory ranges that hold the frame buffer values.
My question is how exactly the BMC have access to these values? Who is responsible for giving the BMC access to these values in these exact memory ranges?
Thank you very much.

Related

How IOMMU unmaps the IOVA comming from different pheripherals through DMA

I have been trying to get the information on this for so long and still haven't got anything solid. So, what I have learned so far is that the IOMMU converts the IOVA provided by the DMA to the physical address and reads or writes from/to the memory. My questions are as follows:
1) Does IOMMU store different Memory map for every single device? Does each device see the address range starting from zero in their virtual address space
2) Where are these IOMMU memory maps are stored?
3) How does IOMMU know about which device the request is coming from if every device sees the virtual address starting from zero in their virtual address space?
4) Does the device also transmit some kind of Device specific ID or something which IOMMU recognizes and uses this to unmap the IOVA and protect the other memory addresses being seen or written by this device?

Allocating a physical memory buffer in linux

I have an SoC which has both DSP and ARM cores on it and I would like to create a section of shared memory that both my userspace software, and DSP software are able to access. What would be the best way to allocate a buffer like this in Linux? Here is a little background, right now what I have is a kernel module in which I use kmalloc() to get a kernel buffer, I then use the __pa() macro from asm/page.h to get the physical address of my kernel buffer. I save this address as a sysfs entry so that my userspace code can get the physical address of this buffer. I can then write this address to the DSP so it knows where the shared memory location is, and I can also mmap /dev/mem or my own kernel module so that I can access this buffer from userspace (I could also use the read/write fileops).
For some reason I feel like this is overboard but I cannot find the best way to do what I am trying to do.
Would it be possible to just mmap \dev\mem a section of memory and just read and write to this section? My feeling is that this would not 'lock' this section of memory from the kernel, thus the kernel could still read/write to this memory without me knowing. Is this the case. After reading the memory management chapter of LDD3 I see that mmap creates a new VMA of the mapping. Would this lock this area of memory so that other processes would not get allocated this section of memory?
Any and all help is appreciated
Depending on the kind of DMA you're using, you need to allocate the buffer with dma_alloc_coherent(), or use standard allocations and the dma_map_* functions.
(You must not use __pa(); physical addresses are not necessarily the same as DMA bus addresses.)
To map the buffers to user space, use dma_mmap_coherent() for coherent buffers, or map the memory pages manually for streaming buffers.
For a similar requirement of mine, I had reserved about 16 MB of memory towards the end of ram and used it in both kernel and user space. Suppose you have 128 MB ram, you can set BOOTMEM argument as 112 MB in your boot loader. I am assuming you are using uboot. This will reserve 16 MB towards the end of the ram. Now in kernel and user space you can map this area and use it as shared memory.

How does the CPU know the PCI adress-space

I understand that PCI and PCIe devices can be configured by the CPU (via code in the BIOS or OS) to respond to certain physical memory addresses by writing to specific areas of the device's configuration space.
In fact the Linux kernel has quite the complicated algorithm for doing this taking into account a lot of requirements of the device (memory alignment, DMA capabilities etc).
Seeing that software seems to be in control of if, when and where this memory is mapped, my question is: How can a piece of software control mapping of physical memory?
After this configuration, the PCI device will know to respond to the given address range, but how does the CPU know that it should go on the PCI bus for those specific addresses that were just dynamically decided?
The northbridge is programmed with the address range(s) that are to be routed to the memory controller(s).
All other addresses go to the external bus.
It is based on address mapping info that CPU had.
normally you have 2^64 -1 address lines with CPU if it is 64 bit processor.
Now memory is now around 16 GB which is 2^34 is around 16 GB.
So all the devices which CPU has (even legacy PCI and PCIe devices) and their config space can be mapped
to address line above this RAM physical address space.
Any IO to this space can be forwarded to respective device.
In our case CPU finds out that the config space which it wants to access to is a PCI or PCIe device then it forwards the
instruction to host bridge of CPU (00:00:00 Do lspci in a box you will see the host bridge with this BDF)
Once it finds out the target device is within host bridge the instruction (Can be IO or Memory) will be converted to appropriate TLP request.

DMA vs Cache difference

Probably a stupid question for most that know DMA and caches... I just know cache stores memory to somewhere closer to where you can access so you don't have to spend as much time for the I/O.
But what about DMA? It lets you access that main memory with less delay?
Could someone explain the differences, both, or why I'm just confused?
DMA is a hardware device that can move to/from memory without using CPU instructions.
For instance, a hardware device (lets say, your PCI sound device) wants audio to play back. You can either:
Write a word at a time via a CPU mov instructions.
Configure the DMA device. You give it a start address, a destination, and the number of bytes to copy. The transfer now occurs while the CPU does something else instead of spoon feeding the audio device.
DMA can be very complex (scatter gather, etc), and varies by bus type and system.
I agree fully with the first answer, and there are some common additions...
On most DMA hardwares you can also set it up to do memory to memory transfers - there are not always external devices involved. Also depending on the system you may or may not need to sync the CPU-cache in software before (or after the transfer), since the data the DMA transfers into/from memory may be done without the knowledge of the CPU-cache.
The benefit of doing any DMA is that the CPU(s) is/are able to do other things simultaneously.
Of course when the CPU also needs to access the memory, only one can gain access and the other must wait.
Mem to mem DMA is often used in embedded systems to increase performance, or may be vital to be able to access some parts of the memory at all.
To answer the question, DMA and CPU-cache are totally different things and not comparable.
I know its a bit late but answering this question will help someone like me I guess, Agreeing with the above answers, I think the question was in relation to cache.
So Yes a cache does store information somewhere closer to the memory, this could be the results of earlier computations. Moreover, whenever a data is found in cache (called a cache hit) the value is used directly. when its not found (called a cache-miss), the processor goes on to calculate the required value. Peripheral Devices (SD cards, USBs etc) can also access this data, which is why on startup we usually invalidate cache data so that the cache line is clean. We also flush cache data on startup so that all the cache data is written back to the main memory for cpu to use, after which we proceed to reset or initialize the cache.
DMA (Direct Memory Access), yes it does let you access the main memory. But I think the better definition is, it lets you access the system register, which can only be accessed by the processor. #Ronnie and #Yann Ramin were both correct in that DMA can be a device hardware, so it can be used by your serial peripheral to access system registers, but it can also be used for memory to memory transfers between two cores.
You can read up further on DMA from wikipedia, about the modes in which DMA can access the system memory. I ll explain it simply
Burst mode: DMA takes full control of the bus, CPU is idle during this time. Data is transferred in burst (as a whole) without interruption.
Cycle stealing mode: In this data is transfered one byte at a time, transfer is slow, but CPU is not idle.

CUDA Memory Allocation accessible for both host and device

I'm trying to figure out a way to allocate a block of memory that is accessible by both the host (CPU) and device (GPU). Other than using cudaHostAlloc() function to allocate page-locked memory that is accessible to both the CPU and GPU, are there any other ways of allocating such blocks of memory? Thanks in advance for your comments.
The only way for the host and the device to "share" memory is using the newer zero-copy functionality. This is available on the GT200 architecture cards and some newer laptop cards. This memory must be, as you note, allocated with cudaHostAlloc so that it is page locked. There is no alternative, and even this functionality is not available on older CUDA capable cards.
If you're just looking for an easy (possibly non-performant) way to manage host to device transfers, check out the Thrust library. It has a vector class that lets you allocate memory on the device, but read and write to it from host code as if it were on the host.
Another alternative is to write your own wrapper that manages the transfers for you.
There is no way to allocate a buffer that is accessible by both the GPU and the CPU unless you use cudaHostAlloc(). This is because not only must you allocate the pinned memory on the CPU (which you could do outside of CUDA), but also you must map the memory into the GPU's (or more specifically, the context's) virtual memory.
It's true that on a discrete GPU zero-copy does incur a bus transfer. However if your access is nicely coalesced and you only consume the data once it can still be efficient, since the alternative is to transfer the data to the device and then read it into the multiprocessors in two stages.
No there is no "Automatic Way" of uploading buffers on the GPU memory.

Resources