Beaglebone Black Rev C PRU Shared Memory - memory-management

Currently working with the Beaglebone Black Rev C and the PRU has the following arrangement:
I have successfully been able to have one PRU access its own 8K, the other PRU's 8K, and the shared 12K.
This leaves me with a total memory of 28K. Storing float's I can store 7,168 values. I would like to capture 20,000 values in about 250ms.
The Derek Malloy tutorial on PRU's (http://exploringbeaglebone.com/chapter15/) claims, "a pool of 2,000,000 bytes is allocated for the sample data". Things appear to have changed since the writing of that tutorial. No more dynamic Device Tree through UIO and PRU's accessed through Remote Proc. Not sure if the architecture changed as well. This makes most of the printed material out of date concerning PRU's and it's painful to figure out what is current and not. I would love to have these 2,000,000 bytes available in the new architecture. Needing help with the addressing, and some code samples would be excellent. Hoping someone out there can help me get it.
Question: Is there a way, technique, or resources that can be applied to allow me to do this? Remember, 20,000 samples in 250ms when providing a response.
Thanks in advance.

Related

External memory Data Copy through SPI -- Speed

Any experience still seems to be insufficient to answer those strange issues that pop up in serial communication buses. We are trying to implement a data copy from an external flash in to the SRAM. Below are the details how we have configured our system.
Controller : RH850 (D1M1), PLL speed at 60MHz
External Flash (IS25LP128)
SPI speed: 5MHz (clocks observed using oscilloscope)
Data size: 4 MB
Now, in theory, if my SPI is operating at 5MHZ it should copy 5MBits/Sec. We are trying to copy 4MB so essentially it will be 32 Mega Bits. So in theory, our transfer should take about 7 seconds. Ok we have some implication overheads. My driver code can accept only up to 64Kb per read call so we chose to copy 40Kb for about 100 times to achieve this and we run this in a for loop.. Ok let me add a whooping 5 seconds of overhead (Sorry RH850!) so in total 12 seconds; well, lets add some more buffer and make it a comfort zone of 15 sec (Max expected!). But then when we run the code, its taking a whole 40seconds to finish the copy. We have checked the clock and it is 5MHz as expected and at least they are continuous.
Has anyone here faced this? Where can we look in to? Well I know I have some flash-driver provided by my vendor to dig in to but before I do that, I wanted to be sure! Any help will be really appreciated.
At a first glance, I can think about minimum 10 things which may be responsible for this. One thing I'm sure, this problem is complex. There is no simple "one line solution". The main suspect is what is not yours: the flash driver. So, isolate "pieces" one by one and verify them, starting from the bottom.
Is there operating system? DMA in use? Issue with memory or resource arbitration/sharing? Interrupts are in use or polling? Any higher priority jobs are running? Data read from registers or memory mapped? Generic SPI peripheral or special serial flash is used by the driver (I don't know RH850, some uC has it)?
Your post is not precise enough, so maybe these questions will help you. What I would do? My own driver!

problem in ram in fpga zynq 7020, someone can give me advice?

Hello I get a strange message when I try to run the MAP, I set the RAM properly and also checked that it uses only 80% of the resources I have on the card. Why do I get this message? Can anyone advise me what to do? And why do I have this message?
The error i got when i try to Synthesize the label "map" to get a bit file
enter image description here - summery of the resources.
enter image description here
ERROR:Place:543 - This design does not fit into the number of slices available
in this device due to the complexity of the design and/or constraints.
Unplaced instances by type:
BLOCKRAM 77 (55.0)
Please evaluate the following:
BLOCKRAM
u_xyz2lcd_for_test/u_send_to_zedboard/dpr_2/U0/xst_blk_mem_generator/gnativeb
mg.native_blk_mem_gen/valid.cstr/ramloop[6].ram.r/v6_noinit.ram/NO_BMM_INFO.S
DP.SIMPLE_PRIM18.ram
BLOCKRAM
It simply means that you want to use more RAM then the device has.
I suggest you check your resources again and check the amount of memory used.
Your 80% may be LUTs or FFs or you may have read something wrong.
There is another possibility although it is very rare:
You memory usage may increase in Place And Route if it has to split the memory over multiple blocks because you have some weird configuration.
This example may not be valid bit it tries to show what can happen:
Suppose you use bit-write enables. Synthesis thinks you have enough memory but PAR has to use a byte for each bit, thus PAR needs to splits the data over more blocks and in the end runs out.
The case where I have seen this was a very complex one with DSPs.

When to use cudaHostRegister() and cudaHostAlloc()? What is the meaning of "Pinned or page-locked" memory? Which are the equivalent in OpenCL?

I am just new with this APIs of the Nvidia and some expressions are not so clear for me. I was wondering if somebody can help me to understand when and how to use these CUDA commands in a simply way. To be more precise:
Studing how is possible to speed up some applications with parallel execution of a kernel (with CUDA for example), at some point I was facing the problem of speeding up the interaction Host-Device.
I have some informations, taken surfing on the web, but I am little bit confused.
It clear that you can go faster when it is possible to use cudaHostRegister() and/or cudaHostAlloc(). Here it is explained that
"you can use the cudaHostRegister() command to take some data (already allocated) and pin it avoiding extra copy to take into the GPU".
What is the meaning of "pin the memory"? Why is it so fast? How can I do this previously in this field? After, in the same video in the link, they continue explaining that
"if you are transferring PINNED memory, you can use the asynchronous memory transfer, cudaMemcpyAsync(), which let's the CPU keep working during the memory transfer".
Are the PCIe transaction managed entirely from the CPU? Is there a manager of a bus that takes care of this?
Also partial answers are really appreciated to re-compose the puzzle at the end.
It is also appreciate to have some link about the equivalent APIs in OpenCL.
What is the meaning of "pin the memory"?
It means make the memory page locked. That is telling the operating system virtual memory manager that the memory pages must stay in physical ram so that they can be directly accessed by the GPU across the PCI-express bus.
Why is it so fast? 
In one word, DMA. When the memory is page locked, the GPU DMA engine can directly run the transfer without requiring the host CPU, which reduces overall latency and decreases net transfer times.
Are the PCIe transaction managed entirely from the CPU?
No. See above.
Is there a manager of a bus that takes care of this?
No. The GPU manages the transfers. In this context there is no such thing as a bus master
EDIT: Seems like CUDA treats pinned and page-locked as the same as per the Pinned Host Memory section in this blog written by Mark Harris. This means by answer is moot and the best answer should be taken as is.
I bumped into this question while looking for something else. For all future users, I think #talonmies answers the question perfectly, but I'd like to bring to notice a slight difference between locking and pinning pages - the former ensures that the memory is not pageable but the kernel is free to move it around and the latter ensures that it stays in memory (i.e. non-pageable) but also is mapped to the same address.
Here's a reference to the same.

How does a memory map of a Windows process look like?

This might be a duplicate question. I wish to know how the memory map of a windows process look like? I am looking for details. Kindly provide links to blogs, articles and other relevant literature.
I always like to actually be able to see things, rather than just read theory. It turns out, according to this blog post, that if you open a program using windbg even when it isn't running it still gets mapped to an address space as if it were. Thus, your disassembly window figuratively (not guaranteed to load your code at these exact addresses) shows you what is at those addresses in terms of code:
Of course, you can't guarantee those addresses thanks to ASLR, but it gives you an idea / gets you to think: memory addresses are also just code. Code and memory is stored in the same (virtual) space, as per the Von Neumann architecture which most modern computers implement. Unfortunately also as there's no stack, heap etc you can't move and look at those.
This blog post from Microsoft gives you a high level overview of the virtual address space. As you can see, half of it is reserved for use by the operating system and the other half you can fill with whatever you have (code, malloc calls, stack allocations etc).
In terms of how the address space works on the user side, this diagram helped me understand it. It's linked in this question which provides a series of decent links as to the varying possible maps. Remember though, that the layout in memory will differ in terms of the parts.
The important point to remember is that all of it, program, data, stack, heap, kernel stuff, is one big sequential series of memory addresses, although these may or may not actually translate to actual memory addresses.
Whilst you're at it, you might also be interested in how the executable appears on disk. This article and this article particularly provide some in depth analysis of the PE file format. The latter article also has a little diagram showing roughly how data is mmap'd.

Report Direct3D memory usage

I have a Direct3D 9 application and I would like to monitor the memory usage.
Is there a tool to know how much system and video memory is used by Direct3D?
Ideally, it would also report how much is allocated for textures, vertex buffers, index buffers...
You can use the old DirectDraw interface to query the total and available memory.
The numbers you get that way are not reliable though.
The free memory may change at any instant and the available memory often takes the AGP-memory into account (which is strictly not video-memory). You can use the numbers to do a good guess about the default texture-resolutions and detail-level of your application/game, but that's it.
You may wonder why is there no way to get better numbers, after all it can't be to hard to track the resource-usage.
From an application point of view this is correct. You may think that the video memory just contains surfaces, textures, index- and vertex buffers and some shader-programs, but that's not true on the low-level side.
There are lots of other resources as well. All these are created and managed by the Direct3D driver to make the rendering as fast as possible. Among others there are hirarchical z-buffer acceleration structures, pre-compiled command lists (e.g. the data required to render something in the format as understood by the GPU). The driver also may queue rendering-commands for multiple frames in advance to even out the frame-rate and increase parallelity between the GPU and CPU.
The driver also does a lot of work under the hood for you. Heuristics are used to detect draw-calls with static geometry and constant rendering-settings. A driver may decide to optimize the geometry in these cases for better cache-usage. This all happends in parallel and under the control of the driver. All this stuff needs space as well so the free memory may changes at any time.
However, the driver also does caching for your resources, so you don't really need to know the resource-usage at the first place.
If you need more space than available the that's no problem. The driver will move the data between system-ram, AGP-memory and video ram for you. In practice you never have to worry that you run out of video-memory. Sure - once you need more video-memory than available the performance will suffer, but that's life :-)
Two suggestions:
You can call GetAvailableTextureMem in various times to obtain a (rough) estimate of overall memory usage progression.
Assuming you develop on nVidia's, PerfHUD includes a graphical representation of consumed AGP/VID memory (separated).
You probably won't be able to obtain a nice clean matrix of memory consumers (vertex buffers etc.) vs. memory location (AGP, VID, system), as -
(1) the driver has a lot of freedom in transferring resources between memory types, and
(2) the actual variety of memory consumers is far greater than the exposed D3D interfaces.

Resources