How to configure and capture buffer overflow in intel processor trace - linux-kernel

I try simple-pt in https://github.com/andikleen/simple-pt
It seens that the buffer is used as ring.
If internal buffer overflow occurs, can tracing pause itself and make the kernel module handle it?

You can use a double buffer approach with two ToPA tables with STOP bit set to 1 on each table. The Intel PT hardware will always fill one buffer while your program (userspace) reads from the other. The user will be responsible for swapping buffer once it reads all the data from its buffer. If the hardware buffer fills up the STOP bit will stop tracing and set the STOPPED bit on the IA32_RTIT_STATUS MSR. You can then check this bit in order to determine if a buffer overflow occured.

Related

Is a buffer within kmalloc also a DMA safe buffer?

I'm in the middle of writing a framebuffer driver for an SPI connected LCD. I use kmalloc to allocate the buffer, which is quite large - 150KB. Given the way kmalloc is allocating the buffer, ksize reports that way more memory is being used - 256KB or so.
The SPI spi_transfer structure takes pointers to tx and rx buffers, both of which have to be DMA safe. As I want the tx buffer to be about 16KB, can I allocate that buffer within the kmalloced video buffer and still be DMA safe?
This could be considered premature optimisation but there's so much spare space within the video buffer it feels bad not to use it! Essentially there is no difference in allocated memory between:
kmalloc(videosize)
and
kmalloc(PAGE_ALIGN(videosize) + txbufsize)
so one could take the kptr returned and do:
txbuf = (u8 *)kptr + PAGE_ALIGN(videosize);
I'm aware that part of the requirement of "DMA safe" is appropriate alignment - to CPU cacheline size I believe... - but shouldn't a page alignment be ok for this?
As an aside, I'm not sure if tx and rx can point to the same place. The spi.h header is unclear too (explicitly unclear actually). Given that the rx buffer will never be more than a few bytes, it would be silly to make trouble by trying to find out!
The answer appears to be yes with provisos. (Specifically that "it's more complicated than that")
If you acquire your memory via __get_free_page*() or the generic memory allocator (kmalloc) then you may DMA to/from that memory using the addresses returned from those routines. The underlying implication is that a page aligned buffer within kmalloc, even spanning multiple pages, will be DMA safe as the underlying physical memory is guaranteed to be contiguous and a page aligned buffer is guaranteed to be on a cache line boundary.
One proviso is whether the device is capable of driving the full bus width (eg: ISA). Thus, the physical address of the memory must be within the dma_mask of the device.
Another is cache coherency requirements. These operates at the granularity of the cache line width. To prevent two seperate memory regions from sharing one cache line, the memory for dma must begin exactly on a cache line boundary and end exactly on one. Given that this may not be known, it is recommended (DMA API documentation) to only map virtual regions that begin and end on page boundaries (as these are guaranteed also to be cache line boundaries as stated above).
A DMA driver can use dma_alloc_coherent() to allocate DMA-able space in this case to guarantee that the DMA region is uncacheable. As this may be expensive, a streaming method also exists - for one way communication - where coherency is limited to cache flushes on write. Use dma_map_single() on a previously allocated buffer.
In my case, passing the tx and rx buffers to spi_sync without dma_map_single is fine - the spi routines will do it for me. I could use dma_map_single myself along with either unmap or dma_sync_single_for_cpu() to keep everything in sync. I won't bother at the moment though - performance tweaking after the driver works is a better strategy.
See also:
Does every dma_map_single call require a corresponding dma_unmap_single?
Linux kernel device driver to DMA into kernel space

DirectShow: How to get the physical address of delivery buffer return by IMediaSample::GetPointer method

I have written a decoder transform filter but it played the video slowly.
There are two memcpy operations. Copying from source media sample and then copied to destination delivery buffer of output pin.
I cannot totally avoid the mem copies, but copied to output pin’s delivery buffer can be avoided.
It would help me avoid second mem copy if I get physical address of the output pin delivery buffer and directly assign this physical address to my hardware decoder register.
By exploring the methods of “m_pOutput”, there is no any function that returns the physical address of the pointer return by IMediaSample::GetPointer method.
Please guide me how can I get this physical address? Any other way to achieve the same thing?
You ask the address you anyway have by calling IMediaSample::GetPointer, so it appears that you already have what you ask for
Output pin is responsible for memory allocator on the pin, so your decoder can use its own memory allocator with IMediaSample::GetPointer method in particular. You can use the buffers you allocate, and share with HW decoder implementation (also, you can have specific alignment there etc. or vice versa get the buffer from decoder for further use in DirectShow pipeline).
I see it's Win CE, however anyway it does not seem very likely that excessive memcpy is visually slow, perhaps you have other bottlenecks as well.

dma mapping which does not cross some power-of-2 boundary

I want to setup a DMA mapping for a memory buffer allocated outside my control. dma_map_single appears the right API to use but my HW has a restriction due to which the mapping must not cross some power-of-two boundary say for e.g. 1K. The buffer being mapped is of size less than the boundary value always but otherwise variable. So it looks like DMA pools may not work since they need a fixed size even though the "allocation" part is sort of what I need.
Should I just keep doing dma_map_single and check if mapping meets my requirement and release mapping if it does not? Can this cause same mapping to potentially be returned causing a never ending search? If so, I could hang on to the unfit mappings till a fit one is found and then release all the unfit mappings in one shot. These however don't sound like good ideas.
Does anyone have other/better ideas?
Thanks.
If you can't guarantee that the buffer you are passed meets your criteria, you may need to allocate an auxiliary buffer and copy to/from that buffer before you DMA. On platforms without an IOMMU or other address translation hardware (eg classic x86, ARM, etc), the DMA mapping operation is really just converting to a physical address. So if you unmap and try again with the same buffer, you'll always get back the same DMA address.
On most (all?) other platforms that do have an IOMMU, the translation is still done on chunks >= PAGE_SIZE. In other words, if you're on a platform with 4K pages, and you do DMA mapping on a buffer at 0xABCDExxx, you'll always get a DMA address like 0xFGHIJxxx where the low part of the address "xxx" stays the same. (This is because the IOMMU works like a normal MMU and only looks up the page translation, and leaves the low 12 or whatever bits alone)
So in essentially all cases on all platforms, you can't use the DMA API to fix up the alignment of the buffer you get passed. As I said, I think the only alternative if the buffers you get passed in don't meet your alignment requirements is to use a bounce buffer. The DMA pool API is a fine way to allocate these bounce buffers -- it's no problem if you sometimes need a smaller buffer; it's fine to leave some of the memory you get back unused.

Magic number with MmMapIoSpace

So upon mapping a memory space with MmMapIoSpace, I noticed that past a certain point, the data was just being discarded when written to. No errors, breakpoints, or even bugchecks were thrown. Everything worked as normal, just without any adverse effects.
I decided to do a write/read test (the driver would write 1's to every byte for the length of the intended size) and the reader (userland) mode would read and report where the 1's ended.
The number it came up with was 3208, which is a seemingly nice, round number (/8=401, /256=12, etc.)
What's up with this? How come I can't map the full buffer space?
EDIT And in 64-bit it drops to 2492.
I'm no expert, but I don't see how MmMapIoSpace can be relied upon to do what you're asking it to, because there's no guarantee that the user-space buffer is contiguous in physical memory.
Instead, I think you should be using IoAllocateMdl and MmProbeAndLockPages to lock down the user buffer and then MmGetSystemAddressForMdlSafe to map it into the system address space. This process is described here.
As previous stated, I think that the point at which the mapping is failing (3208/2492 bytes into the buffer) is probably just the end of the page, but that's easy enough for you to verify: get the user-space application to report the (virtual) address of the first byte that didn't get written rather than the offset, and check whether it is a multiple of 4096 or not.

OS X, gcc, x86, segmentation, paging, seg fault, bus error

In the case of osx, gcc, modern x86:
How is the x86 segmentation h/w and paging h/w used?
For the most part1, the segmentation hardware isn't used. Most current OSes set CS, DS, SS, and ES to all point to all memory (base address of 0, limit of 4Gig). Each is set to allow full access to all that memory (CS->execute, DS, ES, SS->read/write).
That means nearly all real access control is done with the paging unit. The basic idea is that pages accessible by a particular process are mapped to that process. Pages that are in virtual memory are mapped, but marked not present, so attempting to read/write them will cause an exception; the OS reads the data from the paging file into RAM, marks the data as present, and re-starts the instruction.
As far as how pages are marked, most executable code will be marked read-only, and will be shared between processes. Most data and stack will be marked read/write and will not be shared. Depending on the exact system, stack space will usually have the NX bit set to prevent it from being executed.
There are a few other bits and pieces that are a bit different. For example, most OSes (including OS/X, if memory serves) set up a stack guard page -- a page at the top of the stack that allows no access. When/if you try to access it, the OS catches an exception, allocates another page of stack space, and re-starts the instruction. This means you can allocate (say) 4 megabytes of address space for the stack, but only allocate actual RAM for roughly the space that's been used (obviously in page-sized increments).
The hardware also supports "large" (4 megabyte) pages. These are used primarily for mapping large chunks of contiguous memory like the part of the memory on the graphics card that's directly visible to the CPU.
That's only a very high-level view, but it's hard to provide more detail without knowing what you care about. Trying to cover all the use of paging by an entire OS could occupy an entire (large) book.
1 Windows (unlike most other systems) does make a minimal use of segmentation -- it sets up FS as a pointer to a Thread Information Block (TIB), which gives access to some basic information about the current thread. This is useful (and used) particularly by Windows' Structured Exception Handling (and Vectored Exception Handling).

Resources