I am trying to debug a program which is partially working with cached data memory and from cached instruction memory. The question is about how the debugger works, when trying to examine such a memory. Does it access the cached copy when examining a specific location? If so, does it actually modify the cache, as it has to fetch the data once it's a miss? Does it mean that the program behavior might be different under debugger from the one without it? Any way to debug cache-related issues, without the debugger to affect the caches?
Update: The specific CPU core is ARM Cortex-A5. The debugger is DSTREAM/DS-5
I think the question is a bit generic because it will depend on the CPU.
However some very global rules:
The debugger will try to see what the CPU sees on a data access, which will include cache lookups on the data cache.
This is different for instruction cache, as the debugger will normally not do a lookup as it will perform data accesses. But this is normally not a problem as instruction cache does not contain dirty data. Depending on the debugger, it can clean DCache and invalidate corresponding ICache line if a data is written.
Debug access will try to not be intrusive, and can force a mode in which no linefill is performed in case of miss. But this is really dependent on the CPU and not a global rule.
The DS-5 uses a JTAG probe connected into the CPU. To read the CPU's addressable memory, it has to run the CPU through its micro-operations to fetch memory. This perturbs the cache differently than if the CPU were simply to run the program.
You can minimize the effect by not stopping the CPU until after critical (suspect) code and then try to piece together what must have happened from the contents of registers and memory. If you can run a program from its beginning to the breakpoint, especially if that is 10,000+ instructions, the cache probably will be put into the correct state. Unless there is asynchronous activity.
To identify whether an issue is due to caching, maybe you can simply disable the cache?
Related
I'm doing this as a personal project, I want to make a visualizer for this data. but the first step is getting the data.
My current plan is to
make my program debug the target process step through it
each step record the EIP from every thread's context within the target process
construct the memory address the instruction uses from the context and store it.
Is there an easier or built in way to do this?
Have a look at Intel PIN for dynamic binary instrumentation / running a hook for every load / store instruction. intel-pin
Instead of actually single-stepping in a debugger (extremely slow), it does binary-to-binary JIT to add calls to your hooks.
https://software.intel.com/sites/landingpage/pintool/docs/81205/Pin/html/index.html
Honestly the best way to do this is probably instrumentation like Peter suggested, depending on your goals. Have you ever ran a script that stepped through code in a debugger? Even automated it's incredibly slow. The only other alternative I see is page faults, which would also be incredibly slow but should still be faster than single step. Basically you make every page not in the currently executing section inaccessible. Any RW access outside of executing code will trigger an exception where you can log details and handle it. Of course this has a lot of flaws -- you can't detect RW in the current page, it's still going to be slow, it can get complicated such as handling page execution transfers, multiple threads, etc. The final possible solution I have would be to have a timer interrupt that checks RW access for each page. This would be incredibly fast and, although it would provide no specific addresses, it would give you an aggregate of pages written to and read from. I'm actually not entirely sure off the top of my head if Windows exposes that information already and I'm also not sure if there's a reliable way to guarantee your timers would get hit before the kernel clears those bits.
I'm looking for a way to flush the L1-L2 cache using a kernel module.
Is there a way to completely flush the whole cluster cache (4 core configuration) or even better, write back the dirty cache lines into main memory?
It sounds weird that you want to flush your caches from a kernel module. That should be done by the core-kernel part and as a driver you should not have to worry about that.
Is there any specific reason you need to do that in a driver?
I think you want to have a look at 3.9 of "Understanding the Linux Virtual Memory Manager" [1] from Mel Gorman. I think what you are looking for is flush_cache_page(...)
[1] https://www.kernel.org/doc/gorman/
Well it seems that it is actually different the way that the caches are flushed in different architectures. Nevertheless, I didn't find an implementation that works. BUT, what I did was to find the Page table entry (PTE) of the particular page that I want to flush, and changed the memory attributes to Non-Cacheable. Then, the data went directly to the DRAM. (ARMv8)
Cheers
As I understand the creation of processes, every process has it's own space in RAM for it's heap, data, etc, which is allocated upon its creation. Many processes can share their data and storage space in some ways. But since terminating a process would erase its allocated memory(so also its caches), I was wondering if it is possible that many (similar) processes share a cache in memory that is not allocated to any specific process, so that it can be used even when these processes are terminated and other ones are created.
This is a theoretical question from a student perspective, so I am merely interested in the general sence of an operating system, without adding more functionality to them to achieve it.
For example I think of a webserver that uses only single-threaded processes (maybe due to lack of multi-threading support), so that most of the processes created do similar jobs, like retrieving a certain page.
There are a least four ways what you describe can occur.
First, the system address space is shared by all processes. The Operating system can save data there that survives the death of a process.
Second, processes can map logical pages to the same physical page frame. The termination of one process does not cause the page frame to be deallocated to the other processes.
Third, some operating systems have support for writable shared libraries.
Fourth, memory mapped files.
There are probably others as well.
I think so, when a process is terminated the RAM clears it. However your right as things such as webpages will be stored in the Cache for when there re-called. For example -
You open Google and then go to another tab and close the open Google page, when you next go to Google it loads faster.
However, what I think your saying is if the Entire program E.G - Google Chrome or Safari - is closed, does the webpage you just had open stay in the cache? No, when the program is closed all its relative data is also terminated in order to fully close the program.
I guess this page has some info on it -
https://www.wikipedia.org/wiki/Shared_memory
In a processor, what happens to the cache when the operating system replaces a page, if there is not enough space to hold all running processes' pages in memory? Does it need to flush the cache on every page replacement?
Thanks in advance for your replies.
When a page is swapped in, the contents are read off the disk and into memory. Typically this is done using DMA. So the real question is, "How is the cache kept coherent with DMA?". You can either have DMA talk to the cache controller on each access, or make the OS manage the cache manually. See http://en.wikipedia.org/wiki/Direct_memory_access#Cache_coherency.
I am not 100% sure of what happens in details, but caches and virtual memory using paging
are similar: both are divided in "pages".
The same way that only one page needs to be replaced in a page fault, only one line of
the cache needs to be replaced when it occurs a miss on the cache. The cache has
several "pages" (lines), but only the problematic page will be replaced.
There are other things that I do not know if takes part on such replacements: cache size,
cache coherency - write-through/back and so on. I hope someone else can give you a more detailed answer.
one of my app needs the function that free inactive/used/wired memory just like command 'purge'.
Check and google a lot, but can not get any hit
Welcome any comment
Purge doesn't do what you seem to think it does. It doesn't "free inactive/used/wired memory". As the manpage says:
It does not affect anonymous memory that has been allocated through malloc, vm_allocate, etc.
All it does is purge the disk cache. This is only useful if you're running performance tests and want to simulate the effects of "first run after cold boot" without actually cold booting. Again, from the manpage:
Purge can be used to approximate initial boot conditions with a cold disk buffer cache for performance analysis.
There is no public API for this, although a quick scan of the symbols shows that it seems to call a function CPOSXPurgeAllDiskBuffers from the CoreProfile private framework. I believe the underlying kernel and userland disk cache code is all or mostly available on http://www.opensource.apple.com, so you could do probably implement the same thing yourself, if you really want.
As iMysak says, you can just exec (or NSTask, etc.) the tool if you want to.
As a side note, it you could free used/wired memory, presumably that memory is used by something—even if you don't have pointers into it in your own data structures, malloc probably does. Are you trying to segfault your code?
Freeing inactive memory is a different story. Just freeing something up to malloc doesn't necessarily make malloc return it to the OS. And there's no way you can force it to. If you think about the way traditional UNIX works, it makes sense: When you ask it to allocate more memory, it uses sbrk to expand your data segment; if you free up memory at the top, it can sbrk back down, but if you free up memory in the middle, there's no way it can do that. Of course modern UNIX systems don't work that way, but the POSIX and C APIs are all designed to be compatible with systems that do. So, if you want to make sure memory gets freed, you have to handle memory allocation directly.
The simplest and most portable way to do this is to create and mmap a temporary backing file, or just MAP_ANON, and explicitly unmap pages when you're done with them. (This works on all POSIX systems—and, with a pretty simple wrapper, even Windows.) If you need even more control (e.g., to manually handle flushing pages to disk, etc.), you can use the mach/mach_vm.h APIs.
You can directly run it from OS // with exec() function