questions with retrieving CPU Core Frequency and Uncore Frequency by DeviceIoControl on windows - cpu

I'm try to retrieve the CPU core Frequency and Uncore frequecy On Windows by DeviceIoControl Function. I've checked the declaration on https://learn.microsoft.com/zh-cn/windows/win32/api/ioapiset/nf-ioapiset-deviceiocontrol?redirectedfrom=MSDN, but I don't know how to define the para "dwIoControlCode" and the data structure of lpBytesReturned. I've checked the listed control words on the learn.microsoft.com, which is very limited, and can't the one for frequency of CPUs. At the same time, I also cann't find the entry for data structure for CPU info.

Related

Is there a way to measure cache coherence misses

Given a program running on multiple cores, if two or more cores are operating on the same cache line, is there a way to measure the number of cache coherence invalidations/misses there are (i.e. when Core1 writes to the cache line, which then forces Core2 to refresh its copy of the cache line so that both cores are consistent)?
Let me know if I'm using the wrong terminology for this concept.
Yes, hardware performance counters can be used to do so.
However, the way to fetch them is use to be dependent of the operating system and your processor. On Linux, the perf too can be used to track performance counters (more especially perf stat -e COUNTER_NAME_1,COUNTER_NAME_2,etc.). Alternatively, on both Linux & Windows, Intel VTune can do this too.
The list of the hardware counters can be retrieved using perf list (or with PMU-Tools).
The kind of metric you want to measure looks like Request For Ownership (RFO) in the MESI cache-coherence protocol. Hopefully, most modern (x86_64) processors include hardware events to measure RFOs. On Intel Skylake processors, there are hardware events called l2_rqsts.all_rfo, and more precisely l2_rqsts.code_rd_hit and l2_rqsts.code_rd_miss to do this at the L2-cache level. Alternatively, there are many more-advanced RFO-related hardware events that can be used at the offcore level.

In which cases GetSystemInfo/GetLogicalProcessorInformationEx returns different processor count within the same program run?

There are Windows API functions to obtain CPU and CPU Cache topology.
GetSystemInfo fills SYSTEM_INFO with dwActiveProcessorMask and dwNumberOfProcessors.
GetLogicalProcessorInformationEx can be used to have more precise information about each processor, like cache size, cache line size, cache associativity, etc. Some of this information can be obtained by _cpuidex as well.
I'm asking in which cases the obtained values for one call are not consistent with obtained values for another call, if both calls are made without program restart.
Specifically, can CPU count change:
Should user hibernate, install new processor, and wake the system? Or even this wouldn't work?
Or operating system / hardware can just dynamically decide to plug in another processor?
Or it can be achieved with virtual machines, but not on real hardware?
Can cache line size and cache size change:
For GetLogicalProcessorInformationEx, because processor is replaced at runtime
For _cpuid just because system contains processors with different cache properties, and subsequent calls are on a different processor
The practical reason for these questions is that I want to obtain this information only once, and cache it:
I'm going to use it for allocator tuning, so changing allocation strategy for already allocated memory is likely to cause memory corruption.
These function may be expensive, they may be kernel calls, which I want to avoid
If a thread is scheduled to run on a core that belongs to a different processor group with a different number of processors, GetSystemInfo and GetLogicalProcessorInformation would return a different number of processors, which is the number of processors in the group that core belongs to.
On server-grade systems that support hot-adding CPUs, that number of processors can change as follows:
Hot-adding a CPU such that there is one processor group with less than 64 processors and the thread is running on that group would change the number of processors reported by GetSystemInfo, GetLogicalProcessorInformation, and
GetLogicalProcessorInformationEx.
The maximum size of a processor group is 64 and Windows always creates groups in such way as to minimize the total number of groups. So the number of groups is the ceiling of the total number of logical cores divided by 64. Hot-adding a CPU may result in a creating a new group. This not only changes the total number of processors, but also the number of groups, which can only be obtained via GetLogicalProcessorInformationEx.
On a virtual machine, the number of processors can also increase dynamically without hot-plugging.
You can simulate hot-adding processors using the PNPCPU tool. This is useful for testing the correctness of a piece of a code that depends on the number of processors.
It's possible to receive a notification from the system when a new processor is added by writing a device driver and registering a synchronous or asynchronous driver notification. I don't think this is possible without a device driver.
I think, on current systems, cache properties returned by GetLogicalProcessorInformationEx can only change on thread migration to another core or CPU where one or more of these properties are different. For example, on an Intel Lakefield processor, the properties of the L2 cache depend which core the thread is running because different cores have different L2 caches of different properties.

Where is segment table stored ?

In the segmentation scheme, everytime a memory access is made, the MMU would do a translation from to the actual address by looking up the segment table.
Is the segment table stored inside the TLB or in RAM ?
Is the segment table stored inside the TLB or in RAM ?
This depends on which type of CPU and which mode the CPU is in.
For 80x86, when a segment register is loaded the CPU stores "base address, limit and attributes" for the segment in a hidden part of the segment register.
For real mode, virtual8086 mode and system management mode, when a segment register is loaded the CPU just does "hidden segment base = segment value * 16" and there's no tables in RAM.
For protected mode and long mode, when a segment register is loaded the CPU uses the value being loaded into the segment register as an index into a table in RAM, and (after doing protection checks) loads the "base address, limit and attributes" information from the corresponding table entry into the hidden part of the segment register.
Note that (for protected mode) almost nobody used segmentation because the segment register loads are slow (due to protection checks and table lookups); so CPU manufacturers optimised the CPU for "no segmentation" (e.g. if segment bases are zero, instead of doing "linear address = virtual address + segment base" a modern CPU will just do "linear address = virtual address" and avoid the cost of an unnecessary addition and start cache/memory lookup sooner) and didn't bother optimising segment register loads much either; and then when AMD designed long mode they realised nobody wants segmentation and disabled most of it for 64-bit code (ignoring segment bases for most segment registers to get rid of the extra addition, and ignoring segment limits to get rid of the cost of segment limit checks). However, operating systems that don't use segmentation were using gs and fs as a hack to get fast access to CPU specific or thread specific data (because, unlike some other CPUs, 80x86 doesn't have register/s that can only be modified by supervisor code that would be more convenient for this purpose); so AMD kept the "linear address = virtual address + segment base" behaviour for these 2 segment registers and added the ability to modify the hidden "base address" part of gs and fs (via. MSRs and swapgs) to make it easier to port operating systems (Windows) to long mode.
In other words, for 80x86 there are 3 different ways to set a segment's information (by calculation, by table lookup, or by MSR).
Also note that for most instructions (excluding things like segment register loads) 80x86 CPU's don't care how a segment's information was set and only use the hidden parts of segment registers. This means that the CPU doesn't have to consult a table every time it fetches code from cs and every time it fetches data from memory. It also means that the majority of the CPU doesn't care which mode the CPU is in (e.g. instructions like mov eax,[ds:address] only depend on the values in the hidden part of segment registers and don't depend on the CPU mode); which is why there's no benefit to removing obsolete CPU modes (removing support for real mode wouldn't reduce the size or complexity of the CPU).
For other CPUs; most don't support segmentation (and only support paging or nothing), and I'm not familiar with how it works for any that do support it. However I doubt any CPU would do a table lookup every time anything is fetched (it'd be far too slow/expensive to be practical); and I'd expect that for all CPUs that support segmentation, information for "currently in use" segments is stored internally somehow.
The Segment table is the reference whenever you are using the memory . So the table has to be stored permanently for later use , so it is stored in the Physical Address i.e.., the RAM.

which files are useful to calculate RAM consume by kernel process/thread?

I want to know how we can calculate physical RAM used by kernel process/thread, kernel modules from proc file system?
is there any command or file which gives useful information about this
any one can help.........
The utilities like top/htop shall be helpful in such scenarios by providing the percentage of physical memory(RAM) usage via the field MEM. The RES field shall give you the amount of consumption of physical RAM in terms of kilobytes.
There is also a tool called gnome system monitor that provides process related information, but ensure that your system has sufficient resource as it consumes significant amount of resource.

Trace of CPU Instruction Reordering

I have studied a few things about instruction re-ordering by processors and Tomasulo's algorithm.
In an attempt to understand this topic bit more I want to know if there is ANY way to (get the trace) see the actual dynamic reordering done for a given program?
I want to give an input program and see the "out of order instruction execution trace" of my program.
I have access to an IBM-P7 machine and an Intel Core2Duo laptop. Also please tell me if there is an easy alternative.
You have no access to actual reordering done inside the CPU (there is no publically known way to enable tracing). But there is some emulators of reordering and some of them can give you useful hints.
For modern Intel CPUs (core 2, nehalem, Sandy and Ivy) there is "Intel(R) Architecture Code Analyzer" (IACA) from Intel. It's homepage is http://software.intel.com/en-us/articles/intel-architecture-code-analyzer/
This tool allows you to look how some linear fragment of code will be splitted into micro-operations and how they will be planned into execution Ports. This tool has some limitations and it is only inexact model of CPU u-op reordering and execution.
There are also some "external" tools for emulating x86/x86_84 CPU internals, I can recommend the PTLsim (or derived MARSSx86):
PTLsim models a modern superscalar out of order x86-64 compatible processor core at a configurable level of detail ranging ... down to RTL level models of all key pipeline structures. In addition, all microcode, the complete cache hierarchy, memory subsystem and supporting hardware devices are modeled with true cycle accuracy.
But PTLsim models some "PTL" cpu, not real AMD or Intel CPU. The good news is that this PTL is Out-Of-Order, based on ideas from real cores:
The basic microarchitecture of this model is a combination of design features from the Intel Pentium 4, AMD K8 and Intel Core 2, but incorporates some ideas from IBM Power4/Power5 and Alpha EV8.
Also, in arbeit http://es.cs.uni-kl.de/publications/datarsg/Senf11.pdf is said that JavaHASE applet is capable of emulating different simple CPUs and even supports Tomasulo example.
Unfortunately, unless you work for one of these companies, the answer is no. Intel/AMD processors don't even schedule the (macro) instructions you give them. They first convert those instructions into micro operations and then schedule those. What these micro instructions are and the entire process of instruction reordering is a closely guarded secret, so they don't exactly want you to know what is going on.

Resources