I am on linux kernel 2.6.32.
I am facing an issue in which one of the two ISR (serial and ethernet) are taking more time (hundreds of microseconds) on several occasion/under some scenarios which I don't know. I would like to get the time difference every time the ISR executes.
What would be the best way (least expensive in terms of overhead involved). I don't see ARM architecture has some TSC register (read_tsc api) which would give me direct access to time as it offers on some other architecture.
So Idea is
1) The moment ISR is invoked measure time
2) the moment ISR is complete measure the time.
3) get the difference of 1 and 2 store it in some variable.
4) Keep doing the steps 1 to 2 and when the value received in the step 3 is greater than the past value overwrite it (keep/preserve value with maximum latency).
When the issue happens (some abrupt condition print the value) or array of last 10 value).
I need to do in kernel driver so let me know what would be the least expensive way.
OMAP3 has Cortex-A8 core. That does have Performance Monitor Unit (PMU). Cycle Count (CCNT) would correspond to x86 TSC, except probably you have to enable it counting before you read. Good info in BeagleBoard post.
In 2.6.32.55 I see arch/arm/oprofile/op_model_v7.c gives full access and control. My need was bare-metal, I used ARM example code that was simple and worked for me.
It would also be possible to use an OMAP3 GPT, but that would be more work, e.g. to get its clock input set up from PRCM.
Related
Consider I have a software and want to study its behavior using a black-box approach. I have a 3.0GHz CPU with 2 sockets and 4 cores. As you know, in order to find out instructions per second (IPS) we have to use the following formula:
IPS = sockets*(cores/sockets)*clock*(instructions/cycle)
At first, I wanted to find number of instructions per cycle for my specific algorithm. Then I realised its almost impossible to count it using a block-box approach and I need to do in-depth analysis of the algorithm.
But now, I have two question: Regardless of what kind of software is running on my machine and its cpu usage, is there any way to count number of instructions per second sent to the CPU (Millions of instructions per second (MIPS))? And is it possible to find the type of instruction set (add, compare, in, jump, etc) ?
Any piece of script or tool recommendation would be appreciated (in any language).
perf stat --all-user ./my_program on Linux will use CPU performance counters to record how many user-space instructions it ran, and how many core clock cycles it took. And how much CPU time it used, and will calculate average instructions per core clock cycle for you, e.g.
3,496,129,612 instructions:u # 2.61 insn per cycle
It calculates IPC for you; this is usually more interesting than instructions per second. uops per clock is usually even more interesting in terms of how close you are to maxing out the front-end, though. You can manually calculate MIPS from instructions and task-clock. For most other events perf prints a comment with a per-second rate.
(If you don't use --all-user, you can use perf stat -e task-clock:u,instructions:u , ... to have those specific events count in user-space only, while other events can count always, including inside interrupt handlers and system calls.)
But see How to calculate MIPS using perf stat for more detail on instructions / task-clock vs. instructions / elapsed_time if you do actually want total or average MIPS across cores, and counting sleep or not.
For an example output from using it on a tiny microbenchmark loop in a static executable, see Can x86's MOV really be "free"? Why can't I reproduce this at all?
How can I get real-time information at run-time
Do you mean from within the program, to profile only part of it? There's a perf API where you can do perf_event_open or something. Or use a different library for direct access to the HW perf counters.
perf stat is great for microbenchmarking a loop that you've isolated into a stand-alone program that just runs the hot loop for a second or so.
Or maybe you mean something else. perf stat -I 1000 ... ./a.out will print counter values every 1000 ms (1 second), to see how program behaviour changes in real time with whatever time window you want (down to 10ms intervals).
sudo perf top is system-wide, slightly like Unix top
There's also perf record --timestamp to record a timestamp with each event sample. perf report -D might be useful along with this. See http://www.brendangregg.com/perf.html, he mentions something about -T (--timestamp). I haven't really used this; I mostly isolate single loops I'm tuning into a static executable I can run under perf stat.
And is it possible to find the type of instruction set (add, compare, in, jump, etc)?
Intel x86 CPUs at least have a counter for branch instructions, but other types aren't differentiated, other than FP instructions. This is probably common to most architectures that have perf counters at all.
For Intel CPUs, there's ocperf.py, a wrapper for perf with symbolic names for more microarchitectural events. (Update: plain perf now knows the names of most uarch-specific counters so you don't need ocperf.py anymore.)
perf stat -e task_clock,cycles,instructions,fp_arith_inst_retired.128b_packed_single,fp_arith_inst_retired.scalar_double,uops_executed.x87 ./my_program
It's not designed to tell you what instructions are running, you can already tell that from tracing execution. Most instructions are fully pipelined, so the interesting thing is which ports have the most pressure. The exception is the divide/sqrt unit: there's a counter for arith.divider_active: "Cycles when divide unit is busy executing divide or square root operations. Accounts for integer and floating-point operations". The divider isn't fully pipelined, so a new divps or sqrtps can't always start even if no older uops are ready to execute on port 0. (http://agner.org/optimize/)
Related: linux perf: how to interpret and find hotspots for using perf to identify hotspots. Especially using top-down profiling you have perf sample the call-stack to see which functions make a lot of expensive child calls. (I mention this in case that's what you really wanted to know, rather than instruction mix.)
Related:
How do I determine the number of x86 machine instructions executed in a C program?
How to characterize a workload by obtaining the instruction type breakdown?
How do I monitor the amount of SIMD instruction usage
For exact dynamic instruction counts, you might use an instrumentation tool like Intel PIN, if you're on x86. https://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrumentation-tool.
perf stat counts for the instructions:u hardware even should also be more or less exact, and is in practice very repeatable across runs of the same program doing the same work.
On recent Intel CPUs, there's HW support for recording which way conditional / indirect branches went, so you can reconstruct exactly which instructions ran in which order, assuming no self-modifying code and that you can still read any JIT buffers. Intel PT.
Sorry I don't know what the equivalents are on AMD CPUs.
Under Windows, my application makes use of QueryPerformanceCounter (and QueryPerformanceFrequency) to perform "high resolution" timestamping.
Since Windows 10 (and only tested on Intel i7 processors so far), we observe erratic behaviours in the values returned by QueryPerformanceCounter.
Sometimes, the value returned by the call will jump far ahead and then back to its previous value.
It feels as if the thread has moved from one core to another and was returned a different counter value for a lapse of time (no proof, just a gut feeling).
This has never been observed under XP or 7 (no data about Vista, 8 or 8.1).
A "simple" workaround has been to enable the UsePlatformClock boot opiton using BCDEdit (which makes everything behaves wihtout a hitch).
I know about the potentially superior GetSystemTimePreciseAsFileTime but as we still support 7 this is not exactly an option unless we write totatlly different code for different OSes, which we really don't want to do.
Has such behaviour been observed/explained under Windows 10 ?
I'd need much more knowledge about your code but let me highlight few things from MSDN:
When computing deltas, the values [from QueryPerformanceCounter] should be clamped to ensure that any bugs in the timing values do not cause crashes or unstable time-related computations.
And especially this:
Set that single thread to remain on a single processor by using the Windows API SetThreadAffinityMask ... While QueryPerformanceCounter and QueryPerformanceFrequency typically adjust for multiple processors, bugs in the BIOS or drivers may result in these routines returning different values as the thread moves from one processor to another. So, it's best to keep the thread on a single processor.
Your case might exploited one of those bugs. In short:
You should query the timestamp always from one thread (setting same CPU affinity to be sure it won't change) and read that value from any other thread (just an interlocked read, no need for fancy synchronizations).
Clamp the calculated delta (at least to be sure it's not negative)...
Notes:
QueryPerformanceCounter() uses, if possible, TSC (see MSDN). Algorithm to synchronize TSC (if available and in your case it should be) is vastly changed from Windows 7 to Windows 8 however note that:
With the advent of multi-core/hyper-threaded CPUs, systems with multiple CPUs, and hibernating operating systems, the TSC cannot be relied upon to provide accurate results — unless great care is taken to correct the possible flaws: rate of tick and whether all cores (processors) have identical values in their time-keeping registers. There is no promise that the timestamp counters of multiple CPUs on a single motherboard will be synchronized. Therefore, a program can get reliable results only by limiting itself to run on one specific CPU.
Then, even if in theory QPC is monotonic then you must always call it from the same thread to be sure of this.
Another note: if synchronization is made by software you may read from Intel documentation that:
...It may be difficult for software to do this in a way than ensures that all logical processors will have the same value for the TSC at a given point in time...
Edit: if your application is multithreaded and you can't (or you don't wan't) to set CPU affinity (especially if you need precise timestamping at the cost to have de-synchronized values between threads) then you may use GetSystemTimePreciseAsFileTime() when running on Win8 (or later) and fallback to timeGetTime() for Win7 (after you set granularity to 1 ms with timeBeginPeriod(1) and assuming 1 ms resolution is enough). A very interesting reading: The Windows Timestamp Project.
Edit 2: directly suggested by OP! This, when applicable (because it's a system setting, not local to your application), might be an easy workaround. You can force QPC to use HPET instead of TSC using bcdedit (see MSDN). Latency and resolution should be worse but it's intrinsically safe from above described issues.
I'm currently learning C for my next emulation project, a cycle accurate 68000 core (my last project being a non-cycle accurate Sega Master System emulator written in Java which is now on its third release). My query regards cycle level accuracy as taking things to this level is a new thing for me.
To break things down to a granularity of 1 CPU cycle, presumably I need to know how long memory accesses take and so on, but my question is that for instructions that take multiple cycles in their memory fetch/write stages, what is the CPU doing each cycle - e.g. are x amount of bits copied per cycle.
With my SMS emulator I didn't have to worry too much about M1 stages etc, as it just used a cycle count for each instruction - in other words it is only accurate to an instruction level, not a cycle level. I'm not looking for architecture specific details, merely an idea of what sort of things I should look out for when going to this level of granularity.
68k details are welcome however. Basically I'm wondering what is supposed to happen if a video chip reads from an area of memory whilst a CPU is still writing the data to it mid way through that phase of an instruction, and other similar situations. I hope I've made it clear enough, thank you.
For a really cycle accurate emulation you have first to decide on a master clock you want to use as reference. That should be the fastest clock at which's granularity the software running can detect differences in order of occurance. This could by the CPU clock, but in most cases the bus cycle time decides at which granularity events can be discerned (and that is often only a fraction of the CPU clock).
Then you need to find out the precendence order the different devices (IC's) connected to that bus have (if there is more than one bus master). An example would be if (and how) video DMA can delay the CPU.
There exist generally no at the same time events. Either the CPU writes before the DMA reads, or the other way around (that is still true in case of dual ported devices, you just need to consider the device's inherent predence mechanism).
Once you have a solid understanding which clock is the effectively controlling the granularity of discernible events you can think about how to structure the emulator to reproduce that behaviour exactly.
This way you can create a 100% cycle exact emulation, given you have enough information about all the devices behavior.
Sorry I can't give you more detailed info, I know nothing about the specifics of the Sega's hardware.
My guess is that you don't have to get in to excruciating detail to get good enough results for the timing for this sort of thing. Which you can't do anyway f you don't want to get into the specifics of the architecture.
Your main question seemed to be "what is supposed to happen if a video chip reads from an area of memory whilst a CPU is still writing the data to it". Generally on these older chips, the bus protocols are pretty simple (they're not packetized) and there is usually a pin that indicates that the bus is busy. So if the CPU is writing to memory, the video chip will simply have to wait until the CPU is done. Because of these sorts of limitations, dual ported ram was popular for a while so that the frame buffer could be simultaneously written by the CPU and read by the RAMDAC.
Recently I was doing some deep timing checks on a DirectShow application I have in Delphi 6, using the DSPACK components. As part of my diagnostics, I created a Critical Section class that adds a time-out feature to the usual Critical Section object found in most Windows programming languages. If the time duration between the first Acquire() and the last matching Release() is more than X milliseconds, an Exception is thrown.
Initially I set the time-out at 10 milliseconds. The code I have wrapped in Critical Sections is pretty fast using mostly memory moves and fills for most of the operations contained in the protected areas. Much to my surprise I got fairly frequent time-outs in seemingly random parts of the code. Sometimes it happened in a code block that iterates a buffer list and does certain quick operations in sequence, other times in tiny sections of protected code that only did a clearing of a flag between the Acquire() and Release() calls. The only pattern I noticed is that the durations found when the time-out occurred were centered on a median value of about 16 milliseconds. Obviously that's a huge amount of time for a flag to be set in the latter example of an occurrence I mentioned above.
So my questions are:
1) Is it possible for Windows thread management code to, on a fairly frequent basis (about once every few seconds), to switch out an unblocked thread and not return to it for 16 milliseconds or longer?
2) If that is a reasonable scenario, what steps can I take to lessen that occurrence and should I consider elevating my thread priorities?
3) If it is not a reasonable scenario, what else should I look at or try as an analysis technique to diagnose the real problem?
Note: I am running on Windows XP on an Intel i5 Quad Core with 3 GB of memory. Also, the reason why I need to be fast in this code is due to the size of the buffer in milliseconds I have chosen in my DirectShow filter graphs. To keep latency at a minimum audio buffers in my graph are delivered every 50 milliseconds. Therefore, any operation that takes a significant percentage of that time duration is troubling.
Thread priorities determine when ready threads are run. There's, however, a starvation prevention mechanism. There's a so-called Balance Set Manager that wakes up every second and looks for ready threads that haven't been run for about 3 or 4 seconds, and if there's one, it'll boost its priority to 15 and give it a double the normal quantum. It does this for not more than 10 threads at a time (per second) and scans not more than 16 threads at each priority level at a time. At the end of the quantum, the boosted priority drops to its base value. You can find out more in the Windows Internals book(s).
So, it's a pretty normal behavior what you observe, threads may be not run for seconds.
You may need to elevate priorities or otherwise consider other threads that are competing for the CPU time.
sounds like normal windows behaviour with respect to timer resolution unless you explicitly go for some of the high precision timers. Some details in this msdn link
First of all, I am not sure if Delphi's Now is a good choice for millisecond precision measurements. GetTickCount and QueryPerformanceCoutner API would be a better choice.
When there is no collision in critical section locking, everything runs pretty fast, however if you are trying to enter critical section which is currently locked on another thread, eventually you hit a wait operation on an internal kernel object (mutex or event), which involves yielding control on the thread and waiting for scheduler to give control back later.
The "later" above would depend on a few things, including priorities mentioned above, and there is one important things you omitted in your test - what is the overall CPU load at the time of your testing. The more is the load, the less chances to get the thread continue execution soon. 16 ms time looks perhaps a bit still within reasonable tolerance, and all in all it might depends on your actual implementation.
As per SDIO specification, the sequence of operations (for write transaction) take place as:
Command53 -- CommandLatency -- Command53Response -- ResponseLatency -- startbit -- write-number-of-bytes -- CRC -- endbit -- WriteLatency -- startbit -- CRC -- endbit -- busybit.
During benchmarking of SDIO UART driver, the time values which I got were more than expected. A lot of latency was found especially during write transaction.
Reasons for latency could be scheduler allocating processor time to other processes, delay in work queues, etc.
I would like to analyze and understand the latency. May be understanding the mapping between the device driver code and the Logic Analyzer waveform can lead to some cue.
Can somebody shed some light on this?
Thank you.
EDIT 1:
Sorry! I assumed a few things.
In sdio_uart_transmit_chars() there is a call to sdio_out() which in turn calls sdio_writeb() and this call writes byte wise (one byte at a time) to a SDIO UART device. I modified the driver to use sdio_writesb() i.e. multi-byte mode. This reduced the time taken to write X bytes relatively. Interestingly, with increase in size of write data, there was exponential increase in WriteLatency (as mentioned above).
This latency could be because of many reasons. I would like to understand these reasons.
Setup: I am using Linux (v 2.6.32) laptop and a loadable kernel module (which is modified sdio_uart.c)
EDIT 2:
May be adding 'SDIO' in this question is misleading..(not sure at the moment). The reasons for delay could be generic to any device driver while interacting with the hardware and it may be independent of SDIO write process.
If somebody can point me to related online resource, I would be happy to explore and update the result here.
Hope I added more clarity this time. Please comment if I the question is still not clear.
Thank you for your time.
EDIT 3:
Yes, I am looking at the signals on Logic Analyzer (LA) and there are longer delays during and between writes than I expected.
To give an idea about time values:
For 512 bytes transfer: At the hardware level theoretically the write should take 50 micro seconds (us), however in reality I got 200 us.
This gap of 150 us is what I want to understand.
Note:
1) I am rounding off the time values to simplify the case.
2) All the time values are calculated at Kernel level and no user space issue is involved here.
One thing worth looking at is if your sd interface functions by DMA, such that the driver can program the state machine and then it just runs by itself, or if getting the message out requires repetitive service by the driver, which might be delayed by other kernel obligations.
You could also see if there may be an I/O bottleneck, for example is the SD interface or whatever bus it hangs off of also used for something else?
Finally you could search for ways to increase the priority. At an extreme, you could switch to a real time SD driver rather than a normal one.