Finding the effective switched capacitance of a processor - cpu

I need to determine the power consumption of a processor using the equation,
P = C*(V^2)*f
where C is the effective switched capacitance, V is the supply voltage and f is the processor frequency. Could someone please explain me where I could find sample C and V values for a typical processor?. I have gone through some of the Intel processor data sheets, but haven't been able to figure out the typical C values of a processor.
The following link has a generic explanation of the CPU power consumption. However, in order to perform this power calculation, I need to know CL values as specified in this answer.
Could someone please provide me some tips or useful links where I could get the values for an Intel processor?.

I suspect you can't use that power equation in any useful way to measure the power of a modern Intel processor. The voltage requirements should be constant due to modern design needs. You can pull the actual power requirements from the design specs -- again, the designers of the motherboard and power supplies need to know this. This implies that the capacitance of the processor varies moment by moment. Though this is possible, it also may mean that this model of the internals of the processor is wrong.
I'm not saying that the venerated P=C*V^2*f is wrong but that it probably doesn't apply. It gives you the power of a (CMOS?) transistor. Though you can scale it up by multiplying by the number of transistors in a processor (near 5.5B), I'm sure this is way off the mark. The power of a modern Intel processor can vary by at least a couple of orders of magnitude from moment to moment.
As an interesting set of exercises, (1) take the max and min power requirements from the spec, and compute the required C (linear with regards to P in the equation); and (2) get a spec for one switching device using the technology of a modern Intel processor, and multiply it by the # of such devices (>5.5B) to get the "required" power.

Related

How to accurately measure performance of sorting algorithms

I have a bunch of sorting algorithms in C I wish to benchmark. I am concerned regarding good methodology for doing so. Things that could affect benchmark performance include (but are not limited to): specific coding of the implementation, programming language, compiler (and compiler options), benchmarking machine and critically the input data and time measuring method. How do I minimize the effect of said variables on the benchmark's results?
To give you a few examples, I've considered multiple implementations on two different languages to adjust for the first two variables. Moreover I could compile the code with different compilers on fairly mundane (and specified) arguments. Now I'm going to be running the test on my machine, which features turbo boost and whatnot and often boosts a core running stuff to the moon. Of course I will be disabling that and doing multiple runs and likely taking their mean completion time to adjust for that as well. Regarding the input data, I will be taking different array sizes, from very small to relatively large. I do not know what the increments should ideally be like, and what the range of the elements should be as well. Also I presume duplicate elements should be allowed.
I know that theoretical analysis of algorithms accounts for all of these methods, but it is crucial that I complement my study with actual benchmarks. How would you go about resolving the mentioned issues, and adjust for these variables once the data is collected? I'm comfortable with the technologies I'm working with, less so with strict methodology for studying a topic. Thank you.
You can't benchmark abstract algorithms, only specific implementations of them, compiled with specific compilers running on specific machines.
Choose a couple different relevant compilers and machines (e.g. a Haswell, Ice Lake, and/or Zen2, and an Apple M1 if you can get your hands on one, and/or an AArch64 cloud server) and measure your real implementations. If you care about in-order CPUs like ARM Cortex-A53, measure on one of those, too. (Simulation with GEM5 or similar performance simulators might be worth trying. Also maybe relevant are low-power implementations like Intel Silvermont whose out-of-order window is much smaller, but also have a shorter pipeline so smaller branch mispredict penalty.)
If some algorithm allows a useful micro-optimization in the source, or that a compiler finds, that's a real advantage of that algorithm.
Compile with options you'd use in practice for the use-cases you care about, like clang -O3 -march=native, or just -O2.
Benchmarking on cloud servers makes it hard / impossible to get an idle system, unless you pay a lot for a huge instance, but modern AArch64 servers are relevant and may have different ratios of memory bandwidth vs. branch mispredict costs vs. cache sizes and bandwidths.
(You might well find that the same code is the fastest sorting implementation on all or most of the systems you test one.
Re: sizes: yes, a variety of sizes would be good.
You'll normally want to test with random data, perhaps always generated from the same PRNG seed so you're sorting the same data every time.
You may also want to test some unusual cases like already-sorted or almost-sorted, because algorithms that are extra fast for those cases are useful.
If you care about sorting things other than integers, you might want to test with structs of different sizes, with an int key as a member. Or a comparison function that does some amount of work, if you want to explore how sorts do with a compare function that isn't as simple as just one compare machine instruction.
As always with microbenchmarking, there are many pitfalls around warm-up of arrays (page faults) and CPU frequency, and more. Idiomatic way of performance evaluation?
taking their mean completion time
You might want to discard high outliers, or take the median which will have that effect for you. Usually that means "something happened" during that run to disturb it. If you're running the same code on the same data, often you can expect the same performance. (Randomization of code / stack addresses with page granularity usually doesn't affect branches aliasing each other in predictors or not, or data-cache conflict misses, but tiny changes in one part of the code can change performance of other code via effects like that if you're re-compiling.)
If you're trying to see how it would run when it has the machine to itself, you don't want to consider runs where something else interfered. If you're trying to benchmark under "real world" cloud server conditions, or with other threads doing other work in a real program, that's different and you'd need to come up with realistic other loads that use some amount of shared resources like L3 footprint and memory bandwidth.
Things that could affect benchmark performance include (but are not limited to): specific coding of the implementation, programming language, compiler (and compiler options), benchmarking machine and critically the input data and time measuring method.
Let's look at this from a very different perspective - how to present information to humans.
With 2 variables you get a nice 2-dimensional grid of results, maybe like this:
A = 1 A = 2
B = 1 4 seconds 2 seconds
B = 2 6 seconds 3 seconds
This is easy to display and easy for humans to understand and draw conclusions from (e.g. from my silly example table it's trivial to make 2 very different observations - "A=1 is twice as fast as A=2 (regardless of B)" and "B=1 is faster than B=2 (regardless of A)").
With 3 variables you get a 3-dimensional grid of results, and with N variables you get an N-dimensional grid of results. Humans struggle with "3-dimensional data on 2-dimensional screen" and more dimensions becomes a disaster. You can mitigate this a little by "peeling off" a dimension (e.g. instead of trying to present a 3D grid of results you could show multiple 2D grids); but that doesn't help humans much.
Your primary goal is to reduce the number of variables.
To reduce the number of variables:
a) Determine how important each variable is for what you intend to observe (e.g. "which algorithm" will be extremely important and "which language" will be less important).
b) Merge variables based on importance and "logical grouping". For example, you might get three "lower importance" variables (language, compiler, compiler options) and merge them into a single "language+compiler+options" variable.
Note that it's very easy to overlook a variable. For example, you might benchmark "algorithm 1" on one computer and benchmark "algorithm 2" on an almost identical computer, but overlook the fact that (even though both benchmarks used identical languages, compilers, compiler options and CPUs) one computer has faster RAM chips, and overlook "RAM speed" as a possible variable.
Your secondary goal is to reduce number of values each variable can have.
You don't want massive table/s with 12345678 million rows; and you don't want to spend the rest of your life benchmarking to generate such a large table.
To reduce the number of values each variable can have:
a) Figure out which values matter most
b) Select the right number of values in order of importance (and ignore/skip all other values)
For example, if you merged three "lower importance" variables (language, compiler, compiler options) into a single variable; then you might decide that 2 possibilities ("C compiled by GCC with -O3" and "C++ compiled by MSVC with -Ox") are important enough to worry about (for what you're intending to observe) and all of the other possibilities get ignored.
How do I minimize the effect of said variables on the benchmark's results?
How would you go about resolving the mentioned issues, and adjust for these variables once the data is collected?
By identifying the variables (as part of the primary goal) and explicitly deciding which values the variables may have (as part of the secondary goal).
You've already been doing this. What I've described is a formal method of doing what people would unconsciously/instinctively do anyway. For one example, you have identified that "turbo boost" is a variable, and you've decided that "turbo boost disabled" is the only value for that variable you care about (but do note that this may have consequences - e.g. consider "single-threaded merge sort without the turbo boost it'd likely get in practice" vs. "parallel merge sort that isn't as influenced by turning turbo boost off").
My hope is that by describing the formal method you gain confidence in the unconscious/instinctive decisions you're already making, and realize that you were very much on the right path before you asked the question.

How computers calculate logarithm?

I wanted to know how computers calculate logarithms?
I don't mean the related functions. For example, Python uses math.log() function. But I want to know what exactly does this function do? And can it be simulated again and more accurately?
Is there a formula for it? Or an algorithm? (I don't think the computer has a log table!)
Thanks
The GNU C library, for example, uses a call to the fyl2x() assembler instruction, which means that logarithms are calculated directly from the hardware.
Hence one should ask: what algorithm is used for calculating logarithms by computers?
Depends on the CPU, For intel IA64, they use the Taylor series combined with a table.
More info can be found here: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.5177
and here: http://www.computer.org/csdl/proceedings/arith/1999/0116/00/01160004.pdf
This is a hugely open, broad and "depends-on".
For every programming language, every different core library, every different system or so forth, different algorithms/mechanisms and machine-code instructions may be existing for performing mathematical (and any other type of) calculations.
Furthermore, even if all the programming languages, in this world, would have been using same algorithm "X", it still does not mean, that Computer calculates logarithm in way X, because, computer still will (most likely) be doing its machine-level job differently, in different circumstances, disregarding the point, that algorithms are the same (which is, very less likely).
Bear in mind, that Computer Architectures differ, Operating Systems differ, and Assembler Instructions can be very different from CPU to CPU.
I really think, that you should come up with more specific and concrete questions on this website.

How to interpret NVIDIA Visual Profiler analysis/recommendations?

I'm relatively new to CUDA and am currently under a project to accelerate computer vision applications in embedded systems with gpu's attached(NVIDIA TX1). What I'm trying to do is select between two libraries: OpenCV and VisionWorks(includes OpenVX).
Currently, I have made test codes to run Canny Edge Detection algorithm and the two libraries showed different execution times(VisionWorks implementation takes about 30~40% less time).
So, I wondered what the reason might be, and thus profiled the kernel that's taking the most time: 'canny::edgesHysteresisLocalKernel' from OpenCV4Tegra , which is taking up 37.2% of the entire application(from both OpenCV implementation and VisionWorks implementation) and 'edgesHysteresisLocal' from VisionWorks.
I followed the 'guided analysis' and the profiler suggested that the applications are both latency-bound, and below are the captures of 'edgesHysteresisLocal' from VisionWorks, and 'canny::edgesHysteresisLocalKernel' from OpenCV4Tegra.
OpenCV4Tegra - canny::edgesHysteresisLocalKernel
VisionWorks - edgesHysteresisLocal
So, my question is,
from the analysis, what can I tell about the causes of the different performances?
Moreover, when profiling CUDA applications in general, where is a good point to start? I mean, there are a bunch of metrics and it's very hard to tell what to look at.
Is there some educational materials regarding profiling CUDA applications in general? (I looked at many slides from NVIDIA, and I think they're just telling the definitions of the metrics, not where to start from in general.)
-- By the way, as far as I know, NVIDIA doesn't provide the source codes of VisionWorks and OpenCV4Tegra. Correct me if I'm wrong.
Thank you in advance for your answers.
1/ the shared_memory usage is different between the 2 libraries this is probably the cause of performance divergeance.
2/ Generally is use three metrics to know if my algorithm is well coded for CUDA devices :
the memory usage of kernels (bandwidth)
the amount of registers used : is there register spilling or not ?
the amount of shared memory bank conflicts
3/ i think there is many things on the internet....
another thing :
If you just want to qualify the usage of a lib versus another in order to select the best, why do you need the understand each implementations (it's interesting but not a pre-requisite isn't it)?
Why don't you measure algorithm performance with the cycle time and the quality of produced results according to a metric (false positive, average error on a set of results known, ...)

Portability and Optimization of OpenCL between Radeon Graphic Cards

I'm planning on diving into OpenCL and have been reading (only surface knowledge) on what OpenCL can do, but have a few questions.
Let's say I have an AMD Radeon 7750 and I have another computer that has an AMD Radeon 5870 and no plans on using a computer with an Nvidia card. I heard that optimizing the code for a particular device brings performance benefits. What exactly does optimizing mean? From what I've read and a little bit of guessing, it sounds like it means writing the code in a way that a GPU likes (in general without concern that it's an AMD or Nvidia card) as well as in a way that matches how the graphics card handles memory (I'm guessing this is compute device specific? Or is this only brand specific?).
So if I write code and optimized it for the Radeon 7750, would I be able to bring that code to the other computer with the Radeon 5870 and, without changing any part of the code, still retain a reasonable amount of performance benefits from the optimization? In the event that the code doesn't work, would changing parts of the code be a minor issue or would it involve rewriting enough code that it would have been a better idea to have written an optimized code for the Radeon 5870 in the first place.
Without more information about the algorithms and applications you intend to write, the question is a little vague. But I think I can give you some high-level strategies to keep in mind as you develop your code for these two different platforms.
The Radeon 7750's design is of the new Graphics Core Next architecture, while your HD5780 is based on the older VLIW5 (RV770) Architecture.
For your code to perform well on the HD5780 hardware you must make as heavy use of the packed primitive datatypes as possible, especially the int4, float4 types. This is because the OpenCL compiler has a difficult time automatically discovering parallelism and packing data into the wide vectors for you. If you can structure your code so that you already have taken this into account, then you will be able to fill more of the VLIW-5 slots and thus use more of your stream processors.
GCN is more like NVidia's Fermi architecture, where the code's path to the functional units (ALUs, etc.) of the stream processors does not go through explicitly scheduled VLIW instructions. So more parallelism can be automatically detected at runtime and keep your functional units busy doing useful work without you having to think as hard about how to make that happen.
Here's an over-simplified example to illustrate my point:
// multiply four factors
// A[0] = B[0] * C[0]
// ...
// A[3] = B[3] * C[3];
float *A, *B, *C;
for (i = 0; i < 4; i ++) {
A[i] = B[i] * C[i];
}
That code will probably run ok on a GCN architecture (except for suboptimal memory access performance--an advanced topic). But on your HD5870 it would be a disaster, because those four multiplies would take up 4 VLIW5 instructions instead of 1! So you would write that above code using the float4 type:
float4 A, B, C;
A = B * C;
And it would run really well on both of your cards. Plus it would kick ass on a CPU OpenCL context and make great use of MMX/SSE wide registers, a bonus. It's also a much better use of the memory system.
An a tiny nutshell, using the packed primitives is the one thing I can recommend to keep in mind as you start deploying code on these two systems at the same time.
Here's one more example that more clearly illustrates what you need to be careful doing on your HD5870. Say we had implemented the previous example using separate work-units:
// multiply four factors
// as separate work units
// A = B * C
float A, B, C;
A = B * C;
And we had four separate work units instead of one. That would be an absolute disaster on the VLIW device and would show tremendously better performance on the GCN device. That is something you will want to also look for when you are writing your code--can you use float4 types to reduce the number of work units doing the same work? If so, then you will see good performance on both platforms.

Estimating area required by a VHDL implementation

I've got a few VHDL files, which I can compile with ghdl on Debian. The same files have been adapted by some for an ASIC implementation. There's one "large area" implementation and one "compact" implementation for an algorithm. I'd like to write some more implementations, but to evaluate them I'd need to be able to compare how much area the different implementations would take.
I'd like to do the evaluation without installing any proprietary compilers or obtaining any hardware. A sufficient evaluation criteria would be an estimation of GE (gate equivalent) area, or the number of logic slices needed by some FPGA implementation.
Start by counting the flip-flops (FFs). Their number is (almost) uniquely defined by the RTL code that you have written. With some experience, you can get this number by inspecting the code.
Typically, there is a good correlation between the #FFs and the overall area. An old rule of thumb is that for many designs, the combinatorial area will be about the same as the sequential area. For example, suppose the area count of a flip-flop is 10 gates in a gate array technology, then #FFs * 20 would give you an initial estimation.
Of course, the design characteristics have a significant influence. For datapath-oriented designs, the combinatorial area will be relatively larger. For control-oriented designs, the opposite is true. For standard-cell designs, the sequential area may be smaller because FFs are more efficient. For timing-critical designs, the combinatorial area may be much larger as a result of timing optimization by the synthesis tool.
Therefore, the remaining issue is to find out what a good multiplication factor is for your type of designs and target technology. The strategy could be to carry out some experiments, or to look at prior design results, or to ask others. From then on, estimating is a matter of multiplying the #FFs, known from your code, with that factor.
I'd like to do the evaluation without installing any proprietary compilers or obtaining any hardware.
Inspection will give you a rough idea but with all the optimisations that occur during synthesis you may find this level of accuracy too far removed from the end result.
I would suggest that you re-examine your reasons for avoiding "proprietary compilers" to perform the evaluation. I'm unaware of any non-proprietary synthesis tools for VHDL (though it has been discussed). The popular FPGA vendors provide free versions of their software for Windows and Linux which you could use to obtain accurate counts of resource usage. It should be feasible to translate the FPGA resource usage into something more meaningful for your target technology.
I'm not very familiar with the ASIC world but again there may be free (but proprietary) tools available for you to use.

Resources