Emulation and Energy Consumption in QEMU

Emulation and Energy Consumption in QEMU - performance

I need to emulate a particular architecture namely Cortex ARM A9.
I was thinking about QEMU (what do you think about it) with a distro linux.
The problem is that I would to measure the energy consumption of a mathematical operation on this architecture, for example an exponentiation, or an elliptic curve addition/multiplication. The programming language could be python or C.
Do you have an idea on what I can to do that?
Furthermore, in an emulated architecture who guarantee me that I will have the right timings about the execution of these operations?

QEMU is neither cycle accurate or a microarchictecture simulation. You may be able to build your own energy model with the recently added plugins but I suspect you need more low level tools.

Related

How does Intel's RAPL estimate the power consumption

First of all, I do not know whether I should be asking this here or in the Electronics StackExchange, so please let me know if you think I should ask it there.
I am interested in measuring the energy consumption of each CPU core in Intel CPUs. I have read Intel's Intel 64 developer manual, and, as I understand, RAPL will provide energy estimations for:
The whole Package
The Cores
An unspecified Uncore device (Only in client processors)
The DRAM (Only in server processors)
This would indicate that the best I can aspire to is a value for the collective energy consumption of all the cores in the CPU. However, I also know that "RAPL is not an analog power meter, but rather uses a software power model", according to https://01.org/blogs/2014/running-average-power-limit-%E2%80%93-rapl.
What I would like to know is, is the way this model works known, or publicly available? And, would it be possible to get an estimation of individual core power consumption using metrics provided by RAPL or other interfaces? I know that, if Intel isn't providing this information through RAPL it is probably impossible to get it, but I would like to at least find a source that confirms that.
Thanks for your help!

Here is a post on different tools that you can use to get energy measurements for different Operating Systems. If you are using Linux, consider using the Perf since it uses the RAPL interface to get energy measurements. As far as I know, Perf does not offer energy consumption per core but as a whole (package) and you can get energy measurements for an executable (of any kind: Python, Java, SHELL, C, and so on) using the following command:
sudo perf stat -e power/energy-cores/ ./executable

Is it possible to compare ARM and x86 performance via benchmarks?

Judging by the latest news, new Apple processor A11 Bionic gains more points than the mobile Intel Core i7 in the Geekbench benchmark.
As I understand, there are a lot of different tests in this benchmark. These tests simulate a different load, including the load, which can occur in everyday use.
Some people state that these results can not be compared to x86 results. They say that x86 is able to perform "more complex tasks". As an example, they lead Photoshop, video conversion, scientific calculations. I agree that the software for the ARM is often only a "lighweight" version of software for desktops. But it seems to me that this limitation is caused by the format of mobile operating systems (do your work on the go, no mouse, etc), and not by the performance of ARM.
As an opposite example, let's look at Safari. A browser is a complex program. And on the iPad Safari works just as well as on the Mac. Moreover, if we take the results of Sunspider (JS benchmark), it turns out that Safari on the iPad is gaining more points.
I think that in everyday tasks (Web, Office, Music/Films) ARM (A10X, A11) and x86 (dual core mobile Intel i7) performance are comparable and equal.
Are there any kinds of tasks where ARM really lags far behind x86? If so, what is the reason for this? What's stopping Apple from releasing a laptop on ARM? They already do same thing with migration from POWER to x86. This is technical restrictions, or just marketing?

(Intended this as a comment since this question is off topic, but it got long..).
Of course you can compare, you just need to be very careful, which most people aren't. The fact that companies publishing (or "leaking") results are biased also doesn't help much.
The common misconception is that you can compare a benchmark across two systems and get a single score for each. That ignores the fact that different systems have different optimization points, most often with regards to power (or "TDP"). What you need to look at is the power/performance curve - this graph shows how the system reacts to more power (raising the frequency, enabling more performance features, etc), and how much it contributes to its performance.
One system can win over the low power range, but lose when the available power increases since it doesn't scale that well (or even stops scaling at some point). This is usually the case with Arm, as most of these CPUs are tuned for low power, while x86 covers a larger domain and scales much better.
If you are forced to observe a single point along the graph (which is a legitimate scenario, for example if you're looking for a CPU for a low-power device), at least make sure the comparison is fair and uses the same power envelope.
There are of course other factors that must be aligned (and sometimes aren't due to negligence or an intention to cheat) - the workload should be the same (i've seen different versions compared..), the compiler should be as close as possible (although generating arm vs x86 code is already a difference, but the compiler intermediate optimizations should be similar. When comparing 2 x86 like intel and AMD you should prefer the same binary, unless you also want to allow machine specific optimizations).
Finally, the system should also be similar, which is not the case when comparing a smartphone against a pc/macbook. The memory could differ, the core count, etc. This could be legitimate difference, but it's not really related to one architecture being better than the other.

the topic is bogus, from the ISA to an application or source code there are many abstraction level and the only metric that we have (execution time, or throughput) depends on many factors that could advantage one or the other: the algorithm choices, the optimization written in source code, the compiler/interpreter implementation/optimizations, the operating system behaviour. So they are not exactly/mathematically comparable.
However, looking at the numbers, and the utility of the mobile application written by talking as a management engeneer, ARM chip seems to be capable of run quite good.
I think the only reason is inertia of standard spread around (if you note microsoft propose a variant of windows running on ARM processors, debian ARM variant are ready https://www.debian.org/distrib/netinst).
the ARMv8 cores seems close to x86/64 ones by looking at raw numbers
note i7-3770k results: https://en.wikipedia.org/wiki/Instructions_per_second#MIPS
summary of last Armv8 CPU characteristics, note the quantity of decode, dispatch, caches, and compare the last column on cortex A73 to the i7 3770k
https://en.wikipedia.org/wiki/Comparison_of_ARMv8-A_cores
intel ivy bridge characteristics:
https://en.wikichip.org/wiki/intel/microarchitectures/ivy_bridge_(client)
A75 details. https://www.anandtech.com/show/11441/dynamiq-and-arms-new-cpus-cortex-a75-a55
the topic of power consumption is complex again, the basic rule that go under all the frequency/tension rule (used and abused) over www is: transistors raise time. https://en.wikipedia.org/wiki/Rise_time
There is a fixed time delay in the switching of a transistor, this determinates the maximum frequency that a transistor could switch, and with more of them linked in a cascade way this time sums up in a nonlinear way (need some integration to demonstrate it), as a result 10 years ago to increase the GHz companies try to split in more stage the execution of an operation and runs them (operations) in a pipeline way, even inside the logical pipeline stage. https://en.wikipedia.org/wiki/Instruction_pipelining
the raise time depends of physical characteristics (materials and shape of transistors). It can be reduced by increasing the voltage, so the transistor switch faster, as the switching is associated (let me the term) to a the charge/discharge of a capacitor that trigger the transistor channel opening/closing.
These ARM chips are designed to low power applications, by changing the design they could easily gain MHz, but they will use much power, how much? again not comparable if you don't work inside a foundry and have the numbers.
an example of server applications of ARM processors that could be closer to desktop/workstation CPU as power consumption are Cavium or qualcomm Falkor CPUs, and some benchmark report that they are not bad.

What types of code domains is OpenCL suited to?

I read the OpenCL overview, and it states it is suitable for code that runs of CPUs, GPGPUs, DSPs, etc. However, from looking through the command reference, it seems to be all math and image type operations. I didn't see anything for say strings.
This makes me wonder what would you run on a CPU via OpenCL?
Further, I know OpenCL can be used to perform sorting on GPGPUs. But would one ever use it (or, for that matter, a current GPGPU) to perform string processing such as pattern matching, metaphone extraction, dictionary lookup, or anything else that requires the processing of arrays of strings.
EDIT
I noticed that Intel's upcoming Ivy Bridge is touted as "OpenCL compliant" with reference to its graphics units. Does this infer that the CPU cores are not OpenCL compliant, or is there no such inference?
EDIT
In the interests of non-debate and constructiveness, I would appreciate if anyone could point me to official references that would answer my question.

You can think of OpenCL as a combination of a runtime (for device discovery, queueing) and a C-based programming language. This programming language has native vector types and built-in functions and operations for doing all sorts fun stuff to these vectors. This is nice in that you can write a vectorized kernel in OpenCL, and it it the responsibility of the implementation to map that to the actual vector ISA of your hardware.
From this 4/2011 article, which might vanish:
There are two major CPU architectures out there, x86 and ARM, both of
which should soon run OpenCL code.
If you write an OpenCL application that targets both of these architectures, you wouldn't have to worry about writing two versions, one SSE and one NEON. Just write OpenCL C and be done with it. Yes, I know. This assumes the vendor has done his job and written a solid implementation that fully utilizes the underlying ISA. But if he doesn't, complain!
In addition, some CL implementations offer auto-vectorization of scalar kernels, which are usually easier to write. A good auto-vectorizer would give you a solid performance increase for no effort. Since CL kernels are compiled "online," obtaining such a benefit wouldn't require shipping rebuilt code.

No links, but I would assume this is because algorithms that use strings may do a lot of dynamic memory allocation and branching, both of which GPGPUs are not well-suited for. GPGPUs also have a lot in common with vector processing, so doing units of work with different sized blocks of memory (which a string algorithm will generally work on, you usually don't have a homogeneous group of strings), yields poorer performance and is hard to program.
GPUs were designed to do the same work, with little to no branching, on a homogeneous group of data (such as per-vector or per-pixel operations). Algorithms that can mimic this type of behavior are great on GPUs.

This makes me wonder what would you run on a CPU via OpenCL?
I prefer to use ocl to offload work from the cpu to my graphics hardware. Sometimes there is a limitation with my video card, so I like having a backup kernel for cpu use. Such limitations can be memory size, memory bottleneck, low clock speed, or when the pci-e bus gets in the way.
I say I like using a separate kernel for cpu, because I think all kernels should be tweaked to run on their target hardware. I even like to have an openmp backup plan, as most algorithms I use get tested out in this manner ahead of time.
I suppose it is best practice to test out a gpu kernel on the cpu to make sure it runs as expected. If a user of your software has opencl installed, but only a cpu (or a low-end gpu) it's nice to be able to execute the same code on the different devices.

Trace of CPU Instruction Reordering

I have studied a few things about instruction re-ordering by processors and Tomasulo's algorithm.
In an attempt to understand this topic bit more I want to know if there is ANY way to (get the trace) see the actual dynamic reordering done for a given program?
I want to give an input program and see the "out of order instruction execution trace" of my program.
I have access to an IBM-P7 machine and an Intel Core2Duo laptop. Also please tell me if there is an easy alternative.

You have no access to actual reordering done inside the CPU (there is no publically known way to enable tracing). But there is some emulators of reordering and some of them can give you useful hints.
For modern Intel CPUs (core 2, nehalem, Sandy and Ivy) there is "Intel(R) Architecture Code Analyzer" (IACA) from Intel. It's homepage is http://software.intel.com/en-us/articles/intel-architecture-code-analyzer/
This tool allows you to look how some linear fragment of code will be splitted into micro-operations and how they will be planned into execution Ports. This tool has some limitations and it is only inexact model of CPU u-op reordering and execution.
There are also some "external" tools for emulating x86/x86_84 CPU internals, I can recommend the PTLsim (or derived MARSSx86):
PTLsim models a modern superscalar out of order x86-64 compatible processor core at a configurable level of detail ranging ... down to RTL level models of all key pipeline structures. In addition, all microcode, the complete cache hierarchy, memory subsystem and supporting hardware devices are modeled with true cycle accuracy.
But PTLsim models some "PTL" cpu, not real AMD or Intel CPU. The good news is that this PTL is Out-Of-Order, based on ideas from real cores:
The basic microarchitecture of this model is a combination of design features from the Intel Pentium 4, AMD K8 and Intel Core 2, but incorporates some ideas from IBM Power4/Power5 and Alpha EV8.
Also, in arbeit http://es.cs.uni-kl.de/publications/datarsg/Senf11.pdf is said that JavaHASE applet is capable of emulating different simple CPUs and even supports Tomasulo example.

Unfortunately, unless you work for one of these companies, the answer is no. Intel/AMD processors don't even schedule the (macro) instructions you give them. They first convert those instructions into micro operations and then schedule those. What these micro instructions are and the entire process of instruction reordering is a closely guarded secret, so they don't exactly want you to know what is going on.

VHDL and FPGA's

I'm relatively new to the FPGA sceen and was looking to get experience with them and VHDL. I'm not quite sure what the benefit would be over using a standard MCU but looking for experience since many companies are looking for it.
What would be a good platform to start out on and get experience for not to much money. Ive been looking and all I can find are 200 - 300 dollar boards if not 1000's. What should one look for in an FPGA development board, I hear high speed peripheral interfaces, and what I guess I'm really confused about is that an MCU dev board with around 50/100 GPIO can go for around 100 while that same functionality on an FPGA board is much more expensive! I know you can reprogram an FPGA, but so can an MCU. Should I even fiddle with FPGA's will the market keep using them or are we moving towards MCU's only?

Hmm...I was able to find three evaluation boards under $100 pretty quickly:
$79: http://www.terasic.com.tw/cgi-bin/page/archive.pl?Language=English&No=593
$79: http://www.arrownac.com/solutions/bemicro-sdk/
$89: http://www.xilinx.com/products/boards-and-kits/AES-S6MB-LX9.htm
As to what to look for in an evaluation board, that depends entirely on what you want to do. If you have a specific design task to accomplish, you want a board that supports as many of the same functions and I/O as your final circuit. You can get boards with various memory options (SRAM, DDR2, DDR3, Flash, etc), Ethernet, PCI/PCIe bus, high-speed optical transceivers, and more. If you just want to get started, just about any board will work for you. Virtually anything sold today should have enough space for even non-trivial example designs (ie: build your own microcontroller with a soft-core CPU and design/select-your-own peripheral mix).
Even if your board only has a few switches and LEDs you can get started designing a hardware "Hello World" (a.k.a. the blinking LED :), simple state machines, and many other applications. Where you start and what you try to do should depend on your overall goals. If you're just looking to gain general experience with FPGAs, I suggest:
Start with any of low-cost evaluation boards
Run through their demo application (typically already programmed into the HW) to get familiar with what it does
Build the demo program from source and verify it works to get familiar with the FPGA tool chain
Modify the demo application in some way to get familiar with designing hardware for FPGAs
Use your new-found experience to determine what to try next
As for the market continuing to use FPGAs, they are definitely here to stay, but that does not mean they are suitable for every application. An MCU by itself is fine for many applications, but cannot handle everything. For example, you can easily "bit-bang" an I2C or even serial UART with most micro-controllers, but you would be hard pressed to talk to an Ethernet port, a VGA display, or a PCI/PCIe bus without some custom hardware. It's up to you to decide how to mix the available technology (MCUs, FPGAs, custom logic designed in-house, licensed IP cores, off-the-shelf standard hardware chips, etc) to create a functional product or device, and there typically isn't any single 'right' answer.

FPGAs win over microcontrollers if you need some or all of:
Huge amounts of maths to be done (even more than a DSP makes sense for)
Huge amounts of memory bandwidth (often goes hand in hand with the previous point - not much point having lots of maths to do if you have no data to do it on!)
Extremely predictable hard real-time performance - the timing analyser will tell you how fast you can clock you device given the logic you've designed. You can (with a certain - high - statistical likelihood) "guarantee" to operate at that speed. And therefore you can design logic which you know will always meet certain real-time response times, even if those deadlines are in the nano-second realm.
If not, then you are likely better off with a micro or DSP.

The OpenCores web site is an excellent resource, especially the Programming Tools section. The articles link on the site is a good place to start to survey FPGA boards.
The biggest advantage of an FPGA over a microprocessor is architecture. The microprocessor has a fixed set of functional units that solve most problems reasonably well. I've seen computational efficiency figures for microprocessors form 6% to 15%. In an FPGA you are creating functional units specifically for your problem and nothing else, so you can reach 90-100% computational efficiency.
As for the difference in cost, think of volume sales. High volume of microprocessor sales vs. relatively lower FPGA sales.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio