Where did code morphing go? [closed] - processor

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Linus Torvalds used to work for a processor company called Transmeta. The processor they made was a RISC based object in the core. If I remember correctly, the idea was that the core ran an arbitrary and upgradable "processor emulation layer" (could be x86, powerpc etc), which translated the high level opcodes into the RISC core instruction set.
What happened to this idea, and what in your opinion were the pros, cons and situations where such approach could have had an advantage (in terms of programming)?

The company did not do as well as they expected, and were eventually acquired by Novafora for it's power-saving technology. ( http://www.novafora.com/pr01-28-09.html )
From all accounts that I am aware of, the technology simply did not compete with existing systems. They fell far short of their performance numbers. Also, while it may have been possible to put another translator on top of their VLIW design, I'm not aware of any products they produced that did. I don't remember the Crusoe chip being able to accept an alternative "translation" microcode download.
I personally owned a device that used a Crusoe processor, and while it certainly delivered on battery life, the performance of the device was dismal. Some of the blame could probably be leveled on the special version of Windows it used, but it was still slow.
At best, it was good for portable remote desktop.
IMHO, the technology has the same benefits as software VM's like .Net and the JVM:
The upside is that you can probably
accelerate the code faster with a
hardware solution (like IBM does with
it's Java accelerator processors)
than pure software JIT.
The downside is that you never get the raw
performance that processors executing
native code get.
From some perspectives you can think of modern x86 chips as code morphing, although as very specialized ones. They translate the x86 architecture into a more efficient RISC-like subinstruction set, and then execute those.
Another example of this sort of technology could be FPGAs which can be programmed to emulate on a circuit level various kinds of processors or raw circuits. I believe that some Cray systems can come with "accelerator nodes" of this sort.

For one thing most CISC processors internally translate their opcodes to uops micro-ops which are similar to RISC ops. Pipelining and multiple cores have closed the gap on RISC processors to the point where it's a very small difference between them, if any. If you need cross compatibility from C source or another assembly front end you can use LLVM. http://llvm.org/

Obvious pros:
Ability to run any OS (just switch the processor emulation to what is needed)
Possibility (with kernel support of course) of running binaries for different architectures on the same processor/OS without software support.
Obvious con:
Extra emulation layer == more overhead == faster processor needed to get equivalent performance for EVERYTHING.

I would say that cost reductions come with quantity, so something like the Transmeta chip has to sell a lot of volume before it can compete on price with existing high volume x86 chips.
If I recall, the point of the Transmeta chip was that it was low power. Having less silicon gates to flip back and forth every clock cycle saves energy. The code morphing was so you could run a complex instruction set (CISC) on a low power RISC chip.
Transmeta's first processor, the Crusoe, didn't do very well due to problems even running benchmark software. Their second processor, the Efficeon, did manage to use less power than the Intel Atom (in the same performance category), and perform better than the Centrino in the same power envelope.
Now, looking at it from the software and flexibility standpoint like you are, Code Morphing is just a form of Just-In-Time compilation, with all the benefits and detriments of that technology. Your x86 code is essentially running on a virtual machine and being emulated by another processor. The biggest benefit of virtualization right now is the ability to share a single processor among many virtual machines so you have fewer idle CPU cycles, which is more efficient (hardware cost and energy cost).
So it seems to me that code morphing, just like any form of virtualization, is all about being more efficient with resources.

For another approach to hardware-assisted x86 ISA virtualization, you may want to read about the Loongson 3 CPU.

Most modern processors actually implement their instruction sets using microcode.
There are many reasons for this, including compatibility concerns, but there are also other
reasons.
The distinction between what is "hardware" and what is "software" is actually hard to make.
Modern virtual machines such as the JVM or CIL (.NET) might as well be implemented in hardware, but that would probably just be done using microcode anyway.
One of the reasons for having several layers of abstraction in a system is that the
programmers/engineers do not have to think about irrelevant details when they are working at
a higher level.
The operating system and system libraries also provide additional abstraction layers. But having these layers only makes the system "slower" if one does not need the features they provide (i.e. the thread scheduling done by the OS). It is no easy task to get your own program-specific scheduler to beat the one in the Linux kernel.

Related

Is it possible to compare ARM and x86 performance via benchmarks?

Judging by the latest news, new Apple processor A11 Bionic gains more points than the mobile Intel Core i7 in the Geekbench benchmark.
As I understand, there are a lot of different tests in this benchmark. These tests simulate a different load, including the load, which can occur in everyday use.
Some people state that these results can not be compared to x86 results. They say that x86 is able to perform "more complex tasks". As an example, they lead Photoshop, video conversion, scientific calculations. I agree that the software for the ARM is often only a "lighweight" version of software for desktops. But it seems to me that this limitation is caused by the format of mobile operating systems (do your work on the go, no mouse, etc), and not by the performance of ARM.
As an opposite example, let's look at Safari. A browser is a complex program. And on the iPad Safari works just as well as on the Mac. Moreover, if we take the results of Sunspider (JS benchmark), it turns out that Safari on the iPad is gaining more points.
I think that in everyday tasks (Web, Office, Music/Films) ARM (A10X, A11) and x86 (dual core mobile Intel i7) performance are comparable and equal.
Are there any kinds of tasks where ARM really lags far behind x86? If so, what is the reason for this? What's stopping Apple from releasing a laptop on ARM? They already do same thing with migration from POWER to x86. This is technical restrictions, or just marketing?
(Intended this as a comment since this question is off topic, but it got long..).
Of course you can compare, you just need to be very careful, which most people aren't. The fact that companies publishing (or "leaking") results are biased also doesn't help much.
The common misconception is that you can compare a benchmark across two systems and get a single score for each. That ignores the fact that different systems have different optimization points, most often with regards to power (or "TDP"). What you need to look at is the power/performance curve - this graph shows how the system reacts to more power (raising the frequency, enabling more performance features, etc), and how much it contributes to its performance.
One system can win over the low power range, but lose when the available power increases since it doesn't scale that well (or even stops scaling at some point). This is usually the case with Arm, as most of these CPUs are tuned for low power, while x86 covers a larger domain and scales much better.
If you are forced to observe a single point along the graph (which is a legitimate scenario, for example if you're looking for a CPU for a low-power device), at least make sure the comparison is fair and uses the same power envelope.
There are of course other factors that must be aligned (and sometimes aren't due to negligence or an intention to cheat) - the workload should be the same (i've seen different versions compared..), the compiler should be as close as possible (although generating arm vs x86 code is already a difference, but the compiler intermediate optimizations should be similar. When comparing 2 x86 like intel and AMD you should prefer the same binary, unless you also want to allow machine specific optimizations).
Finally, the system should also be similar, which is not the case when comparing a smartphone against a pc/macbook. The memory could differ, the core count, etc. This could be legitimate difference, but it's not really related to one architecture being better than the other.
the topic is bogus, from the ISA to an application or source code there are many abstraction level and the only metric that we have (execution time, or throughput) depends on many factors that could advantage one or the other: the algorithm choices, the optimization written in source code, the compiler/interpreter implementation/optimizations, the operating system behaviour. So they are not exactly/mathematically comparable.
However, looking at the numbers, and the utility of the mobile application written by talking as a management engeneer, ARM chip seems to be capable of run quite good.
I think the only reason is inertia of standard spread around (if you note microsoft propose a variant of windows running on ARM processors, debian ARM variant are ready https://www.debian.org/distrib/netinst).
the ARMv8 cores seems close to x86/64 ones by looking at raw numbers
note i7-3770k results: https://en.wikipedia.org/wiki/Instructions_per_second#MIPS
summary of last Armv8 CPU characteristics, note the quantity of decode, dispatch, caches, and compare the last column on cortex A73 to the i7 3770k
https://en.wikipedia.org/wiki/Comparison_of_ARMv8-A_cores
intel ivy bridge characteristics:
https://en.wikichip.org/wiki/intel/microarchitectures/ivy_bridge_(client)
A75 details. https://www.anandtech.com/show/11441/dynamiq-and-arms-new-cpus-cortex-a75-a55
the topic of power consumption is complex again, the basic rule that go under all the frequency/tension rule (used and abused) over www is: transistors raise time. https://en.wikipedia.org/wiki/Rise_time
There is a fixed time delay in the switching of a transistor, this determinates the maximum frequency that a transistor could switch, and with more of them linked in a cascade way this time sums up in a nonlinear way (need some integration to demonstrate it), as a result 10 years ago to increase the GHz companies try to split in more stage the execution of an operation and runs them (operations) in a pipeline way, even inside the logical pipeline stage. https://en.wikipedia.org/wiki/Instruction_pipelining
the raise time depends of physical characteristics (materials and shape of transistors). It can be reduced by increasing the voltage, so the transistor switch faster, as the switching is associated (let me the term) to a the charge/discharge of a capacitor that trigger the transistor channel opening/closing.
These ARM chips are designed to low power applications, by changing the design they could easily gain MHz, but they will use much power, how much? again not comparable if you don't work inside a foundry and have the numbers.
an example of server applications of ARM processors that could be closer to desktop/workstation CPU as power consumption are Cavium or qualcomm Falkor CPUs, and some benchmark report that they are not bad.

Developing Software For Multi-Core CPUs - Do Programs Have To Be Manually Optimised To Use All Cores? Or Does This Occur Automatically?

The vast majority of CPUs coming out nowadays contain multiple cores which can operate at the same time - in parallel.
I'm just wondering, from the point of executing a program as quickly as possible using all available CPU cores, does a programmer need to take into consideration that the software being developed will be running on a multi-core CPU? For instance, would the software being developed have to be manually configured to assign different tasks to each CPU core? Or does the OS/CPU automatically identify and choose which parts of a program can run - in parallel - on different cores?
Apologies if this may seem like a simple or silly question. I'm completely new to the topic of parallel programming and I've come across some conflicting information early on in my research - some sources state that the programmer must manually configure their software in order to utilise more than one CPU core (the more believable option in my opinion) - and other sources state that the OS/CPU automatically identifies and chooses which tasks can be run in parallel on different CPU cores (the less believable option in my opinion due to the complexity involved in automatically identifying this).
Just in case different Operating Systems, CPUs or Programming Languages perform differently in a parallel computing or multi-core environment - I will be using Windows 7 as my OS, an Intel Dual Core i7 Processor, and OpenCL as the programming language.
Any help is much appreciated.
In practice this occurs semi-automatically.
More detailed answer will depend on your application nature, preferred programming model and target architecture.
More explanation:
In order to exploit multicore hardware efficiently (in your case, keeping as much cores busy as possible) you first of all 1) need to "parallelize" algorithm itself - make it "concurrent", 2) use one of multi-threading (most often) or multi-process (rare case) parallel programming APIs, like for example "OpenMP", "Intel TBB", "OpenCL", "Posix Threads" or (for multi-process) "MPI" in order to efficiently and often automatically assign different "pieces" of your concurrent program to different threads (or, rare case, processes).
One of the simplest possible examples of such kind of parallel programming (using OpenMP) is given here.
Now, you've told that you are using OpenCL as a programming model for CPU. In certain cases, when you use vendor-provided OpenCL implementations (like Intel OpenCL) you could semi-automatically assign OpenCL kernel to be executed by various threads using "NDRange" and other OpenCL concepts, like explained here for Intel Xeon Phi co-processor (not exactly CPU-programming, but similar idea) or here (more general, but more advanced article).
However, using OpenCL as a general-purpose multi-threading programming API for CPU - is definitely not the simplest approach; and it is not always optimal in terms of final performance. There are certain application types, where OpenCL makes some little sense for general-purpose CPU multi-threading programming, but again it very much depends on your algorithm nature and target architecture..
There is one very obsolete, but still reasonable post about OpenCL vs. OpenMP/TBB on stackoverflow. This is obsolete in sense that OpenMP 4.0 now also provides solid capabilities to do Threading*+SIMD* programming (which will make you interested in some future if you explore given topic in more details). That's why I would tell that OpenMP seems to be number-one choice nowadays, bug TBB, MPI or OpenCL might also be appropriate in certain cases.

Looking to get started with FPGAs -- speed up? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I'm very interested in learning FPGA development. I've found a bunch of "getting started with FPGA" questions here, and other tutorials and resources on the internet. But I'm primarily interested in using FPGAs as an accelerator, and I can't figure out what devices will actually offer a speed up over a desktop CPU (say a recent i7).
My particular interest at the moment is cellular automata (and other parallel environments like neural networks and agent based modeling). I'd like to experiment with 3d or higher dimensional cellular automatas. My question is - will the low-cost $100-$200 starter kits provide something that have potential to produce a significant speed up over a desktop CPU? Or would I need to spend more and get a higher end model FPGA?
The FPGA can be a very good accelerator, but (and this is a big BUG) it is usually is very expensive. We have here machines like the beecube, a convey or from Dini godzillas part time nanny, and they are all very expensive (>10k$) and even with these machines many applications can be better accelerated with a standard cpu cluster or gpus. That FPGA is a bit better when the total cost of ownership is considered as you have there usually a better engery efficiency.
But there are applications that you can accelerate. On the lower scale you can/should do an rough estimate if its worth for you application, but you need there more concrete numbers for your application. Consider an standard deskop cpu: usually it has at least 4 cores (or dual with hyperthreading, not to mention the vector units), and clocks at, say 3 GHz. This results in 12 GCycles per second computation power. The (cheap) FPGAs you can get to 250 MHz (better can reach up to 500 MHz, but that must be very friendly designs and very good speed grades), so you need approx. 50 Operations in parallel, to compete with the CPU (actually its a bit better because the cpu has usually not 1 cycle ops, but it also has vector operations so we are equal).
50 Operations sounds much, and is hard, but is doable (the magic word here is pipeling). So you should know exactly how you are going to implement you design in hardware and which degree of parallelism you can use.
Even if you solve that parallelism problem, we come now to the real problem: The memory.
Above mentioned accelerators have so much compute capacity, they could do thousands things in parallel, but the real problem with such computation power is: how to get the data into/out of them. And you also have this problem in your small scale. In your desktop PC the cpu transfers more than 20GB/s to/from memory (good GPU card make 100GB/s and more), while your small accelerator for 100-200$ has at most (when you get lucky) 1-2 GB/s per PCI-Exp.
If its worth for you, depends completely on your application (and here you need far more details than: 3D Cellular Automatas, you must know the neighbourhoods, the required precision (do you double, single float, or integers or fixpoint...?), and your use case (do you transfer initial cell values, let the machine compute 2 day, and than transfer cell values back, or do you need the cell values after every step (this makes a huge difference in the required bandwidth while computation)).
But overall, without knowing more, I would say: Its 100$-200$ worth.
But not because you can compute your cellular automatas faster (which I dont believe), but because you will lern. And you will not only learn to design hardware and the development on FPGAs, but I see with our students that we have here, always that they get with the hardware design knowledge also a far better understanding on how the hardware actually look and behaves like. Sure nothing what you do on your FPGA is direct related to the interior of the cpu, but many get a better feeling for what hardware in general is capable of, which in turn make them even more effective software developer.
But I have also to admit: You are going to pay a much higher price than just the 100-200$: You have to spent really much time on it.
Disclaimer: I work for a reconfigurable system developer/manufacturer.
A short answer to your question "will the low-cost $100-$200 starter kits provide something that have potential to produce a significant speed up over a desktop CPU" is probably not.
A longer answer:
A microprocessor is a set of fixed, shared functional units tuned to perform reasonably well across a broad range of applications. The operating system and compilers do a good job of making sure that these fixed, shared functional units are utilized appropriately.
FPGA based systems get their performance from dedicated, dense, computational efficiency. You create exactly what you need to execute your application, no more, no less - and whatever you create is not shared with any other user, process, operating system, whatever. If you need 80 floating point units, you create 80 dedicated floating point units that run in parallel. Compare that to a microprocessor scheduling floating point operations across some smaller number of floating point units. To get performance faster than a microprocessor, you have to instantiate enough dedicated FPGA-based functional units to make a performance difference vs. a microprocessor. This often requires the resources in the larger FPGA devices.
An FPGA alone is not enough. If you create a large number of efficient computational engines in an FPGA you -have- to keep these engines fed with data. This requires some number of high bandwidth connections to large amounts of data memory around the FPGA. What you often see with I/O based FPGA cards is some of the potential performance gain is usually diminished by moving data back and forth across the I/O bus.
As a data point, my company uses the '530 Stratix IV FPGA from Altera. We surround it with several directly coupled memories and tie this subsystem directly into the microprocessor memory. We get several advantages over microprocessor systems for many applications, but this is not a $100-$200 starter kit, this is a full-blown integrated system.

Why not using GPUs as a CPU?

I know the question is only partially programming-related because the answer I would like to get is originally from these two questions:
Why are CPU cores number so low (vs GPU)? and Why aren't we using GPUs instead of CPUs, GPUs only or CPUs only? (I know that GPUs are specialized while CPUs are more for multi-task, etc.). I also know that there are memory (Host vs GPU) limitations along with precision and caches capability. But, In term of hardware comparison, high-end to high-end CPU/GPU comparison GPUs are much much more performant.
So my question is: Could we use GPUs instead of CPUs for OS, applications, etc
The reason I am asking this questions is because I would like to know the reason why current computers are still using 2 main processing units (CPU/GPU) with two main memory and caching systems (CPU/GPU) even if it is not something a programmer would like.
Current GPUs lack many of the facilities of a modern CPU that are generally considered important (crucial, really) to things like an OS.
Just for example, an OS normally used virtual memory and paging to manage processes. Paging allows the OS to give each process its own address space, (almost) completely isolated from every other process. At least based on publicly available information, most GPUs don't support paging at all (or at least not in the way an OS needs).
GPUs also operate at much lower clock speeds than CPUs. Therefore, they only provide high performance for embarrassingly parallel problems. CPUs are generally provide much higher performance for single threaded code. Most of the code in an OS isn't highly parallel -- in fact, a lot of it is quite difficult to make parallel at all (e.g., for years, Linux had a giant lock to ensure only one thread executed most kernel code at any given time). For this kind of task, a GPU would be unlikely to provide any benefit.
From a programming viewpoint, a GPU is a mixed blessing (at best). People have spent years working on programming models to make programming a GPU even halfway sane, and even so it's much more difficult (in general) than CPU programming. Given the difficulty of getting even relatively trivial things to work well on a GPU, I can't imagine attempting to write anything even close to as large and complex as an operating system to run on one.
GPUs are designed for graphics related processing (obviously), which is inherently something that benefits from parallel processing (doing multiple tasks/calculations at once). This means that unlike modern CPUs, which as you probably know usually have 2-8 cores, GPUs have hundreds of cores. This means that they are uniquely suited to processing things like ray tracing or anything else that you might encounter in a 3D game or other graphics intensive activity.
CPUs on the other hand have a relatively limited number of cores because the tasks that a CPU faces usually do not benefit from parallel processing nearly as much as rendering a 3D scene would. In fact, having too many cores in a CPU could actually degrade the performance of a machine, because of the nature of the tasks a CPU usually does and the fact that a lot of programs would not be written to take advantage of the multitude of cores. This means that for internet browsing or most other desktop tasks, a CPU with a few powerful cores would be better suited for the job than a GPU with many, many smaller cores.
Another thing to note is that more cores usually means more power needed. This means that a 256-core phone or laptop would be pretty impractical from a power and heat standpoint, not to mention the manufacturing challenges and costs.
Usually operating systems are pretty simple, if you look at their structure.
But parallelizing them will not improve speeds much, only raw clock speed will do.
GPU's simply lack parts and a lot of instructions from their instruction sets that an OS needs, it's a matter of sophistication. Just think of the virtualization features (Intel VT-x or AMD's AMD-v).
GPU cores are like dumb ants, whereas a CPU is like a complex human, so to speak. Both have different energy consumption because of this and produce very different amounts of heat.
See this extensive superuser answer here on more info.
Because nobody will spend money and time on this. Except for some enthusiasts like that one: http://gerigeri.uw.hu/DawnOS/history.html (now here: http://users.atw.hu/gerigeri/DawnOS/history.html)
Dawn now works on GPU-s: with a new OpenCL capable emulator, Dawn now
boots and works on Graphics Cards, GPU-s and IGP-s (with OpenCL 1.0).
Dawn is the first and only operating system to boot and work fully on
a graphics chip.

Performing numerical calculations in a virtualized setting?

I'm wondering what kind of performance hit numerical calculations will have in a virtualized setting? More specifically, what kind of performance loss can I expect from running CPU-bound C++ code in a virtualized windows OS as opposed to a native Linux one, on rather fast x86_64 multi-core machines?
I'll be happy to add precisions as needed, but as I don't know much about virtualization, I don't know what info is relevant.
Processes are just bunches of threads which are streams of instructions executing in a sequential fashion. In modern virtualisation solutions, as far as the CPU is concerned, the host and the guest processes execute together and differ only in that the I/O of the latter is being trapped and virtualised. Memory is also virtualised but that occurs more or less in the hardware MMU. Guest instructions are directly executed by the CPU otherwise it would not be virtualisation but rather emulation and as long as they do not access any virtualised resources they would execute just as fast as the host instructions. At the end it all depends on how well the CPU could cope with the increased number of running processes.
There are lightweight virtualisation solutions like zones in Solaris that partition the process space in order to give the appearance of multiple copies of the OS but it all happens under the umbrella of a single OS kernel.
The performance hit for pure computational codes is very small, often under 1-2%. The catch is that in reality all programs read and write data and computational codes usually read and write lots of data. Virtualised I/O is usually much slower than direct I/O even with solutions like Intel VT-* or AMD-V.
Exact numbers depend heavily on the specific hardware.
Goaded by #Mitch Wheat's unarguable assertion that my original post here was not an answer, here's an attempt to recast it as an answer:
I work mostly on HPC in the energy sector. Some of the computations that my scientist colleagues run take O(10^5) CPU-hours, we're seriously thinking about O(10^6) CPU-hours jobs in the near future.
I get well paid to squeeze every last drop of performance out of our codes, I'd think it was a good day's work if I could knock 1% off the run-time of some of our programs. Sometimes it has taken me a month to get that sort of performance improvement, sure I may be slow, but it's still cost-effective for our scientists.
I shudder therefore, when bright salespeople offering the latest and best in data center software (of which virtualization is one aspect) which will only, as I see it, shackle my codes to a pile of anchor chain from a 250,00dwt tanker (that was a metaphor).
I have read the question carefully and understand that OP is not proposing that virtualization would help, I'm offering the perspective of a practitioner. If this is still too much of a comment, do the SO thing and vote to close, I promise I won't be offended !

Resources