Intel xeon phi programming with gcc - gcc

I kind of want to get the intel xeon phi co-processor since there is a model which seems to be running for $230. I have two questions. Can I fully utilize the capabilities of this just using gcc along with openmp or will I need the intel compiler. Also what is it about this model which makes it so cheap?
http://www.amazon.com/Intel-BC31S1P-Xeon-31S1P-Coprocessor/dp/B00OMCB4JI/ref=sr_1_2?ie=UTF8&qid=1444411560&sr=8-2&keywords=intel+xeon+phi

3100 series is a first generation of Xeon Phi (codenamed Knights Corner, abbreviated KNC).
Using GCC for Xeon Phi KNC programming is definitely not perfect idea. See for example: Xeon Phi Knights Corner intrinsics with GCC
So it's extremely recommended to use Intel Compiler for KNC. And yes, in case of non-commerical use, you can apply for free Intel Compilers license here: https://software.intel.com/en-us/qualify-for-free-software (this is kind of new program, unavailable in past).
Given KNC price tag is low enough, although I periodically observe KNC sales for similar prices (so at least it's not "incomplete" Phi; and it's not cheating, although Gilles' passive cooling point is valid). I don't know which problems you work on, but you should be aware that KNC is most of all suitable for some highly parallel workloads. There is a good reference of types of applications which could benefit from using Xeon Phi KNC: https://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-applications-and-solutions-catalog
As I mentioned in the beginning, you are asking about first generation Xeon Phi. Many things (including GCC answer) will likely change with introduction of second generation of Xeon Phi (codenamed Knights Landing, KNL) to be publically released in ~next year.

Gcc permits you to compile codes and run them for Xeon Phi, and I believes it does quite a good job in that. Indeed, AFAIK, gcc is the compiler used for compiling the Linux environment available on Xeon Phi. However, for fully taking advantage of the potential performance of Xeon Phi, I would strongly encourage you to use the Intel compiler. As a matter of fact, and unless I'm greatly mistaken, you can download and install the Intel compiler suite for free for personal use.
Regarding the Xeon Phi card, it comes cheap, not really because it lacks of anything one would wand for a Xeon Phi card, but more because it is a passively cooled card. That means that, unless you thinker some cooling device with cardboard and fans, you won't be able to slot the card and use it in a standard PC. You'll need a rackable server which doesn't come cheap and is usually very noisy. So if you've got a server to put the card in, this is a bargain. But if you don't, you'd better think it through.

Related

Can I compile Go programs on Xeon Phi (Knight's Landing) processors?

I'm a hobbyist who likes to run my own programs in Go, and as Xeon Phi processors become older they're also becoming extremely cheap. So cheap I can build a dual socket machine from 2015/16 for <$1000
I'm trying to find out if I can run Go programs on these. From what I've seen, this thread says they won't run (and to try gccgo), but it says it won't run because it partially runs on an x87 ISA. Confusingly, in Go release notes they say they're dropping x87 support in 1.16, implying it was supported in the past. I've seen in other threads that all programs will run on the compatibility layer, but that's an extremely slow layer which only has access to a small portion of the cpu's cache.
I feel like I'm moving farther and farther out of my element; I was wondering if someone who's used Xeon Phi knows if it will run Go code? Or just in general, after booting up Ubuntu (or FreeBSD, something that I've seen done and is listed in motherboard specs) what sort of things aren't going to work and what will?
I appreciate any and all help!
You're basing your Knight's Landing worries on this quote about Knight's Corner:
The Knight's Corner processor is based on an x86-64 foundation, yes, but it in fact has its own floating-point instruction set—no x87, no AVX, no SSE, no MMX... Oh, and then you can throw all that away when Knight's Landing (KNL) comes out.
By "throw all that away", they mean all the worries and incompatibilities. KNL is based on Silvermont and is fully x86-64 compatible (including x87, SSE, and SSE2 for both standard ways of doing FP math). It also supports AVX-512F, AVX-512ER, and a few other AVX-512 extensions, along with AVX and AVX2 and SSE up to SSE4.2. A lot like a Skylake-server CPU, except a different set of AVX-512 extensions.
The point of this is exactly to solve the problem you're worried about: so any legacy binary can run on KNL. To get good performance out of it, you want to be running code vectorized with AVX-512 vectors in the loops that do the heavy lifting, but all the surrounding code and other programs in the rest of the Linux distro or whatever can be running ordinary bog-standard code that uses whatever x87 and/or SSE.
Knight's Corner (first-gen commercial Xeon Phi) has its own variant / precursor of AVX-512 in a core based on P5-Pentium, and no other FP hardware.
Knight's Landing (second-gen commercial Xeon Phi) is based on Silvermont, with AVX-512, and is the first that can act as a "host" processor (bootable) instead of just a coprocessor.
This "host" mode is another reason for including enough hardware to decode and execute x87 and SSE: if you're running a whole system on KNL, you're much more likely to want to execute some legacy binaries for non-perf-sensitive tasks, not only binaries compiled specifically for it.
Its x87 performance is not great, though: like one scalar fmul per 2 clocks (https://agner.org/optimize). vs. 2-per-clock SSE mulsd (0.5c recip throughput). Same 0.5c throughput for other SSE/AVX math, including AVX-512 vfma132ps zmm to do 16x single-precision Fused-Multiply-Add operations in one instruction.
So hopefully Go's compiler doesn't use x87 much. The normal way to do scalar math in 64-bit mode (that C compilers and their math libraries use) is SSE, in XMM registers. x86-64 C compilers only use x87 for types like long double.
Yes:
Xeon Phi is a series of x86 manycore processors designed and made by Intel. It is intended for use in supercomputers, servers, and high-end workstations. Its architecture allows use of standard programming languages and application programming interfaces (APIs) such as...
See also https://en.wikipedia.org/wiki/Xeon_Phi
If you can compile go on an x86 processor then you will be able to compile on that specific x86 processor which is manufactured by intel.
Xeon is not Itanium :)
On such systems you would also be able to compile go you would just need to provide a suitable c compiler...
What makes you think you would otherwise not be able to compile go on say... an Atari or perhaps a Arduino?
If you can elaborate on that perhaps I can improve my terrible answer further.

Is it possible to compare ARM and x86 performance via benchmarks?

Judging by the latest news, new Apple processor A11 Bionic gains more points than the mobile Intel Core i7 in the Geekbench benchmark.
As I understand, there are a lot of different tests in this benchmark. These tests simulate a different load, including the load, which can occur in everyday use.
Some people state that these results can not be compared to x86 results. They say that x86 is able to perform "more complex tasks". As an example, they lead Photoshop, video conversion, scientific calculations. I agree that the software for the ARM is often only a "lighweight" version of software for desktops. But it seems to me that this limitation is caused by the format of mobile operating systems (do your work on the go, no mouse, etc), and not by the performance of ARM.
As an opposite example, let's look at Safari. A browser is a complex program. And on the iPad Safari works just as well as on the Mac. Moreover, if we take the results of Sunspider (JS benchmark), it turns out that Safari on the iPad is gaining more points.
I think that in everyday tasks (Web, Office, Music/Films) ARM (A10X, A11) and x86 (dual core mobile Intel i7) performance are comparable and equal.
Are there any kinds of tasks where ARM really lags far behind x86? If so, what is the reason for this? What's stopping Apple from releasing a laptop on ARM? They already do same thing with migration from POWER to x86. This is technical restrictions, or just marketing?
(Intended this as a comment since this question is off topic, but it got long..).
Of course you can compare, you just need to be very careful, which most people aren't. The fact that companies publishing (or "leaking") results are biased also doesn't help much.
The common misconception is that you can compare a benchmark across two systems and get a single score for each. That ignores the fact that different systems have different optimization points, most often with regards to power (or "TDP"). What you need to look at is the power/performance curve - this graph shows how the system reacts to more power (raising the frequency, enabling more performance features, etc), and how much it contributes to its performance.
One system can win over the low power range, but lose when the available power increases since it doesn't scale that well (or even stops scaling at some point). This is usually the case with Arm, as most of these CPUs are tuned for low power, while x86 covers a larger domain and scales much better.
If you are forced to observe a single point along the graph (which is a legitimate scenario, for example if you're looking for a CPU for a low-power device), at least make sure the comparison is fair and uses the same power envelope.
There are of course other factors that must be aligned (and sometimes aren't due to negligence or an intention to cheat) - the workload should be the same (i've seen different versions compared..), the compiler should be as close as possible (although generating arm vs x86 code is already a difference, but the compiler intermediate optimizations should be similar. When comparing 2 x86 like intel and AMD you should prefer the same binary, unless you also want to allow machine specific optimizations).
Finally, the system should also be similar, which is not the case when comparing a smartphone against a pc/macbook. The memory could differ, the core count, etc. This could be legitimate difference, but it's not really related to one architecture being better than the other.
the topic is bogus, from the ISA to an application or source code there are many abstraction level and the only metric that we have (execution time, or throughput) depends on many factors that could advantage one or the other: the algorithm choices, the optimization written in source code, the compiler/interpreter implementation/optimizations, the operating system behaviour. So they are not exactly/mathematically comparable.
However, looking at the numbers, and the utility of the mobile application written by talking as a management engeneer, ARM chip seems to be capable of run quite good.
I think the only reason is inertia of standard spread around (if you note microsoft propose a variant of windows running on ARM processors, debian ARM variant are ready https://www.debian.org/distrib/netinst).
the ARMv8 cores seems close to x86/64 ones by looking at raw numbers
note i7-3770k results: https://en.wikipedia.org/wiki/Instructions_per_second#MIPS
summary of last Armv8 CPU characteristics, note the quantity of decode, dispatch, caches, and compare the last column on cortex A73 to the i7 3770k
https://en.wikipedia.org/wiki/Comparison_of_ARMv8-A_cores
intel ivy bridge characteristics:
https://en.wikichip.org/wiki/intel/microarchitectures/ivy_bridge_(client)
A75 details. https://www.anandtech.com/show/11441/dynamiq-and-arms-new-cpus-cortex-a75-a55
the topic of power consumption is complex again, the basic rule that go under all the frequency/tension rule (used and abused) over www is: transistors raise time. https://en.wikipedia.org/wiki/Rise_time
There is a fixed time delay in the switching of a transistor, this determinates the maximum frequency that a transistor could switch, and with more of them linked in a cascade way this time sums up in a nonlinear way (need some integration to demonstrate it), as a result 10 years ago to increase the GHz companies try to split in more stage the execution of an operation and runs them (operations) in a pipeline way, even inside the logical pipeline stage. https://en.wikipedia.org/wiki/Instruction_pipelining
the raise time depends of physical characteristics (materials and shape of transistors). It can be reduced by increasing the voltage, so the transistor switch faster, as the switching is associated (let me the term) to a the charge/discharge of a capacitor that trigger the transistor channel opening/closing.
These ARM chips are designed to low power applications, by changing the design they could easily gain MHz, but they will use much power, how much? again not comparable if you don't work inside a foundry and have the numbers.
an example of server applications of ARM processors that could be closer to desktop/workstation CPU as power consumption are Cavium or qualcomm Falkor CPUs, and some benchmark report that they are not bad.

Can I run Cuda or opencl on intel iris?

I have a Macbook pro mid 2014 with intel iris and intel core i5 processor 16GB of RAM. I am planing to learn some ray-traced 3D. But, I am not sure, if my laptop can render fast without any nvidia's hardware.
So, I would appreciate it, if someone can tell me if I can use Cuda if not, then could you please teach me in a very easy way how to enable OpenCL in after affects. I am looking for any tutorial for beginners to learn how to create or build OpenCL?
Cuda works only on nvidia hardware but there may be some libraries converting it to run on cpu cores(not igpu).
AMD is working on "hipify"ing old cuda kernels to translate them to opencl or similar codes so they can become more general.
Opencl works everywhere as long as both hardware and os supports. Amd, Nvidia, Intel, Xilinx, Altera, Qualcomm, MediaTek, Marvell, Texas Instruments .. support this. Maybe even Raspberry pi-x can support in future.
Documentation for opencl in stackoverflow.com is under development. But there are some sites:
Amd's tutorial
Amd's parallel programming guide for opencl
Nvidia's learning material
Intel's HD graphics coding tutorial
Some overview of hardware, benchmark and parallel programming subjects
blog
Scratch-a-pixel-raytracing-tutorial (I read it then wrote its teraflops gpu version)
If it is Iris Graphics 6100:
Your integrated gpu has 48 execution units each having 8 ALU units that can do add,multiply and many more operations. Its clock frequency can rise to 1GHz. This means a maximum of 48*8*2(1 add+1multiply)*1G = 768 Giga floating point operations per second but only if each ALU is capable of concurrently doing 1 addition and 1 multiplication. 768 Gflops is more than a low-end discrete gpu such as R7-240 of AMD.(As of 19.10.2017, AMD's low-end is RX550 with 1200 GFlops, faster than Intel's Iris Plus 650 which is nearly 900 GFlops). Ray tracing needs re-accessing to too many geometry data so a device should have its own memory(such as with Nvidia or Amd), to let CPU do its work.
How you install opencl on a computer can change by OS and hardware type, but building a software with an opencl-installed computer is similar:
Query platforms. Result of this can be AMD, Intel, Nvidia,duplicate of these because of overlapped installations of wrong drivers,experimental platforms prior to newer opencl version supports.
Query devices of a platform(or all platforms). This gives individual devices (and their duplicates if there are driver errors or some other things to fix).
Create a context(or multiple) using a platform
Using a context(so everything will have implicit sync in it):
Build programs using kernel strings. Usually CPU can take less time than a GPU to build a program.(there is binary load option to shurtcut this)
Build kernels(as objects now) from programs.
Create buffers from host-side buffers or opencl-managed buffers.
Create a command queue (or multiple)
Just before computing(or an array of computations):
Select buffers for a kernel as its arguments.
Enqueue buffer write(or map/unmap) operations on "input" buffers
Compute:
Enqueue nd range kernel(with specifying which kernel runs and with how many threads)
Enqueue buffer read(or map/unmap) operations on "output" buffers
Don't forget to synchronize with host using clFinish() if you haven't used blocking type enqueueBufferRead.
Use your accelerated data.
After opencl is no more needed:
Be sure all command queues are empty / finished doing kernel work.
Release all in the opposite order of creation
If you need to accelerate an open source software, you can switch a hotspot parallelizable loop with a simple opencl kernel, if it doesn't have another acceleration support already. For example, you can accelerate air-pressure and heat-advection part of powdertoy sand-box simulator.
Yes, you can, because OpenCL is supported by MacOS natively.
From your question it appears you are not seeking advice on programming, which would have been the appropriate subject for Stack Overflow. The first search hit on Google explains how to turn on OpenCL accelerated effects in After Effects (Project Settings dialog -> Video Rendering and Effects), but I have no experience with that myself.

Trace of CPU Instruction Reordering

I have studied a few things about instruction re-ordering by processors and Tomasulo's algorithm.
In an attempt to understand this topic bit more I want to know if there is ANY way to (get the trace) see the actual dynamic reordering done for a given program?
I want to give an input program and see the "out of order instruction execution trace" of my program.
I have access to an IBM-P7 machine and an Intel Core2Duo laptop. Also please tell me if there is an easy alternative.
You have no access to actual reordering done inside the CPU (there is no publically known way to enable tracing). But there is some emulators of reordering and some of them can give you useful hints.
For modern Intel CPUs (core 2, nehalem, Sandy and Ivy) there is "Intel(R) Architecture Code Analyzer" (IACA) from Intel. It's homepage is http://software.intel.com/en-us/articles/intel-architecture-code-analyzer/
This tool allows you to look how some linear fragment of code will be splitted into micro-operations and how they will be planned into execution Ports. This tool has some limitations and it is only inexact model of CPU u-op reordering and execution.
There are also some "external" tools for emulating x86/x86_84 CPU internals, I can recommend the PTLsim (or derived MARSSx86):
PTLsim models a modern superscalar out of order x86-64 compatible processor core at a configurable level of detail ranging ... down to RTL level models of all key pipeline structures. In addition, all microcode, the complete cache hierarchy, memory subsystem and supporting hardware devices are modeled with true cycle accuracy.
But PTLsim models some "PTL" cpu, not real AMD or Intel CPU. The good news is that this PTL is Out-Of-Order, based on ideas from real cores:
The basic microarchitecture of this model is a combination of design features from the Intel Pentium 4, AMD K8 and Intel Core 2, but incorporates some ideas from IBM Power4/Power5 and Alpha EV8.
Also, in arbeit http://es.cs.uni-kl.de/publications/datarsg/Senf11.pdf is said that JavaHASE applet is capable of emulating different simple CPUs and even supports Tomasulo example.
Unfortunately, unless you work for one of these companies, the answer is no. Intel/AMD processors don't even schedule the (macro) instructions you give them. They first convert those instructions into micro operations and then schedule those. What these micro instructions are and the entire process of instruction reordering is a closely guarded secret, so they don't exactly want you to know what is going on.

Learn Nvidia CUDA

I am C++ programmer that develop image and video algorithims, should i learn Nvidia CUDA? or it is one of these technlogies that will disappear?
CUDA is currently a single vendor technology from NVIDIA and therefore doesn't have the multi vendor support that OpenCL does.
However, it's more mature than OpenCL, has great documentation and the skills learnt using it will be easily transferred to other parrallel data processing toolkit.
As an example of this, read the Data Parallel Algorithms by Steele and Hillis and then look at the Nvidia tutorials - theres a clear link between the two yet the Steele/Hillis paper was written over 20 years before CUDA was introduced.
Finally, the FCUDA Projects is working to allow CUDA projects to target non nvidia hardware (FPGAs).
CUDA should stick around for a while, but if you're just starting out, I'd recommend looking at OpenCL or DirectCompute. Both of these run on ATI as well as NVidia hardware, in addition to also working on the vector units (SSE) of CPUs.
I think you should rather stick with OpenCL, which is an open standard and supported by ATI, nVidia and more. CUDA might not disappear in the next years, but anyway it is not compatible with non-nVidia GPUs.
OpenCL might take sometime to become pervasive but i found learning CUDA very informative and i don't think CUDA's going to be out of the limelight anytime soon. Besides, CUDA is easy enough that the time it takes to learn it is much shorter than CUDA's shelf life.
This is the era of high performance computing, parallel computing. CUDA and OpenCL are the emerging technologies of GPU Computing which is actually a high performance computing! If you are a passionate programmer and willing to achieve benchmark in parallel algorithms, you should really go for these technologies. Data parallel part of your program will get executed within fraction of a second on GPU many-core architecture which usually takes much longer time on your CPU..

Resources