Latency to/from Xeon Phi - gpgpu

What is a typical latency measure for moving a "small amount" of data (like a few kb) from a CPU cache to a coprocessor like the Xeon PHI? I assume that the return trip would take a similar amount of time, but if not, please specify that in your answer.
I know that this depends on a lot of things, but I'm just looking for order-of-magnitude numbers, and I don't have a similar setup to test.

I'm afraid the question, as you ask it really doesn't have an answer. You can ask what the raw bandwidth and latency of a PCIe bus is, but that doesn't really tell you anything. And you wouldn't really want to read a word into cache in the processor then sending it to the coprocessor. You want to keep the processor itself out of all this as much as possible.
At a minimum, what you need to know before you can ask a question like this is what protocol am I using to move the data, where is the data and how big is the data transfer.
I could suggest you read the Intel® Xeon Phi™ Coprocessor System Software Developers Guide if you want to know about the Intel Xeon Phi coprocessor in particular. (I can't speak to any other architectures - I'm ignorant there.) But the System Software Developers Guide is way more detail than you want or need at this point. But if you want a general idea of what is going on, I would tell you that the Intel Xeon Phi coprocessor mostly uses something called SCIF to communicate between the host and coprocessor and you can find out the basics in chapter six of Rezaur's book Intel® Xeon Phi™ Coprocessor Architecture and Tools: The Guide for Application Developers (which you can find on Google books if you want to just read that chapter.)
As I say, I can't speak to any other architecture; I just don't know. But I'm sure you can find information out there.

Data is not transferred from the host's cache to the co-processor. It can be transferred from the host's memory, to the co-processor's.
Keep in mind that this doesn't occur in native execution. It can only be achieved in offload model.
Now, if you want to benchmark the speed of data transfer, it will depend on your motherboard, and the PCIe bus bandwidth/latency.

Related

How does Intel's RAPL estimate the power consumption

First of all, I do not know whether I should be asking this here or in the Electronics StackExchange, so please let me know if you think I should ask it there.
I am interested in measuring the energy consumption of each CPU core in Intel CPUs. I have read Intel's Intel 64 developer manual, and, as I understand, RAPL will provide energy estimations for:
The whole Package
The Cores
An unspecified Uncore device (Only in client processors)
The DRAM (Only in server processors)
This would indicate that the best I can aspire to is a value for the collective energy consumption of all the cores in the CPU. However, I also know that "RAPL is not an analog power meter, but rather uses a software power model", according to https://01.org/blogs/2014/running-average-power-limit-%E2%80%93-rapl.
What I would like to know is, is the way this model works known, or publicly available? And, would it be possible to get an estimation of individual core power consumption using metrics provided by RAPL or other interfaces? I know that, if Intel isn't providing this information through RAPL it is probably impossible to get it, but I would like to at least find a source that confirms that.
Thanks for your help!
Here is a post on different tools that you can use to get energy measurements for different Operating Systems. If you are using Linux, consider using the Perf since it uses the RAPL interface to get energy measurements. As far as I know, Perf does not offer energy consumption per core but as a whole (package) and you can get energy measurements for an executable (of any kind: Python, Java, SHELL, C, and so on) using the following command:
sudo perf stat -e power/energy-cores/ ./executable

What is the minimum amount of RAM required to run Linux kernel on an Embedded device?

What is the minimum amount of RAM required to run Linux kernel on an Embedded device? In Linux-0.11 for 80x86, the minimum RAM required was 2MB to load the kernel data structures and interrupt vectors.
How much is the minimum needed RAM for present Linux-3.18 kernel? Does different architectures like x86 and ARM have different requirements for minimum RAM required for booting? How does one calculates the same?
It's possible to shrink it down to ~600 KiB. Check the work done by Tom Zanussi from Intel.
Presentation from Tom and Wiki page about the topic.
UPDATE. Tom published interesting statistics about memory use by different subsystems in the kernel. He did research during that time when he was working on the project.
Yet another interesting project is Gray486linux.
This site suggests:
A minimal uClinux configuration could be run from 4MB RAM, although
the recommendation we are giving to our customers is that they should design
in at least 16 MB's worth of RAM.
If you are using SDRAM, the problem would be getting a part any smaller than 16Mb at reasonable volume cost and availability, so maybe it is a non-problem? For SRAM however, that is a large and relatively expensive part.
eLinux.org has a lot of information on embedded kernel size, how to determine it, and how to minimise it.
It depends how you define Linux. If you ask for current operating systems then we are talking about way above 100MByte, better 1000MByte of memory.
If we are talking about "Linux from Scratch" then we are also talking about how much pain you are willing to suffer. In the mid-1990 I build a Linux system by compiling every binary myself and made it run on a 386sx16, 1,5MByte of memory. While it had a 40MByte harddrive it was mostly empty. I compiled my own Kernel 1.0.9, my own libc5, my own base tools, SVGAlib. That system was somewhat useable for using textmode and SVGAlib applications. Increasing the memory to 2MByte did help a lot. And believe me, the system was extremely bare. Today all components need at least twice the memory but then there is also ulibc instead of libc and busybox.
At 8MByte of memory I can create a very basic system today from scratch. At 512MByte of memory you might have a somewhat modern looking but slow desktop system.

Trace of CPU Instruction Reordering

I have studied a few things about instruction re-ordering by processors and Tomasulo's algorithm.
In an attempt to understand this topic bit more I want to know if there is ANY way to (get the trace) see the actual dynamic reordering done for a given program?
I want to give an input program and see the "out of order instruction execution trace" of my program.
I have access to an IBM-P7 machine and an Intel Core2Duo laptop. Also please tell me if there is an easy alternative.
You have no access to actual reordering done inside the CPU (there is no publically known way to enable tracing). But there is some emulators of reordering and some of them can give you useful hints.
For modern Intel CPUs (core 2, nehalem, Sandy and Ivy) there is "Intel(R) Architecture Code Analyzer" (IACA) from Intel. It's homepage is http://software.intel.com/en-us/articles/intel-architecture-code-analyzer/
This tool allows you to look how some linear fragment of code will be splitted into micro-operations and how they will be planned into execution Ports. This tool has some limitations and it is only inexact model of CPU u-op reordering and execution.
There are also some "external" tools for emulating x86/x86_84 CPU internals, I can recommend the PTLsim (or derived MARSSx86):
PTLsim models a modern superscalar out of order x86-64 compatible processor core at a configurable level of detail ranging ... down to RTL level models of all key pipeline structures. In addition, all microcode, the complete cache hierarchy, memory subsystem and supporting hardware devices are modeled with true cycle accuracy.
But PTLsim models some "PTL" cpu, not real AMD or Intel CPU. The good news is that this PTL is Out-Of-Order, based on ideas from real cores:
The basic microarchitecture of this model is a combination of design features from the Intel Pentium 4, AMD K8 and Intel Core 2, but incorporates some ideas from IBM Power4/Power5 and Alpha EV8.
Also, in arbeit http://es.cs.uni-kl.de/publications/datarsg/Senf11.pdf is said that JavaHASE applet is capable of emulating different simple CPUs and even supports Tomasulo example.
Unfortunately, unless you work for one of these companies, the answer is no. Intel/AMD processors don't even schedule the (macro) instructions you give them. They first convert those instructions into micro operations and then schedule those. What these micro instructions are and the entire process of instruction reordering is a closely guarded secret, so they don't exactly want you to know what is going on.

GPU access to system RAM

I am currently involved in developing a large scientific computing project, and I am exploring the possibility of hardware acceleration with GPUs as an alternative to the MPI/cluster approach. We are in a mainly memory-bound situation, with too much data to put in memory to fit on a GPU. To this end, I have two questions:
1) The books I have read say that it is illegal to access memory on the host with a pointer on the device (for obvious reasons). Instead, one must copy the memory from the host's memory to the device memory, then do the computations, and then copy back. My question is whether there is a work-around for this -- is there any way to read a value in system RAM from the GPU?
2) More generally, what algorithms/solutions exist for optimizing the data transfer between the CPU and the GPU during memory-bound computations such as these?
Thanks for your help in this! I am enthusiastic about making the switch to CUDA, simply because the parallelization is much more intuitive!
1) Yes, you can do this with most GPGPU packages.
The one I'm most familair with -- the AMD Stream SDK lets you allocate a buffer in "system" memory and use that as a texture that is read or written by your kernel. Cuda and OpenCL have the same ability, the key is to set the correct flags on the buffer allocation.
BUT...
You might not want to do that because the data is being read/written across the PCIe bus, which has a lot of overhead.
The implementation is free to interpret your requests liberally. I mean you can tell it to locate the buffer in system memory, but the software stack is free to do things like relocate it into GPU memory on the fly -- as long as the computed results are the same
2) All of the major GPGPU software enviroments (Cuda, OpenCL, the Stream SDK) support DMA transfers, which is what you probably want.
Even if you could do this, you probably wouldn't want to, since transfers over PCI-whatever will tend to be a bottleneck, whereas bandwidth between the GPU and its own memory is typically very high.
Having said that, if you have relatively little computation to perform per element on a large data set then GPGPU is probably not going to work well for you anyway.
I suggest cuda programming guide.
you will find many answers there.
Check for streams, unified addressing, cudaHostRegister.

How to measure memory bandwidth utilization on Windows?

I have a highly threaded program but I believe it is not able to scale well across multiple cores because it is already saturating all the memory bandwidth.
Is there any tool out there which allows to measure how much of the memory bandwidth is being used?
Edit: Please note that typical profilers show things like memory leaks and memory allocation, which I am not interested in.
I am only whether the memory bandwidth is being saturated or not.
If you have a recent Intel processor, you might try to use Intel(r) Performance Counter Monitor: http://software.intel.com/en-us/articles/intel-performance-counter-monitor/ It can directly measure consumed memory bandwidth from the memory controllers.
I'd recommend the Visual Studio Sample Profiler which can collect sample events on specific hardware counters. For example, you can choose to sample on cache misses. Here's an article explaining how to choose the CPU counter, though there are other counters you can play with as well.
it would be hard to find a tool that measured memory bandwidth utilization for your application.
But since the issue you face is a suspected memory bandwidth problem, you could try and measure if your application is generating a lot of page faults / sec, which would definitely mean that you are no where near the theoretical memory bandwidth.
You should also measure how cache friendly your algorithms are. If they are thrashing the cache, your memory bandwidth utilization will be severely hampered. Google "measuring cache misses" on good sources that tells you how to do this.
It isn't possible to properly measure memory bus utilisation with any kind of software-only solution. (it used to be, back in the 80's or so. But then we got piplining, cache, out-of-order execution, multiple cores, non-uniform memory architectues with multiple busses, etc etc etc).
You absolutely have to have hardware monitoring the memory bus, to determine how 'busy' it is.
Fortunately, most PC platforms do have some, so you just need the drivers and other software to talk to it:
wenjianhn comments that there is a project specficially for intel hardware (which they call the Processor Counter Monitor) at https://github.com/opcm/pcm
For other architectures on Windows, I am not sure. But there is a project (for linux) which has a grab-bag of support for different architectures at https://github.com/RRZE-HPC/likwid
In principle, a computer engineer could attach a suitable oscilloscope to almost any PC and do the monitoring 'directly', although this is likely to require both a suitably-trained computer engineer as well as quite high performance test instruments (read: both very costly).
If you try this yourself, know that you'll likely need instruments or at least analysis which is aware of the protocol of the bus you're intending to monitor for utilisation.
This can sometimes be really easy, with some busses - eg old parallel FIFO hardware, which usually has a separate wire for 'fifo full' and another for 'fifo empty'.
Such chips are used usually between a faster bus and a slower one, on a one-way link. The 'fifo full' signal, even it it normally occasionally triggers, can be monitored for excessively 'long' levels: For the example of a USB 2.0 Hi-Speed link, this happens when the OS isn't polling the USB fifo hardware on time. Measuring the frequency and duration of these 'holdups' then lets you measure bus utilisation, but only for this USB 2.0 bus.
For a PC memory bus, I guess you could also try just monitoring how much power your RAM interface is using - which perhaps may scale with use. This might be quite difficult to do, but you may 'get lucky'. You want the current of the supply which feeds VccIO for the bus. This should actually work much better for newer PC hardware than those ancient 80's systems (which always just ran at full power when on).
A fairly ordinary oscilloscope is enough for either of those examples - you just need one that can trigger only on 'pulses longer than a given width', and leave it running until it does, which is a good way to do 'soak testing' over long periods.
You monitor utiliation either way by looking for the change in 'idle' time.
But modern PC memory busses are quite a bit more complex, and also much faster.
To do it directly by tapping the bus, you'll need at least an oscilloscope (and active probes) designed explicitly for monitoring the generation of DDR bus your PC has, along with the software analysis option (usually sold separately) to decode the protocol enough to figure out the kind of activity which is occuring on it, from which you can figure out what kind of activity you want to measure as 'idle'.
You may even need a motherboard designed to allow you to make those measurements also.
This isn't so staightfoward as just looking for periods of no activity - all DRAM needs regular refresh cycles at the very least, which may or may not happen along with obvious bus activity (some DRAM's do it automatically, some need a specific command to trigger it, some can continue to address and transfer data from banks not in refresh, some can't, etc).
So the instrument needs to be able to analyse the data deeply enough for you extract how busy it is.
Your best, and simplest bet is to find a PC hardware (CPU) vendor who has tools which do what you want, and buy that hardware so you can use those tools.
This might even involve running your application in a VM, so you can benefit from better tools in a different OS hosting it.
To this end, you'll likely want to try Linux KVM (yes, even for Windows - there are windows guest drivers for it), and also pin down your VM to specific CPUs, whilst you also configure linux to avoid putting other jobs on those same CPUs.

Resources