What CPU instructions use the most power? - cpu

The background is thus: next week our office will have one day with no heating, due to maintenance. Outdoor temperature is expected between 7 and 12 degrees Celcius, so it might become chilly. The portable electric heaters are too few to cater for everyone.
However, I have, in my office of about 6-8 m2, a big honkin' (3 yrs old) workstation (HP xw8600 with 3.0 GHz Quad-core Xeon) that should be able output a couple of hundred Watts of heat. Running Furmark will max out the GPU but I'm not sure how to best work the CPU.
Last time I was in a cold office I either compiled more often or just launched 4-8 DOSBox:es running Norton Commander, but I think one can do better by using SSE1-2-3-4,MMX etc, i.e. stuff that does more work per cycle.
So, what CPU instructions toggle the most transistors each cycle, and thus use cause the CPU to draw most amount of power and thus give off maximum heat?
If I had a power meter available, then I could benchmark myself, but I figure this would be a fun challenge for the SO-crowd. :)

For your specific goal, if you really want to use your system as a heat generator, you need to first make sure that the cooling system is working really well (throwing the heat out of the box). Processors today are designed to throttle themselves when they reach a critical temperature which happens when a proper heatsink is used and the processor is at TDP (Thermal Design Power is the max power for the processor using normal programs). If you have a better heat sink and good ventilation (box fan?), you can probably get beyond TDP assuming that your power supply can handle it. If you turn the fan off, you basically will hit the thermal limit right away.
To be more explicit, the individual instructions that burn the most are generally load instructions that miss in the caches and go out to memory. To guarantee misses, you'll want to allocate a chunk of memory that's bigger than the last level CPU cache and hop around that memory. The pattern of hopping in the maximum power case is a bit complex because you're trying to get the maximum number of misses outstanding at every level of the cache hierarchy simultaneously. If you have 3 levels of cache, in a given period of time, you can have more misses to the L1 than you can to the L2 than you can to the L3 than you can to the DRAM page. (And the specific design of your processor will have a total limit on misses.) Between misses, the instruction doesn't matter too much, but I'd guess that one of the SSE4 multiplies (PMULUDQ) is probably the best since on a lot of modern processors, they execute in pretty quickly and generally do a whole lot of work (compared to say an add).
The funny thing is, running the GPU may limit the amount of heat that you can generate using misses to the L3 cache since the memory may be bogged down by the GPU. In that case, you should make sure that all accesses to the L3 are hits, but that you're missing in the other levels.

For GeForce graphics, my CudaMFLOPS program (free) is quite handy for obtaining high temperatures on the graphics card. If you have an appropriate card details are in:
http://www.roylongbottom.org.uk/cuda1.htm#anchor8
I find that my tests that execute SSE instructions with data from L1 cache generally produce the highest CPU temperatures.

For cpu use Prime95. That is lightweight and will load up all cores nicely. You aren't really going to get much heat out of a 3ghz xeon though. Chips of that age are usually good for over 4ghz with average cooling, and close to 5ghz with high end water loops. With a 6-core chip # >4ghz with extra voltage added you might be hitting 200w TDP but with that system you will be lucky to get the cpu to 100w.
As for the GPU, the Heaven Benchmark is a good one for quickly getting it up to temperature. Again, unless you have a high end card a couple of hundred watts of heat is unlikely. Another alternative on AMD gpus (maybe nvidia too?) is to use crpto-currency mining software, maybe get a USB stick with a mining linux distribution installed and ready to go. You could also use Prime95 on the same rig as mining software uses very little cpu time.
I actually kept a couple of rooms warm over winter with the heat from a computer, only rarely needing extra heating. This was done with a crypto-currency mining rig, which had 4 gpus running at ~80 degrees C, 24/7, with a box fan to circulate the air round the room. That rig had a 1300W PSU. Might I suggest that instead of trying to use the computer to keep you warm, you wear more clothes?

Related

How to reduce time taken for large calculations in MATLAB

When using the desktop PC's in my university (Which have 4Gb of ram), calculations in Matlab are fairly speedy, but on my laptop (Which also has 4Gb of ram), the exact same calculations take ages. My laptop is much more modern so I assume it also has a similar clock speed to the desktops.
For example, I have written a program that calculates the solid angle subtended by 50 disks at 500 points. On the desktop PC's this calculation takes about 15 seconds, on my laptop it takes about 5 minutes.
Is there a way to reduce the time taken to perform these calculations? e.g, can I allocate more ram to MATLAB, or can I boot up my PC in a way that optimises it for using MATLAB? I'm thinking that if the processor on my laptop is also doing calculations to run other programs this will slow down the MATLAB calculations. I've closed all other applications, but I know theres probably a lot of stuff going on I can't see. Can I boot my laptop up in a way that will have less of these things going on in the background?
I can't modify the code to make it more efficient.
Thanks!
You might run some of my benchmarks which, along with example results, can be found via:
http://www.roylongbottom.org.uk/
The CPU core used at a particular point in time, is the same on Pentiums, Celerons, Core 2s, Xeons and others. Only differences are L2/L3 cache sizes and external memory bus speeds. So you can compare most results with similar vintage 2 GHz CPUs. Things to try, besides simple number crunching tests.
1 - Try memory test, such as my BusSpeed, to show that caches are being used and RAM not dead slow.
2 - Assuming Windows, check that the offending program is the one using most CPU time in Task Manager, also that with the program not running, that CPU utilisation is around zero.
3 - Check that CPU temperature is not too high, like with SpeedFan (free D/L).
4 - If disk light is flashing, too much RAM might be being used, with some being swapped in and out. Task Manager Performance would show this. Increasing RAM demands can be checked my some of my reliability tests.
There are many things that go into computing power besides RAM. You mention processor speed, but there is also number of cores, GPU capability and more. Programs like MATLAB are designed to take advantage of features like parallelism.
Summary: You can't compare only RAM between two machines and expect to know how they will perform with respect to one another.
Side note: 4 GB is not very much RAM for a modern laptop.
Firstly you should perform a CPU performance benchmark on both computers.
Modern operating systems usually apply the most aggressive power management schemes when it is run on laptop. This usually means turning off one or more cores, or setting them to a very low frequency. For example, a Quad-core CPU that normally runs at 2.0 GHz could be throttled down to 700 MHz on one CPU while the other three are basically put to sleep, while it is on battery. (Remark. Numbers are not taken from a real example.)
The OS manages the CPU frequency in a dynamic way, tweaking it on the order of seconds. You will need a software monitoring tool that actually asks for the CPU frequency every second (without doing busy work itself) in order to know if this is the case.
Plugging in the laptop will make the OS use a less aggressive power management scheme.
(If this is found to be unrelated to MATLAB, please "flag" this post and ask moderator to move this question to the SuperUser site.)

Estimating how processor frequency affects I/O performance

I am doing research about dedicated I/O software that would run on consumer hardware. Essentially it boils down to saving huge data streams for later processing. Right now I am looking for a model to estimate performance factors on x86.
Take for example the new Macbook Pro:
high-speed Thunderbolt I/O (input/output) technology delivers
an amazing 10 gigabits per second of transfer speeds in both
directions
1.25 GB/s sounds nice but most processors of the day are clocked around 2 Ghz. Multiple cores make little difference as long as only one can be assigned per network channel.
So even if the software acts as a miniature operating system and limits itself to network/disk operations, the amount of data flowing to storage can't be greater than P / (2 * N)[1] chunks per second. Although this hints the rough performance limit, I feel it's far from adequate.
What other considerations should one take estimating I/O performance in regards to processor frequency and other hardware specifics? For simplicity's sake, assume here that storage performs instantly under all circumstances.
[1] P - processor frequency; N - algorithm overhead
The hardware limiting factors are probably the I/O bus performance, say PCIe, and more recently, the FSB clock-rates, since memory controllers are moving from northbridge to the CPUs themselves.
Then, of course, you have to figure out what sort of processing you need to do on the input, and how much work it is to produce the output. These, at least for conventional software running on a CPU, are dependent on the processor clock, but not only. Writing your code to take advantage of the hardware facilities like caches, instruction-level parallelism, etc. is still a black art but can give you an order of magnitude performance boost.
Basically what I'm ranting about is that not all software is created equal, and you probably want to take that into account.
Likely, harddisk controllers will decide the harddisk I/O performance, graphics cards will decide maximum resolution and refresh I/O performance, and so on. Don't really understand the question, the CPU is becoming less and less involved in these kinds of things (well, has been for the last 10 years).
I doubt the question will even have bearing on CPUs with integrated GPUs, since the buffer to be output to screen is in external memory sharing a bus with (again) a controller on the motherboard.
It's all buffered, so I can only see CPUs affecting file performance if you somehow force the hardware buffer size to something insanely puny. Edit: and I'm pretty sure Apple will prevent you from doing such things. ;)
For Thunderbolt specifically, it's more about what the minimum CPU model is, that supports the kinds of bus speeds required by the Thunderbolt chip set version that is in the machine in question.
Thunderbolt is a raw data traffic system and performance specs are potential maximums, hence all the asterisks in the Apple specs. I believe it will indeed alleviate bottlenecks and in general give lag-free intelligent data shuffling doing many things simultaneously.
The CPU will idle-wait a shorter time for needed data, but the processing speed of the data is the same. When playing or creating a movie, codec processing time will be the same, but you will still feel a boost/lack of lag because the data is there when it needs it. For the I/O, the bottleneck will become the read/write speed of your harddisk instead, and the CPU bottleneck (for file copy operations, likely at least some code in Finder) will stay the same.
In other words, only CPU-intensive tasks such as for example movie encoding will benefit significantly from a faster CPU, while the benefits of Thunderbolt vs. a mix of interfaces will boost machines with both slow and fast CPUs.

What is the relation between CPU utilization and energy consumption?

What is the function that describes the relation between CPU utilization and consumption of energy (electricity/heat wise).
I wonder if it's linear/sub-linear/exp etc..
I am writing a program that decreases the CPU utilization/load of other programs and my main concern is how much do I benefit energy wise..
Moreover, my server is mostly being used as a web-server or DB in a data-center (headless).
In case the data center need more power for cooling I need to consider that as well..
I also need to know what is the effect of CPU utilization on the entire machine power consumption ..
Here you can find a short ppt answering your questions, and providing additional info.
Although there is no Copyright notice in the ppt, the work is probably protected, so I will copy here only three graphs relevant to your main question and follow-ups in comments.
HTH!
For the CPU alone linear would be the most likely.
It gets complicated with CPUs that can reduce the clock speed under low load (like laptops) but for a server it's probably a good approximation.
Remember though that the CPU isn't the only component - you have to multiply by the percentage of power the CPU is using compared to the entire system.

At what rate are the number of cores per CPU increasing?

I'm designing a system that will be on-line in 2016 and run on commodity 1U or 2U server boxes. I'd like to understand how parallel the software will need to be so I'd like to estimate the number of cores per physical machine. I'm not interested in more exotic hardware like video game console processors, GPUs or DSPs. I could extrapolate based on when chips where issued by Intel or AMD, but this historical information seems scarce.
Thanks.
I found the following charts from Design for Manycore Systems:
As the great computer scientist Yogi Berra said, "It's tough to make predictions, especially about the future.". Given the relative recency of multicore systems, I think you're right to be wary of extrapolations. Still, you need a number to aim for.
M. Spinelli's graphs are very valuable, and (I think) have the benefit of being based on real plans out to 2014. Other than that, if you want a simple, easly calculatable and defensible number, I'd take as a starting point the number of cores in current (say) 2U systems at your price point (high range systems -- 24-32 cores at $15k; mid-range 12-16 cores at $8k, lower-end 8-12 core at $5k). Then note that Moore's law suggests 8-16x as many transistors per unit silicon in 2016 as now, and that on current trends, these mainly go into more cores. That suggests 64-512 cores per node depending on how much you're spending on each -- and these numbers are consistent with the graphs Matt Spinelli posted above.
Cores per physical machine doesn't seem to be a particularly good metric, I think. We haven't really seen that number grow in particularly non-linear ways, and many-core hardware has been available COTS since the 90's (though it was relatively specialized at that point). If your task is really that parallel, quadrupling the number of cores shouldn't change it that much. We've always had the option of faster-but-fewer-cores, which should still be available to you in 6 years if you find that you don't scale well with the current number of cores.
If your application is really embarrassingly parallel, why are you unwilling to consider GPU solutions?
How quickly do you plan to rotate the hardware? Leave old machines till they die, or replace them proactively as they start to slow the cluster down? How many machines are we talking about? What kind of interconnect technology are you considering? For many cluster applications that is the limiting factor.
The drdobbs article above is not a bad analysis, but I think it misses the point just a tad. It's going to be a significant while before many mainstream apps can take advantage of really parallel general compute hardware (and many tasks simply can't be parallelized much), and when they do, they'll be using graphics cards and (to a less extent) soundcards as the specialized hardware they use to do it.

Easiest way to determine compilation performance hardware bottleneck on single PC?

I've now saved a bit of money for the hardware upgrade. What I'd like to know, which is the easiest way to measure which part of hardware is the bottleneck for compiling and should be upgraded?
Are there any clever techniques I could use? I've looked into perfmon, but it has too many counters and isn't very helpful without exact knowledge what should be looked at.
Conditions: Home development, Windows XP Pro, Visual Studio 2008
Thanks!
The question is really "what is maxed out during compilation?"
If you don't want to use perfmon, you can use something like the task monitor.
Run a compile.
See what's maxed out.
Did you go to 100% CPU for the whole time? Get more CPU -- faster or more cores or something.
Did you go to 100% memory for the whole time? Which number matters on the display? The only memory you can buy is "physical" memory. The only factor that matters is physical memory. The other things you see on the meter are not things you buy, they're adjustments to make to the way Windows works.
Did you go to "huge" amounts of I/O? You can't easily tell what's "huge", but you can conclude this. If you're not using memory and not using CPU, then you're using the only resource that's left -- you're I/O bound and you need a faster bus -- which usually means a whole new machine.
A faster HDD is of little or no value -- the bus clock speed is one limiting factor. The bus width is the other limiting factor. No one designs an ass-kicking I/O bus and then saddles it with junk HDD's. Usually, they design the bus that fits a specific cost target based on available HDD's.
A faster HDD is of little or no value -- the bus clock speed is one limiting factor. The bus width is the other limiting factor. No one designs an ass-kicking I/O bus and then saddles it with junk HDD's. Usually, they design the bus that fits a specific cost target based on available HDD's.
Garbage. Modern HDDs are slow compared to the I/O buses they are connected to. Name a single HDD that can max out a SATA 2 interface (and that is even a generation old now) for random IOPS... A hard drive is lucky to hit 10MB/s when the bus is capable of around 280MB/s.
E.g. http://www.anandtech.com/show/2948/3. Even there the SSDs are only hitting 50MB/s. It's clear the IOPs are NOT the bottleneck otherwise the HDD would do just as much as the SSDs.
I've never seen a computer IOPs bound rather than HDD bound. It doesn't happen.
Using the task monitor has already been suggested but the Sys Internals task monitor gives you more information than the built-in Windows task monitor:
Sys Internals task monitor
You might also want to see what other things are running on your PC which are using up memory and / or CPU processing power. It may be possible to remove or only run on demand things which are affecting performance.
Windows XP will only support 3GB of memory using a switch that you have to turn on and
I seem to remember that applications need to be written to actually take this into consideration.

Resources