After learning a little on how computer programs run I had some thoughts concerning the cpu and RAM. After watching a few youtube videos (linus tech tips and others) they all seem to show that increasing a RAM speed (frequency) does not really have much of a performance improvement in real world applications and games on a general desktop computer. My first question is why is this? Is it because of the high hit rates (95% and above) of the cpu's cache on most modern cpus? Which in turn would lead to less and less need for the cpu to reach out to ram? Also, in which situations would faster RAM frequency be beneficial?
NOTE: this is a very broad question, and the answers can vary very differently depending on the architecture/OS running the system. I am answering from a best-judgement standpoint on how these things generally work
Why is there not a larger performance difference between different RAM clock speeds?
I would imagine that the clock speed of the RAM of the computer matters less than the clock speed of the CPU cache. Because:
the CPU gets its instructions from the cache, not straight from RAM
with the larger cache sizes of modern CPU's, it is less necessary to need to go out to RAM as often.
When the cache needs to go out to RAM, it uses an asynchronous processor (DMA) to grab more information, allowing the CPU to switch to a different process entirely.
Besides that, the clock speed of the motherboard's various pipelines (DMA) could be creating a chokepoint where it is slowing the transfer rate of the information overall.
which situations would faster RAM frequency be beneficial?
I would say that overall, any one of the core pieces of hardware involved with memory and its use and transfer (CPU, CPU Cache, The Various memory pipelines, the various memory transfer devices (DMA, etc.), the RAM itself) can cause a chokepoint where faster RAM might or might not affect the overall performance. It is really a by-case issue.
Assuming someone is doing some big computations (I know that's relative... and I'm not gonna specify the nature of the operation just to keep the question open, it may be sorting data, searching for elements, calculating the prime factors of a really long number... ) using badly designed, brute force algorithmes or just an itterative process to get the results, can this approach have any bad effetcs on the cpu or the ram over a long period of time ?
Intensive processing will increase the heat generated by the CPU (or GPU) and even the RAM (to a much smaller degree).
Recent CPU chips have the ability to slow themselves down once the heat exceeds certain thresholds to prevent damage to the CPU. That would typically indicate a failure in the cooling system though.
I do not believe there are much other issues other than electricity consumption and overheating risks.
I would like to upload two images to the GPU memory, and I'm interested how fast I can do this?
In fact - will it be faster to compare two bitmaps in RAM with CPU, or upload them to GPU and use GPU parallelism to do it?
If you run the CUDA device bandwidth sample, you'll get a benchmark for the upload speed.
Assuming DDR3 tri-channel 1600MHz RAM, you'll get something like 38 GB/s memory bandwidth.
Take a typical midrange card like a GTX460 and you'll get something like 84 GB/s memory bandwidth. Note that you'll have to make a hop across the bus which is something like 8GB/s theoretical, ~5.5 in practice for a PCI-E2.0 x16 link.
Note that kotlinski's answer isn't quite correct. You'll can do compared in parallel and then do a parallel reduction in which case, the bigger GPU device bandwidth can work win out eventually.
I think the answer is likely to be: a loss to upload to GPU and do comparison once. Possible gain if comparison is made multiple times (kept and modified on the GPU, for example).
Edit:
The multiple times comparison refers to if you modified the images on the GPU memory in situ. Thus, it would merit another comparison (caching doesn't cut it), while not incurring the penalty of another copy across the bus.
Since memory access is the bottleneck here, it is extremely likely that it is faster to just do it in CPU. Making it run in parallel is not likely to give you anything, memory access is essentially a serial operation.
The answer to this question is highly debatable and depends entirely on you systems configuration. This means that you'll have to do the benchmarks yourself. Factors that could influence your situation:
Speed of your RAM
Speed of the GPU Bus
Whether or not you have shared RAM between GPU & CPU
However, I do think that in the general case (eg. with busspeeds in the order of GB/s) it's faster to upload the images to the GPU and do the difference comparison there.
I have been asked to measure how "efficiently " does my code use the GPU /what % of peak performance are algorithms achieving.I am not sure how to do this comparison.Till now I have basically had timers put in my code and measure the execution.How can I compare this to optimal performance and find what might be the bottle necks? (I did hear about visual profiler but couldnt get it to work ..it keeps giving me "cannot load output" error).
Each card has a maximum memory bandwidth and processing speed. For example, the GTX 480 bandwidth is 177.4 GB/s. You will need to know the specs for your card.
The first thing to decide is whether your code is memory bound or computation bound. If it is clearly one or the other, that will help you focus on the correct "efficiency" to measure. If your program is memory bound, then you need to compare your bandwidth with the cards maximum bandwidth.
You can calculate memory bandwidth by computing the amount of memory you read/write and dividing by run time (I use cuda events for timing). Here is a good example of calculating bandwidth efficiency (look at the whitepaper for the parallel reduction) and using it to help validate a kernel.
I don't know very much about determining the efficiency if instead you are ALU bound. You can probably count (or profile) the number of instructions, but what is the card's maximum?
I'm also not sure what to do in the likely case that your kernel is something in between memory bound and ALU bound.
Anyone...?
Generally "efficiently" would probably be a measure of how much memory and GPU cycles (average, min, max) of your program is using. Then the efficiency measure would be avg(mem)/total memory for the time period and so on with AVG(GPU cycles)/Max GPU cycles.
Then I'd compare these metrics to metrics from some GPU benchmark suites (which you can assume to be pretty efficient at using most of the GPU). Or you could measure against some random GPU intensive programs of your choice. That'd be how I'd do it but I've never thought to try so good luck!
As for bottlenecks and "optimal" performance. These are probably NP-Complete problems that no one can help you with. Get out the old profiler and debuggers and start working your way through your code.
Can't help with profiler and microoptimisation, but there is a CUDA calculator http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls , which trys to estimate how does your CUDA code use the hardware resources, based on this values:
Threads Per Block
Registers Per Thread
Shared Memory Per Block (bytes)
On a modern system can local hard disk write speeds be improved by compressing the output stream?
This question derives from a case I'm working with where a program serially generates and dumps around 1-2GB of text logging data to a raw text file on the hard disk and I think it is IO bound. Would I expect to be able to decrease runtimes by compressing the data before it goes to disk or would the overhead of compression eat up any gain I could get? Would having an idle second core affect this?
I know this would be affected by how much CPU is being used to generate the data so rules of thumb on how much idle CPU time would be needed would be good.
I recall a video talk where someone used compression to improve read speeds for a database but IIRC compressing is a lot more CPU intensive than decompressing.
Yes, yes, yes, absolutely.
Look at it this way: take your maximum contiguous disk write speed in megabytes per second. (Go ahead and measure it, time a huge fwrite or something.) Let's say 100mb/s. Now take your CPU speed in megahertz; let's say 3Ghz = 3000mhz. Divide the CPU speed by the disk write speed. That's the number of cycles that the CPU is spending idle, that you can spend per byte on compression. In this case 3000/100 = 30 cycles per byte.
If you had an algorithm that could compress your data by 25% for an effective 125mb/s write speed, you would have 24 cycles per byte to run it in and it would basically be free because the CPU wouldn't be doing anything else anyway while waiting for the disk to churn. 24 cycles per byte = 3072 cycles per 128-byte cache line, easily achieved.
We do this all the time when reading optical media.
If you have an idle second core it's even easier. Just hand off the log buffer to that core's thread and it can take as long as it likes to compress the data since it's not doing anything else! The only tricky bit is you want to actually have a ring of buffers so that you don't have the producer thread (the one making the log) waiting on a mutex for a buffer that the consumer thread (the one writing it to disk) is holding.
Yes, this has been true for at least 10 years. There are operating-systems papers about it. I think Chris Small may have worked on some of them.
For speed, gzip/zlib compression on lower quality levels is pretty fast; if that's not fast enough you can try FastLZ. A quick way to use an extra core is just to use popen(3) to send output through gzip.
For what it is worth Sun's filesystem ZFS has the ability to have on the fly compression enabled to decrease the amount of disk IO without a significant increase in overhead as an example of this in practice.
The Filesystems and storage lab from Stony Brook published a rather extensive performance (and energy) evaluation on file data compression on server systems at IBM's SYSTOR systems research conference this year: paper at ACM Digital Library, presentation.
The results depend on the
used compression algorithm and settings,
the file workload and
the characteristics of your machine.
For example, in the measurements from the paper, using a textual workload and a server environment using lzop with low compression effort are faster than plain write, but bzip and gz aren't.
In your specific setting, you should try it out and measure. It really might improve performance, but it is not always the case.
CPUs have grown faster at a faster rate than hard drive access. Even back in the 80's a many compressed files could be read off the disk and uncompressed in less time than it took to read the original (uncompressed) file. That will not have changed.
Generally though, these days the compression/de-compression is handled at a lower level than you would be writing, for example in a database I/O layer.
As to the usefulness of a second core only counts if the system will be also doing a significant number of other things - and your program would have to be multi-threaded to take advantage of the additional CPU.
Logging the data in binary form may be a quick improvement. You'll write less to the disk and the CPU will spend less time converting numbers to text. It may not be useful if people are going to be reading the logs, but they won't be able to read compressed logs either.
Windows already supports File Compression in NTFS, so all you have to do is to set the "Compressed" flag in the file attributes.
You can then measure if it was worth it or not.
This depends on lots of factors and I don't think there is one correct answer. It comes down to this:
Can you compress the raw data faster than the raw write performance of your disk times the compression ratio you are achieving (or the multiple in speed you are trying to get) given the CPU bandwidth you have available to dedicate to this purpose?
Given today's relatively high data write rates in the 10's of MBytes/second this is a pretty high hurdle to get over. To the point of some of the other answers, you would likely have to have easily compressible data and would just have to benchmark it with some test of reasonableness type experiments and find out.
Relative to a specific opinion (guess!?) to the point about additional cores. If you thread up the compression of the data and keep the core(s) fed - with the high compression ratio of text, it is likely such a technique would bear some fruit. But this is just a guess. In a single threaded application alternating between disk writes and compression operations, it seems much less likely to me.
If it's just text, then compression could definitely help. Just choose an compression algorithm and settings that make the compression cheap. "gzip" is cheaper than "bzip2" and both have parameters that you can tweak to favor speed or compression ratio.
If you are I/O bound saving human-readable text to the hard drive, I expect compression to reduce your total runtime.
If you have an idle 2 GHz core, and a relatively fast 100 MB/s streaming hard drive,
halving the net logging time requires at least 2:1 compression and no more than roughly 10 CPU cycles per uncompressed byte for the compressor to ponder the data.
With a dual-pipe processor, that's (very roughly) 20 instructions per byte.
I see that LZRW1-A (one of the fastest compression algorithms) uses 10 to 20 instructions per byte, and compresses typical English text about 2:1.
At the upper end (20 instructions per byte), you're right on the edge between IO bound and CPU bound. At the middle and lower end, you're still IO bound, so there is a a few cycles available (not much) for a slightly more sophisticated compressor to ponder the data a little longer.
If you have a more typical non-top-of-the-line hard drive, or the hard drive is slower for some other reason (fragmentation, other multitasking processes using the disk, etc.)
then you have even more time for a more sophisticated compressor to ponder the data.
You might consider setting up a compressed partition, saving the data to that partition (letting the device driver compress it), and comparing the speed to your original speed.
That may take less time and be less likely to introduce new bugs than changing your program and linking in a compression algorithm.
I see a list of compressed file systems based on FUSE, and I hear that NTFS also supports compressed partitions.
If this particular machine is often IO bound,
another way to speed it up is to install a RAID array.
That would give a speedup to every program and every kind of data (even incompressible data).
For example, the popular RAID 1+0 configuration with 4 total disks gives a speedup of nearly 2x.
The nearly as popular RAID 5 configuration, with same 4 total disks, gives all a speedup of nearly 3x.
It is relatively straightforward to set up a RAID array with a speed 8x the speed of a single drive.
High compression ratios, on the other hand, are apparently not so straightforward. Compression of "merely" 6.30 to one would give you a cash prize for breaking the current world record for compression (Hutter Prize).
This used to be something that could improve performance in quite a few applications way back when. I'd guess that today it's less likely to pay off, but it might in your specific circumstance, particularly if the data you're logging is easily compressible,
However, as Shog9 commented:
Rules of thumb aren't going to help
you here. It's your disk, your CPU,
and your data. Set up a test case and
measure throughput and CPU load with
and without compression - see if it's
worth the tradeoff.