How much computing power does it take to decompress MP3? - performance

Assuming you are doing it on the CPU, not using dedicated hardware, how much computing power, in megaflops, does it take to decompress a moderate quality MP3 stream?

Here is a link to a paper explaining some optimization details for an MP3 decoder application for a non-pc device:
http://www.analog.com/media/en/technical-documentation/application-notes/EE-255.pdf
Still, the paper sheds some light on the issue by giving numeric measurements of instruction counts per second: The original implementation of the algorithm takes 330 Mips. After applying some optimizations, the number reduces to 110 Mips. So, it is a very modest number. A fraction of what CPUs are capable of today.
Your milage may vary depending on the instruction set you will use or requirements of your application. Also keep in mind that the implementation makes a big difference. As it is shown on the paper, missing a couple optimizations could slow down it thrice.

Related

Choosing minibatch size for deep learning

In a blog post by Ilya Sutskever, A brief overview of Deep Learning, he describes how it is important to choose the right minibatch size to train a deep neural network efficiently. He gives the advice "use the smaller minibatch that runs efficiently on your machine". See the full quote below.
I've seen similar statements by other well-known deep learning researchers, but it is still unclear to me how to find the correct minibatch size. Seeing as a greater minibatch can allow for a greater learning rate, it seems like it requires a lot of experiments to determine whether a certain minibatch size yields a better performance in terms of training speed.
I have a GPU with 4gb of RAM and use the libraries Caffe and Keras. What is in this case a practical heuristic for choosing a good minibatch size given that each observation has a certain memory footprint M?
Minibatches: Use minibatches. Modern computers cannot be efficient if
you process one training case at a time. It is vastly more efficient
to train the network on minibatches of 128 examples, because doing so
will result in massively greater throughput. It would actually be nice
to use minibatches of size 1, and they would probably result in
improved performance and lower overfitting; but the benefit of doing
so is outweighed the massive computational gains provided by
minibatches. But don’t use very large minibatches because they tend to
work less well and overfit more. So the practical recommendation is:
use the smaller minibatch that runs efficiently on your machine.
When we are training a network, when we compute a forward pass, we have to keep all the intermediate activation outputs for the backwards pass. You simply need to compute how much memory it will cost you to store all the relevant activation outputs in your forward pass, in addition to the other memory constraints (storing your weights on the GPU, etc). So observe that if your net is quite deep, you might want to take a smaller batchsize as you may not have enough memory.
Selecting a minibatch size is a mixture of memory constraints and performance/accuracy (usually evaluated using cross validation).
I personally guess-timate/compute by hand how much GPU memory my forward/backward pass will use up and try out of a few values. If for example the largest I can fit is roughly 128, I may cross validate using 32, 64, 96, etc. just to be thorough and see if I can get better performance. This is usually for a deeper net which is going to push my GPU memory (I also only have a 4 GB card, don't have access to the monster NVIDIA cards).
I think there tends to be a greater emphasis on network architecture, optimization techniques/tricks of the trade, data pre-processing.

Is there a constant-time algorithm for generating a bandlimited sawtooth?

I'm looking into the feasibility of GPU synthesized audio, where each thread renders a sample. This puts some interesting restrictions on what algorithms can be used - any algorithm that refers to a previous set of samples cannot be implemented in this fashion.
Filtering is one of those algorithms. Bandpass, lowpass, or highpass - all of them require looking to the last few samples generated in order to compute the result. This can't be done because those samples haven't been generated yet.
This makes synthesizing bandlimited waveforms difficult. One approach is additive synthesis of partials using the fourier series. However, this runs at O(n) time, and is especially slow on a GPU to the point that the gain of parallelism is lost. If there were an algorithm that ran at O(1) time, this would eliminate branching AND be up to 1000x faster when dealing with the audible range.
I'm specifically looking for something like a DSF for a sawtooth. I've been trying to work out a simplification of the fourier series by hand, but that's really, really hard. Mainly because it involves harmonic numbers, AKA the only singularity of the Riemann-Zeta function.
Is a constant-time algorithm achievable? If not, can it be proven that it isn't?
Filtering is one of those algorithms. Bandpass, lowpass, or highpass - all of them require looking to the last few samples generated in order to compute the result. This can't be done because those samples haven't been generated yet.
That's not right. IIR filters do need previous results, but FIR filters only need previous input; that is pretty typical for the things that GPUs were designed to do, so it's not likely a problem to let every processing core access let's say 64 input samples to produce one output sample -- in fact, the cache architectures that Nvidia and AMD use lend themselves to that.
Is a constant-time algorithm achievable? If not, can it be proven that it isn't?
It is! In two aspects:
as mentioned above, FIR filters only need multiple samples of immutable input, so they can be parallelized heavily without problems, and
even if you need to calculate your input first, and would like to parallelize that (I don't see a reason for that -- generating a sawtooth is not CPU-limited, but memory bandwidth limited), every core could simply calculate the last N samples -- sure, there's N-1 redundant operations, but as long as your number of cores is much bigger than your N, you will still be faster, and every core will have constant run time.
Comments on your approach:
I'm looking into the feasibility of GPU synthesized audio, where each thread renders a sample.
From a higher-up perspective, that sounds too fine-granular. I mean, let's say you have 3000 stream processors (high-end consumer GPU). Assuming you have a sampling rate of 44.1kHz, and assuming each of these processors does only one sample, letting them all run once only gives you 1/14.7 of a second of audio (mono). Then you'd have to move on to the next part of audio.
In other words: There's bound to be much much more samples than processors. In these situations, it's typically way more efficient to let one processor handle a sequence of samples; for example, if you want to generate 30s of audio, that'd be 1.323MS (amples). Simply splitting the problem into 3000 chunks, one for each processor, and giving each of them the 44100*30/3000=441 samples they should process plus 64 samples of "history" before the first of their "own" samples will still easily fit into local memory.
Yet another thought:
I'm coming from a software defined radio background, where there's usually millions of samples per second, rather than a few kHz of sampling rate, in real time (i.e. processing speed > sampling rate). Still, doing computation on the GPU only pays for the more CPU-intense tasks, because there's significant overhead in exchanging data with the GPU, and CPUs nowadays are blazingly fast. So, for your relatively simple problem, it might never work faster to do things on the GPU compared to optimizing them on the CPU; things of course look different if you've got to process lots of samples, or a lot of streams, at once. For finer-granular tasks, the problem of filling a buffer, moving it to the GPU, and getting the result buffer back into your software usually kills the advantage.
Hence, I'd like to challenge you: Download the GNU Radio live DVD, burn it to a DVD or write it to a USB stick (you might as well run it in a VM, but that of course reduces performance if you don't know how to optimize your virtualizer; really - try it from a live medium), run
volk_profile
to let the VOLK library test which algorithms work best on your specific machine, and then launch
gnuradio-companion
And then, run open the following two signal processing flow graphs:
"classical FIR": This single-threaded implementation of the FIR filter yields about 50MSamples/s on my CPU.
FIR Filter implemented with the FFT, running on 4 threads: This implementation reaches 160MSamples/s (!!) on my CPU alone.
Sure, with the help of FFTs on my GPU, I could be faster, but the thing here is: Even with the "simple" FIR filter, I can, with a single CPU core, get 50 Megasamples out of my machine -- meaning that, with a 44.1kHz audio sampling rate, per single second I can process roughly 19 minutes of audio. No copying in and out of host RAM. No GPU cooler spinning up. It might really not be worth optimizing further. And if you optimize and take the FFT-Filter approach: 160MS/s means roughly one hour of audio per processing second, including sawtooth generation.

Parallelizeable jpeg like compression using only DCT, run length encoding stages, what sort of compression/performance possible?

We have to compress a ton o' (monochrome) image data and move it quickly. If one were to just use the parallelizeable stages of jpeg compression (DCT and run length encoding of the quantized results) and run it on a GPU so each block is compressed in parallel I am hoping that would be very fast and still yeild a very significant compression factor like full jpeg does.
Does anyone with more GPU / image compression experience have any idea how this would compare both compression and performance wise over using libjpeg on a CPU? (If it is a stupid idea, feel free to say so - I am extremely novice in my knowledge of cuda and the various stages of jpeg compression.) Certainly it will be less compression and hopefully(?) faster but I have no idea how significant those factors may be.
You could hardly get more compression in GPU - there are just no complex-enough algorithms which can use that MUCH power.
When working with simple alos like JPEG - it's so simple that you'll spend most of the time transferring data via PCI-E bus (which has significant latency, especially when card does not support DMA transfers).
Positive side is that if card have DMA, you can free up CPU for more important stuff, and get image compression "for free".
In the best case, you can get about 10x improvement on top-end GPU compared to top-end CPU provided that both CPU & GPU code is well-optimized.

Algorithms FPGAs dominate CPUs on

For most of my life, I've programmed CPUs; and although for most algorithms, the big-Oh running time remains the same on CPUs / FPGAs, the constants are quite different (for example, lots of CPU power is wasted shuffling data around; whereas for FPGAs it's often compute bound).
I would like to learn more about this -- anyone know of good books / reference papers / tutorials that deals with the issue of:
what tasks do FPGAs dominate CPUs on (in terms of pure speed)
what tasks do FPGAs dominate CPUs on (in terms of work per jule)
Note: marked community wiki
[no links, just my musings]
FPGAs are essentially interpreters for hardware!
The architecture is like dedicated ASICs, but to get rapid development, and you pay a factor of ~10 in frequency and a [don't know, at least 10?] factor in power efficiency.
So take any task where dedicated HW can massively outperform CPUs, divide by the FPGA 10/[?] factors, and you'll probably still have a winner. Typical qualities of such tasks:
Massive opportunities for fine-grained parallelism.
(Doing 4 operations at once doesn't count; 128 does.)
Opportunity for deep pipelining.
This is also a kind of parallelism, but it's hard to apply it to a
single task, so it helps if you can get many separate tasks to
work on in parallel.
(Mostly) Fixed data flow paths.
Some muxes are OK, but massive random accesses are bad, cause you
can't parallelize them. But see below about memories.
High total bandwidth to many small memories.
FPGAs have hundreds of small (O(1KB)) internal memories
(BlockRAMs in Xilinx parlance), so if you can partition you
memory usage into many independent buffers, you can enjoy a data
bandwidth that CPUs never dreamed of.
Small external bandwidth (compared to internal work).
The ideal FPGA task has small inputs and outputs but requires a
lot of internal work. This way your FPGA won't starve waiting for
I/O. (CPUs already suffer from starving, and they alleviate it
with very sophisticated (and big) caches, unmatchable in FPGAs.)
It's perfectly possible to connect a huge I/O bandwidth to an
FPGA (~1000 pins nowdays, some with high-rate SERDESes) -
but doing that requires a custom board architected for such
bandwidth; in most scenarios, your external I/O will be a
bottleneck.
Simple enough for HW (aka good SW/HW partitioning).
Many tasks consist of 90% irregular glue logic and only 10%
hard work ("kernel" in the DSP sense). If you put all that
onto an FPGA, you'll waste precious area on logic that does no
work most of the time. Ideally, you want all the muck
to be handled in SW and fully utilize the HW for the kernel.
("Soft-core" CPUs inside FPGAs are a popular way to pack lots of
slow irregular logic onto medium area, if you can't offload it to
a real CPU.)
Weird bit manipulations are a plus.
Things that don't map well onto traditional CPU instruction sets,
such as unaligned access to packed bits, hash functions, coding &
compression... However, don't overestimate the factor this gives
you - most data formats and algorithms you'll meet have already
been designed to go easy on CPU instruction sets, and CPUs keep
adding specialized instructions for multimedia.
Lots of Floating point specifically is a minus because both
CPUs and GPUs crunch them on extremely optimized dedicated silicon.
(So-called "DSP" FPGAs also have lots of dedicated mul/add units,
but AFAIK these only do integers?)
Low latency / real-time requirements are a plus.
Hardware can really shine under such demands.
EDIT: Several of these conditions — esp. fixed data flows and many separate tasks to work on — also enable bit slicing on CPUs, which somewhat levels the field.
Well the newest generation of Xilinx parts just anounced brag 4.7TMACS and general purpose logic at 600MHz. (These are basically Virtex 6s fabbed on a smaller process.)
On a beast like this if you can implement your algorithms in fixed point operations, primarily multiply, adds and subtracts, and take advantage of both Wide parallelism and Pipelined parallelism you can eat most PCs alive, in terms of both power and processing.
You can do floating on these, but there will be a performance hit. The DSP blocks contain a 25x18 bit MACC with a 48bit sum. If you can get away with oddball formats and bypass some of the floating point normalization that normally occurs you can still eek out a truck load of performance out of these. (i.e. Use the 18Bit input as strait fixed point or float with a 17 bit mantissia, instead of the normal 24 bit.) Doubles floats are going to eat alot of resources so if you need that, you probably will do better on a PC.
If your algorithms can be expressed as in terms of add and subtract operations, then the general purpose logic in these can be used to implement gazillion adders. Things like Bresenham's line/circle/yadda/yadda/yadda algorithms are VERY good fits for FPGA designs.
IF you need division... EH... it's painful, and probably going to be relatively slow unless you can implement your divides as multiplies.
If you need lots of high percision trig functions, not so much... Again it CAN be done, but it's not going to be pretty or fast. (Just like it can be done on a 6502.) If you can cope with just using a lookup table over a limited range, then your golden!
Speaking of the 6502, a 6502 demo coder could make one of these things sing. Anybody who is familiar with all the old math tricks that programmers used to use on the old school machine like that will still apply. All the tricks that modern programmer tell you "let the libary do for you" are the types of things that you need to know to implement maths on these. If yo can find a book that talks about doing 3d on a 68000 based Atari or Amiga, they will discuss alot of how to implement stuff in integer only.
ACTUALLY any algorithms that can be implemented using look up tables will be VERY well suited for FPGAs. Not only do you have blockrams distributed through out the part, but the logic cells themself can be configured as various sized LUTS and mini rams.
You can view things like fixed bit manipulations as FREE! It's simply handle by routing. Fixed shifts, or bit reversals cost nothing. Dynamic bit operations like shift by a varable amount will cost a minimal amount of logic and can be done till the cows come home!
The biggest part has 3960 multipliers! And 142,200 slices which EACH one can be an 8 bit adder. (4 6Bit Luts per slice or 8 5bit Luts per slice depending on configuration.)
Pick a gnarly SW algorithm. Our company does HW acceleration of SW algo's for a living.
We've done HW implementations of regular expression engines that will do 1000's of rule-sets in parallel at speeds up to 10Gb/sec. The target market for that is routers where anti-virus and ips/ids can run real-time as the data is streaming by without it slowing down the router.
We've done HD video encoding in HW. It used to take several hours of processing time per second of film to convert it to HD. Now we can do it almost real-time...it takes almost 2 seconds of processing to convert 1 second of film. Netflix's used our HW almost exclusively for their video on demand product.
We've even done simple stuff like RSA, 3DES, and AES encryption and decryption in HW. We've done simple zip/unzip in HW. The target market for that is for security video cameras. The government has some massive amount of video cameras generating huge streams of real-time data. They zip it down in real-time before sending it over their network, and then unzip it in real-time on the other end.
Heck, another company I worked for used to do radar receivers using FPGA's. They would sample the digitized enemy radar data directly several different antennas, and from the time delta of arrival, figure out what direction and how far away the enemy transmitter is. Heck, we could even check the unintended modulation on pulse of the signals in the FPGA's to figure out the fingerprint of specific transmitters, so we could know that this signal is coming from a specific Russian SAM site that used to be stationed at a different border, so we could track weapons movements and sales.
Try doing that in software!! :-)
For pure speed:
- Paralizable ones
- DSP, e.g. video filters
- Moving data, e.g. DMA

Compression to Improve Hard Disk Write Performance

On a modern system can local hard disk write speeds be improved by compressing the output stream?
This question derives from a case I'm working with where a program serially generates and dumps around 1-2GB of text logging data to a raw text file on the hard disk and I think it is IO bound. Would I expect to be able to decrease runtimes by compressing the data before it goes to disk or would the overhead of compression eat up any gain I could get? Would having an idle second core affect this?
I know this would be affected by how much CPU is being used to generate the data so rules of thumb on how much idle CPU time would be needed would be good.
I recall a video talk where someone used compression to improve read speeds for a database but IIRC compressing is a lot more CPU intensive than decompressing.
Yes, yes, yes, absolutely.
Look at it this way: take your maximum contiguous disk write speed in megabytes per second. (Go ahead and measure it, time a huge fwrite or something.) Let's say 100mb/s. Now take your CPU speed in megahertz; let's say 3Ghz = 3000mhz. Divide the CPU speed by the disk write speed. That's the number of cycles that the CPU is spending idle, that you can spend per byte on compression. In this case 3000/100 = 30 cycles per byte.
If you had an algorithm that could compress your data by 25% for an effective 125mb/s write speed, you would have 24 cycles per byte to run it in and it would basically be free because the CPU wouldn't be doing anything else anyway while waiting for the disk to churn. 24 cycles per byte = 3072 cycles per 128-byte cache line, easily achieved.
We do this all the time when reading optical media.
If you have an idle second core it's even easier. Just hand off the log buffer to that core's thread and it can take as long as it likes to compress the data since it's not doing anything else! The only tricky bit is you want to actually have a ring of buffers so that you don't have the producer thread (the one making the log) waiting on a mutex for a buffer that the consumer thread (the one writing it to disk) is holding.
Yes, this has been true for at least 10 years. There are operating-systems papers about it. I think Chris Small may have worked on some of them.
For speed, gzip/zlib compression on lower quality levels is pretty fast; if that's not fast enough you can try FastLZ. A quick way to use an extra core is just to use popen(3) to send output through gzip.
For what it is worth Sun's filesystem ZFS has the ability to have on the fly compression enabled to decrease the amount of disk IO without a significant increase in overhead as an example of this in practice.
The Filesystems and storage lab from Stony Brook published a rather extensive performance (and energy) evaluation on file data compression on server systems at IBM's SYSTOR systems research conference this year: paper at ACM Digital Library, presentation.
The results depend on the
used compression algorithm and settings,
the file workload and
the characteristics of your machine.
For example, in the measurements from the paper, using a textual workload and a server environment using lzop with low compression effort are faster than plain write, but bzip and gz aren't.
In your specific setting, you should try it out and measure. It really might improve performance, but it is not always the case.
CPUs have grown faster at a faster rate than hard drive access. Even back in the 80's a many compressed files could be read off the disk and uncompressed in less time than it took to read the original (uncompressed) file. That will not have changed.
Generally though, these days the compression/de-compression is handled at a lower level than you would be writing, for example in a database I/O layer.
As to the usefulness of a second core only counts if the system will be also doing a significant number of other things - and your program would have to be multi-threaded to take advantage of the additional CPU.
Logging the data in binary form may be a quick improvement. You'll write less to the disk and the CPU will spend less time converting numbers to text. It may not be useful if people are going to be reading the logs, but they won't be able to read compressed logs either.
Windows already supports File Compression in NTFS, so all you have to do is to set the "Compressed" flag in the file attributes.
You can then measure if it was worth it or not.
This depends on lots of factors and I don't think there is one correct answer. It comes down to this:
Can you compress the raw data faster than the raw write performance of your disk times the compression ratio you are achieving (or the multiple in speed you are trying to get) given the CPU bandwidth you have available to dedicate to this purpose?
Given today's relatively high data write rates in the 10's of MBytes/second this is a pretty high hurdle to get over. To the point of some of the other answers, you would likely have to have easily compressible data and would just have to benchmark it with some test of reasonableness type experiments and find out.
Relative to a specific opinion (guess!?) to the point about additional cores. If you thread up the compression of the data and keep the core(s) fed - with the high compression ratio of text, it is likely such a technique would bear some fruit. But this is just a guess. In a single threaded application alternating between disk writes and compression operations, it seems much less likely to me.
If it's just text, then compression could definitely help. Just choose an compression algorithm and settings that make the compression cheap. "gzip" is cheaper than "bzip2" and both have parameters that you can tweak to favor speed or compression ratio.
If you are I/O bound saving human-readable text to the hard drive, I expect compression to reduce your total runtime.
If you have an idle 2 GHz core, and a relatively fast 100 MB/s streaming hard drive,
halving the net logging time requires at least 2:1 compression and no more than roughly 10 CPU cycles per uncompressed byte for the compressor to ponder the data.
With a dual-pipe processor, that's (very roughly) 20 instructions per byte.
I see that LZRW1-A (one of the fastest compression algorithms) uses 10 to 20 instructions per byte, and compresses typical English text about 2:1.
At the upper end (20 instructions per byte), you're right on the edge between IO bound and CPU bound. At the middle and lower end, you're still IO bound, so there is a a few cycles available (not much) for a slightly more sophisticated compressor to ponder the data a little longer.
If you have a more typical non-top-of-the-line hard drive, or the hard drive is slower for some other reason (fragmentation, other multitasking processes using the disk, etc.)
then you have even more time for a more sophisticated compressor to ponder the data.
You might consider setting up a compressed partition, saving the data to that partition (letting the device driver compress it), and comparing the speed to your original speed.
That may take less time and be less likely to introduce new bugs than changing your program and linking in a compression algorithm.
I see a list of compressed file systems based on FUSE, and I hear that NTFS also supports compressed partitions.
If this particular machine is often IO bound,
another way to speed it up is to install a RAID array.
That would give a speedup to every program and every kind of data (even incompressible data).
For example, the popular RAID 1+0 configuration with 4 total disks gives a speedup of nearly 2x.
The nearly as popular RAID 5 configuration, with same 4 total disks, gives all a speedup of nearly 3x.
It is relatively straightforward to set up a RAID array with a speed 8x the speed of a single drive.
High compression ratios, on the other hand, are apparently not so straightforward. Compression of "merely" 6.30 to one would give you a cash prize for breaking the current world record for compression (Hutter Prize).
This used to be something that could improve performance in quite a few applications way back when. I'd guess that today it's less likely to pay off, but it might in your specific circumstance, particularly if the data you're logging is easily compressible,
However, as Shog9 commented:
Rules of thumb aren't going to help
you here. It's your disk, your CPU,
and your data. Set up a test case and
measure throughput and CPU load with
and without compression - see if it's
worth the tradeoff.

Resources