Choosing minibatch size for deep learning

Choosing minibatch size for deep learning - performance

In a blog post by Ilya Sutskever, A brief overview of Deep Learning, he describes how it is important to choose the right minibatch size to train a deep neural network efficiently. He gives the advice "use the smaller minibatch that runs efficiently on your machine". See the full quote below.
I've seen similar statements by other well-known deep learning researchers, but it is still unclear to me how to find the correct minibatch size. Seeing as a greater minibatch can allow for a greater learning rate, it seems like it requires a lot of experiments to determine whether a certain minibatch size yields a better performance in terms of training speed.
I have a GPU with 4gb of RAM and use the libraries Caffe and Keras. What is in this case a practical heuristic for choosing a good minibatch size given that each observation has a certain memory footprint M?
Minibatches: Use minibatches. Modern computers cannot be efficient if
you process one training case at a time. It is vastly more efficient
to train the network on minibatches of 128 examples, because doing so
will result in massively greater throughput. It would actually be nice
to use minibatches of size 1, and they would probably result in
improved performance and lower overfitting; but the benefit of doing
so is outweighed the massive computational gains provided by
minibatches. But don’t use very large minibatches because they tend to
work less well and overfit more. So the practical recommendation is:
use the smaller minibatch that runs efficiently on your machine.

When we are training a network, when we compute a forward pass, we have to keep all the intermediate activation outputs for the backwards pass. You simply need to compute how much memory it will cost you to store all the relevant activation outputs in your forward pass, in addition to the other memory constraints (storing your weights on the GPU, etc). So observe that if your net is quite deep, you might want to take a smaller batchsize as you may not have enough memory.
Selecting a minibatch size is a mixture of memory constraints and performance/accuracy (usually evaluated using cross validation).
I personally guess-timate/compute by hand how much GPU memory my forward/backward pass will use up and try out of a few values. If for example the largest I can fit is roughly 128, I may cross validate using 32, 64, 96, etc. just to be thorough and see if I can get better performance. This is usually for a deeper net which is going to push my GPU memory (I also only have a 4 GB card, don't have access to the monster NVIDIA cards).
I think there tends to be a greater emphasis on network architecture, optimization techniques/tricks of the trade, data pre-processing.

Related

Roofline model: How does increasing Arithmetic Intensity allow room for improvements to performance?

Intel Tip: If you can’t break a memory roof, try to rework your algorithm for
higher arithmetic intensity. This will move you to the right and give
you more room to increase performance before hitting the memory
bandwidth roof.
For algorithms in the memory-bound region of a roofline plot, Intel suggests increasing the arithmetic intensity so that they move to the right (compute-bound region) hence providing room to improve the performance, since the performance roof would be higher.
I'm unable to understand how increasing the arithmetic intensity (say, increasing the no. of operations in the algorithm) can possibly improve a performance metric like the wall-clock time taken for the algorithm to run. Wouldn't you need to do more no. of computations even for a higher performance (in FLOPS)? Could someone explain how this is possible?

Increasing the arithmetic alone is not sufficient to make the algorithm faster. The idea is if you have a choice between multiple algorithm and one of them is memory bound, then it is probably better to pick the other one assuming it is not much slower in practice since you can hardly optimize memory-bound algorithm while this is often much easier for compute-bound one. The memory latency did almost not improve much over the last decade and the bandwidth is only slowly increasing (much less than the number of FLOPS of processors). This is known as the Memory Wall (stated several decades ago). Moving data becomes so expensive nowadays that it is sometimes better to recompute operations rather than storing the previous results. This is especially true for very large data since the bigger the data structure the slower it is. This situation is expected to become worse over the next decades. Thus, a slower compute-bound algorithm can become faster than a memory-bound one in a near future (especially if it can-be/is parallelized).

What is the most important feature of caches in scientific computing?

I have started recently learning parallel programming techniques and what to give attention to when trying to create efficient programs. For example knowing specific details about the caches of your processor is essential if you want to write efficient programs.
I want to know what is the most important (if one is more important than the other) feature of a cache between the block size and the number of sets e.g. 4-way or 8-way associative.

Associativity matters more than line size. Many accesses in HPC are sequential, so smaller line size is mostly just a waste of tag overhead.
Having more smaller sets (because of a smaller line size) might help for a histogram problem, which is one of the major things that can't easily be optimized to sequential accesses.
Of course, latency and bandwidth are usually even more important than 4 vs. 8-way.

Understanding perf detail when comparing two different implementations of a BFS algorithm

The results below are measured using perf on a compute server with 32 cores. I know my implementation is unoptimized but purposely as I want to make comparisons. I understand that graph algorithms tend to have low locality which researchers try to address.
I'm unclear of the results, though. The time elapsed is misleading. My implementation runs through a graph with about 4mm nodes in about 10 seconds and the rest of the time pre processing. The optimized version uses the same input and traverses about 10 times with each less than a second each so it's really just pre-processing time. I'm not trying to achieve the same. Just understand why that may be based on perf.
I see my page faults are substantially higher. I'm not 100 sure why this is this the case as the annotations (from what I can tell) do not point to any specific piece of my code from mine...
__gnu_cxx::new_allocator<std::_List_node<int> >::construct<int, int const&>
This seems to be when I process the graph itself since I create linked lists for the adjacency lists. I figured this may actually cause issues and wanted to investigate anyway. I should be able to improve page faults (and hopefully performance) by switching to jagged arrays?
The optimized algorithm has a much higher last level cache miss which I thought would explain the primary issue with BFS / graph algorithms with low locality but performance seems to be unaffected by this and my unoptimized is significantly lower.
Then there are the front / back end cycles which seems to be the opposite in terms of performance issues when comparing the two - I'm worse in frontend and the optimized is worse in backend.
Am I missing or not understanding something obvious? I thought there would be something obvious in terms of low locality that would be of issue when looking at perf but I'm confused by the optimized version.
This is my implementation of unoptimized parallel BFS (running once)...
This is using an optimized parallel BFS from a benchmark suite (running 10 times)...
Both take about 40 seconds to pre-process the data once, before doing parallel searching.

Unfortunately perf stat often doesn't given enough information to really determine where the bottleneck in your application is. It is possible to have two applications with wildly different underlying bottlenecks but with very similar perf stat profiles. For example, two applications may have the same number or fraction of L2 cache misses, and yet one might be dominated by this effect and the other way may almost be not impacted at all, depending on the amount and nature of overlapping work.
So if you try to analyze in depth from these high level counters, you are often just taking stabs in the dark. Still we can make a few observations. You mention:
The optimized algorithm has a much higher last level cache miss which
I thought would explain the primary issue with BFS / graph algorithms
with low locality but performance seems to be unaffected by this and
my unoptimized is significantly lower.
First, LLC misses are ~620 million for the optimized algorithm and ~380 for your algorithm, but you are running the optimized algorithm 10 times in this benchmark and yours only once. So the optimized algorithm has perhaps 62 million misses, and your algorithm has six times the number of LLC misses. Yes, your algorithm has a lower LLC miss rate - but the absolute number of LLC misses is what counts for performance. The lower miss rates just means that you are making every more total accesses than the 6x figure: basically you make many, many more memory accesses than the optimized version, which leads to a higher hit rate but more total misses.
All of this points to accessing more total memory in your unoptimized algorithm, or perhaps accessing it in a much more cache unfriendly fashion. That's would also explain the much higher number of page faults. Overall, both algorithms have low IPC, and yours is particularly low (0.49 IPC) and given that there aren't branch prediction problems, and that you've already identified these as graph algorithms with locality/memory access problems, stalls while waiting for memory are very likely.
Luckily, there is a better way that just trying to reverse engineer what might be the bottleneck based on perf stat output. Intel has developed a whole methodology which tries to this type of top-down analysis in a way that determines the true bottlenecks. It's not perfect, but it's far and away better than looking at the plain perf stat counters. VTune isn't free, but you can get a similar analysis based on the same methodology effect using Andi Kleen's toplev. I highly recommend you start there.

Algorithms FPGAs dominate CPUs on

For most of my life, I've programmed CPUs; and although for most algorithms, the big-Oh running time remains the same on CPUs / FPGAs, the constants are quite different (for example, lots of CPU power is wasted shuffling data around; whereas for FPGAs it's often compute bound).
I would like to learn more about this -- anyone know of good books / reference papers / tutorials that deals with the issue of:
what tasks do FPGAs dominate CPUs on (in terms of pure speed)
what tasks do FPGAs dominate CPUs on (in terms of work per jule)
Note: marked community wiki

[no links, just my musings]
FPGAs are essentially interpreters for hardware!
The architecture is like dedicated ASICs, but to get rapid development, and you pay a factor of ~10 in frequency and a [don't know, at least 10?] factor in power efficiency.
So take any task where dedicated HW can massively outperform CPUs, divide by the FPGA 10/[?] factors, and you'll probably still have a winner. Typical qualities of such tasks:
Massive opportunities for fine-grained parallelism.
(Doing 4 operations at once doesn't count; 128 does.)
Opportunity for deep pipelining.
This is also a kind of parallelism, but it's hard to apply it to a
single task, so it helps if you can get many separate tasks to
work on in parallel.
(Mostly) Fixed data flow paths.
Some muxes are OK, but massive random accesses are bad, cause you
can't parallelize them. But see below about memories.
High total bandwidth to many small memories.
FPGAs have hundreds of small (O(1KB)) internal memories
(BlockRAMs in Xilinx parlance), so if you can partition you
memory usage into many independent buffers, you can enjoy a data
bandwidth that CPUs never dreamed of.
Small external bandwidth (compared to internal work).
The ideal FPGA task has small inputs and outputs but requires a
lot of internal work. This way your FPGA won't starve waiting for
I/O. (CPUs already suffer from starving, and they alleviate it
with very sophisticated (and big) caches, unmatchable in FPGAs.)
It's perfectly possible to connect a huge I/O bandwidth to an
FPGA (~1000 pins nowdays, some with high-rate SERDESes) -
but doing that requires a custom board architected for such
bandwidth; in most scenarios, your external I/O will be a
bottleneck.
Simple enough for HW (aka good SW/HW partitioning).
Many tasks consist of 90% irregular glue logic and only 10%
hard work ("kernel" in the DSP sense). If you put all that
onto an FPGA, you'll waste precious area on logic that does no
work most of the time. Ideally, you want all the muck
to be handled in SW and fully utilize the HW for the kernel.
("Soft-core" CPUs inside FPGAs are a popular way to pack lots of
slow irregular logic onto medium area, if you can't offload it to
a real CPU.)
Weird bit manipulations are a plus.
Things that don't map well onto traditional CPU instruction sets,
such as unaligned access to packed bits, hash functions, coding &
compression... However, don't overestimate the factor this gives
you - most data formats and algorithms you'll meet have already
been designed to go easy on CPU instruction sets, and CPUs keep
adding specialized instructions for multimedia.
Lots of Floating point specifically is a minus because both
CPUs and GPUs crunch them on extremely optimized dedicated silicon.
(So-called "DSP" FPGAs also have lots of dedicated mul/add units,
but AFAIK these only do integers?)
Low latency / real-time requirements are a plus.
Hardware can really shine under such demands.
EDIT: Several of these conditions — esp. fixed data flows and many separate tasks to work on — also enable bit slicing on CPUs, which somewhat levels the field.

Well the newest generation of Xilinx parts just anounced brag 4.7TMACS and general purpose logic at 600MHz. (These are basically Virtex 6s fabbed on a smaller process.)
On a beast like this if you can implement your algorithms in fixed point operations, primarily multiply, adds and subtracts, and take advantage of both Wide parallelism and Pipelined parallelism you can eat most PCs alive, in terms of both power and processing.
You can do floating on these, but there will be a performance hit. The DSP blocks contain a 25x18 bit MACC with a 48bit sum. If you can get away with oddball formats and bypass some of the floating point normalization that normally occurs you can still eek out a truck load of performance out of these. (i.e. Use the 18Bit input as strait fixed point or float with a 17 bit mantissia, instead of the normal 24 bit.) Doubles floats are going to eat alot of resources so if you need that, you probably will do better on a PC.
If your algorithms can be expressed as in terms of add and subtract operations, then the general purpose logic in these can be used to implement gazillion adders. Things like Bresenham's line/circle/yadda/yadda/yadda algorithms are VERY good fits for FPGA designs.
IF you need division... EH... it's painful, and probably going to be relatively slow unless you can implement your divides as multiplies.
If you need lots of high percision trig functions, not so much... Again it CAN be done, but it's not going to be pretty or fast. (Just like it can be done on a 6502.) If you can cope with just using a lookup table over a limited range, then your golden!
Speaking of the 6502, a 6502 demo coder could make one of these things sing. Anybody who is familiar with all the old math tricks that programmers used to use on the old school machine like that will still apply. All the tricks that modern programmer tell you "let the libary do for you" are the types of things that you need to know to implement maths on these. If yo can find a book that talks about doing 3d on a 68000 based Atari or Amiga, they will discuss alot of how to implement stuff in integer only.
ACTUALLY any algorithms that can be implemented using look up tables will be VERY well suited for FPGAs. Not only do you have blockrams distributed through out the part, but the logic cells themself can be configured as various sized LUTS and mini rams.
You can view things like fixed bit manipulations as FREE! It's simply handle by routing. Fixed shifts, or bit reversals cost nothing. Dynamic bit operations like shift by a varable amount will cost a minimal amount of logic and can be done till the cows come home!
The biggest part has 3960 multipliers! And 142,200 slices which EACH one can be an 8 bit adder. (4 6Bit Luts per slice or 8 5bit Luts per slice depending on configuration.)

Pick a gnarly SW algorithm. Our company does HW acceleration of SW algo's for a living.
We've done HW implementations of regular expression engines that will do 1000's of rule-sets in parallel at speeds up to 10Gb/sec. The target market for that is routers where anti-virus and ips/ids can run real-time as the data is streaming by without it slowing down the router.
We've done HD video encoding in HW. It used to take several hours of processing time per second of film to convert it to HD. Now we can do it almost real-time...it takes almost 2 seconds of processing to convert 1 second of film. Netflix's used our HW almost exclusively for their video on demand product.
We've even done simple stuff like RSA, 3DES, and AES encryption and decryption in HW. We've done simple zip/unzip in HW. The target market for that is for security video cameras. The government has some massive amount of video cameras generating huge streams of real-time data. They zip it down in real-time before sending it over their network, and then unzip it in real-time on the other end.
Heck, another company I worked for used to do radar receivers using FPGA's. They would sample the digitized enemy radar data directly several different antennas, and from the time delta of arrival, figure out what direction and how far away the enemy transmitter is. Heck, we could even check the unintended modulation on pulse of the signals in the FPGA's to figure out the fingerprint of specific transmitters, so we could know that this signal is coming from a specific Russian SAM site that used to be stationed at a different border, so we could track weapons movements and sales.
Try doing that in software!! :-)

For pure speed:
- Paralizable ones
- DSP, e.g. video filters
- Moving data, e.g. DMA

efficiency /algorithms vs system specifications

We are all talking about the efficiency of the algorithms and it depends on input size -basically.
How about the system specifications of current computer that runs the algorithm? does it make any difference to run a different sorting algorithm in a Core 2 Duo 2.6 GHZ, 4 GB RAM-computer or in a P-2, 256 MB RAM-computer?
I am sure that there must be a performance difference. But, I want to know what is the real relationship between algorithms and system specifications...

An increase in hardware performance will give you a constant C times the running time of your algorithm. Meaning if you have computer A which is overall 2 times slower than computer B. Than your algorithm will be twice as fast on computer B. Twice as fast though really makes hardly no difference when you consider big input values to an algorithm though.
In big O notation that is to say you will have something like O(n) compared to CO(n) = O(cn) = O(n). The complexity of the algorithm and general running time for large values will be about the same on both Computer A and Computer B.
If you analyze an algorithm's running time using something like big O notation, then you will have a much better idea about how the algorithm really works. Computer performance won't give you any kind of advantage when you are comparing an algorithm that is O(logn) compared to O(n^2).
Take a look at some of the data values for n:
I will assume 1 second per operation for the slow computer, and 2 operations for second for the fast computer. I will compare the better algorithm with the slow computer with the worse algorithm with the fast computer.
for n = 10:
Algorithm 1: O(logn): 4 operations
Slow computer: 4 seconds
Algorithm 2: O(n^2): 100 operations
Fast computer: 50 seconds
for n = 100:
Algorithm 1: O(logn): 7 operations
Slow computer: 7 seconds
Algorithm 2: O(n^2): 10,000
operations Fast computer: 1.4 hours
Large difference
for n = 1,000:
Algorithm 1: O(logn): 10 operations
Slow computer: 10 seconds
Algorithm 2: O(n^2): 1,000,000
operations Fast computer: 5.8 days
Huge difference
As n increases, the difference gets bigger and bigger.
Now if you tried to run each of these algorithms on a faster/slower computer for a large input size. It wouldn't matter. Hands down the O(logn) would be faster.

I don't like the answers provided by Brian Bondy and Czimi...
Perhaps this is because I started in a different era, when 32K was considered a lot of memory, and most "personal computers" had 8K bytes, and that now I work in scientific computing where the largest data sets are processed on some of the world's largest systems with thousands of processing nodes and seemingly unbelievable quantities of storage. Therefore I don't overlook certain other elements of the question.
The size of the data set in question makes a fantastic difference. Most all the answers on this question so far ignore this and work for very small numbers N. The other people who have answered have all presumed "it all fits in memory," or something close to that.
For large data sets other factors come into play, and "large" depends on what resources you have to use in solving your problem. Modern systems have the opportunity for off-line storage (e.g. DVDs), networked storage (e.g. nfs), on-line storage (e.g. serial ATA), and two levels of memory storage, system main memory and on-chip cache. How these are leveraged matters and the larger the data set the more they matter. You may or may not need to design access to these into your "algorithm", but if you do, it really matters!
As you increase scale beyond some particular point - the limit of a single CPU and its local memory is about right - these other factors become an increasingly large factor in the overhead of the workload. When I was a Digital, we did some of the first real commercial work on multi-CPU systems and I remember running a benchmark that showed that using a single-CPU as one "unit" of CPU workload capability, a second CPU (in a tightly coupled system) would give you a total of about 1.8. That is, the second CPU added about 0.8. For three, the increase dropped to about 0.6, and four it dropped a lot more, to about 0.2, for a grand total of about 2.6 for a four CPU arrangement, though we had some troubles keeping good numbers with four CPUs due to other effects (the measurement effort became a large fraction of the additional resource). ...The bottom line was that multi-CPUs weren't necessarily all they were cracked up to be - four times the CPU does NOT give you four times the processing power, even though in theory you get four times the flops. ...We repeated the work on the Alpha chip, the first multi-core in history, and the results held up pretty well. Surely there could have been optimizations to improve the fraction each additional CPU gave, and surely there has been a lot of work since then to split computing threads more smartly, but you'll never get it all the way to 100% of each new one, in part because they all slow down some (extra overhead) to coordinate.
Small interjection - we had a saying about this work: "Religate all the Important Stuff to the Compiler!" RISC, get it? This was because the compiler itself had to organize the workload so competing threads didn't step on one another!
Ultimately performing processing of really massive data crunching requires a really smart strategy of moving the data in and out of farther afield data storage through to local memory. And, division of labor within the algorithm is absolutely vital. In work I was doing with Roberto Mechoso at UCLA doing Global Circulation Modeling, they had a data-broker design that is illustrative of the attempts people make to do a great job. Frankly, the result wasn't as good as it could have been, but the design ideas that went into it are worth study. ...Presuming you consider this part of your "algorithm" - and not just the bit twiddling part, then the algorithms management of resources is one of the most vital aspects of reasonable if not optimal resource utilization doing substantial computing.
...I hope this helps answer your inquiry.

One thing not raised so far is that alogorithms are often described in terms of speed, e.g. O(n), O(n log(n)), etc... but they also have characteristics in terms of resource usage, where improved speed, say O(n) versus O (n log(n)), is at the cost of much greater memory usage. In modern computers as resources become exhausted, they are typically replaced with larger slower resources, e.g. swapping memory for disk, where the slower resource is orders of magnitude slower. Thus when we graph the performance of our algorithm against time, and expect a straight line, n log n curve, etc... we often see spikes for large values of n as memory gets exhuasted. In this case, the difference between 1GB and 2GB of RAM can be huge, so in practical terms, the answer to your question is yes, System specification is very important, and selection of algorithms requires knowledge of the system specification and the size of the input data.
For example, I develeop surface modelling and analysis software, and I know that my programs work well on a 32bit XP box for TIN models of 4 million points. The performance difference between 3.5 million and 4 million points is minor. At 4.5 million points the performance degradation is so severe the software is unusable.

Yes, it does depend on system specification. One system might be 10 times faster than another, so it will run bubblesort and quicksort on a set of data 10 times faster than the other.
But when you do analysis of algorithms, you often ignore constant factors like that, which is one thing that big-O notation does. So bubblesort is O(n^2) and quicksort is O(nlogn) (in the average case), and that holds no matter how fast your hardware is.
The interesting thing is when you start comparing apples and oranges. If you're running bubblesort on your fast hardware, you may find it's faster than quicksort on the slow hardware -- but only up to a point. Eventually, with a large enough input set, the quicksort on the slow hardware is going to be faster than bubblesort on the fast hardware.
If you want to start making comparisons like that, you need to do two things together: determine algorithmic complexity including the constant factors, and develop a speed model (e.g. how many iterations of a particular loop it can perform per second) for the actual hardware you're running on. One of the interesting things about Knuth's Art of Computer Programming, compared with other books on algorithms, is that he does both, so that for each algorithm he examines, he calculates how many units of execution time it will take for a given size of input on his (mythical) MIX computer. You could then adjust the calculation for faster or slower hardware -- something that big-O notation doesn't help with.

By your question, do you mean to ask why the efficiency of an algorithm is described only in terms of the input size?
Algorithms are usually described using the Big O Notation. This notation describes the asymptotic behavior of an algorithm; it describes the behavior when the input data is very very large.
So for example, we have two algorithms for sorting.
Algo#1 with O(n)
Algo#2 with O(n^2)
And let's take two PCs:
PC1
PC2 100x faster than PC1
And we have two setups:
PC1 running Algo#1
PC2 running Algo#2
When n is very very large (like billions?) PC1 will still beat PC1 :)

The efficiency of an algorithm doesn't depend on the system specification. The efficiency is described by the Ordo number, which gives you a relation of the processing effort and the size of the input.

Certainly yes. With high CPU the execution time will reduce.
Similarly with higher memory, the time taken to swap data (if applicable) will definitely reduce.

Be aware of the particularities of the language in which you implement your algorithm.
For instance, in java world, As illustrated in this article, a faster computer does not always means a faster runtime:
Same Java program on a single CPU machine can actually run a lot faster than on a multiprocess/multi-core machine!!

Does it make any difference to run a different sorting algorithm in a Core 2 Duo 2.6 GHZ, 4 GB RAM-computer or in a P-2, 256 MB RAM-computer?
In some cases absolutely! If your data set does not fit into memory you will need to use a disk based sorting algorithm such as merge sort. Quoting from Wikipedia:
When the size of the array to be sorted approaches or exceeds the available primary memory, so that (much slower) disk or swap space must be employed, the memory usage pattern of a sorting algorithm becomes important, and an algorithm that might have been fairly efficient when the array fit easily in RAM may become impractical. In this scenario, the total number of comparisons becomes (relatively) less important, and the number of times sections of memory must be copied or swapped to and from the disk can dominate the performance characteristics of an algorithm. Thus, the number of passes and the localization of comparisons can be more important than the raw number of comparisons, since comparisons of nearby elements to one another happen at system bus speed (or, with caching, even at CPU speed), which, compared to disk speed, is virtually instantaneous.
For example, the popular recursive quicksort algorithm provides quite reasonable performance with adequate RAM, but due to the recursive way that it copies portions of the array it becomes much less practical when the array does not fit in RAM, because it may cause a number of slow copy or move operations to and from disk. In that scenario, another algorithm may be preferable even if it requires more total comparisons.

Yes. Remember, we have jumped ~4 orders of magnitude since the early 80s. (1 MHz, 10 MHz, 100 MHz, 1000MHz).
But that's only the difference between n=10 and n=10000, in terms of data set sizes. I can purchase a terabyte hard drive...over 6?7? orders of magnitude than my old 20 megabyte drive.
There's a lot more data floating around out there than there is compute power. So while you might be confused about how useful the big-O is at n=50, n=500 kind of sizes...when n=1,000,00 you want to minimize n as much as you can. Anything supralinear is just rough on your compute power...non-polynomial is even worse. This extends all the way from the top of the system to the bottom. So, efficiency is king as soon as you deal with real-world dataset sizes.
Let me give you an example.
I did a junior level database design. By the end I had maybe 5 tables with maybe 20-40 pre-defined categories in them. Added rows, 10, 20 rows. No big deal. I was a whiz. I did it in PHP, not Perl. I was all that and a bag o' chips.
Move to now, a few years later. I'm doing a hobby project in datamining the stock market. I harvest data off a financial site every day - 6100 stocks, with about 10 columns in each stock. Thirty-thousand+ rows per week. My initial design was "normalized", with static data and dynamic data in different tables. As I played around with my queries, learning about things, if I did a bad join, I'd literally crash my server and make it unavailable. So I denormalized. My next phase is tagging and starting actual mining. I don't plan to start making serious predictions until Christmas-time; roughly 11x30K = 330K rows to mine and analyze. Algorithm efficiency will matter, if I want to get my data processed in a timely fashion. Doesn't matter if my CPU was 10 times as fast...if I use a N^2 algorithm, it'd only get done 2x as fast. :-)

But, I want to know what is the real
relationship between algorithms and
system specifications...
I am confused. I see a lot of people here writing a lot of things, however, when I read the quote above as a question, here's all I can say about it:
A better system (= faster CPU, more RAM) runs faster, a worse one (= slower CPU, less RAM) runs slower. The same algorithm (no matter how good or bad it is) will most likely run faster on a better system and slower on the worse one.
A faster algorithm runs faster than a slower one. It will run faster on the slower system and it will run faster on the faster system.
So what exactly was your question again? Is your question "Do we really need a fast algorithm if the system is already that fast? Won't a slow one do as well?" Yes, maybe. But in that case I would ask two questions:
Why selecting a slow algorithm just because the system is fast? That way your code will only run at decent speed on a very fast system. If you choose a fast algorithm, your code might even run at decent speed on a much worse system.
Why trying to intentionally achieve worse performance? Even though a bad algorithm might run within five seconds, what you consider fast enough on the fast machine, a good one might run in 100 milliseconds. So why making your program perform a task in 5 seconds it could perform exactly the same one in 100 milliseconds?
Actually it's point number (2) that really bugs me quite often. So often people say "Hey, don't over optimize, it won't really matter. This code is only such a small system in such a big system". Yes, if you just look at this code isolated, that is true. But there is a saying "Many a mickle makes a muckle". Of course you should optimize the most processor intensive parts first. However, if a system consists out of 100 modules and each of them uses only one percent of the CPU time, optimizing one of them to be twice as fast will only get an overall processing time improvement of 0.5%, close to nothing; that's why most people refrain from doing that. But what people overlook is that optimizing all of them to be twice as fast will get a processing time improvement of 50% (or IOW, the app will run twice as fast as a whole).
So unless there is any good reason for not doing it, why not always using the best algorithm known to solve a problem? The best means the one that shows good performance and a good CPU time/memory usage ratio (as it's useless to take the fastest one if this one needs more memory than a normal customer PC can even take).

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio