Calculate the theoretical speed of a program?

Calculate the theoretical speed of a program? - ruby

I've written two different ways of doing the same thing. I would like to compare which one will execute faster. Of course it is always possible to benchmark, but how a program benchmarks can be different from machine to machine and can be affected by many outside factors. How could I calculate which is faster without bench marking? My thought would be that you would sum the times of all the operations done in the program. Is this a standard thing to do? It seems like when you benchmark there is alot of room for error.

My thought would be that you would sum the times of all the operations done in the program.
Yes, but you can't easily / reliably figure out those times by any method other than benchmarking.
The problem is that these times depend on the dynamic context of what happened previously in your program (or even system-wide). CPUs are complex beasts, and cache effects (data cache and instruction cache) are often a major factor. So is branch prediction. Why is it faster to process a sorted array than an unsorted array?
Static analysis of a small loop in assembly language is possible. e.g. I can accurately predict how many cycles per iteration a simple loop can run on Intel Haswell, assuming no cache misses, based on Agner Fog's microarchictecture pdf and instruction tables. Going beyond that involves more and more guesswork.
Performance in a high-level interpreted language like Ruby might be somewhat predictable for experts that spend a lot of time tuning code in it, but almost certainly not "this will take this number of microseconds", only "this is probably a bit or a lot faster than that".

Algorithmic complexity will give you a theoretical speed comparison for an algorithm.
Your question is about an arbitrary program, but a program is more than a collection of algorithms.
The execution speed of a program depends on the context it is running ( I/Os, operating system (multitasking or not), hardware ).
So there is no other method than statistics on a bunch of measurements, which is a definition for benchmark.

Related

Most relevant performance indicators for C/C++

I am looking for relevant performance indicators to benchmark and optimize my C/C++ code. For example, virtual memory usage is a simple but efficient indicator, but I know some are more specialized and help in optimizing specific domains : cache hits/misses, context switches, and so on.
I believe here is a good place to have a list of performance indicators, what they measure, and how to measure them, in order to help people who want to start optimizing their programs know where to start.

Time is the most relevant indicator.
This is why most profilers default to measuring / sampling time or core clock cycles. Understanding where your code spends its time is an essential first step to looking for speedups. First find out what's slow, then find out why it's slow.
There are 2 fundamentally different kinds of speedups you can look for, and time will help you find both of them.
Algorithmic improvements: finding ways to do less work in the first place. This is often the most important kind, and the one Mike Dunlavey's answer focuses on. You should definitely not ignore this. Caching a result that's slow to recompute can be very worth it, especially if it's slow enough that loading from DRAM is still faster.
Using data structures / algorithms that can more efficiently solve your problem on real CPUs is somewhere between these two kinds of speedups. (e.g. linked lists are in practice often slower than arrays because pointer-chasing latency is a bottleneck, unless you end up copying large arrays too often...)
Applying brute force more efficiently to do the same work in fewer cycles. (And/or more friendly to the rest of the program with smaller cache footprint and/or less branching that takes up space in the branch predictors, or whatever.)
Often involves changing your data layout to be more cache friendly, and/or manually vectorizing with SIMD. Or doing so in a smarter way. Or writing a function that handles a common special case faster than your general-case function. Or even hand-holding the compiler into making better asm for your C source.
Consider summing an array of float on modern x86-64: Going from latency-bound scalar addition to AVX SIMD with multiple accumulators can give you a speedup of 8 (elements per vector) * 8 (latency / throughput on Skylake) = 64x for a medium-sized array (still on a single core/thread), in the theoretical best case where you don't run into another bottleneck (like memory bandwidth if your data isn't hot in L1d cache). Skylake vaddps / vaddss has 4 cycle latency, and 2-per-clock = 0.5c reciprocal throughput. (https://agner.org/optimize/). Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? for more about multiple accumulators to hide FP latency. But this still loses hard vs. storing the total somewhere, and maybe even updating the total with a delta when you change an element. (FP rounding error can accumulate that way, though, unlike integers.)
If you don't see an obvious algorithmic improvement, or want to know more before making changes, check whether the CPU is stalling on anything, or if it's efficiency chewing through all the work the compiler is making it do.
Instructions per clock (IPC) tells you whether the CPU is close to its max instruction throughput or not. (Or more accurately, fused-domain uops issued per clock on x86, because for example one rep movsb instruction is a whole big memcpy and decodes to many many uops. And cmp/jcc fuses from 2 instructions to 1 uop, increasing IPC but the pipeline width is still fixed.)
Work done per instruction is a factor, too, but isn't something you can measure with a profiler: if you have the expertise, look at compiler-generated asm to see if the same work with fewer instructions is possible. If the compiler didn't auto-vectorize, or did so inefficiently, you can maybe get a lot more work done per instruction by manually vectorizing with SIMD intrinsics, depending on the problem. Or by hand-holding the compiler into emitting better asm by tweaking your C source to compute things in a way that is natural for asm. e.g. What is the efficient way to count set bits at a position or lower?. And see also C++ code for testing the Collatz conjecture faster than hand-written assembly - why?
If you find low IPC, figure out why by considering possibilities like cache misses or branch misses, or long dependency chains (often a cause of low IPC when not bottlenecked on the front-end or memory).
Or you might find that it's already close to optimally applying the available brute force of the CPU (unlikely but possible for some problems). In that case your only hope is algorithmic improvements to do less work.
(CPU frequency isn't fixed, but core clock cycles is a good proxy. If your program doesn't spend time waiting for I/O, then core clock cycles is maybe more useful to measure.)
A mostly-serial portion of a multi-threaded program can be hard to detect; most tools don't have an easy way to find threads using cycles when other threads are blocked.
Time spent in a function isn't the only indicator, though. A function can make the rest of the program slow by touching a lot of memory, resulting in eviction of other useful data from cache. So that kind of effect is possible. Or having a lot of branches somewhere can maybe occupy some of the branch-prediction capacity of the CPU, resulting in more branch misses elsewhere.
But note that simply finding where the CPU is spending a lot of time executing is not the most useful, in a large codebase where functions containing hotspots can have multiple callers. e.g. lots of time spent in memcpy doesn't mean you need to speed up memcpy, it means you need to find which caller is calling memcpy a lot. And so on back up the call tree.
Use profilers that can record stack snapshots, or just hit control-C in a debugger and look at the call stack a few times. If a certain function usually appears in the call stack, it's making expensive calls.
Related: linux perf: how to interpret and find hotspots, especially Mike Dunlavey's answer there makes this point.
Algorithmic improvements to avoid doing work at all are often much more valuable than doing the same work more efficiently.
But if you find very low IPC for some work you haven't figured out how to avoid yet, then sure take a look at rearranging your data structures for better caching, or avoiding branch mispredicts.
Or if high IPC is still taking a long time, manually vectorizing a loop can help, doing 4x or more work per instruction.

#PeterCordes answers are always good. I can only add my own perspective, coming from about 40 years optimizing code:
If there is time to be saved (which there is), that time is spent doing something unnecessary, that you can get rid of if you know what it is.
So what is it? Since you don't know what it is, you also don't know how much time it takes, but it does take time. The more time it takes, the more worthwhile it is to find, and the easier it is to find it. Suppose it takes 30% of the time. That means a random-time snapshot has a 30% chance of showing you what it is.
I take 5-10 random snapshots of the call stack, using a debugger and the "pause" function.
If I see it doing something on more than one snapshot, and that thing can be done faster or not at all, I've got a substantial speedup, guaranteed.
Then the process can be repeated to find more speedups, until I hit diminishing returns.
The important thing about this method is - no "bottleneck" can hide from it. That sets it apart from profilers which, because they summarize, speedups can hide from them.

For parallel algorithm with N threads, can performance gain be more than N?

A theoretical question, maybe it is obvious:
Is it possible that an algorithm, after being implemented in a parallel way with N threads, will be executed more than N times faster than the original, single-threaded algorithm? In other words, can the gain be better that linear with number of threads?

It's not common, but it most assuredly is possible.
Consider, for example, building a software pipeline where each step in the pipeline does a fairly small amount of calculation, but requires enough static data to approximately fill the entire data cache -- but each step uses different static data.
In a case like this, serial calculation on a single processor will normally be limited primarily by the bandwidth to main memory. Assuming you have (at least) as many processors/cores (each with its own data cache) as pipeline steps, you can load each data cache once, and process one packet of data after another, retaining the same static data for all of them. Now your calculation can proceed at the processor's speed instead of being limited by the bandwidth to main memory, so the speed improvement could easily be 10 times greater than the number of threads.
Theoretically, you could accomplish the same with a single processor that just had a really huge cache. From a practical viewpoint, however, the selection of processors and cache sizes is fairly limited, so if you want to use more cache you need to use more processors -- and the way most systems provide to accomplish this is with multiple threads.

Yes.
I saw an algorithm for moving a robot arm through complicated maneuvers that was basically to divide into N threads, and have each thread move more or less randomly through the solution space. (It wasn't a practical algorithm.) The statistics clearly showed a superlinear speedup over one thread. Apparently the probability of hitting a solution over time rose fairly fast and then leveled out some, so the advantage was in having a lot of initial attempts.

Amdahl's law (parallelization) tells us this is not possible for the general case. At best we can perfectly divide the work by N. The reason for this is that given no serial portion, Amdahl's formula for speedup becomes:
Speedup = 1/(1/N)
where N is the number of processors. This of course reduces to just N.

CUDA: Bigger problems in threads

Almost all of the CUDA exemplar code describes doing near-atomic operations on large data sets. What kind of practical limitations are the to the size of a problem each thread can do?
For example, I have another question open at the minute that involves per-thread matrix solving. Is this kind of thing too large to put within each thread?

CUDA is a data parallel programming model for what is effectively an SIMD architecture, so obviously it isn't as flexible as a general purpose multithreaded or MIMD architecture. Certainly kernels can be a lot more complex than simple arithmetic operations.
In my own work I use CUDA a lot for solving partial differential equations (so the finite element, finite difference and finite volume methods), which every thread processes a cell or element from a discretised continuum. In that sort of calculation, there are a lot of FLOPs per thread per cell/element.
The key area to be mindful of is branch divergence. Because it is an SIMD architecture under the hood, code where there is a lot of branching within a warp of threads (which is effectively the SIMD width), will suffer performance penalties. But branch divergence and code complexity need not be synonymous, you can write very "branchy" and "loopy" code which will run well, as long as threads within any given warp don't diverge too often. In FLOP and IOP heavy algorithms, that is usually not too hard to achieve.

I just want to reiterate talonmies and say that there is no real limit to the "size" of a kernel in number of operations. As long as the computation is parallel, CUDA will be effective!
As far a practical considerations, I would just add a few small notes
long running kernels can timeout, depending on os (or when profiling with cudaProf). You might have to change a setting somewhere to increase maximum kernel execution time.
long running kernels on systems without a dedicated gpu can freeze the display (interrupting ui).
warps are executed asynchronously - one warp can access memory while another performs arithmetic in order to use clock cycles effectively. long running kernels might benefit more from attention to this kind of optimization. i'm not really sure about this last one.

measuring real running time of an algorithm

Approximately, how many physical instructions of MIPS does an abstract algorithm operation amortize to? As for an abstract algorithm operation, I means a basic operation, such as add, divide, etc.
I see this is not a strict measuring technique :-)
Kejia

There is a list of the basic MIPS instructions here. Most of the "basic operations" that you mentioned are a single MIPS instruction or perhaps two, which probably holds true on most current CPU families.
However this does not take into account at all the architecture and performance characteristics of any of the modern CPUs. Different instructions often have diffrent completion times. Current CPUs usually implement branch prediction, instruction pipelines, memory caching, parallelisation and a whole list of other techniques to make the code execution faster.
Therefore just having the assembly code implementation of an algorithm says nothing about its execution speed. You would have to measure and profile the code on the actual hardware to obtain comparable results. In fact, some algorithms may be far more effective on certain CPUs, even within the same CPU family.
A common and rather understandable example is the effect of the instruction cache. Unrolling a loop will eliminate a number of branch operations, which intuitively makes code faster. If you run that code on a CPU of the same family with very little instruction cache memory, though, the added accesses to the main memory can make it far slower than the simple branch-based loop.

Computers are complicated. If you want to get down to this level you need to start considering what kind of CPU you are using, how well your compiler can use this CPU's instruction set, what variables are being kept in what registers, what are their bit-level representations, etc. Even then, the number of instructions not always easily maps to the actual running time. Different instructions can take different ammounts of clock cycles to execute and this is not even thinking about OS threading and your program's cache miss rate.
In the end, there is a good reason we use big-O notatoin in the first place :)
BTW, most simple operations (add, subtract) on integers should map to a single machine instruction, in case you are worried.

It depends on the CPU architecture. Some processors requires several cycles for a single instruction such as divivide, while others manage to execute all machine code instructions in a single cycle each.
It is sometimes relevant to measure an algorithm in how many floating point operations it requires. However this does not take I/O (such as reading memory) into consideration.
The speed of a CPU is sometimes provided in FLOPS (Floating Point OPerations per Second) which could help to give you a time estimate. Again, not taking I/O into consideration - and not multi-threading issues (also a very important measuring factor).

Donald Knuth addressed this very problem in writing Volume 1 of "The Art of Computer Programming".
In the preface he gives a lengthy justification for presenting algorithms in the assembly code for an imaginary machine -
... To avoid this dilemma, I have
attempted to design an "ideal"
computer called "MIX," with very
simple rules of operation ...
That way, one can talk sensibly about how many "cycles" an algorithm would take, without having to care about differences between machines, caching, latency, pipelines, or any of the other ways computers have been optimized to save time, at the expense of knowing how long they will take.

efficiency /algorithms vs system specifications

We are all talking about the efficiency of the algorithms and it depends on input size -basically.
How about the system specifications of current computer that runs the algorithm? does it make any difference to run a different sorting algorithm in a Core 2 Duo 2.6 GHZ, 4 GB RAM-computer or in a P-2, 256 MB RAM-computer?
I am sure that there must be a performance difference. But, I want to know what is the real relationship between algorithms and system specifications...

An increase in hardware performance will give you a constant C times the running time of your algorithm. Meaning if you have computer A which is overall 2 times slower than computer B. Than your algorithm will be twice as fast on computer B. Twice as fast though really makes hardly no difference when you consider big input values to an algorithm though.
In big O notation that is to say you will have something like O(n) compared to CO(n) = O(cn) = O(n). The complexity of the algorithm and general running time for large values will be about the same on both Computer A and Computer B.
If you analyze an algorithm's running time using something like big O notation, then you will have a much better idea about how the algorithm really works. Computer performance won't give you any kind of advantage when you are comparing an algorithm that is O(logn) compared to O(n^2).
Take a look at some of the data values for n:
I will assume 1 second per operation for the slow computer, and 2 operations for second for the fast computer. I will compare the better algorithm with the slow computer with the worse algorithm with the fast computer.
for n = 10:
Algorithm 1: O(logn): 4 operations
Slow computer: 4 seconds
Algorithm 2: O(n^2): 100 operations
Fast computer: 50 seconds
for n = 100:
Algorithm 1: O(logn): 7 operations
Slow computer: 7 seconds
Algorithm 2: O(n^2): 10,000
operations Fast computer: 1.4 hours
Large difference
for n = 1,000:
Algorithm 1: O(logn): 10 operations
Slow computer: 10 seconds
Algorithm 2: O(n^2): 1,000,000
operations Fast computer: 5.8 days
Huge difference
As n increases, the difference gets bigger and bigger.
Now if you tried to run each of these algorithms on a faster/slower computer for a large input size. It wouldn't matter. Hands down the O(logn) would be faster.

I don't like the answers provided by Brian Bondy and Czimi...
Perhaps this is because I started in a different era, when 32K was considered a lot of memory, and most "personal computers" had 8K bytes, and that now I work in scientific computing where the largest data sets are processed on some of the world's largest systems with thousands of processing nodes and seemingly unbelievable quantities of storage. Therefore I don't overlook certain other elements of the question.
The size of the data set in question makes a fantastic difference. Most all the answers on this question so far ignore this and work for very small numbers N. The other people who have answered have all presumed "it all fits in memory," or something close to that.
For large data sets other factors come into play, and "large" depends on what resources you have to use in solving your problem. Modern systems have the opportunity for off-line storage (e.g. DVDs), networked storage (e.g. nfs), on-line storage (e.g. serial ATA), and two levels of memory storage, system main memory and on-chip cache. How these are leveraged matters and the larger the data set the more they matter. You may or may not need to design access to these into your "algorithm", but if you do, it really matters!
As you increase scale beyond some particular point - the limit of a single CPU and its local memory is about right - these other factors become an increasingly large factor in the overhead of the workload. When I was a Digital, we did some of the first real commercial work on multi-CPU systems and I remember running a benchmark that showed that using a single-CPU as one "unit" of CPU workload capability, a second CPU (in a tightly coupled system) would give you a total of about 1.8. That is, the second CPU added about 0.8. For three, the increase dropped to about 0.6, and four it dropped a lot more, to about 0.2, for a grand total of about 2.6 for a four CPU arrangement, though we had some troubles keeping good numbers with four CPUs due to other effects (the measurement effort became a large fraction of the additional resource). ...The bottom line was that multi-CPUs weren't necessarily all they were cracked up to be - four times the CPU does NOT give you four times the processing power, even though in theory you get four times the flops. ...We repeated the work on the Alpha chip, the first multi-core in history, and the results held up pretty well. Surely there could have been optimizations to improve the fraction each additional CPU gave, and surely there has been a lot of work since then to split computing threads more smartly, but you'll never get it all the way to 100% of each new one, in part because they all slow down some (extra overhead) to coordinate.
Small interjection - we had a saying about this work: "Religate all the Important Stuff to the Compiler!" RISC, get it? This was because the compiler itself had to organize the workload so competing threads didn't step on one another!
Ultimately performing processing of really massive data crunching requires a really smart strategy of moving the data in and out of farther afield data storage through to local memory. And, division of labor within the algorithm is absolutely vital. In work I was doing with Roberto Mechoso at UCLA doing Global Circulation Modeling, they had a data-broker design that is illustrative of the attempts people make to do a great job. Frankly, the result wasn't as good as it could have been, but the design ideas that went into it are worth study. ...Presuming you consider this part of your "algorithm" - and not just the bit twiddling part, then the algorithms management of resources is one of the most vital aspects of reasonable if not optimal resource utilization doing substantial computing.
...I hope this helps answer your inquiry.

One thing not raised so far is that alogorithms are often described in terms of speed, e.g. O(n), O(n log(n)), etc... but they also have characteristics in terms of resource usage, where improved speed, say O(n) versus O (n log(n)), is at the cost of much greater memory usage. In modern computers as resources become exhausted, they are typically replaced with larger slower resources, e.g. swapping memory for disk, where the slower resource is orders of magnitude slower. Thus when we graph the performance of our algorithm against time, and expect a straight line, n log n curve, etc... we often see spikes for large values of n as memory gets exhuasted. In this case, the difference between 1GB and 2GB of RAM can be huge, so in practical terms, the answer to your question is yes, System specification is very important, and selection of algorithms requires knowledge of the system specification and the size of the input data.
For example, I develeop surface modelling and analysis software, and I know that my programs work well on a 32bit XP box for TIN models of 4 million points. The performance difference between 3.5 million and 4 million points is minor. At 4.5 million points the performance degradation is so severe the software is unusable.

Yes, it does depend on system specification. One system might be 10 times faster than another, so it will run bubblesort and quicksort on a set of data 10 times faster than the other.
But when you do analysis of algorithms, you often ignore constant factors like that, which is one thing that big-O notation does. So bubblesort is O(n^2) and quicksort is O(nlogn) (in the average case), and that holds no matter how fast your hardware is.
The interesting thing is when you start comparing apples and oranges. If you're running bubblesort on your fast hardware, you may find it's faster than quicksort on the slow hardware -- but only up to a point. Eventually, with a large enough input set, the quicksort on the slow hardware is going to be faster than bubblesort on the fast hardware.
If you want to start making comparisons like that, you need to do two things together: determine algorithmic complexity including the constant factors, and develop a speed model (e.g. how many iterations of a particular loop it can perform per second) for the actual hardware you're running on. One of the interesting things about Knuth's Art of Computer Programming, compared with other books on algorithms, is that he does both, so that for each algorithm he examines, he calculates how many units of execution time it will take for a given size of input on his (mythical) MIX computer. You could then adjust the calculation for faster or slower hardware -- something that big-O notation doesn't help with.

By your question, do you mean to ask why the efficiency of an algorithm is described only in terms of the input size?
Algorithms are usually described using the Big O Notation. This notation describes the asymptotic behavior of an algorithm; it describes the behavior when the input data is very very large.
So for example, we have two algorithms for sorting.
Algo#1 with O(n)
Algo#2 with O(n^2)
And let's take two PCs:
PC1
PC2 100x faster than PC1
And we have two setups:
PC1 running Algo#1
PC2 running Algo#2
When n is very very large (like billions?) PC1 will still beat PC1 :)

The efficiency of an algorithm doesn't depend on the system specification. The efficiency is described by the Ordo number, which gives you a relation of the processing effort and the size of the input.

Certainly yes. With high CPU the execution time will reduce.
Similarly with higher memory, the time taken to swap data (if applicable) will definitely reduce.

Be aware of the particularities of the language in which you implement your algorithm.
For instance, in java world, As illustrated in this article, a faster computer does not always means a faster runtime:
Same Java program on a single CPU machine can actually run a lot faster than on a multiprocess/multi-core machine!!

Does it make any difference to run a different sorting algorithm in a Core 2 Duo 2.6 GHZ, 4 GB RAM-computer or in a P-2, 256 MB RAM-computer?
In some cases absolutely! If your data set does not fit into memory you will need to use a disk based sorting algorithm such as merge sort. Quoting from Wikipedia:
When the size of the array to be sorted approaches or exceeds the available primary memory, so that (much slower) disk or swap space must be employed, the memory usage pattern of a sorting algorithm becomes important, and an algorithm that might have been fairly efficient when the array fit easily in RAM may become impractical. In this scenario, the total number of comparisons becomes (relatively) less important, and the number of times sections of memory must be copied or swapped to and from the disk can dominate the performance characteristics of an algorithm. Thus, the number of passes and the localization of comparisons can be more important than the raw number of comparisons, since comparisons of nearby elements to one another happen at system bus speed (or, with caching, even at CPU speed), which, compared to disk speed, is virtually instantaneous.
For example, the popular recursive quicksort algorithm provides quite reasonable performance with adequate RAM, but due to the recursive way that it copies portions of the array it becomes much less practical when the array does not fit in RAM, because it may cause a number of slow copy or move operations to and from disk. In that scenario, another algorithm may be preferable even if it requires more total comparisons.

Yes. Remember, we have jumped ~4 orders of magnitude since the early 80s. (1 MHz, 10 MHz, 100 MHz, 1000MHz).
But that's only the difference between n=10 and n=10000, in terms of data set sizes. I can purchase a terabyte hard drive...over 6?7? orders of magnitude than my old 20 megabyte drive.
There's a lot more data floating around out there than there is compute power. So while you might be confused about how useful the big-O is at n=50, n=500 kind of sizes...when n=1,000,00 you want to minimize n as much as you can. Anything supralinear is just rough on your compute power...non-polynomial is even worse. This extends all the way from the top of the system to the bottom. So, efficiency is king as soon as you deal with real-world dataset sizes.
Let me give you an example.
I did a junior level database design. By the end I had maybe 5 tables with maybe 20-40 pre-defined categories in them. Added rows, 10, 20 rows. No big deal. I was a whiz. I did it in PHP, not Perl. I was all that and a bag o' chips.
Move to now, a few years later. I'm doing a hobby project in datamining the stock market. I harvest data off a financial site every day - 6100 stocks, with about 10 columns in each stock. Thirty-thousand+ rows per week. My initial design was "normalized", with static data and dynamic data in different tables. As I played around with my queries, learning about things, if I did a bad join, I'd literally crash my server and make it unavailable. So I denormalized. My next phase is tagging and starting actual mining. I don't plan to start making serious predictions until Christmas-time; roughly 11x30K = 330K rows to mine and analyze. Algorithm efficiency will matter, if I want to get my data processed in a timely fashion. Doesn't matter if my CPU was 10 times as fast...if I use a N^2 algorithm, it'd only get done 2x as fast. :-)

But, I want to know what is the real
relationship between algorithms and
system specifications...
I am confused. I see a lot of people here writing a lot of things, however, when I read the quote above as a question, here's all I can say about it:
A better system (= faster CPU, more RAM) runs faster, a worse one (= slower CPU, less RAM) runs slower. The same algorithm (no matter how good or bad it is) will most likely run faster on a better system and slower on the worse one.
A faster algorithm runs faster than a slower one. It will run faster on the slower system and it will run faster on the faster system.
So what exactly was your question again? Is your question "Do we really need a fast algorithm if the system is already that fast? Won't a slow one do as well?" Yes, maybe. But in that case I would ask two questions:
Why selecting a slow algorithm just because the system is fast? That way your code will only run at decent speed on a very fast system. If you choose a fast algorithm, your code might even run at decent speed on a much worse system.
Why trying to intentionally achieve worse performance? Even though a bad algorithm might run within five seconds, what you consider fast enough on the fast machine, a good one might run in 100 milliseconds. So why making your program perform a task in 5 seconds it could perform exactly the same one in 100 milliseconds?
Actually it's point number (2) that really bugs me quite often. So often people say "Hey, don't over optimize, it won't really matter. This code is only such a small system in such a big system". Yes, if you just look at this code isolated, that is true. But there is a saying "Many a mickle makes a muckle". Of course you should optimize the most processor intensive parts first. However, if a system consists out of 100 modules and each of them uses only one percent of the CPU time, optimizing one of them to be twice as fast will only get an overall processing time improvement of 0.5%, close to nothing; that's why most people refrain from doing that. But what people overlook is that optimizing all of them to be twice as fast will get a processing time improvement of 50% (or IOW, the app will run twice as fast as a whole).
So unless there is any good reason for not doing it, why not always using the best algorithm known to solve a problem? The best means the one that shows good performance and a good CPU time/memory usage ratio (as it's useless to take the fastest one if this one needs more memory than a normal customer PC can even take).

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio