The results below are measured using perf on a compute server with 32 cores. I know my implementation is unoptimized but purposely as I want to make comparisons. I understand that graph algorithms tend to have low locality which researchers try to address.
I'm unclear of the results, though. The time elapsed is misleading. My implementation runs through a graph with about 4mm nodes in about 10 seconds and the rest of the time pre processing. The optimized version uses the same input and traverses about 10 times with each less than a second each so it's really just pre-processing time. I'm not trying to achieve the same. Just understand why that may be based on perf.
I see my page faults are substantially higher. I'm not 100 sure why this is this the case as the annotations (from what I can tell) do not point to any specific piece of my code from mine...
__gnu_cxx::new_allocator<std::_List_node<int> >::construct<int, int const&>
This seems to be when I process the graph itself since I create linked lists for the adjacency lists. I figured this may actually cause issues and wanted to investigate anyway. I should be able to improve page faults (and hopefully performance) by switching to jagged arrays?
The optimized algorithm has a much higher last level cache miss which I thought would explain the primary issue with BFS / graph algorithms with low locality but performance seems to be unaffected by this and my unoptimized is significantly lower.
Then there are the front / back end cycles which seems to be the opposite in terms of performance issues when comparing the two - I'm worse in frontend and the optimized is worse in backend.
Am I missing or not understanding something obvious? I thought there would be something obvious in terms of low locality that would be of issue when looking at perf but I'm confused by the optimized version.
This is my implementation of unoptimized parallel BFS (running once)...
This is using an optimized parallel BFS from a benchmark suite (running 10 times)...
Both take about 40 seconds to pre-process the data once, before doing parallel searching.
Unfortunately perf stat often doesn't given enough information to really determine where the bottleneck in your application is. It is possible to have two applications with wildly different underlying bottlenecks but with very similar perf stat profiles. For example, two applications may have the same number or fraction of L2 cache misses, and yet one might be dominated by this effect and the other way may almost be not impacted at all, depending on the amount and nature of overlapping work.
So if you try to analyze in depth from these high level counters, you are often just taking stabs in the dark. Still we can make a few observations. You mention:
The optimized algorithm has a much higher last level cache miss which
I thought would explain the primary issue with BFS / graph algorithms
with low locality but performance seems to be unaffected by this and
my unoptimized is significantly lower.
First, LLC misses are ~620 million for the optimized algorithm and ~380 for your algorithm, but you are running the optimized algorithm 10 times in this benchmark and yours only once. So the optimized algorithm has perhaps 62 million misses, and your algorithm has six times the number of LLC misses. Yes, your algorithm has a lower LLC miss rate - but the absolute number of LLC misses is what counts for performance. The lower miss rates just means that you are making every more total accesses than the 6x figure: basically you make many, many more memory accesses than the optimized version, which leads to a higher hit rate but more total misses.
All of this points to accessing more total memory in your unoptimized algorithm, or perhaps accessing it in a much more cache unfriendly fashion. That's would also explain the much higher number of page faults. Overall, both algorithms have low IPC, and yours is particularly low (0.49 IPC) and given that there aren't branch prediction problems, and that you've already identified these as graph algorithms with locality/memory access problems, stalls while waiting for memory are very likely.
Luckily, there is a better way that just trying to reverse engineer what might be the bottleneck based on perf stat output. Intel has developed a whole methodology which tries to this type of top-down analysis in a way that determines the true bottlenecks. It's not perfect, but it's far and away better than looking at the plain perf stat counters. VTune isn't free, but you can get a similar analysis based on the same methodology effect using Andi Kleen's toplev. I highly recommend you start there.
How layout of a data in memory effects on algorithm performance?
For example merge sort is know for it computational complexity of O(n log n).
But in real world machine that processing algorithm will load/unload blocks of memory into CPU caches / CPU registers and spend auxiliary time on it.
Elements of collection to be sorted could be very scattered throughout the memory, and I wonder it will cause in slower performance vs sorting over gathered together elements.
Is in necessary to take into account how collections are really stores the data in memory?
In terms of big O notation - no. The time you read each block from
RAM to cpu cache is bounded by some constant, let it be C, so even
if you need to load each element in every iteration from RAM to
cache, you are going to need O(C*nlogn) time, but since C is
constant - it remains O(nlogn) time complexity.
In real world applications, especially when dealing with real-time apps, cache performance could be indeed a factor, and should be considered, so the order of accessing data, could matter. This is one of the reasons why quicksort is usually regarded as "faster" - it tends to have nice cache performance.
In addition - there are some algorithms that are developed to enjoy the "best of two worlds" - both O(nlogn) worst case with better constants, such as Timsort.
However, as rule of thumb, you should usually first implement the "easy way", then benchmark to see if it's fast enough, profile if it's not - and optimize the bottleneck. If you'll try to optimize every piece of your code for best cache performance - you will probably never finish writing it.
Profiling, profiling, profiling.
Modern computer architectures have become so complicated that accurate predictions on the running time have become impossible. You should prefer an experimental approach.
Also note that running times are no more deterministic and you should resort to statistical methods.
Architecture killed the algorithmician.
How layout of a data in memory effects on algorithm performance?
Layout is very important especially for large amount of data because access to the main memory is still expensive even for modern CPU:
http://mechanical-sympathy.blogspot.ru/2013/02/cpu-cache-flushing-fallacy.html
And your algo may spend much time on each cache miss:
http://mechanical-sympathy.blogspot.ru/2012/08/memory-access-patterns-are-important.html
Moreover, now there is a special area in Computer Science called Cache-friendly data structures and algos. See, for example, just googled:
http://www.cc.gatech.edu/~bader/COURSES/UNM/ece637-Fall2003/papers/LFN02.pdf
etc etc
For many parallel programs, the parallelization brings substantial cost, making the speedup sublinear. In this case, the parallel versions are less energy efficient than sequential one.
However, people may care both the time performance and energy efficiency, are there any specific metrics commonly used for this purpose?
More specifically, a metric that can determine the number of threads for best energy and performance goal.
The most common metric is performance per watt. Take a look at the "Green500 List". Wikipedia also has an article on performance per watt. The metric is not as clear cut as it first appears because "performance" is not clear cut. FLOPS is very popular at the moment but it has a lot of deficiencies. I disagree that performance/watt can't be used to evaluate the performance of software. Depending upon your application, you may want to use performance/watt/sec.
I don’t know why you want to determine energy efficiency if parallelism is costing you. In fact, I don’t really understand how parallelism can be decreasing energy efficiency unless you are using a single core machine, doing pure computation, and are doing a lot of thrashing between threads. I’m guessing that this is not your own code.
Software power efficiency: The most important two factors are:
getting your computation done faster
making sure that periods between computation are truly idle
These factors break down into a whole host of other more concrete guidelines:
avoid timing interrupts and (shutter) polling
minimize synchronization constructs
exploit parallelism (thread and vectorization)
use a good optimizing compiler
use a thread pool if you are continuously creating and terminating a lot of threads
use efficient high performance libraries
avoid virtual machines (e.g. java and flash)
use a modern (tickless) OS
etc. etc. etc
Dividing your computation between parallel threads should decrease computation times, or else why add its complications? (Yes, I understand that some programming constructs, such as recursion, can result in simpler and cleaner code but worse performance, but these are exceptions.) Decreasing computation should increase energy efficiency. If it doesn't, look at the algorithm and code practice.
If you can give me more detail about your app, I may be able to make more concrete suggestions.
We are all talking about the efficiency of the algorithms and it depends on input size -basically.
How about the system specifications of current computer that runs the algorithm? does it make any difference to run a different sorting algorithm in a Core 2 Duo 2.6 GHZ, 4 GB RAM-computer or in a P-2, 256 MB RAM-computer?
I am sure that there must be a performance difference. But, I want to know what is the real relationship between algorithms and system specifications...
An increase in hardware performance will give you a constant C times the running time of your algorithm. Meaning if you have computer A which is overall 2 times slower than computer B. Than your algorithm will be twice as fast on computer B. Twice as fast though really makes hardly no difference when you consider big input values to an algorithm though.
In big O notation that is to say you will have something like O(n) compared to CO(n) = O(cn) = O(n). The complexity of the algorithm and general running time for large values will be about the same on both Computer A and Computer B.
If you analyze an algorithm's running time using something like big O notation, then you will have a much better idea about how the algorithm really works. Computer performance won't give you any kind of advantage when you are comparing an algorithm that is O(logn) compared to O(n^2).
Take a look at some of the data values for n:
I will assume 1 second per operation for the slow computer, and 2 operations for second for the fast computer. I will compare the better algorithm with the slow computer with the worse algorithm with the fast computer.
for n = 10:
Algorithm 1: O(logn): 4 operations
Slow computer: 4 seconds
Algorithm 2: O(n^2): 100 operations
Fast computer: 50 seconds
for n = 100:
Algorithm 1: O(logn): 7 operations
Slow computer: 7 seconds
Algorithm 2: O(n^2): 10,000
operations Fast computer: 1.4 hours
Large difference
for n = 1,000:
Algorithm 1: O(logn): 10 operations
Slow computer: 10 seconds
Algorithm 2: O(n^2): 1,000,000
operations Fast computer: 5.8 days
Huge difference
As n increases, the difference gets bigger and bigger.
Now if you tried to run each of these algorithms on a faster/slower computer for a large input size. It wouldn't matter. Hands down the O(logn) would be faster.
I don't like the answers provided by Brian Bondy and Czimi...
Perhaps this is because I started in a different era, when 32K was considered a lot of memory, and most "personal computers" had 8K bytes, and that now I work in scientific computing where the largest data sets are processed on some of the world's largest systems with thousands of processing nodes and seemingly unbelievable quantities of storage. Therefore I don't overlook certain other elements of the question.
The size of the data set in question makes a fantastic difference. Most all the answers on this question so far ignore this and work for very small numbers N. The other people who have answered have all presumed "it all fits in memory," or something close to that.
For large data sets other factors come into play, and "large" depends on what resources you have to use in solving your problem. Modern systems have the opportunity for off-line storage (e.g. DVDs), networked storage (e.g. nfs), on-line storage (e.g. serial ATA), and two levels of memory storage, system main memory and on-chip cache. How these are leveraged matters and the larger the data set the more they matter. You may or may not need to design access to these into your "algorithm", but if you do, it really matters!
As you increase scale beyond some particular point - the limit of a single CPU and its local memory is about right - these other factors become an increasingly large factor in the overhead of the workload. When I was a Digital, we did some of the first real commercial work on multi-CPU systems and I remember running a benchmark that showed that using a single-CPU as one "unit" of CPU workload capability, a second CPU (in a tightly coupled system) would give you a total of about 1.8. That is, the second CPU added about 0.8. For three, the increase dropped to about 0.6, and four it dropped a lot more, to about 0.2, for a grand total of about 2.6 for a four CPU arrangement, though we had some troubles keeping good numbers with four CPUs due to other effects (the measurement effort became a large fraction of the additional resource). ...The bottom line was that multi-CPUs weren't necessarily all they were cracked up to be - four times the CPU does NOT give you four times the processing power, even though in theory you get four times the flops. ...We repeated the work on the Alpha chip, the first multi-core in history, and the results held up pretty well. Surely there could have been optimizations to improve the fraction each additional CPU gave, and surely there has been a lot of work since then to split computing threads more smartly, but you'll never get it all the way to 100% of each new one, in part because they all slow down some (extra overhead) to coordinate.
Small interjection - we had a saying about this work: "Religate all the Important Stuff to the Compiler!" RISC, get it? This was because the compiler itself had to organize the workload so competing threads didn't step on one another!
Ultimately performing processing of really massive data crunching requires a really smart strategy of moving the data in and out of farther afield data storage through to local memory. And, division of labor within the algorithm is absolutely vital. In work I was doing with Roberto Mechoso at UCLA doing Global Circulation Modeling, they had a data-broker design that is illustrative of the attempts people make to do a great job. Frankly, the result wasn't as good as it could have been, but the design ideas that went into it are worth study. ...Presuming you consider this part of your "algorithm" - and not just the bit twiddling part, then the algorithms management of resources is one of the most vital aspects of reasonable if not optimal resource utilization doing substantial computing.
...I hope this helps answer your inquiry.
One thing not raised so far is that alogorithms are often described in terms of speed, e.g. O(n), O(n log(n)), etc... but they also have characteristics in terms of resource usage, where improved speed, say O(n) versus O (n log(n)), is at the cost of much greater memory usage. In modern computers as resources become exhausted, they are typically replaced with larger slower resources, e.g. swapping memory for disk, where the slower resource is orders of magnitude slower. Thus when we graph the performance of our algorithm against time, and expect a straight line, n log n curve, etc... we often see spikes for large values of n as memory gets exhuasted. In this case, the difference between 1GB and 2GB of RAM can be huge, so in practical terms, the answer to your question is yes, System specification is very important, and selection of algorithms requires knowledge of the system specification and the size of the input data.
For example, I develeop surface modelling and analysis software, and I know that my programs work well on a 32bit XP box for TIN models of 4 million points. The performance difference between 3.5 million and 4 million points is minor. At 4.5 million points the performance degradation is so severe the software is unusable.
Yes, it does depend on system specification. One system might be 10 times faster than another, so it will run bubblesort and quicksort on a set of data 10 times faster than the other.
But when you do analysis of algorithms, you often ignore constant factors like that, which is one thing that big-O notation does. So bubblesort is O(n^2) and quicksort is O(nlogn) (in the average case), and that holds no matter how fast your hardware is.
The interesting thing is when you start comparing apples and oranges. If you're running bubblesort on your fast hardware, you may find it's faster than quicksort on the slow hardware -- but only up to a point. Eventually, with a large enough input set, the quicksort on the slow hardware is going to be faster than bubblesort on the fast hardware.
If you want to start making comparisons like that, you need to do two things together: determine algorithmic complexity including the constant factors, and develop a speed model (e.g. how many iterations of a particular loop it can perform per second) for the actual hardware you're running on. One of the interesting things about Knuth's Art of Computer Programming, compared with other books on algorithms, is that he does both, so that for each algorithm he examines, he calculates how many units of execution time it will take for a given size of input on his (mythical) MIX computer. You could then adjust the calculation for faster or slower hardware -- something that big-O notation doesn't help with.
By your question, do you mean to ask why the efficiency of an algorithm is described only in terms of the input size?
Algorithms are usually described using the Big O Notation. This notation describes the asymptotic behavior of an algorithm; it describes the behavior when the input data is very very large.
So for example, we have two algorithms for sorting.
Algo#1 with O(n)
Algo#2 with O(n^2)
And let's take two PCs:
PC1
PC2 100x faster than PC1
And we have two setups:
PC1 running Algo#1
PC2 running Algo#2
When n is very very large (like billions?) PC1 will still beat PC1 :)
The efficiency of an algorithm doesn't depend on the system specification. The efficiency is described by the Ordo number, which gives you a relation of the processing effort and the size of the input.
Certainly yes. With high CPU the execution time will reduce.
Similarly with higher memory, the time taken to swap data (if applicable) will definitely reduce.
Be aware of the particularities of the language in which you implement your algorithm.
For instance, in java world, As illustrated in this article, a faster computer does not always means a faster runtime:
Same Java program on a single CPU machine can actually run a lot faster than on a multiprocess/multi-core machine!!
Does it make any difference to run a different sorting algorithm in a Core 2 Duo 2.6 GHZ, 4 GB RAM-computer or in a P-2, 256 MB RAM-computer?
In some cases absolutely! If your data set does not fit into memory you will need to use a disk based sorting algorithm such as merge sort. Quoting from Wikipedia:
When the size of the array to be sorted approaches or exceeds the available primary memory, so that (much slower) disk or swap space must be employed, the memory usage pattern of a sorting algorithm becomes important, and an algorithm that might have been fairly efficient when the array fit easily in RAM may become impractical. In this scenario, the total number of comparisons becomes (relatively) less important, and the number of times sections of memory must be copied or swapped to and from the disk can dominate the performance characteristics of an algorithm. Thus, the number of passes and the localization of comparisons can be more important than the raw number of comparisons, since comparisons of nearby elements to one another happen at system bus speed (or, with caching, even at CPU speed), which, compared to disk speed, is virtually instantaneous.
For example, the popular recursive quicksort algorithm provides quite reasonable performance with adequate RAM, but due to the recursive way that it copies portions of the array it becomes much less practical when the array does not fit in RAM, because it may cause a number of slow copy or move operations to and from disk. In that scenario, another algorithm may be preferable even if it requires more total comparisons.
Yes. Remember, we have jumped ~4 orders of magnitude since the early 80s. (1 MHz, 10 MHz, 100 MHz, 1000MHz).
But that's only the difference between n=10 and n=10000, in terms of data set sizes. I can purchase a terabyte hard drive...over 6?7? orders of magnitude than my old 20 megabyte drive.
There's a lot more data floating around out there than there is compute power. So while you might be confused about how useful the big-O is at n=50, n=500 kind of sizes...when n=1,000,00 you want to minimize n as much as you can. Anything supralinear is just rough on your compute power...non-polynomial is even worse. This extends all the way from the top of the system to the bottom. So, efficiency is king as soon as you deal with real-world dataset sizes.
Let me give you an example.
I did a junior level database design. By the end I had maybe 5 tables with maybe 20-40 pre-defined categories in them. Added rows, 10, 20 rows. No big deal. I was a whiz. I did it in PHP, not Perl. I was all that and a bag o' chips.
Move to now, a few years later. I'm doing a hobby project in datamining the stock market. I harvest data off a financial site every day - 6100 stocks, with about 10 columns in each stock. Thirty-thousand+ rows per week. My initial design was "normalized", with static data and dynamic data in different tables. As I played around with my queries, learning about things, if I did a bad join, I'd literally crash my server and make it unavailable. So I denormalized. My next phase is tagging and starting actual mining. I don't plan to start making serious predictions until Christmas-time; roughly 11x30K = 330K rows to mine and analyze. Algorithm efficiency will matter, if I want to get my data processed in a timely fashion. Doesn't matter if my CPU was 10 times as fast...if I use a N^2 algorithm, it'd only get done 2x as fast. :-)
But, I want to know what is the real
relationship between algorithms and
system specifications...
I am confused. I see a lot of people here writing a lot of things, however, when I read the quote above as a question, here's all I can say about it:
A better system (= faster CPU, more RAM) runs faster, a worse one (= slower CPU, less RAM) runs slower. The same algorithm (no matter how good or bad it is) will most likely run faster on a better system and slower on the worse one.
A faster algorithm runs faster than a slower one. It will run faster on the slower system and it will run faster on the faster system.
So what exactly was your question again? Is your question "Do we really need a fast algorithm if the system is already that fast? Won't a slow one do as well?" Yes, maybe. But in that case I would ask two questions:
Why selecting a slow algorithm just because the system is fast? That way your code will only run at decent speed on a very fast system. If you choose a fast algorithm, your code might even run at decent speed on a much worse system.
Why trying to intentionally achieve worse performance? Even though a bad algorithm might run within five seconds, what you consider fast enough on the fast machine, a good one might run in 100 milliseconds. So why making your program perform a task in 5 seconds it could perform exactly the same one in 100 milliseconds?
Actually it's point number (2) that really bugs me quite often. So often people say "Hey, don't over optimize, it won't really matter. This code is only such a small system in such a big system". Yes, if you just look at this code isolated, that is true. But there is a saying "Many a mickle makes a muckle". Of course you should optimize the most processor intensive parts first. However, if a system consists out of 100 modules and each of them uses only one percent of the CPU time, optimizing one of them to be twice as fast will only get an overall processing time improvement of 0.5%, close to nothing; that's why most people refrain from doing that. But what people overlook is that optimizing all of them to be twice as fast will get a processing time improvement of 50% (or IOW, the app will run twice as fast as a whole).
So unless there is any good reason for not doing it, why not always using the best algorithm known to solve a problem? The best means the one that shows good performance and a good CPU time/memory usage ratio (as it's useless to take the fastest one if this one needs more memory than a normal customer PC can even take).