String compare performance system-verilog - performance

I was wondering if there was a performance difference in system-verilog when doing a string compare when using these two different methods:
1. str.compare(other_str);
2. str == other_str
If there is a difference, why is there a difference, and where did you get your information from?

I think there are a lot more factors that might affect performance than what you have shown here. Realize that SystemVerilog comes from the merging of multiple languages. Sometimes there are duplication of features which, for historical reasons, prevented removal of the redundancies.
The Questions are:
is the compiler sophisticated enough to generate the same
implementation for both?
if not, do some conditions favor one implementation over another? For example: characterization of the variable types, or storage class of the variables.
and do those conditions affect the compilers ability in generating the same code for both?

Related

How to accurately measure performance of sorting algorithms

I have a bunch of sorting algorithms in C I wish to benchmark. I am concerned regarding good methodology for doing so. Things that could affect benchmark performance include (but are not limited to): specific coding of the implementation, programming language, compiler (and compiler options), benchmarking machine and critically the input data and time measuring method. How do I minimize the effect of said variables on the benchmark's results?
To give you a few examples, I've considered multiple implementations on two different languages to adjust for the first two variables. Moreover I could compile the code with different compilers on fairly mundane (and specified) arguments. Now I'm going to be running the test on my machine, which features turbo boost and whatnot and often boosts a core running stuff to the moon. Of course I will be disabling that and doing multiple runs and likely taking their mean completion time to adjust for that as well. Regarding the input data, I will be taking different array sizes, from very small to relatively large. I do not know what the increments should ideally be like, and what the range of the elements should be as well. Also I presume duplicate elements should be allowed.
I know that theoretical analysis of algorithms accounts for all of these methods, but it is crucial that I complement my study with actual benchmarks. How would you go about resolving the mentioned issues, and adjust for these variables once the data is collected? I'm comfortable with the technologies I'm working with, less so with strict methodology for studying a topic. Thank you.
You can't benchmark abstract algorithms, only specific implementations of them, compiled with specific compilers running on specific machines.
Choose a couple different relevant compilers and machines (e.g. a Haswell, Ice Lake, and/or Zen2, and an Apple M1 if you can get your hands on one, and/or an AArch64 cloud server) and measure your real implementations. If you care about in-order CPUs like ARM Cortex-A53, measure on one of those, too. (Simulation with GEM5 or similar performance simulators might be worth trying. Also maybe relevant are low-power implementations like Intel Silvermont whose out-of-order window is much smaller, but also have a shorter pipeline so smaller branch mispredict penalty.)
If some algorithm allows a useful micro-optimization in the source, or that a compiler finds, that's a real advantage of that algorithm.
Compile with options you'd use in practice for the use-cases you care about, like clang -O3 -march=native, or just -O2.
Benchmarking on cloud servers makes it hard / impossible to get an idle system, unless you pay a lot for a huge instance, but modern AArch64 servers are relevant and may have different ratios of memory bandwidth vs. branch mispredict costs vs. cache sizes and bandwidths.
(You might well find that the same code is the fastest sorting implementation on all or most of the systems you test one.
Re: sizes: yes, a variety of sizes would be good.
You'll normally want to test with random data, perhaps always generated from the same PRNG seed so you're sorting the same data every time.
You may also want to test some unusual cases like already-sorted or almost-sorted, because algorithms that are extra fast for those cases are useful.
If you care about sorting things other than integers, you might want to test with structs of different sizes, with an int key as a member. Or a comparison function that does some amount of work, if you want to explore how sorts do with a compare function that isn't as simple as just one compare machine instruction.
As always with microbenchmarking, there are many pitfalls around warm-up of arrays (page faults) and CPU frequency, and more. Idiomatic way of performance evaluation?
taking their mean completion time
You might want to discard high outliers, or take the median which will have that effect for you. Usually that means "something happened" during that run to disturb it. If you're running the same code on the same data, often you can expect the same performance. (Randomization of code / stack addresses with page granularity usually doesn't affect branches aliasing each other in predictors or not, or data-cache conflict misses, but tiny changes in one part of the code can change performance of other code via effects like that if you're re-compiling.)
If you're trying to see how it would run when it has the machine to itself, you don't want to consider runs where something else interfered. If you're trying to benchmark under "real world" cloud server conditions, or with other threads doing other work in a real program, that's different and you'd need to come up with realistic other loads that use some amount of shared resources like L3 footprint and memory bandwidth.
Things that could affect benchmark performance include (but are not limited to): specific coding of the implementation, programming language, compiler (and compiler options), benchmarking machine and critically the input data and time measuring method.
Let's look at this from a very different perspective - how to present information to humans.
With 2 variables you get a nice 2-dimensional grid of results, maybe like this:
A = 1 A = 2
B = 1 4 seconds 2 seconds
B = 2 6 seconds 3 seconds
This is easy to display and easy for humans to understand and draw conclusions from (e.g. from my silly example table it's trivial to make 2 very different observations - "A=1 is twice as fast as A=2 (regardless of B)" and "B=1 is faster than B=2 (regardless of A)").
With 3 variables you get a 3-dimensional grid of results, and with N variables you get an N-dimensional grid of results. Humans struggle with "3-dimensional data on 2-dimensional screen" and more dimensions becomes a disaster. You can mitigate this a little by "peeling off" a dimension (e.g. instead of trying to present a 3D grid of results you could show multiple 2D grids); but that doesn't help humans much.
Your primary goal is to reduce the number of variables.
To reduce the number of variables:
a) Determine how important each variable is for what you intend to observe (e.g. "which algorithm" will be extremely important and "which language" will be less important).
b) Merge variables based on importance and "logical grouping". For example, you might get three "lower importance" variables (language, compiler, compiler options) and merge them into a single "language+compiler+options" variable.
Note that it's very easy to overlook a variable. For example, you might benchmark "algorithm 1" on one computer and benchmark "algorithm 2" on an almost identical computer, but overlook the fact that (even though both benchmarks used identical languages, compilers, compiler options and CPUs) one computer has faster RAM chips, and overlook "RAM speed" as a possible variable.
Your secondary goal is to reduce number of values each variable can have.
You don't want massive table/s with 12345678 million rows; and you don't want to spend the rest of your life benchmarking to generate such a large table.
To reduce the number of values each variable can have:
a) Figure out which values matter most
b) Select the right number of values in order of importance (and ignore/skip all other values)
For example, if you merged three "lower importance" variables (language, compiler, compiler options) into a single variable; then you might decide that 2 possibilities ("C compiled by GCC with -O3" and "C++ compiled by MSVC with -Ox") are important enough to worry about (for what you're intending to observe) and all of the other possibilities get ignored.
How do I minimize the effect of said variables on the benchmark's results?
How would you go about resolving the mentioned issues, and adjust for these variables once the data is collected?
By identifying the variables (as part of the primary goal) and explicitly deciding which values the variables may have (as part of the secondary goal).
You've already been doing this. What I've described is a formal method of doing what people would unconsciously/instinctively do anyway. For one example, you have identified that "turbo boost" is a variable, and you've decided that "turbo boost disabled" is the only value for that variable you care about (but do note that this may have consequences - e.g. consider "single-threaded merge sort without the turbo boost it'd likely get in practice" vs. "parallel merge sort that isn't as influenced by turning turbo boost off").
My hope is that by describing the formal method you gain confidence in the unconscious/instinctive decisions you're already making, and realize that you were very much on the right path before you asked the question.

When to use references versus types versus boxes and slices versus vectors as arguments and return types?

I've been working with Rust the past few days to build a new library (related to abstract algebra) and I'm struggling with some of the best practices of the language. For example, I implemented a longest common subsequence function taking &[&T] for the sequences. I figured this was Rust convention, as it avoided copying the data (T, which may not be easily copy-able, or may be big). When changing my algorithm to work with simpler &[T]'s, which I needed elsewhere in my code, I was forced to put the Copy type constraint in, since it needed to copy the T's and not just copy a reference.
So my higher-level question is: what are the best-practices for passing data between threads and structures in long-running processes, such as a server that responds to queries requiring big data crunching? Any specificity at all would be extremely helpful as I've found very little. Do you generally want to pass parameters by reference? Do you generally want to avoid returning references as I read in the Rust book? Is it better to work with &[&T] or &[T] or Vec<T> or Vec<&T>, and why? Is it better to return a Box<T> or a T? I realize the word "better" here is considerably ill-defined, but hope you'll understand my meaning -- what pitfalls should I consider when defining functions and structures to avoid realizing my stupidity later and having to refactor everything?
Perhaps another way to put it is, what "algorithm" should my brain follow to determine where I should use references vs. boxes vs. plain types, as well as slices vs. arrays vs. vectors? I hesitate to start using references and Box<T> returns everywhere, as I think that'd get me a sort of "Java in Rust" effect, and that's not what I'm going for!

Are muxes more "expensive" than other logic?

This is mostly out of curiosity.
One fragment from some VHDL code that I've been working on recently resembles the following:
led_q <= (pwm_d and ch_ena) when pwm_ena = '1' else ch_ena;
This is a mux-style expression, of course. But it's also equivalent to the following basic logic expression (at least when ignoring non-binary states):
led_q <= ch_ena and (pwm_d or not pwm_ena);
Is one "better" than the other in terms of logic utilisation or efficiency when actually implemented in an FPGA? Is it preferable to use one over the other, or is the compiler smart enough to pick the "best" on its own?
(For the curious, the purpose of the expression is to define the state of an LED -- if ch_ena is false it should always be off as the channel is disabled, otherwise it should either be on solidly or flashing according to pwm_d, according to pwm_ena (PWM enable). I think the first form describes this more obviously than the second, although it's not too hard to realise how the second behaves.)
For a simple logical expression, like the one shown, where the synthesis tool can easily create a complete truth table, the expression is likely to be converted to an internal truth table, which is then directly mapped to the available FPGA LUT resources. Since the truth table is identical for the two equivalent expressions, the hardware will also be the same.
However, for complex expressions where a complete truth table can't be generated, e.g. when using arithmetic operations, and/or where dedicated resources are available, the synthesis tool may choose to hold an internal representation that is more closely related to the original VHDL code, and in this case the VHDL coding style can have a great impact on the resulting logic, even for equivalent expressions.
In the end, the implementation is tool specific, so the best way to find out what logic is generated is to try it with the specific tool, in special for large or timing critical parts of the design, where the implementation is critical.
In general it depends on the target architecture. For Xilinx FPGAs the logic is mostly mapped into LUTs with sporadic use of the hard logic resources where the mapper can make use of them. Every possible LUT configuration has essentially equal performance so there's little benefit to scrutinizing the mapper's work unless you're really pushing the speed limits of the device where you'd be forced into manually instantiating hand-mapped LUTs.
Non-LUT based architectures like the Actel/Microsemi device families use 2-input muxes as the main logic primitive and everything is mapped down to them. You can't generalize what is best across all types of FPGAs and CPLDs but nowadays you can mostly trust that the mapper will do a decent enough job using timing constraints to push it toward the results you need.
With regards to the question I think it is best to avoid obscure Boolean expressions where possible. They tend to be hard to decipher months later when you forgot what you meant them to do. I would lean toward the when-else simply from a code maintenance point of view. Even for this trivial example you have to think closely about what behavior it describes whereas the when-else describes the intended behavior directly in human level syntax.
HDLs work best when you use the highest abstraction possible and avoid wallowing around with low-level bit twiddling. This is a place where VHDL truly shines if you leverage the more advanced features of the language and move away from describing raw logic everywhere. Let the synthesizer do the work. Introductory learning materials focus on the low level structural gate descriptions and logic expressions because that is easiest for beginners to get a start on but it is not the best way to use VHDL for complex designs in the long run.
Of course there are situations where Booleans are better, particularly when doing bitwise operations across vectors in parallel which requires messy loops to do the same imperatively. It all depends on the context.

Computing Efficiency Question

I'm wondering about computational efficiency. I'm going to use Java in this example, but it is a general computing question. Lets say I have a string and I want to get the value of the first letter of the string, as a string. So I can do
String firstletter = String.valueOf(somestring.toCharArray()[0]);
Or I could do:
char[] stringaschar = somestring.toCharArray();
char firstchar = stringaschar[0];
String firstletter = String.valueOf(firstchar);
My question is, are the two ways essentially the same, computationally? I mean, the second way I explicitly had to create 2 intermediate variables, to be stored in memory (the stack?) temporarily.
But the first way, too, the computer will have to still create the same variables, implicitly, right? And the number of operations doesn't change. My thinking is, the two ways are the same. But I'd like to know for sure.
In most cases the two ways should produce the same, or nearly the same, object code. Optimizing compilers usually detect that the intermediate variables in the second option are not necessary to get the correct result, and will collapse the call graph accordingly.
This all depends on how your Java interpreter decides to translate your code into an intermediary language for runtime execution. It may actually have optimizations which translate the two approaches to be the same exact byte code.
The two should be essentially the same. In both cases you make the same calls converting the string to an array, finding the first character, and getting the value of the character. There may be minor differences in how the compiler handles these, but they should be insignficant.
The earlier answers are coincident and right, AFAIK.
However, I think there are a few additional and general considerations you should be aware of each time you wonder about the efficiency of any computational asset (code, for example).
First, if everything is under your strict control you could in principle count clock cycles one by one from assembly code. Or from some more abstract reasoning find the computational cost of an operation/algorithm.
So far so good. But don't forget to measure afterwards. You may find that measuring execution times is not so easy and straightforward, and sometimes is elusive (How to account for interrupts, for I/O wait, for network bottlenecks ...). But it pays. You ask here for counsel, but YOUR Compiler/Interpreter/P-code generator/Whatever could be set with just THAT switch in the third layer of your config scripts.
The other consideration, more to your current point is the existence of Black Boxes. You are not alone in the world and a Black Box is any piece used to run your code, which is essentially out of your control. Compilers, Operating Systems, Networks, Storage Systems, and the World in general fall into this category.
What we do with Black Boxes (they are black, either because their code is not public or because we just happen to use our free time fishing instead of digging library source code) is establishing mental models to help us understand how they work. (BTW, This is an extraordinary book about how we humans forge our mental models). But you should always beware that they are models, not the real thing. Models help us to explain things ... to a certain extent. Classical Mechanics reigned until Relativity and Quantum Mechanics fluorished. None of them is wrong They have limits, and so have all our models.
Even if you happen to be friend with your router OS, or your Linux kernel, when confronting an efficiency problem, design a good experiment and measure.
HTH!
NB: By design a good experiment I mean beware of the tar pits. Examples: measuring your measurement code instead the target of the experiment, being influenced by external factors, forget external factors that will influence the production code, test with data whose cardinality, orthogonality, or whatever-ality is dissimilar with the "real world", mapping wrongly the production and testing Client/server workhorses, et c, et c, et c.
So go, and meassure your code. Your results will be the most interesting thing in this page.

Is there a relation between static code analysis and application performance

My Question:
Performance tests are generally done after an application is integrated with various modules and ready for deploy.
Is there any way to identify performance bottlenecks during the development phase. Does code analysis throw any hints # performance?
It all depends on rules that you run during code analysis but I don't think that you can prevent performance bottlenecks just by CA.
From my expired it looks that performance problems are usually quite complicated and to find real problems you have to run performance tests.
No, except in very minor cases (eg for Java, use StringBuilder in a loop rather than string appends).
The reason is that you won't know how a particular piece of code will affect the application as a whole, until you're running the whole application with relevant dataset.
For example: changing bubblesort to quicksort wouldn't significantly affect your application if you're consistently sorting lists of a half-dozen elements. Or if you're running the sort once, in the middle of the night, and it doesn't delay other processing.
If we are talking .NET, then yes and no... FxCop (or built-in code analysis) has a number of rules in it that deal with performance concerns. However, this list is fairly short and limited in nature.
Having said that, there is no reason that FxCop could not be extended with a lot more rules (heuristic or otherwise) that catch potential problem areas and flag them. It's simply a fact that nobody (that I know of) has put significant work into this (yet).
Generally, no, although from experience I can look at a system I've never seen before and recognize some design approaches that are prone to performance problems:
How big is it, in terms of lines of code, or number of classes? This correlates strongly with performance problems caused by over-design.
How many layers of abstraction are there? Each layer is a chance to spend more cycles than necessary, and this effect compounds, especially if each operation is perceived as being "pretty efficient".
Are there separate data structures that need to be kept in agreement? If so, how is this done? If there is an attempt, through notifications, to keep the data structures tightly in sync, that is a red flag.
Of the categories of input information to the system, does some of it change at low frequency? If so, chances are it should be "compiled" rather than "interpreted". This can be a huge win both in performance and ease of development.
A common motif is this: Programmer A creates functions that wrap complex operations, like DB access to collect a good chunk of information. Programmer A considers this very useful to other programmers, and expects these functions to be used with a certain respect, not casually. Programmer B appreciates these powerful functions and uses them a lot because they get so much done with only a single line of code. (Programmers B and A can be the same person.) You can see how this causes performance problems, especially if distributed over multiple layers.
Those are the first things that come to mind.

Resources