Better or Faster: String.Contains vs List.Contains - performance

I know this is a stupid question, but I feel like someone might want to know (or inform a <redacted> co-worker about) this. I am not attaching a specific programming language to this since I think it could apply to all of them. Correct me if I am wrong about that.
Main question: Is it faster and/or better to look for entries in a constant String or in a List<String>?
Details: Let's say that I want to see if a given extension is in a list of supported extensions. Which of the following is the best (in regards to programming style) and/or fastest:
const String SUPPORTED = ".exe.bin.sh.png.bmp" /*etc*/;
public static bool IsSupported(String ext){
ext = Normalize(ext);//put the extension in some expected state like lowercase
//String.Contains
return SUPPORTED.Contains(ext);
}
final static List<String> SUPPORTED = MakeAnImmutableList(".exe", ".bin", ".sh",
".png",".bmp" /*etc*/);
public static bool IsSupported(String ext){
ext = Normalize(ext);//put the extension in some expected state like lowercase
//List.Contains
return SUPPORTED.Contains(ext);
}

First, it is important to note that the solutions are not functionally equivalent. The substring search will return true for strings like x and exe.bin while the List<String>.contains() will not. In that sense the List<String> version is likely to be closer to the semantic you want. Any possible performance comparison should keep that in mind.
Now, on to performance.
Theoretical
From an asymptotic and algorithm complexity point of view, the List<String>.contains() approach will be faster than the other alternative as the length of the strings grows. Conceptually, the String.contains version needs to look for a match at each position in the SUPPORTED String, while the List.contains() version only needs to match starting at the start of each substring - as soon as finds a mismatch in the current candidate, it skips to the next. This is related to the above note that the options aren't functionally equivalent: the String.contains options can in theory match a much wider universe of inputs, so it has to do more work before rejecting candidates.
Complexity-wise, this difference could be something like O(N) for List.contains() versus O(N^2) for String.contains() if you take N to be the number of candidates and assume each candidate has a bounded length, and to the String.contains() method in the usual brute-force "look for a match starting at each position" algorithm. As it turns out, the Java String.contains() implementation isn't exactly doing the basic O(N^2) search, but it isn't doing Boyer-Moore either. In general you can expect that once the substrings get long enough, the List.String approach will be faster.
Close(r) to the Metal
At a closer to the metal perspective, both approaches have their advantages. The String.contains() solution avoids the overhead of iterating over the List elements: the entire call will be spent in the intrinsified String.contains implementation, and all the chars making up the entire SUPPORTED Strings are contiguous, so this is memory-friendly. The List.contains() approach will spend a lot of time doing the double-dereferencing needed to go from each List element to the contained String and then to the contained char[] array, and this is likely to dominate if the strings you are comparing against are very short.
On other hand, the List.contains solution ultimately calls into String.equals which is likely implemented in terms of Arrays.equal(char[], char[]) which is heavily optimized with SSE and AVX intrinsics on x86 platforms and likely to be blazing fast, even compared to the optimized version of String.contains(). So if the Strings become long, again expect List.contains() to pull ahead.
All that said, there is a simple, canonical way to do this quickly: a HashSet<String> with all the candidate strings. That's just a simple String.hash() (which is cached and so often "free") and a single lookup in the hash table.

Well, it can vary from implementation to implementation, but if want to look to this problem in a generalized way, lets see.
If you want to look for a specific sub-string inside a string, let say a file extension inside a immutable string containing different extensions, you just need to traverse the string with extensions once.
In the other hand, with a list of immutable strings, still need to traverse each one of the string in that list plus the overhead of iterate over that list.
As a conclusion, in a generalized way, you can see that using a list to store the strings need more processing.
But, you can look to both solutions by its readability, maintainability, etc. For example, if you want to add or remove new extensions or apply more complex operations, may worth the overhead using a list of string.

Related

What's the most efficient way of combining switch/if statements

This question doesn't address any programming language in particular but of course I'm happy to hear some examples.
Imagine a big number of files, let's say 5000, that have all kinds of letters and numbers in it. Then, there is a method that receives a user input that acts as an alias in order to display that file. Without having the files sorted in a folder, the method(s) need to return the file name that is associated to the alias the user provided.
So let's say user input "gd322" stands for the file named "k4e23", the method would look like
if(input.equals("gd322")){
return "k4e23";
}
Now, imagine having 4 values in that method:
switch(input){
case gd322: return fw332;
case g344d: return 5g4gh;
case s3red: return 536fg;
case h563d: return h425d;
} //switch on string, no break, no string indicators, ..., pls ignore the syntax, it's just pseudo
Keeping in mind we have 5000 entries, there are probably more than just 2 entries starting with g. Now, if the user input starts with 's', instead of wasting CPU cycles checking all the a's, b's, c's, ..., we could also make another switch for this, which then directs to the 'next' methods like this:
switch(input[0]){ //implying we could access strings like that
case a: switchA(input);
case b: switchB(input);
// [...]
case g: switchG(input);
case s: switchS(input);
}
So the CPU doesn't have to check on all of them, but rather calls a method like this:
switchG(String input){
switch(input){
case gd322: return fw332;
case g344d: return 5g4gh;
// [...]
}
Is there any field of computer science dealing with this? I don't know how to call it and therefore don't know how to search for it but I think my thoughts make sense on a large scale. Pls move the thread if it doesn't belong here but I really wanna see your thoughts on this.
EDIT: don't quote me on that "5000", I am not in the situation described above and I wanted to talk about this completely theoretical, it could also be 3 entries or 300'000, maybe even less or more
If you have 5000 options, you're probably better off hashing them than having hard-coded if / switch statements. In c++ you could also use std::map to pair a function pointer or other option handling information with each possible option.
Interesting, but I don't think you can give a generic answer. It all depends on how the code is executed. Many compilers will have all kinds of optimizations, in the if and switch, but also in the way strings are compared.
That said, if you have actual (disk) files with those lists, then reading the file will probably take much longer than processing it, since disk I/O is very slow compared to memory access and CPU processing.
And if you have a list like that, you may want to build a hash table, or simply a sorted list/array in which you can perform a binary search. Sorting it also takes time, but if you have to do many lookups in the same list, it may be well worth the time.
Is there any field of computer science dealing with this?
Yes, the science of efficient data structures. Well, isn't that what CS is all about? :-)
The algorithm you described resembles a trie. It wouldn't be statically encoded in the source code with switch statements, but would use dynamic lookups in a structure loaded from somewhere and stuff, but the idea is the same.
Yes the problem is known and solved since decades. Hash functions.
Basically you have a set of values (here strings like "gd322", "g344d") and you want to know if some other value v is among them.
The idea is to put the strings in a big array, at an index calculated from their values by some function. Given a value v, you'll compute an index the same way, and check whether the value v is here or not. Much faster than checking the whole array.
Of course there is a problem with different values falling at the same place : collisions. Some magic is needed then : perfect hash functions whose coefficients are tweaked so values from the initial set don't cause any collisions.

How does JIT optimize branching while processing elements of collections? (in Scala)

This is a question about performance of code written in Scala.
Consider the following two code snippets, assume that x is some collection containing ~50 million elements:
def process(x: Traversable[T]) = {
processFirst x.head
x reduce processPair
processLast x.last
}
Versus something like this (assume for now we have some way to determine if we're operating on the first element versus the last element):
def isFirstElement[T](x: T) = ???
def isLastElement[T](x: T) = ???
def process(x: Traversable[T]) = {
x reduce {
(left, right) =>
if (isFirstElement(left)
processFirst(left)
else if (isLastElement(right))
processLast(right)
processPair(left, right)
}
}
Which approach is faster? and for ~50 million elements, how much faster?
It seems to me that the first example would be faster because there are fewer conditional checks occurring for all but the first and last elements. However for the latter example there is some argument to suggest that the JIT might be clever enough to optimize away those additional head/last conditional checks that would otherwise occur for all but the first/last elements.
Is the JIT clever enough to perform such operations? The obvious advantage of the latter approach is that all business can be placed in the same function body while in the latter case business must be partitioned into three separate function bodies invoked separately.
** EDIT **
Thanks for all the great responses. While I am leaving the second code snippet above to illustrate its incorrectness, I want to revise the first approach slightly to reflect better the problem I am attempting to solve:
// x is some iterator
def process(x: Iterator[T]) = {
if (x.hasNext)
{
var previous = x.next
var current = null
processFirst previous
while(x.hasNext)
{
current = x.next
processPair(previous, current)
previous = current
}
processLast previous
}
}
While there are no additional checks occurring in the body, there is an additional reference assignment that appears to be unavoidable (previous = current). This is also a much more imperative approach that relies on nullable mutable variables. Implementing this in a functional yet high performance manner would be another exercise for another question.
How does this code snippet stack-up against the last of the two examples above? (the single-iteration block approach containing all the branches). The other thing I realize is that the latter of the two examples is also broken on collections containing fewer than two elements.
If your underlying collection has an inexpensive head and last method (not true for a generic Traversable), and the reduction operations are relatively inexpensive, then the second way takes about 10% longer (maybe a little less) than the first on my machine. (You can use a var to get first, and you can keep updating a second far with the right argument to obtain last, and then do the final operation outside of the loop.)
If you have an expensive last (i.e. you have to traverse the whole collection), then the first operation takes about 10% longer (maybe a little more).
Mostly you shouldn't worry too much about it and instead worry more about correctness. For instance, in a 2-element list your second code has a bug (because there is an else instead of a separate test). In a 1-element list, the second code never calls reduce's lambda at all, so again fails to work.
This argues that you should do it the first way unless you're sure last is really expensive in your case.
Edit: if you switch to a manual reduce-like-operation using an iterator, you might be able to shave off up to about 40% of your time compared to the expensive-last case (e.g. list). For inexpensive last, probably not so much (up to ~20%). (I get these values when operating on lengths of strings, for example.)
First of all, note that, depending on the concrete implementation of Traversable, doing something like x.last may be really expensive. Like, more expensive than all the rest of what's going on here.
Second, I doubt the cost of conditionals themselves is going to be noticeable, even on a 50 million collection, but actually figuring out whether a given element is the first or the last, might again, depending on implementation, get pricey.
Third, JIT will not be able to optimize the conditionals away: if there was a way to do that, you would have been able to write your implementation without conditionals to begin with.
Finally, if you are at a point where it starts looking like an extra if statement might affect performance, you might consider switching to java or even "C". Don't get me wrong, I love scala, it is a great language, with lots of power and useful features, but being super-fast just isn't one of them.

Mathematica's pattern matching poorly optimized?

I recently inquired about why PatternTest was causing a multitude of needless evaluations: PatternTest not optimized? Leonid replied that it is necessary for what seems to me as a rather questionable method. I can accept that, though I would prefer a more efficient alternative.
I now realize, which I believe Leonid has been saying for some time, that this problem runs much deeper in Mathematica, and I am troubled. I cannot understand why this is not or cannot be better optimized.
Consider this example:
list = RandomReal[9, 20000];
Head /# list; // Timing
MatchQ[list, {x__Integer, y__}] // Timing
{0., Null}
{1.014, False}
Checking the heads of the list is essentially instantaneous, yet checking the pattern takes over a second. Surely Mathematica could recognize that since the first element of the list is not an Integer, the pattern cannot match, and unlike the case with PatternTest I cannot see how there is any mutability in the pattern. What is the explanation for this?
There appears to be some confusion regarding packed arrays, which as far as I can tell have no bearing on this question. Rather, I am concerned with the O(n2) time complexity on all lists, packed or unpacked.
MatchQ unpacks for these kinds of tests. The reason is that no special case for this has been implemented. In principle it could contain anything.
On["Packing"]
MatchQ[list, {x_Integer, y__}] // Timing
MatchQ[list, {x__Integer, y__}] // Timing
Improving this is very tricky - if you break the pattern matcher you have a serious problem.
Edit 1:
It is true that the unpacking is not the cause for the O(n^2) complexity. It does, however, show that for the MatchQ[list, {x__Integer, y__}] part the code goes to another part of the algorithm (which needs the lists to be unpacked). Some other things to note: This complexity arises only if both patterns are __ if either one of them is _ the algorithm has a better complexity.
The algorithm then goes through all n*n potential matches and there seems no early bailout. Presumably because other patters could be constructed that would need this complexity - The issue is that the above pattern forces the matcher to a very general algorithm.
I then was hoping for MatchQ[list, {Shortest[x__Integer], __}] and friends but to no avail.
So, my two cents: either use a different pattern (and have On["Packing"] to see if it goes to the general matcher) or do a pre-check DeveloperPackedArrayQ[expr] && Head[expr[[1]]]===Integer or some such.
#the author of the first answer. As far as I know from reverse-engeneering and reading of available information, it may be due to different ways the patterns are checked. In fact - as they say - a special hash code is used for pattern matching. This hash (basically a FNV-1 round) makes it very easy to check for particular patterns related to the type of expression involved (matter of a few xor operations). The hashing algorithm cycles inside the expression and each subpart is xorred with the output of the previous one. Special xor values are used for each atom expression - machineInts, machineReals, bigNums, Rationals and so on. Hence, for example, _Integer is easy to check because the hash of any integer is formed with integer's xor value, so all we need to do is doing the inverse op and see if matches - i.e. if we get some particular value or something like that (sorry if I'm vague on actual implementation details. It's WIP). For general or uncommon patterns the check may not take advantage of this hash stuff and require something different.
#the OP Head[] simply acts on the internal expression, taking the value of the first pointer of the expression (expressions are implemented as arrays of pointers). So doing it is as easy as copying and printing a string - very very fast. The pattern matching engine is not even called in this case.

Finding substring in assembly

I'm wondering if there is a more efficient method to finding a substring in assembly then what I am currently planning to do.
I know the string instruction "scansb/scasw/scads" can compare a value in EAX to a value addressed by EDI. However, as far as I understand, I can only search for one character at a time using this methodology.
So, if I want to find the location of "help" in string "pleasehelpme", I could use scansb to find the offset of the h, then jump to another function where I compare the remainder. If the remainder isn't correct, I jump back to scansb and try searching again, this time after the previous offset mark.
However, I would hate to do this and then discover there is a more efficient method. Any advice? Thanks in advance
There are indeed more efficient ways, both instruction-wise and algorithmically.
If you have the hardware you can use the sse 4.2 compare string functions, which are very fast. See an overview http://software.intel.com/sites/products/documentation/studio/composer/en-us/2009/compiler_c/intref_cls/common/intref_sse42_comp.htm and an example using the C instrinsics http://software.intel.com/en-us/articles/xml-parsing-accelerator-with-intel-streaming-simd-extensions-4-intel-sse4/
If you have long substrings or multiple search patterns, the Boyer-Moore, Knuth-Morris-Pratt and Rabin-Karp algorithms may be more efficient.
I don't think there is a more efficient method (only some optimizations that can be done to this method). Also this might be of interest.
scansb is the assembly variant for strcmp, not for strstr. if you want a really efficient method, then you have to use better algorithm.
For example, if you search in a long string, then you could try some special algorithms: http://en.wikipedia.org/wiki/String_searching_algorithm

Linq to Objects: filtering performance question

I was thinking about the way linq computes and it made me wonder:
If I write
var count = collection.Count(o => o.Category == 3);
Will that perform any differently than:
var count = collection.Where(o => o.Category == 3).Count();
After all, IEnumerable<T>.Where() will return IEnumerable<T> which doesn't implement Count property, so a subsequent Count() would actually have to walk through the items to determine the count which should cause extra time being spent on this.
I wrote some quick test code to get some metrics but they seem to beat each other at random. I won't put in the test code here initially, but if anyone requests, I'll get it in.
So, am I missing something?
There won't be a lot in it, really - both forms will iterate over the collection, check the predicate against each item, and count the matches. Both approaches will stream the data - it's not like Where is actually building an in-memory list of all matches, for example.
The first form has one fewer (thin) layer of indirection in, that's all. The main reason for using it (IMO) is for readability/simplicity, rather than performance.
As Jon Skeet says, both techniques will have to essentially do the same thing - enumerate the sequence while conditionally incrementing a counter when the predicate is matched. Any performance differences between the two should be slight: insignificant for almost all use-cases. If there is a token winner though, I would think it should be the first one, since from reflector it appears that the overload ofCountthat takes a predicate uses its ownforeachto enumerate rather than the more obvious way of offloading the work to a streaming aWhereinto a parameterlessCountas in your second example. This means technique #1 is likely to have two minor performance benefits:
Fewer argument validation (null-tests etc.) checks. Technique #2's Count will also check if its (piped) input is an ICollection or ICollection<T> , which it can't possibly be.
A single constructed enumerator vs two enumerators piped together (an additional state-machine has costs).
There is one minor in favour of technique #2 point though:Whereis slightly more sophisticated in constructing an enumerator for the source-sequence; it uses a different one for lists and arrays. This may make it more performant in certain scenarios.
Of course, I should reiterate that I might be plain wrong about my analysis - reasoning about performance through static code analysis, especially when the differences are likely to be slight, is not a good idea. There is only one way to find out - measuring the execution times for your specific setup.
FYI, the source I reflected was from .NET 3.5 SP1.
I know what you are thinking here. At least, I think I do; Count() will look to see if Count is available as a property, and will simply return that if so. Otherwise, it has to enumerate the items to get its return value.
The version of Count() which accepts the predicate, though, will always cause the collection to be iterated, since it has to do it to see which ones match.
Above answers make good points, consider also that if you break away into any Linq-To-X implementations that deferred execution (Linq to Sql being the primary), the Expression parameters used in these methods may cause different results.

Resources