Finding substring in assembly - algorithm

I'm wondering if there is a more efficient method to finding a substring in assembly then what I am currently planning to do.
I know the string instruction "scansb/scasw/scads" can compare a value in EAX to a value addressed by EDI. However, as far as I understand, I can only search for one character at a time using this methodology.
So, if I want to find the location of "help" in string "pleasehelpme", I could use scansb to find the offset of the h, then jump to another function where I compare the remainder. If the remainder isn't correct, I jump back to scansb and try searching again, this time after the previous offset mark.
However, I would hate to do this and then discover there is a more efficient method. Any advice? Thanks in advance

There are indeed more efficient ways, both instruction-wise and algorithmically.
If you have the hardware you can use the sse 4.2 compare string functions, which are very fast. See an overview http://software.intel.com/sites/products/documentation/studio/composer/en-us/2009/compiler_c/intref_cls/common/intref_sse42_comp.htm and an example using the C instrinsics http://software.intel.com/en-us/articles/xml-parsing-accelerator-with-intel-streaming-simd-extensions-4-intel-sse4/
If you have long substrings or multiple search patterns, the Boyer-Moore, Knuth-Morris-Pratt and Rabin-Karp algorithms may be more efficient.

I don't think there is a more efficient method (only some optimizations that can be done to this method). Also this might be of interest.

scansb is the assembly variant for strcmp, not for strstr. if you want a really efficient method, then you have to use better algorithm.
For example, if you search in a long string, then you could try some special algorithms: http://en.wikipedia.org/wiki/String_searching_algorithm

Related

Better or Faster: String.Contains vs List.Contains

I know this is a stupid question, but I feel like someone might want to know (or inform a <redacted> co-worker about) this. I am not attaching a specific programming language to this since I think it could apply to all of them. Correct me if I am wrong about that.
Main question: Is it faster and/or better to look for entries in a constant String or in a List<String>?
Details: Let's say that I want to see if a given extension is in a list of supported extensions. Which of the following is the best (in regards to programming style) and/or fastest:
const String SUPPORTED = ".exe.bin.sh.png.bmp" /*etc*/;
public static bool IsSupported(String ext){
ext = Normalize(ext);//put the extension in some expected state like lowercase
//String.Contains
return SUPPORTED.Contains(ext);
}
final static List<String> SUPPORTED = MakeAnImmutableList(".exe", ".bin", ".sh",
".png",".bmp" /*etc*/);
public static bool IsSupported(String ext){
ext = Normalize(ext);//put the extension in some expected state like lowercase
//List.Contains
return SUPPORTED.Contains(ext);
}
First, it is important to note that the solutions are not functionally equivalent. The substring search will return true for strings like x and exe.bin while the List<String>.contains() will not. In that sense the List<String> version is likely to be closer to the semantic you want. Any possible performance comparison should keep that in mind.
Now, on to performance.
Theoretical
From an asymptotic and algorithm complexity point of view, the List<String>.contains() approach will be faster than the other alternative as the length of the strings grows. Conceptually, the String.contains version needs to look for a match at each position in the SUPPORTED String, while the List.contains() version only needs to match starting at the start of each substring - as soon as finds a mismatch in the current candidate, it skips to the next. This is related to the above note that the options aren't functionally equivalent: the String.contains options can in theory match a much wider universe of inputs, so it has to do more work before rejecting candidates.
Complexity-wise, this difference could be something like O(N) for List.contains() versus O(N^2) for String.contains() if you take N to be the number of candidates and assume each candidate has a bounded length, and to the String.contains() method in the usual brute-force "look for a match starting at each position" algorithm. As it turns out, the Java String.contains() implementation isn't exactly doing the basic O(N^2) search, but it isn't doing Boyer-Moore either. In general you can expect that once the substrings get long enough, the List.String approach will be faster.
Close(r) to the Metal
At a closer to the metal perspective, both approaches have their advantages. The String.contains() solution avoids the overhead of iterating over the List elements: the entire call will be spent in the intrinsified String.contains implementation, and all the chars making up the entire SUPPORTED Strings are contiguous, so this is memory-friendly. The List.contains() approach will spend a lot of time doing the double-dereferencing needed to go from each List element to the contained String and then to the contained char[] array, and this is likely to dominate if the strings you are comparing against are very short.
On other hand, the List.contains solution ultimately calls into String.equals which is likely implemented in terms of Arrays.equal(char[], char[]) which is heavily optimized with SSE and AVX intrinsics on x86 platforms and likely to be blazing fast, even compared to the optimized version of String.contains(). So if the Strings become long, again expect List.contains() to pull ahead.
All that said, there is a simple, canonical way to do this quickly: a HashSet<String> with all the candidate strings. That's just a simple String.hash() (which is cached and so often "free") and a single lookup in the hash table.
Well, it can vary from implementation to implementation, but if want to look to this problem in a generalized way, lets see.
If you want to look for a specific sub-string inside a string, let say a file extension inside a immutable string containing different extensions, you just need to traverse the string with extensions once.
In the other hand, with a list of immutable strings, still need to traverse each one of the string in that list plus the overhead of iterate over that list.
As a conclusion, in a generalized way, you can see that using a list to store the strings need more processing.
But, you can look to both solutions by its readability, maintainability, etc. For example, if you want to add or remove new extensions or apply more complex operations, may worth the overhead using a list of string.

Algorithm for one sign checksum

I am desperate in the search for an algorithm to create a checksum that is a maximum of two characters long and can recognize the confusion of characters in the input sequence. When testing different algorithms, such as Luhn, CRC24 or CRC32, the checksums were always longer than two characters. If I reduce the checksum to two or even one character, then no longer all commutations are recognized.
Does any of you know an algorithm that meets my needs? I already have a name with which I can continue my search. I would be very grateful for your help.
Taking that your data is alphanumeric, you want to detect all the permutations (in the perfect case), and you can afford to use the binary checksum (i.e. full 16 bits), my guess is that you should probably go with CRC-16 (as already suggested by #Paul Hankin in the comments), as it is more information-dense compared to check-digit algorithms like Luhn or Damm, and is more "generic" when it comes to possible types of errors.
Maybe something like CRC-CCITT (CRC-16-CCITT), you can give it a try here, to see how it works for you.

what is the relationship between LCS and string similarity?

I wanted to know how similar where two strings and I found a tool in the following page:
https://www.tools4noobs.com/online_tools/string_similarity/
and it says that this tool is based on the article:
"An O(ND) Difference Algorithm and its Variations"
available on:
http://www.xmailserver.org/diff2.pdf
I have read the article, but I have some doubts about how they programmed that tool, for example the authors said that it is based on the C library GNU diff and analyze.c; maybe it refers to this:
https://www.gnu.org/software/diffutils/
and this:
https://github.com/masukomi/dwdiff-annotated/blob/master/src/diff/analyze.c
The problem that I have is how to understand the relation with the article, for what I read the article shows an algorithm for finding the LCS (longest common subsequence) between a pair of strings, so they use a modification of the dynamic programming algorithm used for solving this problem. The modification is the use of the shortest path algorithm to find the LCS that has the minimum number of modifications.
At this point I am lost, because I do not know how the authors of the tool I first mentioned used the LCS for finding how similar are two sequences. Also the have put a limit value of 0.4, what does that mean? can anybody help me with this? or have I misunderstood that article?
Thanks
I think the description on the string similarity tool is not being entirely honest, because I'm pretty sure it has been implemented using the Perl module String::Similarity. The similarity score is normalised to a value between 0 and 1, and as the module page describes, the limit value can be used to abort the comparison early if the similarity falls below it.
If you download the Perl module and expand it, you can read the C source of the algorithm, in the file called fstrcmp.c, which says that it is "Derived from GNU diff 2.7, analyze.c et al.".
The connection between the LCS and string similarity is simply that those characters that are not in the LCS are precisely the characters you would need to add, delete or substitute in order to convert the first string to the second, and the number of these differing characters is usually used as the difference score, as in the Levenshtein Distance.

Improvement to Algorithm

I was asked this question in an interview.
If you had two numbers represented in the binary form and stored as a string. How would you perform simple addition. This was the easy part. (my solution: run through the shortest one and keep track of carry, repeat for the remaining)
The difficult part was when he asked me:
how would you use hardware to make the process faster.
Any suggestion SO community?
I'd say, convert them to proper integers, and use the hardware (ALU) to perform the addition, then convert the result back to a string if needed.
Converting the numbers to an integer variable and letting the CPU do the addition immediately springs to mind. You can then divide the number back into bits if you so choose to.

Branchless Binary Search

I'm curious if anyone could explain a branchless binary search implementation to me. I saw it mentioned in a recent question but I can't imagine how it would be implemented. I assume it could be useful to avoid branches if the number of items is quite large.
I'm going to assume you're talking about the sentence "Make a static const array of all the perfect squares in the domain you want to support, and perform a fast branchless binary search on it." found in this answer.
A "branchless" binary search is basically just an unrolled binary search loop. This only works if you know in advance the number of items in the array you're searching (as you would if it's static const). You can write a program to write the unrolled code if it's too long to do by hand.
Then, you must benchmark your solution to see whether it really is faster than a loop. If your branchless code is too big, it won't fit inside the CPU's fast instruction cache and will take longer to run than the equivalent loop.
If one has a function which returns +1, -1, or 0 based upon the position of the correct item versus the current one, one could initialize position to list size/2, and stepsize to position/2, and then after each comparison do position+=direction*stepsize; stepsize=stepsize/2. Iterate until stepsize is zero.

Resources