what is the relationship between LCS and string similarity? - algorithm

I wanted to know how similar where two strings and I found a tool in the following page:
https://www.tools4noobs.com/online_tools/string_similarity/
and it says that this tool is based on the article:
"An O(ND) Difference Algorithm and its Variations"
available on:
http://www.xmailserver.org/diff2.pdf
I have read the article, but I have some doubts about how they programmed that tool, for example the authors said that it is based on the C library GNU diff and analyze.c; maybe it refers to this:
https://www.gnu.org/software/diffutils/
and this:
https://github.com/masukomi/dwdiff-annotated/blob/master/src/diff/analyze.c
The problem that I have is how to understand the relation with the article, for what I read the article shows an algorithm for finding the LCS (longest common subsequence) between a pair of strings, so they use a modification of the dynamic programming algorithm used for solving this problem. The modification is the use of the shortest path algorithm to find the LCS that has the minimum number of modifications.
At this point I am lost, because I do not know how the authors of the tool I first mentioned used the LCS for finding how similar are two sequences. Also the have put a limit value of 0.4, what does that mean? can anybody help me with this? or have I misunderstood that article?
Thanks

I think the description on the string similarity tool is not being entirely honest, because I'm pretty sure it has been implemented using the Perl module String::Similarity. The similarity score is normalised to a value between 0 and 1, and as the module page describes, the limit value can be used to abort the comparison early if the similarity falls below it.
If you download the Perl module and expand it, you can read the C source of the algorithm, in the file called fstrcmp.c, which says that it is "Derived from GNU diff 2.7, analyze.c et al.".
The connection between the LCS and string similarity is simply that those characters that are not in the LCS are precisely the characters you would need to add, delete or substitute in order to convert the first string to the second, and the number of these differing characters is usually used as the difference score, as in the Levenshtein Distance.

Related

Algorithm for one sign checksum

I am desperate in the search for an algorithm to create a checksum that is a maximum of two characters long and can recognize the confusion of characters in the input sequence. When testing different algorithms, such as Luhn, CRC24 or CRC32, the checksums were always longer than two characters. If I reduce the checksum to two or even one character, then no longer all commutations are recognized.
Does any of you know an algorithm that meets my needs? I already have a name with which I can continue my search. I would be very grateful for your help.
Taking that your data is alphanumeric, you want to detect all the permutations (in the perfect case), and you can afford to use the binary checksum (i.e. full 16 bits), my guess is that you should probably go with CRC-16 (as already suggested by #Paul Hankin in the comments), as it is more information-dense compared to check-digit algorithms like Luhn or Damm, and is more "generic" when it comes to possible types of errors.
Maybe something like CRC-CCITT (CRC-16-CCITT), you can give it a try here, to see how it works for you.

The best String reconstruction algorithm out there ? (Best as in 'most accurate')

I've been searching and testing all kinds of string reconstruction algorithm i.e. reconstructing spaceless text into normal text.
My result posted here Solution working partially in Ruby, is working at 90% reconstruction for 2 or 3 words sentences, with a complete dictionary. But I can't get it to run better then this !
I think my algorithm inspired from dynamic programming is bad and contains a lot of patch work.
Can you propose another algorithm (in pseudo-code) that would work foolproof with a complete dictionary ?
You need more than just a dictionary, because you can have multiple possible phrases from the same spaceless string. For example, "themessobig" could be "the mess so big" or "themes so big" or "the mes so big", etc.
Those are all valid possibilities, but some are far more likely than others. Thus what you want to do is pick the most likely one given how the language is actually used. For this you need a huge corpus of text along with some NLP algorithms. Probably the most simple one is to count how likely a word is to occur after another word. So for "the mess so big", it's likelihood would be:
P(the | <START>) * P(mess | the) * P(so | mess) * P(big | so)
For "themes so big", the likelihood would be:
P(themes | <START>) * P(so | themes) * P(big | so)
Then you can pick the most likely of the possibilities. You can also construct triplets instead of tuples (e.g. P(so | the + mess)) which will require a bigger corpus to be effective.
This won't be foolproof but you can get better and better at it by having better corpuses or tweaking the algorithm.
With an unigram language model, which is essentially word frequencies,
it is possible to find the most probable segmentation of a string.
Example code from Russell & Norvig (2003, p. 837) (look for the function viterbi_segment)

How to fuzzily search for a dictionary word?

I have read a lot of threads here discussing edit-distance based fuzzy-searches, which tools like Elasticsearch/Lucene provide out of the box, but my problem is a bit different. Suppose I have a dictionary of words, {'cat', 'cot', 'catalyst'}, and a character similarity relation f(x, y)
f(x, y) = 1, if characters x and y are similar
= 0, otherwise
(These "similarities" can be specified by the programmer)
such that, say,
f('t', 'l') = 1
f('a', 'o') = 1
f('f', 't') = 1
but,
f('a', 'z') = 0
etc.
Now if we have a query 'cofatyst', the algorithm should report the following matches:
('cot', 0)
('cat', 0)
('catalyst', 0)
where the number is the 0-based starting index of the match found. I have tried the Aho-Corasick algorithm, and while it works great for exact matching and in the case when a character has relatively less number of "similar" characters, its performance drops exponentially as we increase the number of similar characters for a character. Can anyone point me to a better way of doing this? Fuzziness is an absolute necessity, and it must take in to account character similarities(i.e., not blindly depend on just edit-distances).
One thing to note is that in the wild, the dictionary is going to be really large.
I might try to use the cosine similarity using the position of each character as a feature and mapping the product between features using a match function based on your character relations.
Not a very specific advise, I know, but I hope it helps you.
edited: Expanded answer.
With the cosine similarity, you will compute how similar two vectors are. In your case the normalisation might not make sense. So, what I would do is something very simple (I might be oversimplifying the problem): First, see the matrix of CxC as a dependency matrix with the probability that two characters are related (e.g., P('t' | 'l') = 1). This will also allow you to have partial dependencies to differentiate between perfect and partial matches. After this I will compute, for each position the probability that the letter from each word is not the same (using the complement of P(t_i, t_j)) and then you can just aggregate the results using a sum.
It will count the number of terms that are different for a specific pair of words, and it allows you to define partial dependencies. Furthermore, the implementation is very simple and should scale well. This is why I am not sure if I misunderstood your question.
I am using Fuse JavaScript Library for a project of mine. It is a javascript file which works on JSON dataset. It is quite fast. Have a look at it.
It has implemented a full Bitap algorithm, leveraging a modified version of the Diff, Match & Patch tool by Google(from his site).
The code is simple to understand the algorithm implementation done.

Word Suggestion program

Suggest me a program or way to handle the word correction / suggestion system.
- Let's say the input is given as 'Suggset', it should suggest 'Suggest'.
Thanx in advance. And I'm using python and AJAX. Please don't suggest me any jquery modules cuz I need the algorithmic part.
Algorithm that solves your problem called "edit distance". Given the list of words in some language and mistyped/incomplete word you need to build a list of words from given dictionary closest to it. For example distance between "suggest" and "suggset" is equal to 2 - you need one deletion and one insertion. As an optimization you can assign different weights to each operation - for example you can say that substitution is cheaper than deletion and substitution between two letters that lie closer on keyboard (for example 'v' and 'b') is cheaper that between those that are far apart (for example 'q' and 'l').
First description of algorithm for spelling and correction appeared in 1964. In 1974 efficient algorithm based on dynamic programming appeared in paper called "String-to-string correction problem" by Robert A. Wagner and Michael J. Fischer. Any algorithms book have more or less detailed treatment of it.
For python there is library to do that: Levenshtein distance library
Also check this earlier discussion on Stack Overflow
It will take a lot of work to make one of those yourself. There is a really nice spell checker library written in python called PyEnchant that I've found to be quite nice. Here's an example from their website:
>>> import enchant
>>> d = enchant.Dict("en_US")
>>> d.check("Hello")
True
>>> d.check("Helo")
False
>>> d.suggest("Helo")
['He lo', 'He-lo', 'Hello', 'Helot', 'Help', 'Halo', 'Hell', 'Held', 'Helm', 'Hero', "He'll"]
>>>

Finding substring in assembly

I'm wondering if there is a more efficient method to finding a substring in assembly then what I am currently planning to do.
I know the string instruction "scansb/scasw/scads" can compare a value in EAX to a value addressed by EDI. However, as far as I understand, I can only search for one character at a time using this methodology.
So, if I want to find the location of "help" in string "pleasehelpme", I could use scansb to find the offset of the h, then jump to another function where I compare the remainder. If the remainder isn't correct, I jump back to scansb and try searching again, this time after the previous offset mark.
However, I would hate to do this and then discover there is a more efficient method. Any advice? Thanks in advance
There are indeed more efficient ways, both instruction-wise and algorithmically.
If you have the hardware you can use the sse 4.2 compare string functions, which are very fast. See an overview http://software.intel.com/sites/products/documentation/studio/composer/en-us/2009/compiler_c/intref_cls/common/intref_sse42_comp.htm and an example using the C instrinsics http://software.intel.com/en-us/articles/xml-parsing-accelerator-with-intel-streaming-simd-extensions-4-intel-sse4/
If you have long substrings or multiple search patterns, the Boyer-Moore, Knuth-Morris-Pratt and Rabin-Karp algorithms may be more efficient.
I don't think there is a more efficient method (only some optimizations that can be done to this method). Also this might be of interest.
scansb is the assembly variant for strcmp, not for strstr. if you want a really efficient method, then you have to use better algorithm.
For example, if you search in a long string, then you could try some special algorithms: http://en.wikipedia.org/wiki/String_searching_algorithm

Resources