grouping strings by similarity - ruby

I have an array of strings, not many (maybe a few hundreds) but often long (a few hundred chars).
Those string are, generally, nonsense and different one from the other.. but in a group of those string, maybe 5 out of 300, there's a great similarity. In fact they are the same string, what differs is formatting, punctuation and a few words..
How can I work out that group of string?
By the way, I'm writing in ruby, but if nothing else an algorithm in pseudocode would be fine.
thanks

Assuming that you are not worried about misspellings or other errors in each word, you could do the following:
Build an inverted index, which is basically a hash keyed by word, pointing to a list of pointers to the strings which contain that word (how you handle duplicate occurrences is up to you). To determine strings that are similar to a given query string, lookup each query word in the index, and for each source string in the resulting lists, count how many times the source string appears in each list. The strings with the highest counts are your best candidates for similarity, because they contain the most words in common.
Then you can compute the edit distance between the two strings, or whatever other metric you want. This way you avoid the O(n^2) complexity of comparing each string with every other string.

Related

How, do I select the best match for a string in multiple documents, where the score is equal for both?

I have implemented an algorithm in Elm, where I compare a sentence (user input) to other multiple sentences (data). The algorithm is working in such a manner, where the user input and the data is converted to words, and then I compare them by words. the algorithm will mark any sentence from the data, which has the most words in the user input, as the best match.
Now, at the first run, the first sentence from the data will be counted as the best match and then going to the second sentence and looks for matches. If the matches number is greater than the previous one, then the second sentence will be counted as the best match, otherwise the previous one.
In case, if there are equal matches in two sentences, then currently I am comparing the size of these two sentences and select the one, which has the smaller size, as the best match.
There is no semantic meaning involved, so is this the best way to select the best match, which has the smaller size in this case? or are there some other better options available? I have tried to look for some scientific references, but couldn't find any.
Edit:
To summarize, if you want to compare one sentence to two other sentences, based on word occurrences, If both of the sentences have the same number of words, which also exist in your comparing sentence, then which one can be marked as the most similar? which methods are used to retrieve this similarity?
Some factors you can add in to improve the comparison:
String similarity (eg. Levensthein, Jaro-Winkler, ...)
Add a parameter for the sentence length by adding a linear or geometric penalty for a different sentence length (either on character or on word level)
Clean the strings (remove stopwords, special signs etc.)
Add the sequence (position) of words as a parameter. Thus which word is before/after another word.
Use Sentence Embeddings for similarity to also capture some semantics (https://www.analyticsvidhya.com/blog/2020/08/top-4-sentence-embedding-techniques-using-python/)
Finally, there will always be some sentences that have the same difference to your input, although they are different. That's OK, as long as they are actually similarly different to your input sentence.

storing streaming data

Assuming streaming data (i.e. 10 Million strings every 10 minutes), what would be the fast and memory efficient way of storing strings such that if two strings have exactly the same characters but in different orders, they get stored only once.
I have a solution to find if two strings satisfy this criteria, which works in O(n) time, and is based on building a frequency histogram of the characters in each string and checking whether those histograms are the same. But this wouldn't work well since each new string must be compared with ( <= 10 M) stored strings. I can assume if we store each string as a histogram, and then separate them in different blocks, based on their size, it can make things a bit more efficient, but still this can have a huge time complexity. The ideal soltion in terms of time could be to have a perfect hashing function that operates on a histogram input (string: "cacao" -> histogram: "a2:c2:o1")
If your strings are short enough, then comparing sorted string might be faster than comparing histograms (it is worth to check). Note that sorting is executed only once. Just place sorted string into some kind of map: hash map, tree map etc
I would imagine that a slightly tailored version of a trie would actually be what you're interested in.
Benefits:
It takes O(m) time to look up the occurrence of a string in your trie
It has a worst case performance of O(k) to insert a string
If you wanted to keep track of a number of occurrences of specific portions, you could augment each node to increment should a terminal string be reached (so you can keep track of occurrences of terminal "thou", "thought", etc)
Drawback(s):
This can be memory intensive; you'll need to store each character of each word, and the links drawn to different phrases and each word

Is there an efficient algorithm for fuzzy deduplication of string lists? [duplicate]

This question already has answers here:
Fuzzy matching deduplication in less than exponential time?
(6 answers)
Closed 9 years ago.
For example, I have a long list of strings, each string has about 30-50 characters, and I want to remove strings that are similar to some other string in that list (leaving only one occurrence from a family of duplicates).
I looked at various string similarity algorithms, for example, Levenstein distance and the method presented in this article. They do work, but it's painfully slow - the best algorithm I came up with exhibits O(n^2) complexity and takes ~1.5s to process list with 3000 strings.
Is there some fast way to deduplicate those lists?
If your measure of similarity is strong (e.g. Levenshtein distance 1), then you can process your string list in order, generating all possible "close" strings to the current string and looking up that close string in your hashtable. If it is there, skip the original string. If not, output it and add it to the hashtable.
This algorithm depends on being able to generate all close strings to a string, and there not being too many of them. (This is what I mean by "strong" above.)
As a possible optimization, you could store more than just the original strings in the hashtable. For instance, if you wanted Levenshtein distance 3, you could store all strings distance 1 from your outputted strings in the hashtable, then look up distance 2 strings when checking a new string.
This problem occurs frequently when matching DNA strings(or re-assembling fragments). The first approach would be to split up the strings into kmers, substrings, with say 4 adjacent letters. So
abcdefgh
Would become:
abcd + bcde + cdef + defg + efgh
For the complete dictionary, these substrings can be enterered into a hashtable, each carrying as a payload a list of the original strings (their numbers) that contain them (and possibly the offset where they can be found)
To search, treat the string under test the same as the dictionary, and look its fragment up in the hashtable. Now a hit will result in all five fragments to be found, with the correct offsets. A partial hit will yield fewer than five fragments, but with the correct offsets.
Of course a lot of false-negative hits will result from the search, but by combining (logical AND) of the inverted index lists, and by only selecting the hits at about the right index, things get unique pretty fast.
For the problem size in the OP's question the running time would probably be a few (tens of ) milliseconds.
BTW: As a side effect of this method, substitutions will behave almost the same as indels. In the example they would spoin one (at the ends) to four (in the middle) kmer-matches. For larger strings this is not a problem, for small strings (like in the example it is (and you could use smaller fragments)
Update: I just read the link, and it appears they use 2-mers, too (and throw some statistics at it)

Efficient Means of Implementing Collation & Sorting?

I'm writing lexicography software, which may theoretically need to sort tens of thousands of strings with arbitrary (dictionary-project-specific) collations. There are two means of specifying custom collations:
a map of graphemes to unicode-style multi-level collation keys.
an array of alphabetic graphemes (possibly including digraphs, etc.) in sorting order, which can be internally transformed into a map of collation keys.
The naive method of comparing strings is to check grapheme-by-grapheme until you find a mismatch, and then look up the collation keys for the mismatched graphemes to compare, but I'm hoping there's a more efficient way of doing it.
The best idea I've got so far depends on noticing that strings of equal length can be treated as little-endian base-n numbers, so I can pre-compute an integer key for each string which turns collation into cheap integer comparison. But, this breaks for strings of different length (a big deal when sorting a dictionary), and there's no bound on the size of integers that could be generated.
To account for length differences, I thought I could compute a list of keys for all prefixes of each string, and then just compare the keys for prefixes of length equal to the shorter string being compared. That seems to do pretty well, but key sizes are still unbounded, and storing the keys could use a lot of memory.
Is there a way to improve that approach? Or am I just going about it entirely wrong, and there's a much better means of sorting strings with arbitrary collations?
How about a grapheme-by-grapheme Radix sort? You get Big O n(number of words) * m(length of longest word) sorting. The idea should be fairly simple put all the words that start with A in the A bucket, Bs in the B bucket and so on down the characters in the word.
I'm no expert but I might suggest some kind of hybrid between the naive approach and your approach. Where you look at a fixed number of bytes in each string, treat it as a little-endian number and use a pre-calculated collation. Then if they are the same move to the next set of the same length and do the same. The tricky part is dealing with variable length graphemes (such as UTF-8 or digraphs). The simplest solution would be to use a fixed-width representation in the dictionary, but there might be another, more sophisticated solution, which I can't think of right now.
Once you get to the end of the shorter string you zero extend it to meet the next boundary and then do the comparison.
You could also look at open-source implementations of collations, and see if they do something more sophisticated (for instance the GNU implementation of the strcoll C function).

How to split a word into different ways such that it is a concatenation of two other words

I just came across this interesting question online and am quite stumped as to how to even progress on it.
Write a function that finds all the different ways you can split up a word into a
concatenation of two other words.
Is this something that Suffix Trees are used for?
I'm not looking for code, just conceptual way to move forward with this.
some psuedocode:
foreach place you can split the word:
split the word.
check if both sides are valid words.
If you are looking for a nice answer then please let us know your definition of a valid word.
Assuming a word is a string defined over an alphabet and has length greater than zero. You can use suffix trees.
Below is a simplified algorithm which will take just O(n) time.
Convert the word into a character array.
Traverse through the length of the array and for each i just take two strings (0 to i) and
(i+1 to length of the array-1).
Do remember to cover the base conditions like length greater than zero.
Total number of different ways to do it can be greater than one if and only if this condition holds:
-> one of the two words must be a multiple of other. For eg: "abcd" and "abcdabcd".
Using these two words u can form the string "abcdabcdabcdabcd" in many different ways.
So first check this condition.
Then check whether the string can be written from the two words in any way. Then simple math should give you the answer

Resources