What algorithm does VSCode use to achieve highlighting in search results - algorithm

For example, searching for "test" returns "Terminal: Split Terminal" as one of the results, with "Te", "S", and "T" being highlighted. How is this achieved? As far as I'm aware, regular fuzzy search (based on Levenshtein distance) will only provide you with distances between strings.
I've been looking up different approximate string matching algorithms, and wasn't able to find anything similar.

I don't use VSCode but based on the screenshot I would guess that it combines results from substring matching and from some kind of "word initial matching", where it matches if you can split the search string (for instance "Test" into "te", "s", "t"), and each of the parts is at the beginning of a word (and in the same order).

Related

Minhashing on Strings with K-length

I have a application where I should implement Bloom Filters and Minhashing to find similar items.
I have the Bloom Filter implemented but I need to make sure i understand the Minhashing part to do it:
The aplication generates a number of k-length Strings and stores it in a document, then all of those are inserted in the Bloom.
Where I want to implement the MinHash is by giving the option for the user to insert a String and then compare it and try to find the most similar ones on the document.
Do i have to Shingle all the Strings on the document? The problem is that I can't really find something to help me in theis, all I find is regarding two documents and never one String to a set of Strings.
So: the user enters a string and the application finds the most similar strings within a single document. By "similarity", do you mean something like Levenstein distance (whereby "cat" is deemed similar to "rat" and "cart"), or some other measure? And are you (roughly speaking) looking for similar paragraphs, similar sentences, similar phrases or similar words? These are important considerations.
Also, you say you are comparing one string to a set of strings. What are these strings? Sentences? Paragraphs? If you are sure you don't want to find any similarities spanning multiple paragraphs (or multiple sentences, or what-have-you) then it makes sense to think of the document as multiple separate strings; otherwise, you should think of it as a single long string.
The MinHash algorithm is for comparing many documents to each other, when it's impossible to store all document in memory simultaneously, and individually comparing every document to every other would be an n-squared problem. MinHash overcomes these problems by storing hashes for only some shingles, and it sacrifices some accuracy as a result. You don't need MinHash, as you could simply store every shingle in memory, using, say, 4-character-grams for your shingles. But if you don't expect word orderings to be switched around, you may find the Smith-Waterman algorithm more suitable (see also here).
If you're expecting the user to enter long strings of words, you may get better results basing your shingles on words; so 3-word-grams, for instance, ignoring differences in whitespacing, case and punctuation.
Generating 4-character-grams is simple: "The cat sat on the mat" would yield "The ", "he c", "e ca", " cat", etc. Each of these would be stored in memory, along with the paragraph number it appeared in. When the user enters a search string, that would be shingled in identical manner, and the paragraphs containing the greatest number of shared shingles can be retrieved. For efficiency of comparison, rather than storing the shingles as strings, you can store them as hashes using FNV1a or a similar cheap hash.
Shingles can also be built up from words rather than characters (e.g. "the cat sat", "cat sat on", "sat on the"). This tends to be better with larger pieces of text: say, 30 words or more. I would typically ignore all differences in whitespace, case and punctuation if taking this approach.
If you want to find matches that can span paragraphs as well, it becomes quite a bit more complex, as you have to store the character positions for every shingle and consider many different configurations of possible matches, penalizing them according to how widely scattered their shingles are. That could end up quite complex code, and I would seriously consider just sticking with a Levenstein-based solution such as Smith-Waterman, even if it doesn't deal well with inversions of word order.
I don't think a bloom filter is likely to help you much, though I'm not sure how you're using it. Bloom filters might be useful if your document is highly structured: a limited set of possible strings and you're searching for the existence of one of them. For natural language, though, I doubt it will be very useful.

English misspeling correction sequences

I am doing a bit of search engine. One of the features is an attempt to correct spelling in nothing is found. I replace the following phonetic sequences: ph<->f, ee <-> i, oo<->u, ou<->o (colour<->color). Where I can find a full list of things like that for English?
Thank you.
You might want to start here (Wikipedia on Soundex) and then start tracing through the "see also" links. (Metaphone has a list of replacements, for example.)
If you're creating search engine, you have to realize that there are plenty of web pages, which contains incorrectly spelled words. But, of course, you need any strategy to make that pages searchable too. So there are no general rules to implement spelling corrector (because of correctness becomes relative concept in web). But there are some tricks how to do that in practice :-)
I'd suggest you to use n-gram index + Levenstein distance (or any similar distance) to correct spelling.
Strings with small Levenstein distance are presumably the variations of the same word.
Assume you want to correct word "fantoma". If you have large amount of words - it would be very cost to iterate through dictionary and calculate distance to each word. So you have to find words with presumably small distance to "fantoma" very fast.
The main idea is while crawling and indexing web-pages - perform indexing of n-grams (for example - bigrams) into separate index. Split each word into n-grams, and add it to n-gram index:
1) Split each word from dictionary,
for example: "phantom" -> ["ph", "ha", "an", "nt", "to", "om"]
2) Create index:
...
"ph" -> [ "phantom", "pharmacy", "phenol", ... ]
"ha" -> [ "phantom", "happy" ... ]
"an" -> [ "phantom", "anatomy", ... ]
...
Now - you have index, and you may quickly find candidates for your words.
For example:
1) "fantoma" -> ["fa", "an", "nt", "to", "om", "ma"]
2) get lists of words for each n-gram from index,
and extract most frequent words from these lists - these words are candidates
3) calculate Levenstein distance to each candidate,
the word with smallest distance is probably spell-corrected variant of searched word.
I'd suggest you to look through the book "Introduction to information retrieval".

Algorithm to find all possible results using text search

I am currently making a web crawler to crawl all the possible characters on a video game site (Final Fantasy XIV Lodestone).
My interface for doing this is using the site's search. http://lodestone.finalfantasyxiv.com/rc/search/characterForm
If the search finds more than 1000 characters it only returns the first 1000. The text search does not seem to understand either *, ? or _.
If a search for the letter a, I'm getting all the characters that have a in their names rather than all characters that start with a.
I'm guessing I could do searches for all character combination aa, ab, ba, etc. But that doesn't guarantee me:
I will never get more than 1000 result
It doesn't seem very efficient has many characters would appear multiple times and would need to be filtered out.
I'm looking for an algorithm on how to construct my search text.
Considered as a practical problem, have you asked Square Enix for some kind of API access or database dump? They might prefer this to having you scrape their search results.
Considered purely in the abstract, it's not clear that any search strategy will succeed in finding all the results. For suppose there were a character called "Ar", how would you find it? If you search for "ar", the results only go as far as at Ak—. If you search for "a" or "r", the situation is even worse. Any other search fails to find this character. (In practice you might be able to find "Ar" by guessing its world and/or main skill, but in theory there might be so many characters with that skill on that world that this remains ineffective.)
Main question here is what are you planning to do with all those characters. What is the purpose of your program? Putting that aside, you can search for single letter, and filter by both main skill and world (using double loop). It is highly unlikely that you will ever have more that 1000 hits that way for any consonant. If you want to search for name starting with vowel then use bigraph vowel-other_letter in a loop that iterates other_letter from A to Z.
Additional optimization is to try to guess at what page the list with needed letter will start. If you have total number of pages (TNOP) then your list will start somewhere near page TNOP * LETTER / 27, where LETTER is the order of the letter in the alphabet.

Fast filtering of a string collection by substring?

Do you know of a method for quickly filtering a list of strings to obtain the subset that contain a specified string? The obvious implementation is to just iterate through the list, checking each string for whether it contains the search string. Is there a way to index the string list so that the search can be done faster?
Wikipedia article lists a few ways to index substrings. You've got:
Suffix tree
Suffix array
N-gram index, an inverted file for all N-grams of the text
Compressed suffix array1
FM-index
LZ-index
Yes, you could for example create an index for all character combinations in the strings. A string like "hello" would be added in the indexes for "he", "el", "ll" and "lo". To search for the string "hell" you would get the index of all strings that exist in all of the "he", "el" and "ll" indexes, then loop through those to check for the actual content in the strings.
If you can preprocess the collection then you can do a lot of different things.
For example, you could build a trie including all your strings' suffixes, then use that to do very fast matching.
If you're going to be repeatedly searching the same text, then a suffix tree is probably worthwhile. If carefully applied, you can achieve linear time processing for most string problems. If not, then in practice you won't be able to do much better than Rabin-Karp, which is based on hashing, and is linear in expected time.
There are many freely available implementations of suffix trees. See for example, this C implementation, or for Java, check out the Biojava framework.
Not really anything that's viable, no, unless you have additional a priori knowledge of your data and/or search term - for instance, if you're only searching for matches at the beginning of your strings, then you could sort the strings and only look at the ones within the bounds of your search term (or even store them in a binary tree and only look at the branches that could possibly match). Likewise, if your potential search terms are limited, you could run all the possible searchs against a string when it's initially input and then just store a table of which terms match and which don't.
Aside from that kind of thing, just iterating through is basically it.
That depends on if the substring is at the beginning of the string or can be anywhere in the string.
If it's anywhere then you pretty much need to iterate over the entire list unless your list is so large and the query happens sufficiently often that it's worth building a more sophisticated indexing solution.
If the substring is at the beginning of the string then it's easy. Sort the list, find the start/end by biseciton search and take that subset.

How do I compare phrases for similarity?

When entering a question, stackoverflow presents you with a list of questions that it thinks likely to cover the same topic. I have seen similar features on other sites or in other programs, too (Help file systems, for example), but I've never programmed something like this myself. Now I'm curious to know what sort of algorithm one would use for that.
The first approach that comes to my mind is splitting the phrase into words and look for phrases containing these words. Before you do that, you probably want to throw away insignificant words (like 'the', 'a', 'does' etc), and then you will want to rank the results.
Hey, wait - let's do that for web pages, and then we can have a ... watchamacallit ... - a "search engine", and then we can sell ads, and then ...
No, seriously, what are the common ways to solve this problem?
One approach is the so called bag-of-words model.
As you guessed, first you count how many times words appear in the text (usually called document in the NLP-lingo). Then you throw out the so called stop words, such as "the", "a", "or" and so on.
You're left with words and word counts. Do this for a while and you get a comprehensive set of words that appear in your documents. You can then create an index for these words:
"aardvark" is 1, "apple" is 2, ..., "z-index" is 70092.
Now you can take your word bags and turn them into vectors. For example, if your document contains two references for aardvarks and nothing else, it would look like this:
[2 0 0 ... 70k zeroes ... 0].
After this you can count the "angle" between the two vectors with a dot product. The smaller the angle, the closer the documents are.
This is a simple version and there other more advanced techniques. May the Wikipedia be with you.
#Hanno you should try the Levenshtein distance algorithm. Given an input string s and a list of of strings t iterate for each string u in t and return the one with the minimum Levenshtein distance.
http://en.wikipedia.org/wiki/Levenshtein_distance
See a Java implementation example in http://www.javalobby.org/java/forums/t15908.html
To augment the bag-of-words idea:
There are a few ways you can also pay some attention to n-grams, strings of two or more words kept in order. You might want to do this because a search for "space complexity" is much more than a search for things with "space" AND "complexity" in them, since the meaning of this phrase is more than the sum of its parts; that is, if you get a result that talks about the complexity of outer space and the universe, this is probably not what the search for "space complexity" really meant.
A key idea from natural language processing here is that of mutual information, which allows you (algorithmically) to judge whether or not a phrase is really a specific phrase (such as "space complexity") or just words which are coincidentally adjacent. Mathematically, the main idea is to ask, probabilistically, if these words appear next to each other more often than you would guess by their frequencies alone. If you see a phrase with a high mutual information score in your search query (or while indexing), you can get better results by trying to keep these words in sequence.
From my (rather small) experience developing full-text search engines: I would look up questions which contain some words from query (in your case, query is your question).
Sure, noise words should be ignored and we might want to check query for 'strong' words like 'ASP.Net' to narrow down search scope.
http://en.wikipedia.org/wiki/Index_(search_engine)#Inverted_indices'>Inverted indexes are commonly used to find questions with words we are interested in.
After finding questions with words from query, we might want to calculate distance between words we are interested in in questions, so question with 'phrases similarity' text ranks higher than question with 'discussing similarity, you hear following phrases...' text.
Here is the bag of words solution with tfidfvectorizer in python 3
#from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
nltk.download('stopwords')
s=set(stopwords.words('english'))
train_x_cleaned = []
for i in train_x:
sentence = filter(lambda w: not w in s,i.split(","))
train_x_cleaned.append(' '.join(sentence))
vectorizer = TfidfVectorizer(binary=True)
train_x_vectors = vectorizer.fit_transform(train_x_cleaned)
print(vectorizer.get_feature_names_out())
print(train_x_vectors.toarray())
from sklearn import svm
clf_svm = svm.SVC(kernel='linear')
clf_svm.fit(train_x_vectors, train_y)
test_x = vectorizer.transform(["test phrase 1", "test phrase 2", "test phrase 3"])
print (type(test_x))
clf_svm.predict(test_x)

Resources