Is there any go module for doing fuzzy string matching ?. If i have an array of strings, i want to check whether a given string fuzzy matches with any of the elements in the array.
Please Help
Thank You
You are probably looking for a library which implements Levenshtein Distance algorithm.
This is also the algorithm used by elasticsearch fuzzy searching.
See here a list of Go packages.
Related
I am trying to find out if there is an algorithm that exists that is capable of the following:
Given a list of strings:
{"56B99Z", "78K80F", "50B49J", "28F11F"}
And given an input string of:
"??B?9?"
Then the algorithm should output:
{"56B99Z", "50B49J"}
Where ? are uknown characters.
I think some sort of trie-tree with additional links between nodes could work, but I don't want to re-invent the wheel if this has been done before.
Your question is really vague and you need to be more specific, are the strings have the same size? If so you can just look on the position which aren't question mark in your string you search for each other string, anyway if you looking for matching strings algorithms I suggest you read about kmp algorithm which have linear complexity for the given input => https://en.m.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm
use a regular expression to match on the 1,2,4,6 positions as an \w
How to extend naive string matching algorithm to search for a pattern of real numbers when both Pattern and Text are saved in floating point arrays? What are the assumptions we have to make?
Just take floating point number array as a string and then apply the string matching matching algorithm
example: Text[1.2,3.45,5.123] and Pattern[1.2,3.45]
Then string will look like,
Text[]="1.23.455.123" and Pattern[]="1.23.45"
Now we can match pattern with text using naive.
I've been doing a bit of research and can't for the life in me find out if this is possible. Is it possible to use a binary search tree for strings? The way I see it is, if I was to use a binary search tree for strings I'd have to represent those strings with numbers to validate the comparing. I know it's probably better to use a Suffix tree, but if I was to use a binary search tree for strings, what would be the best method for comparing string values such as names? Thanks.
i think there is no other way besides what you already said, the other way would be to decompose the string and use part of the string as a key, this is very common in databases, althought not very recommended.
Can anyone point to best algorithm for substring search in another string?
or search for a char array in another char array?
The best from what point of view? Knuth-Morris-Pratt is a good one. You can find more of them discussed on the Wikipedia entry for string searching algorithms.
It depends on what types of searching you are doing. Specific substring over a specific string? Specific substring over many different strings? Many different substrings over a specific string?
Here's a popular algorithm for a specific substring over many different strings.
Boyer-Moore algorithm: http://en.wikipedia.org/wiki/Boyer–Moore_string_search_algorithm
This strstr() implementation seems pretty slick.
i am working on a small search engine to display a matching file names with full path. and important thing is that i need to provide wildcard(GLOB) search like *.doc or *list*.xlx or *timesheet* or ???.doc or something like that.
i found some related solution
Search for strings matching the pattern "abc:*:xyz" in less than O(n)
but i am looking for efficient algorithms which can find matches out of million file names in a less than a second, so better than O(n) is required..
i am thinking of two phase algorithm with substring array (Suffix array + prefix array) search in first phase and normal RegEx search thru the results of first phase second phase.
any help would be greatly appreciated...
As far as I know there is no way to do better than O(n) for generalized glob searching.
However for the special cases of prefix and suffix search you can make yourself sorted indexes to do a binary search on, resulting in O(log n) for prefix and suffix search. The prefix index would be sorted based on the first character, then the second, and so on. The suffix index would be sorted based on the last character, then the second last, and so on (the reverse strings).
I would do as you suggest in your post and do the search in two phases, search the prefix or suffix indexes, followed by a brute force search through the reduced list provided from the first phase using a regex made from the glob.
Since string length comparisons are faster than regexes, I would also pre-filter for the minimum matching string length, or fixed length matching string for the ???.doc example.
From the sound of your original post the indexes would need to refer to the full path of each entry as well, so that you can display it after the final results are found.
Check out self indexing: This Stack Overflow question, and this DrDobbs article on it.