We have a list of strings. I given smaller subsquences with atmost three dashes in between the letters I have to find the maximum number of matches it can make.
Eg.
1243, 3452, 2343,124
1_4_
Answer is 2 as 1243 and 124 both matches. We can either fill with any number or leave it.
Can anyone suggest me with some efficient hashing techniques?
Hashing wouldn't be a good approach for this problem...I suggest stringifying your numbers and then using a regex to match the characters based on their index in the string.
Related
I have a very interesting problem.
I have a set of strings and I would like to know how to best match a combination these strings in another string against a maximization function.
An example would be. Say I have the set:
['aabbcaa', 'bbc']
and I have the string
'fgabbcdaabbcaaef'
and the possible matches for this are:
fga[bbc]daadaa[bbc]aaef
or
fga[bbc]daad[aabbcaa]ef
Now, given a simple maximization function, I would say that fga[bbc]daad[aabbcaa]ef is the winner due to the number of total characters matched. A different maximization function could give more weight to larger words replaced instead of total characters.
I would love to know if someone could point me to some algos on how to do this. What I’m stumped by is after I find a set of potential matches I’m not sure how to maximize the set of words to choose in an efficient way.
The dictionary, the words of the dictionary, and the word that’s being matched against, could be of any size.
Would appreciate any help I could get with this. Thank you!
Found the answer and it works nicely. Pseudocode is:
Loop over the set and find everywhere the set strings match in the target string. Store the start_index, end_index, and give a score to that string for matching. I currently use the length of string.
Then using all the matches found, run it through the "Weighted Interval Scheduling" algorithm to find the optimal set of matches
https://courses.cs.washington.edu/courses/cse521/13wi/slides/06dp-sched.pdf
Currently I work on an application where I have large number of hash values (strings).
When a query hash value (string) is given, the search process goes through those strings and return strings where the Hamming Distance between the query string and the result string is less than a given threshold.
Hash values are not binary strings. e.g. "1000302014771944008"
All hash values (strings) has the same fixed length.
Threshold values is not small (normally t>25) and can be vary.
I want to implement this search process using an efficient algorithm rather than using brute-force approach.
I have read some research papers (like this & this), but they are for binary strings or for low threshold values. I also tried Locality-sensitive hashing, but implementations I found were focused on binary strings.
Are there any algorithms or data structures to address this problem?
Any suggestions are also welcome. Thank you in advance.
.
Additional Information
Hamming Distance between non-binary strings
string 1: 0014479902266110001131133
string 2: 0014409902226110001111133
-------------------------
1 1 1 = 3 <-- hamming distance
Considered brute-force approach
calculate Hamming Distance between first hash string and the query hash string.
if Hamming Distance is less than the threshold, then add the hash string to the results list.
repeat step 1 and 2 for all hash strings.
Read the 7th section of the paper:
"HmSearch: An Efficient Hamming Distance Query Processing Algorithm".
The state-of-art result for d-query problem can be found at:
"Dictionary matching and indexing with errors and don’t care", which solves d-query problem in time O(m+log(nm)^d+occ) using space O(n*log(nm)^d), where
occ is the number of query result.
If threshold values is not smal, there are practical solutions for binary strings, found on HmSearch.
I think it is possible to apply the same practical solutions found on HmSearch for arbitrary strings, but I've never seen those solutions.
Something like this could work for you.
http://blog.mafr.de/2011/01/06/near-duplicate-detection/
General idea is that do two for loop, carry out every character from string 1, compare to every character from string2, if all finded, that will indicate Include.
so we need to loop all the char from string1, and compare all look all the character from string2, that will O sqaure runing time.
Which interviewer says it is not good idea.
after it, i am thinking for it. i cannot generate one idea that did not do two loop.
perhaps i can first get all the character from string1, convert into asc2, the number built into a tree. so when do the compare to the string2, it will make search very fast.
Or any folk has better idea?
Like string1 is abc but string2 is cbattt that means every character is included in string2.
not substring,
as iccthedral says, boyer moore is probably what the interviewer was looking for.
searching a text for a given pattern (pattern matching) is a very known problem. known solutions:
KMP
witness table
boyer-moore
suffix tree
all solutions vary in some minor aspects, like if it can be generalized for 2D pattern matching, or more. if it needs pre-processing, if it can be generalized for unbound alphabet, running time, etc'...
EDIT:
if you just want to know if all the letters of some string appear in some other string, why not use a table the size of your alphabet, indicating if a given char can be found in the string. if the alphabet is unbounded or extremely large (more than O(1)), use hash table.
I just came across this interesting question online and am quite stumped as to how to even progress on it.
Write a function that finds all the different ways you can split up a word into a
concatenation of two other words.
Is this something that Suffix Trees are used for?
I'm not looking for code, just conceptual way to move forward with this.
some psuedocode:
foreach place you can split the word:
split the word.
check if both sides are valid words.
If you are looking for a nice answer then please let us know your definition of a valid word.
Assuming a word is a string defined over an alphabet and has length greater than zero. You can use suffix trees.
Below is a simplified algorithm which will take just O(n) time.
Convert the word into a character array.
Traverse through the length of the array and for each i just take two strings (0 to i) and
(i+1 to length of the array-1).
Do remember to cover the base conditions like length greater than zero.
Total number of different ways to do it can be greater than one if and only if this condition holds:
-> one of the two words must be a multiple of other. For eg: "abcd" and "abcdabcd".
Using these two words u can form the string "abcdabcdabcdabcd" in many different ways.
So first check this condition.
Then check whether the string can be written from the two words in any way. Then simple math should give you the answer
I have two strings which must be compared for similarity. The algorithm must be designed to find the maximal similarity. In this instance, the ordering matters, but intervening (or missing) characters do not. Edit distance cannot be used in this case for various reasons.
The situation is basically as follows:
string 1: ABCDEFG
string 2: AFENBCDGRDLFG
the resulting algorithm would find the substrings A, BCD, FG
I currently have a recursive solution, but because this must be run on massive amounts of data, any improvements would be greatly appreciated
Looking at your sole example it looks like you want to find longest common subsequence.
Take a look at LCS
Is it just me, or is this NP-hard? – David Titarenco (from comment)
If you want LCS of arbitrary number of strings its NP-hard. But it the number of input strings is constant ( as in this case, 2) this can be done in polynomial time.