Data structure for saving whole sentences in several languages

Data structure for saving whole sentences in several languages - data-structures

I am planing to save big amount of sentences in a certain language such that every element has a pointer to the translation of that sentence to different languages.
By sorting the sentences I can create a balances binary tree and add/delete/search in O(logn).
Is there a way to save the elements in a hash table? What would the hash function look like in this situation?
I want to be able to search a sentence in any language and return the translation of it in any other language.
If I have K languages, using a balanced tree I get O(klogn) search time for k trees.
Is there a better way?

Yes, trie is better suitable for string search. Also I'd store all sentences in all languages in single trie, with value being the list of all available translations.
The search in this case will be O(k) where k is the length of the searched sentence.
UPD
If you are searching only for known language and translating to another known in advance language, then:
Make each the value of the trie Map<(Source Language) → Map<(Target language) → Translation>>
You'll have following workflow: Find value, that corresponds to the given sentence in the trie. In the found Map search for the source language. If found, this means that sentence is in target language (assuming there can be collisions between languages). At this point found Map (inner) represents all available translations to target languages. Search it for desired one.
The most performance-efficient way to implement both Maps will be plain array. Where i-th element represent value for i-th translation (assuming you have finite number of translations and you can assign distinct index to each of them). The search for known translation will be O(1) — just accessing known index in array. But this approach will also be most memory consuming, as if you have sparse values in the Map, your backing array will still consume fixed amount of memory (proportional to the number of all possible translations).
But this approach will work for the inner Map (where, as I understood, translation for all possible languages are stored).
For the outer map, which I can expect to be sparse, you have several options:
If the average number of collisions between sentences in different languages is small (1-2), you can just store Key→Value pairs in the list. Average search time will be constant.
If the average number of collisions is significant, but still not big enough to go with array (approach for the inner Map), then back this Map by a Hashtable, which size will be the expected number of values in the Map — the average number of collisions. Then again you'll get amortized O(1) search time.
Map appears to be dense — go with array approach, as for inner Map.
UPD2 Example:
Suppose you have next sentences that translates to each other:
Hello [en]
Hola [es]
Hallo [nl]
Hallo [de]
You should add all of them into trie, with following values:
Hello -> Map([en] -> Map([en]->Hello, [es]->Hola, [nl]->Hallo, [de]->Hallo))
Hola -> Map([es] -> Map([en]->Hello, [es]->Hola, [nl]->Hallo, [de]->Hallo))
Hallo -> Map([nl] -> Map([en]->Hello, [es]->Hola, [nl]->Hallo, [de]->Hallo),
[de] -> Map([en]->Hello, [es]->Hola, [nl]->Hallo, [de]->Hallo))
Square brackets here have nothing to do with arrays, they are used to visually distinguish languages.
So, when you perform search in you trie, you don't know on which language is your sentence. When you get a value from the trie (Map of Maps) you can "map" the original language of your sentence to the translation in desired language.
Back to example, if you have sentence "Hallo" and you know it's in [de], you want to translate it into [en].
You search "Hallo" in the trie, and get:
Map([nl] -> Map([en]->Hello, [es]->Hola, [nl]->Hallo, [de]->Hallo),
[de] -> Map([en]->Hello, [es]->Hola, [nl]->Hallo, [de]->Hallo))
Now you get value for [de] from this map (as you think your sentence in in [de]). You'll get:
Map([en]->Hello, [es]->Hola, [nl]->Hallo, [de]->Hallo)
Now fetch search for target language in this map ([en]). You'll get "Hello".
Of course instead of Trie->Map->Map structure you can have Map->Trie->Map, where first Map will map source languages to corresponding dictionaries, represented by trie. But I think this is less memory efficient, as it seems that words in different languages often share the same prefix, and hence having them in single trie will preserve memory. But it's the same in terms of performance efficiency, so it's up to you.

Related

search function for finding books in a datastructure

I want to implement a search filter that is as efficient as possible, that manages book titles in my "library". The search should work as follows:
The user types in the first b letters of the book title. will be returned
Number n of book titles that begin exactly with the entered sequence of letters. k is one
preset constant that specifies how many book titles should be output. That means
if n ≤ k, an alphabetically sorted list of the n book titles is output.
The main problem I'm currently facing is that I don't know what datatype to pick and on what data structure I should implement it because I need it as efficient as possible.
And a follow up question would be, if I use an array for this, I would definitely choose an sorted array right?
Any help is gladly appreciated an I'm NOT asking for an implementation.

This is a very good candidate for a trie. At each keystroke you know exactly how many books start with the prefix. This is a linear operation. So, the search for a prefix of n characters takes O(n) time. In a simple trie implementation, to get the book titles, you would need to traverse the rest of the paths that start with the prefix, but this can also be remedied by having pointers at each node of the trie to the titles that start with that prefix (in other words continue from that node). You could also sort those pointers in advance.
So the whole operation of searching for all the titles that start with a prefix of length n and returning the titles would be O(n + m) where m is the number of titles with this prefix.

data structure for finding the substring from large number of strings

My problem statement is that I am given millions of strings, and I have to find one sub-string which can be present in any of those strings.
e.g. given is "xyzoverflowasxs, werstackweq" etc. and I have to find a given sub string named as "stack", which should return "werstackweq". What kind of data structure we can use for solving this problem ?
I think we can use suffix tree for this , but wanted some more suggestions for this problem.

I think the way to go is with a dictionary holding the actual words, and another data structure pointing to entries within this dictionary. One way to go would be with suffix trees and their variants, as mentioned in the question and the comments. I think the following is a far simpler (heuristic) alternative.
Say you choose some integer k. For each of your strings, finding the k Rabin Fingerprints of length-k within each string should be efficient and easy (any language has an implementation).
So, for a given k, you could hold two data structures:
A dictionary of the words, say a hash table based on collision lists
A dictionary mapping each fingerprint to an array of the linked-list node pointers in the first data structure.
Given a word of length k or greater, you would choose a k subword, calculate its Rabin fingerprint, find the words which contain this fingerprint, and check if they indeed contain this word.
The question is which k to use, and whether to use multiple such k. I would try this experimentally (starting with simultaneously a few small k values for, say, 1, 2, and 3, and also a couple of larger ones). The performance of this heuristic anyway depends on the distribution of your dictionary and queries.

Comparing word to targeted words in dictionary

I'm trying to write a program in JAVA that stores a dictionary in a hashmap (each word under a different key) and compares a given word to the words in the dictionary and comes up with a spelling suggestion if it is not found in the dictionary -- basically a spell check program.
I already came up with the comparison algorithm (i.e. Needleman-Wunsch then Levenshtein distance), etc., but got stuck when it came figuring out what words in the dictionary-hashmap to compare the word to i.e. "hellooo".
I cannot compare "ohelloo" [should be corrected to "hello" to each word in the dictionary b/c that would take too long and I cannot compare it to all words int the dictionary starting with 'o' b/c it's supposed to be "hello".
Any ideas?

The most common spelling mistakes are
Delete a letter (smaller word OR word split)
Swap adjacent letters
Alter letter (QWERTY adjacent letters)
Insert letter
Some reports say that 70-90% of mistakes fall in the above categories (edit distance 1)
Take a look on the url below that provides a solution for single or double mistakes (edit distance 1 or 2). Almost everything you'll need is there!
How to write a spelling corrector
FYI: You can find implementation in various programming languages in the bottom of the aforementioned article. I've used it in some of my projects, practical accuracy is really good, sometimes more than 95+% as claimed by the author.
--Based on OP's comment--
If you don't want to pre-compute every possible alteration and then search on the map, I suggest that you use a patricia trie (radix tree) instead of a HashMap. Unfortunately, you will again need to handle the "first-letter mistake" (eg remove first letter or swap first with the second, or just replace it with a Qwerty adjacent) and you can limit your search with high probability.
You can even combine it with an extra Index Map or Trie with "reversed" words or an extra index that omits first N characters (eg first 2), so you can catch errors occurred on prefix only.

Quickly checking if set is superset of stored sets

The problem
I am given N arrays of C booleans. I want to organize these into a datastructure that allows me to do the following operation as fast as possible: Given a new array, return true if this array is a "superset" of any of the stored arrays. With superset I mean this: A is a superset of B if A[i] is true for every i where B[i] is true. If B[i] is false, then A[i] can be anything.
Or, in terms of sets instead of arrays:
Store N sets (each with C possible elements) into a datastructure so you can quickly look up if a given set is a superset of any of the stored sets.
Building the datastructure can take as long as possible, but the lookup should be as efficient as possible, and the datastructure can't take too much space.
Some context
I think this is an interesting problem on its own, but for the thing I'm really trying to solve, you can assume the following:
N = 10000
C = 1000
The stored arrays are sparse
The looked up arrays are random (so not sparse)
What I've come up with so far
For O(NC) lookup: Just iterate all the arrays. This is just too slow though.
For O(C) lookup: I had a long description here, but as Amit pointed out in the comments, it was basically a BDD. While this has great lookup speed, it has an exponential number of nodes. With N and C so large, this takes too much space.
I hope that in between this O(N*C) and O(C) solution, there's maybe a O(log(N)*C) solution that doesn't require an exponential amount of space.
EDIT: A new idea I've come up with
For O(sqrt(N)C) lookup: Store the arrays as a prefix trie. When looking up an array A, go to the appropriate subtree if A[i]=0, but visit both subtrees if A[i]=1.
My intuition tells me that this should make the (average) complexity of the lookup O(sqrt(N)C), if you assume that the stored arrays are random. But: 1. they're not, the arrays are sparse. And 2. it's only intuition, I can't prove it.
I will try out both this new idea and the BDD method, and see which of the 2 work out best.
But in the meantime, doesn't this problem occur more often? Doesn't it have a name? Hasn't there been previous research? It really feels like I'm reinventing the wheel here.

Just to add some background information to the prefix trie solution, recently I found the following paper:
I.Savnik: Index data structure for fast subset and superset queries. CD-ARES, IFIP LNCS, 2013.
The paper proposes the set-trie data structure (container) which provides support for efficient storage and querying of sets of sets using the trie data structure, supporting operations like finding all the supersets/subsets of a given set from a collection of sets.
For any python users interested in an actual implementation, I came up with a python3 package based partly on the above paper. It contains a trie-based container of sets and also a mapping container where the keys are sets. You can find it on github.

I think prefix trie is a great start.
Since yours arrays are sparse, I would additionally test them in bulk. If (B1 ∪ B2) ⊂ A, both are included. So the idea is to OR-pack arrays by pairs, and to reiterate until there is only one "root" array (it would take only twice as much space). It allows to answer 'Yes' to your question earlier, which is mainly useful if you don't need to know with array is actually contained.
Independently, you can apply for each array a hash function preserving ordering.
Ie : B ⊂ A ⇒ h(B) ≺ h(A)
ORing bits together is such a function, but you can also count each 1-bit in adequate partitions of the array. Here, you can eliminate candidates faster (answering 'No' for a particular array).

You can simplify the problem by first reducing your list of sets to "minimal" sets: keep only those sets which are not supersets of any other ones. The problem remains the same because if some input set A is a superset of some set B you removed, then it is also a superset of at least one "minimal" subset C of B which was not removed. The advantage of doing this is that you tend to eliminate large sets, which makes the problem less expensive.
From there I would use some kind of ID3 or C4.5 algorithm.

Building on the trie solution and the paper mentioned by #mmihaltz, it is also possible to implement a method to find subsets by using already existing efficient trie implementations for python. Below I use the package datrie. The only downside is that the keys must be converted to strings, which can be done with "".join(chr(i) for i in myset). This, however, limits the range of elements to about 110000.
from datrie import BaseTrie, BaseState
def existsSubset(trie, setarr, trieState=None):
if trieState is None:
trieState = BaseState(trie)
trieState2 = BaseState(trie)
trieState.copy_to(trieState2)
for i, elem in enumerate(setarr):
if trieState2.walk(elem):
if trieState2.is_terminal() or existsSubset(trie, setarr[i:], trieState2):
return True
trieState.copy_to(trieState2)
return False
The trie can be used like dictionary, but the range of possible elements has to be provided at the beginning:
alphabet = "".join(chr(i) for i in range(100))
trie = BaseTrie(alphabet)
for subset in sets:
trie["".join(chr(i) for i in subset)] = 0 # the assigned value does not matter
Note that the trie implementation above works only with keys larger than (and not equal to) 0. Otherwise, the integer to character mapping does not work properly. This problem can be solved with an index shift.
A cython implementation that also covers the conversion of elements can be found here.

Algorithm to find multiple string matches

I'm looking for suggestions for an efficient algorithm for finding all matches in a large body of text. Terms to search for will be contained in a list and can have 1000+ possibilities. The search terms may be 1 or more words.
Obviously I could make multiple passes through the text comparing against each search term. Not too efficient.
I've thought about ordering the search terms and combining common sub-segments. That way I could eliminate large numbers of terms quickly. Language is C++ and I can use boost.
An example of search terms could be a list of Fortune 500 company names.
Ideas?

Don´t reinvent the wheel
This problem has been intensively researched. Curiously, the best algorithms for searching ONE pattern/string do not extrapolate easily to multi-string matching.
The "grep" family implement the multi-string search in a very efficient way. If you can use them as external programs, do it.
In case you really need to implement the algorithm, I think the fastest way is to reproduce what agrep does (agrep excels in multi-string matching!). Here are the source and executable files.
And here you will find a paper describing the algorithms used, the theoretical background, and a lot of information and pointers about string matching.
A note of caution: multiple-string matching have been heavily researched by people like Knuth, Boyer, Moore, Baeza-Yates, and others. If you need a really fast algorithm don't hesitate on standing on their broad shoulders. Don't reinvent the wheel.

As in the case of single patterns, there are several algorithms for multiple-pattern matching, and you will have to find the one that fits your purpose best. The paper A fast algorithm for multi-pattern searching (archived copy) does a review of most of them, including Aho-Corasick (which is kind of the multi-pattern version of the Knuth-Morris-Pratt algorithm, with linear complexity) and Commentz-Walter (a combination of Boyer-Moore and Aho-Corasick), and introduces a new one, which uses ideas from Boyer-Moore for the task of matching multiple patterns.
An alternative, hash-based algorithm not mentioned in that paper, is the Rabin-Karp algorithm, which has a worst-case complexity bigger than other algorithms, but compensates it by reducing the linear factor via hashing. Which one is better depends ultimately on your use case. You may need to implement several of them and compare them in your application if you want to choose the fastest one.

Assuming that the large body of text is static english text and you need to match whole words you can try the following (you should really clarify what exactly is a 'match', what kind of text you are looking at etc in your question).
First preprocess the whole document into a Trie or a DAWG.
Trie/Dawg has the following property:
Given a trie/dawg and a search term of length K, you can in O(K) time lookup the data associated with the word (or tell if there is no match).
Using a DAWG could save you more space as compared to a trie. Tries exploit the fact that many words will have a common prefix and DAWGs exploit the common prefix as well as the common suffix property.
In the trie, also maintain exactly the list of positions of the word. For example if the text is
That is that and so it is.
The node for the last t in that will have the list {1,3} and the node for s in is will have the list {2,7} associated.
Now when you get a single word search term, you can walk the trie and get the list of matches for that word easily.
If you get a multiple word search term, you can do the following.
Walk the trie with the first word in the search term. Get the list of matches and insert into a hashTable H1.
Now walk the trie with the second word in the search term. Get the list of matches. For each match position x, check if x-1 exists in the HashTable H1. If so, add x to new hashtable H2.
Walk the trie with the third word, get list of matches. For each match position y, check if y-1 exists in H3, if so add to new hashtable H3.
Continue so forth.
At the end you get a list of matches for the search phrase, which give the positions of the last word of the phrase.
You could potentially optimize the phrase matching step by maintaining a sorted list of positions in the list and doing a binary search: i.e for eg. for each key k in H2, you binary search for k+1 in the sorted list for search term 3 and add k+1 to H3 if you find it etc.

An optimal solution for this problem is to use a suffix tree (or a suffix array). It's essentially a trie of all suffixes of a string. For a text of length O(N), this can be built in O(N).
Then all k occurrences of a string of length m can be answered optimally in O(m + k).
Suffix trees can also be used to efficiently find e.g. the longest palindrome, the longest common substring, the longest repeated substring, etc.
This is the typical data structure to use when analyzing DNA strings which can be millions/billions of bases long.
See also
Wikipedia/Suffix tree
Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology (Dan Gusfield).

So you have lots of search terms and want to see if any of them are in the document?
Purely algorithmically, you could sort all your possibilities in alphabetical order, join them with pipes, and use them as a regular expression, if the regex engine will look at /ant|ape/ and properly short-circuit the a in "ape" if it didn't find it in "ant". If not, you could do a "precompile" of a regex and "squish" the results down to their minimum overlap. I.e. in the above case /a(nt|pe)/ and so on, recursively for each letter.
However, doing the above is pretty much like putting all your search strings in a 26-ary tree (26 characters, more if also numbers). Push your strings onto the tree, using one level of depth per character of length.
You can do this with your search terms to make a hyper-fast "does this word match anything in my list of search terms" if your search terms number large.
You could theoretically do the reverse as well -- pack your document into the tree and then use the search terms on it -- if your document is static and the search terms change a lot.
Depends on how much optimization you need...

Are the search terms words that you are looking for or can it be full sentances too ?
If it's only words, then i would suggest building a Red-Black Tree from all the words, and then searching for each word in the tree.
If it could be sentances, then it could get a lot more complex... (?)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Data structure for saving whole sentences in several languages - data-structures

Related

search function for finding books in a datastructure

data structure for finding the substring from large number of strings

Comparing word to targeted words in dictionary

Quickly checking if set is superset of stored sets

Algorithm to find multiple string matches

Categories

Resources