Trie Autocomplete with word weight(frequency) - algorithm

I was asked this during a recent phone interview -
Given a Dictionary with a word and the weight of a word(frequency, higher is better), like so -
var words = new Dictionary<string,int>();
words.Add("am",7);
words.Add("ant", 5);
words.Add("amazon", 10);
words.Add("amazing", 8);
words.Add("an", 4);
words.Add("as", 11);
words.Add("be", 8);
words.Add("bee", 2);
words.Add("bed", 4);
words.Add("best", 12);
words.Add("amuck", 1);
words.Add("amock", 2);
words.Add("bestest", 1);
Design an API method, that given a prefix and a number k, return the top k words that match the prefix.
The words should be sorted based on their weight, the higher the better.
So, prefix = "am", k = 5, returns amazon, amazing, am, amock, amuck - in that specific order.
Performance on the prefix lookup is paramount, you can pre-process and use as much space as you like, as long as the prefix lookup is fast.
This is a Trie implementation, but my question is how best to handle the word weight and optimise the lookup. In my mind the options are -
a. For each node in the Trie, also store a sorted list of words (SortedDictionary<int,List<string>>) that start with this prefix - more space, but faster lookup.
b. For each node, store the Child nodes in some kind of sorted list, so you would still need to do a DFS for each child node to get the K words needed - less space compared to a., but slower.
I decided to go with option a.
public class TrieWithSuggestions
{
TrieWithSuggestions _trieRoot;
public TrieWithSuggestions()
{
}
public char Character { get; set; }
public int WordCount { get; set; } = 1;
public TrieWithSuggestions[] ChildNodes { get; set; } = new TrieWithSuggestions[26];
//Stores all words with this prefix.
public SortedDictionary<int, HashSet<string>> PrefixWordsDictionary = new SortedDictionary<int, HashSet<string>>();
public TrieWithSuggestions ConstructTrie(Dictionary<string, int> words)
{
if (words.Count > 0)
{
_trieRoot = new TrieWithSuggestions() { Character = default(char) };
foreach (var word in words)
{
var node = _trieRoot;
for (int i = 0; i < word.Key.Length; i++)
{
var c = word.Key[i];
if (node.ChildNodes[c - 'a'] != null)
{
node = node.ChildNodes[c - 'a'];
UpdateParentNodeInformation(node, word.Key, words[word.Key]);
node.WordCount++;
}
else
{
InsertIntoTrie(node, word.Key, i, words);
break;
}
}
}
}
return _trieRoot;
}
public List<string> GetMathchingWords(string prefix, int k)
{
if (_trieRoot != null)
{
var node = _trieRoot;
foreach (var ch in prefix)
{
if (node.ChildNodes[ch - 'a'] != null)
{
node = node.ChildNodes[ch - 'a'];
}
else
return null;
}
if (node != null)
return GetWords(node, k);
else
return null;
}
return null;
}
List<string> GetWords(TrieWithSuggestions node, int k)
{
List<string> output = new List<string>();
foreach (var dictEntry in node.PrefixWordsDictionary)
{
var entries = node.PrefixWordsDictionary[dictEntry.Key];
var take = Math.Min(entries.Count, k);
output.AddRange(entries.Take(take).ToList());
k -= entries.Count;
if (k == 0)
break;
}
return output;
}
void InsertIntoTrie(TrieWithSuggestions parentNode, string word, int startIndex, Dictionary<string, int> words)
{
for (int i = startIndex; i < word.Length; i++)
{
var c = word[i];
var childNode = new TrieWithSuggestions() { Character = c };
parentNode.ChildNodes[c - 'a'] = childNode;
UpdateParentNodeInformation(parentNode, word, words[word]);
parentNode = childNode;
if (i == word.Length - 1)
UpdateParentNodeInformation(parentNode, word, words[word]);
}
}
void UpdateParentNodeInformation(TrieWithSuggestions parentNode, string word, int wordWeight)
{
wordWeight *= -1;
if (parentNode.PrefixWordsDictionary.ContainsKey(wordWeight))
{
if (!parentNode.PrefixWordsDictionary[wordWeight].Contains(word))
parentNode.PrefixWordsDictionary[wordWeight].Add(word);
}
else
parentNode.PrefixWordsDictionary.Add(wordWeight, new HashSet<string>() { word });
}
}
Construct Trie - RunTime O(N* M * logN), Space - O(N * M * N) , N - #of words, M - avg word length.
Justification -
If there were no Dictionary, this would be O(N * M), insertion into a SortedDictionary is O(logN), so worst case Runtime must be O(N* M * logN)
Space seems trickier, but like before if there were no SortedDictionary, space would be O(N * M), and in the worst case, the Dictionary could have all N words, so Space Complexity looks like O(N * M * N)
GetMatchingWords - RunTime O(len(prefix) + k)
Function call -
var trie = new TrieWithSuggestions();
trie.ConstructTrie(words);
var list = trie.GetMathchingWords("am", 10); //amazon, amazing, am, amock, amuck
QUESTION:
Given the conditions on space and pre-processing, is there a better way to do this?
EDIT 1 -
a. Given this setup, it is best to sort the words by weight and then insert into the Trie. In this case a simple List<string> would suffice, since higher frequency words would have been inserted first automatically.
b. Now lets say that in addition to being initialized with a Dictionary<string,int>, we are also going to get additional word, frequency pairs. We would still want a lookup that is as fast as possible, given this requirement what is now the best data-structure to store the sorted list of words within a TrieNode, is a SortedDictionary<int,HashSet<string>> the best option?

You could first sort the input with respect to the weights. Then, you could use Lists instead of Dictionaries on the nodes of trie. Since the words come in increasing (or decreasing) order of weight, checking the last element of the list is enough to decide where to put this new word. This gets rid of the O(logN) time taken by Dictionary.
The input can be sorted in O(N * logN) with a comparison sort, or in O(N + W) with a counting sort where W is the maximum weight.
Time complexity of setting up the trie becomes O(N * logN + N * M). This is better than O(N * M * logN). Query time does not change.
(Last paragraph assumes HashSet operations execute in O(1) as in the question. It is wrong to make this assumption for arbitrary inputs and hash functions.)

Related

Last remaining number

I was asked this question in an interview.
Given an array 'arr' of positive integers and a starting index 'k' of the array. Delete element at k and jump arr[k] steps in the array in circular fashion. Do this repeatedly until only one element remain. Find the last remaining element.
I thought of O(nlogn) solution using ordered map. Is any O(n) solution possible?
My guess is that there is not an O(n) solution to this problem based on the fact that it seems to involve doing something that is impossible. The obvious thing you would need to solve this problem in linear time is a data structure like an array that exposes two operations on an ordered collection of values:
O(1) order-preserving deletes from the data structure.
O(1) lookups of the nth undeleted item in the data structure.
However, such a data structure has been formally proven to not exist; see "Optimal Algorithms for List Indexing and Subset Rank" and its citations. It is not a proof to say that if the natural way to solve some problem involves using a data structure that is impossible, the problem itself is probably impossible, but such an intuition is often correct.
Anyway there are lots of ways to do this in O(n log n). Below is an implementation of maintaining a tree of undeleted ranges in the array. GetIndex() below returns an index into the original array given a zero-based index into the array if items had been deleted from it. Such a tree is not self-balancing so will have O(n) operations in the worst case but in the average case Delete and GetIndex will be O(log n).
namespace CircleGame
{
class Program
{
class ArrayDeletes
{
private class UndeletedRange
{
private int _size;
private int _index;
private UndeletedRange _left;
private UndeletedRange _right;
public UndeletedRange(int i, int sz)
{
_index = i;
_size = sz;
}
public bool IsLeaf()
{
return _left == null && _right == null;
}
public int Size()
{
return _size;
}
public void Delete(int i)
{
if (i >= _size)
throw new IndexOutOfRangeException();
if (! IsLeaf())
{
int left_range = _left._size;
if (i < left_range)
_left.Delete(i);
else
_right.Delete(i - left_range);
_size--;
return;
}
if (i == _size - 1)
{
_size--; // Can delete the last item in a range by decremnting its size
return;
}
if (i == 0) // Can delete the first item in a range by incrementing the index
{
_index++;
_size--;
return;
}
_left = new UndeletedRange(_index, i);
int right_index = i + 1;
_right = new UndeletedRange(_index + right_index, _size - right_index);
_size--;
_index = -1; // the index field of a non-leaf is no longer necessarily valid.
}
public int GetIndex(int i)
{
if (i >= _size)
throw new IndexOutOfRangeException();
if (IsLeaf())
return _index + i;
int left_range = _left._size;
if (i < left_range)
return _left.GetIndex(i);
else
return _right.GetIndex(i - left_range);
}
}
private UndeletedRange _root;
public ArrayDeletes(int n)
{
_root = new UndeletedRange(0, n);
}
public void Delete(int i)
{
_root.Delete(i);
}
public int GetIndex(int indexRelativeToDeletes )
{
return _root.GetIndex(indexRelativeToDeletes);
}
public int Size()
{
return _root.Size();
}
}
static int CircleGame( int[] array, int k )
{
var ary_deletes = new ArrayDeletes(array.Length);
while (ary_deletes.Size() > 1)
{
int next_step = array[ary_deletes.GetIndex(k)];
ary_deletes.Delete(k);
k = (k + next_step - 1) % ary_deletes.Size();
}
return array[ary_deletes.GetIndex(0)];
}
static void Main(string[] args)
{
var array = new int[] { 5,4,3,2,1 };
int last_remaining = CircleGame(array, 2); // third element, this call is zero-based...
}
}
}
Also note that if the values in the array are known to be bounded such that they are always less than some m less than n, there are lots of O(nm) algorithms -- for example, just using a circular linked list.
I couldn't think of an O(n) solution. However, we could have O(n log n) average time by using a treap or an augmented BST with a value in each node for the size of its subtree. The treap enables us to find and remove the kth entry in O(log n) average time.
For example, A = [1, 2, 3, 4] and k = 3 (as Sumit reminded me in the comments, use the array indexes as values in the tree since those are ordered):
2(0.9)
/ \
1(0.81) 4(0.82)
/
3(0.76)
Find and remove 3rd element. Start at 2 with size = 2 (including the left subtree). Go right. Left subtree is size 1, which together makes 3, so we found the 3rd element. Remove:
2(0.9)
/ \
1(0.81) 4(0.82)
Now we're starting on the third element in an array with n - 1 = 3 elements and looking for the 3rd element from there. We'll use zero-indexing to correlate with our modular arithmetic, so the third element in modulus 3 would be 2 and 2 + 3 = 5 mod 3 = 2, the second element. We find it immediately since the root with its left subtree is size 2. Remove:
4(0.82)
/
1(0.81)
Now we're starting on the second element in modulus 2, so 1, and we're adding 2. 3 mod 2 is 1. Removing the first element we are left with 4 as the last element.

Word ladder complexity analysis

I'd like to make sure that I am doing the time complexity analysis correctly. There seems to be many different analysis.
Just in case people don't know the problem this is problem description.
Given two words (beginWord and endWord), and a dictionary's word list, find the length of shortest transformation sequence from beginWord to endWord, such that:
Only one letter can be changed at a time.
Each transformed word must exist in the word list. Note that beginWord is not a transformed word.
For example,
Given:
beginWord = "hit"
endWord = "cog"
wordList = ["hot","dot","dog","lot","log","cog"]
As one shortest transformation is "hit" -> "hot" -> "dot" -> "dog" -> "cog",
return its length 5.
And this is simple BFS algorithm.
static int ladderLength(String beginWord, String endWord, List<String> wordList) {
int level = 1;
Deque<String> queue = new LinkedList<>();
queue.add(beginWord);
queue.add(null);
Set<String> visited = new HashSet<>();
// worst case we can add all dictionary thus N (len(dict)) computation
while (!queue.isEmpty()) {
String word = queue.removeFirst();
if (word != null) {
if (word.equals(endWord)) {
return level;
}
// m * 26 * log N
for (int i = 0; i < word.length(); i++) {
char[] chars = word.toCharArray();
for (char c = 'a'; c <= 'z'; c++) {
chars[i] = c;
String newStr = new String(chars);
if (!visited.contains(newStr) && wordList.contains(newStr)) {
queue.add(newStr);
visited.add(newStr);
}
}
}
} else {
level++;
if (!queue.isEmpty()) {
queue.add(null);
}
}
}
return 0;
}
wordList (dictionary) contains N elements and length of beginWord is m
In worst case, the queue would have all the element in the word list, thus, the outer while loop would run for o(N).
For each word (length m), it tries 26 charaters (a to z) thus inner nested for loop is o(26*m), and inside inner for loop, it does wordList.contains assume it's o(logN).
So overall it's o(N*m*26*logN) => o(N*mlogN)
Is this correct?
The List<T> type does not automatically sort its elements, but instead "faithfully" keeps all elements in the order they were added. So wordList.contains is in fact O(n). However for a HashSet such as visited, this operation is O(1) (amortized), so consider switching to that.

Algorithm to list unique permutations of string with duplicate letters

For example, string "AAABBB" will have permutations:
"ABAABB",
"BBAABA",
"ABABAB",
etc
What's a good algorithm for generating the permutations? (And what's its time complexity?)
For a multiset, you can solve recursively by position (JavaScript code):
function f(multiset,counters,result){
if (counters.every(x => x === 0)){
console.log(result);
return;
}
for (var i=0; i<counters.length; i++){
if (counters[i] > 0){
_counters = counters.slice();
_counters[i]--;
f(multiset,_counters,result + multiset[i]);
}
}
}
f(['A','B'],[3,3],'');
This is not full answer, just an idea.
If your strings has fixed number of only two letters I'll go with binary tree and good recursion function.
Each node is object that contains name with prefix of parent name and suffix A or B furthermore it have numbers of A and B letters in the name.
Node constructor gets name of parent and number of A and B from parent so it needs only to add 1 to number of A or B and one letter to name.
It doesn't construct next node if there is more than three A (in case of A node) or B respectively, or their sum is equal to the length of starting string.
Now you can collect leafs of 2 trees (their names) and have all permutations that you need.
Scala or some functional language (with object-like features) would be perfect for implementing this algorithm. Hope this helps or just sparks some ideas.
Since you actually want to generate the permutations instead of just counting them, the best complexity you can hope for is O(size_of_output).
Here's a good solution in java that meets that bound and runs very quickly, while consuming negligible space. It first sorts the letters to find the lexographically smallest permutation, and then generates all permutations in lexographic order.
It's known as the Pandita algorithm: https://en.wikipedia.org/wiki/Permutation#Generation_in_lexicographic_order
import java.util.Arrays;
import java.util.function.Consumer;
public class UniquePermutations
{
static void generateUniquePermutations(String s, Consumer<String> consumer)
{
char[] array = s.toCharArray();
Arrays.sort(array);
for (;;)
{
consumer.accept(String.valueOf(array));
int changePos=array.length-2;
while (changePos>=0 && array[changePos]>=array[changePos+1])
--changePos;
if (changePos<0)
break; //all done
int swapPos=changePos+1;
while(swapPos+1 < array.length && array[swapPos+1]>array[changePos])
++swapPos;
char t = array[changePos];
array[changePos] = array[swapPos];
array[swapPos] = t;
for (int i=changePos+1, j = array.length-1; i < j; ++i,--j)
{
t = array[i];
array[i] = array[j];
array[j] = t;
}
}
}
public static void main (String[] args) throws java.lang.Exception
{
StringBuilder line = new StringBuilder();
generateUniquePermutations("banana", s->{
if (line.length() > 0)
{
if (line.length() + s.length() >= 75)
{
System.out.println(line.toString());
line.setLength(0);
}
else
line.append(" ");
}
line.append(s);
});
System.out.println(line);
}
}
Here is the output:
aaabnn aaanbn aaannb aabann aabnan aabnna aanabn aananb aanban aanbna
aannab aannba abaann abanan abanna abnaan abnana abnnaa anaabn anaanb
anaban anabna ananab ananba anbaan anbana anbnaa annaab annaba annbaa
baaann baanan baanna banaan banana bannaa bnaaan bnaana bnanaa bnnaaa
naaabn naaanb naaban naabna naanab naanba nabaan nabana nabnaa nanaab
nanaba nanbaa nbaaan nbaana nbanaa nbnaaa nnaaab nnaaba nnabaa nnbaaa

Is there an efficient algorithm that could do this?

I have two lists of integers of equal length, each with no duplicates, and I need to map them to each other based on the (absolute value) of their differences, where nothing could be switched in the output to make the totaled differences of all pair smaller. The 'naive' approach I could think of would run would be this (in condensed C#, but I think it's pretty easy to get):
Dictionary<int, int> output;
List<int> list1, list2;
while(!list1.Empty) //While we haven't arranged all the pairs
{
int bestDistance = Int32.MaxValue; //best distance between numbers so far
int bestFirst, bestSecond; //best numbers so far
foreach(int i in list1)
{
foreach(int j in list2)
{
int distance = Math.Abs(i - j);
//if the distance is better than the best so far, make it the new best
if(distance < bestDistance)
{
bestDistance = distance;
bestFirst = i;
bestSecond = j;
}
}
}
output[bestFirst] = bestSecond; //add the best to dictionary
list1.Remove(bestFirst); //remove it from the lists
list2.Remove(bestSecond);
}
Essentially, it just finds the best pair, removes it, and then repeates until it's done. But this runs in cubic time, if I see it correctly, and would take incredibly long for large lists. Is there any faster way to do this?
This is less trivial than my initial hunch suggested. The key to keeping this O(N log(N)) is to work with sorted lists, and search for the "pivot" element in the second sorted list with the smallest difference to the first element in the first sorted list.
Thus the steps to take become:
Sort both input lists
Find the pivot element in the second sorted list
Return this pivot element together with the first element of the first sorted list
Keep track of the element index left to the pivot and right to the pivot
Iterate the first list in sorted order, returning either the left or right element, depending on which difference is smallest and adjusting the left and right indexes.
As in (c# example):
public static IEnumerable<KeyValuePair<int, int>> FindSmallestDistances(List<int> first, List<int> second)
{
Debug.Assert(first.Count == second.Count); // precondition.
// sort the input: O(N log(N)).
first.Sort();
second.Sort();
// determine pivot: O(N).
var min_first = first[0];
var smallest_abs_dif = Math.Abs(second[0] - min_first);
var pivot_ndx = 0;
for (int i = 1; i < second.Count; i++)
{
var abs_dif = Math.Abs(second[i] - min_first);
if (abs_dif < smallest_abs_dif)
{
smallest_abs_dif = abs_dif;
pivot_ndx = i;
}
};
// return the first one.
yield return new KeyValuePair<int, int>(min_first, second[pivot_ndx]);
// Iterate the rest: O(N)
var left = pivot_ndx - 1;
var right = pivot_ndx + 1;
for (var i = 1; i < first.Count; i++)
{
if (left >= 0)
{
if (right < first.Count && Math.Abs(first[i] - second[left]) > Math.Abs(first[i] - second[right]))
yield return new KeyValuePair<int, int>(first[i], second[right++]);
else
yield return new KeyValuePair<int, int>(first[i], second[left--]);
}
else
yield return new KeyValuePair<int, int>(first[i], second[right++]);
}
}

Google search results: How to find the minimum window that contains all the search keywords?

What is the complexity of the algorithm is that is used to find the smallest snippet that contains all the search key words?
As stated, the problem is solved by a rather simple algorithm:
Just look through the input text sequentially from the very beginning and check each word: whether it is in the search key or not. If the word is in the key, add it to the end of the structure that we will call The Current Block. The Current Block is just a linear sequence of words, each word accompanied by a position at which it was found in the text. The Current Block must maintain the following Property: the very first word in The Current Block must be present in The Current Block once and only once. If you add the new word to the end of The Current Block, and the above property becomes violated, you have to remove the very first word from the block. This process is called normalization of The Current Block. Normalization is a potentially iterative process, since once you remove the very first word from the block, the new first word might also violate The Property, so you'll have to remove it as well. And so on.
So, basically The Current Block is a FIFO sequence: the new words arrive at the right end, and get removed by normalization process from the left end.
All you have to do to solve the problem is look through the text, maintain The Current Block, normalizing it when necessary so that it satisfies The Property. The shortest block with all the keywords in it you ever build is the answer to the problem.
For example, consider the text
CxxxAxxxBxxAxxCxBAxxxC
with keywords A, B and C. Looking through the text you'll build the following sequence of blocks
C
CA
CAB - all words, length 9 (CxxxAxxxB...)
CABA - all words, length 12 (CxxxAxxxBxxA...)
CABAC - violates The Property, remove first C
ABAC - violates The Property, remove first A
BAC - all words, length 7 (...BxxAxxC...)
BACB - violates The Property, remove first B
ACB - all words, length 6 (...AxxCxB...)
ACBA - violates The Property, remove first A
CBA - all words, length 4 (...CxBA...)
CBAC - violates The Property, remove first C
BAC - all words, length 6 (...BAxxxC)
The best block we built has length 4, which is the answer in this case
CxxxAxxxBxxAxx CxBA xxxC
The exact complexity of this algorithm depends on the input, since it dictates how many iterations the normalization process will make, but ignoring the normalization the complexity would trivially be O(N * log M), where N is the number of words in the text and M is the number of keywords, and O(log M) is the complexity of checking whether the current word belongs to the keyword set.
Now, having said that, I have to admit that I suspect that this might not be what you need. Since you mentioned Google in the caption, it might be that the statement of the problem you gave in your post is not complete. Maybe in your case the text is indexed? (With indexing the above algorithm is still applicable, just becomes more efficient). Maybe there's some tricky database that describes the text and allows for a more efficient solution (like without looking through the entire text)? I can only guess and you are not saying...
I think the solution proposed by AndreyT assumes no duplicates exists in the keywords/search terms. Also, the current block can get as big as the text itself if text contains lot of duplicate keywords.
For example:
Text: 'ABBBBBBBBBB'
Keyword text: 'AB'
Current Block: 'ABBBBBBBBBB'
Anyway, I have implemented in C#, did some basic testing, would be nice to get some feedback on whether it works or not :)
static string FindMinWindow(string text, string searchTerms)
{
Dictionary<char, bool> searchIndex = new Dictionary<char, bool>();
foreach (var item in searchTerms)
{
searchIndex.Add(item, false);
}
Queue<Tuple<char, int>> currentBlock = new Queue<Tuple<char, int>>();
int noOfMatches = 0;
int minLength = Int32.MaxValue;
int startIndex = 0;
for(int i = 0; i < text.Length; i++)
{
char item = text[i];
if (searchIndex.ContainsKey(item))
{
if (!searchIndex[item])
{
noOfMatches++;
}
searchIndex[item] = true;
var newEntry = new Tuple<char, int> ( item, i );
currentBlock.Enqueue(newEntry);
// Normalization step.
while (currentBlock.Count(o => o.Item1.Equals(currentBlock.First().Item1)) > 1)
{
currentBlock.Dequeue();
}
// Figuring out minimum length.
if (noOfMatches == searchTerms.Length)
{
var length = currentBlock.Last().Item2 - currentBlock.First().Item2 + 1;
if (length < minLength)
{
startIndex = currentBlock.First().Item2;
minLength = length;
}
}
}
}
return noOfMatches == searchTerms.Length ? text.Substring(startIndex, minLength) : String.Empty;
}
This is an interesting question.
To restate it more formally:
Given a list L (the web page) of length n and a set S (the query) of size k, find the smallest sublist of L that contains all the elements of S.
I'll start with a brute-force solution in hopes of inspiring others to beat it.
Note that set membership can be done in constant time, after one pass through the set. See this question.
Also note that this assumes all the elements of S are in fact in L, otherwise it will just return the sublist from 1 to n.
best = (1,n)
For i from 1 to n-k:
Create/reset a hash found[] mapping each element of S to False.
For j from i to n or until counter == k:
If found[L[j]] then counter++ and let found[L[j]] = True;
If j-i < best[2]-best[1] then let best = (i,j).
Time complexity is O((n+k)(n-k)). Ie, n^2-ish.
Here's a solution using Java 8.
static Map.Entry<Integer, Integer> documentSearch(Collection<String> document, Collection<String> query) {
Queue<KeywordIndexPair> queue = new ArrayDeque<>(query.size());
HashSet<String> words = new HashSet<>();
query.stream()
.forEach(words::add);
AtomicInteger idx = new AtomicInteger();
IndexPair interval = new IndexPair(0, Integer.MAX_VALUE);
AtomicInteger size = new AtomicInteger();
document.stream()
.map(w -> new KeywordIndexPair(w, idx.getAndIncrement()))
.filter(pair -> words.contains(pair.word)) // Queue.contains is O(n) so we trade space for efficiency
.forEach(pair -> {
// only the first and last elements are useful to the algorithm, so we don't bother removing
// an element from any other index. note that removing an element using equality
// from an ArrayDeque is O(n)
KeywordIndexPair first = queue.peek();
if (pair.equals(first)) {
queue.remove();
}
queue.add(pair);
first = queue.peek();
int diff = pair.index - first.index;
if (size.incrementAndGet() == words.size() && diff < interval.interval()) {
interval.begin = first.index;
interval.end = pair.index;
size.set(0);
}
});
return new AbstractMap.SimpleImmutableEntry<>(interval.begin, interval.end);
}
There are 2 static nested classes KeywordIndexPair and IndexPair, the implementation of which should be apparent from the names. Using a smarter programming language that supports tuples those classes wouldn't be necessary.
Test:
Document: apple, banana, apple, apple, dog, cat, apple, dog, banana, apple, cat, dog
Query: banana, cat
Interval: 8, 10
For all the words, maintain min and max index in case there is going to be more than one entry; if not both min and mix index will same.
import edu.princeton.cs.algs4.ST;
public class DicMN {
ST<String, Words> st = new ST<>();
public class Words {
int min;
int max;
public Words(int index) {
min = index;
max = index;
}
}
public int findMinInterval(String[] sw) {
int begin = Integer.MAX_VALUE;
int end = Integer.MIN_VALUE;
for (int i = 0; i < sw.length; i++) {
if (st.contains(sw[i])) {
Words w = st.get(sw[i]);
begin = Math.min(begin, w.min);
end = Math.max(end, w.max);
}
}
if (begin != Integer.MAX_VALUE) {
return (end - begin) + 1;
}
return 0;
}
public void put(String[] dw) {
for (int i = 0; i < dw.length; i++) {
if (!st.contains(dw[i])) {
st.put(dw[i], new Words(i));
}
else {
Words w = st.get(dw[i]);
w.min = Math.min(w.min, i);
w.max = Math.max(w.max, i);
}
}
}
public static void main(String[] args) {
// TODO Auto-generated method stub
DicMN dic = new DicMN();
String[] arr1 = { "one", "two", "three", "four", "five", "six", "seven", "eight" };
dic.put(arr1);
String[] arr2 = { "two", "five" };
System.out.print("Interval:" + dic.findMinInterval(arr2));
}
}

Resources