Finding the longest sub-string with no repetition in a string. Time Complexity? - performance

I recently interviewed with a company for software engineering position. I was asked the question of longest unique sub-string in a string. My algorithms was as follows -
Start from the left-most character, and keep storing the character in a hash table with the key as the character and the value as the index_where_it_last_occurred. Add the character to the answer string as long as its not present in the hash table. If we encounter a stored character again, I stop and note down the length. I empty the hash table and then start again from the right index of the repeated character. The right index is retrieved from the (index_where_it_last_occurred) flag. If I ever reach the end of the string, I stop and return the longest length.
For example, say the string was, abcdecfg.
I start with a, store in hash table. I store b and so on till e. Their indexes are stored as well. When I encounter c again, I stop since it's already hashed and note down the length which is 5. I empty the hash table, and start again from the right index of the repeated character. The repeated character being c, I start again from the position 3 ie., the character d. I keep doing this while I don't reach the end of string.
I am interested in knowing what the time complexity of this algorithm will be. IMO, it'll be O(n^2).
This is the code.
import java.util.*;
public class longest
{
static int longest_length = -1;
public static void main(String[] args)
{
Scanner in = new Scanner(System.in);
String str = in.nextLine();
calc(str,0);
System.out.println(longest_length);
}
public static void calc(String str,int index)
{
if(index >= str.length()) return;
int temp_length = 0;
LinkedHashMap<Character,Integer> map = new LinkedHashMap<Character,Integer>();
for (int i = index; i<str.length();i++)
{
if(!map.containsKey(str.charAt(i)))
{
map.put(str.charAt(i),i);
++temp_length;
}
else if(map.containsKey(str.charAt(i)))
{
if(longest_length < temp_length)
{
longest_length = temp_length;
}
int last_index = map.get(str.charAt(i));
// System.out.println(last_index);
calc(str,last_index+1);
break;
}
}
if(longest_length < temp_length)
longest_length = temp_length;
}
}

If the alphabet is of size K, then when you restart counting you jump back at most K-1 places, so you read each character of the string at most K times. So the algorithm is O(nK).
The input string which contains n/K copies of the alphabet exhibits this worst-case behavior. For example if the alphabet is {a, b, c}, strings of the form "abcabcabc...abc" have the property that nearly every character is read 3 times by your algorithm.
You can solve the original problem in O(K+n) time, using O(K) storage space by using dynamic programming.
Let the string be s, and we'll keep a number M which will be the the length of maximum unique_char string ending at i, P, which stores where each character was previously seen, and best, the longest unique-char string found so far.
Start:
Set P[c] = -1 for each c in the alphabet.
M = 0
best = 0
Then, for each i:
M = min(M+1, i-P[s[i]])
best = max(best, M)
P[s[i]] = i
This is trivially O(K) in storage, and O(K+n) in running time.

Related

Understanding this Time Complexity issue (Cracking the coding interview)

Question:
String joinWords(String[] words) {
String sentence = "";
for (String w : words) {
sentence = sentence + w;
}
return sentence;
}
So the solution of this is O(xn^2)
From what I understand, for every iteration, the amount of letters in the variable 'sentence' increases by one. Which, I think, will be O(n^2)?
Is the 'x' just the number of letters in 'words' array?
Answer: On each concatenation, a new copy of the string is created, and the two strings are copied over, character by character. The first iteration requires us to copy x characters. The second iteration requires copying 2x characters. The third iteration requires 3x, and so on. The total time therefore is O(x + 2x + ... x nx). This reduces to O(xn^2)
The answer is wrong in two ways. O(xn^2) doesn't exist. In big O you drop all constants. It would be O(n^2) if that was right (it isn't).
The next part depends on language, the implementation of + on strings and the string class itself. If the string class is mutable, then + should be O(n) assuming it has a decent implementation (doesn't cause a reallocatiion and copy on each use of +). If the string class is immutable, it depends on how the String is implemented- does it use a single character buffer for all data, or can it handle multiple character pointers in an ordered list? Of course even on the worst implementation it wouldn't be O(n^2), more like O(2n) which is O(n) (2 copies per iteration). Anyone giving me an answer of O(n^2) would be marked wrong. But really any modern String class implementation would be O(n) without a constant and be pretty much O(n*l) in space (where n is the number of words and l is the average word length).
class String {
String base; //used for appended strings
String additional; //used for appended strings
char baseData[]; //used for pure strings
String(String base, String additional) {
this.base = base;
this.additional = additional;
}
operator + (String newString) {
return new String(this, newString);
}
//As an example of how this works
int length() {
if(base != null) {
return base.length()+additional.length(); //This can be cached, and should be for efficiency.
}
else {
return baseData.length;
}
}
}
Notice that + is O(1). Yes, I know Java doesn't have operator overloading, the function is there to show how its implemented.

Reconstructing a string of words using a dictionary into an English sentence

I am completely stumped. The question is: given you have a string like "thisisasentence" and a function isWord() that returns true if it is an English word, I would get stuck on "this is a sent"
How can I recursively return and keep track of where I am each time?
You need backtracking, which is easily achievable using recursion. Key observation is that you do not need to keep track of where you are past the moment when you are ready to return a solution.
You have a valid "split" when one of the following is true:
The string w is empty (base case), or
You can split non-empty w into substrings p and s, such that p+s=w, p is a word, and s can be split into a sentence (recursive call).
An implementation can return a list of words when successful split is found, or null when it cannot be found. Base case will always return an empty list; recursive case will, upon finding a p, s split that results in non-null return for s, construct a list with p prefixed to the list returned from the recursive call.
The recursive case will have a loop in it, trying all possible prefixes of w. To speed things up a bit, the loop could terminate upon reaching the prefix that is equal in length to the longest word in the dictionary. For example, if the longest word has 12 characters, you know that trying prefixes 13 characters or longer will not result in a match, so you could cut enumeration short.
Just adding to the answer above.
According to my experience, many people understand recursion better when they see a «linearized» version of a recursive algorithm, which means «implemented as a loop over a stack». Linearization is applicable to any recursive task.
Assuming that isWord() has two parameters (1st: string to test; 2nd: its length) and returns a boolean-compatible value, a C implementation of backtracking is as follows:
void doSmth(char *phrase, int *words, int total) {
int i;
for (i = 0; i < total; ++i)
printf("%.*s ", words[i + 1] - words[i], phrase + words[i]);
printf("\n");
}
void parse(char *phrase) {
int current, length, *words;
if (phrase) {
words = (int*)calloc((length = strlen(phrase)) + 2, sizeof(int));
current = 1;
while (current) {
for (++words[current]; words[current] <= length; ++words[current])
if (isWord(phrase + words[current - 1],
words[current] - words[current - 1])) {
words[current + 1] = words[current];
current++;
}
if (words[--current] == length)
doSmth(phrase, words, current); /** parse successful! **/
}
free(words);
}
}
As can be seen, for each word, a pair of stack values are used, the first of which being an offset to the current word`s first character, whereas the second is a potential offset of a character exactly after the current word`s last one (thus being the next word`s first character). The second value of the current word (the one whose pair is at the top of our «stack») is iterated through all characters left in the phrase.
When a word is accepted, a new second value (equalling the current, to only look at positions after it) is pushed to the stack, making the former second the first in a new pair. If the current word (the one just found) completes the phrase, something useful is performed; see doSmth().
If there are no more acceptable words in the remaining part of our phrase, the current word is considered unsuitable, and its second value is discarded from the stack, effectively repeating a search for words at a previous starting location, while the ending location is now farther than the word previously accepted there.

Find K most frequent words from billions of given words [duplicate]

Input: A positive integer K and a big text. The text can actually be viewed as word sequence. So we don't have to worry about how to break down it into word sequence.
Output: The most frequent K words in the text.
My thinking is like this.
use a Hash table to record all words' frequency while traverse the whole word sequence. In this phase, the key is "word" and the value is "word-frequency". This takes O(n) time.
sort the (word, word-frequency) pair; and the key is "word-frequency". This takes O(n*lg(n)) time with normal sorting algorithm.
After sorting, we just take the first K words. This takes O(K) time.
To summarize, the total time is O(n+nlg(n)+K), Since K is surely smaller than N, so it is actually O(nlg(n)).
We can improve this. Actually, we just want top K words. Other words' frequency is not concern for us. So, we can use "partial Heap sorting". For step 2) and 3), we don't just do sorting. Instead, we change it to be
2') build a heap of (word, word-frequency) pair with "word-frequency" as key. It takes O(n) time to build a heap;
3') extract top K words from the heap. Each extraction is O(lg(n)). So, total time is O(k*lg(n)).
To summarize, this solution cost time O(n+k*lg(n)).
This is just my thought. I haven't find out way to improve step 1).
I Hope some Information Retrieval experts can shed more light on this question.
This can be done in O(n) time
Solution 1:
Steps:
Count words and hash it, which will end up in the structure like this
var hash = {
"I" : 13,
"like" : 3,
"meow" : 3,
"geek" : 3,
"burger" : 2,
"cat" : 1,
"foo" : 100,
...
...
Traverse through the hash and find the most frequently used word (in this case "foo" 100), then create the array of that size
Then we can traverse the hash again and use the number of occurrences of words as array index, if there is nothing in the index, create an array else append it in the array. Then we end up with an array like:
0 1 2 3 100
[[ ],[cat],[burger],[like, meow, geek],[]...[foo]]
Then just traverse the array from the end, and collect the k words.
Solution 2:
Steps:
Same as above
Use min heap and keep the size of min heap to k, and for each word in the hash we compare the occurrences of words with the min, 1) if it's greater than the min value, remove the min (if the size of the min heap is equal to k) and insert the number in the min heap. 2) rest simple conditions.
After traversing through the array, we just convert the min heap to array and return the array.
You're not going to get generally better runtime than the solution you've described. You have to do at least O(n) work to evaluate all the words, and then O(k) extra work to find the top k terms.
If your problem set is really big, you can use a distributed solution such as map/reduce. Have n map workers count frequencies on 1/nth of the text each, and for each word, send it to one of m reducer workers calculated based on the hash of the word. The reducers then sum the counts. Merge sort over the reducers' outputs will give you the most popular words in order of popularity.
A small variation on your solution yields an O(n) algorithm if we don't care about ranking the top K, and a O(n+k*lg(k)) solution if we do. I believe both of these bounds are optimal within a constant factor.
The optimization here comes again after we run through the list, inserting into the hash table. We can use the median of medians algorithm to select the Kth largest element in the list. This algorithm is provably O(n).
After selecting the Kth smallest element, we partition the list around that element just as in quicksort. This is obviously also O(n). Anything on the "left" side of the pivot is in our group of K elements, so we're done (we can simply throw away everything else as we go along).
So this strategy is:
Go through each word and insert it into a hash table: O(n)
Select the Kth smallest element: O(n)
Partition around that element: O(n)
If you want to rank the K elements, simply sort them with any efficient comparison sort in O(k * lg(k)) time, yielding a total run time of O(n+k * lg(k)).
The O(n) time bound is optimal within a constant factor because we must examine each word at least once.
The O(n + k * lg(k)) time bound is also optimal because there is no comparison-based way to sort k elements in less than k * lg(k) time.
If your "big word list" is big enough, you can simply sample and get estimates. Otherwise, I like hash aggregation.
Edit:
By sample I mean choose some subset of pages and calculate the most frequent word in those pages. Provided you select the pages in a reasonable way and select a statistically significant sample, your estimates of the most frequent words should be reasonable.
This approach is really only reasonable if you have so much data that processing it all is just kind of silly. If you only have a few megs, you should be able to tear through the data and calculate an exact answer without breaking a sweat rather than bothering to calculate an estimate.
You can cut down the time further by partitioning using the first letter of words, then partitioning the largest multi-word set using the next character until you have k single-word sets. You would use a sortof 256-way tree with lists of partial/complete words at the leafs. You would need to be very careful to not cause string copies everywhere.
This algorithm is O(m), where m is the number of characters. It avoids that dependence on k, which is very nice for large k [by the way your posted running time is wrong, it should be O(n*lg(k)), and I'm not sure what that is in terms of m].
If you run both algorithms side by side you will get what I'm pretty sure is an asymptotically optimal O(min(m, n*lg(k))) algorithm, but mine should be faster on average because it doesn't involve hashing or sorting.
You have a bug in your description: Counting takes O(n) time, but sorting takes O(m*lg(m)), where m is the number of unique words. This is usually much smaller than the total number of words, so probably should just optimize how the hash is built.
Your problem is same as this-
http://www.geeksforgeeks.org/find-the-k-most-frequent-words-from-a-file/
Use Trie and min heap to efficieinty solve it.
If what you're after is the list of k most frequent words in your text for any practical k and for any natural langage, then the complexity of your algorithm is not relevant.
Just sample, say, a few million words from your text, process that with any algorithm in a matter of seconds, and the most frequent counts will be very accurate.
As a side note, the complexity of the dummy algorithm (1. count all 2. sort the counts 3. take the best) is O(n+m*log(m)), where m is the number of different words in your text. log(m) is much smaller than (n/m), so it remains O(n).
Practically, the long step is counting.
Utilize memory efficient data structure to store the words
Use MaxHeap, to find the top K frequent words.
Here is the code
import java.util.ArrayList;
import java.util.Comparator;
import java.util.List;
import java.util.PriorityQueue;
import com.nadeem.app.dsa.adt.Trie;
import com.nadeem.app.dsa.adt.Trie.TrieEntry;
import com.nadeem.app.dsa.adt.impl.TrieImpl;
public class TopKFrequentItems {
private int maxSize;
private Trie trie = new TrieImpl();
private PriorityQueue<TrieEntry> maxHeap;
public TopKFrequentItems(int k) {
this.maxSize = k;
this.maxHeap = new PriorityQueue<TrieEntry>(k, maxHeapComparator());
}
private Comparator<TrieEntry> maxHeapComparator() {
return new Comparator<TrieEntry>() {
#Override
public int compare(TrieEntry o1, TrieEntry o2) {
return o1.frequency - o2.frequency;
}
};
}
public void add(String word) {
this.trie.insert(word);
}
public List<TopK> getItems() {
for (TrieEntry trieEntry : this.trie.getAll()) {
if (this.maxHeap.size() < this.maxSize) {
this.maxHeap.add(trieEntry);
} else if (this.maxHeap.peek().frequency < trieEntry.frequency) {
this.maxHeap.remove();
this.maxHeap.add(trieEntry);
}
}
List<TopK> result = new ArrayList<TopK>();
for (TrieEntry entry : this.maxHeap) {
result.add(new TopK(entry));
}
return result;
}
public static class TopK {
public String item;
public int frequency;
public TopK(String item, int frequency) {
this.item = item;
this.frequency = frequency;
}
public TopK(TrieEntry entry) {
this(entry.word, entry.frequency);
}
#Override
public String toString() {
return String.format("TopK [item=%s, frequency=%s]", item, frequency);
}
#Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + frequency;
result = prime * result + ((item == null) ? 0 : item.hashCode());
return result;
}
#Override
public boolean equals(Object obj) {
if (this == obj)
return true;
if (obj == null)
return false;
if (getClass() != obj.getClass())
return false;
TopK other = (TopK) obj;
if (frequency != other.frequency)
return false;
if (item == null) {
if (other.item != null)
return false;
} else if (!item.equals(other.item))
return false;
return true;
}
}
}
Here is the unit tests
#Test
public void test() {
TopKFrequentItems stream = new TopKFrequentItems(2);
stream.add("hell");
stream.add("hello");
stream.add("hello");
stream.add("hello");
stream.add("hello");
stream.add("hello");
stream.add("hero");
stream.add("hero");
stream.add("hero");
stream.add("hello");
stream.add("hello");
stream.add("hello");
stream.add("home");
stream.add("go");
stream.add("go");
assertThat(stream.getItems()).hasSize(2).contains(new TopK("hero", 3), new TopK("hello", 8));
}
For more details refer this test case
use a Hash table to record all words' frequency while traverse the whole word sequence. In this phase, the key is "word" and the value is "word-frequency". This takes O(n) time.This is same as every one explained above
While insertion itself in hashmap , keep the Treeset(specific to java, there are implementations in every language) of size 10(k=10) to keep the top 10 frequent words. Till size is less than 10, keep adding it. If size equal to 10, if inserted element is greater than minimum element i.e. first element. If yes remove it and insert new element
To restrict the size of treeset see this link
Suppose we have a word sequence "ad" "ad" "boy" "big" "bad" "com" "come" "cold". And K=2.
as you mentioned "partitioning using the first letter of words", we got
("ad", "ad") ("boy", "big", "bad") ("com" "come" "cold")
"then partitioning the largest multi-word set using the next character until you have k single-word sets."
it will partition ("boy", "big", "bad") ("com" "come" "cold"), the first partition ("ad", "ad") is missed, while "ad" is actually the most frequent word.
Perhaps I misunderstand your point. Can you please detail your process about partition?
I believe this problem can be solved by an O(n) algorithm. We could make the sorting on the fly. In other words, the sorting in that case is a sub-problem of the traditional sorting problem since only one counter gets incremented by one every time we access the hash table. Initially, the list is sorted since all counters are zero. As we keep incrementing counters in the hash table, we bookkeep another array of hash values ordered by frequency as follows. Every time we increment a counter, we check its index in the ranked array and check if its count exceeds its predecessor in the list. If so, we swap these two elements. As such we obtain a solution that is at most O(n) where n is the number of words in the original text.
I was struggling with this as well and get inspired by #aly. Instead of sorting afterwards, we can just maintain a presorted list of words (List<Set<String>>) and the word will be in the set at position X where X is the current count of the word. In generally, here's how it works:
for each word, store it as part of map of it's occurrence: Map<String, Integer>.
then, based on the count, remove it from the previous count set, and add it into the new count set.
The drawback of this is the list maybe big - can be optimized by using a TreeMap<Integer, Set<String>> - but this will add some overhead. Ultimately we can use a mix of HashMap or our own data structure.
The code
public class WordFrequencyCounter {
private static final int WORD_SEPARATOR_MAX = 32; // UNICODE 0000-001F: control chars
Map<String, MutableCounter> counters = new HashMap<String, MutableCounter>();
List<Set<String>> reverseCounters = new ArrayList<Set<String>>();
private static class MutableCounter {
int i = 1;
}
public List<String> countMostFrequentWords(String text, int max) {
int lastPosition = 0;
int length = text.length();
for (int i = 0; i < length; i++) {
char c = text.charAt(i);
if (c <= WORD_SEPARATOR_MAX) {
if (i != lastPosition) {
String word = text.substring(lastPosition, i);
MutableCounter counter = counters.get(word);
if (counter == null) {
counter = new MutableCounter();
counters.put(word, counter);
} else {
Set<String> strings = reverseCounters.get(counter.i);
strings.remove(word);
counter.i ++;
}
addToReverseLookup(counter.i, word);
}
lastPosition = i + 1;
}
}
List<String> ret = new ArrayList<String>();
int count = 0;
for (int i = reverseCounters.size() - 1; i >= 0; i--) {
Set<String> strings = reverseCounters.get(i);
for (String s : strings) {
ret.add(s);
System.out.print(s + ":" + i);
count++;
if (count == max) break;
}
if (count == max) break;
}
return ret;
}
private void addToReverseLookup(int count, String word) {
while (count >= reverseCounters.size()) {
reverseCounters.add(new HashSet<String>());
}
Set<String> strings = reverseCounters.get(count);
strings.add(word);
}
}
I just find out the other solution for this problem. But I am not sure it is right.
Solution:
Use a Hash table to record all words' frequency T(n) = O(n)
Choose first k elements of hash table, and restore them in one buffer (whose space = k). T(n) = O(k)
Each time, firstly we need find the current min element of the buffer, and just compare the min element of the buffer with the (n - k) elements of hash table one by one. If the element of hash table is greater than this min element of buffer, then drop the current buffer's min, and add the element of the hash table. So each time we find the min one in the buffer need T(n) = O(k), and traverse the whole hash table need T(n) = O(n - k). So the whole time complexity for this process is T(n) = O((n-k) * k).
After traverse the whole hash table, the result is in this buffer.
The whole time complexity: T(n) = O(n) + O(k) + O(kn - k^2) = O(kn + n - k^2 + k). Since, k is really smaller than n in general. So for this solution, the time complexity is T(n) = O(kn). That is linear time, when k is really small. Is it right? I am really not sure.
Try to think of special data structure to approach this kind of problems. In this case special kind of tree like trie to store strings in specific way, very efficient. Or second way to build your own solution like counting words. I guess this TB of data would be in English then we do have around 600,000 words in general so it'll be possible to store only those words and counting which strings would be repeated + this solution will need regex to eliminate some special characters. First solution will be faster, I'm pretty sure.
http://en.wikipedia.org/wiki/Trie
This is an interesting idea to search and I could find this paper related to Top-K https://icmi.cs.ucsb.edu/research/tech_reports/reports/2005-23.pdf
Also there is an implementation of it here.
Simplest code to get the occurrence of most frequently used word.
function strOccurence(str){
var arr = str.split(" ");
var length = arr.length,temp = {},max;
while(length--){
if(temp[arr[length]] == undefined && arr[length].trim().length > 0)
{
temp[arr[length]] = 1;
}
else if(arr[length].trim().length > 0)
{
temp[arr[length]] = temp[arr[length]] + 1;
}
}
console.log(temp);
var max = [];
for(i in temp)
{
max[temp[i]] = i;
}
console.log(max[max.length])
//if you want second highest
console.log(max[max.length - 2])
}
In these situations, I recommend to use Java built-in features. Since, they are already well tested and stable. In this problem, I find the repetitions of the words by using HashMap data structure. Then, I push the results to an array of objects. I sort the object by Arrays.sort() and print the top k words and their repetitions.
import java.io.*;
import java.lang.reflect.Array;
import java.util.*;
public class TopKWordsTextFile {
static class SortObject implements Comparable<SortObject>{
private String key;
private int value;
public SortObject(String key, int value) {
super();
this.key = key;
this.value = value;
}
#Override
public int compareTo(SortObject o) {
//descending order
return o.value - this.value;
}
}
public static void main(String[] args) {
HashMap<String,Integer> hm = new HashMap<>();
int k = 1;
try {
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream("words.in")));
String line;
while ((line = br.readLine()) != null) {
// process the line.
//System.out.println(line);
String[] tokens = line.split(" ");
for(int i=0; i<tokens.length; i++){
if(hm.containsKey(tokens[i])){
//If the key already exists
Integer prev = hm.get(tokens[i]);
hm.put(tokens[i],prev+1);
}else{
//If the key doesn't exist
hm.put(tokens[i],1);
}
}
}
//Close the input
br.close();
//Print all words with their repetitions. You can use 3 for printing top 3 words.
k = hm.size();
// Get a set of the entries
Set set = hm.entrySet();
// Get an iterator
Iterator i = set.iterator();
int index = 0;
// Display elements
SortObject[] objects = new SortObject[hm.size()];
while(i.hasNext()) {
Map.Entry e = (Map.Entry)i.next();
//System.out.print("Key: "+e.getKey() + ": ");
//System.out.println(" Value: "+e.getValue());
String tempS = (String) e.getKey();
int tempI = (int) e.getValue();
objects[index] = new SortObject(tempS,tempI);
index++;
}
System.out.println();
//Sort the array
Arrays.sort(objects);
//Print top k
for(int j=0; j<k; j++){
System.out.println(objects[j].key+":"+objects[j].value);
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
For more information, please visit https://github.com/m-vahidalizadeh/foundations/blob/master/src/algorithms/TopKWordsTextFile.java. I hope it helps.
**
C++11 Implementation of the above thought
**
class Solution {
public:
vector<int> topKFrequent(vector<int>& nums, int k) {
unordered_map<int,int> map;
for(int num : nums){
map[num]++;
}
vector<int> res;
// we use the priority queue, like the max-heap , we will keep (size-k) smallest elements in the queue
// pair<first, second>: first is frequency, second is number
priority_queue<pair<int,int>> pq;
for(auto it = map.begin(); it != map.end(); it++){
pq.push(make_pair(it->second, it->first));
// onece the size bigger than size-k, we will pop the value, which is the top k frequent element value
if(pq.size() > (int)map.size() - k){
res.push_back(pq.top().second);
pq.pop();
}
}
return res;
}
};

Find the lexicographically largest unique string

I need an algorithm to find the largest unique (no duplicate characters) substring from a string by removing character (no rearranging).
String A is greater than String B if it satisfies these two conditions.
1. Has more characters than String B
Or
2. Is lexicographically greater than String B if equal length
For example, if the input string is dedede, then the possible unique combinations are de, ed, d, and e.
Of these combinations, the largest one is therefore ed since it has more characters than d and e and is lexicographically greater than de.
The algorithm must more efficient than generating all possible unique strings and sorting them to find the largest one.
Note: this is not a homework assignment.
How about this
string getLargest(string s)
{
int largerest_char_pos=0;
string result="";
if(s.length() == 1) return s;
for(int i=0;i<s.length();)
{
p=i;
for(int j=i+1;j<s.length();j++)
{
if(s[largerest_char_pos]< s[j]) largerest_char_pos =j;
}
res+=s[largerest_char_pos];
i=largerest_char_pos+1;
}
return result;
}
This is code snipet just gives you the lexicigraphically larger string. If you dont want duplicates you can just keep track of already added characters .
Let me state the rules for ordering in a way that I think is more clear.
String A is greater than string B if
- A is longer than B
OR
- A and B are the same length and A is lexicographically greater than B
If my restatement of the rules is correct then I believe I have a solution that runs in O(n^2) time and O(n) space. My solution is a greedy algorithm based on the observation that there are as many characters in the longest valid subsequence as there are unique characters in the input string. I wrote this in Go, and hopefully the comments are sufficient enough to describe the algorithm.
func findIt(str string) string {
// exc keeps track of characters that we cannot use because they have
// already been used in an earlier part of the subsequence
exc := make(map[byte]bool)
// ret is where we will store the characters of the final solution as we
// find them
var ret []byte
for len(str) > 0 {
// inc keeps track of unique characters as we scan from right to left so
// that we don't take a character until we know that we can still make the
// longest possible subsequence.
inc := make(map[byte]bool, len(str))
fmt.Printf("-%s\n", str)
// best is the largest character we have found that can also get us the
// longest possible subsequence.
var best byte
// best_pos is the lowest index that we were able to find best at, we
// always want the lowest index so that we keep as many options open to us
// later if we take this character.
best_pos := -1
// Scan through the input string from right to left
for i := len(str) - 1; i >= 0; i-- {
// Ignore characters we've already used
if _, ok := exc[str[i]]; ok { continue }
if _, ok := inc[str[i]]; !ok {
// If we haven't seen this character already then it means that we can
// make a longer subsequence by including it, so it must be our best
// option so far
inc[str[i]] = true
best = str[i]
best_pos = i
} else {
// If we've already seen this character it might still be our best
// option if it is a lexicographically larger or equal to our current
// best. If it is equal we want it because it is at a lower index,
// which keeps more options open in the future.
if str[i] >= best {
best = str[i]
best_pos = i
}
}
}
if best_pos == -1 {
// If we didn't find any valid characters on this pass then we are done
break
} else {
// include our best character in our solution, and exclude it for
// consideration in any future passes.
ret = append(ret, best)
exc[best] = true
// run the same algorithm again on the substring that is to the right of
// best_pos
str = str[best_pos+1:]
}
}
return string(ret)
}
I am fairly certain you can do this in O(n) time, but I wasn't sure of my solution so I posted this one instead.

Algorithm for finding first repeated substring of length k

There is a homework I should do and I need help. I should write a program to find the first substring of length k that is repeated in the string at least twice.
For example in the string "banana" there are two repeated substrings of length 2: "an" , "na". In this case, the answer is "an" because it appeared sooner than "na"
Note that the simple O(n^2) algorithm is not useful since there is time limit on execution time of program so I guess it should be in linear time.
There is a hint too: Use Hash table.
I don't want the code. I just want you to give me a clue because I have no idea how to do this using a hash table. Should I use a specific data structure too?
Iterate over the character indexes of the string (0, 1, 2, ...) up to and including the index of the second-from-last character (i.e. up to strlen(str) - 2). For each iteration, do the following...
Extract the 2-char substring starting at the character index.
Check whether your hashtable contains the 2-char substring. If it does, you've got your answer.
Insert each 2-char substring into the hashtable.
This is easily modifiable to cope with substrings of length k.
Combine Will A's algorithm with a rolling hash to get a linear-time algorithm.
You can use linked hash map.
public static String findRepeated(String s , int k){
Map<String,Integer> map = new LinkedHashMap<String,Integer>();
for(int i = 0 ; i < s.length() - k ; i ++){
String temp = s.substring(i,i +k);
if(!map.containsKey(temp)){
map.put(temp, 1);
}
else{
map.put(temp, map.get(temp) + 1);
}
}
for(Map.Entry<String,Integer> entry : map.entrySet()){
if(entry.getValue() > 1){
return entry.getKey();
}
}
return "no such value";
}

Resources