How to generate lookup table for counting leading zeroes (clzlut)? - algorithm
I found this function, but there is no explanation of where the clzlut lookup table came from (and I searched for many hours for it on the web, couldn't find anything):
static uint8_t clzlut[256] = {
8,7,6,6,5,5,5,5,
4,4,4,4,4,4,4,4,
3,3,3,3,3,3,3,3,
3,3,3,3,3,3,3,3,
2,2,2,2,2,2,2,2,
2,2,2,2,2,2,2,2,
2,2,2,2,2,2,2,2,
2,2,2,2,2,2,2,2,
1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0
};
uint32_t clz(uint32_t val)
{
uint32_t accum = 0;
accum += clzlut[val >> 24];
accum += (accum == 8 ) ? clzlut[(val >> 16) & 0xFF] : 0;
accum += (accum == 16) ? clzlut[(val >> 8) & 0xFF] : 0;
accum += (accum == 24) ? clzlut[ val & 0xFF] : 0;
return accum;
}
How do you generate this lookup table? What is the algorithm (in C or JavaScript or the like), for generating this lookup table? I am going to use it for the "count leading zeroes" (clz) implementation. I know there is builtins for clz, I just want to know how to generate a fallback, and for curiosity/learning's sake.
How do you generate this lookup table?
...to know how to generate a fallback, and for curiosity/learning's sake.
For each index 0 to 255:
Start at count = 8. (Bit width of uint8_t)
Copy index to i.
while i > 0:
Divide i by 2.
Decrement count.
clzlut[index] = count
Related
Any faster way to replace substring in AWK
I have a long string of about 50,000,000 long... , and I am substituting it part by part cat FILE | tail -n+2 | awk -v k=100 '{ i = 1 while (i<length($0)-k+1) { x = substr($0, i, k) if (CONDITION) { x changed sth $0 = substr($0,1,i-1) x substr($0,i+k) } i += 1 } gsub(sth,sth,$0) printf("%s",$0) >> FILE }' Are there any ways to replace $0 at position i with x of length k without using this method? The string is too long and the commands runs extremely slow sample input: NNNNNNNNNNggcaaacagaatccagcagcacatcaaaaagcttatccacAGTAATTCATTATATCAAAATGCTCCAggccaggcgtggtggcttatgcc sample output: NNNNNNNNNNggcnnncngnnnccngcngcncnncnnnnngcnnnnccncNGNNNNNCNNNNNNNCNNNNNGCNCCNggccnggcgnggnggcnnnngcc If substring with length k=10 contains >50% of A || a || T || t (so there are length($0)-k+1 substrings) substitute A and T with N, a and t with n The $0 string must maintain it size and sequence (Case sensitive) EDIT: I misunderstood the requirement of this problem, and repost the question at here.
Basically: read a window of characters to two buffers - scratch buffer and output buffer if in the scratch buffer there are more then some count of characters ATat then replace all characters ATat in the output buffer buffer to Nn respectively output one character from the output buffer flush one character in both buffers and go to step 1 to repeat reading the characters into buffers when the end of line is encountered, just flush output buffer and reset it all A small C program for sure is going to be the fastest: // The window size #define N 10 // The percent of the window that has to be equal to one of [AaTt] #define PERCENT 50 #include <assert.h> #include <stdio.h> #include <string.h> #include <stdbool.h> // output a string static void output(char *outme, size_t n) { fwrite(outme, n, 1, stdout); } // is one of [AaTt] static bool is_one_of_them(char c) { switch(c) { case 'A': case 'a': case 'T': case 't': return true; } return false; } // Convert one of characters to n/N depending on case static char convert_them_to_n(char c) { // switch(c){ case 'T': case 'A': return true; } return false; // ASCII is assumed const char m = ~0x1f; const char w = 'n' & ~m; return (c & m) | w; } static const unsigned threshold = N * PERCENT / 100; // Store the input in buf static char buf[N]; // Store the output to-be-outputted in out static char out[N]; // The current position in buf and out // The count of readed characters static size_t pos; // The count of one of searched characters in buf static unsigned count_them; static void buf_reset(void) { pos = 0; count_them = 0; } static void buf_flush(void) { output(out, pos); buf_reset(); } static void buf_replace_them(void) { // TODO: this could keep count of characters alrady replaced in out to save CPU for (size_t i = 0; i < N; ++i) { if (is_one_of_them(out[i])) { out[i] = convert_them_to_n(out[i]); } } } static void buf_flush_one(void) { assert(pos > 0); assert(pos == N); output(out, 1); count_them -= is_one_of_them(buf[0]); memmove(buf, buf + 1, pos - 1); memmove(out, out + 1, pos - 1); pos--; } static void buf_add(char c) { buf[pos] = out[pos] = c; pos++; count_them += is_one_of_them(c); // if we reached the substring length if (pos == N) { // if the count reached the threshold if (count_them >= threshold) { // convert the characters to n buf_replace_them(); } // flush one character only at a time buf_flush_one(); } } int main() { int c; buf_reset(); while ((c = getchar()) != EOF) { if (c == '\n') { // If its a newline, just flush what we have buffered buf_flush(); output("\n", 1); continue; } buf_add(c); } buf_flush(); } Such a C program is easily transferable to for example an awk script, just one need to read one character at a time. Below I split the characters with split, like: awk -v N=10 -v percent=50 ' BEGIN{ threshold = N * percent / 100; pos=0 } function is_one_of_them(c) { return c ~ /^[aAtT]$/; } function buf_flush(i) { for (i = 0; i < pos; ++i) { printf "%s", out[i] } pos = 0 count_them = 0 } function buf_replace_them(i) { for (i = 0; i < pos; ++i) { if (is_one_of_them(out[i])) { out[i] = out[i] ~ /[AT]/ ? "N" : "n"; } } } function buf_flush_one(i) { printf "%s", out[0] count_them -= is_one_of_them(buf[0]) if(0 && debug) { printf(" count_them %s ", count_them) for (i = 0; i < pos-1; ++i) { printf("%s", buf[i+1]) } printf(" "); for (i = 0; i < pos-1; ++i) { printf("%s", out[i+1]) } printf("\n"); } for (i = 0; i < pos-1; ++i) { buf[i] = buf[i+1] out[i] = out[i+1] } pos-- } function buf_add(c) { buf[pos]=c; out[pos]=c; pos++ count_them += is_one_of_them(c) if (pos == N) { if (count_them >= threshold) { buf_replace_them() } buf_flush_one() } } { split($0, chars, "") for (idx = 0; idx <= length($0); idx++) { buf_add(chars[idx]) } buf_flush(); printf "\n"; } ' Both programs when run with the input presented in the first line produce the output presented in the second line (note that lone a near the end is not replaced, because there are no 5 charactets ATat in a window of 10 characters from it): NNNNNNNNNNggcaaacagaatccagcagcacatcaaaaagcttatccacAGTAATTCATTATATCAAAATGCTCCAggccaggcgtggtggcttatgcc NNNNNNNNNNggcnnncngnnnccngcngcncnncnnnnngcnnnnccncNGNNNNNCNNNNNNNCNNNNNGCNCCNggccaggcgnggnggcnnnngcc Both solutions were tested on repl.
You need to be careful with how you address this problem. You cannot work on the substituted string. You need to keep track of the original string. Here is a simple example. Assume we have a string consisting of x and y and we want to replace all y with z if there are 8 y in a substring of 10. Imagine your input looks like: yyyyyyyyxxy The first substring of 10 reads yyyyyyyyxx and would be translated into zzzzzzzzxx. If you perform the substitution directly into the original string, you get zzzzzzzzxxy. The second substring now reads zzzzzzzxxy, and does not contain 8 times y, while in the original string it does. So according to the solution of the OP, this would lead into inconsistent results, depending on if you start from the front or the back. So a quick solution would be: awk -v N=10 -v p=50 ' BEGIN { n = N*p/100 } { s = $0 } { for(i=1;i<=length-N;++i) { str=substr($0,i,N) c=gsub(/[AT]/,"N",str) + gsub(/[at]/,"n",str) if(c >= n) s = substr(s,1,i-1) str substr(s,i+N) } } { print s }' file There is ofcourse quite some work you do double here. Imagine you have a string of the form xxyyyyyyyyxx, you would perform 4 concatinations while you only need to do one. So the best idea is to minimalise the work and only check the substrings which end with the respective character: awk -v N=10 -v p=50 ' BEGIN { n = N*p/100 } { s = $0 } { i=N; while (match(substr($0,i),/[ATat]/)) { str=substr($0,i+RSTART-N,N) c=gsub(/[AT]/,"N",str) + gsub(/[at]/,"n",str) if(c >= n) { s = substr(s,1,i+RSTART-N-1) str substr(s,i+RSTART)} i=i+RSTART } } { print s }' file
To replace $0 at position i with x do: awk 'BEGIN{i=12345;x="blubber"} { printf("%s",substr($0,1,i)); printf("%s",x); printf("%s",substr($0,i+length(x))); }' I don't think there is any faster method. To replace AGCT with N and agct with n use tr. To replace them only within a range and using awk you should do: awk 'BEGIN{i=12345;n=123} { printf("%s",substr($0,1,i-1)); printf(gsub(/[atgc]/,"n",gsub(/[ATGC]/,"N",substr($0,i,i+n-1)))); printf("%s",substr($0,i+n)); }' To do more advanced and faster processing you should consider c/c++.
Recursive algorithm to find all possible solutions in a nonogram row
I am trying to write a simple nonogram solver, in a kind of bruteforce way, but I am stuck on a relatively easy task. Let's say I have a row with clues [2,3] that has a length of 10 so the solutions are: $$-$$$---- $$--$$$--- $$---$$$-- $$----$$$- $$-----$$$ -$$----$$$ --$$---$$$ ---$$--$$$ ----$$-$$$ -$$---$$$- --$$-$$$-- I want to find all the possible solutions for a row I know that I have to consider each block separately, and each block will have an availible space of n-(sum of remaining blocks length + number of remaining blocks) but I do not know how to progress from here
Well, this question already have a good answer, so think of this one more as an advertisement of python's prowess. def place(blocks,total): if not blocks: return ["-"*total] if blocks[0]>total: return [] starts = total-blocks[0] #starts = 2 means possible starting indexes are [0,1,2] if len(blocks)==1: #this is special case return [("-"*i+"$"*blocks[0]+"-"*(starts-i)) for i in range(starts+1)] ans = [] for i in range(total-blocks[0]): #append current solutions for sol in place(blocks[1:],starts-i-1): #with all possible other solutiona ans.append("-"*i+"$"*blocks[0]+"-"+sol) return ans To test it: for i in place([2,3,2],12): print(i) Which produces output like: $$-$$$-$$--- $$-$$$--$$-- $$-$$$---$$- $$-$$$----$$ $$--$$$-$$-- $$--$$$--$$- $$--$$$---$$ $$---$$$-$$- $$---$$$--$$ $$----$$$-$$ -$$-$$$-$$-- -$$-$$$--$$- -$$-$$$---$$ -$$--$$$-$$- -$$--$$$--$$ -$$---$$$-$$ --$$-$$$-$$- --$$-$$$--$$ --$$--$$$-$$ ---$$-$$$-$$
This is what i got: #include <iostream> #include <vector> #include <string> using namespace std; typedef std::vector<bool> tRow; void printRow(tRow row){ for (bool i : row){ std::cout << ((i) ? '$' : '-'); } std::cout << std::endl; } int requiredCells(const std::vector<int> nums){ int sum = 0; for (int i : nums){ sum += (i + 1); // The number + the at-least-one-cell gap at is right } return (sum == 0) ? 0 : sum - 1; // The right-most number don't need any gap } bool appendRow(tRow init, const std::vector<int> pendingNums, unsigned int rowSize, std::vector<tRow> &comb){ if (pendingNums.size() <= 0){ comb.push_back(init); return false; } int cellsRequired = requiredCells(pendingNums); if (cellsRequired > rowSize){ return false; // There are no combinations } tRow prefix; int gapSize = 0; std::vector<int> pNumsAux = pendingNums; pNumsAux.erase(pNumsAux.begin()); unsigned int space = rowSize; while ((gapSize + cellsRequired) <= rowSize){ space = rowSize; space -= gapSize; prefix.clear(); prefix = init; for (int i = 0; i < gapSize; ++i){ prefix.push_back(false); } for (int i = 0; i < pendingNums[0]; ++i){ prefix.push_back(true); space--; } if (space > 0){ prefix.push_back(false); space--; } appendRow(prefix, pNumsAux, space, comb); ++gapSize; } return true; } std::vector<tRow> getCombinations(const std::vector<int> row, unsigned int rowSize) { std::vector<tRow> comb; tRow init; appendRow(init, row, rowSize, comb); return comb; } int main(){ std::vector<int> row = { 2, 3 }; auto ret = getCombinations(row, 10); for (tRow r : ret){ while (r.size() < 10) r.push_back(false); printRow(r); } return 0; } And my output is: $$-$$$---- $$--$$$--- $$---$$$-- $$----$$$-- $$-----$$$ -$$-$$$---- -$$--$$$-- -$$---$$$- -$$----$$$- --$$-$$$-- --$$--$$$- --$$---$$$ ---$$-$$$- ---$$--$$$ ----$$-$$$ For sure, this must be absolutely improvable. Note: i did't test it more than already written case Hope it works for you
Parsing through Vectors
I am new and learning C++ using the Programming Principles ... book by Bjarne Stroustrup. I am working on one problem and can't figure out how to make my code work. I know the issue is with if (words[i]==bad[0, bad.size() - 1]) in particular bad.size() - 1]) I am trying to out put all words in the words vector except display a bleep instead of any words from the words vector that match any of the words in the bad vector. So I need to know if words[i] matches any of the values in the bad vector. #include "../std_lib_facilities.h" using namespace std; int main() { vector<string> words; //declare Vector vector<string> bad = {"idiot", "stupid"}; //Read words into Vector for(string temp; cin >> temp;) words.push_back(temp); cout << "Number of words currently entered " << words.size() << '\n'; //sort the words sort(words); //read out words for(int i = 0; i < words.size(); ++i) if (i==0 || words[i-1]!= words[i]) if (words[i]==bad[0, bad.size() - 1]) cout << "Bleep!\n"; else cout << words[i] << '\n'; return 0; }
You need to go through all of the entries in the bad vector for each entry in the words vector. Something like this: for(const string& word : words) { bool foundBadWord = false; for(const string& badWord : bad) { if(0 == word.compare(badWord)) { foundBadWord = true; break; } } if(foundBadWord) { cout << "Bleep!\n"; } else { cout << word << "\n"; } }
Parsing morse code
I am trying to solve this problem. The goal is to determine the number of ways a morse string can be interpreted, given a dictionary of word. What I did is that I first "translated" words from my dictionary into morse. Then, I used a naive algorithm, searching for all the ways it can be interpreted recursively. #include <iostream> #include <vector> #include <map> #include <string> #include <iterator> using namespace std; string morse_string; int morse_string_size; map<char, string> morse_table; unsigned int sol; void matches(int i, int factor, vector<string> &dictionary) { int suffix_length = morse_string_size-i; if (suffix_length <= 0) { sol += factor; return; } map<int, int> c; for (vector<string>::iterator it = dictionary.begin() ; it != dictionary.end() ; it++) { if (((*it).size() <= suffix_length) && (morse_string.substr(i, (*it).size()) == *it)) { if (c.find((*it).size()) == c.end()) c[(*it).size()] = 0; else c[(*it).size()]++; } } for (map<int, int>::iterator it = c.begin() ; it != c.end() ; it++) { matches(i+it->first, factor*(it->second), dictionary); } } string encode_morse(string s) { string ret = ""; for (unsigned int i = 0 ; i < s.length() ; ++i) { ret += morse_table[s[i]]; } return ret; } int main() { morse_table['A'] = ".-"; morse_table['B'] = "-..."; morse_table['C'] = "-.-."; morse_table['D'] = "-.."; morse_table['E'] = "."; morse_table['F'] = "..-."; morse_table['G'] = "--."; morse_table['H'] = "...."; morse_table['I'] = ".."; morse_table['J'] = ".---"; morse_table['K'] = "-.-"; morse_table['L'] = ".-.."; morse_table['M'] = "--"; morse_table['N'] = "-."; morse_table['O'] = "---"; morse_table['P'] = ".--."; morse_table['Q'] = "--.-"; morse_table['R'] = ".-."; morse_table['S'] = "..."; morse_table['T'] = "-"; morse_table['U'] = "..-"; morse_table['V'] = "...-"; morse_table['W'] = ".--"; morse_table['X'] = "-..-"; morse_table['Y'] = "-.--"; morse_table['Z'] = "--.."; int T, N; string tmp; vector<string> dictionary; cin >> T; while (T--) { morse_string = ""; cin >> morse_string; morse_string_size = morse_string.size(); cin >> N; for (int j = 0 ; j < N ; j++) { cin >> tmp; dictionary.push_back(encode_morse(tmp)); } sol = 0; matches(0, 1, dictionary); cout << sol; if (T) cout << endl << endl; } return 0; } Now the thing is that I only have 3 seconds of execution time allowed, and my algorithm won't work under this limit of time. Is this the good way to do this and if so, what am I missing ? Otherwise, can you give some hints about what is a good strategy ? EDIT : There can be at most 10 000 words in the dictionary and at most 1000 characters in the morse string.
A solution that combines dynamic programming with a rolling hash should work for this problem. Let's start with a simple dynamic programming solution. We allocate an vector which we will use to store known counts for prefixes of morse_string. We then iterate through morse_string and at each position we iterate through all words and we look back to see if they can fit into morse_string. If they can fit then we use the dynamic programming vector to determine how many ways we could have build the prefix of morse_string up to i-dictionaryWord.size() vector<long>dp; dp.push_back(1); for (int i=0;i<morse_string.size();i++) { long count = 0; for (int j=1;j<dictionary.size();j++) { if (dictionary[j].size() > i) continue; if (dictionary[j] == morse_string.substring(i-dictionary[j].size(),i)) { count += dp[i-dictionary[j].size()]; } } dp.push_back(count); } result = dp[morse_code.size()] The problem with this solution is that it is too slow. Let's say that N is the length of morse_string and M is the size of the dictionary and K is the size of the largest word in the dictionary. It will do O(N*M*K) operations. If we assume K=1000 this is about 10^10 operations which is too slow on most machines. The K cost came from the line dictionary[j] == morse_string.substring(i-dictionary[j].size(),i) If we could speed up this string matching to constant or log complexity we would be okay. This is where rolling hashing comes in. If you build a rolling hash array of morse_string then the idea is that you can compute the hash of any substring of morse_string in O(1). So you could then do hash(dictionary[j]) == hash(morse_string.substring(i-dictionary[j].size(),i)) This is good but in the presence of imperfect hashing you could have multiple words from the dictionary with the same hash. That would mean that after getting a hash match you would still need to match the strings as well as the hashes. In programming contests, people often assume perfect hashing and skip the string matching. This is often a safe bet especially on a small dictionary. In case it doesn't produce a perfect hashing (which you can check in code) you can always adjust your hash function slightly and maybe the adjusted hash function will produce a perfect hashing.
Finding shortest repeating cycle in word?
I'm about to write a function which, would return me a shortest period of group of letters which would eventually create the given word. For example word abkebabkebabkeb is created by repeated abkeb word. I would like to know, how efficiently analyze input word, to get the shortest period of characters creating input word.
Here is a correct O(n) algorithm. The first for loop is the table building portion of KMP. There are various proofs that it always runs in linear time. Since this question has 4 previous answers, none of which are O(n) and correct, I heavily tested this solution for both correctness and runtime. def pattern(inputv): if not inputv: return inputv nxt = [0]*len(inputv) for i in range(1, len(nxt)): k = nxt[i - 1] while True: if inputv[i] == inputv[k]: nxt[i] = k + 1 break elif k == 0: nxt[i] = 0 break else: k = nxt[k - 1] smallPieceLen = len(inputv) - nxt[-1] if len(inputv) % smallPieceLen != 0: return inputv return inputv[0:smallPieceLen]
O(n) solution. Assumes that the entire string must be covered. The key observation is that we generate the pattern and test it, but if we find something along the way that doesn't match, we must include the entire string that we already tested, so we don't have to reobserve those characters. def pattern(inputv): pattern_end =0 for j in range(pattern_end+1,len(inputv)): pattern_dex = j%(pattern_end+1) if(inputv[pattern_dex] != inputv[j]): pattern_end = j; continue if(j == len(inputv)-1): print pattern_end return inputv[0:pattern_end+1]; return inputv;
This is an example for PHP: <?php function getrepeatedstring($string) { if (strlen($string)<2) return $string; for($i = 1; $i<strlen($string); $i++) { if (substr(str_repeat(substr($string, 0, $i),strlen($string)/$i+1), 0, strlen($string))==$string) return substr($string, 0, $i); } return $string; } ?>
Most easiest one in python: def pattern(self, s): ans=(s+s).find(s,1,-1) return len(pat) if ans == -1 else ans
I believe there is a very elegant recursive solution. Many of the proposed solutions solve the extra complexity where the string ends with part of the pattern, like abcabca. But I do not think that is asked for. My solution for the simple version of the problem in clojure: (defn find-shortest-repeating [pattern string] (if (empty? (str/replace string pattern "")) pattern (find-shortest-repeating (str pattern (nth string (count pattern))) string))) (find-shortest-repeating "" "abcabcabc") ;; "abc" But be aware that this will not find patterns that are uncomplete at the end.
I found a solution based on your post, that could take an incomplete pattern: (defn find-shortest-repeating [pattern string] (if (or (empty? (clojure.string/split string (re-pattern pattern))) (empty? (second (clojure.string/split string (re-pattern pattern))))) pattern (find-shortest-repeating (str pattern (nth string (count pattern))) string)))
My Solution: The idea is to find a substring from the position zero such that it becomes equal to the adjacent substring of same length, when such a substring is found return the substring. Please note if no repeating substring is found I am printing the entire input String. public static void repeatingSubstring(String input){ for(int i=0;i<input.length();i++){ if(i==input.length()-1){ System.out.println("There is no repetition "+input); } else if(input.length()%(i+1)==0){ int size = i+1; if(input.substring(0, i+1).equals(input.substring(i+1, i+1+size))){ System.out.println("The subString which repeats itself is "+input.substring(0, i+1)); break; } } } }
This is a solution I came up with using the queue, it passed all the test cases of a similar problem in codeforces. Problem No is 745A. #include<bits/stdc++.h> using namespace std; typedef long long ll; int main() { ios_base::sync_with_stdio(false); cin.tie(NULL); string s, s1, s2; cin >> s; queue<char> qu; qu.push(s[0]); bool flag = true; int ind = -1; s1 = s.substr(0, s.size() / 2); s2 = s.substr(s.size() / 2); if(s1 == s2) { for(int i=0; i<s1.size(); i++) { s += s1[i]; } } //cout << s1 << " " << s2 << " " << s << "\n"; for(int i=1; i<s.size(); i++) { if(qu.front() == s[i]) {qu.pop();} qu.push(s[i]); } int cycle = qu.size(); /*queue<char> qu2 = qu; string str = ""; while(!qu2.empty()) { cout << qu2.front() << " "; str += qu2.front(); qu2.pop(); }*/ while(!qu.empty()) { if(s[++ind] != qu.front()) {flag = false; break;} qu.pop(); } flag == true ? cout << cycle : cout << s.size(); return 0; }
Simpler answer which I can come up in an interview is just a O(n^2) solution, which tries out all combinations of substring starting from 0. int findSmallestUnit(string str){ for(int i=1;i<str.length();i++){ int j=0; for(;j<str.length();j++){ if(str[j%i] != str[j]){ break; } } if(j==str.length()) return str.substr(0,i); } return str; } Now if someone is interested in O(n) solution to this problem in c++: int findSmallestUnit(string str){ vector<int> lps(str.length(),0); int i=1; int len=0; while(i<str.length()){ if(str[i] == str[len]){ len++; lps[i] = len; i++; } else{ if(len == 0) i++; else{ len = lps[len-1]; } } } int n=str.length(); int x = lps[n-1]; if(n%(n-x) == 0){ return str.substr(0,n-x); } return str; } The above is just #Buge's answer in c++, since someone asked in comments.
Regex solution: Use the following regex replacement to find the shortest repeating substring, and only keeping that substring: ^(.+?)\1*$ $1 Explanation: ^(.+?)\1*$ ^ $ # Start and end, to match the entire input-string ( ) # Capture group 1: .+ # One or more characters, ? # with a reluctant instead of greedy match† \1* # Followed by the first capture group repeated zero or more times $1 # Replace the entire input-string with the first capture group match, # removing all other duplicated substrings † Greedy vs reluctant would in this case mean: greedy = consumes as many characters as it can; reluctant = consumes as few characters as it can. Since we want the shortest repeating substring, we would want a reluctant match in our regex. Example input: "abkebabkebabkeb" Example output: "abkeb" Try it online in Retina. Here an example implementation in Java.
Super delayed answer, but I got the question in an interview, here was my answer (probably not the most optimal but it works for strange test cases as well). private void run(String[] args) throws IOException { File file = new File(args[0]); BufferedReader buffer = new BufferedReader(new FileReader(file)); String line; while ((line = buffer.readLine()) != null) { ArrayList<String> subs = new ArrayList<>(); String t = line.trim(); String out = null; for (int i = 0; i < t.length(); i++) { if (t.substring(0, t.length() - (i + 1)).equals(t.substring(i + 1, t.length()))) { subs.add(t.substring(0, t.length() - (i + 1))); } } subs.add(0, t); for (int j = subs.size() - 2; j >= 0; j--) { String match = subs.get(j); int mLength = match.length(); if (j != 0 && mLength <= t.length() / 2) { if (t.substring(mLength, mLength * 2).equals(match)) { out = match; break; } } else { out = match; } } System.out.println(out); } } Testcases: abcabcabcabc bcbcbcbcbcbcbcbcbcbcbcbcbcbc dddddddddddddddddddd adcdefg bcbdbcbcbdbc hellohell Code returns: abc bc d adcdefg bcbdbc hellohell
Works in cases such as bcbdbcbcbdbc. function smallestRepeatingString(sequence){ var currentRepeat = ''; var currentRepeatPos = 0; for(var i=0, ii=sequence.length; i<ii; i++){ if(currentRepeat[currentRepeatPos] !== sequence[i]){ currentRepeatPos = 0; // Add next character available to the repeat and reset i so we don't miss any matches inbetween currentRepeat = currentRepeat + sequence.slice(currentRepeat.length, currentRepeat.length+1); i = currentRepeat.length-1; }else{ currentRepeatPos++; } if(currentRepeatPos === currentRepeat.length){ currentRepeatPos = 0; } } // If repeat wasn't reset then we didn't find a full repeat at the end. if(currentRepeatPos !== 0){ return sequence; } return currentRepeat; }
I came up with a simple solution that works flawlessly even with very large strings. PHP Implementation: function get_srs($s){ $hash = md5( $s ); $i = 0; $p = ''; do { $p .= $s[$i++]; preg_match_all( "/{$p}/", $s, $m ); } while ( ! hash_equals( $hash, md5( implode( '', $m[0] ) ) ) ); return $p; }