Find word in string buffer/paragraph/text - algorithm

This was asked in Amazon telephonic interview - "Can you write a program (in your preferred language C/C++/etc.) to find a given word in a string buffer of big size ? i.e. number of occurrences "
I am still looking for perfect answer which I should have given to the interviewer.. I tried to write a linear search (char by char comparison) and obviously I was rejected.
Given a 40-45 min time for a telephonic interview, what was the perfect algorithm he/she was looking for ???

The KMP Algorithm is a popular string matching algorithm.
KMP Algorithm
Checking char by char is inefficient. If the string has 1000 characters and the keyword has 100 characters, you don't want to perform unnecessary comparisons. The KMP Algorithm handles many cases which can occur, but I imagine the interviewer was looking for the case where: When you begin (pass 1), the first 99 characters match, but the 100th character doesn't match. Now, for pass 2, instead of performing the entire comparison from character 2, you have enough information to deduce where the next possible match can begin.
// C program for implementation of KMP pattern searching
// algorithm
#include<stdio.h>
#include<string.h>
#include<stdlib.h>
void computeLPSArray(char *pat, int M, int *lps);
void KMPSearch(char *pat, char *txt)
{
int M = strlen(pat);
int N = strlen(txt);
// create lps[] that will hold the longest prefix suffix
// values for pattern
int *lps = (int *)malloc(sizeof(int)*M);
int j = 0; // index for pat[]
// Preprocess the pattern (calculate lps[] array)
computeLPSArray(pat, M, lps);
int i = 0; // index for txt[]
while (i < N)
{
if (pat[j] == txt[i])
{
j++;
i++;
}
if (j == M)
{
printf("Found pattern at index %d \n", i-j);
j = lps[j-1];
}
// mismatch after j matches
else if (i < N && pat[j] != txt[i])
{
// Do not match lps[0..lps[j-1]] characters,
// they will match anyway
if (j != 0)
j = lps[j-1];
else
i = i+1;
}
}
free(lps); // to avoid memory leak
}
void computeLPSArray(char *pat, int M, int *lps)
{
int len = 0; // length of the previous longest prefix suffix
int i;
lps[0] = 0; // lps[0] is always 0
i = 1;
// the loop calculates lps[i] for i = 1 to M-1
while (i < M)
{
if (pat[i] == pat[len])
{
len++;
lps[i] = len;
i++;
}
else // (pat[i] != pat[len])
{
if (len != 0)
{
// This is tricky. Consider the example
// AAACAAAA and i = 7.
len = lps[len-1];
// Also, note that we do not increment i here
}
else // if (len == 0)
{
lps[i] = 0;
i++;
}
}
}
}
// Driver program to test above function
int main()
{
char *txt = "ABABDABACDABABCABAB";
char *pat = "ABABCABAB";
KMPSearch(pat, txt);
return 0;
}
This code is taken from a really good site that teaches algorithms:
Geeks for Geeks KMP

Amazon and companies alike expect knowledge of Boyer–Moore string search or / and Knuth–Morris–Pratt algorithms.
Those are good if you want to show perfect knowledge. Otherwise, try to be creative and write something relatively elegant and efficient.
Did you ask about delimiters before you wrote anything? It could be that they may simplify your task to provide some extra information about a string buffer.
Even code below could be ok (it's really not) if you provide enough information in advance, properly explain runtime, space requirements, choice of data containers.
int find( std::string & the_word, std::string & text )
{
std::stringstream ss( text ); // !!! could be really bad idea if 'text' is really big
std::string word;
std::unordered_map< std::string, int > umap;
while( ss >> text ) ++umap[text]; // you have to assume that each word separated by white-spaces.
return umap[the_word];
}

Related

Make unique array with minimal sum

It is a interview question. Given an array, e.g., [3,2,1,2,7], we want to make all elements in this array unique by incrementing duplicate elements and we require the sum of the refined array is minimal. For example the answer for [3,2,1,2,7] is [3,2,1,4,7] and its sum is 17. Any ideas?
It's not quite as simple as my earlier comment suggested, but it's not terrifically complicated.
First, sort the input array. If it matters to be able to recover the original order of the elements then record the permutation used for the sort.
Second, scan the sorted array from left to right (ie from low to high). If an element is less than or equal to the element to its left, set it to be one greater than that element.
Pseudocode
sar = sort(input_array)
for index = 2:size(sar) ! I count from 1
if sar(index)<=sar(index-1) sar(index) = sar(index-1)+1
forend
Is the sum of the result minimal ? I've convinced myself that it is through some head-scratching and trials but I haven't got a formal proof.
If you only need to find ONE of the best solution, here's the algorythm with some explainations.
The idea of this problem is to find an optimal solution, which can be found only by testing all existing solutions (well, they're infinite, let's stick with the reasonable ones).
I wrote a program in C, because I'm familiar with it, but you can port it to any language you want.
The program does this: it tries to increment one value to the max possible (I'll explain how to find it in the comments under the code sections), than if the solution is not found, decreases this value and goes on with the next one and so on.
It's an exponential algorythm, so it will be very slow on large values of duplicated data (yet, it assures you the best solution is found).
I tested this code with your example, and it worked; not sure if there's any bug left, but the code (in C) is this.
#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
typedef int BOOL; //just to ease meanings of values
#define TRUE 1
#define FALSE 0
Just to ease comprehension, I did some typedefs. Don't worry.
typedef struct duplicate { //used to fasten the algorythm; it uses some more memory just to assure it's ok
int value;
BOOL duplicate;
} duplicate_t;
int maxInArrayExcept(int *array, int arraySize, int index); //find the max value in array except the value at the index given
//the result is the max value in the array, not counting th index
int *findDuplicateSum(int *array, int arraySize);
BOOL findDuplicateSum_R(duplicate_t *array, int arraySize, int *tempSolution, int *solution, int *totalSum, int currentSum); //resursive function used to find solution
BOOL check(int *array, int arraySize); //checks if there's any repeated value in the solution
These are all the functions we'll need. All split up for comprehension purpose.
First, we have a struct. This struct is used to avoid checking, for every iteration, if the value on a given index was originally duplicated. We don't want to modify any value not duplicated originally.
Then, we have a couple functions: first, we need to see the worst case scenario: every value after the duplicated ones is already occupied: then we need to increment the duplicated value up to the maximum value reached + 1.
Then, there are the main Function we'll discute later about.
The check Function only checks if there's any duplicated value in a temporary solution.
int main() { //testing purpose
int i;
int testArray[] = { 3,2,1,2,7 }; //test array
int nTestArraySize = 5; //test array size
int *solutionArray; //needed if you want to use the solution later
solutionArray = findDuplicateSum(testArray, nTestArraySize);
for (i = 0; i < nTestArraySize; ++i) {
printf("%d ", solutionArray[i]);
}
return 0;
}
This is the main Function: I used it to test everything.
int * findDuplicateSum(int * array, int arraySize)
{
int *solution = malloc(sizeof(int) * arraySize);
int *tempSolution = malloc(sizeof(int) * arraySize);
duplicate_t *duplicate = calloc(arraySize, sizeof(duplicate_t));
int i, j, currentSum = 0, totalSum = INT_MAX;
for (i = 0; i < arraySize; ++i) {
tempSolution[i] = solution[i] = duplicate[i].value = array[i];
currentSum += array[i];
for (j = 0; j < i; ++j) { //to find ALL the best solutions, we should also put the first found value as true; it's just a line more
//yet, it saves the algorythm half of the duplicated numbers (best/this case scenario)
if (array[j] == duplicate[i].value) {
duplicate[i].duplicate = TRUE;
}
}
}
if (findDuplicateSum_R(duplicate, arraySize, tempSolution, solution, &totalSum, currentSum));
else {
printf("No solution found\n");
}
free(tempSolution);
free(duplicate);
return solution;
}
This Function does a lot of things: first, it sets up the solution array, then it initializes both the solution values and the duplicate array, that is the one used to check for duplicated values at startup. Then, we find the current sum and we set the maximum available sum to the maximum integer possible.
Then, the recursive Function is called; this one gives us the info about having found the solution (that should be Always), then we return the solution as an array.
int findDuplicateSum_R(duplicate_t * array, int arraySize, int * tempSolution, int * solution, int * totalSum, int currentSum)
{
int i;
if (check(tempSolution, arraySize)) {
if (currentSum < *totalSum) { //optimal solution checking
for (i = 0; i < arraySize; ++i) {
solution[i] = tempSolution[i];
}
*totalSum = currentSum;
}
return TRUE; //just to ensure a solution is found
}
for (i = 0; i < arraySize; ++i) {
if (array[i].duplicate == TRUE) {
if (array[i].duplicate <= maxInArrayExcept(solution, arraySize, i)) { //worst case scenario, you need it to stop the recursion on that value
tempSolution[i]++;
return findDuplicateSum_R(array, arraySize, tempSolution, solution, totalSum, currentSum + 1);
tempSolution[i]--; //backtracking
}
}
}
return FALSE; //just in case the solution is not found, but we won't need it
}
This is the recursive Function. It first checks if the solution is ok and if it is the best one found until now. Then, if everything is correct, it updates the actual solution with the temporary values, and updates the optimal condition.
Then, we iterate on every repeated value (the if excludes other indexes) and we progress in the recursion until (if unlucky) we reach the worst case scenario: the check condition not satisfied above the maximum value.
Then we have to backtrack and continue with the iteration, that will go on with other values.
PS: an optimization is possible here, if we move the optimal condition from the check into the for: if the solution is already not optimal, we can't expect to find a better one just adding things.
The hard code has ended, and there are the supporting functions:
int maxInArrayExcept(int *array, int arraySize, int index) {
int i, max = 0;
for (i = 0; i < arraySize; ++i) {
if (i != index) {
if (array[i] > max) {
max = array[i];
}
}
}
return max;
}
BOOL check(int *array, int arraySize) {
int i, j;
for (i = 0; i < arraySize; ++i) {
for (j = 0; j < i; ++j) {
if (array[i] == array[j]) return FALSE;
}
}
return TRUE;
}
I hope this was useful.
Write if anything is unclear.
Well, I got the same question in one of my interviews.
Not sure if you still need it. But here's how I did it. And it worked well.
num_list1 = [2,8,3,6,3,5,3,5,9,4]
def UniqueMinSumArray(num_list):
max=min(num_list)
for i,V in enumerate(num_list):
while (num_list.count(num_list[i])>1):
if (max > num_list[i]+1) :
num_list[i] = max + 1
else:
num_list[i]+=1
max = num_list[i]
i+=1
return num_list
print (sum(UniqueMinSumArray(num_list1)))
You can try with your list of numbers and I am sure it will give you the correct unique minimum sum.
I got the same interview question too. But my answer is in JS in case anyone is interested.
For sure it can be improved to get rid of for loop.
function getMinimumUniqueSum(arr) {
// [1,1,2] => [1,2,3] = 6
// [1,2,2,3,3] = [1,2,3,4,5] = 15
if (arr.length > 1) {
var sortedArr = [...arr].sort((a, b) => a - b);
var current = sortedArr[0];
var res = [current];
for (var i = 1; i + 1 <= arr.length; i++) {
// check current equals to the rest array starting from index 1.
if (sortedArr[i] > current) {
res.push(sortedArr[i]);
current = sortedArr[i];
} else if (sortedArr[i] == current) {
current = sortedArr[i] + 1;
// sortedArr[i]++;
res.push(current);
} else {
current++;
res.push(current);
}
}
return res.reduce((a,b) => a + b, 0);
} else {
return 0;
}
}

Parsing morse code

I am trying to solve this problem.
The goal is to determine the number of ways a morse string can be interpreted, given a dictionary of word.
What I did is that I first "translated" words from my dictionary into morse. Then, I used a naive algorithm, searching for all the ways it can be interpreted recursively.
#include <iostream>
#include <vector>
#include <map>
#include <string>
#include <iterator>
using namespace std;
string morse_string;
int morse_string_size;
map<char, string> morse_table;
unsigned int sol;
void matches(int i, int factor, vector<string> &dictionary) {
int suffix_length = morse_string_size-i;
if (suffix_length <= 0) {
sol += factor;
return;
}
map<int, int> c;
for (vector<string>::iterator it = dictionary.begin() ; it != dictionary.end() ; it++) {
if (((*it).size() <= suffix_length) && (morse_string.substr(i, (*it).size()) == *it)) {
if (c.find((*it).size()) == c.end())
c[(*it).size()] = 0;
else
c[(*it).size()]++;
}
}
for (map<int, int>::iterator it = c.begin() ; it != c.end() ; it++) {
matches(i+it->first, factor*(it->second), dictionary);
}
}
string encode_morse(string s) {
string ret = "";
for (unsigned int i = 0 ; i < s.length() ; ++i) {
ret += morse_table[s[i]];
}
return ret;
}
int main() {
morse_table['A'] = ".-"; morse_table['B'] = "-..."; morse_table['C'] = "-.-."; morse_table['D'] = "-.."; morse_table['E'] = "."; morse_table['F'] = "..-."; morse_table['G'] = "--."; morse_table['H'] = "...."; morse_table['I'] = ".."; morse_table['J'] = ".---"; morse_table['K'] = "-.-"; morse_table['L'] = ".-.."; morse_table['M'] = "--"; morse_table['N'] = "-."; morse_table['O'] = "---"; morse_table['P'] = ".--."; morse_table['Q'] = "--.-"; morse_table['R'] = ".-."; morse_table['S'] = "..."; morse_table['T'] = "-"; morse_table['U'] = "..-"; morse_table['V'] = "...-"; morse_table['W'] = ".--"; morse_table['X'] = "-..-"; morse_table['Y'] = "-.--"; morse_table['Z'] = "--..";
int T, N;
string tmp;
vector<string> dictionary;
cin >> T;
while (T--) {
morse_string = "";
cin >> morse_string;
morse_string_size = morse_string.size();
cin >> N;
for (int j = 0 ; j < N ; j++) {
cin >> tmp;
dictionary.push_back(encode_morse(tmp));
}
sol = 0;
matches(0, 1, dictionary);
cout << sol;
if (T)
cout << endl << endl;
}
return 0;
}
Now the thing is that I only have 3 seconds of execution time allowed, and my algorithm won't work under this limit of time.
Is this the good way to do this and if so, what am I missing ? Otherwise, can you give some hints about what is a good strategy ?
EDIT :
There can be at most 10 000 words in the dictionary and at most 1000 characters in the morse string.
A solution that combines dynamic programming with a rolling hash should work for this problem.
Let's start with a simple dynamic programming solution. We allocate an vector which we will use to store known counts for prefixes of morse_string. We then iterate through morse_string and at each position we iterate through all words and we look back to see if they can fit into morse_string. If they can fit then we use the dynamic programming vector to determine how many ways we could have build the prefix of morse_string up to i-dictionaryWord.size()
vector<long>dp;
dp.push_back(1);
for (int i=0;i<morse_string.size();i++) {
long count = 0;
for (int j=1;j<dictionary.size();j++) {
if (dictionary[j].size() > i) continue;
if (dictionary[j] == morse_string.substring(i-dictionary[j].size(),i)) {
count += dp[i-dictionary[j].size()];
}
}
dp.push_back(count);
}
result = dp[morse_code.size()]
The problem with this solution is that it is too slow. Let's say that N is the length of morse_string and M is the size of the dictionary and K is the size of the largest word in the dictionary. It will do O(N*M*K) operations. If we assume K=1000 this is about 10^10 operations which is too slow on most machines.
The K cost came from the line dictionary[j] == morse_string.substring(i-dictionary[j].size(),i)
If we could speed up this string matching to constant or log complexity we would be okay. This is where rolling hashing comes in. If you build a rolling hash array of morse_string then the idea is that you can compute the hash of any substring of morse_string in O(1). So you could then do hash(dictionary[j]) == hash(morse_string.substring(i-dictionary[j].size(),i))
This is good but in the presence of imperfect hashing you could have multiple words from the dictionary with the same hash. That would mean that after getting a hash match you would still need to match the strings as well as the hashes. In programming contests, people often assume perfect hashing and skip the string matching. This is often a safe bet especially on a small dictionary. In case it doesn't produce a perfect hashing (which you can check in code) you can always adjust your hash function slightly and maybe the adjusted hash function will produce a perfect hashing.

How do I write an algorithm that allows for no-overflow natural number decrementing?

How can I write a function that takes a string denoting a natural number (>0) such as "100100000000" or "1234567890123456788912345678912345678901234567890" and returns a string denoting the input number decreased by 1? I cannot convert this string to an integer because it could overflow.
I am open to implementing this function in any popular language. I personally know c, C++, Java, javascript, python, and php.
k=len(x)-1
while(True):
if x[k]!='0':
x[k]-=1
break
else:
x[k]='9'
k--
I am leaving boundary conditions for you to work out.
Digit 1 is rather easy to decrease. Algorythm is simple:
Found any non-zero digit, if any
Copy digits before it, if any
Decrease found digit
Convert digits after it to 9
Remove 0 from begining of string
C# code
string res = "";
int nonZeroPos = -1;
int pos = s.Length - 1;
// Search for non-zero. TODO: check for digit
while((pos >= 0) && (nonZeroPos == -1))
{
if(s[pos] != '0')
{
nonZeroPos = pos;
}
pos--;
}
// TODO: if digit is NOT found
// Non changed part of number
for(int i = 0; i < nonZeroPos; i++)
{
res += s[i];
}
res += (char)(s[nonZeroPos] - 1);
for(int i = nonZeroPos + 1; i < s.Length; i++)
{
res += "9";
}
// TODO: kill 0 in the begining
If you want a near-unlimited capacity and want to write the algorithm yourself, process the string one digit at a time, from right to left, exactly as you would by hand.
In python overflow does not happen, python can hold any big number in practice, In C/C++ it is easy to write a string decrement similar to above algorithm by ElKamina. And Java has a BigInteger class

Reorder a string by half the character

This is an interview question.
Given a string such as: 123456abcdef consisting of n/2 integers followed by n/2 characters. Reorder the string to contain as 1a2b3c4d5e6f . The algortithm should be in-place.
The solution I gave was trivial - O(n^2). Just shift the characters by n/2 places to the left.
I tried using recursion as -
a. Swap later half of the first half with the previous half of the 2nd part - eg
123 456 abc def
123 abc 456 def
b. Recurse on the two halves.
The pbm I am stuck is that the swapping varies with the number of elements - for eg.
What to do next?
123 abc
12ab 3c
And what to do for : 12345 abcde
123abc 45ab
This is a pretty old question and may be a duplicate. Please let me know.. :)
Another example:
Input: 38726zfgsa
Output: 3z8f7g2s6a
Here's how I would approach the problem:
1) Divide the string into two partitions, number part and letter part
2) Divide each of those partitions into two more (equal sized)
3) Swap the second the third partition (inner number and inner letter)
4) Recurse on the original two partitions (with their newly swapped bits)
5) Stop when the partition has a size of 2
For example:
123456abcdef -> 123456 abcdef -> 123 456 abc def -> 123 abc 456 def
123abc -> 123 abc -> 12 3 ab c -> 12 ab 3 c
12 ab -> 1 2 a b -> 1 a 2 b
... etc
And the same for the other half of the recursion..
All can be done in place with the only gotcha being swapping partitions that aren't the same size (but it'll be off by one, so not difficult to handle).
It is easy to permute an array in place by chasing elements round cycles if you have a bit-map to mark which elements have been moved. We don't have a separate bit-map, but IF your characters are letters (or at least have the high order bit clear) then we can use the top bit of each character to mark this. This produces the following program, which is not recursive and so does not use stack space.
class XX
{
/** new position given old position */
static int newFromOld(int x, int n)
{
if (x < n / 2)
{
return x * 2;
}
return (x - n / 2) * 2 + 1;
}
private static int HIGH_ORDER_BIT = 1 << 15; // 16-bit chars
public static void main(String[] s)
{
// input data - create an array so we can modify
// characters in place
char[] x = s[0].toCharArray();
if ((x.length & 1) != 0)
{
System.err.println("Only works with even length strings");
return;
}
// Character we have read but not yet written, if any
char holding = 0;
// where character in hand was read from
int holdingPos = 0;
// whether picked up a character in our hand
boolean isHolding = false;
int rpos = 0;
while (rpos < x.length)
{ // Here => moved out everything up to rpos
// and put in place with top bit set to mark new occupant
if (!isHolding)
{ // advance read pointer to read new character
char here = x[rpos];
holdingPos = rpos++;
if ((here & HIGH_ORDER_BIT) != 0)
{
// already dealt with
continue;
}
int targetPos = newFromOld(holdingPos, x.length);
// pick up char at target position
holding = x[targetPos];
// place new character, and mark as new
x[targetPos] = (char)(here | HIGH_ORDER_BIT);
// Now holding a character that needs to be put in its
// correct place
isHolding = true;
holdingPos = targetPos;
}
int targetPos = newFromOld(holdingPos, x.length);
char here = x[targetPos];
if ((here & HIGH_ORDER_BIT) != 0)
{ // back to where we picked up a character to hold
isHolding = false;
continue;
}
x[targetPos] = (char)(holding | HIGH_ORDER_BIT);
holding = here;
holdingPos = targetPos;
}
for (int i = 0; i < x.length; i++)
{
x[i] ^= HIGH_ORDER_BIT;
}
System.out.println("Result is " + new String(x));
}
}
These days, if I asked someone that question, what I'm looking for them to write on the whiteboard first is:
assertEquals("1a2b3c4d5e6f",funnySort("123456abcdef"));
...
and then maybe ask for more examples.
(And then, depending, if the task is to interleave numbers & letters, I think you can do it with two walking-pointers, indexLetter and indexDigit, and advance them across swapping as needed til you reach the end.)
In your recursive solution why don't you just make a test if n/2 % 2 == 0 (n%4 ==0 ) and treat the 2 situations differently
As templatetypedef commented your recursion cannot be in-place.
But here is a solution (not in place) using the way you wanted to make your recursion :
def f(s):
n=len(s)
if n==2: #initialisation
return s
elif n%4 == 0 : #if n%4 == 0 it's easy
return f(s[:n/4]+s[n/2:3*n/4])+f(s[n/4:n/2]+s[3*n/4:])
else: #otherwise, n-2 %4 == 0
return s[0]+s[n/2]+f(s[1:n/2]+s[n/2+1:])
Here we go. Recursive, cuts it in half each time, and in-place. Uses the approach outlined by #Chris Mennie. Getting the splitting right was tricky. A lot longer than Python, innit?
/* In-place, divide-and-conquer, recursive riffle-shuffle of strings;
* even length only. No wide characters or Unicode; old school. */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
void testrif(const char *s);
void riffle(char *s);
void rif_recur(char *s, size_t len);
void swap(char *s, size_t midpt, size_t len);
void flip(char *s, size_t len);
void if_odd_quit(const char *s);
int main(void)
{
testrif("");
testrif("a1");
testrif("ab12");
testrif("abc123");
testrif("abcd1234");
testrif("abcde12345");
testrif("abcdef123456");
return 0;
}
void testrif(const char *s)
{
char mutable[20];
strcpy(mutable, s);
printf("'%s'\n", mutable);
riffle(mutable);
printf("'%s'\n\n", mutable);
}
void riffle(char *s)
{
if_odd_quit(s);
rif_recur(s, strlen(s));
}
void rif_recur(char *s, size_t len)
{
/* Turn, e.g., "abcde12345" into "abc123de45", then recurse. */
size_t pivot = len / 2;
size_t half = (pivot + 1) / 2;
size_t twice = half * 2;
if (len < 4)
return;
swap(s + half, pivot - half, pivot);
rif_recur(s, twice);
rif_recur(s + twice, len - twice);
}
void swap(char *s, size_t midpt, size_t len)
{
/* Swap s[0..midpt] with s[midpt..len], in place. Algorithm from
* Programming Pearls, Chapter 2. */
flip(s, midpt);
flip(s + midpt, len - midpt);
flip(s, len);
}
void flip(char *s, size_t len)
{
/* Reverse order of characters in s, in place. */
char *p, *q, tmp;
if (len < 2)
return;
for (p = s, q = s + len - 1; p < q; p++, q--) {
tmp = *p;
*p = *q;
*q = tmp;
}
}
void if_odd_quit(const char *s)
{
if (strlen(s) % 2) {
fputs("String length is odd; aborting.\n", stderr);
exit(1);
}
}
By comparing 123456abcdef and 1a2b3c4d5e6f we can note that only the first and the last characters are in their correct position. We can also note that for each remaining n-2 characters we can compute their correct position directly from their original position. They will get there, and the element that was there surely was not in the correct position, so it will have to replace another one. By doing n-2 such steps all the elements will get to the correct positions:
void funny_sort(char* arr, int n){
int pos = 1; // first unordered element
char aux = arr[pos];
for (int iter = 0; iter < n-2; iter++) { // n-2 unordered elements
pos = (pos < n/2) ? pos*2 : (pos-n/2)*2+1;// correct pos for aux
swap(&aux, arr + pos);
}
}
Score each digit as its numerical value. Score each letter as a = 1.5, b = 2.5 c = 3.5 etc. Run an insertion sort of the string based on the score of each character.
[ETA] Simple scoring won't work so use two pointers and reverse the piece of the string between the two pointers. One pointer starts at the front of the string and advances one step each cycle. The other pointer starts in the middle of the string and advances every second cycle.
123456abcdef
^ ^
1a65432bcdef
^ ^
1a23456bcdef
^ ^
1a2b6543cdef
^ ^

Remove duplicate items with minimal auxiliary memory?

What is the most efficient way to remove duplicate items from an array under the constraint that axillary memory usage must be to a minimum, preferably small enough to not even require any heap allocations? Sorting seems like the obvious choice, but this is clearly not asymptotically efficient. Is there a better algorithm that can be done in place or close to in place? If sorting is the best choice, what kind of sort would be best for something like this?
I'll answer my own question since, after posting, I came up with a really clever algorithm to do this. It uses hashing, building something like a hash set in place. It's guaranteed to be O(1) in axillary space (the recursion is a tail call), and is typically O(N) time complexity. The algorithm is as follows:
Take the first element of the array, this will be the sentinel.
Reorder the rest of the array, as much as possible, such that each element is in the position corresponding to its hash. As this step is completed, duplicates will be discovered. Set them equal to sentinel.
Move all elements for which the index is equal to the hash to the beginning of the array.
Move all elements that are equal to sentinel, except the first element of the array, to the end of the array.
What's left between the properly hashed elements and the duplicate elements will be the elements that couldn't be placed in the index corresponding to their hash because of a collision. Recurse to deal with these elements.
This can be shown to be O(N) provided no pathological scenario in the hashing:
Even if there are no duplicates, approximately 2/3 of the elements will be eliminated at each recursion. Each level of recursion is O(n) where small n is the amount of elements left. The only problem is that, in practice, it's slower than a quick sort when there are few duplicates, i.e. lots of collisions. However, when there are huge amounts of duplicates, it's amazingly fast.
Edit: In current implementations of D, hash_t is 32 bits. Everything about this algorithm assumes that there will be very few, if any, hash collisions in full 32-bit space. Collisions may, however, occur frequently in the modulus space. However, this assumption will in all likelihood be true for any reasonably sized data set. If the key is less than or equal to 32 bits, it can be its own hash, meaning that a collision in full 32-bit space is impossible. If it is larger, you simply can't fit enough of them into 32-bit memory address space for it to be a problem. I assume hash_t will be increased to 64 bits in 64-bit implementations of D, where datasets can be larger. Furthermore, if this ever did prove to be a problem, one could change the hash function at each level of recursion.
Here's an implementation in the D programming language:
void uniqueInPlace(T)(ref T[] dataIn) {
uniqueInPlaceImpl(dataIn, 0);
}
void uniqueInPlaceImpl(T)(ref T[] dataIn, size_t start) {
if(dataIn.length - start < 2)
return;
invariant T sentinel = dataIn[start];
T[] data = dataIn[start + 1..$];
static hash_t getHash(T elem) {
static if(is(T == uint) || is(T == int)) {
return cast(hash_t) elem;
} else static if(__traits(compiles, elem.toHash)) {
return elem.toHash;
} else {
static auto ti = typeid(typeof(elem));
return ti.getHash(&elem);
}
}
for(size_t index = 0; index < data.length;) {
if(data[index] == sentinel) {
index++;
continue;
}
auto hash = getHash(data[index]) % data.length;
if(index == hash) {
index++;
continue;
}
if(data[index] == data[hash]) {
data[index] = sentinel;
index++;
continue;
}
if(data[hash] == sentinel) {
swap(data[hash], data[index]);
index++;
continue;
}
auto hashHash = getHash(data[hash]) % data.length;
if(hashHash != hash) {
swap(data[index], data[hash]);
if(hash < index)
index++;
} else {
index++;
}
}
size_t swapPos = 0;
foreach(i; 0..data.length) {
if(data[i] != sentinel && i == getHash(data[i]) % data.length) {
swap(data[i], data[swapPos++]);
}
}
size_t sentinelPos = data.length;
for(size_t i = swapPos; i < sentinelPos;) {
if(data[i] == sentinel) {
swap(data[i], data[--sentinelPos]);
} else {
i++;
}
}
dataIn = dataIn[0..sentinelPos + start + 1];
uniqueInPlaceImpl(dataIn, start + swapPos + 1);
}
Keeping auxillary memory usage to a minimum, your best bet would be to do an efficient sort to get them in order, then do a single pass of the array with a FROM and TO index.
You advance the FROM index every time through the loop. You only copy the element from FROM to TO (and increment TO) when the key is different from the last.
With Quicksort, that'll average to O(n-log-n) and O(n) for the final pass.
If you sort the array, you will still need another pass to remove duplicates, so the complexity is O(NN) in the worst case (assuming Quicksort), or O(Nsqrt(N)) using Shellsort.
You can achieve O(N*N) by simply scanning the array for each element removing duplicates as you go.
Here is an example in Lua:
function removedups (t)
local result = {}
local count = 0
local found
for i,v in ipairs(t) do
found = false
if count > 0 then
for j = 1,count do
if v == result[j] then found = true; break end
end
end
if not found then
count = count + 1
result[count] = v
end
end
return result, count
end
I don't see any way to do this without something like a bubblesort. When you find a dupe, you need to reduce the length of the array. Quicksort is not designed for the size of the array to change.
This algorithm is always O(n^2) but it also use almost no extra memory -- stack or heap.
// returns the new size
int bubblesqueeze(int* a, int size) {
for (int j = 0; j < size - 1; ++j) {
for (int i = j + 1; i < size; ++i) {
// when a dupe is found, move the end value to index j
// and shrink the size of the array
while (i < size && a[i] == a[j]) {
a[i] = a[--size];
}
if (i < size && a[i] < a[j]) {
int tmp = a[j];
a[j] = a[i];
a[i] = tmp;
}
}
}
return size;
}
Is you have two different var for traversing a datadet insted of just one then you can limit the output by dismissing all diplicates that currently are already in the dataset.
Obvious this example in C is not an efficiant sorting algorith but it is just an example on one way to look at the probkem.
You could also blindly sort the data first and then relocate the data for removing dups, but I'm not sure that would be faster.
#define ARRAY_LENGTH 15
int stop = 1;
int scan_sort[ARRAY_LENGTH] = {5,2,3,5,1,2,5,4,3,5,4,8,6,4,1};
void step_relocate(char tmp,char s,int *dataset)
{
for(;tmp<s;s--)
dataset[s] = dataset[s-1];
}
int exists(int var,int *dataset)
{
int tmp=0;
for(;tmp < stop; tmp++)
{
if( dataset[tmp] == var)
return 1;/* value exsist */
if( dataset[tmp] > var)
tmp=stop;/* Value not in array*/
}
return 0;/* Value not in array*/
}
void main(void)
{
int tmp1=0;
int tmp2=0;
int index = 1;
while(index < ARRAY_LENGTH)
{
if(exists(scan_sort[index],scan_sort))
;/* Dismiss all values currently in the final dataset */
else if(scan_sort[stop-1] < scan_sort[index])
{
scan_sort[stop] = scan_sort[index];/* Insert the value as the highest one */
stop++;/* One more value adde to the final dataset */
}
else
{
for(tmp1=0;tmp1<stop;tmp1++)/* find where the data shall be inserted */
{
if(scan_sort[index] < scan_sort[tmp1])
{
index = index;
break;
}
}
tmp2 = scan_sort[index]; /* Store in case this value is the next after stop*/
step_relocate(tmp1,stop,scan_sort);/* Relocated data already in the dataset*/
scan_sort[tmp1] = tmp2;/* insert the new value */
stop++;/* One more value adde to the final dataset */
}
index++;
}
printf("Result: ");
for(tmp1 = 0; tmp1 < stop; tmp1++)
printf( "%d ",scan_sort[tmp1]);
printf("\n");
system( "pause" );
}
I liked the problem so I wrote a simple C test prog for it as you can see above. Make a comment if I should elaborate or you see any faults.

Resources