Make unique array with minimal sum - algorithm

It is a interview question. Given an array, e.g., [3,2,1,2,7], we want to make all elements in this array unique by incrementing duplicate elements and we require the sum of the refined array is minimal. For example the answer for [3,2,1,2,7] is [3,2,1,4,7] and its sum is 17. Any ideas?

It's not quite as simple as my earlier comment suggested, but it's not terrifically complicated.
First, sort the input array. If it matters to be able to recover the original order of the elements then record the permutation used for the sort.
Second, scan the sorted array from left to right (ie from low to high). If an element is less than or equal to the element to its left, set it to be one greater than that element.
Pseudocode
sar = sort(input_array)
for index = 2:size(sar) ! I count from 1
if sar(index)<=sar(index-1) sar(index) = sar(index-1)+1
forend
Is the sum of the result minimal ? I've convinced myself that it is through some head-scratching and trials but I haven't got a formal proof.

If you only need to find ONE of the best solution, here's the algorythm with some explainations.
The idea of this problem is to find an optimal solution, which can be found only by testing all existing solutions (well, they're infinite, let's stick with the reasonable ones).
I wrote a program in C, because I'm familiar with it, but you can port it to any language you want.
The program does this: it tries to increment one value to the max possible (I'll explain how to find it in the comments under the code sections), than if the solution is not found, decreases this value and goes on with the next one and so on.
It's an exponential algorythm, so it will be very slow on large values of duplicated data (yet, it assures you the best solution is found).
I tested this code with your example, and it worked; not sure if there's any bug left, but the code (in C) is this.
#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
typedef int BOOL; //just to ease meanings of values
#define TRUE 1
#define FALSE 0
Just to ease comprehension, I did some typedefs. Don't worry.
typedef struct duplicate { //used to fasten the algorythm; it uses some more memory just to assure it's ok
int value;
BOOL duplicate;
} duplicate_t;
int maxInArrayExcept(int *array, int arraySize, int index); //find the max value in array except the value at the index given
//the result is the max value in the array, not counting th index
int *findDuplicateSum(int *array, int arraySize);
BOOL findDuplicateSum_R(duplicate_t *array, int arraySize, int *tempSolution, int *solution, int *totalSum, int currentSum); //resursive function used to find solution
BOOL check(int *array, int arraySize); //checks if there's any repeated value in the solution
These are all the functions we'll need. All split up for comprehension purpose.
First, we have a struct. This struct is used to avoid checking, for every iteration, if the value on a given index was originally duplicated. We don't want to modify any value not duplicated originally.
Then, we have a couple functions: first, we need to see the worst case scenario: every value after the duplicated ones is already occupied: then we need to increment the duplicated value up to the maximum value reached + 1.
Then, there are the main Function we'll discute later about.
The check Function only checks if there's any duplicated value in a temporary solution.
int main() { //testing purpose
int i;
int testArray[] = { 3,2,1,2,7 }; //test array
int nTestArraySize = 5; //test array size
int *solutionArray; //needed if you want to use the solution later
solutionArray = findDuplicateSum(testArray, nTestArraySize);
for (i = 0; i < nTestArraySize; ++i) {
printf("%d ", solutionArray[i]);
}
return 0;
}
This is the main Function: I used it to test everything.
int * findDuplicateSum(int * array, int arraySize)
{
int *solution = malloc(sizeof(int) * arraySize);
int *tempSolution = malloc(sizeof(int) * arraySize);
duplicate_t *duplicate = calloc(arraySize, sizeof(duplicate_t));
int i, j, currentSum = 0, totalSum = INT_MAX;
for (i = 0; i < arraySize; ++i) {
tempSolution[i] = solution[i] = duplicate[i].value = array[i];
currentSum += array[i];
for (j = 0; j < i; ++j) { //to find ALL the best solutions, we should also put the first found value as true; it's just a line more
//yet, it saves the algorythm half of the duplicated numbers (best/this case scenario)
if (array[j] == duplicate[i].value) {
duplicate[i].duplicate = TRUE;
}
}
}
if (findDuplicateSum_R(duplicate, arraySize, tempSolution, solution, &totalSum, currentSum));
else {
printf("No solution found\n");
}
free(tempSolution);
free(duplicate);
return solution;
}
This Function does a lot of things: first, it sets up the solution array, then it initializes both the solution values and the duplicate array, that is the one used to check for duplicated values at startup. Then, we find the current sum and we set the maximum available sum to the maximum integer possible.
Then, the recursive Function is called; this one gives us the info about having found the solution (that should be Always), then we return the solution as an array.
int findDuplicateSum_R(duplicate_t * array, int arraySize, int * tempSolution, int * solution, int * totalSum, int currentSum)
{
int i;
if (check(tempSolution, arraySize)) {
if (currentSum < *totalSum) { //optimal solution checking
for (i = 0; i < arraySize; ++i) {
solution[i] = tempSolution[i];
}
*totalSum = currentSum;
}
return TRUE; //just to ensure a solution is found
}
for (i = 0; i < arraySize; ++i) {
if (array[i].duplicate == TRUE) {
if (array[i].duplicate <= maxInArrayExcept(solution, arraySize, i)) { //worst case scenario, you need it to stop the recursion on that value
tempSolution[i]++;
return findDuplicateSum_R(array, arraySize, tempSolution, solution, totalSum, currentSum + 1);
tempSolution[i]--; //backtracking
}
}
}
return FALSE; //just in case the solution is not found, but we won't need it
}
This is the recursive Function. It first checks if the solution is ok and if it is the best one found until now. Then, if everything is correct, it updates the actual solution with the temporary values, and updates the optimal condition.
Then, we iterate on every repeated value (the if excludes other indexes) and we progress in the recursion until (if unlucky) we reach the worst case scenario: the check condition not satisfied above the maximum value.
Then we have to backtrack and continue with the iteration, that will go on with other values.
PS: an optimization is possible here, if we move the optimal condition from the check into the for: if the solution is already not optimal, we can't expect to find a better one just adding things.
The hard code has ended, and there are the supporting functions:
int maxInArrayExcept(int *array, int arraySize, int index) {
int i, max = 0;
for (i = 0; i < arraySize; ++i) {
if (i != index) {
if (array[i] > max) {
max = array[i];
}
}
}
return max;
}
BOOL check(int *array, int arraySize) {
int i, j;
for (i = 0; i < arraySize; ++i) {
for (j = 0; j < i; ++j) {
if (array[i] == array[j]) return FALSE;
}
}
return TRUE;
}
I hope this was useful.
Write if anything is unclear.

Well, I got the same question in one of my interviews.
Not sure if you still need it. But here's how I did it. And it worked well.
num_list1 = [2,8,3,6,3,5,3,5,9,4]
def UniqueMinSumArray(num_list):
max=min(num_list)
for i,V in enumerate(num_list):
while (num_list.count(num_list[i])>1):
if (max > num_list[i]+1) :
num_list[i] = max + 1
else:
num_list[i]+=1
max = num_list[i]
i+=1
return num_list
print (sum(UniqueMinSumArray(num_list1)))
You can try with your list of numbers and I am sure it will give you the correct unique minimum sum.

I got the same interview question too. But my answer is in JS in case anyone is interested.
For sure it can be improved to get rid of for loop.
function getMinimumUniqueSum(arr) {
// [1,1,2] => [1,2,3] = 6
// [1,2,2,3,3] = [1,2,3,4,5] = 15
if (arr.length > 1) {
var sortedArr = [...arr].sort((a, b) => a - b);
var current = sortedArr[0];
var res = [current];
for (var i = 1; i + 1 <= arr.length; i++) {
// check current equals to the rest array starting from index 1.
if (sortedArr[i] > current) {
res.push(sortedArr[i]);
current = sortedArr[i];
} else if (sortedArr[i] == current) {
current = sortedArr[i] + 1;
// sortedArr[i]++;
res.push(current);
} else {
current++;
res.push(current);
}
}
return res.reduce((a,b) => a + b, 0);
} else {
return 0;
}
}

Related

Find word in string buffer/paragraph/text

This was asked in Amazon telephonic interview - "Can you write a program (in your preferred language C/C++/etc.) to find a given word in a string buffer of big size ? i.e. number of occurrences "
I am still looking for perfect answer which I should have given to the interviewer.. I tried to write a linear search (char by char comparison) and obviously I was rejected.
Given a 40-45 min time for a telephonic interview, what was the perfect algorithm he/she was looking for ???
The KMP Algorithm is a popular string matching algorithm.
KMP Algorithm
Checking char by char is inefficient. If the string has 1000 characters and the keyword has 100 characters, you don't want to perform unnecessary comparisons. The KMP Algorithm handles many cases which can occur, but I imagine the interviewer was looking for the case where: When you begin (pass 1), the first 99 characters match, but the 100th character doesn't match. Now, for pass 2, instead of performing the entire comparison from character 2, you have enough information to deduce where the next possible match can begin.
// C program for implementation of KMP pattern searching
// algorithm
#include<stdio.h>
#include<string.h>
#include<stdlib.h>
void computeLPSArray(char *pat, int M, int *lps);
void KMPSearch(char *pat, char *txt)
{
int M = strlen(pat);
int N = strlen(txt);
// create lps[] that will hold the longest prefix suffix
// values for pattern
int *lps = (int *)malloc(sizeof(int)*M);
int j = 0; // index for pat[]
// Preprocess the pattern (calculate lps[] array)
computeLPSArray(pat, M, lps);
int i = 0; // index for txt[]
while (i < N)
{
if (pat[j] == txt[i])
{
j++;
i++;
}
if (j == M)
{
printf("Found pattern at index %d \n", i-j);
j = lps[j-1];
}
// mismatch after j matches
else if (i < N && pat[j] != txt[i])
{
// Do not match lps[0..lps[j-1]] characters,
// they will match anyway
if (j != 0)
j = lps[j-1];
else
i = i+1;
}
}
free(lps); // to avoid memory leak
}
void computeLPSArray(char *pat, int M, int *lps)
{
int len = 0; // length of the previous longest prefix suffix
int i;
lps[0] = 0; // lps[0] is always 0
i = 1;
// the loop calculates lps[i] for i = 1 to M-1
while (i < M)
{
if (pat[i] == pat[len])
{
len++;
lps[i] = len;
i++;
}
else // (pat[i] != pat[len])
{
if (len != 0)
{
// This is tricky. Consider the example
// AAACAAAA and i = 7.
len = lps[len-1];
// Also, note that we do not increment i here
}
else // if (len == 0)
{
lps[i] = 0;
i++;
}
}
}
}
// Driver program to test above function
int main()
{
char *txt = "ABABDABACDABABCABAB";
char *pat = "ABABCABAB";
KMPSearch(pat, txt);
return 0;
}
This code is taken from a really good site that teaches algorithms:
Geeks for Geeks KMP
Amazon and companies alike expect knowledge of Boyer–Moore string search or / and Knuth–Morris–Pratt algorithms.
Those are good if you want to show perfect knowledge. Otherwise, try to be creative and write something relatively elegant and efficient.
Did you ask about delimiters before you wrote anything? It could be that they may simplify your task to provide some extra information about a string buffer.
Even code below could be ok (it's really not) if you provide enough information in advance, properly explain runtime, space requirements, choice of data containers.
int find( std::string & the_word, std::string & text )
{
std::stringstream ss( text ); // !!! could be really bad idea if 'text' is really big
std::string word;
std::unordered_map< std::string, int > umap;
while( ss >> text ) ++umap[text]; // you have to assume that each word separated by white-spaces.
return umap[the_word];
}

Shortest string to try all 3 digit lock

I was asked this question in one of my recent interviews. A three digit lock can have its key value between range "000" - "999". So basically 1000 combinations have to be tried to open the lock. So I had to generate the shortest string such that all possible combinations (i.e between "000"-"999") would be checked. So for example if we had string "01234" then it would check the combinations "012", "123" and "234". So I had to generate a string which would check all combination. I tried to use a hashset to implement this, where I started with "000" and then took the last two character in string i.e "00" and then appended a new number from 0 to 9 and checked if it existed in hashset. If not I appended that number to output string and repeated the process. Is there any other efficient and clean way to solve this problem.
The procedure you described is based on the assumption that the shortest string has every code exactly once. It turns out that this assumption is correct.
Here's a simple backtracking implementation (C++):
#include <stdio.h>
bool used[1000];
int digits[33333];
bool backtrack(int index, int total)
{
if (total == 1000)
{
printf("%d\n", index);
for (int i = 0; i < index; ++i) {
printf("%d", digits[i]);
}
printf("\n");
return true;
}
for (int d = 0; d < 10; ++d)
{
int prev = 100*digits[index-2]+10*digits[index-1]+d;
if (!used[prev]) {
digits[index] = d;
used[prev] = true;
if (backtrack(index+1, total+1))
return true;
used[prev] = false;
}
}
}
int main(void) {
digits[0] = 0;
backtrack(2, 0);
return 0;
}
Output:
1002
00010020030040050060070080090110120130140150160170\
18019021022023024025026027028029031032033034035036\
03703803904104204304404504604704804905105205305405\
50560570580590610620630640650660670680690710720730\
74075076077078079081082083084085086087088089091092\
09309409509609709809911121131141151161171181191221\
23124125126127128129132133134135136137138139142143\
14414514614714814915215315415515615715815916216316\
41651661671681691721731741751761771781791821831841\
85186187188189192193194195196197198199222322422522\
62272282292332342352362372382392432442452462472482\
49253254255256257258259263264265266267268269273274\
27527627727827928328428528628728828929329429529629\
72982993334335336337338339344345346347348349354355\
35635735835936436536636736836937437537637737837938\
43853863873883893943953963973983994445446447448449\
45545645745845946546646746846947547647747847948548\
64874884894954964974984995556557558559566567568569\
57657757857958658758858959659759859966676686696776\
78679687688689697698699777877978878979879988898999\
00
The procedure is efficient.

Parallel radix sort with virtual memory and write-combining

I'm attempting to implement the variant of parallel radix sort described in http://arxiv.org/pdf/1008.2849v2.pdf (Algorithm 2), but my C++ implementation (for 4 digits in base 10) contains a bug that I'm unable to locate.
For debugging purposes I'm using no parallelism, but the code should still sort correctly.
For instance the line arr.at(i) = item accesses indices outside its bounds in the following
std::vector<int> v = {4612, 4598};
radix_sort2(v);
My implementation is as follows
#include <set>
#include <array>
#include <vector>
void radix_sort2(std::vector<int>& arr) {
std::array<std::set<int>, 10> buckets3;
for (const int item : arr) {
int d = item / 1000;
buckets3.at(d).insert(item);
}
//Prefix sum
std::array<int, 10> outputIndices;
outputIndices.at(0) = 0;
for (int i = 1; i < 10; ++i) {
outputIndices.at(i) = outputIndices.at(i - 1) +
buckets3.at(i - 1).size();
}
for (const auto& bucket3 : buckets3) {
std::array<std::set<int>, 10> buckets0, buckets1;
std::array<int, 10> histogram2 = {};
for (const int item : bucket3) {
int d = item % 10;
buckets0.at(d).insert(item);
}
for (const auto& bucket0 : buckets0) {
for (const int item : bucket0) {
int d = (item / 10) % 10;
buckets1.at(d).insert(item);
int d2 = (item / 100) % 10;
++histogram2.at(d2);
}
}
for (const auto& bucket1 : buckets1) {
for (const int item : bucket1) {
int d = (item / 100) % 10;
int i = outputIndices.at(d) + histogram2.at(d);
++histogram2.at(d);
arr.at(i) = item;
}
}
}
}
Can anyone spot my mistake?
I took at look at the paper you linked. You haven't made any mistakes, none that I can see. In fact, in my estimation, you corrected a mistake in the algorithm.
I wrote out the algorithm and ended up with the exact same problem as you did. After reviewing Algorithm 2, either I woefully mis-understand how it is supposed to work, or it is flawed. There are at least a couple of problems with the algorithm, specifically revolving around outputIndices, and histogram2.
Looking at the algorithm, the final index of an item is determined by the counting sort stored in outputIndices. (lets ignore the histogram for now).
If you had an inital array of numbers {0100, 0103, 0102, 0101} The prefix sum of that would be 4.
The algorithm makes no indication I can determine to lag the result by 1. That being said, in order for the algorithm to work the way they intend, it does have to be lagged, so, moving on.
Now, the prefix sums are 0, 4, 4.... The algorithm doesn't use the MSD as the index into the outputIndices array, it uses "MSD - 1"; So taking 1 as the index into the array, the starting index for the first item without the histogram is 4! Outside the array on the first try.
The outputIndices is built with the MSD, it makes sense for it to be accessed by MSD.
Further, even if you tweak the algorithm to correctly to use the MSD into the outputIndices, it still won't sort correctly. With your initial inputs (swapped) {4598, 4612}, they will stay in that order. They are sorted (locally) as if they are 2 digit numbers. If you increase it to have other numbers not starting with 4, they will be globally, sorted, but the local sort is never finished.
According to the paper the goal is to use the histogram to do that, but I don't see that happening.
Ultimately, I'm assuming, what you want is an algorithm that works the way described. I've modified the algorithm, keeping with the overall stated goal of the paper of using the MSD to do a global sort, and the rest of the digits by reverse LSD.
I don't think these changes should have any impact on your desire to parallel-ize the function.
void radix_sort2(std::vector<int>& arr)
{
std::array<std::vector<int>, 10> buckets3;
for (const int item : arr)
{
int d = item / 1000;
buckets3.at(d).push_back(item);
}
//Prefix sum
std::array<int, 10> outputIndices;
outputIndices.at(0) = 0;
for (int i = 1; i < 10; ++i)
{
outputIndices.at(i) = outputIndices.at(i - 1) + buckets3.at(i - 1).size();
}
for (const auto& bucket3 : buckets3)
{
if (bucket3.size() <= 0)
continue;
std::array<std::vector<int>, 10> buckets0, buckets1, buckets2;
for (const int item : bucket3)
buckets0.at(item % 10).push_back(item);
for (const auto& bucket0 : buckets0)
for (const int item : bucket0)
buckets1.at((item / 10) % 10).push_back(item);
for (const auto& bucket1 : buckets1)
for (const int item : bucket1)
buckets2.at((item / 100) % 10).push_back(item);
int count = 0;
for (const auto& bucket2 : buckets2)
{
for (const int item : bucket2)
{
int d = (item / 1000) % 10;
int i = outputIndices.at(d) + count;
++count;
arr.at(i) = item;
}
}
}
}
For extensiblility, it would probably make sense to create a helper function that does the local sorting. You should be able to extend it to handle any number of digit numbers that way.

Most efficient way to sort parallel arrays in a restricted-feature language

The environment: I am working in a proprietary scripting language where there is no such thing as a user-defined function. I have various loops and local variables of primitive types that I can create and use.
I have two related arrays, "times" and "values". They both contain floating point values. I want to numerically sort the "times" array but have to be sure that the same operations are applied on the "values" array. What's the most efficient way I can do this without the benefit of things like recursion?
You could maintain an index table and sort the index table instead.
This way you will not have to worry about times and values being consistent.
And whenever you need a sorted value, you can lookup on the sorted index.
And if in the future you decided there was going to be a third value, the sorting code will not need any changes.
Here's a sample in C#, but it shouldn't be hard to adapt to your scripting language:
static void Main() {
var r = new Random();
// initialize random data
var index = new int[10]; // the index table
var times = new double[10]; // times
var values = new double[10]; // values
for (int i = 0; i < 10; i++) {
index[i] = i;
times[i] = r.NextDouble();
values[i] = r.NextDouble();
}
// a naive bubble sort
for (int i = 0; i < 10; i++)
for (int j = 0; j < 10; j++)
// compare time value at current index
if (times[index[i]] < times[index[j]]) {
// swap index value (times and values remain unchanged)
var temp = index[i];
index[i] = index[j];
index[j] = temp;
}
// check if the result is correct
for (int i = 0; i < 10; i++)
Console.WriteLine(times[index[i]]);
Console.ReadKey();
}
Note: I used a naive bubble sort there, watchout. In your case, an insertion sort is probably a good candidate. Since you don't want complex recursions.
Just take your favourite sorting algorithm (e.g. Quicksort or Mergesort) and use it to sort the "values" array. Whenever two values are swapped in "values", also swap the values with the same indices in the "times" array.
So basically you can take any fast sorting algorithm and modify the swap() operation so that elements in both arrays are swapped.
Take a look at the Bottom-Up mergesort at Algorithmist. It's a non-recursive way of performing a mergesort. The version presented there uses function calls, but that can be inlined easily enough.
Like martinus said, every time you change a value in one array, do the exact same thing in the parallel array.
Here's a C-like version of a stable-non-recursive mergesort that makes no function calls, and uses no recursion.
const int arrayLength = 40;
float times_array[arrayLength];
float values_array[arrayLength];
// Fill the two arrays....
// Allocate two buffers
float times_buffer[arrayLength];
float values_buffer[arrayLength];
int blockSize = 1;
while (blockSize <= arrayLength)
{
int i = 0;
while (i < arrayLength-blockSize)
{
int begin1 = i;
int end1 = begin1 + blockSize;
int begin2 = end1;
int end2 = begin2 + blockSize;
int bufferIndex = begin1;
while (begin1 < end1 && begin2 < end2)
{
if ( values_array[begin1] > times_array[begin2] )
{
times_buffer[bufferIndex] = times_array[begin2];
values_buffer[bufferIndex++] = values_array[begin2++];
}
else
{
times_buffer[bufferIndex] = times_array[begin1];
values_buffer[bufferIndex++] = values_array[begin1++];
}
}
while ( begin1 < end1 )
{
times_buffer[bufferIndex] = times_array[begin1];
values_buffer[bufferIndex++] = values_array[begin1++];
}
while ( begin2 < end2 )
{
times_buffer[bufferIndex] = times_array[begin2];
values_buffer[bufferIndex++] = values_array[begin2++];
}
for (int k = i; k < i + 2 * blockSize; ++k)
{
times_array[k] = times_buffer[k];
values_array[k] = values_buffer[k];
}
i += 2 * blockSize;
}
blockSize *= 2;
}
I wouldn't suggest writing your own sorting routine, as the sorting routines provided as part of the Java language are well optimized.
The way I'd solve this is to copy the code in the java.util.Arrays class into your own class i.e. org.mydomain.util.Arrays. And add some comments telling yourself not to use the class except when you must have the additional functionality that you're going to add. The Arrays class is quite stable so this is less, less ideal than it would seem, but it's still less than ideal. However, the methods you need to change are private, so you've no real choice.
You then want to create an interface along the lines of:
public static interface SwapHook {
void swap(int a, int b);
}
You then need to add this to the sort method you're going to use, and to every subordinate method called in the sorting procedure, which swaps elements in your primary array. You arrange for the hook to get called by your modified sorting routine, and you can then implement the SortHook interface to achieve the behaviour you want in any secondary (e.g. parallel) arrays.
HTH.

Remove duplicate items with minimal auxiliary memory?

What is the most efficient way to remove duplicate items from an array under the constraint that axillary memory usage must be to a minimum, preferably small enough to not even require any heap allocations? Sorting seems like the obvious choice, but this is clearly not asymptotically efficient. Is there a better algorithm that can be done in place or close to in place? If sorting is the best choice, what kind of sort would be best for something like this?
I'll answer my own question since, after posting, I came up with a really clever algorithm to do this. It uses hashing, building something like a hash set in place. It's guaranteed to be O(1) in axillary space (the recursion is a tail call), and is typically O(N) time complexity. The algorithm is as follows:
Take the first element of the array, this will be the sentinel.
Reorder the rest of the array, as much as possible, such that each element is in the position corresponding to its hash. As this step is completed, duplicates will be discovered. Set them equal to sentinel.
Move all elements for which the index is equal to the hash to the beginning of the array.
Move all elements that are equal to sentinel, except the first element of the array, to the end of the array.
What's left between the properly hashed elements and the duplicate elements will be the elements that couldn't be placed in the index corresponding to their hash because of a collision. Recurse to deal with these elements.
This can be shown to be O(N) provided no pathological scenario in the hashing:
Even if there are no duplicates, approximately 2/3 of the elements will be eliminated at each recursion. Each level of recursion is O(n) where small n is the amount of elements left. The only problem is that, in practice, it's slower than a quick sort when there are few duplicates, i.e. lots of collisions. However, when there are huge amounts of duplicates, it's amazingly fast.
Edit: In current implementations of D, hash_t is 32 bits. Everything about this algorithm assumes that there will be very few, if any, hash collisions in full 32-bit space. Collisions may, however, occur frequently in the modulus space. However, this assumption will in all likelihood be true for any reasonably sized data set. If the key is less than or equal to 32 bits, it can be its own hash, meaning that a collision in full 32-bit space is impossible. If it is larger, you simply can't fit enough of them into 32-bit memory address space for it to be a problem. I assume hash_t will be increased to 64 bits in 64-bit implementations of D, where datasets can be larger. Furthermore, if this ever did prove to be a problem, one could change the hash function at each level of recursion.
Here's an implementation in the D programming language:
void uniqueInPlace(T)(ref T[] dataIn) {
uniqueInPlaceImpl(dataIn, 0);
}
void uniqueInPlaceImpl(T)(ref T[] dataIn, size_t start) {
if(dataIn.length - start < 2)
return;
invariant T sentinel = dataIn[start];
T[] data = dataIn[start + 1..$];
static hash_t getHash(T elem) {
static if(is(T == uint) || is(T == int)) {
return cast(hash_t) elem;
} else static if(__traits(compiles, elem.toHash)) {
return elem.toHash;
} else {
static auto ti = typeid(typeof(elem));
return ti.getHash(&elem);
}
}
for(size_t index = 0; index < data.length;) {
if(data[index] == sentinel) {
index++;
continue;
}
auto hash = getHash(data[index]) % data.length;
if(index == hash) {
index++;
continue;
}
if(data[index] == data[hash]) {
data[index] = sentinel;
index++;
continue;
}
if(data[hash] == sentinel) {
swap(data[hash], data[index]);
index++;
continue;
}
auto hashHash = getHash(data[hash]) % data.length;
if(hashHash != hash) {
swap(data[index], data[hash]);
if(hash < index)
index++;
} else {
index++;
}
}
size_t swapPos = 0;
foreach(i; 0..data.length) {
if(data[i] != sentinel && i == getHash(data[i]) % data.length) {
swap(data[i], data[swapPos++]);
}
}
size_t sentinelPos = data.length;
for(size_t i = swapPos; i < sentinelPos;) {
if(data[i] == sentinel) {
swap(data[i], data[--sentinelPos]);
} else {
i++;
}
}
dataIn = dataIn[0..sentinelPos + start + 1];
uniqueInPlaceImpl(dataIn, start + swapPos + 1);
}
Keeping auxillary memory usage to a minimum, your best bet would be to do an efficient sort to get them in order, then do a single pass of the array with a FROM and TO index.
You advance the FROM index every time through the loop. You only copy the element from FROM to TO (and increment TO) when the key is different from the last.
With Quicksort, that'll average to O(n-log-n) and O(n) for the final pass.
If you sort the array, you will still need another pass to remove duplicates, so the complexity is O(NN) in the worst case (assuming Quicksort), or O(Nsqrt(N)) using Shellsort.
You can achieve O(N*N) by simply scanning the array for each element removing duplicates as you go.
Here is an example in Lua:
function removedups (t)
local result = {}
local count = 0
local found
for i,v in ipairs(t) do
found = false
if count > 0 then
for j = 1,count do
if v == result[j] then found = true; break end
end
end
if not found then
count = count + 1
result[count] = v
end
end
return result, count
end
I don't see any way to do this without something like a bubblesort. When you find a dupe, you need to reduce the length of the array. Quicksort is not designed for the size of the array to change.
This algorithm is always O(n^2) but it also use almost no extra memory -- stack or heap.
// returns the new size
int bubblesqueeze(int* a, int size) {
for (int j = 0; j < size - 1; ++j) {
for (int i = j + 1; i < size; ++i) {
// when a dupe is found, move the end value to index j
// and shrink the size of the array
while (i < size && a[i] == a[j]) {
a[i] = a[--size];
}
if (i < size && a[i] < a[j]) {
int tmp = a[j];
a[j] = a[i];
a[i] = tmp;
}
}
}
return size;
}
Is you have two different var for traversing a datadet insted of just one then you can limit the output by dismissing all diplicates that currently are already in the dataset.
Obvious this example in C is not an efficiant sorting algorith but it is just an example on one way to look at the probkem.
You could also blindly sort the data first and then relocate the data for removing dups, but I'm not sure that would be faster.
#define ARRAY_LENGTH 15
int stop = 1;
int scan_sort[ARRAY_LENGTH] = {5,2,3,5,1,2,5,4,3,5,4,8,6,4,1};
void step_relocate(char tmp,char s,int *dataset)
{
for(;tmp<s;s--)
dataset[s] = dataset[s-1];
}
int exists(int var,int *dataset)
{
int tmp=0;
for(;tmp < stop; tmp++)
{
if( dataset[tmp] == var)
return 1;/* value exsist */
if( dataset[tmp] > var)
tmp=stop;/* Value not in array*/
}
return 0;/* Value not in array*/
}
void main(void)
{
int tmp1=0;
int tmp2=0;
int index = 1;
while(index < ARRAY_LENGTH)
{
if(exists(scan_sort[index],scan_sort))
;/* Dismiss all values currently in the final dataset */
else if(scan_sort[stop-1] < scan_sort[index])
{
scan_sort[stop] = scan_sort[index];/* Insert the value as the highest one */
stop++;/* One more value adde to the final dataset */
}
else
{
for(tmp1=0;tmp1<stop;tmp1++)/* find where the data shall be inserted */
{
if(scan_sort[index] < scan_sort[tmp1])
{
index = index;
break;
}
}
tmp2 = scan_sort[index]; /* Store in case this value is the next after stop*/
step_relocate(tmp1,stop,scan_sort);/* Relocated data already in the dataset*/
scan_sort[tmp1] = tmp2;/* insert the new value */
stop++;/* One more value adde to the final dataset */
}
index++;
}
printf("Result: ");
for(tmp1 = 0; tmp1 < stop; tmp1++)
printf( "%d ",scan_sort[tmp1]);
printf("\n");
system( "pause" );
}
I liked the problem so I wrote a simple C test prog for it as you can see above. Make a comment if I should elaborate or you see any faults.

Resources