Remembering the "original" index of elements after sorting - sorting

Say, I employ merge sort to sort an array of Integers. Now I need to also remember the positions that elements had in the unsorted array, initially. What would be the best way to do this?
A very very naive and space consuming way to do would be to (in C), to maintain each number as a "structure" with another number storing its index:
struct integer {
int value;
int orig_pos;
};
But, obviously there are better ways. Please share your thoughts and solution if you have already tackled such problems. Let me know if you would need more context. Thank you.

Clearly for an N-long array you do need to store SOMEwhere N integers -- the original position of each item, for example; any other way to encode "1 out of N!" possibilities (i.e., what permutation has in fact occurred) will also take at least O(N) space (since, by Stirling's approximation, log(N!) is about N log(N)...).
So, I don't see why you consider it "space consuming" to store those indices most simply and directly. Of course there are other possibilities (taking similar space): for example, you might make a separate auxiliary array of the N indices and sort THAT auxiliary array (based on the value at that index) leaving the original one alone. This means an extra level of indirectness for accessing the data in sorted order, but can save you a lot of data movement if you're sorting an array of large structures, so there's a performance tradeoff... but the space consumption is basically the same!-)

Is the struct such a bad idea? The alternative, to me, would be an array of pointers.

It feels to me that in this question you have to consider the age old question: speed vs size. In either case, you are keeping both a new representation of your data (the sorted array) and an old representation of your data (the way the array use to look), so inherently your solution will have some data replication. If you are sorting n numbers, and you need to remember after they were sorted where those n numbers were, you will have to store n amount of information somewhere, there is no getting around that.
As long as you accept that you are doubling the amount of space you need to be able to keep this old data, then you should consider the specific application and decide what will be faster. One option is to just make a copy of the array before you sort it, however resolving which was where later might turn into a O(N) problem. From that point of view your suggestion of adding another int to your struct doesn't seem like such a bad idea, if it fits with the way you will be using the data later.

This looks like the case where I use an index sort. The following C# example shows how to do it with a lambda expression. I am new at using lambdas, but they can do some complex tasks very easily.
// first, some data to work with
List<double> anylist = new List<double>;
anylist.Add(2.18); // add a value
... // add many more values
// index sort
IEnumerable<int> serial = Enumerable.Range(0, anylist.Count);
int[] index = serial.OrderBy(item => (anylist[item])).ToArray();
// how to use
double FirstValue = anylist[index[0]];
double SecondValue = anylist[index[1]];
And, of course, anylist is still in the origial order.

you can do it the way you proposed
you can also remain a copy of the original unsorted array (means you may use a not inplace sorting algorithm)
you can create an additional array containing only the original indices
All three ways are equally space consuming, there is no "better" way. you may use short instead of int to safe space if you array wont get >65k elements (but be aware of structure padding with your suggestion).

Related

Algorithm for selection the most frequent object during factorization

I have N objects, and M sets of those objects. Sets are non-empty, different, and may intersect. Typically M and N are of the same order of magnitude, usually M > N.
Historically my sets were encoded as-is, each just contained a table (array) of its objects, but I'd like to create a more optimized encoding. Typically some objects present in most of the sets, and I want to utilize this.
My idea is to represent sets as stacks (i.e. single-directional linked lists), whereas their bottom parts can be shared across different sets. It can also be defined as a tree, whereas each node/leaf has a pointer to its parent, but not children.
Such a data structure will allow to use the most common subsets of objects as roots, which all the appropriate sets may "inherit".
The most efficient encoding is computed by the following algorithm. I'll write it as a recursive pseudo-code.
BuildAllChains()
{
BuildSubChains(allSets, NULL);
}
BuildSubChains(sets, pParent)
{
if (sets is empty)
return;
trgObj = the most frequent object from sets;
pNode = new Node;
pNode->Object = trgObj;
pNode->pParent = pParent;
newSets = empty;
for (each set in sets that contains the trgObj)
{
remove trgObj from set;
remove set from sets;
if (set is empty)
set->pHead = pNode;
else
newSets.Insert(set);
}
BuildSubChains(sets, pParent);
BuildSubChains(newSets, pNode);
}
Note: the pseudo-code is written in a recursive manner, but technically naive recursion should not be used, because at each point the splitting is not balanced, and in a degenerate case (which is likely, since the source data isn't random) the recursion depth would be O(N).
Practically I use a combination of loop + recursion, whereas recursion always invoked on a smaller part.
So, the idea is to select each time the most common object, create a "subset" which inherits its parent subset, and all the sets that include it, as well as all the predecessors selected so far - should be based on this subset.
Now, I'm trying to figure-out an effective way to select the most frequent object from the sets. Initially my idea was to compute the histogram of all the objects, and sort it once. Then, during the recursion, whenever we remove an object and select only sets that contain/don't contain it - deduce the sorted histogram of the remaining sets. But then I realized that this is not trivial, because we remove many sets, each containing many objects.
Of course we can select each time the most frequent object directly, i.e. O(N*M). But it also looks inferior, in a degenerate case, where an object exists in either almost all or almost none sets we may need to repeat this O(N) times. OTOH for those specific cases in-place adjustment of the sorted histogram may be preferred way to go.
So far I couldn't come up with a good enough solution. Any ideas would be appreciated. Thanks in advance.
Update:
#Ivan: first thanks a lot for the answer and the detailed analysis.
I do store the list of elements within the histogram rather than the count only. Actually I use pretty sophisticated data structures (not related to STL) with intrusive containers, corss-linked pointers and etc. I planned this from the beginning, because than it seemed to me that the histogram adjustment after removing elements would be trivial.
I think the main point of your suggestion, which I didn't figure-out myself, is that at each step the histograms should only contain elements that are still present in the family, i.e. they must not contain zeroes. I thought that in cases where the splitting is very uneven creating a new histogram for the smaller part is too expensive. But restricting it to only existing elements is a really good idea.
So we remove sets of the smaller family, adjust the "big" histogram and build the "small" one. Now, I need some clarifications about how to keep the big histogram sorted.
One idea, which I thought about first, was immediate fix of the histogram after every single element removal. I.e. for every set we remove, for every object in the set, remove it from the histogram, and if the sort is broken - swap the histogram element with its neighbor until the sort is restored.
This seems good if we remove small number of objects, we don't need to traverse the whole histogram, we do a "micro-bubble" sort.
However when removing large number of objects it seems better to just remove all the objects and then re-sort the array via quick-sort.
So, do you have a better idea regarding this?
Update2:
I think about the following: The histogram should be a data structure which is a binary search tree (auto-balanced of course), whereas each element of the tree contains the appropriate object ID and the list of the sets it belongs to (so far). The comparison criteria is the size of this list.
Each set should contain the list of objects it contains now, whereas the "object" has the direct pointer to the element histogram. In addition each set should contain the number of objects matched so far, set to 0 at the beginning.
Technically we need a cross-linked list node, i.e. a structure that exists in 2 linked lists simultaneously: in the list of a histogram element, and in the list of the set. This node also should contain pointers to both the histogram item and the set. I call it a "cross-link".
Picking the most frequent object is just finding the maximum in the tree.
Adjusting such a histogram is O(M log(N)), whereas M is the number of elements that are currently affected, which is smaller than N if only a little number is affected.
And I'll also use your idea to build the smaller histogram and adjust the bigger.
Sounds right?
I denote the total size of sets with T. The solution I present works in time O(T log T log N).
For the clarity I denote with set the initial sets and with family the set of these sets.
Indeed, let's store a histogram. In BuildSubChains function we maintain a histogram of all elements which are presented in the sets at the moment, sorted by frequency. It may be something like std::set of pairs (frequency, value), maybe with cross-references so you could find an element by value. Now taking the most frequent element is straightforward: it is the first element in the histogram. However, maintaining it is trickier.
You split your family of sets into two subfamilies, one containing the most frequent element, one not. Let there total sizes be T' and T''. Take the family with the smallest total size and remove all elements from its sets from the histogram, making the new histogram on the run. Now you have a histogram for both families, and it is built in time O(min(T', T'') log n), where log n comes from operations with std::set.
At the first glance it seems that it works in quadratic time. However, it is faster. Take a look at any single element. Every time we explicitly remove this element from the histogram the size of its family at least halves, so each element will directly participate in no more than log T removals. So there will be O(T log T) operations with histograms in total.
There might be a better solution if I knew the total size of sets. However, no solution can be faster than O(T), and this is only logarithmically slower.
There may be one more improvement: if you store in the histogram not only elements and frequencies, but also the sets that contain the element (simply another std::set for each element) you'll be able to efficiently select all sets that contain the most frequent element.

Programming : find the first unique string in a file in just 1 pass

Given a very long list of Product Names, find the first product name which is unique (occurred exactly once ). You can only iterate once in the file.
I am thinking of taking a hashmap and storing the (keys,count) in a doubly linked list.
basically a linked hashmap
can anyone optimize this or give a better approach
Since you can only iterate the list once, you have to store
each string that occurs exactly once, because it could be the output
their relative position within the list
each string that occurs more than once (or their hash, if you're not afraid)
Notably, you don't have to store the relative positions of strings that occur more than once.
You need
efficient storage of the set of strings. A hash set is a good candidate, but a trie could offer better compression depending on the set of strings.
efficient lookup by value. This rules out a bare list. A hash-set is the clear winner, but a trie also performs well. You can store the leaves of the trie in a hash set.
efficient lookup of the minimum. This asks for a linked list.
Conclusion:
Use a linked hash-set for the set of strings, and a flag indicating if they're unique. If you're fighting for memory, use a linked trie. If a linked trie is too slow, store the trie leaves in a hash map for look-up. Include only the unique strings in the linked list.
In total, your nodes could look like: Node:{Node[] trieEdges, Node trieParent, String inEdge, Node nextUnique, Node prevUnique}; Node firstUnique, Node[] hashMap
If you strive for ease of implementation, you can have two hash-sets instead (one linked).
The following algorithm solves it in O(N+M) time.
where
N=number of strings
M=total number of characters put together in all strings.
The steps are as follows:
`1. Create a hash value for each string`
`2. Xor it and find the one which didn't have a pair`
Xor has this useful property that if you do a xor a=0 and b xor 0=b.
Tips to generate the hash value for a string:
Use a 27 base number system, and give a a value of 1, b a value of 2 and so on till z which gets 26, and so if string is "abc" , we compute hash value as:
H=3*(27 power 0)+2*(27 power 1)+ 1(27 power 2)
=786
You could use modulus operator to make hash values small enough to fit in 32-bit integers.If you do that keep an eye out for collisions, which are basically two strings which are different but get the same hash value due to the modulus operation.
Mostly I guess you won't be needing it.
So compute the hash for each string, and then start from the first hash and keep xor-ing, the result will hold the hash value of the string which din't have a pair.
Caution:This is useful only when strings occur in pairs.Still this is a good idea to start with, that's why I answered it.
Using a linked hashmap is obvious enough. Otherwise, you could use a TreeMap style data structure where the strings are ordered by count. So as soon as you are done reading the input, the root of your tree is unique if a unique string exists. Unlike a linked hash map, insertion takes at most O(log n) as opposed to O(n). You can read up on TreeMaps for insight on how to augment a basic TreeMap into what you need. Also in your linked hashmap you may have to travel O(n) to find your first unique key. With a TreeMap style data structure, your look up is O(1) -- the root. Even if more unique keys exist, the first one you encountered will be the root. The subsequent ones will be children of the root.

Data structure to represent piecewise continuous range?

Say that I have an integer-indexed array of length 400, and I want to drop out a few elements from the beginning, lots from the end, and something from the middle too, but without actually altering the original array. That is, instead of looping through the array using indices {0...399}, I want to use a piecewise continuous range such as
{3...15} ∪ {18...243} ∪ {250...301} ∪ {305...310}
What is a good data structure to describe this kind of index ranges? An obvious solution is to make another "index mediator" array, containing mappings from continuos zero-based indexing to the new coordinates above, but it feels quite wasteful, since almost all elements in it would be simply sequential numbers, with just a few occasional "jumps". Besides, what if I find that, oh, I want to modify the range a bit? The whole index array would have to be rebuilt. Not nice.
A few points to note:
The ranges never overlap. If a new range is added to the data structure, and it overlaps with existing ranges, the whole thing should get merged. That is, if I add to the above example the range {300... 308}, it should instead replace the last two ranges with {250...310}.
It should be quite cheap to simply loop through the whole range.
It should also be relatively cheap to query a value directly: "Give me the original index corresponding to the 42nd index in the mapped coordinates".
It should be possible (though maybe not quite cheap) to work other way round: "Give me the mapped coordinate corresponding to 42 in the original coordinates, or tell if it's mapped at all."
Before rolling my own solution, I'd like to know if there exists a well-known data structure that solves this class of problems elegantly.
Thanks!
Seems like an array or list of integer pairs would be the best data structure. Your choice as to whether the second integer of the pair is a end point or a count from the first integer.
Edit: On further reflection, this problem is exactly what a database index has to do. If the integer pairs don't have to be in numeric order, you can handle splits easier. If the number sequence has to remain in order, you need a data structure that allows you to add integer pairs to the middle of the array or list.
A split would be having to change the (6, 12) integer pair to (6, 9) (11, 12), when 10 is removed, as an example.
Besides, what if I find that, oh, I want to modify the range a bit? The whole index array would have to be rebuilt. Not nice.
True. Perhaps one integer pair needs to change. Worst case, you'd have to rebuild the entire array or list.

Find a common element within N arrays

If I have N arrays, what is the best(Time complexity. Space is not important) way to find the common elements. You could just find 1 element and stop.
Edit: The elements are all Numbers.
Edit: These are unsorted. Please do not sort and scan.
This is not a homework problem. Somebody asked me this question a long time ago. He was using a hash to solve the problem and asked me if I had a better way.
Create a hash index, with elements as keys, counts as values. Loop through all values and update the count in the index. Afterwards, run through the index and check which elements have count = N. Looking up an element in the index should be O(1), combined with looping through all M elements should be O(M).
If you want to keep order specific to a certain input array, loop over that array and test the element counts in the index in that order.
Some special cases:
if you know that the elements are (positive) integers with a maximum number that is not too high, you could just use a normal array as "hash" index to keep counts, where the number are just the array index.
I've assumed that in each array each number occurs only once. Adapting it for more occurrences should be easy (set the i-th bit in the count for the i-th array, or only update if the current element count == i-1).
EDIT when I answered the question, the question did not have the part of "a better way" than hashing in it.
The most direct method is to intersect the first 2 arrays and then intersecting this intersection with the remaining N-2 arrays.
If 'intersection' is not defined in the language in which you're working or you require a more specific answer (ie you need the answer to 'how do you do the intersection') then modify your question as such.
Without sorting there isn't an optimized way to do this based on the information given. (ie sorting and positioning all elements relatively to each other then iterating over the length of the arrays checking for defined elements in all the arrays at once)
The question asks is there a better way than hashing. There is no better way (i.e. better time complexity) than doing a hash as time to hash each element is typically constant. Empirical performance is also favorable particularly if the range of values is can be mapped one to one to an array maintaining counts. The time is then proportional to the number of elements across all the arrays. Sorting will not give better complexity, since this will still need to visit each element at least once, and then there is the log N for sorting each array.
Back to hashing, from a performance standpoint, you will get the best empirical performance by not processing each array fully, but processing only a block of elements from each array before proceeding onto the next array. This will take advantage of the CPU cache. It also results in fewer elements being hashed in favorable cases when common elements appear in the same regions of the array (e.g. common elements at the start of all arrays.) Worst case behaviour is no worse than hashing each array in full - merely that all elements are hashed.
I dont think approach suggested by catchmeifyoutry will work.
Let us say you have two arrays
1: {1,1,2,3,4,5}
2: {1,3,6,7}
then answer should be 1 and 3. But if we use hashtable approach, 1 will have count 3 and we will never find 1, int his situation.
Also problems becomes more complex if we have input something like this:
1: {1,1,1,2,3,4}
2: {1,1,5,6}
Here i think we should give output as 1,1. Suggested approach fails in both cases.
Solution :
read first array and put into hashtable. If we find same key again, dont increment counter. Read second array in same manner. Now in the hashtable we have common elelements which has count as 2.
But again this approach will fail in second input set which i gave earlier.
I'd first start with the degenerate case, finding common elements between 2 arrays (more on this later). From there I'll have a collection of common values which I will use as an array itself and compare it against the next array. This check would be performed N-1 times or until the "carry" array of common elements drops to size 0.
One could speed this up, I'd imagine, by divide-and-conquer, splitting the N arrays into the end nodes of a tree. The next level up the tree is N/2 common element arrays, and so forth and so on until you have an array at the top that is either filled or not. In either case, you'd have your answer.
Without sorting and scanning the best operational speed you'll get for comparing 2 arrays for common elements is O(N2).

Most efficient way to count occurrences?

I'm looking to calculate entropy and mutual information a huge number of times in performance-critical code. As an intermediate step, I need to count the number of occurrences of each value. For example:
uint[] myArray = [1,1,2,1,4,5,2];
uint[] occurrences = countOccurrences(myArray);
// Occurrences == [3, 2, 1, 1] or some permutation of that.
// 3 occurrences of 1, 2 occurrences of 2, one each of 4 and 5.
Of course the obvious ways to do this are either using an associative array or by sorting the input array using a "standard" sorting algorithm like quick sort. For small integers, like bytes, the code is currently specialized to use a plain old array.
Is there any clever algorithm to do this more efficiently than a hash table or a "standard" sorting algorithm will offer, such as an associative array implementation that heavily favors updates over insertions or a sorting algorithm that shines when your data has a lot of ties?
Note: Non-sparse integers are just one example of a possible data type. I'm looking to implement a reasonably generic solution here, though since integers and structs containing only integers are common cases, I'd be interested in solutions specific to these if they are extremely efficient.
Hashing is generally more scalable, as another answer indicates. However, for many possible distributions (and many real-life cases, where subarrays just happen to be often sorted, depending on how the overall array was put together), timsort is often "preternaturally good" (closer to O(N) than to O(N log N)) -- I hear it's probably going to become the standard/default sorting algorithm in Java at some reasonably close future data (it's been the standard sorting algorithm in Python for years now).
There's no really good way to address such problems except to benchmark on a selection of cases that are representative of the real-life workload you expect to be experiencing (with the obvious risk that you may choose a sample that actually happened to be biased/non-representative -- that's not a small risk if you're trying to build a library that will be used by many external users outside of your control).
Please tell more about your data.
How many items are there?
What is the expected ratio of unique items to total items?
What is the distribution of actual values of your integers? Are they usually small enough to use a simple counting array? Or are they clustered into reasonably narrow groups? Etc.
In any case, I suggest the following idea: a mergesort modified to count duplicates.
That is, you work in terms of not numbers but pairs (number, frequency) (you might use some clever memory-efficient representation for that, for example two arrays instead of an array of pairs etc.).
You start with [(x1,1), (x2,1), ...] and do a mergesort as usual, but when you merge two lists that start with the same value, you put the value into the output list with their sum of occurences. On your example:
[1:1,1:1,2:1,1:1,4:1,5:1,2:1]
Split into [1:1, 1:1, 2:1] and [1:1, 4:1, 5:1, 2:1]
Recursively process them; you get [1:2, 2:1] and [1:1, 2:1, 4:1, 5:1]
Merge them: (first / second / output)
[1:2, 2:1] / [1:1, 2:1, 4:1, 5:1] / [] - we add up 1:2 and 1:1 and get 1:3
[2:1] / [2:1, 4:1, 5:1] / [1:3] - we add up 2:1 and 2:1 and get 2:2
[] / [4:1, 5:1] / [1:3, 2:2]
[1:3, 2:2, 4:1, 5:1]
This might be improved greatly by using some clever tricks to do an initial reduction of the array (obtain an array of value:occurence pairs that is much smaller than the original, but the sum of 'occurence' for each 'value' is equal to the number of occurences of 'value' in the original array). For example, split the array into continuous blocks where values differ by no more than 256 or 65536 and use a small array to count occurences inside each block. Actually this trick can be applied at later merging phases, too.
With an array of integers like in the example, the most effient way would be to have an array of ints and index it based using your values (as you appear to be doing already).
If you can't do that, I can't think of a better alternative than a hashmap. You just need to have a fast hashing algorithm. You can't get better than O(n) performance if you want to use all your data. Is it an option to use only a portion of the data you have?
(Note that sorting and counting is asymptotically slower (O(n*log(n))) than using a hashmap based solution (O(n)).)

Resources