Is there such data structure - "linked list with samples" - data-structures

Is there such data structure:
There is slow list data structure such linked list or data saved on disk.
There is relatively small array of pointers to some of the elements in the "slow list", hopefully evenly distributed.
Then when you do search, you first check the array and then perform the normal search (linked list search or binary search in case of disk data).
This looks very similar to jump search, sample search and to skip lists, but I think is different algorithm.
Please note I am giving example with link list or file on disk, because they are slow structures.

I don't know if there's a name for this algorithm (I don't think it deserves one, though if there isn't, it could bear mine:), but I did implement something like that 10 years ago for an interview.
You can have an array of pointers to the elements of a list. An array of fixed size, say, of 256 pointers. When you construct the list or traverse it for the first time, you store pointers to its elements in the array. So, for a list of 256 or fewer elements you'd have a pointer to each element.
As the list grows beyond 256 elements, you drop every odd-numbered pointer by moving the 128 even-numbered pointers to the beginning of the array. When the array of pointers fills up again, you repeat the procedure. At every such point you double the step between the list elements whose addresses end up in the array of pointers. Initially you'd place every element's address there, then every other's, then of one out of four and so on.
You end up with an array of pointers to the list elements spaced apart by the list length / 256.
If the list is singly-linked, locating i-th element from the beginning or the end of it is reduced to searching in 1/256th of the list.
If the list is sorted, you can perform binary search on the array to locate the bin (the 1/256th portion of the list) where to look further.

Related

Algorithm for selection the most frequent object during factorization

I have N objects, and M sets of those objects. Sets are non-empty, different, and may intersect. Typically M and N are of the same order of magnitude, usually M > N.
Historically my sets were encoded as-is, each just contained a table (array) of its objects, but I'd like to create a more optimized encoding. Typically some objects present in most of the sets, and I want to utilize this.
My idea is to represent sets as stacks (i.e. single-directional linked lists), whereas their bottom parts can be shared across different sets. It can also be defined as a tree, whereas each node/leaf has a pointer to its parent, but not children.
Such a data structure will allow to use the most common subsets of objects as roots, which all the appropriate sets may "inherit".
The most efficient encoding is computed by the following algorithm. I'll write it as a recursive pseudo-code.
BuildAllChains()
{
BuildSubChains(allSets, NULL);
}
BuildSubChains(sets, pParent)
{
if (sets is empty)
return;
trgObj = the most frequent object from sets;
pNode = new Node;
pNode->Object = trgObj;
pNode->pParent = pParent;
newSets = empty;
for (each set in sets that contains the trgObj)
{
remove trgObj from set;
remove set from sets;
if (set is empty)
set->pHead = pNode;
else
newSets.Insert(set);
}
BuildSubChains(sets, pParent);
BuildSubChains(newSets, pNode);
}
Note: the pseudo-code is written in a recursive manner, but technically naive recursion should not be used, because at each point the splitting is not balanced, and in a degenerate case (which is likely, since the source data isn't random) the recursion depth would be O(N).
Practically I use a combination of loop + recursion, whereas recursion always invoked on a smaller part.
So, the idea is to select each time the most common object, create a "subset" which inherits its parent subset, and all the sets that include it, as well as all the predecessors selected so far - should be based on this subset.
Now, I'm trying to figure-out an effective way to select the most frequent object from the sets. Initially my idea was to compute the histogram of all the objects, and sort it once. Then, during the recursion, whenever we remove an object and select only sets that contain/don't contain it - deduce the sorted histogram of the remaining sets. But then I realized that this is not trivial, because we remove many sets, each containing many objects.
Of course we can select each time the most frequent object directly, i.e. O(N*M). But it also looks inferior, in a degenerate case, where an object exists in either almost all or almost none sets we may need to repeat this O(N) times. OTOH for those specific cases in-place adjustment of the sorted histogram may be preferred way to go.
So far I couldn't come up with a good enough solution. Any ideas would be appreciated. Thanks in advance.
Update:
#Ivan: first thanks a lot for the answer and the detailed analysis.
I do store the list of elements within the histogram rather than the count only. Actually I use pretty sophisticated data structures (not related to STL) with intrusive containers, corss-linked pointers and etc. I planned this from the beginning, because than it seemed to me that the histogram adjustment after removing elements would be trivial.
I think the main point of your suggestion, which I didn't figure-out myself, is that at each step the histograms should only contain elements that are still present in the family, i.e. they must not contain zeroes. I thought that in cases where the splitting is very uneven creating a new histogram for the smaller part is too expensive. But restricting it to only existing elements is a really good idea.
So we remove sets of the smaller family, adjust the "big" histogram and build the "small" one. Now, I need some clarifications about how to keep the big histogram sorted.
One idea, which I thought about first, was immediate fix of the histogram after every single element removal. I.e. for every set we remove, for every object in the set, remove it from the histogram, and if the sort is broken - swap the histogram element with its neighbor until the sort is restored.
This seems good if we remove small number of objects, we don't need to traverse the whole histogram, we do a "micro-bubble" sort.
However when removing large number of objects it seems better to just remove all the objects and then re-sort the array via quick-sort.
So, do you have a better idea regarding this?
Update2:
I think about the following: The histogram should be a data structure which is a binary search tree (auto-balanced of course), whereas each element of the tree contains the appropriate object ID and the list of the sets it belongs to (so far). The comparison criteria is the size of this list.
Each set should contain the list of objects it contains now, whereas the "object" has the direct pointer to the element histogram. In addition each set should contain the number of objects matched so far, set to 0 at the beginning.
Technically we need a cross-linked list node, i.e. a structure that exists in 2 linked lists simultaneously: in the list of a histogram element, and in the list of the set. This node also should contain pointers to both the histogram item and the set. I call it a "cross-link".
Picking the most frequent object is just finding the maximum in the tree.
Adjusting such a histogram is O(M log(N)), whereas M is the number of elements that are currently affected, which is smaller than N if only a little number is affected.
And I'll also use your idea to build the smaller histogram and adjust the bigger.
Sounds right?
I denote the total size of sets with T. The solution I present works in time O(T log T log N).
For the clarity I denote with set the initial sets and with family the set of these sets.
Indeed, let's store a histogram. In BuildSubChains function we maintain a histogram of all elements which are presented in the sets at the moment, sorted by frequency. It may be something like std::set of pairs (frequency, value), maybe with cross-references so you could find an element by value. Now taking the most frequent element is straightforward: it is the first element in the histogram. However, maintaining it is trickier.
You split your family of sets into two subfamilies, one containing the most frequent element, one not. Let there total sizes be T' and T''. Take the family with the smallest total size and remove all elements from its sets from the histogram, making the new histogram on the run. Now you have a histogram for both families, and it is built in time O(min(T', T'') log n), where log n comes from operations with std::set.
At the first glance it seems that it works in quadratic time. However, it is faster. Take a look at any single element. Every time we explicitly remove this element from the histogram the size of its family at least halves, so each element will directly participate in no more than log T removals. So there will be O(T log T) operations with histograms in total.
There might be a better solution if I knew the total size of sets. However, no solution can be faster than O(T), and this is only logarithmically slower.
There may be one more improvement: if you store in the histogram not only elements and frequencies, but also the sets that contain the element (simply another std::set for each element) you'll be able to efficiently select all sets that contain the most frequent element.

Time Complexity of searching

there is a sorted array which is of very large size. every element is repeated more than once except one element. how much time will it take to find that element?
Options are:
1.O(1)
2.O(n)
3.O(logn)
4.O(nlogn)
The answer to the question is O(n) and here's why.
Let's first summarize the knowledge we're given:
A large array containing elements
The array is sorted
Every item except for one occurs more than once
Question is what is the time growth of searching for that one item that only occurs once?
The sorted property of the array, can we use this to speed up the search for the item? Yes, and no.
First of all, since the array isn't sorted by the property we must use to look for the item (only one occurrence) then we cannot use the sorted property in this regard. This means that optimized search algorithms, such as binary search, is out.
However, we know that if the array is sorted, then all items that have the same value will be grouped together. This means that when we look at an item we see for the first time we only have to compare it to the following item. If it's different, we've found the item we're looking for.
"see for the first time" is important, otherwise we would pick the first value since there will be a boundary between two groups of items where the two items are different.
So we have to move from one end of the array to the other, and compare each item to the following item, and this is an O(n) operation.
Basically, since the array isn't sorted by the property we're looking at, we're back to a linear search.
Must be O(n).
The fact that it's sorted doesn't help. Suppose you tried a binary method, jumping into the middle somewhere. You see that the value there has a neighbour that is the same. Now which half do you go to?
How would you write a program to find the value? You'd start at one end an check for an element whose neighbour is not the same. You'd have to walk the whole array until you found the value. So O(n)

Understanding these questions about binary search on linear data structures?

The answers are (1) and (5) but I am not sure why. Could someone please explain this to me and why the other answers are incorrect. How can I understand how things like binary/linear search will behavior on different data structures?
Thank you
I am hoping you already know about binary search.
(1) True-
Explanation
For performing binary search, we have to get to middle of the sorted list. In linked list to get to the middle of the list we have to traverse half of the list starting from the head, while in array we can directly get to middle index if we know the length of the list. So linked lists takes O(n/2) time which can be done in O(1) by using array. Therefore linked list is not the efficient way to implement binary search.
(2)False
Same explanation as above
(3)False
Explanation
As explained in point 1 linked list cannot be used efficiently to perform binary search but array can be used.
(4) False
Explanation
Binary search worst case time is O(logn). As in binary search we don't need to traverse the whole list. In first loop if key is lesser then middle value we will discard the second half of the list. Similarly now we will operate with the remaining list. As we can see with every loop we are discarding the part of the list that we don't have to traverse, so clearly it will take less then O(n).
(5)True
Explanation
If element is found in O(1) time, that means only one loop was run by the code. And in the first loop we always compare to the middle element of the list that means the search will take O(1) time only if the middle element is the key value.
In short, binary search is an elimination based searching technique that can be applied when the elements are sorted. The idea is to eliminate half the keys from consideration by keeping the keys in sorted order. If the search key is not equal to the middle element, one of the two sets of keys to the left and to the right of the middle element can be eliminated from further consideration.
Now coming to your specific question,
True
The basic binary search requires that mid-point can be found in O(1) time which can't be possible in linked list and can be way more expensive if the the size of the list is unknown.
True.
False
False
Binary search, mid-point calculation should be done in O(1) time which can only be possible in arrays , as the indices defined in arrays are known. Secondly binary search can only be applied to the arrays which are in sorted order.
False
The answer by Vaibhav Khandelwal, explained it nicely. But I wanted to add some variations of the array on to which binary search can be still applied. If the given array is sorted but rotated by X degree and contains duplicates, for example,
3 5 6 7 1 2 3 3 3
Then binary search still applies on it, but for the worst case, we needed we go linearly through this list to find the required element, which is O(n).
True
If the element found in the first attempt i.e situated at the mid-point then it would be processed in O(1) time.
MidPointOfArray = (LeftSideOfArray + RightSideOfArray)/ 2
The best way to understand binary search is to think of exam papers which are sorted according to last names. In order to find a particular student paper, the teacher has to search in that student name's category and rule-out the ones that are not alphabetically closer to the name of the student.
For example, if the name is Alex Bob, then teacher directly starts her search from "B", then take out all the copies that have surname "B", then again repeat the process, and skips the copies till letter "o" and so on till find it or not.

Break the linked list into smaller linked lists

I need to break a singly linked list into smaller linked lists after every 2 nodes . The approach I thought was,
create an array containign head pointers of n/2 objects
Link hop the linked list and store the address in the array after
every 2 nodes are encountered.
Can there be a better approach for this?
Thanks.
That seems like a good approach.
You also need to remember to set the next member of the 2nd, 4th, etc... elements to null to break the long list into smaller pieces. Remember to store the old value before you overwrite it as you will need to use it while you iterate.

Find a common element within N arrays

If I have N arrays, what is the best(Time complexity. Space is not important) way to find the common elements. You could just find 1 element and stop.
Edit: The elements are all Numbers.
Edit: These are unsorted. Please do not sort and scan.
This is not a homework problem. Somebody asked me this question a long time ago. He was using a hash to solve the problem and asked me if I had a better way.
Create a hash index, with elements as keys, counts as values. Loop through all values and update the count in the index. Afterwards, run through the index and check which elements have count = N. Looking up an element in the index should be O(1), combined with looping through all M elements should be O(M).
If you want to keep order specific to a certain input array, loop over that array and test the element counts in the index in that order.
Some special cases:
if you know that the elements are (positive) integers with a maximum number that is not too high, you could just use a normal array as "hash" index to keep counts, where the number are just the array index.
I've assumed that in each array each number occurs only once. Adapting it for more occurrences should be easy (set the i-th bit in the count for the i-th array, or only update if the current element count == i-1).
EDIT when I answered the question, the question did not have the part of "a better way" than hashing in it.
The most direct method is to intersect the first 2 arrays and then intersecting this intersection with the remaining N-2 arrays.
If 'intersection' is not defined in the language in which you're working or you require a more specific answer (ie you need the answer to 'how do you do the intersection') then modify your question as such.
Without sorting there isn't an optimized way to do this based on the information given. (ie sorting and positioning all elements relatively to each other then iterating over the length of the arrays checking for defined elements in all the arrays at once)
The question asks is there a better way than hashing. There is no better way (i.e. better time complexity) than doing a hash as time to hash each element is typically constant. Empirical performance is also favorable particularly if the range of values is can be mapped one to one to an array maintaining counts. The time is then proportional to the number of elements across all the arrays. Sorting will not give better complexity, since this will still need to visit each element at least once, and then there is the log N for sorting each array.
Back to hashing, from a performance standpoint, you will get the best empirical performance by not processing each array fully, but processing only a block of elements from each array before proceeding onto the next array. This will take advantage of the CPU cache. It also results in fewer elements being hashed in favorable cases when common elements appear in the same regions of the array (e.g. common elements at the start of all arrays.) Worst case behaviour is no worse than hashing each array in full - merely that all elements are hashed.
I dont think approach suggested by catchmeifyoutry will work.
Let us say you have two arrays
1: {1,1,2,3,4,5}
2: {1,3,6,7}
then answer should be 1 and 3. But if we use hashtable approach, 1 will have count 3 and we will never find 1, int his situation.
Also problems becomes more complex if we have input something like this:
1: {1,1,1,2,3,4}
2: {1,1,5,6}
Here i think we should give output as 1,1. Suggested approach fails in both cases.
Solution :
read first array and put into hashtable. If we find same key again, dont increment counter. Read second array in same manner. Now in the hashtable we have common elelements which has count as 2.
But again this approach will fail in second input set which i gave earlier.
I'd first start with the degenerate case, finding common elements between 2 arrays (more on this later). From there I'll have a collection of common values which I will use as an array itself and compare it against the next array. This check would be performed N-1 times or until the "carry" array of common elements drops to size 0.
One could speed this up, I'd imagine, by divide-and-conquer, splitting the N arrays into the end nodes of a tree. The next level up the tree is N/2 common element arrays, and so forth and so on until you have an array at the top that is either filled or not. In either case, you'd have your answer.
Without sorting and scanning the best operational speed you'll get for comparing 2 arrays for common elements is O(N2).

Resources