Efficient Tree Structure Hierarchy Rebuild

Efficient Tree Structure Hierarchy Rebuild - sorting

I'm having a tree structure like this:
1 ABC
1.1 DEF
1.1.2 GHI
1.2 JKL
1.2.1 MNO
2 PQR
2.1
... with no limits on the depth and length of each level. Now what happens is that I take out some of the elements all around the tree structure and in the end I want to have a proper, restructured hierarchy numbering.
How do you usually re-sort & apply proper hierarchy numbering in the least amount of work in such a case? This is a somewhat basic use case, but I'm looking for some room for improvement.

I suppose that you are keeping both number and the text in the same value variable.
The simplest thing you could possibly do is to:
Separate that into two variables: number and text
Whenever you are swapping two tree nodes (i.e. during sorting) swap just text values and keep the number values where they are.
Whenever you are adding new element as the last subelement, just use the previousLastElement.number + 1
Whenever you print out the elements number reversly append all its parents numbers separated by dot.
The only remaining complexity now is when you insert elements, where you would have to "push" the other elements' numbers after that one (but only on that one level), or when you remove elements, when you would have to pull them.

You could use a priority queue like a self-balancing tree.

Related

Algorithm for selection the most frequent object during factorization

I have N objects, and M sets of those objects. Sets are non-empty, different, and may intersect. Typically M and N are of the same order of magnitude, usually M > N.
Historically my sets were encoded as-is, each just contained a table (array) of its objects, but I'd like to create a more optimized encoding. Typically some objects present in most of the sets, and I want to utilize this.
My idea is to represent sets as stacks (i.e. single-directional linked lists), whereas their bottom parts can be shared across different sets. It can also be defined as a tree, whereas each node/leaf has a pointer to its parent, but not children.
Such a data structure will allow to use the most common subsets of objects as roots, which all the appropriate sets may "inherit".
The most efficient encoding is computed by the following algorithm. I'll write it as a recursive pseudo-code.
BuildAllChains()
{
BuildSubChains(allSets, NULL);
}
BuildSubChains(sets, pParent)
{
if (sets is empty)
return;
trgObj = the most frequent object from sets;
pNode = new Node;
pNode->Object = trgObj;
pNode->pParent = pParent;
newSets = empty;
for (each set in sets that contains the trgObj)
{
remove trgObj from set;
remove set from sets;
if (set is empty)
set->pHead = pNode;
else
newSets.Insert(set);
}
BuildSubChains(sets, pParent);
BuildSubChains(newSets, pNode);
}
Note: the pseudo-code is written in a recursive manner, but technically naive recursion should not be used, because at each point the splitting is not balanced, and in a degenerate case (which is likely, since the source data isn't random) the recursion depth would be O(N).
Practically I use a combination of loop + recursion, whereas recursion always invoked on a smaller part.
So, the idea is to select each time the most common object, create a "subset" which inherits its parent subset, and all the sets that include it, as well as all the predecessors selected so far - should be based on this subset.
Now, I'm trying to figure-out an effective way to select the most frequent object from the sets. Initially my idea was to compute the histogram of all the objects, and sort it once. Then, during the recursion, whenever we remove an object and select only sets that contain/don't contain it - deduce the sorted histogram of the remaining sets. But then I realized that this is not trivial, because we remove many sets, each containing many objects.
Of course we can select each time the most frequent object directly, i.e. O(N*M). But it also looks inferior, in a degenerate case, where an object exists in either almost all or almost none sets we may need to repeat this O(N) times. OTOH for those specific cases in-place adjustment of the sorted histogram may be preferred way to go.
So far I couldn't come up with a good enough solution. Any ideas would be appreciated. Thanks in advance.
Update:
#Ivan: first thanks a lot for the answer and the detailed analysis.
I do store the list of elements within the histogram rather than the count only. Actually I use pretty sophisticated data structures (not related to STL) with intrusive containers, corss-linked pointers and etc. I planned this from the beginning, because than it seemed to me that the histogram adjustment after removing elements would be trivial.
I think the main point of your suggestion, which I didn't figure-out myself, is that at each step the histograms should only contain elements that are still present in the family, i.e. they must not contain zeroes. I thought that in cases where the splitting is very uneven creating a new histogram for the smaller part is too expensive. But restricting it to only existing elements is a really good idea.
So we remove sets of the smaller family, adjust the "big" histogram and build the "small" one. Now, I need some clarifications about how to keep the big histogram sorted.
One idea, which I thought about first, was immediate fix of the histogram after every single element removal. I.e. for every set we remove, for every object in the set, remove it from the histogram, and if the sort is broken - swap the histogram element with its neighbor until the sort is restored.
This seems good if we remove small number of objects, we don't need to traverse the whole histogram, we do a "micro-bubble" sort.
However when removing large number of objects it seems better to just remove all the objects and then re-sort the array via quick-sort.
So, do you have a better idea regarding this?
Update2:
I think about the following: The histogram should be a data structure which is a binary search tree (auto-balanced of course), whereas each element of the tree contains the appropriate object ID and the list of the sets it belongs to (so far). The comparison criteria is the size of this list.
Each set should contain the list of objects it contains now, whereas the "object" has the direct pointer to the element histogram. In addition each set should contain the number of objects matched so far, set to 0 at the beginning.
Technically we need a cross-linked list node, i.e. a structure that exists in 2 linked lists simultaneously: in the list of a histogram element, and in the list of the set. This node also should contain pointers to both the histogram item and the set. I call it a "cross-link".
Picking the most frequent object is just finding the maximum in the tree.
Adjusting such a histogram is O(M log(N)), whereas M is the number of elements that are currently affected, which is smaller than N if only a little number is affected.
And I'll also use your idea to build the smaller histogram and adjust the bigger.
Sounds right?

I denote the total size of sets with T. The solution I present works in time O(T log T log N).
For the clarity I denote with set the initial sets and with family the set of these sets.
Indeed, let's store a histogram. In BuildSubChains function we maintain a histogram of all elements which are presented in the sets at the moment, sorted by frequency. It may be something like std::set of pairs (frequency, value), maybe with cross-references so you could find an element by value. Now taking the most frequent element is straightforward: it is the first element in the histogram. However, maintaining it is trickier.
You split your family of sets into two subfamilies, one containing the most frequent element, one not. Let there total sizes be T' and T''. Take the family with the smallest total size and remove all elements from its sets from the histogram, making the new histogram on the run. Now you have a histogram for both families, and it is built in time O(min(T', T'') log n), where log n comes from operations with std::set.
At the first glance it seems that it works in quadratic time. However, it is faster. Take a look at any single element. Every time we explicitly remove this element from the histogram the size of its family at least halves, so each element will directly participate in no more than log T removals. So there will be O(T log T) operations with histograms in total.
There might be a better solution if I knew the total size of sets. However, no solution can be faster than O(T), and this is only logarithmically slower.
There may be one more improvement: if you store in the histogram not only elements and frequencies, but also the sets that contain the element (simply another std::set for each element) you'll be able to efficiently select all sets that contain the most frequent element.

Geneal B+ Tree split logic

I just want to know if you would split a leaf node after the insert or before the insert. lets say our capacity in the leaf is 4 elements and we already have 3 elements in there. would you add the 4th element and immediately split after the insert so we have now two nodes holding 2 elements each. Or would you just add the 4th element so that the leaf is full. Now if you add the 5th element (which would cause an overflow) we do the split and add the element which would result in 2 leaf nodes one holding 2 and one holding 3 elements.
EDIT: Since I have seed both approaches out there in the www. I would like to know the reason when to choose solution 1 or 2. Or if one of them even is incorrect for some reason.

https://www.cs.usfca.edu/~galles/visualization/BPlusTree.html
This visualization is very useful to understand B+ tree logic.

A red black tree with the same key multiple times: store collections in the nodes or store them as multiple nodes?

Apparently you could do either, but the former is more common.
Why would you choose the latter and how does it work?
I read this: http://www.drdobbs.com/cpp/stls-red-black-trees/184410531; which made me think that they did it. It says:
insert_always is a status variable that tells rb_tree whether multiple instances of the same key value are allowed. This variable is set by the constructor and is used by the STL to distinguish between set and multiset and between map and multimap. set and map can only have one occurrence of a particular key, whereas multiset and multimap can have multiple occurrences.
Although now i think it doesnt necessarily mean that. They might still be using containers.
I'm thinking all the nodes with the same key would have to be in a row, because you either have to store all nodes with the same key on the right side or the left side. So if you store equal nodes to the right and insert 1000 1s and one 2, you'd basically have a linked list, which would ruin the properties of the red black tree.
Is the reason why i can't find much on it that it's just a bad idea?

down side of store as multiple nodes:
expands tree size, which make search slower.
if you want to retrieve all values for key K, you need M*log(N) time, where N is number of total nodes, M is number of values for key K, unless you introduce extra code (which complicates the data structure) to implement linked list for these values. (if storing collection, time complexity only take log(N), and it's simple to implement)
more costly to delete. with multi-node method, you'll need to remove node on every delete, but with collection-storage, you only need to remove node K when the last value of key K is deleted.
Can't think of any good side of multi-node method.

Binary Search trees by definition cannot contain duplicates. If you use them to produce a sorted list throwing out the duplicates would produce an incorrect result.
I am working on an implementation of Red Black trees in PHP when I ran into the duplicate issue. We are going to use the tree for sorting and searching.
I am considering adding an occurrence value to the node data type. When a duplicate is encountered just increment occurrence. When walking the tree to produce output just repeat the value by the number of occurrences. I think I would still have a valid BST and avoid having a whole chain of duplicate values which preserve the optimal search time.

Grid data structure

Usually ‘expandable’ grids are represented as list of lists (list of rows, each row has list of cells) and those lists are some kind of linked lists.
Manipulating (removing, inserting) rows in this data structure is easy and inexpensive, it’s just matter of re-linking previous nodes, but when it comes to columns, for instance removing a column it become a very long operation, I need to ‘loop’ all rows to remove that indexes cells. Clearly this isn’t good behavior, at least for my case.
I’m not talking databases here; a good example I’ve found for this is some text file into a text editor, (as I know) text editors mostly splitting lines into rows and it’s easy to remove line. I want removing a column is as inexpensive and efficient as removing some row.
Finally, what I need is some Multi-dimensional grid but I think of any 2d simple grid would be applicable for MD, Am I right?

You could have a two dimensional "linked matrix" (I forget the proper terminology):
... Col 3 ... Col 4 ...
| |
... --X-- ... --Y-- ...
| |
... ... ... ... ...
Each cell has four neighbours, as shown. Additionally you need row and column headers that might indicate the row/column position, as well as pointing to the first cell in each row or column. These are most easily represented as special cells without an up neighbour (for column headers).
Inserting a new column between 3 and 4 means iterating down the cells X in col 3, and inserting a new right neighbour Z. This new cell Z links leftward to X and rightward to Y. You also need to add a new column header, and link the new cells vertically. Then the positions of all the columns after 4 can be renumbered (col 4 becomes col 5).
... Col 3 Col 4 Col 5 ...
| | |
... --X-----Z-----Y-- ...
| | |
... ... ... ... ...
The cost of inserting a column is O(n) for inserting and linking the new cells, and again O(m) for updating the column headers. It's a similar process for deletion.
Because each cell is just four links, the same algorithms are used for row insertion/deletion.

Keep your existing data structure as is. In addition, give each column a unique id when it is created. When you delete the column, just add its id to a hash table of all deleted column ids. Every time you walk a row, check each element's column id (which needs to be stored along with all other data for an element) against the hash table and splice it out of the row if it has been deleted.
The hash table and ids are unnecessary if you have a per-column data structure that each grid element can point to. Then you just need a deleted bit in that data structure.
By the way, Edmund's scheme would be fine for you as well. Even though it takes O(n) time to delete a row or column of length n, you can presumably amortize that cost against the cost of creating those n elements, making the delete O(1) amortized time.

I know that "Linked-Lists" are usually appreciated from a theoretical point of view, but in practice they are generally inefficient.
I would suggest moving toward Random Access Containers to get some speed. The most simple would be an array, but a double-ended queue or an indexed skip list / B* tree could be better, depending on the data size we are talking about.
Conceptually, it doesn't change much (yet), however you get the ability to move to a given index in O(1) (array, deque) / O(log N) (skip list / B* tree) operations, rather than O(N) with a simple linked-list.
And then it's time for magic.
Keith has already exposed the basic idea: rather than actually deleting the column, you just need to mark it as deleted and then 'jump' above it when you walk your structure. However a hash table requires a linear walk to get to the Nth column. Using a Fenwick Tree would yield an efficient way to compute the real index, and you could then jump directly there.
Note that a key benefit of marking a row as deleted is the obvious possibility of an undo operation.
Also note that you might want to build a compacting function, to eliminate the deleted columns from time to time, and not let them accumulate.

Designing small comparable objects

Intro
Consider you have a list of key/value pairs:
(0,a) (1,b) (2,c)
You have a function, that inserts a new value between two current pairs, and you need to give it a key that keeps the order:
(0,a) (0.5,z) (1,b) (2,c)
Here the new key was chosen as the average between the average of keys of the bounding pairs.
The problem is, that you list may have milions of inserts. If these inserts are all put close to each other, you may end up with keys such to 2^(-1000000), which are not easily storagable in any standard nor special number class.
The problem
How can you design a system for generating keys that:
Gives the correct result (larger/smaller than) when compared to all the rest of the keys.
Takes up only O(logn) memory (where n is the number of items in the list).
My tries
First I tried different number classes. Like fractions and even polynomium, but I could always find examples where the key size would grow linear with the number of inserts.
Then I thought about saving pointers to a number of other keys, and saving the lower/greater than relationship, but that would always require at least O(sqrt) memory and time for comparison.
Extra info: Ideally the algorithm shouldn't break when pairs are deleted from the list.

I agree with snowlord. A tree would be ideal in this case. A red-black tree would prevent things from getting unbalanced. If you really need keys, though, I'm pretty sure you can't do better than using the average of the keys on either side of the value you need to insert. That will increase your key length by 1 bit each time. What I recommend is renormalizing the keys periodically. Every x inserts, or whenever you detect keys being generated too close together, renumber everything from 1 to n.
Edit:
You don't need to compare keys if you're inserting by position instead of key. The compare function for the red-black tree would just use the order in the conceptual list, which lines up with in-order in the tree. If you're inserting in position 4 in the list, insert a node at position 4 in the tree (using in-ordering). If you're inserting after a certain node (such as "a"), it's the same. You might have to use your own implementation if whatever language/library you're using requires a key.

I don't think you can avoid getting size O(n) keys without reassigning the key during operation.
As a practical solution I would build an inverted search tree, with pointers from the children to the parents, where each pointer is marked whether it is coming from a left or right child. To compare two elements you need to find the closest common ancestor, where the path to the elements diverges.
Reassigning keys is then rebalancing of the tree, you can do that by some rotation that doesn't change the order.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio