Suitable tree data structure - algorithm

I have been reading about tree data structure to model a problem. I need to construct memory representation of a data which is very similar to folder/file representation in file system (I don't imply the actual file stored in disk but the explorer like structure). The tree may be maximum 10 deep The intermediate nodes may only have moderate number of children (say 10 ), but there could be thousands of leaf nodes.[that is like thousands of files in the folder and file is the leaf node]
Some thoughts
A Binary tree cannot work as one node can at the most have only 2
children. (say we can have 3 subfolders)
A very generic tree implementation may be inefficient as my data can be ordered. Like the left sibling is smaller/lesser than the right ones. I hope this allow to have efficient traversal.
A B-tree sounds very close, but does it insist balancing requirements. In my case, the depth won't be more than 10, but not necessarily all the branch that deep.(say c:/windows , C:/MyDoc../A/B/C)
Please help with your experience. Should I custom make a tree or any suitable data structure available (don't mean specific to a programming language)

You have two different kinds of nodes: files and folders.
A folder node contains a set (or map) of children, where the children may themselves be files or folders.
Alternatively, you might prefer for a folder node to contain a set of files and a set of folders.
For the sets, just use your favorite representation of ordered sets (probably the one that comes with whatever language you are using). Depending the exact details of your situation, you might prefer to use a map instead.

Use two separate data structures:
A binary search tree for search
And a general binary tree for representation
and link these two together.
Note:
In general tree put folders first in order and put all files in a BST as one last node.
Or Use:
Node:
Node* Left_Most_Child_Folder;
Node* Right_Sibling_Folder;
BST_Node* Files_Root;

In a typical file system, the "directory-tree" and the search tree are not the same thing, and are usually maintained separately. The "directory-tree", which tells you what files/sub-folders a folder has, or the path to a particular file, simply reflects how the user organizes the files and is only useful to the user. The search tree on the other hand maintains the global index of all files, so as to facilitate a fast search.
For example, you can implement a Linux like file system, where a folder is a file that records the pointers of the other files/folders it contains. At the same time you maintain a B+ tree, which has every file pointer as a leaf. The balance condition of the B+ tree has nothing to do with how the user organizes the folders.

One way to do this would be to use a binary tree of binary trees. For example:
Node
Node* Children;
Node* Left;
Note* Right;
And the root of your tree is a Node*.
This makes for easy traversal and quick insertion and removal of a node. Provided, of course, you know the path to the level where you want to insert the node, or the path to the node that you want to delete. But since you indicate that you want a model similar to Explorer, I assume that finding a particular level doesn't pose a problem.
Searching for a node at a particular level is as simple as searching a binary tree.
Without a little bit more information about what you're trying to model, that's the best I can do.

Related

Can I use a trie that has a whole word on each node?

I want to implement a trie to check for the validity of paths, so I would have a tree built that contains all the possible path constructions by breaking it down by directory. So something like /guest/friendsList/search would go from the root node to it's child guest, then guest's child friendsList, and then friendsList's child search. If search is a leaf node then my string /guest/friendsList/search would be considered valid.
Is this something a trie would be useful for. All the implementations of tries I've seen deal with individual letters at each node, but can they be whole strings instead? Is a trie specific to this kind of implementation and what I'm trying to do just a basic tree?
Thanks!
You can absolutely do this, though I'd typically call this a directory tree rather than a trie since you're essentially modeling the file system as a tree structure rather than storing lots of prefixes of different strings. In fact, the OS probably has a similar data structure on disk for representing the file system!
The directory tree can definitely do that. Creating a root node at first and then adding remaining directory entries to the root node as children, and so on. So checking if a path name is valid is just parsing the whole string and going through the directory tree.
If you want to make it faster, you can use a dictionary to store a level of nodes and searching a name is linear in one level. So searching a path name in the directory tree takes O(h), and h is the height of the directory tree. Further, to prevent redundant searching, keeping track of height of the directory tree can optimize the searching time; when the length of parsed path name exceeds the height, we know we do not need to search the directory tree.

Data structure: a graph that's similar to a tree - but not a tree

I have implemented a data structure in C, based upon a series of linked lists, that appears to be similar to a tree - but not enough to be referred as such, because in theory it allows the existence of cycles. Here's a basic outline of the nodes:
There is a single, identifiable root that doesn't have a parent node or brothers;
Each node contains a pointer to its "father", its nearest "brother" and the first of his "children";
There are "outer" nodes without children and brothers.
How can I name such a data structure? It cannot be a tree, because even if the pointers are clearly labelled and used differently, cycles like father->child->brother->father may very well exist. My question is: terms such as "father", "children" and "brother" can be used in the context of a graph or they are only reserved for trees? After quite a bit of research I'm still unable to clarify this matter.
Thanks in advance!
I'd say you can still call it a tree, because the foundation is a tree data structure. There is precedence for my claim: "Patricia Tries" are referred to as trees even though their leaf nodes may point up the tree (creating cycles). I'm sure there are other examples as well.
It sounds like the additional links you have are essentially just for convenience, and could be determined implicitly (rather than stored explicitly). By storing them explicitly you impose additional invariants on your tree operations (insert, delete, etc), but do not affect the underlying organization, which is a tree.
Precisely because you are naming and treating those additional links separately, they can be thought of as an "overlay" on top of your tree.
In the end it doesn't really matter what you call it or what category it falls into if it works for you. Reading a book like Sedgewick's "Algorithms in C" you realize that there are tons of data structures out there, and nothing wrong with inventing your own!
One more point: Trees are a special case of graphs, so there is nothing wrong with referring to it as a graph (or using graph algorithms on it) as well.

Easy tree traversal and fast random node access

Edited after Alex Taggart's remark below.
I am using a zipper to easily traverse and edit a tree which can grow to many thousands of nodes. Each node is incomplete when it is first created. Data is going to be added/removed all the time in random positions, leaf nodes are going to be replaced by branches, etc.
The tree can be very unbalanced.
Fast random access to a node is also important.
An implementation would be to traverse the tree using a zipper and create a hash table of the nodes indexed by key. Needless to say the above would be very inefficient as:
2 copies of each node need to be created
any changes need to be consistently mirrored between the 2 data structures (tree and hashmap).
In short, is there a time/space efficient way to combine the easiness of traversing/updating with a zipper and the fast access of a hash table in clojure?
Clojure's data structures are persistent and use structural sharing. This means that operations like adding, removing or accumulating are not as inefficient as you describe. The memory cost will be minimal since you are not duplicating what's already there.
By default Clojure's data structures are immutable. The nodes in your tree like structure will thus not update themselves unless you use some sort of reference type (like a Var). I don't know enough about your specific use case to advice on the best way to access nodes. One way to access nodes in a nested structure is the get-in function where you supply the path to the node to return its value.
Hope this helps solving your problem.

Binary tree, deleting item and reconnecting node

I'm learning data structures and found out that for binary search trees, there are two ways to reconnect node when you delete item. Are those two ways (below) correct?
Link to the image to see it non-resized
Yes, they are. Note that you could also do the "mirror image" version of each way, so it's actually 4 ways in total.
In fact, there are quite few ways that would produce a valid binary tree. All you need to take care of is that the left child of a node is less than the node itself, and the right child is more. However the ways you have listed are the simplest ones that are typically used (unless it's a balanced tree and you need to rebalance it).
The two methods look correct. The first method does re-balance the tree while the second simply does the connect.

How to walk two arbitrarily complex tree structures simultaneously and create a superset?

I have two tree structures that represent snapshots of a directory structure at two different points in time. Directories may have been added, removed or modified between the snapshots. I need to walk the two trees simultaneously and mark the newer with the differences between the two - i.e. flag nodes as New, Modified, Deleted, Unchanged, adding any deleted nodes, so that the end result is the full superset of the two snapshots.
Typically, the trees are likely to be about 10 deep but very wide, containing hundreds of thousands, potentially millions of nodes. I want to skip large chunks of the trees by comparing hash codes at each node and only continuing to recurse where the codes don't match.
Is there an algorithm that could be my friend here? Any other advice?
Imagine unrolling each tree into a sorted list of files and directories. A method could obtain the next input from each unrolled tree from an interator for that tree. I could then compare the hash codes and skip ahead on one tree or another, note deletions, and note modifications.
The paper "Fast and Simple XML Tree Differencing by Sequence Alignment" by Lindholm, Kangasharju, and Tarkoma has some pointers:
1) rsync does the sort of thing you are interested in. Have a look at http://samba.anu.edu.au/ftp/rsync/rsync.html, and it might be worth checking to see if rsync --list-only does what it sounds like.
2) One trick is to turn the tree hierarchy into a sequence, by traversing it with a depth first search, and then compare the two sequences. Your idea about comparing hash codes can then be implemented by using a rolling hash (http://en.wikipedia.org/wiki/Rolling_hash).
I suspect that you will end up generating two entire sequences and then running some equivalent of diff or xdelta between them, rather than trying to do the job incrementally. A completely incremental approach might have problems when some sub-directory is moved a long way in the tree structure.

Resources