Trie with custom insert and delete functions - algorithm

I need to create four custom function for a Trie data structure without changing its O(n) complexity:
makeDir("pathToDir")
make("pathToFile")
delete("pathToDir")
forceDelete("pathToDirOrFile")
makeDir("pathToDir"): adds path to the trie only if it is a valid path
make("pathToFile"): adds path to the trie only if it is a valid path (a file node can't call makeDir() )
delete("pathToDir"): deletes the path from trie only if it has no child dirs
forceDelete("pathToDirOrFile"): deletes path and its child dirs
For example a list of commands would be:
makeDir("\dir");
makeDir("\dir\foo")
makeDir("\dir\foo\lol\ok"); /* incorrect path */
make("\dir\file");
makeDir("\dir\file\abc"); /* file can't have sub dirs */
delete("\dir"); /* has childs, incorrect */
delete("\dir\file");
forceDelete("\dir");
Does anybody have any idea on how to recognize that the node indicates the path of a files? What is the best way to implement these functions?

Validating and splitting the path
It's OS specific, so just pick any library that works with paths for your target system.
The trie
Once you can split a path into pieces, you can build a trie. Keep strings in its edges. For instance, if you have a foo/bar path, there'll be 3 nodes and two edges: the first one (1->2) is marked with foo and the second one (2->3) is marked with bar.
You can store a flag in each node to indicate if it's a regular file or a directory.
To check if a directory is empty, just make sure it's node has no children.
To check if a directory/file can be created, take its base dir (all parts of the path except the last one), check that it exists by traversing your trie from the root and that its node is a directory, not a regular file.
Efficient traversal
You can store edges in hash table that maps a string to a node.

Related

Search/Parse XML and exclude certain nodes without removing them?

The command below allows me to parse the text in all nodes except for nodes 'wp14:sizeRelH' & 'wp14:sizeRelV'
XML.search('//wp14:sizeRelH', '//wp14:sizeRelV').remove.search('//text()')
I would like to do the same thing but I do not want to remove nodes 'wp14:sizeRelH' and 'wp14:sizeRelV' from the XML.
This way I can parse through the XML tree and make changes to the text in each node without affecting nodes 'wp14:sizeRelH' and 'wp14:sizeRelV'
EDIT: It appears if nodes '//wp14:sizeRelH' or '//wp14:sizeRelV' are not in the XML, then my command also returns nothing which is not good :(
Looks like I found the answer. I used //text()[not...] but had to find the ancestors names of the text I didn't want to include:
XML.search('//text()[not(ancestor::wp14:pctHeight or ancestor::wp14:pctWidth or ancestor::wp:posOffset)]')

What is the difference between the 'Avl' suffix and the original function in the Win32 API?

I've dumped the export table of ntdll.dll looking for specific APIs, but found the same function but one 'Avl' appended to it, what does it mean?
like:
RtlDeleteElementGenericTable
RtlDeleteElementGenericTableAvl
That is just the simple meaning, For example: RtlDeleteElementGenericTable and RtlDeleteElementGenericTableAvl, The parameter type of RtlDeleteElementGenericTable is RTL_GENERIC_TABLE, But RtlDeleteElementGenericTableAvl is RTL_AVL_TABLE.
The RTL_AVL_TABLE structure contains file system-specific data for an
Adelson-Velsky/Landis (AVL) tree. An AVL tree ensures a more balanced,
shallower tree implementation than a splay tree implementation of a
generic table (RTL_GENERIC_TABLE).
So that's is also the differences between * and *Avl functions.

fast filesystem path matching with some pattern (but no wildchar)

Assume a set of (Unix) paths, such as
/usr
/lib
/var/log
/home/myname/somedir
....
given a path /some/path, I want to test whether this /some/path matches any of that in the path set above, and, by "match", I mean
/some/path is exactly one of the path above, or
it is the a subpath of one of the path above.
I know I can split the path by / and do string matching one by one, but I want to do this very fast, probably by using some hashing technique or something similar such that I can transform those string matching to some integer matching.
Are there any algorithms for that? Or, is there any proof that it doesn't?
Hash table approach
Since paths are generally not very deep, you may be able to afford storing all possible matching subpaths.
For every path in the input set add every its subpath to a hash table. For example, this set:
/usr
/lib
/var/log
/home/myname/somedir
will produce this table:
hash0 -> /usr
hash1 -> /lib
hash2 -> /var
hash3 -> /var/log
hash4 -> /home
hash5 -> /home/myname
hash6 -> /home/myname/somedir
Now the search query boils down to finding an exact match in this hash table. String comparison will only be needed in case of a hash collision.
One major drawback of this method is that in the general case it needs superlinear amount of memory (with respect to the size of the input set).
Consider a 600 characters-long path:
[400characterlongprefix]/a/a/a/...[100 times].../a/a/a/
And the corresponding table that contains 50500 characters in total:
hash0 -> [400characterlongprefix]
hash1 -> [400characterlongprefix]/a
hash2 -> [400characterlongprefix]/a/a
...
hash100 -> [400characterlongprefix]/a/a/a/...[100 times].../a/a/a/
Trie approach
Precomputation step
Split every path in the set to its components.
Assign every distinct component an index and add the pair (component, index) to a hash table.
For every path, add the sequence of its component indices to a prefix tree.
Example
Input set:
/usr
/var/log
/home/log/usr
Component indices:
usr -> 0
var -> 1
log -> 2
home -> 3
Prefix tree:
0 // usr
1 -> 2 // var, log
3 -> 2 -> 0 // home, log, usr
Search query
Split the path to its components.
For every component find its index in the hash table.
If one of the components does not have a corresponding index, report no match.
Search the prefix tree for the sequence of component indices.

Rename all contents of directory with a minimum of overhead

I am currently in the position where I need to rename all files in a directory. The chance that a file does not change name is minimal, and the chance that an old filename is the same as a new filename is considerable, making renaming conflicts likely.
Thus, simply looping over the files and renaming old->new is not an option.
The easy / obvious solution is to rename everything to have a temporary filename: old->tempX->new. Of course, to some degree, this shifts the issue, because now there is the responsibility of checking nothing in the old names list overlaps with the temporary names list, and nothing in the temporary names list overlaps with the new list.
Additionally, since I'm dealing with slow media and virus scanners that love to slow things down, I would like to minimize the actual actions on disk. Besides that, the user will be impatiently waiting to do more stuff. So if at all possible, I would like to process all files on disk in a single pass (by smartly re-ordering rename operations) and avoid exponential time shenanigans.
This last bit has brought me to a 'good enough' solution where I first create a single temporary directory inside my directory, I move-rename everything into that, and finally, I move everything back into the old folder and delete the temporary directory. This gives me a complexity of O(2n) for disk and actions.
If possible, I'd love to get the on-disk complexity to O(n), even if it comes at a cost of increasing the in-memory actions to O(99999n). Memory is a lot faster after all.
I am personally not at-home enough in graph theory, and I suspect the entire 'rename conflict' thing has been tackled before, so I was hoping someone could point me towards an algorithm that meets my needs. (And yes, I can try to brew my own, but I am not smart enough to write an efficient algorithm, and I probably would leave in a logic bug that rears its ugly head rarely enough to slip through my testing. xD)
One approach is as follows.
Suppose file A renames to B and B is a new name, we can simply rename A.
Suppose file A renames to B and B renames to C and C is a new name, we can follow the list in reverse and rename B to C, then A to B.
In general this will work providing there is not a loop. Simply make a list of all the dependencies and then rename in reverse order.
If there is a loop we have something like this:
A renames to B
B renames to C
C renames to D
D renames to A
In this case we need a single temporary file per loop.
Rename the first in the loop, A to ATMP.
Then our list of modifications becomes:
ATMP renames to B
B renames to C
C renames to D
D renames to A
This list no longer has a loop so we can process the files in reverse order as before.
The total number of file moves with this approach will be n + number of loops in your rearrangement.
Example code
So in Python this might look like this:
D={1:2,2:3,3:4,4:1,5:6,6:7,10:11} # Map from start name to final name
def rename(start,dest):
moved.add(start)
print 'Rename {} to {}'.format(start,dest)
moved = set()
filenames = set(D.keys())
tmp = 'tmp file'
for start in D.keys():
if start in moved:
continue
A = [] # List of files to rename
p = start
while True:
A.append(p)
dest = D[p]
if dest not in filenames:
break
if dest==start:
# Found a loop
D[tmp] = D[start]
rename(start,tmp)
A[0] = tmp
break
p = dest
for f in A[::-1]:
rename(f,D[f])
This code prints:
Rename 1 to tmp file
Rename 4 to 1
Rename 3 to 4
Rename 2 to 3
Rename tmp file to 2
Rename 6 to 7
Rename 5 to 6
Rename 10 to 11
Looks like you're looking at a sub-problem of Topologic sort.
However it's simpler, since each file can depend on just one other file.
Assuming that there are no loops:
Supposing map is the mapping from old names to new names:
In a loop, just select any file to rename, and send it to a function which :
if it's destination new name is not conflicting (a file with the new name doesn't exist), then just rename it
else (conflict exists)
2.1 rename the conflicting file first, by sending it to the same function recursively
2.2 rename this file
A sort-of Java pseudo code would look like this:
// map is the map, map[oldName] = newName;
HashSet<String> oldNames = new HashSet<String>(map.keys());
while (oldNames.size() > 0)
{
String file = oldNames.first(); // Just selects any filename from the set;
renameFile(map, oldNames, file);
}
...
void renameFile (map, oldNames, file)
{
if (oldNames.contains(map[file])
{
(map, oldNames, map[file]);
}
OS.rename(file, map[file]); //actual renaming of file on disk
map.remove(file);
oldNames.remove(file);
}
I believe you are interested in a Graph Theory modeling of the problem so here is my take on this:
You can build the bidirectional mapping of old file names to new file names as a first stage.
Now, you compute the intersection set I the old filenames and new filenames. Each target "new filename" appearing in this set requires the "old filename" to be renamed first. This is a dependency relationship that you can model in a graph.
Now, to build that graph, we iterate over that I set. For each element e of I:
Insert a vertex in the graph representing the file e needing to be renamed if it doesn't exist yet
Get the "old filename" o that has to be renamed into e
Insert a vertex representing o into the graph if it doesn't already exist
Insert a directed edge (e, o) in the graph. This edge means "e must be renamed before o". If that edge introduce a cycle (*), do not insert it and mark o as a file that needs to be moved-and-renamed.
You now have to iterate over the roots of your graph (vertices that have no in-edges) and perform a BFS using them as a starting point and perform the renaming each time you discover a vertex. The renaming can be a common rename or a move-and-rename depending on if the vertex was tagged.
The last step is to move back the moved-and-renamed files back from their sandbox directory to the target directory.
C++ Live Demo to illustrate the graph processing.

`esprima` AST Tree: How to easily detect and add function parens?

TL;DR: i want to do same thing as there https://github.com/nolanlawson/optimize-js but with esprima when i traverse through AST tree with estraverse.
ESPrima gives same output nodes for following code:
!function (){}()
and
!(function (){})()
http://esprima.org/demo/parse.html?code=!function%20()%7B%7D()%0A%0A!(function%20()%7B%7D)()
For example - i will traverse through AST tree. On Function Expression ExpressionStatement node i want to check - if that node doesn't have parens around function - i want to add it.
So, how i can detect function parens, how i can add them? I look at tokens, but i have no idea how i can associate flat tokens object with object with specified AST node.
Seems it task not for esprima but for escodegen
https://github.com/estools/escodegen/issues/315

Resources