fast filesystem path matching with some pattern (but no wildchar) - algorithm

Assume a set of (Unix) paths, such as
/usr
/lib
/var/log
/home/myname/somedir
....
given a path /some/path, I want to test whether this /some/path matches any of that in the path set above, and, by "match", I mean
/some/path is exactly one of the path above, or
it is the a subpath of one of the path above.
I know I can split the path by / and do string matching one by one, but I want to do this very fast, probably by using some hashing technique or something similar such that I can transform those string matching to some integer matching.
Are there any algorithms for that? Or, is there any proof that it doesn't?

Hash table approach
Since paths are generally not very deep, you may be able to afford storing all possible matching subpaths.
For every path in the input set add every its subpath to a hash table. For example, this set:
/usr
/lib
/var/log
/home/myname/somedir
will produce this table:
hash0 -> /usr
hash1 -> /lib
hash2 -> /var
hash3 -> /var/log
hash4 -> /home
hash5 -> /home/myname
hash6 -> /home/myname/somedir
Now the search query boils down to finding an exact match in this hash table. String comparison will only be needed in case of a hash collision.
One major drawback of this method is that in the general case it needs superlinear amount of memory (with respect to the size of the input set).
Consider a 600 characters-long path:
[400characterlongprefix]/a/a/a/...[100 times].../a/a/a/
And the corresponding table that contains 50500 characters in total:
hash0 -> [400characterlongprefix]
hash1 -> [400characterlongprefix]/a
hash2 -> [400characterlongprefix]/a/a
...
hash100 -> [400characterlongprefix]/a/a/a/...[100 times].../a/a/a/
Trie approach
Precomputation step
Split every path in the set to its components.
Assign every distinct component an index and add the pair (component, index) to a hash table.
For every path, add the sequence of its component indices to a prefix tree.
Example
Input set:
/usr
/var/log
/home/log/usr
Component indices:
usr -> 0
var -> 1
log -> 2
home -> 3
Prefix tree:
0 // usr
1 -> 2 // var, log
3 -> 2 -> 0 // home, log, usr
Search query
Split the path to its components.
For every component find its index in the hash table.
If one of the components does not have a corresponding index, report no match.
Search the prefix tree for the sequence of component indices.

Related

Filtering on a list of paths affects the contents of the list variable

I am using neo4j Community Edition, version 3.4.0 on Windows.
I have a simple use case where I wish to collect a number of path results, combine these into a single list and then process the contents of that list.
I wish to filter the common list for specific node attributes and process those nodes based on the category of the filter.
I then wish to apply further filters to the common list and process the resulting nodes in a similar way.
Some nodes may be selected by more than one filter so it is important that one filter does not remove any nodes from the common list.
The problem I have is that, after the first filter, the contents of the common list is reduced to only those paths that include nodes that match that filter.
It appears the filter is affecting the contents of the list it is parsing and not just returning a new list of nodes that match the filter criteria.
The following queries are contrived but they demonstrate the issue I am facing:
Create the test data:
CREATE (b:B)-[:FollowedBy]->(c:C)-[:FollowedBy]->(d:D)
RETURN b, c, d;
The query:
// Establish two related paths
MATCH p1 = (:B)-[:FollowedBy]->(c)
MATCH p2 = (c)-[:FollowedBy]->()
// Join the two paths to create a single list
WITH collect(p1) + collect(p2) AS pList
// Unwind the common list so that it can be filtered for specific categories
UNWIND pList AS path
// Filter for nodes in the 'D' category
WITH filter(n1 IN nodes(path) WHERE 'D' IN labels(n1)) AS dNodes, pList, path
// Unwind the filtered set of 'D' nodes so that they can be processed
UNWIND dNodes AS dNode
// ... do some dNode stuff
// Filter for nodes in the 'B' category
WITH filter(n2 IN nodes(path) WHERE 'B' IN labels(n2)) AS bNodes, pList, path
// Unwind the filtered set of 'B' nodes so that they can be processed
UNWIND bNodes AS bNode
// ... do some bNode stuff
RETURN path, pList;
If I run this query, 0 rows are returned.
What is actually happening is:
1) After collecting and joining the two paths, the common list "pList" looks as expected. It returns a single collection with two path elements.
+------------------------------------------------------+
| pList |
+------------------------------------------------------+
| [(:B)-[:FollowedBy]->(:C), (:C)-[:FollowedBy]->(:D)] |
+------------------------------------------------------+
2) After unwinding pList to path, pList now contains two identical records, one for each path value - Please could somebody explain why this is the case, i.e. why has the "unwind of pList into path" affected pList itself?:
+---------------------------------------------------------------------------------+
| pList | path |
+---------------------------------------------------------------------------------+
| [(:B)-[:FollowedBy]->(:C), (:C)-[:FollowedBy]->(:D)] | (:B)-[:FollowedBy]->(:C) |
| [(:B)-[:FollowedBy]->(:C), (:C)-[:FollowedBy]->(:D)] | (:C)-[:FollowedBy]->(:D) |
+---------------------------------------------------------------------------------+
3) After filtering for nodes in the 'D' category and unwinding the resulting list "dNodes", the contents of pList are a single record and path now only contains the path corresponding to the filtered node?
+---------------------------------------------------------------------------------+
| pList | path |
+---------------------------------------------------------------------------------+
| [(:B)-[:FollowedBy]->(:C), (:C)-[:FollowedBy]->(:D)] | (:C)-[:FollowedBy]->(:D) |
+---------------------------------------------------------------------------------+
4) After filtering for nodes in the 'B' category and unwinding the resulting list "bNodes", returning pList or path results in zero rows. This means it is not possible process the 'B' node?
I guess I have a fundamental misunderstanding of how Cypher handles variables and filters and would appreciate it if somebody would explain the behavior I have described above.
Also, considering my requirement, how should I be doing this? I could perform multiple queries but it appears that my requirement is simple enough that I should be able to carry out the whole process in one.
Thanks in advance.
cybersam answered your question for #2.
As for #3 and #4, it's important to understand that UNWIND will provide a row per element of the list, and when done on an empty list it will wipe out the row (no elements, so no rows). This is what happened when you unwound the results of the filter, since one path didn't have :D nodes (and so that row was removed), and the remaining path didn't have :B nodes (and was removed).
We have an entry describing this in the documentation, along with a workaround in case you want to keep the row with a null result if the list is empty.
In your case, it would probably be better to use FOREACH to process the filtered nodes list (provided you're only using SET, CREATE, MERGE, REMOVE, or DELETE):
MATCH p1 = (:B)-[:FollowedBy]->(c)
MATCH p2 = (c)-[:FollowedBy]->()
WITH collect(p1) + collect(p2) AS pList
UNWIND pList AS path
FOREACH(dNode in [n in nodes(path) WHERE n:D] |
// ... do some dNode stuff
)
FOREACH(bNode in [n in nodes(path) WHERE n:B] |
// ... do some bNode stuff
)
RETURN path, pList;
Otherwise, if you have more complicated things to do for dNode and bNode processing, you can use the trick in the linked documentation to UNWIND [null] using CASE when the filtered list is empty.
In step #2, pList has NOT changed (into 2 identical records).
At that step, you have only pList and path as variables. Neo4j would represent every possible variable value combination as a separate row of data, and process each row. Since there is a single pList value and 2 path values, that results in 2 rows of data, which is exactly what you displayed in the table in #2.
Also, you did not show your full Cypher code, so it is unknown why the overall query returned nothing. There is possibly an un-shown MATCH clause that is not matching, which would abort the remainder of the query.

Trie with custom insert and delete functions

I need to create four custom function for a Trie data structure without changing its O(n) complexity:
makeDir("pathToDir")
make("pathToFile")
delete("pathToDir")
forceDelete("pathToDirOrFile")
makeDir("pathToDir"): adds path to the trie only if it is a valid path
make("pathToFile"): adds path to the trie only if it is a valid path (a file node can't call makeDir() )
delete("pathToDir"): deletes the path from trie only if it has no child dirs
forceDelete("pathToDirOrFile"): deletes path and its child dirs
For example a list of commands would be:
makeDir("\dir");
makeDir("\dir\foo")
makeDir("\dir\foo\lol\ok"); /* incorrect path */
make("\dir\file");
makeDir("\dir\file\abc"); /* file can't have sub dirs */
delete("\dir"); /* has childs, incorrect */
delete("\dir\file");
forceDelete("\dir");
Does anybody have any idea on how to recognize that the node indicates the path of a files? What is the best way to implement these functions?
Validating and splitting the path
It's OS specific, so just pick any library that works with paths for your target system.
The trie
Once you can split a path into pieces, you can build a trie. Keep strings in its edges. For instance, if you have a foo/bar path, there'll be 3 nodes and two edges: the first one (1->2) is marked with foo and the second one (2->3) is marked with bar.
You can store a flag in each node to indicate if it's a regular file or a directory.
To check if a directory is empty, just make sure it's node has no children.
To check if a directory/file can be created, take its base dir (all parts of the path except the last one), check that it exists by traversing your trie from the root and that its node is a directory, not a regular file.
Efficient traversal
You can store edges in hash table that maps a string to a node.

Rename all contents of directory with a minimum of overhead

I am currently in the position where I need to rename all files in a directory. The chance that a file does not change name is minimal, and the chance that an old filename is the same as a new filename is considerable, making renaming conflicts likely.
Thus, simply looping over the files and renaming old->new is not an option.
The easy / obvious solution is to rename everything to have a temporary filename: old->tempX->new. Of course, to some degree, this shifts the issue, because now there is the responsibility of checking nothing in the old names list overlaps with the temporary names list, and nothing in the temporary names list overlaps with the new list.
Additionally, since I'm dealing with slow media and virus scanners that love to slow things down, I would like to minimize the actual actions on disk. Besides that, the user will be impatiently waiting to do more stuff. So if at all possible, I would like to process all files on disk in a single pass (by smartly re-ordering rename operations) and avoid exponential time shenanigans.
This last bit has brought me to a 'good enough' solution where I first create a single temporary directory inside my directory, I move-rename everything into that, and finally, I move everything back into the old folder and delete the temporary directory. This gives me a complexity of O(2n) for disk and actions.
If possible, I'd love to get the on-disk complexity to O(n), even if it comes at a cost of increasing the in-memory actions to O(99999n). Memory is a lot faster after all.
I am personally not at-home enough in graph theory, and I suspect the entire 'rename conflict' thing has been tackled before, so I was hoping someone could point me towards an algorithm that meets my needs. (And yes, I can try to brew my own, but I am not smart enough to write an efficient algorithm, and I probably would leave in a logic bug that rears its ugly head rarely enough to slip through my testing. xD)
One approach is as follows.
Suppose file A renames to B and B is a new name, we can simply rename A.
Suppose file A renames to B and B renames to C and C is a new name, we can follow the list in reverse and rename B to C, then A to B.
In general this will work providing there is not a loop. Simply make a list of all the dependencies and then rename in reverse order.
If there is a loop we have something like this:
A renames to B
B renames to C
C renames to D
D renames to A
In this case we need a single temporary file per loop.
Rename the first in the loop, A to ATMP.
Then our list of modifications becomes:
ATMP renames to B
B renames to C
C renames to D
D renames to A
This list no longer has a loop so we can process the files in reverse order as before.
The total number of file moves with this approach will be n + number of loops in your rearrangement.
Example code
So in Python this might look like this:
D={1:2,2:3,3:4,4:1,5:6,6:7,10:11} # Map from start name to final name
def rename(start,dest):
moved.add(start)
print 'Rename {} to {}'.format(start,dest)
moved = set()
filenames = set(D.keys())
tmp = 'tmp file'
for start in D.keys():
if start in moved:
continue
A = [] # List of files to rename
p = start
while True:
A.append(p)
dest = D[p]
if dest not in filenames:
break
if dest==start:
# Found a loop
D[tmp] = D[start]
rename(start,tmp)
A[0] = tmp
break
p = dest
for f in A[::-1]:
rename(f,D[f])
This code prints:
Rename 1 to tmp file
Rename 4 to 1
Rename 3 to 4
Rename 2 to 3
Rename tmp file to 2
Rename 6 to 7
Rename 5 to 6
Rename 10 to 11
Looks like you're looking at a sub-problem of Topologic sort.
However it's simpler, since each file can depend on just one other file.
Assuming that there are no loops:
Supposing map is the mapping from old names to new names:
In a loop, just select any file to rename, and send it to a function which :
if it's destination new name is not conflicting (a file with the new name doesn't exist), then just rename it
else (conflict exists)
2.1 rename the conflicting file first, by sending it to the same function recursively
2.2 rename this file
A sort-of Java pseudo code would look like this:
// map is the map, map[oldName] = newName;
HashSet<String> oldNames = new HashSet<String>(map.keys());
while (oldNames.size() > 0)
{
String file = oldNames.first(); // Just selects any filename from the set;
renameFile(map, oldNames, file);
}
...
void renameFile (map, oldNames, file)
{
if (oldNames.contains(map[file])
{
(map, oldNames, map[file]);
}
OS.rename(file, map[file]); //actual renaming of file on disk
map.remove(file);
oldNames.remove(file);
}
I believe you are interested in a Graph Theory modeling of the problem so here is my take on this:
You can build the bidirectional mapping of old file names to new file names as a first stage.
Now, you compute the intersection set I the old filenames and new filenames. Each target "new filename" appearing in this set requires the "old filename" to be renamed first. This is a dependency relationship that you can model in a graph.
Now, to build that graph, we iterate over that I set. For each element e of I:
Insert a vertex in the graph representing the file e needing to be renamed if it doesn't exist yet
Get the "old filename" o that has to be renamed into e
Insert a vertex representing o into the graph if it doesn't already exist
Insert a directed edge (e, o) in the graph. This edge means "e must be renamed before o". If that edge introduce a cycle (*), do not insert it and mark o as a file that needs to be moved-and-renamed.
You now have to iterate over the roots of your graph (vertices that have no in-edges) and perform a BFS using them as a starting point and perform the renaming each time you discover a vertex. The renaming can be a common rename or a move-and-rename depending on if the vertex was tagged.
The last step is to move back the moved-and-renamed files back from their sandbox directory to the target directory.
C++ Live Demo to illustrate the graph processing.

Hashing table design in C

I have a design issue regarding HASH function.
In my program I am using a hash table of size 2^13, where the slot is calculated based on the value of the node(the hash key) which I want to insert.
Now, say my each node has two value |A|B| however I am inserting value into hash table using A.
Later on, I want to search a particular node which B not A.
Is it possible to that way? Is yes, could you highlight some design approaches?
The constraint is that I have to use A as the hash key.
Sorry, I can't share the code. Small example:
Value[] = {Part1, Part2, Part3};
insert(value)
check_for_index(value.part1)
value.part1 to be used to calculate the index of the slot.
Once slot is found then insert the "value"
Later on,
search_in_hash(part2)
check_for_index("But here I need the value.part1 to check for slot index")
So, how can I relate the part1, part2 & part3 such that I later on I can find the slot by either part2 or part3
If the problem statement is vague kindly let me know.
Unless you intend to do a search element-by-element (in which case you don't need a hash, just a plain list), then what you basically ask is - can I have a hash such that hash(X) == hash(Y), but X!=Y, so that you could map to a location using part1 and then map to the same one using part2 or 3. That completely goes against what hashing stands for.
What you should do is (as viraptor also suggested), create 3 structures, each hashed using a different part of the value, and push the full value to all 3. Then when you need to search use the proper hash by the part you want to search by.
for e.g.:
value[] = {part1, part2, part3};
hash1.insert(part1, value)
hash2.insert(part2, value)
hash3.insert(part3, value)
then
hash2.search_in_hash(part2)
or
hash3.search_in_hash(part3)
The above 2 should produce the exact same values.
Also make sure that all data manipulations (removing values, changing them), is done on all 3 structures simultaneously. For e.g. -
value = hash2.search_in_hash(part2)
hash1.remove(value.part1)
hash2.remove(part2) // you can assert that part2 == value.part2
hash3.remove(value.part3)

Compressed array of file paths and random access

I'm developing an file management Windows application. The program should keep an array of paths to all files and folders that are on the disk. For example:
0 "C:"
1 "C:\abc"
2 "C:\abc\def"
3 "C:\ghi"
4 "C:\ghi\readme.txt"
The array "as is" will be very large, so it should be compressed and stored on the disk. However, I'd like to have random access to it:
to retrieve any path in the array by index (e.g., RetrievePath(2) = "C:\abc\def")
to find index of any path in the array (e.g., IndexOf("C:\ghi") = 3)
to add a new path to the array (indexes of any existing paths should not change), e.g., AddPath("C:\ghi\xyz\file.dat")
to rename some file or folder in the database;
to delete existing path (again, any other indexes should not change).
For example, delete path 1 "C:\abc" from the database and still have 4 "C:\ghi\readme.txt".
Can someone suggest some good algorithm/data structure/ideas to do these things?
Edit:
At the moment I've come up with the following solution:
0 "C:"
1 "[0]\abc"
2 "[1]\def"
3 "[0]\ghi"
4 "[3]\readme.txt"
That is, common prefixes are compressed.
RetrievePath(2) = "[1]\def" = RetrievePath(1) + "\def" = "[0]\abc\def" = RetrievePath(0) + "\abc\def" = "C:\abc\def"
IndexOf() also works iteratively, something like that:
IndexOf("C:") = 0
IndexOf("C:\abc") = IndexOf("[0]\abc") = 1
IndexOf("C:\abc\def") = IndexOf("[1]\def") = 2
To add new path, say AddPath("C:\ghi\xyz\file.dat"), one should first add its prefixes:
5 [3]\xyz
6 [5]\file.dat
Renaming/moving file/folder involves just one replacement (e.g., replacing [0]\ghi with [1]\klm will rename directory "ghi" to "klm" and move it to the directory "C:\abc")
DeletePath() involves setting it (and all subpaths) to empty strings. In future, they can be replaced with new paths.
After DeletePath("C:\abc"), the array will be:
0 "C:"
1 ""
2 ""
3 "[0]\ghi"
4 "[3]\readme.txt"
The whole array still needs to be loaded in RAM to perform fast operations. With, for example, 1000000 files and folders in total and average filename length of 10, the array will occupy over 10 MB.
Also, function IndexOf() is forced to scan array sequentially.
Edit (2): I just realised that my question can be reformulated:
How I can assign each file and each folder on the disk unique integer index so that I will be able to quickly find file/folder by index, index of the known file/folder, and perform basic file operations without changing many indices?
Edit (3): Here is a question about similar but Linux-related problem. It is suggested to use filename and content hashing to identify file. Are there some Windows-specific improvements?
Your solution seems decent. You could also try to compress more using ad-hoc tricks such as using a few bits only for common characters such as "\", drive letters, maybe common file extensions and whatnot. You could also have a look on tries ( http://en.wikipedia.org/wiki/Trie ).
Regarding your second edit, this seems to match the features of a hash table, but this is for indexing, not compressed storage.

Resources