The fastest way to implement mktree in FTP - algorithm

What is generally the fastest algorithm to recursively make a directory (similar to UNIX mkdir -p) using the FTP protocol?
I have considered one approach:
MKDIR node
if error and nodes left go to 1 with next node
end
But this might have bad performance if part of the directory most likely exists. For example, with some amortization the "/a/b/c/d" part of "/a/b/c/d/e/f/g" path exists %99 of the time.

Considering that sending a command and receiving the response is taking most of the time, the fastest way to create a directory path is using as few commands as possible.
As there is no way other than to try to create or cd into a directory to check for its existence, just using mkdir a; mkdir a/b; ..., mkdir a/b/c/d/e/f would be the generally fastest way (do not cd into the subdirectories to create the next as this would prolong the process).
If you create multiple directories this way, you could of course keep track of which top-level directories you already created. Also, depending on the length of your paths and the likelihood that the upper directories already exist, you could try to start with e.g. mkdir a/b/c (for a/b/c/d/e/f) and then backtrack if it did not succeed. However if it's more likely that directories do not exist this will actually be slower in the long run.

If the existing directory hierarchy is equally likely to end at any given depth, then binary searching for the start position will be the fastest way. But as dseifert points out, if most of the time the directories already exist down to say level k, then it will be faster to start binary searching at level k rather than level n/2.
BTW, you'd have to be creating a lot of very deep directories for this sort of optimisation to be worth your time. Are you sure you're not optimising prematurely?

Related

Efficient way to walk directory tree containing link cycles

Is there a more efficient way to walk a directory tree that contains link cycles than tracking which files have already been visited?
For example consider walking a directory containing these files:
symlink "parent" -> ".."
symlink "uh_oh" -> "/"
regular file "reg"
symlink "reg2" -> "reg"
You should also track which directories have been visited, as per your first example, but otherwise there is no better solution than maintaining visited flags for every file.
Maintaining the flags would be easier if there were a portable way of getting a short unique identifier for a mounted filesystem. Even then, you need to think through the consequences of mount and umount operations occurring during the scan, particularly since such a scan might take quite a long time if the filesystem tree includes remote filesystems.
In theory, you can get a "filesystem id" from the stafvfs interface, but in practice that is not totally portable. Quoting man statfs from a Linux distro:
Nobody knows what f_fsid is supposed to contain…
…The general idea is that f_fsid contains some random stuff such that the pair (f_fsid,ino) uniquely determines a file. Some operating systems use (a variation on) the device number, or the device number combined with the filesystem type. Several OSes restrict giving out the f_fsid field to the superuser only (and zero it for unprivileged users), because this field is used in the filehandle of the filesystem when NFS-exported, and giving it out is a security concern.
This latter restriction -- that f_fsid is presented as 0 to non-privileged users -- does not violate the Posix standard cited above, because that standard includes a very general disclaimer: "It is unspecified whether all members of the statvfs structure have meaningful values on all file systems."
The tree walk algorithm guarantees that you'll visit every file under a directory, so instead of tracking individual files you can maintain a list of search "roots":
Add the initial directory to the list of roots
Walk the directory tree for each search root
For every symlink you find, check if it's already contained by a search root. If it's not, add it as a new search root.
This way you'll visit every file and directory, will never get stuck in a loop, but may visit files and directories more than once. That can happen only when you find a symlink to an ancestor of an existing root. To avoid doing that, you can check if a directory is a search root before entering it.

Do algorithms for performing hierarchical permission checking exist?

I have a data structure which represents a hierarchy.
folder
folder
folder
file
file
etc.
Permissions are stored in a flat table:
| pKey | type | bitperms |
When performing global operations like search, we need to check permissions recursively within the tree.
Checking permissions inline with the individual leaves of the tree structure is easy. However accounting for permission on the nodes requires one of two known approaches:
after fetching the filtered leaves, post process each one to check it's parents perms
cost is delayed until after
there might be lots of initial leaves found, but after processing the parents, nothing remains resulting in useless work being done
pre calculating all the roots (nodes which grant the permission) ahead of time and using that as a query filter while getting leaves
potentially a huge query if many roots exist resulting in excessive time spent processing each leaf
Do any algorithms exist for doing this in a more efficient way? Perhaps reorganizing the permission data or adding more information to the hierarchy?
Perhaps adding some heuristics to deal with extremes?
Dunno about a complete paper about that, but here are my thoughts.
You obviously need to check at some point the whole path from the leaf to the root.
I assume no permission rule introduction from the side (i.e. you're working on a tree, not a general graph).
I assume lots of leafs on few "folder" nodes.
I also assume that you have a method for including permissions (ORing on a bitmask) or excluding them (NOTANDing on a bitmask).
Permissions are mostly granted to roles/groups, not individual users (in the latter case, you'd need to create s.th. like an "ad-hoc role/group" for that user).
Permissions will not go up the tree, only down to the leafs.
Then I'd pre-calculate all permissions on folders from the root up and save them along with the folder nodes whenever some permissions on folders change (or a role is added, etc). When a specific file/leaf is called, you only have to check the files/leafs permissions and its folders permissions.
You could also mark some folders as "do not inherit permissions from parent", which may shorten your calculations when the root's permissions change...
This would make it cheap for the following operations:
checking a leaf's permissions (join leaf and its parent permissions).
changing the permissions of a folder which does not contain more folders.
Costly are these operations, but since they do not need to work on any leaf/file, they only need to touch a minor part of the whole tree:
changing/extending the permission model (e.g. by adding a role/group, which might broaden your bitmask, depending on your implementation).
changing the roots permissions.

Best Practice - Directory Structures

Is there an optimum number of directories to hold images on a drive before grouping into sub-directories.
Example, I have a collection of approximately 600,0000 image files
I can logically sub-group these into several layers but I'm not sure of the optimum for fastest retrieval. I dont need to search the disk because I will always know its absolute path.
My basic options are:
1 directory with 600,000 files (my instincts tell me this is no good!)
OR
1 directory with 1500 sub-directories each with an average of 400 files (min 200 max 600)
OR
1 directory with 75 sub-directories each with an average of 20 sub-directories with an average of 400 files in each.
The second scenario would be my ideal choice but am concerned that this number of sub-directories will affect performance.
Discuss please !
Roger
In my experience this is filesystem (and even storage vendor) dependent...with the exception that choice #1 ("Just dump everything in one place") is almost certainly going to be a poor performer.
We faced a similar problem and went with variant of #2. In our case, we had tens of millions of users, each with somewhere between 10 and ~1000 files. We ended up with a structure that looked like this:
ab\cd\ef\all_the_files
The ab portion specified the mount point, and cd\ef were the two levels of sub folders underneath.
If you're going to be seeing significant IO load I'd urge you test our your configuration on the hardware and network you're going to be using at scale. And, of course, give thought to how you can do backups and restores of portions of data, if required.
This previous question favours flat files on NTFS after experiments. This makes sense, since modern file systems will store directory contents in a structure with logarithmic search times, so you get to choose between log(n) and something that is >= 2 log(sqrt(n)) - or at best equal.

Shortest sequence of operations transforming a file tree to another

Given two file trees A and B, is it possible to determine the shortest sequence of operations or a short sequence of operations that is necessary in order to transform A to B?
An operation can be:
Create a new, empty folder
Create a new file with any contents
Delete a file
Delete an empty folder
Rename a file
Rename a folder
Move a file inside another existing folder
Move a folder inside another existing folder
A and B are identical when they will have the same files with the same contents (or same size same CRC) and same name, in the same folder structure.
This question has been puzzling me for some time. For the moment I have the following, basic idea:
Compute a database:
Store file names and their CRCs
Then, find all folders with no subfolders, and compute a CRC from the CRCs of the files they contain, and a size from the total size of the files they contain
Ascend the tree to make a CRC for each parent folder
Use the following loop having database A and database B:
Compute A ∩ B and remove this intersection from both databases.
Use an inner join to find matching CRCs in A and B, folders first, order by size desc
while there is a result, use the first result to make a folder or file move (possibly creating new folders if necessary), remove from both database the source rows of the result. If there was a move then update CRCs of new location's parent folders in db A.
Then remove all files and folders referenced in database A and create those referenced in database B.
However I think that this is really a suboptimal way to do that. What could you give me as advice?
Thank you!
This problem is a special case of the tree edit distance problem, for which finding an optimal solution is (unfortunately) known to be NP-hard. This means that there probably aren't any good, fast, and accurate algorithms for the general case.
That said, the paper I linked does contain several nice discussions of approximation algorithms and algorithms that work in restricted cases of the problem. You may find the discussion interesting, as it illuminates many of the issues that actually arise in solving this problem.
Hope this helps! And thanks for posting an awesome question!
You might want to check out tree-edit distance algorithms. I don't know if this will map neatly to your file system, but it might give you some ideas.
https://github.com/irskep/sleepytree (code and paper)
The first step to do is figure out which files need to be created/renamed/deleted.
A) Create a hash map of the files of Tree A
B) Go through the files of Tree B
B.1) If there is an identical (name and contents) file in the hash map, then leave it alone
B.2) If the contents are the identical but the name is different, rename the file to that in the hash map
B.3) If the file contents doesn't exist in the hash map, remove it
B.4) (if one of 1,2,3 was true) Remove the file from the hash map
The files left over in the hash map are those that must be created. This should be the last step, after the directory structure has been resolved.
After the file differences have been resolved, it get's rather tricky. I wouldn't be surprised if there isn't an efficient optimal solution to this problem (NP-complete/hard).
The difficulty lies in that the problem doesn't naturally subdivide itself. Each step you do must consider the entire file tree. I'll think about it some more.
EDIT: It seems that the most studied tree edit distance algorithms consider only creating/deleting nodes and relabeling of nodes. This isn't directly applicable to this problem because this problem allows moving entire subtrees around which makes it significantly more difficult. The current fastest run-time for the "easier" edit distance problem is O(N^3). I'd imagine the run-time for this will be significantly slower.
Helpful Links/References
An Optimal Decomposition Algorithm for Tree Edit Distance - Demaine, Mozes, Weimann
Enumerate all files in B and their associated sizes and checksums;
sort by size/checksum.
Enumerate all files in A and their associated sizes and checksums;
sort by size/checksum.
Now, doing an ordered list comparison, do the following:
a. for every file in A but not B, delete it.
b. for every file in B but not A, create it.
c. for every file in A and B, rename as many as you encounter from A to B, then make copies of the rest in B. If you are going to overwrite an existing file, save it off to the side in a separate list. If you find A in that list, use that as the source file.
Do the same for directories, deleting ones in A but not in B and adding those in B but not in A.
You iterate by checksum/size to ensure you never have to visit files twice or worry about deleting a file you will later need to resynchronize. I'm assuming you are trying to keep two directories in sync without unnecessary copying?
The overall complexity is O(N log N) plus however long it takes to read in all those files and their metadata.
This isn't the tree edit distance problem; it's more of a list synchronization problem that happens to generate a tree.
Only non trivial problem is moving folders and files. Renaming, deleting and creating is trivial and can be done in first step (or better in last when you finish).
You can then transform this problem into problem whit transforming tree both whit same leafs but different topology.
You decide which files will be moved from some folder/bucket and which files will be left in folder. Decision is based on number of same files in source and destination.
You apply same strategy to move folders in new topology.
I think that you should be near optimal or optimal if you forget about names of folders and think just about files and topology.

Building a directory tree from a list of file paths

I am looking for a time efficient method to parse a list of files into a tree. There can be hundreds of millions of file paths.
The brute force solution would be to split each path on occurrence of a directory separator, and traverse the tree adding in directory and file entries by doing string comparisons but this would be exceptionally slow.
The input data is usually sorted alphabetically, so the list would be something like:
C:\Users\Aaron\AppData\Amarok\Afile
C:\Users\Aaron\AppData\Amarok\Afile2
C:\Users\Aaron\AppData\Amarok\Afile3
C:\Users\Aaron\AppData\Blender\alibrary.dll
C:\Users\Aaron\AppData\Blender\and_so_on.txt
From this ordering my natural reaction is to partition the directory listings into groups... somehow... before doing the slow string comparisons. I'm really not sure. I would appreciate any ideas.
Edit: It would be better if this tree were lazy loaded from the top down if possible.
You have no choice but to do full string comparisons since you can't guarantee where the strings might differ. There are a couple tricks that might speed things up a little:
As David said, form a tree, but search for the new insertion point from the previous one (perhaps with the aid of some sort of matchingPrefix routine that will tell you where the new one differs).
Use a hash table for each level of the tree if there may be very many files within and you need to count duplicates. (Otherwise, appending to a stack is fine.)
if its possible, you can generate your tree structure with the tree command, here
To take advantage of the "usually sorted" property of your input data, begin your traversal at the directory where your last file was inserted: compare the directory name of current pathname to the previous one. If they match, you can just insert here, otherwise pop up a level and try again.

Resources