Rename all contents of directory with a minimum of overhead - algorithm

I am currently in the position where I need to rename all files in a directory. The chance that a file does not change name is minimal, and the chance that an old filename is the same as a new filename is considerable, making renaming conflicts likely.
Thus, simply looping over the files and renaming old->new is not an option.
The easy / obvious solution is to rename everything to have a temporary filename: old->tempX->new. Of course, to some degree, this shifts the issue, because now there is the responsibility of checking nothing in the old names list overlaps with the temporary names list, and nothing in the temporary names list overlaps with the new list.
Additionally, since I'm dealing with slow media and virus scanners that love to slow things down, I would like to minimize the actual actions on disk. Besides that, the user will be impatiently waiting to do more stuff. So if at all possible, I would like to process all files on disk in a single pass (by smartly re-ordering rename operations) and avoid exponential time shenanigans.
This last bit has brought me to a 'good enough' solution where I first create a single temporary directory inside my directory, I move-rename everything into that, and finally, I move everything back into the old folder and delete the temporary directory. This gives me a complexity of O(2n) for disk and actions.
If possible, I'd love to get the on-disk complexity to O(n), even if it comes at a cost of increasing the in-memory actions to O(99999n). Memory is a lot faster after all.
I am personally not at-home enough in graph theory, and I suspect the entire 'rename conflict' thing has been tackled before, so I was hoping someone could point me towards an algorithm that meets my needs. (And yes, I can try to brew my own, but I am not smart enough to write an efficient algorithm, and I probably would leave in a logic bug that rears its ugly head rarely enough to slip through my testing. xD)

One approach is as follows.
Suppose file A renames to B and B is a new name, we can simply rename A.
Suppose file A renames to B and B renames to C and C is a new name, we can follow the list in reverse and rename B to C, then A to B.
In general this will work providing there is not a loop. Simply make a list of all the dependencies and then rename in reverse order.
If there is a loop we have something like this:
A renames to B
B renames to C
C renames to D
D renames to A
In this case we need a single temporary file per loop.
Rename the first in the loop, A to ATMP.
Then our list of modifications becomes:
ATMP renames to B
B renames to C
C renames to D
D renames to A
This list no longer has a loop so we can process the files in reverse order as before.
The total number of file moves with this approach will be n + number of loops in your rearrangement.
Example code
So in Python this might look like this:
D={1:2,2:3,3:4,4:1,5:6,6:7,10:11} # Map from start name to final name
def rename(start,dest):
moved.add(start)
print 'Rename {} to {}'.format(start,dest)
moved = set()
filenames = set(D.keys())
tmp = 'tmp file'
for start in D.keys():
if start in moved:
continue
A = [] # List of files to rename
p = start
while True:
A.append(p)
dest = D[p]
if dest not in filenames:
break
if dest==start:
# Found a loop
D[tmp] = D[start]
rename(start,tmp)
A[0] = tmp
break
p = dest
for f in A[::-1]:
rename(f,D[f])
This code prints:
Rename 1 to tmp file
Rename 4 to 1
Rename 3 to 4
Rename 2 to 3
Rename tmp file to 2
Rename 6 to 7
Rename 5 to 6
Rename 10 to 11

Looks like you're looking at a sub-problem of Topologic sort.
However it's simpler, since each file can depend on just one other file.
Assuming that there are no loops:
Supposing map is the mapping from old names to new names:
In a loop, just select any file to rename, and send it to a function which :
if it's destination new name is not conflicting (a file with the new name doesn't exist), then just rename it
else (conflict exists)
2.1 rename the conflicting file first, by sending it to the same function recursively
2.2 rename this file
A sort-of Java pseudo code would look like this:
// map is the map, map[oldName] = newName;
HashSet<String> oldNames = new HashSet<String>(map.keys());
while (oldNames.size() > 0)
{
String file = oldNames.first(); // Just selects any filename from the set;
renameFile(map, oldNames, file);
}
...
void renameFile (map, oldNames, file)
{
if (oldNames.contains(map[file])
{
(map, oldNames, map[file]);
}
OS.rename(file, map[file]); //actual renaming of file on disk
map.remove(file);
oldNames.remove(file);
}

I believe you are interested in a Graph Theory modeling of the problem so here is my take on this:
You can build the bidirectional mapping of old file names to new file names as a first stage.
Now, you compute the intersection set I the old filenames and new filenames. Each target "new filename" appearing in this set requires the "old filename" to be renamed first. This is a dependency relationship that you can model in a graph.
Now, to build that graph, we iterate over that I set. For each element e of I:
Insert a vertex in the graph representing the file e needing to be renamed if it doesn't exist yet
Get the "old filename" o that has to be renamed into e
Insert a vertex representing o into the graph if it doesn't already exist
Insert a directed edge (e, o) in the graph. This edge means "e must be renamed before o". If that edge introduce a cycle (*), do not insert it and mark o as a file that needs to be moved-and-renamed.
You now have to iterate over the roots of your graph (vertices that have no in-edges) and perform a BFS using them as a starting point and perform the renaming each time you discover a vertex. The renaming can be a common rename or a move-and-rename depending on if the vertex was tagged.
The last step is to move back the moved-and-renamed files back from their sandbox directory to the target directory.
C++ Live Demo to illustrate the graph processing.

Related

Differential file saving algorithm or tool

I'm looking for any information or algorithms that allows differential file saving and merging.
To be more clear, I would like when modifying the content of a file the original file should stay the same and every modification made must be saved in a separate file (same thing as differential backup but for files), in case of accessing the file, it should reconstruct the latest version of the file using the original file and the last differential file.
What I need to do is described in the diagram below :
For calculating diffs you could use something like diff_match_patch.
You could store for each file version series of DeltaDiff.
DeltaDiff would be a tuple of one of 2 types: INSERT or DELETE.
Then you could store the series of DeltaDiff as follows:
Diff = [DeltaDiff_1, DeltaDiff_2, ... DeltaDiff_n ] = [
(INSERT, byteoffset regarding to initial file, bytes)
(DELETE, byteoffset regarding to initial file, length)
....
(....)
]
Applying the DeltaDiffs to initial file would give you the next file version, and so on, for example:
FileVersion1 + Diff1 -> FileVersion2 + Diff2 -> FileVersion3 + ....

Has Directory Content Changed?

How can I check a directory to see if its contents has changed since a given point in time?
I don't need to be informed when it changes, or what has changed. I just need a way to check if it has changed.
Create a file at the point in time you wish to start monitoring, using any method you like, e.g.:
touch time_marker
Then, when you want to check if anything has been added, use "find" like this:
find . -newer time_marker
This will only tell you files that have been modified or added since time_marker was created - it won't tell you if anything has been deleted. If you want to look again at a future point, "touch" time_marker again to create a new reference point.
If you just need to know if names have changed or files have been added/removed, you can try this:
Dir.glob('some_directory/**/*').hash
Just store and compare the hash values. You can obviously go further by getting more information out of a call to ls, for example, or out of File objects that represent each of the files in your directory structure, and hashing that.
Dir.glob('some_directory/**/*').map { |name| [name, File.mtime(name)] }.hash
UM ACTUALLY I'm being dumb and hash is only consistent for any one runtime environment of ruby. Let's use the standard Zlib::crc32 instead, e.g.
Zlib::crc32(Dir.glob('some_directory/**/*').map { |name| [name, File.mtime(name)] }.to_s)
My concern is that this approach will be memory-hungry and slow if you're checking a very large filesystem. Perhaps globbing the entire structure and mapping it isn't the way--if you have a lot of subdirectories you could walk them recursively and calculate a checksum for each, then combine the checksums.
This might be better for larger directories:
Dir.glob('some_directory/**/*').map do |name|
s = [name, File.mtime(name)].to_s
[Zlib::crc32(s), s.length]
end.inject(Zlib::crc32('')) do |combined, x|
Zlib::crc32_combine(combined, x[0], x[1])
end
This would be less prone to collisions:
Dir.glob('some_directory/**/*').map do |name|
[name, File.mtime(name)].to_s
end.inject(Digest::SHA512.new) do |digest, x|
digest.update x
end.to_s
I've amended this to include timestamp and file size.
dir_checksum = Zlib::crc32(Dir.glob(
File.join(dispatch, '/**/*')).map { |path|
path.to_s + "_" + File.mtime(path).to_s + "_" + File.size(path).to_s
}.to_s)

Print all the files in a given folder and sub-folders without using recursion/stack

I recently had an interview with a reputable company for the position of Software Developer and this was one of the questions asked:
"Given the following methods:
List subDirectories(String directoryName){ ... };
List filesInDirectory(String directoryName) { ... };
As the names suggest, the first method returns a list of names of immediate sub-directories in the input directory ('directoryName') and the second method returns a list of names of all files in this folder.
Print all the files in the file system."
I thought about it and gave the interview a pretty obvious recursive solution. She then told me to do it without recursion. Since recursion makes use of the call stack, I told her I will use an auxillary stack instead, at which point point she told me not to use a stack either. Unfortunately, I wasn't able to come up with a solution. I did ask how it can be done without recursion/stack, but she wouldn't say.
How can this be done?
You want to use a queue and a BFS algorithm.
I guess some pseudo-code would be nice:
files = filesInDirectory("/")
foreach (file in files) {
fileQ.append(file)
}
dirQ = subDirectories("/")
while (dirQ != empty) {
dir = dirQ.pop
files = filesInDirectory(dir)
foreach (file in files) {
fileQ.append(file)
}
dirQ.append(subDirectories(dir))
}
while (fileQ != empty) {
print fileQ.pop
}
If I understood correctly, immediate sub-directories are only the directories in that folder. I mean if I=we have these three paths /home/user, /home/config and /home/user/u001, we can say that both user and config are immediate subdirectories of /home/, but u001 isn't. The same applies if user, and u001 are files (user is immediate while u001 isn't).
So you don't really need recursion or stack to return a list of immediate subdirectories or files.
EDIT: I thought that the OP wanted to implement the subDirectories() and filesInDirectories() functions.
So, you can do something like to print all files (kind of pseudocode):
List subd = subDirectories(current_dir);
List files = filesInDirectories(current_dir);
foreach (file in files) {
print file.name();
}
while (!subd.empty()) {
dir = subd.pop();
files = filesInDirectory(dir.name());
foreach (file in files) {
print file.name();
}
subd.append(subDirectories(dir.path()));
}
I think that what #lqs suggests is indeed an acceptable answer that she might have been looking for: store the full path in a variable, and append the directory name to it if you enter a subdirectory, and clip off the last directory name when you leave it. This way, your full path acts as the pointer to where you currently are in the file system.
Because the full path is always modified at the end, the full path behaves (not surprisingly) as your stack.
Interview questions aside, I think I would still pick a real stack over string manipulation though...

Compressed array of file paths and random access

I'm developing an file management Windows application. The program should keep an array of paths to all files and folders that are on the disk. For example:
0 "C:"
1 "C:\abc"
2 "C:\abc\def"
3 "C:\ghi"
4 "C:\ghi\readme.txt"
The array "as is" will be very large, so it should be compressed and stored on the disk. However, I'd like to have random access to it:
to retrieve any path in the array by index (e.g., RetrievePath(2) = "C:\abc\def")
to find index of any path in the array (e.g., IndexOf("C:\ghi") = 3)
to add a new path to the array (indexes of any existing paths should not change), e.g., AddPath("C:\ghi\xyz\file.dat")
to rename some file or folder in the database;
to delete existing path (again, any other indexes should not change).
For example, delete path 1 "C:\abc" from the database and still have 4 "C:\ghi\readme.txt".
Can someone suggest some good algorithm/data structure/ideas to do these things?
Edit:
At the moment I've come up with the following solution:
0 "C:"
1 "[0]\abc"
2 "[1]\def"
3 "[0]\ghi"
4 "[3]\readme.txt"
That is, common prefixes are compressed.
RetrievePath(2) = "[1]\def" = RetrievePath(1) + "\def" = "[0]\abc\def" = RetrievePath(0) + "\abc\def" = "C:\abc\def"
IndexOf() also works iteratively, something like that:
IndexOf("C:") = 0
IndexOf("C:\abc") = IndexOf("[0]\abc") = 1
IndexOf("C:\abc\def") = IndexOf("[1]\def") = 2
To add new path, say AddPath("C:\ghi\xyz\file.dat"), one should first add its prefixes:
5 [3]\xyz
6 [5]\file.dat
Renaming/moving file/folder involves just one replacement (e.g., replacing [0]\ghi with [1]\klm will rename directory "ghi" to "klm" and move it to the directory "C:\abc")
DeletePath() involves setting it (and all subpaths) to empty strings. In future, they can be replaced with new paths.
After DeletePath("C:\abc"), the array will be:
0 "C:"
1 ""
2 ""
3 "[0]\ghi"
4 "[3]\readme.txt"
The whole array still needs to be loaded in RAM to perform fast operations. With, for example, 1000000 files and folders in total and average filename length of 10, the array will occupy over 10 MB.
Also, function IndexOf() is forced to scan array sequentially.
Edit (2): I just realised that my question can be reformulated:
How I can assign each file and each folder on the disk unique integer index so that I will be able to quickly find file/folder by index, index of the known file/folder, and perform basic file operations without changing many indices?
Edit (3): Here is a question about similar but Linux-related problem. It is suggested to use filename and content hashing to identify file. Are there some Windows-specific improvements?
Your solution seems decent. You could also try to compress more using ad-hoc tricks such as using a few bits only for common characters such as "\", drive letters, maybe common file extensions and whatnot. You could also have a look on tries ( http://en.wikipedia.org/wiki/Trie ).
Regarding your second edit, this seems to match the features of a hash table, but this is for indexing, not compressed storage.

Advanced Array Sorting in Ruby

I'm currently working on a project in ruby, and I hit a wall on how I should proceed. In the project I'm using Dir.glob to search a directory and all of its subdirectories for certain file types and placing them into an arrays. The type of files I'm working with all have the same file name and are differentiated by their extensions. For example,
txt_files = Dir.glob("**/*.txt")
doc_files = Dir.glob("**/*.doc")
rtf_files = Dir.glob("**/*.rtf")
Would return something similar to,
FILECON.txt
ASSORTED.txt
FIRST.txt
FILECON.doc
ASSORTED.doc
FIRST.doc
FILECON.rtf
ASSORTED.rtf
FIRST.rtf
So, the question I have is how I could break down these arrays efficiently (dealing with thousands of files) and placing all files with the same filename into an array. The new array would look like,
FILECON.txt
FILECON.doc
FILECON.rtf
ASSORTED.txt
ASSORTED.doc
ASSORTED.rtf
etc. etc.
I'm not even sure if glob would be the correct way to do this (all the files with the same file name are in the same folders). Any help would be greatly appreciated!
Get all your files into a single array with Dir.glob("**/*.{txt,doc,rtf}")
Don't forget that all the filenames have the directory too, so if you want to sort by the basename, then
files = Dir.glob("**/*.{txt,doc,rtf}").sort_by {|f| File.basename f}
Not sure if this is exactly what you need, but you can try to
# first get all files
all_files = Dir.glob('**/*')
# then you can group them by name
by_name = all_files.group_by{|f| m = f.match(/([^\/]+)\.[^.\/]+$/); m[1] if m}
# and by extension
by_ext = all_files.group_by{|f| m = f.match(/[^\/]+\.([^.\/]+)$/); m[1] if m}
BTW, I don't see any relation of the question with sorting.

Resources