Compressed array of file paths and random access - windows

I'm developing an file management Windows application. The program should keep an array of paths to all files and folders that are on the disk. For example:
0 "C:"
1 "C:\abc"
2 "C:\abc\def"
3 "C:\ghi"
4 "C:\ghi\readme.txt"
The array "as is" will be very large, so it should be compressed and stored on the disk. However, I'd like to have random access to it:
to retrieve any path in the array by index (e.g., RetrievePath(2) = "C:\abc\def")
to find index of any path in the array (e.g., IndexOf("C:\ghi") = 3)
to add a new path to the array (indexes of any existing paths should not change), e.g., AddPath("C:\ghi\xyz\file.dat")
to rename some file or folder in the database;
to delete existing path (again, any other indexes should not change).
For example, delete path 1 "C:\abc" from the database and still have 4 "C:\ghi\readme.txt".
Can someone suggest some good algorithm/data structure/ideas to do these things?
Edit:
At the moment I've come up with the following solution:
0 "C:"
1 "[0]\abc"
2 "[1]\def"
3 "[0]\ghi"
4 "[3]\readme.txt"
That is, common prefixes are compressed.
RetrievePath(2) = "[1]\def" = RetrievePath(1) + "\def" = "[0]\abc\def" = RetrievePath(0) + "\abc\def" = "C:\abc\def"
IndexOf() also works iteratively, something like that:
IndexOf("C:") = 0
IndexOf("C:\abc") = IndexOf("[0]\abc") = 1
IndexOf("C:\abc\def") = IndexOf("[1]\def") = 2
To add new path, say AddPath("C:\ghi\xyz\file.dat"), one should first add its prefixes:
5 [3]\xyz
6 [5]\file.dat
Renaming/moving file/folder involves just one replacement (e.g., replacing [0]\ghi with [1]\klm will rename directory "ghi" to "klm" and move it to the directory "C:\abc")
DeletePath() involves setting it (and all subpaths) to empty strings. In future, they can be replaced with new paths.
After DeletePath("C:\abc"), the array will be:
0 "C:"
1 ""
2 ""
3 "[0]\ghi"
4 "[3]\readme.txt"
The whole array still needs to be loaded in RAM to perform fast operations. With, for example, 1000000 files and folders in total and average filename length of 10, the array will occupy over 10 MB.
Also, function IndexOf() is forced to scan array sequentially.
Edit (2): I just realised that my question can be reformulated:
How I can assign each file and each folder on the disk unique integer index so that I will be able to quickly find file/folder by index, index of the known file/folder, and perform basic file operations without changing many indices?
Edit (3): Here is a question about similar but Linux-related problem. It is suggested to use filename and content hashing to identify file. Are there some Windows-specific improvements?

Your solution seems decent. You could also try to compress more using ad-hoc tricks such as using a few bits only for common characters such as "\", drive letters, maybe common file extensions and whatnot. You could also have a look on tries ( http://en.wikipedia.org/wiki/Trie ).
Regarding your second edit, this seems to match the features of a hash table, but this is for indexing, not compressed storage.

Related

Differential file saving algorithm or tool

I'm looking for any information or algorithms that allows differential file saving and merging.
To be more clear, I would like when modifying the content of a file the original file should stay the same and every modification made must be saved in a separate file (same thing as differential backup but for files), in case of accessing the file, it should reconstruct the latest version of the file using the original file and the last differential file.
What I need to do is described in the diagram below :
For calculating diffs you could use something like diff_match_patch.
You could store for each file version series of DeltaDiff.
DeltaDiff would be a tuple of one of 2 types: INSERT or DELETE.
Then you could store the series of DeltaDiff as follows:
Diff = [DeltaDiff_1, DeltaDiff_2, ... DeltaDiff_n ] = [
(INSERT, byteoffset regarding to initial file, bytes)
(DELETE, byteoffset regarding to initial file, length)
....
(....)
]
Applying the DeltaDiffs to initial file would give you the next file version, and so on, for example:
FileVersion1 + Diff1 -> FileVersion2 + Diff2 -> FileVersion3 + ....

How to speed up XQuery string search?

I am using XQuery/BaseX to look through large XML files to find historical data for some counters. All the files are zipped and stored somewhere on drive. The important part of file looks as follows:
<measInfo xmlns="http://www.hehe.org/foo/" measInfoId="uplink speed">
<granPeriod duration="DS222S" endTime="2020-09-03T08:15:00+02:00"/>
<repPeriod duration="DS222S"/>
<measTypes>AFD123 AFD124 AFD125 AFD156</measTypes>
<measValue measObjLdn="PLDS-PLDS/STBHG-532632">
<measResults>23 42 12 43</measResults>
</measValue>
</measInfo>
I built the following query:
declare default element namespace "http://www.hehe.org/foo/";
let $sought := ["AFD124", "AFD125"]
let $datasource := collection("C:\Users\Patryk\Desktop\folderwitharchives")
let $filename := concat(convert:dateTime-to-integer(current-dateTime()), ".xml")
for $meas in $datasource/measCollecFile/measData/measInfo return
for $measType at $i in $meas/tokenize(measTypes)[. = $sought] return
file:append($filename,
<meas
measInfoId="{data($meas/#measInfoId)}"
measObjLdn="{data($meas/measValue/#measObjLdn)}"
>
{$meas/granPeriod}
{$meas/repPeriod}
<measType>{$measType}</measType>
<measValue>{$meas/measValue/tokenize(measResults, " ")[$i]}</measValue>
</meas>)
The script works, but it takes a lot of time for some counters (measType). I read the documentation about indexing, and my idea is to somehow index all the measTypes (parts of the string), so that once I need to look through the whole archive looking for a counter, it can be quickly accessed. I am not sure if it is possible when operating directly on archives? Would I have to create a new database of them? I would prefer not to, due to the size of files. How to create indexes for such case?
It is not the answer to my question, but I have noticed that the execution time is much longer when I write XML nodes to a file. It is much faster to append any other string to a file:
concat($measInfo/#measInfoId, ",", $measInfo/measValue/#measObjLdn, ",",
$measInfo/granPeriod, ",", $measInfo/repPeriod, ",", $measType, ",",
$tokenizedValues[$i], "
"))
Why is it and how to speed up writing XML nodes to a file?
Also, I have noticed that appending value to a file inside for loop is much longer, and I suspect that it is because the file has to be opened again in each iteration. Is there a way to keep the file open throughout the whole query?

Rename all contents of directory with a minimum of overhead

I am currently in the position where I need to rename all files in a directory. The chance that a file does not change name is minimal, and the chance that an old filename is the same as a new filename is considerable, making renaming conflicts likely.
Thus, simply looping over the files and renaming old->new is not an option.
The easy / obvious solution is to rename everything to have a temporary filename: old->tempX->new. Of course, to some degree, this shifts the issue, because now there is the responsibility of checking nothing in the old names list overlaps with the temporary names list, and nothing in the temporary names list overlaps with the new list.
Additionally, since I'm dealing with slow media and virus scanners that love to slow things down, I would like to minimize the actual actions on disk. Besides that, the user will be impatiently waiting to do more stuff. So if at all possible, I would like to process all files on disk in a single pass (by smartly re-ordering rename operations) and avoid exponential time shenanigans.
This last bit has brought me to a 'good enough' solution where I first create a single temporary directory inside my directory, I move-rename everything into that, and finally, I move everything back into the old folder and delete the temporary directory. This gives me a complexity of O(2n) for disk and actions.
If possible, I'd love to get the on-disk complexity to O(n), even if it comes at a cost of increasing the in-memory actions to O(99999n). Memory is a lot faster after all.
I am personally not at-home enough in graph theory, and I suspect the entire 'rename conflict' thing has been tackled before, so I was hoping someone could point me towards an algorithm that meets my needs. (And yes, I can try to brew my own, but I am not smart enough to write an efficient algorithm, and I probably would leave in a logic bug that rears its ugly head rarely enough to slip through my testing. xD)
One approach is as follows.
Suppose file A renames to B and B is a new name, we can simply rename A.
Suppose file A renames to B and B renames to C and C is a new name, we can follow the list in reverse and rename B to C, then A to B.
In general this will work providing there is not a loop. Simply make a list of all the dependencies and then rename in reverse order.
If there is a loop we have something like this:
A renames to B
B renames to C
C renames to D
D renames to A
In this case we need a single temporary file per loop.
Rename the first in the loop, A to ATMP.
Then our list of modifications becomes:
ATMP renames to B
B renames to C
C renames to D
D renames to A
This list no longer has a loop so we can process the files in reverse order as before.
The total number of file moves with this approach will be n + number of loops in your rearrangement.
Example code
So in Python this might look like this:
D={1:2,2:3,3:4,4:1,5:6,6:7,10:11} # Map from start name to final name
def rename(start,dest):
moved.add(start)
print 'Rename {} to {}'.format(start,dest)
moved = set()
filenames = set(D.keys())
tmp = 'tmp file'
for start in D.keys():
if start in moved:
continue
A = [] # List of files to rename
p = start
while True:
A.append(p)
dest = D[p]
if dest not in filenames:
break
if dest==start:
# Found a loop
D[tmp] = D[start]
rename(start,tmp)
A[0] = tmp
break
p = dest
for f in A[::-1]:
rename(f,D[f])
This code prints:
Rename 1 to tmp file
Rename 4 to 1
Rename 3 to 4
Rename 2 to 3
Rename tmp file to 2
Rename 6 to 7
Rename 5 to 6
Rename 10 to 11
Looks like you're looking at a sub-problem of Topologic sort.
However it's simpler, since each file can depend on just one other file.
Assuming that there are no loops:
Supposing map is the mapping from old names to new names:
In a loop, just select any file to rename, and send it to a function which :
if it's destination new name is not conflicting (a file with the new name doesn't exist), then just rename it
else (conflict exists)
2.1 rename the conflicting file first, by sending it to the same function recursively
2.2 rename this file
A sort-of Java pseudo code would look like this:
// map is the map, map[oldName] = newName;
HashSet<String> oldNames = new HashSet<String>(map.keys());
while (oldNames.size() > 0)
{
String file = oldNames.first(); // Just selects any filename from the set;
renameFile(map, oldNames, file);
}
...
void renameFile (map, oldNames, file)
{
if (oldNames.contains(map[file])
{
(map, oldNames, map[file]);
}
OS.rename(file, map[file]); //actual renaming of file on disk
map.remove(file);
oldNames.remove(file);
}
I believe you are interested in a Graph Theory modeling of the problem so here is my take on this:
You can build the bidirectional mapping of old file names to new file names as a first stage.
Now, you compute the intersection set I the old filenames and new filenames. Each target "new filename" appearing in this set requires the "old filename" to be renamed first. This is a dependency relationship that you can model in a graph.
Now, to build that graph, we iterate over that I set. For each element e of I:
Insert a vertex in the graph representing the file e needing to be renamed if it doesn't exist yet
Get the "old filename" o that has to be renamed into e
Insert a vertex representing o into the graph if it doesn't already exist
Insert a directed edge (e, o) in the graph. This edge means "e must be renamed before o". If that edge introduce a cycle (*), do not insert it and mark o as a file that needs to be moved-and-renamed.
You now have to iterate over the roots of your graph (vertices that have no in-edges) and perform a BFS using them as a starting point and perform the renaming each time you discover a vertex. The renaming can be a common rename or a move-and-rename depending on if the vertex was tagged.
The last step is to move back the moved-and-renamed files back from their sandbox directory to the target directory.
C++ Live Demo to illustrate the graph processing.

FindNextFile order NTFS

FindNextFile WinApi function is used to list content of directories. Microsoft is stating in documentation, that order is file system dependent. However NTFS should be in alphabetical order most of the time.
The order in which this function returns the file names is dependent on the file system type. With the NTFS file system and CDFS file systems, the names are usually returned in alphabetical order. With FAT file systems, the names are usually returned in the order the files were written to the disk, which may or may not be in alphabetical order. However, as stated previously, these behaviors are not guaranteed.
My application needs some ordering of object in directories. Because majority of Windows users use NTFS, I would like to optimize my application for that case. Therefore I use function _wcsicmp for name compare. Most of the time it is correct and results from FindNextFile are sorted according to _wcsicmp. However sometime result are not sorted. I thought, that it is natural, because FindFirstFile doesn't guaranteed the order and I must sort it anyway (in case of another file system). Then I noticed strange pattern. It looks like character '_' is returned after letters. Folder with content (a.txt, b.txt, _.txt) is returned in order a, b, _. Function _wcsicmp will sort that as _, a, b. Tested on Windows 8.1. I ran some test and this behavior is consistent.
Can someone explain me what is the comparison criteria used by NTFS? Or why is FindNextFile returning names out of alphabetical order?
Because NTFS sort rules are not so simple as just to sort in alphabetical order. Here is an msdn blog article to shed some light on the problem:
Why do NTFS and Explorer disagree on filename sorting?
One reason to this can be that NTFS captures the case mapping table at the time the drive is formatted and continues to use that table, even if the OS's case mapping tables change subsequently.
You can use CompareStringEx and set the flag SORT_DIGITSASNUMBERS
Minimum system requirement for this function is Windows Vista
LINK
int CompareStringEx(0,0x00000008/*SORT_DIGITSASNUMBERS*/,
lpString1, cchCount1, lpString2, cchCount2, NULL, NULL, 0);
Comparison result for this function is weird, it returns 1, 2, or 3:
#define CSTR_LESS_THAN 1 // string 1 less than string 2
#define CSTR_EQUAL 2 // string 1 equal to string 2
#define CSTR_GREATER_THAN 3 // string 1 greater than string 2
You can also try _wcsicoll for older systems. If I recall correctly _wcsicoll works better but not the same as Windows's sort.

Parsing text files in Ruby when the content isn't well formed

I'm trying to read files and create a hashmap of the contents, but I'm having trouble at the parsing step. An example of the text file is
put 3
returns 3
between
3
pargraphs 1
4
3
#foo 18
****** 2
The word becomes the key and the number is the value. Notice that the spacing is fairly erratic. The word isn't always a word (which doesn't get picked up by /\w+/) and the number associated with that word isn't always on the same line. This is why I'm calling it not well-formed. If there were one word and one number on one line, I could just split it, but unfortunately, this isn't the case. I'm trying to create a hashmap like this.
{"put"=>3, "#foo"=>18, "returns"=>3, "paragraphs"=>1, "******"=>2, "4"=>3, "between"=>3}
Coming from Java, it's fairly easy. Using Scanner I could just use scanner.next() for the next key and scanner.nextInt() for the number associated with it. I'm not quite sure how to do this in Ruby when it seems I have to use regular expressions for everything.
I'd recommend just using split, as in:
h = Hash[*s.split]
where s is your text (eg s = open('filename').read. Believe it or not, this will give you precisely what you're after.
EDIT: I realized you wanted the values as integers. You can add that as follows:
h.each{|k,v| h[k] = v.to_i}

Resources