Differential file saving algorithm or tool - algorithm

I'm looking for any information or algorithms that allows differential file saving and merging.
To be more clear, I would like when modifying the content of a file the original file should stay the same and every modification made must be saved in a separate file (same thing as differential backup but for files), in case of accessing the file, it should reconstruct the latest version of the file using the original file and the last differential file.
What I need to do is described in the diagram below :

For calculating diffs you could use something like diff_match_patch.
You could store for each file version series of DeltaDiff.
DeltaDiff would be a tuple of one of 2 types: INSERT or DELETE.
Then you could store the series of DeltaDiff as follows:
Diff = [DeltaDiff_1, DeltaDiff_2, ... DeltaDiff_n ] = [
(INSERT, byteoffset regarding to initial file, bytes)
(DELETE, byteoffset regarding to initial file, length)
....
(....)
]
Applying the DeltaDiffs to initial file would give you the next file version, and so on, for example:
FileVersion1 + Diff1 -> FileVersion2 + Diff2 -> FileVersion3 + ....

Related

How to speed up XQuery string search?

I am using XQuery/BaseX to look through large XML files to find historical data for some counters. All the files are zipped and stored somewhere on drive. The important part of file looks as follows:
<measInfo xmlns="http://www.hehe.org/foo/" measInfoId="uplink speed">
<granPeriod duration="DS222S" endTime="2020-09-03T08:15:00+02:00"/>
<repPeriod duration="DS222S"/>
<measTypes>AFD123 AFD124 AFD125 AFD156</measTypes>
<measValue measObjLdn="PLDS-PLDS/STBHG-532632">
<measResults>23 42 12 43</measResults>
</measValue>
</measInfo>
I built the following query:
declare default element namespace "http://www.hehe.org/foo/";
let $sought := ["AFD124", "AFD125"]
let $datasource := collection("C:\Users\Patryk\Desktop\folderwitharchives")
let $filename := concat(convert:dateTime-to-integer(current-dateTime()), ".xml")
for $meas in $datasource/measCollecFile/measData/measInfo return
for $measType at $i in $meas/tokenize(measTypes)[. = $sought] return
file:append($filename,
<meas
measInfoId="{data($meas/#measInfoId)}"
measObjLdn="{data($meas/measValue/#measObjLdn)}"
>
{$meas/granPeriod}
{$meas/repPeriod}
<measType>{$measType}</measType>
<measValue>{$meas/measValue/tokenize(measResults, " ")[$i]}</measValue>
</meas>)
The script works, but it takes a lot of time for some counters (measType). I read the documentation about indexing, and my idea is to somehow index all the measTypes (parts of the string), so that once I need to look through the whole archive looking for a counter, it can be quickly accessed. I am not sure if it is possible when operating directly on archives? Would I have to create a new database of them? I would prefer not to, due to the size of files. How to create indexes for such case?
It is not the answer to my question, but I have noticed that the execution time is much longer when I write XML nodes to a file. It is much faster to append any other string to a file:
concat($measInfo/#measInfoId, ",", $measInfo/measValue/#measObjLdn, ",",
$measInfo/granPeriod, ",", $measInfo/repPeriod, ",", $measType, ",",
$tokenizedValues[$i], "
"))
Why is it and how to speed up writing XML nodes to a file?
Also, I have noticed that appending value to a file inside for loop is much longer, and I suspect that it is because the file has to be opened again in each iteration. Is there a way to keep the file open throughout the whole query?

Rename all contents of directory with a minimum of overhead

I am currently in the position where I need to rename all files in a directory. The chance that a file does not change name is minimal, and the chance that an old filename is the same as a new filename is considerable, making renaming conflicts likely.
Thus, simply looping over the files and renaming old->new is not an option.
The easy / obvious solution is to rename everything to have a temporary filename: old->tempX->new. Of course, to some degree, this shifts the issue, because now there is the responsibility of checking nothing in the old names list overlaps with the temporary names list, and nothing in the temporary names list overlaps with the new list.
Additionally, since I'm dealing with slow media and virus scanners that love to slow things down, I would like to minimize the actual actions on disk. Besides that, the user will be impatiently waiting to do more stuff. So if at all possible, I would like to process all files on disk in a single pass (by smartly re-ordering rename operations) and avoid exponential time shenanigans.
This last bit has brought me to a 'good enough' solution where I first create a single temporary directory inside my directory, I move-rename everything into that, and finally, I move everything back into the old folder and delete the temporary directory. This gives me a complexity of O(2n) for disk and actions.
If possible, I'd love to get the on-disk complexity to O(n), even if it comes at a cost of increasing the in-memory actions to O(99999n). Memory is a lot faster after all.
I am personally not at-home enough in graph theory, and I suspect the entire 'rename conflict' thing has been tackled before, so I was hoping someone could point me towards an algorithm that meets my needs. (And yes, I can try to brew my own, but I am not smart enough to write an efficient algorithm, and I probably would leave in a logic bug that rears its ugly head rarely enough to slip through my testing. xD)
One approach is as follows.
Suppose file A renames to B and B is a new name, we can simply rename A.
Suppose file A renames to B and B renames to C and C is a new name, we can follow the list in reverse and rename B to C, then A to B.
In general this will work providing there is not a loop. Simply make a list of all the dependencies and then rename in reverse order.
If there is a loop we have something like this:
A renames to B
B renames to C
C renames to D
D renames to A
In this case we need a single temporary file per loop.
Rename the first in the loop, A to ATMP.
Then our list of modifications becomes:
ATMP renames to B
B renames to C
C renames to D
D renames to A
This list no longer has a loop so we can process the files in reverse order as before.
The total number of file moves with this approach will be n + number of loops in your rearrangement.
Example code
So in Python this might look like this:
D={1:2,2:3,3:4,4:1,5:6,6:7,10:11} # Map from start name to final name
def rename(start,dest):
moved.add(start)
print 'Rename {} to {}'.format(start,dest)
moved = set()
filenames = set(D.keys())
tmp = 'tmp file'
for start in D.keys():
if start in moved:
continue
A = [] # List of files to rename
p = start
while True:
A.append(p)
dest = D[p]
if dest not in filenames:
break
if dest==start:
# Found a loop
D[tmp] = D[start]
rename(start,tmp)
A[0] = tmp
break
p = dest
for f in A[::-1]:
rename(f,D[f])
This code prints:
Rename 1 to tmp file
Rename 4 to 1
Rename 3 to 4
Rename 2 to 3
Rename tmp file to 2
Rename 6 to 7
Rename 5 to 6
Rename 10 to 11
Looks like you're looking at a sub-problem of Topologic sort.
However it's simpler, since each file can depend on just one other file.
Assuming that there are no loops:
Supposing map is the mapping from old names to new names:
In a loop, just select any file to rename, and send it to a function which :
if it's destination new name is not conflicting (a file with the new name doesn't exist), then just rename it
else (conflict exists)
2.1 rename the conflicting file first, by sending it to the same function recursively
2.2 rename this file
A sort-of Java pseudo code would look like this:
// map is the map, map[oldName] = newName;
HashSet<String> oldNames = new HashSet<String>(map.keys());
while (oldNames.size() > 0)
{
String file = oldNames.first(); // Just selects any filename from the set;
renameFile(map, oldNames, file);
}
...
void renameFile (map, oldNames, file)
{
if (oldNames.contains(map[file])
{
(map, oldNames, map[file]);
}
OS.rename(file, map[file]); //actual renaming of file on disk
map.remove(file);
oldNames.remove(file);
}
I believe you are interested in a Graph Theory modeling of the problem so here is my take on this:
You can build the bidirectional mapping of old file names to new file names as a first stage.
Now, you compute the intersection set I the old filenames and new filenames. Each target "new filename" appearing in this set requires the "old filename" to be renamed first. This is a dependency relationship that you can model in a graph.
Now, to build that graph, we iterate over that I set. For each element e of I:
Insert a vertex in the graph representing the file e needing to be renamed if it doesn't exist yet
Get the "old filename" o that has to be renamed into e
Insert a vertex representing o into the graph if it doesn't already exist
Insert a directed edge (e, o) in the graph. This edge means "e must be renamed before o". If that edge introduce a cycle (*), do not insert it and mark o as a file that needs to be moved-and-renamed.
You now have to iterate over the roots of your graph (vertices that have no in-edges) and perform a BFS using them as a starting point and perform the renaming each time you discover a vertex. The renaming can be a common rename or a move-and-rename depending on if the vertex was tagged.
The last step is to move back the moved-and-renamed files back from their sandbox directory to the target directory.
C++ Live Demo to illustrate the graph processing.

Perl efficiency in writing in file

I am creating a database with some informations of files.
e.g: file_name | size | modify_date ...
I was thinking what is more efficient in this situation:
1) For each file get the info and print them in my file
foreach my $file ( #listOfFiles) {
my %temporary_hash = get_info_for_file($file); //store in a tempoarary hash
the informations for current file
print_info(%temporary_hash, $output_file); // print the information in my output file
}
2) Store the info for every file in a hash and print all the hash at once
foreach my $file( #listOfFiles){
store_info_in_hash( get_info_for_file($file), %hash); // for each file, store the
information in a global hash
}
print_all_info(%hash, $output_file); //after i have informations for each file
print the whole hash in my output file
You are wrong to consider efficiency before you have even got your program working
You should write your code as clearly as possible and debug it. Only then, if it is not running fast enough for your purpose, you should put your code through a profiler to discover the bottlenecks that are taking the most time
The two options you show will probably not be very different unless your files are enormous
Doing a benchmark test on the two options i got those results ( which, if I increase the information size for each file, will lead to even bigger differencies between the two).

Update existing excel file template formulas using ruby

I had been using spreadsheet to read in a template excel file, modify it and output a new file for the end-user.
As far as I can identify from the documentation spreadsheet provides no way to input or edit formulas in the produced document.
However, the purpose of my script is to read an undefined number of items from a site and enter them into the spreadsheet, then calculate totals and subtotals.
The end user (using excel or libreoffice etc) is then able to make slight modifications to the quantity of items whilst the totals update (due to formulas) as they are accustomed.
I have looked into the writeexcel gem which claims to be able to input formulas, but I can't see how to take an existing template file and modify it to produce my output. I can only create fresh workbooks.
Any tips please? I do not want to use Win32OLE.
This is surprisingly difficult; apparently all Gems for handling Excel files are missing some crucial functionality.
I can think of two approaches for this problem:
use a combination of spreadsheet (to read the Excel file) and use writeexcel (to write the output file)
use an input file that already contains the required formulas on a separate "formula" sheet and copies the formulas to the "real" sheet
Here's a simplistic version of the second approach:
require 'rubygems'
require 'spreadsheet'
Dir.chdir(File.dirname(__FILE__))
# input file, contains this data
# Sheet0: headers + data (for this simple demo, we will generate the data on-the-fly)
# Sheet1: Formula '=SUM(Worksheet1.A2:A255) in cell A1
book = Spreadsheet.open 'in.xls'
sheet = book.worksheet 0
formulasheet = book.worksheet 1
# insert some input data (in a real application,
# this data would already be present in the input sheet)
rows = rand(20) + 1
(1..rows).each do |i|
sheet[i,0] = i
end
# add total at bottom of column C
sheet[rows+1,2] = formulasheet[0,0]
# write output file
book.write 'out.xls'
However, this will fail if
you're using the same column for your input data and your totals (since then, the total will try to include itself in the calculation)

Compressed array of file paths and random access

I'm developing an file management Windows application. The program should keep an array of paths to all files and folders that are on the disk. For example:
0 "C:"
1 "C:\abc"
2 "C:\abc\def"
3 "C:\ghi"
4 "C:\ghi\readme.txt"
The array "as is" will be very large, so it should be compressed and stored on the disk. However, I'd like to have random access to it:
to retrieve any path in the array by index (e.g., RetrievePath(2) = "C:\abc\def")
to find index of any path in the array (e.g., IndexOf("C:\ghi") = 3)
to add a new path to the array (indexes of any existing paths should not change), e.g., AddPath("C:\ghi\xyz\file.dat")
to rename some file or folder in the database;
to delete existing path (again, any other indexes should not change).
For example, delete path 1 "C:\abc" from the database and still have 4 "C:\ghi\readme.txt".
Can someone suggest some good algorithm/data structure/ideas to do these things?
Edit:
At the moment I've come up with the following solution:
0 "C:"
1 "[0]\abc"
2 "[1]\def"
3 "[0]\ghi"
4 "[3]\readme.txt"
That is, common prefixes are compressed.
RetrievePath(2) = "[1]\def" = RetrievePath(1) + "\def" = "[0]\abc\def" = RetrievePath(0) + "\abc\def" = "C:\abc\def"
IndexOf() also works iteratively, something like that:
IndexOf("C:") = 0
IndexOf("C:\abc") = IndexOf("[0]\abc") = 1
IndexOf("C:\abc\def") = IndexOf("[1]\def") = 2
To add new path, say AddPath("C:\ghi\xyz\file.dat"), one should first add its prefixes:
5 [3]\xyz
6 [5]\file.dat
Renaming/moving file/folder involves just one replacement (e.g., replacing [0]\ghi with [1]\klm will rename directory "ghi" to "klm" and move it to the directory "C:\abc")
DeletePath() involves setting it (and all subpaths) to empty strings. In future, they can be replaced with new paths.
After DeletePath("C:\abc"), the array will be:
0 "C:"
1 ""
2 ""
3 "[0]\ghi"
4 "[3]\readme.txt"
The whole array still needs to be loaded in RAM to perform fast operations. With, for example, 1000000 files and folders in total and average filename length of 10, the array will occupy over 10 MB.
Also, function IndexOf() is forced to scan array sequentially.
Edit (2): I just realised that my question can be reformulated:
How I can assign each file and each folder on the disk unique integer index so that I will be able to quickly find file/folder by index, index of the known file/folder, and perform basic file operations without changing many indices?
Edit (3): Here is a question about similar but Linux-related problem. It is suggested to use filename and content hashing to identify file. Are there some Windows-specific improvements?
Your solution seems decent. You could also try to compress more using ad-hoc tricks such as using a few bits only for common characters such as "\", drive letters, maybe common file extensions and whatnot. You could also have a look on tries ( http://en.wikipedia.org/wiki/Trie ).
Regarding your second edit, this seems to match the features of a hash table, but this is for indexing, not compressed storage.

Resources