Data Structure Selection - Printing Filepaths - algorithm

I had a question regarding printing the names of files. Say I start with something like a list of strings such as
files = [['documents', 'pics', 'cool.zip'], ['documents', 'homework'], ['Desktop, 'documents', 'file.jpg'], ['awesome.jpg'], ['turtles', 'homework']]
Essentially this is a list of lists of file paths. I'd like to try to take this and organize it into a data structure that will help to identify the links between the file paths.
I was thinking that a Graph may be the best way to represent this, but typically i'e seen graphs start out with adjacency lists which is also a list of lists, but typically each sub list is a pair of items. Anyone have some feedback here on best data structure to use here? I'd ultimately like to reconstruct a Graph and then print out the contents of the Graph, depth first.

Usually, files are organised in a tree. You start with a "root" directory, which has a set of children, each of which is either a file, or a directory - which has its own set of children (or a link/shortcut, but they make things more complicated than it sounds like you need here)

Related

Build hierarchy from sets of attributes

I'm having some trouble with what I believe to be some pretty basic stuff. Nevertheless I can't seem to find anything. Probably because I'm not asking the correct question.
Let's say I have three(potentially redundant) sets of data A,B,C = (a,b,c), (a,b,d), (a,e,f).
What I need is for some tool to suggest a hierarchy for me.
Like so:
(a)
(b) (ef)
(c) (d)
In reality there are far more sets and ALOT of attributes within each set but they are all closely related and I don't want to manually find and build the hierarchy.
If you want to build an hierarchy out of plain tuples, go build a tree (or, rather, a forest) out of them!
In your case tree would look like
c
/
b - d
/
a - e -f
Algorithm is trivial:
pick first element from the tuple
find top element in the forest with this value (or create one if not found)
pick next value from the tuple
find matching element among children of previously found node.
repeat until PROFIT

Shortest sequence of operations transforming a file tree to another

Given two file trees A and B, is it possible to determine the shortest sequence of operations or a short sequence of operations that is necessary in order to transform A to B?
An operation can be:
Create a new, empty folder
Create a new file with any contents
Delete a file
Delete an empty folder
Rename a file
Rename a folder
Move a file inside another existing folder
Move a folder inside another existing folder
A and B are identical when they will have the same files with the same contents (or same size same CRC) and same name, in the same folder structure.
This question has been puzzling me for some time. For the moment I have the following, basic idea:
Compute a database:
Store file names and their CRCs
Then, find all folders with no subfolders, and compute a CRC from the CRCs of the files they contain, and a size from the total size of the files they contain
Ascend the tree to make a CRC for each parent folder
Use the following loop having database A and database B:
Compute A ∩ B and remove this intersection from both databases.
Use an inner join to find matching CRCs in A and B, folders first, order by size desc
while there is a result, use the first result to make a folder or file move (possibly creating new folders if necessary), remove from both database the source rows of the result. If there was a move then update CRCs of new location's parent folders in db A.
Then remove all files and folders referenced in database A and create those referenced in database B.
However I think that this is really a suboptimal way to do that. What could you give me as advice?
Thank you!
This problem is a special case of the tree edit distance problem, for which finding an optimal solution is (unfortunately) known to be NP-hard. This means that there probably aren't any good, fast, and accurate algorithms for the general case.
That said, the paper I linked does contain several nice discussions of approximation algorithms and algorithms that work in restricted cases of the problem. You may find the discussion interesting, as it illuminates many of the issues that actually arise in solving this problem.
Hope this helps! And thanks for posting an awesome question!
You might want to check out tree-edit distance algorithms. I don't know if this will map neatly to your file system, but it might give you some ideas.
https://github.com/irskep/sleepytree (code and paper)
The first step to do is figure out which files need to be created/renamed/deleted.
A) Create a hash map of the files of Tree A
B) Go through the files of Tree B
B.1) If there is an identical (name and contents) file in the hash map, then leave it alone
B.2) If the contents are the identical but the name is different, rename the file to that in the hash map
B.3) If the file contents doesn't exist in the hash map, remove it
B.4) (if one of 1,2,3 was true) Remove the file from the hash map
The files left over in the hash map are those that must be created. This should be the last step, after the directory structure has been resolved.
After the file differences have been resolved, it get's rather tricky. I wouldn't be surprised if there isn't an efficient optimal solution to this problem (NP-complete/hard).
The difficulty lies in that the problem doesn't naturally subdivide itself. Each step you do must consider the entire file tree. I'll think about it some more.
EDIT: It seems that the most studied tree edit distance algorithms consider only creating/deleting nodes and relabeling of nodes. This isn't directly applicable to this problem because this problem allows moving entire subtrees around which makes it significantly more difficult. The current fastest run-time for the "easier" edit distance problem is O(N^3). I'd imagine the run-time for this will be significantly slower.
Helpful Links/References
An Optimal Decomposition Algorithm for Tree Edit Distance - Demaine, Mozes, Weimann
Enumerate all files in B and their associated sizes and checksums;
sort by size/checksum.
Enumerate all files in A and their associated sizes and checksums;
sort by size/checksum.
Now, doing an ordered list comparison, do the following:
a. for every file in A but not B, delete it.
b. for every file in B but not A, create it.
c. for every file in A and B, rename as many as you encounter from A to B, then make copies of the rest in B. If you are going to overwrite an existing file, save it off to the side in a separate list. If you find A in that list, use that as the source file.
Do the same for directories, deleting ones in A but not in B and adding those in B but not in A.
You iterate by checksum/size to ensure you never have to visit files twice or worry about deleting a file you will later need to resynchronize. I'm assuming you are trying to keep two directories in sync without unnecessary copying?
The overall complexity is O(N log N) plus however long it takes to read in all those files and their metadata.
This isn't the tree edit distance problem; it's more of a list synchronization problem that happens to generate a tree.
Only non trivial problem is moving folders and files. Renaming, deleting and creating is trivial and can be done in first step (or better in last when you finish).
You can then transform this problem into problem whit transforming tree both whit same leafs but different topology.
You decide which files will be moved from some folder/bucket and which files will be left in folder. Decision is based on number of same files in source and destination.
You apply same strategy to move folders in new topology.
I think that you should be near optimal or optimal if you forget about names of folders and think just about files and topology.

Merge sorted files efficiently

I need to merge about 30 gzip-ed text files, each about 10-15GB compressed, each containing multi-line records, each sorted by the same key. The files reside on an NFS share, I have access to them from several nodes, and each node has its own /tmp filesystem. What would be the fastest way to go about it?
Some possible solutions:
A. Leave it all to sort -m. To do that, I need to pass every input file through awk/sed/grep to collapse each record into a line and extract a key that would be understood by sort. So I would get something like
sort -m -k [...] <(preprocess file1) [...] <(preprocess filen) | postprocess
B. Look into python's heapq.merge.
C. Write my own C code to do this. I could merge the files in small batches, make an OMP thread for each input file, one for the output, and one actually doing the merging in RAM, etc.
Options for all of the above:
D. Merge a few files at a time, in a tournament.
E. Use several nodes for this, copying intermediate results in between the nodes.
What would you recommend? I don't have much experience about secondary storage efficiency, and as such, I find it hard to estimate how either of these would perform.
If you go for your solution B involving heapq.merge, then you will be delighted to know, that Python 3.5 will add a key parameter to heapq.merge() according to docs.python.org, bugs.python.org and github.com. This will be a great solution to your problem.

Compression and Lookup of huge list of words

I have a huge list of multi-byte sequences (lets call them words) that I need to store in a file and that I need to be able to lookup quickly. Huge means: About 2 million of those, each 10-20 bytes in length.
Furthermore, each word shall have a tag value associated with it, so that I can use that to reference more (external) data for each item (hence, a spellchecker's dictionary is not working here as that only provides a hit-test).
If this were just in memory, and if memory was plenty, I could simply store all words in a hashed map (aka dictionary, aka key-value pairs), or in a sorted list for a binary search.
However, I'd like to compress the data highly, and would also prefer not to have to read the data into memory but rather search inside the file.
As the words are mostly based on the english language, there's a certain likelyness that certain "sillables" in the words occur more often than others - which is probably helpful for an efficient algorithm.
Can someone point me to an efficient technique or algorithm for this?
Or even code examples?
Update
I figure that DAWG or anything similar routes the path into common suffixes this way won't work for me, because then I won't be able to tag each complete word path with an individual value. If I were to detect common suffixes, I'd have to put them into their own dictionary (lookup table) so that a trie node could reference them, yet the node would keep its own ending node for storing that path's tag value.
In fact, that's probably the way to go:
Instead of building the tree nodes for single chars only, I could try to find often-used character sequences, and make a node for those as well. That way, single nodes can cover multiple chars, maybe leading to better compression.
Now, if that's viable, how would I actually find often-used sub-sequences in all my phrases?
With about 2 million phrases consisting of usually 1-3 words, it'll be tough to run all permutations of all possible substrings...
There exists a data structure called a trie. I believe that this data structure is perfectly suited for your requirements. Basically a trie is a tree where each node is a letter and each node has child nodes. In an letter based trie, there would be 26 children per node.
Depending on what language you are using this may be easier or better to store as a variable length list while creation.
This structure gives:
a) Fast searching. Following a word of length n, you can find the string in n links in the tree.
b) Compression. Common prefixes are stored.
Example: The word BANANA and BANAL both will have B,A,N,A nodes equal and then the last (A) node will have 2 children, L and N. Your Nodes can also stored other information about the word.
(http://en.wikipedia.org/wiki/Trie)
Andrew JS
I would recommend using a Trie or a DAWG (directed acyclic word graph). There is a great lecture from Stanford on doing exactly what you want here: http://academicearth.org/lectures/lexicon-case-study
Have a look at the paper "How to sqeeze a lexicon". It explains how to build a minimized finite state automaton (which is just another name for a DAWG) with a one-to-one mapping of words to numbers and vice versa. Exactly what you need.
You should get familiar with Indexed file.
Have you tried just using a hash map? Thing is, on a modern OS architecture, the OS will use virtual memory to swap out unused memory segments to disk anyway. So it may turn out that just loading it all into a hash map is actually efficient.
And as jkff points out, your list would only be about 40 MB, which is not all that much.

Building a directory tree from a list of file paths

I am looking for a time efficient method to parse a list of files into a tree. There can be hundreds of millions of file paths.
The brute force solution would be to split each path on occurrence of a directory separator, and traverse the tree adding in directory and file entries by doing string comparisons but this would be exceptionally slow.
The input data is usually sorted alphabetically, so the list would be something like:
C:\Users\Aaron\AppData\Amarok\Afile
C:\Users\Aaron\AppData\Amarok\Afile2
C:\Users\Aaron\AppData\Amarok\Afile3
C:\Users\Aaron\AppData\Blender\alibrary.dll
C:\Users\Aaron\AppData\Blender\and_so_on.txt
From this ordering my natural reaction is to partition the directory listings into groups... somehow... before doing the slow string comparisons. I'm really not sure. I would appreciate any ideas.
Edit: It would be better if this tree were lazy loaded from the top down if possible.
You have no choice but to do full string comparisons since you can't guarantee where the strings might differ. There are a couple tricks that might speed things up a little:
As David said, form a tree, but search for the new insertion point from the previous one (perhaps with the aid of some sort of matchingPrefix routine that will tell you where the new one differs).
Use a hash table for each level of the tree if there may be very many files within and you need to count duplicates. (Otherwise, appending to a stack is fine.)
if its possible, you can generate your tree structure with the tree command, here
To take advantage of the "usually sorted" property of your input data, begin your traversal at the directory where your last file was inserted: compare the directory name of current pathname to the previous one. If they match, you can just insert here, otherwise pop up a level and try again.

Resources