What is the fastest way to use a binary search tree when inserting a large number of sequential files? - binary-tree

I'm writing a program which sorts through folders of files and checks them against each other for duplicate names. To do this I get a list of all the file names and then run them through a binary tree. If the name exists in the tree, it marks the file as a duplicate and if it doesn't exist it adds the name to the tree.
The problem I'm running into is when a large batch of files is sequential (e.g. picture files where the entire name is identical except the final number sequentially going up) which causes the files to continually be placed on the right which in turn causes the depth of the tree to balloon. I'm looking for a way to reduce the time to process these files.
I've tried an AVL Tree but the time it takes to continually balance the tree as hundreds of thousands of files are added (and again, constantly rebalancing due to the sequential nature of the file names) ends up taking longer than simply allowing the depth to reach the tens of thousands. Any help would be greatly appreciated.

Shihab Shahriar suggested randomly shuffling the array and that did the trick beautifully.
The tests were run on a folder containing 233,738 picture files and prior to shuffling, the sequential nature of the picture file's names resulted in the binary tree depth being 34,227 and taking just over 26 minutes to process. Various batches of picture files were resulting in an O(n) insertion and search on the binary tree. After simply shuffling the array containing all the files prior to inserting them into the binary tree, the depth was reduced to the mid 40s and the time to process the files dropped to around 2 minutes.
Thanks for the help!

Related

Searching for ordered series in a file

I have 2 file where each line contain an ordered series of id/value pairs of different sizes as follow
(id,value):(id,value):...
(1,2):(3,0):(60:3):....
Each line/series is considered a coverage point. File one is a list of all the coverage points that i need to find(this file could be big 5000000 lines +). This file should never change as it is a master list of all the points I need to cover. The second file is a run report that is generate by my program.
What I have to do now is write a script that first takes the report file and for every point line/coverage point I need to search to master file and see if there is an exact line that matches the coverage point. I need to find the line number so i can save the number of time I hit that coverage point.
First option I go trough each line in the report file and compare it to each line of the master file
Second option I have some way of initially sorting the master file so it is easier to search
Third option I make some sort of "hash function" that would take a line and give it a unique ID.
Fourth option I use some sort of data structure initially loaded with the master file
Linked list
Tree structure
Database
I think there is many ways of doing this but I want to do it as efficiently and without being to complex . if there are others I can't think about let me know. Any guidance at this point would be great
Here are a few points as example
(53,0):
(53,1):(54,0):(55,0):(56,1):(57,0):
(53,1):(54,0):(55,0):(56,1):(57,1):
(53,2):(54,0):(55,0):(56,1):(57,0):
(53,1):(54,0):(55,1):(59,1):(60,0):
(53,2):(54,0):(55,1):(59,1):(60,0):
(53,1):(54,0):(55,0):(56,1):(57,2):
(53,1):(54,0):(55,2):(59,1):(59,0):(60,0):
(53,2):(54,0):(55,2):(59,1):(59,0):(60,0):
(53,2):(54,0):(55,1):(59,1):(60,1):
(53,2):(54,0):(55,0):(56,1):(57,1):
(53,1):(54,0):(55,1):(59,1):(60,2):
(53,1):(54,0):(55,1):(59,1):(60,1):
(53,1):(54,0):(55,2):(59,1):(59,0):(60,1):
(53,1):(54,0):(55,2):(59,1):(59,0):(60,2):
(53,2):(54,0):(55,0):(56,1):(57,2):
(53,1):(54,0):(55,3):(59,1):(59,0):(59,1):(60,0):
(53,1):(54,0):(55,3):(59,1):(59,0):(59,1):(60,2):
(53,2):(54,0):(55,3):(59,1):(59,0):(59,1):(60,0):
(53,2):(54,0):(55,1):(59,1):(60,2):
(53,1):(54,0):(55,3):(59,1):(59,0):(59,1):(60,1):
(53,2):(54,0):(55,3):(59,1):(59,0):(59,1):(60,2):
(53,2):(54,0):(55,5):(59,1):(59,0):(59,1):(59,0):(59,1):(60,0):
(53,2):(54,0):(55,6):(59,1):(59,0):(59,1):(59,0):(59,1):(59,0):(60,0):
(53,1):(54,0):(55,5):(59,1):(59,0):(59,1):(59,0):(59,1):(60,1):
Let m= number of lines in the master-file
Let n= number of lines in the report-file
For 5000000 lines + it is most likly infeasable to load it to into memory. So you have to cope with slow file-IO.
Some Comments:
First option I go trough each line in the report file and compare it to each line of the master file
This would be extremely slow (O(m*n)) and all of it is very slow file-IO
If you can load the whole report-file into RAM you could do so and read the master-file line by line and search for it in the report-file (in ram)
line in
still O(m*n)) but you read both files only once
Second option I have some way of initially sorting the master file so it is easier to search
This would only be an advantage if you find a way to search in the file at random line-positions. [Without extra effort this is only possible if all lines have the exact same size]. So searching would speed up from O(m) to O(log M).
So overall performance would still be (O((log m)*n)) and all of it is very slow file-IO
Third option I make some sort of "hash function" that would take a line and give it a unique ID.
Implementing a hash-function working on files is possible but kind of tricky. You could split up the master-file in a lot of files. Each file includes all lines corresponding to the same hash-value (witch is also the file-name).
Fourth option I use some sort of data structure initially loaded with the master file
My Answer:
You cannot keep the whole master-file in RAM (most likely).
If your repot-file is much smaller, you can keep it in RAM.
1. Approach (quite simple):
Make a HashSet of the report-file in Ram and than iterate over the master-file (once) and check each line if it is in your HashSet.
Takes O(m) time [+ O(n) to create the HashSet]
2. Approach: split the master-file in smaller files matching their hash-value:
Load your report-file and store it in a HashSet. Than for each hasgh-value in our report-file-hash-set, we look in the corresponding master-file-hash-value-part to check for a match.
Takes O(n) time [+ O(n) to create the HashSet + O(m) to create the master-file-hash-files]

Why is it important to delete files in-order to remove them faster?

Some time ago I learned that rsync deletes files much faster that many other tools.
A few days ago I came across this wonderful answer on Serverfault which explains why rsync is so good at deleting files.
Quotation from that answer:
I revisited this today, because most filesystems store their directory
structures in a btree format, the order of which you delete files is
also important. One needs to avoid rebalancing the btree when you
perform the unlink. As such I added a sort before deletes occur.
Could you explain how does removing files in-order prevents or reduces the number of btree rebalancings?
I expect the answer to show how deleting in order increase deletion speed, with details of what happens at btree level. People, who wrote rsync and another programs (see links in the question) used this knowledge to create better programs. I think it's important for other programmers to have this understanding to be able to write better soft.
It is not important, nor b-tree issue. It is just a coincidence.
First of all, this is very much implementation dependent and very much ext3 specific. That's why I said it's not important (for general use). Otherwise, put the ext3 tag or edit the summary line.
Second of all, ext3 does not use b-tree for the directory entry index. It uses Htree. The Htree is similar to b-tree but different and does not require balancing. Search "htree" in fs/ext3/dir.c.
Because of the htree based index, a) ext3 has a faster lookup compare to ext2, but b) readdir() returns entries in hash value order. The hash value order is random relative to file creation time or physical layout of data. As we all know random access is much slower than sequential access on a rotating media.
A paper on ext3 published for OLS 2005 by Mingming Cao, et al. suggests (emphasis mine):
to sort the directory entries returned by readdir() by inode number.
Now, onto rsync. Rsync sorts files by file name. See flist.c::fsort(), flist.c::file_compare(), and flist.c::f_name_cmp().
I did not test the following hypothesis because I do not have the data sets from which #MIfe got 43 seconds. but I assume that sorted-by-name was much closer to the optimal order compare to the random order returned by readdir(). That was why you saw much faster result with rsync on ext3. What if you generate 1000000 files with random file names then delete them with rsync? Do you see the same result?
Let's assume that the answer you posted is correct, and that the given file system does indeed store things in a balanced tree. Balancing a tree is a very expensive operation. Keeping a tree "partially" balanced is pretty simple, in that when you allow for a tree to be imbalanced slightly, you only worry about moving things around the point of insertion/deletion. However, when talking about completely balanced trees, when you remove a given node, you may find that suddenly, the children of this node could belong on the complete opposite side of the tree, or a child node on the opposite side has become the root node, and all of it's children need to be rotated up the tree. This requires you to do either a long series of rotations, or to place all the items into an array and re-create the tree.
5
3 7
2 4 6 8
now remove the 7, easy right?
5
3 8
2 4 6
Now remove the 6, still easy, yes...?
5
3 8
2 4
Now remove the 8, uh oh
5
3
2 4
Getting this tree to be the proper balanced form like:
4
3 5
2
Is quite expensive, compared at least to the other removals we have done, and gets exponentially worse as the depth of our tree increases. We could make this go much(exponentially) faster by removing the 2 and the 4, before removing the 8. Particularly if our tree was more than 3 levels deep.
Without sorting removal is on average a O(K * log_I(N)^2). N representing the number of elements total, and K the number to be removed, I the number of children a given node is permitted, log_I(N) then being the depth, and for each level of depth we increase the number of operations quadratically.
Removal with some ordering help is on average O(K * log_I(N)), though sometimes ordering cannot help you and you are stuck removing something that will require a re-balance. Still, minimizing this is optimal.
EDIT:
Another possible tree ordering scheme:
8
6 7
1 2 3 4
Accomplishing optimal removal under this circumstance would be easier, because we can take advantage of our knowledge of how things are sorted. Under either situation it is possible, and in fact both are identical, under this one the logic is just a little simpler to understand, because the ordering is more human friendly for the given scenario. In either case in-order is defined as "remove the farthest leaf first", in this case it just so happens to be that the farthest leaves are also the smallest numbers, a fact that we could take advantage of to make it even a little more optimal, but this fact is not necessarily true for the file system example presented(though it may be).
I am not convinced that the number of B-tree rebalancing changes significantly if you delete the files in-order. However I do believe that the number of different seeks to external storage will be significantly smaller if you do this. At any time, the only nodes in the B-tree that need be visited will then be the far right boundary of the tree, whereas, with a random order, each leaf block in the B tree is visited with equal probability for each file.
Rebalancing for B-Trees are cheaper than B-Tree+ implementations that's why most filesystems and database indexes implementations use them.
There are many approaches when deletion, depending on the approach it can be more efficient in terms of time and the need to rebalance the tree. You'll also have to consider the size of the node, since the number of keys the node can store will affect the need for rebalancing the tree. A large node size will just reorder keys inside the node, but a small one probably will make the tree rebalance many times.
A great resource for understanding this is the famous CLR (Thomas Cormen) book "Introduction to Algorithms".
On storage systems where hosting huge directories, the buffer cache will be under stress and buffers may get recycled. So, if you have deletes spaced apart by time, then the number of disk reads to get the btree back in core into the buffer cache, between deletes, may be high.
If you sort the files to be deleted, effectively you are delaying the deletes and bunching them. This may have the side effect of more deletes per block of btree paged in. If there are stats to say what the buffer cache hits are between the two experiments, it may tell if this hypo is wrong or not.
But, if there is no stress on the buffer cache during the deletes, then the btree blocks could stay in core and then my hypothesis is not a valid one.

Shortest sequence of operations transforming a file tree to another

Given two file trees A and B, is it possible to determine the shortest sequence of operations or a short sequence of operations that is necessary in order to transform A to B?
An operation can be:
Create a new, empty folder
Create a new file with any contents
Delete a file
Delete an empty folder
Rename a file
Rename a folder
Move a file inside another existing folder
Move a folder inside another existing folder
A and B are identical when they will have the same files with the same contents (or same size same CRC) and same name, in the same folder structure.
This question has been puzzling me for some time. For the moment I have the following, basic idea:
Compute a database:
Store file names and their CRCs
Then, find all folders with no subfolders, and compute a CRC from the CRCs of the files they contain, and a size from the total size of the files they contain
Ascend the tree to make a CRC for each parent folder
Use the following loop having database A and database B:
Compute A ∩ B and remove this intersection from both databases.
Use an inner join to find matching CRCs in A and B, folders first, order by size desc
while there is a result, use the first result to make a folder or file move (possibly creating new folders if necessary), remove from both database the source rows of the result. If there was a move then update CRCs of new location's parent folders in db A.
Then remove all files and folders referenced in database A and create those referenced in database B.
However I think that this is really a suboptimal way to do that. What could you give me as advice?
Thank you!
This problem is a special case of the tree edit distance problem, for which finding an optimal solution is (unfortunately) known to be NP-hard. This means that there probably aren't any good, fast, and accurate algorithms for the general case.
That said, the paper I linked does contain several nice discussions of approximation algorithms and algorithms that work in restricted cases of the problem. You may find the discussion interesting, as it illuminates many of the issues that actually arise in solving this problem.
Hope this helps! And thanks for posting an awesome question!
You might want to check out tree-edit distance algorithms. I don't know if this will map neatly to your file system, but it might give you some ideas.
https://github.com/irskep/sleepytree (code and paper)
The first step to do is figure out which files need to be created/renamed/deleted.
A) Create a hash map of the files of Tree A
B) Go through the files of Tree B
B.1) If there is an identical (name and contents) file in the hash map, then leave it alone
B.2) If the contents are the identical but the name is different, rename the file to that in the hash map
B.3) If the file contents doesn't exist in the hash map, remove it
B.4) (if one of 1,2,3 was true) Remove the file from the hash map
The files left over in the hash map are those that must be created. This should be the last step, after the directory structure has been resolved.
After the file differences have been resolved, it get's rather tricky. I wouldn't be surprised if there isn't an efficient optimal solution to this problem (NP-complete/hard).
The difficulty lies in that the problem doesn't naturally subdivide itself. Each step you do must consider the entire file tree. I'll think about it some more.
EDIT: It seems that the most studied tree edit distance algorithms consider only creating/deleting nodes and relabeling of nodes. This isn't directly applicable to this problem because this problem allows moving entire subtrees around which makes it significantly more difficult. The current fastest run-time for the "easier" edit distance problem is O(N^3). I'd imagine the run-time for this will be significantly slower.
Helpful Links/References
An Optimal Decomposition Algorithm for Tree Edit Distance - Demaine, Mozes, Weimann
Enumerate all files in B and their associated sizes and checksums;
sort by size/checksum.
Enumerate all files in A and their associated sizes and checksums;
sort by size/checksum.
Now, doing an ordered list comparison, do the following:
a. for every file in A but not B, delete it.
b. for every file in B but not A, create it.
c. for every file in A and B, rename as many as you encounter from A to B, then make copies of the rest in B. If you are going to overwrite an existing file, save it off to the side in a separate list. If you find A in that list, use that as the source file.
Do the same for directories, deleting ones in A but not in B and adding those in B but not in A.
You iterate by checksum/size to ensure you never have to visit files twice or worry about deleting a file you will later need to resynchronize. I'm assuming you are trying to keep two directories in sync without unnecessary copying?
The overall complexity is O(N log N) plus however long it takes to read in all those files and their metadata.
This isn't the tree edit distance problem; it's more of a list synchronization problem that happens to generate a tree.
Only non trivial problem is moving folders and files. Renaming, deleting and creating is trivial and can be done in first step (or better in last when you finish).
You can then transform this problem into problem whit transforming tree both whit same leafs but different topology.
You decide which files will be moved from some folder/bucket and which files will be left in folder. Decision is based on number of same files in source and destination.
You apply same strategy to move folders in new topology.
I think that you should be near optimal or optimal if you forget about names of folders and think just about files and topology.

Find common words from two files

Given two files containing list of words(around million), We need to find out the words that are in common.
Use Some efficient algorithm, also not enough memory availble(1 million, certainly not).. Some basic C Programming code, if possible, would help.
The files are not sorted.. We can use some sort of algorithm... Please support it with basic code...
Sorting the external file...... with minimum memory available,, how can it be implement with C programming.
Anybody game for external sorting of a file... Please share some code for this.
Yet another approach.
General. first, notice that doing this sequentially takes O(N^2). With N=1,000,000, this is a LOT. Sorting each list would take O(N*log(N)); then you can find the intersection in one pass by merging the files (see below). So the total is O(2N*log(N) + 2N) = O(N*log(N)).
Sorting a file. Now let's address the fact that working with files is much slower than with memory, especially when sorting where you need to move things around. One way to solve this is - decide the size of the chunk that can be loaded into memory. Load the file one chunk at a time, sort it efficiently and save into a separate temporary file. The sorted chunks can be merged (again, see below) into one sorted file in one pass.
Merging. When you have 2 sorted lists (files or not), you can merge them into one sorted list easily in one pass: have 2 "pointers", initially pointing to the first entry in each list. In each step, compare the values the pointers point to. Move the smaller value to the merged list (the one you are constructing) and advance its pointer.
You can modify the merge algorithm easily to make it find the intersection - if pointed values are equal move it to the results (consider how do you want to deal with duplicates).
For merging more than 2 lists (as in sorting the file above) you can generalize the algorithm for using k pointers.
If you had enough memory to read the first file completely into RAM, I would suggest reading it into a dictionary (word -> index of that word ), loop over the words of the second file and test if the word is contained in that dictionary. Memory for a million words is not much today.
If you have not enough memory, split the first file into chunks that fit into memory and do as I said above for each of that chunk. For example, fill the dictionary with the first 100.000 words, find every common word for that, then read the file a second time extracting word 100.001 up to 200.000, find the common words for that part, and so on.
And now the hard part: you need a dictionary structure, and you said "basic C". When you are willing to use "basic C++", there is the hash_map data structure provided as an extension to the standard library by common compiler vendors. In basic C, you should also try to use a ready-made library for that, read this SO post to find a link to a free library which seems to support that.
Your problem is: Given two sets of items, find the intersaction (items common to both), while staying within the constraints of inadequate RAM (less than the size of any set).
Since finding an intersaction requires comparing/searching each item in another set, you must have enough RAM to store at least one of the sets (the smaller one) to have an efficient algorithm.
Assume that you know for a fact that the intersaction is much smaller than both sets and fits completely inside available memory -- otherwise you'll have to do further work in flushing the results to disk.
If you are working under memory constraints, partition the larger set into parts that fit inside 1/3 of the available memory. Then partition the smaller set into parts the fit the second 1/3. The remaining 1/3 memory is used to store the results.
Optimize by finding the max and min of the partition for the larger set. This is the set that you are comparing from. Then when loading the corresponding partition of the smaller set, skip all items outside the min-max range.
First find the intersaction of both partitions through a double-loop, storing common items to the results set and removing them from the original sets to save on comparisons further down the loop.
Then replace the partition in the smaller set with the second partition (skipping items outside the min-max). Repeat. Notice that the partition in the larger set is reduced -- with common items already removed.
After running through the entire smaller set, repeat with the next partition of the larger set.
Now, if you do not need to preserve the two original sets (e.g. you can overwrite both files), then you can further optimize by removing common items from disk as well. This way, those items no longer need to be compared in further partitions. You then partition the sets by skipping over removed ones.
I would give prefix trees (aka tries) a shot.
My initial approach would be to determine a maximum depth for the trie that would fit nicely within my RAM limits. Pick an arbitrary depth (say 3, you can tweak it later) and construct a trie up to that depth, for the smaller file. Each leaf would be a list of "file pointers" to words that start with the prefix encoded by the path you followed to reach the leaf. These "file pointers" would keep an offset into the file and the word length.
Then process the second file by reading each word from it and trying to find it in the first file using the trie you constructed. It would allow you to fail faster on words that don't match. The deeper your trie, the faster you can fail, but the more memory you would consume.
Of course, like Stephen Chung said, you still need RAM to store enough information to describe at least one of the files, if you really need an efficient algorithm. If you don't have enough memory -- and you probably don't, because I estimate my approach would require approximately the same amount of memory you would need to load a file whose words were 14-22 characters long -- then you have to process even the first file by parts. In that case, I would actually recommend using the trie for the larger file, not the smaller. Just partition it in parts that are no bigger than the smaller file (or no bigger than your RAM constraints allow, really) and do the whole process I described for each part.
Despite the length, this is sort of off the top of my head. I might be horribly wrong in some details, but this is how I would initially approach the problem and then see where it would take me.
If you're looking for memory efficiency with this sort of thing you'll be hard pushed to get time efficiency. My example will be written in python, but should be relatively easy to implement in any language.
with open(file1) as file_1:
current_word_1 = read_to_delim(file_1, delim)
while current_word_1:
with open(file2) as file_2:
current_word_2 = read_to_delim(file_2, delim)
while current_word_2:
if current_word_2 == current_word_1:
print current_word_2
current_word_2 = read_to_delim(file_2, delim)
current_word_1 = read_to_delim(file_1, delim)
I leave read_to_delim to you, but this is the extreme case that is memory-optimal but time-least-optimal.
depending on your application of course you could load the two files in a database, perform a left outer join, and discard the rows for which one of the two columns is null

Building a directory tree from a list of file paths

I am looking for a time efficient method to parse a list of files into a tree. There can be hundreds of millions of file paths.
The brute force solution would be to split each path on occurrence of a directory separator, and traverse the tree adding in directory and file entries by doing string comparisons but this would be exceptionally slow.
The input data is usually sorted alphabetically, so the list would be something like:
C:\Users\Aaron\AppData\Amarok\Afile
C:\Users\Aaron\AppData\Amarok\Afile2
C:\Users\Aaron\AppData\Amarok\Afile3
C:\Users\Aaron\AppData\Blender\alibrary.dll
C:\Users\Aaron\AppData\Blender\and_so_on.txt
From this ordering my natural reaction is to partition the directory listings into groups... somehow... before doing the slow string comparisons. I'm really not sure. I would appreciate any ideas.
Edit: It would be better if this tree were lazy loaded from the top down if possible.
You have no choice but to do full string comparisons since you can't guarantee where the strings might differ. There are a couple tricks that might speed things up a little:
As David said, form a tree, but search for the new insertion point from the previous one (perhaps with the aid of some sort of matchingPrefix routine that will tell you where the new one differs).
Use a hash table for each level of the tree if there may be very many files within and you need to count duplicates. (Otherwise, appending to a stack is fine.)
if its possible, you can generate your tree structure with the tree command, here
To take advantage of the "usually sorted" property of your input data, begin your traversal at the directory where your last file was inserted: compare the directory name of current pathname to the previous one. If they match, you can just insert here, otherwise pop up a level and try again.

Resources