Which one is suitable datastructure for file comparison? - data-structures

Two files, each of size in terabytes. A file comparison tool compares i-th line of file1 with
i-th line of file2. if they are same it prints. which datastructure is suitable.
B-tree
Linked list
Hash tables
None of them

It can be done using Longest Commons Subsequence, check this out...

Depends how much memory you have and how fast it needs to go - though this really feels like an exam question rather than a true question. I'd go as far as to say that any of the above answers could be 'correct' depending on what exactly the machine specs were.

First, you'd need to make sure that both lists are sorted (this could be done using a merge sort). Then you compare the two files, line by line.

Related

About multipass sort algorithm

I am reading Programming Pearls by Jon Bentley (reference).
Here author is mentioning about various sorting algorithms like merge sort, multipass sort.
Questions:
How does algorithm for merge sort work by reading input file once and using work files and writing output file only once?
How does the author denote that the 40 pass i.e. multipass sort algorithm works by writing only once to output file and with no work files?
Can someone explain the above with a simple example, like having memory to store 3 digits and having 10 digits to store, e.g. 9,0,8,6,5,4,1,2,3,7
This is from Chapter 1 of Jon Bentley's
Programming Pearls, 2nd Edn (1999), which is an excellent book. The equivalent example from the first edition is slightly different; the multipass algorithm only made 27 passes over the data (and there was less memory available).
The sort described by Jon Bentley has special setup constraints.
File contains at most 10 million records.
Each record is a 7 digit number.
There is no other data associated with the records.
There is only 1 MiB of memory available when the sort must be done.
Question 1
The single read of the input file slurps as many lines from the input as will fit in memory, sorts that data, and writes it out to a work file. Rinse and repeat until there is no more data in the input file.
Then, complete the process by reading the work files and merging the sorted contents into a single output file. In extreme cases, it might be necessary to create new, bigger work files because the program can't read all the work files at once. If that happens, you arrange for the final pass to have the maximum number of inputs that can be handled, and have the intermediate passes merge appropriate numbers of files.
This is a general purpose algorithm.
Question 2
This is where the peculiar properties of the data are exploited. Since the numbers are unique and limited in range, the algorithm can read the file the first time, extracting numbers from the first fortieth of the range, sorting and writing those; then it extracts the second fortieth of the range, then the third, ..., then the last fortieth.
This is a special-purpose algorithm, exploiting the nature of the numbers.

Cleaning doubles out of a massive word list

I got a wordlist which is 56GB and I would like to remove doubles.
I've tried to approach this in java but I run out of space on my laptop after 2.5M words.
So I'm looking for an (online) program or algorithm which would allow me to remove all duplicates.
Thanks in advance,
Sir Troll
edit:
What I did in java was put it in a TreeSet so they would be ordered and removed of duplicated
I think the problem here is the huge amount of data. I would in a first step try to split the data into several files: e.g. make a file for every char like where you put words with the first character beeing 'a' into a.txt, first char equals 'b' into b.txt. ...
a.txt
b.txt
c.txt
-
afterwards i would try using default sorting algorithms and check whether they work with the size of the files. After sorting cleaning of doubles should be easy.
if the files remain to big you can also split using more than 1 char
e.g:
aa.txt
ab.txt
ac.txt
...
Frameworks like Mapreduce or Hadoop are perfect for such tasks. You'll need to write your own map and reduce functions. Although i'm sure this must've been done before. A quick search on stackoverflow gave this
I suggest you use a Bloom Filter for this.
For each word, check if it's already present in the filter, otherwise insert it (or, rather some good hash value of it).
It should be fairly efficient and you shouldn't need to provide it with more than a gigabyte or two for it to have practically no false negatives. I leave it to you to work out the math.
I do like the divide-and-conquer comments here, but I have to admit: If you're running into trouble with 2.5mio words, something's going wrong with your original approach. Even if we assume each word is unique within those 2.5mio (which basically rules out that what we're talking about is a text in a natural language) and assuming each word is on average 100 unicode characters long we're at 500MB for storing the unique strings plus some overhead for storing the set structure. Meaning: You should be doing really fine since those numbers are totally overestimated already. Maybe before installing Hadoop, you could try increasing your heap size?

Sorting algorithm: Big text file with variable-length lines (comma-separated values)

What's a good algorithm for sorting text files that are larger than available memory (many 10s of gigabytes) and contain variable-length records? All the algorithms I've seen assume 1) data fits in memory, or 2) records are fixed-length. But imagine a big CSV file that I wanted to sort by the "BirthDate" field (the 4th field):
Id,UserId,Name,BirthDate
1,psmith,"Peter Smith","1984/01/01"
2,dmehta,"Divya Mehta","1985/11/23"
3,scohen,"Saul Cohen","1984/08/19"
...
99999999,swright,"Shaun Wright","1986/04/12"
100000000,amarkov,"Anya Markov","1984/10/31"
I know that:
This would run on one machine (not distributed).
The machine that I'd be running this on would have several processors.
The files I'd be sorting could be larger than the physical memory of the machine.
A file contains variable-length lines. Each line would consist of a fixed number of columns (delimiter-separated values). A file would be sorted by a specific field (ie. the 4th field in the file).
An ideal solution would probably be "use this existing sort utility", but I'm looking for the best algorithm.
I don't expect a fully-coded, working answer; something more along the lines of "check this out, here's kind of how it works, or here's why it works well for this problem." I just don't know where to look...
This isn't homework!
Thanks! ♥
This class of algorithms is called external sorting. I would start by checking out the Wikipedia entry. It contains some discussion and pointers.
Suggest the following resources:
Merge Sort: http://en.wikipedia.org/wiki/Merge_sort
Seminumerical Algorithms, vol 2 of The Art of Computer Programming: Knuth: Addison Wesley:ISBN 0-201-03822-6(v.2)
A standard merge sort approach will work. The common schema is
Split the file into N parts of roughly equal size
Sort each part (in memory if it's small enough, otherwise recursively apply the same algorithm)
Merge the sorted parts
No need to sort. Read the file ALL.CSV and append each read line to a file per day, like 19841231.CSV. For each existing day with data, in numerical order, read that CSV file and append those lines to a new file. Optimizations are possible by, for example, processing the original file more than once or by recording days actually occuring in the file ALL.CSV.
So a line containing "1985/02/28" should be added to the file 19850228.CSV. The file 19850228.CSV should be appended to NEW.CSV after the file 19850227.CSV was appended to NEW.CSV. The numerical order avoids the use of all sort algorithms, albeit it could torture the file system.
In reality the file ALL.CSV could be split in a file per, for example, year. 1984.CSV, 1985.CSV, and so on.

How to sort (million/billion/...) integers?

Sometimes interviewers ask how to sort million/billion 32-bit integers (e.g. here and here). I guess they expect the candidates to compare O(NLog(N)) sort with radix sort. For million integers O(NLog(N)) sort is probably better but for billion they are probably the same. Does it make sense ?
If you get a question like this, they are not looking for the answer. What they are trying to do is see how you think through a problem. Do you jump right in, or do you ask questions about the project requirements?
One question you had better ask is, "How optimal of solution does the problem require?" Maybe a bubble sort of records stored in a file is good enough, but you have to ask. Ask questions about what if the input changes to 64 bit numbers, should the sort process be easily updated? Ask how long does the programmer have to develop the program.
Those types of questions show me that the candidate is wise enough to see there is more to the problem than just sorting numbers.
I expect they're looking for you to expand on the difference between internal sorting and external sorting. Apparently people don't read Knuth nowadays
As aaaa bbbb said, it depends on the situation. You would ask questions about the project requirements. For example, if they want to count the ages of the employees, you probably use the Counting sort, I can sort the data in the memory. But when the data are totally random, you probably use the external sorting. For example, you can divide the data of the source file into the different files, every file has a unique range(File1 is from 0-1m, File2 is from 1m+1 - 2m , ect ), then you sort every single file, and lastly merge them into a new file.
Use bit map. You need some 500 Mb to represent whole 32-bit integer range. For every integer in given array just set coresponding bit. Then simply scan your bit map from left to right and get your integer array sorted.
It depends on the data structure they're stored in. Radix sort beats N-log-N sort on fairly small problem sizes if the input is in a linked list, because it doesn't need to allocate any scratch memory, and if you can afford to allocate a scratch buffer the size of the input at the beginning of the sort, the same is true for arrays. It's really only the wrong choice (for integer keys) when you have very limited additional storage space and your input is in an array.
I would expect the crossover point to be well below a million regardless.

Building a directory tree from a list of file paths

I am looking for a time efficient method to parse a list of files into a tree. There can be hundreds of millions of file paths.
The brute force solution would be to split each path on occurrence of a directory separator, and traverse the tree adding in directory and file entries by doing string comparisons but this would be exceptionally slow.
The input data is usually sorted alphabetically, so the list would be something like:
C:\Users\Aaron\AppData\Amarok\Afile
C:\Users\Aaron\AppData\Amarok\Afile2
C:\Users\Aaron\AppData\Amarok\Afile3
C:\Users\Aaron\AppData\Blender\alibrary.dll
C:\Users\Aaron\AppData\Blender\and_so_on.txt
From this ordering my natural reaction is to partition the directory listings into groups... somehow... before doing the slow string comparisons. I'm really not sure. I would appreciate any ideas.
Edit: It would be better if this tree were lazy loaded from the top down if possible.
You have no choice but to do full string comparisons since you can't guarantee where the strings might differ. There are a couple tricks that might speed things up a little:
As David said, form a tree, but search for the new insertion point from the previous one (perhaps with the aid of some sort of matchingPrefix routine that will tell you where the new one differs).
Use a hash table for each level of the tree if there may be very many files within and you need to count duplicates. (Otherwise, appending to a stack is fine.)
if its possible, you can generate your tree structure with the tree command, here
To take advantage of the "usually sorted" property of your input data, begin your traversal at the directory where your last file was inserted: compare the directory name of current pathname to the previous one. If they match, you can just insert here, otherwise pop up a level and try again.

Resources