I have 2 file where each line contain an ordered series of id/value pairs of different sizes as follow
(id,value):(id,value):...
(1,2):(3,0):(60:3):....
Each line/series is considered a coverage point. File one is a list of all the coverage points that i need to find(this file could be big 5000000 lines +). This file should never change as it is a master list of all the points I need to cover. The second file is a run report that is generate by my program.
What I have to do now is write a script that first takes the report file and for every point line/coverage point I need to search to master file and see if there is an exact line that matches the coverage point. I need to find the line number so i can save the number of time I hit that coverage point.
First option I go trough each line in the report file and compare it to each line of the master file
Second option I have some way of initially sorting the master file so it is easier to search
Third option I make some sort of "hash function" that would take a line and give it a unique ID.
Fourth option I use some sort of data structure initially loaded with the master file
Linked list
Tree structure
Database
I think there is many ways of doing this but I want to do it as efficiently and without being to complex . if there are others I can't think about let me know. Any guidance at this point would be great
Here are a few points as example
(53,0):
(53,1):(54,0):(55,0):(56,1):(57,0):
(53,1):(54,0):(55,0):(56,1):(57,1):
(53,2):(54,0):(55,0):(56,1):(57,0):
(53,1):(54,0):(55,1):(59,1):(60,0):
(53,2):(54,0):(55,1):(59,1):(60,0):
(53,1):(54,0):(55,0):(56,1):(57,2):
(53,1):(54,0):(55,2):(59,1):(59,0):(60,0):
(53,2):(54,0):(55,2):(59,1):(59,0):(60,0):
(53,2):(54,0):(55,1):(59,1):(60,1):
(53,2):(54,0):(55,0):(56,1):(57,1):
(53,1):(54,0):(55,1):(59,1):(60,2):
(53,1):(54,0):(55,1):(59,1):(60,1):
(53,1):(54,0):(55,2):(59,1):(59,0):(60,1):
(53,1):(54,0):(55,2):(59,1):(59,0):(60,2):
(53,2):(54,0):(55,0):(56,1):(57,2):
(53,1):(54,0):(55,3):(59,1):(59,0):(59,1):(60,0):
(53,1):(54,0):(55,3):(59,1):(59,0):(59,1):(60,2):
(53,2):(54,0):(55,3):(59,1):(59,0):(59,1):(60,0):
(53,2):(54,0):(55,1):(59,1):(60,2):
(53,1):(54,0):(55,3):(59,1):(59,0):(59,1):(60,1):
(53,2):(54,0):(55,3):(59,1):(59,0):(59,1):(60,2):
(53,2):(54,0):(55,5):(59,1):(59,0):(59,1):(59,0):(59,1):(60,0):
(53,2):(54,0):(55,6):(59,1):(59,0):(59,1):(59,0):(59,1):(59,0):(60,0):
(53,1):(54,0):(55,5):(59,1):(59,0):(59,1):(59,0):(59,1):(60,1):
Let m= number of lines in the master-file
Let n= number of lines in the report-file
For 5000000 lines + it is most likly infeasable to load it to into memory. So you have to cope with slow file-IO.
Some Comments:
First option I go trough each line in the report file and compare it to each line of the master file
This would be extremely slow (O(m*n)) and all of it is very slow file-IO
If you can load the whole report-file into RAM you could do so and read the master-file line by line and search for it in the report-file (in ram)
line in
still O(m*n)) but you read both files only once
Second option I have some way of initially sorting the master file so it is easier to search
This would only be an advantage if you find a way to search in the file at random line-positions. [Without extra effort this is only possible if all lines have the exact same size]. So searching would speed up from O(m) to O(log M).
So overall performance would still be (O((log m)*n)) and all of it is very slow file-IO
Third option I make some sort of "hash function" that would take a line and give it a unique ID.
Implementing a hash-function working on files is possible but kind of tricky. You could split up the master-file in a lot of files. Each file includes all lines corresponding to the same hash-value (witch is also the file-name).
Fourth option I use some sort of data structure initially loaded with the master file
My Answer:
You cannot keep the whole master-file in RAM (most likely).
If your repot-file is much smaller, you can keep it in RAM.
1. Approach (quite simple):
Make a HashSet of the report-file in Ram and than iterate over the master-file (once) and check each line if it is in your HashSet.
Takes O(m) time [+ O(n) to create the HashSet]
2. Approach: split the master-file in smaller files matching their hash-value:
Load your report-file and store it in a HashSet. Than for each hasgh-value in our report-file-hash-set, we look in the corresponding master-file-hash-value-part to check for a match.
Takes O(n) time [+ O(n) to create the HashSet + O(m) to create the master-file-hash-files]
I have to 2 utf-8 text files. In each row of the file there is string, that can contain language specific characters like Ü, Ö, ą, ę. Strings are random order and length and can repeat. In the first file there is at least 3 mln of rows (it can easy exceed 1 mld of rows). The second file is smaller it usually get about 400 thousands of rows (but can be much bigger).
I need to create new file that contains entries from file one with removed entries that appear in file two and all repeatings entries.
Currently I'm sorting both files and remove repeating entries. Next I'm writing them to new file while checking if they appear in the second file.
Is there any faster way to do this?
Edit
Memory is a problem. I don't copy this strings to memory, buy operate on files. My friend suggested not to copy to memory, but work on file streams. After this execution time drop significantly.
Administrator of computer don't want to install data-base on it.
After sort my code rune like this in loop:
if stringFromFile1 < stringFromFile2 then writeToFile3 and get next stringFromFile1
else if stringFromFile1 == stringFromFile2 then dropStringFromFile1 and get next stringFromFile1
else if stringFromFile1 > stringFromFile2 then get next stringFromFile2 and go to line 1
If you have a data structure available such as a hash set you could just iterate over the files and add each line. Sets do not allow repetition and a hashset should provide you with a constant way of checking if an element already exists (in Java at least, the add method checks if an element exists, if it does not, it adds the item to the set in constant time).
Once you have gone through both files, you can then iterate over the hash set and store its content to the file. This should provide you with an algorithm that can in linear time.
Forgot to mention: I am assuming that you do not have restrictions on memory consumption. If you do, you might want to try saving each line to a database, using the hash of each line as a primary key. Inserting elements with two primary keys should fail, thus making sure that you have unique strings in the database. Once you will be done with the insertions, you can retrieve and store the values from the database to a file.
My proposal is to preprocess file two and form tree structure from it. For example, say you have this kind of file two:
bad
bass
absent
then your tree structure would be like this:
BEGIN -> b -> a -> d -> END
| |
| + -> s -> s -> END
|
+-> a -> b -> s -> e -> n -> t -> END
END designates word delimiter (be it space or new line or something else)
Then you open file one into file stream and read it out byte after byte. Once you encounter beginning of the file or pick next character after delimiter you start walking your tree. If with streamed bytes you can walk it to the END, it means you found matching word and you should discard it. If not, the word is unique and need not be dropped. If found unique, the word must be added into tree structure to discard its further repetitions.
Tree structure will take substantial amount memory, but it is anyway less than holding unique words in some sort of array
There are a number of possible optimizations.
As Roman Saveljev suggested, you can keep a trie structure in memory. Depending on the entropy of the data, it can easily fit in memory.
As the 2nd file is sorted, you can run a binary search to check if the record is there (if you aren't doing this yet).
You can also keep a Bloom Filter in memory to easily check those records that aren't duplicated to avoid going to disk everytime.
There are two articles, A and B, which are very large. Get three or more successive words in A and check if they appear in B, and count how many times they appear. For example, if 'book' 'his' and 'her' appear in A, how many times do they appear in B?
I thought about splitting the entire content of B and then checking all 3 words in A with StringToken, but I am not sure about the algorithm efficiency.
Look into what a Hashtable is, scan your file B for words one by one (you can split if you don't care about memory usage on large files) each word you find into the hashtable (when not found) or increment the number to get of times a word is seen.
Then you just scan. over A, looking for each set of 3 words, with a rolling sliding window. this way you can increase the length of the window later without rewriting anything.
for reference you should really tag homework questions as such.
It is obvious that you need to scan / parse entire content of B once to reach the results. You just cannot avoid doing that. Read it line by line. For every line, search for the given query terms and their counts in the line. Keep adding the counts generated on per line basis to get the final result.
If you want to do such computation many times on content of B for same/different terms, creating an Inverted_index for B would be best way.
What's a good algorithm for sorting text files that are larger than available memory (many 10s of gigabytes) and contain variable-length records? All the algorithms I've seen assume 1) data fits in memory, or 2) records are fixed-length. But imagine a big CSV file that I wanted to sort by the "BirthDate" field (the 4th field):
Id,UserId,Name,BirthDate
1,psmith,"Peter Smith","1984/01/01"
2,dmehta,"Divya Mehta","1985/11/23"
3,scohen,"Saul Cohen","1984/08/19"
...
99999999,swright,"Shaun Wright","1986/04/12"
100000000,amarkov,"Anya Markov","1984/10/31"
I know that:
This would run on one machine (not distributed).
The machine that I'd be running this on would have several processors.
The files I'd be sorting could be larger than the physical memory of the machine.
A file contains variable-length lines. Each line would consist of a fixed number of columns (delimiter-separated values). A file would be sorted by a specific field (ie. the 4th field in the file).
An ideal solution would probably be "use this existing sort utility", but I'm looking for the best algorithm.
I don't expect a fully-coded, working answer; something more along the lines of "check this out, here's kind of how it works, or here's why it works well for this problem." I just don't know where to look...
This isn't homework!
Thanks! ♥
This class of algorithms is called external sorting. I would start by checking out the Wikipedia entry. It contains some discussion and pointers.
Suggest the following resources:
Merge Sort: http://en.wikipedia.org/wiki/Merge_sort
Seminumerical Algorithms, vol 2 of The Art of Computer Programming: Knuth: Addison Wesley:ISBN 0-201-03822-6(v.2)
A standard merge sort approach will work. The common schema is
Split the file into N parts of roughly equal size
Sort each part (in memory if it's small enough, otherwise recursively apply the same algorithm)
Merge the sorted parts
No need to sort. Read the file ALL.CSV and append each read line to a file per day, like 19841231.CSV. For each existing day with data, in numerical order, read that CSV file and append those lines to a new file. Optimizations are possible by, for example, processing the original file more than once or by recording days actually occuring in the file ALL.CSV.
So a line containing "1985/02/28" should be added to the file 19850228.CSV. The file 19850228.CSV should be appended to NEW.CSV after the file 19850227.CSV was appended to NEW.CSV. The numerical order avoids the use of all sort algorithms, albeit it could torture the file system.
In reality the file ALL.CSV could be split in a file per, for example, year. 1984.CSV, 1985.CSV, and so on.