need help designing for search algorithm in a more efficient way - algorithm

I have a problem that involves biology area. Right now I have 4 VERY LARGE files(each with 0.1 billion lines), but the structure is rather simple, each line of these files has only 2 fields, both stands for a type of gene.
My goal is: design an efficient algorithm that can achieves the following:
Find a circle within the contents of these 4 files. The circle is defined as:
field #1 in a line in file 1 == field #1 in a line in file 2 and
field #2 in a line in file 2 == field #1 in a line in file 3 and
field #2 in a line in file 3 == field #1 in a line in file 4 and
field #2 in a line in file 4 == field #2 in a line in file 1
I cannot think of a decent way to solve this, so I just wrote a brute-force-stupid-4-layer-nested loop for now. I'm thinking about sorting them as alphabetical order, even if that might help a little, but then it's also obvious that the computer memory would not allow me to load everything at once. Can anybody tell me a good way to solve this problem in a both time and space efficient way? Thanks!!

First of all, I note that you can sort a file without holding it memory all at once, and that most operating systems have some program that does this, often called just "sort". Usually you can get it to sort on a field within a file, but if not you can rewrite each line to get it to sort the way you want.
Given this, you can connect two files by sorting them so that the first is sorted on field #1 and the second on field #2. You can then create one record for each match, combining all the fields, and only holding in memory a chunk from each file where all the fields you have sorted on have the same value. This will allow you to connect the result with another file - four such connections should solve your problem.
Depending on your data, the time it takes to solve your problem may depend on the order in which you make the connections. One rather naive way to make use of this is, at each stage, to take a small random sample from each file, and use this to see how many results will follow from each possible connection, and choose the connection that produces the fewest results. One way to take a random sample of N items from a large file is to take the first N lines in the file and then, when you have read in m lines so far, read the next line, and then with probability N/(m + 1) exchange one of the N lines held for it, else throw it away. Keep on until you have read through the whole file.

Here is one algorithm:
Select an appropriate lookup structure: If field#1 is an integer, Use bit-fields or an dictionary (or a set) if its an string; Use the a lookup structure for each file, i.e 4 in your case
Initialization phase: For each file: parse the file line by line and set the appropriate bit in bit-field or add the field to the dictionary in the corresponding lookup structure for the file.
After initializing the lookup structure above, check the condition in your question.
The complexity of this depends on the lookup structure implementation. For bit fields, it will be O(1) and for set or dictionary, it will be O(lg(n)), since they are usually implemented as a Balanced Search Tree. The complete complexity will be O(n) or O(n lg(n)); You solution in the question has complexity of O(n^4)
You can get the code and solution for bit fields from here
HTH

Here is one approach:
We will use the notation Fxy where x=field number , y=file_no
Sort each of the 4 files on the first fields.
For each field F11, find a match in file 2. This will be linear. Save these matches with all four fields to a new file. Now, use this file and use the corresponding field in this file and get all the matches from file3. Continue for file4 and back to file1.
In this way, as you progress to each new file, you are dealing with lesser number of lines. And since you have sorted the files, search in linear and can be done by reading from disk.
Here the complexity in O(n log n) for sorting, and O(m log n) for lookup, assuming m << n.

It's a bit easier to explain if your File 1 is the other way around (so each second element points to a first element in the next file).
Start with File 1, copy it to a new file writing each A, B pair as B, A, 'REV'
Append the contents of File 2 to it writing each A, B pair as A, B, 'FWD'
Sort the file
Process the file in chunks with the same initial value
Within that chunk group the lines into REV's and FWD's
Take the cartesian product of the revs and the fwds (nested loop)
Write a line with reverse(fwd) concat (rev) excluding the repeated token
e.g. B, A, 'REV' and B, C, 'FWD' -> C, B, A, 'REV'
Append the next file to this new output file (adding 'FWD' to each line)
Repeat from step 3
In essence you are building up a chain in reverse order and using a file-based sort algorithm to put sequences together that can be combined.
Of course it would be even easier to just read these files into a database and let it do the work ...

Related

Searching for ordered series in a file

I have 2 file where each line contain an ordered series of id/value pairs of different sizes as follow
(id,value):(id,value):...
(1,2):(3,0):(60:3):....
Each line/series is considered a coverage point. File one is a list of all the coverage points that i need to find(this file could be big 5000000 lines +). This file should never change as it is a master list of all the points I need to cover. The second file is a run report that is generate by my program.
What I have to do now is write a script that first takes the report file and for every point line/coverage point I need to search to master file and see if there is an exact line that matches the coverage point. I need to find the line number so i can save the number of time I hit that coverage point.
First option I go trough each line in the report file and compare it to each line of the master file
Second option I have some way of initially sorting the master file so it is easier to search
Third option I make some sort of "hash function" that would take a line and give it a unique ID.
Fourth option I use some sort of data structure initially loaded with the master file
Linked list
Tree structure
Database
I think there is many ways of doing this but I want to do it as efficiently and without being to complex . if there are others I can't think about let me know. Any guidance at this point would be great
Here are a few points as example
(53,0):
(53,1):(54,0):(55,0):(56,1):(57,0):
(53,1):(54,0):(55,0):(56,1):(57,1):
(53,2):(54,0):(55,0):(56,1):(57,0):
(53,1):(54,0):(55,1):(59,1):(60,0):
(53,2):(54,0):(55,1):(59,1):(60,0):
(53,1):(54,0):(55,0):(56,1):(57,2):
(53,1):(54,0):(55,2):(59,1):(59,0):(60,0):
(53,2):(54,0):(55,2):(59,1):(59,0):(60,0):
(53,2):(54,0):(55,1):(59,1):(60,1):
(53,2):(54,0):(55,0):(56,1):(57,1):
(53,1):(54,0):(55,1):(59,1):(60,2):
(53,1):(54,0):(55,1):(59,1):(60,1):
(53,1):(54,0):(55,2):(59,1):(59,0):(60,1):
(53,1):(54,0):(55,2):(59,1):(59,0):(60,2):
(53,2):(54,0):(55,0):(56,1):(57,2):
(53,1):(54,0):(55,3):(59,1):(59,0):(59,1):(60,0):
(53,1):(54,0):(55,3):(59,1):(59,0):(59,1):(60,2):
(53,2):(54,0):(55,3):(59,1):(59,0):(59,1):(60,0):
(53,2):(54,0):(55,1):(59,1):(60,2):
(53,1):(54,0):(55,3):(59,1):(59,0):(59,1):(60,1):
(53,2):(54,0):(55,3):(59,1):(59,0):(59,1):(60,2):
(53,2):(54,0):(55,5):(59,1):(59,0):(59,1):(59,0):(59,1):(60,0):
(53,2):(54,0):(55,6):(59,1):(59,0):(59,1):(59,0):(59,1):(59,0):(60,0):
(53,1):(54,0):(55,5):(59,1):(59,0):(59,1):(59,0):(59,1):(60,1):
Let m= number of lines in the master-file
Let n= number of lines in the report-file
For 5000000 lines + it is most likly infeasable to load it to into memory. So you have to cope with slow file-IO.
Some Comments:
First option I go trough each line in the report file and compare it to each line of the master file
This would be extremely slow (O(m*n)) and all of it is very slow file-IO
If you can load the whole report-file into RAM you could do so and read the master-file line by line and search for it in the report-file (in ram)
line in
still O(m*n)) but you read both files only once
Second option I have some way of initially sorting the master file so it is easier to search
This would only be an advantage if you find a way to search in the file at random line-positions. [Without extra effort this is only possible if all lines have the exact same size]. So searching would speed up from O(m) to O(log M).
So overall performance would still be (O((log m)*n)) and all of it is very slow file-IO
Third option I make some sort of "hash function" that would take a line and give it a unique ID.
Implementing a hash-function working on files is possible but kind of tricky. You could split up the master-file in a lot of files. Each file includes all lines corresponding to the same hash-value (witch is also the file-name).
Fourth option I use some sort of data structure initially loaded with the master file
My Answer:
You cannot keep the whole master-file in RAM (most likely).
If your repot-file is much smaller, you can keep it in RAM.
1. Approach (quite simple):
Make a HashSet of the report-file in Ram and than iterate over the master-file (once) and check each line if it is in your HashSet.
Takes O(m) time [+ O(n) to create the HashSet]
2. Approach: split the master-file in smaller files matching their hash-value:
Load your report-file and store it in a HashSet. Than for each hasgh-value in our report-file-hash-set, we look in the corresponding master-file-hash-value-part to check for a match.
Takes O(n) time [+ O(n) to create the HashSet + O(m) to create the master-file-hash-files]

Is there the best data structure to only insert and search for particular words in the text file line by line?

I am struggling with my homework, I have to design data structure/s suitable for the specific scenario. I have a text to load line by line. After that i have to:
Print line numbers on which a given word exists.
Print the total number of times a given word occurs (on a specific line)
Print whole line of words.
I must not use programming language, it must be only description of this problem.
I was thinking about linked list of arrays, where one Node is a line that contains array of words. I do not need much space for that but in the worst case searching operation will be O(n*n).
I also have tries in my mind, however number of either lines or words in particular line is not defined but set to be max 4 bytes integer thus it can use a lot of space.

Fast data extraction algorithm

I have to 2 utf-8 text files. In each row of the file there is string, that can contain language specific characters like Ü, Ö, ą, ę. Strings are random order and length and can repeat. In the first file there is at least 3 mln of rows (it can easy exceed 1 mld of rows). The second file is smaller it usually get about 400 thousands of rows (but can be much bigger).
I need to create new file that contains entries from file one with removed entries that appear in file two and all repeatings entries.
Currently I'm sorting both files and remove repeating entries. Next I'm writing them to new file while checking if they appear in the second file.
Is there any faster way to do this?
Edit
Memory is a problem. I don't copy this strings to memory, buy operate on files. My friend suggested not to copy to memory, but work on file streams. After this execution time drop significantly.
Administrator of computer don't want to install data-base on it.
After sort my code rune like this in loop:
if stringFromFile1 < stringFromFile2 then writeToFile3 and get next stringFromFile1
else if stringFromFile1 == stringFromFile2 then dropStringFromFile1 and get next stringFromFile1
else if stringFromFile1 > stringFromFile2 then get next stringFromFile2 and go to line 1
If you have a data structure available such as a hash set you could just iterate over the files and add each line. Sets do not allow repetition and a hashset should provide you with a constant way of checking if an element already exists (in Java at least, the add method checks if an element exists, if it does not, it adds the item to the set in constant time).
Once you have gone through both files, you can then iterate over the hash set and store its content to the file. This should provide you with an algorithm that can in linear time.
Forgot to mention: I am assuming that you do not have restrictions on memory consumption. If you do, you might want to try saving each line to a database, using the hash of each line as a primary key. Inserting elements with two primary keys should fail, thus making sure that you have unique strings in the database. Once you will be done with the insertions, you can retrieve and store the values from the database to a file.
My proposal is to preprocess file two and form tree structure from it. For example, say you have this kind of file two:
bad
bass
absent
then your tree structure would be like this:
BEGIN -> b -> a -> d -> END
| |
| + -> s -> s -> END
|
+-> a -> b -> s -> e -> n -> t -> END
END designates word delimiter (be it space or new line or something else)
Then you open file one into file stream and read it out byte after byte. Once you encounter beginning of the file or pick next character after delimiter you start walking your tree. If with streamed bytes you can walk it to the END, it means you found matching word and you should discard it. If not, the word is unique and need not be dropped. If found unique, the word must be added into tree structure to discard its further repetitions.
Tree structure will take substantial amount memory, but it is anyway less than holding unique words in some sort of array
There are a number of possible optimizations.
As Roman Saveljev suggested, you can keep a trie structure in memory. Depending on the entropy of the data, it can easily fit in memory.
As the 2nd file is sorted, you can run a binary search to check if the record is there (if you aren't doing this yet).
You can also keep a Bloom Filter in memory to easily check those records that aren't duplicated to avoid going to disk everytime.

Find Successive words in Article B available from A

There are two articles, A and B, which are very large. Get three or more successive words in A and check if they appear in B, and count how many times they appear. For example, if 'book' 'his' and 'her' appear in A, how many times do they appear in B?
I thought about splitting the entire content of B and then checking all 3 words in A with StringToken, but I am not sure about the algorithm efficiency.
Look into what a Hashtable is, scan your file B for words one by one (you can split if you don't care about memory usage on large files) each word you find into the hashtable (when not found) or increment the number to get of times a word is seen.
Then you just scan. over A, looking for each set of 3 words, with a rolling sliding window. this way you can increase the length of the window later without rewriting anything.
for reference you should really tag homework questions as such.
It is obvious that you need to scan / parse entire content of B once to reach the results. You just cannot avoid doing that. Read it line by line. For every line, search for the given query terms and their counts in the line. Keep adding the counts generated on per line basis to get the final result.
If you want to do such computation many times on content of B for same/different terms, creating an Inverted_index for B would be best way.

Sorting algorithm: Big text file with variable-length lines (comma-separated values)

What's a good algorithm for sorting text files that are larger than available memory (many 10s of gigabytes) and contain variable-length records? All the algorithms I've seen assume 1) data fits in memory, or 2) records are fixed-length. But imagine a big CSV file that I wanted to sort by the "BirthDate" field (the 4th field):
Id,UserId,Name,BirthDate
1,psmith,"Peter Smith","1984/01/01"
2,dmehta,"Divya Mehta","1985/11/23"
3,scohen,"Saul Cohen","1984/08/19"
...
99999999,swright,"Shaun Wright","1986/04/12"
100000000,amarkov,"Anya Markov","1984/10/31"
I know that:
This would run on one machine (not distributed).
The machine that I'd be running this on would have several processors.
The files I'd be sorting could be larger than the physical memory of the machine.
A file contains variable-length lines. Each line would consist of a fixed number of columns (delimiter-separated values). A file would be sorted by a specific field (ie. the 4th field in the file).
An ideal solution would probably be "use this existing sort utility", but I'm looking for the best algorithm.
I don't expect a fully-coded, working answer; something more along the lines of "check this out, here's kind of how it works, or here's why it works well for this problem." I just don't know where to look...
This isn't homework!
Thanks! ♥
This class of algorithms is called external sorting. I would start by checking out the Wikipedia entry. It contains some discussion and pointers.
Suggest the following resources:
Merge Sort: http://en.wikipedia.org/wiki/Merge_sort
Seminumerical Algorithms, vol 2 of The Art of Computer Programming: Knuth: Addison Wesley:ISBN 0-201-03822-6(v.2)
A standard merge sort approach will work. The common schema is
Split the file into N parts of roughly equal size
Sort each part (in memory if it's small enough, otherwise recursively apply the same algorithm)
Merge the sorted parts
No need to sort. Read the file ALL.CSV and append each read line to a file per day, like 19841231.CSV. For each existing day with data, in numerical order, read that CSV file and append those lines to a new file. Optimizations are possible by, for example, processing the original file more than once or by recording days actually occuring in the file ALL.CSV.
So a line containing "1985/02/28" should be added to the file 19850228.CSV. The file 19850228.CSV should be appended to NEW.CSV after the file 19850227.CSV was appended to NEW.CSV. The numerical order avoids the use of all sort algorithms, albeit it could torture the file system.
In reality the file ALL.CSV could be split in a file per, for example, year. 1984.CSV, 1985.CSV, and so on.

Resources