Fast data extraction algorithm - algorithm

I have to 2 utf-8 text files. In each row of the file there is string, that can contain language specific characters like Ü, Ö, ą, ę. Strings are random order and length and can repeat. In the first file there is at least 3 mln of rows (it can easy exceed 1 mld of rows). The second file is smaller it usually get about 400 thousands of rows (but can be much bigger).
I need to create new file that contains entries from file one with removed entries that appear in file two and all repeatings entries.
Currently I'm sorting both files and remove repeating entries. Next I'm writing them to new file while checking if they appear in the second file.
Is there any faster way to do this?
Edit
Memory is a problem. I don't copy this strings to memory, buy operate on files. My friend suggested not to copy to memory, but work on file streams. After this execution time drop significantly.
Administrator of computer don't want to install data-base on it.
After sort my code rune like this in loop:
if stringFromFile1 < stringFromFile2 then writeToFile3 and get next stringFromFile1
else if stringFromFile1 == stringFromFile2 then dropStringFromFile1 and get next stringFromFile1
else if stringFromFile1 > stringFromFile2 then get next stringFromFile2 and go to line 1

If you have a data structure available such as a hash set you could just iterate over the files and add each line. Sets do not allow repetition and a hashset should provide you with a constant way of checking if an element already exists (in Java at least, the add method checks if an element exists, if it does not, it adds the item to the set in constant time).
Once you have gone through both files, you can then iterate over the hash set and store its content to the file. This should provide you with an algorithm that can in linear time.
Forgot to mention: I am assuming that you do not have restrictions on memory consumption. If you do, you might want to try saving each line to a database, using the hash of each line as a primary key. Inserting elements with two primary keys should fail, thus making sure that you have unique strings in the database. Once you will be done with the insertions, you can retrieve and store the values from the database to a file.

My proposal is to preprocess file two and form tree structure from it. For example, say you have this kind of file two:
bad
bass
absent
then your tree structure would be like this:
BEGIN -> b -> a -> d -> END
| |
| + -> s -> s -> END
|
+-> a -> b -> s -> e -> n -> t -> END
END designates word delimiter (be it space or new line or something else)
Then you open file one into file stream and read it out byte after byte. Once you encounter beginning of the file or pick next character after delimiter you start walking your tree. If with streamed bytes you can walk it to the END, it means you found matching word and you should discard it. If not, the word is unique and need not be dropped. If found unique, the word must be added into tree structure to discard its further repetitions.
Tree structure will take substantial amount memory, but it is anyway less than holding unique words in some sort of array

There are a number of possible optimizations.
As Roman Saveljev suggested, you can keep a trie structure in memory. Depending on the entropy of the data, it can easily fit in memory.
As the 2nd file is sorted, you can run a binary search to check if the record is there (if you aren't doing this yet).
You can also keep a Bloom Filter in memory to easily check those records that aren't duplicated to avoid going to disk everytime.

Related

Search data from a data set without reading each element

I have just started learning algorithms and data structures and I came by an interesting problem.
I need some help in solving the problem.
There is a data set given to me. Within the data set are characters and a number associated with each of them. I have to evaluate the sum of the largest numbers associated with each of the present characters. The list is not sorted by characters however groups of each character are repeated with no further instance of that character in the data set.
Moreover, the largest number associated with each character in the data set always appears at the largest position of reference of that character in the data set. We know the length of the entire data set and we can get retrieve the data by specifying the line number associated with that data set.
For Eg.
C-7
C-9
C-12
D-1
D-8
A-3
M-67
M-78
M-90
M-91
M-92
K-4
K-7
K-10
L-13
length=15
get(3)= D-1(stores in class with character D and value 1)
The answer for the above should be 13+10+92+3+8+12 as they are the highest numbers associated with L,K,M,A,D,C respectively.
The simplest solution is, of course, to go through all of the elements but what is the most efficient algorithm(reading the data set lesser than the length of the data set)?
You'll have to go through them each one by one, since you can't be certain what the key is.
Just for sake of easy manipulation, I would loop over the dataset and check if the key at index i is equal to the index at i+1, if it's not, that means you have a local max.
Then, store that value into a hash or dictionary if there's not already an existing key:value pair for that key, if there is, do a check to see if the existing value is less than the current value, and overwrite it if true.
While you could use statistics to optimistically skip some entries - say you read A 1, you skip 5 entries you read A 10 - good. You skip 5 more, B 3, so you need to go back and also read what is inbetween.
But in reality it won't work. Not on text.
Because IO happens in blocks. Data is stored in chunks of usually around 8k. So that is the minimum read size (even if your programming language may provide you with other sized reads, they will eventually be translated to reading blocks and buffering them).
How do you find the next line? Well you read until you find a \n...
So you don't save anything on this kind of data. It would be different if you had much larger records (several KB, like files) and an index. But building that index will require reading all at least once.
So as presented, the fastest approach would likely be to linearly scan the entire data once.

Perl processing a trillion records

Looking for some advice or insight on what I consider a simple method in PERL to compare text files to one another.
Lets assume you have 90,000 text files that are all structured similarly, say they have a common theme with a small amount of unique data in each.
My logic says to simply loop through the files (breaking into 1000 lines for simplicity), then loop through the # of files ... 90,000 - then loop through the 90,000 files again to compare to each other. This becomes a virtually endless loop of a bazillion lines or processes.
Now the mandatory step here is to "remove" any line that is found in any file except the file we are working on. The ultimate goal is to scrub all the files down to content that is unique across the entire collection, even if it means some files end up empty.
I am saying files, but this could be rows in a database, or elements in an array. (I`ve tried all.) The fastest solution so far has been to load all the files into mysql, then run
UPDATE table SET column=REPLACE(column, find, replace); Also tried Parallel::ForkManager when working with mysql.
The slowest approach actually led to exhausting my 32 GB of ram - that was loading all 90k files into an array. 90k files didnt work at all, smaller batches like 1000 works fine, but then doesnt compare to the other 89,000.
Server specs if helpful: Single Quad-Core E3-1240 4Cores x 3.4Ghz w/ HT 32GB DDR3 ECC RAM 1600MHz 1x256SSD
So how does an engineer solve this problem? I am just PERL hacker...
Tag every line with the filename (and maybe the line number) and sort all the lines using Sort::External. Then you can read the sorted records in order and write only a single unique line to the result files.
A Bloom filter is perfect for this, if you can handle arbitrarily small error.
To quote wikipedia: "A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not; i.e. a query returns either 'possibly in set' or 'definitely not in set'."
In essence, you'll use k hashes to hash each row to k spots on a bit array. Each time you encounter a new row, you are guaranteed you haven't seen it if at least one of the k hashed indices has a '0' bit. You can read up on Bloom filters to see how to size the array and choose k to make false positives arbitrarily small.
Then you go through your files, and either delete rows where you get a positive match, or copy the negative match rows into a new file.
Sort the items using an external merge sort algorithm and remove the duplicates on the merge phase.
Actually, you can do that efficiently just calling the sort command with the -u flag. From Perl:
system "sort -u #files >output";
Your sort command may provide several adjustable parameters to improve its performance. For instance, the number of parallel processes or the amount of memory it can allocate.

Removing Duplicate Words Across Multiple and Large Dictionary Files

I have roughly ~600GB of dictionaries I've accumulated over the years, and I decided I want to clean them up and sort them
First of all, each file on average is very large, anywhere from 500MB to 9GB in size. A prerequisite for what I want to do is that I sort each dictionary. My end goal is to entirely remove duplicate words within and throughout all dictionary files.
The reason for this is that most of my dictionaries are sorted and organized by categories, but duplicates still often exist.
Load file
Read each line and put into data structure
Sort and remove any and all duplicate
Load next file and repeat
Once all files are individually unique, compare against eachother and remove duplicates
For Dictionaries D{1} to D{N}:
1) Sort D{1} through D{N} individually.
2) Check uniqueness of each word in D{i}
3) For each word in D{i}, check ALL words across D{i+1} to D{N}. Delete each word if unique in D{i} first.
I am considering using a sort of "hash" to improve this algorithm. Possibly by only checking the first one or two characters, since the list will be sorted (e.g. hash beginning line location for words starting with a, b, etc.).
4) Save and exit.
Example before (but far smaller):
Dictionary 1 Dictionary 2 Dictionary 3
]a 0u3TGNdB 2 KLOCK
all avisskriveri 4BZ32nKEMiqEaT7z
ast chorion 4BZ5
astn chowders bebotch
apiala chroma bebotch
apiales louts bebotch
avisskriveri lowlander chorion
avisskriverier namely PC-Based
avisskriverierne silking PC-Based
avisskriving underwater PC-Based
So it would see avisskriveri, chorion, bebotch and PC-Based are words that repeate both within and among each of the three dictionaries. So I see avisskriveri in D{1} first, so remove it in all other instances that I have seen it in. Then I see chorion in D{2} first, and remove that in all other instances first, and so forth. In D{3} bebotch and PC-Based are replicated, so I want to delete all but one entry of it (unless I've seen it before). Then save all files and close.
Example after:
Dictionary 1 Dictionary 2 Dictionary 3
]a 0u3TGNdB 2 KLOCK
all chorion 4BZ32nKEMiqEaT7z
ast chowders 4BZ5
astn chroma bebotch
apiala louts PC-Based
apiales lowlander
avisskriveri namely
avisskriverier silking
avisskriverierne underwater
avisskriving
Remember: I do NOT want to create any new dictionaries, only remove duplicates across all dictionaries.
Options:
"Hash" the amount of unique words for each file, allowing the program to estimate the computation time.
Specify a way give the location of the first word beginning with the desired first letter. So that the search may "jump" to a line and skip unecessary computational time.
Run on GPU for high performance parallel computing. (This is an issue because getting the data off of the GPU is tricky)
Goal: Reduce computational time and space consumption so that the method is affordable on a standard machine or server with limited abilities. Or device a method for running it remotely on a GPU cluster.
tl;dr - Sorting unique words across hundreds of files, where each file is 1-9GB in size.
Assuming the dictionaries are in alphabetical order and line by line, one word per line (as are most dictionaries), you could do something like this:
Open a file stream to each file.
Open a file stream to the compiled list file.
Read 1 entry from each file and put it onto a heap, priority queue, or other sorted data structure.
while you still have entries
find & remove the first entry, storing the word (it is not necessary to store the file)
read in the next entry from that file, if one exists
find & remove any duplicates of the stored entry
read in the next entry for each of those files, if one exists
write the stored word to your compiled list file
Close all of the streams
The efficiency of this is something like O(n*m*log(n)) and the space efficiency is O(n), where n is the number of files and m is the average number of entries.
Note that you'll want to create a data type that pairs entries (strings) with file pointers/references, and sorts by string storing. You'll also need a data structure that allows you to peek before you pop.
If you have questions in implementation, ask me.
A more thorough analysis of the efficiency:
Space efficiency is pretty easy. You fill the data structure, and for every item you put on, you take one off, so it stays at O(n).
Computational efficiency is more complex. The looping itself is O(n*m), because you will consider each entry, and there are n*m entries. Some c percent of those will be valid, but that's a constant, so we don't care.
Next, adding and removing from a priority queue is log(n) both ways, so to find & remove is 2*log(n).
Because we add and remove each entry, we get n*m add and removes, so O(n*m*log(n)). I think it might actually be a theta in this case, but meh.
As far as I understand, there is no pattern to exploit in a clever way. So we want to do raw sorting.
Let us assume that no cluster farm is available (we could do other things then)
Then I would start with the easiest approach possible, the command line tool sort:
sort -u inp1 inp2 -o sorted
This will sort inp1 and inp2 together in output file sorted without duplicates (u = unique). Sort typically uses a customized mergesort algorithm, which can handle a limited amount of memory. So you should not run in memory problems.
You should have at least 600 gb (double the size) of free disk space.
You should test with only 2 input files how long it takes and what happens. My tests did not show any problems, but they had used different data and an afs server (which is rather slow, but is a better emulation as some HPC filesystem provider):
$ ll
2147483646 big1
2147483646 big2
$ time sort -u big1 big2 -o bigsorted
1009.674u 6.290s 28:01.63 60.4% 0+0k 0+0io 0pf+0w
$ ll
2147483646 big1
2147483646 big2
117440512 bigsorted
I'd start with something like:
#include <string>
#include <set>
int main()
{
typedef std::set<string> Words;
Words words;
std::string word;
while (std::cin >> word)
words.insert(word); // will only work if not seen before
for (Words::const_iterator i = words.begin(); i != words.end(); ++i)
std::cout << *i;
}
Then just:
cat file1 file2... | ./this_wonderful_program > greatest_dictionary.txt
Should be fine assuming the number of non-duplicate words fits in memory (likely on any modern PC, especially if you've 64 bits and > 4GB), this will probably be I/O bound anyway so no point fussing over unordered map vs (binary-tree) map etc.. You may want to convert to lower-case, strip spurious characters etc. before inserting to the map.
EDIT:
If the unique words don't fit in memory, or you're just stubbornly determined to sort each individual input then merge them, you can use the unix sort command on each file, then sort -m to efficiently merge the pre-sorted files. If you're not on UNIX/Linux, you can probably still find a port of sort (e.g. from Cygwin for Windows), your OS may have an equivalent program, or you could try compiling the sort source code. Note that this approach is a little different from tb-'s suggestion of asking one invocation of sort to sort everything (presumably in memory) - I'm not sure how well that would work, so best to try/compare.
On that that scale of 300GB+, you may want to consider using Hadoop or some other scalable store - otherwise, you will have to deal with memory issues through your own coding. You can try other, more direct methods (UNIX scripting, small C/C++ programs, etc...), but you will likely run out of memory unless you have a ton of duplicate words in your data.
Addendum
Just came across memcached which seems very close to what you are trying to accomplish: but you may have to tweak it not to throw away the oldest values. I don't have time to check right now, but you should do a search on Distributed Hash Tables.

need help designing for search algorithm in a more efficient way

I have a problem that involves biology area. Right now I have 4 VERY LARGE files(each with 0.1 billion lines), but the structure is rather simple, each line of these files has only 2 fields, both stands for a type of gene.
My goal is: design an efficient algorithm that can achieves the following:
Find a circle within the contents of these 4 files. The circle is defined as:
field #1 in a line in file 1 == field #1 in a line in file 2 and
field #2 in a line in file 2 == field #1 in a line in file 3 and
field #2 in a line in file 3 == field #1 in a line in file 4 and
field #2 in a line in file 4 == field #2 in a line in file 1
I cannot think of a decent way to solve this, so I just wrote a brute-force-stupid-4-layer-nested loop for now. I'm thinking about sorting them as alphabetical order, even if that might help a little, but then it's also obvious that the computer memory would not allow me to load everything at once. Can anybody tell me a good way to solve this problem in a both time and space efficient way? Thanks!!
First of all, I note that you can sort a file without holding it memory all at once, and that most operating systems have some program that does this, often called just "sort". Usually you can get it to sort on a field within a file, but if not you can rewrite each line to get it to sort the way you want.
Given this, you can connect two files by sorting them so that the first is sorted on field #1 and the second on field #2. You can then create one record for each match, combining all the fields, and only holding in memory a chunk from each file where all the fields you have sorted on have the same value. This will allow you to connect the result with another file - four such connections should solve your problem.
Depending on your data, the time it takes to solve your problem may depend on the order in which you make the connections. One rather naive way to make use of this is, at each stage, to take a small random sample from each file, and use this to see how many results will follow from each possible connection, and choose the connection that produces the fewest results. One way to take a random sample of N items from a large file is to take the first N lines in the file and then, when you have read in m lines so far, read the next line, and then with probability N/(m + 1) exchange one of the N lines held for it, else throw it away. Keep on until you have read through the whole file.
Here is one algorithm:
Select an appropriate lookup structure: If field#1 is an integer, Use bit-fields or an dictionary (or a set) if its an string; Use the a lookup structure for each file, i.e 4 in your case
Initialization phase: For each file: parse the file line by line and set the appropriate bit in bit-field or add the field to the dictionary in the corresponding lookup structure for the file.
After initializing the lookup structure above, check the condition in your question.
The complexity of this depends on the lookup structure implementation. For bit fields, it will be O(1) and for set or dictionary, it will be O(lg(n)), since they are usually implemented as a Balanced Search Tree. The complete complexity will be O(n) or O(n lg(n)); You solution in the question has complexity of O(n^4)
You can get the code and solution for bit fields from here
HTH
Here is one approach:
We will use the notation Fxy where x=field number , y=file_no
Sort each of the 4 files on the first fields.
For each field F11, find a match in file 2. This will be linear. Save these matches with all four fields to a new file. Now, use this file and use the corresponding field in this file and get all the matches from file3. Continue for file4 and back to file1.
In this way, as you progress to each new file, you are dealing with lesser number of lines. And since you have sorted the files, search in linear and can be done by reading from disk.
Here the complexity in O(n log n) for sorting, and O(m log n) for lookup, assuming m << n.
It's a bit easier to explain if your File 1 is the other way around (so each second element points to a first element in the next file).
Start with File 1, copy it to a new file writing each A, B pair as B, A, 'REV'
Append the contents of File 2 to it writing each A, B pair as A, B, 'FWD'
Sort the file
Process the file in chunks with the same initial value
Within that chunk group the lines into REV's and FWD's
Take the cartesian product of the revs and the fwds (nested loop)
Write a line with reverse(fwd) concat (rev) excluding the repeated token
e.g. B, A, 'REV' and B, C, 'FWD' -> C, B, A, 'REV'
Append the next file to this new output file (adding 'FWD' to each line)
Repeat from step 3
In essence you are building up a chain in reverse order and using a file-based sort algorithm to put sequences together that can be combined.
Of course it would be even easier to just read these files into a database and let it do the work ...

Find common words from two files

Given two files containing list of words(around million), We need to find out the words that are in common.
Use Some efficient algorithm, also not enough memory availble(1 million, certainly not).. Some basic C Programming code, if possible, would help.
The files are not sorted.. We can use some sort of algorithm... Please support it with basic code...
Sorting the external file...... with minimum memory available,, how can it be implement with C programming.
Anybody game for external sorting of a file... Please share some code for this.
Yet another approach.
General. first, notice that doing this sequentially takes O(N^2). With N=1,000,000, this is a LOT. Sorting each list would take O(N*log(N)); then you can find the intersection in one pass by merging the files (see below). So the total is O(2N*log(N) + 2N) = O(N*log(N)).
Sorting a file. Now let's address the fact that working with files is much slower than with memory, especially when sorting where you need to move things around. One way to solve this is - decide the size of the chunk that can be loaded into memory. Load the file one chunk at a time, sort it efficiently and save into a separate temporary file. The sorted chunks can be merged (again, see below) into one sorted file in one pass.
Merging. When you have 2 sorted lists (files or not), you can merge them into one sorted list easily in one pass: have 2 "pointers", initially pointing to the first entry in each list. In each step, compare the values the pointers point to. Move the smaller value to the merged list (the one you are constructing) and advance its pointer.
You can modify the merge algorithm easily to make it find the intersection - if pointed values are equal move it to the results (consider how do you want to deal with duplicates).
For merging more than 2 lists (as in sorting the file above) you can generalize the algorithm for using k pointers.
If you had enough memory to read the first file completely into RAM, I would suggest reading it into a dictionary (word -> index of that word ), loop over the words of the second file and test if the word is contained in that dictionary. Memory for a million words is not much today.
If you have not enough memory, split the first file into chunks that fit into memory and do as I said above for each of that chunk. For example, fill the dictionary with the first 100.000 words, find every common word for that, then read the file a second time extracting word 100.001 up to 200.000, find the common words for that part, and so on.
And now the hard part: you need a dictionary structure, and you said "basic C". When you are willing to use "basic C++", there is the hash_map data structure provided as an extension to the standard library by common compiler vendors. In basic C, you should also try to use a ready-made library for that, read this SO post to find a link to a free library which seems to support that.
Your problem is: Given two sets of items, find the intersaction (items common to both), while staying within the constraints of inadequate RAM (less than the size of any set).
Since finding an intersaction requires comparing/searching each item in another set, you must have enough RAM to store at least one of the sets (the smaller one) to have an efficient algorithm.
Assume that you know for a fact that the intersaction is much smaller than both sets and fits completely inside available memory -- otherwise you'll have to do further work in flushing the results to disk.
If you are working under memory constraints, partition the larger set into parts that fit inside 1/3 of the available memory. Then partition the smaller set into parts the fit the second 1/3. The remaining 1/3 memory is used to store the results.
Optimize by finding the max and min of the partition for the larger set. This is the set that you are comparing from. Then when loading the corresponding partition of the smaller set, skip all items outside the min-max range.
First find the intersaction of both partitions through a double-loop, storing common items to the results set and removing them from the original sets to save on comparisons further down the loop.
Then replace the partition in the smaller set with the second partition (skipping items outside the min-max). Repeat. Notice that the partition in the larger set is reduced -- with common items already removed.
After running through the entire smaller set, repeat with the next partition of the larger set.
Now, if you do not need to preserve the two original sets (e.g. you can overwrite both files), then you can further optimize by removing common items from disk as well. This way, those items no longer need to be compared in further partitions. You then partition the sets by skipping over removed ones.
I would give prefix trees (aka tries) a shot.
My initial approach would be to determine a maximum depth for the trie that would fit nicely within my RAM limits. Pick an arbitrary depth (say 3, you can tweak it later) and construct a trie up to that depth, for the smaller file. Each leaf would be a list of "file pointers" to words that start with the prefix encoded by the path you followed to reach the leaf. These "file pointers" would keep an offset into the file and the word length.
Then process the second file by reading each word from it and trying to find it in the first file using the trie you constructed. It would allow you to fail faster on words that don't match. The deeper your trie, the faster you can fail, but the more memory you would consume.
Of course, like Stephen Chung said, you still need RAM to store enough information to describe at least one of the files, if you really need an efficient algorithm. If you don't have enough memory -- and you probably don't, because I estimate my approach would require approximately the same amount of memory you would need to load a file whose words were 14-22 characters long -- then you have to process even the first file by parts. In that case, I would actually recommend using the trie for the larger file, not the smaller. Just partition it in parts that are no bigger than the smaller file (or no bigger than your RAM constraints allow, really) and do the whole process I described for each part.
Despite the length, this is sort of off the top of my head. I might be horribly wrong in some details, but this is how I would initially approach the problem and then see where it would take me.
If you're looking for memory efficiency with this sort of thing you'll be hard pushed to get time efficiency. My example will be written in python, but should be relatively easy to implement in any language.
with open(file1) as file_1:
current_word_1 = read_to_delim(file_1, delim)
while current_word_1:
with open(file2) as file_2:
current_word_2 = read_to_delim(file_2, delim)
while current_word_2:
if current_word_2 == current_word_1:
print current_word_2
current_word_2 = read_to_delim(file_2, delim)
current_word_1 = read_to_delim(file_1, delim)
I leave read_to_delim to you, but this is the extreme case that is memory-optimal but time-least-optimal.
depending on your application of course you could load the two files in a database, perform a left outer join, and discard the rows for which one of the two columns is null

Resources