How to find unique entries in Large data set? - data-structures

I have 100 million lines of data, the data is a word no longer than 15 chars,one word per line. Those data are stored in multiple files.
My goal to to find the unique words among all files.
One solution is to import all words into database and add a unique key for the field. but this is too slow for this large data set.
Is there any faster solution?
Thank you

I'm not sure that there'll be many faster ways than using a database. Personally, I usually use UNIX shell script for this:
cat * | sort | uniq
I don't know how fast that would be with 100,000,000 words, and I'm not sure how fast you want it to be. (E.g., do you need to run it lots of times or just once? If just once, I'd go with the sort and uniq option and let it run overnight if you can).
Alternatively, you could write a script in ruby or a similar language that stored the words in an associative array. I suspect that would almost certainly be slower than the database approach though.
I guess if you really want speed, and you need to carry out this task (or ones like it) often, then you might want to write something in C, but to me that feels a bit like overkill.
Ben

Using a database for this is insane. 100 million records of 15 chars fits in ram. If there is at least some duplication, simply build a trie. Should be able to process 50MB/second or so on a modern machine

If you have to stick with the file structure, then you need some way of indexing the files and then maintaining the index.
Otherwise, I would recommend moving to a database and migrating all operations on that file to work with the database.

You could store the words in a hashtable. Assuming there are quite a number of duplicates, the O(1) search time will be a big performance boost.
Read a line.
Search for the word in the hashtable.
If not found, add it to the table.

If you have this much data, then it needs to be in a SQL server. This is why SQL was designed in the first place. If you continue to use these files you will forever be stuck with performance issues.
Even if these files are modified from external programs (or via FTP) you need to create an import process to run nightly.

You can conserve speed, space, or your sanity. Pick any two.
Throwing it all into a database sacrificed both speed and space, as you found out. But it was easy.
If space is your main problem (memory, disk space) then partition the work. Filter all of the 1 character lines from the files and use one of the above solutions (sort, uniq). Repeat with the 2 character lines for each file. And so on. The unique solutions from each pass form your solution set.
If your main problem is speed, then read each file exactly once creating a hash table (dictionary, whatever) to look for duplicates. Depending on the hash implementation, this could eat up bucketloads of memory (or disk). But it'll be fast.
If you need to conserve speed and space, then consider blending the two techniques. But be prepared to sacrifice the third item.

If there's significant duplication within individual files, it may be quicker to do it file by file then merge the results. Something along the lines of:
{ for n in * ; do sort -u $n ; done } | sort -u
(I'm assuming GNU bash and GNU sort)
I think the choice of best solution will depend heavily on the distribution of duplicates and the number of separate files, though, which you haven't shared with us.
Given myhusky's clarification (plenty of dupes, 10~20 files), I'll definitely suggest this as a good solution. In particular, dense duplication will speed up sort -u versus sort|uniq

Related

Fortran95 access large files fast using direct access

I am currently working on a problem which requires me to store a large amount of well structured information in a file.
It is more data than I can keep in memory, but I need to access different parts of it very often and would like to do so as quickly as possible (of course).
Unfortunately, the file would be large enough that actually reading through it would take quite some time as well.
From what I have gathered so far, it seems to me that ACCESS="DIRECT" would be a good way of handling this problem. Do I understand correctly that with direct access, I am basically pointing at a specific chunk of memory and ask "What's in there?"? And do I correctly infer from that, that reading time does not depend on the overall file size?
Thank you very much in advance!
You can think of an ACCESS='DIRECT' file as a file consisting of a number of fixed size records. You can do operations like read or write record #N in O(1) time. That is, in order to access record #N you don't need to scan through all the preceding #M (M<N) records in the file.
If this maps reasonably well to the problem you're trying to solve, then ACCESS='DIRECT' might be the correct solution in your case. If not, ACCESS='STREAM' offers a little bit more flexibility in that the size of each record does not need to be fixed, though you need to be able to compute the correct file offset yourself. If you need even more flexibility there's things like NetCDF, or HDF5 like #HighPerformanceMark suggested, or even things like sqlite.

Faster searching through files in Perl

I have a problem where my current algorithm uses a naive linear search algorithm to retrieve data from several data files through matching strings.
It is something like this (pseudo code):
while count < total number of files
open current file
extract line from this file
build an arrayofStrings from this line
foreach string in arrayofStrings
foreach file in arrayofDataReferenceFiles
search in these files
close file
increment count
For a large real life job, a process can take about 6 hours to complete.
Basically I have a large set of strings that uses the program to search through the the same set of files (for example 10 in 1 instance and can be 3 in the next instance the program runs). Since the reference data files can change, I do not think it is smart to build a permanent index of these files.
I'm pretty much a beginner and am not aware of any faster techniques for unsorted data.
I was thinking since the search gets repetitive after a while, is it possible to prebuild an index of locations of specific lines in the data reference files without using any external perl libraries once the file array gets built (files are known)? This script is going to be ported onto a server that probably only has standard Perl installed.
I figured it might be worth spending 3-5 minutes building some sort of index for a search before processing the job.
Is there a specific concept of indexing/searching that applies to my situation?
Thanks everyone!
It is difficult to understand exactly what you're trying to achieve.
I assume the data set does not fit in RAM.
If you are trying to match each line in many files against a set of patterns, it may be better to read each line in once, then match it against all the patterns while it's in memory before moving on. This will reduce IO over looping for each pattern.
On the other hand, if the matching is what's taking the time you're probably better off using a library which can simultaneously match lots of patterns.
You could probably replace this:
foreach file in arrayofDataReferenceFiles
search in these files
with a preprocessing step to build a DBM file (i.e. an on-disk hash) as a reverse index which maps each word in your reference files to a list of the files containing that word (or whatever you need). The Perl core includes DBM support:
dbmopen HASH,DBNAME,MASK
This binds a dbm(3), ndbm(3), sdbm(3), gdbm(3), or Berkeley DB file to a hash.
You'd normally access this stuff through tie but that's not important, every Perl should have some support for at least one hash-on-disk library without needing non-core packages installed.
As MarkR said, you want to read each line from each file no more than one time. The pseudocode you posted looks like you're reading each line of each file multiple times (once for each word that is searched for), which will slow things down considerably, especially on large searches. Reversing the order of the two innermost loops should (judging by the posted pseudocode) fix this.
But, also, you said, "Since the reference data files can change, I do not think it is smart to build a permanent index of these files." This is, most likely, incorrect. If performance is a concern (if you're getting 6-hour runtimes, I'd say that probably makes it a concern) and, on average, each file gets read more than once between changes to that particular file, then building an index on disk (or even... using a database!) would be a very smart thing to do. Disk space is very cheap these days; time that people spend waiting for results is not.
Even if files frequently undergo multiple changes without being read, on-demand indexing (when you want to check the file, first look to see whether an index exists and, if not, build one before doing the search) would be an excellent approach - when a file gets searched more than once, you benefit from the index; when it doesn't, building the index first, then doing an search off the index will be slower than a linear search by such a small margin as to be largely irrelevant.

What is the best way to sort 30gb of strings with a computer with 4gb of RAM using Ruby as scripting language?

Hi I saw that as an interview question and thought it was an interesting question that I am not sure about the answer.
What would be the best way ?
Assuming *nix:
system("sort <input_file >output_file")
"sort" can use temporary files to work with input files larger than memory. It has switches to tune the amount of main memory and the number of temporary files it will use, if needed.
If not *nix, or the interviewer frowns because of the sideways answer, then I'll code an external merge sort. See #psyho's answer for a good summary of an external sorting algorithm.
Put them in a database and let the database worry about it.
One way to do this is to use an external sorting algorithm:
Read a chunk of file into memory
Sort that chunk using any regular sorting algorithm (like quicksort)
Output the sorted strings into a
temporary file
Repeat steps 1-3 until you process
the whole file
Apply the merge-sort algorithm by
reading the temporary files line by line
Profit!
Well, this is an interesting interview question... almost all such kind of questions are meant to test your skills and don't, fortunately, directly apply to real-life examples. This looks like one, so let's get into the puzzle
When your interviewer asks for "best", I believe he/she talks about performance only.
Answer 1
30GB of strings is lot of data. All compare-swap algorithms are Omega(n logn), so it will take a long time. While there are O(n) algorithms, such as counting sort, they are not in place, so you will be multiplying the 30GB and you have only 4GB of RAM (consider the swapping amount...), so I would go with quicksort
Answer 2 (partial)
Start thinking about counting sort. You may want to first split the strings in groups (using radix sort approach), one for each letter. You may want to scan the file and, for each initial letter, move the string (so copy and delete, no space waste) into a temporary file. You may want to repeat the process for the first 2, 3 or 4 chars of each string. Then, in order to reduce the complexity of sorting lots of files, you can separately sort the string within each one (using quicksort now) and finally merge all files in order. This way you'll still have a O(n logn) but on fair lower n
Database systems are already handling this particular problem well.
A good answer is to use the merge-sort algorithm, adapting it to spool data to and from disk as needed for the merge steps. This can be done with minimal demands on memory.

Partial sorting algorithm

Say I have 50 million features, each feature comes from disk.
At the beggining of my program, I handle each feature and depending on some conditions, I apply some modifications to some.
A this point in my program, I am reading a feature from disk, processing it, and writing it back, because well I don't have enough ram to open all 50 million features at once.
Now say I want to sort these 50 million features, is there any optimal algorithm to do this as I can't load everyone at the same time?
Like a partial sorting algorithm or something like that?
In general, the class of algorithms you're looking for is called external sorting. Perhaps the most widely known example of such sorting algorithm is called Merge sort.
The idea of this algorithm (the external version) is that you split the data into pieces that you can sort in-place in memory (say 100 thousands) and sort each block independently (using some standard algorithm such as Quick sort). Then you take the blocks and merge them (so you merge two 100k blocks into one 200k block) which can be done by reading elements from both of the block into buffers (since the blocks are already sorted). At the end, you merge two smaller blocks into one block which will contain all the elements in the right order.
If you are on Unix, use sort ;)
It may seem stupid but the command-line tool has been programmed to handle this case and you won't have to reprogram it.

What's the performance difference between DBI's fetchall_hashref and fetchall_arrayref?

I am writing some Perl scripts to manipulate large amounts (in total about 42 million rows, but it won't be done in one hit) of data in two PostgreSQL databases.
For some of my queries it makes good sense to use fetchall_hashref because I have synthetic keys. However, in other instances, I'm going to have use an array of three columns as the unique key.
This has got me wondering about performance differences between fetchall_arrayref and fetchall_hashref. I know that in both cases everything is going in to memory so selecting several GB of data probably isn't a good idea but other than that there appears to be very little guidance in the documentation when it comes to performance.
My googling has been unsuccessful so if anyone can point me in the direction of some general performance studies I'd be grateful.
(I know I could benchmark this myself but unfortunately for dev purposes I don't have access to a machine which has identical hardware to production which is why I'm looking for general guidelines or even best practices).
Most of the choices between fetch methods depend on what format you want the data to end up in and how much of the work for that you want DBI to do for you.
My recollection is that iterating with fetchrow_arrayref and using bind_columns is the fastest (least DBI overhead) way to read through returned data.
First question is whether you really need to use a fetchall in the first place. If you don't need all 42 million rows in memory at once, then don't read them all in at once! bind_columns and fetchrow_arrayref are generally the way to go whenever possible, as ysth already pointed out.
Assuming that fetchall really is needed, my gut intuition is that fetchall_arrayref will be marginally faster, since an array is a simpler data structure and doesn't need to compute hashes of the inserted keys, but the savings in time would be dwarfed by database read times, so it's unlikely to be significant.
Memory requirements are another matter entirely, though. The structure returned by fetchall_hashref is a hash of id => row, with each row being represented as a hash of field name => field value. If you get 42 million rows, that means your list of field names is repeated in each of 42 million sets of hash keys... That's going to require a good deal more memory to store than the array of arrays of arrays returned by fetchall_arrayref. (Unless DBI is doing some magic with tie to optimize the fetchall_hashref structure, I suppose.)

Resources