Faster searching through files in Perl - performance

I have a problem where my current algorithm uses a naive linear search algorithm to retrieve data from several data files through matching strings.
It is something like this (pseudo code):
while count < total number of files
open current file
extract line from this file
build an arrayofStrings from this line
foreach string in arrayofStrings
foreach file in arrayofDataReferenceFiles
search in these files
close file
increment count
For a large real life job, a process can take about 6 hours to complete.
Basically I have a large set of strings that uses the program to search through the the same set of files (for example 10 in 1 instance and can be 3 in the next instance the program runs). Since the reference data files can change, I do not think it is smart to build a permanent index of these files.
I'm pretty much a beginner and am not aware of any faster techniques for unsorted data.
I was thinking since the search gets repetitive after a while, is it possible to prebuild an index of locations of specific lines in the data reference files without using any external perl libraries once the file array gets built (files are known)? This script is going to be ported onto a server that probably only has standard Perl installed.
I figured it might be worth spending 3-5 minutes building some sort of index for a search before processing the job.
Is there a specific concept of indexing/searching that applies to my situation?
Thanks everyone!

It is difficult to understand exactly what you're trying to achieve.
I assume the data set does not fit in RAM.
If you are trying to match each line in many files against a set of patterns, it may be better to read each line in once, then match it against all the patterns while it's in memory before moving on. This will reduce IO over looping for each pattern.
On the other hand, if the matching is what's taking the time you're probably better off using a library which can simultaneously match lots of patterns.

You could probably replace this:
foreach file in arrayofDataReferenceFiles
search in these files
with a preprocessing step to build a DBM file (i.e. an on-disk hash) as a reverse index which maps each word in your reference files to a list of the files containing that word (or whatever you need). The Perl core includes DBM support:
dbmopen HASH,DBNAME,MASK
This binds a dbm(3), ndbm(3), sdbm(3), gdbm(3), or Berkeley DB file to a hash.
You'd normally access this stuff through tie but that's not important, every Perl should have some support for at least one hash-on-disk library without needing non-core packages installed.

As MarkR said, you want to read each line from each file no more than one time. The pseudocode you posted looks like you're reading each line of each file multiple times (once for each word that is searched for), which will slow things down considerably, especially on large searches. Reversing the order of the two innermost loops should (judging by the posted pseudocode) fix this.
But, also, you said, "Since the reference data files can change, I do not think it is smart to build a permanent index of these files." This is, most likely, incorrect. If performance is a concern (if you're getting 6-hour runtimes, I'd say that probably makes it a concern) and, on average, each file gets read more than once between changes to that particular file, then building an index on disk (or even... using a database!) would be a very smart thing to do. Disk space is very cheap these days; time that people spend waiting for results is not.
Even if files frequently undergo multiple changes without being read, on-demand indexing (when you want to check the file, first look to see whether an index exists and, if not, build one before doing the search) would be an excellent approach - when a file gets searched more than once, you benefit from the index; when it doesn't, building the index first, then doing an search off the index will be slower than a linear search by such a small margin as to be largely irrelevant.

Related

Fortran95 access large files fast using direct access

I am currently working on a problem which requires me to store a large amount of well structured information in a file.
It is more data than I can keep in memory, but I need to access different parts of it very often and would like to do so as quickly as possible (of course).
Unfortunately, the file would be large enough that actually reading through it would take quite some time as well.
From what I have gathered so far, it seems to me that ACCESS="DIRECT" would be a good way of handling this problem. Do I understand correctly that with direct access, I am basically pointing at a specific chunk of memory and ask "What's in there?"? And do I correctly infer from that, that reading time does not depend on the overall file size?
Thank you very much in advance!
You can think of an ACCESS='DIRECT' file as a file consisting of a number of fixed size records. You can do operations like read or write record #N in O(1) time. That is, in order to access record #N you don't need to scan through all the preceding #M (M<N) records in the file.
If this maps reasonably well to the problem you're trying to solve, then ACCESS='DIRECT' might be the correct solution in your case. If not, ACCESS='STREAM' offers a little bit more flexibility in that the size of each record does not need to be fixed, though you need to be able to compute the correct file offset yourself. If you need even more flexibility there's things like NetCDF, or HDF5 like #HighPerformanceMark suggested, or even things like sqlite.

How Duplicate File search is implemented in Gemini For Mac os

I tried to search for Duplicate files in my mac machine via command line.
This process took almost half an hour for 10 gb Data files whereas Gemini and cleanmymac apps takes lesser time to find the files.
So my point here is how this fastness is achieved in these apps,what is the concept behind it?, in which language code is written.
I tried googling for information but didnot get anything related to duplicate finder.
if you have any ideas please input them here.
First of all Gemini locates files with equal size, than it uses it’s own hash-like type-dependent algorithm to compare files content. That algorithm is not 100% accurate but much more quick than classical hashes.
I contacted support, asking them what algorithm they use. Their response was that they compare parts of each file to each other, rather than the whole file or doing a hash. As a result, they can only check maybe 5% (or less) of each file that's reasonably similar in size to each other, and get a reasonably accurate result. Using this method, they don't have to pay the cost of comparing the whole file OR the cost of hashing files. They could be even more accurate, if they used this method for the initial comparison, and then did full comparisons among the potential matches.
Using this method, files that are minor variants of each other may be detected as identical. For example, I've had two songs (original mix and VIP mix) that counted as the same. I also had two images, one with a watermark and one without, listed as identical. In both these cases, the algorithm just happened to pick parts of the file that were identical across the two files.

Java 7 parallel search for files recursively in a folder

I would like to use the Visitor API, in Java 7, to search for some files recursively in a folder. Since I will search big folders, with 100.000+ files, sparsed through folders, I would like to do this parallel.
But, I can't, for example, spawn a thread for each folder. May Fork Join may be an idea but , from what I've understood, FJ is usually used when you know the data, for example, you have a given array and you want to process parts of 5 elements from it. So divide and conquer can be used very well in that case.
So can you please share your opinion of an idea that can allow me to search recursively for files fast ( must be parallel ) and also allow cancellation if the user desires so.
Thank you,
Ryu
I bet there will be no gain from parallel search on a single disk drive; the disk access/read time is significantly larger than any possible name comparisons you can make.
Did you actually write the code? Did you test it? Did you profile it? What have you deducted from profiling?
Remember that the first rule of optimization is: don't do it.
You cannot use Files.walkFileTree for that (I assume that is what you mean when you say "the Visitor API, in Java 7"); you have to implement the directory traversal yourself to be able to parallelize it.
Fork/join does, actually, fit this problem very well. There is even a relevant example at Fork and Join: Java Can Excel at Painless Parallel Programming Too!. There is an example program in that article that "count[s] the occurrences of a word in a set of document" by traversing files in a directory, and all its subdirectories (recursively).
The author provides some seemingly positive speedup measurements in the discussion section, but you should consider what Dariusz said about the problem probably being IO bound rather than CPU bound (i.e., just throwing lots of thread at it will not result in any speedup after some, probably low, number of threads). It's surprising, at least to me at least, that the example program from the article was faster with 12 threads than 8 threads.
Cancellation, afaics, is an orthogonal issue to this, and can be implemented in some standard way (e.g., polling a volatile flag).

How to find unique entries in Large data set?

I have 100 million lines of data, the data is a word no longer than 15 chars,one word per line. Those data are stored in multiple files.
My goal to to find the unique words among all files.
One solution is to import all words into database and add a unique key for the field. but this is too slow for this large data set.
Is there any faster solution?
Thank you
I'm not sure that there'll be many faster ways than using a database. Personally, I usually use UNIX shell script for this:
cat * | sort | uniq
I don't know how fast that would be with 100,000,000 words, and I'm not sure how fast you want it to be. (E.g., do you need to run it lots of times or just once? If just once, I'd go with the sort and uniq option and let it run overnight if you can).
Alternatively, you could write a script in ruby or a similar language that stored the words in an associative array. I suspect that would almost certainly be slower than the database approach though.
I guess if you really want speed, and you need to carry out this task (or ones like it) often, then you might want to write something in C, but to me that feels a bit like overkill.
Ben
Using a database for this is insane. 100 million records of 15 chars fits in ram. If there is at least some duplication, simply build a trie. Should be able to process 50MB/second or so on a modern machine
If you have to stick with the file structure, then you need some way of indexing the files and then maintaining the index.
Otherwise, I would recommend moving to a database and migrating all operations on that file to work with the database.
You could store the words in a hashtable. Assuming there are quite a number of duplicates, the O(1) search time will be a big performance boost.
Read a line.
Search for the word in the hashtable.
If not found, add it to the table.
If you have this much data, then it needs to be in a SQL server. This is why SQL was designed in the first place. If you continue to use these files you will forever be stuck with performance issues.
Even if these files are modified from external programs (or via FTP) you need to create an import process to run nightly.
You can conserve speed, space, or your sanity. Pick any two.
Throwing it all into a database sacrificed both speed and space, as you found out. But it was easy.
If space is your main problem (memory, disk space) then partition the work. Filter all of the 1 character lines from the files and use one of the above solutions (sort, uniq). Repeat with the 2 character lines for each file. And so on. The unique solutions from each pass form your solution set.
If your main problem is speed, then read each file exactly once creating a hash table (dictionary, whatever) to look for duplicates. Depending on the hash implementation, this could eat up bucketloads of memory (or disk). But it'll be fast.
If you need to conserve speed and space, then consider blending the two techniques. But be prepared to sacrifice the third item.
If there's significant duplication within individual files, it may be quicker to do it file by file then merge the results. Something along the lines of:
{ for n in * ; do sort -u $n ; done } | sort -u
(I'm assuming GNU bash and GNU sort)
I think the choice of best solution will depend heavily on the distribution of duplicates and the number of separate files, though, which you haven't shared with us.
Given myhusky's clarification (plenty of dupes, 10~20 files), I'll definitely suggest this as a good solution. In particular, dense duplication will speed up sort -u versus sort|uniq

Cache directory structure

I'm in the process of implementing caching for my project. After looking at cache directory structures, I've seen many examples like:
cache
cache/a
cache/a/a/
cache/a/...
cache/a/z
cache/...
cache/z
...
You get the idea. Another example for storing files, let's say our file is named IMG_PARTY.JPG, a common way is to put it in a directory named:
files/i/m/IMG_PARTY.JPG
Some thoughts come to mind, but I'd like to know the real reasons for this.
Filesystems doing linear lookups find files faster when there's fewer of them in a directory. Such structure spreads files thin.
To not mess up *nix utilities like rm, which take a finite number of arguments and deleting large number of files at once tends to be hacky (having to pass it though find etc.)
What's the real reason? What is a "good" cache directory structure and why?
Every time I've done it, it has been to avoid slow linear searches in filesystems. Luckily, at least on Linux, this is becoming a thing of the past.
However, even today, with b-tree based directories, a very large directory will be hard to deal with, since it will take forever and a day just to get a listing of all the files, never mind finding the right file.
Just use dates. Since you will remove by date. :)
If you do ls -l, all the files need to be stat()ed to get details, which adds considerably to the listing time - this happens whether the FS uses hashed or linear structures.
So even if the FS has a capability of coping with incredibly large directory sizes, there are good reasons not to have large flat structures (They're also a pig to back up)
I've benchmarked GFS2 (clustered) with 32,000 files in a directory or arranged in a tree structure - recursive listings were around 300 times faster than getting a listing when they were all in a flat structure (could take up to 10 minutes to get a directory listing)
EXT4 showed similar ratios but as the end point was only a couple of seconds most people wouldn't notice.

Resources