Storage for Write Once Read Many - algorithm

I have a list of 1 million digits. Every time the user submit an input, I would need to do a matching of the input with the list.
As such, the list would have the Write Once Read Many (WORM) characteristics?
What would be the best way to implement storage for this data?
I am thinking of several options:
A SQL Database but is it suitable for WORM (UPDATE: using VARCHAR field type instead of INT)
One file with the list
A directory structure like /1/2/3/4/5/6/7/8/9/0 (but this one would be taking too much space)
A bucket system like /12345/67890/
What do you think?
UPDATE: The application would be a web application.

To answer this question you'll need to think about two things:
Are you trying to minimize storage space, or are you trying to minimize process time.
Storing the data in memory will give you the fastest processing time, especially if you could optimize the datastructure for your most common operations (in this case a lookup) at the cost of memory space. For persistence, you could store the data to a flat file, and read the data during startup.
SQL Databases are great for storing and reading relational data. For instance storing Names, addresses, and orders can be normalized and stored efficiently. Does a flat list of digits make sense to store in a relational database? For each access you will have a lot of overhead associated with looking up the data. Constructing the query, building the query plan, executing the query plan, etc. Since the data is a flat list, you wouldn't be able to create an effective index (your index would essentially be the values you are storing, which means you would do a table scan for each data access).
Using a directory structure might work, but then your application is no longer portable.
If I were writing the application, I would either load the data during startup from a file and store it in memory in a hash table (which offers constant lookups), or write a simple indexed file accessor class that stores the data in a search optimized order (worst case a flat file).

Maybe you are interested in how The Pi Searcher did it. They have 200 million digits to search through, and have published a description on how their indexed searches work.

If you're concerned about speed and don't want to care about file system storage, probably SQL is your best shot. You can optimize your table indexes but also will add another external dependency on your project.
EDIT: Seems MySQL have an ARCHIVE Storage Engine:
MySQL supports on-the-fly compression since version 5.0 with the ARCHIVE storage engine. Archive is a write-once, read-many storage engine, designed for historical data. It compresses data up to 90%. It does not support indexes. In version 5.1 Archive engine can be used with partitioning.

Two options I would consider:
Serialization - when the memory footprint of your lookup list is acceptable for your application, and the application is persistent (a daemon or server app), then create it and store it as a binary file, read the binary file on application startup. Upside - fast lookups. Downside - memory footprint, application initialization time.
SQL storage - when the lookup is amenable to index-based lookup, and you don't want to hold the entire list in memory. Upside - reduced init time, reduced memory footprint. Downside - requires DBMS (extra app dependency, design expertise), fast, but not as fast as holding the whole list in memeory

If you're concerned about tampering, buy a writable DVD (or a CD if you can find a store which still carries them ...), write the list on it and then put it into a server with only a DVD drive (not a DVD writer/burner). This way, the list can't be modified. Another option would be to buy an USB stick which has a "write protect" switch but they are hard to come by and the security isn't as good as with a CD/DVD.
Next, write each digit into a file on that disk with one entry per line. When you need to match the numbers, just open the file, read each line and stop when you find a match. With todays computer speeds and amounts of RAM (and therefore file system cache), this should be fast enough for a once-per-day access pattern.

Given that 1M numbers is not a huge amount of numbers for todays computers, why not just do pretty much the simplest thing that could work. Just store the numbers in a text file and read them into a hash set on application startup. On my computer reading in 1M numbers from a text file takes under a second and after that I can do about 13M lookups per second.

Related

What makes Everything's file search and index so efficient?

Everything is a file searching program. As its author hasn't released the source code, I am wondering how it works.
How could it index files so efficiently?
What data structures does it use for file searching?
How can its file searching be so fast?
To quote its FAQ,
"Everything" only indexes file and folder names and generally takes a
few seconds to build its database. A fresh install of Windows 10
(about 120,000 files) will take about 1 second to index. 1,000,000
files will take about 1 minute.
If it takes only one second to index the whole Windows 10, and takes only 1 minute to index one million files, does this mean it can index 120,000 files per second?
To make the search fast, there must be a special data structure. Searching by file name doesn't only search from the start, but also from the middle in most cases. This makes it some widely used indexing structures such as Trie and Red–black tree ineffective.
The FAQ clarifies further.
Does "Everything" hog my system resources?
No, "Everything" uses very little system resources. A fresh install of
Windows 10 (about 120,000 files) will use about 14 MB of ram and less
than 9 MB of disk space. 1,000,000 files will use about 75 MB of ram
and 45 MB of disk space.
Short Answer: MFT (Master File Table)
Getting the data
Many search engines used to recursively walk through the entire disk structure so that it finds all the files. Therefore it used to take longer time to complete the indexing process (even when contents are not indexed). If contents were also indexed, it would take a lot longer.
From my analysis of Everything, it does not recurse at all. If we observe the speed, in about 5 seconds it indexed an entire 1tb drive (SSD). Even if it had to recurse it would take longer - since there are thousands of small files - each with its own file size, date etc - all spread across.
Instead, Everything does not even touch the actual data, it reads and parses the 'Index' of the hard drive. For NTFS, MFT store all the file names, its physical location (like concept of iNodes in Linux). So, in one small contiguous area (a file), all the data inside MFT is present. So, the search indexer does not have waste time finding where the info about next file is, it does not have to seek. Since MFT by design is contiguous (rare exception if there are many more files and MFT for some reason is filled up or corrupt, it will link to a new one which will cause a seek time - but that edge case is very rare).
MFT is not plain text, it needs to be parsed. Folks at Everything have designed a superfast parser and decoder for NFT and hence all is well.
FSCTL_GET_NTFS_VOLUME_DATA (declared in winioctl.h) will get you the cluster locations for mft. Alternatively, you could use NTFSInfo (Microsoft SysInternals - NTFSInfo v1.2).
MFT zone clusters : 90400352 - 90451584
Storing and retrieving
The .db file from my index at C:\Users\XXX\AppData\Local\Everything I assume this is a regular nosql file-based database. Since it uses a DB and not a flat file, that contributes to the speed. And also, at start of program, it loads this db file into RAM, so all the queries do not look up the DB on disk, instead on RAM. All this combined makes it slick.
How could it index files so efficiently?
First, it indexes only file/directory names, not contents.
I don't know if it's efficient enough for your needs, but the ordinary way is with FindFirstFile function. Write a simple C program to list folders/files recursively, and see if it's fast enough for you. The second step through optimization would be running threads in parallel, but maybe disk access would be the bottleneck, if so multiple threads would add little benefit.
If this is not enough, finally you could try to dig into the even lower Native API functions; I have no experience with these, so I can't give you further advice. You'd be pretty close to the metal, and maybe the Linux NTFS project has some concepts you need to learn.
What data structures does it use for file searching?
How can its file searching be so fast?
Well, you know there are many different data structures designed for fast searching... probably the author ran a lot of benchmarks.

VSAM Search VS COBOL search/loop

I have a file that could contain about 3 million records. Certain records of this file will need to be updated multiple times throughout the run of the program. If I need to pull specific records from this file, which of the following is more efficient:
Indexed VSAM search
Indexed flat file with a COBOL search all
Buffering all of the data into working storage and writing a loop to handle the search
Obviously, if you can buffer all of the data into memory (and if the host system can support a working-set of pages which is big enough to allow all of it to actually remain in RAM without paging, then this would probably be the fastest possible approach.
But, be very careful to consider "hidden disk-I/O" caused by the virtual-memory paging subsystem! If the requested "in-memory" data is, in fact, not "in memory," a page-fault will occur and your process will stop in its tracks until the page has been retrieved. (And if "page stealing" occurs, well, you're in trouble. Your "in-memory" strategy just turned into a possibly very-inefficient(!) disk-based one. If keys are distributed randomly, then your process has a gigantic working-set that it is accessing randomly. If all of that memory is not actually in memory, and will stay there, you're in trouble.
If you are making updates to a large file, consider sorting the updates-delta file before processing it, so that all occurrences of the same key will be adjacent. You can now write your COBOL program to take advantage of this (and, of course, to abend if an out-of-sequence record is ever detected!). If the key in "this" record is identical to the key of the "previous" one, then you do not need to re-read the record. (And, you do not actually need to write the old record, until the key does change.) As the indexed-file access method is presented with the succession of keys, each key is likely to be "close to" the one previously-requested, such that some of the necessary index-tree pages will already be in-memory. Obviously, you will need to benchmark this, but the amount of time spent sorting the file can be far less than the amount of time spent in index-lookups. (Which actually can be considerable.)
The answer of Mike has the important issue about "hidden I/O" in (depends on the machine, configuration, amount of data)...
If you very likely need to update many records the option Mike suggest is the most useful one.
If you very likely need to update not much records (I'd guess you're likely below 2%) another approach can be quite faster (needs a benchmark !):
read every key via indexed VSAM search
store the changed record in memory (big occurs table), if you will only change some values and the record is quite big then only store all possible changed values + key in the table without an actual REWRITE
before doing a VSAM search: look in your occurs table if you read the key
already, take the values either from there or get a new one
...
at program end: go through your occurs and REQRITE all records (if you have the complete record a REWRITE is enough, otherwise you'd need a READ first to get the complete record)
Performance is often: "know your data and possible program flow, then try the best 2-3 approach, benchmark and decide".

Column Store Strategies

We have an app that requires very fast access to columns from delimited flat files, whithout knowing in advance what columns will be in the files. My initial approach was to store each column in a MongoDB document as a array, but it does not scale as the files get bigger as I hit the 16MB document limit.
My second approach is to split the file columnwise and essentially treat them as blobs that can be served off disk to the client app. I'd intuitively think that storing the location in database and the files on disk is the best approach, and that storing them in a database as blobs (or in mongo gridfs) is adding unnessasary overhead, however there may be advatages that are not apparant to me at the moment. So my question is: what would be the advantage to storing them as blobs in a database such as (Oracle/Mongo) and are there any databases that are particularly well suited to this task.
Thanks,
Vackar

Running out of RAM memory

I may need to build a hash table that may grow very large in size. I am wondering if the hash table does not fit in memory what is the best way to address this problem as to avoid having the application crash when it runs out of memory.
Use case: This hash table contains a bunch of ids that are referenced in a for loop that needs to consult the id for a particular word.
Any time you have data that can not be easily recreated on the fly, then you need to make provisions to get it out of RAM and onto disk. Any sort of data store will do that. You could use a flat or text file, or a YAML file.
If you need fast access then you'll be looking at some sort of database, because reading a flat/text file doesn't easily allow random access. SQLLite can do it, or a no-sql database.
If you need to allow multiple processes access to the data and have good access restriction, and/or store the data on one machine and access it from another, then you'll be looking at a database of some sort. At that point I'd look into MySQL or Postgres. I prefer the later, but they'll both work.
If you really think the hash will grow so big, then maybe you should not store this data in a hash in your ram. I don't think you can easily avoid a crash when your app runs out of memory. I guess the key is create mechanisms to avoid major memory consumption.
I don't know your situation, but I really doubt the hash table you described would make a reasonable computer run out of memory. If you really think so, maybe you should use a key value storage database (Redis is fairly easy to learn http://redis.io/) or other kind of NoSQL database.

Filesystem seek performance with lots of tiny files

I'm looking to build a server with lots of tiny files delivered by an XML API. It won't be doing a whole lot of iterating over directories or blocks of sequential files--we're talking lots and lots of seeks for discontinuous data.
Will seek time on BSD UFS degrade over time for requests for individual files? I understand that the filesystem's inode limit is based on the size of the partition/slice, but the hard drive has to step through the inode table for every file request before it can discover the location of the data. What filesystem yields the best performance for seek time?
The alternative is to setup 2-4GB "blob" files and have a separate system of seeking a file contained in them from within the software. The software's "inode table" could be optimized for delivery based on currently logged in user, etc... These "inode tables" would likely be cached in RAM and would only relate to the users currently logged in so that there are fewer wasted resources.
Where do these two solutions rate on a scalability and maintenance standpoint? What sort of performance gains, if any, could I expect by using the second solution?
The most obvious and time-proven mitigation technique is to use a good hierarchical design for directories (and pathname search strategies), and have more directories with fewer files in each.
For recent FreeBSD versions with dirhash and softupdates I have seen no problems with a few ten thousand files per directory. You probably don't want to go north of 500.000 files or so. E.g. deleting a directory with 2.500.000 files took me three days.
I'm not sure i understand you question correctly, but if you want to seek over lots of files, why not use a partioned mysql table laid out on a RAID0 or VFS filesystem?
Edit: as far as i know, lots of files in one folder will degrade any FS speed as it has to maintain bigger lists of files, permissions and names, a database is designed to keep lists of data in memory and seek in a very optimized way through it.
More details of your situation would be helpful, are the files existing or would they be created by your application? If you need a way to store arbitrary data with out the structure of a relational database have you looked at object databases
Another option, if your objects should or can be accessed via HTTP, is to use a varnish cache in front of a small web server. Initially objects would be stored on disk, but varnish would store and serve objects from memory after the first access to a given object.

Resources