VSAM Search VS COBOL search/loop - performance

I have a file that could contain about 3 million records. Certain records of this file will need to be updated multiple times throughout the run of the program. If I need to pull specific records from this file, which of the following is more efficient:
Indexed VSAM search
Indexed flat file with a COBOL search all
Buffering all of the data into working storage and writing a loop to handle the search

Obviously, if you can buffer all of the data into memory (and if the host system can support a working-set of pages which is big enough to allow all of it to actually remain in RAM without paging, then this would probably be the fastest possible approach.
But, be very careful to consider "hidden disk-I/O" caused by the virtual-memory paging subsystem! If the requested "in-memory" data is, in fact, not "in memory," a page-fault will occur and your process will stop in its tracks until the page has been retrieved. (And if "page stealing" occurs, well, you're in trouble. Your "in-memory" strategy just turned into a possibly very-inefficient(!) disk-based one. If keys are distributed randomly, then your process has a gigantic working-set that it is accessing randomly. If all of that memory is not actually in memory, and will stay there, you're in trouble.
If you are making updates to a large file, consider sorting the updates-delta file before processing it, so that all occurrences of the same key will be adjacent. You can now write your COBOL program to take advantage of this (and, of course, to abend if an out-of-sequence record is ever detected!). If the key in "this" record is identical to the key of the "previous" one, then you do not need to re-read the record. (And, you do not actually need to write the old record, until the key does change.) As the indexed-file access method is presented with the succession of keys, each key is likely to be "close to" the one previously-requested, such that some of the necessary index-tree pages will already be in-memory. Obviously, you will need to benchmark this, but the amount of time spent sorting the file can be far less than the amount of time spent in index-lookups. (Which actually can be considerable.)

The answer of Mike has the important issue about "hidden I/O" in (depends on the machine, configuration, amount of data)...
If you very likely need to update many records the option Mike suggest is the most useful one.
If you very likely need to update not much records (I'd guess you're likely below 2%) another approach can be quite faster (needs a benchmark !):
read every key via indexed VSAM search
store the changed record in memory (big occurs table), if you will only change some values and the record is quite big then only store all possible changed values + key in the table without an actual REWRITE
before doing a VSAM search: look in your occurs table if you read the key
already, take the values either from there or get a new one
...
at program end: go through your occurs and REQRITE all records (if you have the complete record a REWRITE is enough, otherwise you'd need a READ first to get the complete record)
Performance is often: "know your data and possible program flow, then try the best 2-3 approach, benchmark and decide".

Related

Collecting large statistical sets with pg_stat_statements?

According to Postgres pg_stat_statements documentation:
The module requires additional shared memory proportional to
pg_stat_statements.max. Note that this memory is consumed whenever the
module is loaded, even if pg_stat_statements.track is set to none.
and also:
The representative query texts are kept in an external disk file, and
do not consume shared memory. Therefore, even very lengthy query texts
can be stored successfully. However, if many long query texts are
accumulated, the external file might grow unmanageably large.
From these it is unclear what the actual memory cost of a high pg_stat_statements.max would be - say at 100k or 500k (default is 5k). Is it safe to set the levels that high, would could be the negative repercussions of such high levels? Would aggregating statistics into an external database via logstash/fluentd be a preferred approach above certain sizes?
1.
from what I have read, it hashes the query and keeps it in DB, saving the text to FS. So next concern is more expected then overloaded shared memory:
if many long query texts are accumulated, the external file might grow
unmanageably large
the hash of text is so much smaller then text, that I think you should not worry about extension memory consumption comparing long queries. Especially knowing that extension uses Query Analyser (which will work for EVERY query ANYWAY):
the queryid hash value is computed on the post-parse-analysis
representation of the queries
Setting pg_stat_statements.max 10 times bigger should take 10 times more shared memory I believe. The grows should be linear. It does not say so in documentation, but logically should be so.
There is no answer if it is safe or not to set setting to distinct value, because there is no data on other configuration values and HW you have. But as growth should be linear, consider this answer: "if you set it to 5K, and query runtime has grown almost nothing, then setting it to 50K will prolong it almost nothing times ten". BTW, my question - who is gong to dig 50000 slow statements? :)
2.
This extension already makes a pre-aggregation for "dis-valued" statement. You can select it straight on DB, so moving data to other db and selecting it there will only give you the benefit of unloading the original DB and loading another. In other words you save 50MB for a query on original, but spend same on another. Does it make sense? For me - yes. This is what I do myself. But I also save execution plans for statement (which is not a part of pg_stat_statements extension). I believe it depends on what you have and what you have. Definitely there is no need for that just because of a number of queries. Again unless you have so big file that extension can
As a recovery method if that happens, pg_stat_statements may choose to
discard the query texts, whereupon all existing entries in the
pg_stat_statements view will show null query fields

Serving millions of routes with good performance

I'm doing some research for a new project, for which the constraints and specifications have yet to be set. One thing that is wanted is a large number of paths, directly under the root domain. This could ramp up to millions of paths. The paths don't have a common structure or unique parts, so I have to look for exact matches.
Now I know it's more efficient to break up those paths, which would also help in the path lookup. However I'm researching the possibility here, so bear with me.
I'm evaluating methods to accomplish this, while maintaining excellent performance. I thought of the following methods:
Storing the paths in an SQL database and doing a lookup on every request. This seems like the worst option and will definitely not be used.
Storing the paths in a key-value store like Redis. This would be a lot better, and perform quite well I think (have to benchmark it though).
Doing string/regex matching - like many frameworks do out of the box - for this amount of possible matches is nuts and thus not really an option. But I could see how doing some sort of algorithm where you compare letter-by-letter, in combination with some smart optimizations, could work.
But maybe there are tools/methods I don't know about that are far more suited for this type of problem. I could use any tips on how to accomplish this though.
Oh and in case anyone is wondering, no this isn't homework.
UPDATE
I've tested the Redis approach. Based on two sets of keywords, I got 150 million paths. I've added each of them using the set command, with the value being a serialized string of id's I can use to identify the actual keywords in the request. (SET 'keyword1-keyword2' '<serialized_string>')
A quick test in a local VM with a data set of one million records returned promising results: benchmarking 1000 requests took 2ms on average. And this was on my laptop, which runs tons of other stuff.
Next I did a complete test on a VPS with 4 cores and 8GB of RAM, with the complete set of 150 million records. This yielded a database of 3.1G in file size, and around 9GB in memory. Since the database could not be loaded in memory entirely, Redis started swapping, which caused terrible results: around 100ms on average.
Obviously this will not work and scale nice. Either each web server needs to have a enormous amount of RAM for this, or we'll have to use a dedicated Redis-routing server. I've read an article from the engineers at Instagram, who came up with a trick to decrease the database size dramatically, but I haven't tried this yet. Either way, this does not seem the right way to do this. Back to the drawing board.
Storing the paths in an SQL database and doing a lookup on every request. This seems like the worst option and will definitely not be used.
You're probably underestimating what a database can do. Can I invite you to reconsider your position there?
For Postgres (or MySQL w/ InnoDB), a million entries is a notch above tiny. Store the whole path in a field, add an index on it, vacuum, analyze. Don't do nutty joins until you've identified the ID of your key object, and you'll be fine in terms of lookup speeds. Say a few ms when running your query from psql.
Your real issue will be the bottleneck related to disk IO if you get material amounts of traffic. The operating motto here is: the less, the better. Besides the basics such as installing APC on your php server, using Passenger if you're using Ruby, etc:
Make sure the server has plenty of RAM to fit that index.
Cache a reference to the object related to each path in memcached.
If you can categorize all routes in a dozen or so regex, they might help by allowing the use of smaller, more targeted indexes that are easier to keep in memory. If not, just stick to storing the (possibly trailing-slashed) entire path and move on.
Worry about misses. If you've a non-canonical URL that redirects to a canonical one, store the redirect in memcached without any expiration date and begone with it.
Did I mention lots of RAM and memcached?
Oh, and don't overrate that ORM you're using, either. Chances are it's taking more time to build your query than your data store is taking to parse, retrieve and return the results.
RAM... Memcached...
To be very honest, Reddis isn't so different from a SQL + memcached option, except when it comes to memory management (as you found out), sharding, replication, and syntax. And familiarity, of course.
Your key decision point (besides excluding iterating over more than a few regex) ought to be how your data is structured. If it's highly structured with critical needs for atomicity, SQL + memcached ought to be your preferred option. If you've custom fields all over and obese EAV tables, then playing with Reddis or CouchDB or another NoSQL store ought to be on your radar.
In either case, it'll help to have lots of RAM to keep those indexes in memory, and a memcached cluster in front of the whole thing will never hurt if you need to scale.
Redis is your best bet I think. SQL would be slow and regular expressions from my experience are always painfully slow in queries.
I would do the following steps to test Redis:
Fire up a Redis instance either with a local VM or in the cloud on something like EC2.
Download a dictionary or two and pump this data into Redis. For example something from here: http://wordlist.sourceforge.net/ Make sure you normalize the data. For example, always lower case the strings and remove white space at the start/end of the string, etc.
I would ignore the hash. I don't see the reason you need to hash the URL? It would be impossible to read later if you wanted to debug things and it doesn't seem to "buy" you anything. I went to http://www.sha1-online.com/, and entered ryan and got ea3cd978650417470535f3a4725b6b5042a6ab59 as the hash. The original text would be much smaller to put in RAM which will help Redis. Obviously for longer paths, the hash would be better, but your examples were very small. =)
Write a tool to read from Redis and see how well it performs.
Profit!
Keep in mind that Redis needs to keep the entire data set in RAM, so plan accordingly.
I would suggest using some kind of key-value store (i.e. a hashing store), possibly along with hashing the key so it is shorter (something like SHA-1 would be OK IMHO).

Storing arrays of integers in database

I am creating a database that will store 100.000 (and probably more in the future) users. While this obviously happens in a table with 1 row per user, every user can (and will) store hundreds of items. In programming language this would mean the user has 2 arrays (or one 2-dimensional array) of integers: a column for the itemid's and a column for the amounts.
My instincts tell me to create a table to hold all these items, with rows like (userid, itemid, amount). However this would result in a huge table. 200.000 users with 250 items each... that's 50 million entries in one table. This, plus the fact that the table will undergo continuous and rapid change, frightens me. (How rapid? I estimate up to 100 modifications per second.)
Typically there will be anywhere between 100 and 2000 users, all adding and removing items, and modifying amounts. These actions can and will happen in programming code. It would go as follows:
User starts session, program loads all the users items from the database
User modifies the item list
Every few minutes, the changes are saved into the database
When the user ends the session, it is also saved into the database
It is worth noting that there is a maximum to the number of items a user can store.
Are there any alternatives to using a separate table? Perhaps save the values in a formatted text string? Or is this one of the instances where using a MySQL database is actually a Bad Idea™?
Thank you for your time and insights.
My instincts tell me to create a table to hold all these items
Your instincts are right.
1) avoid premature optimisation
2) don't break the rules of normalization unless you've got a very good and real reason to do so
3) why do you suspect that the multi-table approach will be faster?
that's 50 million entries in one table
So what? Even if you only have an index on userid, the difference in performance compared with a single table per user will not be noticeably slower (in practice, with 200,000 users, it will be much, much faster - since the DBMS can comfortably keep an open file handle for each table!).
I estimate up to 100 modifications per second
Should be possible using MySQL and fairly basic hardware, but if it were me, and I wanted a bit of headroom, I'd go with a pair of mirrored SATA disks, tables on one mirror, indexes on the other.
The only issue I'd be concerned about (which applies regardless of which of the 2 models you choose) is supporting 2000 concurrent connections. Do the connections have to be concurrent? Or can each user download a working set (optionally using an optimistic locking strategy) and close off the connection, then push back the changes on a new connection? If not, then you'll probably want a good whack of memory and CPU.
But leaving aside whether to use one big table or lots of little ones, if this is the only use for the data, and access is not concurrent to particular data items, then why bother with a relational database at all? NoSQL or a shared filesystem might work just as well.
Putting data into one field as a array is alwmost always a mistake. It makes querying the data much harder and much more timeconsuming as well as much less likely to use indexes. It is ok, if the values were just text where you would never need to find one or more elements fo the array but it is my experience that this situation is rarely encountered. Modern databases can handle 50 million records without even breaking a sweat. That's a small table in daatbase terms.
It should be OK to do it as you described using two tables. The database should be able to handle millions of records.
The important points to look at:
1- Optimize your queries as much as possible.
2- Create the appropriate index(es) to speed up your queries.
3- Use InnoDB if you have concurrent read/update operations as it supports row-level locking as opposed to MyISAM.
4- Provide good hardware to support the database server.
5- Run the database server on a dedicated server if affordable.

Are random updates disk bound mostly in standard and append only databases?

If I have large dataset and do random updates then I think updates are mostly disk bounded (in case append only databases there is not about seeks but about bandwidth I think). When I update record slightly one data page must be updated, so if my disk can write 10MB/s of data and page size is 16KB then i can have max 640 random updates per second. In append only databases apout 320 per second bacause one update can take two pages - index and data. In other databases bacause of ranom seeks to update page in place can be even worse like 100 updates per second.
I assume that one page in cache has only one update before write (random updates). Going forward the same will by for random inserts around all data pages (for examle not time ordered UUID) or even worst.
I refer to the situation when dirty pages (after update) must be flushed to disk and synced (can't longer stay in cache). So updates per second count is in this situation disk bandwidth bounded? Are my calculations like 320 updates per second likely? Maybe I am missing something?
"It depends."
To be complete, there are other things to consider.
First, the only thing distinguishing a random update from an append is the head seek involved. A random update will have the head dancing all over the platter, whereas an append will ideally just track like record player. This also assumes that each disk write is the full write and completely independent of all other writes.
Of course, that's in a perfect world.
With most modern databases, each update will typically involve, at a minimum, 2 writes. One for the actual data, the other for the log.
In a typical scenario, if you update a row, the database will make the change in memory. If you commit that row, the database will acknowledge that by making a note in the log, while keeping the actual dirty page in memory. Later, when the database checkpoints it will right the dirty pages to the disk. But when it does this, it will sort the blocks and write them as sequentially as it can. Then it will write a checkpoint to the log.
During recovery when the DB crashed and could not checkpoint, the database reads the log up to the last checkpoint, "rolls it forward" and applies those changes to actual disk page, marks the final checkpoint, then makes the system available for service.
The log write is sequential, the data writes are mostly sequential.
Now, if the log is part of a normal file (typical today) then you write the log record, which appends to the disk file. The FILE SYSTEM will then (likely) append to ITS log that change you just made so that it can update it's local file system structures. Later, the file system will, also, commit its dirty pages and make it's meta data changes permanent.
So, you can see that even a simple append can invoke multiple writes to the disk.
Now consider an "append only" design like CouchDB. What Couch will do, is when you make a simple write, it does not have a log. The file is its own log. Couch DB files grow without end, and need compaction during maintenance. But when it does the write, it writes not just the data page, but any indexes affected. And when indexes are affected, then Couch will rewrite the entire BRANCH of the index change from root to leaf. So, a simple write in this case can be more expensive than you would first think.
Now, of course, you throw in all of the random reads to disrupt your random writes and it all get quite complicated quite quickly. What I've learned though is that while streaming bandwidth is an important aspect of IO operations, overall operations per second are even more important. You can have 2 disks with the same bandwidth, but the one with the slower platter and/or head speed will have fewer ops/sec, just from head travel time and platter seek time.
Ideally, your DB uses dedicated raw storage vs a file system for storage, but most do not do that today. The advantages of file systems based stores operationally typically outweigh the performance benefits.
If you're on a file system, then preallocated, sequential files are a benefit so that your "append only" isn't simply skipping around other files on the file system, thus becoming similar to random updates. Also, by using preallocated files, your updates are simply updating DB data structures during writes rather than DB AND file system data structures as the file expands.
Putting logs, indexes, and data on separate disks allow multiple drives to work simultaneously with less interference. Your log can truly be append only for example compared to fighting with the random data reads or index updates.
So, all of those things factor in to throughput on DBs.

Storage for Write Once Read Many

I have a list of 1 million digits. Every time the user submit an input, I would need to do a matching of the input with the list.
As such, the list would have the Write Once Read Many (WORM) characteristics?
What would be the best way to implement storage for this data?
I am thinking of several options:
A SQL Database but is it suitable for WORM (UPDATE: using VARCHAR field type instead of INT)
One file with the list
A directory structure like /1/2/3/4/5/6/7/8/9/0 (but this one would be taking too much space)
A bucket system like /12345/67890/
What do you think?
UPDATE: The application would be a web application.
To answer this question you'll need to think about two things:
Are you trying to minimize storage space, or are you trying to minimize process time.
Storing the data in memory will give you the fastest processing time, especially if you could optimize the datastructure for your most common operations (in this case a lookup) at the cost of memory space. For persistence, you could store the data to a flat file, and read the data during startup.
SQL Databases are great for storing and reading relational data. For instance storing Names, addresses, and orders can be normalized and stored efficiently. Does a flat list of digits make sense to store in a relational database? For each access you will have a lot of overhead associated with looking up the data. Constructing the query, building the query plan, executing the query plan, etc. Since the data is a flat list, you wouldn't be able to create an effective index (your index would essentially be the values you are storing, which means you would do a table scan for each data access).
Using a directory structure might work, but then your application is no longer portable.
If I were writing the application, I would either load the data during startup from a file and store it in memory in a hash table (which offers constant lookups), or write a simple indexed file accessor class that stores the data in a search optimized order (worst case a flat file).
Maybe you are interested in how The Pi Searcher did it. They have 200 million digits to search through, and have published a description on how their indexed searches work.
If you're concerned about speed and don't want to care about file system storage, probably SQL is your best shot. You can optimize your table indexes but also will add another external dependency on your project.
EDIT: Seems MySQL have an ARCHIVE Storage Engine:
MySQL supports on-the-fly compression since version 5.0 with the ARCHIVE storage engine. Archive is a write-once, read-many storage engine, designed for historical data. It compresses data up to 90%. It does not support indexes. In version 5.1 Archive engine can be used with partitioning.
Two options I would consider:
Serialization - when the memory footprint of your lookup list is acceptable for your application, and the application is persistent (a daemon or server app), then create it and store it as a binary file, read the binary file on application startup. Upside - fast lookups. Downside - memory footprint, application initialization time.
SQL storage - when the lookup is amenable to index-based lookup, and you don't want to hold the entire list in memory. Upside - reduced init time, reduced memory footprint. Downside - requires DBMS (extra app dependency, design expertise), fast, but not as fast as holding the whole list in memeory
If you're concerned about tampering, buy a writable DVD (or a CD if you can find a store which still carries them ...), write the list on it and then put it into a server with only a DVD drive (not a DVD writer/burner). This way, the list can't be modified. Another option would be to buy an USB stick which has a "write protect" switch but they are hard to come by and the security isn't as good as with a CD/DVD.
Next, write each digit into a file on that disk with one entry per line. When you need to match the numbers, just open the file, read each line and stop when you find a match. With todays computer speeds and amounts of RAM (and therefore file system cache), this should be fast enough for a once-per-day access pattern.
Given that 1M numbers is not a huge amount of numbers for todays computers, why not just do pretty much the simplest thing that could work. Just store the numbers in a text file and read them into a hash set on application startup. On my computer reading in 1M numbers from a text file takes under a second and after that I can do about 13M lookups per second.

Resources