What's the performance difference between DBI's fetchall_hashref and fetchall_arrayref? - performance

I am writing some Perl scripts to manipulate large amounts (in total about 42 million rows, but it won't be done in one hit) of data in two PostgreSQL databases.
For some of my queries it makes good sense to use fetchall_hashref because I have synthetic keys. However, in other instances, I'm going to have use an array of three columns as the unique key.
This has got me wondering about performance differences between fetchall_arrayref and fetchall_hashref. I know that in both cases everything is going in to memory so selecting several GB of data probably isn't a good idea but other than that there appears to be very little guidance in the documentation when it comes to performance.
My googling has been unsuccessful so if anyone can point me in the direction of some general performance studies I'd be grateful.
(I know I could benchmark this myself but unfortunately for dev purposes I don't have access to a machine which has identical hardware to production which is why I'm looking for general guidelines or even best practices).

Most of the choices between fetch methods depend on what format you want the data to end up in and how much of the work for that you want DBI to do for you.
My recollection is that iterating with fetchrow_arrayref and using bind_columns is the fastest (least DBI overhead) way to read through returned data.

First question is whether you really need to use a fetchall in the first place. If you don't need all 42 million rows in memory at once, then don't read them all in at once! bind_columns and fetchrow_arrayref are generally the way to go whenever possible, as ysth already pointed out.
Assuming that fetchall really is needed, my gut intuition is that fetchall_arrayref will be marginally faster, since an array is a simpler data structure and doesn't need to compute hashes of the inserted keys, but the savings in time would be dwarfed by database read times, so it's unlikely to be significant.
Memory requirements are another matter entirely, though. The structure returned by fetchall_hashref is a hash of id => row, with each row being represented as a hash of field name => field value. If you get 42 million rows, that means your list of field names is repeated in each of 42 million sets of hash keys... That's going to require a good deal more memory to store than the array of arrays of arrays returned by fetchall_arrayref. (Unless DBI is doing some magic with tie to optimize the fetchall_hashref structure, I suppose.)

Related

Performance-wise, is it worth it to rename every mongo key name for production? [duplicate]

This question already has answers here:
Is shortening MongoDB property names worthwhile?
(7 answers)
Closed 5 years ago.
As far as I know, every key name is stored "as-is" in the mongo database. It means that a field "name" will be stored using the 4 letters everywhere it is used.
Would it be wise, if I want my app to be ready to store a large amount of data, to rename every key in my mongo documents? For instance, "name" would become "n" and "description" would become "d".
I expect it to reduce significantly the space used by the database as well as reducing the amount of data sent to client (not to mention that it kinda uglify the mongo documents content). Am I right?
If I undertake the rename of every key in my code (no need to rename the existing data, I can rebuild it from scratch), is there a good practice or any additional advise I should know?
Note: this is mainly speculation, I don't have benchmarking results to back this up
While "minifying" your keys technically would reduce the size of your memory/diskspace footprint, I think the advantages of this are quite minimal if not actually disadvantageous.
The first thing to realize is that data stored in Mongodb is actually not stored in its raw JSON format, its actually stored as pure binary using a standard know as BSON. This allows Mongo to do all sorts of internal optimizationsm, such as compression if you're using WiredTiger as your storage engine (thanks for pointing that ouT #Jpaljasma).
Second, lets say you do minify your keys. Well then you need to minify your keys. Every time. Forever. Thats a lot of work on your application side. Plus you need to unminify your keys when you read (because users wont know what n is). Every time. Forever. All of a sudden your minor memory optimization becomes a major runtime slowdown.
Third, that minifying/unminifying process is kinda complicated. You need to maintain and test mappings between the two, keep it tested, up to date, and never having any overlap (if you do, thats the end of all your data pretty much). I wouldn't ever work on that.
So overall, I think its a pretty terrible idea to minify your keys to save a couple of characters. Its important to keep the big picture in mind: the VAST majority of your data will be not in the keys, but in the values. If you want to optimize data size, look there.
The full name of every field is included in every document. So when your field-names are long and your values rather short, you can end up with documents where the majority of the used space is occupied by redundant field names.
This affects the total storage size and decreases the number of documents which can be cached in RAM, which can negatively affect performance. But using descriptive field-names does of course improve readability of the database content and queries, which makes the whole application easier to develop, debug and maintain.
Depending on how flexible your driver is, it might also require quite a lot of boilerplate code to convert between your application field-names and the database field-names.
Whether or not this is worth it depends on how complex your database is and how important performance is to you.

if huge array is faster than hash-map for look-up?

I'm receiving "order update" from stock exchange. Each order id is between 1 and 100 000 000, so I can use 100 million array to store 100 million orders and when update is received I can look-up order from array very fast just accessing it by index arrray[orderId]. I will spent several gigabytes of memory but this is OK.
Alternatively I can use hashmap, and because at any moment the number of "active" orders is limited (to, very roughly, 100 000), look-up will be pretty fast too, but probaly a little bit slower then array.
The question is - will hashmap be actually slower? Is it reasonably to create 100 millions array?
I need latency and nothing else, I completely don't care about memory, what should I choose?
Whenever considering performance issues, one experiment is worth a thousand expert opinions. Test it!
That said, I'll take a wild stab in the dark: it's likely that if you can convince your OS to keep your multi-gigabyte array resident in physical memory (this isn't necessarily easy - consider looking at the mlock and munlock syscalls), you'll have relatively better performance. Any such performance gain you notice (should one exist) will likely be by virtue of bypassing the cost of the hashing function, and avoiding the overheads associated with whichever collision-resolution and memory allocation strategies your hashmap implementation uses.
It's also worth cautioning that many hash table implementations have non-constant complexity for some operations (e.g., separate chaining could degrade to O(n) in the worst case). Given that you are attempting to optimize for latency, an array with very aggressive signaling to the OS memory manager (e.g., madvise and mlock) are likely to result in the closest to constant-latency lookups that you can get on a microprocessor easily.
While the only way to objectively answer this question is with performance tests, I will argue for using a Hashtable Map. (Caching and memory access can be so full of surprises; I do not have the expertise to speculate on which one will be faster, and when. Also consider that localized performance differences may be marginalized by other code.)
My first reason for "initially choosing" a hash is based off of the observation that there are 100M distinct keys but only 0.1M active records. This means that if using an array, index utilization will only be 0.1% - this is a very sparse array.
If the data is stored as values in the array then it needs to be relatively small or the array size will balloon. If the data is not stored in the array (e.g. array is of pointers) then the argument for locality of data in the array is partially mitigated. Either way, the simple array approach requires lots of unused space.
Since all the keys are already integers, the distribution (hash) function and can be efficiently implemented - there is no need to create a hash of a complex type/sequence so the "cost" of this function should approach zero.
So, my simple proposed hash:
Use linear probing backed by contiguous memory. It is simple, has good locality (especially during the probe), and avoids needing to do any form of dynamic allocation.
Pick a suitable initial bucket size; say, 2x (or 0.2M buckets, primed). Don't even give the hash a chance of resize. Note that this suggested bucket array size is only 0.2% the size of the simple array approach and could be reduced further as the size vs. collision rate can be tuned.
Create a good distribution function for the hash. It can also exploit knowledge of the ID range.
While I've presented specialized hashtable rules "optimized" for the given case, I would start with a normal Map implementation (be it a hashtable or tree) and test it .. if a standard implementation works suitably well, why not use it?
Now, test different candidates under expected and extreme loads - and pick the winner.
This seems to depend on the clustering of the IDs.
If the active IDs are clustered suitably already then, without hashing, the OS and/or L2 cache have a fair shot at holding on to the good data and keeping it low-latency.
If they're completely random then you're going to suffer just as soon as the number of active transactions exceeds the number of available cache lines or the size of those transactions exceeds the size of the cache (it's not clear which is likely to happen first in your case).
However, if the active IDs work out to have some unfortunate pattern which causes a high rate of contention (eg., it's a bit-pack of different attributes, and the frequently-varying attribute hits the hardware where it hurts), then you might benefit from using a 1:1 hash of the index to get back to the random case, even though that's usually considered a pretty bad case on its own.
As far as hashing for compaction goes; noting that some people are concerned about worst-case fallback behaviour for a hash collision, you might simply implement a cache of the full-sized table in contiguous memory, since that has a reasonably constrained worst case. Simply keep the busiest entry in the map, and fall back to the full table on collisions. Move the other entry into the map if it's more active (if you can find a suitable algorithm to decide this).
Even so, it's not clear that the necessary hash table size is sufficient to reduce the working set to being cacheable. How big are your orders?
The overhead of a hashmap vs. an array is almost none. I would bet on a hashmap of 100,000 records over an array of 100,000,000, without a doubt.
Remember also that, while you "don't care about memory", this also means you'd better have the memory to back it up - an array of 100,000,000 integers will take up 400mb, even if all of them are empty. You run the risk of your data being swapped out. If your data gets swapped out, you will get a performance hit of several orders of magnitude.
You should test and profile, as others have said. My random stab in the dark, though: A high-load-factor hash table will be the way to go here. One huge array is going to cost you a TLB miss and then a last-level cache miss per access. This is expensive. A hash table, given the working set size you mentioned, is probably only going to cost some arithmetic and an L1 miss.
Again, test both alternatives on representative examples. We're all just stabbing in the dark.

Caching sortable/filterable data in Redis

I have a variety of data that I've got cached in a standard Redis hashmap, and I've run into a situation where I need to respond to client requests for ordering and filtering. Order rankings for name, average rating, and number of reviews can change regularly (multiple times a minute, possibly). Can anyone advise me on a proper strategy for attacking this problem? Consider the following example to help understand what I'm looking for:
Client makes an API request to /api/v1/cookbooks?orderBy=name&limit=20&offset=0
I should respond with the first 20 entries, ordered by name
Strategies I've considered thus far:
for each type of hashmap store (cookbooks, recipes, etc), creating a sorted set for each ordering scheme (alphabetical, average rating, etc) from a Postgres ORDER BY; then pulling out ZRANGE slices based on limit and offset
storing ordering data directly into the JSON string data for each key.
hitting postgres with an SELECT id FROM table ORDER BY _, and using the ids to pull directly from the hashmap store
Any additional thoughts or advice on how to best address this issue? Thanks in advance.
So, as mentioned in a comment below Sorted Sets are a great way to implement sorting and filtering functionality in cache. Take the following example as an idea of how one might solve the issue of needing to order objects in a hash:
Given a hash called "movies" with the scheme of bucket:objectId -> object, which is a JSON string representation (read about "bucketing" your hashes for performance here.
Create a sorted set called "movieRatings", where each member is an objectId from your "movies" hash, and its score is an average of all rating values (computed by the database). Just use a numerical representation of whatever you're trying to sort, and Redis gives you a lot of flexibility on how you can extract the slices you need.
This simple scheme has a lot of flexibility in what can be achieved - you simply ask your sorted set for a set of keys that fit your requirements, and look up those keys with HMGET from your "movies" hash. Two swift Redis calls, problem solved.
Rinse and repeat for whatever type of ordering you need, such as "number of reviews", "alphabetically", "actor count", etc. Filtering can also be done in this manner, but normal sets are probably quite sufficient for that purpose.
This depends on your needs. Each of your strategies could work.
Your first approach of storing an auxiliary sorted set for each way
you want to order is the best way to do this if you have a very big
hash and/or you run your order queries frequently. This approach will
require a lot of ram if your hash is big, but it will also scale well
in terms of time complexity as your hash gets bigger and you start
running order queries more frequently. On the other hand, it
introduces complexity in your data structures, and feels like you're
trying to use Redis for something a typical DB like Postgres, MySQL,
or Mongo would be better at.
Storing ordering data directly into your keys means you need to pull
your entire hash every time you do an order query. Maybe that's not
so bad if your hash is very small, or you don't do ordered queries very often, but this won't scale at all.
If you're already hitting Postgres to get keys, why not just store the values in Postgres as well. That would be much cheaper than hitting Postgres and then hitting Redis, and would have your code depend on fewer things. IMO, this is probably your best option and would work most naturally. Do this, unless you have some really good reason to not store values in Postgres, or some really big speed concerns, in which case go with your first strategy.

How to design database to store and retrieve large item/skill lists in ruby

I plan a role playing game where characters are supposed to carry/use items and train skills. When it comes to store (possibly numerous) items/skills possessed by characters, I can't think of a better way than putting a row for every possible item and skill to each character instantiated. However this seems to be an overkill to me.
To be clear, if this would be an exercise or a small game where total number of items/skills is ~30, I would add an items and a skills hash to the character class and methods to add and remove them like:
def initialize
#inventory = {}
#skills = {}
end
def add_item item, number
#inventory[item] += number
end
Regarding that I would like to store the number of the items and the levels of the skills, what else can I try to handle ~1000 items and ~150 in the inventory and possibly 100 skills?
Plan for Data Retrieval
Generally, it's a good idea to design your database around how you plan to look up and retrieve your data, rather than how you want to store it. A bad design makes your data very expensive to collect from the database.
In your example, having a separate model for each inventory item or skill would be hugely expensive in terms of lookups whenever you want to load a character. Do you really want to do 1,000 lookups every time you load someone's inventory? Probably not.
Denormalize for Speed
You typically want to normalize data that needs to be consistent, and denormalize data that needs to be retrieved/updated quickly. One option might be to serialize your character attributes.
For example, it should be faster to store a serialized Character#inventory_items field than update 100 separate records with a has_many :though or has_and_belongs_to_many relationship. There are certainly trade-offs involved with denormalization in general and serialization in particular, but it might be a good fit for your specific use case.
Consider a Document Database
Character sheets are documents. Unless you need the relational power of a SQL database, a document-oriented database might be a better fit for the data you want to manage. CouchDB seems particularly well-suited for this example, but you should certainly evaluate all your NoSQL options to see if any offer the features you need. Your mileage will definitely vary.
Always Benchmark
Don't take my word for what's optimal. Try a design. Benchmark it. See what the design does with your data. In the end, that's the only thing that matters.
I can't think of a better way than putting a row for every possible item and skill to each character instantiated.
Do characters evolve independently?
Assuming yes, there is no other choice but having each end every relevant combination physically represented in the database.
If not, then you can "reuse" the same set or items/skills for multiple characters, but this is probably not what is going on here.
In any case, relational databases are very good at managing huge amounts of data and the numbers you mentioned don't even qualify as "huge". By correctly utilizing techniques such as clustering, you can ensure that a lookup of all items/skills for a given character is done in a minimal number of I/O operations, i.e. very fast.

Efficient storage of external index of strings

Say you have a large collection with n objects on disk and each one has a variable-sized string. What are common practices of efficient ways to make an index of those objects with plain string comparison. Storing the whole strings on the index would be prohibitive in the long rundue to size and I/O, but since disks have a high latency storing only references isn't a good idea, either.
I've been thinking on using a B-Tree-like design with tries but can't find any database implementation using this approach. In fact, it's hard to find how major databases implement indexes for strings (it probably gets lost in the vast results for SQL-level information.)
TIA!
EDIT: changed title from "Efficient external sorting and searching of stored objects with large strings" to "Efficient storage of external index of strings."
A "prefix B-tree" or "simple prefix B-tree" would probably be helpful here.
A "simple prefix B-tree" is a bit simpler, just storing the shortest prefix that separates two items, without trying to eliminate redundancy within those prefixes (e.g. for 'astronomy' and 'azimuth', it would store just 'as' and 'az', but not try to keep from duplicating the 'a').
A "prefix B-tree" is close to what you've described -- something like a trie, but in a B-tree structure to give good characteristics when stored primarily on disk. Nonetheless, it's intended to remove (most of) the redundancy within the prefixes that form the index.
There is one other question: do you really need to traverse the records in order, or do you just need to look up a specified record quickly? If the latter is adequate, you might be able to use extendible hashing instead. Extendible hashing has been around (in a number of different forms) for a few decades, and still works pretty well. The general idea is fairly simple: hash the strings to create keys of fixed length, then create some sort of tree of those fixed-length pseudo-keys. As with (almost) any hash, you have to be prepared to deal with collisions. As with other hash tables, the details of the hashing and collision resolution vary (though probably not quite as much with extendible hashing as in-memory hashing).
As for real use, major DBMS and DBMS-like systems use all of the above. B-tree variants are probably the most common in the general purpose DBMS market (e.g. Oracle or MS SQL Server). Extendible hashing is used in a fair number of more-specialized products (e.g., Lotus Domino Server).
What are you doing with the objects?
If you're running a large system that needs low latency to handle lots of concurrent requests, then I'd store the objects in a database and have it take care of the sorting and indexing. This would be much simpler than implementing B-tree from scratch and possibly having it be buggy.
DBMSs also have caching and various other features that might make your life easier.
Start by being clear what you want. Do you want to sort them or index them? Sorting is likely to require moving at least some of the items on disk, but indexing would likely leave them where they are.
If you really want to sort them, Knuth's "The Art of Computer Programming" volume three covers sorting and searching in about as much details as you're likely to want.

Resources