Performance-wise, is it worth it to rename every mongo key name for production? [duplicate] - performance

This question already has answers here:
Is shortening MongoDB property names worthwhile?
(7 answers)
Closed 5 years ago.
As far as I know, every key name is stored "as-is" in the mongo database. It means that a field "name" will be stored using the 4 letters everywhere it is used.
Would it be wise, if I want my app to be ready to store a large amount of data, to rename every key in my mongo documents? For instance, "name" would become "n" and "description" would become "d".
I expect it to reduce significantly the space used by the database as well as reducing the amount of data sent to client (not to mention that it kinda uglify the mongo documents content). Am I right?
If I undertake the rename of every key in my code (no need to rename the existing data, I can rebuild it from scratch), is there a good practice or any additional advise I should know?

Note: this is mainly speculation, I don't have benchmarking results to back this up
While "minifying" your keys technically would reduce the size of your memory/diskspace footprint, I think the advantages of this are quite minimal if not actually disadvantageous.
The first thing to realize is that data stored in Mongodb is actually not stored in its raw JSON format, its actually stored as pure binary using a standard know as BSON. This allows Mongo to do all sorts of internal optimizationsm, such as compression if you're using WiredTiger as your storage engine (thanks for pointing that ouT #Jpaljasma).
Second, lets say you do minify your keys. Well then you need to minify your keys. Every time. Forever. Thats a lot of work on your application side. Plus you need to unminify your keys when you read (because users wont know what n is). Every time. Forever. All of a sudden your minor memory optimization becomes a major runtime slowdown.
Third, that minifying/unminifying process is kinda complicated. You need to maintain and test mappings between the two, keep it tested, up to date, and never having any overlap (if you do, thats the end of all your data pretty much). I wouldn't ever work on that.
So overall, I think its a pretty terrible idea to minify your keys to save a couple of characters. Its important to keep the big picture in mind: the VAST majority of your data will be not in the keys, but in the values. If you want to optimize data size, look there.

The full name of every field is included in every document. So when your field-names are long and your values rather short, you can end up with documents where the majority of the used space is occupied by redundant field names.
This affects the total storage size and decreases the number of documents which can be cached in RAM, which can negatively affect performance. But using descriptive field-names does of course improve readability of the database content and queries, which makes the whole application easier to develop, debug and maintain.
Depending on how flexible your driver is, it might also require quite a lot of boilerplate code to convert between your application field-names and the database field-names.
Whether or not this is worth it depends on how complex your database is and how important performance is to you.

Related

Fortran95 access large files fast using direct access

I am currently working on a problem which requires me to store a large amount of well structured information in a file.
It is more data than I can keep in memory, but I need to access different parts of it very often and would like to do so as quickly as possible (of course).
Unfortunately, the file would be large enough that actually reading through it would take quite some time as well.
From what I have gathered so far, it seems to me that ACCESS="DIRECT" would be a good way of handling this problem. Do I understand correctly that with direct access, I am basically pointing at a specific chunk of memory and ask "What's in there?"? And do I correctly infer from that, that reading time does not depend on the overall file size?
Thank you very much in advance!
You can think of an ACCESS='DIRECT' file as a file consisting of a number of fixed size records. You can do operations like read or write record #N in O(1) time. That is, in order to access record #N you don't need to scan through all the preceding #M (M<N) records in the file.
If this maps reasonably well to the problem you're trying to solve, then ACCESS='DIRECT' might be the correct solution in your case. If not, ACCESS='STREAM' offers a little bit more flexibility in that the size of each record does not need to be fixed, though you need to be able to compute the correct file offset yourself. If you need even more flexibility there's things like NetCDF, or HDF5 like #HighPerformanceMark suggested, or even things like sqlite.

Working with a Set that does not fit in memory

Let's say I have a huge list of fixed-length strings, and I want to be able to quickly determine if a new given string is part of this huge list.
If the list remains small enough to fit in memory, I would typically use a set: I would feed it first with the list of strings, and by design, the data structure would allow me to quickly check whether or not a given string is part of the set.
But as far as I can see, the various standard implementation of this data structure store data in memory, and I already know that the huge list of strings won't fit in memory, and that I'll somehow need to store this list on disk.
I could rely on something like SQLite to store the strings in a indexed table, then query the table to know whether a string is part of the initial set or not. However, using SQLite for this seems unnecessarily heavy to me, as I definitely don't need all the querying features it supports.
Have you guys faced this kind of problems before? Do you know any library that might be helpful? (I'm quite language-agnostic, feel free to throw whatever you have)
There are multiple solutions to efficiently find if a string is a part of a huge set of strings.
A first solution is to use a trie to make the set much more compact. Indeed, many strings will likely start by the same header and re-writing it over and over in memory is not space efficient. It may be enough to keep the full set in memory or not. If not, the root part of the trie can be stored in memory referencing leaf-like nodes stored on the disk. This enable the application to quickly find with part of the leaf-like nodes need to be loaded with a relatively small cost. If the number of string is not so huge, most leaf parts of the trie related to a given leaf of the root part can be loaded in one big sequential chunk from the storage device.
Another solution is to use a hash table to quickly find if a given string exist in the set with a low latency (eg. with only 2 fetches). The idea is just to hash a searched string and perform a lookup at a specific item of a big array stored on the storage device. Open-adressing can be used to make the structure more compact at the expense of a possibly higher latency while only 2 fetches are needed with closed-adressing (the first get the location of the item list associated to the given hash and the second get all the actual items).
One simple way to easily implement such data structures so they can work on a storage devices is to make use of mapped memory. Mapped memory enable you to access data on a storage device transparently as if it was in memory (whatever the language used). However, the cost to access data is the one of the storage device and not the one of the memory. Thus, the data structure implementation should be adapted to the use of mapped memory for better performance.
Finally, you can cache data so that some fetches can be much faster. One way to do that is to use Bloom filters. A Bloom filter is a very compact probabilistic hash-based data structure. It can be used to cache data in memory without actually storing any string item. False positive matches are possible, but false negatives are not. Thus, they are good to discard searched strings that are often not in the set without the need to do any (slow) fetch on the storage device. A big Bloom filter can provide a very good accuracy. This data structure need to be mixed with the above ones if deterministic results are required. LRU/LFU caches might also help regarding the distribution of the searched items.

MongoDB why use an embedded list instead of separate collection?

Assuming we have a server with large enough RAM, why should we worry about extra querying required when we use separate collections instead of embedded list of objects? Since queries will be really fast, would it be worth it to store objects as embedded list?
There is a 16MB size limit for BSON documents in MongoDB. So when depending on the data model, you impose an artificial limit to what can be stored.
Depending on your storage engine, increasing the document size frequently can cause the document moved within the data files, which is a rather costly operation you really want to prevent
With complicated data models, queries tend to get more complicated, leading to problem, as you can see often see here on SO. Complicated queries are not necessarily faster.
Usually, embedded documents stem from the fact that developers are used to SQL JOINs and want their data all within one query. But if you boil it down, usually you have questions like
For a given X, what are the Ys belonging to it?
So usually you already have X. There is no need to prematurely load data you never need in the majority of cases. Think of an overview page of Xs where you select the X you want to see the according Ys. Even with a pagination of 10, 9/10 of the data loaded would be useless if you had all the data embedded. Fun fact: This applies to SQL, too – though nobody seems to care about real optimizations nowadays.
This is the summary of my blog post "The problem with overembedding", where you find an in-detail explanation of the points mentioned above.

Efficient storage of external index of strings

Say you have a large collection with n objects on disk and each one has a variable-sized string. What are common practices of efficient ways to make an index of those objects with plain string comparison. Storing the whole strings on the index would be prohibitive in the long rundue to size and I/O, but since disks have a high latency storing only references isn't a good idea, either.
I've been thinking on using a B-Tree-like design with tries but can't find any database implementation using this approach. In fact, it's hard to find how major databases implement indexes for strings (it probably gets lost in the vast results for SQL-level information.)
TIA!
EDIT: changed title from "Efficient external sorting and searching of stored objects with large strings" to "Efficient storage of external index of strings."
A "prefix B-tree" or "simple prefix B-tree" would probably be helpful here.
A "simple prefix B-tree" is a bit simpler, just storing the shortest prefix that separates two items, without trying to eliminate redundancy within those prefixes (e.g. for 'astronomy' and 'azimuth', it would store just 'as' and 'az', but not try to keep from duplicating the 'a').
A "prefix B-tree" is close to what you've described -- something like a trie, but in a B-tree structure to give good characteristics when stored primarily on disk. Nonetheless, it's intended to remove (most of) the redundancy within the prefixes that form the index.
There is one other question: do you really need to traverse the records in order, or do you just need to look up a specified record quickly? If the latter is adequate, you might be able to use extendible hashing instead. Extendible hashing has been around (in a number of different forms) for a few decades, and still works pretty well. The general idea is fairly simple: hash the strings to create keys of fixed length, then create some sort of tree of those fixed-length pseudo-keys. As with (almost) any hash, you have to be prepared to deal with collisions. As with other hash tables, the details of the hashing and collision resolution vary (though probably not quite as much with extendible hashing as in-memory hashing).
As for real use, major DBMS and DBMS-like systems use all of the above. B-tree variants are probably the most common in the general purpose DBMS market (e.g. Oracle or MS SQL Server). Extendible hashing is used in a fair number of more-specialized products (e.g., Lotus Domino Server).
What are you doing with the objects?
If you're running a large system that needs low latency to handle lots of concurrent requests, then I'd store the objects in a database and have it take care of the sorting and indexing. This would be much simpler than implementing B-tree from scratch and possibly having it be buggy.
DBMSs also have caching and various other features that might make your life easier.
Start by being clear what you want. Do you want to sort them or index them? Sorting is likely to require moving at least some of the items on disk, but indexing would likely leave them where they are.
If you really want to sort them, Knuth's "The Art of Computer Programming" volume three covers sorting and searching in about as much details as you're likely to want.

Wanted: DB for fast read operations to be accessed from ruby apps

Basically it's a financial database, with both daily and intraday data (date,symbol,open,high,low,close,vol,openinterest) -- very simple structure. Updates are just once a day. A typical query would be: date and close price of MSFT for all dates in DB. I was thinking that there's got to be something out there that's been optimized for lots of reads and not many writes, as opposed to a general-purpose RDBMS like MySQL. I searched rubyforge.org, and I didn't see anything that specifically addressed this (as far as I could tell).
MS SQL Server can be optimized like this with the fairly simple:
ALTER DATABASE myDatabase
SET READ_COMMITTED_SNAPSHOT ON
SQL Server will automatically cache your data in memory if it is being used heavily for reads.
You can always use a RAMdisk for your MySQL installation if your database footprint is small enough. One way to make your tables small enough to fit is to create them as MyISAM ARCHIVE tables. While they are very compact, compressed, they can only be appended to or read from, but not updated. (http://dev.mysql.com/tech-resources/articles/storage-engine.html)
Generally a properly indexed and well organized MySQL table is really fast, especially when using MyISAM, and even more so when loaded from memory. They key is in denormalizing the data as heavily as you can optimizing for your particular read scenarios.
For example, having a stock_id, date, price tuple is going to be fairly slow to sort and retrieve. If you have, instead, stock_id and a column with some serialized data, the retrieval time will be very quick.
Another solution that is likely faster is to push all the data into an alternative DBMS like Toyko Cabinet or something similar, especially if your data fits neatly into a key/value store.
Look at MySQL, but run the database from memory instead of disk. Depends on the size of your dataset and your budget, but you could then update memory from disk once a day, and have a very, very fast read time afterwards.
The best-known (to me at least!) time series database is Fame but it's expensive and I strongly doubt that there's anything like, say, an ActiveRecord implementation for it. Unless it's changed a lot in the 10 or so years since I last touched it, it isn't SQL-friendly at all.
With a fairly tightly-focused application, you can take a more flexible view of your data. For example, consider what is the information that you're actually looking to store? Is it the atomic price/hi/lo/close/vol/whatever, or is it more appropriately a time series of such values? If you always want to view the series, store a series per row, not a value.
Throwing a few ideas out here...
How might it look if you stored a year or a month of a single value for a single stock in one row? Maybe as an XML string, or JSON or something more terse of your own devising. Compressed CSV, perhaps? That ought to fit a month's values into a 255-character column. (Use something like Huffman coding to do the encoding, perhaps - a single dictionary ought to work for all instances of such similar data).
You can still hold a horizontal view as well: with the extremely low update rate you'll have (should only be data fixes, I'd guess) you can probably stand to build that stuff.
There's an obvious downside to this: you'll have a bunch of extra work to do.
I don't have any personal experience, but MogoDB claims to offer relational-style flexibility with key-value performance.
As mentioned elsewhere key-value database might be worth looking at: Tokyo Cabinet, CouchDB or one of the others again, perhaps, with concatenated value for the time series.

Resources