I know a bit about database internals. I've actually implemented a small, simple relational database engine before, using ISAM structures on disk and BTree indexes and all that sort of thing. It was fun, and very educational. I know that I'm much more cognizant about carefully designing database schemas and writing queries now that I know a little bit more about how RDBMSs work under the hood.
But I don't know anything about multidimensional OLAP data models, and I've had a hard time finding any useful information on the internet.
How is the information stored on disk? What data structures comprise the cube? If a MOLAP model doesn't use tables, with columns and records, then... what? Especially in highly dimensional data, what kinds of data structures make the MOLAP model so efficient? Do MOLAP implementations use something analogous to RDBMS indexes?
Why are OLAP servers so much better at processing ad hoc queries? The same sorts of aggregations that might take hours to process in an ordinary relational database can be processed in milliseconds in an OLTP cube. What are the underlying mechanics of the model that make that possible?
I've implemented a couple of systems that mimicked what OLAP cubes do, and here are a couple of things we did to get them to work.
The core data was held in an n-dimensional array, all in memory, and all the keys were implemented via hierarchies of pointers to the underlying array. In this way we could have multiple different sets of keys for the same data. The data in the array was the equivalent of the fact table, often it would only have a couple of pieces of data, in one instance this was price and number sold.
The underlying array was often sparse, so once it was created we used to remove all the blank cells to save memory - lots of hardcore pointer arithmetic but it worked.
As we had hierarchies of keys, we could write routines quite easily to drill down/up a hierarchy easily. For instance we would access year of data, by going through the month keys, which in turn mapped to days and/or weeks. At each level we would aggregate data as part of building the cube - made calculations much faster.
We didn't implement any kind of query language, but we did support drill down on all axis (up to 7 in our biggest cubes), and that was tied directly to the UI which the users liked.
We implemented core stuff in C++, but these days I reckon C# could be fast enough, but I'd worry about how to implement sparse arrays.
Hope that helps, sound interesting.
The book Microsoft SQL Server 2008 Analysis Services Unleashed spells out some of the particularities of SSAS 2008 in decent detail. It's not quite a "here's exactly how SSAS works under the hood", but it's pretty suggestive, especially on the data structure side. (It's not quite as detailed/specific about the exact algorithms.) A few of the things I, as an amateur in this area, gathered from this book. This is all about SSAS MOLAP:
Despite all the talk about multi-dimensional cubes, fact table (aka measure group) data is still, to a first approximation, ultimately stored in basically 2D tables, one row per fact. A number of OLAP operations seem to ultimately consist of iterating over rows in 2D tables.
The data is potentially much smaller inside MOLAP than inside a corresponding SQL table, however. One trick is that each unique string is stored only once, in a "string store". Data structures can then refer to strings in a more compact form (by string ID, basically). SSAS also compresses rows within the MOLAP store in some form. This shrinking I assume lets more of the data stay in RAM simultaneously, which is good.
Similarly, SSAS can often iterate over a subset of the data rather than the full dataset. A few mechanisms are in play:
By default, SSAS builds a hash index for each dimension/attribute value; it thus knows "right away" which pages on disk contain the relevant data for, say, Year=1997.
There's a caching architecture where relevant subsets of the data are stored in RAM separate from the whole dataset. For example, you might have cached a subcube that has only a few of your fields, and that only pertains to the data from 1997. If a query is asking only about 1997, then it will iterate only over that subcube, thereby speeding things up. (But note that a "subcube" is, to a first approximation, just a 2D table.)
If you're predefined aggregates, then these smaller subsets can also be precomputed at cube processing time, rather than merely computed/cached on demand.
SSAS fact table rows are fixed size, which presumibly helps in some form. (In SQL, in constrast, you might have variable-width string columns.)
The caching architecture also means that, once an aggregation has been computed, it doesn't need to be refetched from disk and recomputed again and again.
These are some of the factors in play in SSAS anyway. I can't claim that there aren't other vital things as well.
Related
I have a question about NoSQL type databases, in particular MongoDB, but it applies in general to most key-value or document based storages. Some of the selling points of NoSQL are speed and scalability, but it seems to me that there is significant overhead compared to relational databases.
You have lots of duplication because (almost) everything is unnormalized. You can't do much about it because this is kind of the point of such databases. I'm more concerned about the next ones:
There is a lot of overhead because, if you have a JSON document, you have to save all the keys (and all the structural information) with each document. So for 10000 rows, you'll have to save the strings 'age', 'name', ... 10000 times.
The database can't do a lot of clever stuff like creating indices or binary trees (to save time) or storing integers in a compact way (because one of the free-form documents could have a string where all the others have an int, etc.)
I know you can write your own views or map/reduce algorithms to get something like an index, but it seems at first glance that for the general case NoSQL must be terribly inefficient space and CPU wise.
Is it really that bad? What kinds of optimizations are in place in NoSQL databases (say MongoDB)? What's the overhead in storing lots of identical complex JSON documents compared to using a relational database?
First, any overhead or inefficiency is more often than not simply represent a choice of priorities; an overhead somewhere gives you an advantage somewhere else.
As for your specifics points, again, I think answers will depends a lot depending on the exact NoSQL products, even among the key-value or document-based subgroup, but here some thoughts :
1- You have lots of duplication because (almost) everything is unnormalized. You can't do much about it because this is kind of the point of such databases.
Actually, most (if not all) key-value databases can be used with any schema you want. So you can have a "normalized schema" laid upon a key-value store, resulting in no duplication. Don't forget that there are SQL solutions available for some (or most?) key-value databases.
2- There is a lot of overhead because, if you have a JSON document, you have to save all the keys (and all the structural information) with each document. So for 10000 rows, you'll have to save the strings 'age', 'name', ... 10000 times.
I guess this depends on how the database engine is implemented, but compression - either complicated or simple "tokenization" - can be used and result in no significant overhead there neither.
3- The database can't do a lot of clever stuff like creating indices or binary trees (to save time) or storing integers in a compact way (because one of the free-form documents could have a string where all the others have an int, etc.)
Again, nothing prevent a key-value or document-based database from using any kind of trees under the hood or to store integers in a compact way (for example, it can have a simple binary flag to indicate if the data is stored as string or "compact integer"). As for creating indices, that is also possible (for the same reasons stated in 1, or done manually by the application).
I am currently trying to improve the performance of a web application. The goal of the application is to provide (real time) analytics. We have a database model that is similiar to a star schema, few fact tables and many dimensional tables. The database is running with Mysql and MyIsam engine.
The Fact table size can easily go into the upper millions and some dimension tables can also reach the millions.
Now the point is, select queries can get awfully slow if the dimension tables get joined on the fact tables and also aggretations are done. First thing that comes in mind when hearing this is, why not precalculate the data? This is not possible because the users are allowed to use several freely customizable filters.
So what I need is an all-in-one system suitable for every purpose ;) Sadly it wasn't invented yet. So I came to the idea to combine 2 existing systems. Mixing a row oriented and a column oriented database (e.g. like infinidb or infobright). Keeping the mysql MyIsam solution (for fast inserts and row based queries) and add a column oriented database (for fast aggregation operations on few columns) to it and fill it periodically (nightly) via cronjob. Problem would be when the current data (it must be real time) is queried, therefore I maybe would need to get data from both databases which can complicate things.
First tests with infinidb showed really good performance on aggregation of a few columns, so I really think this could help me speed up the application.
So the question is, is this a good idea? Has somebody maybe already done this? Maybe there is are better ways to do it.
I have no experience in column oriented databases yet and I'm also not sure how the schema of it should look like. First tests showed good performance on the same star schema like structure but also in a big table like structure.
I hope this question fits on SO.
Greenplum, which is a proprietary (but mostly free-as-in-beer) extension to PostgreSQL, supports both column-oriented and row-oriented tables with high customizable compression. Further, you can mix settings within the same table if you expect that some parts will experience heavy transactional load while others won't. E.g., you could have the most recent year be row-oriented and uncompressed, the prior year column-oriented and quicklz-compresed, and all historical years column-oriented and bz2-compressed.
Greenplum is free for use on individual servers, but if you need to scale out with its MPP features (which are its primary selling point) it does cost significant amounts of money, as they're targeting large enterprise customers.
(Disclaimer: I've dealt with Greenplum professionally, but only in the context of evaluating their software for purchase.)
As for the issue of how to set up the schema, it's hard to say much without knowing the particulars of your data, but in general having compressed column-oriented tables should make all of your intuitions about schema design go out the window.
In particular, normalization is almost never worth the effort, and you can sometimes get big gains in performance by denormalizing to borderline-comical levels of redundancy. If the data never hits disk in an uncompressed state, you might just not care that you're repeating each customer's name 40,000 times. Infobright's compression algorithms are designed specifically for this sort of application, and it's not uncommon at all to end up with 40-to-1 ratios between the logical and physical sizes of your tables.
Basically it's a financial database, with both daily and intraday data (date,symbol,open,high,low,close,vol,openinterest) -- very simple structure. Updates are just once a day. A typical query would be: date and close price of MSFT for all dates in DB. I was thinking that there's got to be something out there that's been optimized for lots of reads and not many writes, as opposed to a general-purpose RDBMS like MySQL. I searched rubyforge.org, and I didn't see anything that specifically addressed this (as far as I could tell).
MS SQL Server can be optimized like this with the fairly simple:
ALTER DATABASE myDatabase
SET READ_COMMITTED_SNAPSHOT ON
SQL Server will automatically cache your data in memory if it is being used heavily for reads.
You can always use a RAMdisk for your MySQL installation if your database footprint is small enough. One way to make your tables small enough to fit is to create them as MyISAM ARCHIVE tables. While they are very compact, compressed, they can only be appended to or read from, but not updated. (http://dev.mysql.com/tech-resources/articles/storage-engine.html)
Generally a properly indexed and well organized MySQL table is really fast, especially when using MyISAM, and even more so when loaded from memory. They key is in denormalizing the data as heavily as you can optimizing for your particular read scenarios.
For example, having a stock_id, date, price tuple is going to be fairly slow to sort and retrieve. If you have, instead, stock_id and a column with some serialized data, the retrieval time will be very quick.
Another solution that is likely faster is to push all the data into an alternative DBMS like Toyko Cabinet or something similar, especially if your data fits neatly into a key/value store.
Look at MySQL, but run the database from memory instead of disk. Depends on the size of your dataset and your budget, but you could then update memory from disk once a day, and have a very, very fast read time afterwards.
The best-known (to me at least!) time series database is Fame but it's expensive and I strongly doubt that there's anything like, say, an ActiveRecord implementation for it. Unless it's changed a lot in the 10 or so years since I last touched it, it isn't SQL-friendly at all.
With a fairly tightly-focused application, you can take a more flexible view of your data. For example, consider what is the information that you're actually looking to store? Is it the atomic price/hi/lo/close/vol/whatever, or is it more appropriately a time series of such values? If you always want to view the series, store a series per row, not a value.
Throwing a few ideas out here...
How might it look if you stored a year or a month of a single value for a single stock in one row? Maybe as an XML string, or JSON or something more terse of your own devising. Compressed CSV, perhaps? That ought to fit a month's values into a 255-character column. (Use something like Huffman coding to do the encoding, perhaps - a single dictionary ought to work for all instances of such similar data).
You can still hold a horizontal view as well: with the extremely low update rate you'll have (should only be data fixes, I'd guess) you can probably stand to build that stuff.
There's an obvious downside to this: you'll have a bunch of extra work to do.
I don't have any personal experience, but MogoDB claims to offer relational-style flexibility with key-value performance.
As mentioned elsewhere key-value database might be worth looking at: Tokyo Cabinet, CouchDB or one of the others again, perhaps, with concatenated value for the time series.
What are the best practices for database design and normalization for high traffic websites like stackoverflow?
Should one use a normalized database for record keeping or a normalized technique or a combination of both?
Is it sensible to design a normalized database as the main database for record keeping to reduce redundancy and at the same time maintain another denormalized form of the database for fast searching?
or
Should the main database be denormalized but with normalized views at the application level for fast database operations?
or some other approach?
The performance hit of joining is frequently overestimated. Database products like Oracle are built to join very efficiently. Joins are often regarded as performing badly when the real culprit is a poor data model or a poor indexing strategy. People also forget that denormalised databases perform very badly when it comes to inserting or updating data.
The key thing to bear in mind is the type of application you're building. Most of the famous websites are not like regular enterprise applications. That's why Google, Facebook, etc don't use relational databases. There's been a lot of discussion of this topic recently, which I have blogged about.
So if you're building a website which is primarily about delivering shedloads of semi-structured content you probably don't want to be using a relational database, denormalised or otherwise. But if you're building a highly transactional website (such as an online bank) you need a design which guarantees data security and integrity, and does so well. That means a relational database in at least third normal form.
Denormalizing the db to reduce the number of joins needed for intense queries is one of many different ways of scaling. Having to do fewer joins means less heavy lifting by the db, and disk is cheap.
That said, for ridiculous amounts of traffic good relational db performance can be hard to achieve. That is why many bigger sites use key value stores(e.g. memcached) and other caching mechanisms.
The Art of Capacity Planning is pretty good.
You can listen to a discussion on this very topic by the creators of stack overflow on thier podcast at:
http://itc.conversationsnetwork.org/shows/detail3993.html
First: Define for yourself what hight-traffic means:
50.000 Page-Viewss per day?
500.000 Page-Views per day?
5.000.000 Page-Views per day?
more?
Then calculate this down to probable peak page-views per minute and per seconds.
After that think about the data you want to query per page-view. Is the data cacheable? How dynamic is the data, how big is the data?
Analyze your individual requirements, program some code, do some load-testing, optimize. In most cases, before you need to scale out the database servers you need to scale out the web-servers.
Relational-database can be, if fully optimized, amazingly fast, when joining tables!
A relational-database could be hit seldom when to as a back-end, to populate a cache or fill some denormalized data tables. I would not make denormalization the default approach.
(You mentioned search, look into e.g. lucene or something similar, if you need full-text search.)
The best best-practice answer is definitely: It depends ;-)
For a project I'm working on, we've gone for the denormalized table route as we expect our major tables to have a high ratio of writes to reads (instead of all users hitting the same tables, we've denormalized them and set each "user set" to use a particular shard). You may find read http://highscalability.com/ for examples of how the "big sites" cope with the volume - Stack Overflow was recently featured.
Neither matters if you aren't caching properly.
We have a system that generates many events as the result of a phone call/web request/sms/email etc, each of these events need to be able to be stored and be available for reporting (for MI/BI etc) on, each of these events have many variables and does not fit any one specific scheme.
The structure of the event document is a key-value pair list (cdr= 1&name=Paul&duration=123&postcode=l21). Currently we have a SQL Server system using dynamically generated sparse columns to store our (flat) document, of which we have reports that run against the data, for many different reasons I am looking at other solutions.
I am looking for suggestions of a system (open or closed) that allows us to push these events in (regardless of the schema) and provide reporting and anlytics on top of it.
I have seen Pentaho and Jasper, but most of the seem to connect to a system to get the data out of it to then report on it. I really just want to be able to push a document in and have it available to be reported on.
As much as I love CouchDB, I am looking for a system that allows schema-less submitting of data and reporting on top of it (much like Pentaho, Jasper, SQL Reporting/Analytics Server etc)
I don't think there is any DBMS that will do what you want and allow an off-the-shelf reporting tool to be used. Low-latency analytic systems are not quick and easy to build. Low-latency on unstructured data is quite ambitious.
You are going to have to persist the data in some sort of database, though.
I think you may have to take a closer look at your problem domain. Are you trying to run low-latency analytical reports, or an operational report that prompts some action within the business when certain events occur? For low-latency systems you need to be quite ruthless about what constitutes operational reporting and what constitutes analytics.
Edit: Discourage the 'potentially both' mindset unless the business are prepared to pay. Investment banks and hedge funds spend big bucks and purchase supercomputers to do 'real-time analytics'. It's not a trivial undertaking. It's even less trivial when you try to do such a system and build it for high uptimes.
Even on apps like premium-rate SMS services and .com applications the business often backs down when you do a realistic scope and cost analysis of the problem. I can't say this enough. Be really, really ruthless about 'realtime' requirements.
If the business really, really need realtime analytics then you can make hybrid OLAP architectures where you have a marching lead partition on the fact table. This is an architecture where the fact table or cube is fully indexed for historical data but has a small leading partition that is not indexed and thus relatively quick to insert data into.
Analytic queries will table scan the relatively small leading data partition and use more efficient methods on the other partitions. This gives you low latency data and the ability to run efficient analytic queries over the historical data.
Run a process nightly that rolls over to a new leading partition and consolidates/indexes the previous lead partition.
This works well where you have items such as bitmap indexes (on databases) or materialised aggregations (on cubes) that are expensive on inserts. The lead partition is relatively small and cheap to table scan but efficient to trickle insert into. The roll-over process incrementally consolidates this lead partition into the indexed historical data which allows it to be queried efficiently for reports.
Edit 2: The common fields might be candidates to set up as dimensions on a fact table (e.g. caller, time). The less common fields are (presumably) coding. For an efficient schema you could move the optional coding into one or more 'junk' dimensions..
Briefly, a junk dimension is one that represents every existing combination of two or more codes. A row on the table doesn't relate to a single system entity but to a unique combination of coding. Each row on the dimension table corresponds to a distinct combination that occurs in the raw data.
In order to have any analytic value you are still going to have to organise the data so that the columns in the junk dimension contain something consistently meaningful. This goes back to some requirements work to make sure that the mappings from the source data make sense. You can deal with items that are not always recorded by using a placeholder value such as a zero-length string (''), which is probably better than nulls.
Now I think I see the underlying requirements. This is an online or phone survey application with custom surveys. The way to deal with this requirement is to fob the analytics off onto the client. No online tool will let you turn around schema changes in 20 minutes.
I've seen this type of requirement before and it boils down to the client wanting to do some stats on a particular survey. If you can give them a CSV based on the fields (i.e. with named header columns) in their particular survey they can import it into excel and pivot it from there.
This should be fairly easy to implement from a configurable online survey system as you should be able to read the survey configuration. The client will be happy that they can play with their numbers in Excel as they don't have to get their head around a third party tool. Any competent salescritter should be able to spin this to the client as a good thing. You can use a spiel along the lines of 'And you can use familiar tools like Excel to analyse your numbers'. (or SAS if they're that way inclined)
Wrap the exporter in a web page so they can download it themselves and get up-to-date data.
Note that the wheels will come off if you have larger data volumes over 65535 respondents per survey as this won't fit onto a spreadsheet tab. Excel 2007 increases this limit to 1048575. However, surveys with this volume of response will probably be in the minority. One possible workaround is to provide a means to get random samples of the data that are small enough to work with in Excel.
Edit: I don't think there are other solutions that are sufficiently flexible for this type of applicaiton. You've described a holy grail of survey statistics.
I still think that the basic strategy is to give them a data dump. You can pre-package it to some extent by using OLE automation to construct a pivot table and deliver something partially digested. The API for pivot tables in Excel is a bit hairy but this is certainly quite feasible. I have written VBA code that programatically creates pivot tables in the past so I can say from personal experience that this is feasible to do.
The problem becomes a bit more complex if you want to compute and report distributions of (say) response times as you have to construct the displays. You can programatically construct pivot charts if necessary but automating report construction through excel in this way will be a fair bit of work.
You might get some mileage from R (www.r-project.org) as you can construct a framework that lets you import data and generate bespoke reports with a bit of R Code. This is not an end-user tool but your client base sounds like they want canned reports anyway.