Super large matrix generation from MySQL tables? - matrix

I have two MySQL tables with one containing a set of 6000 users and another set of 10000 ratings they have provided for products. I'd like to make a matrix of feature vectors that have for each row that denotes a user a 1 or 0 if they have given a rating to a particular product (or even the rating value). What is the best way to accomplish this (given too that the matrix will be sparse?).
I'm curious as to what implementations I can test out with tools at my disposal (like MySQL or MATLAB) - the end purpose is to perform clustering of similar users. Somehow I think a 10,000 column MySQL table won't make my db admin happy... at all.

The obvious way of storing a sparse matrix in SQL is to use three columns, where user and product together are the primary key, and the extra column is the rating.
It does not make sense to do the actual processing with the SQL database. This is just a huge overhead, and makes things slow. Just get the data out into a primitive and fast data structure, do the analysis, then eventually translate the output in whatever output format you need.
SQL is good when you need only part of the data or have to perform changes, need locking and all this. But I'd never run the computation directly on the database, because unless you can load your low-level linear algebra libraries into your database, it will be slow.

Related

Business Intelligence Datasource Performance - Large Table

I use Tableau and have a table with 140 fields. Due to the size/width of the table, the performance is poor. I would like to remove fields to increase reading speed, but my user base is so large, that at least one person uses each of the fields, while 90% use the same ~20 fields.
What is the best solution to this issue? (Tableau is our BI tool, BigQuery is our database)
What I have done thus far:
In Tableau, it isn't clear how to user dynamic data sources that change based on the field selected. Ideally, I would like to have smaller views OR denormalized tables. As the users makes their selections in Tableau, the underlying data sources updates to the table or view with that field.
I have tried a simple version of a large view, but that performed worse than my large table, and read significantly more data (remember, I am BigQuery, so I care very much about bytes read due to costs)
Suggestion 1: Extract your data.
Especially when it comes to datasources which are pay per query byte, (Big Query, Athena, Etc) extracts make a great deal of sense. Depending how 'fresh' the data must be for the users. (Of course all users will say 'live is the only way to go', but dig into this a little and see what it might actually be.) Refreshes can be scheduled for as little as 15 minutes. The real power of refreshes comes in the form of 'incremental refreshes' whereby only new records are added (along an index of int or date.) This is a great way to reduce costs - if your BigQuery database is partitioned - (which it should be.) Since Tableau Extracts are contained within .hyper files, a structure of Tableau's own design/control, they are extremely fast and optimized perfectly for use in Tableau.
Suggestion 2: Create 3 Data Sources (or more.) Certify these datasources after validating that they provide correct information. Provide users with with clear descriptions.
Original Large Dataset.
Subset of ~20 fields for the 90%.
Remainder of fields for the 10%
Extract of 1
Extract of 2
Extract of 3
Importantly, if field names match in each datasource (ie: not changed manually ever) then it should be easy for a user to 'scale up' to larger datasets as needed. This means that they could generally always start out with a small subset of data to begin their exploration, and then use the 'replace datasource' feature to switch to a different datasource while keeping their same views. (This wouldn't work as well if at all for scaling down, though.)

Querying a view using multiple keys

Given the following view for the gamesim-sample example:
function (doc, meta) {
if (doc.jsonType == "player" && doc.experience) {
emit([doc.experience,meta.id], doc.id);
}
}
I would like to Query the leaderboard for users who only belong to specific group (the grouping data is maintained in an external system).
For e.g. if the view has users "orange","purple","green","blue" and "red" I would like the leaderboard to give me the rankings of only "orange" and "purple" without having to query their respective current experience points.
...view/leaderboard?keys=[[null,"orange"],[null,"purple"]
The following works fine, but it requires additional queries to find the experience point of "orange" and "purple" beforehand. However, this does not scale for obvious reasons.
...view/leaderboard?keys=[[1,"orange"],[5,"purple"]
Thanks in advance!
Some NoSql vs. SQL Background
First, you have to remember that specifically with Couchbase, the advantage is the super-fast storage and retrieval of records. Indicies were added later, as a way to make storage a little more useful and less error-prone (think of them more as an automated inventory) and their design really constrains you to move away from SQL-style thinking. Your query above is a perfect example:
select *
from leaderboard
where id in ('orange','purple')
order by experience
This is a retrieval, computation, and filter all in one shot. This is exactly what NoSql databases are optimized not to do (and conversely, SQL databases are, which often makes them hopelessly complex, but that is another topic).
So, this leads to the primary difference between a SQL vs a NoSQL database: NoSql is optimized for storage while SQL is optimized for querying. In conjunction, it causes one to adjust how one thinks about the role of the database, which in my opinion should be more the former than the latter.
The creators of Couchbase originally focused purely on the storage aspect of the database. However, storage makes a lot more sense when you know what it is you have stored, and indices were added later as a feature (originally you had to keep track of your own stuff - it was not much fun!) They also added in map-reduce in a way that takes advantage of CB's ability to store and retrieve massive quantities of records simultaneously. Neither of these features were really intended to solve complex query problems (even though this query is simple, it is a perfect example because of this). This is the function of your application logic.
Addressing Your Specific Issue
So, now on to your question. The query itself appears to be a simple one, and indeed it is. However,
select * from leaderboard
is not actually simple. It is instead a 2-layer deep query, as your definition of leaderboard implies a sorted list from largest to smallest player experience. Therefore, this query, expanded out, becomes:
select * from players order by experience desc
Couchbase supports the above natively in the index mechanism (remember, it inventories your objects), and you have accurately described in your question how to leverage views to achieve this output. What Couchbase does not support is the third-level query, which represents your where clause. Typically, a where in Couchbase is executed in either the view "map" definition or the index selection parameters. You can't do it in "map" because you don't always want the same selection, and you can't do it in the index selection parameter because the index is sorted on experience level first.
Method 1
Let's assume that you are displaying this to a user on a web page. You can easily implement this filter client-side (or in your web service) by pulling the data as-is and throwing out values that you don't want. Use the limit and skip parameters to ask for more as the user scrolls down (or clicks more pages, or whatever).
Method 2
Reverse the order of your index, and sort by "group" (aka color) first, then experience level. Run separate queries to select the top 'N' users of each color, then merge and sort on the client side. This will take longer to load up-front but will give you a larger in-memory data set to work with if you need it for that reason. This method may not work well if you have a very uneven distribution of categories, in which case 'N' would need to be tailored to match the statistical distribution(s) within the categories.
Bottom Line
One parting thought is that NoSql databases were designed to deal with highly dynamic data sets. This requires some statistical thinking, because there no longer is a single "right" answer. Some degree of inconsistency and error is to be expected (as there always is in the real world). You can't expect a NoSql database to return a perfect query result - because there is no perfection. You have to settle for "good enough" - which is often much better than what is needed anyway.

Efficient searching in huge multi-dimensional matrix

I am looking for a way to search in an efficient way for data in a huge multi-dimensional matrix.
My application contains data that is characterized by multiple dimensions. Imagine keeping data about all sales in a company (my application is totally different, but this is just to demonstrate the problem). Every sale is characterized by:
the product that is being sold
the customer that bought the product
the day on which it has been sold
the employee that sold the product
the payment method
the quantity sold
I have millions of sales, done on thousands of products, by hundreds of employees, on lots of days.
I need a fast way to calculate e.g.:
the total quantity sold by an employee on a certain day
the total quantity bought by a customer
the total quantity of a product paid by credit card
...
I need to store the data in the most detailed way, and I could use a map where the key is the sum of all dimensions, like this:
class Combination
{
Product *product;
Customer *customer;
Day *day;
Employee *employee;
Payment *payment;
};
std::map<Combination,quantity> data;
But since I don't know beforehand which queries are performed, I need multiple combination classes (where the data members are in different order) or maps with different comparison functions (using a different sequence to sort on).
Possibly, the problem could be simplified by giving each product, customer, ... a number instead of a pointer to it, but even then I end up with lots of memory.
Are there any data structures that could help in handling this kind of efficient searches?
EDIT:
Just to clarify some things: On disk my data is stored in a database, so I'm not looking for ways to change this.
The problem is that to perform my complex mathematical calculations, I have all this data in memory, and I need an efficient way to search this data in memory.
Could an in-memory database help? Maybe, but I fear that an in-memory database might have a serious impact on memory consumption and on performance, so I'm looking for better alternatives.
EDIT (2):
Some more clarifications: my application will perform simulations on the data, and in the end the user is free to save this data or not into my database. So the data itself changes the whole time. While performing these simulations, and the data changes, I need to query the data as explained before.
So again, simply querying the database is not an option. I really need (complex?) in-memory data structures.
EDIT: to replace earlier answer.
Can you imagine you have any other possible choice besides running qsort( ) on that giant array of structs? There's just no other way that I can see. Maybe you can sort it just once at time zero and keep it sorted as you do dynamic insertions/deletions of entries.
Using a database (in-memory or not) to work with your data seems like the right way to do this.
If you don't want to do that, you don't have to implement lots of combination classes, just use a collection that can hold any of the objects.

Mixing column and row oriented databases?

I am currently trying to improve the performance of a web application. The goal of the application is to provide (real time) analytics. We have a database model that is similiar to a star schema, few fact tables and many dimensional tables. The database is running with Mysql and MyIsam engine.
The Fact table size can easily go into the upper millions and some dimension tables can also reach the millions.
Now the point is, select queries can get awfully slow if the dimension tables get joined on the fact tables and also aggretations are done. First thing that comes in mind when hearing this is, why not precalculate the data? This is not possible because the users are allowed to use several freely customizable filters.
So what I need is an all-in-one system suitable for every purpose ;) Sadly it wasn't invented yet. So I came to the idea to combine 2 existing systems. Mixing a row oriented and a column oriented database (e.g. like infinidb or infobright). Keeping the mysql MyIsam solution (for fast inserts and row based queries) and add a column oriented database (for fast aggregation operations on few columns) to it and fill it periodically (nightly) via cronjob. Problem would be when the current data (it must be real time) is queried, therefore I maybe would need to get data from both databases which can complicate things.
First tests with infinidb showed really good performance on aggregation of a few columns, so I really think this could help me speed up the application.
So the question is, is this a good idea? Has somebody maybe already done this? Maybe there is are better ways to do it.
I have no experience in column oriented databases yet and I'm also not sure how the schema of it should look like. First tests showed good performance on the same star schema like structure but also in a big table like structure.
I hope this question fits on SO.
Greenplum, which is a proprietary (but mostly free-as-in-beer) extension to PostgreSQL, supports both column-oriented and row-oriented tables with high customizable compression. Further, you can mix settings within the same table if you expect that some parts will experience heavy transactional load while others won't. E.g., you could have the most recent year be row-oriented and uncompressed, the prior year column-oriented and quicklz-compresed, and all historical years column-oriented and bz2-compressed.
Greenplum is free for use on individual servers, but if you need to scale out with its MPP features (which are its primary selling point) it does cost significant amounts of money, as they're targeting large enterprise customers.
(Disclaimer: I've dealt with Greenplum professionally, but only in the context of evaluating their software for purchase.)
As for the issue of how to set up the schema, it's hard to say much without knowing the particulars of your data, but in general having compressed column-oriented tables should make all of your intuitions about schema design go out the window.
In particular, normalization is almost never worth the effort, and you can sometimes get big gains in performance by denormalizing to borderline-comical levels of redundancy. If the data never hits disk in an uncompressed state, you might just not care that you're repeating each customer's name 40,000 times. Infobright's compression algorithms are designed specifically for this sort of application, and it's not uncommon at all to end up with 40-to-1 ratios between the logical and physical sizes of your tables.

Anyone know anything about OLAP Internals?

I know a bit about database internals. I've actually implemented a small, simple relational database engine before, using ISAM structures on disk and BTree indexes and all that sort of thing. It was fun, and very educational. I know that I'm much more cognizant about carefully designing database schemas and writing queries now that I know a little bit more about how RDBMSs work under the hood.
But I don't know anything about multidimensional OLAP data models, and I've had a hard time finding any useful information on the internet.
How is the information stored on disk? What data structures comprise the cube? If a MOLAP model doesn't use tables, with columns and records, then... what? Especially in highly dimensional data, what kinds of data structures make the MOLAP model so efficient? Do MOLAP implementations use something analogous to RDBMS indexes?
Why are OLAP servers so much better at processing ad hoc queries? The same sorts of aggregations that might take hours to process in an ordinary relational database can be processed in milliseconds in an OLTP cube. What are the underlying mechanics of the model that make that possible?
I've implemented a couple of systems that mimicked what OLAP cubes do, and here are a couple of things we did to get them to work.
The core data was held in an n-dimensional array, all in memory, and all the keys were implemented via hierarchies of pointers to the underlying array. In this way we could have multiple different sets of keys for the same data. The data in the array was the equivalent of the fact table, often it would only have a couple of pieces of data, in one instance this was price and number sold.
The underlying array was often sparse, so once it was created we used to remove all the blank cells to save memory - lots of hardcore pointer arithmetic but it worked.
As we had hierarchies of keys, we could write routines quite easily to drill down/up a hierarchy easily. For instance we would access year of data, by going through the month keys, which in turn mapped to days and/or weeks. At each level we would aggregate data as part of building the cube - made calculations much faster.
We didn't implement any kind of query language, but we did support drill down on all axis (up to 7 in our biggest cubes), and that was tied directly to the UI which the users liked.
We implemented core stuff in C++, but these days I reckon C# could be fast enough, but I'd worry about how to implement sparse arrays.
Hope that helps, sound interesting.
The book Microsoft SQL Server 2008 Analysis Services Unleashed spells out some of the particularities of SSAS 2008 in decent detail. It's not quite a "here's exactly how SSAS works under the hood", but it's pretty suggestive, especially on the data structure side. (It's not quite as detailed/specific about the exact algorithms.) A few of the things I, as an amateur in this area, gathered from this book. This is all about SSAS MOLAP:
Despite all the talk about multi-dimensional cubes, fact table (aka measure group) data is still, to a first approximation, ultimately stored in basically 2D tables, one row per fact. A number of OLAP operations seem to ultimately consist of iterating over rows in 2D tables.
The data is potentially much smaller inside MOLAP than inside a corresponding SQL table, however. One trick is that each unique string is stored only once, in a "string store". Data structures can then refer to strings in a more compact form (by string ID, basically). SSAS also compresses rows within the MOLAP store in some form. This shrinking I assume lets more of the data stay in RAM simultaneously, which is good.
Similarly, SSAS can often iterate over a subset of the data rather than the full dataset. A few mechanisms are in play:
By default, SSAS builds a hash index for each dimension/attribute value; it thus knows "right away" which pages on disk contain the relevant data for, say, Year=1997.
There's a caching architecture where relevant subsets of the data are stored in RAM separate from the whole dataset. For example, you might have cached a subcube that has only a few of your fields, and that only pertains to the data from 1997. If a query is asking only about 1997, then it will iterate only over that subcube, thereby speeding things up. (But note that a "subcube" is, to a first approximation, just a 2D table.)
If you're predefined aggregates, then these smaller subsets can also be precomputed at cube processing time, rather than merely computed/cached on demand.
SSAS fact table rows are fixed size, which presumibly helps in some form. (In SQL, in constrast, you might have variable-width string columns.)
The caching architecture also means that, once an aggregation has been computed, it doesn't need to be refetched from disk and recomputed again and again.
These are some of the factors in play in SSAS anyway. I can't claim that there aren't other vital things as well.

Resources