Multi-dimensional data structure - performance

Which of the following data structure
R-tree,
R*-tree,
X-tree,
SS-tree,
SR-tree,
VP-tree,
metric-trees
provide reasonably good performance in insert, update and searching of multidimensional data stored in its corresponding form?
Is there a better data structure out there for handling multidimensional data?

What kind of multi-dimensional data are you talkign about? In the R-tree wiki, it states that itis used for indexing multi-dimensional data, but it seems clear that it will be primarily useful for data which is multi-dimensional in the same kind of feature -- i.e. vertical location and horizontal location, longitude and latitude, etc.
If the data is multi-dimensional simply because there are a lot of attributes for the data and it needs to be analyzed along many of these dimensions, then a relational representation is probably best.
The real issue is how do you optimize the relations and indices for the type of queries you need to answer? For this, you need to do some domain analysis beforehand, and some performance analysis after the first iteration, to determine if there are better ways to structure and index your tables.

Related

Data Structures and data representation in a given context

I started dedicating time for learning algorithms and data structures. So my first and basic question is, how do we represent the data depending on the context.
I have given it time and thought and came up with this conclusion.
Groups of same data -> List/Arrays
Classification of data [Like population on gender, then age etc.] -> Trees
Relations [Like relations between a product brought and others] -> Graphs
I am posting this question to know our stack overflow community thought about my interpretation of datastructures. Since it is a generic topic I could not get a justification for my thought online. Please help me if I am wrong.
This looks like oversimplifying things.
The data structure we want to use depends on what we are going to do with the data.
For example, when we store records about people and need fast access by index, we can use an array.
When we store the same records about people but need to find by name fast, we can use a search tree.
Graphs are a theoretical concept, not a data structure.
They can be stored as an adjacency matrix (two-dimensional array, suitable for small or dense graphs), or as lists of adjacent edges (array/list of dynamic arrays/lists, suitable for large or sparse graphs), or implicitly (generated on the fly), or otherwise.

best data structure for multidimensional data?

I would like to implement a simple, in-memory OLAP cube storage engine for read and write (writeback) - functionally similar to SSAS cube with multiple dimensions but one measure and only with 1 type of aggregation (sum). As in OLAP cube each axis in the multidimensional space can be a multi-level hierarchy.
Can the community provide me with some hints at which data-structures and related algorithms should I be looking at? I understand that I need something capable of indexing data in many dimensions at once, and storing intermediate precomputed aggregation values.
I'd rather not be gluing multiple nested maps together but implement something from scratch - the goal of the excercise is not just to implement this beast but also to better understand multidimensional data structures and algorithms.
Just to clarify - I am focused on the core data structure of storing multidimensional hierarchical data for reads and writes. I do not seek to implement MDX parser, make the cube persistent, etc.
Have a look at the list of spatial indexes at Wikipedia, one of them, like R-tree or k-d tree might be what you are looking for.

Best tree structure for Multi-dimensional data

To organize multi-dimensional data,
What is the most useful and efficient tree data structure?
(eg, K-D-B tree, region quadtree, R-tree)
I want to know best search time and best space utilization tree structure.
It highly depends on how your data is distributed in the space and how you want to search for it (what are the criteria you query for?).
It is very easy to find the right quad-tree bin given a location in space, on the other hand it introduces more overhead than a well-shaped kd-tree. There is a reason why all of these techniques are still in use.
Specify the problem you want to solve with the data structure.
Different data structures, including trees and information about them and source code of their implementation is found at https://ece.uwaterloo.ca/~ece250/Algorithms/
Furthermore, runtime information and asymptotic analysis on different types of tree structures is found under section 4 at https://ece.uwaterloo.ca/~ece250/Lectures/Slides/
These are very useful and reliable and this way you can choose the best structure depending on your specific needs/ data
I hope this helps!

NoSQL or YesSQL

I have a huge dictionary of words:
"word1" => [value1]
"word2" => [value2]
"word3" => [value3, value2]
...
"word400000000" => [value455, value3435, ..., value3423]
number of words is really big.
Now I want to be able to retrieve, really fast, all the values which are being pointed by word. word is string value.
What are the best tools to use? I thought of simple DB solution, but DBA guys said that it will not work really fast.
So, before I open Cormen's book, is there some ready solutions for that problem?
Look at key/value storage engines such as Berkeley DB. They are very fast at that sort of thing.
In RDMSs (YesSQL) you will most probably search values with LIKE or = operators on all records, i.e. search will take O(n). What you actually need is a data structure called inverted index, which allows you to find list of needed values in O(1). For description of structure and algorithms see Wikipedia article, for ready-to-use tools keep reading.
There's plenty of implementations of inverted index in search engines like Lucene/Solr, Sphinx (which, by the way, supports several databases as data source), and also in some key-value stores like Berkeley DB or Apache Cassandra. Distinction between search engines and key-value stores is in that:
Search engines implement inverted index more directly (AFAIK, key-value DBs use BigTable-like structures, that are much more complex then inverted index itself).
Search engines have a plenty of tools for text analysis (parsing, stemming). I don't know, if you actually need it, but if you do, use search engines.
Key-value DBs are real databases. I.e., unlike search engines they have real data types, not only strings. Moreover, some of such DBs (e.g. Berkeley DB) can store programming language native data types without converting them to any inner format. So, if you need a real database with all features, use key-value stores.
Also note, that inverted index is really simple structure, so you can easily implement it by yourself, if none of previous options is suitable for you.
It really depends on what behavior you want. If you just want to be able to do an exact text search, then a hash table is probably a really great idea. It has expected O(1) lookup, which is about as fast as you're going to get.
If you need the elements in sorted order (for example, so you can iterate across them in a reasonable order), then one of the myriad balanced search trees might be a good candidate; for example, a red-black tree or an AVL tree.
If you're working with a huge data set that can't all fit into main memory, then a very good choice might be a B-tree, which is a type of balanced binary search tree that minimizes the number of disk reads required to find a given element. Most database systems use some flavor of B-trees for their lookups.
You can use cassandra (http://cassandra.apache.org/). Is Easy to start, has pretty much documentation and is a really fast solution for your problem.
Hope this helps,
If you know that you will only want to search for values based on words and not the other way around, use a simple Key-Value store. Maybe Redis would be best.
If you think you will ever need to search based on the values, then you'll likely need Secondary Indices or off-line MapReduce jobs. Maybe Cassandra would be best.

Anyone know anything about OLAP Internals?

I know a bit about database internals. I've actually implemented a small, simple relational database engine before, using ISAM structures on disk and BTree indexes and all that sort of thing. It was fun, and very educational. I know that I'm much more cognizant about carefully designing database schemas and writing queries now that I know a little bit more about how RDBMSs work under the hood.
But I don't know anything about multidimensional OLAP data models, and I've had a hard time finding any useful information on the internet.
How is the information stored on disk? What data structures comprise the cube? If a MOLAP model doesn't use tables, with columns and records, then... what? Especially in highly dimensional data, what kinds of data structures make the MOLAP model so efficient? Do MOLAP implementations use something analogous to RDBMS indexes?
Why are OLAP servers so much better at processing ad hoc queries? The same sorts of aggregations that might take hours to process in an ordinary relational database can be processed in milliseconds in an OLTP cube. What are the underlying mechanics of the model that make that possible?
I've implemented a couple of systems that mimicked what OLAP cubes do, and here are a couple of things we did to get them to work.
The core data was held in an n-dimensional array, all in memory, and all the keys were implemented via hierarchies of pointers to the underlying array. In this way we could have multiple different sets of keys for the same data. The data in the array was the equivalent of the fact table, often it would only have a couple of pieces of data, in one instance this was price and number sold.
The underlying array was often sparse, so once it was created we used to remove all the blank cells to save memory - lots of hardcore pointer arithmetic but it worked.
As we had hierarchies of keys, we could write routines quite easily to drill down/up a hierarchy easily. For instance we would access year of data, by going through the month keys, which in turn mapped to days and/or weeks. At each level we would aggregate data as part of building the cube - made calculations much faster.
We didn't implement any kind of query language, but we did support drill down on all axis (up to 7 in our biggest cubes), and that was tied directly to the UI which the users liked.
We implemented core stuff in C++, but these days I reckon C# could be fast enough, but I'd worry about how to implement sparse arrays.
Hope that helps, sound interesting.
The book Microsoft SQL Server 2008 Analysis Services Unleashed spells out some of the particularities of SSAS 2008 in decent detail. It's not quite a "here's exactly how SSAS works under the hood", but it's pretty suggestive, especially on the data structure side. (It's not quite as detailed/specific about the exact algorithms.) A few of the things I, as an amateur in this area, gathered from this book. This is all about SSAS MOLAP:
Despite all the talk about multi-dimensional cubes, fact table (aka measure group) data is still, to a first approximation, ultimately stored in basically 2D tables, one row per fact. A number of OLAP operations seem to ultimately consist of iterating over rows in 2D tables.
The data is potentially much smaller inside MOLAP than inside a corresponding SQL table, however. One trick is that each unique string is stored only once, in a "string store". Data structures can then refer to strings in a more compact form (by string ID, basically). SSAS also compresses rows within the MOLAP store in some form. This shrinking I assume lets more of the data stay in RAM simultaneously, which is good.
Similarly, SSAS can often iterate over a subset of the data rather than the full dataset. A few mechanisms are in play:
By default, SSAS builds a hash index for each dimension/attribute value; it thus knows "right away" which pages on disk contain the relevant data for, say, Year=1997.
There's a caching architecture where relevant subsets of the data are stored in RAM separate from the whole dataset. For example, you might have cached a subcube that has only a few of your fields, and that only pertains to the data from 1997. If a query is asking only about 1997, then it will iterate only over that subcube, thereby speeding things up. (But note that a "subcube" is, to a first approximation, just a 2D table.)
If you're predefined aggregates, then these smaller subsets can also be precomputed at cube processing time, rather than merely computed/cached on demand.
SSAS fact table rows are fixed size, which presumibly helps in some form. (In SQL, in constrast, you might have variable-width string columns.)
The caching architecture also means that, once an aggregation has been computed, it doesn't need to be refetched from disk and recomputed again and again.
These are some of the factors in play in SSAS anyway. I can't claim that there aren't other vital things as well.

Resources