Serializing boost::uuids have a cost. Using them to index into vector / unordered_map requires additional hashing. What are the ideal use cases where boost::uuids are ideal data structures to use ?
UUIDs are valuable if you want IDs that are stable in time and across storage systems.
Imagine having two databases, each with auto-generated IDs.
Merging them would be a headache if the IDs are generated by incrementing integral values from 0.
Merging them would be a breeze if all relevant IDs are UUID.
Likewise, handing out a lot of data to an external party, who records operations offline, and subsequently applying these operations back on original data is much easier with UUIDs - even if relations between elements have been changed, new elements created etc.
UUID is also handy for universal "identification" (not authentication/authorization!) - like in driver versions, plugin ids etc. Think about detecting that an MSI is an update to a particular installed software package.
In general, I wouldn't rate UUIDs a characteristic of any data structure. I'd rate it a tool in designing your infrastructure. It plays on the level of persistence, exchange, not so much on the level of algorithms and in-memory manipulation.
Related
I'm considering replacing Redis with Aerospike and I wanted to know if aerospike is capable of delivering the same capabilities and performance as Redis's sorted sets for Leaderboards within an application. I need to be able to quickly insert, read and update items in the set. I also need to be able to do range queries on them and retrieve the rank of an arbitary item in the set quickly.
Aerospike does not currently have a built-in Leaderboard feature. However, this is one of many functions that anyone can build with User Defined Functions (UDFs) and Large Data Types (LDTs).
The way this would work is you would have a set of UDFs that employs two Large Ordered List LDTs. One LLIST would manage the primary collection, and the other LLIST would provide the Leaderboard/Scoreboard ordering (basically used as an index into the primary collection).
The UDFs would manage the user interaction (read/write/delete primary value and read/scan leaderboard value) and pass the work on to the LDT functions.
We've talked internally about building these examples to show the power of UDFs and LDTs. Perhaps, with a little incentive, we could raise the priority of getting these examples done.
The other issue is performance. What are your latency and throughput requirements?
I have a question about NoSQL type databases, in particular MongoDB, but it applies in general to most key-value or document based storages. Some of the selling points of NoSQL are speed and scalability, but it seems to me that there is significant overhead compared to relational databases.
You have lots of duplication because (almost) everything is unnormalized. You can't do much about it because this is kind of the point of such databases. I'm more concerned about the next ones:
There is a lot of overhead because, if you have a JSON document, you have to save all the keys (and all the structural information) with each document. So for 10000 rows, you'll have to save the strings 'age', 'name', ... 10000 times.
The database can't do a lot of clever stuff like creating indices or binary trees (to save time) or storing integers in a compact way (because one of the free-form documents could have a string where all the others have an int, etc.)
I know you can write your own views or map/reduce algorithms to get something like an index, but it seems at first glance that for the general case NoSQL must be terribly inefficient space and CPU wise.
Is it really that bad? What kinds of optimizations are in place in NoSQL databases (say MongoDB)? What's the overhead in storing lots of identical complex JSON documents compared to using a relational database?
First, any overhead or inefficiency is more often than not simply represent a choice of priorities; an overhead somewhere gives you an advantage somewhere else.
As for your specifics points, again, I think answers will depends a lot depending on the exact NoSQL products, even among the key-value or document-based subgroup, but here some thoughts :
1- You have lots of duplication because (almost) everything is unnormalized. You can't do much about it because this is kind of the point of such databases.
Actually, most (if not all) key-value databases can be used with any schema you want. So you can have a "normalized schema" laid upon a key-value store, resulting in no duplication. Don't forget that there are SQL solutions available for some (or most?) key-value databases.
2- There is a lot of overhead because, if you have a JSON document, you have to save all the keys (and all the structural information) with each document. So for 10000 rows, you'll have to save the strings 'age', 'name', ... 10000 times.
I guess this depends on how the database engine is implemented, but compression - either complicated or simple "tokenization" - can be used and result in no significant overhead there neither.
3- The database can't do a lot of clever stuff like creating indices or binary trees (to save time) or storing integers in a compact way (because one of the free-form documents could have a string where all the others have an int, etc.)
Again, nothing prevent a key-value or document-based database from using any kind of trees under the hood or to store integers in a compact way (for example, it can have a simple binary flag to indicate if the data is stored as string or "compact integer"). As for creating indices, that is also possible (for the same reasons stated in 1, or done manually by the application).
Problem
I need a key-value store that can store values of the following form:
DS<DS<E>>
where the data structure DS can be
either a List, SortedSet or an Array
and E can be either a String or byte-array.
It is very expensive to generate this data and so once I put it into the store, I will only perform read queries on it. Essentially it is a complex object cache with no eviction.
Example Application
A (possibly bad, but sufficient to clarify) example of an application is storing tokenized sentences from a document where you need to be able to quickly access the qth word of the pth sentence given documentID. In this case, I would be storing it as a K-V pair as follows:
K - docID
V - List<List<String>>
String word = map.get(docID).get(p).get(q);
I prefer to avoid app-integrated Map solutions (such as EhCache within Java).
I have worked with Redis but it doesn't appear to support the second layer of data-structure complexity. Any other K-V solutions that can help my use case?
Update:
I know that I could serialize/deserialize my object but I was wondering if there is any other solution.
In terms of platform choice you have two options - A full document database will support arbitrarily complex objects, but won't have built in commands for working with specific data structures. Something like Redis which does have optimised code for specific data structures can't support all possible data structures.
You can actually get pretty close with Redis by using ids instead of the nested data structure. DS1<DS2<E>> becomes DS1<int> and DS2<E>, with the int from DS1 and a prefix giving you the key holding DS2.
With this structure you can access any E with only two operations. In some cases you will be able to get that down to a single operation by knowing what the id of DS2 will be for a given query.
I hesitate to "recommend" it, but one of the only storage engines I know of which handles multi-dimensional data of this sort efficiently is Intersystems Cache. I had to use it at my last job, mostly coding against it using it's built in MUMPS-based language. I would not recommend the native approach, unless you hate yourself or your developers. However, they do have decent Java adapters, which appears to be what you're using. I've seen it handle billions of records, efficiently stored in nested binary tree tables. There is no practical limit to the depth (number of dimensions) you can use. However, this is very much a proprietary solution. There is an open-source alternative called GT.M, but I don't know how compatible it is with languages that aren't M or C.
Any Key-Value store supports complex values, you just need to serialize/deserialize the data.
If you want fast retrieval only for specific parts of the data, you could use a more complex Key. In your example this would be:
K - tuple(docID, p, q)
We have a system that generates many events as the result of a phone call/web request/sms/email etc, each of these events need to be able to be stored and be available for reporting (for MI/BI etc) on, each of these events have many variables and does not fit any one specific scheme.
The structure of the event document is a key-value pair list (cdr= 1&name=Paul&duration=123&postcode=l21). Currently we have a SQL Server system using dynamically generated sparse columns to store our (flat) document, of which we have reports that run against the data, for many different reasons I am looking at other solutions.
I am looking for suggestions of a system (open or closed) that allows us to push these events in (regardless of the schema) and provide reporting and anlytics on top of it.
I have seen Pentaho and Jasper, but most of the seem to connect to a system to get the data out of it to then report on it. I really just want to be able to push a document in and have it available to be reported on.
As much as I love CouchDB, I am looking for a system that allows schema-less submitting of data and reporting on top of it (much like Pentaho, Jasper, SQL Reporting/Analytics Server etc)
I don't think there is any DBMS that will do what you want and allow an off-the-shelf reporting tool to be used. Low-latency analytic systems are not quick and easy to build. Low-latency on unstructured data is quite ambitious.
You are going to have to persist the data in some sort of database, though.
I think you may have to take a closer look at your problem domain. Are you trying to run low-latency analytical reports, or an operational report that prompts some action within the business when certain events occur? For low-latency systems you need to be quite ruthless about what constitutes operational reporting and what constitutes analytics.
Edit: Discourage the 'potentially both' mindset unless the business are prepared to pay. Investment banks and hedge funds spend big bucks and purchase supercomputers to do 'real-time analytics'. It's not a trivial undertaking. It's even less trivial when you try to do such a system and build it for high uptimes.
Even on apps like premium-rate SMS services and .com applications the business often backs down when you do a realistic scope and cost analysis of the problem. I can't say this enough. Be really, really ruthless about 'realtime' requirements.
If the business really, really need realtime analytics then you can make hybrid OLAP architectures where you have a marching lead partition on the fact table. This is an architecture where the fact table or cube is fully indexed for historical data but has a small leading partition that is not indexed and thus relatively quick to insert data into.
Analytic queries will table scan the relatively small leading data partition and use more efficient methods on the other partitions. This gives you low latency data and the ability to run efficient analytic queries over the historical data.
Run a process nightly that rolls over to a new leading partition and consolidates/indexes the previous lead partition.
This works well where you have items such as bitmap indexes (on databases) or materialised aggregations (on cubes) that are expensive on inserts. The lead partition is relatively small and cheap to table scan but efficient to trickle insert into. The roll-over process incrementally consolidates this lead partition into the indexed historical data which allows it to be queried efficiently for reports.
Edit 2: The common fields might be candidates to set up as dimensions on a fact table (e.g. caller, time). The less common fields are (presumably) coding. For an efficient schema you could move the optional coding into one or more 'junk' dimensions..
Briefly, a junk dimension is one that represents every existing combination of two or more codes. A row on the table doesn't relate to a single system entity but to a unique combination of coding. Each row on the dimension table corresponds to a distinct combination that occurs in the raw data.
In order to have any analytic value you are still going to have to organise the data so that the columns in the junk dimension contain something consistently meaningful. This goes back to some requirements work to make sure that the mappings from the source data make sense. You can deal with items that are not always recorded by using a placeholder value such as a zero-length string (''), which is probably better than nulls.
Now I think I see the underlying requirements. This is an online or phone survey application with custom surveys. The way to deal with this requirement is to fob the analytics off onto the client. No online tool will let you turn around schema changes in 20 minutes.
I've seen this type of requirement before and it boils down to the client wanting to do some stats on a particular survey. If you can give them a CSV based on the fields (i.e. with named header columns) in their particular survey they can import it into excel and pivot it from there.
This should be fairly easy to implement from a configurable online survey system as you should be able to read the survey configuration. The client will be happy that they can play with their numbers in Excel as they don't have to get their head around a third party tool. Any competent salescritter should be able to spin this to the client as a good thing. You can use a spiel along the lines of 'And you can use familiar tools like Excel to analyse your numbers'. (or SAS if they're that way inclined)
Wrap the exporter in a web page so they can download it themselves and get up-to-date data.
Note that the wheels will come off if you have larger data volumes over 65535 respondents per survey as this won't fit onto a spreadsheet tab. Excel 2007 increases this limit to 1048575. However, surveys with this volume of response will probably be in the minority. One possible workaround is to provide a means to get random samples of the data that are small enough to work with in Excel.
Edit: I don't think there are other solutions that are sufficiently flexible for this type of applicaiton. You've described a holy grail of survey statistics.
I still think that the basic strategy is to give them a data dump. You can pre-package it to some extent by using OLE automation to construct a pivot table and deliver something partially digested. The API for pivot tables in Excel is a bit hairy but this is certainly quite feasible. I have written VBA code that programatically creates pivot tables in the past so I can say from personal experience that this is feasible to do.
The problem becomes a bit more complex if you want to compute and report distributions of (say) response times as you have to construct the displays. You can programatically construct pivot charts if necessary but automating report construction through excel in this way will be a fair bit of work.
You might get some mileage from R (www.r-project.org) as you can construct a framework that lets you import data and generate bespoke reports with a bit of R Code. This is not an end-user tool but your client base sounds like they want canned reports anyway.
I know a bit about database internals. I've actually implemented a small, simple relational database engine before, using ISAM structures on disk and BTree indexes and all that sort of thing. It was fun, and very educational. I know that I'm much more cognizant about carefully designing database schemas and writing queries now that I know a little bit more about how RDBMSs work under the hood.
But I don't know anything about multidimensional OLAP data models, and I've had a hard time finding any useful information on the internet.
How is the information stored on disk? What data structures comprise the cube? If a MOLAP model doesn't use tables, with columns and records, then... what? Especially in highly dimensional data, what kinds of data structures make the MOLAP model so efficient? Do MOLAP implementations use something analogous to RDBMS indexes?
Why are OLAP servers so much better at processing ad hoc queries? The same sorts of aggregations that might take hours to process in an ordinary relational database can be processed in milliseconds in an OLTP cube. What are the underlying mechanics of the model that make that possible?
I've implemented a couple of systems that mimicked what OLAP cubes do, and here are a couple of things we did to get them to work.
The core data was held in an n-dimensional array, all in memory, and all the keys were implemented via hierarchies of pointers to the underlying array. In this way we could have multiple different sets of keys for the same data. The data in the array was the equivalent of the fact table, often it would only have a couple of pieces of data, in one instance this was price and number sold.
The underlying array was often sparse, so once it was created we used to remove all the blank cells to save memory - lots of hardcore pointer arithmetic but it worked.
As we had hierarchies of keys, we could write routines quite easily to drill down/up a hierarchy easily. For instance we would access year of data, by going through the month keys, which in turn mapped to days and/or weeks. At each level we would aggregate data as part of building the cube - made calculations much faster.
We didn't implement any kind of query language, but we did support drill down on all axis (up to 7 in our biggest cubes), and that was tied directly to the UI which the users liked.
We implemented core stuff in C++, but these days I reckon C# could be fast enough, but I'd worry about how to implement sparse arrays.
Hope that helps, sound interesting.
The book Microsoft SQL Server 2008 Analysis Services Unleashed spells out some of the particularities of SSAS 2008 in decent detail. It's not quite a "here's exactly how SSAS works under the hood", but it's pretty suggestive, especially on the data structure side. (It's not quite as detailed/specific about the exact algorithms.) A few of the things I, as an amateur in this area, gathered from this book. This is all about SSAS MOLAP:
Despite all the talk about multi-dimensional cubes, fact table (aka measure group) data is still, to a first approximation, ultimately stored in basically 2D tables, one row per fact. A number of OLAP operations seem to ultimately consist of iterating over rows in 2D tables.
The data is potentially much smaller inside MOLAP than inside a corresponding SQL table, however. One trick is that each unique string is stored only once, in a "string store". Data structures can then refer to strings in a more compact form (by string ID, basically). SSAS also compresses rows within the MOLAP store in some form. This shrinking I assume lets more of the data stay in RAM simultaneously, which is good.
Similarly, SSAS can often iterate over a subset of the data rather than the full dataset. A few mechanisms are in play:
By default, SSAS builds a hash index for each dimension/attribute value; it thus knows "right away" which pages on disk contain the relevant data for, say, Year=1997.
There's a caching architecture where relevant subsets of the data are stored in RAM separate from the whole dataset. For example, you might have cached a subcube that has only a few of your fields, and that only pertains to the data from 1997. If a query is asking only about 1997, then it will iterate only over that subcube, thereby speeding things up. (But note that a "subcube" is, to a first approximation, just a 2D table.)
If you're predefined aggregates, then these smaller subsets can also be precomputed at cube processing time, rather than merely computed/cached on demand.
SSAS fact table rows are fixed size, which presumibly helps in some form. (In SQL, in constrast, you might have variable-width string columns.)
The caching architecture also means that, once an aggregation has been computed, it doesn't need to be refetched from disk and recomputed again and again.
These are some of the factors in play in SSAS anyway. I can't claim that there aren't other vital things as well.