Disk based indexing for multi dimensional data - algorithm

I want to use some kinda disk based indexing for multi dimensional data. I want to be able
to perform range searches - (10 - 20% of application usage)
faster retrieval - (80%)
data size ( in order of GBs) and record count in order of billions
To be more specific, I want to implement something like R-Tree, or X-Tree. But I thought it is a good idea to get started with B-Trees. Although all the databases offer very efficient
implementations of B-Tree, i want to be able to tune the design, add possible
application based heuristics to the design so I would prefer to implement something
of my own or to use some library as a starting point.
Any pointers to libraries, or suggestions would be very helpful. Thanks in advance

"Retrieval" - by what? Window queries? Radius queries? Nearest neighbor queries?
How many dimensions - if it's just 2D, even simple grid approaches may work very well.
Note that most quality SQL systems (pretty much everything except MySQL actually) have support for R-trees to some extend.

Related

How large can state be in an EventStore projection?

When creating a projection using the JavaScript API in eventstore, how large can the state object become? Is this limited to the amount of memory on the machine or is this saved to disk? I would think the later would be more impactful in terms of how large of a state you could hold.
In the ideal world projection should be as small as possible and really small.
If you need several bunches of data - use several projections. This is the right way to simple scaling (in the worst case - one node - one projection).
Also, I suggest deciding about data type you want to store. IMHO, projection in event-sourced system should be organized document-oriented - in this case, projection will be small.
If you want to store GBs of info, in any case, use so db as projection. In theory, it's ok, on practice you'll create another one abstraction (adapter) to work with different projection types. This concept you can investigate in resolvejs framework.

How couchdb 1.6 inherently take advantage of Map reduce when it is Single Server Database

I am new to couch db, while going through documentation of Couch DB1.6, i came to know that it is single server DB, so I was wondering how map reduce inherently take advantage of it.
If i need to scale this DB then do I need to put more RAID hardware, of will it work on commodity hardware like HDFS?
I came to know that couch db 2.0 planning to bring clustering feature, but could not get proper documentation on this.
Can you please help me understanding how exactly internally file get stored and accessed.
Really appreciate your help.
I think your question is something like this:
"MapReduce is … a parallel, distributed algorithm on a cluster." [shortened from MapReduce article on Wikipedia]
But CouchDB 1.x is not a clustered database.
So what does CouchDB mean by using the term "map reduce"?
This is a reasonable question.
The historical use of "MapReduce" as described by Google in this paper using that stylized term, and implemented in Hadoop also using that same styling implies parallel processing over a dataset that may be too large for a single machine to handle.
But that's not how CouchDB 1.x works. View index "map" and "reduce" processing happens not just on single machine, but even on a single thread! As dch (a longtime contributor to the core CouchDB project) explains in his answer to https://stackoverflow.com/a/12725497/179583:
The issue is that eventually, something has to operate in serial to build the B~tree in such a way that range queries across the indexed view are efficient. … It does seem totally wacko the first time you realise that the highly parallelisable map-reduce algorithm is being operated sequentially, wat!
So: what benefit does map/reduce bring to single-server CouchDB? Why were CouchDB 1.x view indexes built around it?
The benefit is that the two functions that a developer can provide for each index "map", and optionally "reduce", form very simple building blocks that are easy to reason about, at least after your indexes are designed.
What I mean is this:
With e.g. the SQL query language, you focus on what data you need — not on how much work it takes to find it. So you might have unexpected performance problems, that may or may not be solved by figuring out the right columns to add indexes on, etc.
With CouchDB, the so-called NoSQL approach is taken to an extreme. You have to think explicitly about how you each document or set of documents "should be" found. You say, I want to be able to find all the "employee" documents whose "supervisor" field matches a certain identifier. So now you have to write a map function:
function (doc) {
if (doc.isEmployeeRecord) emit(doc.supervisor.identifier);
}
And then you have to query it like:
GET http://couchdb.local:5984/personnel/_design/my_indexes/_view/by_supervisor?key=SOME_UUID
In SQL you might simply say something like:
SELECT * FROM personnel WHERE supervisor == ?
So what's the advantage to the CouchDB way? Well, in the SQL case this query could be slow if you don't have an index on the supervisor column. In the CouchDB case, you can't really make an unoptimized query by accident — you always have to figure out a custom view first!
(The "reduce" function that you provide to a CouchDB view is usually used for aggregate functions purposes, like counting or averaging across multiple documents.)
If you think this is a dubious advantage, you are not alone. Personally I found designing my own indexes via a custom "map function" and sometimes a "reduce function" to be an interesting challenge, and it did pay off in knowing the scaling costs at least of queries (not so much for replications…).
So don't think of CouchDB view so much as being "MapReduce" (in the stylized sense) but just as providing efficiently-accessible storage for the results of running [].map(…).reduce(…) across a set of data. Because the "map" function is applied to only a document at once, the total set of data can be bigger than fits in memory at once. Because the "reduce" function is limited in its size, it further encourages efficient processing of a large set of data into an efficiently-accessed index.
If you want to learn a bit more about how the indexes generated in CouchDB are stored, you might find these articles interesting:
The Power of B-trees
CouchDB's File Format is brilliantly simple and speed-efficient (at the cost of disk space).
Technical Details, View Indexes
You may have noticed, and I am sorry, that I do not actually have a clear/solid answer of what the actual advantage and reasons were! I did not design or implement CouchDB, was only an avid user for many years.
Maybe the bigger advantage is that, in systems like Couchbase and CouchDB 2.x, the "parallel friendliness" of the map/reduce idea may come into play more. So then if you have designed an app to work in CouchDB 1.x it may then scale in the newer version without further intervention on your part.

Efficient Data Structure for Query Sync

I have a giant lists of query searches with cached image results for a few different servers and I want to sync the queries efficiently. I know that one way would be to do it in two steps. First comparing the queries, and second, only syncing non-identical results. Instead though I'd like it to be faster and more efficient by only exchanging a small fixed amount of data and then syncing non-identical results based on that data (it's fine if it happens to sync a small amount of identical results).
What kind of data structure for these queries would be recommended to accomplish this? I've been looking at https://en.wikipedia.org/wiki/List_of_data_structures to try to get a better idea, but I don't have a lot of experience in algorithms so I could really use some direction. I'm planning to do this in C++ if that needs to be taken into consideration. All suggestions appreciated, thanks.

How to efficiently search large datasets by location and date range?

I have a MongoDB collection containing attributes such as:
longitude, latitude, start_date, end_date, price
I have over 500 million documents.
My question is how to search by lat/long, date range and price as efficiently as possible?
As I see it my options are:
Create an Geo-spatial index on lat/long and use MongoDB's proximity search... and then filter this based on date range and price.
I have yet to test this but, am worrying that the amount of data would be too much to search this quickly, when we have around 1 search a second.
have you had experience with how MongoDB would react under these circumstances?
Split the data into multiple collections by location. i.e. by cities like london_collection, paris_collection, new_york_collection.
I would then have to query by lat/long first, find the nearest city collection and then do a MongoDB spatial search on that subset data in that collection with date and price filters.
I would have uneven distribution of documents as some cities would have more documents than others.
Create collections by dates instead of location. Same as above but each document is allocated a collection based on it's date range.
problem with searches that have a date range that straddles multiple collections.
Create unique ids based on city_start_date_end_date for each document.
Again I would have to use my lat/long query to find the nearest city append the date range to access the key. This seems to be pretty fast but I don't really like the city look up aspect... it seems a bit ugly.
I am in the process of experimenting with option 1.) but would really like to hear your ideas before I go too far down one particular path?
How do search engines split up and manage their data... this must be a similar kind of problem?
Also I do not have to use MongoDB, I'm open to other options?
Many thanks.
Indexing and data access performance is a deep and complex subject. A lot of factors can effect the most efficient solution including the size of your data sets, the read to write ratio, the relative performance of your IO and backing store, etc.
While I can't give you a concrete answer, I can suggest investigating using morton numbers as an efficient way of pulling multiple similar numeric values like lat longs.
Morton number
Why do you think option 1 would be too slow? Is this the result of a real world test or is this merely an assumption that it might eventually not work out?
MongoDB has native support for geohashing and turns coordinates into a single number which can then be searched by a BTree traversal. This should be reasonably fast. Messing around with multiple collections does not seem like a very good idea to me. All it does is replace one level of BTree traversal on the database with some code you still need to write, test and maintain.
Don't reinvent the wheel, but try to optimize the most obvious path (1) first:
Set up geo indexes
Use explain to make sure your queries actually use the index
Make sure your indexes fit into RAM
Profile the database using the built-in profiler
Don't measure performance on a 'cold' system where the indexes didn't have a chance to go to RAM yet
If possible, try not to use geoNear if possible, and stick to the faster (but not perfectly spherical) near queries
If you're still hitting limits, look at sharding to distribute reads and writes to multiple machines.

Mixing column and row oriented databases?

I am currently trying to improve the performance of a web application. The goal of the application is to provide (real time) analytics. We have a database model that is similiar to a star schema, few fact tables and many dimensional tables. The database is running with Mysql and MyIsam engine.
The Fact table size can easily go into the upper millions and some dimension tables can also reach the millions.
Now the point is, select queries can get awfully slow if the dimension tables get joined on the fact tables and also aggretations are done. First thing that comes in mind when hearing this is, why not precalculate the data? This is not possible because the users are allowed to use several freely customizable filters.
So what I need is an all-in-one system suitable for every purpose ;) Sadly it wasn't invented yet. So I came to the idea to combine 2 existing systems. Mixing a row oriented and a column oriented database (e.g. like infinidb or infobright). Keeping the mysql MyIsam solution (for fast inserts and row based queries) and add a column oriented database (for fast aggregation operations on few columns) to it and fill it periodically (nightly) via cronjob. Problem would be when the current data (it must be real time) is queried, therefore I maybe would need to get data from both databases which can complicate things.
First tests with infinidb showed really good performance on aggregation of a few columns, so I really think this could help me speed up the application.
So the question is, is this a good idea? Has somebody maybe already done this? Maybe there is are better ways to do it.
I have no experience in column oriented databases yet and I'm also not sure how the schema of it should look like. First tests showed good performance on the same star schema like structure but also in a big table like structure.
I hope this question fits on SO.
Greenplum, which is a proprietary (but mostly free-as-in-beer) extension to PostgreSQL, supports both column-oriented and row-oriented tables with high customizable compression. Further, you can mix settings within the same table if you expect that some parts will experience heavy transactional load while others won't. E.g., you could have the most recent year be row-oriented and uncompressed, the prior year column-oriented and quicklz-compresed, and all historical years column-oriented and bz2-compressed.
Greenplum is free for use on individual servers, but if you need to scale out with its MPP features (which are its primary selling point) it does cost significant amounts of money, as they're targeting large enterprise customers.
(Disclaimer: I've dealt with Greenplum professionally, but only in the context of evaluating their software for purchase.)
As for the issue of how to set up the schema, it's hard to say much without knowing the particulars of your data, but in general having compressed column-oriented tables should make all of your intuitions about schema design go out the window.
In particular, normalization is almost never worth the effort, and you can sometimes get big gains in performance by denormalizing to borderline-comical levels of redundancy. If the data never hits disk in an uncompressed state, you might just not care that you're repeating each customer's name 40,000 times. Infobright's compression algorithms are designed specifically for this sort of application, and it's not uncommon at all to end up with 40-to-1 ratios between the logical and physical sizes of your tables.

Resources