DB candidate as CouchDB/Schema replacement - performance

The idea is to redesign data structure and/or change DB.
I just started to review this project and plan to start optimization from this one.
Currently i have CouchDb with about 80GB of document data, around 30M records.
From that subset for the most of documents properties like id, group_id, location, type can be considered as generic, but unfortunately for now such are even stored with different property naming around the set. Also a lot of deeply nested can be found.
Structure isn't hardly defined, that's why NoSQL db was selected way before some picture was seen.
Data is calculated and populated in DB in a separate Job on powerful cluster. This isn't done too often. From that perspective i can conclude that general write/update performance isn't very important. Also size decrease would be great, but isn't most important. There are only like 1-10 active customers at a time.
Actually read performance with various filtering/grouping etc is most important.
But no heavy summary calculations should be done, this one is already done while population.
This one is a data analytical tool for displaying compare and other reports to quality engineers and data analyst, so they can browse the results, group them or filter from the Web UI.
Now such tasks like searching a subset of document properties for a text isn't possible due to performance.
For sure i've done some initial investigations(like http://www.datastax.com/wp-content/themes/datastax-2014-08/files/NoSQL_Benchmarks_EndPoint.pdf) and it looks Cassandra seems to be good choice among NoSql.
Also it's quite interesting trying to port this data into the new PostgreSQl.
Any ideas would be highly appreciated :-)

Hello please check the following articles:
http://www.enterprisedb.com/nosql-for-enterprise
For me, PostgreSQL json(and jsonb!) capabilities allow to start schema-less, have transactions, indexes, grouping, aggregate functions with very good performance, just from the start. And when ready(and if needed), you can go for the schema, with internal data migration.
Also check:
https://www.compose.io/articles/is-postgresql-your-next-json-database/
Good luck

Related

How to start on query tuning?

looking for advice on how to start with Oracle tuning, I've never done it before.
I'd like to know how to do performance tuning of queries. We build applications for clients, so I don't have production database here. I basically know which of our queries are important, and want to make sure they will run fast.
In MSSQL it is pretty straight-forward, when you enable execution plan, it shows additional indexes which will help the query. Is there anything similar to that in Oracle? Or perhaps another approach?
Performance tuning is a big subject. There are two basic types of approaches that people take: science or guesswork.
A. Guesswork (otherwise known as the shotgun approach)
Look at the query
Guess which columns need indexes that don't have them already *
Create the indexes and re-run the query
If it made no difference, go back to step 1 and repeat.
B. Science
Find out how fast the query "should" be.
Determine how fast the query actually is (under "realistic" conditions as far as possible).
If it's already fast enough, stop here. Otherwise:
Determine where most of the time is going - is it spinning or waiting?
If it's waiting, find out what it's waiting for, and eliminate that if possible.
If it's spinning, find out what it's doing that it doesn't need to do, and eliminate that if possible.
Go back to step 2 and repeat.
(* if there are no more indexes to create, look for something else to change - anything. e.g. change a few session or system settings, rebuild a table, etc.)
(*** in case it's not obvious, I'm not advocating approach A)
First thing I would suggest is if you are not going to do tuning on prod databases you should import the table statics from prod db to your Dev one. This will make your Dev system behave as it has similar data as prod. If you google for importing stats you will find loads of articles. The next stage would to look at queries where u can add indexes generally where you see full table scan is a good place to look.
Do you use Bind variables?
Oracle does a lot of caching data so see where you can use bind variables this will do filtering in the private space of server process and save the db from doing physical reads and hence data is returned quickly.
Hope this helps.

Best strategy for retrieving large dynamically-specified tables on an ASP.NET page

Looking for a bit of advice on how to optimise one of our projects. We have a ASP.NET/C# system that retrieves data from a SQL2008 data and presents it on a DevExpress ASPxGridView. The data that's retrieved can come from one of a number of databases - all of which are slightly different and are being added and removed regularly. The user is presented with a list of live "companies", and the data is retrieved from the corresponding database.
At the moment, data is being retrieved using a standard SqlDataSource and a dynamically-created SQL SELECT statement. There are a few JOINs in the statement, as well as optional WHERE constraints, again dynamically-created depending on the database and the user's permission level.
All of this works great (honest!), apart from performance. When it comes to some databases, there are several hundreds of thousands of rows, and retrieving and paging through the data is quite slow (the databases are already properly indexed). I've therefore been looking at ways of speeding the system up, and it seems to boil down to two choices: XPO or LINQ.
LINQ seems to be the popular choice, but I'm not sure how easy it will be to implement with a system that is so dynamic in nature - would I need to create "definitions" for each database that LINQ could access? I'm also a bit unsure about creating the LINQ queries dynamically too, although looking at a few examples that part at least seems doable.
XPO, on the other hand, seems to allow me to create a XPO Data Source on the fly. However, I can't find too much information on how to JOIN to other tables.
Can anyone offer any advice on which method - if any - is the best to try and retro-fit into this project? Or is the dynamic SQL model currently used fundamentally different from LINQ and XPO and best left alone?
Before you go and change the whole way that your app talks to the database, have you had a look at the following:
Run your code through a performance profiler (such as Redgate's performance profiler), the results are often surprising.
If you are constructing the SQL string on the fly, are you using .Net best practices such as String.Concat("str1", "str2") instead of "str1" + "str2". Remember, multiple small gains add up to big gains.
Have you thought about having a summary table or database that is periodically updated (say every 15 mins, you might need to run a service to update this data automatically.) so that you are only hitting one database. New connections to databases are quiet expensive.
Have you looked at the query plans for the SQL that you are running. Today, I moved a dynamically created SQL string to a sproc (only 1 param changed) and shaved 5-10 seconds off the running time (it was being called 100-10000 times depending on some conditions).
Just a warning if you do use LINQ. I have seen some developers who have decided to use LINQ write more inefficient code because they did not know what they are doing (pulling 36,000 records when they needed to check for 1 for example). This things are very easily overlooked.
Just something to get you started on and hopefully there is something there that you haven't thought of.
Cheers,
Stu
As far as I understand you are talking about so called server mode when all data manipulations are done on the DB server instead of them to the web server and processing them there. In this mode grid works very fast with data sources that can contain hundreds thousands records. If you want to use this mode, you should either create the corresponding LINQ classes or XPO classes. If you decide to use LINQ based server mode, the LINQServerModeDataSource provides the Selecting event which can be used to set a custom IQueryable and KeyExpression. I would suggest that you use LINQ in your application. I hope, this information will be helpful to you.
I guess there are two points where performance might be tweaked in this case. I'll assume that you're accessing the database directly rather than through some kind of secondary layer.
First, you don't say how you're displaying the data itself. If you're loading thousands of records into a grid, that will take time no matter how fast everything else is. Obviously the trick here is to show a subset of the data and allow the user to page, etc. If you're not doing this then that might be a good place to start.
Second, you say that the tables are properly indexed. If this is the case, and assuming that you're not loading 1,000 records into the page at once and retreiving only subsets at a time, then you should be OK.
But, if you're only doing an ExecuteQuery() against an SQL connection to get a dataset back I don't see how Linq or anything else will help you. I'd say that the problem is obviously on the DB side.
So to solve the problem with the database you need to profile the different SELECT statements you're running against it, examine the query plan and identify the places where things are slowing down. You might want to start by using the SQL Server Profiler, but if you have a good DBA, sometimes just looking at the query plan (which you can get from Management Studio) is usually enough.

Free data warehouse - Infobright, Hadoop/Hive or what?

I need to store large amount of small data objects (millions of rows per month). Once they're saved they wont change. I need to :
store them securely
use them to analysis (mostly time-oriented)
retrieve some raw data occasionally
It would be nice if it could be used with JasperReports or BIRT
My first shot was Infobright Community - just a column-oriented, read-only storing mechanism for MySQL
On the other hand, people says that NoSQL approach could be better. Hadoop+Hive looks promissing, but the documentation looks poor and the version number is less than 1.0 .
I heard about Hypertable, Pentaho, MongoDB ....
Do you have any recommendations ?
(Yes, I found some topics here, but it was year or two ago)
Edit:
Other solutions : MonetDB, InfiniDB, LucidDB - what do you think?
Am having the same problem here and made researches; two types of storages for BI :
column oriented. Free and known : monetDB, LucidDb, Infobright. InfiniDB
Distributed : hTable, Cassandra (also column oriented theoretically)
Document oriented / MongoDb, CouchDB
The answer depends on what you really need :
If your millions of row are loaded at once (nighly batch or so), InfiniDB or other column oriented DB are the best; They have great performance and are "BI oriented". http://www.d1solutions.ch/papers/d1_2010_hauenstein_real_life_performance_database.pdf
And they won't require a setup of "nodes", "sharding" and other stuff that comes with distributed/"NoSQL" DBs.
http://www.mysqlperformanceblog.com/2010/01/07/star-schema-bechmark-infobright-infinidb-and-luciddb/
If the rows are added in real time.. then column oriented DB are bad. You can either choose two have two separate DB (that's my choice : one noSQL for real feeding of the stats by the front, and real time stats. The other DB column-oriented for BI). Or turn towards something that mixes column oriented (for out requests) and distribution (for writes) / like Cassandra.
Document oriented DBs are not suited for BI, they are more useful for CRM/CMS issues where you need frequent access to a particular row
As for the exact choice inside a category, I'm still undecided. Cassandra in distributed, and Monet or InfiniDB for CODB, are leaders. Monet is reported to have problem loading very big tables because it runs indexes in memory.
You could also consider GridSQL. Even for a single server, you can create multiple logical "nodes" to utilize multiple cores when processing queries.
GridSQL uses PostgreSQL, so you can also take advantage of partitioning tables into subtables to evaluate queries faster. You mentioned the data is time-oriented, so that would be a good candidate for creating subtables.
If you're looking for compatibility with reporting tools, something based on MySQL may be your best choice. As for what will work for you, Infobright may work. There are several other solutions as well, however you may want also to look at plain-old MySQL and the Archive table. Each record is compressed and stored and, IIRC, it's designed for your type of workload, however I think Infobright is supposed to get better compression. I haven't really used either, so I'm not sure which will work best for you.
As for the key-value stores (E.g. NoSQL), yes, they can work as well and there are plenty of alternatives out there. I know CouchDB has "views", but I haven't had the opportunity to use any, so I don't know how well any of them work.
My only concern with your data set is that since you mentioned time, you may want to ensure that whatever solution you use will allow you to archive data past a certain time. It's a common data warehouse practice to only keep N months of data online and archive the rest. This is where partitioning, as implemented in an RDBMS, comes in very useful.

How to stop thinking "relationally"

At work, we recently started a project using CouchDB (a document-oriented database). I've been having a hard time un-learning all of my relational db knowledge.
I was wondering how some of you overcame this obstacle? How did you stop thinking relationally and start think documentally (I apologise for making up that word).
Any suggestions? Helpful hints?
Edit: If it makes any difference, we're using Ruby & CouchPotato to connect to the database.
Edit 2: SO was hassling me to accept an answer. I chose the one that helped me learn the most, I think. However, there's no real "correct" answer, I suppose.
I think, after perusing about on a couple of pages on this subject, it all depends upon the types of data you are dealing with.
RDBMSes represent a top-down approach, where you, the database designer, assert the structure of all data that will exist in the database. You define that a Person has a First,Last,Middle Name and a Home Address, etc. You can enforce this using a RDBMS. If you don't have a column for a Person's HomePlanet, tough luck wanna-be-Person that has a different HomePlanet than Earth; you'll have to add a column in at a later date or the data can't be stored in the RDBMS. Most programmers make assumptions like this in their apps anyway, so this isn't a dumb thing to assume and enforce. Defining things can be good. But if you need to log additional attributes in the future, you'll have to add them in. The relation model assumes that your data attributes won't change much.
"Cloud" type databases using something like MapReduce, in your case CouchDB, do not make the above assumption, and instead look at data from the bottom-up. Data is input in documents, which could have any number of varying attributes. It assumes that your data, by its very definition, is diverse in the types of attributes it could have. It says, "I just know that I have this document in database Person that has a HomePlanet attribute of "Eternium" and a FirstName of "Lord Nibbler" but no LastName." This model fits webpages: all webpages are a document, but the actual contents/tags/keys of the document vary soo widely that you can't fit them into the rigid structure that the DBMS pontificates from upon high. This is why Google thinks the MapReduce model roxors soxors, because Google's data set is so diverse it needs to build in for ambiguity from the get-go, and due to the massive data sets be able to utilize parallel processing (which MapReduce makes trivial). The document-database model assumes that your data's attributes may/will change a lot or be very diverse with "gaps" and lots of sparsely populated columns that one might find if the data was stored in a relational database. While you could use an RDBMS to store data like this, it would get ugly really fast.
To answer your question then: you can't think "relationally" at all when looking at a database that uses the MapReduce paradigm. Because, it doesn't actually have an enforced relation. It's a conceptual hump you'll just have to get over.
A good article I ran into that compares and contrasts the two databases pretty well is MapReduce: A Major Step Back, which argues that MapReduce paradigm databases are a technological step backwards, and are inferior to RDBMSes. I have to disagree with the thesis of the author and would submit that the database designer would simply have to select the right one for his/her situation.
It's all about the data. If you have data which makes most sense relationally, a document store may not be useful. A typical document based system is a search server, you have a huge data set and want to find a specific item/document, the document is static, or versioned.
In an archive type situation, the documents might literally be documents, that don't change and have very flexible structures. It doesn't make sense to store their meta data in a relational databases, since they are all very different so very few documents may share those tags. Document based systems don't store null values.
Non-relational/document-like data makes sense when denormalized. It doesn't change much or you don't care as much about consistency.
If your use case fits a relational model well then it's probably not worth squeezing it into a document model.
Here's a good article about non relational databases.
Another way of thinking about it is, a document is a row. Everything about a document is in that row and it is specific to that document. Rows are easy to split on, so scaling is easier.
In CouchDB, like Lotus Notes, you really shouldn't think about a Document as being analogous to a row.
Instead, a Document is a relation (table).
Each document has a number of rows--the field values:
ValueID(PK) Document ID(FK) Field Name Field Value
========================================================
92834756293 MyDocument First Name Richard
92834756294 MyDocument States Lived In TX
92834756295 MyDocument States Lived In KY
Each View is a cross-tab query that selects across a massive UNION ALL's of every Document.
So, it's still relational, but not in the most intuitive sense, and not in the sense that matters most: good data management practices.
Document-oriented databases do not reject the concept of relations, they just sometimes let applications dereference the links (CouchDB) or even have direct support for relations between documents (MongoDB). What's more important is that DODBs are schema-less. In table-based storages this property can be achieved with significant overhead (see answer by richardtallent), but here it's done more efficiently. What we really should learn when switching from a RDBMS to a DODB is to forget about tables and to start thinking about data. That's what sheepsimulator calls the "bottom-up" approach. It's an ever-evolving schema, not a predefined Procrustean bed. Of course this does not mean that schemata should be completely abandoned in any form. Your application must interpret the data, somehow constrain its form -- this can be done by organizing documents into collections, by making models with validation methods -- but this is now the application's job.
may be you should read this
http://books.couchdb.org/relax/getting-started
i myself just heard it and it is interesting but have no idea how to implemented that in the real world application ;)
One thing you can try is getting a copy of firefox and firebug, and playing with the map and reduce functions in javascript. they're actually quite cool and fun, and appear to be the basis of how to get things done in CouchDB
here's Joel's little article on the subject : http://www.joelonsoftware.com/items/2006/08/01.html

Can OLAP be done in BigTable?

In the past I used to build WebAnalytics using OLAP cubes running on MySQL.
Now an OLAP cube the way I used it is simply a large table (ok, it was stored a bit smarter than that) where each row is basically a measurement or and aggregated set of measurements. Each measurement has a bunch of dimensions (i.e. which pagename, useragent, ip, etc.) and a bunch of values (i.e. how many pageviews, how many visitors, etc.).
The queries that you run on a table like this are usually of the form (meta-SQL):
SELECT SUM(hits), SUM(bytes),
FROM MyCube
WHERE date='20090914' and pagename='Homepage' and browser!='googlebot'
GROUP BY hour
So you get the totals for each hour of the selected day with the mentioned filters.
One snag was that these cubes usually meant a full table scan (various reasons) and this meant a practical limitation on the size (in MiB) you could make these things.
I'm currently learning the ins and outs of Hadoop and the likes.
Running the above query as a mapreduce on a BigTable looks easy enough:
Simply make 'hour' the key, filter in the map and reduce by summing the values.
Can you run a query like I showed above (or at least with the same output) on a BigTable kind of system in 'real time' (i.e. via a user interface and the user get's their answer ASAP) instead of batch mode?
If not; what is the appropriate technology to do something like this in the realm of BigTable/Hadoop/HBase/Hive and the likes?
It's even kind of been done (kind of).
LastFm's aggregation/summary engine: http://github.com/zohmg/zohmg
A google search turned up a google code project "mroll" but it doesn't have anything except contact info (no code, nothing). Still, might want to reach out to that guy and see what's up. http://code.google.com/p/mroll/
We managed to create low latency OLAP in HBase by preagragating a SQL query and mapping it into appropriate Hbase qualifiers. For more detail visit below site.
http://soumyajitswain.blogspot.in/2012/10/hbase-low-latency-olap.html
My answer relates to HBase, but applies equally to BigTable.
Urban Airship open-sourced datacube, which I think is close to what you want. See their presentation here.
Adobe also has a couple of presentations (here and here) on how they do "low-latency OLAP" with HBase.
Andrei Dragomir made an interesting talk about how Adobe performs OLAP functionality with M/R and HBase.
Video: http://www.youtube.com/watch?v=5U3EnfiKs44
Slides: http://hstack.org/hbasecon-low-latency-olap-with-hbase/
If you are looking for a table-scan approach, have you considered Google BigQuery? BigQuery does automatic scale-out on the back-side that gives interactive response. There is a good session by Jordan Tigani from the 2012 Google I/O event that explains some of the internals.
http://www.youtube.com/watch?v=QI8623HlYd4
It's not MapReduce but it is geared towards high-speed table scan like what you described.

Resources