Couchbase and NoSQL DBs for a serious Web-Application? - performance

I read some posts here, but a real answer I didnt find.
Normally I work and worked with normal SQL Databaeses (MS SQL, MySQL), when I developed applications (ERP, CRM, PPS, Web Shops etc.). A real contact/experience with document-oriented databases in real business was not possible.
Only in a private sector (hobby, experimental projects) I tested MongoDB and CouchDB. The experience was good, but not enough to say "Yes, let it use for business!", because I could not test it in a real environment.
But now, there is a chance to program from zero, which could be a big start for a business.
So my question:
Can I use Couchebase for a big business application, where thousands users would use it. Is it so fast and with good performance to handling thousnds of queries, requests/reposts etc.?
How looks like with backup and restore?
Where is the limit of couchbase?
Thank you for the anwser.

In short, yes.
Your questions are too broad to fully address here. Couchbase has many real-world installations with clients doing production work at large scale. You can see several references with write ups of their uses on the Couchbase site. (Note this is not a complete list of customers, only ones that have agreed to have their use highlighted.) You will definitely recognize some names.

Related

NoSQL for multi-site archival logging with full-text search

I'm looking at building a somewhat complex log handling system to replace an old ad-hoc setup and could use a bit of advice. I'm pretty familiar with SQL databases and networking, but am very new to NoSQL stores, which seem to be the key to solving this mess. Note that we have a very good team, but a limited licensing budget, so free/open-source options are vastly preferred. (That said, availability of support if something goes pear-shaped would be nice.)
Requirements:
Archive (test) logs generated in the several GB/day range at multiple sites around the world.
Provide full text search of those logs at each site fairly instantaneous for debugging purposes.
Push that archived data back to a central location (though a replica at each site would be absolutely okay).
Provide for analytics of that data back at the central location.
Constraints:
The sites have fairly crap Internet connections for the moment (high latency and fairly low bandwidth). Much of the data is generated during the day and a good portion of the sync would have to lag behind and finish overnight each day.
Sites MUST be able to function if the WAN goes completely off-line.
Extras
The log data is (as usual) highly compressible. Any solution that compresses data transacting from node to node across the WAN is preferred.
Many log files are related to each other in multi-level hierarchies, and that relationship is very important and must be maintained!
Sites will generally not modify the same data or modify it again once stored. This is all archival for the most part.
We can either stream as the logs are generated or push blocks of logs. Streaming is preferred, as it would simplify things considerably.
Options I'm aware of:
Local MySQL and folder structure for logging and local configuration management.
This is what we have now and it's running, but not a long-term solution by any means.
Elasticsearch
I've read that ElasticSearch would probably be really good for this, though from what I understand that doesn't support multi-site.
Cassandra
This seems to have built-in multi-site support, but I'm not exactly familiar with the data-model. Is this a good choice for something like this, or will I hate myself if I give it a try?
CouchDB
This is a document store that seems(?) like a good match for log data, but again doesn't appear to have multi-site support.
Apache Kafka
I read up on this, but I haven't quite wrapped my head around it yet...
Questions:
Do any of these actually let you stream-append logs or are they best suited to dumping completed files in?
Is there a solution I'm missing that might be better?
Any recommendations on multi-site with some of the options that don't support multi-site by themselves?
Interesting links:
https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
http://blog.cloudera.com/blog/2015/07/deploying-apache-kafka-a-practical-faq/
https://www.elastic.co/blog/scaling_elasticsearch_across_data_centers_with_kafka
https://kafka.apache.org/08/ops.html
https://github.com/Stratio/cassandra-lucene-index
I may be a bit biased, since Couchbase is my employer, but this sounds like the kind of problem that XDCR (Cross Datacenter Replication) was made to solve.
You could stand up a cluster on multiple geographical sites (Couchbase calls these "datacenters") and then XDCR would automatically replicate (bidirectionally) the data between sites. If I understand your requirements correctly, this sounds like just what you need.

monetdb - anyone uses it in production?

I am very interested in using monetdb as a datamart, holding some huge data tables for querying and reporting
However, after some searching, I am unable to find any online posts / blogs regarding their use of Monetdb in any kind of production capacity.
Also, there seems to be little or next to no activity online regarding Monetdb.
Is this a bad sign for the future of Monetdb ?
I am very interested in using monetdb as a datamart, holding some huge data tables for >querying and reporting
My boss is also interested in MonetDB and I had the same reaction as you. No one is writing about MonetDB... is no one using MonetDB?
Regardless, I have been running performance tests on datasets of 500,000 to 1,000,000 records comparing MonetDB (column-oriented dbms) vs. MySQL (row-oriented dbms) and MonetDB beats MySQL in all regards- even in bulk inserts... which hypothetically it should not be as good at.
I can't speculate as to what all this means for MonetDB's future, but while it's around you might want to check it out because it performs well.
(I run Windows 7 and am communicating with each database using PHP)
I react a bit late to this post, but I'd like to add my voice to the ones using MonetDB in a production environment. We use it as the back-end of Spinque, a framework for designing complex search solutions. I've been using MonetDB for about 10 years, but only in the past 3 years in a production environment. Clearly, it has pros and cons and bugs like all other products, but it is being developed and improved very actively (I don't understand the low-activity signs that you refer to). If you want a DB that allows you to be ahead of the market standards, it's a good choice. Otherwise, just go for MS SQL ;)
I've been evaluating it lately for a client so I've had some time with it. My impression at this point is that it is just finishing "growing up" from being an academic experimental playground. It clearly has yet to be really discovered, though it does have some rough edges which might hinder certain applications.
As I write, I'm in the process of trying to load over 100 million rows into an instance (at 27mil presently). So far, it performs startlingly well in some areas (aggregates), but is oddly sluggish in others (most joins I've tried so far); that said, I've not yet run the recommended sampling process yet and I'm forcing it to live in just a single service with 32GB RAM.
I've found a few little glitches and one thing that caused a full service crash (obscure and reported), but I'm thinking that for many applications MonetDB could be just the ticket. Columnar storage (rather than NoSQL) seems to be the future IMO.
I'll update this if I find anything particularly interesting.
MonetDB is first and for all a research system, but has progressed far beyond the level of the average research prototype. It is the (only) relational column-store platform in open source that I know of that supports full SQL. I have used it myself at CWI in many research projects that are not core DB research, but do need advanced DB technology.
You can see on the user's mailing list that deployments happen in many different organisations. As Roberto Cornacchia stated in a different answer, it is the backend of all Spinque deployments and we are happy MonetDB users. MonetDB is also used at a variety of non-profit projects like open streetmap and open kvk.
More and more commercial parties deploy MonetDB for analytics. (They do not always like to advertise that their analyses depend on an open source system.) Recently, MonetDB Solutions has started to provide dedicated commercial support for these deployments.
We have been using MonetDB in our business. We analyse very large data sets with many millions of rows. Traditional methods of data warehousing on SQL databases became so slow. The problem we were facing was that the data was only going to get bigger! The only way forward was to go columnar.
The results have been amazing. When you have very few joins it is staggeringly quick. Even with joins on the data sets we are looking at it is still frightening how fast it comes back.
Having seen some of the commercial partnerships I think MonetDB is going to boom over the next few years. I believe some of the major BI suppliers are using Monet under their hood to perform the large data work.

SQLite for client-server

I've seen a couple of SQLite performance questions here on Stackoverflow, but the focus was on websites, and I'm considering using this DB in a client-server scenario:
I expect 1-10 clients for one server for now, could go up to 50 or more in the future.
slightly more reads than writes
the DB would sit behind a server process (i.e: not using direct DB access through a network)
Would using SQLite make the app less responsive as opposed to using PostgreSQL? My intuition tells me that it should be ok for these loads, but maybe someone has some practical experience with this kind of scenario.
I did use SQLite for a major client/server product used with ~10 concurrent users and I deeply regret that decision. In my opinion - PostgreSQL is much more suitable for client/server scenarios than SQLite due to its fine locking granularity.
You simply can't get very far when the entire database is locked whenever someone needs to write something ..
I like SQLite very much (I even wrote a commercial utility for comparing SQLite databases - SQLite Compare but I don't think it fits the bill when you have client/server scenarios.
Even SQLite's author says that it should be used as a replacement for custom file formats and not as a full blown database server. I wish I took his advice more seriously..
You didn't mention what operating system and Postgres versions you are using. However, before considering change of database engine, try to do some logging and benchmarking your current database with typical usage, then optimize "heaviest" questions. And maybe your backend processing load makes DB question time irrelevant? As SQLite is a file-based DBMS, concurrent access from multiple processes will degrade performance when client number grows up (edited after comment)
Following question may be helpful: How Scalable is SQLite?
I would confirm to S.Lott's answer.
I dont know how SQLite performs in comparison to PostgreSQL, since I don't know any newer meassurements, but my own experience with SQLite in a rather similar environment is rather good.
The only thing that might cause troubles in my view is that you have rather many writes. But it all depends on the total number per second I would say.
Also your setting to have one server process is optimal for SQLite in my opinion -- so you circumvent its weakness in multi-tasking.

Best scaling methodologies for a highly traffic web application?

We have a new project for a web app that will display banners ads on websites (as a network) and our estimate is for it to handle 20 to 40 billion impressions a month.
Our current language is in ASP...but are moving to PHP. Does PHP 5 has its limit with scaling web application? Or, should I have our team invest in picking up JSP?
Or, is it a matter of the app server and/or DB? We plan to use Oracle 10g as the database.
No offense, but I strongly suspect you're vastly overestimating how many impressions you'll serve.
That said:
PHP or other languages used in the application tier really have little to do with scalability. Since the application tier delegates it's state to the database or equivalent, it's straightforward to add as much capacity as you need behind appropriate load balancing. Choice of language does influence per server efficiency and hence costs, but that's different than scalability.
It's scaling the state/data storage that gets more complicated.
For your app, you have three basic jobs:
what ad do we show?
serving the add
logging the impression
Each of these will require thought and likely different tools.
The second, serving the add, is most simple: use a CDN. If you actually serve the volume you claim, you should be able to negotiate favorable rates.
Deciding which ad to show is going to be very specific to your network. It may be as simple as reading a few rows from a database that give ad placements for a given property for a given calendar period. Or it may be complex contextual advertising like google. Assuming it's more the former, and that the database of placements is small, then this is the simple task of scaling database reads. You can use replication trees or alternately a caching layer like memcached.
The last will ultimately be the most difficult: how to scale the writes. A common approach would be to still use databases, but to adopt a sharding scaling strategy. More exotic options might be to use a key/value store supporting counter instructions, such as Redis, or a scalable OLAP database such as Vertica.
All of the above assumes that you're able to secure data center space and network provisioning capable of serving this load, which is not trivial at the numbers you're talking.
You do realize that 40 billion per month is roughly 15,500 per second, right?
Scaling isn't going to be your problem - infrastructure period is going to be your problem. No matter what technology stack you choose, you are going to need an enormous amount of hardware - as others have said in the form of a farm or cloud.
This question (and the entire subject) is a bit subjective. You can write a dog slow program in any language, and host it on anything.
I think your best bet is to see how your current implementation works under load. Maybe just a few tweaks will make things work for you - but changing your underlying framework seems a bit much.
That being said - your infrastructure team will also have to be involved as it seems you have some serious load requirements.
Good luck!
I think that it is not matter of language, but it can be be a matter of database speed as CPU processing speed. Have you considered a web farm? In this way you can have more than one machine serving your application. There are some ways to implement this solution. You can start with two server and add more server as the app request more processing volume.
In other point, Oracle 10g is a very good database server, in my humble opinion you only need a stand alone Oracle server to commit the volume of request. Remember that a SQL server is faster as the people request more or less the same things each time and it happens in web application if you plan your database schema carefully.
You also have to check all the Ad Server application solutions and there are a very good ones, just try Google with "Open Source AD servers".
PHP will be capable of serving your needs. However, as others have said, your first limits will be your network infrastructure.
But your second limits will be writing scalable code. You will need good abstraction and isolation so that resources can easily be added at any level. Things like a fast data-object mapper, multiple data caching mechanisms, separate configuration files, and so on.

Would SQLite be a 'better' choice for Joomla than MySQL, if it would be available?

Since this doesn't touch a real problem of mine I'm somwhat uncertain, if it is even worth to be asked here. However maybe some of you would like to share your opinion on that.
In general I have to admit, that 'better' means anything and nothing at all at the same time. So I probably should be more specific, but I tried not to overflow the topic. In a regular hosted environment on one of those cheap webhosters (like Dreamhost), with around 1000 articles in Joomla, a couple of users and a few hundreds visitors a day, would a SQLite database with a persistent connection (sqlite_popen) perform noticeable faster than the MySQL equivalent (with the TCP/IP overhead etc.)?
Or in short: Would it be wise to call Joomla to support SQLite?
I have never used sqlite on a website, but I have used it extensively for other purposes and I quite like it. The truth is, you won't know till you try. If you try, I reccomend creating a db abstraction layer first so that you can easily swap in other db's.
The downside to sqlite is that it's not really meant to be a multi user database. If you rarely write to the db, but do lots of reading, sqlite will probably be fine. If you find that you need multiple processes writing to the same db, I believe sqlite uses file level locking to maintain database consistency.. So, if all you're tables are in the same file, you'll lock the whole file while it's being written to even if another process wants to modify a completely different table.
In my opinion it's not the big multi user databases of the world that should be worried about competition from sqlite... It's all the regular files out there (and there custom file formats) that applications create and use that should be shaking in their boots about sqlite...
Linux ISPs for whatever reason seem to have settled on MySQL. This is what they offer and you will lock yourself to a limited number of service providers if you wander outside the norm.

Resources