Related
I am very interested in using monetdb as a datamart, holding some huge data tables for querying and reporting
However, after some searching, I am unable to find any online posts / blogs regarding their use of Monetdb in any kind of production capacity.
Also, there seems to be little or next to no activity online regarding Monetdb.
Is this a bad sign for the future of Monetdb ?
I am very interested in using monetdb as a datamart, holding some huge data tables for >querying and reporting
My boss is also interested in MonetDB and I had the same reaction as you. No one is writing about MonetDB... is no one using MonetDB?
Regardless, I have been running performance tests on datasets of 500,000 to 1,000,000 records comparing MonetDB (column-oriented dbms) vs. MySQL (row-oriented dbms) and MonetDB beats MySQL in all regards- even in bulk inserts... which hypothetically it should not be as good at.
I can't speculate as to what all this means for MonetDB's future, but while it's around you might want to check it out because it performs well.
(I run Windows 7 and am communicating with each database using PHP)
I react a bit late to this post, but I'd like to add my voice to the ones using MonetDB in a production environment. We use it as the back-end of Spinque, a framework for designing complex search solutions. I've been using MonetDB for about 10 years, but only in the past 3 years in a production environment. Clearly, it has pros and cons and bugs like all other products, but it is being developed and improved very actively (I don't understand the low-activity signs that you refer to). If you want a DB that allows you to be ahead of the market standards, it's a good choice. Otherwise, just go for MS SQL ;)
I've been evaluating it lately for a client so I've had some time with it. My impression at this point is that it is just finishing "growing up" from being an academic experimental playground. It clearly has yet to be really discovered, though it does have some rough edges which might hinder certain applications.
As I write, I'm in the process of trying to load over 100 million rows into an instance (at 27mil presently). So far, it performs startlingly well in some areas (aggregates), but is oddly sluggish in others (most joins I've tried so far); that said, I've not yet run the recommended sampling process yet and I'm forcing it to live in just a single service with 32GB RAM.
I've found a few little glitches and one thing that caused a full service crash (obscure and reported), but I'm thinking that for many applications MonetDB could be just the ticket. Columnar storage (rather than NoSQL) seems to be the future IMO.
I'll update this if I find anything particularly interesting.
MonetDB is first and for all a research system, but has progressed far beyond the level of the average research prototype. It is the (only) relational column-store platform in open source that I know of that supports full SQL. I have used it myself at CWI in many research projects that are not core DB research, but do need advanced DB technology.
You can see on the user's mailing list that deployments happen in many different organisations. As Roberto Cornacchia stated in a different answer, it is the backend of all Spinque deployments and we are happy MonetDB users. MonetDB is also used at a variety of non-profit projects like open streetmap and open kvk.
More and more commercial parties deploy MonetDB for analytics. (They do not always like to advertise that their analyses depend on an open source system.) Recently, MonetDB Solutions has started to provide dedicated commercial support for these deployments.
We have been using MonetDB in our business. We analyse very large data sets with many millions of rows. Traditional methods of data warehousing on SQL databases became so slow. The problem we were facing was that the data was only going to get bigger! The only way forward was to go columnar.
The results have been amazing. When you have very few joins it is staggeringly quick. Even with joins on the data sets we are looking at it is still frightening how fast it comes back.
Having seen some of the commercial partnerships I think MonetDB is going to boom over the next few years. I believe some of the major BI suppliers are using Monet under their hood to perform the large data work.
We have a new project for a web app that will display banners ads on websites (as a network) and our estimate is for it to handle 20 to 40 billion impressions a month.
Our current language is in ASP...but are moving to PHP. Does PHP 5 has its limit with scaling web application? Or, should I have our team invest in picking up JSP?
Or, is it a matter of the app server and/or DB? We plan to use Oracle 10g as the database.
No offense, but I strongly suspect you're vastly overestimating how many impressions you'll serve.
That said:
PHP or other languages used in the application tier really have little to do with scalability. Since the application tier delegates it's state to the database or equivalent, it's straightforward to add as much capacity as you need behind appropriate load balancing. Choice of language does influence per server efficiency and hence costs, but that's different than scalability.
It's scaling the state/data storage that gets more complicated.
For your app, you have three basic jobs:
what ad do we show?
serving the add
logging the impression
Each of these will require thought and likely different tools.
The second, serving the add, is most simple: use a CDN. If you actually serve the volume you claim, you should be able to negotiate favorable rates.
Deciding which ad to show is going to be very specific to your network. It may be as simple as reading a few rows from a database that give ad placements for a given property for a given calendar period. Or it may be complex contextual advertising like google. Assuming it's more the former, and that the database of placements is small, then this is the simple task of scaling database reads. You can use replication trees or alternately a caching layer like memcached.
The last will ultimately be the most difficult: how to scale the writes. A common approach would be to still use databases, but to adopt a sharding scaling strategy. More exotic options might be to use a key/value store supporting counter instructions, such as Redis, or a scalable OLAP database such as Vertica.
All of the above assumes that you're able to secure data center space and network provisioning capable of serving this load, which is not trivial at the numbers you're talking.
You do realize that 40 billion per month is roughly 15,500 per second, right?
Scaling isn't going to be your problem - infrastructure period is going to be your problem. No matter what technology stack you choose, you are going to need an enormous amount of hardware - as others have said in the form of a farm or cloud.
This question (and the entire subject) is a bit subjective. You can write a dog slow program in any language, and host it on anything.
I think your best bet is to see how your current implementation works under load. Maybe just a few tweaks will make things work for you - but changing your underlying framework seems a bit much.
That being said - your infrastructure team will also have to be involved as it seems you have some serious load requirements.
Good luck!
I think that it is not matter of language, but it can be be a matter of database speed as CPU processing speed. Have you considered a web farm? In this way you can have more than one machine serving your application. There are some ways to implement this solution. You can start with two server and add more server as the app request more processing volume.
In other point, Oracle 10g is a very good database server, in my humble opinion you only need a stand alone Oracle server to commit the volume of request. Remember that a SQL server is faster as the people request more or less the same things each time and it happens in web application if you plan your database schema carefully.
You also have to check all the Ad Server application solutions and there are a very good ones, just try Google with "Open Source AD servers".
PHP will be capable of serving your needs. However, as others have said, your first limits will be your network infrastructure.
But your second limits will be writing scalable code. You will need good abstraction and isolation so that resources can easily be added at any level. Things like a fast data-object mapper, multiple data caching mechanisms, separate configuration files, and so on.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Our company is considering moving from hosting our own servers to EC2 and I was wondering if this was a good idea.
I have seen a lot of stuff about can cloud computing (and specifically EC2) do x, or can it do y, but my real question is why would you NOT want to use it?
If you were setting up a business, what are the reasons (outside of cost) that you would choose to go through the trouble of managing your own servers?
I know there are a lot of cost calculations you can put in regarding bandwidth, disk usage etc, but there are of course, other costs regarding maintenance of your own server. For the sake of this discussion I am willing to consider the costs roughly equal.
I seem to remember that Joel Spolsky wrote a little blur on this at one time, but I was unable to find it.
Anyone have any reasons?
Thanks!
I can think of several reasons why not use EC2 (and I am talking about EC2, not grid comp in general):
Reliability: Amazon makes no guarantee as to the availability / down time / safety of EC2
Security: Amazon does not makes any guarantee as to whom it will disclose your data
Persistence: ensuring persistence of your data (that includes, effort to set up the system) is complicated over EC2
Management: there are very few integrated management tools for a cloud deployed on EC2
Network: the virtual network that allows EC2 instances to communicates has some quite painful limitations (latency, no multicast, arbitrary topological location)
And to finish that:
Cost: on the long run, if you are not using EC2 to absorb peak traffic, it is going to be much more costly than investing into your own servers (cheapo servers like Supermicro cost just a couple of hundred bucks...)
On the other side, I still think EC2 is a great way to soak up non-sensitive peak traffic, if your architecture allows it.
Some questions to ask:
What is the expected uptime, and how does downtime affect your business? What sort of service level agreement can you get, what are the penalties for missing it, and how confident are you that the SLA uptime goals will be met? (They may be better or worse at keeping the systems up than you are.)
How sensitive is the data you're proposing to put into the cloud? Again, we get into the questions of how secure the provider promises to be, what the contractual penalties and indemnities are, and how confident you are that the provider will live up to the agreement. Further, there may be external requirements. If you deal with health-related data in the US, you are subject to very strict requirements. If you deal with credit card data, you also have responsibilities (contractual, not legal).
How easy will it be to back out of the arrangement, should service not be what was expected, or if you find a better deal elsewhere? This includes not only getting your data back, but also some version of the applications you've been using. Consider the possibilities of your provider going bankrupt (Amazon isn't going to go bankrupt any time soon, but they could split off a cloud provider which could then go bankrupt), or having an internal reorganization. Bear in mind that a company in serious trouble may not be able to live up to your expectations of service.
How much independence are you going to have? Are you going to be running their software or software you pick? How easy will it be to reconfigure?
What is the pricing scheme? Is it possible for the bills to hit unacceptable levels without adequate warning?
What is the disaster plan? Ideally, it's running your software on servers in a different location from where the disaster hit.
What does your legal department (or retained corporate attorney) think of the contract? Is there a dispute resolution mechanism, and, if so, is it fair to you?
Finally, what do you expect to get out of moving to the cloud? What are you willing to pay? What can you compromise on, and what do you need?
Highly sensitive data might be better to control yourself. And there's legislation; some privacy sensitive information, for example, might not leave the the country.
Also, except for Microsoft Azure in combination with SDS, the data stores tend to be not relational, which is a nuisance in certain cases.
Maybe concern that that big a company will more likely be approached by an Agent Smith from the government to spy on everyone that a little small provider somewhere.
Big company - more customers - more data to aggregate and recognize patterns - more resources to organize a sophisticated watch system.
Maybe it's more of a fantasy but who ever knows?
If you don't have a paranoia it doesn't mean yet that you are not being watched.
The big one is: if Amazon goes down, there's nothing you can do to bring it back up.
I'm not talking about doomsday scenarios where the company disappears. I mean that you're at the mercy of their downtime, with little recourse of your own.
Security -- you don't know what is being done to your data
Dependency -- your business is now directly intertwined with the provider
There are different kinds of cloud computing with lots of different vendors providing it. It would make me nervous to code my apps to work with a single cloud vendor. that you specifically had to code for..amazon and Microsoft I believe you need to specifically code for that platform - maybe google too.
That said, I recently jettisoned my own dedicated servers and moved to Rackspaces Mosso Cloud platform (which have no proprietary coding necessary) and I am really, really pleased with it so far. Cut my costs in half, and performance is way better than before. My sql server databases are now running on 64Bit enterprise SQL server versions with 32G of ram - that would have cost me a fortune on my previous providers infrastructure.
As far as being out of luck when the cloud is down, that was true if my dedicated server went down - it never did, but if there was a hardware crash on my dedicated server, I am not sure it would be back on-line any quicker than rackspace could bring their cloud back up.
Lack of control.
Putting your software on someone else's cloud represents handing over some control. They might institute a file upload size limit, or memory limits which could ruin your application. A security vulnerbility in their control panel could get your site hacked.
Security issues are not relevant if your application does its own encryption. Amazon is then storing encrypted data that they have no way of decrypting.
But in addition to the uptime issues, Amazon could decide to increase their prices to whatever they want. If you're dependent on them, you'll just have to pay it.
Depends how much you trust your own infrastructure in comparison to a 3rd party cloud service. In my opinion, most businesses (at least not IT related) should choose the later.
Another thing you lose with the cloud is the ability to choose exactly what operating system you want to run. For example, the latest Fedora Linux kernel available on EC2 is FC8, and the latest Windows version is Server 2003.
Besides the issues raised regarding dependability, reliability, and cost is the issue of data ownership. When you locate data on someone else's server, you no longer control who views, accesses, modifies, or uses that data. While the cloud operators can limit your access, you possess no way of limiting theirs or limiting who they give access to. Yes, you can encrypt all the data on the server but you lack any way of knowing who possesses root access to the server itself and any means to stop others from downloading your encrypted data and cracking it open. You lose control over your data; depending on what type of apps you are running and the proprietary nature of the data involved, this could engender corporate security and/or liability risks.
The other factor to consider is what would happen to your company if Amazon and/or EC2 were to suddenly vanish overnight. While a seemingly preposterous position, it could happen. Would you be able to quickly fill the hole and restore service, or would your potentially revenue generating apps languish while the IT staff scramble to obtain servers and bandwidth to get them back online? Also, what would happen to your data? The cloud hard drive holding all your information still exists, somewhere, and could pose a potential liability risk depending on the information you stored there--items such as personal information, business transaction records etc.
If I was starting my own business now, I would go through the hassle of purchasing and maintaining my own severs so I retained data ownership. I could control root access to the hardware, as well as control who can access and modify the data.
Unanswered security questions.
Really, do you want your IP out there, where you're not the one in control of it?
Most cloud computing environment are at least partially vendor specific. There's no good way to move stuff from one cloud to another without having to do a lot of rewriting. That sort of lock-in puts you at the mercy of one vendor when it comes to downtime, price increases, etc. If you rent or own your own servers, hosting providers and colos are pretty much interchangeable. You always have the option of moving somewhere else.
This may change in the future, as these things become standardized, but for now tying yourself to the cloud means tying yourself to a specific vendor.
This is kind of like the "Why would you use Linux" comment I received from management many years ago. The response I got was that it is a solution in search of a problem.
So what are your goals and objectives in moving to EC2?
I'd be interested to know if you'd still want to move to a cloud, if it was your own.
Cloud computing has brought parallel programming a little closer to the masses, but you still have to understand how best to use it - otherwise you're going to waste compute cycles and bandwidth.
Re-architecting your application for most efficient use of a cloud computing service is non-trivial.
Besides what has already been said here, we have to consider uniformity across the business. Are all of you applications going to be hosted in the cloud, or only most? Is most enough to pull the trigger on using the cloud when you still have to have personnel to handle a few special servers?
In particular, there might be special hardware that you need to communicate with such modems to accept incoming data, or voice cards that make automated phone calls. I don't know how such things could be handled in a cloud environment.
What makes a site good for high traffic?
Does it have more to do with the hardware/infrastructure, or with how one writes the software, using Java as the example, if it matters?
I'm wondering how the software changes just because it is expected that billions of users will be on the site, if at all.
My understanding up to this point is that the code doesn't change, but that it is deployed on multiple servers, in a cluster, and a load balancer distributes the load, so really, on any one server/deployment, the application is just as any other standard application/website.
I highly recommend reading Jeff Atwood's blog on Micro-Optimization. In previous blogs he talks somewhat about how this site was created and the hardware upgrades he has had (which quickly summarized said that better hardware performs better only the extent that it is faster/better), but the real speed of a site comes from good programming, and this article seems like it should sum up some of your site programming questions quite well.
Hardware is cheap. Programming is expensive.
There are some programming techniques to make sure your code can handle multiple simultaneous views/updates. If you're using an existing framework, much of that work is (hopefully) done for you, but otherwise you're going to find stuff that worked for a few hundred hits an hour on one server isn't going to work when you're getting hundreds of thousands of hits and you have to deploy multiple load balancing machines.
Well, it is primarily an issue of hardware scaling but there are a few things to keep in mind with respect to the software involved in scaling. For example, if you are on a server farm, you'll need to work with a session management server (either via SQL Server or via a state server - which has implications in that your session variables need to be serializable).
But, in the bigger picture, there are a variety of things that you would want to do to scale to an enterprise level. For example, it becomes particularly important that you abstract out your database calls to a DAL because you may well need to adopt the use of a middleware package for high volume environments.
I have been doing some catching up lately by reading about cloud hosting.
For a client that has about the same characteristics as StackOverflow (Windows stack, same amount of visitors), I need to set up a hosting environment. Stackoverflow went from renting to buying.
The question is why didn't they choose cloud hosting?
Since Stackoverflow doesn't use any weird stuff that needs to run on a dedicated server and supposedly cloud hosting is 'the' solution, why not use it?
By getting answers to this question I hope to be able to make a weighted decision myself.
I honestly do not know why SO runs like it does, on privately owned servers.
However, I can assume why a website would prefer this:
Maintainability - when things DO go wrong, you want to be hands-on on the problem, and solve it as quickly as possible, without needing to count on some third-party. Of course the downside is that you need to be available 24/7 to handle these problems.
Scalability - Cloud hosting (or any external hosting, for that matter) is very convenient for a small to medium-sized site. And most of the hosting providers today do give you the option to start small (shared hosting for example) and grow to private servers/VPN/etc... But if you truly believe you will need that extra growth space, you might want to count only on your own infrastructure.
Full Control - with your own servers, you are never bound to any restrictions or limitations a hosting service might impose on you. Run whatever you want, hog your CPU or your RAM, whatever. It's your server. Many hosting providers do not give you this freedom (unless you pay up, of course :) )
Again, this is a cost-effectiveness issue, and each business will handle it differently.
I think this might be a big reason why:
Cloud databases are typically more
limited in functionality than their
local counterparts. App Engine returns
up to 1000 results. SimpleDB times out
within 5 seconds. Joining records from
two tables in a single query breaks
databases optimized for scale. App
Engine offers specialized storage and
query types such as geographical
coordinates.
The database layer of a cloud instance
can be abstracted as a separate
best-of-breed layer within a cloud
stack but developers are most likely
to use the local solution for both its
speed and simplicity.
From Niall Kennedy
Obviously I cannot say for StackOverflow, but I have a few clients that went the "cloud hosting" route. All of which are now frantically trying to get off of the cloud.
In a lot of cases, it just isn't 100% there yet. Limitations in user tracking (passing of requestor's IP address), fluctuating performance due to other load on the cloud, and unknown usage number are just a few of the issues that have came up.
From what I've seen (and this is just based on reading various blogged stories) most of the time the dollar-costs of cloud hosting just don't work out, especially given a little bit of planning or analysis. It's only really valuable for somebody who expects highly fluctuating traffic which defies prediction, or seasonal bursts. I guess in it's infancy it's just not quite competitive enough.
IIRC Jeff and Joel said (in one of the podcasts) that they did actually run the numbers and it didn't work out cloud-favouring.
I think Jeff said in one of the Podcasts that he wanted to learn a lot of things about hosting, and generally has fun doing it. Some headaches aside (see the SO blog), I think it's a great learning experience.
Cloud computing definitely has it's advantages as many of the other answers have noted, but sometimes you just want to be able to control every bit of your server.
I looked into it once for quite a small site. Running a small Amazon instance for a year would cost around £700 + bandwidth costs + S3 storage costs. VPS hosting with similar specs and a decent bandwidth allowance chucked in is around £500. So I think cost has a lot to do with it unless you are going to have fluctuating traffic and lots of it!
I'm sure someone from SO will answer it but "Isn't just more hassle"? Old school hosting is still cheap and unless you got big scalability problems why would you do cloud hosting?