I'm just curious, there's a few services like MongoLab where data is hosted on remote servers. Anyone who's worked with databases knows that there's a certain amount of network latency, even when all servers are internal. Is a remote data storage service such as MongoLab a good idea for production environments?
This question is mainly for AJAX based web apps or websites in general.
I've found MongoLab to be pretty good. Obviously, you need to think about round-trips in general, and optimising those will minimise your overall latency.
It also makes sense to put yourself into the same data-center as MongoLab (you can choose where). They also have a (beta) service on Azure now.
I've been running services with high-latency (three different geographical regions for browser, web servers and Mongo and it still performs adequately in my case because my interactions are not "chatty".
As you probably know, one of the design constraints with Mongo is a lack of joins, so my data structures have naturally lent themselves to simple Q&A fetching of data. I don't read one collection and then use that information to go look in another (manual joins). As a result, I'm not adding up latency costs with those complex interactions. The worst case is generally a single request/response (or a series of parallel, single request/response queries) so it's the difference of about 200ms total which is acceptable.
But of course, the closer you can get your web servers to your DB the better you'll be.
Presumably, if you're spending enough money, MongoLab et al could roll you a custom configuration, possibly where you can have local secondaries.
Related
It seems that in the traditional microservice architecture, each service gets its own database with a different understanding of the data (described here). Sometimes it is considered permissible for databases to duplicate data. For instance, the "Users" service might know essentially everything about a user, whereas the "Posts" service might just store primary keys and usernames (so that the author of a post can have their name displayed, for instance). This page talks about eventual consistency, sources of truth, and other related concepts when data is duplicated. I understand that microservice architectures sometimes include a shared database, but most places I look suggest that this is a rare strategy.
As for why each service typically gets its own database, all I've seen so far is "so that each service owns its own resources," but I'm not convinced that a) the service layer in any way "owns" the persisted resources accessed through the database to begin with, or that b) services even need to own the resources they require rather than accessing necessary subsets of the master resources through a shared database.
So what are some of the justifications that each service in a microservice architecture should get its own database?
There are a few reasons why it does make sense to use a separate database per micro-service. Some of them are:
Scaling
Splitting your domain in micro-services is fine. You can scale your particular micro-service on the deployed web-server on demand or scale out as needed. That it obviously one of the benefits when using micro-services. More importantly you can have micro-service-1 running for example on 10 servers as it demands this traffic but micro-service-2 only requires 1 web-server so you deploy it on 1 server. The good thing is that you control this and you can manage your computing resources like in order to save money as Cloud providers are not cheap.
Considering this what about the database?
If you have one database for multiple services you could not do this. You could not scale the databases individually as they would be on one server.
Data partitioning to reduce size
Automatically as you split your domain in micro-services with each containing 1 database you split the amount of data that is stored in each database. Ideally if you do this you can have smaller database servers with less computing power and/or RAM.
In general paying for multiple small servers is cheaper then one large one.
So in this case you could make use of this fact and save some resources as well.
If it happens that the already spited by domain database have large amount of data techniques like data sharding or data partitioning could be applied additional, but this is another topic.
Which db technology fits the business requirement
This is very important pro fact for having multiple databases. It would allow you to pick the database technology which fits your Business requirement best in order to get the best performance or usage of it. For example some specific micro-service might have some Read-heavy operations with very complex filter options and a full text search requirement. Using Elastic Search in this case would be a good choice. Some other micro-service might use SQL Server as it requires SQL specific features like transnational behavior or similar. If for some reason you have one database for all services you would be stuck with the particular database technology which might not be so performant for those requirement. It is a compromise for sure.
Developer discipline
If for some reason you would have a couple micro-services which would share their database you would need to deal with the human factor. The developers would need to be disciplined to not cross domains and access/modify the other micro-services database(tables, collections and etc) which would be hard to achieve and control. In large organisations with a lot of developers this could be a serious problem. With a hard/physical split this is not an issue.
Summary
There are some arguments for having database per micro-service but also some against it. In general the guidelines and suggestions when using micro-services are to have the micro-service together with its data autonomous in order to work independent in Ideal case(this is not the case always). It is defiantly a compromise as well as using micro-services in general. As always the rule is the rule but there are exceptions to it. Micro-services architecture is flexible and very dependent of your Domain needs and requirements. If you and your team identify that it makes sense to merge multiple micro-service databases to 1 and that it solves a lot of your problems then go for it.
Microservices
Microservices advocate design constraints where each service is developed, deployed and scaled independently. This philosophy is only possible if you have database per service. How can i continue my business if i have DB failure and what steps i can take to mitigate this?DB is essential part of any enterprise application. I agree there are different number of challenges when services has its own databases.
Why Independent database?
Unlike other approaches this approach not only keeps your code-base clean and extendable but you truly omit the single point of failure in your business. To achieve this services sometimes can have duplicated data as well, as long as my service is autonomous and services can only be autonomous if i have database per service.
From business point of view, Lets take eCommerce application. you have microserivces like Booking, Order, Payment, Recommendation , search and so on. Database is shared. What happens if the DB is down ? All your services are down ! and there is no point using Microservies architecture other than you have clean code base.
If you have each service having it's own database , i don't mind if my recommendation service is not working but i can still search and book the order and i haven't lost the customer. that's the whole point.
It comes at cost and challenges, but in longer run it pays off.
SQL / NoSQL
Each service has it's own needs. To get the best performance I can use SQL for payment service (transaction) and I can use (I should) NoSQL for recommendation service. Shared database wouldn't help me in this case. In modern cloud Architectures like CQRS, Event Sourcing, Materialized views, we sometimes use 2 different databases for same service to get the performance out of it.
Again Database per service is not only about resources or how much data should it own. But we really have to see the bigger picture. Yes we have certain practices how much data and duplication is good or bad but that's another debate.
Hope that helps !
We have many web services and web app applications which have caching needs so we are trying to come up with caching strategy which can help all the teams irrespective of their technical choices. We have used Memcached(not replicated) & Couchbase(multi master) running locally on each server node and applications connect to them locally using Memcached protocol but going forward we are planning to go with centralized cache cluster exposed via REST APIs which can be used by all the applications running on different server nodes in a datacenter. Following are reasons behind this thought process:
Easy maintenance of a cluster without worrying about app server
nodes.
Single protocol(HTTP) used to access the cache without worrying
about underlying implementation.(We might use Redis or Couchbase or
Aerospike cluster)
But we are not sure about this strategy because we are worried about performance impact due to network overhead because of HTTP.
Has anyone tried this strategy? Is it a good idea to make cache as centralized service or local caches are the best?
While it's true that HTTP and network add latency, generally you need a cache because the actual operation takes significantly longer. So the question is: if you add 1-2 milliseconds to the cache access, does that still shorten the un-cached operation time significantly? If the answer is yes, and you follow some common best practices, having a centralized cache could be a good idea.
You might want to look into low latency, high throughput server-side frameworks for the HTTP service, like Node.js or Go. Also, you will probably benefit from implementing proper ETag support in your cache HTTP API.
Another alternative might be centralizing the cache server(s) without wrapping them in an HTTP layer. There are standard cache provider implementations for all the technologies you mentioned available for most modern web frameworks.
Disclaimer: I work for Redis Labs, a commercial company that makes tools for managing Redis and Memcached clusters. My employer, Redis Labs, has made a business of the strategy that you want affirmed :)
Cache is a dish best served close, but remote caching has benefits (e.g., offloading the DB) even if the latency penalty suggests differently. In most cases, compared to the time spent in the application, the local area network latency becomes negligible, so using a shared network-attached cache makes a lot of sense.
To get the best performance, interact from your app directly with the shared cache using its own protocol. An HTTP API, unless provided by the caching engine itself, could add latency to the client app's requests. OTOH, formalizing your apps access to the cache with a custom layer (such as a REST API) has a lot of nice benefits too, so you should evaluate the cost in the context of your latency budgets.
Your strategy is sound and it is used everywhere to build scalable and performant applications. Feel free to hit me if you need further advice.
For the first time I am developing an app that requires quite a bit of scaling, I have never had an application need to run on multiple instances before.
How is this normally achieved? Do I cluster SQL servers then mirror the programming across all servers and use load balancing?
Or do I separate out the functionality to run some on one server some on another?
Also how do I push out code to all my EC2 windows instances?
This will depend on the requirements you have. But as a general guideline (I am assuming a website) I would separate db, webserver, caching server etc to different instance(s) and use s3(+cloudfont) for static assets. I would also make sure that some proper rate limiting is in place so that only legitimate load is on the infrastructure.
For RDBMS server I might setup a master-slave db setup (RDS makes this easier), use db sharding etc. DB cluster solutions also exists which will be more complex to setup but simplifies database access for the application programmer. I would also check all the db queries and the tune db/sql queries accordingly. In some cases pure NoSQL type databases might be better than RDBMS or a mix of both where the application switches between them depending on the data required.
For webserver I will setup a loadbalancer and then use autoscaling on the webserver instance(s) behind the loadbalancer. Something similar will apply for app server if any. I will also tune the web servers settings.
Caching server will also be separated into its on cluster of instance(s). ElastiCache seems like a nice service. Redis has comparable performance to memcache but has more features(like lists, sets etc) which might come in handy when scaling.
Disclaimer - I'm not going to mention any Windows specifics because I have always worked on Unix machines. These guidelines are fairly generic.
This is a subjective question and everyone would tailor one's own system in a unique style. Here are a few guidelines I follow.
If it's a web application, separate the presentation (front-end), middleware (APIs) and database layers. A sliced architecture scales the best as compared to a monolithic application.
Database - Amazon provides excellent and highly available services (unless you are on us-east availability zone) for SQL and NoSQL data stores. You might want to check out RDS for Relational databases and DynamoDb for NoSQL. Both scale well and you need not worry about managing and load sharding/clustering your data stores once you launch them.
Middleware APIs - This is a crucial part. It is important to have a set of APIs (preferably REST, but you could pretty much use anything here) which expose your back-end functionality as a service. A service oriented architecture can be scaled very easily to cater multiple front-facing clients such as web, mobile, desktop, third-party widgets, etc. Middleware APIs should typically NOT be where your business logic is processed, most of it (or all of it) should be translated to database lookups/queries for higher performance. These services could be load balanced for high availability. Amazon's Elastic Load Balancers (ELB) are good for starters. If you want to get into some more customization like blocking traffic for certain set of IP addresses, performing Blue/Green deployments, then maybe you should consider HAProxy load balancers deployed to separate instances.
Front-end - This is where your presentation layer should reside. It should avoid any direct database queries except for the ones which are limited to the scope of the front-end e.g.: a simple Redis call to get the latest cache keys for front-end fragments. Here is where you could pretty much perform a lot of caching, right from the service calls to the front-end fragments. You could use AWS CloudFront for static assets delivery and AWS ElastiCache for your cache store. ElastiCache is nothing but a managed memcached cluster. You should even consider load balancing the front-end nodes behind an ELB.
All this can be bundled and deployed with AutoScaling using AWS Elastic Beanstalk. It currently supports ASP .NET, PHP, Python, Java and Ruby containers. AWS Elastic Beanstalk still has it's own limitations but is a very cool way to manage your infrastructure with the least hassle for monitoring, scaling and load balancing.
Tip: Identifying the read and write intensive areas of your application helps a lot. You could then go ahead and slice your infrastructure accordingly and perform required optimizations with a read or write focus at a time.
To sum it all, Amazon AWS has pretty much everything you could possibly use to craft your server topology. It's upon you to choose components.
Hope this helps!
The way I would do it would be, to have 1 server as the DB server with mysql running on it. All my data on memcached, which can span across multiple servers and my clients with a simple "if not on memcached, read from db, put it on memcached and return".
Memcached is very easy to scale, as compared to a DB. A db scaling takes a lot of administrative effort. Its a pain to get it right and working. So I choose memcached. Infact I have extra memcached servers up, just to manage downtime (if any of my memcached) servers.
My data is mostly read, and few writes. And when writes happen, I push the data to memcached too. All in all this works better for me, code, administrative, fallback, failover, loadbalancing way. All win. You just need to code a "little" bit better.
Clustering mysql is more tempting, as it seems more easy to code, deploy, maintain and keep up and performing. Remember mysql is harddisk based, and memcached is memory based, so by nature its much more faster (10 times atleast). And since it takes over all the read load from the db, your db config can be REALLY simple.
I really hope someone points to a contrary argument here, I would love to hear it.
We have a new project for a web app that will display banners ads on websites (as a network) and our estimate is for it to handle 20 to 40 billion impressions a month.
Our current language is in ASP...but are moving to PHP. Does PHP 5 has its limit with scaling web application? Or, should I have our team invest in picking up JSP?
Or, is it a matter of the app server and/or DB? We plan to use Oracle 10g as the database.
No offense, but I strongly suspect you're vastly overestimating how many impressions you'll serve.
That said:
PHP or other languages used in the application tier really have little to do with scalability. Since the application tier delegates it's state to the database or equivalent, it's straightforward to add as much capacity as you need behind appropriate load balancing. Choice of language does influence per server efficiency and hence costs, but that's different than scalability.
It's scaling the state/data storage that gets more complicated.
For your app, you have three basic jobs:
what ad do we show?
serving the add
logging the impression
Each of these will require thought and likely different tools.
The second, serving the add, is most simple: use a CDN. If you actually serve the volume you claim, you should be able to negotiate favorable rates.
Deciding which ad to show is going to be very specific to your network. It may be as simple as reading a few rows from a database that give ad placements for a given property for a given calendar period. Or it may be complex contextual advertising like google. Assuming it's more the former, and that the database of placements is small, then this is the simple task of scaling database reads. You can use replication trees or alternately a caching layer like memcached.
The last will ultimately be the most difficult: how to scale the writes. A common approach would be to still use databases, but to adopt a sharding scaling strategy. More exotic options might be to use a key/value store supporting counter instructions, such as Redis, or a scalable OLAP database such as Vertica.
All of the above assumes that you're able to secure data center space and network provisioning capable of serving this load, which is not trivial at the numbers you're talking.
You do realize that 40 billion per month is roughly 15,500 per second, right?
Scaling isn't going to be your problem - infrastructure period is going to be your problem. No matter what technology stack you choose, you are going to need an enormous amount of hardware - as others have said in the form of a farm or cloud.
This question (and the entire subject) is a bit subjective. You can write a dog slow program in any language, and host it on anything.
I think your best bet is to see how your current implementation works under load. Maybe just a few tweaks will make things work for you - but changing your underlying framework seems a bit much.
That being said - your infrastructure team will also have to be involved as it seems you have some serious load requirements.
Good luck!
I think that it is not matter of language, but it can be be a matter of database speed as CPU processing speed. Have you considered a web farm? In this way you can have more than one machine serving your application. There are some ways to implement this solution. You can start with two server and add more server as the app request more processing volume.
In other point, Oracle 10g is a very good database server, in my humble opinion you only need a stand alone Oracle server to commit the volume of request. Remember that a SQL server is faster as the people request more or less the same things each time and it happens in web application if you plan your database schema carefully.
You also have to check all the Ad Server application solutions and there are a very good ones, just try Google with "Open Source AD servers".
PHP will be capable of serving your needs. However, as others have said, your first limits will be your network infrastructure.
But your second limits will be writing scalable code. You will need good abstraction and isolation so that resources can easily be added at any level. Things like a fast data-object mapper, multiple data caching mechanisms, separate configuration files, and so on.
What makes a site good for high traffic?
Does it have more to do with the hardware/infrastructure, or with how one writes the software, using Java as the example, if it matters?
I'm wondering how the software changes just because it is expected that billions of users will be on the site, if at all.
My understanding up to this point is that the code doesn't change, but that it is deployed on multiple servers, in a cluster, and a load balancer distributes the load, so really, on any one server/deployment, the application is just as any other standard application/website.
I highly recommend reading Jeff Atwood's blog on Micro-Optimization. In previous blogs he talks somewhat about how this site was created and the hardware upgrades he has had (which quickly summarized said that better hardware performs better only the extent that it is faster/better), but the real speed of a site comes from good programming, and this article seems like it should sum up some of your site programming questions quite well.
Hardware is cheap. Programming is expensive.
There are some programming techniques to make sure your code can handle multiple simultaneous views/updates. If you're using an existing framework, much of that work is (hopefully) done for you, but otherwise you're going to find stuff that worked for a few hundred hits an hour on one server isn't going to work when you're getting hundreds of thousands of hits and you have to deploy multiple load balancing machines.
Well, it is primarily an issue of hardware scaling but there are a few things to keep in mind with respect to the software involved in scaling. For example, if you are on a server farm, you'll need to work with a session management server (either via SQL Server or via a state server - which has implications in that your session variables need to be serializable).
But, in the bigger picture, there are a variety of things that you would want to do to scale to an enterprise level. For example, it becomes particularly important that you abstract out your database calls to a DAL because you may well need to adopt the use of a middleware package for high volume environments.