High traffic web sites - performance

What makes a site good for high traffic?
Does it have more to do with the hardware/infrastructure, or with how one writes the software, using Java as the example, if it matters?
I'm wondering how the software changes just because it is expected that billions of users will be on the site, if at all.
My understanding up to this point is that the code doesn't change, but that it is deployed on multiple servers, in a cluster, and a load balancer distributes the load, so really, on any one server/deployment, the application is just as any other standard application/website.

I highly recommend reading Jeff Atwood's blog on Micro-Optimization. In previous blogs he talks somewhat about how this site was created and the hardware upgrades he has had (which quickly summarized said that better hardware performs better only the extent that it is faster/better), but the real speed of a site comes from good programming, and this article seems like it should sum up some of your site programming questions quite well.

Hardware is cheap. Programming is expensive.

There are some programming techniques to make sure your code can handle multiple simultaneous views/updates. If you're using an existing framework, much of that work is (hopefully) done for you, but otherwise you're going to find stuff that worked for a few hundred hits an hour on one server isn't going to work when you're getting hundreds of thousands of hits and you have to deploy multiple load balancing machines.

Well, it is primarily an issue of hardware scaling but there are a few things to keep in mind with respect to the software involved in scaling. For example, if you are on a server farm, you'll need to work with a session management server (either via SQL Server or via a state server - which has implications in that your session variables need to be serializable).
But, in the bigger picture, there are a variety of things that you would want to do to scale to an enterprise level. For example, it becomes particularly important that you abstract out your database calls to a DAL because you may well need to adopt the use of a middleware package for high volume environments.

Related

how does one identify why a website is slow? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I was asked this question once at an interview:
"Suppose you own a website where the server is at some remote location. One day, some user calls/emails you saying the site is abominably slow. How would you identify why the site is slow? Also, when you check the website yourself as any user would (using your browser), the site behaves just fine."
I could think of only one thing (which was shot down):
Check the server logs to analyse incoming traffic. Maybe a DoS attack or exceptionally high traffic. Interviewer told me to assume the server has normal traffic and no DoS.
I was kind of lost because I had never thought of this problem. I have almost no idea how running a server/website works. So if someone could highlight a few approaches, it would be nice.
While googling around, I could find only this relevant, wonderful article. That article is kind of too technical for me now, but I'm slowly breaking it down and understanding it.
Since you already said when you check the site yourself the speed is fine, this means that (at least for the pages you checked) there is nothing wrong with the server and it can serve those pages at a good speed. What you should be figuring out at this point is what the difference is between you and the user that reports your site is slow. It might be a lot of different things:
Is the user using a slow network connection (mobile for example)?
Does the user experience the same problems with other websites hosted at the same webhoster? If so, this could indicate a network problem. Normally this could also indicate a resource problem at the webserver, but in that case the site would also be slow for you.
If neither of the above leads to an answer, you could assume that the connection to the server and the server itself are fine. This means the problem must be in the users device. Find out which browser/OS he uses and try to replicate the problem. If that fails find out if he uses any antivirus or similar software that might cause problems.
This is a great tool to find the speed of web pages and tells you what makes it slow: https://developers.google.com/speed/pagespeed/insights
I think one of the important thing that is missing from above answers is the server location, which can play a vital in web performance.
When someone is saying that it is taking a longer time to open a web page that means high latency. High latency can be caused due to server location.
Let's assume as you are the owner of the web page then the server and client are co-located, so it will have a low latency.
But, now if client is across the border, then latency time will increase drastically. And hence a slow perfomance.
Another factor is caching which drastically affects the latency time.
Taking the example of facebook, they have server all over the world to reduce the latency time (and also to provide several other advantages) and they use huge caching system to cache their hot data (trending topics) whereas cold data (old data) are stored in hard disk so it takes a longer time to load an older photo or post.
So, a user might would have complained about this as they were trying load up some cold data.
I can think of these few reasons (first two are already mentioned above):
High Latency due to location of client
Server memory might need to be increased
Number of service calls from the page.
If a service could be down at the time of complaint, it could prevent page from loading.
The server load might be too high at the time of the poor experience. The server might need to increase the resources (e.g. adding another server/web server to the cluster).
Check if there was any background job running on the server at that time.
It is important to check the logs and schedules of the batch jobs to determine what all was running at that time.
Hope this help.
Normally the user takes the page loading time as a measure to find out that the site is slow. But if you really want to know that what is taking the maximum time the you can open the browser debugger by pressing f12. if your browser is chrome the click on network and see what calls your application is making and which are taking maximum time. If you are using Firefox the you need to install firebug. If you have that, then again press f12 and click on Net.
One reason could be the role of the user is different of your role. You might be having suppose an administrator privilege (some thing like super user role) and the code might be just allowing everything for such role that means it does not really do much of conditional checking to see what is allowed or not. Some times, it's a considerable over ahead to get all the privileges of the user and have the conditions checking, how course depends how how the authorization is implemented. That means, the page might be really slow for specific roles. Hence, you should find out the roles of the user and see if that is a reason.
Obviously an issue with the connection of the person connecting to your site OR it's possible it was a temporary issue and by the time you checked your site, everything was dandy. You could check your logs or ask your host if there was an issue at the time the slow down occured.
This is usually a memory issue and it can be resolved by increasing the Heap Size of the Web Server hosting the application. In case the application is running on Weblogic Server. Heap size can be increased in "setEnv" file located in Application Home.
Goodluck!
Michael Orebe
Though your question is quite clear, web site optimisation is a very extensive subject.
The majority of the popular web developing frameworks are for some reason, extremely processor inefficient.
The old fashioned way of developing n-tier web applications is still very relevant and is still considered to be best practice according the W3C. If you take a little time to read the source code structure of the most popular web developing frameworks you will see that they run much more code at the server than is necessary.
This may seem a bit of a simple answer but, the less code you run at the server and the more code you run at the client the faster your servers will work.
Sometimes contrasting framework code against the old fashioned way is the best way to get an understanding of this. Here is a link to a fully working mini web application which represents W3C best practices and runs the minimum amount of code at the server and the maximum amount of code at the client: http://developersfound.com/W3C_MVC_EX.zip this codebases is also MVC compliant.
This codebase comes with a MySQL database dump, php and client side code. To see this code in action you will need to restore the SQL dump to a MySQL instance (sql dump came from MySQL 8 Community) and add the user and schema permissions that are found in the php file (conn_include.php); setting the user to have execute permissions on the schema.
If you contrast this code base against all of the most popular web frameworks, it will really open your eyes to just how inefficient these frameworks are. The popular PHP frameworks that claim to be MVC frameworks aren’t actually MVC compliant at all. This is because they rely on embedding PHP tags inside HTML tags or visa-versa (considered very bad practice according the W3C). Also most popular node frameworks run way more code at the server than is necessary. Embedded tags also stop asynchronous calls from working properly unless the framework supports AJAX dumps such as Yii 2.
Two of the most important rules to follow with MVC compliance is: never embed server side tags (such as PHP tags) in HTML tags or visa-versa (unless there is a very good excuse such as SEO) and religiously never write code to run at the server if it can be run at the client. Also true MVC is based on tier separation, where as the MVC frameworks are based on code separation. True MVC compliance is very processor efficient. Don’t get me wrong MVC frameworks are very useful for a lot of things, but if you’re developing a site that is going to get millions of hits, they are quite useless, or at least they will drive your cloud bills so high that it will really eat into your company’s profits.
In summary frameworks don’t give much control over what code runs at the client or server and are very inefficient but you can get prototypes up and running quicker with less code.
In contrast the old fashioned way takes a bit more elbow grease but you have complete control over what runs at the server and what runs at the client.
As an additional bit of advice for optimisation avoid using pass-through queries and triggers and instead opt for stored procedures. Historically stored procedures weren’t invented at the time MVC was present as a paradigm but it definitely increases separation of concerns between the tiers and is much more processor efficient.
Hope this advice helps.

CDN vs Homegrown Caching

My understanding of a CDN (like Akamai or Limelight) is that they are heavy-duty caching services.
I also understand that they are very expensive. So I'm wondering why I can't just create my own cluster of replicated caching servers (using, say, EhCache or Memcached) for all my web app's caching needs (images, URL hits/responses, Javascripts, etc) and basically get the same thing?
In essence, from a developer's perspective, what are the benefits (technical or otherwise) of paying a CDN vs. just using your own caching solution? Or, if I have completely misunderstood what CDNs are, please correct my understanding! Thanks in advance!
The key to this is the N in CDN, Content Delivery Network.
The advantage of a CDN (a good one, at least) is that it's geographically distributed. What the big CDN providers do (the ones you mentioned along with others) is have hundreds or thousands of servers in facilities all over the world. This lets what ever resource the CDN is serving sit as geographically close as possible to the end user no matter where they are in the world.
You can certainly replicate the functionality. At a very basic level most CDN's are simply an object/value store which plenty of software can do - what you can't do, at the scale these guys do it at any rate, is have servers all over the world to serve and replicate these objects.
If all you're interested in is an efficient way to store and serve static files then an object store coupled with something like nginx would do just fine. A CDN is used to make sites load as fast as they possibly can, but they come at a price.
For the record, Amazon's Cloud Front, while not as distributed as Limelight or Akamai is a lot cheaper and a good middle ground compromise on cost vs service.
The advantage of big CDNs is that they own distributed resources all over the world, allowing them to serve the users from a servers near them. Besides caching, this is the major point that makes them fast.
To build your own CDN, you would have to install servers on multiple continents, arrange good Internet connectivity for them, setup caching and make sure you have the servers all synchronized. It's not impossible to do that, but it will be very cost-intensive and might not be profitable.
With all due respect, I disagree with the other two answers given thus far to this question. Just to be clear, I'm not saying they are wrong, I'm just offering a different perspective.
While it is certainly true that the key to CDN performance is the "N," as Bulk eloquently explained, that doesn't mean you can't build your own CDN. The question is whether or not it's worth the time (and by definition, the money).
We live in a world of cheap servers and even cheaper virtual machines. Sure the big CDN networks have thousands/millions of servers all over the world, but that's because they have thousands/millions of sites to serve. Depending on the size of your site/app, all you really need in terms of resources is as much as your site(s) need. If you're small, the minimum might be a VPS on each US coasts, one in Europe, maybe two in Asia, and one in Australia. Sure, the hardware costs are too high for your typical homepage, but they are certainly not extreme, and if you're looking at CDNs in the first place, they are probably within your budget.
To me, commercial CDN services just provide PaaS convenience, but there is nothing preventing you from getting IaaS and building up your own platform.
One more thing on this topic:
I once read a comment either by David Heinemeier Hansson (the creator of Ruby on Rails) or by someone referring to him that went something along the lines of: The owner(s) of 37 Signals were concerned DHH was using Ruby to build their application. At that point Ruby was still very obscure. Almost all web hosts were offering PHP, Perl, and Microsoft technologies. When asked about the fact that there were only a handful of Ruby hosts in the world, DHH asked, "well how many do you need?"
To me the point is you need to look at what's best for your needs and the needs of your application, and not necessarily what the guys with millions of servers think.

Is perl the fastest way to write a high performance page?

I was inspired by Slashdot, I was heard that it uses very limited servers to support a lot of users with fast response. And there is a website named slashcode, not sure if slashdot uses its source code.
I am wondering if Perl is the best to write a high performance web page? I know using Apache or IIS will be having a lot of overhead?
Any idea, books, papers, tutorials?
I'm going to assume that by "high performance" you mean both in the real time taken to produce a page and also how many it can serve concurrently.
The programming language isn't so important as your servers and algorithms. You may want to look into The C10k Problem which is a series of new technologies and refinement of techniques with the aim to allow a single web server to concurrently handle more than 10,000 concurrent connections. Things like the Nginx and lighttpd web servers and varnish cache came out of this project.
Big wins come from using a very light, very fast, very modular web server (Apache and IIS ain't it) with a very light, very fast cache in front of it to avoid having to process the same thing twice. For a high concurrency server, even caching for a few seconds can save you hundreds or thousands of processes. By chopping up a static page into a series of AJAX requests you can cache the more static bits and pieces independently of the bits that change frequently.
Instead of using mod_blah that embeds your program into a web server, use FastCGI or similar that puts your programs into their own little application servers. This allows them to run independent of the web server, possibly on remote machines and with load balancing. This lets you easily scale your processing power.
Eventually you're going to micro-optimize really important bits of your application code to the point where the language matters, but you can focus on the really important bits rather than having to do the whole project solely according to raw performance.
Regardless of how fast your code is, at some point the bottleneck will stop being your code, and start being the web server itself.
As long as you're not using the CGI interface[1] to talk to the web server, the language isn't going to have a noticeable impact on performance in 99% of cases. The exceptions are those in which you're doing heavy back-end processing rather than simply grabbing something out of a database, lightly massaging it, and sending it off to the user - and, if you are doing that kind of thing, you're likely better off doing it asynchronously if possible and stuffing the results into a database to be lightly massaged and viewed later.
The reason is, quite simply, that network connection and data transfer times will be so much longer than your program's execution time that it's not even funny. If it's taking 2 seconds to establish a network connection to the server and do the data transmission in each direction, nobody is going to care whether the processing on the server adds 0.1s or 0.2s on top of that 2s of network activity.
[1] Note that I am talking here about the vanilla CGI "start up a new process to service each incoming request" model, not the Perl CGI module (CGI.pm/use CGI). There are ways to use CGI while also making use of a long-lived process which handles multiple requests over its lifetime.
Architecture and system design are more important than language choice for a high traffic app.
But selecting a language is not the first thing you should do, unless you are planning to write everything from the ground up.
You should be selecting a toolset.
If you want to have something soonest, look at existing web applications. What meets your needs? How customizable is it? Does it meet your performance/scalability requirements? If so, the language you use will be the language your app uses.
If you can't find a good match in existing apps, look at different frameworks, Catalyst, Rails, Squatting, Camping, Jifty, Django. There's a nice list of them on Wikipedia.
You should be able to find a framework that will do the job, many of them. Pick some contenders and choose one. The language you use will be the language your framework uses.
There's really no such thing as a "high performance page". That's like asking what the fastest car is (and if you watch enough Top Gear, you know that's not a simple answer). You have to think about what you actually want to do (i.e. the particular task), what you have to do to make that happen, and which tools would work best for that.
Are you going to have a lot of people doing a lot of small things, or fewer people doing really big things? Is it all going to happen at once (i.e. spikes), or is it going to be constant demand? Are you send back small chunks of data or serving up really large files?
Suppose that every portion were as fast as possible. It's a fantasy for sure, but consider it anyway. Now that everything is fast as possible, rank every part according to how relatively fast they are. What's the slowest part? Is it disk access? Network IO? Socket availability?
If you aren't at the point where you're already thinking about this, the language probably isn't that important beyond your skill with it.
There are a lot of books on web performance out there. :)
This post on serverfault suggestst that you could write an extension module to nginx for serving dynamic content.
Such modules need to be compiled to native machine code, so most likely are faster than running Perl.
I don't believe it would be faster than other common choices such as PHP, Python, Ruby, Java, or C#.

Best scaling methodologies for a highly traffic web application?

We have a new project for a web app that will display banners ads on websites (as a network) and our estimate is for it to handle 20 to 40 billion impressions a month.
Our current language is in ASP...but are moving to PHP. Does PHP 5 has its limit with scaling web application? Or, should I have our team invest in picking up JSP?
Or, is it a matter of the app server and/or DB? We plan to use Oracle 10g as the database.
No offense, but I strongly suspect you're vastly overestimating how many impressions you'll serve.
That said:
PHP or other languages used in the application tier really have little to do with scalability. Since the application tier delegates it's state to the database or equivalent, it's straightforward to add as much capacity as you need behind appropriate load balancing. Choice of language does influence per server efficiency and hence costs, but that's different than scalability.
It's scaling the state/data storage that gets more complicated.
For your app, you have three basic jobs:
what ad do we show?
serving the add
logging the impression
Each of these will require thought and likely different tools.
The second, serving the add, is most simple: use a CDN. If you actually serve the volume you claim, you should be able to negotiate favorable rates.
Deciding which ad to show is going to be very specific to your network. It may be as simple as reading a few rows from a database that give ad placements for a given property for a given calendar period. Or it may be complex contextual advertising like google. Assuming it's more the former, and that the database of placements is small, then this is the simple task of scaling database reads. You can use replication trees or alternately a caching layer like memcached.
The last will ultimately be the most difficult: how to scale the writes. A common approach would be to still use databases, but to adopt a sharding scaling strategy. More exotic options might be to use a key/value store supporting counter instructions, such as Redis, or a scalable OLAP database such as Vertica.
All of the above assumes that you're able to secure data center space and network provisioning capable of serving this load, which is not trivial at the numbers you're talking.
You do realize that 40 billion per month is roughly 15,500 per second, right?
Scaling isn't going to be your problem - infrastructure period is going to be your problem. No matter what technology stack you choose, you are going to need an enormous amount of hardware - as others have said in the form of a farm or cloud.
This question (and the entire subject) is a bit subjective. You can write a dog slow program in any language, and host it on anything.
I think your best bet is to see how your current implementation works under load. Maybe just a few tweaks will make things work for you - but changing your underlying framework seems a bit much.
That being said - your infrastructure team will also have to be involved as it seems you have some serious load requirements.
Good luck!
I think that it is not matter of language, but it can be be a matter of database speed as CPU processing speed. Have you considered a web farm? In this way you can have more than one machine serving your application. There are some ways to implement this solution. You can start with two server and add more server as the app request more processing volume.
In other point, Oracle 10g is a very good database server, in my humble opinion you only need a stand alone Oracle server to commit the volume of request. Remember that a SQL server is faster as the people request more or less the same things each time and it happens in web application if you plan your database schema carefully.
You also have to check all the Ad Server application solutions and there are a very good ones, just try Google with "Open Source AD servers".
PHP will be capable of serving your needs. However, as others have said, your first limits will be your network infrastructure.
But your second limits will be writing scalable code. You will need good abstraction and isolation so that resources can easily be added at any level. Things like a fast data-object mapper, multiple data caching mechanisms, separate configuration files, and so on.

What are the reasons for a "simple" website not to choose Cloud Based Hosting?

I have been doing some catching up lately by reading about cloud hosting.
For a client that has about the same characteristics as StackOverflow (Windows stack, same amount of visitors), I need to set up a hosting environment. Stackoverflow went from renting to buying.
The question is why didn't they choose cloud hosting?
Since Stackoverflow doesn't use any weird stuff that needs to run on a dedicated server and supposedly cloud hosting is 'the' solution, why not use it?
By getting answers to this question I hope to be able to make a weighted decision myself.
I honestly do not know why SO runs like it does, on privately owned servers.
However, I can assume why a website would prefer this:
Maintainability - when things DO go wrong, you want to be hands-on on the problem, and solve it as quickly as possible, without needing to count on some third-party. Of course the downside is that you need to be available 24/7 to handle these problems.
Scalability - Cloud hosting (or any external hosting, for that matter) is very convenient for a small to medium-sized site. And most of the hosting providers today do give you the option to start small (shared hosting for example) and grow to private servers/VPN/etc... But if you truly believe you will need that extra growth space, you might want to count only on your own infrastructure.
Full Control - with your own servers, you are never bound to any restrictions or limitations a hosting service might impose on you. Run whatever you want, hog your CPU or your RAM, whatever. It's your server. Many hosting providers do not give you this freedom (unless you pay up, of course :) )
Again, this is a cost-effectiveness issue, and each business will handle it differently.
I think this might be a big reason why:
Cloud databases are typically more
limited in functionality than their
local counterparts. App Engine returns
up to 1000 results. SimpleDB times out
within 5 seconds. Joining records from
two tables in a single query breaks
databases optimized for scale. App
Engine offers specialized storage and
query types such as geographical
coordinates.
The database layer of a cloud instance
can be abstracted as a separate
best-of-breed layer within a cloud
stack but developers are most likely
to use the local solution for both its
speed and simplicity.
From Niall Kennedy
Obviously I cannot say for StackOverflow, but I have a few clients that went the "cloud hosting" route. All of which are now frantically trying to get off of the cloud.
In a lot of cases, it just isn't 100% there yet. Limitations in user tracking (passing of requestor's IP address), fluctuating performance due to other load on the cloud, and unknown usage number are just a few of the issues that have came up.
From what I've seen (and this is just based on reading various blogged stories) most of the time the dollar-costs of cloud hosting just don't work out, especially given a little bit of planning or analysis. It's only really valuable for somebody who expects highly fluctuating traffic which defies prediction, or seasonal bursts. I guess in it's infancy it's just not quite competitive enough.
IIRC Jeff and Joel said (in one of the podcasts) that they did actually run the numbers and it didn't work out cloud-favouring.
I think Jeff said in one of the Podcasts that he wanted to learn a lot of things about hosting, and generally has fun doing it. Some headaches aside (see the SO blog), I think it's a great learning experience.
Cloud computing definitely has it's advantages as many of the other answers have noted, but sometimes you just want to be able to control every bit of your server.
I looked into it once for quite a small site. Running a small Amazon instance for a year would cost around £700 + bandwidth costs + S3 storage costs. VPS hosting with similar specs and a decent bandwidth allowance chucked in is around £500. So I think cost has a lot to do with it unless you are going to have fluctuating traffic and lots of it!
I'm sure someone from SO will answer it but "Isn't just more hassle"? Old school hosting is still cheap and unless you got big scalability problems why would you do cloud hosting?

Resources