I need to study about load-balancers, such as Network Load Balancing, Linux Virtual Server, HAProxy, etc. There're something under-the-hood I need to know:
What algorithms/technologies are used in these load-balancers? Which is the most popular? most effective?
I expect that these algorithms/technologies will not be too complicated. Are there some resources written about them?
Load balancing in Apache, for example, is taken care of by the module called mod_proxy_balancer. This module supports 3 load balancing algorithms:
Request counting
Weighted traffic counting
Pending request counting
For more details, take a look here: mod_proxy_balancer
Not sure if this belongs on serverfault or not, but some load balancing techniques are:
Round Robin
Least Connections
I used least connections. It just made the most sense to send the person to the machine which had the least amount of load.
In general, load balancing is all about sending new client requests to servers which are the least busy. Based on the application running, assign a 'busy factor' to each server: basically a number reflecting one/several points of interest for your load balancing algorithm (connected clients, cpu/mem usage, etc.) and then, at runtime, choose the server with the lowest such score. Basically ANY load balancing technique is based on something like this:
Round robin does not implement a 'busy score' per se, but assigns each consecutive request to the next server in a circular queue.
Least connections has its score = number_of_open_connections to the server. Obviously, a server with fewer connections is a better choice.
Random assignment is a special case - you make an uninformed decision about the server's load, but assume that the function has a statistically even distribution.
In addition to those already mentioned, a simple random assignment can be a good enough algorithm for load balancing, especially with a large number of servers.
Here's one link from Oracle: http://download-llnw.oracle.com/docs/cd/E11035_01/wls100/cluster/load_balancing.html
Related
I'm trying to understand does "farm" mean in computing, so that, if a farm can be exemplified as a cluster, there must be something that a farm is independently of what a cluster is.
How is it different than a grid?
Do these concepts have a general meaning, where talking about for example web servers is only one common scennario, or do they differ deppending on the context completely? If so, what are the different meanings (or the most common ones)?
Also, should I be asking this somewhere else? (If so, my apologies).
I'll have a go:
A server farm is a group of servers, which may or may not be clustered, that together offer a higher computing capacity for a particular goal than an individual server. An example is a Web farm, which typically consists of a number of load-balanced Web servers, each with the same content and configurations.
While a farm may be clustered, typically a cluster consists of servers that are configured in a failover scenario, for instance active/passive, meaning that, say, 1/2 the servers are on active duty while the rest becomes activated only when the servers on active duty are no longer accessible. So when a server in the cluster crashes, it is not a problem because another server takes over.
A grid then is typically a configuration of computers that work together (in parallel) to solve a complicated task, such as organic chemistry computations or playing chess against the world champion. The work is divided into parts and each computer takes on a part and reports its findings to the main node which then synthesizes the final result.
I hope this helps.
Some links:
Wikipedia definition of server farm: http://en.wikipedia.org/wiki/Server_farm
Wikipedia definition of computing cluster: http://en.wikipedia.org/wiki/Cluster_(computing)
Wikipedia definition of grid computing: http://en.wikipedia.org/wiki/Grid_computing
Is it logical to say: "If average service time for a request is X and affordable waiting time for the requests is Y then maximum number of concurrent requests to serve would be Y / X" ?
I think what I'm asking is that if there're any hidden factors that I'm not taking into account!?
If you're talking specifically about webservers, then no, your formula doesn't work, because webservers are designed to handle multiple, simultaneous requests, using forking or threading.
This turns the formula into something far harder to quantify - in my experience, web servers can handle LOTS (i.e. hundreds or thousands) of concurrent requests which consume little or no time, but tend to reduce that concurrency quite dramatically as the requests consume more time.
That means that "average service time" isn't massively useful - it can hide wide variations, and it's actually the outliers that affect you the most.
Broadly yes, but your service provider (webserver in your case) is capable of handling more than one request in parallel, so you should take that into account. I assume you measured end to end service time and havent already averaged it by number of parallel streams. One other thing you didnt and cannot realistically measure is the delay to/from your website.
What you are heading towards is the Erlang unit (not the language using the same name) which is used to described how much load a system can take. Erlangs are unitless (it is just a number) and originated from old school telephony, POTS, where it was used to describe how many wires were needed to handle X calls per time period with low blocking probability. Beyond erlang is engset which is used more for high capacity systems, such as mobile systems.
It also gets used for expensive consultant reports into realtime computer systems and databases to describe the point at which performance degradation is likely to occur. Wikipedia has an article on this http://en.wikipedia.org/wiki/Erlang_(unit) and the book 'Fixed and mobile telecommunications, network systems and services' has a good chapter on performance analysis.
While aimed at telephone systems, just replace with word webserver and it behaves the same. A webserver is the same concept, load is offered that arrives at random intervals to a system with finite parallel capacity. In your case, you can probably calculate total load with load tools easier than parallel capacity and then back calculate the formulas. This is widely done to gain a level of confidence in overall system models.
Erlang/engsetformulas are really useful when you have a randomly arriving load over parallel stream (ie web requests) and a service time that can only be averaged or estimated (ie it varies in real life). You can then calculate the blocking probability, which is the probability a new request will need to wait while current requests are serviced, and how long it will wait. It also helps analyse whether you need to handle more requests in parallel, or make each faster (#lines and holding time in erlang speak)
You will probably look into queuing systems analysis next, as a soon as requests block (queue), the models change slightly.
many factors are not taken into account
memory limits
data locking constraints such as people wanting to update the same data
application latency
caching mechanisms
different users will have different tasks on the site and put different loads
That said, one easy way to get a rough estimate is with apache ab tool (apache benchmark)
Example, get 1000 times the homepage with 100 requests at a time:
ab -c 100 -n 1000 http://www.example.com/
I'm working on creating a version of Pastry natively in Go. From the design [PDF]:
It is assumed that the application
provides a function that allows each Pastry node to determine the “distance” of a node
with a given IP address to itself. A node with a lower distance value is assumed to be
more desirable. An application is expected to implements this function depending on its
choice of a proximity metric, using network services like traceroute or Internet subnet
maps, and appropriate caching and approximation techniques to minimize overhead.
I'm trying to figure out what the best way to determine the "proximity" (i.e., network latency) between two EC2 instances programmatically from Go. Unfortunately, I'm not familiar enough with low-level networking to be able to differentiate between the different types of requests I could use. Googling did not turn up any suggestions for measuring latency from Go, and general latency techniques always seem to be Linux binaries, which I'm hoping to avoid in the name of fewer dependencies. Any help?
Also, I note that the latency should be on the scale of 1ms between two EC2 instances. While I plan to use the implementation on EC2, it could hypothetically be used anywhere. Is latency generally so bad that I should expend the effort to ensure the network proximity of two nodes? Keep in mind that most Pastry requests can be served in log base 16 of the number of servers in the cluster (so for 10,000 servers, it would take approximately 3 requests, on average, to find the key being searched for). Is the latency from, for example, EC2's Asia-Pacific region to EC2's US-East region enough to justify the increased complexity and the overhead introduced by the latency checks when adding nodes?
A common distance metric in networking is to count the number of hops (node-hops in-between) a packet needs to reach its destination. This metric was also mentioned in the text you quoted. This could give you adequate distance values even for the low-latency environment you mentioned (EC2 “local”).
For the go logic itself, one would think the net package is what you are looking for. And indeed, for latency tests (ICMP ping) you could use it to create an IP connection
conn, err := net.Dial("ip4", "127.0.0.1")
create your ICMP package structure and data, and send it. (See Wikipedia page on ICMP; IPv6 needs a different format.) Unfortunately you can’t create an ICMP connection directly, like you can with TCP and UDP, thus you will have to handle the package structure yourself.
As conn of type Conn is a Writer, you can then pass it your data, the ICMP data you defined.
In the ICMP Type field you can specify the message type. Values 8, 1 and 30 are the ones you are looking for. 8 for your echo request, the reply will be of type 1. And maybe 30 gives you some more information.
Unfortunately, for counting the network hops, you will need the IP packet header fields. This means, you will have to construct your own IP packets, which net does not seem to allow.
Checking the source of Dial(), it uses internetSocket, which is not exported/public. I’m not really sure if I’m missing something, but it seems there is no simple way to construct your own IP packets to send, with customizable header values. You’d have to further check how DialIP sends packages with internetSocket and duplicate and adapt that code/concept. Alternatively, you could use cgo and a system library to construct your own packages (this would add yet more complexity though).
If you are planning on using IPv6, you will (also) have to look into ICMPv6. Both packages have a different structure over their v4 versions.
So, I’d suggest using simple latency (timed ping) as a simple(r) implementation and then add node-hops at a later time/afterwards, if you need it. If you have both in place, maybe you also want to combine those 2 (less hops does not automatically mean better; think long overseas-cables etc).
DNS Round Robin (DRR) permits to do cheap load balancing (distribution is a better term). It has the pro of permitting infinite horizontal scaling. The con is that if one of the web servers goes down, some clients continue to use the broken IP for minutes (min TTL 300s) or more, even if the DNS implements fail-over.
An Hardware Load Balancer (HLB) handles such web server failures transparently but it cannot scale its bandwidth indefinitely. An hot spare is also needed.
A good solution seems to use DRR in front to a group of HLB pairs. Each HLB pair never goes down and therefore DRR never keeps clients down. Plus, when bandwidth isn't enough you can add a new HLB pair to the group.
Problem: DRR moves clients randomly between the HLB pairs and therefore (AFAIK) session stickiness cannot work.
I could just avoid to use session stickiness but it makes better use of caches therefore is something that I want to preserve.
Question: is it possible/exist an HLB implementation where an instance can share its (sessionid,webserver) mapping with other instances?
If this is possible then a client would be routed to the same web server independently by the HLB that routed the request.
Thanks in advance.
Modern load balancers have very high throughput capabilities (gigabit). So unless you're running a huuuuuuuuuuge site (e.g. google), adding bandwidth is not why you'll need a new pair of load balancers, especially since most large sites offload much of their bandwidth to CDNs (Content Delivery Networks) like Akamai. If you're pumping a gigabit of un-CDN-able data through your site and don't already have a global load-balancing strategy, you've got bigger problems than cache affinity. :-)
Instead of bandwidth limits, sites tend to add additional LB pairs for geo-distribution of servers at separate data centers to ensure users spread across the world can talk to a server closest to them.
For that latter scenario, load balancer companies offer geo-location solutions, which (at least until a few years ago which was when I was following this stuff) were based on custom DNS implementations which looked at client IPs and resolved to the load balancer pairs Virtual IP address which is "closest" (in network topology or performance) to the client. These days, CDNs like Akamai also offer global load balancing services (e.g. http://www.akamai.com/html/technology/products/gtm.html). Amazon's EC2 hosting also supports this kind of feature for sites hosted there (see http://aws.amazon.com/elasticloadbalancing/).
Since users tend not to move across continents in the course of a single session, you automatically get affinity (aka "stickiness") with geographic load balancing, assuming your pairs are located in separate data centers.
Keep in mind that geo-location is really hard since you also have to geo-locate your data to ensure your back-end cross-data-center network doesn't get swamped.
I suspect that F5 and other vendors also offer single-datacenter solutions which achieve the same ends, if you're really concerned about the single point of failure of network infrastructure (routers, etc.) inside your datacenter. But router and switch vendors have high-availability solutions which may be more appropriate to address that issue.
Net-net, if I were you I wouldn't worry about multiple pairs of load balancers. Get one pair and, unless you have a lot of money and engineering time to burn, partner with a hoster who's good at keeping their data center network up and running.
That said, if cache affinity is such a big deal for your app that you're thinking about shelling out big $$$ for multiple pairs of load balancers, it may be worth considering some app architecture changes (like using an external caching cluster). Solutions like memcached (for linux) are designed for this scenario. Microsoft also has one coming called "Velocity".
Anyway, hope this is useful info-- it's admittedly been a while since I've been deeply involved in this space (I was part of the team which designed an application load balancing product for a large software vendor) so you might want to double-check my assumptions above with facts you can pull off the web from F5 and other LB vendors.
Ok, this is an ancient question, which I just found through a Google search. But for any future visitors, here is some additional clarifications:
Problem: [DNS Round Robin] moves clients randomly between the HLB pairs and therefore (AFAIK) session stickiness cannot work.
This premise is as best I can tell not accurate. It seems nobody really knows what old browsers might do, but presumably each browser window will stay on the same IP address as long as it's open. Newer operation systems probably obey the "match longest prefix" rule. Thus there shouldn't be much 'flapping', randomly switching from one load balancer IP to another.
However, if you're still worried about users getting randomly reassigned to a new load balancer pair, then a small modification of the classic L3/4 & L7 load balancing setup can help:
Publish DNS Round Robin records that go to Virtual high-availability IPs that are handled by L4 load balancers.
Have the L4 load balancers forward to pairs of L7 load balancers based on the origin IP address, i.e. use consistent hashing based on the end users IP to always route end users to the same L7 load balancer.
Have your L7 load balancers use "sticky sessions" as you want them to.
Essentially this is just a small modification to what Willy Tarreau (the creator of HAProxy) wrote years ago.
thanks for having put things in the right perspective.
I agree with you.
I did some reading and found:
Flickr: http://highscalability.com/flickr-architecture
4 billion queries per day --> about 50000 queries/s
Youtube: http://highscalability.com/youtube-architecture
100 million video views/day --> about 1200 video views/second
PlentyOfFish: http://highscalability.com/plentyoffish-architecture
600 pages/second
200 Mbps used
CDN used
Twitter: http://highscalability.com/scaling-twitter-making-twitter-10000-percent-faster
300 tweets/second
600 req/s
A very top end LB like this can scale up :
200,000 SSL handshakes per second
1 million TCP connections per second
3.2 million HTTP requests per second
36 Gbps of TCP or HTTP throughput
Therefore, you are right a LB could hardly become a bottleneck.
Anyway I found this (old) article http://www.tenereillo.com/GSLBPageOfShame.htm
where it is explained that geo-aware DNS could create availability issues.
Could someone comment on that article?
Thanks,
Valentino
So why not keep it simple and have the DNS server give out a certain IP address (or addresses) based on the origin IP address (i.e. use consistent hashing based on the end users IP to always give end users the same IP address(es)) ?
I'm aware that this only provides a simple and cheap load distribution mechanism.
I have been looking for this, but haven't found a DNS server which implements this (although Bind has some possibilities with views).
Assuming I have a cluster of n Erlang nodes, some of which may be on my LAN, while others may be connected using a WAN (that is, via the Internet), what are suitable mechanisms to cater for a) different bandwidth availability/behavior (for example, latency induced) and b) nodes with differing computational power (or even memory constraints for that matter)?
In other words, how do I prioritize local nodes that have lots of computational power, over those that have a high latency and may be less powerful, or how would I ideally prioritize high performance remote nodes with high transmission latencies to specifically do those processes with a relatively huge computations/transmission (that is, completed work per message ,per time unit) ratio?
I am mostly thinking in terms of basically benchmarking each node in a cluster by sending them a benchmark process to run during initialization, so that the latencies involved in messasing can be calculated, as well as the overall computation speed (that is, using a node-specific timer to determine how fast a node terminates with any task).
Probably, something like that would have to be done repeatedly, on the one hand in order to get representative data (that is, averaging data) and on the other hand it might possibly even be useful at runtime in order to be able to dynamically adjust to changing runtime conditions.
(In the same sense, one would probably want to prioritize locally running nodes over those running on other machines)
This would be meant to hopefully optimize internal job dispatch so that specific nodes handle specific jobs.
We've done something similar to this, on our internal LAN/WAN only (WAN being for instance San Francisco to London). The problem boiled down to a combination of these factors:
The overhead in simply making a remote call over a local (internal) call
The network latency to the node (as a function of the request/result payload)
The performance of the remote node
The compute power needed to execute the function
Whether batching of calls provides any performance improvement if there was a shared "static" data set.
For 1. we assumed no overhead (it was negligible compared to the others)
For 2. we actively measured it using probe messages to measure round trip time, and we collated information from actual calls made
For 3. we measured it on the node and had them broadcast that information (this changed depending on the load current active on the node)
For 4 and 5. we worked it out empirically for the given batch
Then the caller solved to get the minimum solution for a batch of calls (in our case pricing a whole bunch of derivatives) and fired them off to the nodes in batches.
We got much better utilization of our calculation "grid" using this technique but it was quite a bit of effort. We had the added advantage that the grid was only used by this environment so we had a lot more control. Adding in an internet mix (variable latency) and other users of the grid (variable performance) would only increase the complexity with possible diminishing returns...
The problem you are talking about has been tackled in many different ways in the context of Grid computing (e.g, see Condor). To discuss this more thoroughly, I think some additional information is required (homogeneity of the problems to be solved, degree of control over the nodes [i.e. is there unexpected external load etc.?]).
Implementing an adaptive job dispatcher will usually require to also adjust the frequency with which you probe the available resources (otherwise the overhead due to probing could exceed the performance gains).
Ideally, you might be able to use benchmark tests to come up with an empirical (statistical) model that allows you to predict the computational hardness of a given problem (requires good domain knowledge and problem features that have a high impact on execution speed and are simple to extract), and another one to predict communication overhead. Using both in combination should make it possible to implement a simple dispatcher that bases its decisions on the predictive models and improves them by taking into account actual execution times as feedback/reward (e.g., via reinforcement learning).