Networked applications often benefit from the ability to estimate the bandwidth between two end-points on the Internet. This may be advantageous not only for rate control purposes, but also in isolating preferred connections where a number of alternatives exist.
Although there are a couple of rigorous treatments of packet-pair probing, a summary of the high-level principles and salient points, covering both the how and the why of the method would be very beneficial; even if only to serve as a bootstrap to more in-depth study.
Any pointers to implementations or usage of packet-pair probing that serve as good examples would also be much appreciated.
Update:
I found some good soft introduction material in a usenix paper derived from work on the nettimer tool - in particular the discussion concerning use of cross-talk filters and sampling windows for increased agility make a lot of sense.
About high-level principles: traditional means of estimating bandwidth send one packet to target and wait for it to return, then send another packet and wait for return, etc... in a sequential way. Then one computes some kind of average/median of the total time of the return trip per k-byte (or any other unit). This information is then used against the theoretical maximum bandwidth (when available) to estimate the available unused bandwidth.
Packet-pair probing send a group of packets to the target at once (i.e., in a parallel way) and wait for them to return. Then a kind of average/median is computed too and evaluated against the maximum theoretical bandwidth.
If you send more packets at once, you are disturbing the system you are trying to measure and you have to take this into account in your estimations, but it goes faster than the one-by-one method and feels more like a snapshot. The bottom question is: what's the trade-off between measurement accuracy and speed of measurement in both cases? Is there any value in this trading?
I have written a program for bandwidth estimation using packet pair method. If anyone wants to have a look at it, will be happy to share it..
EDIT:
Here, is how, I had implemented it in a class assignment,
https://github.com/npbendre/Bandwidth-Estimation-using-Packet-Pair-Probing-Algorithm
Hope this helps!
Related
I am currently testing performance of two different messaging systems and there are many independent tuning and load parameters that influence the message rate. I would like to develop an automated method that would produce the maximum message per second rate on a given hardware.
CONSTANTS:
# of CPUs, memory, network and other hardware parameters are a constant (or at least there is a known maximum).
Version of OS and the messaging products is a constant (WebSphere MQ 7.5 for instance)
Message size
3 minutes to complete one test measurement
DESIRED RESULT:
Maximum rate of messages per second
VARIABLES:
OS tuning options (kernel, TCP and other network and memory stuff)
Messaging software tuning settings - of which there are several dozens (buffer sizes of all sorts, number of threads, number of Queue Managers, number of queues, different kinds of bindings for clients, security settings, log types, number and sizes of log and data files, etc.....)
Number of message producers and consumers (say a range from 1 to 10,000)
Currently performance tuning is done as an iterative process - change one variable at a time, measure msg/sec, rinse and repeat.
I have developed a script framework that can iterate over different settings, but if I do this as a nested loop that is 50 levels deep and iterate over every possible value for each variable this process will take hundreds of years to complete. Including a random factor might help, but what I would like to do is to use neural network to find optimum settings for my tuning variables.
I found some research papers that talk about this kind of thing, but could not find any real implementations. Has anybody done it? Any recommendations on how to proceed? I have not worked with neural networks, but have done mathematical programming (optimization) in university and normal optimization methods will take too long to solve this problem.
This kind of problem doesn't sound very suitable for neural networks. Neural networks are probably most often used as classifiers with supervised learning. Typically, you have to have a training set with examples that are already classified. There are some unsupervised approaches too, most recently most notably perhaps the deep neural networks. But that is an active area of research and I doubt there are ready made tools for what you need.
You have a problem with combinatorial explosion in the parameter space. That kind of problem is often well suited for solving with evolutionary algorithms. For example genetic algorithms or simulated annealing.
With genetic algorithms the problem solving shifts to finding
a good problem representation that allows you to do crossover and mutations,
a suitable fitness function.
This often proves to be very difficult.
Your problem, however, seems to be very suitable for this method:
individual solution representation: a list of parameters i = [p_1, p_2, ..., p_n]
crossover of i_1 and i_2: choose a splitting point between 1 and n and exchange the respective parts of the individuals' lists
mutation: choose a parameter and adjust it
fitness function: rate of messages per second
It's not guaranteed that you would get an optimal solution this way but it can help you to combat the combinatorial explosion quite effectively.
what is a baseline and what is a benchmark? what is the best definition for these and how do you baseline a set of numbers and benchmark another set?
Interesting definitions from SPR (Software Productivity Research)
Baseline and benchmark are similar but distinct activities.
Figuratively, a baseline is a "line in the sand" for an organization whereby it measures important performance characteristics for future reference.
This is not necessarily a "good" state", just a reference.
A benchmark is best understood by way of the original derivation of the word itself:
Tradesmen engaged in repetitive tasks, such as sawing lumber to consistent lengths, often placed notches on their workbenches to indicate placement of boards prior to cutting. Literally, a benchmark became a standard for comparison and an indicator of past success.
Basically:
baseline is about identification of a significant state, meaning your set of numbers met an approval status, publicly recognized.
a benchmark is about assessing the relative performance of an application.
HI Gagneet, I'm on the Windows performance team: here is how we use these terms.
A baseline is a measurement of a known configuration that is used as a reference for subsequent measurements. For base line, we characterize the thing being measured: lets take cold boot time for example. Here we have a set of machines that are well characterized - this means we know how they work, that we have good drivers for them, and that the hardware isn't broken or flawed.
On this hardware, we have several "base line" measurements such as XP-RTM, XP-SP2, Vista-RTM, Vista-SP1, Vista-SP2, etc. etc.
For each of these base lines, we have a set of well characterized and understood measurements including all the phases of boot, the amount of CPU, disk and memory utilization, the number of DLL loads, etc. etc.
After a baseline is established, we can then take other measurements and compare them to the base line. For example, we are currently working on Window-7. For each build (daily) we run a set of boot time tests. We compare all the characteristics of each Win-7 build to the base line measurements. This includes all the previous Win-7 builds. This lets us see where the differences lie and helps us drill into the problem areas. Here are some more details.
In scientific research, a benchmark is a kind of test and a baseline is a kind of result.
Let's look at an example of a benchmark test: we might take a collection of 5,000 sentences in English and use the lab's four-core Dell machine to translate them into Spanish using various algorithms. Because we've kept the data and the machine constant, we can meaningfully compare the time taken by the different algorithms to complete the task, as well as their relative accuracy (measured against gold-standard human translations).
To find a baseline for this benchmark test, we might write a very naive translation algorithm that just finds the commonest translation for each individual word, with no regard for the context. Measuring the accuracy of this algorithm against our human translations gives us an idea of the minimum score - the baseline - that the others must beat, and gives us a feel for what level of accuracy counts as "good".
At the other end of the scale from a baseline, an upper bound is a useful yardstick too. In the translation example, we might find the upper bound by measuring the accuracy of one of our human translations with respect to the others. This gives us an idea of how high it's possible to get on our "accuracy" measure before you hit the ceiling of human disagreement. We expect our machine translation algorithms to perform at a level between the baseline and the upper bound.
Correct me if I'm wrong, but I believe "baseline" refers to a known good state, while "benchmark" refers to the current state. You would do a benchmark and compare it to the baseline.
This question is about a whole class of similar problems, but I'll ask it as a concrete example.
I have a server with a file system whose contents fluctuate. I need to monitor the available space on this file system to ensure that it doesn't fill up. For the sake of argument, let's suppose that if it fills up, the server goes down.
It doesn't really matter what it is -- it might, for example, be a queue of "work".
During "normal" operation, the available space varies within "normal" limits, but there may be pathologies:
Some other (possibly external)
component that adds work may run out
of control
Some component that removes work seizes up, but remains undetected
The statistical characteristics of the process are basically unknown.
What I'm looking for is an algorithm that takes, as input, timed periodic measurements of the available space (alternative suggestions for input are welcome), and produces as output, an alarm when things are "abnormal" and the file system is "likely to fill up". It is obviously important to avoid false negatives, but almost as important to avoid false positives, to avoid numbing the brain of the sysadmin who gets the alarm.
I appreciate that there are alternative solutions like throwing more storage space at the underlying problem, but I have actually experienced instances where 1000 times wasn't enough.
Algorithms which consider stored historical measurements are fine, although on-the-fly algorithms which minimise the amount of historic data are preferred.
I have accepted Frank's answer, and am now going back to the drawing-board to study his references in depth.
There are three cases, I think, of interest, not in order:
The "Harrods' Sale has just started" scenario: a peak of activity that at one-second resolution is "off the dial", but doesn't represent a real danger of resource depletion;
The "Global Warming" scenario: needing to plan for (relatively) stable growth; and
The "Google is sending me an unsolicited copy of The Index" scenario: this will deplete all my resources in relatively short order unless I do something to stop it.
It's the last one that's (I think) most interesting, and challenging, from a sysadmin's point of view..
If it is actually related to a queue of work, then queueing theory may be the best route to an answer.
For the general case you could perhaps attempt a (multiple?) linear regression on the historical data, to detect if there is a statistically significant rising trend in the resource usage that is likely to lead to problems if it continues (you may also be able to predict how long it must continue to lead to problems with this technique - just set a threshold for 'problem' and use the slope of the trend to determine how long it will take). You would have to play around with this and with the variables you collect though, to see if there is any statistically significant relationship that you can discover in the first place.
Although it covers a completely different topic (global warming), I've found tamino's blog (tamino.wordpress.com) to be a very good resource on statistical analysis of data that is full of knowns and unknowns. For example, see this post.
edit: as per my comment I think the problem is somewhat analogous to the GW problem. You have short term bursts of activity which average out to zero, and long term trends superimposed that you are interested in. Also there is probably more than one long term trend, and it changes from time to time. Tamino describes a technique which may be suitable for this, but unfortunately I cannot find the post I'm thinking of. It involves sliding regressions along the data (imagine multiple lines fitted to noisy data), and letting the data pick the inflection points. If you could do this then you could perhaps identify a significant change in the trend. Unfortunately it may only be identifiable after the fact, as you may need to accumulate a lot of data to get significance. But it might still be in time to head off resource depletion. At least it may give you a robust way to determine what kind of safety margin and resources in reserve you need in future.
Generally speaking what do you get out of extending an artificial neural net by adding more nodes to a hidden layer or more hidden layers?
Does it allow for more precision in the mapping, or does it allow for more subtlety in the relationships it can identify, or something else?
There's a very well known result in machine learning that states that a single hidden layer is enough to approximate any smooth, bounded function (the paper was called "Multilayer feedforward networks are universal approximators" and it's now almost 20 years old). There are several things to note, however.
The single hidden layer may need to be arbitrarily wide.
This says nothing about the ease with which an approximation may be found; in general large networks are hard to train properly and fall victim to overfitting quite frequently (the exception are so-called "convolutional neural networks" which really are only meant for vision problems).
This also says nothing about the efficiency of the representation. Some functions require exponential numbers of hidden units if done with one layer but scale much more nicely with more layers (for more discussion of this read Scaling Learning Algorithms Towards AI)
The problem with deep neural networks is that they're even harder to train. You end up with very very small gradients being backpropagated to the earlier hidden layers and the learning not really going anywhere, especially if weights are initialized to be small (if you initialize them to be of larger magnitude you frequently get stuck in bad local minima). There are some techniques for "pre-training" like the ones discussed in this Google tech talk by Geoff Hinton which attempt to get around this.
This is very interesting question but it's not so easy to answer. It depends on the problem you try to resolve and what neural network you try to use. There are several neural network types.
I general it's not so clear that more nodes equals more precision. Research show that you need mostly only one hidden layer. The numer of nodes should be the minimal numer of nodes that are required to resolve a problem. If you don't have enough of them - you will not reach solution.
From the other hand - if you have reached the number of nodes that is good to resolve solution - you can add more and more of them and you will not see any further progress in result estimation.
That's why there are so many types of neural networks. They try to resolve different types of problems. So you have NN to resolve static problems, to resolve time related problems and so one. The number of nodes is not so important like the design of them.
When you have a hidden layer is that you are creating a combined feature of the input. So, is the problem better tackled by more features of the existing input, or through higher-order features that come from combining existing features? This is the trade-off for a standard feed-forward network.
You have a theoretical reassurance that any function can be represented by a neural network with two hidden layers and non-linear activation.
Also, consider using additional resources for boosting, instead of adding more nodes, if you're not certain of the appropriate topology.
Very rough rules of thumb
generally more elements per layer for bigger input vectors.
more layers may let you model more non-linear systems.
If the kind of network you are using has delays in propagation , more layers may allow modelling of time series . Take care to have time jitter in the delays or it wont work very well. If this is just gobbledegook to you, ignore it.
More layers lets you insert recurrent features. This can be very useful for discrimination tasks. You ANN implementation my not permit this.
HTH
The number of units per hidden layer accounts for the ANN's potential to describe an arbitrarily complex function. Some (complicated) functions may require many hidden nodes, or possibly more than one hidden layer.
When a function can be roughly approximated by a certain number of hidden units, any extra nodes will provide more accuracy...but this is only true if the training samples used are enough to justify this addition - otherwise what will happen is "overconvergence". Overconvergence means that your ANN has lost its generalization abilities because it has overemphasized on the particular samples.
In general it is best to use the less hidden units possible, if the resulting network can give good results. The additional training patterns required to justify more hidden nodes can not be found easily in most cases, and accuracy is not the NNs' strong point.
In general, what kinds of design decisions help an application scale well?
(Note: Having just learned about Big O Notation, I'm looking to gather more principles of programming here. I've attempted to explain Big O Notation by answering my own question below, but I want the community to improve both this question and the answers.)
Responses so far
1) Define scaling. Do you need to scale for lots of users, traffic, objects in a virtual environment?
2) Look at your algorithms. Will the amount of work they do scale linearly with the actual amount of work - i.e. number of items to loop through, number of users, etc?
3) Look at your hardware. Is your application designed such that you can run it on multiple machines if one can't keep up?
Secondary thoughts
1) Don't optimize too much too soon - test first. Maybe bottlenecks will happen in unforseen places.
2) Maybe the need to scale will not outpace Moore's Law, and maybe upgrading hardware will be cheaper than refactoring.
The only thing I would say is write your application so that it can be deployed on a cluster from the very start. Anything above that is a premature optimisation. Your first job should be getting enough users to have a scaling problem.
Build the code as simple as you can first, then profile the system second and optimise only when there is an obvious performance problem.
Often the figures from profiling your code are counter-intuitive; the bottle-necks tend to reside in modules you didn't think would be slow. Data is king when it comes to optimisation. If you optimise the parts you think will be slow, you will often optimise the wrong things.
Ok, so you've hit on a key point in using the "big O notation". That's one dimension that can certainly bite you in the rear if you're not paying attention. There are also other dimensions at play that some folks don't see through the "big O" glasses (but if you look closer they really are).
A simple example of that dimension is a database join. There are "best practices" in constructing, say, a left inner join which will help to make the sql execute more efficiently. If you break down the relational calculus or even look at an explain plan (Oracle) you can easily see which indexes are being used in which order and if any table scans or nested operations are occurring.
The concept of profiling is also key. You have to be instrumented thoroughly and at the right granularity across all the moving parts of the architecture in order to identify and fix any inefficiencies. Say for example you're building a 3-tier, multi-threaded, MVC2 web-based application with liberal use of AJAX and client side processing along with an OR Mapper between your app and the DB. A simplistic linear single request/response flow looks like:
browser -> web server -> app server -> DB -> app server -> XSLT -> web server -> browser JS engine execution & rendering
You should have some method for measuring performance (response times, throughput measured in "stuff per unit time", etc.) in each of those distinct areas, not only at the box and OS level (CPU, memory, disk i/o, etc.), but specific to each tier's service. So on the web server you'll need to know all the counters for the web server your're using. In the app tier, you'll need that plus visibility into whatever virtual machine you're using (jvm, clr, whatever). Most OR mappers manifest inside the virtual machine, so make sure you're paying attention to all the specifics if they're visible to you at that layer. Inside the DB, you'll need to know everything that's being executed and all the specific tuning parameters for your flavor of DB. If you have big bucks, BMC Patrol is a pretty good bet for most of it (with appropriate knowledge modules (KMs)). At the cheap end, you can certainly roll your own but your mileage will vary based on your depth of expertise.
Presuming everything is synchronous (no queue-based things going on that you need to wait for), there are tons of opportunities for performance and/or scalability issues. But since your post is about scalability, let's ignore the browser except for any remote XHR calls that will invoke another request/response from the web server.
So given this problem domain, what decisions could you make to help with scalability?
Connection handling. This is also bound to session management and authentication. That has to be as clean and lightweight as possible without compromising security. The metric is maximum connections per unit time.
Session failover at each tier. Necessary or not? We assume that each tier will be a cluster of boxes horizontally under some load balancing mechanism. Load balancing is typically very lightweight, but some implementations of session failover can be heavier than desired. Also whether you're running with sticky sessions can impact your options deeper in the architecture. You also have to decide whether to tie a web server to a specific app server or not. In the .NET remoting world, it's probably easier to tether them together. If you use the Microsoft stack, it may be more scalable to do 2-tier (skip the remoting), but you have to make a substantial security tradeoff. On the java side, I've always seen it at least 3-tier. No reason to do it otherwise.
Object hierarchy. Inside the app, you need the cleanest possible, lightest weight object structure possible. Only bring the data you need when you need it. Viciously excise any unnecessary or superfluous getting of data.
OR mapper inefficiencies. There is an impedance mismatch between object design and relational design. The many-to-many construct in an RDBMS is in direct conflict with object hierarchies (person.address vs. location.resident). The more complex your data structures, the less efficient your OR mapper will be. At some point you may have to cut bait in a one-off situation and do a more...uh...primitive data access approach (Stored Procedure + Data Access Layer) in order to squeeze more performance or scalability out of a particularly ugly module. Understand the cost involved and make it a conscious decision.
XSL transforms. XML is a wonderful, normalized mechanism for data transport, but man can it be a huge performance dog! Depending on how much data you're carrying around with you and which parser you choose and how complex your structure is, you could easily paint yourself into a very dark corner with XSLT. Yes, academically it's a brilliantly clean way of doing a presentation layer, but in the real world there can be catastrophic performance issues if you don't pay particular attention to this. I've seen a system consume over 30% of transaction time just in XSLT. Not pretty if you're trying to ramp up 4x the user base without buying additional boxes.
Can you buy your way out of a scalability jam? Absolutely. I've watched it happen more times than I'd like to admit. Moore's Law (as you already mentioned) is still valid today. Have some extra cash handy just in case.
Caching is a great tool to reduce the strain on the engine (increasing speed and throughput is a handy side-effect). It comes at a cost though in terms of memory footprint and complexity in invalidating the cache when it's stale. My decision would be to start completely clean and slowly add caching only where you decide it's useful to you. Too many times the complexities are underestimated and what started out as a way to fix performance problems turns out to cause functional problems. Also, back to the data usage comment. If you're creating gigabytes worth of objects every minute, it doesn't matter if you cache or not. You'll quickly max out your memory footprint and garbage collection will ruin your day. So I guess the takeaway is to make sure you understand exactly what's going on inside your virtual machine (object creation, destruction, GCs, etc.) so that you can make the best possible decisions.
Sorry for the verbosity. Just got rolling and forgot to look up. Hope some of this touches on the spirit of your inquiry and isn't too rudimentary a conversation.
Well there's this blog called High Scalibility that contains a lot of information on this topic. Some useful stuff.
Often the most effective way to do this is by a well thought through design where scaling is a part of it.
Decide what scaling actually means for your project. Is infinite amount of users, is it being able to handle a slashdotting on a website is it development-cycles?
Use this to focus your development efforts
Jeff and Joel discuss scaling in the Stack Overflow Podcast #19.
FWIW, most systems will scale most effectively by ignoring this until it's a problem- Moore's law is still holding, and unless your traffic is growing faster than Moore's law does, it's usually cheaper to just buy a bigger box (at $2 or $3K a pop) than to pay developers.
That said, the most important place to focus is your data tier; that is the hardest part of your application to scale out, as it usually needs to be authoritative, and clustered commercial databases are very expensive- the open source variations are usually very tricky to get right.
If you think there is a high likelihood that your application will need to scale, it may be intelligent to look into systems like memcached or map reduce relatively early in your development.
One good idea is to determine how much work each additional task creates. This can depend on how the algorithm is structured.
For example, imagine you have some virtual cars in a city. At any moment, you want each car to have a map showing where all the cars are.
One way to approach this would be:
for each car {
determine my position;
for each car {
add my position to this car's map;
}
}
This seems straightforward: look at the first car's position, add it to the map of every other car. Then look at the second car's position, add it to the map of every other car. Etc.
But there is a scalability problem. When there are 2 cars, this strategy takes 4 "add my position" steps; when there are 3 cars, it takes 9 steps. For each "position update," you have to cycle through the whole list of cars - and every car needs its position updated.
Ignoring how many other things must be done to each car (for example, it may take a fixed number of steps to calculate the position of an individual car), for N cars, it takes N2 "visits to cars" to run this algorithm. This is no problem when you've got 5 cars and 25 steps. But as you add cars, you will see the system bog down. 100 cars will take 10,000 steps, and 101 cars will take 10,201 steps!
A better approach would be to undo the nesting of the for loops.
for each car {
add my position to a list;
}
for each car {
give me an updated copy of the master list;
}
With this strategy, the number of steps is a multiple of N, not of N2. So 100 cars will take 100 times the work of 1 car - NOT 10,000 times the work.
This concept is sometimes expressed in "big O notation" - the number of steps needed are "big O of N" or "big O of N2."
Note that this concept is only concerned with scalability - not optimizing the number of steps for each car. Here we don't care if it takes 5 steps or 50 steps per car - the main thing is that N cars take (X * N) steps, not (X * N2).