Which distributions can be used to produce starting times of jobs if there is no observation real state? - random

I need to produce some data which has starting times of each job (# of jobs: 30), I do not have chance to get real data so how can I generate data which shows similarities with a data distribution. In this case, which distribution should be good to go on?

A common technique used in simulation models where you don't have any data yet (e.g., data is very expensive, or it's a prospective system that does not even exist yet so where would you get the data from?) is to use a triangular distribution parameterized by subject matter experts (or your own best guesses) about the smallest, largest, and most common value you might see.
A relatively new, but quite powerful extension to this would be to vary the parameter choices in a designed set of experiments to see how much it matters if your guesstimates are off. A well-designed experiment would allow you to assess and characterize how much your results change as a function of the parameter values.
A more comprehensive variant would be to incorporate the distribution choice itself (triangle vs exponential vs anything else you think is plausible) into the design, to see whether that makes much of a difference. In the happy event that it doesn't, you can freely use a simple and convenient distribution choice such as the triangle; if it makes a big difference, you now have certain knowledge that you should get your hands on real data ASAP, because without that data based knowledge you're operating in a garbage-in-garbage-out mode. This also assumes that you control for, say, the first two moments as you switch between distribution choices so that your experiments are testing the shape of the distribution rather than the effect of mean and variance of the distribution.
If designed experiments tell you it doesn't much matter, that's wonderful news. If it does matter, you now know more about the system than you did before and know where to focus your efforts going forward.

Related

Uncertainty versus randomness

I would like to know the difference between uncertainty and randomness in mathematical fashion. I tried to find it but I get confused , as some people said they are the same? But can any one provide me logical reasoning behind it. If they are not same then please explain it why?
Don't get too hung up on it.
People use different words in different situations.
It's not so much that they have different meanings, as that their meanings are situation-dependent.
Randomness is just a fuzzy general term meaning something is random.
In statistics, uncertainty is used to mean that some property of a distribution, such as its mean, is itself unknown but can be given a distribution.
For example, suppose you want to know the average weight of all people.
You could find it out exactly if you could go around to all people, get their weight, add it all up, and divide by the number of people.
But that's too hard to do, so suppose you just pick 10 people at random and get their average weight, and pretend it's the same as the average of everybody.
That's called the sample mean, but you know it isn't accurate.
It has what is called a standard error, meaning it has uncertainty.
In fact, if you were to do that experiment many times over with different people, you would get a different sample mean every time, and those sample means would themselves form a bell-shaped distribution, the standard deviation of which would be called the standard error, representing its uncertainty.
In general, if you increased the number of people you look at by a factor of 100, you can reduce the standard error, the uncertainty, by a factor of 10.
I bet you can tell that people who take polls for a living care about this stuff very much.
EDIT for the downvoter: In case the downvote is because this doesn't look like a stackoverflow question or answer,
I've made a point of advocating the random pausing method of profiling.
Profiling in large part is perceived to be about measuring (statistically) the time that programming constructs are responsible for.
Often people are inhibited from using that method because they are afraid the results have too much uncertainty.
This post gets very specific about what that uncertainty actually is.
It shows that the bogey-man fear of uncertainty has the effect of preventing people from finding really substantial speedups in their code.
So naivete' about statistics is definitely a serious programming problem.
My view looks at a scenario using three different coloured balls:
I love some of the answers given here. My own view, based on my current research, is that these are two distinct terms. Uncertainty refers to not knowing in advance which ball could be selected when a person, for instance, is given a chance to select one ball from three different coloured balls.
This remains true when each ball has an equal chance of being selected i.e. equal probabilities. However, things soon get complex when each ball has it's own distinct probability. Chances are that the one with the highest probability will be selected. This seems especially true in algorithm development which would almost always select the highest probability compromising the meaning of randomness.
Having said all of this - I believe these concepts remain confusing which has just made me realise the time I need to dedicate on clearly distinguishing between the two to make sure my current research is not confusing. My own predicament is that I need to work on stochastic vs deterministic views. Based on the current view stochastic would be more uncertain than random whereas deterministic would be more probability based i.e. knowing for certain that the highest probability would be chosen; but this seems very far from the truth.
It seems as if uncertainty holds until just before a ball is selected/touched and soon looses its meaning as soon as the ball is picked which should result to its probability being revised. I personally think the terms have theoretical differences which perhaps allows them to be used interchangeably.
Uncertainty in math and science typically means there are a lack of facts, or the facts are unobtainable. Weather forecasting is a great example of uncertainty.
Randomness has many definitions. Commonly it's used in probability / statistics as a measure or quantification of uncertainty. So in my weather example, a 30% chance of rain is a measure of uncertainty. The more general definition (which also applies to math / science) is unpredictable, or lack of order.
There is definitely a fuzzy distinction between the two.
According to the Bayesian interpretation of probability, uncertainty and randomness are just two names for the same thing.
If an experiment is random, then it is uncertain to you. If something is uncertain to you, then it has the randomness property.

Search space data

I was wondering if anyone knew of a source which provides 2D model search spaces to test a GA against. I believe i read a while ago that there are a bunch of standard search spaces which are typically used when evaluating these type of algorithms.
If not, is it just a case of randomly generating this data yourself each time?
Edit: View from above and from the side.
The search space is completely dependent on your problem. The idea of a genetic algorithm being that modify the "genome" of a population of individuals to create the next generation, measure the fitness of the new generation and modify the genomes again with some randomness thrown is to try to prevent getting stuck in local minima. The search space however is completely determined by what you have in your genome, which in turn in completely determined by what the problem is.
There might be standard search spaces (i.e. genomes) that have been found to work well for particular problems (I haven't heard of any) but usually the hardest part in using GAs is defining what you have in your genome and how it is allowed to mutate. The usefulness comes from the fact that you don't have to explicitly declare all the values for the different variables for the model, but you can find good values (not necessarily the best ones though) using a more or less blind search.
EXAMPLE
One example used quite heavily is the evolved radio antenna (Wikipedia). The aim is to find a configuration for a radio antenna such that the antenna itself is as small and lightweight as possible, with the restriction that is has to respond to certain frequencies and have low noise and so on.
So you would build your genome specifying
the number of wires to use
the number of bends in each wire
the angle of each bend
maybe the distance of each bend from the base
(something else, I don't know what)
run your GA, see what comes out the other end, analyse why it didn't work. GAs have a habit of producing results you didn't expect because of bugs in the simulation. Anyhow, you discover that maybe the genome has to encode the number of bends individually for each of the wires in the antenna, meaning that the antenna isn't going to be symmetric. So you put that in your genome and run the thing again. Simulating stuff that needs to work in the physical world is usually the most expensive because at some point you have to test the indivudal(s) in the real world.
There's a reasonable tutorial of genetic algorithms here with some useful examples about different encoding schemes for the genome.
One final point, when people say that GAs are simple and easy to implement, they mean that the framework around the GA (generating a new population, evaluating fitness etc.) is simple. What usually is not said is that setting up a GA for a real problem is very difficult and usually requires a lot of trial and error because coming up with an encoding scheme that works well is not simple for complex problems. The best way to start is to start simple and make things more complex as you go along. You can of course make another GA to come with the encoding for first GA :).
There are several standard benchmark problems out there.
BBOB (Black Box Optimization Benchmarks) -- have been used in recent years as part of a continuous optimization competition
DeJong functions -- pretty old, and really too easy for most practical purposes these days. Useful for debugging perhaps.
ZDT/DTLZ multiobjective functions -- multi-objective optimization problems, but you could scalarize them yourself I suppose.
Many others

Why is average so popular when measuring application performance

When measuring application performance (response time for example) it's so easy to come across averages (mean). ab, httpref and bunch of other utilities are reporting mean and standard deviation. But from theoretical point of view it doesn't make a lot of sense to me. And there is why.
Mean value is good at describing symmetrical distributed population, because in case of symmetrical distribution mean is equal to population mode and expected value. But response times are not distributed symmetrical. They are more like exponential. In this case average tells us nothing.
It's more convenient to work with percentile values, which tells us what response time we could afford in what percentage of responses.
Am I missing something or mean is popular just because it's very simple to calculate?
All kinds of tools get their features not necessarily from what makes sense, but from users' expectations.
You're absolutely right that the distributions are non-negative and heavily skewed, and that percentiles would be more informative.
Alternatively, a distribution more like lognormal or chi-square would be a little better.
Yes, you are missing something.
The whole point of descriptive statistics is to present a few numbers to describe (or represent or model or ...) a large number of numbers. They aid the comprehension of large datasets, the extraction of information from data, the approximate comparison of datasets whose exact comparison is large and bewildering to the limitations of the human mind.
But no single descriptive statistic is always fit for all purposes, and no one is dictating to you that you must or should or ought to use the mean. If it doesn't suit your purposes, use something else.
As it happens you are quite wrong to write They are more like exponential. In this case average tells us nothing. For an exponential distribution with rate parameter lambda the mean is simply 1/lambda so the mean tells you everything about an exponential distribution.
I'm not an expert in statistics but i believe the average values are used so much because those are the values that help to measure the scalability of a system.
You need to consider first your average values to know how your system needs to bahevae under certains workloads and those needs to be predictable, you usually are not very interested in outliers at least not at first.
Of course you need to look into your min values and the peak values to know the moment your system its going to have a bottleneck but the average values show you as i said a correct and predictable behavior.

Initial Genetic Programming Parameters

I did a little GP (note:very little) work in college and have been playing around with it recently. My question is in regards to the intial run settings (population size, number of generations, min/max depth of trees, min/max depth of initial trees, percentages to use for different reproduction operations, etc.). What is the normal practice for setting these parameters? What papers/sites do people use as a good guide?
You'll find that this depends very much on your problem domain - in particular the nature of the fitness function, your implementation DSL etc.
Some personal experience:
Large population sizes seem to work
better when you have a noisy fitness
function, I think this is because the growth
of sub-groups in the population over successive generations acts
to give more sampling of
the fitness function. I typically use
100 for less noisy/deterministic functions, 1000+
for noisy.
For number of generations it is best to measure improvements in the
fitness function and stop when it
meets your target criteria. I normally run a few hundred generations and see what kind of answers are coming out, if it is showing no improvement then you probably have an issue elsewhere.
Tree depth requirements are really dependent on your DSL. I sometimes try to do an
implementation without explicit
limits but penalise or eliminate
programs that run too long (which is probably
what you really care about....). I've also found total node counts of ~1000 to be quite useful hard limits.
Percentages for different mutation / recombination operators don't seem
to matter all that much. As long as
you have a comprehensive set of mutations, any reasonably balanced
distribution will usually work. I think the reason for this is that you are basically doing a search for favourable improvements so the main objective is just to make sure the trial improvements are reasonably well distributed across all the possibilities.
Why don't you try using a genetic algorithm to optimise these parameters for you? :)
Any problem in computer science can be
solved with another layer of
indirection (except for too many
layers of indirection.)
-David J. Wheeler
When I started looking into Genetic Algorithms I had the same question.
I wanted to collect data variating parameters on a very simple problem and link given operators and parameters values (such as mutation rates, etc) to given results in function of population size etc.
Once I started getting into GA a bit more I then realized that given the enormous number of variables this is a huge task, and generalization is extremely difficult.
talking from my (limited) experience, if you decide to simplify the problem and use a fixed way to implement crossover, selection, and just play with population size and mutation rate (implemented in a given way) trying to come up with general results you'll soon realize that too many variables are still into play because at the end of the day the number of generations after which statistically you will get a decent result (whatever way you wanna define decent) still obviously depend primarily on the problem you're solving and consequently on the genome size (representing the same problem in different ways will obviously lead to different results in terms of effect of given GA parameters!).
It is certainly possible to draft a set of guidelines - as the (rare but good) literature proves - but you will be able to generalize the results effectively in statistical terms only when the problem at hand can be encoded in the exact same way and the fitness is evaluated in a somehow an equivalent way (which more often than not means you're ealing with a very similar problem).
Take a look at Koza's voluminous tomes on these matters.
There are very different schools of thought even within the GP community -
Some regard populations in the (low) thousands as sufficient whereas Koza and others often don't deem if worthy to start a GP run with less than a million individuals in the GP population ;-)
As mentioned before it depends on your personal taste and experiences, resources and probably the GP system used!
Cheers,
Jan

How to get scientific results from non-experimental data (datamining?)

I want to obtain maximum performance out of a process with many variables, many of which cannot be controlled.
I cannot run thousands of experiments, so it'd be nice if I could run hundreds of experiments and
vary many controllable parameters
collect data on many parameters indicating performance
'correct,' as much as possible, for those parameters I couldn't control
Tease out the 'best' values for those things I can control, and start all over again
It feels like this would be called data mining, where you're going through tons of data which doesn't immediately appear to relate, but does show correlation after some effort.
So... Where do I start looking at algorithms, concepts, theory of this sort of thing? Even related terms for purposes of search would be useful.
Background: I like to do ultra-marathon cycling, and keep logs of each ride. I'd like to keep more data, and after hundreds of rides be able to pull out information about how I perform.
However, everything varies - routes, environment (temp, pres., hum., sun load, wind, precip., etc), fuel, attitude, weight, water load, etc, etc, etc. I can control a few things, but running the same route 20 times to test out a new fuel regime would just be depressing, and take years to perform all the experiments that I'd like to do. I can, however, record all these things and more(telemetry on bicycle FTW).
It sounds like you want to do some regression analysis. You certainly have plenty of data!
Regression analysis is an extremely common modeling technique in statistics and science. (It could be argued that statistics is the art and science of regression analysis.) There are many statistics packages out there to do the computation you'll need. (I'd recommend one, but I'm years out of date.)
Data mining has gotten a bad name because far too often people assume correlation equals causation. I found that a good technique is to start with variables you know have an influence and build a statistical model around them first. So you know that wind, weight and climb have an influence on how fast you can travel and statistical software can take your dataset and calculate what the correlation between those factors are. That will give you a statistical model or linear equation:
speed = x*weight + y*wind + z*climb + constant
When you explore new variables, you will be able to see if the model is improved or not by comparing a goodness of fit metric like R-squared. So you might check if temperature or time of day adds anything to the model.
You may want to apply a transformation to you data. For instance, you might find that you perform better on colder days. But really cold days and really hot days might hurt performance. In that case, you could assign temperatures to bins or segments: < 0°C; 0°C to 40°C; > 40°C, or some such. The key is to transform the data in a way that matches a rational model of what is going on in the real world, not just the data itself.
In case someone thinks this is not a programming related topic, notice that you can use these same techniques to analyze system performance.
With that many variables you have too many dimensions and you may want to look at Principal Component Analysis. It takes some of the "art" out of regression analysis and lets the data speak for itself. Some software to do that sort of analysis is shown at the bottom of the link.
I have used the Perl module Statistics::Regression for somewhat similar problems in the past. Be warned, however, that regression analysis is definitely an art. As the warning in the Perl module says, it won't make sense to you if you haven't learned the appropriate math.

Resources