Examples of machine code using fuzzy logic (all values in (0,1) interval and the extended set of discrete logical operators) - machine-code

I'm interested in knowing how machine code would look using a discrete system, and what progress has been made in building systems based upon this concept.
In other words, where instead of dealing simply with 0/1 values, and the "and/or/not" primitives, one is using any values on (0,1) interval, and the extended set of operators that would accompany such computation (min, max, product, average, modular difference, mod 1 sum, etc).
I've tried endless searches on google and on here but found nothing at all to that explicitly relates to this seemingly fundamental idea.

Related

XGBOOST/lLightgbm over-fitting despite no indication in cross-validation test scores?

We aim to identify predictors that may influence the risk of a relatively rare outcome.
We are using a semi-large clinical dataset, with data on nearly 200,000 patients.
The outcome of interest is binary (i.e. yes/no), and quite rare (~ 5% of the patients).
We have a large set of nearly 1,200 mostly dichotomized possible predictors.
Our objective is not to create a prediction model, but rather to use the boosted trees algorithm as a tool for variable selection and for examining high-order interactions (i.e. to identify which variables, or combinations of variables, that may have some influence on the outcome), so we can target these predictors more specifically in subsequent studies. Given the paucity of etiological information on the outcome, it is somewhat possible that none of the possible predictors we are considering have any influence on the risk of developing the condition, so if we were aiming to develop a prediction model it would have likely been a rather bad one. For this work, we use the R implementation of XGBoost/lightgbm.
We have been having difficulties tuning the models. Specifically when running cross validation to choose the optimal number of iterations (nrounds), the CV test score continues to improve even at very high values (for example, see figure below for nrounds=600,000 from xgboost). This is observed even when increasing the learning rate (eta), or when adding some regularization parameters (e.g. max_delta_step, lamda, alpha, gamma, even at high values for these).
As expected, the CV test score is always lower than the train score, but continuous to improve without ever showing a clear sign of over fitting. This is true regardless of the evaluation metrics that is used (example below is for logloss, but the same is observed for auc/aucpr/error rate, etc.). Relatedly, the same phenomenon is also observed when using a grid search to find the optimal value of tree depth (max_depth). CV test scores continue to improve regardless of the number of iterations, even at depth values exceeding 100, without showing any sign of over fitting.
Note that owing to the rare outcome, we use a stratified CV approach. Moreover, the same is observed when a train/test split is used instead of CV.
Are there situations in which over fitting happens despite continuous improvements in the CV-test (or test split) scores? If so, why is that and how would one choose the optimal values for the hyper parameters?
Relatedly, again, the idea is not to create a prediction model (since it would be a rather bad one, owing that we don’t know much about the outcome), but to look for a signal in the data that may help identify a set of predictors for further exploration. If boosted trees is not the optimal method for this, are there others to come to mind? Again, part of the reason we chose to use boosted trees was to enable the identification of higher (i.e. more than 2) order interactions, which cannot be easily assessed using more conventional methods (including lasso/elastic net, etc.).
welcome to Stackoverflow!
In the absence of some code and representative data it is not easy to make other than general suggestions.
Your descriptive statistics step may give some pointers to a starting model.
What does existing theory (if it exists!) suggest about the cause of the medical condition?
Is there a male/female difference or old/young age difference that could help get your foot in the door?
Your medical data has similarities to the fraud detection problem where one is trying to predict rare events usually much rarer than your cases.
It may pay you to check out the use of xgboost/lightgbm in the fraud detection literature.

Machine learning: optimal parameter values in reasonable time

Sorry if this is a duplicate.
I have a two-class prediction model; it has n configurable (numeric) parameters. The model can work pretty well if you tune those parameters properly, but the specific values for those parameters are hard to find. I used grid search for that (providing, say, m values for each parameter). This yields m ^ n times to learn, and it is very time-consuming even when run in parallel on a machine with 24 cores.
I tried fixing all parameters but one and changing this only one parameter (which yields m × n times), but it's not obvious for me what to do with the results I got. This is a sample plot of precision (triangles) and recall (dots) for negative (red) and positive (blue) samples:
Simply taking the "winner" values for each parameter obtained this way and combining them doesn't lead to best (or even good) prediction results. I thought about building regression on parameter sets with precision/recall as dependent variable, but I don't think that regression with more than 5 independent variables will be much faster than grid search scenario.
What would you propose to find good parameter values, but with reasonable estimation time? Sorry if this has some obvious (or well-documented) answer.
I would use a randomized grid search (pick random values for each of your parameters in a given range that you deem reasonable and evaluate each such randomly chosen configuration), which you can run for as long as you can afford to. This paper runs some experiments that show this is at least as good as a grid search:
Grid search and manual search are the most widely used strategies for hyper-parameter optimization.
This paper shows empirically and theoretically that randomly chosen trials are more efficient
for hyper-parameter optimization than trials on a grid. Empirical evidence comes from a comparison
with a large previous study that used grid search and manual search to configure neural networks
and deep belief networks. Compared with neural networks configured by a pure grid search,
we find that random search over the same domain is able to find models that are as good or better
within a small fraction of the computation time.
For what it's worth, I have used scikit-learn's random grid search for a problem that required optimizing about 10 hyper-parameters for a text classification task, with very good results in only around 1000 iterations.
I'd suggest the Simplex Algorithm with Simulated Annealing:
Very simple to use. Simply give it n + 1 points, and let it run up to some configurable value (either number of iterations, or convergence).
Implemented in every possible language.
Doesn't require derivatives.
More resilient to local optimum than the method you're currently using.

Comparing two algorithms on a single dataset using total cost - Which statistical test to use?

I have to run three different kinds of comparisons between different data mining algorithms.
The only type of comparison that is problematic for is the most basic one, two algorithms on a single data set - is the problematic one for me.
I am aware of the Diettrich (1998) paper which refers to McNemar and 5x2CV as the options of choice and states, that resampled t-test is infeasible. As the analysis forms part of a larger setup using subsamples, 60:40 training:test-splits and total cost as performance measure, I cannot use those though.
Which other options are there to evaluate the performance in this case?
Sign-test: Just counting the number of cases, where each of the two algorithms performs better and thereafter check the p-value using the binomial distribution. Problematic as very weak.
Wilcoxon-signed-rank-test: As non-parametric alternative to the t-test the first one I thought of, but not mentioned in any paper for this kind of comparison, only for comparing two algorithms on several datasets using average performance result of several iterations. Is it infeasible and if so, why?
One obvious difference between the two is that Wilcoxon signed rank test requires that you compute a difference between the two members of a pair and then rank these differences. If the only information you have for each member of a pair is whether the data-mining procedures guessed the class of their member correctly, then there will only be three possible signed ranks - -1, 0, 1, and the Wilcoxon signed rank test will be equivalent to the McNemar test, which is in fact simply a way of calculating an approximate tail value of the sign test. If it makes sense to compare the results from the two members of a pair but not to subtract them and get a number then again you are back with the sign test.
This sounds like an exercise to get you to do a number of statistical tests, but if this was something in real life my first thought would be to work out why you really cared about running a data mining exercise, perhaps reduce this to a value in terms of money, and then look for the test that represented that best.

How does bootstrapping improve the quality of a phylogenetic reconstruction?

My understanding of bootstrapping is that you
Build a "tree" using some algorithm from a matrix of sequences (nucleotides, lets say).
You store that tree.
Perturb the matrix from 1, and rebuild the tree.
My question is: what is the purpose of 3 from a sequence bioinformatics perspective? I can try to "guess" that, by changing characters in the original matrix, you can remove artifacts in the data? But I have a problem with that guess: I am not sure, why removal of such artifacts is necessary. A sequence alignment is supposed to deal with artifacts by finding long lenghts of similarity, by its very nature.
Bootstrapping, in phylogenetics as elsewhere, doesn't improve the quality of whatever you're trying to estimate (a tree in this case). What it does do is give you an idea of how confident you can be about the result you get from your original dataset. A bootstrap analysis answers the question "If I repeated this experiment many times, using a different sample each time (but of the same size), how often would I expect to get the same result?" This is usually broken down by edge ("How often would I expect to see this particular edge in the inferred tree?").
Sampling Error
More precisely, bootstrapping is a way of approximately measuring the expected level of sampling error in your estimate. Most evolutionary models have the property that, if your dataset had an infinite number of sites, you would be guaranteed to recover the correct tree and correct branch lengths*. But with a finite number of sites this guarantee disappears. What you infer in these circumstances can be considered to be the correct tree plus sampling error, where the sampling error tends to decrease as you increase the sample size (number of sites). What we want to know is how much sampling error we should expect for each edge, given that we have (say) 1000 sites.
What We Would Like To Do, But Can't
Suppose you used an alignment of 1000 sites to infer the original tree. If you somehow had the ability to sequence as many sites as you wanted for all your taxa, you could extract another 1000 sites from each and perform this tree inference again, in which case you would probably get a tree that was similar but slightly different to the original tree. You could do this again and again, using a fresh batch of 1000 sites each time; if you did this many times, you would produce a distribution of trees as a result. This is called the sampling distribution of the estimate. In general it will have highest density near the true tree. Also it becomes more concentrated around the true tree if you increase the sample size (number of sites).
What does this distribution tell us? It tells us how likely it is that any given sample of 1000 sites generated by this evolutionary process (tree + branch lengths + other parameters) will actually give us the true tree -- in other words, how confident we can be about our original analysis. As I mentioned above, this probability-of-getting-the-right-answer can be broken down by edge -- that's what "bootstrap probabilities" are.
What We Can Do Instead
We don't actually have the ability to magically generate as many alignment columns as we want, but we can "pretend" that we do, by simply regarding the original set of 1000 sites as a pool of sites from which we draw a fresh batch of 1000 sites with repetition for each replicate. This generally produces a distribution of results that is different from the true 1000-site sampling distribution, but for large site counts the approximation is good.
* That is assuming that the dataset was in fact generated according to this model -- which is something that we cannot know for certain, unless we're doing a simulation. Also some models, like uncorrected parsimony, actually have the paradoxical quality that under some conditions, the more sites you have, the lower the probability of recovering the correct tree!
Bootstrapping is a general statistical technique that has applications outside of bioinformatics. It is a flexible means of coping with small samples, or samples from a complex population (which I imagine is the case in your application.)

Modeling distribution of performance measurements

How would you mathematically model the distribution of repeated real life performance measurements - "Real life" meaning you are not just looping over the code in question, but it is just a short snippet within a large application running in a typical user scenario?
My experience shows that you usually have a peak around the average execution time that can be modeled adequately with a Gaussian distribution. In addition, there's a "long tail" containing outliers - often with a multiple of the average time. (The behavior is understandable considering the factors contributing to first execution penalty).
My goal is to model aggregate values that reasonably reflect this, and can be calculated from aggregate values (like for the Gaussian, calculate mu and sigma from N, sum of values and sum of squares). In other terms, number of repetitions is unlimited, but memory and calculation requirements should be minimized.
A normal Gaussian distribution can't model the long tail appropriately and will have the average biased strongly even by a very small percentage of outliers.
I am looking for ideas, especially if this has been attempted/analysed before. I've checked various distributions models, and I think I could work out something, but my statistics is rusty and I might end up with an overblown solution. Oh, a complete shrink-wrapped solution would be fine, too ;)
Other aspects / ideas: Sometimes you get "two humps" distributions, which would be acceptable in my scenario with a single mu/sigma covering both, but ideally would be identified separately.
Extrapolating this, another approach would be a "floating probability density calculation" that uses only a limited buffer and adjusts automatically to the range (due to the long tail, bins may not be spaced evenly) - haven't found anything, but with some assumptions about the distribution it should be possible in principle.
Why (since it was asked) -
For a complex process we need to make guarantees such as "only 0.1% of runs exceed a limit of 3 seconds, and the average processing time is 2.8 seconds". The performance of an isolated piece of code can be very different from a normal run-time environment involving varying levels of disk and network access, background services, scheduled events that occur within a day, etc.
This can be solved trivially by accumulating all data. However, to accumulate this data in production, the data produced needs to be limited. For analysis of isolated pieces of code, a gaussian deviation plus first run penalty is ok. That doesn't work anymore for the distributions found above.
[edit] I've already got very good answers (and finally - maybe - some time to work on this). I'm starting a bounty to look for more input / ideas.
Often when you have a random value that can only be positive, a log-normal distribution is a good way to model it. That is, you take the log of each measurement, and assume that is normally distributed.
If you want, you can consider that to have multiple humps, i.e. to be the sum of two normals having different mean. Those are a bit tricky to estimate the parameters of, because you may have to estimate, for each measurement, its probability of belonging to each hump. That may be more than you want to bother with.
Log-normal distributions are very convenient and well-behaved. For example, you don't deal with its average, you deal with it's geometric mean, which is the same as its median.
BTW, in pharmacometric modeling, log-normal distributions are ubiquitous, modeling such things as blood volume, absorption and elimination rates, body mass, etc.
ADDED: If you want what you call a floating distribution, that's called an empirical or non-parametric distribution. To model that, typically you save the measurements in a sorted array. Then it's easy to pick off the percentiles. For example the median is the "middle number". If you have too many measurements to save, you can go to some kind of binning after you have enough measurements to get the general shape.
ADDED: There's an easy way to tell if a distribution is normal (or log-normal). Take the logs of the measurements and put them in a sorted array. Then generate a QQ plot (quantile-quantile). To do that, generate as many normal random numbers as you have samples, and sort them. Then just plot the points, where X is the normal distribution point, and Y is the log-sample point. The results should be a straight line. (A really simple way to generate a normal random number is to just add together 12 uniform random numbers in the range +/- 0.5.)
The problem you describe is called "Distribution Fitting" and has nothing to do with performance measurements, i.e. this is generic problem of fitting suitable distribution to any gathered/measured data sample.
The standard process is something like that:
Guess the best distribution.
Run hypothesis tests to check how well it describes gathered data.
Repeat 1-3 if not well enough.
You can find interesting article describing how this can be done with open-source R software system here. I think especially useful to you may be function fitdistr.
In addition to already given answers consider Empirical Distributions. I have successful experience in using empirical distributions for performance analysis of several distributed systems. The idea is very straightforward. You need to build histogram of performance measurements. Measurements should be discretized with given accuracy. When you have histogram you could do several useful things:
calculate the probability of any given value (you are bound by accuracy only);
build PDF and CDF functions for the performance measurements;
generate sequence of response times according to a distribution. This one is very useful for performance modeling.
Try whit gamma distribution http://en.wikipedia.org/wiki/Gamma_distribution
From wikipedia
The gamma distribution is frequently a probability model for waiting times; for instance, in life testing, the waiting time until death is a random variable that is frequently modeled with a gamma distribution.
The standard for randomized Arrival times for performance modelling is either Exponential distribution or Poisson distribution (which is just the distribution of multiple Exponential distributions added together).
Not exactly answering your question, but relevant still: Mor Harchol-Balter did a very nice analysis of the size of jobs submitted to a scheduler, The effect of heavy-tailed job size distributions on computer systems design (1999). She found that the size of jobs submitted to her distributed task assignment system took a power-law distribution, which meant that certain pieces of conventional wisdom she had assumed in the construction of her task assignment system, most importantly that the jobs should be well load balanced, had awful consequences for submitters of jobs. She's done good follor-up work on this issue.
The broader point is, you need to ask such questions as:
What happens if reasonable-seeming assumptions about the distribution of performance, such as that they take a normal distribution, break down?
Are the data sets I'm looking at really representative of the problem I'm trying to solve?

Resources