manufacturing ambiguous data set - probability

I hope questions like this belong here.
So here is the problem I am dealing with right now:
I have some data collected from a manufacturing process (sensor data, process parameters etc.) and for every part that leaves the production line i know if it is scrap or not.
So for each part I have the its process data and the quality (0: good 1:bad)
My goal is to optimize the manufacturing process, i.e. find the optimal process parameters to produce the least amount of scrap.
What i did so far: I tried different classification algorithms (random forest, SVM, neural network) but none are able to achieve a good accuracy.
I think the reason is that the data is very ambiguous, i.e if i have parts with the same process parameters some of them might be scrap while some might be good. But there is definitely a connection between quality and process parameters.
What i want to now is to predict the "probability" for a part to be good or bad. Imo i want to estimate the probability density? Can i do this with K-nearest neighbours?

A step you could try is to, for each parameter, estimate , where x is the parameter value and is the good/bad indicator variable.
There's a chance that does not adhere to any particular distribution, and not knowing the type of values they take on it would be hard for me to make a suggestion.
A "model free" approach would be to, given a set of n observations , "discretize" the parameter x so that
Then you can estimate the pmf via
and similarly for the "bad" case.
After you have for each parameter, you can compute the relative entropy/KL divergence between that parameter's "good" and "bad" cases. Those that have larger divergence between the two classes are the parameters that matter most, and their pmfs will hopefully show you which values are indicative of bad performance.
This is of course assuming the paramters iid, which they may in fact not be, but a similar process can be performed by considering co-parameters that are not independent and discretizing accordingly.

Related

XGBOOST/lLightgbm over-fitting despite no indication in cross-validation test scores?

We aim to identify predictors that may influence the risk of a relatively rare outcome.
We are using a semi-large clinical dataset, with data on nearly 200,000 patients.
The outcome of interest is binary (i.e. yes/no), and quite rare (~ 5% of the patients).
We have a large set of nearly 1,200 mostly dichotomized possible predictors.
Our objective is not to create a prediction model, but rather to use the boosted trees algorithm as a tool for variable selection and for examining high-order interactions (i.e. to identify which variables, or combinations of variables, that may have some influence on the outcome), so we can target these predictors more specifically in subsequent studies. Given the paucity of etiological information on the outcome, it is somewhat possible that none of the possible predictors we are considering have any influence on the risk of developing the condition, so if we were aiming to develop a prediction model it would have likely been a rather bad one. For this work, we use the R implementation of XGBoost/lightgbm.
We have been having difficulties tuning the models. Specifically when running cross validation to choose the optimal number of iterations (nrounds), the CV test score continues to improve even at very high values (for example, see figure below for nrounds=600,000 from xgboost). This is observed even when increasing the learning rate (eta), or when adding some regularization parameters (e.g. max_delta_step, lamda, alpha, gamma, even at high values for these).
As expected, the CV test score is always lower than the train score, but continuous to improve without ever showing a clear sign of over fitting. This is true regardless of the evaluation metrics that is used (example below is for logloss, but the same is observed for auc/aucpr/error rate, etc.). Relatedly, the same phenomenon is also observed when using a grid search to find the optimal value of tree depth (max_depth). CV test scores continue to improve regardless of the number of iterations, even at depth values exceeding 100, without showing any sign of over fitting.
Note that owing to the rare outcome, we use a stratified CV approach. Moreover, the same is observed when a train/test split is used instead of CV.
Are there situations in which over fitting happens despite continuous improvements in the CV-test (or test split) scores? If so, why is that and how would one choose the optimal values for the hyper parameters?
Relatedly, again, the idea is not to create a prediction model (since it would be a rather bad one, owing that we don’t know much about the outcome), but to look for a signal in the data that may help identify a set of predictors for further exploration. If boosted trees is not the optimal method for this, are there others to come to mind? Again, part of the reason we chose to use boosted trees was to enable the identification of higher (i.e. more than 2) order interactions, which cannot be easily assessed using more conventional methods (including lasso/elastic net, etc.).
welcome to Stackoverflow!
In the absence of some code and representative data it is not easy to make other than general suggestions.
Your descriptive statistics step may give some pointers to a starting model.
What does existing theory (if it exists!) suggest about the cause of the medical condition?
Is there a male/female difference or old/young age difference that could help get your foot in the door?
Your medical data has similarities to the fraud detection problem where one is trying to predict rare events usually much rarer than your cases.
It may pay you to check out the use of xgboost/lightgbm in the fraud detection literature.

Evaluating a specific Information retrieval system with P#1

I am working on a information retrieval system which aims to select the first result and to link it to other database. Indeed, our system is based on a Keyword description of a video and try to interlink the video to a DBpedia entity which has the same meaning of the description. In the step of evaluation, i noticid that the majority of evaluation set the minimum of the precision cut-off to 5, whereas in our system is not suitable. I am thinking to put an interval [1,5]: (P#1,...P#5).Will it be possible? !!
Please provide your suggestions and your reference to some notes.. Thanks..
You can definitely calculate P#1 for a retrieval system, if you have truth labels. (In this case, it sounds like they would be [Video, DBPedia] matching pairs generated by humans).
People generally look at this measure for things like Question-Answering or recommendation systems. The only caveat is that you typically wouldn't use it to train a learning to rank system or any other learning system -- it's not "continuous enough" a near miss (best at rank 2) and a total miss (best at rank 4 million) get equivalent scores, so it can be hard to smoothly improve a system by tuning weights in such a case.
For those kinds of tasks, using Mean Reciprocal Rank is pretty common, if you need something tunable. Also NDCG tends to be okay, too, since it has an exponential discounting factor.
But there's nothing in the definition of precision that prevents you from calculating it at rank 1. It may be more correct to describe it as a "success#1" feature, since you're going to get 0/1 or 1/1 as your two options.

Neural Network Basics

I'm a computer science student and for this years project, I need to create and apply a Genetic Algorithm to something. I think Neural Networks would be a good thing to apply it to, but I'm having trouble understanding them. I fully understand the concepts but none of the websites out there really explain the following which is blocking my understanding:
How the decision is made for how many nodes there are.
What the nodes actually represent and do.
What part the weights and bias actually play in classification.
Could someone please shed some light on this for me?
Also, I'd really appreciate it if you have any similar ideas for what I could apply a GA to.
Thanks very much! :)
Your question is quite complex and I don't think a small answer will fully satisfy you. Let me try, nonetheless.
First of all, there must be at least three layers in your neural network (assuming a simple feedforward one). The first is the input layer and there will be one neuron per input. The third layer is the output one and there will be one neuron per output value (if you are classifying, there might be more than one f you want to assign a "belong to" meaning to each neuron).. The remaining layer is the hidden one, which will stand between the input and output. Determining its size is a complex task as you can see in the following references:
comp.ai faq
a post on stack exchange
Nevertheless, the best way to proceed would be for you to state your problem more clearly (as weel as industrial secrecy might allow) and let us think a little more on your context.
The number of input and output nodes is determined by the number of inputs and outputs you have. The number of intermediate nodes is up to you. There is no "right" number.
Imagine a simple network: inputs( age, sex, country, married ) outputs( chance of death this year ). Your network might have a 2 "hidden values", one depending on age and sex, the other depending on country and married. You put weights on each. For example, Hidden1 = age * weight1 + sex * weight2. Hidden2 = country * weight3 + married * weight4. You then make another set of weights, Hidden3 and Hidden4 connecting to the output variable.
Then you get a data from, say the census, and run through your neural network to find out what weights best match the data. You can use genetic algorithms to test different sets of weights. This is useful if you have so many edges you could not try every possible weighting. You need to find good weights without exhaustively trying every possible set of weights, so GA lets you "evolve" a good set of weights.
Then you test your weights on data from a different census to see how well it worked.
... my major barrier to understanding this though is understanding how the hidden layer actually works; I don't really understand how a neuron functions and what the weights are for...
Every node in the middle layer is a "feature detector" -- it will (hopefully) "light up" (i.e., be strongly activated) in response to some important feature in the input. The weights are what emphasize an aspect of the previous layer; that is, the set of input weights to a neuron correspond to what nodes in the previous layer are important for that feature.
If a weight connecting myInputNode to myMiddleLayerNode is 0, then you can tell that myInputNode is not important to whatever feature myMiddleLayerNode is detecting. If, though, the weight connecting myInputNode to myMiddleLayerNode is very large (either positive or negative), you know that myInputNode is quite important (if it's very negative it means "No, this feature is almost certainly not there", while if it's very positive it means "Yes, this feature is almost certainly there").
So a corollary of this is that you want the number of your middle-layer nodes to have a correspondence to how many features are needed to classify the input: too few middle-layer nodes and it will be hard to converge during training (since every middle-layer node will have to "double up" on its feature-detection) while too many middle-layer nodes may over-fit your data.
So... a possible use of a genetic algorithm would be to design the architecture of your network! That is, use a GA to set the number of middle-layer nodes and initial weights. Some instances of the population will converge faster and be more robust -- these could be selected for future generations. (Personally, I've never felt this was a great use of GAs since I think it's often faster just to trial-and-error your way into a decent NN architecture, but using GAs this way is not uncommon.)
You might find this wikipedia page on NeuroEvolution of Augmenting Topologies (NEAT) interesting. NEAT is one example of applying genetic algorithms to create the neural network topology.
The best way to explain an Artificial Neural Network (ANN) is to provide the biological process that it attempts to simulate - a neural network. The best example of one is the human brain. So how does the brain work (highly simplified for CS)?
The functional unit (for our purposes) of the brain is the neuron. It is a potential accumulator and "disperser". What that means is that after a certain amount of electric potential (think filling a balloon with air) has been reached, it "fires" (balloon pops). It fires electric signals down any connections it has.
How are neurons connected? Synapses. These synapses can have various weights (in real life due to stronger/weaker synapses from thicker/thinner connections). These weights allow a certain amount of a fired signal to pass through.
You thus have a large collection of neurons connected by synapses - the base representation for your ANN. Note that the input/output structures described by the others are an artifact of the type of problem to which ANNs are applied. Theoretically, any neuron can accept input as well. It serves little purpose in computational tasks however.
So now on to ANNs.
NEURONS: Neurons in an ANN are very similar to their biological counterpart. They are modeled either as step functions (that signal out "1" after a certain combined input signal, or "0" at all other times), or slightly more sophisticated firing sequences (arctan, sigmoid, etc) that produce a continuous output, though scaled similarly to a step. This is closer to the biological reality.
SYNAPSES: These are extremely simple in ANNs - just weights describing the connections between Neurons. Used simply to weight the neurons that are connected to the current one, but still play a crucial role: synapses are the cause of the network's output. To clarify, the training of an ANN with a set structure and neuron activation function is simply the modification of the synapse weights. That is it. No other change is made in going from a a "dumb" net to one that produces accurate results.
STRUCTURE:
There is no "correct" structure for a neural network. The structures are either
a) chosen by hand, or
b) allowed to grow as a result of learning algorithms (a la Cascade-Correlation Networks).
Assuming the hand-picked structure, these are actually chosen through careful analysis of the problem and expected solution. Too few "hidden" neurons/layers, and you structure is not complex enough to approximate a complex function. Too many, and your training time rapidly grows unwieldy. For this reason, the selection of inputs ("features") and the structure of a neural net are, IMO, 99% of the problem. The training and usage of ANNs is trivial in comparison.
To now address your GA concern, it is one of many, many efforts used to train the network by modifying the synapse weights. Why? because in the end, a neural network's output is simply an extremely high-order surface in N dimensions. ANY surface optimization technique can be use to solve the weights, and GA are one such technique. The simple backpropagation method is alikened to a dimension-reduced gradient-based optimization technique.

Why is average so popular when measuring application performance

When measuring application performance (response time for example) it's so easy to come across averages (mean). ab, httpref and bunch of other utilities are reporting mean and standard deviation. But from theoretical point of view it doesn't make a lot of sense to me. And there is why.
Mean value is good at describing symmetrical distributed population, because in case of symmetrical distribution mean is equal to population mode and expected value. But response times are not distributed symmetrical. They are more like exponential. In this case average tells us nothing.
It's more convenient to work with percentile values, which tells us what response time we could afford in what percentage of responses.
Am I missing something or mean is popular just because it's very simple to calculate?
All kinds of tools get their features not necessarily from what makes sense, but from users' expectations.
You're absolutely right that the distributions are non-negative and heavily skewed, and that percentiles would be more informative.
Alternatively, a distribution more like lognormal or chi-square would be a little better.
Yes, you are missing something.
The whole point of descriptive statistics is to present a few numbers to describe (or represent or model or ...) a large number of numbers. They aid the comprehension of large datasets, the extraction of information from data, the approximate comparison of datasets whose exact comparison is large and bewildering to the limitations of the human mind.
But no single descriptive statistic is always fit for all purposes, and no one is dictating to you that you must or should or ought to use the mean. If it doesn't suit your purposes, use something else.
As it happens you are quite wrong to write They are more like exponential. In this case average tells us nothing. For an exponential distribution with rate parameter lambda the mean is simply 1/lambda so the mean tells you everything about an exponential distribution.
I'm not an expert in statistics but i believe the average values are used so much because those are the values that help to measure the scalability of a system.
You need to consider first your average values to know how your system needs to bahevae under certains workloads and those needs to be predictable, you usually are not very interested in outliers at least not at first.
Of course you need to look into your min values and the peak values to know the moment your system its going to have a bottleneck but the average values show you as i said a correct and predictable behavior.

How to automatically tune parameters of an algorithm?

Here's the setup:
I have an algorithm that can succeed or fail.
I want it to succeed with highest probability possible.
Probability of success depends on some parameters (and some external circumstances):
struct Parameters {
float param1;
float param2;
float param3;
float param4;
// ...
};
bool RunAlgorithm (const Parameters& parameters) {
// ...
// P(return true) is a function of parameters.
}
How to (automatically) find best parameters with a smallest number of calls to RunAlgorithm ?
I would be especially happy with a readl library.
If you need more info on my particular case:
Probability of success is smooth function of parameters and have single global optimum.
There are around 10 parameters, most of them independently tunable (but some are interdependent)
I will run the tunning overnight, I can handle around 1000 calls to Run algorithm.
Clarification:
Best parameters have to found automatically overnight, and used during the day.
The external circumstances change each day, so computing them once and for all is impossible.
More clarification:
RunAlgorithm is actually game-playing algorithm. It plays a whole game (Go or Chess) against fixed opponent. I can play 1000 games overnight. Every night is other opponent.
I want to see whether different opponents need different parameters.
RunAlgorithm is smooth in the sense that changing parameter a little does change algorithm only a little.
Probability of success could be estimated by large number of samples with the same parameters.
But it is too costly to run so many games without changing parameters.
I could try optimize each parameter independently (which would result in 100 runs per parameter) but I guess there are some dependencies.
The whole problem is about using the scarce data wisely.
Games played are very highly randomized, no problem with that.
Maybe you are looking for genetic algorithms.
Why not allow the program fight with itself? Take some vector v (parameters) and let it fight with v + (0.1,0,0,0,..,0), say 15 times. Then, take the winner and modify another parameter and so on. With enough luck, you'll get a strong player, able to defeat most others.
Previous answer (much of it is irrevelant after the question was edited):
With these assumptions and that level of generality, you will achieve nothing (except maybe an impossiblity result).
Basic question: can you change the algorithm so that it will return probability of success, not the result of a single experiment? Then, use appropriate optimization technique (nobody will tell you which under such general assumptions). In Haskell, you can even change code so that it will find the probability in simple cases (probability monad, instead of giving a single result. As others mentioned, you can use a genetic algorithm using probability as fitness function. If you have a formula, use a computer algebra system to find the maximum value.
Probability of success is smooth function of parameters and have single global optimum.
Smooth or continuous? If smooth, you can use differential calculus (Lagrange multipliers?). You can even, with little changes in code (assuming your programming language is general enough), compute derivatives automatically using automatic differentiation.
I will run the tunning overnight, I can handle around 1000 calls to Run algorithm.
That complex? This will allow you to check two possible values (210=1024), out of many floats. You won't even determine order of magnitude, or even order of order of magnitude.
There are around 10 parameters, most of them independently tunable (but some are interdependent)
If you know what is independent, fix some parameters and change those that are independent of them, like in divide-and-conquer. Obviously it's much better to tune two algorithms with 5 parameters.
I'm downvoting the question unless you give more details. This has too much noise for an academic question and not enough data for a real-world question.
The main problem you have is that, with ten parameters, 1000 runs is next to nothing, given that, for each run, all you have is a true/false result rather than a P(success) associated with the parameters.
Here's an idea that, on the one hand, may make best use of your 1000 runs and, on the other hand, also illustrates the the intractability of your problem. Let's assume the ten parameters really are independent. Pick two values for each parameter (e.g. a "high" value and a "low" value). There are 1024 ways to select unique combinations of those values; run your method for each combination and store the result. When you're done, you'll have 512 test runs for each value of each parameter; with the independence assumption, that might give you a decent estimate on the conditional probability of success for each value. An analysis of that data should give you a little information about how to set your parameters, and may suggest refinements of your "high" and "low" values for future nights. The back of my mind is dredging up ANOVA as a possibly useful statistical tool here.
Very vague advice... but, as has been noted, it's a rather vague problem.
Specifically for tuning parameters for game-playing agents, you may be interested in CLOP
http://remi.coulom.free.fr/CLOP/
Not sure if I understood correctly...
If you can choose the parameters for your algorithm, does it mean that you can choose it once for all?
Then, you could simply:
have the developper run all/many cases only once, find the best case, and replace the parameters with the best value
at runtime for your real user, the algorithm is already parameterized with the best parameters
Or, if the best values change for each run ...
Are you looking for Genetic Algorithms type of approach?
The answer to this question depends on:
Parameter range. Can your parameters have a small or large range of values?
Game grading. Does it have to be a boolean, or can it be a smooth function?
One approach that seems natural to this problem is Hill Climbing.
A possible way to implement would be to start with several points, and calculate their "grade". Then figure out a favorable direction for the next point, and try to "ascend".
The main problems that I see in this question, as you presented it, is the huge range of parameter values, and the fact that the result of the run is boolean (and not a numeric grade). This will require many runs to figure out whether a set of chosen parameters are indeed good, and on the other hand, there is a huge set of parameters values yet to check. Just checking all directions will result in a (too?) large number of runs.

Resources