Probability of failure - Limit State Function - Monte Carlo Method - probability

I want to calculate the probability of failure, pf adopting the monte carlo method.
The limit state equation is obtained by comparing the substance content at a time t, C(x=a,t), and the critical content, Ccrit:
LSF: g(Ccrit, C(x=a,t)) = Ccrit - C(x=a, t) < 0
Ccrit follows a beta distribution Ccrit~B(mean=0.6, s=0.15, a=0.20, b=2.0). Generated distribution:
r=((mean-a)/(b-a))*((((mean-a)*(b-mean))/(s^2))-1)
t=((b-mean)/(b-a))*((((mean-a)*(b-mean))/(s^2))-1)
Ccrit=beta.rvs(r,t,a,b,1e6)
C(x=a, t) is function of 11 other variables (beta, normal, deterministic, lognormal etc) and varies with time t. These variables have been defined adopting scipy.stats eg:
Var1=truncnorm.rvs(0, 1000, 60e-3, 6e-3, 1e6)
(...)
Var11=Csax=dist.lognormal(l, z, 1e6)
After all the variables are generated I am having difficulty computing the pf.
I have seen that:
P(Ccrit < C) = integral -inf to +inf Fccrit(c) * fC(c) dc
leads to the pf but I am clueless on how to calculate it.
Will appreciate your help,
Thank you

Well, how I understood your question, this is the way to compute the probability of failure from crude Monte Carlo simulation:
pf = sum(I(g(x))/N
where:
N - is the number of simulations
x - is the vector of all the involved random variables
I(arg) - is an indicator function, defined as:
if arg < 0
I = 1
else
I = 0
end
The simulation methods are basically invented to circumvent complicated or impossible integrals, no need in this case for the integration you mentioned.
Keep in mind that the coefficient of variation of the estimate is proportional to 1/sqrt(N).
I tried to be clear as possible with the notations, in case it is problematic to follow, see this lecture notes for better formatting.
I assumed you used crude Monte Carlo, but for importance sampling you can find the formulas in the linked source as well.
The above formulation is time-invariant; the fact that your problem involves time makes the task much harder in general.
The solution technique depends on the time-variance, because no details are given in this regard I can only recommend you a book (Melchers, Structural Reliability Analysis and Prediction) where the question is treated in details:
In general, time-variant problems can be reduced (at least in an approximate manner) to time-variant problems and the above formulation can be used. Or you might calculate the probability of failure in every time moments with the above sketched 'method' if that makes sense for your problem.
Because C is substance content the problem might contain no stochastic process but only a monotonically increasing (in time) random variable, in this case the probability of failure is the time-invariant probability of failure at the last time instant (when the concentration is closest to the critical value), so the above mentioned Monte Carlo technique could be directly used. This type of problem is called right-boundary problem, more details:
Construction Reliability: Safety, Variability and Sustainability. Chapter 10.
If this is not you want to accomplish please give us more details.

Related

Should one calculate QR decomposition before Least Squares to speed up the process?

I am reading the book "Introduction to linear algebra" by Gilbert Strang. The section is called "Orthonormal Bases and Gram-Schmidt". The author several times emphasised the fact that with orthonormal basis it's very easy and fast to calculate Least Squares solution, since Qᵀ*Q = I, where Q is a design matrix with orthonormal basis. So your equation becomes x̂ = Qᵀb.
And I got the impression that it's a good idea to every time calculate QR decomposition before applying Least Squares. But later I figured out time complexity for QR decomposition and it turned out to be that calculating QR decomposition and after that applying Least Squares is more expensive than regular x̂ = inv(AᵀA)Aᵀb.
Is that right that there is no point in using QR decomposition to speed up Least Squares? Or maybe I got something wrong?
So the only purpose of QR decomposition regarding Least Squares is numerical stability?
There are many ways to do least squares; typically these vary in applicability, accuracy and speed.
Perhaps the Rolls-Royce method is to use SVD. This can be used to solve under-determined (fewer obs than states) and singular systems (where A'*A is not invertible) and is very accurate. It is also the slowest.
QR can only be used to solve non-singular systems (that is we must have A'*A invertible, ie A must be of full rank), and though perhaps not as accurate as SVD is also a good deal faster.
The normal equations ie
compute P = A'*A
solve P*x = A'*b
is the fastest (perhaps by a large margin if P can be computed efficiently, for example if A is sparse) but is also the least accurate. This too can only be used to solve non singular systems.
Inaccuracy should not be taken lightly nor dismissed as some academic fanciness. If you happen to know that the problems ypu will be solving are nicely behaved, then it might well be fine to use an inaccurate method. But otherwise the inaccurate routine might well fail (ie say there is no solution when there is, or worse come up with a totally bogus answer).
I'm a but confused that you seem to be suggesting forming and solving the normal equations after performing the QR decomposition. The usual way to use QR in least squares is, if A is nObs x nStates:
decompose A as A = Q*(R )
(0 )
transform b into b~ = Q'*b
(here R is upper triangular)
solve R * x = b# for x,
(here b# is the first nStates entries of b~)

why overfitting gives a bad hypothesis function

In linear or logistic regression if we find a hypothesis function which fits the training set perfectly then it should be a good thing because in that case we have used 100 % of the information given to predict new information.
While it is called to be overfitting and said to be bad thing.
By making the hypothesis function simpler we may be actually increasing the noise instead of decreasing it.
Why is it so?
Overfitting occurs when you try "too hard" to make the examples in the training set fit the classification rule.
It is considered bad thing for 2 reasons main reasons:
The data might have noise. Trying too hard to classify 100% of the examples correctly, will make the noise count, and give you a bad rule while ignoring this noise - would usually be much better.
Remember that the classified training set is just a sample of the real data. This solution is usually more complex than what you would have got if you tolerated a few wrongly classified samples. According to Occam's Razor, you should prefer the simpler solution, so ignoring some of the samples, will be better,
Example:
According to Occam's razor, you should tolerate the misclassified sample, and assume it is noise or insignificant, and adopt the simple solution (green line) in this data set:
Because you actually didn't "learn" anything from your training set, you've just fitted to your data.
Imagine, you have a one-dimensional regression
x_1 -> y_1
...
x_n -> y_1
The function, defined this way
y_n, if x = x_n
f(x)=
0, otherwise
will give you perfect fit, but it's actually useless.
Hope, this helped a bit:)
Assuming that your regression accounts for all source of deviation in your data, then you might argue that your regression perfectly fits the data. However, if you know all (and I mean all) of the influences in your system, then you probably don't need a regression. You likely have an analytic solution that perfectly predicts new information.
In actuality, the information you possess will fall short of this perfect level. Noise (measurement error, partial observability, etc) will cause deviation in your data. In response, a regression (or other fitting mechanism) should seek the general trend of the data while minimizing the influence of noise.
Actually, the statement is not quite correct as written. It is perfectly fine to match 100% of your data if your hypothesis function is linear. Every continuous nonlinear function may be approximated locally by a linear function which gives important information on it's local behavior.
It is also fine to match 100 points of data to a quadratic curve if that data matches 100%. You can have high confidence that you are not overfitting your data, since the data consistently shows quadratic behavior.
However, one can always get 100% fit by using a polynomial function of high enough degree. Even without the noise that others have pointed out, though, you shouldn't assume your data has some high degree polynomial behavior without having some kind of theoretical or experimental confirmation of that hypothesis. Two good indicators that polynomial behavior is indicated are:
You have some theoretical reason for expecting the data to grow as x^n in one of the directional limits.
You have data that has been supporting a fixed degree polynomial fit as more and more data has been collected.
Notice, though, that even though exponential and reciprocal relationships may have data that fits a polynomial of high enough degree, they don't tend to obey eith of the two conditions above.
The point is that your data fit needs to be useful to prediction. You always know that a linear fit will give information locally, but that information becomes more useful the more points are fit. Even if there are only two points and noise, a linear fit still gives the best theoretical look at the data collected so far, and establishes the first expectations of the data. Beyond that, though, using a quadratic fit for three points or a cubic fit for four is not validly giving more information, as it assumes both local and asymptotic behavior information with the addition of one point. You need justification for your hypothesis function. That justification can come from more points or from theory.
(A third reason that sometimes comes up is
You have theoretical and experimental reason to believe that error and noise do not contribute more than some bounds, and you can take a polynomial hypothesis to look at local derivatives and the behavior needed to match the data.
This is typically used in understanding data to build theoretical models without having a good starting point for theory. You should still strive to use the smallest polynomial degree possible, and look to substitute out patterns in the coefficients with what they may indicate (reciprocal, exponential, gaussian, etc.) in infinite series.)
Try imagining it this way. You have a function from which you pick n different values to represent a sample / training set:
y(n) = x(n), n is element of [0, 1]
But, since you want to build a robust model, you want to add a little noise to your training set, so you actually add a little noise when generating the data:
data(n) = y(n) + noise(n) = x(n) + u(n)
where by u(n) I marked a uniform random noise with a mean 0 and standard deviation 1: U(0,1). Quite simply, it's a noise signal which is most probable to take an value 0, and less likely to take a value farther it is from 0.
And then you draw, let's say, 10 points to be your training set. If there was no noise, they would all be lying on a line y = x. Since there was noise, the lowest degree of polynomial function that can represent them is probably of 10-th order, a function like: y = a_10 * x^10 + a_9 * x^9 + ... + a_1 * x + a_0.
If you consider, by just using an estimation of the information from the training set, you would probably get a simpler function than the 10-th order polynomial function, and it would have been closer to the real function.
Consider further that your real function can have values outside the [0, 1] interval but for some reason the samples for the training set could only be collected from this interval. Now, a simple estimation would probably act significantly better outside the interval of the training set, while if we were to fit the training set perfectly, we would get an overfitted function that meandered with lots of ups and downs all over :)
Overfitting is termed as bad due to the bais it has to the true solution. The solution which is overfit is 100% fitting to the training data which is used but with any small data point addition the model will change drastically. This is called variance of the model. Hence the bais-variance tradeoff where we try to have a balance between both the factors so that, the model does not change drastically on small data changes but also reasonably properly predicts the output.

An algorithm for checking if a nonlinear function f is always positive

Is there an algorithm to check if a given (possibly nonlinear) function f is always positive?
The idea that I currently have is to find the roots of the function (using newton-raphson algorithm or similar techniques, see http://en.wikipedia.org/wiki/Root-finding_algorithm) and check for derivatives, or finding the minimum of the f, but they don't seems to be the best solutions to this problem, also there are a lot of convergence issues with root finding algorithms.
For example, in Maple, function verify can do this, but I need to implement it in my own program.
Maple Help on verify: http://www.maplesoft.com/support/help/Maple/view.aspx?path=verify/function_shells
Maple example:
assume(x,'real');
verify(x^2+1,0,'greater_than' ); --> returns true, since for every x we have x^2+1 > 0
[edit] Some background on the question:
The function $f$ is the right hand-side differential nonlinear model for a circuit. A nonlinear circuit can be modeled as a set of ordinary differential equations by applying modified nodal analysis (MNA), for sake of simplicity, let's consider only systems with 1 dimension, so $x' = f(x)$ where $f$ describes the circuit, for example $f$ can be $f(x) = 10x - 100x^2 + 200x^3 - 300x^4 + 100x^5$ ( A model for nonlinear tunnel-diode) or $f=10 - 2sin(4x)+ 3x$ (A model for josephson junction).
$x$ is bounded and $f$ is only defined in interval $[a,b] \in R$. $f$ is continuous.
I can also make an assumption that $f$ is Lipschitz with Lipschitz constant L>0, but I don't want to unless I have to.
If I understand your problem correctly, it boils down to counting the number of (real) roots in an interval without necessarily identifying them. In fact, you don't even need to get the exact number, just whether or not it's equal to zero.
If your function is a polynomial, I think that Sturm's theorem may be applicable. The Wikipedia article claims two other procedures are preferred, so you might want to check those out, too. I'm not sure if Descartes' rule of signs works on an interval, but Budan's theorem does appear to.

Frequency determination from sparsely sampled data

I'm observing a sinusoidally-varying source, i.e. f(x) = a sin (bx + d) + c, and want to determine the amplitude a, offset c and period/frequency b - the shift d is unimportant. Measurements are sparse, with each source measured typically between 6 and 12 times, and observations are at (effectively) random times, with intervals between observations roughly between a quarter and ten times the period (just to stress, the spacing of observations is not constant for each source). In each source the offset c is typically quite large compared to the measurement error, while amplitudes vary - at one extreme they are only on the order of the measurement error, while at the other extreme they are about twenty times the error. Hopefully that fully outlines the problem, if not, please ask and i'll clarify.
Thinking naively about the problem, the average of the measurements will be a good estimate of the offset c, while half the range between the minimum and maximum value of the measured f(x) will be a reasonable estimate of the amplitude, especially as the number of measurements increase so that the prospects of having observed the maximum offset from the mean improve. However, if the amplitude is small then it seems to me that there is little chance of accurately determining b, while the prospects should be better for large-amplitude sources even if they are only observed the minimum number of times.
Anyway, I wrote some code to do a least-squares fit to the data for the range of periods, and it identifies best-fit values of a, b and d quite effectively for the larger-amplitude sources. However, I see it finding a number of possible periods, and while one is the 'best' (in as much as it gives the minimum error-weighted residual) in the majority of cases the difference in the residuals for different candidate periods is not large. So what I would like to do now is quantify the possibility that the derived period is a 'false positive' (or, to put it slightly differently, what confidence I can have that the derived period is correct).
Does anybody have any suggestions on how best to proceed? One thought I had was to use a Monte-Carlo algorithm to construct a large number of sources with known values for a, b and c, construct samples that correspond to my measurement times, fit the resultant sample with my fitting code, and see what percentage of the time I recover the correct period. But that seems quite heavyweight, and i'm not sure that it's particularly useful other than giving a general feel for the false-positive rate.
And any advice for frameworks that might help? I have a feeling this is something that can likely be done in a line or two in Mathematica, but (a) I don't know it, an (b) don't have access to it. I'm fluent in Java, competent in IDL and can probably figure out other things...
This looks tailor-made for working in the frequency domain. Apply a Fourier transform and identify the frequency based on where the power is located, which should be clear for a sinusoidal source.
ADDENDUM To get an idea of how accurate is your estimate, I'd try a resampling approach such as cross-validation. I think this is the direction that you're heading with the Monte Carlo idea; lots of work is out there, so hopefully that's a wheel you won't need to re-invent.
The trick here is to do what might seem at first to make the problem more difficult. Rewrite f in the similar form:
f(x) = a1*sin(b*x) + a2*cos(b*x) + c
This is based on the identity for the sin(u+v).
Recognize that if b is known, then the problem of estimating {a1, a2, c} is a simple LINEAR regression problem. So all you need to do is use a 1-variable minimization tool, working on the value of b, to minimize the sum of squares of the residuals from that linear regression model. There are many such univariate optimizers to be found.
Once you have those parameters, it is easy to find the parameter a in your original model, since that is all you care about.
a = sqrt(a1^2 + a2^2)
The scheme I have described is called a partitioned least squares.
If you have a reasonable estimate of the size and the nature of your noise (e.g. white Gaussian with SD sigma), you can
(a) invert the Hessian matrix to get an estimate of the error in your position and
(b) should be able to easily derive a significance statistic for your fit residues.
For (a), compare http://www.physics.utah.edu/~detar/phys6720/handouts/curve_fit/curve_fit/node6.html
For (b), assume that your measurement errors are independent and thus the variance of their sum is the sum of their variances.

Create a function for given input and ouput

Imagine, there are two same-sized sets of numbers.
Is it possible, and how, to create a function an algorithm or a subroutine which exactly maps input items to output items? Like:
Input = 1, 2, 3, 4
Output = 2, 3, 4, 5
and the function would be:
f(x): return x + 1
And by "function" I mean something slightly more comlex than [1]:
f(x):
if x == 1: return 2
if x == 2: return 3
if x == 3: return 4
if x == 4: return 5
This would be be useful for creating special hash functions or function approximations.
Update:
What I try to ask is to find out is whether there is a way to compress that trivial mapping example from above [1].
Finding the shortest program that outputs some string (sequence, function etc.) is equivalent to finding its Kolmogorov complexity, which is undecidable.
If "impossible" is not a satisfying answer, you have to restrict your problem. In all appropriately restricted cases (polynomials, rational functions, linear recurrences) finding an optimal algorithm will be easy as long as you understand what you're doing. Examples:
polynomial - Lagrange interpolation
rational function - Pade approximation
boolean formula - Karnaugh map
approximate solution - regression, linear case: linear regression
general packing of data - data compression; some techniques, like run-length encoding, are lossless, some not.
In case of polynomial sequences, it often helps to consider the sequence bn=an+1-an; this reduces quadratic relation to linear one, and a linear one to a constant sequence etc. But there's no silver bullet. You might build some heuristics (e.g. Mathematica has FindSequenceFunction - check that page to get an impression of how complex this can get) using genetic algorithms, random guesses, checking many built-in sequences and their compositions and so on. No matter what, any such program - in theory - is infinitely distant from perfection due to undecidability of Kolmogorov complexity. In practice, you might get satisfactory results, but this requires a lot of man-years.
See also another SO question. You might also implement some wrapper to OEIS in your application.
Fields:
Mostly, the limits of what can be done are described in
complexity theory - describing what problems can be solved "fast", like finding shortest path in graph, and what cannot, like playing generalized version of checkers (they're EXPTIME-complete).
information theory - describing how much "information" is carried by a random variable. For example, take coin tossing. Normally, it takes 1 bit to encode the result, and n bits to encode n results (using a long 0-1 sequence). Suppose now that you have a biased coin that gives tails 90% of time. Then, it is possible to find another way of describing n results that on average gives much shorter sequence. The number of bits per tossing needed for optimal coding (less than 1 in that case!) is called entropy; the plot in that article shows how much information is carried (1 bit for 1/2-1/2, less than 1 for biased coin, 0 bits if the coin lands always on the same side).
algorithmic information theory - that attempts to join complexity theory and information theory. Kolmogorov complexity belongs here. You may consider a string "random" if it has large Kolmogorov complexity: aaaaaaaaaaaa is not a random string, f8a34olx probably is. So, a random string is incompressible (Volchan's What is a random sequence is a very readable introduction.). Chaitin's algorithmic information theory book is available for download. Quote: "[...] we construct an equation involving only whole numbers and addition, multiplication and exponentiation, with the property that if one varies a parameter and asks whether the number of solutions is finite or infinite, the answer to this question is indistinguishable from the result of independent tosses of a fair coin." (in other words no algorithm can guess that result with probability > 1/2). I haven't read that book however, so can't rate it.
Strongly related to information theory is coding theory, that describes error-correcting codes. Example result: it is possible to encode 4 bits to 7 bits such that it will be possible to detect and correct any single error, or detect two errors (Hamming(7,4)).
The "positive" side are:
symbolic algorithms for Lagrange interpolation and Pade approximation are a part of computer algebra/symbolic computation; von zur Gathen, Gerhard "Modern Computer Algebra" is a good reference.
data compresssion - here you'd better ask someone else for references :)
Ok, I don't understand your question, but I'm going to give it a shot.
If you only have 2 sets of numbers and you want to find f where y = f(x), then you can try curve-fitting to give you an approximate "map".
In this case, it's linear so curve-fitting would work. You could try different models to see which works best and choose based on minimizing an error metric.
Is this what you had in mind?
Here's another link to curve-fitting and an image from that article:
It seems to me that you want a hashtable. These are based in hash functions and there are known hash functions that work better than others depending on the expected input and desired output.
If what you want a algorithmic way of mapping arbitrary input to arbitrary output, this is not feasible in the general case, as it totally depends on the input and output set.
For example, in the trivial sample you have there, the function is immediately obvious, f(x): x+1. In others it may be very hard or even impossible to generate an exact function describing the mapping, you would have to approximate or just use directly a map.
In some cases (such as your example), linear regression or similar statistical models could find the relation between your input and output sets.
Doing this in the general case is arbitrarially difficult. For example, consider a block cipher used in ECB mode: It maps an input integer to an output integer, but - by design - deriving any general mapping from specific examples is infeasible. In fact, for a good cipher, even with the complete set of mappings between input and output blocks, you still couldn't determine how to calculate that mapping on a general basis.
Obviously, a cipher is an extreme example, but it serves to illustrate that there's no (known) general procedure for doing what you ask.
Discerning an underlying map from input and output data is exactly what Neural Nets are about! You have unknowingly stumbled across a great branch of research in computer science.

Resources