How should I filter this data? - algorithm

I have a several series of data points that need to be graphed. For each graph, some points may need to be thrown out due to error. An example is the following:
The circled areas are errors in the data.
What I need is an algorithm to filter this data so that it eliminates the error by replacing the bad points with flat lines, like so:
Are there any algorithms out there that are especially good at detecting error points? Do you have any tips that could point me in the right direction?
EDIT: Error points are any points that don't look consistent with the data on both sides. There can be large jumps, as long as the data after the jump still looks consistent. If it's on the edge of the graph, large jumps should probably be considered error.

This is a problem that is hard to solve generically; your final solution will end up being very process-dependent, and unique to your situation.
That being said, you need to start by understanding your data: from one sample to the next, what kind of variation is possible? Using that, you can use previous data samples (and maybe future data samples) to decide if the current sample is bogus or not. Then, you'll end up with a filter that looks something like:
const int MaxQueueLength = 100; // adjust these two values as necessary
const double MaxProjectionError = 5;
List<double> FilterData(List<double> rawData)
{
List<double> toRet = new List<double>(rawData.Count);
Queue<double> history = new Queue<double>(MaxQueueLength); // adjust queue length as necessary
foreach (double raw_Sample in rawData)
{
while (history.Count > MaxQueueLength)
history.Dequeue();
double ProjectedSample = GuessNext(history, raw_Sample);
double CurrentSample = (Math.Abs(ProjectedSample - raw_Sample) > MaxProjectionError) ? ProjectedSample : raw_Sample;
toRet.Add(CurrentSample);
history.Enqueue(CurrentSample);
}
return toRet;
}
The magic, then, is coming up with your GuessNext function. Here, you'll be getting into stuff that is specific to your situation, and should take into account everything you know about the process that is gathering data. Are there physical limits to how quickly the input can change? Does your data have known bad values you can easily filter?
Here is a simple example for a GuessNext function that works off of the first derivative of your data (i.e. it assumes that your data is a roughly a straight line when you only look at a small section of it)
double lastSample = double.NaN;
double GuessNext(Queue<double> history, double nextSample)
{
lastSample = double.IsNaN(lastSample) ? nextSample : lastSample;
//ignore the history for simple first derivative. Assume that input will always approximate a straight line
double toRet = (nextSample + (nextSample - lastSample));
lastSample = nextSample;
return toRet;
}
If your data is particularly noisy, you may want to apply a smoothing filter to it before you pass it to GuessNext. You'll just have to spend some time with the algorithm to come up with something that makes sense for your data.
Your example data appears to be parametric in that each sample defines both a X and a Y value. You might be able to apply the above logic to each dimension independently, which would be appropriate if only one dimension is the one giving you bad numbers. This can be particularly successful in cases where one dimension is a timestamp, for instance, and the timestamp is occasionally bogus.

If removing the outliers by eye is not possible, try kriging (with error terms) as in http://www.ipf.tuwien.ac.at/cb/publications/pipeline.pdf . This seems to work quite well to automatically deal with occasional extreme noise. I know that French meteorologists use such an approach to remove outliers in their data (like a fire next to a temperature sensor or something kicking a wind sensor for instance).
Please note that it is a difficult problem in general. Any information about the errors is precious. Did someone kick the measuring device ? Then you cannot do much except removing the offending data by hand. Is your noise systematic ? You can do a lot of things then by making (reasonable) hypotheses about it.

Related

scheduling prefetch in halide rdom update stage

I've been trying to recreate a hand tuned c function via halide. It is a a series of histograms done on vertical scanlines of the source image. As such I'm using an 1 dimension RDom to iterate the source image.
RDom reductionY(0, input.height());
parade(x,y,c) = Halide::cast<uint16_t>(0);
parade(x, input(x, reductionY, c), c) += Halide::cast<uint16_t>(1);
To increase locality, I'm wrapping the rdom in another func so I can schedule it with compute_at.
wrapper(x,y,c) = parade(x, y, c);
parade.update(0).reorder(c, reductionY, x);
parade.update(0).split(x, x_outer, x_inner, THREADWIDTH);
parade.compute_at(wrapper, x_outer);
This (plus some vectorization/parallelization I've stripped out for this question) closely matches my hand tuned original. One thing the original benefits from that I can't figure out how to schedule, is to prefetch the first read of each vertical line from input in the update(0) stage. If I schedule
parade.update(0).prefetch(inputParam, x_inner, 3);
it seems to prefetch every pixel to be read? My hope is to issue a single prefetch to the first pixel to be read.
On first glance, it doesn't seem that the code you posted is complete: parade is computed at the x_outer dimension of wrapper, but wrapper has never been split to create such a dimension. Seeing the exact code would help, and you may also find both print_loop_nest and compiling to a lowered statement file useful in seeing the exact structure and figuring out where you want the prefetch to be executed.
Quickly, though, I don't believe prefetches can be issued for only a subset of the used data—logically, they apply to the whole block of the data to be used at a given granularity. Do you observe poor performance due to prefetching the whole column rather than a single pixel? Explicitly prefetching a single pixel seems likely to help only insofar as it may trigger the hardware prefetcher to speculatively fetch the whole column.
If this is a case where a known-better approach is not representable in the current Halide model, however, you should share it with the halide-dev list or as an issue on GitHub with a simple reproducer for your target platform (x86?).

Re-use Eigen::SimplicialLLT's symbolic decomposition

I am struggling a bit with the API of the Eigen Library, namely the SimplicialLLT class for Cholesky factorization of sparse matrices.
I have three matrices that I need to factor and later use to solve many equation systems (changing only the right side) - therefore I would like to factor these matrices only once and then just re-use them. Moreover, they all have the same sparcity pattern, so I would like to do the symbolic decomposition only once and then use it for the numerical decomposition for all three matrices. According to the documentation, this is exactly what the SimplicialLLT::analyzePattern and SimplicialLLT::factor methods are for. However, I can't seem to find a way to keep all three factors in the memory.
This is my code:
I have these member variables in my class I would like to fill with the factors:
Eigen::SimplicialLLT<Eigen::SparseMatrix<double>> choleskyA;
Eigen::SimplicialLLT<Eigen::SparseMatrix<double>> choleskyB;
Eigen::SimplicialLLT<Eigen::SparseMatrix<double>> choleskyC;
Then I create the three sparse matrices A, B and C and want to factor them:
choleskyA.analyzePattern(A);
choleskyA.factorize(A);
choleskyB.analyzePattern(B); // this has already been done!
choleskyB.factorize(B);
choleskyC.analyzePattern(C); // this has already been done!
choleskyC.factorize(C);
And later I can use them for solutions over and over again, changing just the b vectors of right sides:
xA = choleskyA.solve(bA);
xB = choleskyB.solve(bB);
xC = choleskyC.solve(bC);
This works (I think), but the second and third call to analyzePattern are redundant. What I would like to do is something like:
choleskyA.analyzePattern(A);
choleskyA.factorize(A);
choleskyB = choleskyA.factorize(B);
choleskyC = choleskyA.factorize(C);
But that is not an option with the current API (we use Eigen 3.2.3, but if I see correctly there is no change in this regard in 3.3.2). The problem here is that the subsequent calls to factorize on the same instance of SimplicialLLT will overwrite the previously computed factor and at the same time, I can't find a way to make a copy of it to keep. I took a look at the sources but I have to admit that didn't help much as I can't see any simple way to copy the underlying data structures. It seems to me like a rather common usage, so I feel like I am missing something obvious, please help.
What I have tried:
I tried using simply choleskyB = choleskyA hoping that the default copy constructor will get things done, but I have found out that the base classes are designed to be non-copyable.
I can get the L and U matrices (there's a getter for them) from choleskyA, make a copy of them and store only those and then basically copy-paste the content of SimplicialCholeskyBase::_solve_impl() (copy-pasted below) to write the method for solving myself using the previously stored L and U directly.
template<typename Rhs,typename Dest>
void _solve_impl(const MatrixBase<Rhs> &b, MatrixBase<Dest> &dest) const
{
eigen_assert(m_factorizationIsOk && "The decomposition is not in a valid state for solving, you must first call either compute() or symbolic()/numeric()");
eigen_assert(m_matrix.rows()==b.rows());
if(m_info!=Success)
return;
if(m_P.size()>0)
dest = m_P * b;
else
dest = b;
if(m_matrix.nonZeros()>0) // otherwise L==I
derived().matrixL().solveInPlace(dest);
if(m_diag.size()>0)
dest = m_diag.asDiagonal().inverse() * dest;
if (m_matrix.nonZeros()>0) // otherwise U==I
derived().matrixU().solveInPlace(dest);
if(m_P.size()>0)
dest = m_Pinv * dest;
}
...but that's quite an ugly solution plus I would probably screw it up since I don't have that good understanding of the process (I don't need the m_diag from the above code since I am doing LLT, right? that would be relevant only if I was using LDLT?). I hope this is not what I need to do...
A final note - adding the necessary getters/setters to the Eigen classes and compiling "my own" Eigen is not an option (well, not a good one) as this code will (hopefully) be further redistributed as open source, so it would be troublesome.
This is a quite unusual pattern. In practice the symbolic factorization is very cheap compared to the numerical factorization, so I'm not sure it's worth bothering much. The cleanest solution to address this pattern would be to let SimplicialL?LT to be copiable.

Integrating multiple raymarching samples

Let's say I'm using raymarching to render a field function. (This on the CPU, not the GPU.) I have an algorithm like this crudely-written pseudocode:
pixelColour = arbitrary;
pixelTransmittance = 1.0;
t = 0;
while (t < max_view_distance) {
point = rayStart + t*rayDirection;
emission, opacity = sampleFieldAt(point);
pixelColour, pixelTransmittance =
integrate(pixelColour, pixelTransmittance, emission, absorption);
t = t * stepFactor;
}
return pixelColour;
The logic is all really simple... but how does integrate() work?
Each sample actually represents a volume in my field, not a point, even though the sample is taken at a point; therefore the effect on the final pixel colour will vary according to the size of the volume.
I don't know how to do this. I've had a look around, but while I've found lots of code which does it (usually on Shadertoy), it all does it differently and I can't find any explanations of why. How does this work, and more importantly, what magic search terms will let me look it up on Google?
It's the Beer-Lambert law, which governs extinction through participating homogenous media. No wonder I was unable to find any keywords which worked.
There's a good writeup here, which tells me almost everything I need to know, although it does rather gloss over the calculation of the phase functions. But at least now I know what to read up on.

How to implement a part of histogram equalization in matlab without using for loops and influencing speed and performance

Suppose that I have these Three variables in matlab Variables
I want to extract diverse values in NewGrayLevels and sum rows of OldHistogram that are in the same rows as one diverse value is.
For example you see in NewGrayLevels that the six first rows are equal to zero. It means that 0 in the NewGrayLevels has taken its value from (0 1 2 3 4 5) of OldGrayLevels. So the corresponding rows in OldHistogram should be summed.
So 0+2+12+38+113+163=328 would be the frequency of the gray level 0 in the equalized histogram and so on.
Those who are familiar with image processing know that it's part of the histogram equalization algorithm.
Note that I don't want to use built-in function "histeq" available in image processing toolbox and I want to implement it myself.
I know how to write the algorithm with for loops. I'm seeking if there is a faster way without using for loops.
The code using for loops:
for k=0:255
Condition = NewGrayLevels==k;
ConditionMultiplied = Condition.*OldHistogram;
NewHistogram(k+1,1) = sum(ConditionMultiplied);
end
I'm afraid if this code gets slow for high resolution big images.Because the variables that I have uploaded are for a small image downloaded from the internet but my code may be used for sattellite images.
I know you say you don't want to use histeq, but it might be worth your time to look at the MATLAB source file to see how the developers wrote it and copy the parts of their code that you would like to implement. Just do edit('histeq') or edit('histeq.m'), I forget which.
Usually the MATLAB code is vectorized where possible and runs pretty quick. This could save you from having to reinvent the entire wheel, just the parts you want to change.
I can't think a way to implement this without a for loop somewhere, but one optimisation you could make would be using indexing instead of multiplication:
for k=0:255
Condition = NewGrayLevels==k; % These act as logical indices to OldHistogram
NewHistogram(k+1,1) = sum(OldHistogram(Condition)); % Removes a vector multiplication, some additions, and an index-to-double conversion
end
Edit:
On rereading your initial post, I think that the way to do this without a for loop is to use accumarray (I find this a difficult function to understand, so read the documentation and search online and on here for examples to do so):
NewHistogram = accumarray(1+NewGrayLevels,OldHistogram);
This should work so long as your maximum value in NewGrayLevels (+1 because you are starting at zero) is equal to the length of OldHistogram.
Well I understood that there's no need to write the code that #Hugh Nolan suggested. See the explanation here:
%The green lines are because after writing the code, I understood that
%there's no need to calculate the equalized histogram in
%"HistogramEqualization" function and after gaining the equalized image
%matrix you can pass it to the "ExtractHistogram" function
% (which there's no loops in it) to acquire the
%equalized histogram.
%But I didn't delete those lines of code because I had tried a lot to
%understand the algorithm and write them.
For more information and studying the code, please see my next question.

Tree Algorithm

I was thinking earlier today about an idea for a small game and stumbled upon how to implement it. The idea is that the player can make a series of moves that cause a little effect, but if done in a specific sequence would cause a greater effect. So far so good, this I know how to do. Obviously, I had to make it be more complicated (because we love to make it more complicated), so I thought that there could be more than one possible path for the sequence that would both cause greater effects, albeit different ones. Also, part of some sequences could be the beggining of other sequences, or even whole sequences could be contained by other bigger sequences. Now I don't know for sure the best way to implement this. I had some ideas, though.
1) I could implement a circular n-linked list. But since the list of moves never end, I fear it might cause a stack overflow ™. The idea is that every node would have n children and upon receiving a command, it might lead you to one of his children or, if no children was available to such command, lead you back to the beggining. Upon arrival on any children, a couple of functions would be executed causing the small and big effect. This might, though, lead to a lot of duplicated nodes on the tree to cope up with all the possible sequences ending on that specific move with different effects, which might be a pain to maintain but I am not sure. I never tried something this complex on code, only theoretically. Does this algorithm exist and have a name? Is it a good idea?
2) I could implement a state machine. Then instead of wandering around a linked list, I'd have some giant nested switch that would call functions and update the machine state accordingly. Seems simpler to implement, but... well... doesn't seem fun... nor ellegant. Giant switchs always seem ugly to me, but would this work better?
3) Suggestions? I am good, but I am far inexperienced. The good thing of the coding field is that no matter how weird your problem is, someone solved it in the past, but you must know where to look. Someone might have a better idea than those I had, and I really wanted to hear suggestions.
I'm not absolutely completely sure that I understand exactly what you're saying, but as an analagous situation, say someone's inputting an endless stream of numbers on the keyboard. '117' is a magic sequence, '468' is another one, '411799' is another (which contains the first one).
So if the user enters:
55468411799
you want to fire 'magic events' at the *s:
55468*4117*99*
or something like that, right? If that's analagous to the problem you're talking about, then what about something like (Java-like pseudocode):
MagicSequence fireworks = new MagicSequence(new FireworksAction(), 1, 1, 7);
MagicSequence playMusic = new MagicSequence(new MusicAction(), 4, 6, 8);
MagicSequence fixUserADrink = new MagicSequence(new ManhattanAction(), 4, 1, 1, 7, 9, 9);
Collection<MagicSequence> sequences = ... all of the above ...;
while (true) {
int num = readNumberFromUser();
for (MagicSequence seq : sequences) {
seq.handleNumber(num);
}
}
while MagicSequence has something like:
Action action = ... populated from constructor ...;
int[] sequence = ... populated from constructor ...;
int position = 0;
public void handleNumber(int num) {
if (num == sequence[position]) {
// They've entered the next number in the sequence
position++;
if (position == sequence.length) {
// They've got it all!
action.fire();
position = 0; // Or disable this Sequence from accepting more numbers if it's a once-off
}
} else {
position = 0; // missed a number, start again!
}
}
You might want to implement a state machine anyway, but you don't have to hardcode state transitions.
Try to make a graph of states, where link between state A to state B will mean A can lead to B.
Then you can traverse graph at runtime to find where player goes.
Edit: You can define graph node as:
-state-id
-list of links to other states,
where every link defines:
-state-id
-precondition, a list of states what must be visited before going to this state
What you're describing sounds very similar to the technology tree in a game live Civilization.
I don't know how the Civ authors built theirs, but I'd be inclined to use a multigraph to represent possible 'moves' - there will be some you can start at with no 'experience', and once you're in them, there will be multiple paths through to the end.
Draw-out what potential options you can have at each stage of the game, and then draw lines going from some options to others.
That should give you a start on implementation, as graphs are [relatively] easy concepts to implement and utilize.
Sounds like a neural network. You could create one and train it to recognize the patterns that cause the various effects you are looking for.
What you're describing sounds somewhat similar to a dependency graph or a word graph. You might look into those.
#Cowan, #Javier: Nice idea, mind if I add to it?
Let the MagicSequence objects listen to the incoming stream of user input, that is notify them of the input (broadcast) and let each of them add the input to there internal input stream. This stream is cleared when the input is not the expected next input in the pattern that would have the MagicSequence fire its action. As soon as the pattern is completed, fire the action and clear the internal input stream.
Optimize this by only feeding input to the MagicSequences that are waiting for it. This could be done two ways:
You have an object that lets all MagicSequences connect with events that correspond with numbers in their patterns. MagicSequence(1,1,7) would add itself to got1 and got7, for example:
UserInput.got1 += MagicSequnece[i].SendMeInput;
You could optimize this such that after each input MagicSequences deregister from invalid events and register with valid ones.
create a small state machine for each effect that you'd want. at each user action, 'broadcast' it to all state machines. most of then won't care, but some will advance, or maybe go backwards. when one of them reaches it's goal, produce the desired effect.
to keep the code neat, don't hardcode the state machines, instead build a simple data structure that encodes the state graph: each state is a node with a list of interesting events, each one points to the next state's node. Each machine's state is simply a reference to the appropriate state node.
edit: It seems Cowan's advice is equivalent to this, but he optimises his state machines to express only simple sequences. seems enough for your specific problem, but more complex conditions could need a more general solution.

Resources