Regularization vs. Validation

Regularization vs. Validation - validation

What I always see in the papers and articles about under/overfitting is a falling curve for training error and a U-shaped curve for testing error, saying the area left to the U-curve bottom is subject to underfitting and the area right to it is subject to overfitting.
To find the best model, we can test each configuration (e.g. changing the number of nodes and layers) and compare the test error values to find the minimum point (typically via cross-validation). That looks straightforward and perfect.
Do we need a regularizer to achieve this point? This is what I am not sure I have the topic understood well. To me, it seems that we don't need a regularizer if we can test different model configurations. The only case when a regularizer comes to play is when we have a fixed model configuration (e.g. fixed number of nodes and layers) and don't want to try other configurations, so we use regularizer to limit the model complexity by forcing other model parameters (e.g. network weights) to low values. Is this view right?
But if it is right, then what is the intuition behind it? First of all, when using a regularizer we don't know in advance if this network configuration/complexity bring us to the right or left of the minimum of test error curve. It may be already underfit, overfit, or fit. Putting math aside, why forcing weights to lower values will cause network to be more generalizable and less overfit? Is there any analogy of this method with the previous method of moving along test loss curve to find its minimum? Also regularizer does its job while training, it can not do anything with test data. How can it help to move toward minimum test error?

Related

Binary classification of sensor data

My problem is the following: I need to classify a data stream coming from an sensor. I have managed to get a baseline using the
median of a window and I subtract the values from that baseline (I want to avoid negative peaks, so I only use the absolute value of the difference).
Now I need to distinguish an event (= something triggered the sensor) from the noise near the baseline:
The problem is that I don't know which method to use.
There are several approaches of which I thought of:
sum up the values in a window, if the sum is above a threshold the class should be EVENT ('Integrate and dump')
sum up the differences of the values in a window and get the mean value (which gives something like the first derivative), if the value is positive and above a threshold set class EVENT, set class NO-EVENT otherwise
combination of both
(unfortunately these approaches have the drawback that I need to guess the threshold values and set the window size)
using SVM that learns from manually classified data (but I don't know how to set up this algorithm properly: which features should I look at, like median/mean of a window?, integral?, first derivative?...)
What would you suggest? Are there better/simpler methods to get this task done?
I know there exist a lot of sophisticated algorithms but I'm confused about what could be the best way - please have a litte patience with a newbie who has no machine learning/DSP background :)
Thank you a lot and best regards.

The key to evaluating your heuristic is to develop a model of the behaviour of the system.
For example, what is the model of the physical process you are monitoring? Do you expect your samples, for example, to be correlated in time?
What is the model for the sensor output? Can it be modelled as, for example, a discretized linear function of the voltage? Is there a noise component? Is the magnitude of the noise known or unknown but constant?
Once you've listed your knowledge of the system that you're monitoring, you can then use that to evaluate and decide upon a good classification system. You may then also get an estimate of its accuracy, which is useful for consumers of the output of your classifier.
Edit:
Given the more detailed description, I'd suggest trying some simple models of behaviour that can be tackled using classical techniques before moving to a generic supervised learning heuristic.
For example, suppose:
The baseline, event threshold and noise magnitude are all known a priori.
The underlying process can be modelled as a Markov chain: it has two states (off and on) and the transition times between them are exponentially distributed.
You could then use a hidden Markov Model approach to determine the most likely underlying state at any given time. Even when the noise parameters and thresholds are unknown, you can use the HMM forward-backward training method to train the parameters (e.g. mean, variance of a Gaussian) associated with the output for each state.
If you know even more about the events, you can get by with simpler approaches: for example, if you knew that the event signal always reached a level above the baseline + noise, and that events were always separated in time by an interval larger than the width of the event itself, you could just do a simple threshold test.
Edit:
The classic intro to HMMs is Rabiner's tutorial (a copy can be found here). Relevant also are these errata.

from your description a correctly parameterized moving average might be sufficient
Try to understand the Sensor and its output. Make a model and do a Simulator that provides mock-data that covers expected data with noise and all that stuff
Get lots of real sensor data recorded
visualize the data and verify your assuptions and model
annotate your sensor data i. e. generate ground truth (your simulator shall do that for the mock data)
from what you learned till now propose one or more algorithms
make a test system that can verify your algorithms against ground truth and do regression against previous runs
implement your proposed algorithms and run them against ground truth
try to understand the false positives and false negatives from the recorded data (and try to adapt your simulator to reproduce them)
adapt your algotithm(s)
some other tips
you may implement hysteresis on thresholds to avoid bouncing
you may implement delays to avoid bouncing
beware of delays if implementing debouncers or low pass filters
you may implement multiple algorithms and voting
for testing relative improvements you may do regression tests on large amounts data not annotated. then you check the flipping detections only to find performance increase/decrease

Indoor positioning of a moving object in 3D space

I am working on a project which determines the indoor position of an object which moves in 3D space (e.g. a quadcopter).
I have built some prototypes which use a combination of gyroscope, accelerometer and compass. However the results were far from being satisfactory, especially related to the moved distance, which I calculated using the accelerometer. Determining the orientation using a fusion of gyroscope and compass was close to perfect.
In my opinion I am missing some more sensors to get some acceptable results. Which additional sensors would I need for my purpose? I was thinking about adding one or more infrared cameras/distance sensors. I have never worked with such sensors and I am not sure which sensor would lead to better results.
I appreciate any suggestions, ideas and experiences.

The distance checking would decidedly help. The whole algorithm of any surface geo survey is based on the conception of start/final check. You know the start, then you add erroneous steps, and come to the finish that you know, too. But you have collected some sum error by the way. Then you distribute the error found among all steps done, with the opposite sign, of course.
What is interesting, in most cases you not only somewhat diminish the effect of arbitrary mistakes, but almost eliminate the systematical ones. Because they mostly are linear or close to linear and such linear distribution of found error will simply kill them.
That is only the illustration idea. Any non-primitive task will contain collecting all data and finding their dependencies, linearizing them and creating parametrical or correlational systems of equations. The solving of them you get the optimal changes in the measured values. By parametrical method you can also easily find approximate errors of these new values.
The utmost base of these methods is the lesser squares method of Gauss. The more concrete methodics can be found in old books on geodesy/geomatic/triangulation/ geodesy nets. The books after introduction of GPS are for nothing, because everything was terribly simplified by it. Look for the books with matrix formulaes for lesser squares solutions.
Sorry if I had translated some terms into English with errors.

Dividing the world in a thousand or so locations

Background: I want to create a weather service, and since most available APIs limit the number of daily calls, I want to divide the planet in a thousand or so areas.
Obviously, internet users are not uniformly distributed, so the sampling should be finer around densely populated regions.
How should I go about implementing this?
Where can I find data regarding geographical internet user density?
The algorithm will probably be something similar to k-means. However, implementing it on a sphere with oceans may be a bit tricky. Any insight?
Finally, maybe there is a way I can avoid doing all of this?

Very similar to k-mean is the centroidal Voronoi diagram (it is the continuous version of k-means). However, this would produce a uniform tesselation of your sphere that does not account for user density as you wish.
So a similar solution is the same technique but used with a Power Diagram : a Power Diagram is a Voronoi Diagram that accounts for a density (by assigning a weight to each Voronoi seed). Such diagram can be computed using an embedding in a 3D space (instead of 2D) that consists of the first two (x,y) coordinates plus a third one which is the square root of [any large positive constant minus the weight for the given point].
Using that, you can obtain a tesselation of your domain accounting for a user density.

You don't care about internet user density in general. You care about the density of users using your service - and you don't care where those users are, you care where they ask about. So once your site has been going for more than a day you can use the locations people ask about the previous day to work out what the areas should be for the next day.
Dynamic programming on a tree is easy. What I would do for an algorithm is to build a tree of successively more finely divided cells. More cells mean a smaller error, because people get predictions for points closer to them, and you can work out the error, or at least the relative error between more cells and fewer cells. Starting from the bottom up work out the smallest possible total error contributed by each subtree, allowing it to be divided in up to 1,2,3,..N. ways. You can work out the best possible division and smallest possible error for each k=1..N for a node by looking at the smallest possible error you have already calculated for each of its descendants, and working out how best to share out the available k divisions between them.
I would try to avoid doing this by thinking of a different idea. Depending on the way you look at life, there are at least two disadvantages of this:
1) You don't seem to be adding anything to the party. It looks like you are interposing yourself between organizations that actually make weather forecasts and their clients. Organizations lose direct contact with their clients, which might for instance lose them advertising revenue. Customers get a poorer weather forecast.
2) Most sites have legal terms of service, which must clients can ignore without worrying. My guess is that you would be breaking those terms of service, and if your service gets popular enough to be noticed they will be enforced against you.

Action constraints in actor-critic reinforcement learning

I've implemented the natural actor-critic RL algorithm on a simple grid world with four possible actions (up,down,left,right), and I've noticed that in some cases it tends to get stuck oscillating between up-down or left-right.
Now, in this domain up-down and left-right are opposites and feel that learning might be improved if I were somehow able to make the agent aware of this fact. I was thinking of simply adding a step after the action activations are calculated (e.g. subtracting the left activation from the right activation and vice versa). However, I'm afraid of this causing convergence issues in the general case.
It seems as so adding constraints would be a common desire in the field, so I was wondering if anyone knows of a standard method I should be using for this purpose. And if not, then whether my ad-hoc approach seems reasonable.
Thanks in advance!

I'd stay away from using heuristics in the selection of actions, if at all possible. If you want to add heuristics to your training, I'd do it in the calculation of the reward function. That way the agent will learn and embody the heuristic as a part of the value function it is approximating.
About the oscillation behavior, do you allow for the action of no movement (i.e. stay in the same location)?
Finally, I wouldn't worry too much about violating the general case and convergence guarantees. They are merely guidelines when doing applied work.

Edge detection : Any performance evaluation technique?

I am working on edge detection in images and would like to evaluate the performance of algorithm, if any any one could give me a reference or method on how to proceed it will be really helpful. :)
I do not have ground truth and data set includes color as well as gray images.
Thank you.

Create a synthetic data set with known edges, for example by 3D rendering, by compositing 2D images with precise masks (as may be obtained in royalty free photosets), or by introducing edges directly (thin/faint lines). Remember to add some confounding non-edges that look like edges, of a type appropriate for what you're tuning for.
Use your (non-synthetic) data set. Run the reference algorithms that you want to compare against. Also produce combinations of the reference algorithms, for example by voting (majority, at least K out of N, etc). Calculate stats on your algo vs reference algo performance, in terms of (a) number of points your algo classifies as edge which each reference algo, or the combination, does not classify as edge (false positive), or (b) number of points which the reference algo classifies as edge that your algo does not (false negative). You can also calculate a rank correlation-type number for algos by looking at each point and looking at which algos do (or don't) classify that as an edge.
Create ground truth manually. Use reference edge-finding algos as a starting point, then fix up by hand. Probably valuable to do for a small number of images in any case.
Good luck!

For comparisons, quantitative measures like what #Alex I explained is best. To do so, you need to define what is "correct" with a ground truth set and a way to consistently determine if a given image is correct or on a more granular level, how correct (some number like a percentage) it is. #Alex I gave a way to do that.
Another option that is often used in graphics research where there is no ground truth is user studies. Usually less desirable as they are time consuming and often more costly. However, if it is a qualitative improvement that you are after or if a quantitative measurement is just too hard to do, a user study is an appropriate solution.
When I mean user study I mean to poll people on how well a result is given the input image. You could give them a scale to rate things on and randomly give them samples from both your results and the results of another algorithm
And of course, if you still want more ideas, be sure to check out edge detection papers to see how they measured their results (I'd actually look here first as they've already gone through this same process and determined what was best for them: google scholar).

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio