What if my Snorkel labeling function has a very low coverage over a development set? - labeling

I am trying to label a span recognition dataset using Snorkel and am currently at the stage of improving labeling functions. One of the LF has a rather low coverage because it only labels a subclass of one of the entity spans. What would be the impact of low coverage labeling functions on the final downstream span recognition model?

Even if the labeling function is low coverage, it might have high empirical accuracy over the class it is labeling. According to this video on "Best Practices for Improving Your Labeling Functions" from Snorkel co-founder Paroma Verma, those Snorkel LF's that have low coverage but good empirical accuracy should not be discarded.

Related

Differences in FP and FN rates between two algorithems

I am conducting binary classification using logistic regression with and without applying PCA. The application of PCA before logistic regression gives a higher accuracy and lower FNs in comparison to logistic regression alone. I would like to find out why this is happening, specifically why PCA produces less FNs. I have read that cost sensitivity analysis could help explain this, but I am not sure if this is correct. Any suggestions?
There is no need of fancy analysis to explain this behavior.
PCA is used just for "clean" the data by limiting its variance. Let me explain this concept with an example, and then I will turn back to your question.
In general, in any ML problem, the available samples are never sufficient in number to cover all the possible variety of the sample space. You can never have a dataset with all the possible human faces, with all the possible expressions, etc.
So, instead of using all the available features you engineer the features (the pixels, in this example) in a way that you get more meaningful higher level features. You can reduce the resolution of the pictures, as easy example; you will loose the informations on the pictures background, but your model will focus better on the most important part of the picture, i.e. the faces.
When you deal with tabular data, a technique similar to the resolution lowering is cutting off parts of the original features, and that's what PCA do: it keeps the most important components of the features, the "Principal Components", dropping the less important ones.
So, the model trained with PCA gives better results because, by cutting off part of the features, your model focus better on the most important part of your samples, and so it gains robustness against overfitting.
cheers

Binary classification of sensor data

My problem is the following: I need to classify a data stream coming from an sensor. I have managed to get a baseline using the
median of a window and I subtract the values from that baseline (I want to avoid negative peaks, so I only use the absolute value of the difference).
Now I need to distinguish an event (= something triggered the sensor) from the noise near the baseline:
The problem is that I don't know which method to use.
There are several approaches of which I thought of:
sum up the values in a window, if the sum is above a threshold the class should be EVENT ('Integrate and dump')
sum up the differences of the values in a window and get the mean value (which gives something like the first derivative), if the value is positive and above a threshold set class EVENT, set class NO-EVENT otherwise
combination of both
(unfortunately these approaches have the drawback that I need to guess the threshold values and set the window size)
using SVM that learns from manually classified data (but I don't know how to set up this algorithm properly: which features should I look at, like median/mean of a window?, integral?, first derivative?...)
What would you suggest? Are there better/simpler methods to get this task done?
I know there exist a lot of sophisticated algorithms but I'm confused about what could be the best way - please have a litte patience with a newbie who has no machine learning/DSP background :)
Thank you a lot and best regards.
The key to evaluating your heuristic is to develop a model of the behaviour of the system.
For example, what is the model of the physical process you are monitoring? Do you expect your samples, for example, to be correlated in time?
What is the model for the sensor output? Can it be modelled as, for example, a discretized linear function of the voltage? Is there a noise component? Is the magnitude of the noise known or unknown but constant?
Once you've listed your knowledge of the system that you're monitoring, you can then use that to evaluate and decide upon a good classification system. You may then also get an estimate of its accuracy, which is useful for consumers of the output of your classifier.
Edit:
Given the more detailed description, I'd suggest trying some simple models of behaviour that can be tackled using classical techniques before moving to a generic supervised learning heuristic.
For example, suppose:
The baseline, event threshold and noise magnitude are all known a priori.
The underlying process can be modelled as a Markov chain: it has two states (off and on) and the transition times between them are exponentially distributed.
You could then use a hidden Markov Model approach to determine the most likely underlying state at any given time. Even when the noise parameters and thresholds are unknown, you can use the HMM forward-backward training method to train the parameters (e.g. mean, variance of a Gaussian) associated with the output for each state.
If you know even more about the events, you can get by with simpler approaches: for example, if you knew that the event signal always reached a level above the baseline + noise, and that events were always separated in time by an interval larger than the width of the event itself, you could just do a simple threshold test.
Edit:
The classic intro to HMMs is Rabiner's tutorial (a copy can be found here). Relevant also are these errata.
from your description a correctly parameterized moving average might be sufficient
Try to understand the Sensor and its output. Make a model and do a Simulator that provides mock-data that covers expected data with noise and all that stuff
Get lots of real sensor data recorded
visualize the data and verify your assuptions and model
annotate your sensor data i. e. generate ground truth (your simulator shall do that for the mock data)
from what you learned till now propose one or more algorithms
make a test system that can verify your algorithms against ground truth and do regression against previous runs
implement your proposed algorithms and run them against ground truth
try to understand the false positives and false negatives from the recorded data (and try to adapt your simulator to reproduce them)
adapt your algotithm(s)
some other tips
you may implement hysteresis on thresholds to avoid bouncing
you may implement delays to avoid bouncing
beware of delays if implementing debouncers or low pass filters
you may implement multiple algorithms and voting
for testing relative improvements you may do regression tests on large amounts data not annotated. then you check the flipping detections only to find performance increase/decrease

Which is suitable for car detection, CascadeClassifier or HOGDescriptor?

My goal is to detect cars in images and recognize its model. For car-detection, from http://docs.opencv.org/trunk/opencv_cheatsheet.pdf, it says:
CascadeClassifier
Viola’s Cascade of Boosted classifiers us-ing Haar or LBP features.
Suits for de-tecting faces, facial features and some other objects
without diverse textures. See facedetect.cpp
HOGDescriptor
N. Dalal’s object detector using Histogram-of-Oriented-Gradients (HOG)
features. Suits for detect-ing people, cars and other objects with
well-defined silhouettes. See peopledetect.cpp
and from http://docs.opencv.org/opencv2refman.pdf , chapter 8.1 Cascade Classification ,page 421:
First, a classifier (namely a cascade of boosted classifiers working
with haar-like features) is trained with a few hundred sample views of
a particular object (i.e., a face or a car), called positive examples,
that are scaled to the same size (say, 20x20), and negative examples -
arbitrary images of the same size.
Both of these two methods mention about applying for car detection: in opencv_cheatsheet.pdf it says HOGDescriptor suits for cars and in opencv2refman.pdf it also mentions cars for Cascade Classification. So my question is how to choose between CascadeClassifier and HOGDescriptor, and which is better for car detection? Thank you.
Its hard to tell without testing, but actually I got very good results with a third option - latentSVM detector.
It depends on your goals in terms of speed, accuracy etc. You use cascade classifiers and can get the desired results. Although cascade classifiers are quite good but still they are computational expensive. In that case you can you gpu. Below is an few example for you.
https://github.com/abhi-kumar/CAR-DETECTION
It's a research question. Classifier trained by and CascadeClassifier and HOG can be used to detect vehicles. OpenCV provide a detailed help to build a customized detector in case of CascadeClassifier and HOG. Quality of the classifier can be described by accuracy and in my case I was more interested in real time implementation. So you have to look a number of different factors.

Anomaly Detection Algorithms

I am tasked with detecting anomalies (known or unknown) using machine-learning algorithms from data in various formats - e.g. emails, IMs etc.
What are your favorite and most effective anomaly detection algorithms?
What are their limitations and sweet-spots?
How would you recommend those limitations be addressed?
All suggestions very much appreciated.
Statistical filters like Bayesian filters or some bastardised version employed by some spam filters are easy to implement. Plus there are lots of online documentation about it.
The big downside is that it cannot really detect unknown things. You train it with a large sample of known data so that it can categorize new incoming data. But you can turn the traditional spam filter upside down: train it to recognize legitimate data instead of illegitimate data so that anything it doesn't recognize is an anomaly.
There are various types of anomaly detection algorithms, depending on the type of data and the problem you are trying to solve:
Anomalies in time series signals:
Time series signals is anything you can draw as a line graph over time (e.g., CPU utilization, temperature, rate per minute of number of emails, rate of visitors on a webpage, etc). Example algorithms are Holt-Winters, ARIMA models, Markov Models, and more. I gave a talk on this subject a few months ago - it might give you more ideas about algorithms and their limitations.
The video is at: https://www.youtube.com/watch?v=SrOM2z6h_RQ
Anomalies in Tabular data: These are cases where you have feature vector that describe something (e.g, transforming an email to a feature vector that describes it: number of recipients, number of words, number of capitalized words, counts of keywords, etc....). Given a large set of such feature vectors, you want to detect some that are anomalies compared to the rest (sometimes called "outlier detection"). Almost any clustering algorithm is suitable in these cases, but which one would be most suitable depends on the type of features and their behavior -- real valued features, ordinal, nominal or anything other. The type of features determine if certain distance functions are suitable (the basic requirement for most clustering algorithms), and some algorithms are better with certain types of features than others.
The simplest algo to try is k-means clustering, where an anomaly sample would be either very small clusters or vectors that are far from all cluster centers. One sided SVM can also detect outliers, and has the flexibility of choosing different kernels (and effectively different distance functions). Another popular algo is DBSCAN.
When anomalies are known, the problem becomes a supervised learning problem, so you can use classification algorithms and train them on the known anomalies examples. However, as mentioned - it would only detect those known anomalies and if the number of training samples for anomalies is very small, the trained classifiers may not be accurate. Also, because the number of anomalies is typically very small compared to "no-anomalies", when training the classifiers you might want to use techniques like boosting/bagging, with over sampling of the anomalies class(es), but optimize on very small False Positive rate. There are various techniques to do it in the literature --- one idea that I found to work many times very well is what Viola-Jones used for face detection - a cascade of classifiers. see: http://www.vision.caltech.edu/html-files/EE148-2005-Spring/pprs/viola04ijcv.pdf
(DISCLAIMER: I am the chief data scientist for Anodot, a commercial company doing real time anomaly detection for time series data).

How can I visualize changes in a large code base quality?

One of the things I’ve been thinking about a lot off and on is how we can use metrics of some kind to measure change, are we going backwards or not? This is in the context of a large, legacy code base which we are improving. Most of the code is C++ with a C heritage. Some new functions and the GUI are written in C#.
To start with, we could at least be checking if the simple complexity level was changing over time in the code. The difficulty is in having a representation – we can maybe do a 3D surface where a 2D map represents the code and we have a heat-map of color representing complexity with the 3D surface bulging in and out to show change.
Once you can generate some matrics of numbers there are a ton of math systems around to take care of stuff like this.
Over time, I'd like to have more sophisticated numbers in there but the same visualisation techniques used to represent change.
I like the idea in Crap4j of focusing on the ratio between complexity and number of unit tests covering that code.
I'd also like to include Uncle Bob's SOLID metrics and some of the Chidamber and Kemerer OO metrics. The hard part is finding tools to generate these for C++. The only option seems to be Krakatau Essential Metrics (I have no objection to paying for tools). My desire to use the CK metrics comes partly from the books Object-Oriented Metrics:Measures of Complexity by Henderson-Sellers and the earlier Object-Oriented Software Metrics.
If we start using a number of these metrics we could end up with ten or so numbers that are varying across time. I'm fairly ignorant of statistics but it seems it could be interesting to track a bunch of such metrics and then pay attention to which ones tend to vary.
Note that a related question is about measuring code quality across a large code base. I'm more interested in measuring the change.
I'd consider using a Kiviat Diagram to represent multiple software metrics dimensions evolving over time. These diagrams represent multiple data points in a concave hull around a centerpoint. Visual inspection will show where a particular metric is going up or down, and one ought to be able to compute an overall ratio of area biased by metric value using some hueristic area computation.
You can also have a glance at NDepend documentation about code metrics. Disclaimer: I am one of the developer of the tool NDepend.
With the Code Rule and Query over LINQ (CQLinq) facility, it is possible to ask for code metric evolution/trending across two different snapshots in time of the code base. For example there is a default rule proposed: Avoid making complex methods even more complex illustrated by the screenshot below:
Several metric trending rules are proposed like:
Avoid decreasing code coverage by[enter link description here]5 tests of types
Types that used to be 100% covered but not anymore
and also, since you mentioned Crap4J the metric C.R.A.P can be written with CQLinq, and the query could be easily tweaked to see the trending in C.R.A.P metric.
Concerning the visualization of code metric, NDepend lets visualize code metrics values through an interactive treemap:
There is a fresh approach for this topic.
E.g.
https://github.com/databricks/koalas/pull/840#issuecomment-536949320
See https://softagram.com/docs/visualizing-code-changes/ for more info or do an image search in search engine using the two keywords: softagram koalas
Disclaimer: I work for Softagram.

Resources