Evaluation Metrics for Generative Adversarial Networks - metrics

I'm using GANs on non image data, and I'm looking for some qualitative metrics to measure the quality of the generated samples.
I can't use Inception Score (IS) or similar metrics that uses inception model because I'm not working on images.
Is there any metrics which are clearly explained ?

Related

What if my Snorkel labeling function has a very low coverage over a development set?

I am trying to label a span recognition dataset using Snorkel and am currently at the stage of improving labeling functions. One of the LF has a rather low coverage because it only labels a subclass of one of the entity spans. What would be the impact of low coverage labeling functions on the final downstream span recognition model?
Even if the labeling function is low coverage, it might have high empirical accuracy over the class it is labeling. According to this video on "Best Practices for Improving Your Labeling Functions" from Snorkel co-founder Paroma Verma, those Snorkel LF's that have low coverage but good empirical accuracy should not be discarded.

How are the visually similar images in Google Vision API retreived?

I have retrieved "Visually Similar Images" using Google Vision API. I would like to know how given a photo (that could pertain to a blog or article), Google Vision API finds a list of visually similar images? I cannot seem to find a white paper describing this.
Additionally, I would like to know if it makes sense to consider these visually similar images if the labels predicted by Google Vision API have a score lower than 70% confidence?
According to the documentation, Google Cloud’s Vision API offers powerful pre-trained machine learning models through REST and RPC APIs, like the Web Detection which are in charge of processing and analyzing the images received in order to identify other images with characteristics similar to the original, as is mentioned here; however, since it is a pre-trained model of Google there isn’t a public documentation of its development.
Regarding your question about considering a confidence score lower than 70%, it completely depends on your use-case, you have to evaluate the acceptance limits required in order to satisfy your requirements.
Please note that the object returned in the "visuallySimilarImages" field of the JSON response is a WebImage object and its score field is deprecated, you may be referring to the score within the WebEntity object that is an overall relevancy score for the entity. Not normalized and not comparable across different image queries.

yolov3 Metrics Output Interpretation

I'm training a yolov3 neural network (https://github.com/ultralytics/yolov3/) to recognize objects in an image and was able to get some metrics out.
I was just wondering if anyone knew how to interpret the following metrics (i.e. definition of what these metrics measure).
Objectness
Classification.
yoloV3 Training Metrics Plots
I'm assuming the val Objectness and val Classification are the scores for the validation set.
Thanks!
Sorry for the late reply. Anyways, hope this useful for somebody.
Objectness: measures how well the model is at identifying that an object exists in a proposed region of interest
Classification: measures how well the model is at labeling those objects by their corresponding associated class
Both are usually calculated by nn.BCEWithLogitsLoss as both are classification tasks

What machine learning algorithm would be best suited for a scenario when you are not sure about the test features/attributes?

Eg: For training, you use data for which users have filled up all the fields (around 40 fields) in a form along with an expected output.
We now build a model (could be an artificial neural net or SVM or logistic regression, etc).
Finally, a user now enters 3 fields in the form and expects a prediction.
In this scenario, what is the best ML algorithm I can use?
I think it will depend on the specific context of your problem. What are you trying to predict based on what kind of input?
For example, recommender systems are used by companies like Netflix to predict a user's rating of, for example, movies based on a very sparse feature vector (user's existing ratings of a tiny percentage of all of the movies in the catalog).
Another option is to develop some mapping algorithm from your sparse feature space to a common latent space on which you perform your classification with, e.g., an SVM or neural network. I believe this paper does something similar. You can also look in to papers like this one for a classifier that translates data from two different domains (your training vs. testing set, for example, where both contain similar information, but one has complete data and the other does not) into a common latent space for classification. There is a lot out there actually on domain-independent classification.
Keywords to look up (with some links to get you started): generative adversarial networks (GAN), domain-adversarial training, domain-independent classification, transfer learning.

Good results when training and cross-validating a model, but test data set shows poor results

My problem is that I obtain a model with very good results (training and cross-validating), but when I test it again (with a different data set) poor results appear.
I got a model which has been trained and cross-validating tested. The model shows AUC=0.933, TPR=0.90 and FPR=0.04
I guess there is no overfitting present looking at pictures, corresponding to learning curve (error), learning curve (score), and deviance curve:
The problem is that when I test this model with a different test data set, I obtain poor results, nothing to do with my previus results AUC=0.52, TPR=0.165 and FPR=0.105
I used Gradient Boosting Classifier to train my model, with learning_rate=0.01, max_depth=12, max_features='auto', min_samples_leaf=3, n_estimators=750
I used SMOTE to balance the class. It is binary model. I vectorized my categorical attributes. I used 75% of my data set to train and 25% tot test. My model has a very low training error, and a low test error, so I guess it is not overfitted. Training error is very low, so there are not outliers in the training and cv-test data sets. What can I do from now on to find the problem? Thanks
If the process generating your datasets is non-stationary it could cause the behavior you describe.
In that case the distribution of the dataset you're using to test has not been used for training

Resources