Probability estimates from Google AutoML - google-cloud-automl

I am analyzing pedigrees to determine if families have a cancer susceptibility gene mutation based on their family history of cancer (various types). I have created a dataset for Google AutoML Tables with 250,000 generated pedigrees. I would like to use this to develop a classifier to predict if a given pedigree has a familial mutation.
What I am interested in, though, is not a yes/no answer, but rather the Bayesian posterior probability. In other words, what is the probability of this family carrying a mutation given a positive result. Is it possible to do this with AutoML Tables? If so, how would I go about this?

Related

Is k-Nearest Neighbors algorithm used a lot in real life?

I am teaching myself machine learning through the book "Introduction to Machine Learning with Python: A Guide for Data Scientists", and I am currently at the k-Nearest Neighbors section. The authors mention that this algorithm is rarely used in real life due to "prediction being slow and its inability to handle many features". However, the k-Nearest Neighbors is mentioned as one of the most popular algorithms for data scientist in many articles. So, could somebody explain it for me here?
K-nearest neighbor has a lot of application in machine learning because of the nature of the problem which is solved by a k-nearest neighbor. In other words, the problem of the k-nearest neighbor is fundamental and it is used in a lot of solutions. For example, in data representation such as tSNE, to run the algorithm we need to compute the k-nearest neighbor of each point base on the predefined perplexity.
Also, you can find more application of kNN here and its application in the industry in the last page of this article.
The KNN algorithm is one of the most popular
algorithms for text categorization or text mining.
Another interesting application is the evaluation of forest
inventories and for estimating forest variables. In
these applications, satellite imagery is used, with the
aim of mapping the land cover and land use with few
discrete classes. The other applications of the k-NN
method in agriculture include climate forecasting and
estimating soil water parameters.
Some of the other applications of KNN in finance are
mentioned below:
Forecasting stock market: Predict the price of a
stock, on the basis of company performance
measures and economic data.
Currency exchange rate
Bank bankruptcies
Understanding and managing financial risk
Trading futures
Credit rating
Loan management
Bank customer profiling
Money laundering analyses
Medicine
Predict whether a patient, hospitalized due to a
heart attack, will have a second heart attack. The
prediction is to be based on demographic, diet
and clinical measurements for that patient.
Estimate the amount of glucose in the blood of a
diabetic person, from the infrared absorption
spectrum of that person’s blood.
Identify the risk factors for prostate cancer,
based on clinical and demographic variables.
The KNN algorithm has been also applied
for analyzing micro-array gene expression data,
where the KNN algorithm has been coupled with
genetic algorithms, which are used as a search tool.
Other applications include the prediction of solvent
accessibility in protein molecules, the detection of
intrusions in computer systems, and the management
of databases of moving objects such as computer
with wireless connections.

What machine learning algorithm would be best suited for a scenario when you are not sure about the test features/attributes?

Eg: For training, you use data for which users have filled up all the fields (around 40 fields) in a form along with an expected output.
We now build a model (could be an artificial neural net or SVM or logistic regression, etc).
Finally, a user now enters 3 fields in the form and expects a prediction.
In this scenario, what is the best ML algorithm I can use?
I think it will depend on the specific context of your problem. What are you trying to predict based on what kind of input?
For example, recommender systems are used by companies like Netflix to predict a user's rating of, for example, movies based on a very sparse feature vector (user's existing ratings of a tiny percentage of all of the movies in the catalog).
Another option is to develop some mapping algorithm from your sparse feature space to a common latent space on which you perform your classification with, e.g., an SVM or neural network. I believe this paper does something similar. You can also look in to papers like this one for a classifier that translates data from two different domains (your training vs. testing set, for example, where both contain similar information, but one has complete data and the other does not) into a common latent space for classification. There is a lot out there actually on domain-independent classification.
Keywords to look up (with some links to get you started): generative adversarial networks (GAN), domain-adversarial training, domain-independent classification, transfer learning.

What estimator to use in scikit-learn?

This is my first brush with machine learning, so I'm trying to figure out how this all works. I have a dataset where I've compiled all the statistics of each player to play with my high school baseball team. I also have a list of all the players that have ever made it to the MLB from my high school. What I'd like to do is split the data into a training set and a test set, and then feed it to some algorithm in the scikit-learn package and predict the probability of making the MLB.
So I looked through a number of sources and found this cheat sheet that suggests I start with linear SVC.
So, then as I understand it I need to break my data into training samples where each row is a player and each column is a piece of data about the player (batting average, on base percentage, yada, yada), X_train; and a corresponding truth matrix of a single row per player that is simply 1 (played in MLB) or 0 (did not play in MLB), Y_train. From there, I just do Fit(X,Y) and then I can use predict(X_test) to see if it gets the right values for Y_test.
Does this seem a logical choice of algorithm, method, and application?
EDIT to provide more information:
The data is made of 20 features such as number of games played, number of hits, number of Home Runs, number of Strike Outs, etc. Most are basic counting statistics about the players career; a few are rates such as batting average.
I have about 10k total rows to work with, so I can split the data based on that; but I have no idea how to optimally split the data, given that <1% have made the MLB.
Alright, here are a few steps that might want to make:
Prepare your data set. In practice, you might want to scale the features, but we'll leave it out to make the first working model as simple as possible. So will just need to split the dataset into test/train set. You could shuffle the records manually and take the first X% of the examples as the train set, but there's already a function for it in scikit-learn library: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html. You might want to make sure that both: positive and negative examples are present in the train and test set. To do so, you can separate them before the test/train split to make sure that, say 70% of negative examples and 70% of positive examples go the training set.
Let's pick a simple classifier. I'll use logistic regression here: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html, but other classifiers have a similar API.
Creating the classifier and training it is easy:
clf = LogisticRegression()
clf.fit(X_train, y_train)
Now it's time to make our first predictions:
y_pred = clf.predict(X_test)
A very important part of the model is its evaluation. Using accuracy is not a good idea here: the number of positive examples is very small, so the model that unconditionally returns 0 can get a very high score. We can use the f1 score instead: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html.
If you want to predict probabilities instead of labels, you can just use the predict_proba method of the classifier.
That's it. We have a working model! Of course, there are a lot thing you may try to improve, such as scaling the features, trying different classifiers, tuning their hyperparameters, but this should be enough to get started.
If you don't have a lot of experience in ML, in scikit learn you have classification algorithms (if the target of your dataset is a boolean or a categorical variable) or regression algorithms (if the target is a continuous variable).
If you have a classification problem, and your variables are in a very different scale a good starting point is a decision tree:
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
The classifier is a Tree and you can see the decisions that are taking in the nodes.
After that you can use random forest, that is a group of decision trees that average results:
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
After that you can put the same scale in every feature:
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
And you can use other algorithms like SVMs.
For every algorithm you need a technique to select its parameters, for example cross validation:
https://en.wikipedia.org/wiki/Cross-validation_(statistics)
But a good course is the best option to learn. In coursera you can find several good courses like this:
https://www.coursera.org/learn/machine-learning

Clustering+Regression-the right approach or not?

I have a task of prognosing the quickness of selling goods (for example, in one category). E.g, the client inputs the price that he wants his item to be sold and the algorithm should displays that it will be sold with the inputed price for n days. And it should have 3 intervals of quick, medium and long sell. Like in the picture:
The question: how exactly should I prepare the algorithm?
My suggestion: use clustering technics for understanding this three price ranges and then solving regression task for each cluster for predicting the number of days. Is it a right concept to do?
There are two questions here, and I think the answer to each lies in a different domain:
Given an input price, predict how long will it take to sell the item. This is a well defined prediction problem, and can be tackled using ML algorithms. e.g. use your entire dataset to train and test a regression model for prediction.
Translate the prediction into a class: quick-, medium- or slow-sell. This problem is product oriented - there doesn't seem to be any concrete data allowing you to train a classifier on this translation; and I agree with #anony-mousse that using unsupervised learning might not yield easy-to-use results.
You can either consult your users or a product manager on reasonable thresholds to use (there might be considerations here like the type of item, season etc.), or try getting some additional data in order to train a supervised classifier.
E.g. you could ask your users, post-sell, if they think the sell was quick, medium or slow. Then you'll have some data to use for thresholding or for classification.
I suggest you simply define thesholds of 10 days and 31 days. Keep it simple.
Because these are the values the users will want to understand. If you use clustering, you may end up with 0.31415 days or similar nonintuitive values that you cannot explain to the user anyway.

what is the theoretical benchmark upon which to make select the artifical network to predicting data?

I have trained neural networks for 50 times(training) now I am not sure which network (from 50nets) I should choose to predict my data?Among MSE(mean square error) of network or R-squred or validation performance or test performance?....Thanks for any suggestion
If you have a training data, which you actually use to build a predictive model (so you need network for some actual application, and not for research purposes) then you should just have split with two subsets: training and validation, nothing else. Thus when you run your 50 trainings (over k-fold CV or random splits) you are supposed to use it to find out what set of hyperparameters is the best. And you select hyperparameters which lead to the best mean score over these 50 splits. Then you retrain your model on the whole dataset with these hyperparameters. Similarly - if you want to select between K algorithms, you use each of them to approximate their generalization capabilities, select the one with biggest mean score, and retrain this particular model on the whole dataset.

Resources