How to get google AutoML model coefficients - google-cloud-automl

I am new to the google AutoML, once I trained a model, I want to see the model details, i.e. feature factors and the related coefficients. Any suggestions?

Assuming you are talking about an AutoML Vision model (both classification or object detection work similar): You can choose to train an edge model when starting the training. This enables you to download the model afterwards as TensorFlow saved_model.pb.
With this, you could then e.g. use Netron to visualize the network. Or you can load the model with Python and print details about it with some code like:
import tensorflow as tf
path_mdl = "input/model" # folder to file with saved_model.pb
with tf.Session(graph=tf.Graph()) as sess:
tf.saved_model.loader.load(sess, ["serve"], path_mdl)
graph = tf.get_default_graph()
# print input and output operations
graph.get_operations()
# print infos about all nodes
weight_nodes = [n for n in graph_def.node if n.op == 'Const']
for n in weight_nodes:
print("Name of the node - %s" % n.name)
print("Value - " )
print(tensor_util.MakeNdarray(n.attr['value'].tensor))

Related

Reduce the output layer size from XLTransformers

I'm running the following using the huggingface implementation:
t1 = "My example sentence is really great."
tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')
model = TransfoXLLMHeadModel.from_pretrained("transfo-xl-wt103")
encoded_input = tokenizer(t1, return_tensors='pt', add_space_before_punct_symbol=True)
output = model(**encoded_input)
tmp = output[0].detach().numpy()
print(tmp.shape)
>>> (1, 7, 267735)
With the goal of getting output embeddings that I'll use downstream.
The last dimension is /substantially/ larger than I expected, and it looks like it is the size of the entire vocab_size rather than a reduction based on the ECL from the paper (which potentially I am misinterpreting).
What argument would I provide the model to reduce this layer size to a smaller dimensional space, something more like the basic BERT at 400 or 768 and still obtain good performance based on the pretrained embeddings?
That's because you used ...LMHeadModel, which predicts the next token. You can use TransfoXLModel.from_pretrained("transfo-xl-wt103") instead, then output[0] is the last hidden state which has the shape (batch_size, sequence_length, hidden_size).

Is there a way to infer topic distributions on unseen document from gensim LDA pre-trained model using matrix multiplication?

Is there a way to get the topic distribution of an unseen document using a pretrained LDA model without using the LDA_Model[unseenDoc] syntax? I am trying to implement my LDA model into a web application, and if there was a way to use matrix multiplication to get a similar result then I could use the model in javascript.
For example, I tried the following:
import numpy as np
import gensim
from gensim.corpora import Dictionary
from gensim import models
import nltk
from nltk.stem import WordNetLemmatizer, SnowballStemmer
nltk.download('wordnet')
def Preprocesser(text_list):
smallestWordSize = 3
processedList = []
for token in gensim.utils.simple_preprocess(text_list):
if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > smallestWordSize:
processedList.append(StemmAndLemmatize(token))
return processedList
lda_model = models.LdaModel.load('LDAModel\GoldModel') #Load pretrained LDA model
dictionary = Dictionary.load("ModelTrain\ManDict") #Load dictionary model was trained on
#Sample Unseen Doc to Analyze
doc = "I am going to write a string about how I can't get my task executor \
to travel properly. I am trying to use the \
AGV navigator, but it doesn't seem to be working network. I have been trying\
to use the AGV Process flow but that isn't working either speed\
trailer offset I am now going to change this so I can see how fast it runs"
termTopicMatrix = lda_model.get_topics() #Get Term-topic Matrix from pretrained LDA model
cleanDoc = Preprocesser(doc) #Tokenize, lemmatize, clean and stem words
bowDoc = dictionary.doc2bow(cleanDoc) #Create bow using dictionary
dictSize = len(termTopicMatrix[0]) #Get length of terms in dictionary
fullDict = np.zeros(dictSize) #Initialize array which is length of dictionary size
First = [first[0] for first in bowDoc] #Get index of terms in bag of words
Second = [second[1] for second in bowDoc] #Get frequency of term in bag of words
fullDict[First] = Second #Add word frequency to full dictionary
print('Matrix Multiplication: \n', np.dot(termTopicMatrix,fullDict))
print('Conventional Syntax: \n', lda_model[bowDoc])
Output:
Matrix Multiplication:
[0.0283254 0.01574513 0.03669142 0.01671816 0.03742738 0.01989461
0.01558603 0.0370233 0.04648389 0.02887623 0.00776652 0.02147539
0.10045133 0.01084273 0.01229849 0.00743788 0.03747379 0.00345913
0.03086953 0.00628912 0.29406082 0.10656977 0.00618827 0.00406316
0.08775404 0.00785408 0.02722744 0.09957815 0.01669402 0.00744392
0.31177135 0.03063149 0.07211428 0.01192056 0.03228589]
Conventional Syntax:
[(0, 0.070313625), (2, 0.056414187), (18, 0.2016589), (20, 0.46500313), (24, 0.1589748)]
In the pretrained model there are 35 topics and 1155 words.
In the "Conventional Syntax" output, the first element of each tuple is the index of the topic and the second element is the probability of the topic. In the "Matrix Multiplication" version, the probability is the index and the value is the probability. Clearly the two don't match up.
For example, the lda_model[unseenDoc] shows that topic 0 has a 0.07 probability, but the matrix multiplication method says that topic has a 0.028 probability. Am I missing a step here?
You can review the full source code used by LDAModel's get_document_topics() method in your installation, or online at:
https://github.com/RaRe-Technologies/gensim/blob/e75f6c8e8d1dee0786b1b2cd5ef60da2e290f489/gensim/models/ldamodel.py#L1283
(It also makes use of the inference() method in the same file.)
It's doing a lot more scaling/normalization/clipping than your code, which is likely the cause of the discrepancy. But you should be able to examine, line-by-line, where your process & its differ to get the steps to match up.
It also shouldn't be hard to use the gensim code's steps as guidance for creating parallel Javascript code that, given the right parts of the model's state, can reproduce its results.

How to add words and vectors manually to Word2vec gensim?

Let's say, word2vec.model is my trained word2vec model. When a out-of-vocabulary word (oov_word) occurs, I compute a vector vec using compute_vec(oov_word) method. Now, I want to add/append oov_word and its corresponding vector vec to my already trained model word2vec.model.
I have already checked the below links. But they do not answer my question.
Combining/adding vectors from different word2vec models
https://datascience.stackexchange.com/questions/49431/how-to-train-an-existing-word2vec-gensim-model-on-new-words
https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.BaseKeyedVectors.add
from gensim.models.keyedvectors import WordEmbeddingsKeyedVectors
vector_length = 100
kv = WordEmbeddingsKeyedVectors(vector_length)
# wordList - list of words
# vectorList - list of the vector corresponding to the words
kv.add(wordList, vectorList)
kv.most_similar(word1) # gives the list of words similar to word1
for version 4.0:
w2v_model.wv.add_vectors(keys, vectors)
Check the doc

Supervised Machine Learning, producing a trained estimator

I have an assignment in which I am supposed to use scikit, numpy and pylab to do the following:
"All of the following should use data from the training_data.csv file
provided. training_data gives you a labeled set of integer pairs,
representing the scores of two sports teams, with the labels giving the
sport.
Write the following functions:
plot_scores() should draw a scatterplot of the data.
predict(dataset) should produce a trained Estimator to guess the sport
that resulted in a given score (from a dataset we've withheld, which will
be inputs as a 1000 x 2 np array). You can use any algorithm from scikit.
An optional additional function called "preprocess" will process dataset
before we it is passed to predict.
"
This is what I have done so far:
import numpy as np
import scipy as sp
import pylab as pl
from random import shuffle
def plot_scores():
k=open('training_data.csv')
lst=[]
for triple in k:
temp=triple.split(',')
lst.append([int(temp[0]), int(temp[1]), int(temp[2][:1])])
array=np.array(lst)
pl.scatter(array[:,0], array[:,1])
pl.show()
def preprocess(dataset):
k=open('training_data.csv')
lst=[]
for triple in k:
temp=triple.split(',')
lst.append([int(temp[0]), int(temp[1]), int(temp[2][:1])])
shuffle(lst)
return lst
In preprocess, I shuffled the data because I am supposed to use some of it to train on and some of it to test on, but the original data was not at all random. My question is, how am I supposed to "produce a trained estimator" in predict(dataset)? Is this supposed to be a function that returns another function? And which algorithm would be ideal to classify based on a dataset that looks like this:
The task likely wants you to train a standard scikit classifier model and return it, i.e. something like
from sklearn.svm import SVC
def predict(dataset):
X = ... # features, extract from dataset
y = ... # labels, extract from dataset
clf = SVC() # create classifier
clf.fit(X, y) # train
return clf
Though judging from the name of the function (predict) you should check if it really wants you to return a trained classifier or return predictions for the given dataset argument, as that would be more typical.
As a classifier you can basically use anyone that you like. Your plot looks like your dataset is linearly seperable (there are no colors for the classes, but I assume that the blops are the two classes). On linearly separable data hardly anything will fail. Try SVMs, logistic regression, random forests, naive bayes, ... For extra fun you can try to plot the decision boundaries, see here (which also contains an overview of the available classifiers).
I would recommend you to take a look at this structure:
from random import shuffle
import matplotlib.pyplot as plt
# import a classifier you need
def get_data():
# open your file and parse data to prepare X as a set of input vectors and Y as a set of targets
return X, Y
def split_data(X, Y):
size = len(X)
indices = range(size)
shuffle(indices)
train_indices = indices[:size/2]
test_indices = indices[size/2:]
X_train = [X[i] for i in train_indices]
Y_train = [Y[i] for i in train_indices]
X_test = [X[i] for i in test_indices]
Y_test = [Y[i] for i in test_indices]
return X_train, Y_train, X_test, Y_test
def plot_scatter(Y1, Y2):
plt.figure()
plt.scatter(Y1, Y2, 'bo')
plt.show()
# get data
X, Y = get_data()
# split data
X_train, Y_train, X_test, Y_test = split_data(X, Y)
# create a classifier as an object
classifier = YourImportedClassifier()
# train the classifier, after that the classifier is the trained estimator you need
classifier.train(X_train, Y_train) # or .fit(X_train, Y_train) or another train routine
# make a prediction
Y_prediction = classifier.predict(X_test)
# plot the scatter
plot_scatter(Y_prediction, Y_test)
I think what you are looking for is clf.fit() function, instead creating function that produce another function

Extrapolating variance components from Weir-Fst on Vcftools

vcftools --vcf ALL.chr1.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf --weir-fst-pop POP1.txt --weir-fst-pop POP2.txt --out fst.POP1.POP2
The above script computes Fst distances on 1000 Genomes population data using Weir and Cokerham's 1984 formula. This formula uses 3 variance components, namely a,b,c (between populations; between individuals within populations; between gametes within individuals within populations).
The output directly provides the result of the formula but not the components that the program calculated to arrive at the final result. How can I ask Vcftools to output the values for a,b,c?
If you can get the data into the format for hierfstat, you can get the variance components from varcomp.glob. What I normally do is:
use vcftools with --012 to get genotypes
convert 0/1/2/-1 to hierfstat format (eg., 11/12/22/NA)
load the data into hierfstat and compute (see below)
R example:
library(hierfstat)
data = read.table("hierfstat.txt", header=T, sep="\t")
levels = data.frame(data$popid)
loci = data[,2:ncol(data)]
res = varcomp.glob(levels=levels, loci=loci, diploid=T)
print(res$loc)
print(res$F)
Fst for each locus (row) therefore is (without hierarchical design), from res$loc: res$loc[1]/sum(res$loc). If you have more complicated sampling, you'll need to interpret the variance components differently.
--update per your comment--
I do this in Pandas, but any language would do. It's a text replacement exercise. Just get your .012 file into a dataframe and convert as below. I read in row by row into numpy b/c I have tons of snps, but read_csv would work, too.
import pandas as pd
import numpy as np
z12_data = []
for i, line in enumerate(open(z12_file)):
line = line.strip()
line = [int(x) for x in line.split("\t")]
z12_data.append(np.array(line))
if i % 10 == 0:
print i
z12_data = np.array(z12_data)
z12_df = pd.DataFrame(z12_data)
z12_df = z12_df.drop(0, axis=1)
z12_df.columns = pd.Series(z12_df.columns)-1
hierf_trans = {0:11, 1:12, 2:22, -1:'NA'}
def apply_hierf_trans(series):
return [hierf_trans[x] if x in hierf_trans else x for x in series]
hierf = df.apply(apply_hierf_trans)
hierf.to_csv("hierfstat.txt", header=True, index=False, sep="\t")
Then, you'd read that file hierfstat.txt into R, these are your loci. You'd need to specify your levels in your sampling design (e.g., your population). Then call varcomp.glob() to get the variance components. I have a parallel version of this here if you want to use it.
Note that you are specifying 0 as the reference allele, in this case. May be what you want, maybe not. I often calculate minor allele frequency and make 2 the minor allele, but it depends on your study goal.

Resources