Extrapolating variance components from Weir-Fst on Vcftools - bioinformatics

vcftools --vcf ALL.chr1.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf --weir-fst-pop POP1.txt --weir-fst-pop POP2.txt --out fst.POP1.POP2
The above script computes Fst distances on 1000 Genomes population data using Weir and Cokerham's 1984 formula. This formula uses 3 variance components, namely a,b,c (between populations; between individuals within populations; between gametes within individuals within populations).
The output directly provides the result of the formula but not the components that the program calculated to arrive at the final result. How can I ask Vcftools to output the values for a,b,c?

If you can get the data into the format for hierfstat, you can get the variance components from varcomp.glob. What I normally do is:
use vcftools with --012 to get genotypes
convert 0/1/2/-1 to hierfstat format (eg., 11/12/22/NA)
load the data into hierfstat and compute (see below)
R example:
library(hierfstat)
data = read.table("hierfstat.txt", header=T, sep="\t")
levels = data.frame(data$popid)
loci = data[,2:ncol(data)]
res = varcomp.glob(levels=levels, loci=loci, diploid=T)
print(res$loc)
print(res$F)
Fst for each locus (row) therefore is (without hierarchical design), from res$loc: res$loc[1]/sum(res$loc). If you have more complicated sampling, you'll need to interpret the variance components differently.
--update per your comment--
I do this in Pandas, but any language would do. It's a text replacement exercise. Just get your .012 file into a dataframe and convert as below. I read in row by row into numpy b/c I have tons of snps, but read_csv would work, too.
import pandas as pd
import numpy as np
z12_data = []
for i, line in enumerate(open(z12_file)):
line = line.strip()
line = [int(x) for x in line.split("\t")]
z12_data.append(np.array(line))
if i % 10 == 0:
print i
z12_data = np.array(z12_data)
z12_df = pd.DataFrame(z12_data)
z12_df = z12_df.drop(0, axis=1)
z12_df.columns = pd.Series(z12_df.columns)-1
hierf_trans = {0:11, 1:12, 2:22, -1:'NA'}
def apply_hierf_trans(series):
return [hierf_trans[x] if x in hierf_trans else x for x in series]
hierf = df.apply(apply_hierf_trans)
hierf.to_csv("hierfstat.txt", header=True, index=False, sep="\t")
Then, you'd read that file hierfstat.txt into R, these are your loci. You'd need to specify your levels in your sampling design (e.g., your population). Then call varcomp.glob() to get the variance components. I have a parallel version of this here if you want to use it.
Note that you are specifying 0 as the reference allele, in this case. May be what you want, maybe not. I often calculate minor allele frequency and make 2 the minor allele, but it depends on your study goal.

Related

Is there a way to infer topic distributions on unseen document from gensim LDA pre-trained model using matrix multiplication?

Is there a way to get the topic distribution of an unseen document using a pretrained LDA model without using the LDA_Model[unseenDoc] syntax? I am trying to implement my LDA model into a web application, and if there was a way to use matrix multiplication to get a similar result then I could use the model in javascript.
For example, I tried the following:
import numpy as np
import gensim
from gensim.corpora import Dictionary
from gensim import models
import nltk
from nltk.stem import WordNetLemmatizer, SnowballStemmer
nltk.download('wordnet')
def Preprocesser(text_list):
smallestWordSize = 3
processedList = []
for token in gensim.utils.simple_preprocess(text_list):
if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > smallestWordSize:
processedList.append(StemmAndLemmatize(token))
return processedList
lda_model = models.LdaModel.load('LDAModel\GoldModel') #Load pretrained LDA model
dictionary = Dictionary.load("ModelTrain\ManDict") #Load dictionary model was trained on
#Sample Unseen Doc to Analyze
doc = "I am going to write a string about how I can't get my task executor \
to travel properly. I am trying to use the \
AGV navigator, but it doesn't seem to be working network. I have been trying\
to use the AGV Process flow but that isn't working either speed\
trailer offset I am now going to change this so I can see how fast it runs"
termTopicMatrix = lda_model.get_topics() #Get Term-topic Matrix from pretrained LDA model
cleanDoc = Preprocesser(doc) #Tokenize, lemmatize, clean and stem words
bowDoc = dictionary.doc2bow(cleanDoc) #Create bow using dictionary
dictSize = len(termTopicMatrix[0]) #Get length of terms in dictionary
fullDict = np.zeros(dictSize) #Initialize array which is length of dictionary size
First = [first[0] for first in bowDoc] #Get index of terms in bag of words
Second = [second[1] for second in bowDoc] #Get frequency of term in bag of words
fullDict[First] = Second #Add word frequency to full dictionary
print('Matrix Multiplication: \n', np.dot(termTopicMatrix,fullDict))
print('Conventional Syntax: \n', lda_model[bowDoc])
Output:
Matrix Multiplication:
[0.0283254 0.01574513 0.03669142 0.01671816 0.03742738 0.01989461
0.01558603 0.0370233 0.04648389 0.02887623 0.00776652 0.02147539
0.10045133 0.01084273 0.01229849 0.00743788 0.03747379 0.00345913
0.03086953 0.00628912 0.29406082 0.10656977 0.00618827 0.00406316
0.08775404 0.00785408 0.02722744 0.09957815 0.01669402 0.00744392
0.31177135 0.03063149 0.07211428 0.01192056 0.03228589]
Conventional Syntax:
[(0, 0.070313625), (2, 0.056414187), (18, 0.2016589), (20, 0.46500313), (24, 0.1589748)]
In the pretrained model there are 35 topics and 1155 words.
In the "Conventional Syntax" output, the first element of each tuple is the index of the topic and the second element is the probability of the topic. In the "Matrix Multiplication" version, the probability is the index and the value is the probability. Clearly the two don't match up.
For example, the lda_model[unseenDoc] shows that topic 0 has a 0.07 probability, but the matrix multiplication method says that topic has a 0.028 probability. Am I missing a step here?
You can review the full source code used by LDAModel's get_document_topics() method in your installation, or online at:
https://github.com/RaRe-Technologies/gensim/blob/e75f6c8e8d1dee0786b1b2cd5ef60da2e290f489/gensim/models/ldamodel.py#L1283
(It also makes use of the inference() method in the same file.)
It's doing a lot more scaling/normalization/clipping than your code, which is likely the cause of the discrepancy. But you should be able to examine, line-by-line, where your process & its differ to get the steps to match up.
It also shouldn't be hard to use the gensim code's steps as guidance for creating parallel Javascript code that, given the right parts of the model's state, can reproduce its results.

Even an image in data set used to train is giving opposite values when making prediction

I am new to ML and TensorFlow. I am trying to build a CNN to categorize a good image against corrupted images, similar to rock paper scissor tutorials in tensor flow, except for only two categories.
The Colab Notebook
Model Architecture
train_generator = training_datagen.flow_from_directory(
TRAINING_DIR,
target_size=(150,150),
class_mode='categorical'
)
validation_generator = validation_datagen.flow_from_directory(
VALIDATION_DIR,
target_size=(150,150),
class_mode='categorical'
)
model = tf.keras.models.Sequential([
# Note the input shape is the desired size of the image 150x150 with 3 bytes color
# This is the first convolution
tf.keras.layers.Conv2D(64, (3,3), activation='relu', input_shape=(150, 150, 3)),
tf.keras.layers.MaxPooling2D(2, 2),
# The second convolution
tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
tf.keras.layers.MaxPooling2D(2,2),
# The third convolution
tf.keras.layers.Conv2D(128, (3,3), activation='relu'),
tf.keras.layers.MaxPooling2D(2,2),
# The fourth convolution
tf.keras.layers.Conv2D(128, (3,3), activation='relu'),
tf.keras.layers.MaxPooling2D(2,2),
# Flatten the results to feed into a DNN
tf.keras.layers.Flatten(),
tf.keras.layers.Dropout(0.5),
# 512 neuron hidden layer
tf.keras.layers.Dense(512, activation='relu'),
tf.keras.layers.Dense(2, activation='softmax')
])
model.summary()
model.compile(loss = 'categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
history = model.fit_generator(train_generator, epochs=25, validation_data = validation_generator, verbose = 1)
model.save("rps.h5")
Only Change I made was turning input shape to (150,150,1) to (150,150,3) and changed last layers output to 2 neurons from 3. The training gave me consistently accuracy of 90 above for data set of 600 images in each class. But when I am making a prediction using code in the tutorial, it gives me highly wrong values even for data in the data set.
PREDICTION
Original code in TensorFlow tutorial
for file in onlyfiles:
path = fn
img = image.load_img(path, target_size=(150, 150,3)) # changed target_size to (150, 150,3)) from (150,150 )
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
images = np.vstack([x])
classes = model.predict(images, batch_size=10)
print(fn)
print(classes)
I changed target_size to (150, 150,3)) from (150,150) in my belief that since my input is a 3 channel image,
Result
It gives very wrong values [0,1][0,1] for even images in which are in dataset
But when I changed the code to this
for file in onlyfiles:
path = fn
img = image.load_img(path, target_size=(150, 150,3))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x /= 255.
classes = model.predict(images, batch_size=10)
print(fn)
print(classes)
In this case values come like
[[9.9999774e-01 2.2242968e-06]]
[[9.9999785e-01 2.1864464e-06]]
[[9.9999785e-01 2.1641024e-06]]
one or two errors are there but it is very much correct
So my question even though the last activation is softmax, why it is now coming in decimal values, is there any logical mistake in the way I am making predictions.? I tried binary also, but couldn't find much difference.
Please note -
When you are changing output classes from 2 to 3, you are asking the model to categorise into 3 classes. This would contradict your problem statement which separates good and corrupted ones i.e 2 output classes (a binary problem). I think it can be reversed from 3 to 2 if I have understood the question correctly.
Second the output you are getting is perfectly correct, the neural network models outputs probabilities instead of absolute class values like 0 or 1. By probability, it tells how likely it belongs to say class 0 or class 1.
Also , as mentioned above by #BBloggsbott - you just have to use np.argmax on the output array which will tell you the probability of belonging to class 1 (Positive class) by default.
Hope this helps.
Thanks.
Softmax returns probability distributions for the vector it gets as input. So, the fact that you are getting decimal values is not a problem. If you want to find the exact class each image belongs to, try using the argmax function on the predictions.

Copying embeddings for gensim word2vec

I wanted to see if I can simply set new weights for gensim's Word2Vec without training. I get the 20 News Group data set from scikit-learn (from sklearn.datasets import fetch_20newsgroups) and trained an instance of Word2Vec on it:
model_w2v = models.Word2Vec(sg = 1, size=300)
model_w2v.build_vocab(all_tokens)
model_w2v.train(all_tokens, total_examples=model_w2v.corpus_count, epochs = 30)
Here all_tokens is the tokenized data set.
Then I created a new instance of Word2Vec without training
model_w2v_new = models.Word2Vec(sg = 1, size=300)
model_w2v_new.build_vocab(all_tokens)
and set the embeddings of the new Word2Vec equal to the first one
model_w2v_new.wv.vectors = model_w2v.wv.vectors
Most of the functions work as expected, e.g.
model_w2v.wv.similarity( w1='religion', w2 = 'religions')
> 0.4796233
model_w2v_new.wv.similarity( w1='religion', w2 = 'religions')
> 0.4796233
and
model_w2v.wv.words_closer_than(w1='religion', w2 = 'judaism')
> ['religions']
model_w2v_new.wv.words_closer_than(w1='religion', w2 = 'judaism')
> ['religions']
and
entities_list = list(model_w2v.wv.vocab.keys()).remove('religion')
model_w2v.wv.most_similar_to_given(entity1='religion',entities_list = entities_list)
> 'religions'
model_w2v_new.wv.most_similar_to_given(entity1='religion',entities_list = entities_list)
> 'religions'
However, most_similar doesn't work:
model_w2v.wv.most_similar(positive=['religion'], topn=3)
[('religions', 0.4796232581138611),
('judaism', 0.4426296651363373),
('theists', 0.43141329288482666)]
model_w2v_new.wv.most_similar(positive=['religion'], topn=3)
>[('roderick', 0.22643062472343445),
> ('nci', 0.21744996309280396),
> ('soviet', 0.20012077689170837)]
What am I missing?
Disclaimer. I posted this question on datascience.stackexchange but got no response, hoping to have a better luck here.
Generally, your approach should work.
It's likely the specific problem you're encountering was caused by an extra probing step you took and is not shown in your code, because you had no reason to think it significant: some sort of most_similar()-like operation on model_w2v_new after its build_vocab() call but before the later, malfunctioning operations.
Traditionally, most_similar() calculations operate on a version of the vectors that has been normalized to unit-length. The 1st time these unit-normed vectors are needed, they're calculated – and then cached inside the model. So, if you then replace the raw vectors with other values, but don't discard those cached values, you'll see results like you're reporting – essentially random, reflecting the randomly-initialized-but-never-trained starting vector values.
If this is what happened, just discarding the cached values should cause the next most_similar() to refresh them properly, and then you should get the results you expect:
model_w2v_new.wv.vectors_norm = None

use iris efficiently to combine iris.analysis&aggregated while outputing time of occurence

I am using python/iris to get annual extreme values from daily data. I use aggregated_by('season_year', iris.analysis.MIN) to get the extreme values, but I need to also know when in each year they occur. I have written the code below, but this is really slow, so I am wondering whether anyone knows maybe an iris build-in way to do it, or can otherwise think of another way that is more efficient?
Thank you!
#--- get daily data
cma = iris.load_cube('daily_data.nc')
#--- get annual extremes
c_metric = cma.aggregated_by('season_year', iris.analysis.MIN)
#--- add date of when the extremes are occurring
extrdateli=[]
#loop over all years
for mij in range(c_metric.data.shape[0]):
#
# get extreme value
m = c_metric.data[mij]
#
#get values for this year
cma_thisseasyr = cma.extract(iris.Constraint(season_year=lambda season_year:season_year==c_metric.coord('season_year').points[mij]))
#
#get date in data cube for when this extreme occurs and print add as string to a list
extradateli += [ str(c_metric.coord('season_year').points[mij])+':'+','.join([''.join(_) for _ in zip([str(_) for _ in cma_thisseasyr.coord('day').points[np.where(cma_thisseasyr.data==m)]], [str(_) for _ in cma_thisseasyr.coord('month').points[np.where(cma_thisseasyr.data==m)]], [str(_) for _ in cma_thisseasyr.coord('year').points[np.where(cma_thisseasyr.data==m)]])])]
#add this list to the metric cube as attribute
c_metric.attributes['date_of_extreme_value'] = ' '.join(extrdateli)
#--- save to file
iris.save('annual_min.nc')
I think the slow part is where you extract the values for each season year. You can speed this up a bit by dispensing with the lambda, i.e:
iris.Constraint(season_year=c_metric.coord('season_year').points[mij])
If this is still too slow, you could work directly on the numpy arrays in your cube. Slicing numpy arrays is much faster than extracting from cubes. For simplicity, the example below assumes you have a time coordinate.
import iris
import numpy as np
import iris.coord_categorisation as cat
#--- create a dummy data cube
ndays = 12 * 365 + 3 # 12 years of data
tcoord = iris.coords.DimCoord(range(ndays), units='days since 2001-02-01',
standard_name='time')
cma = iris.cube.Cube(np.random.normal(0, 1, ndays), long_name='blah')
cma.add_dim_coord(tcoord, 0)
cat.add_season_year(cma, 'time')
#--- get annual extremes
c_metric = cma.aggregated_by('season_year', iris.analysis.MIN)
#--- add date of when the extremes are occurring
extrdateli=[]
#loop over all years
for mij in range(c_metric.data.shape[0]):
#
#get extreme value
m = c_metric.data[mij]
#
#get values for this year
year_index = cma.coord('season_year').points == c_metric.coord('season_year').points[mij]
temperatures_this_syear = cma.data[year_index]
dates_this_syear = tcoord.units.num2date(tcoord.points[year_index])
#
#get date in data cube for when this extreme occurs and print add as string to a list
extreme_dates = dates_this_syear[temperatures_this_syear==m]
extrdateli += [ str(c_metric.coord('season_year').points[mij])+':'+','.join(str(date) for date in extreme_dates)]
#add this list to the metric cube as attribute
c_metric.attributes['date_of_extreme_value'] = ' '.join(extrdateli)

Supervised Machine Learning, producing a trained estimator

I have an assignment in which I am supposed to use scikit, numpy and pylab to do the following:
"All of the following should use data from the training_data.csv file
provided. training_data gives you a labeled set of integer pairs,
representing the scores of two sports teams, with the labels giving the
sport.
Write the following functions:
plot_scores() should draw a scatterplot of the data.
predict(dataset) should produce a trained Estimator to guess the sport
that resulted in a given score (from a dataset we've withheld, which will
be inputs as a 1000 x 2 np array). You can use any algorithm from scikit.
An optional additional function called "preprocess" will process dataset
before we it is passed to predict.
"
This is what I have done so far:
import numpy as np
import scipy as sp
import pylab as pl
from random import shuffle
def plot_scores():
k=open('training_data.csv')
lst=[]
for triple in k:
temp=triple.split(',')
lst.append([int(temp[0]), int(temp[1]), int(temp[2][:1])])
array=np.array(lst)
pl.scatter(array[:,0], array[:,1])
pl.show()
def preprocess(dataset):
k=open('training_data.csv')
lst=[]
for triple in k:
temp=triple.split(',')
lst.append([int(temp[0]), int(temp[1]), int(temp[2][:1])])
shuffle(lst)
return lst
In preprocess, I shuffled the data because I am supposed to use some of it to train on and some of it to test on, but the original data was not at all random. My question is, how am I supposed to "produce a trained estimator" in predict(dataset)? Is this supposed to be a function that returns another function? And which algorithm would be ideal to classify based on a dataset that looks like this:
The task likely wants you to train a standard scikit classifier model and return it, i.e. something like
from sklearn.svm import SVC
def predict(dataset):
X = ... # features, extract from dataset
y = ... # labels, extract from dataset
clf = SVC() # create classifier
clf.fit(X, y) # train
return clf
Though judging from the name of the function (predict) you should check if it really wants you to return a trained classifier or return predictions for the given dataset argument, as that would be more typical.
As a classifier you can basically use anyone that you like. Your plot looks like your dataset is linearly seperable (there are no colors for the classes, but I assume that the blops are the two classes). On linearly separable data hardly anything will fail. Try SVMs, logistic regression, random forests, naive bayes, ... For extra fun you can try to plot the decision boundaries, see here (which also contains an overview of the available classifiers).
I would recommend you to take a look at this structure:
from random import shuffle
import matplotlib.pyplot as plt
# import a classifier you need
def get_data():
# open your file and parse data to prepare X as a set of input vectors and Y as a set of targets
return X, Y
def split_data(X, Y):
size = len(X)
indices = range(size)
shuffle(indices)
train_indices = indices[:size/2]
test_indices = indices[size/2:]
X_train = [X[i] for i in train_indices]
Y_train = [Y[i] for i in train_indices]
X_test = [X[i] for i in test_indices]
Y_test = [Y[i] for i in test_indices]
return X_train, Y_train, X_test, Y_test
def plot_scatter(Y1, Y2):
plt.figure()
plt.scatter(Y1, Y2, 'bo')
plt.show()
# get data
X, Y = get_data()
# split data
X_train, Y_train, X_test, Y_test = split_data(X, Y)
# create a classifier as an object
classifier = YourImportedClassifier()
# train the classifier, after that the classifier is the trained estimator you need
classifier.train(X_train, Y_train) # or .fit(X_train, Y_train) or another train routine
# make a prediction
Y_prediction = classifier.predict(X_test)
# plot the scatter
plot_scatter(Y_prediction, Y_test)
I think what you are looking for is clf.fit() function, instead creating function that produce another function

Resources