tf.data.Dataset.zip(a, b) changes order of elements if a was shuffled - tensorflow-datasets

I am preparing a dataset and then training a model before storing the outputs (for the purpose of knowledge distillation)
In order to store them in the tfrecords format i need to use the .zip() function.
I reproduced the bug/mistake with the following code.
My actual training files are hundreds of lines so I didn't include them here.
I use tensorflow 2.1. and python 3.7 on ubuntu 18.04
The problem I can't solve is:
The data is shuffled (which is okay). But after zipping the tuples have a different order to each other (which is not okay).
import tensorflow as tf
ds = tf.data.Dataset.from_tensor_slices([1,2,3,4, 5])
#prepare dataset for training
batch_size=2
ds = ds.cache().repeat().shuffle(buffer_size=5, reshuffle_each_iteration=True).batch(batch_size)
#create model. here: map identity function
model = tf.keras.models.Sequential([tf.keras.layers.Lambda(lambda x: x , input_shape=(1,))])
#train with model.fit()
#make predictions.
pred = model.predict(ds, steps=5//batch_size)
#prepare for saving to tfrecords
ds = ds.unbatch()
ds = ds.take(5)
pred = tf.data.Dataset.from_tensor_slices(pred)
combined = tf.data.Dataset.zip((ds, pred))
#show unwanted behaviour
for (a),c in combined:
print(a,c)
output of code snippet shows that the elements per line don't match. (eg line 1: 3 should be mapped to 3)
tf.Tensor(3, shape=(), dtype=int32) tf.Tensor([4.], shape=(1,), dtype=float32)
tf.Tensor(1, shape=(), dtype=int32) tf.Tensor([1.], shape=(1,), dtype=float32)
tf.Tensor(4, shape=(), dtype=int32) tf.Tensor([1.], shape=(1,), dtype=float32)
tf.Tensor(3, shape=(), dtype=int32) tf.Tensor([2.], shape=(1,), dtype=float32)

Tensorflow applies shuffle at each iteration through the dataset.
Zip is one of those iterations, this is why the order in model.predict will not match the order in zip when (both times there is a shuffle)
anyway, for predict you do not really need to shuffle the dataset. The predictions should not depend on what the model sees in a previous prediction.

TensorFlow shuffles the first axis so if your tensor shape is (x,) this will change the order of your elements; here is a test
a = tf.data.Dataset.from_tensor_slices(tf.constant([[x] for x in range(10)]))
b = tf.data.Dataset.from_tensor_slices(tf.constant([[x] for x in range(10)]))
c = tf.data.Dataset.zip((a,b)).shuffle(10)
for i,j in c.batch(1):
print(i.numpy(),j.numpy())
and the output is
[[3]] [[3]]
[[6]] [[6]]
[[5]] [[5]]
[[8]] [[8]]
[[7]] [[7]]
[[1]] [[1]]
[[2]] [[2]]
[[0]] [[0]]
[[9]] [[9]]
[[4]] [[4]]
as you can see the order has been preserved but the items on the first axis of each tensor has been shuffled.

Related

The interpretation of cross validation scores

I am trying to understand this part of code that I found on the Internet:
kfold = KFold(n_splits=7, random_state=seed)
results = cross_val_score(estimator, x, y, cv=kfold)
print("Results: %.2f (%.2f) MSE" % (results.mean(), results.std()))
What the does cross_val_score?
I know that it calculates scores. I want to understand the meaning of these scores, and how they are evaluated.
Here's the working of cross_val_score:
As seen in source code of cross_val_score, this x you supplied to cross_val_score will be divided into X_train, X_test using cv=kfold. Same for y.
X_test will be held back and X_train and y_train will be passed on to estimator for fit().
After fitting, estimator will then be scored using X_test and y_test.
The steps 1 to 3 will be repeated for folds specified in kfold and array of scores will returned from cross_val_score.
Explanation of 3rd point: Scoring depends on the estimator and scoring param in cross_val_score. In your code here, you have not passed any scorer in scoring. So default estimator.score() will be used.
If estimator is a classifier, then estimator.score(X_test, y_test) will return accuracy. If its a regressor, then R-squared is returned.

word2vec training procedure clarification

I'm trying to learn the skip-gram model within word2vec, however I'm confused by some of the basic concepts. To start, here is my current understanding of the model motivated with an example. I am using Python gensim as I go.
Here I have a corpus with three sentences.
sentences = [
['i', 'like', 'cats', 'and', 'dogs'],
['i', 'like', 'dogs'],
['dogs', 'like', 'dogs']
]
From this, I can determine my vocabulary, V = ['and', 'cats', 'dogs', 'i', 'like'].
Following this paper by Tomas Mikolov (and others)
The basic Skip-gram formulation defines p(w_t+j |w_t) using the softmax
function:
where v_w and v′_w are the “input” and “output” vector representations
of w, and W is the number of words in the vocabulary.
To my understanding, the skip-gram model involves two matrices (I'll call them I and O) which are the vector representations of "input/center" words and the vector representation of "output/context" words. Assuming d = 2 (vector dimension or 'size' as its called in genism), I should be a 2x5 matrix and O should be a 5x2 matrix. At the start of the training procedure, these matrices are filled with random values (yes?). So we might have
import numpy as np
np.random.seed(2017)
I = np.random.rand(5,2).round(2) # 5 rows by 2 cols
[[ 0.02 0.77] # and
[ 0.45 0.12] # cats
[ 0.93 0.65] # dogs
[ 0.14 0.23] # i
[ 0.23 0.26]] # like
O = np.random.rand(2,5).round(2) # 2 rows by 5 cols
#and #cats #dogs #i #like
[[ 0.11 0.63 0.39 0.32 0.63]
[ 0.29 0.94 0.15 0.08 0.7 ]]
Now if I want to calculate the probability that the word "dogs" appears in the context of "cats" I should do
exp([0.39, 0.15] * [0.45 0.12])/(...) = (0.1125)/(...)
A few questions on this:
Is my understanding of the algorithm correct thus far?
Using genism, I can train a model on this data using
import gensim
model = gensim.models.Word2Vec(sentences, sg = 1, size=2, window=1, min_count=1)
model.wv['dogs'] # array([ 0.06249372, 0.22618999], dtype=float32)
For the array given, is that the vector for "dogs" in the Input matrix or the Output matrix? Is there a way to view both matrices in the final model?
Why does model.wv.similarity('cats','cats') = 1? I thought this should be closer to 0, since the data would indicate that the word "cats" is unlikely to occur in the context of the word "cats".
(1) Generally, yes, but:
The O output matrix – more properly understood as the weights from the neural-network's hidden layer, to a number of output nodes – is interpreted differently whether using 'negative sampling' ('NS') or 'hierarchical softmax' ('HS') training.
In practice in both I and O are len(vocab) rows and vector-size columns. (I is the Word2Vec model instance's model.wv.syn0 array; O is its model.syn1neg array in NS or model.syn1 in HS.)
I find NS a bit easier to think about: each predictable word corresponds to a single output node. For training data where (context)-indicates->(word), training tries to drive that word's node value toward 1.0, and the other randomly-chosen word node values toward 0.0.
In HS, each word is represented by a huffman-code of a small subset of the output nodes – those 'points' are driven to 1.0 or 0.0 to make the network more indicative of a single word after a (context)-indicates->(word) example.
Only the I matrix, initial word values, are randomized to low-magnitude vectors at the beginning. (The hidden-to-output weights O are left zeros.)
(2) Yes, that'll train things - just note that tiny toy-sized examples won't necessarily generate the useful constellations-of-vector-coordinates that are valued from word2vec.
(3) Note, model.similarity('cats', 'cats') is actually checking the cosine-similarity between the (input) vectors for those two words. Those are the same word, thus they definitionally have the same vector, and the similarity between identical vectors is 1.0.
That is, similarity() is not asking the model for a prediction, it's retrieving learned words by key and comparing those vectors. (Recent versions of gensim do have a predict_output_word() function, but it only works in NS mode, and making predictions isn't really the point of word2vec, and many implementations don't offer any prediction API at all. Rather, the point is using those attempted predictions during training to induce word-vectors that turn out to be useful for various other tasks.)
But even if you were reading predictions, 'cats' might still be a reasonable-although-bad prediction from the model in the context of 'cats'. The essence of forcing large vocabularies into the smaller dimensionality of 'dense' embeddings is compression – the model has no choice but to cluster related words together, because there's not enough internal complexity (learnable parameters) to simply memorize all details of the input. (And for the most part, that's a good thing, because it results in generalizable patterns, rather than just overfit idiosyncrasies of the training corpus.)
The word 'cats' will wind up close to 'dogs' and 'pets' – because they all co-occur with similar words, or each other. And thus the model will be forced to make similar output-predictions for each, because their input-vectors don't vary that much. And a few predictions that are nonsensical in logical language use – like a repeating word - may be made, but only because taking a larger error there still gives less error over the whole training set, compared to other weighting alternatives.

tensorflow multiply two tensors

I'm trying to multiply two tensors together that both have the same shape:
weights = tf.Variable(tf.random_normal([200], stddev=0.35),
name="weights")
weights2 = tf.Variable(tf.random_normal([200], stddev=0.35),
name="weights2")
greg = tf.matmul(weights,weights2)
sess=tf.Session()
sess.run(tf.initialize_all_variables())
sess.close()
Trying this in jupyter notebook, I get this error:
"Shapes (200,) and (?, ?) must have the same rank"
What am I missing?
As NPE mentions in their comment, the tf.matmul() op expects both of its inputs to be two-dimensional tensors, but your arguments weights and weights2 are one-dimensional tensors.
If you want to compute the inner product of these two tensors, you need to reshape them to be 200-by-1 and 1-by-200 matrices, using (e.g.) tf.reshape() as follows:
greg = tf.matmul(tf.reshape(weights, [1, 200]), tf.reshape(weights2, [200, 1]))

What is stratified bootstrap?

I have learned bootstrap and stratification. But what is stratified bootstrap? And how does it work?
Let's say we have a dataset of n instances (observations), and m is the number of classes. How should I divide the dataset, and what's the percentage for training and testing?
You split your dataset per class. Afterwards, you sample from each sub-population independently. The number of instances you sample from one sub-population should be relative to its proportion.
data
d(i) <- { x in data | class(x) =i }
for each class
for j = 0..samplesize*(size(d(i))/size(data))
sample(i) <- draw element from d(i)
sample <- U sample(i)
If you sample four elements from a dataset with classes {'a', 'a', 'a', 'a', 'a', 'a', 'b', 'b'}, this procedure makes sure that at least one element of class b is contained in the stratified sample.
Just had to implement this in python, I will just post my current approach here in case this is of interest for others.
Function to create index for original Dataframe to create stratified bootstrapped sample
I chose to iterate over all relevant strata clusters in the original Dataframe , retrieve the index of the relevant rows in each stratum and randomly (with replacement) draw the same amount of samples from the stratum that this very stratum consists of.
In turn, the randomly drawn indices can just be combined into one list (that should in the end have the same length as the original Dataframe).
import pandas as pd
from random import choices
def provide_stratified_bootstap_sample_indices(bs_sample):
strata = bs_sample.loc[:, "STRATIFICATION_VARIABLE"].value_counts()
bs_index_list_stratified = []
for idx_stratum_var, n_stratum_var in strata.iteritems():
data_index_stratum = list(bs_sample[bs_sample["STRATIFICATION_VARIABLE"] == idx_stratum_var[0]].index)
bs_index_list_stratified.extend(choices(data_index_stratum , k = len(data_index_stratum )))
return bs_index_list_stratified
And then the actual bootstrapping loop
(say 10k times):
k=10000
for i in range(k):
bs_sample = DATA_original.copy()
bs_index_list_stratified = provide_stratified_bootstap_sample_indices(bs_sample)
bs_sample = bs_sample.loc[bs_index_list_stratified , :]
# process data with some statistical operation as required and save results as required for each iteration
RESULTS = FUNCTION_X(bs_sample)

Sorting: Return an array with new positions of each element

I need to sort an array whilst also returning an array which contains the sorted positions of the original elements. (N.B. not an argsort, the indexes to sort the array)
At present this requires two steps:
An argsort
A scatter operation on a new array
i.e. pos[argsort[i]] = i
I feel like I am missing a trick here. Is this a well known algorithm that I have overlooked that can be achieved in one step?
Step 2 can also be implemented with a search, but I think the scatter is more efficient.
I have included some example python code to illustrate the problem.
import numpy as np
l = [0,-8,1,10,13,2]
a = np.argsort(l)
# returns [1 0 2 5 3 4], the order required to sort l
# init new list to zero
pos = [0 for x in range(0,len(l))]
# scatter http://en.wikipedia.org/wiki/Gather-scatter_(vector_addressing)
for i in range(0,len(l)):
pos[a[i]] = i
print pos
# prints [1, 0, 2, 4, 5, 3], i.e. each original indexes new position in the sorted array
Searching for references to this problem has left me frustrated and maybe that I am missing the correct terminology for this type of operation.
Any help or guidance would be much appreciated.
Here's a simple implementation, although it's not "in-place" in any meaningful sense. I'm not sure what you mean by "in-place", since the output is an np.array of type int and the input could contain doubles.
Updated in response to #norio's comment and to clarify intent:
#!/usr/bin/env python
import numpy as np
unsorted = np.array([0,-8,1,10,13,2])
def myargsort(numbers):
tuples = enumerate(numbers) # returns iterable of index,value
sortedTuples = sorted(tuples,key = lambda pair: pair[1])
sortedNumbers = [num for idx,num in sortedTuples]
sortIndexes = [idx for idx,num in sortedTuples]
return (sortedNumbers,sortIndexes)
sortedNums, sortIndices = myargsort(unsorted)
print(unsorted)
print(sortedNums)
print(sortIndices)

Resources