I want to be able to sample from a multinomial distribution very efficiently and apparently my TensorFlow code is very... very slow...
The idea is that, I have:
A vector: counts = [40, 50, 26, ..., 19] for example
A matrix of probabilities: probs = [[0.1, ..., 0.5], ... [0.3, ..., 0.02]] such that np.sum(probs, axis=1) = 1
Let's say len(counts) = N and len(probs) = (N, 50). What I want to do is (in our example):
sample 40 times from the first probability vector of the matrix probs
sample 50 times from the second probability vector of the matrix probs
...
sample 19 times from the Nth probability vector of the matrix probs
such that my final matrix looks like (for example):
A = [[22, ... 13], ..., [12, ..., 3]] where np.sum(A, axis=1) == counts
(i.e the sum over each row = the number in the corresponding row of counts vector)
Here is my TensorFlow code sample:
import numpy as np
import tensorflow as tf
import tensorflow.contrib.distributions as ds
import time
nb_distribution = 100 # number of probability distributions
counts = np.random.randint(2000, 3500, size=nb_distribution) # define number of counts (vector of size 100 with int in 2000, 3500)
# print(u[:40]) # should be the same as the output of print(np.sum(res, 1)[:40]) in the tf.Session()
# probsn is a matrix of probability:
# each row of probsn contains a vector of size 30 that sums to 1
probsn = np.random.uniform(size=(nb_distribution, 30))
probsn /= np.sum(probsn, axis=1)[:, None]
counts = tf.Variable(counts, dtype=tf.float32)
probs = tf.Variable(tf.convert_to_tensor(probsn.astype(np.float32)))
# sample from the multinomial
dist = ds.Multinomial(total_count=counts, probs=probs)
out = dist.sample()
start = time.time()
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
res = sess.run(out)
# print(np.sum(res, 1)[:40])
print(time.time() - start)
elapsed time: 0.12 seconds
My equivalent code in Theano:
import numpy as np
import theano
from theano.tensor import _shared
nb_distribution = 100 # number of probability distributions
counts = np.random.randint(2000, 3500, size=nb_distribution)
#print(u[:40]) # should be the same as the output of print(np.sum(v_sample(), 1)[:40])
counts = _shared(counts) # define number of counts (vector of size 100 with int in 2000, 3500)
# probsn is a matrix of probability:
# each row of probsn contains a vector that sums to 1
probsn = np.random.uniform(size=(nb_distribution, 30))
probsn /= np.sum(probsn, axis=1)[:, None]
probsn = _shared(probsn)
from theano.tensor.shared_randomstreams import RandomStreams
np_rng = np.random.RandomState(12345)
theano_rng = RandomStreams(np_rng.randint(2 ** 30))
v_sample = theano.function(inputs=[], outputs=theano_rng.multinomial(n=counts, pvals=probsn))
start_t = time.time()
out = np.sum(v_sample(), 1)[:40]
# print(out)
print(time.time() - start_t)
elapsed time: 0.0025 seconds
Theano is like 100x faster... Is there something wrong with my TensorFlow code? How can I sample from a multinomial distribution efficiently in TensorFlow?
The problem is that the TensorFlow multinomial sample() method actually uses the method calls _sample_n(). This method is defined here. As we can see in the code to sample from the multinomial the code produces a matrix of one_hot for each row and then reduce the matrix into a vector by summing over the rows:
math_ops.reduce_sum(array_ops.one_hot(x, depth=k), axis=-2)
It is inefficient because it uses extra memory. To avoid this I have used the
tf.scatter_nd function. Here is a fully runnable example:
import tensorflow as tf
import numpy as np
import tensorflow.contrib.distributions as ds
import time
tf.reset_default_graph()
nb_distribution = 100 # number of probabilities distribution
u = np.random.randint(2000, 3500, size=nb_distribution) # define number of counts (vector of size 100 with int in 2000, 3500)
# probsn is a matrix of probability:
# each row of probsn contains a vector of size 30 that sums to 1
probsn = np.random.uniform(size=(nb_distribution, 30))
probsn /= np.sum(probsn, axis=1)[:, None]
counts = tf.Variable(u, dtype=tf.float32)
probs = tf.Variable(tf.convert_to_tensor(probsn.astype(np.float32)))
# sample from the multinomial
dist = ds.Multinomial(total_count=counts, probs=probs)
out = dist.sample()
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
res = sess.run(out) # if remove this line the code is slower...
start = time.time()
res = sess.run(out)
print(time.time() - start)
print(np.all(u == np.sum(res, axis=1)))
This code took 0.05 seconds to compute
def vmultinomial_sampling(counts, pvals, seed=None):
k = tf.shape(pvals)[1]
logits = tf.expand_dims(tf.log(pvals), 1)
def sample_single(args):
logits_, n_draw_ = args[0], args[1]
x = tf.multinomial(logits_, n_draw_, seed)
indices = tf.cast(tf.reshape(x, [-1,1]), tf.int32)
updates = tf.ones(n_draw_) # tf.shape(indices)[0]
return tf.scatter_nd(indices, updates, [k])
x = tf.map_fn(sample_single, [logits, counts], dtype=tf.float32)
return x
xx = vmultinomial_sampling(u, probsn)
# check = tf.expand_dims(counts, 1) * probs
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
res = sess.run(xx) # if remove this line the code is slower...
start_t = time.time()
res = sess.run(xx)
print(time.time() -start_t)
#print(np.sum(res, axis=1))
print(np.all(u == np.sum(res, axis=1)))
This code took 0.016 seconds
The drawback is that my code doesn't actually parallelize the computation (even though parallel_iterations parameter is set to 10 by default in map_fn, putting it to 1 doesn't change anything...)
Maybe someone will find something better because it is still very slow as compare to Theano's implementation (due to the fact that it doesn't take advantage of the parallelization... and yet, here, parallelization makes sense because sampling one row is indenpendent from sampling another one...)
Related
I have a conditional statement that adds row of binary values from matrix A to matrix B. I want to put this in a loop so that it continues to add rows from matrix A until matrix B is full. Currently matrix B is initialized as 10 by 10 matrix of zeros. Do I need to initialize matrix B differently in order to create this condition or is there a way of doing it as is?
Below is roughly how my code looks so far
from random import sample
import numpy as np
matrixA = np.random.randint(2, size=(10,10))
matrixB = np.zeros((10,10))
x, y = sample(range(1, 10), k=2)
if someCondition:
matrixB = np.append(matrixB, [matrixA[x]], axis=0)
else:
matrixB = np.append(matrixB, [matrixA[y]], axis=0)
you don't need a loop for it. It is really easy to just do it using smart indexing. For example:
import numpy as np
A = np.random.randint(0, 10, size=(20,10))
B = np.empty((10, 10))
print(A)
# Copy till the row that satisfies your conditions. Here I assume it to be 10
B = A[:10, :]
print(B)
I have a fairly simple test data set I am trying to fit with pymc3.
The result generated by traceplot looks something like this.
Essentially the trace of all parameter look like there is a standard 'caterpillar' for 100 iterations, followed by a flat line for 750 iterations, followed by the caterpillar again.
The initial 100 iterations happen after 25,000 ADVI iterations, and 10,000 tune iterations. If I change these amounts, I randomly will/won't have these periods of unwanted stability.
I'm wondering if anyone has any advice about how I can stop this from happening - and what is causing it?
Thanks.
The full code is below. In brief, I am generating a set of 'phases' (-pi -> pi) with a corresponding set of values y = a(j)*sin(phase) + b(j)*sin(phase). a and b are drawn for each subject j at random, but are related to each other.
I then essentially try to fit this same model.
Edit: Here is a similar example, running for 25,000 iterations. Something goes wrong around iteration 20,000.
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import matplotlib.pyplot as plt
import numpy as np
import pymc3 as pm
%matplotlib inline
np.random.seed(0)
n_draw = 2000
n_tune = 10000
n_init = 25000
init_string = 'advi'
target_accept = 0.95
##
# Generate some test data
# Just generates:
# x a vector of phases
# y a vector corresponding to some sinusoidal function of x
# subject_idx a vector corresponding to which subject x is
#9 Subjects
N_j = 9
#Each with 276 measurements
N_i = 276
sigma_y = 1.0
mean = [0.1, 0.1]
cov = [[0.01, 0], [0, 0.01]] # diagonal covariance
x_sub = np.zeros((N_j,N_i))
y_sub = np.zeros((N_j,N_i))
y_true_sub = np.zeros((N_j,N_i))
ab_sub = np.zeros((N_j,2))
tuning_sub = np.zeros((N_j,1))
sub_ix_sub = np.zeros((N_j,N_i))
for j in range(0,N_j):
aj,bj = np.random.multivariate_normal(mean, cov)
#aj = np.abs(aj)
#bj = np.abs(bj)
xj = np.random.uniform(-1,1,size = (N_i,1))*np.pi
xj = np.sort(xj)#for convenience
yj_true = aj*np.sin(xj) + bj*np.cos(xj)
yj = yj_true + np.random.normal(scale=sigma_y, size=(N_i,1))
x_sub[j,:] = xj.ravel()
y_sub[j,:] = yj.ravel()
y_true_sub[j,:] = yj_true.ravel()
ab_sub[j,:] = [aj,bj]
tuning_sub[j,:] = np.sqrt(((aj**2)+(bj**2)))
sub_ix_sub[j,:] = [j]*N_i
x = np.ravel(x_sub)
y = np.ravel(y_sub)
subject_idx = np.ravel(sub_ix_sub)
subject_idx = np.asarray(subject_idx, dtype=int)
##
# Fit model
hb1_model = pm.Model()
with hb1_model:
# Hyperpriors
hb1_mu_a = pm.Normal('hb1_mu_a', mu=0., sd=100)
hb1_sigma_a = pm.HalfCauchy('hb1_sigma_a', 4)
hb1_mu_b = pm.Normal('hb1_mu_b', mu=0., sd=100)
hb1_sigma_b = pm.HalfCauchy('hb1_sigma_b', 4)
# We fit a mixture of a sine and cosine with these two coeffieicents
# allowed to be different for each subject
hb1_aj = pm.Normal('hb1_aj', mu=hb1_mu_a, sd=hb1_sigma_a, shape = N_j)
hb1_bj = pm.Normal('hb1_bj', mu=hb1_mu_b, sd=hb1_sigma_b, shape = N_j)
# Model error
hb1_eps = pm.HalfCauchy('hb1_eps', 5)
hb1_linear = hb1_aj[subject_idx]*pm.math.sin(x) + hb1_bj[subject_idx]*pm.math.cos(x)
hb1_linear_like = pm.Normal('y', mu = hb1_linear, sd=hb1_eps, observed=y)
with hb1_model:
hb1_trace = pm.sample(draws=n_draw, tune = n_tune,
init = init_string, n_init = n_init,
target_accept = target_accept)
pm.traceplot(hb1_trace)
To partially answer my own question: After playing with this for a while, it looks like the problem might be due to the hyperprior standard deviation going to 0. I am not sure why the algorithm should get stuck there though (testing a small standard deviation can't be that uncommon...).
In any case, two solutions that seem to alleviate the problem (although they don't remove it entirely) are:
1) Add an offset to the definitions of the standard deviation. e.g.:
offset = 1e-2
hb1_sigma_a = offset + pm.HalfCauchy('hb1_sigma_a', 4)
2) Instead of using a HalfCauchy or HalfNormal for the SD prior, use a logNormal distribution set so that 0 is unlikely.
I'd look at the divergencies, as explained in notes and literature on Hamiltonian Monte Carlo, see, e.g., here and here.
with model:
np.savetxt('diverging.csv', hb1_trace['diverging'])
As a dirty solution, you can try to increase target_accept, perhaps.
Good luck!
I was recently working on a deep learning model in Keras and it gave me very perplexing results. The model is capable of mastering the training data over time, but it consistently gets worse results on the validation data.
I know that if the validation accuracy goes up for a while and then starts to decrease that you are over-fitting to the training data, but in this case, the validation accuracy only ever decreases. I am really confused why this happens. Does anyone have any intuition as to what could cause this to happen? Or any suggestions on things to test to potentially fix it?
Edit to add more info and code
Ok. So I am making a model that is trying to do some basic stock predictions. By looking at the open, high, low, close, and volume of the last 40 days, the model tries to predict whether or not the price will go up two average true ranges without going down one average true range. As input, I took CSVs from Yahoo Finance that include this information for the last 30 years for all of the stocks in the Dow Jones Industrial Average. The model trains on 70% of the stocks and validates on the other 20%. This leads to about 150,000 training samples. I am currently using a 1d Convolutional Neural Network, but I have also tried other smaller models (logistic regression and small Feed Forward NN) and I always get the same either diverging train and validation loss or nothing learned at all because the model is too simple.
Here is the code:
import numpy as np
from sklearn import preprocessing
from sklearn.metrics import auc, roc_curve, roc_auc_score
from keras.layers import Input, Dense, Flatten, Conv1D, Activation, MaxPooling1D, Dropout, Concatenate
from keras.models import Model
from keras.callbacks import ModelCheckpoint, EarlyStopping, Callback
from keras import backend as K
import matplotlib.pyplot as plt
from random import seed, shuffle
from os import listdir
class roc_auc(Callback):
def on_train_begin(self, logs={}):
self.aucs = []
def on_train_end(self, logs={}):
return
def on_epoch_begin(self, epoch, logs={}):
return
def on_epoch_end(self, epoch, logs={}):
y_pred = self.model.predict(self.validation_data[0])
self.aucs.append(roc_auc_score(self.validation_data[1], y_pred))
if max(self.aucs) == self.aucs[-1]:
model.save_weights("weights.roc_auc.hdf5")
print(" - auc: %0.4f" % self.aucs[-1])
return
def on_batch_begin(self, batch, logs={}):
return
def on_batch_end(self, batch, logs={}):
return
rrr = 2
epochs = 200
batch_size = 64
days_input = 40
seed(42)
X_train = []
X_test = []
y_train = []
y_test = []
files = listdir("Stocks")
total_stocks = len(files)
shuffle(files)
for x, file in enumerate(files):
test = False
if (x+1.0)/total_stocks > 0.7:
test = True
if test:
print("Test -> Stocks/%s" % file)
else:
print("Train -> Stocks/%s" % file)
stock = np.loadtxt(open("Stocks/"+file, "r"), delimiter=",", skiprows=1, usecols = (1,2,3,5,6))
atr = []
last = None
for day in stock:
if last is None:
tr = abs(day[1] - day[2])
atr.append(tr)
else:
tr = max(day[1] - day[2], abs(last[3] - day[1]), abs(last[3] - day[2]))
atr.append((13*atr[-1]+tr)/14)
last = day.copy()
stock = np.insert(stock, 5, atr, axis=1)
for i in range(days_input,stock.shape[0]-1):
input = stock[i-days_input:i, 0:5].copy()
for j, day in enumerate(input):
input[j][1] = (day[1]-day[0])/day[0]
input[j][2] = (day[2]-day[0])/day[0]
input[j][3] = (day[3]-day[0])/day[0]
input[:,0] = input[:,0] / np.linalg.norm(input[:,0])
input[:,1] = input[:,1] / np.linalg.norm(input[:,1])
input[:,2] = input[:,2] / np.linalg.norm(input[:,2])
input[:,3] = input[:,3] / np.linalg.norm(input[:,3])
input[:,4] = input[:,4] / np.linalg.norm(input[:,4])
preprocessing.scale(input, copy=False)
output = -1
buy = stock[i][1]
stoploss = buy - stock[i][5]
target = buy + rrr*stock[i][5]
for j in range(i+1, stock.shape[0]):
if stock[j][0] < stoploss or stock[j][2] < stoploss:
output = 0
break
elif stock[j][1] > target:
output = 1
break
if output != -1:
if test:
X_test.append(input)
y_test.append(output)
else:
X_train.append(input)
y_train.append(output)
shape = list(X_train[0].shape)
shape[:0] = [len(X_train)]
X_train = np.concatenate(X_train).reshape(shape)
y_train = np.array(y_train)
shape = list(X_test[0].shape)
shape[:0] = [len(X_test)]
X_test = np.concatenate(X_test).reshape(shape)
y_test = np.array(y_test)
print("Train class split is %0.2f" % (100*np.average(y_train)))
print("Test class split is %0.2f" % (100*np.average(y_test)))
inputs = Input(shape=(days_input,5))
x = Conv1D(32, 5, padding='same')(inputs)
x = Activation('relu')(x)
x = MaxPooling1D()(x)
x = Conv1D(64, 5, padding='same')(x)
x = Activation('relu')(x)
x = MaxPooling1D()(x)
x = Conv1D(128, 5, padding='same')(x)
x = Activation('relu')(x)
x = MaxPooling1D()(x)
x = Flatten()(x)
x = Dense(128, activation="relu")(x)
x = Dense(64, activation="relu")(x)
output = Dense(1, activation="sigmoid")(x)
model = Model(inputs=inputs,outputs=output)
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
filepath="weights.best.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=0, save_best_only=True, mode='max')
auc_hist = roc_auc()
callbacks_list = [checkpoint, auc_hist]
history = model.fit(X_train, y_train, validation_data=(X_test,y_test) , epochs=epochs, callbacks=callbacks_list, batch_size=batch_size, class_weight ='balanced').history
model_json = model.to_json()
with open("model.json", "w") as json_file:
json_file.write(model_json)
model.save_weights("weights.latest.hdf5")
model.load_weights("weights.roc_auc.hdf5")
plt.plot(history['acc'])
plt.plot(history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
plt.plot(history['loss'])
plt.plot(history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
plt.plot(auc_hist.aucs)
plt.title('model ROC AUC')
plt.ylabel('AUC')
plt.xlabel('epoch')
plt.show()
y_pred = model.predict(X_train)
fpr, tpr, _ = roc_curve(y_train, y_pred)
roc_auc = auc(fpr, tpr)
plt.subplot(1, 2, 1)
plt.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy',linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Train ROC')
plt.legend(loc="lower right")
y_pred = model.predict(X_test)
fpr, tpr, thresholds = roc_curve(y_test, y_pred)
roc_auc = auc(fpr, tpr)
plt.subplot(1, 2, 2)
plt.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy',linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Test ROC')
plt.legend(loc="lower right")
plt.show()
with open('roc.csv','w+') as file:
for i in range(len(thresholds)):
file.write("%f,%f,%f\n" % (fpr[i], tpr[i], thresholds[i]))
Results by 100 batches instead of by epoch
I listened to suggestions and made a few updates. The classes are now balanced 50% to 50% instead of 25% to 75%. Also, the validation data is randomly selected now instead of being a specific set of stocks. By graphing the loss and accuracy at a finer resolution(100 batches vs 1 epoch), the over-fitting can clearly be seen. The model does actually start to learn at the very beginning before it starts to diverge. I am surprised at how fast it starts to over-fit, but now that I can see the issue hopefully I can debug it.
Possible explanations
Coding error
Overfitting due to differences in the training / validation data
Skewed classes (and differences in the training / validation data)
Things I would try
Swapping the training and the validation set. Does the problem still occur?
Plot the curves in more detail for the first ~10 epochs (e.g. directly after initialization; each few training iterations, not only per epoch). Do you still start at > 75%? Then your classes might be skewed and you might also want to check if your training-validation split is stratified.
Code
This is useless: np.concatenate(X_train)
Make your code as readable as possible when you post it here. This includes removing lines which are commented out.
This looks suspicious for a coding error to me:
if test:
X_test.append(input)
y_test.append(output)
else:
#if((output == 0 and np.average(y_train) > 0.5) or output == 1):
X_train.append(input)
y_train.append(output)
use sklearn.model_selection.train_test_split instead. Do all transformations to the data before, then make the split with this method.
Looks like the batch size is much too small for the number of training samples you have. Try batching 20% and see if that makes a difference.
When using tf.boolean_mask(), a Value Error is raised. It reads "Number of mask dimensions must be specified, even if some dimensions are None. E.g. shape=[None] is ok, but shape=None is not.
I suspect that something is going wrong when I create my boolean mask s, because when I just create a boolean mask by hand, all works fine. However, I've checked the shape and the dtype of s so far, and couldn't notice anything suspicious. Both seemed to be identical to the shape and type of the boolean mask I created by hand.
Please see a screenshot of the problem.
The following should allow you to reproduce the error on your machine. You need tensorflow, numpy and scipy.
with tf.Session() as sess:
# receive five embedded vectors
v0 = tf.constant([[3.0,1.0,2.,4.,2.]])
v1 = tf.constant([[4.0,0,1.0,4,1.]])
v2 = tf.constant([[1.0,1.0,0.0,4.,8.]])
v3 = tf.constant([[1.,4,2.,5.,2.]])
v4 = tf.constant([[3.,2.,3.,2.,5.]])
# concatenate the five embedded vectors into a matrix
VT = tf.concat([v0,v1,v2,v3,v4],axis=0)
# perform SVD on the concatenated matrix
s, u1, u2 = tf.svd(VT)
e = tf.square(s) # list of eigenvalues
v = u1 # eigenvectors as column vectors
# sample a set
s = tf.py_func(sample_dpp_bin,[e,v],tf.bool)
X = tf.boolean_mask(VT,s)
print(X.eval())
This is the code to generate s. s is a sample from a determinantal point process (for the mathematically interested).
Note that I'm using tf.py_func to wrap this python function:
import tensorflow as tf
import numpy as np
from scipy.linalg import orth
def sample_dpp_bin(e_val,e_vec):
# e_val = np.array of eigenvalues
# e_vec = array of eigenvectors (= column vectors)
eps = 0.01
# sample a set of eigenvectors
ind = (np.random.rand(len(e_val)) <= (e_val)/(1+e_val))
k = sum(ind)
if k == e_val.size:
return np.ones(e_val.size,dtype=bool) # check for full set
if k == 0:
return np.zeros(e_val.size,dtype=bool)
V = e_vec[:,np.array(ind)]
# sample a set of k items
sample = np.zeros(e_val.size,dtype=bool)
for l in range(k-1,-1,-1):
p = np.sum(V**2,axis=1)
p = np.cumsum(p / np.sum(p)) # item cumulative probabilities
i = int((np.random.rand() <= p).argmax()) # choose random item
sample[i] = True
j = (np.abs(V[i,:])>eps).argmax() # pick an eigenvector not orthogonal to e_i
Vj = V[:,j]
V = orth(V - (np.outer(Vj,(V[i,:]/Vj[i]))))
return sample
The output if I print s and tf.reshape(s) is
[False True True True True]
[5]
The output if I print VT and tf.reshape(VT) is
[[ 3. 1. 2. 4. 2.]
[ 4. 0. 1. 4. 1.]
[ 1. 1. 0. 4. 8.]
[ 1. 4. 2. 5. 2.]
[ 3. 2. 3. 2. 5.]]
[5 5]
Any help much appreciated.
Following example works for me.
import tensorflow as tf
import numpy as np
tensor = [[1, 2], [3, 4], [5, 6]]
mask = np.array([True, False, True])
t_m = tf.boolean_mask(tensor, mask)
sess = tf.Session()
print(sess.run(t_m))
Output:
[[1 2]
[5 6]]
Provide your runnable code snippet to reproduce the error. I think you might be doing something wrong in s.
Update:
s = tf.py_func(sample_dpp_bin,[e,v],tf.bool)
s_v = (s.eval())
X = tf.boolean_mask(VT,s_v)
print(X.eval())
mask should be a np array not TF tensor. You don't have to use tf.pyfunc.
The error message states that the shape of the mask is not defined. What do you get if you print tf.shape(s)? I'd bet the problem with your code is that the shape of s is completely unknown, and you could fix that with a simple call like s.set_shape((None)) (to simply specify that s is a 1-dimensional tensor). Consider this code snippet:
X = np.random.randint(0, 2, (100, 100, 3))
with tf.Session() as sess:
X_tf = tf.placeholder(tf.int8)
# X_tf.set_shape((None, None, None))
y = tf.greater(tf.reduce_max(X_tf, axis=(0, 1)), 0)
print(tf.shape(y))
z = tf.boolean_mask(X_tf, y, axis=2)
print(sess.run(z, feed_dict={X_tf: X}))
This prints a shape of Tensor("Shape_3:0", shape=(?,), dtype=int32) (i.e., even the dimensions of y are unknown) and returns the same error as you have. However, if you uncomment the set_shape line, then X_tf is known to be 3-dimensional and so s is 1-dimensional. The code then works. So, I think all you need to do is add a s.set_shape((None)) call after the py_func call.
As many machine learning algorithms rely to matrix multiplication(or at least can be implemented using matrix multiplication) to test my GPU is I plan to create matrices a , b , multiply them and record time it takes for computation to complete.
Here is code that will generate two matrices of dimensions 300000,20000 and multiply them :
import tensorflow as tf
import numpy as np
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)
#a = np.array([[1, 2, 3], [4, 5, 6]])
#b = np.array([1, 2, 3])
a = np.random.rand(300000,20000)
b = np.random.rand(300000,20000)
println("Init complete");
result = tf.mul(a , b)
v = sess.run(result)
print(v)
Is this a sufficient test to compare performance of GPU's ? What other factors should I consider ?
Here's an example of a matmul benchmark which avoids common pitfalls, and matches the official 11 TFLOP mark on Titan X Pascal.
import os
import sys
os.environ["CUDA_VISIBLE_DEVICES"]="1"
import tensorflow as tf
import time
n = 8192
dtype = tf.float32
with tf.device("/gpu:0"):
matrix1 = tf.Variable(tf.ones((n, n), dtype=dtype))
matrix2 = tf.Variable(tf.ones((n, n), dtype=dtype))
product = tf.matmul(matrix1, matrix2)
# avoid optimizing away redundant nodes
config = tf.ConfigProto(graph_options=tf.GraphOptions(optimizer_options=tf.OptimizerOptions(opt_level=tf.OptimizerOptions.L0)))
sess = tf.Session(config=config)
sess.run(tf.global_variables_initializer())
iters = 10
# pre-warming
sess.run(product.op)
start = time.time()
for i in range(iters):
sess.run(product.op)
end = time.time()
ops = n**3 + (n-1)*n**2 # n^2*(n-1) additions, n^3 multiplications
elapsed = (end - start)
rate = iters*ops/elapsed/10**9
print('\n %d x %d matmul took: %.2f sec, %.2f G ops/sec' % (n, n,
elapsed/iters,
rate,))