How to use another library in the tensorflow graph? - image

I just read this article. The article says that the resize algorithm of tensorflow has some bugs. Now I want to use scipy.misc.imresize instead of tf.image.resize_images. And I wonder what is the best way to implement the scipy resize algorithm.
Let`s consider the following layer:
def up_sample(input_tensor, new_height, new_width):
_up_sampled = tf.image.resize_images(input_tensor, [new_height, new_width])
_conv = tf.layers.conv2d(_up_sampled, 32, [3,3], padding="SAME")
return _conv
How can I use the scipy algorithm in this layer?
Edit:
An example can be this function:
input_tensor = tf.placeholder("float32", [10, 200, 200, 8])
output_shape = [32, 210, 210, 8]
def up_sample(input_tensor, output_shape):
new_array = np.zeros(output_shape)
for batch in range(input_tensor.shape[0]):
for channel in range(input_tensor.shape[-1]):
new_array[batch, :, :, channel] = misc.imresize(input_tensor[batch, :, :, channel], output_shape[1:3])
But obviously scipy raises a ValueError that the the tf.Tensor object does not have the right shape. I read that during the a tf.Session the Tensors are accessible as numpy arrays. How can I use the scipy function only during a session and omit the execution in when creating the protocol buffer?
And is there a faster way than looping over all batches and channels?

Generally speaking, the tools you need are a combination of tf.map_fn and tf.py_func.
tf.py_func allows you to wrap a standard python function into a tensorflow op that is inserted into your graph.
tf.map_fn allows you to call a function repeatedly on the batch samples, when the function cannot operate on the whole batch — as it is often the case with image functions.
In the present case, I would probably advise to use scipy.ndimage.zoom on the basis that it can operate directly on the 4D tensor, which makes things simpler. On the other hand, it takes as input zoom factors, not sizes, so we need to compute them.
import tensorflow as tf
sess = tf.InteractiveSession()
# unimportant -- just a way to get an input tensor
batch_size = 13
im_size = 7
num_channel=5
x = tf.eye(im_size)[None,...,None] + tf.zeros((batch_size, 1, 1, num_channel))
new_size = 17
from scipy import ndimage
new_x = tf.py_func(
lambda a: ndimage.zoom(a, (1, new_size/im_size, new_size/im_size, 1)),
[x], [tf.float32], stateful=False)[0]
print(new_x.eval().shape)
# (13, 17, 17, 5)
You could use other functions (e.g. OpenCV's cv2.resize, Scikit-image's transform.image, Scipy's misc.imresize) but none can operate directly on 4D tensors and therefore are more verbose to use. You may still want to use them if you want an interpolation other than zoom's spline-based interpolation.
However, be aware of the following things:
Python functions are executed on the host. So, if you are executing your graph on a device like a graphics card, it needs to stop, copy the tensor to host memory, call your function, then copy the result back on the device. This can completely ruin your computation time if memory transfers are important.
Gradients do not pass through python functions. If your node is used, say, in an upscaling part of a network, layers upstream will not receive any gradient (or only part of it, if you have skip connections), which would compromise your training.
For those two reasons, I would advise to apply this kind of resampling to inputs only, when preprocessed on CPU and gradients are not used.
If you do want to use this upscale node for training on the device, then I see no alternative as to either stick with the buggy tf.image.resize_image, or to write your own.

Related

difference between convolution2d and conv2d in tensorflow in terms of ussage

In TensorFlow for 2D convolution we have:
tf.nn.conv2d(input, filter, strides, padding, use_cudnn_on_gpu=None,
data_format=None, name=None)
and
tf.contrib.layers.convolution2d(*args, **kwargs)
I am not sure about differences?
I know that I should use the first one if I want to use a special filter, right? But what else? Especially about outputs?
Thank you
tf.nn.conv2d(...) is the core, low-level convolution functionality provided by TensorFlow. tf.contrib.layers.conv2d(...) is part of a higher-level API build around core-TensorFlow.
Note, that in current TensorFlow versions, parts of layers are now in core, too, e.g. tf.layers.conv2d.
The difference is simply, that tf.nn.conv2d is an op, that does convolution, nothing else. tf.layers.conv2d does more, e.g. it also creates variables for the kernel and the biases amongst other things.
Check out the Tensorflow Tutorial on CNNs which uses Tensorflow core (here). With the low-level API the convolutional layers are created like this:
def conv2d(x, W):
return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')
W_conv1 = weight_variable([5, 5, 1, 32])
b_conv1 = bias_variable([32])
h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
Compare that to the TF Layers Tutorial for CNNs (here). With TF Layers convolutional layers are create like this:
conv1 = tf.layers.conv2d(
inputs=input_layer,
filters=32,
kernel_size=[5, 5],
padding="same",
activation=tf.nn.relu)
Without knowing your use case: Most likely you want to use tf.layers.conv2d.
there will be no difference between tf.keras.layers.Conv2D and tf.keras.layers.Convolution2D in Tensorflow 2.x.
Here's the link for the illustration
In tensorflow 2.x, keras is an API in tensorflow

Poor performance when using TensorFlow input functions in a loop

I am currently training a set of around 10,000 images with ten epochs. My question is regarding the following code:
file_contents = cv2.imread(shuffle_image_list[i],3)
resized_image = cv2.resize(file_contents, (72,72), interpolation = cv2.INTER_AREA)
data = np.array(resized_image)
flattened = data.flatten()
#image_batch, label_batch = tf.train.batch([resized_image, shuffle_label_list[i]], batch_size=batch_size) # does train.batch take individual images or final tensors
#if(i>batch_size):
#print(label_batch.eval())
print(str(i))
imageArr.append(flattened)
labelArr.append(int(shuffle_label_list[i]))
if i % 100 == 0:
print("....... " + str(i))
_, c = sess.run([optimizer, cost], feed_dict={x: imageArr, y: labelArr})
epoch_loss += c
imageArr = []
labelArr = []
Here, I am feeding images in mini-batches of 100 to the neural network. The code is initially getting a string, file_contents. Then that string is being decoded into a jpg. However, when I use TensorFlow's functions, such as tf.decode_jpeg and tf.reshape(), etc., there is a difference in speed. When using TensorFlow's functions to accomplish the same task of decoding images, it starts out fast in the training process, and then becomes incredibly slow after the first epoch.
To put this in perspective, using OpenCV, I could train this entire model with 10 epochs in around 1hr and 30 minutes. However, using TensorFlow's functions, it took over 12 hours to get past the first epoch and midway through the second, I stopped the training process after I saw the progress it was making.
I am not sure if this as anything to do with the concept of network slowdown, as seen here. I am simply replacing OpenCV functions with TensorFlow functions to decode an image and read a file. Why is there such as dramatic speed difference between OpenCV and TensorFlow? Why exactly does TensorFlow's functions slow down as the code progresses? Will this have any effect on the accuracy of model?
Note: The only functions I changed were at the top. I didn't use tf.train.batch in either versions. The only thing that was changed was the lines of code from file_contents to data.flatten(), to corresponding tensorflow functions.

Is there a way to parallelize stacked RNNs over multiple GPUs in TensorFlow?

Is it possible to take the output of a tf.scan operation and stream it directly to a different GPU, effectively running two stacked RNNs on two GPUs in parallel? Something like this:
cell1 = tf.nn.rnn_cell.MultiRNNCell(..)
cell2 = tf.nn.rnn_cell.MultiRNNCell(..)
with tf.device("/gpu:0"):
ys1 = tf.scan(lambda a, x: cell1(x, a[1]), inputs,
initializer=(tf.zeros([batch_size, state_size]), init_state))
with tf.device("/gpu:1"):
ys2 = tf.scan(lambda a, x: cell2(x, a[1]), ys1,
initializer=(tf.zeros([batch_size, state_size]), init_state))
Will TensorFlow automatically take care of that optimization, or will it block the graph flow until the list ys1 is finalized.
Unfortunately, tf.scan has a "boundary" at the output, and all iterations have to complete before the output tensor can be read by the next operations. However, you can run the different levels of your lstm stack on different GPUs, and get frame parallelism within a scan. Write your own version of MultiRNNCell to use separate devices for each lstm layer.
Also you probably want to use tf.nn.dynamic_rnn instead of scan.

Safety of sharing a read-only scipy sparse matrix between multiple processes

I have a computation I must do which is somewhat expensive and I want to spawn multiple processes to complete it. The gist is more or less this:
1) I have a big scipy.sparse.csc_matrix (could use other sparse format if needed) from which I'm going to read (only read, never write) data for the calculation.
2) I must do lots of embarrassingly parallel calculations and return values.
So I did something like this:
import numpy as np
from multiprocessing import Process, Manager
def f(instance, big_matrix):
"""
This is the actual thing I want to calculate. This reads lots of
data from big_matrix but never writes anything to it.
"""
return stuff_calculated
def do_some_work(big_matrix, instances, outputs):
"""
This do a few chunked calculations for a few instances and
saves the result in `outputs`, which is a memory shared dictionary.
"""
for instance in instances:
x = f(instance, big_matrix)
outputs[instance] = x
def split_work(big_matrix, instances_to_calculate):
"""
Split do_some_work into many processes by chunking instances_to_calculate,
creating a shared dictionary and spawning and joining the processes.
"""
# break instance list into 4 chunks to pass each process
instance_sets = np.array_split(instances_to_calculate, 4)
manager = Manager()
outputs = manager.dict()
processes = [
Process(target=do_some_work, args=(big_matrix, instance_sets, outputs))
for instances in instance_sets
]
for p in processes:
p.start()
for p in processes:
p.join()
return user_sets, outputs
My question is: is this safe? My function f never writes anything, but I'm not taking any precaution to share the big_array between processes, just passing it as it is. It seems to be working but I'm concerned if I can corrupt anything just by passing a value between multiple processes even if I never write to it.
I tried to use the sharemem package to share the matrix between multiple processes but it seems to be unable to hold scipy sparse matrices, only normal numpy arrays.
If this isn't safe, how can I share (read only) big sparse matrices between processes without problems?
I've saw here that I can make another csc_matrix pointing to the same memory with:
other_matrix = csc_matrix(
(bit_matrix.data, bit_matrix.indices, bit_matrix.indptr),
shape=bit_matrix.shape,
copy=False
)
Will this make it safer or would it be the same just as safe as passing the original object?
Thanks.
As explained here it seems your first option creates one copy of the sparse matrix per process. This is safe, but isn't ideal from a performance point of view. However, depending on the computation you perform on the sparse matrix, the overhead may not be signficant.
I suspect a cleaner option using the multiprocessing lib would be to create three lists (depending on the matrix format you use) and populate these with the values, row_ind and col_ptr of your CSC matrix. The documentation for multiprocessing shows how this can be done using an Array or using the Manager and one of the supported types.
Afterwards I don't see how you could run into trouble using read-only operations and it may be more efficient.

SVM training performance

I'm using SVMLib to train a simple SVM over the MNIST dataset. It contains 60.000 training data. However, I have several performance issues: the training seems to be endless (after a few hours, I had to shut it down by hand, because it doesn't respond). My code is very simple, I just call ovrtrain on the dataset without any kernel and any special constants:
function features = readFeatures(fileName)
[fid, msg] = fopen(fileName, 'r', 'ieee-be');
header = fread(fid, 4, "int32" , 0, "ieee-be");
if header(1) ~= 2051
fprintf("Wrong magic number!");
end
M = header(2);
rows = header(3);
columns = header(4);
features = fread(fid, [M, rows*columns], "uint8", 0, "ieee-be");
fclose(fid);
return;
endfunction
function labels = readLabels(fileName)
[fid, msg] = fopen(fileName, 'r', 'ieee-be');
header = fread(fid, 2, "int32" , 0, "ieee-be");
if header(1) ~= 2049
fprintf("Wrong magic number!");
end
M = header(2);
labels = fread(fid, [M, 1], "uint8", 0, "ieee-be");
fclose(fid);
return;
endfunction
labels = readLabels("train-labels.idx1-ubyte");
features = readFeatures("train-images.idx3-ubyte");
model = ovrtrain(labels, features, "-t 0"); % doesn't respond...
My question: is it normal? I'm running it on Ubuntu, a virtual machine. Should I wait longer?
I don't know whether you took your answer or not, but let me tell you what I predict about your situation. 60.000 examples is not a lot for a power trainer like LibSVM. Currently, I am working on a training set of 6000 examples and it takes 3-to-5 seconds to train. However, the parameter selection is important and that is the one probably taking long time. If the number of unique features in your data set is too high, then for any example, there will be lots of zero feature values for non-existing features. If the tool is implementing data scaling on your training set, then most probably those lots of zero feature values will be scaled to a certain non-zero value, leaving you astronomic number of unique and non-zero valued features for each and every example. This is very very complicated for a SVM tool to get in and extract efficient parameter values.
Long story short, if you had enough research on SVM tools and understand what I mean, you either assign parameter values in the training command before executing it or find a way to decrease the number of unique features. If you haven't, go on and download the latest version of LibSVM, read the ReadME files as well as the FAQ from the website of the tool.
If non of these is the case, then sorry for taking your time:) Good luck.
It might be an issue of convergence given the characteristics of your data.
Check the kernel you have as default selection and change it. Also, check the stopping criterion of the package. Additionally, if you are looking for faster implementation, check MSVMpack which is a parallel implementation of SVM.
Finally, feature selection in your case is desired. You can end up with a good feature subset of almost half of what you have. In addition, you need only a portion of data for training e.g. 60~70 % are sufficient.
First of all 60k is huge data for training.Training that much data with linear kernel will take hell of time unless you have a supercomputing. Also you have selected a linear kernel function of degree 1. Its better to use Gaussian or higher degree polynomial kernel (deg 4 used with the same dataset showed a good tranning accuracy). Try to add the LIBSVM options for -c cost -m memory cachesize -e epsilon tolerance of termination criterion (default 0.001). First run 1000 samples with Gaussian/ polynomial of deg 4 and compare the accuracy.

Resources