In Tensorflow, what is the difference between Session.partial_run and Session.run? - session

I always thought that Session.run required all placeholders in the graph to be fed, while Session.partial_run only the ones specified through Session.partial_run_setup, but looking further that is not the case.
So how exactly do the two methods differentiate? What are the advantages/disadvantages of using one over the other?

With tf.Session.run, you usually give some inputs and expected outputs, and TensorFlow runs the operations in the graph to compute and return those outputs. If you later want to get some other output, even if it is with the same input, you have to run again all the necessary operations in the graph, even if some intermediate results will be the same as in the previous call. For example, consider something like this:
import tensorflow as tf
input_ = tf.placeholder(tf.float32)
result1 = some_expensive_operation(input_)
result2 = another_expensive_operation(result1)
with tf.Session() as sess:
x = ...
sess.run(result1, feed_dict={input_: x})
sess.run(result2, feed_dict={input_: x})
Computing result2 will require to run both the operations from some_expensive_operation and another_expensive_operation, but actually most of the computation is repeated from when result1 was calculated. tf.Session.partial_run allows you to evaluate part of a graph, leave that evaluation "on hold" and complete it later. For example:
import tensorflow as tf
input_ = tf.placeholder(tf.float32)
result1 = some_expensive_operation(input_)
result2 = another_expensive_operation(result1)
with tf.Session() as sess:
x = ...
h = sess.partial_run_setup([result1, result2], [input_ ])
sess.partial_run(h, result1, feed_dict={input_: x})
sess.partial_run(h, result2)
Unlike before, here the operations from some_expensive_operation will only we run once in total, because the computation of result2 is just a continuation from the computation of result1.
This can be useful in several contexts, for example if you want to split the computational cost of a run into several steps, but also if you need to do some mid-evaluation checks out of TensorFlow, such as computing an input to the second half of the graph that depends on an output of the first half, or deciding whether or not to complete an evaluation depending on an intermediate result (these may also be implemented within TensorFlow, but there may be cases where you do not want that).
Note too that it is not only a matter of avoiding repeating computation. Many operations have a state that changes on each evaluation, so the result of two separate evaluations and one evaluation divided into two partial ones may actually be different. This is the case with random operations, where you get a new different value per run, and other stateful object like iterators. Variables are also obviously stateful, so operations that change variables (like tf.Session.assign or optimizers) will not produce the same results when they are run once and when they are run twice.
In any case, note that, as of v1.12.0, partial_run is still an experimental feature and is subject to change.

Related

Can we do a parallel operation for Quantum Monte Carlo method in julia?

This is my main code of parallel operation:
using Distributed
using SharedArrays
nprocs()
addprocs(7)
Now, I need to store a variable about time:
variable = SharedArray{ComplexF64, 3}(Dim, steps, paths)
Note that "steps" and "paths" denote time series and total number of trajectories, respectively. However, if i define this variable, i will meet with the out of memory probelm because Dim=10000, steps=600, and paths=1000, though i can use multiple kernels to achieve parallel operation. The code of parallel operation can be written as
#sync #distributed for path=1:paths
...
variable[:,:,path] = matrix_var
end
Actually, this variable is not my final result, and the result is
final_var = sum(variable, dim=3)
, which represents the summation of all trajectories.
Thus, I want to deal with the out of memory problem and simultaneously use parallel operation. If i cast away the dimension of "paths" when i define this variable, the out of memory problem will vanish, but parallel operation becomes invaild. I hope that there are a solution to overcome it.
Seems that for each value of path you should create the variable locally rather than on huge array. Your code might look more or less like this:
final_vars = #distributed (append!) for path=1:paths
#create local variable for a single step
locvariable = Array{ComplexF64, 2}(undef, Dim, steps)
# at any time locvariable is at most in nprocs() copies
# load data to locvariable specific to path and do your job
final_var = sum(locvariable, dim=2)
[final_var] # in this way you will get a vector of arrays
end

Why is my Doc2Vec model in gensim not reproducible?

I have noticed that my gensim Doc2Vec (DBOW) model is sensitive to document tags. My understanding was that these tags are cosmetic and so they should not influence the learned embeddings. Am I misunderstanding something? Here is a minimal example:
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import numpy as np
import os
os.environ['PYTHONHASHSEED'] = '0'
reps = []
for a in [0,500]:
documents = [TaggedDocument(doc, [i + a])
for i, doc in enumerate(common_texts)]
model = Doc2Vec(documents, vector_size=100, window=2, min_count=0,
workers=1, epochs=10, dm=0, seed=0)
reps.append(np.array([model.docvecs[k] for k in range(len(common_texts))])
reps[0].sum() == reps[1].sum()
This last line returns False. I am working with gensim 3.8.3 and Python 3.5.2. More generally, is there any role that the values of the tags play (assuming they are unique)? I ask because I have found that using different tags for documents in a classification task leads to widely varying performance.
Thanks in advance.
First & foremost, your test isn't even comparing vectors corresponding to the same texts!
In run #1, the vector for the 1st text in in model.docvecs[0]. In run #2, the vector for the 1st text is in model.docvecs[1].
And, in run #2, the vector at model.docvecs[0] is just a randomly-initialized, but never-trained, vector - because none of the training texts had a document tag of (int) 0. (If using pure ints as the doc-tags, Doc2Vec uses them as literal indexes - potentially leaving any unused slots less than your highest tag allocated-and-initialized, but never-trained.)
Since common_texts only has 11 entries, by the time you reach run #12, all the vectors in your reps array of the first 11 vectors are garbage uncorrelated with any of your texts/
However, even after correcting that:
As explained in the Gensim FAQ answer #11, determinism in this algorithm shouldn't generally be expected, given many sources of potential randomness, and the fuzzy/approximate nature of the whole approach. If you're relying on it, or testing for it, you're probably making some unwarranted assumptions.
In general, tests of these algorithms should be evaluating "roughly equivalent usefulness in comparative uses" rather than "identical (or even similar) specific vectors". For example, a test whether apple and orange are roughly at the same positions in each others' nearest-neighbor rankings makes more sense than checking their (somewhat arbitrary) exact vector positions or even cosine-similarity.
Additionally:
tiny toy datasets like common_texts won't show the algorithm's usual behavior/benefits
PYTHONHASHSEED is only consulted by the Python interpreter at startup; setting it from Python can't have any effect. But also, the kind of indeterminism it introduces only comes up with separate interpreter launches: a tight loop within a single interpreter run like this wouldn't be affected by that in any case.
Have you checked the magnitude of the differences?
Just running:
delta = reps[0].sum() - reps[1].sum()
for the aggregate differences results with -1.2598932e-05 when I run it.
Comparison dimension-wise:
eps = 10**-4
over = (np.abs(diff) <= eps).all()
Returns True on a vast majority of the runs which means that you are getting quite reproducible results given the complexity of the calculations.
I would blame numerical stability of the calculations or uncontrolled randomness. Even though you do try to control the random seed, there is a different random seed in NumPy and different in random standard library so you are not controlling for all of the sources of randomness. This can also have an influence on the results but I did not check the actual implementation in gensim and it's dependencies.
Change
import os
os.environ['PYTHONHASHSEED'] = '0'
to
import os
import sys
hashseed = os.getenv('PYTHONHASHSEED')
if not hashseed:
os.environ['PYTHONHASHSEED'] = '0'
os.execv(sys.executable, [sys.executable] + sys.argv)

Speed of L2 Regularization on Pytorch

I'm trying to manually implement L2 regularisation and a couple of its variations in a neural network. What I'm doing is the following:
for name, param in model.state_dict():
if 'weight' in name:
l2_reg += torch.sum(param**2)
loss = cross_entropy(outputs, labels) + 0.0001*l2_reg
Is this equivalent to adding 'weight_decay = 0.0001' inside my optimizer? i.e.:
torch.optim.SGD(model.parameters(), lr=learning_rate , momentum=0.9, weight_decay = 0.0001)
My problem is that I thought they were equivalent, but the manual procedure is about 100x slower than adding 'weight_decay = 0.0001'. Why is that? How can I fix it?
Note that I need to also implement my own variation of L2 regularization, so just adding 'weight_decay = 0.0001' won't help.
You can check PyTorch implementation of SGD to get some tips and base off of that code.
There are a few things going on which should speed up your custom regularization.
Below is a cleaned version (a little pseudo-code, refer to original) of the parts we are interested in:
for p in group['params']:
if p.grad is None:
continue
d_p = p.grad.data
if weight_decay != 0:
d_p.add_(weight_decay, p.data)
p.data.add_(-group['lr'], d_p)
return loss
BTW. It seems your implementation is mathematically sound (correct me if I missed anything) and equivalent to PyTorch but will be slow indeed.
Modify only gradient
Please notice you perform regularization explicitly during forward pass. This takes a lot of time, more or less because:
take parameters and iterate over them
take it to the power of 2
sum all of them
add to variable containing all previous parameters (all this while creating graph dynamically and creating new nodes).
What pytorch does is it only focuses on backward pass as that's all is needed. This is pretty handy because:
parameters have to be loaded and iterated over once anyway during corrections performed by optimizer (in your case they are taken out twice)
no power of 2 because gradient of w**2 is simply 2*w (2 is further left out and L2 is often expressed as 1/2 * w **2 to make it simpler and a little faster)
no accumulation and creation of additional graph nodes
Essentially, this line:
d_p.add_(weight_decay, p.data)
Modifies the gradient adding p.data (weight) multiplied by weight_decay all done in-place (notice d_p.add_), which is all you have to do to perform L2 regularization.
Finally this line:
p.data.add_(-group['lr'], d_p)
Updates weights with gradient (modified by weight decay) using standard SGD formula (once again, in-place to be as fast as possible, at least on Python level).
Your own implementation
I would advise you to follow similar logic for your own regularization if you want to make it faster.
You can copy PyTorch implementation of SGD and only change this one relevant line. This would also gives you functionality of PyTorch optimizer in case you need it in your experiments.
For L1 regularization (|w| instead of w**2) you would have to calculate the derivative of it (which is 1 for positive case, -1 for negative and undefined for 0 (we can't have that so it should be zero)).
With that in mind we can write the weight_decay like this:
if weight_decay != 0:
d_p.add_(weight_decay, torch.sign(p.data))
torch.sign returns 1 for positive values and -1 for negative and 0 for... yeah, 0.
Hope this helps, exact implementation is left for you (hit me up in the comments in case you have any questions or troubles).

Data management in a parallel for-loop in Julia

I'm trying to do some statistical analysis using Julia. The code consists of the files script.jl (e.g. initialisation of the data) and algorithm.jl.
The number of simulations is large (at least 100,000) so it makes sense to use parallel processing.
The code below is just some pseudocode to illustrate my question —
function script(simulations::Int64)
# initialise input data
...
# initialise other variables for statistical analysis using zeros()
...
require("algorithm.jl")
#parallel for z = 1:simulations
while true
choices = algorithm(data);
if length(choices) == 0
break
else
# process choices and pick one (which alters the data)
...
end
end
end
# display results of statistical analysis
...
end
and
function algorithm(data)
# actual algorithm
...
return choices;
end
As example, I would like to know how many choices there are on average, what is the most common choice, and so on. For this purpose I need to save some data from choices (in the for-loop) to the statistical analysis variables (initialised before the for-loop) and display the results (after the for-loop).
I've read about using #spawn and fetch() and functions like pmap() but I'm not sure how I should proceed. Just using the variables inside the for-loop does not work as each proc gets its own copy, so the values of the statistical analysis variables after the for-loop will just be zeros.
[Edit] In Julia I use include("script.jl") and script(100000) to run the simulations, there are no issues when using a single proc. However, when using multiple procs (e.g. using addprocs(3)) all statistical variables are zeros after the for-loop — which is to be expected.
It seems that you want to parallelize an inherently serial operations, because each operation is related to the result of another one (in this case data).
I think if you could implement the above code like:
#parallel (dosumethingwithdata) for z = 1:simulations
while true
choices = algorithm(data,z);
if length(choices) == 0
break
else
# process choices and pick one (which alters the data)
...
end
data
end
end
then you may find a parallel solution for the problem.

Best way to calculate the result of a formula?

I currently have an application which can contain 100s of user defined formulae. Currently, I use reverse polish notation to perform the calculations (pushing values and variables on to a stack, then popping them off the stack and evaluating). What would be the best way to start parallelizing this process? Should I be looking at a functional language?
The calculations are performed on arrays of numbers so for example a simple A+B could actually mean 100s of additions. I'm currently using Delphi, but this is not a requirement going forward. I'll use the tool most suited to the job. Formulae may also be dependent on each other So we may have one formula C=A+B and a second one D=C+A for example.
Let's assume your formulae (equations) are not cyclic, as otherwise you cannot "just" evaluate them. If you have vectorized equations like A = B + C where A, B and C are arrays, let's conceptually split them into equations on the components, so that if the array size is 5, this equation is split into
a1 = b1 + c1
a2 = b2 + c2
...
a5 = b5 + c5
Now assuming this, you have a large set of equations on simple quantities (whether integer, rational or something else).
If you have two equations E and F, let's say that F depends_on E if the right-hand side of F mentions the left-hand side of E, for example
E: a = b + c
F: q = 2*a + y
Now to get towards how to calculate this, you could always use randomized iteration to solve this (this is just an intermediate step in the explanation), following this algorithm:
1 while (there is at least one equation which has not been computed yet)
2 select one such pending equation E so that:
3 for every equation D such that E depends_on D:
4 D has been already computed
5 calculate the left-hand side of E
This process terminates with the correct answer regardless on how you make your selections on line // 2. Now the cool thing is that it also parallelizes easily. You can run it in an arbitrary number of threads! What you need is a concurrency-safe queue which holds those equations whose prerequisites (those the equations depend on) have been computed but which have not been computed themselves yet. Every thread pops out (thread-safely) one equation from this queue at a time, calculates the answer, and then checks if there are now new equations so that all their prerequisites have been computed, and then adds those equations (thread-safely) to the work queue. Done.
Without knowing more, I would suggest taking a SIMD style approach if possible. That is, create threads to compute all formulas for a single data set. Trying to divide the computation of formulas to parallelise them wouldn't yield much speed improvement as the logic required to be able to split up the computations into discrete units suitable for threading would be hard to write and harder to get right, the overhead would cancel out any speed gains. It would also suffer quickly from diminishing returns.
Now, if you've got a set of formulas that are applied to many sets of data then the parallelisation becomes easier and would scale better. Each thread does all computations for one set of data. Create one thread per CPU core and set its affinity to each core. Each thread instantiates one instance of the formula evaluation code. Create a supervisor which loads a single data set and passes it an idle thread. If no threads are idle, wait for the first thread to finish processing its data. When all data sets are processed and all threads have finished, then exit. Using this method, there's no advantage to having more threads than there are cores on the CPU as thread switching is slow and will have a negative effect on overall speed.
If you've only got one data set then it is not a trivial task. It would require parsing the evaluation tree for branches without dependencies on other branches and farming those branches to separate threads running on each core and waiting for the results. You then get problems synchronizing the data and ensuring data coherency.

Resources