I have the following function:
import random
lst = []
for i in range(100):
lst.append(random.randint(1, 10))
print(lst)
buffer = []
# This is the peace of code which I am interested to convert into tensorflow.
for a in lst:
buffer.append(a)
if len(buffer) > 5:
buffer.pop(0)
if len(buffer) == 5:
print(buffer)
So, from the code, I need to create a buffer (that could be a variable in tensorflow). This buffer should hold the extracted features from the last conv layer. The variable will be an input to an RNN in my case.
The advantage of this approach is that when we have large images, and when we need to feed a RNN with a (batch of images) * (sequence length) * (size of 1 image), that will require a very big batch of images to be loaded into the main memory. On the other hand, according to the code above, we will be feeding 1 image at a time using the Datasets from tensorflow, or an input queue or any other alternative. As a result, we will be storing in memory the feature of size: batch_size * sequence_length * feature space .In addition, we can say:
if len(buffer) == n:
# empty out the buffer after using its elements
buffer = [] # Or any other alternative way
I am aware that I can feed my network batches of images, but I need to accomplish the mentioned code based on some literature.
Any help is much appreciated!!
I try to regenerate your output using tf.FIFOQueue (https://www.tensorflow.org/api_docs/python/tf/FIFOQueue). I have given my code below with the comments where necessary.
BATCH_SIZE = 20
lst = []
for i in range(BATCH_SIZE):
lst.append(random.randint(1, 10))
print(lst)
curr_data = np.reshape(lst, (BATCH_SIZE, 1)) # reshape the tensor so that [BATCH_SIZE 1]
# queue starts here
queue_input_data = tf.placeholder(tf.int32, shape=[1]) # Placeholder for feed the data
queue = tf.FIFOQueue(capacity=50, dtypes=[tf.int32], shapes=[1]) # Queue define here
enqueue_op = queue.enqueue([queue_input_data]) # enqueue operation
len_op = queue.size() # chek the queue size
#check the length of the queue and dequeue one if greater than 5
dequeue_one = tf.cond(tf.greater(len_op, 5), lambda: queue.dequeue(), lambda: 0)
#check the length of the queue and dequeue five elemts if equals to 5
dequeue_many = tf.cond(tf.equal(len_op, 5), lambda:queue.dequeue_many(5), lambda: 0)
with tf.Session() as session:
for i in range(BATCH_SIZE):
_ = session.run(enqueue_op, feed_dict={queue_input_data: curr_data[i]}) # enqueue one element each ietaration
len = session.run(len_op) # check the legth of the queue
print(len)
element = session.run(dequeue_one) # dequeue the first element
print(element)
However, following two problems are associated with the above code,
Only the dequeue one and dequeue many operations are available and you cannot see the elements inside the queue (I don't think you will need this since you are looking something like a pipeline).
I think that tf.cond is the only way to implement a conditional operation (I couldn't find any other suitable function similar to that). However, since its similar to the if-then-else statement, its mandatory to define an operation when the statement is false also (not just having only if statement without the else). Since Tensorflow is all about building a graph I think its necessary to include the two branches (when the condition is true and false).
Moreover, a good explanation for Tensorflow input pipelines can be found here (http://ischlag.github.io/2016/11/07/tensorflow-input-pipeline-for-large-datasets/).
Hope this helps.
Related
I am trying to "skip forward" a few realizations by using the function Future.randjump(), but it doesn't seem to behave as I expect it to. The following code gives me the desired result, where jumping forward 1 steps gives the same result as if I had called rand(rng) twice, i.e. the two println display the same number:
using Random, Future
rng = MersenneTwister(123);
new_rng = Future.randjump(rng, 1)
rand(rng)
rand(rng)
println(rand(rng))
println(rand(new_rng))
However, if I add one extra call to rand(rng) before the call to randjump(), the two printed numbers are completely different:
using Random, Future
rng = MersenneTwister(123);
rand(rng) # Added line
new_rng = Future.randjump(rng, 1)
rand(rng)
rand(rng)
println(rand(rng))
println(rand(new_rng))
I expected that the two calls to println() would display the same thing even in the second case, how come they don't? Is there a way I can use randjump() in the second case to get the same realizations as if I had called rand(rng) several times? Thank you in advance.
One unit of randjump corresponds to generation of two floating point numbers.
Consider this example
julia> rng = MersenneTwister(123);
julia> rng2 = Future.randjump(rng, 1);
julia> rand(rng, 4)
4-element Vector{Float64}:
0.7684476751965699
0.940515000715187
0.6739586945680673
0.3954531123351086
julia> rand(rng2,2)
2-element Vector{Float64}:
0.6739586945680673
0.3954531123351086
Note that in the second call (that is rand(rng2,2)) the both numbers are identical to the two last numbers in the first call (taht is rand(rng,2)).
Another issue is that different distributions might "consume" Float64 numbers from the stream at a different speed - so you need to check with a particular distribution how fast it consumes floats for the stream (some might also use buffering etc...).
Looking at the source code of randn (#edit randn()) it consumes one float and hence you get the same results for those two calls:
julia> randn(MersenneTwister(123),6)[3:end]
4-element Vector{Float64}:
1.142650902867199
0.45941562040708034
-0.396679079295223
-0.6647125451916877
julia> randn(Future.randjump(MersenneTwister(123),1),4)
4-element Vector{Float64}:
1.142650902867199
0.45941562040708034
-0.396679079295223
-0.6647125451916877
EDIT
Regarding your comment the size of Mersenne Twister state is 19937 bits and half-unit jumps are not supported. Running rand is mutating this state but not half-the way - so you end up with different bits. Note that an RNG is a sequence of states and the actual values are calculated from that state.
The correct pattern to synchronize random numbers in your computations is the following:
master_rng = MersenneTwister(123);
rng1 = Future.randjump(master_rng, big(10)^20)
# do whatever you want
rng2 = Future.randjump(master_rng, 2*big(10)^20)
# do whatever you want
rng3 = Future.randjump(master_rng, 3*big(10)^20)
# do whatever you want
With this pattern you can correctly maintains synchronization between random number streams and have full control whether the should overlap or not.
The code works absolutely fine for the data set containing 500000+ instances but whenever I reduce the data set to 5000/10000/15000 it throws a key error : word "***" not in vocabulary.Not for every data point but for most them it throws the error.The data set is in excel format. [1]: https://i.stack.imgur.com/YCBiQ.png
I don't know how to fix this problem since i have very little knowledge about it,,I am still learning.Please help me fix this problem!
purchases_train = []
for i in tqdm(customers_train):
temp = train_df[train_df["CustomerID"] == i]["StockCode"].tolist()
purchases_train.append(temp)
purchases_val = []
for i in tqdm(validation_df['CustomerID'].unique()):
temp = validation_df[validation_df["CustomerID"] == i]["StockCode"].tolist()
purchases_val.append(temp)
model = Word2Vec(window = 10, sg = 1, hs = 0,
negative = 10, # for negative sampling
alpha=0.03, min_alpha=0.0007,
seed = 14)
model.build_vocab(purchases_train, progress_per=200)
model.train(purchases_train, total_examples = model.corpus_count,
epochs=10, report_delay=1)
model.save("word2vec_2.model")
model.init_sims(replace=True)
# extract all vectors
X = model[model.wv.vocab]
X.shape
products = train_df[["StockCode", "Description"]]
products.drop_duplicates(inplace=True, subset='StockCode', keep="last")
products_dict=products.groupby('StockCode'['Description'].apply(list).to_dict()
def similar_products(v, n = 6):
ms = model.similar_by_vector(v, topn= n+1)[1:]
new_ms = []
for j in ms:
pair = (products_dict[j[0]][0], j[1])
new_ms.append(pair)
return new_ms
similar_products(model['21883'])
If you get a KeyError saying a word is not in the vocabulary, that's a reliable indicator that the word you're looking-up was not in the training data fed to Word2Vec, or did not appear enough (default min_count=5) times.
So, your error indicates the word-token '21883' did not appear at least 5 times in the texts (purchases_train) supplied to Word2Vec. You should do either or both of:
Ensure all words you're going to look-up appear enough times, either with more training data or a lower min_count. (However, words with only one or a few occurrences tend not to get good vectors & instead just drag the quaality of surrounding-words' vectors down - so keeping this value above 1, or even raising it above the default of 5 to discard more rare words, is a better path whenever you have sufficient data.)
If your later code will be looking up words that might not be present, either check for their presence first (word in model.wv.vocab) or set up a try: ... except: ... to catch & handle the case where they're not present.
I wrote a function that acts on each combination of columns in an input matrix. It uses multiple for loops and is very slow, so I am trying to parallelize it to use the maximum number of threads on my computer.
I am having difficulty finding the correct syntax to set this up. I'm using the Parallel package in octave, and have tried several ways to set up the calls. Here are two of them, in a simplified form, as well as a non-parallel version that I believe works:
function A = parallelExample(M)
pkg load parallel;
# Get total count of columns
ct = columns(M);
# Generate column pairs
I = nchoosek([1:ct],2);
ops = rows(I);
slice = ones(1, ops);
Ic = mat2cell(I, slice, 2);
## # Non-parallel
## A = zeros(1, ops);
## for i = 1:ops
## A(i) = cmbtest(Ic{i}, M);
## endfor
# Parallelized call v1
A = parcellfun(nproc, #cmbtest, Ic, {M});
## # Parallelized call v2
## afun = #(x) cmbtest(x, M);
## A = parcellfun(nproc, afun, Ic);
endfunction
# function to apply
function P = cmbtest(indices, matrix)
colset = matrix(:,indices);
product = colset(:,1) .* colset(:,2);
P = sum(product);
endfunction
For both of these examples I generate every combination of two columns and convert those pairs into a cell array that the parcellfun function should split up. In the first, I attempt to convert the input matrix M into a 1x1 cell array so it goes to each parallel instance in the same form. I get the error 'C must be a cell array' but this must be internal to the parcellfun function. In the second, I attempt to define an anonymous function that includes the matrix. The error I get here specifies that 'cmbtest' is undefined.
(Naturally, the actual function I'm trying to apply is far more complex than cmbtest here)
Other things I have tried:
Put M into a global variable so it doesn't need to be passed. Seemed to be impossible to put a global variable in a function file, though I may just be having syntax issues.
Make cmbtest a nested function so it can access M (parcellfun doesn't support that)
I'm out of ideas at this point and could use help figuring out how to get this to work.
Converting my comments above to an answer.
When performing parallel operations, it is useful to think of each parallel worker that will result as separate and independent octave instances, which need to have appropriate access to all functions and variables they will require in order to do their independent work.
Therefore, do not rely on subfunctions when calling parcellfun from a main function, since this might lead to errors if the worker is unable to access the subfunction directly under the hood.
In this case, separating the subfunction into its own file fixed the problem.
I am trying to apply a function over each row of a DataFrame as the code shows.
using RDatasets
iris = dataset("datasets", "iris")
function mean_n_var(x)
mean1=mean([x[1], x[2], x[3], x[4]])
var1=var([x[1], x[2], x[3], x[4]])
rst=[mean1, var1]
return rst
end
mean_n_var([2,4,5,6])
for row in eachrow(iris[1:4])
println(mean_n_var(convert(Array, row)))
end
However, instead of printing results, I'd like to save them in an array or another DataFrame.
Thanks in advance.
I thought it is worth to mention some more options available over what was already mentioned.
I assume you want a Matrix or a DataFrame. There are several possible approaches.
First is the most direct to get a Matrix:
mean_n_var(a) = [mean(a), var(a)]
hcat((mean_n_var(Array(x)) for x in eachrow(iris[1:4]))...) # rows
vcat((mean_n_var(Array(x)).' for x in eachrow(iris[1:4]))...) # cols
another possible approach is vectorized, e.g.:
mat_iris = Matrix(iris[1:4])
mat = hcat(mean(mat_iris, 2), var(mat_iris, 2))
df = DataFrame([vec(f(mat_iris, 2)) for f in [mean,var]], [:mean, :var])
DataFrame(mat) # this constructor also accepts variable names on master but is not released yet
So I'm trying to iterate over the list of partitions of something, say 1:n for some n between 13 and 21. The code that I ideally want to run looks something like this:
valid_num = #parallel (+) for p in partitions(1:n)
int(is_valid(p))
end
println(valid_num)
This would use the #parallel for to map-reduce my problem. For example, compare this to the example in the Julia documentation:
nheads = #parallel (+) for i=1:200000000
Int(rand(Bool))
end
However, if I try my adaptation of the loop, I get the following error:
ERROR: `getindex` has no method matching getindex(::SetPartitions{UnitRange{Int64}}, ::Int64)
in anonymous at no file:1433
in anonymous at multi.jl:1279
in run_work_thunk at multi.jl:621
in run_work_thunk at multi.jl:630
in anonymous at task.jl:6
which I think is because I am trying to iterate over something that is not of the form 1:n (EDIT: I think it's because you cannot call p[3] if p=partitions(1:n)).
I've tried using pmap to solve this, but because the number of partitions can get really big, really quickly (there are more than 2.5 million partitions of 1:13, and when I get to 1:21 things will be huge), constructing such a large array becomes an issue. I left it running over night and it still didn't finish.
Does anyone have any advice for how I can efficiently do this in Julia? I have access to a ~30 core computer and my task seems easily parallelizable, so I would be really grateful if anyone knows a good way to do this in Julia.
Thank you so much!
The below code gives 511, the number of partitions of size 2 of a set of 10.
using Iterators
s = [1,2,3,4,5,6,7,8,9,10]
is_valid(p) = length(p)==2
valid_num = #parallel (+) for i = 1:30
sum(map(is_valid, takenth(chain(1:29,drop(partitions(s), i-1)), 30)))
end
This solution combines the takenth, drop, and chain iterators to get the same effect as the take_every iterator below under PREVIOUS ANSWER. Note that in this solution, every process must compute every partition. However, because each process uses a different argument to drop, no two processes will ever call is_valid on the same partition.
Unless you want to do a lot of math to figure out how to actually skip partitions, there is no way to avoid computing partitions sequentially on at least one process. I think Simon's answer does this on one process and distributes the partitions. Mine asks each worker process to compute the partitions itself, which means the computation is being duplicated. However, it is being duplicated in parallel, which (if you actually have 30 processors) will not cost you time.
Here is a resource on how iterators over partitions are actually computed: http://www.informatik.uni-ulm.de/ni/Lehre/WS03/DMM/Software/partitions.pdf.
PREVIOUS ANSWER (More complicated than necessary)
I noticed Simon's answer while writing mine. Our solutions seem similar to me, except mine uses iterators to avoid storing partitions in memory. I'm not sure which would actually be faster for what size sets, but I figure it's good to have both options. Assuming it takes you significantly longer to compute is_valid than to compute the partitions themselves, you can do something like this:
s = [1,2,3,4]
is_valid(p) = length(p)==2
valid_num = #parallel (+) for i = 1:30
foldl((x,y)->(x + int(is_valid(y))), 0, take_every(partitions(s), i-1, 30))
end
which gives me 7, the number of partitions of size 2 for a set of 4. The take_every function returns an iterator that returns every 30th partition starting with the ith. Here is the code for that:
import Base: start, done, next
immutable TakeEvery{Itr}
itr::Itr
start::Any
value::Any
flag::Bool
skip::Int64
end
function take_every(itr, offset, skip)
value, state = Nothing, start(itr)
for i = 1:(offset+1)
if done(itr, state)
return TakeEvery(itr, state, value, false, skip)
end
value, state = next(itr, state)
end
if done(itr, state)
TakeEvery(itr, state, value, true, skip)
else
TakeEvery(itr, state, value, false, skip)
end
end
function start{Itr}(itr::TakeEvery{Itr})
itr.value, itr.start, itr.flag
end
function next{Itr}(itr::TakeEvery{Itr}, state)
value, state_, flag = state
for i=1:itr.skip
if done(itr.itr, state_)
return state[1], (value, state_, false)
end
value, state_ = next(itr.itr, state_)
end
if done(itr.itr, state_)
state[1], (value, state_, !flag)
else
state[1], (value, state_, false)
end
end
function done{Itr}(itr::TakeEvery{Itr}, state)
done(itr.itr, state[2]) && !state[3]
end
One approach would be to divide the problem up into pieces that are not too big to realize and then process the items within each piece in parallel, e.g. as follows:
function my_take(iter,state,n)
i = n
arr = Array[]
while !done(iter,state) && (i>0)
a,state = next(iter,state)
push!(arr,a)
i = i-1
end
return arr, state
end
function get_part(npart,npar)
valid_num = 0
p = partitions(1:npart)
s = start(p)
while !done(p,s)
arr,s = my_take(p,s,npar)
valid_num += #parallel (+) for a in arr
length(a)
end
end
return valid_num
end
valid_num = #time get_part(10,30)
I was going to use the take() method to realize up to npar items from the iterator but take() appears to be deprecated so I've included my own implementation which I've called my_take(). The getPart() function therefore uses my_take() to obtain up to npar partitions at a time and carry out a calculation on them. In this case, the calculation just adds up their lengths, because I don't have the code for the OP's is_valid() function. get_part() then returns the result.
Because the length() calculation isn't very time-consuming, this code is actually slower when run on parallel processors than it is on a single processor:
$ julia -p 1 parpart.jl
elapsed time: 10.708567515 seconds (373025568 bytes allocated, 6.79% gc time)
$ julia -p 2 parpart.jl
elapsed time: 15.70633439 seconds (548394872 bytes allocated, 9.14% gc time)
Alternatively, pmap() could be used on each piece of the problem instead of the parallel for loop.
With respect to the memory issue, realizing 30 items from partitions(1:10) took nearly 1 gigabyte of memory on my PC when I ran Julia with 4 worker processes so I expect realizing even a small subset of partitions(1:21) will require a great deal of memory. It may be desirable to estimate how much memory would be needed to see if it would be at all possible before trying such a computation.
With respect to the computation time, note that:
julia> length(partitions(1:10))
115975
julia> length(partitions(1:21))
474869816156751
... so even efficient parallel processing on 30 cores might not be enough to make the larger problem solvable in a reasonable time.