How to save Julia for loop returns in an array or dataframe? - for-loop

I am trying to apply a function over each row of a DataFrame as the code shows.
using RDatasets
iris = dataset("datasets", "iris")
function mean_n_var(x)
mean1=mean([x[1], x[2], x[3], x[4]])
var1=var([x[1], x[2], x[3], x[4]])
rst=[mean1, var1]
return rst
end
mean_n_var([2,4,5,6])
for row in eachrow(iris[1:4])
println(mean_n_var(convert(Array, row)))
end
However, instead of printing results, I'd like to save them in an array or another DataFrame.
Thanks in advance.

I thought it is worth to mention some more options available over what was already mentioned.
I assume you want a Matrix or a DataFrame. There are several possible approaches.
First is the most direct to get a Matrix:
mean_n_var(a) = [mean(a), var(a)]
hcat((mean_n_var(Array(x)) for x in eachrow(iris[1:4]))...) # rows
vcat((mean_n_var(Array(x)).' for x in eachrow(iris[1:4]))...) # cols
another possible approach is vectorized, e.g.:
mat_iris = Matrix(iris[1:4])
mat = hcat(mean(mat_iris, 2), var(mat_iris, 2))
df = DataFrame([vec(f(mat_iris, 2)) for f in [mean,var]], [:mean, :var])
DataFrame(mat) # this constructor also accepts variable names on master but is not released yet

Related

multiprocessing a geopandas.overlay() throws no error but seemingly never completes

I'm trying to pass a geopandas.overlay() to multiprocessing to speed it up.
I have used custom functions and functools to partially fill function inputs and then pass the iterative component to the function to produce a series of dataframes that I then concat into one.
def taska(id, points, crs):
return make_break_points((vms_points[points.ID == id]).reset_index(drop=True), crs)
points_gdf = geodataframe of points with an id field
grid_gdf = geodataframe polygon grid
partialA = functools.partial(taska, points=points_gdf, crs=grid_gdf.crs)
partialA_results =[]
with Pool(cpu_count()-4) as pool:
for results in pool.map(partialA, list(points_gdf.ID.unique())):
partialA_results.append(results)
bpts_gdf = pd.concat(partialA_results)
In the example above I use the list of unique values to subset the df and pass it to a processor to perform the function and return the results. In the end all the results are combined using pd.concat.
When I apply the same approach to a list of dataframes created using numpy.array_split() the process starts with a number of processors, then they all close and everything hangs with no indication that work is being done or that it will ever exit.
def taskc(tracks, grid):
return gpd.overlay(tracks, grid, how='union').explode().reset_index(drop=True)
tracks_gdf = geodataframe of points with an id field
dfs = np.array_split(tracks_gdf, (cpu_count()-4))
grid_gdf = geodataframe polygon grid
partialC_results = []
partialC = functools.partial(taskc, grid=grid_gdf)
with Pool(cpu_count() - 4) as pool:
for results in pool.map(partialC, dfs):
partialC_results.append(results)
results_df = pd.concat(partialC_results)
I tried using with get_context('spawn').Pool(cpu_count() - 4) as pool: based on the information here https://pythonspeed.com/articles/python-multiprocessing/ with no change in behavior.
Additionally, if I simply run geopandas.overlay(tracks_gdf, grid_gdf) the process is successful and the script carries on to the end with expected results.
Why does the partial function approach work on a list of items but not a list of dataframes?
Is the numpy.array_split() not an iterable object like a list?
How can I pass a single df into geopandas.overlay() in chunks to utilize multiprocessing capabilities and get back a single dataframe or a series of dataframes to concat?
This is my work around but am also interested if there is a better way to perform this and similar tasks. Essentially, modified the partial function so the df split is moved to the partial function then I create a list of values from range() as my iteral.
def taskc(num, tracks, grid):
return gpd.overlay(np.array_split(tracks, cpu_count()-4)[num], grid, how='union').explode().reset_index(drop=True)
partialC = functools.partial(taskc, tracks=tracks_gdf, grid=grid_gdf)
dfrange = list(range(0, cpu_count() - 4))
partialC_results = []
with get_context('spawn').Pool(cpu_count() - 4) as pool:
for results in pool.map(partialC, dfrange):
partialC_results.append(results)
results_gdf = pd.concat(partialC_results)

Convert structure fields to arrays efficiently matlab

I have a structure called s in Matlab. This is a structure with two fields a and b. The structure size is 1 x 1,620,000.
It is a very large structure (that probably takes half of the ram of my machine). This is what the structure looks like:
I am looking for an efficient way to concatenate each of the fields a and b into two separate arrays that I can then export to csv. I built the code below, to do so, but even after 12 hours running it has not even reached a quarter of the loop. Any more efficient way of doing this?
a = [];
b =[];
total_n = size(s,2);
count = 1;
while size(s,2)>0
if size(s(1).a,1)
a = [a; s(1).a];
end
if size(s(1).b,1)
b = [b; s(1).b];
end
s(1) = []; %to save memory
if mod(count,1000) == 0
fprintf('Done %2f \n', [count/total_n])
end
count = count+1;
end
s(1) = []; %to save memory
ah, but such huge misunderstanding that comment is.
if size(s) is 1 x 1,620,000, you just suddenly forced the loop to do (under the hood, you dont see it)
snew=zeros(1,size(s,2)-1) # now you use double memory
snew=s(2:end) # now you force an unnecesary copy
So not only does that line make your code require double the memory, but also in each loop, you make an unnecesary copy of a large array.
Just replace your while for a normal for loop of for ii=1:size(s,2) and then index s!
Now, you can see hopefully then why the following is equally a big mistake (not only that, but any modern MATLAB version is currently telling you this is a bad idea in your editor)
a=[]
a=[a;s(1).a]
In here in each loop you are forcing MATLAB to make a new a that is 1 bigger than before, and copy the contents of the old a there.
instead, preallocate the size of a.
As you don't know what you are going to put there, I suggest using a cell array, as each s(ii).a has a different length.
You can then, after the loop, remove all empty (isempty) cells if you want.
Managed to do it efficiently:
s= struct2cell(s);
s= squeeze(s);
a = a(1,:);
a = a';
a = vertcat(a{:});
b = a(2,:);
b = b';
b = vertcat(b{:});

Input f into play3d() and movie3d() in the rgl package in R

I don't understand the input f expected by play3d and movie3d in the rgl package.
library(rgl)
nobs<-10
x<-runif(nobs)
y<-runif(nobs)
z<-runif(nobs)
n<-rep(1:nobs)
df<-as.data.frame(cbind(x,y,z,n))
listofobs<-split(df,n)
plot3d(df[,1],df[,2],df[,3], type = "n", radius = .2 )
myplotfunction<-function(x) {
rgl.spheres(x=x$x,y=x$y,z=x$z, type="s", r=0.025)
}
When executing the 2 lines below, the animation does play but both lines (play3d() and movie3d()) trigger the error displayed below:
play3d(f=lapply(listofobs,myplotfunction), fps=1 )
movie3d(f=lapply(listofobs,myplotfunction), fps=1 , duration=20)
I am hoping someone can correct my code and help me understand the f input to play3d and movie3d.
Question 1: Why is the play3d line above correct enough that the animation does display correctly?
Question 2: Why is the play3d line above incorrect enough that it triggers the error?
Question 3: What is wrong with the movie3d line that it does not produce a video output?
As the docs say, f is "A function returning a list that may be passed to par3d". It's not a list, which is what your usage passes.
To answer the questions:
R evaluates the lapply call which does the animation, then play3d looks at the result and dies because it's not a function.
f needs to be a function, as described in the help page.
It dies when it looks at f, because it's not a function.
This looks like it will do what you want:
library(rgl)
nobs<-10
x<-runif(nobs)
y<-runif(nobs)
z<-runif(nobs)
df<-data.frame(x,y,z)
plot3d(df, type = "n" )
id <- NA
myplotfunction<-function(time) {
index <- round(time)
# For a 3x faster display, use index <- round(3*time)
# To cycle through the points several times, use
# index <- round(3*time) %% nobs + 1
if (!is.na(id))
pop3d(id = id) # Delete previous item
id <<- spheres3d(df[index,], r=0.025)
list()
}
play3d(myplotfunction, startTime = 1, duration = nobs - 1)
movie3d(myplotfunction, startTime = 1, duration = nobs - 1, fps = 1)
This will leave a GIF in file.path(tempdir(), "movie.gif").
Some other notes:
don't call rgl.spheres. It will cause you immense pain later. Use spheres3d, or never call any *3d function, and never upgrade rgl: you're living in the past using the rgl.* functions. The *3d functions and the rgl.* functions don't play nicely together.
to construct a dataframe, just use the data.frame() function, don't convert
a matrix.
you don't need all those contortions to extract points from the dataframe.
Most rgl functions can handle a dataframe with x, y, and z columns.
You might notice the plot3d frame move a little: spheres are bigger than points, so it will adjust to accommodate them. You could use xlim, ylim and zlim to set the original frame a little bigger if you don't like this.

How can I pass multiple parameters to a parallel operation in Octave?

I wrote a function that acts on each combination of columns in an input matrix. It uses multiple for loops and is very slow, so I am trying to parallelize it to use the maximum number of threads on my computer.
I am having difficulty finding the correct syntax to set this up. I'm using the Parallel package in octave, and have tried several ways to set up the calls. Here are two of them, in a simplified form, as well as a non-parallel version that I believe works:
function A = parallelExample(M)
pkg load parallel;
# Get total count of columns
ct = columns(M);
# Generate column pairs
I = nchoosek([1:ct],2);
ops = rows(I);
slice = ones(1, ops);
Ic = mat2cell(I, slice, 2);
## # Non-parallel
## A = zeros(1, ops);
## for i = 1:ops
## A(i) = cmbtest(Ic{i}, M);
## endfor
# Parallelized call v1
A = parcellfun(nproc, #cmbtest, Ic, {M});
## # Parallelized call v2
## afun = #(x) cmbtest(x, M);
## A = parcellfun(nproc, afun, Ic);
endfunction
# function to apply
function P = cmbtest(indices, matrix)
colset = matrix(:,indices);
product = colset(:,1) .* colset(:,2);
P = sum(product);
endfunction
For both of these examples I generate every combination of two columns and convert those pairs into a cell array that the parcellfun function should split up. In the first, I attempt to convert the input matrix M into a 1x1 cell array so it goes to each parallel instance in the same form. I get the error 'C must be a cell array' but this must be internal to the parcellfun function. In the second, I attempt to define an anonymous function that includes the matrix. The error I get here specifies that 'cmbtest' is undefined.
(Naturally, the actual function I'm trying to apply is far more complex than cmbtest here)
Other things I have tried:
Put M into a global variable so it doesn't need to be passed. Seemed to be impossible to put a global variable in a function file, though I may just be having syntax issues.
Make cmbtest a nested function so it can access M (parcellfun doesn't support that)
I'm out of ideas at this point and could use help figuring out how to get this to work.
Converting my comments above to an answer.
When performing parallel operations, it is useful to think of each parallel worker that will result as separate and independent octave instances, which need to have appropriate access to all functions and variables they will require in order to do their independent work.
Therefore, do not rely on subfunctions when calling parcellfun from a main function, since this might lead to errors if the worker is unable to access the subfunction directly under the hood.
In this case, separating the subfunction into its own file fixed the problem.

What is the equivalent of the following code in tensorflow?

I have the following function:
import random
lst = []
for i in range(100):
lst.append(random.randint(1, 10))
print(lst)
buffer = []
# This is the peace of code which I am interested to convert into tensorflow.
for a in lst:
buffer.append(a)
if len(buffer) > 5:
buffer.pop(0)
if len(buffer) == 5:
print(buffer)
So, from the code, I need to create a buffer (that could be a variable in tensorflow). This buffer should hold the extracted features from the last conv layer. The variable will be an input to an RNN in my case.
The advantage of this approach is that when we have large images, and when we need to feed a RNN with a (batch of images) * (sequence length) * (size of 1 image), that will require a very big batch of images to be loaded into the main memory. On the other hand, according to the code above, we will be feeding 1 image at a time using the Datasets from tensorflow, or an input queue or any other alternative. As a result, we will be storing in memory the feature of size: batch_size * sequence_length * feature space .In addition, we can say:
if len(buffer) == n:
# empty out the buffer after using its elements
buffer = [] # Or any other alternative way
I am aware that I can feed my network batches of images, but I need to accomplish the mentioned code based on some literature.
Any help is much appreciated!!
I try to regenerate your output using tf.FIFOQueue (https://www.tensorflow.org/api_docs/python/tf/FIFOQueue). I have given my code below with the comments where necessary.
BATCH_SIZE = 20
lst = []
for i in range(BATCH_SIZE):
lst.append(random.randint(1, 10))
print(lst)
curr_data = np.reshape(lst, (BATCH_SIZE, 1)) # reshape the tensor so that [BATCH_SIZE 1]
# queue starts here
queue_input_data = tf.placeholder(tf.int32, shape=[1]) # Placeholder for feed the data
queue = tf.FIFOQueue(capacity=50, dtypes=[tf.int32], shapes=[1]) # Queue define here
enqueue_op = queue.enqueue([queue_input_data]) # enqueue operation
len_op = queue.size() # chek the queue size
#check the length of the queue and dequeue one if greater than 5
dequeue_one = tf.cond(tf.greater(len_op, 5), lambda: queue.dequeue(), lambda: 0)
#check the length of the queue and dequeue five elemts if equals to 5
dequeue_many = tf.cond(tf.equal(len_op, 5), lambda:queue.dequeue_many(5), lambda: 0)
with tf.Session() as session:
for i in range(BATCH_SIZE):
_ = session.run(enqueue_op, feed_dict={queue_input_data: curr_data[i]}) # enqueue one element each ietaration
len = session.run(len_op) # check the legth of the queue
print(len)
element = session.run(dequeue_one) # dequeue the first element
print(element)
However, following two problems are associated with the above code,
Only the dequeue one and dequeue many operations are available and you cannot see the elements inside the queue (I don't think you will need this since you are looking something like a pipeline).
I think that tf.cond is the only way to implement a conditional operation (I couldn't find any other suitable function similar to that). However, since its similar to the if-then-else statement, its mandatory to define an operation when the statement is false also (not just having only if statement without the else). Since Tensorflow is all about building a graph I think its necessary to include the two branches (when the condition is true and false).
Moreover, a good explanation for Tensorflow input pipelines can be found here (http://ischlag.github.io/2016/11/07/tensorflow-input-pipeline-for-large-datasets/).
Hope this helps.

Resources