Copy big value from Box into Vec in Rust without blowing the stack - memory-management

I want to copy (not move) a very big value from a box into a vec. The normal way of doing this (dereferencing the box) means the value is copied onto the stack temporarily, which blows it. Here's an example and a Playground link where it can be reproduced.
fn main() {
let big_value = Box::new([0u8; 8 * 1024 * 1024]);
let mut vec = Vec::new();
vec.push(*big_value);
}
https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=5998cff185dc209e7a1e676d41850ff4
Since both the Box and the Vec are on the heap, it should be possible to do a copy without passing through the stack. What's the best solution here?

See this answer on a Reddit post I made: https://old.reddit.com/r/rust/comments/n2jasd/question_copy_big_value_from_box_into_vec_without/gwmrcxp/
You can do this with vec.extend_from_slice(std::slice_from_ref(&big_value));. This performs no allocations, and just copies the big_value from the heap into a new slot in the vec.

Related

Stack overflow solution with O(n) runtime

I have a problem related to runtime for push and pop in a stack.
Here, I implemented a stack using array.
I want to avoid overflow in the stack when I insert a new element to a full stack, so when the stack is full I do the following (Pseudo-Code):
(I consider a stack as an array)
Generate a new array with the size of double of the origin array.
Copy all the elements in the origin stack to the new array in the same order.
Now, I know that for a single push operation to the stack with the size of n the action executes in the worst case in O(n).
I want to show that the runtime of n pushes to an empty stack in the worst case is also O(n).
Also how can I update this algorithm that for every push the operation will execute in a constant runtime in the worst case?
Amortized constant-time is often just as good in practice if not better than constant-time alternatives.
Generate a new array with the size of double of the origin array.
Copy all the elements in the origin stack to the new array in the same order.
This is actually a very decent and respectable solution for a stack implementation because it has good locality of reference and the cost of reallocating and copying is amortized to the point of being almost negligible. Most generalized solutions to "growable arrays" like ArrayList in Java or std::vector in C++ rely on this type of solution, though they might not exactly double in size (lots of std::vector implementations increase their size by something closer to 1.5 than 2.0).
One of the reasons this is much better than it sounds is because our hardware is super fast at copying bits and bytes sequentially. After all, we often rely on millions of pixels being blitted many times a second in our daily software. That's a copying operation from one image to another (or frame buffer). If the data is contiguous and just sequentially processed, our hardware can do it very quickly.
Also how can I update this algorithm that for every push the operation
will execute in a constant runtime in the worst case?
I have come up with stack solutions in C++ that are ever-so-slightly faster than std::vector for pushing and popping a hundred million elements and meet your requirements, but only for pushing and popping in a LIFO pattern. We're talking about something like 0.22 secs for vector as opposed to 0.19 secs for my stack. That relies on just allocating blocks like this:
... of course typically with more than 5 elements worth of data per block! (I just didn't want to draw an epic diagram). There each block stores an array of contiguous data but when it fills up, it links to a next block. The blocks are linked (storing a previous link only) but each one might store, say, 512 bytes worth of data with 64-byte alignment. That allows constant-time pushes and pops without the need to reallocate/copy. When a block fills up, it just links a new block to the previous block and starts filling that up. When you pop, you just pop until the block becomes empty and then once it's empty, you traverse its previous link to get to the previous block before it and start popping from that (you can also free the now-empty block at this point).
Here's your basic pseudo-C++ example of the data structure:
template <class T>
struct UnrolledNode
{
// Points to the previous block. We use this to get
// back to a former block when this one becomes empty.
UnrolledNode* prev;
// Stores the number of elements in the block. If
// this becomes full with, say, 256 elements, we
// allocate a new block and link it to this one.
// If this reaches zero, we deallocate this block
// and begin popping from the previous block.
size_t num;
// Array of the elements. This has a fixed capacity,
// say 256 elements, though determined at runtime
// based on sizeof(T). The structure is a VLS to
// allow node and data to be allocated together.
T data[];
};
template <class T>
struct UnrolledStack
{
// Stores the tail end of the list (the last
// block we're going to be using to push to and
// pop from).
UnrolledNode<T>* tail;
};
That said, I actually recommend your solution instead for performance since mine barely has a performance edge over the simple reallocate and copy solutions and yours would have a slight edge when it comes to traversal since it can just traverse the array in a straightforward sequential fashion (as well as straightforward random-access if you need it). I didn't actually implement mine for performance reasons. I implemented it to prevent pointers from being invalidated when you push things to the container (the actual thing is a memory allocator in C) and, again, in spite of achieving true constant-time push backs and pop backs, it's still barely any faster than the amortized constant-time solution involving reallocation and memory copying.

Max size in array 2-dimentional in C++

I want to execute large computational program in 3 and 2 dimension with size of array[40000][40000] or more ,this code can explain my problem a bit,I comment vector because it have same problem when I run it it goes to lib of vector, how to increase memory of compiler or delete(clean) some part of it when program running?
#include<iostream>
#include<cstdlib>
#include<vector>
using namespace std;
int main(){
float array[40000][40000];
//vector< vector<double> > array(1000,1000);
cout<<"bingo"<<endl;
return 0;
}
A slightly better option than vector (and far better than vector-of-vector1), which like vector, uses dynamic allocation for the contents (and therefore doesn't overflow the stack), but doesn't invite resizing:
std::unique_ptr<float[][40000]> array{ new float[40000][40000] };
Conveniently, float[40000][40000] still appears, making it fairly obvious what is going on here even to a programmer unfamiliar with incomplete array types.
1 vector<vector<T> > is very bad, since it would have many different allocations, which all have to be separately initialized, and the resulting storage would be discontiguous. Slightly better is a combination of vector<T> with vector<T*>, with the latter storing pointers created one row apart into a single large buffer managed by the former.

How to pass a list of arbitrary size to an OpenCL kernel

I have been fiddling with OpenCL recently, and I have run into a serious limitation: You cannot pass an array of pointers to a kernel. This makes it difficult to pass an arbitrarily sized list of, say, images to a kernel. I had a couple of thoughts toward this, and I was wondering if anybody could say for sure whether or not they would work, or offer better suggestions.
Let's say you had x image objects that you wanted to be passed to the kernel. If they were all only 2D, one solution might be to pack them all into a 3D image, and just index the slices. The problem with this is, if the images are different sizes, then space will be wasted, because the 3D image has to have the width of the widest image, the height of the tallest image, and the depth would be the number of images.
However, I was also thinking that when you pass a buffer object to a kernel, it appears in the kernel as a pointer. If you had a kernel that took an arbitrary data buffer, and a buffer designated just for storing pointers, and then appended the pointer to the first buffer to the end of the second buffer, (provided there was enough allocated space of course) then maybe you could keep a buffer of pointers to other buffers on the device. This buffer could then be passed to other kernels, which would then, with some interesting casting, be able to access these arbitrary buffers on the device. The only problem is whether or not a given buffer pointer would remain the same throughout the life of the buffer. Also, when you pass an image, you get a struct as an argument. Now, does this struct actually have a home in device memory? Or is it around just long enough to be passed to the kernel? These things are important in that they would determine whether or not the pointer buffer trick would work on images too, assuming it would work at all.
Does anybody know if the buffer trick would work? Are there any other ways anybody can think of to pass a list of arbitrary size to a kernel?
EDIT: The buffer trick does NOT work. I have tested it. I am not sure why exactly, but the pointers on the device don't seem to stay the same from one invocation to another.
Passing an array of pointers to a kernel does not make sense, because the pointers would point to host memory, which the OpenCL device does not know anything about. You would have to transfer the data to a device buffer and then pass the buffer pointer to the kernel. (There are some more complicated options with mapped/pinned memory and especially in the case of APUs, but they don't change the main fact, that host pointers are invalid on the device).
I can suggest one approach, although I have never actually used it myself. If you have a large device buffer preallocated, you could fill it up with images back to back from the host. Then call the kernel with the buffer and a list of offsets as arguments.
This is easy, and I've done it. You don't use pointers, so much as references, and you do it like this. In your kernel, you can provide two arguments:
kernel void(
global float *rowoffsets,
global float *data
) {
Now, in your host, you simply take your 2d data, copy it into a 1d array, and put the index of the start of each row into rowoffsets
For the last row, you add an additional rowoffset, pointing to one past the end of data.
Then in your kernel, to read the data from a row, you can do things like:
kernel void(
global float *rowoffsets,
global float *data,
const int N
) {
for( int n = 0; n < N; n++ ) {
const int rowoffset = rowoffsets[n];
const int rowlen = rowoffsets[n+1] - rowoffset;
for( int col = 0; col < rowlen; col++ ) {
// do stuff with data[rowoffset + col] here
}
}
}
Obviously, how you're actually going to assign the data to each workitem is up to you, so whether you're using actual loops, or giving each workitem a single row and column is part of your own application design.

openCL reduction, and passing 2d array

Here is the loop I want to convert to openCL.
for(n=0; n < LargeNumber; ++n) {
for (n2=0; n2< SmallNumber; ++n2) {
A[n]+=B[n2][n];
}
Re+=A[n];
}
And here is what I have so far, although, I know it is not correct and missing some things.
__kernel void openCL_Kernel( __global int *A,
__global int **B,
__global int *C,
__global _int64 Re,
int D)
{
int i=get_global_id(0);
int ii=get_global_id(1);
A[i]+=B[ii][i];
//barrier(..); ?
Re+=A[i];
}
I'm a complete beginner to this type of thing. First of all I know that I can't pass a global double pointer to an openCL kernel. If you can, wait a few days or so before posting the solution, I want to figure this out for myself, but if you can help point me in the right direction I would be grateful.
Concerning your problem with passing doublepointers: That kind of problem is typically solved by copying the whole matrix (or whatever you are working on) into one continous block of memory and, if the blocks have different lengths passing another array, which contains the offsets for the individual rows ( so your access would look something like B[index[ii]+i]).
Now for your reduction down to Re: since you didn't mention what kind of device you are working on I'm going to assume its GPU. In that case I would avoid doing the reduction in the same kernel, since its going to be slow as hell the way you posted it (you would have to serialize the access to Re over thousands of threads (and the access to A[i] too).
Instead I would write want kernel, which sums all B[*][i] into A[i] and put the reduction from A into Re in another kernel and do it in several steps, that is you use a reduction kernel which operates on n element and reduces them to something like n / 16 (or any other number). Then you iteratively call that kernel until you are down to one element, which is your result (I'm making this description intentionally vague, since you said you wanted to figure thinks out yourself).
As a sidenote: You realize that the original code doesn't exactly have a nice memory access pattern? Assuming B is relatively large (and much larger then A due to the second dimension) having the inner loop iterate over the outer index is going to create a lot of cachemisses. This is even worse when porting to the gpu, which is very sensitive about coherent memory access
So reordering it like this may massively increase performance:
for (n2=0; n2< SmallNumber; ++n2)
for(n=0; n < LargeNumber; ++n)
A[n]+=B[n2][n];
for(n=0; n < LargeNumber; ++n)
Re+=A[n];
This is particulary true if you have a compiler which is good at autovectorization, since it might be able to vectorize that construct, but it's very unlikely to be able to do so for the original code (and if it can't prove that A and B[n2] can't refer to the same memory it can't turn the original code into this).

rvalues c++0x and moving to heap

Okay, I've been reading about rvalues and they seem like a great idea, but something keeps bothering me about them. Particularly the claim that move allows us to steal resources and avoid copying.
I understand that move works and does avoid copying for everything that happens on the stack, but eventually most of the stuff done on the stack yields some value that we want copied into the heap and this is where I don't think move works.
Assuming that int has a move assignment operator, given the following code:
struct Foo
{
int x;
};
void doIt()
{
Foo* f = new Foo();
f->x = (2 + 4);
}
So in this example, the rvalue resulting from (2+4) can supposedly be moved over to f->x instead of copied. Okay, great. But f and consequently f->x is on the heap and that rvalue is on the stack. It seems impossible to avoid a copy. You cannot simply point f->x to the memory of the rvalue. That rvalue is going to be blown away as soon as doIt ends. A copy seems necessary.
So am I right that a copy will be made?
Or am I wrong?
Or did I completely misunderstand the rvalue concept?
In this case, it probably would do a copy, but since the object only contains an int, that's not really much of a problem.
The times you care are generally when the object contains a pointer to some data that's allocated on the heap (regardless of where the object itself is allocated). In this case, avoiding allocating a new copy of that data is quite worthwhile (and since it's on the heap even if the object itself is on the stack, you can move it regardless of where the object itself is located).
My understanding is that moving is not the opposite of copying : moving is preferable because in most cases it will implement a shallow copy (instead of a deep copy).
If an object holds a pointer to some resource, a shallow copy is copying the pointer, while a deep copy is copying the data pointed to by the pointer. There always is copying involved : the question is "how deep must we go".
Your example only involves an int : there is no such thing as a shallow or deep copy of an int, so it is irrelevant here. So indeed, you are right to believe that moving only makes sense when dynamically allocated resources are involved.
Um. Local variables are on the stack. That rvalue of yours will be optimized to 6 and the resulting binary will most likely have a mov [dest], 6 in it.

Resources