Problem with maximum index value of a vector [duplicate] - c++11

Why does the following work? I thought that writing to an an index of a vector object beyond the end of the vector object would cause a segmentation fault.
#include <iostream>
#include <vector>
using namespace std;
int main() {
vector<int> x(1);
x[10] = 1;
cout << x[10] << endl;
}
What are the implications of this? Is there a safer way to initialize a vector of exactly n elements and write only to those? Should I always use push_back()?

Somebody implementing std::vector might easily decide to give it a minimum size of 10 or 20 elements or so, on the theory that the memory manager likely has a large enough minimum allocation size that it will use (about) the same amount of memory anyway.
As far as avoiding reading/writing past the end, one possibility is to avoid using indexing whenever possible, and using .at() to do indexing when you truly can't avoid it.
I find that I can usually avoid doing indexing at all by using range-based for loops and/or standard algorithms for most tasks. For a trivial example, consider something like this:
int main() {
vector<int> x(1);
x.push_back(1);
for (auto i : x)
cout << i << "\n";
}
.at() does work, but I rarely find it useful or necessary--I suspect I use it less than once a year on average.

So under the covers what actually happens when you try to address an element of an array or vector that is outside the container's bounds is a memory dereference to a piece of memory that is not part of the container. However, reads and writes to this location can "work" or appear to work because it is just more memory that you are reading/writing to. You are doing something very very bad. You will generally see random junk when accessing memory outside of your bounds because it can belong to something else or is leftovers from a previous process because the memory controller will not naturally zero out memory on its own. So the best practice is to use the size of the container to check your reads and writes if you are ever in doubt as to if you are outside the bounds of the container. You can use vector.size() to find the current size of the container.

Related

Max size in array 2-dimentional in C++

I want to execute large computational program in 3 and 2 dimension with size of array[40000][40000] or more ,this code can explain my problem a bit,I comment vector because it have same problem when I run it it goes to lib of vector, how to increase memory of compiler or delete(clean) some part of it when program running?
#include<iostream>
#include<cstdlib>
#include<vector>
using namespace std;
int main(){
float array[40000][40000];
//vector< vector<double> > array(1000,1000);
cout<<"bingo"<<endl;
return 0;
}
A slightly better option than vector (and far better than vector-of-vector1), which like vector, uses dynamic allocation for the contents (and therefore doesn't overflow the stack), but doesn't invite resizing:
std::unique_ptr<float[][40000]> array{ new float[40000][40000] };
Conveniently, float[40000][40000] still appears, making it fairly obvious what is going on here even to a programmer unfamiliar with incomplete array types.
1 vector<vector<T> > is very bad, since it would have many different allocations, which all have to be separately initialized, and the resulting storage would be discontiguous. Slightly better is a combination of vector<T> with vector<T*>, with the latter storing pointers created one row apart into a single large buffer managed by the former.

Reduced efficiency of open MP over time

I have a large calculation algorithm where I use open mp to help speed it up. It works fine for the first 50 or so iterations (i.e. until y = 50 or so), but then starts to slow down progressively. I also notice that the CPU usage goes from ~100% to ~40% by the end.
The code looks something like this:
#include <iostream>
#include <omp.h>
#include <ipp.h>
void main(){
std::string filename = "Large_File.file";
FILE * fid = fopen(filename.c_str(), "rb");
Ipp32f* vector= ippsMalloc_32f(100000000);
for (int y=0; y<300; y++){
fread(vector,sizeof(float),100000000,fid);
#pragma omp parallel for
for (int x=0; x<300; x++){
//A time-consuming calculation
}
}
}
First of all, check if this exactly the same question and the same answer as given stackoverflow link: Why does moving the buffer pointer slow down fread (C programming language)?
(It's not very far from what Hristo suggested)
Secondly, from the way you stated your question and "common sense" it is most likely that the slowdown is driven by "fread" call (as an only thing to vary with "y" increase from what could be seen in your "restricted" example).
Third side comment: since you use IPP, you are likely to use fresh (likely Intel) Compiler. If so, I would suggest using more novel ways of allocating memory in aligned manner: either _aligned_malloc() or using #include <aligned_new> read more
(I assume you expect to hear more about OpenMP-specific slow-downs but it's pretty low probability that there is any relationship wrt threading in your case; you may possibly verify it by disabling OpenMP and comparing y=50, y=100, y=150, .. runs, although it will not prove a lot in case you really deal with sophisticated NUMA/threading composability issues which again I don't beleive is the case)

How to pass a list of arbitrary size to an OpenCL kernel

I have been fiddling with OpenCL recently, and I have run into a serious limitation: You cannot pass an array of pointers to a kernel. This makes it difficult to pass an arbitrarily sized list of, say, images to a kernel. I had a couple of thoughts toward this, and I was wondering if anybody could say for sure whether or not they would work, or offer better suggestions.
Let's say you had x image objects that you wanted to be passed to the kernel. If they were all only 2D, one solution might be to pack them all into a 3D image, and just index the slices. The problem with this is, if the images are different sizes, then space will be wasted, because the 3D image has to have the width of the widest image, the height of the tallest image, and the depth would be the number of images.
However, I was also thinking that when you pass a buffer object to a kernel, it appears in the kernel as a pointer. If you had a kernel that took an arbitrary data buffer, and a buffer designated just for storing pointers, and then appended the pointer to the first buffer to the end of the second buffer, (provided there was enough allocated space of course) then maybe you could keep a buffer of pointers to other buffers on the device. This buffer could then be passed to other kernels, which would then, with some interesting casting, be able to access these arbitrary buffers on the device. The only problem is whether or not a given buffer pointer would remain the same throughout the life of the buffer. Also, when you pass an image, you get a struct as an argument. Now, does this struct actually have a home in device memory? Or is it around just long enough to be passed to the kernel? These things are important in that they would determine whether or not the pointer buffer trick would work on images too, assuming it would work at all.
Does anybody know if the buffer trick would work? Are there any other ways anybody can think of to pass a list of arbitrary size to a kernel?
EDIT: The buffer trick does NOT work. I have tested it. I am not sure why exactly, but the pointers on the device don't seem to stay the same from one invocation to another.
Passing an array of pointers to a kernel does not make sense, because the pointers would point to host memory, which the OpenCL device does not know anything about. You would have to transfer the data to a device buffer and then pass the buffer pointer to the kernel. (There are some more complicated options with mapped/pinned memory and especially in the case of APUs, but they don't change the main fact, that host pointers are invalid on the device).
I can suggest one approach, although I have never actually used it myself. If you have a large device buffer preallocated, you could fill it up with images back to back from the host. Then call the kernel with the buffer and a list of offsets as arguments.
This is easy, and I've done it. You don't use pointers, so much as references, and you do it like this. In your kernel, you can provide two arguments:
kernel void(
global float *rowoffsets,
global float *data
) {
Now, in your host, you simply take your 2d data, copy it into a 1d array, and put the index of the start of each row into rowoffsets
For the last row, you add an additional rowoffset, pointing to one past the end of data.
Then in your kernel, to read the data from a row, you can do things like:
kernel void(
global float *rowoffsets,
global float *data,
const int N
) {
for( int n = 0; n < N; n++ ) {
const int rowoffset = rowoffsets[n];
const int rowlen = rowoffsets[n+1] - rowoffset;
for( int col = 0; col < rowlen; col++ ) {
// do stuff with data[rowoffset + col] here
}
}
}
Obviously, how you're actually going to assign the data to each workitem is up to you, so whether you're using actual loops, or giving each workitem a single row and column is part of your own application design.

openCL reduction, and passing 2d array

Here is the loop I want to convert to openCL.
for(n=0; n < LargeNumber; ++n) {
for (n2=0; n2< SmallNumber; ++n2) {
A[n]+=B[n2][n];
}
Re+=A[n];
}
And here is what I have so far, although, I know it is not correct and missing some things.
__kernel void openCL_Kernel( __global int *A,
__global int **B,
__global int *C,
__global _int64 Re,
int D)
{
int i=get_global_id(0);
int ii=get_global_id(1);
A[i]+=B[ii][i];
//barrier(..); ?
Re+=A[i];
}
I'm a complete beginner to this type of thing. First of all I know that I can't pass a global double pointer to an openCL kernel. If you can, wait a few days or so before posting the solution, I want to figure this out for myself, but if you can help point me in the right direction I would be grateful.
Concerning your problem with passing doublepointers: That kind of problem is typically solved by copying the whole matrix (or whatever you are working on) into one continous block of memory and, if the blocks have different lengths passing another array, which contains the offsets for the individual rows ( so your access would look something like B[index[ii]+i]).
Now for your reduction down to Re: since you didn't mention what kind of device you are working on I'm going to assume its GPU. In that case I would avoid doing the reduction in the same kernel, since its going to be slow as hell the way you posted it (you would have to serialize the access to Re over thousands of threads (and the access to A[i] too).
Instead I would write want kernel, which sums all B[*][i] into A[i] and put the reduction from A into Re in another kernel and do it in several steps, that is you use a reduction kernel which operates on n element and reduces them to something like n / 16 (or any other number). Then you iteratively call that kernel until you are down to one element, which is your result (I'm making this description intentionally vague, since you said you wanted to figure thinks out yourself).
As a sidenote: You realize that the original code doesn't exactly have a nice memory access pattern? Assuming B is relatively large (and much larger then A due to the second dimension) having the inner loop iterate over the outer index is going to create a lot of cachemisses. This is even worse when porting to the gpu, which is very sensitive about coherent memory access
So reordering it like this may massively increase performance:
for (n2=0; n2< SmallNumber; ++n2)
for(n=0; n < LargeNumber; ++n)
A[n]+=B[n2][n];
for(n=0; n < LargeNumber; ++n)
Re+=A[n];
This is particulary true if you have a compiler which is good at autovectorization, since it might be able to vectorize that construct, but it's very unlikely to be able to do so for the original code (and if it can't prove that A and B[n2] can't refer to the same memory it can't turn the original code into this).

CUDA Add Rows of a Matrix

I'm trying to add the rows of a 4800x9600 matrix together, resulting in a matrix 1x9600.
What I've done is split the 4800x9600 into 9,600 matrices of length 4800 each. I then perform a reduction on the 4800 elements.
The trouble is, this is really slow...
Anyone got any suggestions?
Basically, I'm trying to implement MATLAB's sum(...) function.
Here is the code which I've verified works fine, it's just it's really slow:
void reduceRows(Matrix Dresult,Matrix DA)
{
//split DA into chunks
Matrix Dchunk;
Dchunk.h=1;Dchunk.w=DA.h;
cudaMalloc((void**)&Dchunk.data,Dchunk.h*Dchunk.w*sizeof(float));
Matrix DcolSum;
DcolSum.h=1;DcolSum.w=1;
//cudaMalloc((void**)&DcolSum.data,DcolSum.h*DcolSum.w*sizeof(float));
int i;
for(i=0;i<DA.w;i++) //loop over each column
{
//printf("%d ",i);
cudaMemcpy(Dchunk.data,&DA.data[i*DA.h],DA.h*sizeof(float),cudaMemcpyDeviceToDevice);
DcolSum.data=&Dresult.data[i];
reduceTotal(DcolSum,Dchunk);
}
cudaFree(Dchunk.data);
}
Matrix is defined as:
typedef struct{
long w;
long h;
float* data;
}Matrix;
ReduceTotal() just calls the standard NVIDIA reduction, sums all the elements in Dchunk and puts the answer in DcolSum.
I'm about to do all this on the CPU if I can't find an answer... ;(
Many thanks in advance,
Instead of looping over each column, parallelize on the columns. Each of 4600 threads sums the 9600 entries in its column, and puts the sum in the appropriate place in the result vector.
If you're looking for a library to make working with Cuda simpler, I highly recommend Thrust: http://code.google.com/p/thrust/
Using Thrust, I would create a functor to hold your matrix's pointer in device memory, and then map it over a sequence of column indices. The operator() of the functor would take an index, sum up everything in that column of the matrix, and return the sum. Then you would have your sum sitting in a thrust::device_vector without any memory copies (or even direct CUDA calls).
Your functor might look something like:
struct ColumnSumFunctor {
const Matrix matrix;
// Make a functor to sum the matrix
ColumnSumFunctor(const Matrix& matrix);
// Compute and return the sum of the specified column
__device__
int operator()(const int& column) const;
};
Reduction is very basic operation in GPGPU, it's supposed to be fast, and 9600 times of reduction shouldn't be slow either.
What graphics card are you using?
I suggest you split it into 9600 arrays, each time you reduce an array of 4800 elements into one result. Instead of reduceTotal, I suggest you use CUDPP to perform the reduction operation, CUDPP is like the STL for CUDA. It's implemented with concern on performance.
http://code.google.com/p/cudpp/
I think your problem is that you are launching 9600X2 kernels. This should be an easy algorithm to express as a single kernel.
The most naive way to implement it would not coalesce memory, but it could well be faster than the way you are doing it now.
Once you've got the naive way working, then coalesce your memory reads: e.g. have every thread in a block read 16 consecutive floats into shared memory, syncthreads, then accumulate the relevant 16 floats into a register, synthreads, then repeat
The Computing SDK has lots of examples of reduction techniques.

Resources