Please correct me if this is not possible with C++ but here is the idea: I have a set of data that I would like to, at runtime, perform addition (+), subtraction(-), and/or multiplication(*). Currently I have three for loops to achieve this, which means it can be slow. I'd like to put all these operations into a single loop.
pseudocode:
Data ApplyOperations(const Data &a, ... const Data &n, OperatorA(), ..., OperatorN()) {
for (size_t i = 0; i < a.size(); ++i)
result[i] = a[i] OperatorA() ... n[i] OperatorN();
return result;
}
This way, I can apply N operations in whatever order I want in a single loop. Can someone point me into the right direction to achieve this in C++11?
Thanks!
Basically you have two nested loops, one on an "array" of N sets and operators, and one on M elements in each set.
From a complexity analysis point of view it makes no difference which loop is the outer one; the complexity is O(N*M). However, if the data sets passed as argument to your functions are in fact the same set, then on most modern architectures you will get a much better performance by having the iteration on the data items as the outer one. The reason is the effect of caches. If you iterate on the data over and over, you're going to have more cache misses, which have a very heavy effect on performance. Of course, if you're rally passing different data set for each operator, then there is no difference.
Related
Assume that we are working with a language which stores arrays in column-major order. Assume also that we have a function which uses 2-D array as an argument, and returns it.
I'm wondering can you claim that it is (or isn't) in general beneficial to transpose this array when calling the function in order to work with column-wise operations instead of row-wise operations, or does the transposing negate the the benefits of column-wise operations?
As an example, in R I have a object of class ts named y which has dimension n x p, i.e I have p times series of length n.
I need to make some computations with y in Fortran, where I have two loops with following kind of structure:
do i = 1, n
do j= 1, p
!just an example, some row-wise operations on `y`
x(i,j) = a*y(i,j)
D = ddot(m,y(i,1:p),1,b,1)
! ...
end do
end do
As Fortran (as does R) uses column-wise storage, it would be better to make the computations with p x n array instead. So instead of
out<-.Fortran("something",y=array(y,dim(y)),x=array(0,dim(y)))
ynew<-out$out$y
x<-out$out$x
I could use
out<-.Fortran("something2",y=t(array(y,dim(y))),x=array(0,dim(y)[2:1]))
ynew<-t(out$out$y)
x<-t(out$out$x)
where Fortran subroutine something2 would be something like
do i = 1, n
do j= 1, p
!just an example, some column-wise operations on `y`
x(j,i) = a*y(j,i)
D = ddot(m,y(1:p,i),1,b,1)
! ...
end do
end do
Does the choice of approach always depend on the dimensions n and p or is it possible to say one approach is better in terms of computation speed and/or memory requirements? In my application n is usually much larger than p, which is 1 to 10 in most cases.
more of a comment, buy i wanted to put a bit of code: under old school f77 you would essentially be forced to use the second approach as
y(1:p,i)
is simply a pointer to y(1,i), with the following p values contiguous in memory.
the first construct
y(i,1:p)
is a list of values interspaced in memory, so it seems to require making a copy of the data to pass to the subroutine. I say it seems because i haven't the foggiest idea how a modern optimizing compiler deals with these things. I tend to think at best its a wash at worst this could really hurt. Imagine an array so large you need to page swap to access the whole vector.
In the end the only way to answer this is to test it yourself
----------edit
did a little testng and confirmed my hunch: passing rows y(i,1:p) does cost you vs passing columns y(1:p,i). I used a subroutine that does practically nothing to see the difference. My guess with any real subroutine the hit is negligable.
Btw (and maybe this helps understand what goes on) passing every other value in a column
y(1:p:2,i) takes longer (orders of magnitude) than passing the whole column, while passing every other value in a row cuts the time in half vs. passing a whole row.
(using gfortran 12..)
This is a general programming question.
I have seen on a lot of posts that iterating through a 2d array via a double for loop is "horrible" "ugly" etc... Why is this?
Are arrays not an efficient data structure compared to dictionaries and such, and also is not a double for loop more efficient than a foreach or other alternative?
Also if your using a 2d array you are often dealing with a 2d coordinate system. The x and y positions are already "built in" to the data structure as the indexes of the arrays (so you dont need to add ,say, a tuple as a dictionary key) and by changing the for loop paramaters you can very cheaply iterate through different parts of your grid while totally ignoring the parts you dont want to iterate through. For example to avoid the "outer" rows and columns you could do..
for (int x = 1; x < Grid.GetLength(0)-1; x++)
{
for (int y = 1; y < Grid.GetLength(1)-1; y++)
{
Grid[x,y].DoSomething();
}
}
With a foreach you'd iterate through everything in the collection and then have something to check whether it is in the coordinate range you want.
Nothing wrong with 2 loops for iterating a 2D array, as long as this is what you really need to do.
One thing to note is performance - in general, the looping should follow the layout of the array in memory. E.g., if the 2D array is stored as a 1D memory buffer where row n is stored after row n-1 (this is the common implementation in general purpose languages), the external loop should go through the rows and internal one through the columns. This way cache misses are minimized.
In general, effectiveness of array access compared to other methods completely depends on particular language implementation. Usually, the array would be the most primitive data structure, resulting in the fastest access. BTW, dictionary is a generalization of array concept.
The setup: I have two arrays which are not sorted and are not of the same length. I want to see if one of the arrays is a subset of the other. Each array is a set in the sense that there are no duplicates.
Right now I am doing this sequentially in a brute force manner so it isn't very fast. I am currently doing this subset method sequentially. I have been having trouble finding any algorithms online that A) go faster and B) are in parallel. Say the maximum size of either array is N, then right now it is scaling something like N^2. I was thinking maybe if I sorted them and did something clever I could bring it down to something like Nlog(N), but not sure.
The main thing is I have no idea how to parallelize this operation at all. I could just do something like each processor looks at an equal amount of the first array and compares those entries to all of the second array, but I'd still be doing N^2 work. But I guess it'd be better since it would run in parallel.
Any Ideas on how to improve the work and make it parallel at the same time?
Thanks
Suppose you are trying to decide if A is a subset of B, and let len(A) = m and len(B) = n.
If m is a lot smaller than n, then it makes sense to me that you sort A, and then iterate through B doing a binary search for each element on A to see if there is a match or not. You can partition B into k parts and have a separate thread iterate through every part doing the binary search.
To count the matches you can do 2 things. Either you could have a num_matched variable be incremented every time you find a match (You would need to guard this var using a mutex though, which might hinder your program's concurrency) and then check if num_matched == m at the end of the program. Or you could have another array or bit vector of size m, and have a thread update the k'th bit if it found a match for the k'th element of A. Then at the end, you make sure this array is all 1's. (On 2nd thoughts bit vector might not work out without a mutex because threads might overwrite each other's annotations when they load the integer containing the bit relevant to them). The array approach, atleast, would not need any mutex that can hinder concurrency.
Sorting would cost you mLog(m) and then, if you only had a single thread doing the matching, that would cost you nLog(m). So if n is a lot bigger than m, this would effectively be nLog(m). Your worst case still remains NLog(N), but I think concurrency would really help you a lot here to make this fast.
Summary: Just sort the smaller array.
Alternatively if you are willing to consider converting A into a HashSet (or any equivalent Set data structure that uses some sort of hashing + probing/chaining to give O(1) lookups), then you can do a single membership check in just O(1) (in amortized time), so then you can do this in O(n) + the cost of converting A into a Set.
I was looking into different sorting algorithms, and trying to think how to port them to GPUs when I got this idea of sorting without actually sorting. This is how my kernel looks:
__global__ void noSort(int *inarr, char *outarr, int size)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < size)
outarr[inarr[idx]] = 1;
}
Then at the host side, I am just printing the array indices where outarr[i] == 1. Now effectively, the above could be used to sort an integer list, and that too may be faster than algorithms which actually sort.
Is this legit?
Your example is essentially a specialized counting sort for inputs with unique keys (i.e. no duplicates). To make the code a proper counting sort you could replace the assignment outarr[inarr[idx]] = 1 with atomicAdd(inarr + idx, 1) so duplicate keys are counted. However, aside from the fact that atomic operations are fairly expensive, you still have the problem that the complexity of the method is proportional to the largest value in the input. Fortunately, radix sort solves both of these problems.
Radix sort can be thought of as a generalization of counting sort that looks at only B bits of the input at a time. Since integers of B bits can only take on values in the range [0,2^B) we can avoid looking at the full range of values.
Now, before you go and implement radix sort on CUDA I should warn you that it has been studied extensively and extremely fast implementations are readily available. In fact, the Thrust library will automatically apply radix sort whenever possible.
I see what you're doing here, but I think it's only useful in special cases. For example, what if an element of inarr had an extremely large value? That would require outarr to have at least as many elements in order to handle it. What about duplicate numbers?
Supposing you started with an array with unique, small values within it, this is an interesting way of sorting. In general though, it seems to me that it will use enormous amounts of memory to do something that is already well-handled with algorithms such as parallel merge sort. Reading the output array would also be a very expensive process (especially if there are any large values in the input array), as you will essentially end up with a very sparse array.
Say, I employ merge sort to sort an array of Integers. Now I need to also remember the positions that elements had in the unsorted array, initially. What would be the best way to do this?
A very very naive and space consuming way to do would be to (in C), to maintain each number as a "structure" with another number storing its index:
struct integer {
int value;
int orig_pos;
};
But, obviously there are better ways. Please share your thoughts and solution if you have already tackled such problems. Let me know if you would need more context. Thank you.
Clearly for an N-long array you do need to store SOMEwhere N integers -- the original position of each item, for example; any other way to encode "1 out of N!" possibilities (i.e., what permutation has in fact occurred) will also take at least O(N) space (since, by Stirling's approximation, log(N!) is about N log(N)...).
So, I don't see why you consider it "space consuming" to store those indices most simply and directly. Of course there are other possibilities (taking similar space): for example, you might make a separate auxiliary array of the N indices and sort THAT auxiliary array (based on the value at that index) leaving the original one alone. This means an extra level of indirectness for accessing the data in sorted order, but can save you a lot of data movement if you're sorting an array of large structures, so there's a performance tradeoff... but the space consumption is basically the same!-)
Is the struct such a bad idea? The alternative, to me, would be an array of pointers.
It feels to me that in this question you have to consider the age old question: speed vs size. In either case, you are keeping both a new representation of your data (the sorted array) and an old representation of your data (the way the array use to look), so inherently your solution will have some data replication. If you are sorting n numbers, and you need to remember after they were sorted where those n numbers were, you will have to store n amount of information somewhere, there is no getting around that.
As long as you accept that you are doubling the amount of space you need to be able to keep this old data, then you should consider the specific application and decide what will be faster. One option is to just make a copy of the array before you sort it, however resolving which was where later might turn into a O(N) problem. From that point of view your suggestion of adding another int to your struct doesn't seem like such a bad idea, if it fits with the way you will be using the data later.
This looks like the case where I use an index sort. The following C# example shows how to do it with a lambda expression. I am new at using lambdas, but they can do some complex tasks very easily.
// first, some data to work with
List<double> anylist = new List<double>;
anylist.Add(2.18); // add a value
... // add many more values
// index sort
IEnumerable<int> serial = Enumerable.Range(0, anylist.Count);
int[] index = serial.OrderBy(item => (anylist[item])).ToArray();
// how to use
double FirstValue = anylist[index[0]];
double SecondValue = anylist[index[1]];
And, of course, anylist is still in the origial order.
you can do it the way you proposed
you can also remain a copy of the original unsorted array (means you may use a not inplace sorting algorithm)
you can create an additional array containing only the original indices
All three ways are equally space consuming, there is no "better" way. you may use short instead of int to safe space if you array wont get >65k elements (but be aware of structure padding with your suggestion).