Tuple v/s StaticVectors in Julia - data-structures

If I understand correctly, then as tuples are immutable in Julia, they must also be stack allocated (similar to StaticVectors). So there should be not any advantage of using StaticVectors in place of Tuples when I am dealing with small vectors say, a length 3 vector for coordinates of a particle. Can someone highlight the advantages of using StaticVectors in such cases. And more broadly what will be the use cases where I would possible want to choose using one over the other?

The raw performance is similar, since StaticArrays are built on tuples. The point of StaticArrays is all the functionality, the linear algebra, the solvers, sorting, the mutable arrays, etc.
Tuples are a barebones data collection with barely any mathematical structure. That's fine as far as it goes, but StaticArrays has done most of the work you would have to do yourself with tuples.


Algorithm for 2D nearest-neighbour queries with dynamic points

I am trying to find a fast algorithm for finding the (approximate, if need be) nearest neighbours of a given point in a two-dimensional space where points are frequently removed from the dataset and new points are added.
(Relatedly, there are two variants of this problem that interest me: one in which points can be thought of as being added and removed randomly and another in which all the points are in constant motion.)
Some thoughts:
kd-trees offer good performance, but are only suitable for static point sets
R*-trees seem to offer good performance for a variety of dimensions, but the generality of their design (arbitrary dimensions, general content geometries) suggests the possibility that a more specific algorithm might offer performance advantages
Algorithms with existing implementations are preferable (though this is not necessary)
What's a good choice here?
I agree with (almost) everything that #gsamaras said, just to add a few things:
In my experience (using large dataset with >= 500,000 points), kNN-performance of KD-Trees is worse than pretty much any other spatial index by a factor of 10 to 100. I tested them (2 KD-trees and various other indexes) on a large OpenStreetMap dataset. In the following diagram, the KD-Trees are called KDL and KDS, the 2D dataset is called OSM-P (left diagram): The diagram is taken from this document, see bullet points below for more information.
This research describes an indexing method for moving objects, in case you keep (re-)inserting the same points in slightly different positions.
Quadtrees are not too bad either, they can be very fast in 2D, with excellent kNN performance for datasets < 1,000,000 entries.
If you are looking for Java implementations, have a look at my index library. In has implementations of quadtrees, R-star-tree, ph-tree, and others, all with a common API that also supports kNN. The library was written for the TinSpin, which is a framework for testing multidimensional indexes. Some results can be found enter link description here (it doesn't really describe the test data, but 'OSM-P' results are based on OpenStreetMap data with up to 50,000,000 2D points.
Depending on your scenario, you may also want to consider PH-Trees. They appear to be slower for kNN-queries than R-Trees in low dimensionality (though still faster than KD-Trees), but they are faster for removal and updates than RTrees. If you have a lot of removal/insertion, this may be a better choice (see the TinSpin results, Figures 2 and 46). C++ versions are available here and here.
Check the Bkd-Tree, which is:
an I/O-efficient dynamic data structure based on the kd-tree. [..] the Bkd-tree maintains its high space utilization and excellent
query and update performance regardless of the number of updates performed on it.
However this data structure is multi dimensional, and not specialized to lower dimensions (like the kd-tree).
Play with it in bkdtree.
Dynamic Quadtrees can also be a candidate, with O(logn) query time and O(Q(n)) insertion/deletion time, where Q(n) is the time
to perform a query in the data structure used. Note that this data structure is specialized for 2D. For 3D however, we have octrees, and in a similar way the structure can be generalized for higher dimensions.
An implentation is QuadTree.
R*-tree is another choice, but I agree with you on the generality. A r-star-tree implementations exists too.
A Cover tree could be considered as well, but I am not sure if it fits your description. Read more here,and check the implementation on CoverTree.
Kd-tree should still be considered, since it's performance is remarkable on 2 dimensions, and its insertion complexity is logarithic in size.
nanoflann and CGAL are jsut two implementations of it, where the first requires no install and the second does, but may be more performant.
In any case, I would try more than one approach and benchmark (since all of them have implementations and these data structures are usually affected by the nature of your data).

What is the best data structure for an AABB collision checking physics engine?

I need an engine which consists of a world populated with axis-aligned bounding boxes (AABBs). A continuous loop will be executed, doing the following:
for box_a in world
box_a = do_something(box_a)
for box_b in world
if (box_a!=box_b and collides(box_a, box_b))
collide(box_a, box_b)
collide(box_b, box_a)
The problem with that is, obviously, that this is O(n^2). I have managed to make this loop much faster partitioning the space in chunks, so this became:
for box_a in world
box_a = do_something(box_a)
for chunk in box_a.neighbor_chunks
for box_b in chunk
if (box_a!=box_b and collides(box_a, box_b))
collide(box_a, box_b)
collide(box_b, box_a)
This is much faster but a little crude. Given that there is such a faster algorithm with not a lot of effort, I'd bet there is a data structure I'm not aware of that generalizes what I've done here, providing much better scalability for this algorithm.
So, my question is: what is the name of this problem and what are the optimal algorithms and data-structures to implement it?
this is indeed a generic problem of computer science : space partitionning.
its used in raytracing, path tracing, raster rendering, physics, IA, games, and pretty sure in HPC, databases, matrix maths, whatever science (molecules, pharmacy....), and I bet thousands of other stuff.
there is no 1 best structure, I have a friend who did his master on an algorithm to tesselate a point of cloud coming out of a laser scanner (billions of data) and in his case the best data structure was to mix a collection of uniforms 3D grids with some octree.
For other people kd-tree is the best, for other people, BVH trees are the best.
I like the grid system but it cannot work if the space is too wide because all cells has to exist.
One day I even implemented a sparse grid system using a hash map, it worked, I didn't bother to profile or investigate the performance so I wouldn't know if its an excellent way, I know its one way though.
To do that, you make a KEY class which is basically a 3D position vector hasher, first you apply an integer division on the coordinates to define the size of one grid cell. Then you stupidely hash all coordinates into one hash and provide a hash_value method or friend method. an equality operator and then its usable in a hash map.
You can use a google::sparse_map or something along these lines. I personally used boost::unordered and it was enough in my case.
Then the thing to consider is the presence of AABB into more than one cell. You can store a reference in every cell covered by your AABB, its just something to be aware of in every algorithm : "there is no 1-1 relationship between cell references and AABB." that's all.
good luck

Fast/Area optimised sorting in hardware (fpga)

I'm trying to sort an array of 8bit numbers using vhdl.
I'm trying to find out a method which optimise delay and another which would use less hardware.
The size of the array is fixed. But I'm also interested to extend the functionality to variable lengths.
I've come across 3 algorithms so far:
Bathcher Parallel
Method Green Sort
Van Vorris Sort
Which of these will do the best job? Are there any other methods I should be looking at?
There is a lot of research articles in the matter. You could try to search the web for it. I did a search for "Sorting Networks" and came up with a lot of comparisons of different algorithms and how well they fitted into an FPGA.
The algorithm you choose will greatly depend on which parameter is most important to optimize for, i.e. latency, area, etc. Another important factor is where the values are stored at the beginning and end of the sort. If they are stored in registers, all might be accessed at once, but if you have to read them from a memory with a limited width, you should consider that in your implementation as well, because then you will have to sort values in a stream, and rearrange that stream before saving it back to memory.
Personally, I'd consider something time-constant like merge-sort, which has a constant time to sort, so you could easily schedule the sort for a fixed size array. I'm however not sure how well this scales or works with arbitrary sized arrays. You'd probably have to set an upper limit on array size, and also this approach works best if all data is stored in registers.
I read about this in a book by Knuth and according to that book, the Batcher's parallel merge sort is the fastest algorithm and also the most hardware efficient.

What is a good sorting algorithm on CUDA?

I have an array of struct and I need to sort this array according to a property of the struct (N). The object looks like this:
struct OBJ
int N; //sort array of OBJ with respect to N
OB *c; //OB is another struct
The array size is small, about 512 elements, but the size of every element is big therefore I cannot copy the array to shared memory.
What is the simplest and "good" way to sort this array? I do not need a complex algorithm that require a lot of time to implement (since the number of elements in the array is small) I just need a simple algorithm.
Note: I have read some papers about sorting algorithms using GPUs, but the speed gain from these papers only show up when the size of the array is very big. Therefore I did not try to implement their algorithms because the size of my array is small. I only need a simple way to parallel sort my array. Thanks.
What means "big" and "small" ?
By "big" I assume you mean something of >1M elements, while small --- small enough to actually fit in shared memory (probably <1K elements). If my understanding of "small" matches yours, I would try the following:
Use only a single block to sort the array (it can be then a part of some bigger CUDA kernel)
Bitonic sort is one of good appraches which can be adopted for parallel algorithm.
Some pages on bitonic sort:
Bitonic sort (nice explanation, applet to visualise and java source which does not take too much space)
Wikipedia (a bit too short explanation for my taste, but more source codes - some abstract language and Java)
NVIDIA code Samples (A sample source in CUDA. I think it is a bit ovefocused on killing bank conflicts. I believe the simpler code may actually perform faster)
I once also implemented a bubble sort (lol!) for a single warp to sort arrays of 32 elements. Thanks to its simplicity it did not perform that bad actually. A well tuned bitonic sort will still perform faster though.
Use the sorting calls available in the CUDPP or the Thrust library.
If you use cudppSort, note that it only works with integers or floats. To sort your array of structures, you can first sort the keys along with an index array. Later, you can use the sorted index array to move the structures to their final sorted location. I have described how to do this for the cudppCompact compaction algorithm in a blog post here. The steps are similar for sorting an array of structs using cudppSort.
Why exactly are you heading towards CUDA? I mean, it smells like your problem is not one of those, CUDA is very good at. You just want to sort an array of 512 Elements and let some pointers refer to another location. This is nothing fancy, use a simple serial algorithm for that, e.g. Quicksort, Heapsort or Mergesort.
Additionally, think about the overhead it takes to copy data from your Heap/Stack to your CUDA device. Using CUDA just makes sense, when the calculations are intense enough so that COMPUTING_TIME_ON_CUDA+COPY_DATA_FROM_HEAP_TO_CUDA_DEVICE+COPY_DATA_FROM_CUDA_DEVICE_TO_HEAP < COMPUTING_TIME_ON_HOST_CPU.
Besides, CUDA is immersely powerful at math calculations with big vectors and matrices and rather simple data-types (numbers) because it is one of the problems that often arise on a GPU: Calculating graphics.
Yes I would totally agree, the overhead of sorting small arrays (<5k elements) kills the possible speedup you will achieve with a "fine-tuned" parallel sorting algorithm implemented in CUDA. I would prefer CPU based sorting for such a small size...

What is the standard OCaml data structure with fastest iteration?

I'm looking for a container that provides fastest unordered iterations through the encapsulated elements. In other words, "add once, iterate many times".
Is there one among OCaml's standard modules that is fast enough (such that further optimization of it would be useless)? Or some kind of third-party GPL-ready ones?
AFAIK there's just one OCaml compiler, so the concept of being fast is more or less clear...
...But after I saw a couple of answers, it appears, it's not. Of course, there's a plenty of data structures that allow O(n) iteration through container of size n. But the task I'm solving is one of those, where difference between O(n) and O(2n) matters ;-).
I also see that Arrays and Lists provide unnecessary information about the order of elements added, which I don't need. Maybe in "functional world" there exists data structures such that can trade this information for a bit of iteration speed.
In C I would outright pick a plain array. The question is, what should I pick in OCaml?
You are unlikely to do better than built-in arrays and lists, since they are hand-coded in C, unless you bind to your own native implementation of an iterator. An array will behave almost exactly like an array in C (a contiguously allocated block of memory containing a sequence of element values), possibly with some extra pointer indirections due to boxing. List are implemented exactly how you would expect: as cells with a value and a "next" pointer. Arrays will give you the best locality for unboxed types (especially floats, which have a super-special unboxed implementation).
For information about the implementation of arrays and lists, see Section 18.3 of the OCaml manual and the files byterun/mlvalues.h, byterun/array.c, and byterun/alloc.c in the OCaml source code.
From the questioner: indeed, Array appeared to be the fastest solution. However it only outperformed List by 7%. Maybe it was because the type of an array element was not plain enough: it was an algebraic type. Hashtbl performed 4 times worse, as expected.
So, I will pick Array and I'm accepting this one. good.
To know for sure, you're going to have to measure. Based on the machine instructions the compiler is likely to generate, I would try an array, then a list.
Access to an array element requires a bounds check, address arithmetic, and a load
Access to the head of a list requires a load, a test for empty list, and a load at a known compile-time offset.
The details of which is faster probably depend on your application and what else is happening on your machine. They also depend on the type of elements; for example, if they are floating-point numbers, ocamlopt may be clever enough to make an unboxed array, which will save you a level of indirection.
Other common data structures like hash tables or balanced trees generally require that you allocate some context somewhere to keep track of where you are. With an array, keeping track requires only an integer index; with a list, keeping track requires a single pointer. I think this is going to be hard to beat in another data structure.
Finally please note that there may be only one OCaml compiler, but it has two back ends: bytecode and native code. Naturally if you care about this level of performance, you are using the native-code ocamlopt version. Right?
Please take measurements and edit the results into your question.
Don't forget about Bigarrays, they are most close to C arrays (just a flat piece of memory), but cannot contain arbitrary OCaml values. Also consider switching bounds checking off (unsafe_set/get). And of course you should profile first.
The array - a linear piece of memory with the items visited in sequential order - best utilises the CPU's L1 data cache.
All common data structures are iterable in O(n) time, so the differences between data structures will only be constant (and very probably not significant).
At least lists and arrays allow iteration without significant overhead. I can't think of a situation where that would not be fast enough.
