PCL kd-tree implementation extremely slow

PCL kd-tree implementation extremely slow - nearest-neighbor

I am using Point Cloud Library (PCL) based C++ implementation of kd-tree nearest neighbour(NN) searching. The data set contains about 2.2 million points. I am searching NN points for every other point. The search radius is set at 2.0. To fully compute that, it takes about 12 hours! I am using windows 64 bit machine with 4GB RAM. Is it very common for kd-tree searching? I wonder if there is any other c++ library for 3D point cloud data, which is faster. I heard about ANN C++ library & CGAL, but not sure how fast these are.

In short:
You can only be sure if you run yourself the time measurements.
You should ensure if the NN library is faster over brute force, which probably is the case for your data.
As anderas mentioned in the comment, the search radius plays also a significant role. If a lot of points fall into the search radius, the search can become really slow.
Full answer:
3 dimensions are not much. The problem occurs due to the number of points you have.
ANN stands for approximate nearest neighbour searching. It is very common to accept the trade-off between accuracy and speed when it comes to NNS (Nearest Neighbour search).
This means that you perform the search faster, but you may not find the exact NN, but a close one (for example not the closest point, but the second closest one and so on). More speed means less accuracy and vice versa.
CGAL has also a parameter epsilon, which stands for accuracy (ε = 0 means full accuracy). CGAL is meant to go till 3 dimensions, so you could give it a shot.
I could just test myself and post the answers. However, this would not be 100% safe, since the points you have may have some relation. It is very important for the performance of the library, if it is going to exploit the relation the points (may) have to each other.
Another factor to take into account, easiness of installation.
CGAL is hard to install. When I did it, I followed these steps.
ANN is easy to install.
I would also suggest you to take a look at BOOST Geometry, which may come in handy.
FLANN is also a strong player in the field, but I would not suggest it, since it's meant to handle datasets of much bigger dimensions (like 128 for example).
....
OK I admit it, I am now curious myself and I am going to run some tests!
....
ANN
(I am not posting the code, so that the answer is not getting too big. There are examples in the documentation and you can ask of course if you want!).
Output:
// for a Klein bottle
samaras#samaras-A15:~/Inria/nn_libs/ANN/ann_1.1.2/code$ g++ main.cpp -O3 -I ../include/ -L../lib -lANN -std=c++0x -o myExe
samaras#samaras-A15:~/Inria/nn_libs/ANN/ann_1.1.2/code$ ./myExe
N = 1000000
D = 3
ε = 0 // full accuracy
Misses: 0
creation of the tree took 1.06638 seconds.
Brute force for min: 0.009985598 seconds.
NN for 0.0 0.000078551 seconds.
Speedup of ANN for NN = 8.721533780
For ε = 0.1 I got:
Misses: 1000
creation of the tree took 0.727613 seconds.
Brute force for min: 0.006351318 seconds.
NN for 0.0 0.000004260 seconds.
Speedup of ANN for NN = 8.678169573
// for a shpere
ε = 0
Misses: 0
creation of the tree took 1.28098 seconds.
Brute force for min: 0.006382912 seconds.
NN for 0.0 0.000022341 seconds.
Speedup of ANN for NN = 4.897436311
ε = 0.1
Misses: 1000
creation of the tree took 1.25572 seconds.
Brute force for min: 0.006482465 seconds.
NN for 0.0 0.000004379 seconds.
Speedup of ANN for NN = 5.144413371
Notice the difference in the speedup! This has to do with the nature of the datasets, as explained above (the relation the points have).
CGAL:
// Klein bottle
samaras#samaras-A15:~/code/NN$ ./myExe
ε = 0
SplitingRule = Sliding_midpoint, N = 1000000, D = 3
creation of the tree took 0.02478 seconds.
Tree statistics:
Number of items stored: 1000000
Number of nodes: 471492
Tree depth: 34
0x80591e4
Brute force for min: 0.007979495 seconds.
NN for 0.0 0.008085704 seconds.
Speedup of cgal for NN = 0.983849423
ε = 0.1
SplitingRule = Sliding_midpoint, N = 1000000, D = 3
creation of the tree took 0.02628 seconds.
Tree statistics:
Number of items stored: 1000000
Number of nodes: 471492
Tree depth: 34
0x80591e4
Brute force for min: 0.007852951 seconds.
NN for 0.0 0.007856228 seconds.
Speedup of cgal for NN = 0.996250305
// Sphere
samaras#samaras-A15:~/code/NN$ ./myExe
ε = 0
SplitingRule = Sliding_midpoint, N = 1000000, D = 3
creation of the tree took 0.025502 seconds.
Tree statistics:
Number of items stored: 1000000
Number of nodes: 465002
Tree depth: 32
0x80591e4
Brute force for min: 0.007946504 seconds.
NN for 0.0 0.007981456 seconds.
Speedup of cgal for NN = 0.992449817
samaras#samaras-A15:~/code/NN$ ./myExe
ε = 0.1
SplitingRule = Sliding_midpoint, N = 1000000, D = 3
creation of the tree took 0.025106 seconds.
Tree statistics:
Number of items stored: 1000000
Number of nodes: 465002
Tree depth: 32
0x80591e4
Brute force for min: 0.008115519 seconds.
NN for 0.0 0.008117261 seconds.
Speedup of cgal for NN = 0.996702679
ANN is clearer faster than CGAL for my tests, probably it is for yours too!
A side-note:
You actually asking for the NN of every point. However, the library doesn't know that and doesn't take into account the previous work done for every point, which is a pity. However, I am not aware of any library that does this.

Related

Big O Notation O(n^2) what does it mean?

For example, it says that in 1 sec 3000 number are sorted with selection sort. How can we predict how many numbers are gonna be sorted in 10 sec ?
I checked that selection sort needs O(n^2) but I dont understand how I am gonna calculate how many numbers are gonna be sorted in 10 sec.

We cannot use big O to reliably extrapolate actual running times or input sizes (whichever is the unknown).
Imagine the same code running on two machines A and B, different parsers, compilers, hardware, operating system, array implementation, ...etc.
Let's say they both can parse and run the following code:
procedure sort(reference A)
declare i, j, x
i ← 1
n ← length(A)
while i < n
x ← A[i]
j ← i - 1
while j >= 0 and A[j] > x
A[j+1] ← A[j]
j ← j - 1
end while
A[j+1] ← x[3]
i ← i + 1
end while
end procedure
Now system A spends 0.40 seconds on the initialisation part before the loop starts, independent on what A is, because on that configuration the initiation of the function's execution context including the allocation of the variables is a very, very expensive operation. It also needs to spend 0.40 seconds on the de-allocation of the declared variables and the call stack frame when it arrives at the end of the procedure, again because on that configuration the memory management is very expensive. Furthermore, the length function is costly as well, and takes 0.19 seconds. That's a total overhead of 0.99 seconds
On system B this memory allocation and de-allocation is cheap and takes 1 microsecond. Also the length function is fast and needs 1 microsecond. That's a total overhead that is 2 microseconds.
System A is however much faster on the rest of the statements in comparison with system B.
Both implementations happen to need 1 second to sort an array A having 3000 values.
If we now take the reasoning that we could predict the array size that can be sorted in 10 seconds based on the results for 1 second, we would say:
𝑛 = 3000, and the duration is 1 second which corresponds to 𝑛² = 9 000 000 operations. So if 9 000 000 operations corresponds to 1 second, then 90 000 000 operations correspond to 10 seconds, and 𝑛 = √(𝑛²) ~= 9 487 (the size of the array that can be sorted in 10 seconds).
However, if we follow the same reasoning, we can look at the time needed for completing the outer loop only (without the initialisation overhead), which also is O(𝑛²) and thus the same reasoning can be followed:
𝑛 = 3000, and the duration in system A is 0.01 second which corresponds to 𝑛² = 9 000 000 operations. So if 9 000 000 operations can be executed in 0.01 second then in 10 - 0.99 seconds (overhead is subtracted) we can execute 9.01 / 0.01 operations, i.e 𝑛² = 8 109 000 000 operations, and now 𝑛 = √(𝑛²) ~= 90 040.
The problem is that using the same reasoning on big O, the predicted outcomes differ by a factor of about 10!
We may be tempted to think that this is now only a "problem" of constant overhead, but similar things can be said about operations in the outer loop. For instance it might be that x ← A[i] has a relatively high cost for some reason on some system. These are factors that are not revealed in the big O notation, which only retains the most significant factor, omitting linear and constant factors that play a role.
The actual running time for an actual input size is dependent on a more complex function that is likely close to polynomial, like 𝑛² + 𝑎𝑛 + 𝑏. These coefficients 𝑎, and 𝑏 would be needed to make a more reasonable prediction possible. There might even be function components that are non-polynomial, like 𝑛² + 𝑎𝑛 + 𝑏 + 𝑐√𝑛... This may seem unlikely, but systems on which the code runs may do all kinds of optimisations while code runs which may have such or similar effect on actual running time.
The conclusion is that this type of reasoning gives no guarantee that the prediction is anywhere near the reality -- without more information about the actual code, system on which it runs,... etc, it is nothing more than a guess. Big O is a measure for asymptotic behaviour.

As the comments say, big-oh notation has nothing to do with specific time measurements; however, the question still makes sense, because the big-oh notation is perfectly usable as a relative factor in time calculations.
Big-oh notation gives us an indication of how the number of elementary operations performed by an algorithm varies as the number of items to process varies.
Simple algorithms perform a fixed number of operations per item, but in more complicated algorithms the number of operations that need to be performed per item varies as the number of items varies. Sorting algorithms are a typical example of such complicated algorithms.
The great thing about big-oh notation is that it belongs to the realm of science, rather than technology, because it is completely independent of your hardware, and of the speed at which your hardware is capable of performing a single operation.
However, the question tells us exactly how much time it took for some hypothetical hardware to process a certain number of items, so we have an idea of how much time that hardware takes to perform a single operation, so we can reason based on this.
If 3000 numbers are sorted in 1 second, and the algorithm operates with O( N ^ 2 ), this means that the algorithm performed 3000 ^ 2 = 9,000,000 operations within that second.
If given 10 seconds to work, the algorithm will perform ten times that many operations within that time, which is 90,000,000 operations.
Since the algorithm works in O( N ^ 2 ) time, this means that after 90,000,000 operations it will have sorted Sqrt( 90,000,000 ) = 9,486 numbers.
To verify: 9,000,000 operations within a second means 1.11e-7 seconds per operation. Since the algorithm works at O( N ^ 2 ), this means that to process 9,486 numbers it will require 9,486 ^ 2 operations, which is roughly equal to 90,000,000 operations. At 1.11e-7 seconds per operation, 90,000,000 operations will be done in roughly 10 seconds, so we are arriving at the same result via a different avenue.
If you are seriously pursuing computer science or programming I would recommend reading up on big-oh notation, because it is a) very important and b) a very big subject which cannot be covered in stackoverflow questions and answers.

Estimate the order of growth of algorithm from run time and ratio of change

I'm a beginner practising with algorithms. Below list represents an algorithm I ran and recorded the times and ratio of change. I'm not sure how to figure out the order of growth from this list. What factors do I have to consider? I would very much appreciate an explanatory answer.
N |seconds | ratio | log(base of 2) ratio
---------------------------------------
512 0.12 4.14 2.05
1024 0.49 4.24 2.08
2048 2.08 4.24 2.08
4096 8.83 4.24 2.08

Compare the times for your smallest input to the various larger inputs:
A 2x increase in N (512->1024) results in a 4x increase in running time.
A 4x increase in N (512->2048) results in a 16x increase in running time.
An 8x increase in N (512->4096) results in a 64x increase in running time.
From this, you can extrapolate that a kx increase in N will result in an k2x increase in running time, indicating an O(n2) algorithm.

Finding the constant c in the time complexity of certain algorithms

I need help finding and approximating the constant c in the complexity of insertion sort (cn^2) and merge sort (cnlgn) by inspecting the results of their running times.
A bit of background, my purpose was to "implement insertion sort and merge sort (decreasing order) algorithms and measure the performance of these two algorithms. For each algorithm, and for each n = 100, 200, 300, 400, 500, 1000, 2000, 4000, measure its running time when the input is
already sorted, i.e. n, n-1, …, 3, 2,1;
reversely sorted 1, 2, 3, … n;
random permutation of 1, 2, …, n.
The running time should exclude the time for initialization."
I have done the code for both algorithms and put the measurements (microseconds) in a spreadsheet. Now, I'm not sure how to find this c due to differing values for each condition of each algorithm.
For reference, the time table:
InsertionSort MergeSort
n AS RS Random AS RS Random
100 12 419 231 192 191 211
200 13 2559 1398 1303 1299 1263
300 20 236 94 113 113 123
400 25 436 293 536 641 556
500 32 504 246 91 81 105
1000 65 1991 995 169 246 214
2000 9 8186 4003 361 370 454
4000 17 31777 15797 774 751 952
I can provide the code if necessary.

It's hardly possible to determine values of these constants, especially for modern processors that uses caches, pipelines, and other "performance things".
Of course, you can try to find an approximation, and then you'll need Excel or any other spreadsheet.
Enter your data, create chart, and then add trendline. The spreadsheet calculates the values of constants for you.

First to understand is, that complexity and running times are not the same and maybe does not have very much to do with each other.
The complexity is a theoretical measurement to get an idea of how an algorithm slow down on bigger inputs compared to smaller inputs or compared to other algorithms.
The running time depends on the exact implementation, the computer it is running on, the other programms that run on the same computer and many other things. You will also notice, that the running time will slow down if the input is to big for your cache, and jump an other time if its also to big for your RAM. As you can see for n = 200 you got some weird running times. This will not help you finding the constants.
In cases where you don't have the code, you have no other choise to use the running times to approximat the complexity. Then you should use only big inputs (1000 should be the smallest input in your case). If your algorithm is deterministic, just input the worst case. Random cases can be good and bad, and so you never get anything about the real complexity. An other problem is, that the complexity measures "operations", so evaluating and if-statement or incrementing a variable is the same, but in running time an if needs more time than an incrementing something.
So what you can do is to plot your complexity and the values you measured and look for a factor that holds...
E.g. This is a plot of n² skaled by 1/500 and the points from your chart.

First some notes:
you have very small n
The algorithm complexity start corresponding to runtime only if n is big enough. For n=4000 is ~4KB of data which can still fit into most of CPU CACHE's so increasing to at least n=1000000 can and will change the relation between runtime and n considerably !
Runtime measurement
for random data you need the average runtime measurement not single one so for any n do at least 5 measurements each with different dataset and use average time from all
Now how to obtain c
If program has complexity O(n^2) it means that for big enough n the runtime is:
t(n)=c*n^2
so take few measurements. I choose last 3 from your insert sort, reverse sorted because that should match the worst case O(n^2) complexity if I am not mistaken so:
c*n^2 =t(n)
c*1000^2= 1.991
c*2000^2= 8.186
c*4000^2=31.777
solve the equations:
c=t(n)/(n^2)
c= 1.991/ 1000000=1.991 us
c= 8.186/ 4000000=2.0465 us
c=31.777/16000000=1.9860625 us
If everything is alright then the c for different n should be relatively the same. In your case it is around 2 us per element but as I mentioned above with increasing n this will change due to CACHE usage. Also if any dynamic container is used then you have to include complexity of its usage to the algorithm which can be sometimes significant !!!

Take the case of 4000 elements and divide the time by the respective complexity estimate, 4000² or 4000 Lg 4000.
This is not worse than any other method.
For safety, you should check anyway that the last values align on a relatively smooth curve, so that the value for 4000 is representative.
As others commented, this is rather poor methodology. You should also consider the standard deviation of the running times, or even better, the histogram of running times, and cover a larger range of sizes.
On another hand, getting accurate values is not so important as knowing the values of the constants is not helpful to compare the two algorithms.

Optimizing rank computation for very large sparse matrices

I have a sparse matrix such as
A =
(1,1) 1
(3,1) 1
(1,2) 1
(2,2) 1
(1,3) 1
(3,3) 1
(4,3) 1
(4,4) 1
The full matrix of A can see look like as following:
full(A) =
1 1 1 0
0 1 0 0
1 0 1 0
0 0 1 1
I want to find the rank of matrix A by fast way(because my matrix can extend to 10000 x 20000). I try to do it by two ways but it give the different result
Convert to full matrix and find rank using
rank(full(A)) = 3
Find the rank using sprank
sprank(A) = 4
The true answer must be 3 that means using first way. However, it take long time to find the rank,especially matrix with large size. I know the reason why the second way give 4 because sprank only tells you how many rows/columns of your matrix have non-zero elements, while rank is reporting the actual rank of the matrix which indicates how many rows of your matrix are linearly independent. sprank(A) is 4 but rank(A) is only 3 because you can write the third row as a linear combination of the other rows, specifically A(2,:) - A(1,:).
My problem is that how to find the rank of a sparse matrix with lowest time consumption
Update: I tried to use some way. However, it reported larger time consumption
%% Create random matrix
G = sparse(randi(2,1000,1000))-1;
A=sparse(G) %% Because my input matrix is sparse matrix
%% Measure performance
>> tic; rank(full(A)); toc
Elapsed time is 0.710750 seconds.
>> tic; svds(A); toc
Elapsed time is 1.130674 seconds.
>> tic; eigs(A); toc
Warning: Only 3 of the 6 requested eigenvalues converged.
> In eigs>processEUPDinfo at 1472
In eigs at 365
Elapsed time is 4.894653 seconds.

I don't know which algorithm is best suited for you and I agree that may be more appropriate to ask on math.stackexchange.com. While I was trying with the random matrix you supply G = sparse(randi(2,1000,1000))-1; I noticed that there is little chance that its rank is <1000, and whatever algorithm you use, it is likely that its performance is very data-dependant. For instance using eigs(G) on a 2000-samples square matrix of rank (resp.) [198,325,503,1026,2000] yields the following performance (in seconds): [0.64,0.90,1.38,1.57,4.00] which shows the performance of the eigs function is strongly related with the rank of the matrix.
I also searched for existing tools and gave a try to spnrank which I think is not so data-dependant (it gives a better performance than eigs for high ranks and worse if the rank is small).
In the end you may want to adapt your technical solution depending on the kind of matrices you are most likely to work with.

random number generator test

How will you test if the random number generator is generating actual random numbers?
My Approach: Firstly build a hash of size M, where M is the prime number. Then take the number
generated by random number generator, and take mod with M.
and see it fills in all the hash or just in some part.
That's my approach. Can we prove it with visualization?
Since I have very less knowledge about testing. Can you suggest me a thorough approach of this question? Thanks in advance

You should be aware that you cannot guarantee the random number generator is working properly. Note that even a perfect uniform distribution in range [1,10] - there is a 10-10 chance of getting 10 times 10 in a random sampling of 10 numbers.
Is it likely? Of course not.
So - what can we do?
We can statistically prove that the combination (10,10,....,10) is unlikely if the random number generator is indeed uniformly distributed. This concept is called Hypothesis testing. With this approach we can say "with certainty level of x% - we can reject the hypothesis that the data is taken from a uniform distribution".
A common way to do it, is using Pearson's Chi-Squared test, The idea is similar to yours - you fill in a table - check what is the observed (generated) number of numbers for each cell, and what is the expected number of numbers for each cell under the null hypothesis (in your case, the expected is k/M - where M is the range's size, and k is the total number of numbers taken).
You then do some manipulation on the data (see the wikipedia article for more info what this manipulation is exactly) - and get a number (the test statistic). You then check if this number is likely to be taken from a Chi-Square Distribution. If it is - you cannot reject the null hypothesis, if it is not - you can be certain with x% certainty that the data is not taken from a uniform random generator.
EDIT: example:
You have a cube, and you want to check if it is "fair" (uniformly distributed in [1,6]). Throw it 200 times (for example) and create the following table:
number: 1 2 3 4 5 6
empirical occurances: 37 41 30 27 32 33
expected occurances: 33.3 33.3 33.3 33.3 33.3 33.3
Now, according to Pearson's test, the statistic is:
X = ((37-33.3)^2)/33.3 + ((41-33.3)^2)/33.3 + ... + ((33-33.3)^2)/33.3
X = (18.49 + 59.29 + 10.89 + 39.69 + 1.69 + 0.09) / 33.3
X = 3.9
For a random C~ChiSquare(5), the probability of being higher then 3.9 is ~0.45 (which is not improbable)1.
So we cannot reject the null hypothesis, and we can conclude that the data is probably uniformly distributed in [1,6]
(1) We usually reject the null hypothesis if the value is smaller then 0.05, but this is very case dependent.

My naive idea:
The generator is following a distribution. (At least it should.) Do a reasonable amount of runs then plot the values on a graph. Fit a regression curve on the points. If it correlates with the shape of the distribution you're good. (This is also possible in 1D with projections and histograms. And fully automatable with the correct tool, e.g. MatLab)
You can also use the diehard tests as it was mentioned before, that is surely better but involves much less intuition, at least on your side.

Let's say you want to generate a uniform distribution on the interval [0, 1].
Then one possible test is
for i from 1 to sample-size
when a < random-being-tested() < b
counter +1
return counter/sample-size
And see if the result is closed to b-a (b minus a).
Of course you should define a function taking a, b between 0 and 1 as inputs, and return the difference between the counter/sample-size and b-a. Loop through possible a, b, say of the multiples of 0.01, a < b. Print out a, b when the difference is larger than a preset epsilon, say 0.001.
Those are the a, b for which there are too many outliers.
If you let sample-size be 5000. Your random-being-tested will be called about 5000 * 5050 times in total, hopefully not too bad.

I had the same problem.
when I finish to write my code (using an external RNG engine)
I looked on the results and found that all of them fail Chi-Square test whenever I have to many results.
my code generated a random number and hold buckets of the amount of each result range.
I don't know why the Chi-square test fail when i have a lot of results.
during my research I saw that the C# Random.next() fail in any range of random and that some of the numbers have better odds than the other, further more i saw that the RNGCryptoServiceProvider random provider is not supporting good on big numbers.
when trying to get numbers in the range of 0-1,000,000,000 the numbers in the lower range 0-300M have better odds to appear...
as a result I'm using the RNGCryptoServiceProvider and if my range is higher than 100M i'm combine the number my self (RandomHigh*100M + RandomLow) and the ranges of both randoms is smaller than 100M so it good.
Good Luck!

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio