Cutting Dimension in KD-Tree - nearest-neighbor

Recently I have been reading about KD-Trees, where we choose a cutting dimension (Whether arbitrarily or by variance) to split the data and build the tree.
Now am wondering is it possible somehow we can use, more than a single dimension as cutting dimension for KD-Tree? Like maybe use some kind of combination of 2 dimensions at each level, rather than one at each level.
If the idea is stupid, I would like to know why?
PS: I just want to know if there exists some related research or info on this topic which I can refer to, as I can't seem to find any.

Splitting in several dimensions can be a very good idea. Have a look at quadtrees, they split in every level in all dimensions. They are also widely used.

Related

Tiling Algorithm

I'm faced with a problem where I have to solve puzzles.
E.g. I have an (variable) area of 20x20 (meters for example). There are a number of given set pieces having variable sizes. Such as 4x3, 4x2, 1x5 pieces etc. These pieces can also be turned to add more pain to my problem. The point of the puzzle is to fill the entire area of 20x20 with the given pieces.
What would be a good starting algorithm to achieve such a feat?
I'm thinking of using a heuristic that calculates the open space (for efficiency purposes).
Thanks in advance
That's an Exact Cover problem, with a nice structure too, usually, depending on the pieces. I don't know about any heuristic algorithms, but there are several exact options that should work well.
As usual with Exact Covers, you can use Dancing Links, a way to implement Algorithm X efficiently.
Less generally, you can probably solve this with zero-suppressed decision diagrams. It depends on the tiles though. As a bonus, you can represent all possible solutions and count them or generate one with some properties, all without ever explicitly storing the entire (usually far too large) set of solutions.
BDDs would work about as well, using more nodes to accomplish the same thing (because the solutions are very sparse, as in, using few of the possible tile-placements - ZDDs like that but BDDs like symmetry better than sparseness).
Or you could turn it into a SAT problem, then you get less information (no solution count for example), but faster if there are easy solutions.

Choose N samples from dataset for best distribution

This question may have been asked before, but I couldn't find it. I am not sure how to word it properly.
My situation is this:
I have a large dataset and I want to choose a subset of samples that best represents the data.
To be more specific, I have many points within a unit (hyper)cube, and I want to choose N points that have the widest/smoothest coverage.
I'm thinking this must be a well known problem (for example, colour quantisation). However, there is one extra constraint. The N samples must be members of the original dataset, not mean values of a cluster or anything like that.
Also, I don't care about efficiency, accuracy is far more important.
Thanks.

Matlab - distinguish overlapping low contrast objects in a RGB or Grayscale Image

I have a big problem detecting objects within an image - I know this topic was already highly discusses in many forums, but I spend the last 4 days searching for an answer and was not able.
In fact: I have a picture from a branch (http://cl.ly/image/343Y193b2m1c). My goal is to count every single needle in this picture. So I have to face several problems:
Separate the branch with its needles from the background (which in this case is no problem).
Select the borders of the needles. This is a huge problem; I tried different ways including all edges() functions but the problem is always the same - the borders around the needles are not closed and - which leads to the last problem:
Needles are overlapping! This leads in "squares between the needles" which are, if I use imfill() or equal formula, filled in instead of the needles. And: the places where the needles are concentrated (many needles at one place) are nearly impossible to distinguish.
I tried watershed, I tried to enhance the contrast, Kmeans clustering, I tried imerose, imdilate and related functions with subsequent edge detection. I tried as well to filter and smooth the picture a bit in order to "unsharp" the needles a bit so that not every small change in color is recognized as a border (which is another problem).
I am relatively new to matlab, so I dont know what I have to look for. I tried to follow the MatLab tutorial used for Nuclei detection - but with this I just can get all the green objects (all needles at once).
I hope this questions did not came up before - if yes, I apologize deeply for the double post. If anybody has an idea what to do or what methods to use, it would be awesome and would safe this really bad beginning of the week.
Thank you very much in advance,
Phillip
Distinguishing overlapping objects is very, very hard, particularly if you do not know how many objects you have to distinguish. Your brain is much better at distinguishing overlapping objects than any segmentation algorithm I'm aware of, since it is able to integrate a lot of information that is difficult to encode. Therefore: If you're not able to distinguish some of the features yourself, forget about doing it via code.
Having said that, there may be a way for you to be able to get an approximate count of the needles: If you can segment the image pixels into two classes: "needle" versus "not needle", and you know how much area in your picture is covered by a needle (it may help to include a ruler when you take the picture), you can then divide number of "needle"-pixels by the number of pixels covered by a single needle to estimate the total number of needles in the image. This will somewhat underestimate the needle count due to overlaps, and it will underestimate more the denser the needles are (due to more overlaps), but it should allow you to compare automatically between branches with lots of needles and branches with few needles, as well as to identify changes in time, should that be one of your goals.
I agree with #Jonas = you got yourself one HUGE problem.
Let me make a few suggestions.
First, along #Jonas' direction, instead of getting an accurate count, another way of getting a rough estimate is by counting the tips of the needles. Obviously, not all the tips are clearly visible. But, if you can get a clear mask of the branch it might be relatively easy to identify the tips of the needles using some of the morphological operations you mentioned yourself.
Second, is there any way you can get more information? For example, if you could have depth information it might help a little in distinguishing the needles from one another (it will not completely solve the task but it may help). You may get depth information from stereo - that is, taking two pictures of the branch while moving the camera a bit. If you have a Kinect device at your disposal (or some other range-camera) you can get a depth map directly...

Appropriate clustering method for 1 or 2 dimensional data

I have a set of data I have generated that consists of extracted mass (well, m/z but that not so important) values and a time. I extract the data from the file, however, it is possible to get repeat measurements and this results in a large amount of redundancy within the dataset. I am looking for a method to cluster these in order to group those that are related based on either similarity in mass alone, or similarity in mass and time.
An example of data that should be group together is:
m/z time
337.65 1524.6
337.65 1524.6
337.65 1604.3
However, I have no way to determine how many clusters I will have. Does anyone know of an efficient way to accomplish this, possibly using a simple distance metric? I am not familiar with clustering algorithms sadly.
http://en.wikipedia.org/wiki/Cluster_analysis
http://en.wikipedia.org/wiki/DBSCAN
Read the section about hierarchical clustering and also look into DBSCAN if you really don't want to specify how many clusters in advance. You will need to define a distance metric and in that step is where you would determine which of the features or combination of features you will be clustering on.
Why don't you just set a threshold?
If successive values (by time) do not differ by at least +-0.1 (by m/s) they a grouped together. Alternatively, use a relative threshold: differ by less than +- .1%. Set these thresholds according to your domain knowledge.
That sounds like the straightforward way of preprocessing this data to me.
Using a "clustering" algorithm here seems total overkill to me. Clustering algorithms will try to discover much more complex structures than what you are trying to find here. The result will likely be surprising and hard to control. The straightforward change-threshold approach (which I would not call clustering!) is very simple to explain, understand and control.
For the simple one dimension K-means clustering (http://en.wikipedia.org/wiki/K-means_clustering#Standard_algorithm) is appropriate and can be used directly. The only issue is selecting appropriate K. The best way to select a good K is to either plot K vs residual variance and select the K that "dramatically" reduces variance. Another strategy is to use some information criteria (eg. Bayesian Information Criteria).
You can extend K-Means to multi-dimensional data easily. But you should be beware of scaling the individual dimensions. Eg. Among items (1KG, 1KM) (2KG, 2KM) the nearest point to (1.7KG, 1.4KM) is (2KG, 2KM) with these scales. But once you start expression second item in meters, probably the alternative is true.

anything better than bounding boxes?

I have a scenario, where I have x million longitude latitude points.
When a new long/lat point is added I want to know efficiently which other points are within a user configured distance parameter, so I can add them to a list.
got anything better than bounding boxes?
I would love to see algorithms, references and a few implementations ;) thank you kindly!
There are quite a few options that are better, mostly based around space partitioning.
A common, and often very good option (which isn't too tough to implement) is to use a KD-Tree. Quadtrees are easier to implement, but slower for searching. Depending on the distribution of your data, and your requirements, other space partitioning algorithms may perform better, have lower memory requirements, or other issues that are related.
A colleague told me that he had good experience with using Morton-Code as a spatial index on GIS data, maybe that is something worth investigating.
This quick-and-dirty approach may save you some grief: Divide the surface of the earth into 1 degree boxes. You will then have a 180x360 element array and you will only need to search a small number of boxes, including the box containing the new point and all the boxes immediately around it for which one of the corners is within the user-specified distance. You will find that there are some tricks you can use to quickly figure out what boxes to use without considering them all. Just don't forget latitude and longitude wrap-around.
If your "only" have millions of points, and they aren't clustered into hot-spots, that might get you through.
A theoretically superior way: You could map each point into three dimensional space and then store them in an octree, which would let you quickly find nearby points to within an arbitrary distance. Of course, the distance in three-dimensional space will be slightly different than the great-circle distance on the globe, so you will have to calculate a conversion factor. That should be simple, though. You don't mention an implementation language, but there is almost certainly going to be a well-tested octree implementation for any language you are working in. If you don't mind inserting the third-party code, this solution is the way to go.

Resources