least squares approximate 1D data by a given amount horizontal lines - approximation

The problem is following. I have 1D array of data, that i need to approximate by a given amount of horizontal lines (for example, by 3 lines) in the optimal way (so, the summary error becomes minimal). The method of approximation should be as fast as possible (so, i cannot take every horizontal line, approximate data set, extract it value from data set and approximate the rest by the reamaining set of lines). Now, i have no idea how to do it except slightly feeling that the solution of this problem is linked to the solution of the maximum subarray problem. Please, could you give me some advices how to solve it?

The least-squares approximation of a 1D set of data is by definition its arithmetic mean, so there's one of your lines. I'm not sure what criterion you'd want to use for the other two.

Related

how to find the nearest neighbor of a sparse vector

I have about 500 vectors,each vector is a 1500-dimension vector,
and almost every vector is very sparse-- I mean only about 30-70 dimension of the vector is not 0。
Now, the problom is that here is a given vetor,also 1500 dimension,and I need to compare it to the 500 vectors to find which of the 500 is the nearest one.(In euclidean distance).
There is no doubt that brute-force method is a solution , but I need to calculate the distance for 500 times ,which takes a long time.
Yesterday I read an article "Object retrieval with large vocabularies and fast spatial matching", it says using inverted index will help,its says:
but after my test, it made almost no sense, imagine a 1500-vector in which 50 of the dimension are not zero, when it comes to another one, they may always have the same dimension that are not zero. In other words, this algorithm can only rule out a little vectors, I still need to compare with many vectors left.
Thank you for your nice that you have read to here, my question is that:
1.will this algorithm make sense?
2.is there any other way to do what I want to do? such as flann or Kd-TREE?
but I want the exact accurate nearest neighbor, a approxiate one is not enough
This kind of index is called inverted lists, and is commonly used for text.
For example, Apache Lucene uses this kind of indexing for text similarity search.
Essentially, you use a columnar layout, and you only store the non-zero values. For on-disk efficiency, various compression techniques can be employed.
You can then compute many similarities using set operations on these lists.
k-d-trees cannot be used here. They will be extremely inefficient if you have many duplicate (zero) values.
I don't know your context but if you don't care of having a long preprocess step and you have to make this check often and fast, you can build a neighborhood graph and sorting neighbors by distances.
To efficiently build this graph you can perform a taxicab distance or a square distance to sort the points by distances (This will avoid heavy calculations).
Then if you want the nearest neighbor you just have to pick the first neighbor :p.

Fast method of tile fitting/arranging

For example, let's say we have a bounded 2D grid which we want to cover with square tiles of equal size. We have unlimited number of tiles that fall into a defined number of types. Each type of tile specifies the letters printed on that tile. Letters are printed next to each edge and only the tiles with matching letters on their adjacent edges can be placed next to one another on the grid. Tiles may be rotated.
Given the size of the grid and tile type definitions, what is the fastest method of arranging the tiles such that the above constraint is met and the entire/majority of the grid is covered? Note that my use case is for large grids (~20 in each dimension) and medium-large number of solutions (unlike Eternity II).
So far, I've tried DFS starting in the center and picking the locations around filled area that allow the least number of possibilities and backtrack in case no progress can be made. This only works for simple scenarios with one or two types. Any more and too much backtracking ensues.
Here's a trivial example, showing input and the final output:
This is a hard puzzle.
The Eternity 2 was a puzzle of this form with a 16 by 16 square grid.
Despite a 2 million dollar prize, no one found the solution in several years.
The paper "Jigsaw Puzzles, Edge Matching, and Polyomino Packing: Connections and Complexity" by Erik D. Demaine, Martin L. Demaine shows that this type of puzzle is NP-complete.
Given a problem of this sort with a square grid I would try a brute force list of all possible columns, and then a dynamic programming solution across all of the rows. This will find the number of solutions, and with a little extra work can be used to generate all of the solutions.
However if your rows are n long and there are m letters with k tiles, then the brute force list of all possible columns has up to mn possible right edges with up to m4k combinations/rotations of tiles needed to generate it. Then the transition from the right edge of one column to the right edge of the next next potentially has up to m2n possibilities in it. Those numbers are usually not worst case scenarios, but the size of those data structures will be the upper bound on the feasibility of that technique.
Of course if those data structures get too large to be feasible, then the recursive algorithm that you are describing will be too slow to enumerate all solutions. But if there are enough, it still might run acceptably fast even if this approach is infeasible.
Of course if you have more rows than columns, you'll want to reverse the role of rows and columns in this technique.

How to find the nearest line segment to a specific point more efficently?

This is a problem I came across frequently and I'm searching a more effective way to solve it. Take a look at this pics:
Let's say you want to find the shortest distance from the red point to a line segment an. Assume you only know the start/end point (x,y) of the segments and the point. Now this can be done in O(n), where n are the line segments, by checking every distance from the point to a line segment. This is IMO not effective, because in the worst case there have to be n-1 distance checks till the right one is found.
This can be a real performance issue for n = 1000 f.e. (which is a likely number), especially if the distance calculation isn't just done in the euclidean space by the Pythagorean theorem but for example by a geodesic method like the haversine formula or Vincenty's.
This is a general problem in different situations:
Is the point inside a radius of the vertices?
Which set of vertices is nearest to the point?
Is the point surrounded by line segments?
To answer these questions, the only approach I know is O(n). Now I would like to know if there is a data structure or a different strategy to solve these problems more efficiently?
To make it short: I'm searching a way, where the line segments / vertices could be "filtered" somehow to get a set of potential candidates before I start my distance calculations. Something to reduce the complexity to O(m) where m < n.
Probably not an acceptable answer, but too long for a comment: The most appropriate answer here depends on details that you did not state in the question.
If you only want to perform this test once, then there will be no way avoid a linear search. However, if you have a fixed set of lines (or a set of lines that does not change too significantly over time), then you may employ various techniques for accelerating the queries. These are sometimes referred to as Spatial Indices, like a Quadtree.
You'll have to expect a trade-off between several factors, like the query time and the memory consumption, or the query time and the time that is required for updating the data structure when the given set of lines changes. The latter also depends on whether it is a structural change (lines being added or removed), or whether only the positions of the existing lines change.

Faster ways to search points close to a line

I have a set of points and a line in 2D space. I need to find all points that lie within a distance D from the line. Is there a way for me to do this without having to actually compute distances di of all points from the line? Is there a solution better than linear search?
Edit: I need to search through the same point set for different lines multiple times. The points are always constant but the line would be different during each search. Typically the point set is of the order of tens of thousands (~50k).
As for queries:
If you create a kd-tree using the points, and use a few equidistant points (likely around d) points on the line, you should be able use a modifed nearest neighbor query to find all points that are withing d of a line in roughly O(k + log(N)). The kd-tree requires O(N log N) preprocessing through, so it's only better if you use the same point set (with perhaps slight differences as you can add/remove a point from a kd-tree in O(log N)) and different lines. The only issue is that a kd-tree isn't really meant for use with lines. I'm sure there is something like it for lines that would work better, but I'm not familiar with it.
Note: False positives and negatives are possible depending on how things are arranged, as you are really querying the distance from a point on the line instead of the distance on the line. How problematic this is largely depends on the ratio between the length of the line and d. Thus you are either going to get a fair number false positives or false negatives unless the majority of the points are no where near the line. In general, this probably won't be too much of an issue though, as even with the false positives k should be fairly small compared to N unless d is relatively large.
After a bit of review, I noticed that the query is against a line not line segment. It can however be converted into one by make the line segment bounded by the min/max x/y. I imagine there is still probably a more efficent way to use a kd-tree for this.
This search can't be completed faster than linear search, as input data and result population has the same complexity.

"Covering" the space of all possible histogram shapes

There is a very expensive computation I must make frequently.
The computation takes a small array of numbers (with about 20 entries) that sums to 1 (i.e. the histogram) and outputs something that I can store pretty easily.
I have 2 things going for me:
I can accept approximate answers
The "answers" change slowly. For example: [.1 .1 .8 0] and [.1
.1 .75 .05] will yield similar results.
Consequently, I want to build a look-up table of answers off-line. Then, when the system is running, I can look-up an approximate answer based on the "shape" of the input histogram.
To be precise, I plan to look-up the precomputed answer that corresponds to the histogram with the minimum Earth-Mover-Distance to the actual input histogram.
I can only afford to store about 80 to 100 precomputed (histogram , computation result) pairs in my look up table.
So, how do I "spread out" my precomputed histograms so that, no matter what the input histogram is, I'll always have a precomputed result that is "close"?
Finding N points in M-space that are a best spread-out set is more-or-less equivalent to hypersphere packing (1,2) and in general answers are not known for M>10. While a fair amount of research has been done to develop faster methods for hypersphere packings or approximations, it is still regarded as a hard problem.
It probably would be better to apply a technique like principal component analysis or factor analysis to as large a set of histograms as you can conveniently generate. The results of either analysis will be a set of M numbers such that linear combinations of histogram data elements weighted by those numbers will predict some objective function. That function could be the “something that you can store pretty easily” numbers, or could be case numbers. Also consider developing and training a neural net or using other predictive modeling techniques to predict the objective function.
Building on #jwpat7's answer, I would apply k-means clustering to a huge set of randomly generated (and hopefully representative) histograms. This would ensure that your space was spanned with whatever number of exemplars (precomputed results) you can support, with roughly equal weighting for each cluster.
The trick, of course, will be generating representative data to cluster in the first place. If you can recompute from time to time, you can recluster based on the actual data in the system so that your clusters might get better over time.
I second jwpat7's answer, but my very naive approach was to consider the count of items in each histogram bin as a y value, to consider the x values as just 0..1 in 20 steps, and then to obtain parameters a,b,c that describe x vs y as a cubic function.
To get a "covering" of the histograms I just iterated through "possible" values for each parameter.
e.g. to get 27 histograms to cover the "shape space" of my cubic histogram model I iterated the parameters through -1 .. 1, choosing 3 values linearly spaced.
Now, you could change the histogram model to be quartic if you think your data will often be represented that way, or whatever model you think is most descriptive, as well as generate however many histograms to cover. I used 27 because three partitions per parameter for three parameters is 3*3*3=27.
For a more comprehensive covering, like 100, you would have to more carefully choose your ranges for each parameter. 100**.3 isn't an integer, so the simple num_covers**(1/num_params) solution wouldn't work, but for 3 parameters 4*5*5 would.
Since the actual values of the parameters could vary greatly and still achieve the same shape it would probably be best to store ratios of them for comparison instead, e.g. for my 3 parmeters b/a and b/c.
Here is an 81 histogram "covering" using a quartic model, again with parameters chosen from linspace(-1,1,3):
edit: Since you said your histograms were described by arrays that were ~20 elements, I figured fitting parameters would be very fast.
edit2 on second thought I think using a constant in the model is pointless, all that matters is the shape.

Resources