Iterate twice through values in Reducer Hadoop - hadoop

I read in couple of places that the only way to iterate twice through values in a Reducer is to cache that values.
But also, there is a limitation that all the values must fit in main memory in that case.
What if you need to iterate twice, but you don't have the luxury of caching the values in memory?
Is there some kind of workaround?
Maybe there are some answers about this problem, but I'm new to Hadoop so I'm hopping that some solution was found since the time that questions were asked.
To be more concrete with my question, here is what I need to do:
Reducer gets a certain number of points (per example - points in 3D space with x,y,z coordinates)
One random point between them should be selected - let's call it firstPoint
Reducer should then find point that is farthest from firstPoint, to do that it needs to iterate through all the values - this way we get secondPoint
After that, reducer should find point farthest from secondPoint, so there's a need to iterate through the dataset again - this way we get thirdPoint
Distance from thirdPoint to all other points needs to be calculated
Distances from secondPoint to all other points and distances from thirdPoint to all other points need to be saved, so additional steps could be performed.
It's not a problem to buffer this distances, since each distance is a double, while a point could actually be a point in n-dimensional space so each point could have n coordinates, so it could take up too much space.
My original question was how can I iterate twice, but my question is more general, how can you iterate multiple times through values, to perform the steps above?

It might not work for every case, but you could try running more reducers so that each one processes a small enough amount of data that you could then cache the values into memory.

Related

Query from list of 2D data points (C++11)

I'm finding it hard to describe (and then search) what I want, so I will try here.
I have a list of 2D data points (time and distance). You could say it's like a vector of pairs. Although the data type doesn't matter, as I'm trying to find the best one now. It is/can be sorted on time.
Here is some example data to help me explain:
So I want to store a fairly large amount of data points like the ones in the spreadsheet above. I then want to be able to query them.
So if I say get_distance(0.2); it would return 1.1. This is quite simple.
Something like a map sounds sensible here to store the data with the time being the key. But then I come to the problem, what happens if the time I am querying isn't in the map like below:
But if I say get_distance(0.45);, I want it to average between the two nearest points just like the line on the graph and it would return 2.
All I have in my head at the minute is to loop through the data point vector find the point that has the closest time less than the time I want and find the point with the closest time above the time I want and average the distances. I don't think this sounds efficient, especially with a large amount of data points (probably up to around 10000, but there is a possibility to have more than this) and I want to do this query fairly often.
If anyone has a nice data type or algorithm that would work for me and could point me in that direction I would be grateful.
The STL is the way to go.
If your query time is not in the data, you want the largest that is smaller and the smallest that is larger so you can interpolate.
https://cplusplus.com/reference/algorithm/lower_bound/
https://cplusplus.com/reference/algorithm/upper_bound/
Note that since your data is already sorted you do not need a map - a vector is fine and saves the time taken to populate the map
You can still achieve this in O(N log N) time complexity, N being the scale of the data, with a std::map.
You can first query if there is an exact match. Use something like std::map::find to acheive this.
If there is no exact match, we should now query for the two keys that are the largest less than, or the least greater than the query (basically find the two keys that "sandwich" the query).
To do this, use std::map::lower_bound (or std::map::upper_bound, as the two are equivalent in this case). Save the iterator that is returned. To find the key greater than the query, simply increment the iterator (if itr is the iterator, just do itr++ and look for the value there).
The lower_bound (or upper_bound), along with find are all in O(N log N) and incrementing is O(log N), giving a total time complexity of O(N log N), which should be efficient enough in your case.

Find optimal local alignment of two strings using local & global alignments

I have a homework question that I trying to solve for many hours without success, maybe someone can guide me to the right way of thinking about it.
The problem:
We want to find an optimal local alignment of two strings S1 and S2, we know that there exists such an alignment
with the two aligned substrings of S1 and S2 both of length at most q.
Besides, we know that the number of the table cells with the maximal value, opt, is at most r.
Describe an algorithm solving the problem in time O(mn+r*q^2) using working space of at most
O(n+r+q^2).
Restrictions: run the algorithm of finding the optimal local alignment value, with
additions to your choice (like the list of index pairs), only once. Besides, you can run any variant of the algorithm for solving the global optimal alignment problem as many times as you wish
I know the solution to this problem with running the local alignment many times and the global alignment only once, but not the opposite.
the global alignment algorithm:
the local alignment algorithm:
Any help would be appreciated.
The answer in case someone will be interested in this question in the future:
Compute the optimal local alignment score OPT of both strings in $O(mn)$ time and $O(n)$ space by maintaining just a single row of the DP matrix. (Since we are only computing the score and don't need to perform traceback to build the full solution, we don't need to keep the full DP matrix.) As you do so, keep track of the highest cell value seen so far, as well as a list of the coordinates $(i, j)$ of cells having this value. Whenever a new maximum is seen, clear the list and update the maximum. Whenever a cell $\ge$ the current maximum is seen (including in the case where we just saw a new maximum), add the coordinates of the current cell to the list. At the end, we have a list of endpoints of all optimal local alignments; by assumption, there are at most $r$ of these.
For each entry $(i, j)$ in the list:
Set $R1$ to the reverse of the substring $S1[i-q+1..i]$, and $R2$ to the reverse of $S2[j-q+1..j]$.
Perform optimal global alignment of $R1$ and $R2$ in $O(q^2)$ time, maintaining the full $O(q^2)$ DP matrix this time.
Search for the highest entry in the matrix (also $O(q^2)$ time; or you can perform this during the previous step).
If this entry is OPT, we have found a solution: Trace back towards the top-left corner from this cell to find the full solution, reverse it, and output it, and stop.
By assumption, at least one of the alignments performed in the previous step reaches a score of OPT. (Note that reversing both strings does not change the score of an alignment.)
Step 2 iterates at most $r$ times, and does $O(q^2)$ work each time, using at most $O(q^2)$ space, so overall the time and space bounds are met.
(A simpler way, that avoids reversing strings, would be to simply perform local alignments of the length-$q$ substrings, but the restrictions appear to forbid this.)

How to do the final steps of the k-means algorithm

I’m having a problem with my program. I am trying to implement k-means (manually) by clustering a set of RGB values to 3 clusters. I don’t need help with coding just with understanding. So far I have done this:
Created 3 cluster objects, each with a mean and an array to hold that clusters members.
Imported the text file and saved the RGB values in an array.
Looped the array and for each RGB value stored, calculated the mean with the closest distance using Euclidian distance.
Added each RGB value to the array of the cluster with the closest mean.
I have done research and I can’t seem to understand the next step. Research has suggested:
In each cluster, add all of the RGB values together divide them by the number of values in that cluster, then update the mean with that value or;
Find the average distance between all RGB values and the mean, then update the mean with that value or;
Update the mean each time you add a RGB to the cluster…..
I just can’t seem to understand the last steps, thanks.
Once you have a per-cluster array, recompute the "cluster center" as the average of the items in that cluster. Then, reassign every item to the appropriate cluster (the one whose "center" it's closest to after the recomputation). You're done when no item changes cluster (it's theoretically possible to end up in a situation where one item keeps flipping between two clusters, with generalized distance measures -- this can be detected to stop the loop anyway -- but I don't think it can happen with Euclidean distances).
IOW, that would be the first one of your three alternatives. I'm not even sure what you mean by the second alternative; the third one would likely not be stable, and depend on the arbitrary order of items, so I feel strongly against it.

Collision detection complexity <O(n²): Simpler approach than grid, quadtrees, BSP?

I have a large number of object (balls, for a start) that move stepwise in space, one at a time, and shall not overlap. Currently, for every move I check for collision with every other object. Several other questions here deal with this, however, I thought of a seemingly simple solution that does not seem to come up in this context, and I wonder why.
Why not simply keep 2 collections (for 2D, or 3 in three dimensions) of all objects, sorted by the x and y (and z) coordinate, respectively, and at every move look up all other objects within a given distance (ball diameter here) in each dimension and do the actual collision check only on objects in both (or all 3) result sets?
I realize this only works for equally-sized objects, but alternatively one could use twice as many collections, sorted by the (1) highest (2) lowest coordinate of each object for each dimension. Any reason why this would not work, or give significantly less of an improvement compared to going from O(n) "pairwise check" to "grid method" or "quad/octrees"? I see the update of these sorted collections as the costly operation here, but using e.g. a TreeSet (my implementation would be in Java) it should still be significantly less than O(n), right?
The check for which objects are in both result sets involves looking at all objects in two strips of the plane. That is a much larger area, and therefore involves more objects, than the enclosing square that a quadtree lets you immediately narrow down to. More objects means that it is slower.
You want to use a spatial index or space-filling-curve instead of a quadtree. A sfc reduce the 2d complexity to a 1d complexity and is different from a quadtree because it can only store 1 object per x,y pair? Maybe this works for your problem? You want to search for Nick's hilbert curve quadtree spatial index blog.

Possible to calculate closest locations via lat/long in better than O(n) time?

I'm wondering if there is an algorithm for calculating the nearest locations (represented by lat/long) in better than O(n) time.
I know I could use the Haversine formula to get the distance from the reference point to each location and sort ASC, but this is inefficient for large data sets.
How does the MySQL DISTANCE() function perform? I'm guessing O(n)?
If you use a kd-tree to store your points, you can do this in O(log n) time (expected) or O(sqrt(n)) worst case.
You mention MySql, but there are some pretty sophisticated spatial features in SQL Server 2008 including a geography data type. There's some information out there about doing the types of things you are asking about. I don't know spatial well enough to talk about perf. but I doubt there is a bounded time algorithm to do what you're asking, but you might be able to do some fast set operations on locations.
If the data set being searched is static, e.g., the coordinates of all gas stations in the US, then a proper index (BSP) would allow for efficient searching. Postgres has had good support since the mid 90's for 2-dimensional indexed data so you can do just this sort of query.
Better than O(n)? Only if you go the way of radix sort or store the locations with hash keys that represent the general location they are in.
For instance, you could divide the globe with latitude and longitude to the minutes, enumerate the resulting areas, and make the hash for a location it's area. So when the time comes to get the closest location, you only need to check at most 9 hash keys -- you can test beforehand if an adjacent grid can possibly provide a close location than the best found so far, thus decreasing the set of locations to compute the distance to. It's still O(n), but with a much smaller constant factor. Properly implemented you won't even notice it.
Or, if the data is in memory or otherwise randomly accessible, you could store it sorted by both latitude and longitude. You then use binary search to find the closest latitude and longitude in the respective data sets. Next, you keep reading locations with increasing latitude or longitude (ie, the preceding and succeeding locations), until it becomes impossible to find a closer location.
You know you can't find a close location when the latitude of the next location to either side of the latitude-sorted data wouldn't be closer than the best case found so far even if they belonged in the same longitude as the point from which distance is being calculated. A similar test applies for the longitude-sorted data.
This actually gives you better than O(n) -- closer to O(logN), I think, but does require random, instead of sequential, access to data, and duplication of all data (or the keys to the data, at least).
I wrote a article about Finding the nearest Line at DDJ a couple of years ago, using a grid (i call it quadrants). Using it to find the nearest point (instead of lines) would be just a reduction of it.
Using Quadrants reduces the time drastically, although the complexity is not determinable mathematically (all points could theoretically lie in a single quadrant). A precondition of using quadrants/grids is, that you have a maximum distance for the point searched. If you just look for the nearest point, without giving a maximum distance, you cant use quadrants.
In this case, have a look at A Template for the Nearest Neighbor Problem (Larry Andrews at DDJ), having a retrival complexity of O(log n). I did not compare the runtime of both algorithms. Probably, if you have a reasonable maximum width, quadrants are better. The better general purpose algorithm is the one from Larry Andrews.
If you are looking for the (1) closest location, there's no need to sort. Simply iterate through your list, calculating the distance to each point and keeping track of the closest one. By the time you get through the list, you'll have your answer.
Even better would be to introduce the concept of grids. You would assign each point to a grid. Then, for your search, first determine the grid you are in and perform your calculations on the points in the grid. You'll need to be a little careful though. If the test location is close to the boundary of a grid, you'll need to search those grid(s) as well. Still, this is likely to be highly performant.
I haven't looked at it myself, but Postgres does have a module dedicated to the management of GIS data.
In an appliation I worked on in a previous life we took all of the data, computed it's key for a quad-tree (for 2D space) or an oct-tree (for 3D space) and stored that in the database. It was then a simple matter of loading the values from the database (to prevent you having to recompute the quad-tree) and following the standard quad-tree search algorithm.
This does of course mean you will touch all of the data at least once to get it into the data structure. But persisting this data-structure means you can get better lookup speeds from then on. I would imagine you will do a lot of nearest-neighbour checks for each data-set.
(for kd-tree's wikipedia has a good explanation: http://en.wikipedia.org/wiki/Kd-tree)
You need a spatial index. Fortunately, MySQL provides just such an index, in its Spatial Extensions. They use an R-Tree index internally - though it shouldn't really matter what they use. The manual page referenced above has lots of details.
I guess you could do it theoretically if you had a large enough table to do this... secondly, perhaps caching correctly could get you very good average case?
An R-Tree index can be used to speed spatial searches like this. Once created, it allows such searches to be better than O(n).

Resources