"Snapping" the Results of t-sne to a Regular Grid - Scalability Issues

"Snapping" the Results of t-sne to a Regular Grid - Scalability Issues - algorithm

I'm trying to use t-sne to arrange images based on their visual similarity, similar to this cool example for emojies (source):
but the output of t-sne is just a "point cloud", while my goal is to display the images in a regular, near-square, dense grid. So I need to somehow convert the output of t-sne to (x,y) locations on a grid.
So far, I've followed the suggestion in this great blog post: I formulated it as a linear assignment problem to find the best embedding into a regular grid. I'm pleased with the results, for example:
My Problem is that the "snapping to grid" stage turns out to be a huge bottleneck, and I need my method to scale well for a large number of images (10K). To solve the linear assignment problem I'm using a Java implementation of the Jonker-Volgenant algorithm, whose time complexity is O(n^3). So while t-sne is nlogn and can scale well up to 10K images, the part of aligning to a regular grid can only deal with up to 2K images.
Potential Solutions, as I see it:
Randomly sample 2K images out of the total 10K
Divide the 10K images into 5 and create 5 maps. This is problematic because there's a "chicken and egg" problem, how do I do the division well?
Trade accuracy for performance: Solve the Linear assignment problem approximately in a near-linear time. I want to try this but I couldn't find any existing implementations for me to use.
Implement the "snap to grid" part in a different, more efficient way.
I'm working with Java but solutions in cpp are also good. I'm guessing I'm not the first to try this. Any suggestions? Thoughts?
Thanks!

Related

performance issue, edit distance for large strings LCP vs Levenshtein vs SIFT

So I'm trying to calculate the distance between two large strings (about 20-100).
The obstacle is the performance, I need to run 20k distance comparisons. (It takes hours)
After investigating, I came a cross few algorithms, And I'm having trouble to decide which to choose. (based on performance VS accuracy)
https://github.com/tdebatty/java-string-similarity - performance list for each of the algorithms.
** EDITED **
Is SIFT4 algorithm well-proven / reliable?
Is SIFT4 the right algorithm for the task?
How come it's so much faster than LCP-based / Levenshtein algorithm?
Is SIFT also used in image processing? or is it a different thing? answered by AMH
Thanks.

As far as i know Scale-invariant feature transform (SIFT) is an algorithm in computer vision detect and describe local features in images.
also if you want to find similar images you must compare local features of images to each other by calculating their distance which may do what you intend to do. but local features are vector of numbers as i remember. it uses Brute-Force matcher:Feature Matching - OpenCV Library - SIFT
please read about SIFT here: http://docs.opencv.org/3.1.0/da/df5/tutorial_py_sift_intro.html
SIFT4 which is mentioned on your provided link is completely different thing.

Machine learning: optimal parameter values in reasonable time

Sorry if this is a duplicate.
I have a two-class prediction model; it has n configurable (numeric) parameters. The model can work pretty well if you tune those parameters properly, but the specific values for those parameters are hard to find. I used grid search for that (providing, say, m values for each parameter). This yields m ^ n times to learn, and it is very time-consuming even when run in parallel on a machine with 24 cores.
I tried fixing all parameters but one and changing this only one parameter (which yields m × n times), but it's not obvious for me what to do with the results I got. This is a sample plot of precision (triangles) and recall (dots) for negative (red) and positive (blue) samples:
Simply taking the "winner" values for each parameter obtained this way and combining them doesn't lead to best (or even good) prediction results. I thought about building regression on parameter sets with precision/recall as dependent variable, but I don't think that regression with more than 5 independent variables will be much faster than grid search scenario.
What would you propose to find good parameter values, but with reasonable estimation time? Sorry if this has some obvious (or well-documented) answer.

I would use a randomized grid search (pick random values for each of your parameters in a given range that you deem reasonable and evaluate each such randomly chosen configuration), which you can run for as long as you can afford to. This paper runs some experiments that show this is at least as good as a grid search:
Grid search and manual search are the most widely used strategies for hyper-parameter optimization.
This paper shows empirically and theoretically that randomly chosen trials are more efficient
for hyper-parameter optimization than trials on a grid. Empirical evidence comes from a comparison
with a large previous study that used grid search and manual search to configure neural networks
and deep belief networks. Compared with neural networks configured by a pure grid search,
we find that random search over the same domain is able to find models that are as good or better
within a small fraction of the computation time.
For what it's worth, I have used scikit-learn's random grid search for a problem that required optimizing about 10 hyper-parameters for a text classification task, with very good results in only around 1000 iterations.

I'd suggest the Simplex Algorithm with Simulated Annealing:
Very simple to use. Simply give it n + 1 points, and let it run up to some configurable value (either number of iterations, or convergence).
Implemented in every possible language.
Doesn't require derivatives.
More resilient to local optimum than the method you're currently using.

Tiling Algorithm

I'm faced with a problem where I have to solve puzzles.
E.g. I have an (variable) area of 20x20 (meters for example). There are a number of given set pieces having variable sizes. Such as 4x3, 4x2, 1x5 pieces etc. These pieces can also be turned to add more pain to my problem. The point of the puzzle is to fill the entire area of 20x20 with the given pieces.
What would be a good starting algorithm to achieve such a feat?
I'm thinking of using a heuristic that calculates the open space (for efficiency purposes).
Thanks in advance

That's an Exact Cover problem, with a nice structure too, usually, depending on the pieces. I don't know about any heuristic algorithms, but there are several exact options that should work well.
As usual with Exact Covers, you can use Dancing Links, a way to implement Algorithm X efficiently.
Less generally, you can probably solve this with zero-suppressed decision diagrams. It depends on the tiles though. As a bonus, you can represent all possible solutions and count them or generate one with some properties, all without ever explicitly storing the entire (usually far too large) set of solutions.
BDDs would work about as well, using more nodes to accomplish the same thing (because the solutions are very sparse, as in, using few of the possible tile-placements - ZDDs like that but BDDs like symmetry better than sparseness).
Or you could turn it into a SAT problem, then you get less information (no solution count for example), but faster if there are easy solutions.

How to build Windows Metro like dashboard?

I've recently saw how Windows 8 presents the on the dashboard the icons in "Metro" style.
In the image above, seems that some widgets receive a 2xwidth in comparison to the others so that the whole list of widgets is to a certain degree balanced.
My question is whether algorithm described here (partion problem solution) is used for achieving the result.
Can anybody give me some hints on how to build up a similar display when the widgets can span on multiple lines (e.g. : "Popular Session" widget would take 2 columns and 2 lines to be displayed)

The algorithm called Linear Partitioning is dividing a given range of positive numbers to optimum k sub-ranges which can be re-stated as a linear programming problems to which dynamic programming is a solution.
Your problem here does not look like an optimization problem which means there not much of a target function there. If tiles had weights and you needed to partition them among few pages evenly , then it would've been a optimization problem. So the answer to first question is NO to me.
Metro's tile arrangement seem to be much simpler as it actually let's the user edit the list and automatic re-arranging tiles would be annoying. so all Metro is doing is not letting you create a new row when the tile can be placed on upper row.
Maybe same simple technique should works you.

Clustering geo-data for heatmap

I have a list of tweets with their geo locations.
They are going to be displayed in a heatmap image transparently placed over Google Map.
The trick is to find groups of locations residing next to each other and display
them as a single heatmap circle/figure of a certain heat/color, based on cluster size.
Is there some library ready to grouping locations in a map into clusters?
Or I better should decide my clusterization params and build a custom algorithm?

I don't know if there is a 'library ready to grouping locations in a map into clusters', maybe it is, maybe it isn't. Anyways, I don't recommend you to build your custom clustering algorithm since there are a lot of libraries already implemented for this.
#recursive sent you a link with a php code for k-means (one clustering algorithm). There is also a huge Java library with other techniques (Java-ML) including k-means too, hierarchical clustering, k-means++ (to select the centroids), etc.
Finally I'd like to tell you that clustering is a non-supervised algorithm, which means that effectively, it will give you a set of clusters with data inside them, but at a first glance you don't know how the algorithm clustered your data. I mean, it may be clustered by locations as you want, but it can be clustered also by another characteristic you don't need so it's all about playing with the parameters of the algorithm and tune your solutions.
I'm interested in the final solution you could find to this problem :) Maybe you can share it in a comment when you end this project!

K means clustering is a technique often used for such problems
The basic idea is this:
Given an initial set of k means m1,…,mk, the
algorithm proceeds by alternating between two steps:
Assignment step: Assign each observation to the cluster with the closest mean
Update step: Calculate the new means to be the centroid of the observations in the cluster.
Here is some sample code for php.

heatmap.js is an HTML5 library for rendering heatmaps, and has a sample for doing it on top of the Google Maps API. It's pretty robust, but only works in browsers that support canvas:
The heatmap.js library is currently supported in Firefox 3.6+, Chrome
10, Safari 5, Opera 11 and IE 9+.

You can try my php class hilbert curve at phpclasses.org. It's a monster curve and reduces 2d complexity to 1d complexity. I use a quadkey to address a coordinate and it has 21 zoom levels like Google maps.

This isn't really a clustering problem. Head maps don't work by creating clusters. Instead they convolute the data with a gaussian kernel. If you're not familiar with image processing, think of it as using a normal or gaussian "stamp" and stamping it over each point. Since the overlays of the stamp will add up on top of each other, areas of high density will have higher values.

One simple alternative for heatmaps is to just round the lat/long to some decimals and group by that.
See this explanation about lat/long decimal accuracy.
1 decimal - 11km
2 decimals - 1.1km
3 decimals - 110m
etc.
For a low zoom level heatmap with lots of data, rounding to 1 or 2 decimals and grouping the results by that should do the trick.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio