Determine Mapping Function or Approximation from Massive Amount of Data - algorithm

Is there a good/well known approach to flushing out a mapping function/approximation having access to lots of mapped data?
E.g. Situation
Let's say I have the domain space of a 3D cube (bottom_left: 0,0,0, top_right: 10,10,10). E.g. points: (0,0,1), (1,2,3) etc.
Each point maps to a separate series of 3 values in the solution space. We do NOT know the mapping function which is I guess perhaps the heart of the problem. But we do have a massive amount of mapped data. From the data these values were found to range from (-30.0 to +30.0). E.g. data:
[0,0,1] -> (0.1, 0.1, 0.1), [1,2,3]-> (10.2, 3.1, 29.3) etc.
Any two different keys CAN be mapped to the same point in the solution space however these keys will be positioned far away from one another in the domain space.
We also have the last position and a condition where the searched domain position cannot be a greater distance than a given distance e.g (0.1,0.1,0.1) from the last position. I feel like this can be used somehow to eliminate the condition of identical solution space values?
If I have a random point (2.3,6.5,2.6) which is in the solution space, how would I find the nearest domain value? Since I have massive amounts of data is there a good approach to flush out a mapping function/approximation?

Related

How to rotate a word2vec onto another word2vec?

I am training multiple word2vec models with Gensim. Each of the word2vec will have the same parameter and dimension, but trained with slightly different data. Then I want to compare how the change in data affected the vector representation of some words.
But every time I train a model, the vector representation of the same word is wildly different. Their similarity among other words remain similar, but the whole vector space seems to be rotated.
Is there any way I can rotate both of the word2vec representation in such way that same words occupy same position in vector space, or at least they are as close as possible.
Thanks in advance.
That the locations of words vary between runs is to be expected. There's no one 'right' place for words, just mutual arrangements that are good at the training task (predicting words from other nearby words) – and the algorithm involves random initialization, random choices during training, and (usually) multithreaded operation which can change the effective ordering of training examples, and thus final results, even if you were to try to eliminate the randomness by reliance on a deterministically-seeded pseudorandom number generator.
There's a class called TranslationMatrix in gensim that implements the learn-a-projection-between-two-spaces method, as used for machine-translation between natural languages in one of the early word2vec papers. It requires you to have some words that you specify should have equivalent vectors – an anchor/reference set – then lets other words find their positions in relation to those. There's a demo of its use in gensim's documentation notebooks:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/translation_matrix.ipynb
But, there are some other techniques you could also consider:
transform & concatenate the training corpuses instead, to both retain some words that are the same across all corpuses (such as very frequent words), but make other words of interest different per segment. For example, you might leave words like "hot" and "cold" unchanged, but replace words like "tamale" or "skiing" with subcorpus-specific versions, like "tamale(A)", "tamale(B)", "skiing(A)", "skiing(B)". Shuffle all data together for training in a single session, then check the distances/directions between "tamale(A)" and "tamale(B)" - since they were each only trained by their respective subsets of the data. (It's still important to have many 'anchor' words, shared between different sets, to force a correlation on those words, and thus a shared influence/meaning for the varying-words.)
create a model for all the data, with a single vector per word. Save that model aside. Then, re-load it, and try re-training it with just subsets of the whole data. Check how much words move, when trained on just the segments. (It might again help comparability to hold certain prominent anchor words constant. There's an experimental property in the model.trainables, with a name ending _lockf, that lets you scale the updates to each word. If you set its values to 0.0, instead of the default 1.0, for certain word slots, those words can't be further updated. So after re-loading the model, you could 'freeze' your reference words, by setting their _lockf values to 0.0, so that only other words get updated by the secondary training, and they're still bound to have coordinates that make sense with regard to the unmoving anchor words. Read the source code to better understand how _lockf works.)

Maintaining a list of regions without overlapps

I have a list of integer axis aligned cuboids that is being built and then processed (a dirty region system).
Currently this will often have overlaps with some coordinates getting processed many times as a result (although still far less in total than the process everything due to 1 change approach). What I want to do is when adding a new region to the list, is to have a simple way to prevent any such resulting overlaps.
Due to the size of the data (iirc about 100 million cells), even though the coordinates are integers, I want to avoid a bool array of every coordinate to mark it uptodate/dirty. On the other hand, the actual number of regions in the list will generally be pretty small (most of the time only covering a fraction of the data set, with individual regions being 1000's of cells).
struct Region
{
int x, y, z;//corner coordinate
int w, h, d;//size
};
void addRegion(Region region)
{
regions.push_back(region);
}
So my current thinking is in addRegion to go through all the regions, find the overlapping ones and split them up appropriately. However even in 2D this seems tricky to come up with, so is there a known algorithm for this sort of thing?
You might be able to make use of an r-tree or r-tree variant, which is designed for indexing multidimensional data and has support for a fast intersection test; and given the size of your dataset, you might instead want to use a spatial database.

Geohashes - Why is interleaving index values necessary?

I have had a look at this post about geohashes. According to the author, the final step in calculating the hash is interleaving the x and y index values. But is this really necessary? Is there a proper reason not to just concatenate these values, as long as the hash table is built according to that altered indexing rule?
From the wiki page
Geohashes offer properties like arbitrary precision and the
possibility of gradually removing characters from the end of the code
to reduce its size (and gradually lose precision).
If you simply concatenated x and y coordinates, then users would have to take a lot more care when trying to reduce precision by being careful to remove exactly the right number of characters from both the x and y coordinate.
There is a related (and more important) reason than arbitrary precision: Geohashes with a common prefix are close to one another. The longer the common prefix, the closer they are.
54.321 -2.345 has geohash gcwm48u6
54.322 -2.346 has geohash gcwm4958
(See http://geohash.org to try this)
This feature enables fast lookup of nearby points (though there are some complications), and only works because we interleave the two dimensions to get a sort of approximate 2D proximity metric.
As the wikipedia entry goes on to explain:
When used in a database, the structure of geohashed data has two
advantages. First, data indexed by geohash will have all points for a
given rectangular area in contiguous slices (the number of slices
depends on the precision required and the presence of geohash "fault
lines"). This is especially useful in database systems where queries
on a single index are much easier or faster than multiple-index
queries. Second, this index structure can be used for a
quick-and-dirty proximity search - the closest points are often among
the closest geohashes.
Note that the converse is not always true - if two points happen to lie on either side of a subdivision (e.g. either side of the equator) then they may be extremely close but have no common prefix. Hence the complications I mentioned earlier.

Smart sorting by function of geo and int

I'm thinking about ways to solve the following task.
We are developing a service (website) which has some objects. Each object has geo field (lat and long). It's about 200-300 cities with objects can be connected. Amount of objects is thousands and tens of thousands.
Also each object has date of creation.
We need to search objects with sorting by function of distance and freshness.
E.g. we have two close cities A and B. User from city A authorizes and he should see objects from city A and then, on some next pages, from city B (because objects from A are closer).
But, if there is an object from A which was added like a year ago, and an object from B which was added today, then B's object should be displayed befare A's one.
So, for peoeple from city A we can create special field with relevant index like = 100*distance + age_in_days
And then sort by this field and we will get data as we need.
The problem is such relevant index will not work for all other people from other places.
In my example i used linear function but it's just an example, we will need to fit correct function.
The site will work on our servers, so we can use almost any database or any other software (i supposed to use mongodb)
I have following ideas
Recacl relevant index every day and keep it with object like
{
fields : ...,
relindex : {
cityA : 100,
cityB : 120
}
}
And if user belongs to cityA then sort by relindex.cityA
Disadvantages:
Recurrent update of all object, but i dont think it's a hude problem
Huge mongo index. If we have about 300 cities than each object will have 300 indexed fields
Hard to add new cities.
Use 3d spatial index: (lat, long, freshness). But i dont know if any database supports 3d geo-patial
Compact close objects in cluster and search only in cluster but not by whole base. But im not sure that it's ok.
I think there are four possible solutions:
1) Use 3D index - lat, lon, time.
2) Distance is more important - use some geo index and select nearest objects. If the object is too old then discard it and increase allowed distance. Stop after you have enough objects.
3) Time is more important - index by time and discard the objects which are too far.
4) Approximate distance - choose some important points (centre of cities or centre of clusters of objects) and calculate the distances from these important points up front. The query will first find the nearest important point and then use index to find the data.
Alternatively you can create clusters from your objects and then calculate the distance in the query. The point here is that the amount of clusters is limited.

NULL values across a dimension in Support Vector Machine

I am designing a support vector machine considering n dimensions. Along every dimension, the values could range from [0-1]. Now, if I am unable to determine the value across a particular dimension from the original data set, for a particular data point due to various reasons, what should the value along that dimension be for the SVM? Can I just put it as [-1] indicating a missing value?
Thanks
Abhishek S
You would be better served leaving the missing value out altogether if the dimension won't be able to contribute to your machine's partitioning of the space. This is because the only thing the SVM can do is place zero weight on that dimension as far as classification power, as all of the points in that dimension are at the same place.
Thus each pass over that dimension is just wasted computational resources. If recovering this value is of importance, you may be able to use a regression model of some type to try to get estimated values back, but if that estimated value is generated from your other data, yet again it won't actually contribute to your SVM because the data in that estimated dimension is nothing more that a summary of the data you used to generate it (which I would assume would be in your SVM model already).

Resources