Method to find optimal-ish result for numerous combinations - algorithm

I have a pool of n objects, and each object has a known degree of "compatibility" with each other object. Recognizing that "best" is a relative term, how can I best group these objects into n/2 pairs where the overall result is "good"?
To further clarify, it'd be fairly simple to just iterate over the entire batch and say, "What's the best match for this one?", put them together, and move on, but that may lead to some items at the end of the process that have very low compatibility being paired, and a small change earlier on (perhaps making a second-best match) might have increased the overall quality of matches across the entire pool.


What is the fastest way to intersect two large set of ids

The Problem
On a server, I host ids in a json file. From clients, I need to mandate the server to intersect and sometimes negate these ids (the ids never travel to the client even though the client instructs the server its operations to perform).
I typically have 1000's of ids, often have 100,000's of ids, and have a maximum of 56,000,000 of them, where each value is unique and between -100,000,000 and +100,000,000.
These ids files are stable and do not change (so it is possible to generate a different representation for it that is better adapted for the calculations if needed).
Sample ids
Largest file sizes
I need an algorithm that will intersect ids in the sub-second range for most cases. What would you suggest? I code in java, but do not limit myself to java for the resolution of this problem (I could use JNI to bridge to native language).
Potential solutions to consider
Although you could not limit yourselves to the following list of broad considerations for solutions, here is a list of what I internally debated to resolve the situation.
Neural-Network pre-qualifier: Train a neural-network for each ids list that accepts another list of ids to score its intersection potential (0 means definitely no intersection, 1 means definitely there is an intersection). Since neural networks are good and efficient at pattern recognition, I am thinking of pre-qualifying a more time-consuming algorithm behind it.
Assembly-language: On a Linux server, code an assembly module that does such algorithm. I know that assembly is a mess to maintain and code, but sometimes one need the speed of an highly optimized algorithm without the overhead of a higher-level compiler. Maybe this use-case is simple enough to benefit from an assembly language routine to be executed directly on the Linux server (and then I'd always pay attention to stick with the same processor to avoid having to re-write this too often)? Or, alternately, maybe C would be close enough to assembly to produce clean and optimized assembly code without the overhead to maintain assembly code.
Images and GPU: GPU and image processing could be used and instead of comparing ids, I could BITAND images. That is, I create a B&W image of each ids list. Since each id have unique values between -100,000,000 and +100,000,000 (where a maximum of 56,000,000 of them are used), the image would be mostly black, but the pixel would become white if the corresponding id is set. Then, instead of keeping the list of ids, I'd keep the images, and do a BITAND operation on both images to intersect them. This may be fast indeed, but then to translate the resulting image back to ids may be the bottleneck. Also, each image could be significantly large (maybe too large for this to be a viable solution). An estimate of a 200,000,000 bits sequence is 23MB each, just loading this in memory is quite demanding.
String-matching algorithms: String comparisons have many adapted algorithms that are typically extremely efficient at their task. Create a binary file for each ids set. Each id would be 4 bytes long. The corresponding binary file would have each and every id sequenced as their 4 bytes equivalent into it. The algorithm could then be to process the smallest file to match each 4 bytes sequence as a string into the other file.
Am I missing anything? Any other potential solution? Could any of these approaches be worth diving into them?
I did not yet try anything as I want to secure a strategy before I invest what I believe will be a significant amount of time into this.
EDIT #1:
Could the solution be a map of hashes for each sector in the list? If the information is structured in such a way that each id resides within its corresponding hash key, then, the smaller of the ids set could be sequentially ran and matching the id into the larger ids set first would require hashing the value to match, and then sequentially matching of the corresponding ids into that key match?
This should make the algorithm an O(n) time based one, and since I'd pick the smallest ids set to be the sequentially ran one, n is small. Does that make sense? Is that the solution?
Something like this (where the H entry is the hash):
"H780" : [ 45902780, 46062780, -42912780, -19812780, 25323780, 40572780, -30131780, 60266780, -26203780, 46152780, 67216780, 71666780, -67146780, 46162780, 67226780, 67781780, -47021780, 46122780, 19973780, 22113780, 67876780, 42692780, -18473780, 30993780, 67711780, 67791780, -44036780, -45904780, -42142780, 18703780, 60276780, 46182780, 63600780, 63680780, -70486780, -68290780, -18493780, -68210780, 67731780, 46092780, 63450780, 30074780, 24772780, -26483780, 68371780, -18483780, 18723780, -29834780, 46202780, 67821780, 29594780, 46082780, 44632780, -68406780, -68310780, -44056780, 67751780, 45912780, 40842780, 44642780, 18743780, -68220780, -44066780, 46142780, -26193780, 67681780, 46222780, 67761780 ],
"H782" : [ 27343782, 67456782, 18693782, 43322782, -37832782, 46152782, 19113782, -68411782, 18763782, 67466782, -68400782, -68320782, 34031782, 45056782, -26713782, -61776782, 67791782, 44176782, -44096782, 34041782, -39324782, -21873782, 67961782, 18703782, 44186782, -31143782, 67721782, -68340782, 36103782, 19143782, 19223782, 31711782, 66350782, 43362782, 18733782, -29233782, 67811782, -44076782, -19623782, -68290782, 31721782, 19233782, 65726782, 27313782, 43352782, -68280782, 67346782, -44086782, 67741782, -19203782, -19363782, 29583782, 67911782, 67751782, 26663782, -67910782, 19213782, 45992782, -17201782, 43372782, -19992782, -44066782, 46142782, 29993782 ],
"H540" : [...
You can convert each file (list of ids) into a bit-array of length 200_000_001, where bit at index j is set if the list contains value j-100_000_000. It is possible, because the range of id values is fixed and small.
Then you can simply use bitwise and and not operations to intersect and negate lists of ids. Depending on the language and libraries used, it would require operating element-wise: iterating over arrays and applying corresponding operations to each index.
Finally, you should measure your performance and decide whether you need to do some optimizations, such as parallelizing operations (you can work on different parts of arrays on different processors), preloading some of arrays (or all of them) into memory, using GPU, etc.
First, the bitmap approach will produce the required performance, at a huge overhead in memory. You'll need to benchmark it, but I'd expect times of maybe 0.2 seconds, with that almost entirely dominated by the cost of loading data from disk, and then reading the result.
However there is another approach that is worth considering. It will use less memory most of the time. For most of the files that you state, it will perform well.
First let's use Cap'n Proto for a file format. The type can be something like this:
struct Ids {
is_negated #0 :Bool;
ids #1 :List(Int32);
The key is that ids are always kept sorted. So list operations are a question of running through them in parallel. And now:
Applying not is just flipping is_negated.
If neither is negated, it is a question of finding IDs in both lists.
If the first is not negated and the second is, you just want to find IDs in the first that are not in the second.
If the first is negated and the second is not, you just want to find IDs in the second that are not in the first.
If both are negated, you just want to find all ids in either list.
If your list has 100k entries, then the file will be about 400k. A not requires copying 400k of data (very fast). And intersecting with another list of the same size involves 200k comparisons. Integer comparisons complete in a clock cycle, and branch mispredictions take something like 10-20 clock cycles. So you should be able to do this operation in the 0-2 millisecond range.
Your worst case 56,000,000 file will take over 200 MB and intersecting 2 of them can take around 200 million operations. This is in the 0-2 second range.
For the 56 million file and a 10k file, your time is almost all spent on numbers in the 56 million file and not in the 10k one. You can speed that up by adding a "galloping" mode where you do a binary search forward in the larger file looking for the next matching number and picking most of them. Do be warned that this code tends to be tricky and involves lots of mispredictions. You'll have to benchmark it to find out how big a size difference is needed.
In general this approach will lose for your very biggest files. But it will be a huge win for most of the sizes of file that you've talked about.

word2vec window size at sentence boundaries

I am using word2vec (and doc2vec) to get embeddings for sentences, but i want to completely ignore word order.
I am currently using gensim, but can use other packages if necessary.
As an example, my text looks like this:
['apple', 'banana','carrot','dates', 'elderberry', ..., 'zucchini'],
['aluminium', 'brass','copper', ..., 'zinc'],
I intentionally want 'apple' to be considered as close to 'zucchini' as it is to 'banana' so I have set the window size to a very large number, say 1000.
I am aware of 2 problems that may arise with this.
Problem 1:
The window might roll in at the start of a sentence creating the following training pairs:
('apple', ('banana')), ('apple', ('banana', 'carrot')), ('apple', ('banana', 'carrot', 'date')) before it eventually gets to the correct ('apple', ('banana','carrot', ..., 'zucchini')).
This would seem to have the effect of making 'apple' closer to 'banana' than 'zucchini',
since their are so many more pairs containing 'apple' and 'banana' than there are pairs containing 'apple' and 'zucchini'.
Problem 2:
I heard that pairs are sampled with inverse proportion to the distance from the target word to the context word- This also causes an issue making nearby words more seem more connected than I want them to be.
Is there a way around problems 1 and 2?
Should I be using cbow as opposed to sgns? Are there any other hyperparameters that I should be aware of?
What is the best way to go about removing/ignoring the order in this case?
Thank you
I'm not sure what you mean by "Problem 1" - there's no "roll" or "wraparound" in the usual interpretation of a word2vec-style algorithm's window parameter. So I wouldn't worry about this.
Regarding "Problem 2", this factor can be essentially made negligible by the choice of a giant window value – say for example, a value one million times larger than your largest sentence. Then, any difference in how the algorithm treats the nearest-word and the 2nd-nearest-word is vanishingly tiny.
(More specifically, the way the gensim implementation – which copies the original Google word2vec.c in this respect – achieves a sort of distance-based weighting is actually via random dynamic shrinking of the actual window used. That is, for each visit during training to each target word, the effective window truly used is some random number from 1 to the user-specified window. By effectively using smaller windows much of the time, the nearer words have more influence – just without the cost of performing other scaling on the whole window's words every time. But in your case, with a giant window value, it will be incredibly rare for the effective-window to ever be smaller than your actual sentences. Thus every word will be included, equally, almost every time.)
All these considerations would be the same using SG or CBOW mode.
I believe a million-times-larger window will be adequate for your needs, for if for some reason it wasn't, another way to essentially cancel-out any nearness effects could be to ensure your corpus's items individual word-orders are re-shuffled between each time they're accessed as training data. That ensures any nearness advantages will be mixed evenly across all words – especially if each sentence is trained on many times. (In a large-enough corpus, perhaps even just a 1-time shuffle of each sentence would be enough. Then, over all examples of co-occurring words, the word co-occurrences would be sampled in the right proportions even with small windows.)
Other tips:
If your training data starts in some arranged order that clumps words/topics together, it can be beneficial to shuffle them into a random order instead. (It's better if the full variety of the data is interleaved, rather than presented in runs of many similar examples.)
When your data isn't true natural-language data (with its usual distributions & ordering significance), it may be worth it to search further from the usual defaults to find optimal metaparameters. This goes for negative, sample, & especially ns_exponent. (One paper has suggested the optimal ns_exponent for training vectors for recommendation-systems is far different from the usual 0.75 default for natural-language modeling.)

Logoot CRDT: interleaving of data on concurrent edits to the same spot?

I want to implement Logoot for eventually-convergent P2P text editing and I've run into a bit of a problem.
My understanding of Logoot is that the intervals between objects (lines of text in the original paper, but could be characters or words) can be divided infinitely on account of an unbounded identifier. This means that the position of an object is determined not by its neighbors as in WOOT (which would require tombstones) but by a fixed numerical point along the length of the string. Combined with a unique site identifier, this also gives us a total order and enables eventual convergence.
However... doesn't this cause a problem when concurrent edits are made to the same spot? If two disconnected clients start writing new sentences at the same cursor position and then merge, their sentences have a good chance of interleaving.
Below is a whiteboard example of what I'm talking about:
As you can see, both site B and site C divide the interval between "I" and "conquered" according to the rules of Logoot, giving us random points between the positions of (20,A) and (25,A). But nothing orders these points relative to each other, causing them to mix when merged. Meanwhile, neighbor-based algorithms can account for this issue since the causality chain of each object is preserved.
The above is a baby example, but in the more general case, imagine if two users wanted to insert a different sentence between two existing sentences. If one of the users happened to be offline, they shouldn't come back to a garbled mess! Clearly, to preserve intent, one sentence should follow the other.
Am I missing something in my reading of the paper, or is this an inherent downside to Logoot?
(Also, why is there a recorded clock value that's seemingly unused in the algorithm? The paper even points out that each object's identifier is necessarily unique without the clock.)
You're correct, this a real anomaly in Logoot and LSEQ. Whether or not it constitutes a intention violation depends on what your definition of intention is. An extension to the definition requiring that contiguous sequences remain contiguous unless they are split by a casually subsequent operation would make intuitive sense.
The clock is unnecessary. Most likely the authors used the (site, clock) pair or Lamport timestamp as their UUIDs out of convention. One site can never create two identical positions, so clocks will never need to be compared. (Assuming messages are received from a site in order, which is required for other aspects of Logoot/LSEQ too.)

Are there any good techniques for keeping nearly-sorted data nearly-sorted?

Short Version:
I'm looking for a technique to keep nearly-sorted data in nearly-sorted order over time, despite the values changing slightly.
Here's the scenario:
In the world of 3D graphics, it is often beneficial to order your objects from front-to-back before drawing. As your scene changes or your view of the scene changes, this data may require re-sorting, however it will usually be very close to the sorted order (i.e. it won't change very much between frames). It's also not critical that the data be exactly in sorted order. The worst thing that will happen is that a polygon will be rendered and then completely hidden. It's a small performance hit, but not the end of the world.
With this in mind, is it possible to sort the data once ahead of time and then apply a minimal patch to the data once per frame to ensure that the data stays mostly sorted? In this scenario, the data would be considered mostly sorted if most of the objects were in ascending order. That is, 1 object that is 10 steps away from it's proper location is much better (10x better) than 10 objects that are 1 step away from their proper location.
It's also worth noting that the data could continue to be patched on a semi regular basis, as the data is typically rendered 30 times per second (or so). As long as the calculation was efficient, it could continue to be done over time until the changes stop and the list was completely sorted.
Existing Idea:
My knee jerk reaction to this problem is:
Apply an n log n sort to the data when it is loaded, and on large changes (which I can track pretty easily).
When the data starts changing slowly (e.g. when the scene is rotated), apply a single (linear) pass of some sort on the data to swap backwards neighbors and try to maintain sort order (I think this is basically shell sort - maybe there is a better algorithm to use for this single pass).
Keep doing a single pass of the partial sort each frame until the changes stop and the data is completely sorted
Go back to step 2 and wait for more changes.
There are a variety of sorts that run in O(n) time if the input is mostly sorted, and O(n log n) if the data is not sorted. It sounds like you can use that pretty easily. Timsort is one such sort and, I believe, is the default sort now in both python and java. Smoothsort is another one that is fairly easy to implement yourself.
From your description it sounds like the sort order changes without you changing the data itself. E.g. you change the camera, so the sort order should change, even though you have not modified any polygons.
If so, you can't detect sort order changes directly when they happen. If you could, I would create buckets for the list of polygons, and resort buckets when 'enough' polygons in that bucket have been touched.
But I'm betting your system doesn't work that way. The sort is determined by the view port. In that case polygons at the front of the sort matter much more than ones at the end.
So I'd segment the poly list into fifths or something like that. Front to back, so that the first fifth is the part closest to the camera. I'd completely sort the first segment every frame. I'd divide the second segment into sub segments - say 5 again - and sort each sub segment every frame, such that every 5 frames the second fifth is completely sorted. segment the third through 5th segments into 15 sub segments and do those every 5 frames each such that the rest get sorted completely every 75 frames. At 60 fps you'd have the display list completely resorted a little more than once per second.
The nice thing about prioritizing the front of the list, is
1. Polys at the front are going to tend to be larger on the screen, and will fail depth test more often. Bad orders at the end of the list will more often than not just not matter.
2. the front of the list is more susceptible sort changes due to camera changes.
Also chose those segment ranges with a little overlap, so that polygons can migrate to their correct segment in 2 sorts.
#OP: Thinking about it a little more. You are probably more concerned with having the sorting cost stay bounded - instead of exploding with scene complexity. Especially since a very complex scene should - surprisingly - be less susceptible to bad sorts ( because generally the polys get smaller ).
You could define a fixed amount of sorting you are willing to do per frame. Use say 50% of the budget for as much of the front of the list as you can afford, 25% of the budget to sort the next region and 25% to spend equally on the rest.
Say you budget 1000 polys sorted per frame, and you have 10000 polys in the scene. Sort the first 500 polys every frame. Sort 250 polys every tenth frame for the next region. So 501-750 on frame 1, 751-1000 on frame 2 etc. And then divide the rest of the list into 250 frame segments and sort them round robin for however many frames you need to.
This keeps the sorting cost fixed s the scene gets more and less complex, and it is easy to tune, you just adjust the sorting budget to what you can afford.
I'll suggest a solution that borrows from a number of others here. Of course we start with a full sort of the objects on initialisation.
What I would do is always perform, say, 10 linear-time runs over your objects for every frame (with early termination if you find out that your objects are already completely sorted). Each run can be, say, one pass of bubble sort with a shell sort-style gap over the whole array: for all i from 0 to n-gap-1, compare A[i] and A[i+gap], and exchange them if they are not sorted. You can use a fixed sequence of gaps, or maybe better, let it vary between frames; either way, if you do sufficiently many frames where the objects do not change, you'll have a fully sorted sequence. You could even mix different types of sub-algorithms to do your runs, as long as each iteration improves the 'sortedness'.
You can add Rafael Baptista's idea of prioritizing the front of the scene easily by doing one extra run on the front segment, or choosing to divide the gap by two for the front half, or something like that.
It doesn't work out as neatly as the problem you've supposed because all you have to do is turn the camera 90 degrees and the basis for being sorted is on a different axis entirely. (X and Y axis are independent, for example -- looking down the X axis will cause the sort order to not rely on the X axis, and looking down the Y axis will cause the sort order to not rely on the Y axis.) Even a 5 degree turn can cause far away "close" (as far as Z-order is concerned) things to be suddenly "far".
Let's be honest -- generating the draw calls for the objects is normally going to take much more time than sorting them, especially if you have an optimized sorting algorithm for your scenario and your game is of modern visual complexity.
Sorting can be practically O(n), especially with histogram-based algorithms or radix-style algorithms. (Yes, radix sort applies to integers, so you'd have to scale your world coordinates to integers, but normally that's more than good enough unless you have a gigantic world.)
That being said, since you're already doing O(n) ops for everything you're drawing, resorting per frame isn't going to be a huge problem, especially with both high and low level optimization.
Another common way of addressing this issue is with a scene graph, but for your purposes it ends up essentially being a re-sort per frame. However, you can build frustum culling, shadow culling, and level of detail calculations into the scene graph traversal.
If you're looking for approximations, instead of doing a z-distance sort do a true distance sort and update the sort order more often for close by objects and less often for further objects (depending on distance the camera has traveled). This can work because if you're further away from an object, moving doesn't cause the angle to the viewer to change as often which, in turn, means the old sorting data is more likely to be valid. I'm not a fan of this because I like algorithms which allow my game to teleport across the map without any issues. (Mind you, streaming assets from disk becomes the real issue for teleporting.)
Shell sort is good for lists with few unique values and some scenarios that "need short code and do not use the call stack".
In your case, you need something called Adaptive sort, which means algorithms "takes advantage of existing order in its input".
If your space is tight, you can just use Straight Insertion Sort, which is adaptive and in place.
Otherwise you can try Timsort and Smoothsort as #RunningWild suggested, they are both adaptive sort algorithms.

Coming up with factors for a weighted algorithm?

I'm trying to come up with a weighted algorithm for an application. In the application, there is a limited amount of space available for different elements. Once all the space is occupied, the algorithm should choose the best element(s) to remove in order to make space for new elements.
There are different attributes which should affect this decision. For example:
T: Time since last accessed. (It's best to replace something that hasn't been accessed in a while.)
N: Number of times accessed. (It's best to replace something which hasn't been accessed many times.)
R: Number of elements which need to be removed in order to make space for the new element. (It's best to replace the least amount of elements. Ideally this should also take into consideration the T and N attributes of each element being replaced.)
I have 2 problems:
Figuring out how much weight to give each of these attributes.
Figuring out how to calculate the weight for an element.
(1) I realize that coming up with the weight for something like this is very subjective, but I was hoping that there's a standard method or something that can help me in deciding how much weight to give each attribute. For example, I was thinking that one method might be to come up with a set of two sample elements and then manually compare the two and decide which one should ultimately be chosen. Here's an example:
Element A: N = 5, T = 2 hours ago.
Element B: N = 4, T = 10 minutes ago.
In this example, I would probably want A to be the element that is chosen to be replaced since although it was accessed one more time, it hasn't been accessed in a lot of time compared with B. This method seems like it would take a lot of time, and would involve making a lot of tough, subjective decisions. Additionally, it may not be trivial to come up with the resulting weights at the end.
Another method I came up with was to just arbitrarily choose weights for the different attributes and then use the application for a while. If I notice anything obviously wrong with the algorithm, I could then go in and slightly modify the weights. This is basically a "guess and check" method.
Both of these methods don't seem that great and I'm hoping there's a better solution.
(2) Once I do figure out the weight, I'm not sure which way is best to calculate the weight. Should I just add everything? (In these examples, I'm assuming that whichever element has the highest replacementWeight should be the one that's going to be replaced.)
replacementWeight = .4*T - .1*N - 2*R
or multiply everything?
replacementWeight = (T) * (.5*N) * (.1*R)
What about not using constants for the weights? For example, sure "Time" (T) may be important, but once a specific amount of time has passed, it starts not making that much of a difference. Essentially I would lump it all in an "a lot of time has passed" bin. (e.g. even though 8 hours and 7 hours have an hour difference between the two, this difference might not be as significant as the difference between 1 minute and 5 minutes since these two are much more recent.) (Or another example: replacing (R) 1 or 2 elements is fine, but when I start needing to replace 5 or 6, that should be heavily weighted down... therefore it shouldn't be linear.)
replacementWeight = 1/T + sqrt(N) - R*R
Obviously (1) and (2) are closely related, which is why I'm hoping that there's a better way to come up with this sort of algorithm.
What you are describing is the classic problem of choosing a cache replacement policy. Which policy is best for you, depends on your data, but the following usually works well:
First, always store a new object in the cache, evicting the R worst one(s). There is no way to know a priori if an object should be stored or not. If the object is not useful, it will fall out of the cache again soon.
The popular squid cache implements the following cache replacement algorithms:
Least Recently Used (LRU):
replacementKey = -T
Least Frequently Used with Dynamic Aging (LFUDA):
replacementKey = N + C
Greedy-Dual-Size-Frequency (GDSF):
replacementKey = (N/R) + C
C refers to a cache age factor here. C is basically the replacementKey of the item that was evicted last (or zero).
NOTE: The replacementKey is calculated when an object is inserted or accessed, and stored alongside the object. The object with the smallest replacementKey is evicted.
LRU is simple and often good enough. The bigger your cache, the better it performs.
LFUDA and GDSF both are tradeoffs. LFUDA prefers to keep large objects even if they are less popular, under the assumption that one hit to a large object makes up lots of hits for smaller objects. GDSF basically makes the opposite tradeoff, keeping many smaller objects over fewer large objects. From what you write, the latter might be a good fit.
If none of these meet your needs, you can calculate optimal values for T, N and R (and compare different formulas for combining them) by minimizing regret, the difference in performance between your formula and the optimal algorithm, using, for example, Linear regression.
This is a completely subjective issue -- as you yourself point out. And a distinct possibility is that if your test cases consist of pairs (A,B) where you prefer A to B, then you might find that you prefer A to B , B to C but also C over A -- i.e. its not an ordering.
If you are not careful, your function might not exist !
If you can define a scalar function of your input variables, with various parameters for coefficients and exponents, you might be able to estimate said parameters by using regression, but you will need an awful lot of data if you have many parameters.
This is the classical statistician's approach of first reviewing the data to IDENTIFY a model, and then using that model to ESTIMATE a particular realisation of the model. There are large books on this subject.
