Synchronize two lists of objects - algorithm

Problem
I have two lists of objects. Each object contains the following:
GUID (allows to determine if objects are the same — from business
point of view)
Timestamp (updates to current UTC each time the
object changed)
Version (positive integer; increments each time
the object changed)
Deleted (boolean flag; switches to "true" instead
of actual object deleting)
Data (some useful payload)
Any other fields if need
Next, I need to sync two lists according to these rules:
If object with some GUID presented only in one list, it should be copied to another list
If object with some GUID presented in both lists, the instance with less Version should be replaced with one having greater Version (nothing to do if versions are equal)
Real-world requirements:
Each list has 50k+ objects, each object is about 1 Kb
Lists are placed on different machines connected via Internet (e.g., mobile app and remote server), thus, algorithm shouldn't waste the traffic or CPU much
Most of time (say, 96%) lists are already synced before sync process, hence, the algorithm should determine it with minimal effort
If there are any differences, most of time they are pretty small (3-5 objects changed/added)
Should proceed OK if one list is empty (and other still has 50k+ items)
Solution #1 (currently implemented)
Client stores the time-of-last-sync-succeed (say T)
Both lists are asked for all objects having Timestamp > T (i.e. recently modified; in the production it's ... > (T - day) for better robustness)
These lists of recently modified objects are synced naively:
items presented only in first list are saved to second list
items presented only in second list are saved to first list
other items has their Version's compared and saved to appropriative list (if need)
Procs:
Works great with small changes
Almost fits the requirements
Cons:
Depends on T, which makes the algorithm fragile: it's easy to sync last updates, but hard to make sure lists are completely synced (using minimal T like 1970-01-01 just hangs the sync process)
My questions:
Is there any common / best-practice / proved way to sync object lists?
Is there any better [than #1] solutions for my case?
P.S. Already viewed, not duplicates:
Compare Two List Of Objects For Synchronization
Two list synchronization

Summary
All answers has some worth points. To summarize, here is the compiled answer I was looking for, based on finally implemented working sync system:
In general, use Merkle trees. They are dramatically efficient in comparing large amounts of data.
If you can, rebuild your hash tree from scratch every time you need it.
Check the time required to rebuild hash tree. Most likely it's pretty fast (e.g., in my case on Nexus 4 rebuilding tree for 20k items takes ~2 sec: 1.8 sec for fetching data from DB + 0.2 sec for building tree; the server performs ~20x faster), so you don't need to store the tree in the DB and maintain it when data changed (my first try was rebuilding only relevant branches — it's not too complicated to implement, but is very fragile).
Nevertheless, it's ok to cache and reuse tree if no data modifications was done at all. Once modification happened, invalidate the whole cache.
Technical details
GUID is 32 chars long without any hyphens/braces, lowercase;
I use 16-ary tree with the height of 4, where each branch is related to the GUID's char. It may be implemented as actual tree or map:
0000 → (hash of items with GUID 0000*)
0001 → (hash of items with GUID 0001*)
...
ffff → (hash of items with GUID ffff*);
000 → (hash of hashes 000_)
...
00 → (hash of hashes 00_)
...
() → (root hash, i.e. hash of hashes _)
Thus, the tree has 65536 leafs and requires 2 Mb of memory; each leaf covers ~N/65536 data items. Binary trees would be 2x more efficient in terms of memory, but it's a harder to implement.
I had to implement these methods:
getHash() — returns root hash; used for primary check (as mentioned,
in 96% that's all we need to test);
getHashChildren(x) — returns list of hashes x_ (at most 16); used for effective, single-request discovering data difference;
findByIdPrefix(x) — returns items with GUID x*, x must contain exactly 4 chars; used for requesting leaf items;
count(x) — returns number of items with GUID x*; when reasonably small, we can dismiss checking tree branch-by-branch and transfer bunch of items with single request;
As far as syncing is done per-branch transmitting small amounts of data, it's very responsive (you can check the progress at any time) + very robust for unexpected terminating (e.g., due to network failure) and easily restarts from the last point if need.
IMPORTANT: sometimes you will stuck with conflicting state: {version_1 = version_2, but hash_1 != hash_2}: in this case you must make some decision (maybe with user's help or comparing timestamps as last resort) and rewrite some item with another to resolve the conflict, otherwise you'll end up with unsynced and unsyncable hash trees.
Possible improvements
Implement transmitting (GUID, Version) pairs without payload for lightweighting requests.

Two suggestions come to mind, the first one is possibly something you're doing already:
1) Don't send entire lists of items with timestamps > T. Instead, send a list of (UUID, Version) tuples of objects with timestamps > T. Then the other side can figure out which objects it needs to update from that. Send the UUIDs of those back to request the actual objects. This avoids sending full objects if they have timestamp > T, but are nonetheless newer already (or present already with the latest Version) on the other side.
2) Don't process the full list at once, but in chunks, i.e. first sync 10%, then the next 10% etc. to avoid transferring too much data at once for big syncs (and to allow for restarting points if a connection should break). This can be done by e.g. starting with all UUIDs with a checksum equivalent to 1 modulo 10, then 1 modulo 10 etc.
Another possibility would be proactive syncing, e.g. asynchronously posting chances, possibly via UCP (unreliable as opposed to TCP). You would still need to sync when you need current information, but chances are most of it is current.

You need to store not time of last synchronization, but the state of the objects (eg. the hash of object data) at time of last synchronization. Then you compare each list with the stored list and find, what objects have changed on each side.
This is much more reliable than rely on time, cause time requires that both sides have synchronized timer which gives precise time (and this is not the case on most systems). For the same reason your idea of detecting changes based on time + version can be more error-prone than it initially seems.
Also you don't initially transfer object data but only GUIDs.
BTW we've made a framework (free with source) which addresses exactly your problems. I am not giving the link because some alternatively talented people would complain.

Related

What is the fastest way to intersect two large set of ids

The Problem
On a server, I host ids in a json file. From clients, I need to mandate the server to intersect and sometimes negate these ids (the ids never travel to the client even though the client instructs the server its operations to perform).
I typically have 1000's of ids, often have 100,000's of ids, and have a maximum of 56,000,000 of them, where each value is unique and between -100,000,000 and +100,000,000.
These ids files are stable and do not change (so it is possible to generate a different representation for it that is better adapted for the calculations if needed).
Sample ids
Largest file sizes
I need an algorithm that will intersect ids in the sub-second range for most cases. What would you suggest? I code in java, but do not limit myself to java for the resolution of this problem (I could use JNI to bridge to native language).
Potential solutions to consider
Although you could not limit yourselves to the following list of broad considerations for solutions, here is a list of what I internally debated to resolve the situation.
Neural-Network pre-qualifier: Train a neural-network for each ids list that accepts another list of ids to score its intersection potential (0 means definitely no intersection, 1 means definitely there is an intersection). Since neural networks are good and efficient at pattern recognition, I am thinking of pre-qualifying a more time-consuming algorithm behind it.
Assembly-language: On a Linux server, code an assembly module that does such algorithm. I know that assembly is a mess to maintain and code, but sometimes one need the speed of an highly optimized algorithm without the overhead of a higher-level compiler. Maybe this use-case is simple enough to benefit from an assembly language routine to be executed directly on the Linux server (and then I'd always pay attention to stick with the same processor to avoid having to re-write this too often)? Or, alternately, maybe C would be close enough to assembly to produce clean and optimized assembly code without the overhead to maintain assembly code.
Images and GPU: GPU and image processing could be used and instead of comparing ids, I could BITAND images. That is, I create a B&W image of each ids list. Since each id have unique values between -100,000,000 and +100,000,000 (where a maximum of 56,000,000 of them are used), the image would be mostly black, but the pixel would become white if the corresponding id is set. Then, instead of keeping the list of ids, I'd keep the images, and do a BITAND operation on both images to intersect them. This may be fast indeed, but then to translate the resulting image back to ids may be the bottleneck. Also, each image could be significantly large (maybe too large for this to be a viable solution). An estimate of a 200,000,000 bits sequence is 23MB each, just loading this in memory is quite demanding.
String-matching algorithms: String comparisons have many adapted algorithms that are typically extremely efficient at their task. Create a binary file for each ids set. Each id would be 4 bytes long. The corresponding binary file would have each and every id sequenced as their 4 bytes equivalent into it. The algorithm could then be to process the smallest file to match each 4 bytes sequence as a string into the other file.
Am I missing anything? Any other potential solution? Could any of these approaches be worth diving into them?
I did not yet try anything as I want to secure a strategy before I invest what I believe will be a significant amount of time into this.
EDIT #1:
Could the solution be a map of hashes for each sector in the list? If the information is structured in such a way that each id resides within its corresponding hash key, then, the smaller of the ids set could be sequentially ran and matching the id into the larger ids set first would require hashing the value to match, and then sequentially matching of the corresponding ids into that key match?
This should make the algorithm an O(n) time based one, and since I'd pick the smallest ids set to be the sequentially ran one, n is small. Does that make sense? Is that the solution?
Something like this (where the H entry is the hash):
{
"H780" : [ 45902780, 46062780, -42912780, -19812780, 25323780, 40572780, -30131780, 60266780, -26203780, 46152780, 67216780, 71666780, -67146780, 46162780, 67226780, 67781780, -47021780, 46122780, 19973780, 22113780, 67876780, 42692780, -18473780, 30993780, 67711780, 67791780, -44036780, -45904780, -42142780, 18703780, 60276780, 46182780, 63600780, 63680780, -70486780, -68290780, -18493780, -68210780, 67731780, 46092780, 63450780, 30074780, 24772780, -26483780, 68371780, -18483780, 18723780, -29834780, 46202780, 67821780, 29594780, 46082780, 44632780, -68406780, -68310780, -44056780, 67751780, 45912780, 40842780, 44642780, 18743780, -68220780, -44066780, 46142780, -26193780, 67681780, 46222780, 67761780 ],
"H782" : [ 27343782, 67456782, 18693782, 43322782, -37832782, 46152782, 19113782, -68411782, 18763782, 67466782, -68400782, -68320782, 34031782, 45056782, -26713782, -61776782, 67791782, 44176782, -44096782, 34041782, -39324782, -21873782, 67961782, 18703782, 44186782, -31143782, 67721782, -68340782, 36103782, 19143782, 19223782, 31711782, 66350782, 43362782, 18733782, -29233782, 67811782, -44076782, -19623782, -68290782, 31721782, 19233782, 65726782, 27313782, 43352782, -68280782, 67346782, -44086782, 67741782, -19203782, -19363782, 29583782, 67911782, 67751782, 26663782, -67910782, 19213782, 45992782, -17201782, 43372782, -19992782, -44066782, 46142782, 29993782 ],
"H540" : [...
You can convert each file (list of ids) into a bit-array of length 200_000_001, where bit at index j is set if the list contains value j-100_000_000. It is possible, because the range of id values is fixed and small.
Then you can simply use bitwise and and not operations to intersect and negate lists of ids. Depending on the language and libraries used, it would require operating element-wise: iterating over arrays and applying corresponding operations to each index.
Finally, you should measure your performance and decide whether you need to do some optimizations, such as parallelizing operations (you can work on different parts of arrays on different processors), preloading some of arrays (or all of them) into memory, using GPU, etc.
First, the bitmap approach will produce the required performance, at a huge overhead in memory. You'll need to benchmark it, but I'd expect times of maybe 0.2 seconds, with that almost entirely dominated by the cost of loading data from disk, and then reading the result.
However there is another approach that is worth considering. It will use less memory most of the time. For most of the files that you state, it will perform well.
First let's use Cap'n Proto for a file format. The type can be something like this:
struct Ids {
is_negated #0 :Bool;
ids #1 :List(Int32);
}
The key is that ids are always kept sorted. So list operations are a question of running through them in parallel. And now:
Applying not is just flipping is_negated.
If neither is negated, it is a question of finding IDs in both lists.
If the first is not negated and the second is, you just want to find IDs in the first that are not in the second.
If the first is negated and the second is not, you just want to find IDs in the second that are not in the first.
If both are negated, you just want to find all ids in either list.
If your list has 100k entries, then the file will be about 400k. A not requires copying 400k of data (very fast). And intersecting with another list of the same size involves 200k comparisons. Integer comparisons complete in a clock cycle, and branch mispredictions take something like 10-20 clock cycles. So you should be able to do this operation in the 0-2 millisecond range.
Your worst case 56,000,000 file will take over 200 MB and intersecting 2 of them can take around 200 million operations. This is in the 0-2 second range.
For the 56 million file and a 10k file, your time is almost all spent on numbers in the 56 million file and not in the 10k one. You can speed that up by adding a "galloping" mode where you do a binary search forward in the larger file looking for the next matching number and picking most of them. Do be warned that this code tends to be tricky and involves lots of mispredictions. You'll have to benchmark it to find out how big a size difference is needed.
In general this approach will lose for your very biggest files. But it will be a huge win for most of the sizes of file that you've talked about.

best data structure for range delete

I have a stream of chars that I need to keep in a big data structure (can contain billions of chars)
I need to be able to:
store these chars quickly.
get all the chars quickly in order to print them for example
Delete a range of chars without leaving any gaps in the memory.
my first thought was double linked list , but the problem is that is taking to long to get to the middle of the list (begnining of the range)in order to delete.
to solve that I was thinking about a skip list which will make the search of this range faster but then I'm facing the problem of having to re-index each node after deletion
([0,1,2,3,4,5,6,7]
=> delete (3,4)
=> [0,1,2,5,6,7]
=> delete (3,4)
=> [0,1,2,7]
in this example after the first delete I need to give numbers 5,6,7 new indexes )
what is the best way to do this ?
It might be helpful to read about the span<T> data structure.
Related Answer: What is a "span" and when should I use one?
A span<T> is:
A very lightweight abstraction of a contiguous sequence of values of type T somewhere in memory.
Basically a struct { T * ptr; std::size_t length; } with a bunch of convenience methods.
A non-owning type (i.e. a "reference-type" rather than a "value type"): It never allocates nor deallocates anything and does not keep
smart pointers alive.
I would add that if you are processing a stream of characters, you will probably want to use buffering (or perhaps more apt - "chunking") where each chunk is itself a span<char> of fixed-size (which are all stored in a separate bit of memory) but tracked in a central array (or a more complex data structure like a double-linked-list, to facilitate quick deletion)
It would be an anti-pattern to attempt to actually maintain your entire stream of data in a single piece of contiguous physical memory (which you seem to suggest in part 3 of your request) - especially if you plan on deleting chunks of it. There should other ways to facilitate fast deletion without sacrificing performance elsewhere.
For example if you wish to delete a range of characters that falls into a given span, you can create two new spans from the start and end of the original span, excluding the deleted characters, and then replace the original span instance in your larger data structure (e.g if it were a double-linked list) with the two new smaller spans. None of this requires copying the underlying data itself, just slicing up our lightweight references to the underlying data.
If your language of choice doesn't support span, or a similar structure, check out how span is implemented.
Depending on your language of choice, it may even have built-in support for streaming spans (as .NET Core 2.1+ (2018) does).
Any additional requirements (such fast indexing to any point in your data stream, net of any deletions) can be satisfied by maintaining separate data structures that carry metadata about your spans (such as the suggested linked list). They will need updating when spans are deleted or added to, but because spans are a thin layer on top of large strings of characters, they reduce the cardinality of data structures you are maintaining by several orders of magnitude, so while you could get fancy with maintaining a variety of heaps and maps to facilitate O(1) algorithms for every operation, you will probably find that basic structures and O(log(n)) or even O(N) (where N is actually N/chunk-size) maintenance operations are feasible.

Efficient synchronization algorithm

Lets say I have a large sorted (+10 MB, +650k rows) dataset on node_a and different dataset on node_b. There is no master version of the dataset, meaning that either node can have some pieces which are not available to other node. My goal is to have a content of node_a synchronized with content of node_b. What is the most efficient way to do so?
Common sense solution would be:
node_a: Here's everything I have... (sends entire dataset)
node_b: Here's what you don't have... (sends missing parts)
But this solution is not efficient at all. It requires the node_a to send (+10 MB) every time he attempts to synchronize.
So this time using a little brainpower I could introduce a partitioning of the dataset, sending only a part of entire content and expect differences found between first and last row of the part.
Can you think of any better solutions?
For a single synchronization:
Break the dataset up into arbitrary parts, hash each (with MD5, for example), and only send through the hash values instead of the whole data set. Then use a comparison of the hash values on the other side to determine what's not the same on each side, and send this through as appropriate.
If each part doesn't have a global unique ID (i.e. a primary key that's guaranteed to be the same for the corresponding row on each side), you may need some meta-data sent across as well, or send hashes of parts incrementally, determining the difference as you go, and changing what you send if required (e.g. send the hash of 10 rows at a time, if you find a missing row, there will be a mismatch of the rows - either cater for this on the receiver-side, or offset the sender by one row). How exactly this should be done will depend on what your data looks like.
For repeated synchronization:
A good idea might be to create a master version, and store this separately on one of the nodes, although this probably isn't necessary if you don't care about conflicts or being able to revert mistakes.
With or without a master version, you can use versioning here. Store the version of last synchronize, and store a version on each part. When synchronizing, just send the parts with a version higher than the last synchronize version.
As an alternative to a globally auto-incremented version, you could either use a timestamp as the version, or just have a modified flag on each part, setting it when modified, sending all parts with their flag set, and resetting the flags once synchronized.

What data structure will optimzied to represent stock market?

Data for various stocks is coming from various stock exchange continuously. Which data structure is suitable to store these data?
things to consider are :
a) effective retrieval and update of data is required as stock data changes per second or microsecond during trading time.
I thought of using Heap as the number of stocks would be more or less constant and the most frequent used operations are retrieval and update so heap should perform well for this scenario.
b) need to show stocks which are currently trending (as in volume of shares being sold most active and least active, high profit and loss on a particular day)
I am nt sure about how to got about this.
c) as storing to database using any programming language has some latency considering the amount of stocks that will be traded during a particular time, how can u store all the transactional data persistently??
Ps: This is a interview question from Morgan Stanley.
A heap doesn't support efficient random access (i.e. look-up by index) nor getting the top k elements without removing elements (which is not desired).
My answer would be something like:
A database would be the preferred choice for this, as, with a proper table structure and indexing, all of the required operations can be done efficiently.
So I suppose this is more a theoretical question about understanding of data structures (related to in-memory storage, rather than persistent).
It seems multiple data structures is the way to go:
a) Effective retrieval and update of data is required as stock data changes per second or microsecond during trading time.
A map would make sense for this one. Hash-map or tree-map allows for fast look-up.
b) How to show stocks which are currently trending (as in volume of shares being sold most active and least active, high profit and loss on a particular day)?
Just about any sorted data structure seems to make sense here (with the above map having pointers to the correct node, or pointing to the same node). One for activity and one for profit.
I'd probably go with a sorted (double) linked-list. It takes minimal time to get the first or last n items. Since you have a pointer to the element through the map, updating takes as long as the map lookup plus the number of moves of that item required to get it sorted again (if any). If an item often moves many indices at once, a linked-list would not be a good option (in which case I'd probably go for a Binary Search Tree).
c) How can you store all the transactional data persistently?
I understand this question as - if the connection to the database is lost or the database goes down at any point, how do you ensure there is no data corruption? If this is not it, I would've asked for a rephrase.
Just about any database course should cover this.
As far as I remember - it has to do with creating another record, updating this record, and only setting the real pointer to this record once it has been fully updated. Before this you might also have to set a pointer to the old record so you can check if it's been deleted if something happens after setting the pointer away, but before deletion.
Another option is having a active transaction table which you add to when starting a transaction and remove from when a transaction completes (which also stores all required details to roll back or resume the transaction). Thus, whenever everything is okay again, you check this table and roll back or resume any transactions that have not yet completed.
If I have to choose , I would go for Hash Table:
Reason : It is synchronized and thread safe , BigO(1) as average case complexity.
Provided :
1.Good hash function to avoid the collision.
2. High performance cache.
While this is a language agnostic question, a few of the requirements jumped out at me. For example:
effective retrieval and update of data is required as stock data changes per second or microsecond during trading time.
The java class HashMap uses the hash code of a key value to rapidly access values in its collection. It actually has an O(1) runtime complexity, which is ideal.
need to show stocks which are currently trending (as in volume of shares being sold most active and least active, high profit and loss on a particular day)
This is an implementation based issue. Your best bet is to implement a fast sorting algorithm, like QuickSort or Mergesort.
as storing to database using any programming language has some latency considering the amount of stocks that will be traded during a particular time, how can u store all the transactional data persistently??
A database would have been my first choice, but it depends on your resources.

Caching vector addition over changing collections

I have the following setup:
I have a largish number of uuids (currently about 10k but expected to grow unboundedly - they're user IDs) and a function f : id -> sparse vector with 32-bit integer values (no need to worry about precision). The function is reasonably expensive (not outrageously so, but probably on the order of a few 100ms for a given id). The dimension of the sparse vectors should be assumed to be infinite, as new dimensions can appear over time, but in practice is unlikely to ever exceed about 20k (and individual results of f are unlikely to have more than a few hundred non-zero values).
I want to support the following operations efficiently:
add a new ID to the collection
invalidate an existing ID
retrieve sum f(id) in O(changes since last retrieval)
i.e. I want to cache the sum of the vectors in a way that's reasonable to do incrementally.
One option would be to support a remove ID operation and treat invalidation as a remove followed by an add. The problem with this is that it requires us to keep track of all the old values of f, which is expensive in space. I potentially need to use many instances of this sort of cached structure, so I would like to avoid that.
The likely usage pattern is that new IDs are added at a fairly continuous rate and are frequently invalidated at first. Ids which have been invalidated recently are much more likely to be invalidated again than ones which have remained valid for a long time, but in principle an old Id can still be invalidated.
Ideally I don't want to do this in memory (or at least I want a way that lets me save the result to disk efficiently), so an idea which lets me piggyback off an existing DB implementation of some sort would be especially appreciated.

Resources