Caching vector addition over changing collections - data-structures

I have the following setup:
I have a largish number of uuids (currently about 10k but expected to grow unboundedly - they're user IDs) and a function f : id -> sparse vector with 32-bit integer values (no need to worry about precision). The function is reasonably expensive (not outrageously so, but probably on the order of a few 100ms for a given id). The dimension of the sparse vectors should be assumed to be infinite, as new dimensions can appear over time, but in practice is unlikely to ever exceed about 20k (and individual results of f are unlikely to have more than a few hundred non-zero values).
I want to support the following operations efficiently:
add a new ID to the collection
invalidate an existing ID
retrieve sum f(id) in O(changes since last retrieval)
i.e. I want to cache the sum of the vectors in a way that's reasonable to do incrementally.
One option would be to support a remove ID operation and treat invalidation as a remove followed by an add. The problem with this is that it requires us to keep track of all the old values of f, which is expensive in space. I potentially need to use many instances of this sort of cached structure, so I would like to avoid that.
The likely usage pattern is that new IDs are added at a fairly continuous rate and are frequently invalidated at first. Ids which have been invalidated recently are much more likely to be invalidated again than ones which have remained valid for a long time, but in principle an old Id can still be invalidated.
Ideally I don't want to do this in memory (or at least I want a way that lets me save the result to disk efficiently), so an idea which lets me piggyback off an existing DB implementation of some sort would be especially appreciated.

Related

What is the fastest way to intersect two large set of ids

The Problem
On a server, I host ids in a json file. From clients, I need to mandate the server to intersect and sometimes negate these ids (the ids never travel to the client even though the client instructs the server its operations to perform).
I typically have 1000's of ids, often have 100,000's of ids, and have a maximum of 56,000,000 of them, where each value is unique and between -100,000,000 and +100,000,000.
These ids files are stable and do not change (so it is possible to generate a different representation for it that is better adapted for the calculations if needed).
Sample ids
Largest file sizes
I need an algorithm that will intersect ids in the sub-second range for most cases. What would you suggest? I code in java, but do not limit myself to java for the resolution of this problem (I could use JNI to bridge to native language).
Potential solutions to consider
Although you could not limit yourselves to the following list of broad considerations for solutions, here is a list of what I internally debated to resolve the situation.
Neural-Network pre-qualifier: Train a neural-network for each ids list that accepts another list of ids to score its intersection potential (0 means definitely no intersection, 1 means definitely there is an intersection). Since neural networks are good and efficient at pattern recognition, I am thinking of pre-qualifying a more time-consuming algorithm behind it.
Assembly-language: On a Linux server, code an assembly module that does such algorithm. I know that assembly is a mess to maintain and code, but sometimes one need the speed of an highly optimized algorithm without the overhead of a higher-level compiler. Maybe this use-case is simple enough to benefit from an assembly language routine to be executed directly on the Linux server (and then I'd always pay attention to stick with the same processor to avoid having to re-write this too often)? Or, alternately, maybe C would be close enough to assembly to produce clean and optimized assembly code without the overhead to maintain assembly code.
Images and GPU: GPU and image processing could be used and instead of comparing ids, I could BITAND images. That is, I create a B&W image of each ids list. Since each id have unique values between -100,000,000 and +100,000,000 (where a maximum of 56,000,000 of them are used), the image would be mostly black, but the pixel would become white if the corresponding id is set. Then, instead of keeping the list of ids, I'd keep the images, and do a BITAND operation on both images to intersect them. This may be fast indeed, but then to translate the resulting image back to ids may be the bottleneck. Also, each image could be significantly large (maybe too large for this to be a viable solution). An estimate of a 200,000,000 bits sequence is 23MB each, just loading this in memory is quite demanding.
String-matching algorithms: String comparisons have many adapted algorithms that are typically extremely efficient at their task. Create a binary file for each ids set. Each id would be 4 bytes long. The corresponding binary file would have each and every id sequenced as their 4 bytes equivalent into it. The algorithm could then be to process the smallest file to match each 4 bytes sequence as a string into the other file.
Am I missing anything? Any other potential solution? Could any of these approaches be worth diving into them?
I did not yet try anything as I want to secure a strategy before I invest what I believe will be a significant amount of time into this.
EDIT #1:
Could the solution be a map of hashes for each sector in the list? If the information is structured in such a way that each id resides within its corresponding hash key, then, the smaller of the ids set could be sequentially ran and matching the id into the larger ids set first would require hashing the value to match, and then sequentially matching of the corresponding ids into that key match?
This should make the algorithm an O(n) time based one, and since I'd pick the smallest ids set to be the sequentially ran one, n is small. Does that make sense? Is that the solution?
Something like this (where the H entry is the hash):
{
"H780" : [ 45902780, 46062780, -42912780, -19812780, 25323780, 40572780, -30131780, 60266780, -26203780, 46152780, 67216780, 71666780, -67146780, 46162780, 67226780, 67781780, -47021780, 46122780, 19973780, 22113780, 67876780, 42692780, -18473780, 30993780, 67711780, 67791780, -44036780, -45904780, -42142780, 18703780, 60276780, 46182780, 63600780, 63680780, -70486780, -68290780, -18493780, -68210780, 67731780, 46092780, 63450780, 30074780, 24772780, -26483780, 68371780, -18483780, 18723780, -29834780, 46202780, 67821780, 29594780, 46082780, 44632780, -68406780, -68310780, -44056780, 67751780, 45912780, 40842780, 44642780, 18743780, -68220780, -44066780, 46142780, -26193780, 67681780, 46222780, 67761780 ],
"H782" : [ 27343782, 67456782, 18693782, 43322782, -37832782, 46152782, 19113782, -68411782, 18763782, 67466782, -68400782, -68320782, 34031782, 45056782, -26713782, -61776782, 67791782, 44176782, -44096782, 34041782, -39324782, -21873782, 67961782, 18703782, 44186782, -31143782, 67721782, -68340782, 36103782, 19143782, 19223782, 31711782, 66350782, 43362782, 18733782, -29233782, 67811782, -44076782, -19623782, -68290782, 31721782, 19233782, 65726782, 27313782, 43352782, -68280782, 67346782, -44086782, 67741782, -19203782, -19363782, 29583782, 67911782, 67751782, 26663782, -67910782, 19213782, 45992782, -17201782, 43372782, -19992782, -44066782, 46142782, 29993782 ],
"H540" : [...
You can convert each file (list of ids) into a bit-array of length 200_000_001, where bit at index j is set if the list contains value j-100_000_000. It is possible, because the range of id values is fixed and small.
Then you can simply use bitwise and and not operations to intersect and negate lists of ids. Depending on the language and libraries used, it would require operating element-wise: iterating over arrays and applying corresponding operations to each index.
Finally, you should measure your performance and decide whether you need to do some optimizations, such as parallelizing operations (you can work on different parts of arrays on different processors), preloading some of arrays (or all of them) into memory, using GPU, etc.
First, the bitmap approach will produce the required performance, at a huge overhead in memory. You'll need to benchmark it, but I'd expect times of maybe 0.2 seconds, with that almost entirely dominated by the cost of loading data from disk, and then reading the result.
However there is another approach that is worth considering. It will use less memory most of the time. For most of the files that you state, it will perform well.
First let's use Cap'n Proto for a file format. The type can be something like this:
struct Ids {
is_negated #0 :Bool;
ids #1 :List(Int32);
}
The key is that ids are always kept sorted. So list operations are a question of running through them in parallel. And now:
Applying not is just flipping is_negated.
If neither is negated, it is a question of finding IDs in both lists.
If the first is not negated and the second is, you just want to find IDs in the first that are not in the second.
If the first is negated and the second is not, you just want to find IDs in the second that are not in the first.
If both are negated, you just want to find all ids in either list.
If your list has 100k entries, then the file will be about 400k. A not requires copying 400k of data (very fast). And intersecting with another list of the same size involves 200k comparisons. Integer comparisons complete in a clock cycle, and branch mispredictions take something like 10-20 clock cycles. So you should be able to do this operation in the 0-2 millisecond range.
Your worst case 56,000,000 file will take over 200 MB and intersecting 2 of them can take around 200 million operations. This is in the 0-2 second range.
For the 56 million file and a 10k file, your time is almost all spent on numbers in the 56 million file and not in the 10k one. You can speed that up by adding a "galloping" mode where you do a binary search forward in the larger file looking for the next matching number and picking most of them. Do be warned that this code tends to be tricky and involves lots of mispredictions. You'll have to benchmark it to find out how big a size difference is needed.
In general this approach will lose for your very biggest files. But it will be a huge win for most of the sizes of file that you've talked about.

best data structure for range delete

I have a stream of chars that I need to keep in a big data structure (can contain billions of chars)
I need to be able to:
store these chars quickly.
get all the chars quickly in order to print them for example
Delete a range of chars without leaving any gaps in the memory.
my first thought was double linked list , but the problem is that is taking to long to get to the middle of the list (begnining of the range)in order to delete.
to solve that I was thinking about a skip list which will make the search of this range faster but then I'm facing the problem of having to re-index each node after deletion
([0,1,2,3,4,5,6,7]
=> delete (3,4)
=> [0,1,2,5,6,7]
=> delete (3,4)
=> [0,1,2,7]
in this example after the first delete I need to give numbers 5,6,7 new indexes )
what is the best way to do this ?
It might be helpful to read about the span<T> data structure.
Related Answer: What is a "span" and when should I use one?
A span<T> is:
A very lightweight abstraction of a contiguous sequence of values of type T somewhere in memory.
Basically a struct { T * ptr; std::size_t length; } with a bunch of convenience methods.
A non-owning type (i.e. a "reference-type" rather than a "value type"): It never allocates nor deallocates anything and does not keep
smart pointers alive.
I would add that if you are processing a stream of characters, you will probably want to use buffering (or perhaps more apt - "chunking") where each chunk is itself a span<char> of fixed-size (which are all stored in a separate bit of memory) but tracked in a central array (or a more complex data structure like a double-linked-list, to facilitate quick deletion)
It would be an anti-pattern to attempt to actually maintain your entire stream of data in a single piece of contiguous physical memory (which you seem to suggest in part 3 of your request) - especially if you plan on deleting chunks of it. There should other ways to facilitate fast deletion without sacrificing performance elsewhere.
For example if you wish to delete a range of characters that falls into a given span, you can create two new spans from the start and end of the original span, excluding the deleted characters, and then replace the original span instance in your larger data structure (e.g if it were a double-linked list) with the two new smaller spans. None of this requires copying the underlying data itself, just slicing up our lightweight references to the underlying data.
If your language of choice doesn't support span, or a similar structure, check out how span is implemented.
Depending on your language of choice, it may even have built-in support for streaming spans (as .NET Core 2.1+ (2018) does).
Any additional requirements (such fast indexing to any point in your data stream, net of any deletions) can be satisfied by maintaining separate data structures that carry metadata about your spans (such as the suggested linked list). They will need updating when spans are deleted or added to, but because spans are a thin layer on top of large strings of characters, they reduce the cardinality of data structures you are maintaining by several orders of magnitude, so while you could get fancy with maintaining a variety of heaps and maps to facilitate O(1) algorithms for every operation, you will probably find that basic structures and O(log(n)) or even O(N) (where N is actually N/chunk-size) maintenance operations are feasible.

Synchronize two lists of objects

Problem
I have two lists of objects. Each object contains the following:
GUID (allows to determine if objects are the same — from business
point of view)
Timestamp (updates to current UTC each time the
object changed)
Version (positive integer; increments each time
the object changed)
Deleted (boolean flag; switches to "true" instead
of actual object deleting)
Data (some useful payload)
Any other fields if need
Next, I need to sync two lists according to these rules:
If object with some GUID presented only in one list, it should be copied to another list
If object with some GUID presented in both lists, the instance with less Version should be replaced with one having greater Version (nothing to do if versions are equal)
Real-world requirements:
Each list has 50k+ objects, each object is about 1 Kb
Lists are placed on different machines connected via Internet (e.g., mobile app and remote server), thus, algorithm shouldn't waste the traffic or CPU much
Most of time (say, 96%) lists are already synced before sync process, hence, the algorithm should determine it with minimal effort
If there are any differences, most of time they are pretty small (3-5 objects changed/added)
Should proceed OK if one list is empty (and other still has 50k+ items)
Solution #1 (currently implemented)
Client stores the time-of-last-sync-succeed (say T)
Both lists are asked for all objects having Timestamp > T (i.e. recently modified; in the production it's ... > (T - day) for better robustness)
These lists of recently modified objects are synced naively:
items presented only in first list are saved to second list
items presented only in second list are saved to first list
other items has their Version's compared and saved to appropriative list (if need)
Procs:
Works great with small changes
Almost fits the requirements
Cons:
Depends on T, which makes the algorithm fragile: it's easy to sync last updates, but hard to make sure lists are completely synced (using minimal T like 1970-01-01 just hangs the sync process)
My questions:
Is there any common / best-practice / proved way to sync object lists?
Is there any better [than #1] solutions for my case?
P.S. Already viewed, not duplicates:
Compare Two List Of Objects For Synchronization
Two list synchronization
Summary
All answers has some worth points. To summarize, here is the compiled answer I was looking for, based on finally implemented working sync system:
In general, use Merkle trees. They are dramatically efficient in comparing large amounts of data.
If you can, rebuild your hash tree from scratch every time you need it.
Check the time required to rebuild hash tree. Most likely it's pretty fast (e.g., in my case on Nexus 4 rebuilding tree for 20k items takes ~2 sec: 1.8 sec for fetching data from DB + 0.2 sec for building tree; the server performs ~20x faster), so you don't need to store the tree in the DB and maintain it when data changed (my first try was rebuilding only relevant branches — it's not too complicated to implement, but is very fragile).
Nevertheless, it's ok to cache and reuse tree if no data modifications was done at all. Once modification happened, invalidate the whole cache.
Technical details
GUID is 32 chars long without any hyphens/braces, lowercase;
I use 16-ary tree with the height of 4, where each branch is related to the GUID's char. It may be implemented as actual tree or map:
0000 → (hash of items with GUID 0000*)
0001 → (hash of items with GUID 0001*)
...
ffff → (hash of items with GUID ffff*);
000 → (hash of hashes 000_)
...
00 → (hash of hashes 00_)
...
() → (root hash, i.e. hash of hashes _)
Thus, the tree has 65536 leafs and requires 2 Mb of memory; each leaf covers ~N/65536 data items. Binary trees would be 2x more efficient in terms of memory, but it's a harder to implement.
I had to implement these methods:
getHash() — returns root hash; used for primary check (as mentioned,
in 96% that's all we need to test);
getHashChildren(x) — returns list of hashes x_ (at most 16); used for effective, single-request discovering data difference;
findByIdPrefix(x) — returns items with GUID x*, x must contain exactly 4 chars; used for requesting leaf items;
count(x) — returns number of items with GUID x*; when reasonably small, we can dismiss checking tree branch-by-branch and transfer bunch of items with single request;
As far as syncing is done per-branch transmitting small amounts of data, it's very responsive (you can check the progress at any time) + very robust for unexpected terminating (e.g., due to network failure) and easily restarts from the last point if need.
IMPORTANT: sometimes you will stuck with conflicting state: {version_1 = version_2, but hash_1 != hash_2}: in this case you must make some decision (maybe with user's help or comparing timestamps as last resort) and rewrite some item with another to resolve the conflict, otherwise you'll end up with unsynced and unsyncable hash trees.
Possible improvements
Implement transmitting (GUID, Version) pairs without payload for lightweighting requests.
Two suggestions come to mind, the first one is possibly something you're doing already:
1) Don't send entire lists of items with timestamps > T. Instead, send a list of (UUID, Version) tuples of objects with timestamps > T. Then the other side can figure out which objects it needs to update from that. Send the UUIDs of those back to request the actual objects. This avoids sending full objects if they have timestamp > T, but are nonetheless newer already (or present already with the latest Version) on the other side.
2) Don't process the full list at once, but in chunks, i.e. first sync 10%, then the next 10% etc. to avoid transferring too much data at once for big syncs (and to allow for restarting points if a connection should break). This can be done by e.g. starting with all UUIDs with a checksum equivalent to 1 modulo 10, then 1 modulo 10 etc.
Another possibility would be proactive syncing, e.g. asynchronously posting chances, possibly via UCP (unreliable as opposed to TCP). You would still need to sync when you need current information, but chances are most of it is current.
You need to store not time of last synchronization, but the state of the objects (eg. the hash of object data) at time of last synchronization. Then you compare each list with the stored list and find, what objects have changed on each side.
This is much more reliable than rely on time, cause time requires that both sides have synchronized timer which gives precise time (and this is not the case on most systems). For the same reason your idea of detecting changes based on time + version can be more error-prone than it initially seems.
Also you don't initially transfer object data but only GUIDs.
BTW we've made a framework (free with source) which addresses exactly your problems. I am not giving the link because some alternatively talented people would complain.

What data structure will optimzied to represent stock market?

Data for various stocks is coming from various stock exchange continuously. Which data structure is suitable to store these data?
things to consider are :
a) effective retrieval and update of data is required as stock data changes per second or microsecond during trading time.
I thought of using Heap as the number of stocks would be more or less constant and the most frequent used operations are retrieval and update so heap should perform well for this scenario.
b) need to show stocks which are currently trending (as in volume of shares being sold most active and least active, high profit and loss on a particular day)
I am nt sure about how to got about this.
c) as storing to database using any programming language has some latency considering the amount of stocks that will be traded during a particular time, how can u store all the transactional data persistently??
Ps: This is a interview question from Morgan Stanley.
A heap doesn't support efficient random access (i.e. look-up by index) nor getting the top k elements without removing elements (which is not desired).
My answer would be something like:
A database would be the preferred choice for this, as, with a proper table structure and indexing, all of the required operations can be done efficiently.
So I suppose this is more a theoretical question about understanding of data structures (related to in-memory storage, rather than persistent).
It seems multiple data structures is the way to go:
a) Effective retrieval and update of data is required as stock data changes per second or microsecond during trading time.
A map would make sense for this one. Hash-map or tree-map allows for fast look-up.
b) How to show stocks which are currently trending (as in volume of shares being sold most active and least active, high profit and loss on a particular day)?
Just about any sorted data structure seems to make sense here (with the above map having pointers to the correct node, or pointing to the same node). One for activity and one for profit.
I'd probably go with a sorted (double) linked-list. It takes minimal time to get the first or last n items. Since you have a pointer to the element through the map, updating takes as long as the map lookup plus the number of moves of that item required to get it sorted again (if any). If an item often moves many indices at once, a linked-list would not be a good option (in which case I'd probably go for a Binary Search Tree).
c) How can you store all the transactional data persistently?
I understand this question as - if the connection to the database is lost or the database goes down at any point, how do you ensure there is no data corruption? If this is not it, I would've asked for a rephrase.
Just about any database course should cover this.
As far as I remember - it has to do with creating another record, updating this record, and only setting the real pointer to this record once it has been fully updated. Before this you might also have to set a pointer to the old record so you can check if it's been deleted if something happens after setting the pointer away, but before deletion.
Another option is having a active transaction table which you add to when starting a transaction and remove from when a transaction completes (which also stores all required details to roll back or resume the transaction). Thus, whenever everything is okay again, you check this table and roll back or resume any transactions that have not yet completed.
If I have to choose , I would go for Hash Table:
Reason : It is synchronized and thread safe , BigO(1) as average case complexity.
Provided :
1.Good hash function to avoid the collision.
2. High performance cache.
While this is a language agnostic question, a few of the requirements jumped out at me. For example:
effective retrieval and update of data is required as stock data changes per second or microsecond during trading time.
The java class HashMap uses the hash code of a key value to rapidly access values in its collection. It actually has an O(1) runtime complexity, which is ideal.
need to show stocks which are currently trending (as in volume of shares being sold most active and least active, high profit and loss on a particular day)
This is an implementation based issue. Your best bet is to implement a fast sorting algorithm, like QuickSort or Mergesort.
as storing to database using any programming language has some latency considering the amount of stocks that will be traded during a particular time, how can u store all the transactional data persistently??
A database would have been my first choice, but it depends on your resources.

Are there any hash functions that allow you to resize the table without also rehashing (removing + reinserting) the contents?

Is it possible using a certain hash function and method (the division method, or double hashing) to make a chained hash table that can be resized without having to reinsert (rehash) each element already in the table?
You would still need to reinsert, but some way to make that cheaper would be to store the hash value before the modulus was applied. That way, you can save a large part of the calculation cost of rehashing.
With this approach, it would be possible to shrink the table in size as well.
I can only assume the reason you want to avoid rehashing everything is that the resulting high latency operation is not an issue to throughput but is instead a problem for responsiveness (either human or in SLA sense)
In theory you could use a modified closed addressing hash table like so:
remember all previous sizes where elements were added
On resize keep the old buckets around linked to internally via a map of sizeWhenUsed -> buckets (obviously if the buckets are empty no need to bother)
Invariant a mapping of Key k exists in only one of the 'internal hash tables' at any time.
on addition of a value you must first look it up in all the other maps to determine if the entry already exists and is mapped. If it is remove it from the old one and add it to the new one.
if an internal map becomes empty/below a certain size it should be deleted and remaining elements moved into the current hash table.
so long as the number of internal hashes is kept constant this will not impact the big O behaviour of the data structure in time, though it will in memory.
This will however affect the actual performance as X additional checks must be made where X is the number of old hashes maintained.
If the wasted space of the list of buckets (the buckets themselves will be null if empty so are zero cost unless populated) becomes significant (use a fudge factor for this) then at some point on a rehash you may have to take the hit of moving things into the current table unless you are willing to expend essentially unlimited memory.
Downgrades in size of the hash will only function in the desired manner (releasing memory) if you are willing to rehash. This is unavoidable.
It is possible you could make use of some complex additional data within an open addressing scheme to 'flag' which of the internal hashes the cell was in use by but removals would be extremely complex to get right and would be very expensive unless you just left them as wasted space. I would never attempt this.
I would not suggest using the former method either unless the underlying data spent very little time in the hash, thus the related churn would tend to steadily 'erase' the older sized hashes. It is likely that a hash tuned for just this sort of behaviour and preset with an appropriate size would perform much better though.
Since the above scheme is simply trading wasted memory and throughput for reduction in the expensive operations with speculative (at best) chance of reducing this waste I would suggest simply pre-sizing your hash to be larger than required and thus never resized would be a more sensible option.
Probably not - the hash would have to not use any variety of modulus, which would mean that it would have a required table size depending on the data anyway.
All hash tables must deal with collisions, either through chaining or probing or whatever, so, I suspect that if upon table resize you simply resized the table (IE, you don't re-insert everything), you would have a functional, though highly non optimal, hash table.
I assume you're asking this question because you want to avoid the high cost of resizing a hash table. You want a hash table which has guaranteed constant time (assuming no collision problems, of course). This can be done.
The trick is to iteratively initialize the next-size hash table while the current one is filling up. By the time you need it, it's ready.
Quick pseudo-code to add an element:
if resizing then
smallTable = bigTable
bigTable = new T[smallTable.length * 2] //if allocation zeroes memory, we lose O(1)
set state to zeroing
elseif zeroing then
zero a small amount of the bigTable memory
if done zeroing then set state to transfering
elseif transfering then
transfer a few values in the small table to the big table
if done transfering then set state to resizing
end if
add new item to small array
add new item to large array

Resources