This is an interview question I came across for a big software major.
Design a data structure for a server which can store atmost 100 records, 2 functions are used to access server get(k) ,put(k,v,x).
Where k is key and v is corresponding value and x is the expiry time before which this record can't be removed.
Approach I have come so far: Maintain two datastructures.
Hashmap: stores key, value pair.
PriorityQueue: Create a priorityqueue in order of expiry times of record.
And each entry in queue, will also have key value, so that when a record expires, we can remove the key, value pair from hashmap in O(1) time.
I would like to ask, can we design a better solution for this question.
Related
I seek for some inspirations. my problem is how to pool an object identified by key-value pairs. I am working on JVM and don't want to create garbage when creating those objects in runtime - that's why an idea of pooling.
so lets give some example - user can provide identifiers like (A(1) B(4)), (A(1) B(2) C('foo') E(0)), (B(4) A(1)), (A(0)), etc.. - for each of those set of pairs I would like to get unique reference to object created earlier. mind that (A(1) B(4)) and (B(4) A(1)) should point to exactly the same ref - of course I can sort by keys and then perform calculation.
of course I could create a map in runtime as a key and keep a Map<Map<K,V>, T> but this is not really efficient when we talk about X thousands of lookup per second.
I was thinking about hashing those input pairs to get unique id - but somehow I have to prove that hashing function will have no collisions when number or variety of k-v increase in runtime. I expect to have no more than 1M of unique k-v pairs, so computing an id would be the best solution and map that id to an object - but I need function that will calculate such id.
maybe some smart graph structure could be a solution, the only trouble I have is that not only keys create a graph but also values of those keys.
any help would be appreciated.
I ve solve this particular problem with having a instance of Key builder per thread, key builder holds an array of objects - this array is of known size as we have strictly defined keys (in example it was A,B,C, etc - so we know ordinal of that key).
I dont allocate anything for each request execution, we reuse builder's to set key-values in the array and once that is done I can use the key to find or create object identified by those K-V. also this approach solves problem of order, as it doesnt matter whether somebody identified object by (A(1) B(4)) or (B(4) A(1)) - to follow the example.
I could use Map instead of Objects[], but at least implementation of java.util.HashMap allocates lots of node objects, and I will clear such map for every request! If you use a Map that caches nodes - then maybe this is a way to go.
I know, Maybe the title is a little confusing. however, my actual question is basic I think.
I'm working on a brand new LRU implementation for that I use an Index Table which maps the name of the incoming packet to index of where the content of packet stored in CS.
As illustrated below each incoming packet store in the CS and can be addressed by Index Table.
Now suppose new packet arrived, as we know, regarding LRU, its index must set to top of CS (zero) and it needs to upgrade other indexes, they need to be incremented as a result.
One obvious solution is to loop over all entries in the Index Table and increment them.
Is there any solution or structure that is using for such a problem?
I don't see how you are establishing the order of your cache in the description. But to answer your question, it's possible to reduce the LRU store method to O(1) time complexity.
The classical way to do it is to have these two data structures:
Doubly Linked List : for order in the cache. Each node stores a data element (it plays the role of your content store).
HashMap that associates each key to the pointer to the node in the linked list. (it plays the role of your index table)
So when you access already stored data in your cache, it must be at the top of the list, so you delete the corresponding node from the linked list (in O(1) time because you have access to its previous and next nodes) and store it at the head.
For new data it is simpler, only store it at the head of the list and store your (key, value) in the hashmap.
Following the pointers in an ebay tech blog and a datastax developers blog, I model some event log data in Cassandra 1.2. As a partition key, I use “ddmmyyhh|bucket”, where bucket is any number between 0 and the number of nodes in the cluster.
The Data model
cqlsh:Log> CREATE TABLE transactions (yymmddhh varchar, bucket int,
rId int, created timeuuid, data map, PRIMARY
KEY((yymmddhh, bucket), created) );
(rId identifies the resource that fired the event.)
(map is are key value pairs derived from a JSON; keys change, but not much)
I assume that this translates into a composite primary/row key with X buckets per hours.
My column names are than timeuuids. Querying this data model works as expected (I can query time ranges.)
The problem is the performance: the time to insert a new row increases continuously.
So I am doing s.th. wrong, but can't pinpoint the problem.
When I use the timeuuid as a part of the row key, the performance remains stable on a high level, but this would prevent me from querying it (a query without the row key of course throws an error message about "filtering").
Any help? Thanks!
UPDATE
Switching from the map data-type to a predefined column names alleviates the problem. Insert times now seem to remain at around <0.005s per insert.
The core question remains:
How is my usage of the "map" datatype in efficient? And what would be an efficient way for thousands of inserts with only slight variation in the keys.
My keys I use data into the map mostly remain the same. I understood the datastax documentation (can't post link due to reputation limitations, sorry, but easy to find) to say that each key creates an additional column -- or does it create one new column per "map"?? That would be... hard to believe to me.
I suggest you model your rows a little differently. The collections aren't very good to use in cases where you might end up with too many elements in them. The reason is a limitation in the Cassandra binary protocol which uses two bytes to represent the number of elements in a collection. This means that if your collection has more than 2^16 elements in it the size field will overflow and even though the server sends all of the elements back to the client, the client only sees the N % 2^16 first elements (so if you have 2^16 + 3 elements it will look to the client as if there are only 3 elements).
If there is no risk of getting that many elements into your collections, you can ignore this advice. I would not think that using collections gives you worse performance, I'm not really sure how that would happen.
CQL3 collections are basically just a hack on top of the storage model (and I don't mean hack in any negative sense), you can make a MAP-like row that is not constrained by the above limitation yourself:
CREATE TABLE transactions (
yymmddhh VARCHAR,
bucket INT,
created TIMEUUID,
rId INT,
key VARCHAR,
value VARCHAR,
PRIMARY KEY ((yymmddhh, bucket), created, rId, key)
)
(Notice that I moved rId and the map key into the primary key, I don't know what rId is, but I assume that this would be correct)
This has two drawbacks over using a MAP: it requires you to reassemble the map when you query the data (you would get back a row per map entry), and it uses a litte more space since C* will insert a few extra columns, but the upside is that there is no problem with getting too big collections.
In the end it depends a lot on how you want to query your data. Don't optimize for insertions, optimize for reads. For example: if you don't need to read back the whole map every time, but usually just read one or two keys from it, put the key in the partition/row key instead and have a separate partition/row per key (this assumes that the set of keys will be fixed so you know what to query for, so as I said: it depends a lot on how you want to query your data).
You also mentioned in a comment that the performance improved when you increased the number of buckets from three (0-2) to 300 (0-299). The reason for this is that you spread the load much more evenly thoughout the cluster. When you have a partition/row key that is based on time, like your yymmddhh, there will always be a hot partition where all writes go (it moves throughout the day, but at any given moment it will hit only one node). You correctly added a smoothing factor with the bucket column/cell, but with only three values the likelyhood of at least two ending up on the same physical node are too high. With three hundred you will have a much better spread.
use yymmddhh as rowkey and bucket+timeUUID as column name,where each bucket have 20 or fix no of records,buckets can be managed using counter cloumn family
Data for various stocks is coming from various stock exchange continuously. Which data structure is suitable to store these data?
things to consider are :
a) effective retrieval and update of data is required as stock data changes per second or microsecond during trading time.
I thought of using Heap as the number of stocks would be more or less constant and the most frequent used operations are retrieval and update so heap should perform well for this scenario.
b) need to show stocks which are currently trending (as in volume of shares being sold most active and least active, high profit and loss on a particular day)
I am nt sure about how to got about this.
c) as storing to database using any programming language has some latency considering the amount of stocks that will be traded during a particular time, how can u store all the transactional data persistently??
Ps: This is a interview question from Morgan Stanley.
A heap doesn't support efficient random access (i.e. look-up by index) nor getting the top k elements without removing elements (which is not desired).
My answer would be something like:
A database would be the preferred choice for this, as, with a proper table structure and indexing, all of the required operations can be done efficiently.
So I suppose this is more a theoretical question about understanding of data structures (related to in-memory storage, rather than persistent).
It seems multiple data structures is the way to go:
a) Effective retrieval and update of data is required as stock data changes per second or microsecond during trading time.
A map would make sense for this one. Hash-map or tree-map allows for fast look-up.
b) How to show stocks which are currently trending (as in volume of shares being sold most active and least active, high profit and loss on a particular day)?
Just about any sorted data structure seems to make sense here (with the above map having pointers to the correct node, or pointing to the same node). One for activity and one for profit.
I'd probably go with a sorted (double) linked-list. It takes minimal time to get the first or last n items. Since you have a pointer to the element through the map, updating takes as long as the map lookup plus the number of moves of that item required to get it sorted again (if any). If an item often moves many indices at once, a linked-list would not be a good option (in which case I'd probably go for a Binary Search Tree).
c) How can you store all the transactional data persistently?
I understand this question as - if the connection to the database is lost or the database goes down at any point, how do you ensure there is no data corruption? If this is not it, I would've asked for a rephrase.
Just about any database course should cover this.
As far as I remember - it has to do with creating another record, updating this record, and only setting the real pointer to this record once it has been fully updated. Before this you might also have to set a pointer to the old record so you can check if it's been deleted if something happens after setting the pointer away, but before deletion.
Another option is having a active transaction table which you add to when starting a transaction and remove from when a transaction completes (which also stores all required details to roll back or resume the transaction). Thus, whenever everything is okay again, you check this table and roll back or resume any transactions that have not yet completed.
If I have to choose , I would go for Hash Table:
Reason : It is synchronized and thread safe , BigO(1) as average case complexity.
Provided :
1.Good hash function to avoid the collision.
2. High performance cache.
While this is a language agnostic question, a few of the requirements jumped out at me. For example:
effective retrieval and update of data is required as stock data changes per second or microsecond during trading time.
The java class HashMap uses the hash code of a key value to rapidly access values in its collection. It actually has an O(1) runtime complexity, which is ideal.
need to show stocks which are currently trending (as in volume of shares being sold most active and least active, high profit and loss on a particular day)
This is an implementation based issue. Your best bet is to implement a fast sorting algorithm, like QuickSort or Mergesort.
as storing to database using any programming language has some latency considering the amount of stocks that will be traded during a particular time, how can u store all the transactional data persistently??
A database would have been my first choice, but it depends on your resources.
I am looking for the optimal (time and space) optimal data structure for supporting the following operations:
Add Persons (name, age) to a global data store of persons
Fetch Person with minimum and maximum age
Search for Person's age given the name
Here's what I could think of:
Keep an array of Persons, and keep adding to end of array when a new Person is to be added
Keep a hash of Person name vs. age, to assist in fetching person's age with given name
Maintain two objects minPerson and maxPerson for Person with min and max age. Update this if needed, when a new Person is added.
Now, although I keep a hash for better performance of (3), I think it may not be the best way if there are many collisions in the hash. Also, addition of a Person would mean an overhead of adding to the hash.
Is there anything that can be further optimized here?
Note: I am looking for the best (balanced) approach to support all these operations in minimum time and space.
You can get rid of the array as it doesn't provide anything that the other two structures can't do.
Otherwise, a hashtable + min/max is likely to perform well for your use case. In fact, this is precisely what I would use.
As to getting rid of the hashtable because a poor hash function might lead to collisions: well, don't use a poor hash function. I bet that the default hash function for strings that's provided by your programming language of choice is going to do pretty well out of the box.
It looks like that you need a data structure that needs fast inserts and that also supports fast queries on 2 different keys (name and age).
I would suggest keeping two data structures, one a sorted data structure (e.g. a balanced binary search tree) where the key is the age and the value is a pointer to the Person object, the other a hashtable where the key is the name and the value is a pointer to the Person object. Notice we don't keep two copies of the same object.
A balanced binary search tree would provide O(log(n)) inserts and max/min queries, while the hastable would give us O(1) (amortized) inserts and lookups.
When we add a new Person, we just add a pointer to it to both data structures. For a min/max age query, we can retrieve the Object by querying the BST. For a name query we can just query the hashtable.
Your question does not ask for updates/deletes, but those are also doable by suitably updating both data structures.
It sounds like you're expecting the name to be the unique idenitifer; otherwise your operation 3 is ambiguous (What is the correct return result if you have two entries for John Smith?)
Assuming that the uniqueness of a name is guaranteed, I would go with a plain hashtable keyed by names. Operation 1 and 3 are trivial to execute. Operation 2 could be done in O(N) time if you want to search through the data structure manually, or you can do like you suggest and keep track of the min/max and update it as you add/delete entries in the hash table.