I am in need of a data structure that can properly model blocks of time, like appointments. For example, each appointment has a time it starts on, and a time it ends on. I need to have extremely fast access to things like:
Does a specified start time and end time conflict with an existing event?
What events exist from a specified start time and end time?
Ideally the data structure could model something like the image below.
I thought of using a binary search tree (ex. Java's TreeMap) but I can't think of what key or value I would use. Is there a single data structure or combination of data structures that is strong at modeling this?
A Guava Table would probably work for your use case, depending on what it is you want to actually index on.
A naive approach would be to index by name, then time of day, and then have a value whether or not this particular block is occupied by that particular person.
This would make the instantiation of the object become...
Table<String, LocalDateTime, Boolean> calendar = TreeBasedTable.create();
You would populate each individual's allocation at a given interval. You get to set what that interval is - if it's broken into 15, 30 or 1 hour periods (as defined by the table).
To find if a time is occupied, you look for the closest interval to the time you want to schedule a person. You'd use the column() method for this to see if there's any availability, or you could get specific and get a row for the individual. This means you'd have to pull two values; the start time you want, and however many minutes your interval is out. That part I'll have to leave as an exercise for the reader.
Related
Say I want to store a forex rate trend in which, I receive two updates every second on average. But I don't want to store all updates against the timestamp over a day as the data would be huge. But I want to show every update in the last two minutes, every second update in the last 1 hour and so on with reducing frequencies over a day. Which algorithm/data structure is best for this?
You could use a circular buffer. But generally StackOverflow is not for questions like that.
Data for various stocks is coming from various stock exchange continuously. Which data structure is suitable to store these data?
things to consider are :
a) effective retrieval and update of data is required as stock data changes per second or microsecond during trading time.
I thought of using Heap as the number of stocks would be more or less constant and the most frequent used operations are retrieval and update so heap should perform well for this scenario.
b) need to show stocks which are currently trending (as in volume of shares being sold most active and least active, high profit and loss on a particular day)
I am nt sure about how to got about this.
c) as storing to database using any programming language has some latency considering the amount of stocks that will be traded during a particular time, how can u store all the transactional data persistently??
Ps: This is a interview question from Morgan Stanley.
A heap doesn't support efficient random access (i.e. look-up by index) nor getting the top k elements without removing elements (which is not desired).
My answer would be something like:
A database would be the preferred choice for this, as, with a proper table structure and indexing, all of the required operations can be done efficiently.
So I suppose this is more a theoretical question about understanding of data structures (related to in-memory storage, rather than persistent).
It seems multiple data structures is the way to go:
a) Effective retrieval and update of data is required as stock data changes per second or microsecond during trading time.
A map would make sense for this one. Hash-map or tree-map allows for fast look-up.
b) How to show stocks which are currently trending (as in volume of shares being sold most active and least active, high profit and loss on a particular day)?
Just about any sorted data structure seems to make sense here (with the above map having pointers to the correct node, or pointing to the same node). One for activity and one for profit.
I'd probably go with a sorted (double) linked-list. It takes minimal time to get the first or last n items. Since you have a pointer to the element through the map, updating takes as long as the map lookup plus the number of moves of that item required to get it sorted again (if any). If an item often moves many indices at once, a linked-list would not be a good option (in which case I'd probably go for a Binary Search Tree).
c) How can you store all the transactional data persistently?
I understand this question as - if the connection to the database is lost or the database goes down at any point, how do you ensure there is no data corruption? If this is not it, I would've asked for a rephrase.
Just about any database course should cover this.
As far as I remember - it has to do with creating another record, updating this record, and only setting the real pointer to this record once it has been fully updated. Before this you might also have to set a pointer to the old record so you can check if it's been deleted if something happens after setting the pointer away, but before deletion.
Another option is having a active transaction table which you add to when starting a transaction and remove from when a transaction completes (which also stores all required details to roll back or resume the transaction). Thus, whenever everything is okay again, you check this table and roll back or resume any transactions that have not yet completed.
If I have to choose , I would go for Hash Table:
Reason : It is synchronized and thread safe , BigO(1) as average case complexity.
Provided :
1.Good hash function to avoid the collision.
2. High performance cache.
While this is a language agnostic question, a few of the requirements jumped out at me. For example:
effective retrieval and update of data is required as stock data changes per second or microsecond during trading time.
The java class HashMap uses the hash code of a key value to rapidly access values in its collection. It actually has an O(1) runtime complexity, which is ideal.
need to show stocks which are currently trending (as in volume of shares being sold most active and least active, high profit and loss on a particular day)
This is an implementation based issue. Your best bet is to implement a fast sorting algorithm, like QuickSort or Mergesort.
as storing to database using any programming language has some latency considering the amount of stocks that will be traded during a particular time, how can u store all the transactional data persistently??
A database would have been my first choice, but it depends on your resources.
A DVR needs to store a list of programs to record. Each program has a starting time and duration. This data needs to be stored in a way that allows the system to quickly determine if a new recording request conflicts with existing scheduled recordings.
The issue is that merely looking to see if there is a show with a conflicting start time is inadequate because the end of a longer program can overlap with a shorter one. I suppose one could create a data structure that tracked the availability of each time slice, perhaps at half-hour granularity, but this would fail if we cannot assume all shows start and end at the half-hour boundary, and tracking at the minute level seems inefficient, both in storage and look up.
Is there a data structure that allows one to query by range, where you supply the lower and upper bound and it returns a collection of all elements that fall within or overlap that range?
An interval tree (maybe using the augmented tree data structure?) does exactly what you're looking for. You'd enter all scheduled recordings into the tree and when a new request comes in, check whether it overlaps any of the existing intervals. Both this lookup and adding a new request take O(log(n)) time, where n is the number of intervals currently stored.
Clarification: I am not trying to calculate friendly times (e.g., "8 seconds ago") by using a timestamp and the current time.
I need to create a timeline of events in my data model, but where these events are only relative to each other. For example, I have events A, B, and C. They happen in order, so it may be that B occurs 20 seconds after A, and that C occurs 20 years after B.
I don't care about the unit of time. For my purpose, there is no time, just relativity.
I intend to model this like a linked list, where each event is a node:
Event
id
name
prev_event
next_event
Is this the most efficient way to model relative events?
All time recorded by computers is relative time, nominally it is relative to an epoch as an offset in milliseconds. Normally this epoch is an offset from 1970/01/01 as is the case with Unix.
If you store normal everyday timestamp values, you already have relative time between events if they are all sequential, you just need to subtract them to get intervals which are what you are calling relative times but they are actually intervals.
You can use whatever resolution you need to use, milliseconds is what most things use, if you are sampling things at sub-millisecond resolution, you would use nanoseconds
I don't think you need to link to previous and next event, why not just use a timestamp and order by the timestamp?
If you can have multiple, simultaneous event timelines, then you would use some kind of identifier to identify the timeline (int, guid, whatever) and key that in witht the timestamp. No id is even necessary unless you need to refer to it by an single number.
Something like this:
Event
TimeLineID (key)
datetime (key)
Name
I was wondering if there was an algorithm for counting "most frequent items" without having to keep a count of each item? For example, let's say I was a search engine and wanted to keep track of the 10 most popular searches. What I don't want to do is keep a counter of every query since there could be too many queries for me to count (and most them will be singletons). Is there a simple algorithm for this? Maybe something that is probabilistic? Thanks!
Well, if you have a very large number of queries (like a search engine presumably would), then you could just do "sampling" of queries. So you might be getting 1,000 queries per second, but if you just keep a count one per second, then over a longish period of time, you'd get an answer that would be relatively close to the "real" answer.
This is how, for example, a "sampling" profiler works. Every n mililiseconds it looks at what function is currently being executed. Over a long period of time (several seconds) you get a good idea of the "expensive" functions, because they're the ones that appear in your samples more often.
You still have to do "counting" but by doing periodic samples, instead of counting every single query you can get an upper bound on the amount of data that you actually have to store (e.g. max of one query per second, etc)
If you want the most frequent searches at any given time, you don't need to have endless counters keeping track of each query submitted. Instead, you need an algorithm to measure the amount of submissions for any given query divided by a set period of time. This is a pretty simple algorithm. Any search submitted to your search engine, for example the word “cache,” is stored for a fixed period of time called a refresh rate, (the length of your refresh rate depends on the kind of traffic your search engine is getting and the amount of “top-results” you want to keep track of). If the refresh rate time period expires and searches for the word “cache” have not persisted, the query is deleted memory. If searches for the word “cache” do persist, your algorithm only needs to keep track of the rate at which the word “cache” is being searched. To do this, simply store all searches on a “leaky-counter.” Every entry is pushed onto the counter with an expiration date after which the query is deleted. Your active counters are the indicators of your top queries.
Storing each and every query would be expensive, yet necessary to ensure the top 10 are actually the top 10. You'll have to cheat.
One idea is to store a table of URLs, hit counters, and timestamp indexed by count, then timestamp. When the table reaches some arbitrary near-maximum size, start removing low-end entries that are older than a given number of days. Although old, infrequent queries won't be counted, the queries likely to make the top 10 should make it on the table because of the faster query rate.
Another idea would be to write a 16-bit (or more) hash function for search queries. Have a 65536-entry table holding counters and URLs. When a search is performed, increment the respective table entry and set the URL if necessary. However, this approach has a major drawback. A spam bot could make repeated queries like "cheap viagra", possibly making legitimate queries increment the spam query counters instead, placing their messages on your main page.
You want a cache, of which there are many kinds; see Wikipedia
Cache algorithms and
Page replacement algorithm Aging.