Quick Design approach needed - data-structures

I am reading a huge file having orderTime (DateTime object) & orderID (String). Which data structure can I use and how - so that, given a time range I can give all the order id's as part of that time range?

You can use linear or non-linear data structures. Linear data structure could be as simple as a linked list having the order Id's in a time-line based sorted order.
You can also go for Calender-Queue's. They are very efficient for querying between ranges.

You can use some sort of binary search tree, allowing you to quickly find not only the corresponding value for some key, but also all the parts of the tree being larger or smaller than that key.
For example, in Java you could use a TreeMap and particularly the headMap, tailMap, and subMap methods. Example usage:
SortedMap<Date, String> map = new TreeMap<>();
map.put(someDate, someId);
...
SortedMap<Date, String> between = map.subMap(fromDate, toDate);

Related

How does count min sketch find the most frequent item in a stream? - Heavy Hitters

Count min sketch uses different hash functions to map elements in the stream to the hash function. How to map back from the sketch to find the most frequent item? Considering that enough elements have been passes(millions) and we don’t know the elements.
First of all the CMS in order to store data use pairwise independent hash functions to map elements in their structure (think of it as a table).
Secondly, the reverse process is not supported as is, which is from the table to distinguish the distinct elements in the CMS.
Using separate elements as queries you can retrieve their estimated count in the stream using the same family of hash functions (point query).
In order to retrieve the most frequent item/items an additional data structure such as a heap should be used.
Appart from the CMS papers, a quick and useful presentation over your question is found here: http://theory.stanford.edu/~tim/s15/l/l2.pdf

C# HashSet<T> search performance (compared to an ObservableCollection<T>)?

The C# the generic HashSet<T> search performance should be O(1), and the search performance of an ObservableCollection<T> should be O(n).
I have a large amount of unique elements, each element has a DateTime property that is not unique.
Each element calculates its HashCode by simply returning its DateTime.GetHashCode().
Now I want to get a subset of my data, e.g. all elements that have a date which is between March 2012 and June 2012.
var result = from p in this.Elements
where p.Date >= new DateTime(2012, 03, 01) &&
p.Date <= new DateTime(2012, 30, 06
select p;
If I run this LINQ query on a collection of 300.000 elements, it takes ~25 ms to return 80 elements that are within the given range - it does not matter if I use a HashSet<T> or an ObservableCollection<T>.
If I loop through all elements manually and check them, it takes the same time, ~25 ms.
But I do know the HashCode of all Dates that are within the given range. Is it possible to get all elements with the given HashCodes from my HashSet<T>? I think that would be much faster...
Is it possible to speed up the LINQ query? I assume that it does not make use of the special abilities of my HashSet<T>?
You're not using the right data structure. You should be using something like a sorted list (sorted on the Date property) where you can then binary search for the beginning and end of the range.
As has been pointed out a hash set is very efficient at determining if a given hash is in the set. Your query just uses the fact that the hashset implement IEnumerable to iterate over the entire set and do the date comparison. It will not use the hashes at all. This is why the manual way takes the same time as the query.
You cannot get an element based on a hash from a hashset, you can only test for existance of the element in the set. A dictionary is what you want if you need to get it by has (which it seems you don't)
Decide what it is that you need to do with your data and use a structure which is optimised for that. This may be your own class which maintains multiple internal structures each of which is efficient at one thing (like one for searching for ranges and another for checking by existence by multiple fields), or there may be an existing structure which fits your needs. But without knowing what it is you want to do with your data its difficult to advise.
The other thing to consider is whether you are optimising prematurely. If 25ms to search manually is fast enough then maybe any structure which implements IEnumerable will be good enough. In which case you can choose one based on the other criteria you need.

Suitable data structure for finding a person's phone number, given their name?

Suppose you want to write a program that implements a simple phone book. Given a particular name, you want to be able to retrieve that person's phone number as quickly as possible. What data structure would you use to store the phone book, and why?
the text below answers your question.
In computer science, a hash table or hash map is a data structure that
uses a hash function to map identifying values, known as keys (e.g., a
person's name), to their associated values (e.g., their telephone
number). Thus, a hash table implements an associative array. The hash
function is used to transform the key into the index (the hash) of an
array element (the slot or bucket) where the corresponding value is to
be sought.
the text is from wiki:hashtable.
there are some further discussions, like collision, hash functions... check the wiki page for details.
I respect & love hashtables :) but even a balanced binary tree would be fine for your phone book application giving you in worst case a logarithmic complexity and avoiding you for having good hash functions, collisions etc. which is more suitable for huge amounts of data.
When I talk about huge data what I mean is something related to storage. Every time you fill all of the buckets in a hash-table you will need to allocate new storage and re-hash everything. This can be avoided if you know the size of the data ahead of time. Balanced trees wont let you go into these problems. Domain needs to be considered too while designing data structures, for an example for small devices storage matters a lot.
I was wondering why 'Tries' didn't come up in one of the answers,
Tries is suitable for Phone book kind of data.
Also, saving space compared to HashTable at the same cost(almost) of Retrieval efficiency, (assuming constant size alphabet & constant length Names)
Tries also facilitate the 'Prefix Matches' sometimes required while searching.
A dictionary is both dynamic and fast.
You want a dictionary, where you use the name as the key, and the number as the data stored. Check this out: http://en.wikipedia.org/wiki/Dictionary_%28data_structure%29
Why not use a singly linked list? Each node will have the name, number and link information.
One drawback is that your search might take some time since you'll have to traverse the entire list from link to link. You might order the list at the time of node insertion itself!
PS: To make the search a tad bit faster, maintain a link to the middle of the list. Search can continue to the left or right of the list based on the value of the "name" field at this node. Note that this requires a doubly linked list.

Best data structure for a given set of operations - Add, Retrieve Min/Max and Retrieve a specific object

I am looking for the optimal (time and space) optimal data structure for supporting the following operations:
Add Persons (name, age) to a global data store of persons
Fetch Person with minimum and maximum age
Search for Person's age given the name
Here's what I could think of:
Keep an array of Persons, and keep adding to end of array when a new Person is to be added
Keep a hash of Person name vs. age, to assist in fetching person's age with given name
Maintain two objects minPerson and maxPerson for Person with min and max age. Update this if needed, when a new Person is added.
Now, although I keep a hash for better performance of (3), I think it may not be the best way if there are many collisions in the hash. Also, addition of a Person would mean an overhead of adding to the hash.
Is there anything that can be further optimized here?
Note: I am looking for the best (balanced) approach to support all these operations in minimum time and space.
You can get rid of the array as it doesn't provide anything that the other two structures can't do.
Otherwise, a hashtable + min/max is likely to perform well for your use case. In fact, this is precisely what I would use.
As to getting rid of the hashtable because a poor hash function might lead to collisions: well, don't use a poor hash function. I bet that the default hash function for strings that's provided by your programming language of choice is going to do pretty well out of the box.
It looks like that you need a data structure that needs fast inserts and that also supports fast queries on 2 different keys (name and age).
I would suggest keeping two data structures, one a sorted data structure (e.g. a balanced binary search tree) where the key is the age and the value is a pointer to the Person object, the other a hashtable where the key is the name and the value is a pointer to the Person object. Notice we don't keep two copies of the same object.
A balanced binary search tree would provide O(log(n)) inserts and max/min queries, while the hastable would give us O(1) (amortized) inserts and lookups.
When we add a new Person, we just add a pointer to it to both data structures. For a min/max age query, we can retrieve the Object by querying the BST. For a name query we can just query the hashtable.
Your question does not ask for updates/deletes, but those are also doable by suitably updating both data structures.
It sounds like you're expecting the name to be the unique idenitifer; otherwise your operation 3 is ambiguous (What is the correct return result if you have two entries for John Smith?)
Assuming that the uniqueness of a name is guaranteed, I would go with a plain hashtable keyed by names. Operation 1 and 3 are trivial to execute. Operation 2 could be done in O(N) time if you want to search through the data structure manually, or you can do like you suggest and keep track of the min/max and update it as you add/delete entries in the hash table.

What's a good way to manage a lot of loosely related components in F#?

I'm trying to translate an idea I had from OOP concepts to FP concepts, but I'm not quite sure how to best go about it. I want to have multiple collections of records, but have individual records linked across the collections. In C# I would probably use multiple Dictionary objects with an Entity-specific ID as a common key, so that given any set of the dictionaries, a method could extract a particular Entity using its ID/Name.
I guess I could do the same thing in F#, owing to its hybrid nature, but I'd prefer to be more purely functional. What is the best structure to do what I'm talking about here?
I had considered maybe a trie or a patricia trie, but I shouldn't need very deep name searching, and I'm more likely to have one or two of some things and lots of other things. It's a game design idea, so, for example, you'd only have one "Player" but could have tons of "Enemy1", "Enemy2" etc.
Is there a really good data structure for fast keyed lookup in FP, or should I just stick to Dictionary/Hashmaps?
A usual functional data structure for representing dictionaries that's available in F# is Map (as pointed out by larsmans). Under the cover, this is implemented as a ballanced binary tree, so the complexity of lookup is O(log N) for a tree containing N elements. This is slower than hash-based dictionary (which has O(1) for good hash keys), but it allows adding and removing elements without copying the whole collection - only a part of the tree needs to be changed.
From your description, I have the impression that you'll be creating the data structure only once and then using it for a long time without modifying it. In this case, you could implement a simple immutable wrapper type that uses Dictionary<_, _> under the cover, but takes all elements as a sequence in the constructor and doesn't allow modifications:
type ImmutableMap<'K, 'V when 'K : equality>(data:seq<'K * 'V>) = // '
// Store data passed in constructor in hash-based dictionary
let dict = new System.Collections.Generic.Dictionary<_, _>()
do for k, v in data do dict.Add(k, v)
// Provide read-only access
member x.Item with get(k) = dict.[k]
let f = new ImmutableMap<_,_ >( [1, "Hello"; 2, "Ahoj" ])
let str = f.[1]
This should be faster than using F# Map as long as you don't need to modify the collection (or, more precisely, create copies with elements added/removed).
Use the F# module Collections.Map. My bet is that it implements balanced binary search trees, the data structure of choice for this task in functional programming.
Tries are hard to program and mostly useful in specialized applications such as search engine indexing, where they are commonly used as a secondary store on top of an array/database/etc. Don't use them unless you know you need to.

Resources