Dictionary Lookup (O(1)) vs Linq where - linq

What is faster and should I sacrifice the Linq standard to achieve speed (assuming Dictionary lookup is truly faster)? So let me elaborate:
I have the following:
List<Product> products = GetProductList();
I have a need to search for a product based on some attribute, for example, the serial number. I could first create a dictionary, and then populate it as follow:
Dictionary<string, Product> dict = new Dictionary<string, Product>();
foreach(Product p in products)
{
dict.Add(p.serial, p);
}
When it's time to find a product, take advantage of O(1) offered by the Dictionary look-up:
string some_serial = ...;
try { Product p = dict[some_serial]; } catch(KeyNotFoundException) { }
Alternatively, using Linq:
Product p = products.Where(p => p.serial.Equals(some_serial)).FirstOrDefault();
The drawback with the Dict approach is of course this requires more space in memory, more code to write, less elegant, etc (though most of this is debatable). Assume that's non-factor. Should I take the first approach?
To conclude, I would like to confirm if the complexity of the Linq approach above is indeed O(n) and I don't see how it can be better than that.

Assuming you are starting with an enumeration of objects and are only doing this once ...
It will be faster to do the Where method as opposed to adding to a Dictionary<TKey,TValue> and then looking it back up. The reason why is that the dictionary method is not O(1). In this scenario you are adding items to the dictionary and then looking it up. The adding part is O(N) which is just as expensive as the Where method with additional memory overhead.
Another minor point to be aware of is that Dictionary<TKey,TValue> is not truly O(1). It instead approaches O(1) but can degrade to lesser performance in certain circumstances (lots of clashing keys for instance).

Related

What is the time complexity performance of Scala's Vector data structure?

I know that most of the Vector methods are effectively O(1) (constant time) because of the tree they use, but I cannot find any information on the contains method. My first thought is that it would have to be O(n) to check all the elements but I am not sure.
Answering the question in the title, performance characteristics (2.13 docs version) of basic operations head, tail, apply, update, prepend, append, insert are all listed as eC for Vector:
eC The operation takes effectively constant time, but this might depend on some assumptions such as maximum length of a vector or distribution of hash keys.
You are correct contains is O(N), as there is no hashing or nothing else that would avoid the need to compare with all items. Still, if you want to be sure, it is best to check the implementation.
As finding the correct implementation in the library sources can be difficult because of many traits and overrides used to implement the containers, the best way to check this is the debugger. Use a code like:
val v = Vector(0, 1, 2)
v.contains(1)
Use the debugger to step into v.contains and the source you will see is:
def contains[A1 >: A](elem: A1): Boolean = exists (_ == elem)
If you are still not convinced at this point, some more "step into" will lead you to:
def exists(p: A => Boolean): Boolean = {
var res = false
while (!res && hasNext) res = p(next())
res
}

An effective way to perform a prefix search on ranked (sorted) list?

I have a large list of some elements sorted by their probabilities:
data class Element(val value: String, val probability: Float)
val sortedElements = listOf(
Element("dddcccdd", 0.7f),
Element("aaaabb", 0.2f),
Element("bbddee", 0.1f)
)
Now I need to perform a prefix searches on this list to find items that start with one prefix and then with the next prefix and so on (elements still need to be sorted by probabilities)
val filteredElements1 = sortedElements
.filter { it.value.startsWith("aa") }
val filteredElements2 = sortedElements
.filter { it.value.startsWith("bb") }
Each "request" of elements filtered by some prefix takes O(n) time, which is too slow in case of a large list.
If I didn't care about the order of the elements (their probabilities), I could sort the elements lexicographically and perform a binary search: sorting takes O(n*log n) time and each request -- O(log n) time.
Is there any way to speed up the execution of these operations without losing the sorting (probability) of elements at the same time? Maybe there is some kind of special data structure that is suitable for this task?
You can read more about Trie data structure https://en.wikipedia.org/wiki/Trie
This could be really useful for your usecase.
Leetcode has another very detailed explanation on it, which you can find here https://leetcode.com/articles/implement-trie-prefix-tree/
Hope this helps
If your List does not change often, you could create a HashMap where each existing Prefix is a key referring to a collection (sorted by probability) of all entries it is a prefix of.
getting all entries for a given prefix needs ~O(1) then.
Be careful the Map get really big. And creation of the map takes quite some time.

Hash Tables and Separate Chaining: How do you know which value to return from the bucket's list?

We're learning about hash tables in my data structures and algorithms class, and I'm having trouble understanding separate chaining.
I know the basic premise: each bucket has a pointer to a Node that contains a key-value pair, and each Node contains a pointer to the next (potential) Node in the current bucket's mini linked list. This is mainly used to handle collisions.
Now, suppose for simplicity that the hash table has 5 buckets. Suppose I wrote the following lines of code in my main after creating an appropriate hash table instance.
myHashTable["rick"] = "Rick Sanchez";
myHashTable["morty"] = "Morty Smith";
Let's imagine whatever hashing function we're using just so happens to produce the same bucket index for both string keys rick and morty. Let's say that bucket index is index 0, for simplicity.
So at index 0 in our hash table, we have two nodes with values of Rick Sanchez and Morty Smith, in whatever order we decide to put them in (the first pointing to the second).
When I want to display the corresponding value for rick, which is Rick Sanchez per our code here, the hashing function will produce the bucket index of 0.
How do I decide which node needs to be returned? Do I loop through the nodes until I find the one whose key matches rick?
To resolve Hash Tables conflicts, that's it, to put or get an item into the Hash Table whose hash value collides with another one, you will end up reducing a map to the data structure that is backing the hash table implementation; this is generally a linked list. In the case of a collision this is the worst case for the Hash Table structure and you will end up with an O(n) operation to get to the correct item in the linked list. That's it, a loop as you said, that will search the item with the matching key. But, in the cases that you have a data structure like a balanced tree to search, it can be O(logN) time, as the Java8 implementation.
As JEP 180: Handle Frequent HashMap Collisions with Balanced Trees says:
The principal idea is that once the number of items in a hash bucket
grows beyond a certain threshold, that bucket will switch from using a
linked list of entries to a balanced tree. In the case of high hash
collisions, this will improve worst-case performance from O(n) to
O(log n).
This technique has already been implemented in the latest version of
the java.util.concurrent.ConcurrentHashMap class, which is also slated
for inclusion in JDK 8 as part of JEP 155. Portions of that code will
be re-used to implement the same idea in the HashMap and LinkedHashMap
classes.
I strongly suggest to always look at some existing implementation. To say about one, you could look at the Java 7 implementation. That will increase your code reading skills, that is almost more important or you do more often than writing code. I know that it is more effort but it will pay off.
For example, take a look at the HashTable.get method from Java 7:
public synchronized V get(Object key) {
Entry<?,?> tab[] = table;
int hash = key.hashCode();
int index = (hash & 0x7FFFFFFF) % tab.length;
for (Entry<?,?> e = tab[index] ; e != null ; e = e.next) {
if ((e.hash == hash) && e.key.equals(key)) {
return (V)e.value;
}
}
return null;
}
Here we see that if ((e.hash == hash) && e.key.equals(key)) is trying to find the correct item with the matching key.
And here is the full source code: HashTable.java

Scala collection transformation performance: single looping vs. multiple looping

When there is a collection and you must perform two or more operations on all of its elements, what is faster?:
val f1: String => String = _.reverse
val f2: String => String = _.toUpperCase
val elements: Seq[String] = List("a", "b", "c")
iterate multiple times and perform one operation on one loop
val result = elements.map(f1).map(f2)
This approach does have the advantage, that the result after application of the first function could be reused.
iterate one time and perform all operation on each element together
val result = elements.map(element => f2(f1(element)))
or
val result = elements.map(element => f1.compose(f2)
Is there any difference in performance between these two approaches? And if yes, which is faster?
Here's the thing, transformation of a collection is more or less of runtime O(N) , * runtime cost of all the functions applied. So I doubt the 2nd set of choices you present above would make even the slightest difference in runtime. The first option you list, is a different story. New collection creation can be avoided, because that could result in overhead. That's where "view" collections come in (see this good example I spotted)
In Scala, what does "view" do?
If you had the apply several mapping operations you might do this:
val result = elements.view.map(f1).map(f2).force
(force at the end, causes all functions to evaluate)
The 2nd set of examples above would maybe be a tiny bit faster, but the "view" option could make your code more readable if you had a lot of these or complex anonymous functions used in the mapping.
Composing functions to produce a single pass transformation will probably gain you some performance, but will quickly become unreadable. Consider using views as an alernative. While this will create intermediate collections:
val result = elements.map(f1).map(f2)
This will perform lazy evaluation and will perform functional composition the same way you do:
val result = elements.view.map(f1).map(f2)
Notice that result type will be SeqView so you might want to convert it to list later with toList.

Best data structure to retrieve by max values and ID?

I have quite a big amount of fixed size records. Each record has lots of fields, ID and Value are among them. I am wondering what kind of data structure would be best so that I can
locate a record by ID(unique) very fast,
list the 100 records with the biggest values.
Max-heap seems work, but far from perfect; do you have a smarter solution?
Thank you.
A hybrid data structure will most likely be best. For efficient lookup by ID a good structure is obviously a hash-table. To support top-100 iteration a max-heap or a binary tree is a good fit. When inserting and deleting you just do the operation on both structures. If the 100 for the iteration case is fixed, iteration happens often and insertions/deletions aren't heavily skewed to the top-100, just keep the top 100 as a sorted array with an overflow to a max-heap. That won't modify the big-O complexity of the structure, but it will give a really good constant factor speed-up for the iteration case.
I know you want pseudo-code algorithm, but in Java for example i would use TreeSet, add all the records by ID,value pairs.
The Tree will add them sorted by value, so querying the first 100 will give you the top 100. Retrieving by ID will be straight-forward.
I think the algorithm is called Binary-Tree or Balanced Tree not sure.
Max heap would match the second requirement, but hash maps or balanced search trees would be better for the first one. Make the choice based on frequency of these operations. How often would you need to locate a single item by ID and how often would you need to retrieve top 100 items?
Pseudo code:
add(Item t)
{
//Add the same object instance to both data structures
heap.add(t);
hash.add(t);
}
remove(int id)
{
heap.removeItemWithId(id);//this is gonna be slow
hash.remove(id);
}
getTopN(int n)
{
return heap.topNitems(n);
}
getItemById(int id)
{
return hash.getItemById(id);
}
updateValue(int id, String value)
{
Item t = hash.getItemById(id);
//now t is the same object referred to by the heap and hash
t.value = value;
//updated both.
}

Resources