Does an ordered Map knows how to search efficiently for a key in Scala? - performance

Does an ordered Map knows how to search efficiently for a key in Scala?
Imagine I have a Map:
val unorderdMap: Map[Int, String] = ...
val orederedMap: Map[Int, String] = unorderedMap.sort
Is lookup operation for a key faster in orderedMap?
unorderedMap.get(i) //Slower???
orderedMap.get(i) //Faster???
Does the compliler knows how to search efficiently?
Does the compiler performs the lookup operation differently in each case?
*EDIT:
I have a
case class A(key: Int, value1: String, value2: String, ...)
val SeqA: Seq[A] = Seq(A(1, "One", "Uno", ...), A(2, "Two", "Duo",...), ..., A(20000,... ,...))
I want to have fast lookup operations on key(That's what i am interested ONLY)
Is it better to make a Map out of it like:
val mapA = SeqA.map(a => a.key -> a)(collection.breakOut)
Or Is it Better to leave it as a Sequence(and maybe order them).
Then If I make it a Map should I Order it or not? *Elements are around
20K - 30K elements!

Sorted maps are usually(*) slower than hash maps in any languages. This is because sorted maps has O(log n) complexity compared to hash maps which have O(1) amortized complexity.
You should have a look at relevant wiki pages for a more in depth explanation.
(*) That depends on many factors like the size of the map. For small sets, sorted arrays with binary searches might do better if it fits in cache.

Related

Check if a value belongs to a hash

I'm not sure if this is actually possible thus I ask here. Does anyone knows of an algorithm that would allow something like this?
const values = ['a', 'b', 'c', 'd'];
const hash = createHash(values); // => xjaks14sdffdghj23h4kjhgd9f81nkjrsdfg9aiojd
hash.includes('b'); // => true
hash.includes('v'); // => false
What this snippet does, is it first creates some sort of hash from a list of values, then checks if the certain value belongs to that hash.
Hash functions in general
The primary idea of hash functions is to reduce the space, that is the functions are not injective as they map from a bigger domain to a smaller.
So they produce collisions. That is, there are different elements x and y that get mapped to the same hash value:
h(x) = h(y)
So basically you loose information of the given argument x.
However, in order to answer the question whether all values are contained you would need to keep all information (or at least all non-duplicates). This is obviously not possible for nearly all practical hash-functions.
Possible hash-functions would be identity function:
h(x) = x for all x
but this doesn't reduce the space, not practical.
A natural idea would be to compute hash values of the individual elements and then concatenate them, like
h(a, b, c) = (h(a), h(b), h(c))
But this again doesn't reduce the space, hash values are as long as the message, not practical.
Another possibility is to drop all duplicates, so given values [a, b, c, a, b] we only keep [a, b, c]. But this, in most examples, only reduces the space marginally, again not practical.
But no matter what you do, you can not reduce more than the amount of non-duplicates. Else you wouldn't be able to answer the question for some values. For example if we use [a, b, c, a] but only keep [a, b], we are unable to answer "was c contained" correctly.
Perfect hash functions
However, there is the field of perfect hash functions (Wikipedia). Those are hash-functions that are injective, they don't produce collisions.
In some areas they are of interest.
For those you may be able to answer that question, for example if computing the inverse is easy.
Cryptographic hash functions
If you talk about cryptographic hash functions, the answer is no.
Those need to have three properties (Wikipedia):
Pre-image resistance - Given h it should be difficult to find m : hash(m) = h
Second pre-image resistance - Given m it should be difficult to find m' : hash(m) = hash(m')
Collision resistance - It should be difficult to find (m, m') : hash(m) = hash(m')
Informally you have especially:
A small change to a message should change the hash value so extensively that the new hash value appears uncorrelated with the old hash value.
If you now would have such a hash value you would be able to easily reconstruct it by asking whether some values are contained. Using that you can easily construct collisions on purpose and stuff like that.
Details would however depend on the specific hash algorithm.
For a toy-example let's use the previous algorithm that simply removes all duplicates:
[a, b, c, a, a] -> [a, b, c]
In that case we find messages like
[a, b, c]
[a, b, c, a]
[a, a, b, b, c]
...
that all map to the same hash value.
If the hash function produces collisions (as almost all hash function do) this cannot be possible.
Think about it this way if for example h('abc') = x and h('abd') = x, how can you decide based on x if the original string contains 'd'?
You could arguably decide to use identity as a has function, which would do the job.
Trivial solution will be a simple hash concatenation.
func createHash(values) {
var hash;
foreach (v in values)
hash += MD5(v);
return hash;
}
Can it be done with fixed length hash and variable input? I'd bet it's impossible.
In case of string hash (such as used in HashMaps), because it is additive, I think we can match partially (prefix match but not suffix).
const values = ['a', 'b', 'c', 'd'];
const hash = createStringHash(values); // => xjaks14sdffdghj23h4kjhgd9f81nkjrsdfg9aiojd
hash.includes('a'); // => true
hash.includes('a', 'b'); // => true
hash.includes('a', 'b', 'v'); // => false
Bit arrays
If you don't care what the resulting hash looks like, I'd recommend just using a bit array.
Take the range of all possible values
Map this to the range of integers starting from 0
Let each bit in our hash indicate whether or not this value appears in the input
This will require 1 bit for every possible value (which could be a lot of bits for large ranges).
Note: this representation is optimal in terms of the number of bits used, assuming there's no limit on the number of elements you can have (beyond 1 of each possible value) - if it were possible to use any fewer bits, you'd have an algorithm that's capable of providing guaranteed compression of any data, which is impossible by the pigeonhole principle.
For example:
If your range is a-z, you can map this to 0-25, then [a,d,g,h] would map to:
10010011000000000000000000 = 38535168 = 0x24c0000
(abcdefghijklmnopqrstuvwxyz)
More random-looking hashes
If you care what the hash looks like, you could take the output from the above and perform a perfect hash on it to map it either to the same length hash or a longer hash.
One trivial example of such a map would be to increment the resulting hash by a randomly chosen but deterministic value (i.e. it's the same for every hash we convert) - you can also do this for each byte (with wrap-around) if you want (e.g. byte0 = (byte0+5)%255, byte1 = (byte1+18)%255).
To determine whether an element appears, the simplest approach would be to reverse the above operation (subtract instead of add) and then just check if the corresponding bit is set. Depending on what you did, it might also be possible to only convert a single byte.
Bloom filters
If you don't mind false positives, I might recommend just using a bloom filter instead.
In short, this sets multiple bits for each value, and then checks each of those bits to check whether a value is in our collection. But the bits that are set for one value can overlap with the bits for other values, which allows us to significantly reduce the number of bits required at the cost of a few false positives (assuming the total number of elements isn't too large).

Find the Most Frequent Ordered Word Pair In a Document

This is a problem from S. Skiena's "Algorithm. Design Manual" book, the problem statement is:
Give an algorithm for finding an ordered word pair(e.g."New York")
occurring with the greatest frequency in a given webpage.
Which data structure would you use? Optimize both time and space.
One obvious solution is inserting each ordered pair in a hash-map and then iterating over all of them, to find the most frequent one, however, there definitely should be a better way, can anyone suggest anything?
In a text with n words, we have exactly n - 1 ordered word pairs (not distinct of course). One solution is to use a max priority queue; we simply insert each pair in the max PQ with frequency 1 if not already present. If present, we increment the key. However, if we use a Trie, we don't need to represent all n - 1 pairs separately. Take for example the following text:
A new puppy in New York is happy with it's New York life.
The resulting Trie would look like the following:
If we store the number of occurrences of a pair in the leaf nodes, we could easily compute the maximum occurrence in linear time. Since we need to look at each word, that's the best we can do, time wise.
Working Scala code below. The official site has a solution in Python.
class TrieNode(val parent: Option[TrieNode] = None,
val children: MutableMap[Char, TrieNode] = MutableMap.empty,
var n: Int = 0) {
def add(c: Char): TrieNode = {
val child = children.getOrElseUpdate(c, new TrieNode(parent = Some(this)))
child.n += 1
child
}
def letter(node: TrieNode): Char = {
node.parent
.flatMap(_.children.find(_._2 eq node))
.map(_._1)
.getOrElse('\u0000')
}
override def toString: String = {
Iterator
.iterate((ListBuffer.empty[Char], Option(this))) {
case (buffer, node) =>
node
.filter(_.parent.isDefined)
.map(letter)
.foreach(buffer.prepend(_))
(buffer, node.flatMap(_.parent))
}
.dropWhile(_._2.isDefined)
.take(1)
.map(_._1.mkString)
.next()
}
}
def mostCommonPair(text: String): (String, Int) = {
val root = new TrieNode()
#tailrec
def loop(s: String,
mostCommon: TrieNode,
count: Int,
parent: TrieNode): (String, Int) = {
s.split("\\s+", 2) match {
case Array(head, tail # _*) if head.nonEmpty =>
val word = head.foldLeft(parent)((tn, c) => tn.add(c))
val (common, n, p) =
if (parent eq root) (mostCommon, count, word.add(' '))
else if (word.n > count) (word, word.n, root)
else (mostCommon, count, root)
loop(tail.headOption.getOrElse(""), common, n, p)
case _ => (mostCommon.toString, count)
}
}
loop(text, new TrieNode(), -1, root)
}
Inspired by the question here.
I think the first point to note is that finding the most frequent ordered word pair is no more (or less) difficult than finding the most frequent word. The only difference is that instead of words made up of the letters a..z+A.Z separated by punctuation or spaces, you are looking for word-pairs made up of the letters a..z+A..Z+exactly_one_space, similarly separated by punctuation or spaces.
If your web-page has n words then there are only n-1 word-pairs. So hashing each word-pair then iterating over the hash table will O(n) in both time and memory. This should be pretty quick to do even if n is ~10^6 (i.e. the length of an average novel). I can't imagine anything more efficient unless n is fairly small, in which case the memory savings resulting from constructing an ordered list of word pairs (instead of a hash table) might outweigh the cost of increasing time complexity to O(nlogn)
why not keep all the ordered pairs in AVL tree with 10 elements array to track top 10 ordered pairs. In AVL we would keep all the order pairs with their occurring count and top 10 will keep in the array. this way searching of any ordered pair would be O(log N) and traversing would be O(N).
I think we could not do better than O(n) in terms of time as one would have to see at least each element once. So time complexity cannot be optimised further.
But we can use a trie to optimise the space used. In a page, there are often words which are repeated, so this might lead to significant reduction in space usage. The leaf nodes in the trie cold store the frequency of the ordered pair and using two pointers to iterate in the text where one would point at the current word and second would point at previous word.

Design of a data structure that can search over objects with 2 attributes

I'm trying to think of a way to desing a data structure that I can efficiently insert to, remove from and search in it.
The catch is that the search function is getting a similar object as input, with 2 attributes, and I need to find an object in my dataset, such that both the 1st and 2nd of the object in my dataset are equal to or bigger than the one in search function's input.
So for example, if I send as input, the following object:
object[a] = 9; object[b] = 14
Then a valid found object could be:
object[a] = 9; object[b] = 79
but not:
object[a] = 8; object[b] = 28
Is there anyway to store the data such that the search complexity is better than linear?
EDIT:
I forgot to include in my original question. The search has to return the smallest possible object in the dataset, by multipication of the 2 attributes.
Meaning that the value of object[a]*object[b] of an object that fits the original condition, is smaller than any other object in the dataset that also fits.
You may want to use k-d tree data structure, which is typically use to index k dimensional points. The search operation, like what you perform, requires O(log n) in average.
This post may help when attributes are hierarchically linked like name, forename. For point in a 2D space k-d tree is more adapted as explain by fajarkoe.
class Person {
string name;
string forename;
... other non key attributes
}
You have to write a comparator function which take two objects of class X as input and returns -1, 0 or +1 for <, = and > cases.
Libraries like glibc(), with qsort() and bsearch or more higher languages like Java and its java.util.Comparator class and java.util.SortedMap (implementation java.util.TreeMap) as containers use comparators.
Other languages use equivalent concept.
The comparator method may be wrote followin your spec like:
int compare( Person left, Person right ) {
if( left.name < right.name ) {
return -1;
}
if( left.name > right.name ) {
return +1;
}
if( left.forename < right.forename ) {
return -1;
}
if( left.forename > right.forename ) {
return +1;
}
return 0;
}
Complexity of qsort()
Quicksort, or partition-exchange sort, is a sorting algorithm
developed by Tony Hoare that, on average, makes O(n log n) comparisons
to sort n items. In the worst case, it makes O(n2) comparisons, though
this behavior is rare. Quicksort is often faster in practice than
other O(n log n) algorithms.1 Additionally, quicksort's sequential
and localized memory references work well with a cache. Quicksort is a
comparison sort and, in efficient implementations, is not a stable
sort. Quicksort can be implemented with an in-place partitioning
algorithm, so the entire sort can be done with only O(log n)
additional space used by the stack during the recursion.2
Complexity of bsearch()
If the list to be searched contains more than a few items (a dozen,
say) a binary search will require far fewer comparisons than a linear
search, but it imposes the requirement that the list be sorted.
Similarly, a hash search can be faster than a binary search but
imposes still greater requirements. If the contents of the array are
modified between searches, maintaining these requirements may even
take more time than the searches. And if it is known that some items
will be searched for much more often than others, and it can be
arranged so that these items are at the start of the list, then a
linear search may be the best.

Scala: fastest `remove(i: Int)` in mutable sequence

Which implementation from scala.collection.mutable package should I take if I intend to do lots of by-index-deletions, like remove(i: Int), in a single-threaded environment? The most obvious choice, ListBuffer, says that it may take linear time depending on buffer size. Is there some collection with log(n) or even constant time for this operation?
Removal operators, including buf remove i, are not part of Seq, but it's actually part of Buffer trait under scala.mutable. (See Buffers)
See the first table on Performance Characteristics. I am guessing buf remove i has the same characteristic as insert, which are linear for both ArrayBuffer and ListBuffer.
As documented in Array Buffers, they use arrays internally, and Link Buffers use linked lists (that's still O(n) for remove).
As an alternative, immutable Vector may give you an effective constant time.
Vectors are represented as trees with a high branching factor. Every tree node contains up to 32 elements of the vector or contains up to 32 other tree nodes. [...] So for all vectors of reasonable size, an element selection involves up to 5 primitive array selections. This is what we meant when we wrote that element access is "effectively constant time".
scala> import scala.collection.immutable._
import scala.collection.immutable._
scala> def remove[A](xs: Vector[A], i: Int) = (xs take i) ++ (xs drop (i + 1))
remove: [A](xs: scala.collection.immutable.Vector[A],i: Int)scala.collection.immutable.Vector[A]
scala> val foo = Vector(1, 2, 3, 4, 5)
foo: scala.collection.immutable.Vector[Int] = Vector(1, 2, 3, 4, 5)
scala> remove(foo, 2)
res0: scala.collection.immutable.Vector[Int] = Vector(1, 2, 4, 5)
Note, however, a high constant time with lots of overhead may not win a quick linear access until the data size is significantly large.
Depending on your exact use case, you may be able to use LinkedHashMap from scala.collection.mutable.
Although you cannot remove by index, you can remove by a unique key in constant time, and it maintains a deterministic ordering when you iterate.
scala> val foo = new scala.collection.mutable.LinkedHashMap[String,String]
foo: scala.collection.mutable.LinkedHashMap[String,String] = Map()
scala> foo += "A" -> "A"
res0: foo.type = Map((A,A))
scala> foo += "B" -> "B"
res1: foo.type = Map((A,A), (B,B))
scala> foo += "C" -> "C"
res2: foo.type = Map((A,A), (B,B), (C,C))
scala> foo -= "B"
res3: foo.type = Map((A,A), (C,C))
Java's ArrayList effectively has constant time complexity if the last element is the one to be removed. Look at the following snippet copied from its source code,
int numMoved = size - index - 1;
if (numMoved > 0)
System.arraycopy(elementData, index+1, elementData, index,
numMoved);
elementData[--size] = null; // clear to let GC do its work
As you can see, if numMoved is equal to 0, remove will not shift and copy the array at all. This in some scenarios can be quite useful. For example, if you do not care about the ordering that much, to remove an element, you can always swap it with the last element, and then delete the last element from the ArrayList, which effectively makes the remove operation all the way constant time. I was hoping ArrayBuffer would do the same, unfortunately that is not the case.

Whats the best data-structure for storing 2-tuple (a, b) which support adding, deleting tuples and compare (either on a or b))

So here is my problem. I want to store 2-tuple (key, val) and want to perform following operations:
keys are strings and values are Integers
multiple keys can have same value
adding new tuples
updating any key with new value (any new value or updated value is greater than the previous one, like timestamps)
fetching all the keys with values less than or greater than given value
deleting tuples.
Hash seems to be the obvious choice for updating the key's value but then lookups via values will be going to take longer (O(n)). The other option is balanced binary search tree with key and value switched. So now lookups via values will be fast (O(lg(n))) but updating a key will take (O(n)). So is there any data-structure which can be used to address these issues?
Thanks.
I'd use 2 datastructures, a hash table from keys to values and a search tree ordered by values and then by keys. When inserting, insert the pair into both structures, when deleting by key, look up the value from the hash and then remove the pair from the tree. Updating is basically delete+insert. Insert, delete and update are O(log n). For fetching all the keys less than a value lookup the value in the search tree and iterate backwards. This is O(log n + k).
The choices for good hash table and search tree implementations depend a lot on your particular distribution of data and operations. That said, a good general purpose implementation of both should be sufficient.
For binary Search Tree Insert is O(logN) operation in average and O(n) in worst case. The same for lookup operation. So this should be your choice I believe.
Dictionary or Map types tend to be based on one of two structures.
Balanced tree (guarantee O(log n) lookup).
Hash based (best case is O(1), but a poor hash function for the data could result in O(n) lookups).
Any book on algorithms should cover both in lots of detail.
To provide operations both on keys and values, there are also multi-index based collections (with all the extra complexity) which maintain multiple structures (much like an RDBMS table can have multiple indexes). Unless you have a lot of lookups over a large collection the extra overhead might be a higher cost than a few linear lookups.
You can create a custom data structure which holds two dictionaries.
i.e
a hash table from keys->values and another hash table from values->lists of keys.
class Foo:
def __init__(self):
self.keys = {} # (KEY=key,VALUE=value)
self.values = {} # (KEY=value,VALUE=list of keys)
def add_tuple(self,kd,vd):
self.keys[kd] = vd
if self.values.has_key(vd):
self.values[vd].append(kd)
else:
self.values[vd] = [kd]
f = Foo()
f.add_tuple('a',1)
f.add_tuple('b',2)
f.add_tuple('c',3)
f.add_tuple('d',3)
print f.keys
print f.values
print f.keys['a']
print f.values[3]
print [f.values[v] for v in f.values.keys() if v > 1]
OUTPUT:
{'a': 1, 'c': 3, 'b': 2, 'd': 3}
{1: ['a'], 2: ['b'], 3: ['c', 'd']}
1
['c', 'd']
[['b'], ['c', 'd']]

Resources