Efficient way to identify named Sets with common elements in Scala - algorithm

Given a Map[String, Set[String]] what is an elegant and efficient way in Scala to determine the set of all pairs of distinct keys where the corresponding sets have a non-empty intersection?
For example fix the map as
val input = Map (
"a" -> Set("x", "z"),
"b" -> Set("f")
"c" -> Set("f", "z", "44")
"d" -> Set("99")
)
then the required output is
Set(
("a", "c"),
("b", "c")
)
Efficient in this context means better than O(n^2) where n is the sum of the number of elements in the family of sets given as input.

You can't get better pessimistic complexity than O(n^2). Look at following example:
Map(
1 -> Set("a"),
2 -> Set("a"),
3 -> Set("a"),
...
n -> Set("a")
)
In this case every single pair of sets has non-empty intersection. So the size of output in this case is O(n^2), so you can't get better complexity.
Obviously, that doesn't mean you can't think of better algorithm than just brute force. For example, you could transform this:
val input = Map (
"a" -> Set("x", "z"),
"b" -> Set("f")
"c" -> Set("f", "z", "44")
"d" -> Set("99")
)
into this:
val transformed = Map (
"x" -> Set("a"),
"z" -> Set("a", "c"),
"f" -> Set("b", "c"),
"44" -> Set("c"),
"99" -> Set("d")
)
You can do this in linear time. I'd use Scala collection builders or mutable collections for this to avoid expensive operations on immutable collections.
Then you can just look at every set that is a value in this transformed map and for each one, generate all possible pairs of its elements. This can take O(n^2) but if you don't have many pairs in your output then it will be a lot faster.

Related

calculate the optimal order of a series of transformations

Not sure if I should ask here or a different stack exchange but.
Basically I'm wondering if there is a known way to find the shortest path between two values given a number of potential transformations?
Brute force solution/example in python
from itertools import permutations, groupby
start = ["A", "B", "C", "D"]
goal = ["A", "X", "C", "Y"]
Transforms = [
(None,None,"B","D"),
("F",None,None,"Y"),
(None,"X","C",None),
(None,None,"G","Y"),
("D","X",None,None),
(None,"X",None,None)
]
def apply_transform(value, transform):
for x in range(4):
if transform[x] is None: continue
value[x] = transform[x]
perms = permutations(range(len(Transforms)))
results = []
for order in perms:
value = start.copy()
moves = 0
for o in order:
moves += 1
apply_transform(value, Transforms[o])
if value == goal:
results.append([moves, order[0:moves]])
break
# just printing sorted unique in a formated way...I'd be just picking the first one not listing all potential ones
results.sort( key=lambda x: x[0])
results = list(k for k,_ in groupby(results))
print("\n".join(f"moves {m} | {' -> '.join(str(s) for s in ms)}" for m,ms in results))
results that correctly move the start to the goal.
moves 2 | 3 -> 2
moves 3 | 0 -> 3 -> 2
moves 3 | 3 -> 5 -> 2
moves 3 | 5 -> 3 -> 2
moves 4 | 0 -> 3 -> 5 -> 2
moves 4 | 0 -> 5 -> 3 -> 2
moves 4 | 5 -> 0 -> 3 -> 2
so picking the first item in the sorted list as the lowest number of transformations. (applying transformation "3" and then transformation "2").
Obviously, this exact brute force "algorithm" can be improved by breaking out of a permutation if its already started getting longer than the lowest number of jumps... but is there a better solution to this problem I'm not seeing? Some sort of graph? Permutations aren't the best for speed but it might be the only option. Are there other small optimizations that can be done with this?
One possible optimization would be to find transformations that have to be last ones, and work your way backwards.
So, here only transformations 2 and 5 can be the last ones, and 5 is the subgroup of 2 so it can be ignored (one more optimization: ignore transformations that are parts of other transformations), and the only that remains is 2.
Now you are looking how to reach state (A, *, *, Y) using remaining transformations. Transformations 1 and 3 are the only candidates, and 3 -> 2 makes the solution.
This algorithm is a bit complicated, because it requires recursion and backtracking (if you do it the easy way, depth-first), or some queue processing (if you do it the better way, breadth-first), but it will be faster than trying all possible permutations.
Think of it as a graph. Each value is a node, and the transformations are edges. There are known algorithms for the shortest path in a graph, e. g. Dijkstra or A*.

Looking up all keys in a hashmap like data structure that are a subset of a query

The Question
Let's say I have a HashMap:
{
"foo" => (A,B),
"bar" => (B,C),
"baz" => (A,D)
}
And I transform it into some other data-structure, something similar to:
{
(A,B) => "foo",
(B,C) => "bar",
(A,D) => "baz",
(D,E) => "biz"
}
Given the 'key' (or query I suppose): (A,B,C)
I'd like to lookup all values where their key is a subset of the query set: ("foo","bar")
I'm trying to find an efficient algorithm/data-structure that will do that lookup efficiently, although I don't think there's a O(1) solution.
One possible solution
One idea I had was to break up the transformed map to look like this:
{
A => ("foo","baz"),
B => ("foo","bar")
C => ("bar"),
D => ("baz","biz"),
E => ("biz")
}
Then lookup each element not in the query set: (D => ("baz", "biz"), E => ("biz"))
Union them: ("baz", "biz")
And take the difference from the set of all possible results: ("foo", "bar", "baz", "biz") - ("baz", "biz") => ("foo", "bar")
While this works, it's a lot of steps with an num_of_set_elements number of unions, and on large sets with a small query, a possibly huge number of lookups.
So, does anyone have a better solution?
The basic solution does not need to build a new structure: iterate through the pairs (key, value) and if the value is a subset of your target, yield the key. The expected time is O(n) (n = number of keys) if you can represent the sets with a 64 bits integer (less than 64 atomic values A, B, C, ...). See https://cs.calvin.edu/activities/books/c++/ds/2e/WebItems/Chapter09/Bitsets.pdf for example (S contains A if A & S == S), and more if you have to check the potential subsets another way.
But if you have enough room and a long time for preprocessing data, there's a (crazy?) amortized O(1) time solution for lookup.
If S is the set of all possible atomic values (A, B, ...), you can build all the possible supersets of every value ((A, B), (A, C), ...) that are subsets of S.
Now you can build a new map like this (pseudocode):
for each pair (key, value):
for every superset s of value:
add key to m[s]
The lookup will be in constant amortized time O(1). The map will have roughly 2^|S| keys. The size of the values will depend on the size of the keys, e.g. the biggest key, S, will contain every key of the initial map. Preprocessing will be something like O(n*2^|S|).
I'd rather go for O(n) though.

Find ranges in array

I've been trying to find the optimal solution to the following (interesting?) problem that came up at work: Eventually I settled for a good enough solution but I'd like to know if there's a better one.
Let a1...an be an array of strings.
Let s1...sk be an unordered list of strings, all of them also members of the array.
The task is to find the minimum set of index ranges eleements of s cover in a.
So for example if a = [ "x", "y", "a", "f", "c" ] and s = { "c","y","f" }, the answer would be (1;1), (3;4), assuming that the array is indexed from zero.
a is typically fairly large (hundreds of thousands of elements), while s is relatively small, typically length(s) < log(length(a)).
So the question is: can you find a time-efficient algorithm for this problem? (Space efficiency is not a concern within reasonable limits.)
Just a quick but important update: I need to perform this operation with different s values but the same a a lot. So precomputing stuff based on a is allowed, indeed it is the only way.
Build a hash table H(a) to map from element to index: ax->x in O(n) time and space. Then look up each sy in H(a) (in O(1) time on average for a total of O(k) for s) and keep track of the ranges. For that you can use an array of pair(min_index, max_index) sorted by min_index and do a binary search to either locate the range or where you should insert the new 1 element range.
So overall, the solution above would take O( n + k + k * log( nb_ranges ) ) time and O( n + nb_ranges ) space.
This is what you want, written in python:
def flattened(indexes):
s, rest = indexes[0], indexes[1:]
result = (s, s)
for e in rest:
if e == result[1] + 1:
result = (result[0], e)
else:
yield result
result = (e, e)
yield result
a = ["x", "y", "a", "f", "c"]
s = ["c", "y", "f"]
# Create lookup table of ai to index in a
src_indexes = dict((key, i) for i, key in enumerate(a))
# Create sorted list of all indexes into a
raw_dst_indexes = sorted(src_indexes[key] for key in s)
# Convert sorted list of indexes into an array of ranges
dst_indexes = [r for r in flattened(raw_dst_indexes)]
print dst_indexes
I think you can throw the elements of S into a set or hashtable, anything with near O(1) to check for membership. Then just do a linear scan on A, with a flag to determine if you are currently covering elements in S, and the start position of that cover. Should be O(n + k).

Scala map sorting

How do I sort a map of this kind:
"01" -> List(34,12,14,23), "11" -> List(22,11,34)
by the beginning values?
One way is to use scala.collection.immutable.TreeMap, which is always sorted by keys:
val t = TreeMap("01" -> List(34,12,14,23), "11" -> List(22,11,34))
//If you have already a map...
val m = Map("01" -> List(34,12,14,23), "11" -> List(22,11,34))
//... use this
val t = TreeMap(m.toSeq:_*)
You can convert it to a Seq or List and sort it, too:
//by specifying an element for sorting
m.toSeq.sortBy(_._1) //sort by comparing keys
m.toSeq.sortBy(_._2) //sort by comparing values
//by providing a sort function
m.toSeq.sortWith(_._1 < _._1) //sort by comparing keys
There are plenty of possibilities, each more or less convenient in a certain context.
As stated, the default Map type is unsorted, but there's always SortedMap
import collection.immutable.SortedMap
SortedMap("01" -> List(34,12,14,23), "11" -> List(22,11,34))
Although I'm guessing you can't use that, because I recognise this homework and suspect that YOUR map is the result of a groupBy operation. So you have to create an empty SortedMap and add the values:
val unsorted = Map("01" -> List(34,12,14,23), "11" -> List(22,11,34))
val sorted = SortedMap.empty[String, List[Int]] ++ unsorted
//or
val sorted = SortedMap(unsorted.toSeq:_*)
Or if you're not wedded to the Map interface, you can just convert it to a sequence of tuples. Note that this approach will only work if both the keys and values have a defined ordering. Lists don't have a default ordering defined, so this won't work with your example code - I therefore made up some other numbers instead.
val unsorted = Map("01" -> 56, "11" -> 34)
val sorted = unsorted.toSeq.sorted
This might be useful if you can first convert your lists to some other type (such as a String), which is best done using mapValues
update: See Landei's answer, which shows how you can provide a custom sort function that'll make this approach work.

How to implement an half-edge data structure in Haskell?

For a description of the data structure see
http://www.flipcode.com/archives/The_Half-Edge_Data_Structure.shtml
http://www.cgal.org/Manual/latest/doc_html/cgal_manual/HalfedgeDS/Chapter_main.html
An half-edge data structure involves cycles.
is it possible to implement it in a functional language like Haskell ?
are mutable references (STRef) to way to go ?
Thanks
In order to efficiently construct half-edge data structures you need an acceleration structure for the HE_vert (let's call it HE_vert_acc... but you can actually just do this in HE_vert directly) that saves all HE_edges that point to this HE_vert. Otherwise you get very bad complexity when trying to define the "HE_edge* pair" (which is the oppositely oriented adjacent half-edge), e.g. via brute-force comparison.
So, making a half-edge data structure for a single face can easily be done with the tying-the-knot method, because there are (probably) no pairs anyway. But if you add the complexity of the acceleration structure to decide on those pairs efficiently, then it becomes a bit more difficult, since you need to update the same HE_vert_acc across different faces, and then update the HE_edges to contain a valid pair. Those are actually multiple steps. How you would glue them all together via tying-the-knot is way more complex than constructing a circular doubly linked list and not really obvious.
Because of that... I wouldn't really bother much about the question "how do I construct this data structure in idiomatic haskell".
I think it's reasonable to use more imperative approaches here while trying to keep the API functional. I'd probably go for arrays and state-monads.
Not saying it isn't possible with tying-the-knot, but I haven't seen such an implementation yet. It is not an easy problem in my opinion.
EDIT: so I couldn't let go and implemented this, assuming the input is an .obj mesh file.
My approach is based on the method described here https://wiki.haskell.org/Tying_the_Knot#Migrated_from_the_old_wiki, but the one from Andrew Bromage where he explains tying the knots for a DFA without knowing the knots at compile-time.
Unfortunately, the half-edge data structure is even more complex, since it actually consists of 3 data structures.
So I started with what I actually want:
data HeVert a = HeVert {
vcoord :: a -- the coordinates of the vertex
, emedge :: HeEdge a -- one of the half-edges emanating from the vertex
}
data HeFace a = HeFace {
bordedge :: HeEdge a -- one of the half-edges bordering the face
}
data HeEdge a = HeEdge {
startvert :: HeVert a -- start-vertex of the half-edge
, oppedge :: Maybe (HeEdge a) -- oppositely oriented adjacent half-edge
, edgeface :: HeFace a -- face the half-edge borders
, nextedge :: HeEdge a -- next half-edge around the face
}
The problem is that we run into multiple issues here when constructing it efficiently, so for all these data structures we will use an "Indirect" one which basically just saves plain information given by the .obj mesh file.
So I came up with this:
data IndirectHeEdge = IndirectHeEdge {
edgeindex :: Int -- edge index
, svindex :: Int -- index of start-vertice
, nvindex :: Int -- index of next-vertice
, indexf :: Int -- index of face
, offsetedge :: Int -- offset to get the next edge
}
data IndirectHeVert = IndirectHeVert {
emedgeindex :: Int -- emanating edge index (starts at 1)
, edgelist :: [Int] -- index of edge that points to this vertice
}
data IndirectHeFace =
IndirectHeFace (Int, [Int]) -- (faceIndex, [verticeindex])
A few things are probably not intuitive and can be done better, e.g. the "offsetedge" thing.
See how I didn't save the actual vertices anywhere. This is just a lot of index stuff which sort of emulates the C pointers.
We will need "edgelist" to efficiently find the oppositely oriented ajdgacent half-edges later.
I don't go into detail how I fill these indirect data structures, because that is really specific to the .obj file format. I'll just give an example on how things convert.
Suppose we have the following mesh file:
v 50.0 50.0
v 250.0 50.0
v 50.0 250.0
v 250.0 250.0
v 50.0 500.0
v 250.0 500.0
f 1 2 4 3
f 3 4 6 5
The indirect faces will now look like this:
[IndirectHeFace (0,[1,2,4,3]),IndirectHeFace (1,[3,4,6,5])]
The indirect edges:
[IndirectHeEdge {edgeindex = 0, svindex = 1, nvindex = 2, indexf = 0, offsetedge = 1},
IndirectHeEdge {1, 2, 4, 0, 1},
IndirectHeEdge {2, 4, 3, 0, 1},
IndirectHeEdge {3, 3, 1, 0, -3},
IndirectHeEdge {0, 3, 4, 1, 1},
IndirectHeEdge {1, 4, 6, 1, 1},
IndirectHeEdge {2, 6, 5, 1, 1},
IndirectHeEdge {3, 5, 3, 1, -3}]
And the indirect vertices:
[(1,IndirectHeVert {emedgeindex = 0, edgelist = [3]}),
(2,IndirectHeVert {1, [0]}),
(3,IndirectHeVert {4, [7,2]}),
(4,IndirectHeVert {5, [4,1]}),
(5,IndirectHeVert {7, [6]}),
(6,IndirectHeVert {6, [5]})]
Now the really interesting part is how we can turn these indirect data structures into the "direct" one we defined at the very beginning. This is a bit tricky, but is basically just index lookups and works because of laziness.
Here's the pseudo code (the actual implementation uses not just lists and has additional overhead in order to make the function safe):
indirectToDirect :: [a] -- parsed vertices, e.g. 2d points (Double, Double)
-> [IndirectHeEdge]
-> [IndirectHeFace]
-> [IndirectHeVert]
-> HeEdge a
indirectToDirect points edges faces vertices
= thisEdge (head edges)
where
thisEdge edge
= HeEdge (thisVert (vertices !! svindex edge) $ svindex edge)
(thisOppEdge (svindex edge) $ indexf edge)
(thisFace $ faces !! indexf edge)
(thisEdge $ edges !! (edgeindex edge + offsetedge edge))
thisFace face = HeFace $ thisEdge (edges !! (head . snd $ face))
thisVert vertice coordindex
= HeVert (points !! (coordindex - 1))
(thisEdge $ points !! (emedgeindex vertice - 1))
thisOppEdge startverticeindex faceindex
= thisEdge
<$>
(headMay
. filter ((/=) faceindex . indexf)
. fmap (edges !!)
. edgelist -- getter
$ vertices !! startverticeindex)
Mind that we cannot really make this return a "Maybe (HeEdge a)" because it would try to evaluate the whole thing (which is infinite) in order to know which constructor to use.
I had to add a NoVert/NoEdge/NoFace constructor for each of them to avoid the "Maybe".
Another downside is that this heavily depends on the input and isn't really a generic library thing. I'm also not entirely sure if it will re-evaluate (which is still very cheap) already visited edges.
Using Data.IntMap.Lazy seems to increase performance (at least for the list of IndirectHeVert). Data.Vector didn't really do much for me here.
There's no need for using the state monad anywhere, unless you want to use Arrays or Vectors.
Obviously the problem is that a half-edge references the next and the opposite half-edge (the other references are no problem). You can "break the cycle" e.g. by referencing not directly to other half-edges, but to reference just an ID (e.g. simple Ints). In order to look up a half-edge by ID, you can store them in a Data.Map. Of course this approach requires some book-keeping in order to avoid a big hairy mess, but it is the easiest way I can think of.
Stupid me, I'm not thinking lazy enough. The solution above works for strict functional languages, but is unnecessary for Haskell.
If the task in question allows you to build the half-edge structure once and then query it many times, then lazy tying-the-know approach is the way to go, as was pointed out in the comments and the other answer.
However, if you want to update your structure, then purely-functional interface might prove cumbersome to work with. Also, you need to consider O(..) requirements for update functions. It might turn out that you need mutable internal representation (probably with pure API on top) after all.
I've run into a helpful application of polymorphism for this sort of thing. You'll commonly desire both a static non-infinite version for serialization, as well as a knot-tyed version for internal representation.
If you make one version that's polymorphic, then you can update that particular value using record syntax :
data Foo edge_type_t = Depot {
edge_type :: edge_type_t,
idxI, idxE, idxF, idxL :: !Int
} deriving (Show, Read)
loadFoo edgetypes d = d { edge_type = edgetypes ! edge_type d }
unloadFoo d = d { edge_type = edgetype_id $ edge_type d }
There is however one major caveat : You cannot make a Foo (Foo (Foo( ...))) type this way because Haskell must understand the type's recursively. :(

Resources