Algorithms for compression of set tries - algorithm

I have a collection of sets that I'd like to place in a trie.
Normal tries are made of strings of elements - that is, the order of the elements is important. Sets lack a defined order, so there's the possibility of greater compression.
For example, given the strings "abc", "bc", and "c", I'd create the trie:
(*,3) -> ('a',1) -> ('b',1) -> ('c',1)
-> ('b',1) -> ('c',1)
-> ('c',1)
But given the sets { 'a', 'b', 'c' }, { 'b', 'c' }, { 'c' }, I could create the above trie, or any of these eleven:
(*,3) -> ('a',1) -> ('b',1) -> ('c',1)
-> ('c',2) -> ('a',1)
(*,3) -> ('a',1) -> ('c',1) -> ('b',1)
-> ('b',1) -> ('c',1)
-> ('c',1)
(*,3) -> ('a',1) -> ('c',1) -> ('b',1)
-> ('c',2) -> ('a',1)
(*,3) -> ('b',2) -> ('a',1) -> ('c',1)
-> ('c',1)
-> ('c',1)
(*,3) -> ('b',1) -> ('a',1) -> ('c',1)
-> ('c',2) -> ('b',1)
(*,3) -> ('b',2) -> ('c',2) -> ('a',1)
-> ('c',1)
(*,3) -> ('b',1) -> ('c',1) -> ('a',1)
-> ('c',2) -> ('b',1)
(*,3) -> ('c',2) -> ('a',1) -> ('b',1)
-> ('b',1) -> ('c',1)
(*,3) -> ('c',2) -> ('a',1) -> ('b',1)
-> ('b',1)
(*,3) -> ('c',2) -> ('b',1) -> ('a',1)
-> ('b',1) -> ('c',1)
(*,3) -> ('c',3) -> ('b',2) -> ('a',1)
So there's obviously room for compression (7 nodes to 4).
I suspect defining a local order at each node dependent on the relative frequency of its children would do it, but I'm not certain, and it might be overly expensive.
So before I hit the whiteboard, and start cracking away at my own compression algorithm, is there an existing one? How expensive is it? Is it a bulk process, or can it be done per-insert/delete?

I think you should sort a set according to item frequency and this get a good heuristics as you suspect. The same approach using in FP-growth (frequent patterns mining) for representing in compact way the items sets.

Basically you should construct a dependence graph. If element y occurs only if x occurs, draw an edge from x to y (in case of equality, just order lexicographically). The resulting graph is a DAG. Now, do a topological sorting of this graph to get the order of the elements with a twist. Whenever you can choose one of the two (or more elements) choose the one with higher number of occurrences.

My suspiscion is that the maximum compression would keep the most common elements at the top (as in your last example).
The compression algorithm would start with the whole collection of sets and the top node, and recursively create nodes for each subset containing the most common elements
Compress(collection, node):
while NOT collection.isEmpty?
e = collection.find_most_common_element
c2 = collection.find_all_containing(e)
collection = collection - c2
if e==NIL //empty sets only
node[END_OF_SET]=node
else
c2.each{set.remove(e)}
node[e]=new Node
Compress(c2,node[e])
end
end
The resulting tree would have a special End-of-set marker to signify that a complete set ends at that node. For your example it would be
*->(C,3)->(B,2)->(A,1)->EOS
->EOS
->EOS
Deleting a set is easy, just remove it's EOS marker (and any parent nodes that become empty). You could insert on the fly - at each node, descend to the matching element with the most children until there are no matches, then use the algorithm above - but keeping it maximally compressed would be tricky. When element B gained more children than element A, you'd have to move all sets containing A & B into the B node, which would involve a full search of all of A's children. But if you don't keep it compressed, then the inclusion searches are no longer linear with the set size.

Related

how to sort by levels in graphviz

I want to separate the rectangles like this
or (3 items per line)
You were quite close. Most of the changes are just for (my) clarity. rank=same and dir=back
// https://stackoverflow.com/questions/72449201/how-to-sort-by-levels-in-graphviz
digraph {
{rank=same; a -> b -> c}
{rank=same; edge [dir=back] f -> e -> d }
{rank=same; g -> h -> i}
{rank=same; edge [dir=back] l -> k -> j }
c -> d
f -> g
i->j
}
Giving:

Constructing proximity matrix in Wolfram Mathematica

I have the following dataset:
dataset =
Dataset[{<|"City" -> "Belgrade" , "Population" -> 1500000|>, <|
"City" -> "Ljubljana", "Population" -> 300000|>, <|
"City" -> "Sarajevo", "Population" -> 275000|>, <|
"City" -> "Zagreb", "Population" -> 800000|>, <|
"City" -> "Skopje", "Population" -> 530000|>, <|
"City" -> "Podgorica", "Population" -> 180000|>}]
I want to construct proximity matrix out of it, using Euclidean Distance (function in Wolfram Mathematica: EuclideanDistance) between the city populations. I had some trials but it didn't work out in the end. Anyone has an idea?
Thank you in advance!
Try
pop=Normal[dataset[All,"Population"]];
MatrixPlot[Outer[Sqrt[(#1-#2)^2]&,pop,pop]]
which I think implements EuclidianDistance
Adding FrameTicks and using EuclideanDistance.
pop = Normal[dataset[All, "Population"]];
cities = Normal[dataset[All, "City"]];
ticks = List ### Thread[Range#Length#cities -> (Style[#, 14, Black] &) /# cities]
MatrixPlot[Outer[EuclideanDistance[##] &, pop, pop],
FrameTicks -> {ticks, ticks, ticks, ticks},
Mesh -> True,
MeshStyle -> Black,
ImageSize -> 600]

Topological sort, but with a certain kind of grouping

It seems this must be a common scheduling problem, but I don't see the solution or even what to call the problem. It's like a topological sort, but different....
Given some dependencies, say
A -> B -> D -- that is, A must come before B, which must come before D
A -> C -> D
there might be multiple solutions to a topological sort:
A, B, C, D
and A, C, B, D
are both solutions.
I need an algorithm that returns this:
(A) -> (B,C) -> (D)
That is, do A, then all of B and C, then you can do D. All the ambiguities or don't-cares are grouped.
I think algorithms such as those at Topological Sort with Grouping won't correctly handle cases like the following.
A -> B -> C -> D -> E
A - - - > M - - - > E
For this, the algorithm should return
(A) -> (B, C, D, M) -> (E)
This
A -> B -> D -> F
A -> C -> E -> F
should return
(A) -> (B, D, C, E) -> (F)
While this
A -> B -> D -> F
A -> C -> E -> F
C -> D
B -> E
should return
(A) -> (B, C) -> (D, E) -> (F)
And this
A -> B -> D -> F
A -> C -> E -> F
A -> L -> M -> F
C -> D
C -> M
B -> E
B -> M
L -> D
L -> E
should return
(A) -> (B, C, L) -> (D, E, M) -> (F)
Is there a name and a conventional solution to this problem? (And do the algorithms posted at Topological Sort with Grouping correctly handle this?)
Edit to answer requests for more examples:
A->B->C
A->C
should return
(A) -> (B) -> (C). That would be a straight topological sort.
And
A->B->D
A->C->D
A->D
should return
(A) -> (B, C) -> (D)
And
A->B->C
A->C
A->D
should return
(A) -> (B,C,D)
Let G be the transitive closure of the graph. Let G' be the undirected graph that results from removing the orientation from G and taking the complement. The connected components of the G' are the sets you are looking for.

How to draw the classic state diagram using Mathematica?

Is it possible and practical for Mathematica to draw something like this (being created by Graphviz):
This is the best that I can get (but the shape and style are not satisfying):
Code:
GraphPlot[{{A -> C, "go"}, {C -> B, "gone"}, {C -> D,
"went"}, {C -> C, "loop"}}, VertexLabeling -> True,
DirectedEdges -> True]
You can do something like this using VertexRenderingFunction.
GraphPlot[{{A -> C, "go"}, {C -> B, "gone"}, {C -> D, "went"}, {C -> C, "loop"}},
DirectedEdges -> True,
VertexRenderingFunction -> ({{White, Disk[#, 0.15]},
AbsoluteThickness[2], Circle[#, 0.15],
If[MatchQ[#2, A | B], Circle[#, 0.12], {}], Text[#2, #]} &)]
Method Updated February 2015
To preserve the ability to interactively rearrange the graph with the drawing tools (double click) one must keep the vertex graphics inside of GraphicsComplex, with indexes rather than coordinates. I believe one could do this from VertexRenderingFunction using an incrementing variable but it seems easier an possibly more robust to do it with post-processing. This works in versions 7 and 10 of Mathematica, presumably 8 and 9 as well:
GraphPlot[
{{A -> C, "go"}, {C -> B, "gone"}, {C -> D, "went"}, {C -> C, "loop"}},
DirectedEdges -> True
] /.
Tooltip[Point[n_Integer], label_] :>
{{White, Disk[n, 0.15]},
Black, AbsoluteThickness[2], Circle[n, 0.15],
If[MatchQ[label, A | B], Circle[n, 0.12], {}], Text[label, n]}
There's no need for interactive placement to get your vertices at the desired location as mr.Wizard suggests in his answer. You can use VertexCoordinateRules for that:
GraphPlot[{{A -> C, "go"}, {C -> B, "gone"}, {C -> D, "went"}, {C -> C, "loop"}},
DirectedEdges -> True,
VertexRenderingFunction ->
({{White, Disk[#, 0.15]}, AbsoluteThickness[2], Circle[#, 0.15],
If[MatchQ[#2, A | B], Circle[#, 0.12], {}], Text[#2, #]} &),
VertexCoordinateRules ->
{A -> {0, 0}, C -> {0.75, 0},B -> {1.5, 0.25}, D -> {1.5, -0.25}}
]

Processing KMZ in Mathematica

I'm stuck on a conversion.
I have a KMZ file with some coordinates. I read the file like this:
m=Import["~/Desktop/locations.kmz","Data"]
I get something like this:
{{LayerName->Point Features,
Geometry->{
Point[{-120.934,49.3321,372}],
Point[{-120.935,49.3275,375}],
Point[{-120.935,49.323,371}]},
Labels->{},LabeledData->{},ExtendedData->{},
PlacemarkNames->{1,2,3},
Overlays->{},NetworkLinks->{}
}}
I want to extract the {x,y,z} from each of the points and also the placemark names {1,2,3} associated with the points. Even if I can just get the points out of Geometry->{} that would be fine because I can extract them into a list with List###, but I'm lost at the fundamental part where I can't extract the Geometry "Rule".
Thanks for any help,
Ron
While Leonid's answer is correct, you will likely find that it does not work with your code. The reason is that the output of your Import command contains strings, such as "LayerNames", rather than symbols, such as LayerNames. I've uploaded a KML file to my webspace so we can try this using an actual Import command. Try something like the following:
in = Import["http://facstaff.unca.edu/mcmcclur/my.kml", "Data"];
pointList = "Geometry" /.
Cases[in, Verbatim[Rule]["Geometry", _], Infinity];
pointList /. Point[stuff_] -> stuff
Again, note that "Geometry" is a string. In fact, the contents of in look like so (in InputForm):
{{"LayerName" -> "Waypoints",
"Geometry" -> {Point[{-82.5, 32.5, 0}]},
"Labels" -> {}, "LabeledData" -> {},
"ExtendedData" -> {}, "PlacemarkNames" -> {"asheville"},
"Overlays" -> {}, "NetworkLinks" -> {}}}
Context: KML refers to Keyhole Markup Language. Keyhole was a company that developed tools that ultimately became Google Earth, after they were acquired by Google. KMZ is a zipped version of KML.
A simplification to Leonid and Mark's answers that I believe can be made safely is to remove the fancy Verbatim construct. That is:
Leonid's first operation can be written:
Join ## Cases[expr, (Geometry -> x_) :> (x /. Point -> Sequence), Infinity]
Leonid's second operation:
Join ## Cases[expr, (PlacemarkNames -> x_) :> x, Infinity]
I had trouble importing Mark's data, but from what I can guess, one could write:
pointList = Cases[in, ("Geometry" -> x_) :> x, Infinity, 1]
I'll let the votes on this answer tell me if I am correct.
Given your expression
expr = {{LayerName -> Point Features,
Geometry -> {
Point[{-120.934, 49.3321, 372}],
Point[{-120.935, 49.3275, 375}],
Point[{-120.935, 49.323, 371}]},
Labels -> {}, LabeledData -> {}, ExtendedData -> {},
PlacemarkNames -> {1, 2, 3}, Overlays -> {}, NetworkLinks -> {}}}
This will extract the points:
In[121]:=
Flatten[Cases[expr, Verbatim[Rule][Geometry, x_] :> (x /. Point -> Sequence),
Infinity], 1]
Out[121]= {{-120.934, 49.3321, 372}, {-120.935, 49.3275,375}, {-120.935, 49.323, 371}}
And this will extract the placemarks:
In[124]:= Flatten[Cases[expr, Verbatim[Rule][PlacemarkNames, x_] :> x, Infinity], 1]
Out[124]= {1, 2, 3}
Here is a more elegant method exploiting that we are looking for rules, that will extract both:
In[127]:=
{Geometry, PlacemarkNames} /.Cases[expr, _Rule, Infinity] /. Point -> Sequence
Out[127]=
{{{-120.934, 49.3321, 372}, {-120.935, 49.3275,375}, {-120.935, 49.323, 371}}, {1, 2, 3}}
How about Transpose[{"PlacemarkNames", "Geometry"} /. m[[1]]] ?

Resources