Performance suggestions for trees and the GC - data-structures

In my attempt to create a CSS3 parser (https://github.com/tdewolff/css) for my minification library, I'm in the process of optimizing the tokenizer, parser and minifier. The tokenizer is pretty fast but I believe the parser could be optimized.
The parser processes the tokens and generates a tree of nodes representing the syntax. All nodes satisfy this interface:
type Node interface {
Type() NodeType
String() string
}
Example nodes (see https://github.com/tdewolff/css/blob/master/node.go):
// root
type NodeStylesheet struct {
NodeType
Nodes []Node
}
// example node
type NodeRuleset struct {
NodeType
SelGroups []*NodeSelectorGroup
Decls []*NodeDeclaration
}
// leave (the only end-leave possible)
type NodeToken struct {
NodeType
TokenType
Data string
}
Parsing stylesheet (see https://github.com/tdewolff/css/blob/master/parse.go):
func (p *parser) parseStylesheet() *NodeStylesheet {
n := NewStylesheet()
for {
p.skipWhitespace()
if p.at(ErrorToken) {
return n
}
if p.at(CDOToken) || p.at(CDCToken) {
n.Nodes = append(n.Nodes, p.shift())
} else if cn := p.parseAtRule(); cn != nil {
n.Nodes = append(n.Nodes, cn)
} else if cn := p.parseRuleset(); cn != nil {
n.Nodes = append(n.Nodes, cn)
} else if cn := p.parseDeclaration(); cn != nil {
n.Nodes = append(n.Nodes, cn)
} else if !p.at(ErrorToken) {
n.Nodes = append(n.Nodes, p.shift())
}
}
}
Each node is allocated on the heap and a significant time is spent on the GC and related tasks. Could I reduce that?
Can I put all element in a flat array for instance? Because the elements of the tree are filled sequentially (ie. it can be flattened). What techniques can I use to reduce (small) allocations on the heap?
Update
The minifier is actually not really slow, bootstrap (134kB) is minified in 28ms (NodeJS implementations take atleast 45ms and produce larger files http://goalsmashers.github.io/css-minification-benchmark/). But it would be great if I could squeeze out even more!
I know that some time is spent on []byte -> string casting, but the []byte from the tokenizer needs to be copied anyways because its memory can be reused at any tokenizer.Next() call. Since it needs to be copied anyways, I figured casting to string was better because it made much of the code easier (checking for equality).
I can make a tokenizer variant that keeps the whole file in memory, which is fine for the parser because the parser doesn't stream anyways.
Update 2
I did load the whole file into memory for the parser and replaced all string to '[]byte`, now it's 10% faster! Bootstrap.css now takes 23ms.
I don't think it's worth the hassle to flatten the tree, and it removes some of the flexibility (user able to modify the tree).

Related

How to test if a generic type can fit a more restrictive interface - golang

I am building an all-purpose data structure, that I intend to use in various contexts and with various bits of data.
I am currently attempting to make a matcher, that will look into my structure and return all nodes containing the data given. My problem being that, since I need my structure to be as generic as possible, I need my data to be of a generic type matching any, and this won't allow me to make equalities.
I have built a "descendant type" (there's probably a correct term, I'm self-taught on this) that has the more rigorous comparable constraint.
I want to see if I can convert from the more general one to the more specific one (even if I have to catch an error and return it to the user). I know that I don't specifically need to, but it makes the code understandable down the line if i do it like that.
Here's a bit of code to explain my question :
type DataStructureGeneralCase[T any] struct {
/*
my data structure, which is too long to make for a good example, so I'm using a slice instead
*/
data []T
}
type DataStructureSpecific[T comparable] DataStructureGeneralCase[T]
// this works because any contains comparable
func (ds *DataStructureSpecific[T]) GetMatching(content T) int {
/*The real function returns my custom Node type, but let's be brief*/
for idx, item := range ds.data {
if item == content {
return idx
}
}
return -1
}
func (dg *DataStructureGeneralCase[T]) TryMatching(content T) (int, error) {
if ds, ok := (*dg).(DataStructureGeneral); ok {
// Does not work because dg is not interface, which I understand
} else {
return -1, fmt.Errorf("Could not convert because of non-comparable content")
}
}
My question can be summarized to "How can I do my conversion ?".
Cast to the empty interface first:
castedDg := interface{}(dg)
if ds, ok := castedDg.(DataStructureGeneralCase[T]); ok {
They explain why they chose this approach in https://go.googlesource.com/proposal/+/refs/heads/master/design/43651-type-parameters.md#why-not-permit-type-assertions-on-values-whose-type-is-a-type-parameter

How to ensure compile-time safety in custom data structures

I am writing some data structures to get my feet wet and learn about the Go language and am struggling with Go's lack of generics.
In my implementations I have chosen to force each user to implement an interface so my structures could refer to these objects abstractly but I don't love my solution because this is not verified at compile-time as you will see.
Comparer Interface
Each object that is held in a container must implement a Compare function of the following signature (onerous if all you wanted to hold were raw types)
type Comparer interface {
Compare(Comparer) int
}
You could then have various elements that implement the interface like float64 or a custom struct:
float64
type number float64
func (n1 number) Compare(comparer Comparer) int {
n2, _ := comparer.(number)
if n1 > n2 {
return 1
} else if n1 < n2 {
return -1
} else {
return 0
}
}
Person
type Person struct {
Age int
}
func (p1 Person) Compare(comparer Comparer) int {
p2, _ := comparer.(Person)
if p1.Age > p2.Age {
return 1
} else if p1.Age < p2.Age {
return -1
} else {
return 0
}
}
And now I can compare some of these things:
func main() {
fmt.Println(number(2).Compare(number(4))) // -1
fmt.Println(Person{26}.Compare(Person{28})) // -1
fmt.Println(Person{26}.Compare(number(28))) // 1
}
The problem here is that I should not be able to compare a Person and a number. I realize that I can check the type at runtime but I would like to find either a) a compile-time way to verify the type or b) a different method to implement data structures generically.
Questions
I know that one can do almost everything one might need with the built in data structures ... but how would someone make their own data structures without generics or runtime type checking?
Since interface implementation in Go appears to use duck typing, how does Go enforce types at compile time?
I mean there's nothing unsafe about that code... There just isn't compile time safety. For example, in your method below, the first line does a type assertion on comparer, if it's not a number and you didn't have _ for the second item on the LHS then it would return an error and you could act accordingly. Or you could call it without that at all and a panic will occur leaving it up to the caller to handle it (would be appropriate since they're the person calling the method with wrong arguments, would be like getting an InvalidOperationException in C#).
func (n1 number) Compare(comparer Comparer) int {
n2, _ := comparer.(number)
if n1 > n2 {
return 1
} else if n1 < n2 {
return -1
} else {
return 0
}
}
The difference between this and a language like C# is purely in generics, which allow you to do these kinds of things with more compile time safety (because you're not able to call the method incorrectly). That being said, there was a time before C# had generics and many languages before that which didn't feature them at all. These operations are no more unsafe than the casts you do routinely even in languages that do have generics.

Implementing a Merkle-tree data structure in Go

I'm currently attempting to implement a merkle-tree data structure in Go. Basically, my end goal is to store a small set of structured data (10MB max) and allow this "database" to be easily synchronised with other nodes distributed over the network (see related ).
I've implemented this reasonably effectively in Node as there are no type-checks. Herein lies the problem with Go, I'd like to make use of Go's compile-time type checks, though I also want to have one library which works with any provided tree.
In short, I'd like to use structs as merkle nodes and I'd like to have one Merkle.Update() method which is embedded in all types. I'm trying to avoid writing an Update() for every struct (though I'm aware this might be the only/best way).
My idea was to use embedded types:
//library
type Merkle struct {
Initialised bool
Container interface{} //in example this references foo
Fields []reflect.Type
//... other merkle state
}
//Merkle methods... Update()... etc...
//userland
type Foo struct {
Merkle
A int
B bool
C string
D map[string]*Bazz
E []*Bar
}
type Bazz struct {
Merkle
S int
T int
U int
}
type Bar struct {
Merkle
X int
Y int
Z int
}
In this example, Foo will be the root, which will contain Bazzs and Bars. This relationship could be inferred by reflecting on the types. The problem is the usage:
foo := &Foo{
A: 42,
B: true,
C: "foo",
D: map[string]*Bazz{
"b1": &Bazz{},
"b2": &Bazz{},
},
E: []*Bar{
&Bar{},
&Bar{},
&Bar{},
},
}
merkle.Init(foo)
foo.Hash //Initial hash => abc...
foo.A = 35
foo.E = append(foo.E, &Bar{})
foo.Update()
foo.Hash //Updated hash => def...
I think we need to merkle.Init(foo) since foo.Init() would actually be foo.Merkle.Init() and would not be able to reflect on foo. The uninitialised Bars and Bazzs could be detected and initialised by the parent foo.Update(). Some reflection is acceptable as correctness is more important than performance at the moment.
Another problem is, when we Update() a node, all struct fields (child nodes) would need to be Update()d as well (rehashed) since we aren't sure what was changed. We could do foo.SetInt("A", 35) to implement an auto-update, though then we lose compile time type-checks.
Would this be considered idiomatic Go? If not, how could this be improved? Can anyone think of an alternative way to store a dataset in memory (for fast reads) with concise dataset comparison (for efficient delta transfers over the network)?
Edit: And also a meta-question: Where is the best place to ask this kind of question, StackOverflow, Reddit or go-nuts? Originally posted on reddit with no answer :(
Some goals seem like:
Hash anything -- make it easy to use by hashing lots of things out of the box
Cache hashes -- make updates just rehash what they need to
Be idiomatic -- fit in well among other Go code
I think you can attack hashing anything roughly the way that serialization tools like the built-in encoding/gob or encoding/json do, which is three-pronged: use a special method if the type implements it (for JSON that's MarshalJSON), use a type switch for basic types, and fall back to a nasty default case using reflection. Here's an API sketch that provides a helper for hash caching and lets types either implement Hash or not:
package merkle
type HashVal uint64
const MissingHash HashVal = 0
// Hasher provides a custom hash implementation for a type. Not
// everything needs to implement it, but doing so can speed
// updates.
type Hasher interface {
Hash() HashVal
}
// HashCacher is the interface for items that cache a hash value.
// Normally implemented by embedding HashCache.
type HashCacher interface {
CachedHash() *HashVal
}
// HashCache implements HashCacher; it's meant to be embedded in your
// structs to make updating hash trees more efficient.
type HashCache struct {
h HashVal
}
// CachedHash implements HashCacher.
func (h *HashCache) CachedHash() *HashVal {
return &h.h
}
// Hash returns something's hash, using a cached hash or Hash() method if
// available.
func Hash(i interface{}) HashVal {
if hashCacher, ok := i.(HashCacher); ok {
if cached := *hashCacher.CachedHash(); cached != MissingHash {
return cached
}
}
switch i := i.(type) {
case Hasher:
return i.Hash()
case uint64:
return HashVal(i * 8675309) // or, you know, use a real hash
case []byte:
// CRC the bytes, say
return 0xdeadbeef
default:
return 0xdeadbeef
// terrible slow recursive case using reflection
// like: iterate fields using reflect, then hash each
}
// instead of panic()ing here, you could live a little
// dangerously and declare that changes to unhashable
// types don't invalidate the tree
panic("unhashable type passed to Hash()")
}
// Item is a node in the Merkle tree, which must know how to find its
// parent Item (the root node should return nil) and should usually
// embed HashCache for efficient updates. To avoid using reflection,
// Items might benefit from being Hashers as well.
type Item interface {
Parent() Item
HashCacher
}
// Update updates the chain of items between i and the root, given the
// leaf node that may have been changed.
func Update(i Item) {
for i != nil {
cached := i.CachedHash()
*cached = MissingHash // invalidate
*cached = Hash(i)
i = i.Parent()
}
}
Go doesn't have inheritance in the same way other languages do.
The "parent" can't modify items in the child, you'd have to implement Update on each struct then do your business in it then have it call the parent's Update.
func (b *Bar) Update() {
b.Merkle.Update()
//do stuff related to b and b.Merkle
//stuff
}
func (f *Foo) Update() {
f.Merkle.Update()
for _, b := range f.E {
b.Update()
}
//etc
}
I think you will have to re-implement your tree in a different way.
Also please provide a testable case the next time.
Have you seen https://github.com/xsleonard/go-merkle which will allow you to create a binary merkle tree. You could append a type byte to the end of your data to identify it.

Is there an efficient way of reclaiming over-capacity slices?

I have a large number of allocated slices (a few million) which I have appended to. I'm sure a large number of them are over capacity. I want to try and reduce memory usage.
My first attempt is to iterate over all of them, allocate a new slice of len(oldSlice) and copy the values over. Unfortunately this appears to increase memory usageĀ (up to double) and the garbage collection is slow to reclaim the memory.
Is there a good general way to slim down memory usage for a large number of over-capacity slices?
Choosing the right strategy to allocate your buffers is hard without knowing the exact problem.
In general you can try to reuse your buffers:
type buffer struct{}
var buffers = make(chan *buffer, 1024)
func newBuffer() *buffer {
select {
case b:= <-buffers:
return b
default:
return &buffer{}
}
}
func returnBuffer(b *buffer) {
select {
case buffers <- b:
default:
}
}
The heuristic used in append may not be suitable for all applications. It's designed for use when you don't know the final length of the data you'll be storing. Instead of iterating over them later, I'd try to minimize the amount of extra capacity you're allocating as early as possible. Here's a simple example of one strategy, which is to use a buffer only while the length is not known, and to reuse that buffer:
type buffer struct {
names []string
... // possibly other things
}
// assume this is called frequently and has lots and lots of names
func (b *buffer) readNames(lines bufio.Scanner) ([]string, error) {
// Start from zero, so we can re-use capacity
b.names = b.names[:0]
for lines.Scan() {
b.names = append(b.names, lines.Text())
}
// Figure out the error
err := lines.Err()
if err == io.EOF {
err = nil
}
// Allocate a minimal slice
out := make([]string, len(b.names))
copy(out, b.names)
return out, err
}
Of course, you'll need to modify this if you need something that's safe for concurrent use; for that I'd recommend using a buffered channel as a leaky bucket for storing your buffers.

Is there an easy way to iterate over a map in order?

This is a variant of the venerable "why is my map printing out of order" question.
I have a (fairly large) number of maps of the form map[MyKey]MyValue, where MyKey and MyValue are (usually) structs. I've got "less" functions for all the key types.
I need to iterate over the maps in order. (Specifically, the order defined by the less function on that type.) Right now, my code looks like this:
type PairKeyValue struct {
MyKey
MyValue
}
type PairKeyValueSlice []Pair
func (ps PairKeyValueSlice) Len() int {
return len(ps)
}
func (ps PairKeyValueSlice) Swap(i,j int) {
ps[i], ps[j] = ps[j], ps[i]
}
func (ps PairKeyValueSlice) Less(i,j int) {
return LessKey(ps[i].MyKey, ps[j].MyKey)
}
func NewPairKeyValueSlice(m map[MyKey]MyValue) (ps PairKeyValueSlice) {
ps = make(PairKeyValueSlice, len(m))
i := 0
for k,v := range m {
ps[i] = PairKeyValue{k,v}
i++
}
sort.Sort(ps)
}
And then, any time I want an in-order iteration, it looks like:
var m map[MyKey]MyValue
m = GetMapFromSomewhereUseful()
for _, kv := range NewPairKeyValueSlice(m) {
key := kv.MyKey
value := kv.MyValue
DoUsefulWork(key, value)
}
And this appears to largely work. The problem is that it is terribly verbose. Particularly since the problem at hand really has very little to do with implmenting ordered maps and is really about the useful work in the loop.
Also, I have several different keys and value types. So, every time I want to iterate over a map in order, I copy/paste all that code and do find/replace MyKey with the new key and MyValue with the new value. Copy/paste on that magnitude is... "smelly". It has already become a hassle, since I've already made a few errors that I had to fix several times.
This technique also has the downside that it requires making a full copy of all the keys and values. That is undesirable, but I don't see a way around it. (I could reduce it to just the keys, but it doesn't change the primary nature of the problem.)
This question is attempting the same thing with strings. This question does it with strings and ints. This question implies that you need to use reflection and will have to have a switch statement that switches on every possible type, including all user-defined types.
But with the people who are puzzled that maps don't iterate deterministically, it seems that there has got to be a better solution to this problem. I'm from an OO background, so I'm probably missing something fundamental.
So, is there a reasonable way to iterate over a map in order?
Update: Editing the question to have more information about the source, in case there's a better solution than this.
I have a lot of things I need to group for output. Each grouping level is in a structure that looks like these:
type ObjTypeTree struct {
Children map[Type]*ObjKindTree
TotalCount uint
}
type ObjKindTree struct {
Children map[Kind]*ObjAreaTree
TotalCount uint
}
type ObjAreaTree struct {
Children map[Area]*ObjAreaTree
TotalCount uint
Objs []*Obj
}
Then, I'd iterate over the children in the ObjTypeTree to print the Type groupings. For each of those, I iterate over the ObjKindTree to print the Kind groupings. The iterations are done with methods on the types, and each kind of type needs a little different way of printing its grouping level. Groups need to be printed in order, which causes the problem.
Don't use a map if key collating is required. Use a B-tree or any other/similar ordered container.
I second jnml's answer. But if you want something shorter than you have and are willing to give up compile time type safety, then my library might work for you. (It's built on top of reflect.) Here's a full working example:
package main
import (
"fmt"
"github.com/BurntSushi/ty/fun"
)
type OrderedKey struct {
L1 rune
L2 rune
}
func (k1 OrderedKey) Less(k2 OrderedKey) bool {
return k1.L1 < k2.L1 || (k1.L1 == k2.L1 && k1.L2 < k2.L2)
}
func main() {
m := map[OrderedKey]string{
OrderedKey{'b', 'a'}: "second",
OrderedKey{'x', 'y'}: "fourth",
OrderedKey{'x', 'x'}: "third",
OrderedKey{'a', 'b'}: "first",
OrderedKey{'x', 'z'}: "fifth",
}
for k, v := range m {
fmt.Printf("(%c, %c): %s\n", k.L1, k.L2, v)
}
fmt.Println("-----------------------------")
keys := fun.QuickSort(OrderedKey.Less, fun.Keys(m)).([]OrderedKey)
for _, k := range keys {
v := m[k]
fmt.Printf("(%c, %c): %s\n", k.L1, k.L2, v)
}
}
Note that such a method will be slower, so if you need performance, this is not a good choice.

Resources