Slow performance of html/template in Go lang, any workaround? - go

I'm stress testing (with loader.io) this type of code in Go to create an array of 100 items along with some other basic variables and parse them all in the template:
package main
import (
"html/template"
"net/http"
)
var templates map[string]*template.Template
// Load templates on program initialisation
func init() {
if templates == nil {
templates = make(map[string]*template.Template)
}
templates["index.html"] = template.Must(template.ParseFiles("index.html"))
}
func handler(w http.ResponseWriter, r *http.Request) {
type Post struct {
Id int
Title, Content string
}
var Posts [100]Post
// Fill posts
for i := 0; i < 100; i++ {
Posts[i] = Post{i, "Sample Title", "Lorem Ipsum Dolor Sit Amet"}
}
type Page struct {
Title, Subtitle string
Posts [100]Post
}
var p Page
p.Title = "Index Page of My Super Blog"
p.Subtitle = "A blog about everything"
p.Posts = Posts
tmpl := templates["index.html"]
tmpl.ExecuteTemplate(w, "index.html", p)
}
func main() {
http.HandleFunc("/", handler)
http.ListenAndServe(":8888", nil)
}
My test with Loader is using 5k concurrent connections/s through 1 minute. The problem is, just after a few seconds after starting the test, I get a high average latency (almost 10s) and as a result 5k successful responses and the test stops because it reaches the Error Rate of 50% (timeouts).
On the same machine, PHP gives 50k+.
I understand that it's not Go performance issue, but probably something related to html/template. Go can easily manage hard enough calculations a lot faster than anything like PHP of course, but when it comes to parsing the data to the template, why is it so awful?
Any workarounds, or probably I'm just doing it wrong (I'm new to Go)?
P.S. Actually even with 1 item it's exactly the same... 5-6k and stopping after huge amount of timeouts. But that's probably because the array with posts is staying of the same length.
My template code (index.html):
{{ .Title }}
{{ .Subtitle }}
{{ range .Posts }}
{{ .Title }}
{{ .Content }}
{{ end }}
Here's the profiling result of github.com/pkg/profile:
root#Test:~# go tool pprof app /tmp/profile311243501/cpu.pprof
Possible precedence issue with control flow operator at /usr/lib/go/pkg/tool/linux_amd64/pprof line 3008.
Welcome to pprof! For help, type 'help'.
(pprof) top10
Total: 2054 samples
97 4.7% 4.7% 726 35.3% reflect.Value.call
89 4.3% 9.1% 278 13.5% runtime.mallocgc
85 4.1% 13.2% 86 4.2% syscall.Syscall
66 3.2% 16.4% 75 3.7% runtime.MSpan_Sweep
58 2.8% 19.2% 1842 89.7% text/template.(*state).walk
54 2.6% 21.9% 928 45.2% text/template.(*state).evalCall
51 2.5% 24.3% 53 2.6% settype
47 2.3% 26.6% 47 2.3% runtime.stringiter2
44 2.1% 28.8% 149 7.3% runtime.makeslice
40 1.9% 30.7% 223 10.9% text/template.(*state).evalField
These are profiling results after refining the code (as suggested in the answer by icza):
root#Test:~# go tool pprof app /tmp/profile501566907/cpu.pprof
Possible precedence issue with control flow operator at /usr/lib/go/pkg/tool/linux_amd64/pprof line 3008.
Welcome to pprof! For help, type 'help'.
(pprof) top10
Total: 2811 samples
137 4.9% 4.9% 442 15.7% runtime.mallocgc
126 4.5% 9.4% 999 35.5% reflect.Value.call
113 4.0% 13.4% 115 4.1% syscall.Syscall
110 3.9% 17.3% 122 4.3% runtime.MSpan_Sweep
102 3.6% 20.9% 2561 91.1% text/template.(*state).walk
74 2.6% 23.6% 337 12.0% text/template.(*state).evalField
68 2.4% 26.0% 72 2.6% settype
66 2.3% 28.3% 1279 45.5% text/template.(*state).evalCall
65 2.3% 30.6% 226 8.0% runtime.makeslice
57 2.0% 32.7% 57 2.0% runtime.stringiter2
(pprof)

There are two main reasons why the equivalent application using html/template is slower than PHP variant.
First of all html/template provides more functionality than the PHP. The main difference is that html/template will automatically escape variables using correct escaping rules (HTML, JS, CSS, etc) depending on their location in the resulting HTML output (which I think is quite cool!).
Secondly html/template rendering code heavily uses reflection and methods with variable number of arguments and they are just not as fast as statically compiled code.
Under the hood the following template
{{ .Title }}
{{ .Subtitle }}
{{ range .Posts }}
{{ .Title }}
{{ .Content }}
{{ end }}
is converted to something like
{{ .Title | html_template_htmlescaper }}
{{ .Subtitle | html_template_htmlescaper }}
{{ range .Posts }}
{{ .Title | html_template_htmlescaper }}
{{ .Content | html_template_htmlescaper }}
{{ end }}
Calling html_template_htmlescaper using reflection in a loop kills performance.
Having said all that this micro-benchmark of html/template shouldn't be used to decide whether to use Go or not. Once you add code to work with the database to the request handler I suspect that template rendering time will hardly be noticeable.
Also I am pretty sure that over time both Go reflection and html/template package will become faster.
If in a real application you will find that html/template is a bottleneck it still possible to switch to text/template and supply it with already escaped data.

You are working with arrays and structs, both which are non-pointer types, nor are they descriptors (like slices or maps or channels). So passing them always creates a copy of the value, assigning an array value to a variable copies all the elements. This is slow and gives a huge amount of work to the GC.
Also you are utilizing only 1 CPU core. To utilize more, add this to your main() function:
func main() {
runtime.GOMAXPROCS(runtime.NumCPU())
http.HandleFunc("/", handler)
log.Fatal(http.ListenAndServe(":8888", nil))
}
Edit: This was only the case prior to Go 1.5. Since Go 1.5 runtime.NumCPU() is the default.
Your code
var Posts [100]Post
An array with space for 100 Posts is allocated.
Posts[i] = Post{i, "Sample Title", "Lorem Ipsum Dolor Sit Amet"}
You create a Post value with a composite literal, then this value is copied into the ith element in the array. (redundant)
var p Page
This creates a variable of type Page. It is a struct, so its memory is allocated which also contains a field Posts [100]Post so another array of 100 elements is allocated.
p.Posts = Posts
This copies 100 elements (a hundred structs)!
tmpl.ExecuteTemplate(w, "index.html", p)
This creates a copy of p (which is of type Page), so another array of 100 posts is created and elements from p are copied, then it is passed to ExecuteTemplate().
And since Page.Posts is an array, most likely when it is processed (iterated over in the template engine), a copy will be made from each element (haven't checked - not verified).
Proposal for a more efficient code
Some things to speed up your code:
func handler(w http.ResponseWriter, r *http.Request) {
type Post struct {
Id int
Title, Content string
}
Posts := make([]*Post, 100) // A slice of pointers
// Fill posts
for i := range Posts {
// Initialize pointers: just copies the address of the created struct value
Posts[i]= &Post{i, "Sample Title", "Lorem Ipsum Dolor Sit Amet"}
}
type Page struct {
Title, Subtitle string
Posts []*Post // "Just" a slice type (it's a descriptor)
}
// Create a page, only the Posts slice descriptor is copied
p := Page{"Index Page of My Super Blog", "A blog about everything", Posts}
tmpl := templates["index.html"]
// Only pass the address of p
// Although since Page.Posts is now just a slice, passing by value would also be OK
tmpl.ExecuteTemplate(w, "index.html", &p)
}
Please test this code and report back your results.

PHP isn't answering 5000 requests concurrently. The requests are being multiplexed to a handful of processes for serial execution. This makes more efficient use of both CPU and memory. 5000 concurrent connections may make sense for a message broker or similar, doing limited processing of small pieces of data, but it makes little sense for any service doing real I/O or processing. If your Go app is not behind a proxy of some type that will limit the number of concurrent requests, you will want to do so yourself, perhaps at the beginning of your handler, using a buffered channel or a wait group, a la https://blakemesdag.com/blog/2014/11/12/limiting-go-concurrency/.

html/template is slow because it uses reflection, which isn't optimized for speed yet.
Try quicktemplate as a workaround of slow html/template. Currently quicktemplate is more than 20x faster than html/template according to the benchmark from its' source code.

There is a template benchmark you can check at goTemplateBenchmark. Personally, I think Hero is the one that best combines efficiency and readability.

Typed strings is your friend if you would like to speed up html/template. It is sometimes useful to pre-render repeating HTML-fragments.
Assuming that most of the time is spent to render those 100 Post objects, it could make sense to pre-render those.

Related

What does runtime.memclrNoHeapPointers do?

I am profiling a library and see that a function called runtime.memclrNoHeapPointers is taking up about 0.82seconds of the cpu-time.
What does this function do, and what does this tell me about the memory-usage of the library i am profiling?
The output, for completeness:
File: gribtest.test
Type: cpu
Time: Feb 12, 2019 at 8:27pm (CET)
Duration: 5.21s, Total samples = 5.11s (98.15%)
Showing nodes accounting for 4.94s, 96.67% of 5.11s total
Dropped 61 nodes (cum <= 0.03s)
flat flat% sum% cum cum%
1.60s 31.31% 31.31% 1.81s 35.42% github.com/nilsmagnus/grib/griblib.(*BitReader).readBit
1.08s 21.14% 52.45% 2.89s 56.56% github.com/nilsmagnus/grib/griblib.(*BitReader).readUint
0.37s 7.24% 59.69% 0.82s 16.05% encoding/binary.(*decoder).value
0.35s 6.85% 66.54% 0.35s 6.85% runtime.memclrNoHeapPointers
func memclrNoHeapPointers(ptr unsafe.Pointer, n uintptr)
memclrNoHeapPointers clears n bytes starting at ptr.
Usually you should use typedmemclr. memclrNoHeapPointers should be
used only when the caller knows that *ptr contains no heap pointers
because either:
*ptr is initialized memory and its type is pointer-free.
*ptr is uninitialized memory (e.g., memory that's being reused
for a new allocation) and hence contains only "junk".
in memclr_*.s go:noescape
See https://github.com/golang/go/blob/9e277f7d554455e16ba3762541c53e9bfc1d8188/src/runtime/stubs.go#L78
This is part of the garbage collector. You can see the declaration here.
The specifics of what it does are CPU dependent. See the various memclr_*.s files in the runtime for implmentation
This does seem like a long time in the GC, but it's hard to say something about the memory usage of the library with just the data you've shown I think.

pprof (for golang) doesn't show details for my package

I've been trying to profile my go application (evm-specification-miner) with pprof, but the output is not really useful:
(pprof) top5
108.59mins of 109.29mins total (99.36%)
Dropped 607 nodes (cum <= 0.55mins)
Showing top 5 nodes out of 103 (cum >= 0.98mins)
flat flat% sum% cum cum%
107.83mins 98.66% 98.66% 108.64mins 99.40% [evm-specification-miner]
0.36mins 0.33% 98.99% 6mins 5.49% net.dialIP
0.30mins 0.28% 99.27% 4.18mins 3.83% net.listenIP
0.06mins 0.052% 99.32% 34.66mins 31.71%
github.com/urfave/cli.BoolFlag.ApplyWithError
0.04mins 0.036% 99.36% 0.98mins 0.9% net.probeIPv6Stack
And here is the cumulative output:
(pprof) top5 --cum
1.80hrs of 1.82hrs total (98.66%)
Dropped 607 nodes (cum <= 0.01hrs)
Showing top 5 nodes out of 103 (cum >= 1.53hrs)
flat flat% sum% cum cum%
1.80hrs 98.66% 98.66% 1.81hrs 99.40% [evm-specification-miner]
0 0% 98.66% 1.53hrs 83.93% net.IP.matchAddrFamily
0 0% 98.66% 1.53hrs 83.92% net.(*UDPConn).WriteToUDP
0 0% 98.66% 1.53hrs 83.90% net.sockaddrToUDP
0 0% 98.66% 1.53hrs 83.89% net.(*UDPConn).readMsg
As you can see, most of the time is spent in evm-specification-miner (which is the name of my go application), but I've been unable to obtain more details or even understand what these square brackets meant (there is a question with a similar problem, but it didn't receive any answer).
Here are the build and pprof commands:
go install evm-specification-miner
go tool pprof evm-specification-miner cpuprof
I've even tried the debug flags -gcflags "-N -l" (as noted here: https://golang.org/doc/gdb#Introduction), to no avail.
The profiling is done with calls to pprof.StartCPUProfile() and pprof.StopCPUProfile() as is explained by this blog post: https://blog.golang.org/profiling-go-programs:
func StartProfiling(cpuprof string) error {
f, err := os.Create(cpuprof)
if err != nil {
return err
}
return pprof.StartCPUProfile(f)
}
func StopProfiling() error {
pprof.StopCPUProfile()
return nil
}
StartProfiling is called at the beginning of main(), and StopProfiling when a signal (interrupt or kill) is received (or if the program terminates normally). This profile was obtained after an interruption.
Looks like updating to 1.9rc1 fixed it.
I no longer have [evm-specifiation-miner] in the profile (for the record, the top functions do not even come from my own package, so it is even weirder than they did not appear before).

Why Locking in Go much slower than Java? Lot's of time spent in Mutex.Lock() Mutex.Unlock()

I've written a small Go library (go-patan) that collects a running min/max/avg/stddev of certain variables. I compared it to an equivalent Java implementation (patan), and to my surprise the Java implementation is much faster. I would like to understand why.
The library basically consists of a simple data store with a lock that serializes reads and writes. This is a snippet of the code:
type Store struct {
durations map[string]*Distribution
counters map[string]int64
samples map[string]*Distribution
lock *sync.Mutex
}
func (store *Store) addSample(key string, value int64) {
store.addToStore(store.samples, key, value)
}
func (store *Store) addDuration(key string, value int64) {
store.addToStore(store.durations, key, value)
}
func (store *Store) addToCounter(key string, value int64) {
store.lock.Lock()
defer store.lock.Unlock()
store.counters[key] = store.counters[key] + value
}
func (store *Store) addToStore(destination map[string]*Distribution, key string, value int64) {
store.lock.Lock()
defer store.lock.Unlock()
distribution, exists := destination[key]
if !exists {
distribution = NewDistribution()
destination[key] = distribution
}
distribution.addSample(value)
}
I've benchmarked the GO and Java implementations (go-benchmark-gist, java-benchmark-gist) and Java wins by far, but I don't understand why:
Go Results:
10 threads with 20000 items took 133 millis
100 threads with 20000 items took 1809 millis
1000 threads with 20000 items took 17576 millis
10 threads with 200000 items took 1228 millis
100 threads with 200000 items took 17900 millis
Java Results:
10 threads with 20000 items takes 89 millis
100 threads with 20000 items takes 265 millis
1000 threads with 20000 items takes 2888 millis
10 threads with 200000 items takes 311 millis
100 threads with 200000 items takes 3067 millis
I've profiled the program with the Go's pprof and generated a call graph call-graph. This shows that it basically spends all the time in sync.(*Mutex).Lock() and sync.(*Mutex).Unlock().
The Top20 calls according to the profiler:
(pprof) top20
59110ms of 73890ms total (80.00%)
Dropped 22 nodes (cum <= 369.45ms)
Showing top 20 nodes out of 65 (cum >= 50220ms)
flat flat% sum% cum cum%
8900ms 12.04% 12.04% 8900ms 12.04% runtime.futex
7270ms 9.84% 21.88% 7270ms 9.84% runtime/internal/atomic.Xchg
7020ms 9.50% 31.38% 7020ms 9.50% runtime.procyield
4560ms 6.17% 37.56% 4560ms 6.17% sync/atomic.CompareAndSwapUint32
4400ms 5.95% 43.51% 4400ms 5.95% runtime/internal/atomic.Xadd
4210ms 5.70% 49.21% 22040ms 29.83% runtime.lock
3650ms 4.94% 54.15% 3650ms 4.94% runtime/internal/atomic.Cas
3260ms 4.41% 58.56% 3260ms 4.41% runtime/internal/atomic.Load
2220ms 3.00% 61.56% 22810ms 30.87% sync.(*Mutex).Lock
1870ms 2.53% 64.10% 1870ms 2.53% runtime.osyield
1540ms 2.08% 66.18% 16740ms 22.66% runtime.findrunnable
1430ms 1.94% 68.11% 1430ms 1.94% runtime.freedefer
1400ms 1.89% 70.01% 1400ms 1.89% sync/atomic.AddUint32
1250ms 1.69% 71.70% 1250ms 1.69% github.com/toefel18/go-patan/statistics/lockbased.(*Distribution).addSample
1240ms 1.68% 73.38% 3140ms 4.25% runtime.deferreturn
1070ms 1.45% 74.83% 6520ms 8.82% runtime.systemstack
1010ms 1.37% 76.19% 1010ms 1.37% runtime.newdefer
1000ms 1.35% 77.55% 1000ms 1.35% runtime.mapaccess1_faststr
950ms 1.29% 78.83% 15660ms 21.19% runtime.semacquire
860ms 1.16% 80.00% 50220ms 67.97% main.Benchmrk.func1
Can someone explain why locking in Go seems to be so much slower than in Java, what am I doing wrong? I've also written a channel based implementation in Go, but that is even slower.
It is best to avoid defer in tiny functions that need high performance since it is expensive. In most other cases, there is no need to avoid it since the cost of defer is outweighed by the code around it.
I'd also recommend using lock sync.Mutex instead of using a pointer. The pointer creates a slight amount of extra work for the programmer (initialisation, nil bugs), and a slight amount of extra work for the garbage collector.
I've also posted this question on the golang-nuts group. The reply from Jesper Louis Andersen explains quite well that Java uses synchronization optimization techniques such as lock escape analysis/lock elision and lock coarsening.
Java JIT might be taking the lock and allowing multiple updates at once within the lock to increase performance. I ran the Java benchmark with -Djava.compiler=NONE which gave dramatic performance, but is not a fair comparison.
I assume that many of these optimization techniques have less impact in a production environment.

Map access bottleneck in Golang

I am using Golang to implement naive bayesian classification for a dataset with over 30000 possible tags. I have built the model and I am in the classification phase. I am working on classifying 1000 records and this is taking up to 5 minutes. I have profiled the code with pprof functionality; the top10 are shown below:
Total: 28896 samples
16408 56.8% 56.8% 24129 83.5% runtime.mapaccess1_faststr
4977 17.2% 74.0% 4977 17.2% runtime.aeshashbody
2552 8.8% 82.8% 2552 8.8% runtime.memeqbody
1468 5.1% 87.9% 28112 97.3% main.(*Classifier).calcProbs
861 3.0% 90.9% 861 3.0% math.Log
435 1.5% 92.4% 435 1.5% runtime.markspan
267 0.9% 93.3% 302 1.0% MHeap_AllocLocked
187 0.6% 94.0% 187 0.6% runtime.aeshashstr
183 0.6% 94.6% 1137 3.9% runtime.mallocgc
127 0.4% 95.0% 988 3.4% math.log10
Surprisingly the map access seems to be the bottleneck. Has anyone experienced this. What other key, value datastructure can be used to avoid this bottleneck? All the map access is done in the following piece of code given below:
func (nb *Classifier) calcProbs(data string) *BoundedPriorityQueue{
probs := &BoundedPriorityQueue{}
heap.Init(probs)
terms := strings.Split(data, " ")
for class, prob := range nb.classProb{
condProb := prob
clsProbs := nb.model[class]
for _, term := range terms{
termProb := clsProbs[term]
if termProb != 0{
condProb += math.Log10(termProb)
}else{
condProb += -6 //math.Log10(0.000001)
}
}
entry := &Item{
value: class,
priority: condProb,
}
heap.Push(probs,entry)
}
return probs
}
The maps are nb.classProb which is map[string]float64 while the nb.model is a nested map of type
map[string]map[string]float64
In addition to what #tomwilde said, another approach that may speed up your algorithm is string interning. Namely, you can avoid using a map entirely if you know the domain of keys ahead of time. I wrote a small package that will do string interning for you.
Yes, the map access will be the bottleneck in this code: it's the most significant operation inside the two nested loops.
It's not possible to tell for sure from the code that you've included, but I expect you've got a limited number of classes. What you might do, is number them, and store the term-wise class probabilities like this:
map[string][NumClasses]float64
(ie: for each term, store an array of class-wise probabilities [or perhaps their logs already precomputed], and NumClasses is the number of different classes you have).
Then, iterate over terms first, and classes inside. The expensive map lookup will be done in the outer loop, and the inner loop will be iteration over an array.
This'll reduce the number of map lookups by a factor of NumClasses. This may need more memory if your data is extremely sparse.
The next optimisation is to use multiple goroutines to do the calculations, assuming you've more than one CPU core available.

Performance Drop whilst testing Go Language

I've been testing out a simple web server written in Go with http_load. When running the test for 1 second with 100 parallel I've seen 16k requests completed. However, running the test for 10 seconds results in a similar number of requests being completed at around 1/10th of the rate of the 1 second test.
Additionally, if I run several 1 second test close together, the first test will complete 16k requests and the subsequent tests will complete just 100-200 requests.
package main
import "net/http"
func main() {
bytes := make([]byte, 1024)
for i := 0; i < len(bytes); i++ {
bytes[i] = 100
}
http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
w.Write(bytes)
})
http.ListenAndServe(":8000", nil)
}
I'm wondering whether there is any reason as to why the performance would hit a ceiling whilst dealing with this number of requests and whether there is something I have missed in the implementation of the above web server.
This is probably a limitation on your own system rather than the go server. The same kind of degradation happens if you try hitting something like google with http_load:
$> http_load -parallel 100 -seconds 10 google.txt
1000 fetches, 100 max parallel, 219000 bytes, in 10.0006 seconds
219 mean bytes/connection
99.9944 fetches/sec, 21898.8 bytes/sec
msecs/connect: 410.409 mean, 4584.36 max, 16.949 min
msecs/first-response: 279.595 mean, 3647.74 max, 35.539 min
HTTP response codes:
code 301 -- 1000
$> http_load -parallel 100 -seconds 50 google.txt
729 fetches, 100 max parallel, 159213 bytes, in 50.0008 seconds
218.399 mean bytes/connection
14.5798 fetches/sec, 3184.21 bytes/sec
msecs/connect: 1588.57 mean, 36192.6 max, 17.944 min
msecs/first-response: 237.376 mean, 33816.7 max, 33.092 min
2 bad byte counts
HTTP response codes:
code 301 -- 727
$> http_load -parallel 100 -seconds 100 google.txt
1091 fetches, 100 max parallel, 223161 bytes, in 100 seconds
204.547 mean bytes/connection
10.91 fetches/sec, 2231.61 bytes/sec
msecs/connect: 1652.16 mean, 35860.4 max, 17.825 min
msecs/first-response: 319.259 mean, 35482.1 max, 31.892 min
HTTP response codes:
code 301 -- 1019
As you can see, the rate goes down quite a bit the longer you hit google (google.txt contains the single url "http://google.com"). This is most likely due to limitations in your system (the maximum num of open connections your programs can have, memory, cpu, etc...).

Resources