I am using Golang to implement naive bayesian classification for a dataset with over 30000 possible tags. I have built the model and I am in the classification phase. I am working on classifying 1000 records and this is taking up to 5 minutes. I have profiled the code with pprof functionality; the top10 are shown below:
Total: 28896 samples
16408 56.8% 56.8% 24129 83.5% runtime.mapaccess1_faststr
4977 17.2% 74.0% 4977 17.2% runtime.aeshashbody
2552 8.8% 82.8% 2552 8.8% runtime.memeqbody
1468 5.1% 87.9% 28112 97.3% main.(*Classifier).calcProbs
861 3.0% 90.9% 861 3.0% math.Log
435 1.5% 92.4% 435 1.5% runtime.markspan
267 0.9% 93.3% 302 1.0% MHeap_AllocLocked
187 0.6% 94.0% 187 0.6% runtime.aeshashstr
183 0.6% 94.6% 1137 3.9% runtime.mallocgc
127 0.4% 95.0% 988 3.4% math.log10
Surprisingly the map access seems to be the bottleneck. Has anyone experienced this. What other key, value datastructure can be used to avoid this bottleneck? All the map access is done in the following piece of code given below:
func (nb *Classifier) calcProbs(data string) *BoundedPriorityQueue{
probs := &BoundedPriorityQueue{}
heap.Init(probs)
terms := strings.Split(data, " ")
for class, prob := range nb.classProb{
condProb := prob
clsProbs := nb.model[class]
for _, term := range terms{
termProb := clsProbs[term]
if termProb != 0{
condProb += math.Log10(termProb)
}else{
condProb += -6 //math.Log10(0.000001)
}
}
entry := &Item{
value: class,
priority: condProb,
}
heap.Push(probs,entry)
}
return probs
}
The maps are nb.classProb which is map[string]float64 while the nb.model is a nested map of type
map[string]map[string]float64
In addition to what #tomwilde said, another approach that may speed up your algorithm is string interning. Namely, you can avoid using a map entirely if you know the domain of keys ahead of time. I wrote a small package that will do string interning for you.
Yes, the map access will be the bottleneck in this code: it's the most significant operation inside the two nested loops.
It's not possible to tell for sure from the code that you've included, but I expect you've got a limited number of classes. What you might do, is number them, and store the term-wise class probabilities like this:
map[string][NumClasses]float64
(ie: for each term, store an array of class-wise probabilities [or perhaps their logs already precomputed], and NumClasses is the number of different classes you have).
Then, iterate over terms first, and classes inside. The expensive map lookup will be done in the outer loop, and the inner loop will be iteration over an array.
This'll reduce the number of map lookups by a factor of NumClasses. This may need more memory if your data is extremely sparse.
The next optimisation is to use multiple goroutines to do the calculations, assuming you've more than one CPU core available.
Related
I am profiling a library and see that a function called runtime.memclrNoHeapPointers is taking up about 0.82seconds of the cpu-time.
What does this function do, and what does this tell me about the memory-usage of the library i am profiling?
The output, for completeness:
File: gribtest.test
Type: cpu
Time: Feb 12, 2019 at 8:27pm (CET)
Duration: 5.21s, Total samples = 5.11s (98.15%)
Showing nodes accounting for 4.94s, 96.67% of 5.11s total
Dropped 61 nodes (cum <= 0.03s)
flat flat% sum% cum cum%
1.60s 31.31% 31.31% 1.81s 35.42% github.com/nilsmagnus/grib/griblib.(*BitReader).readBit
1.08s 21.14% 52.45% 2.89s 56.56% github.com/nilsmagnus/grib/griblib.(*BitReader).readUint
0.37s 7.24% 59.69% 0.82s 16.05% encoding/binary.(*decoder).value
0.35s 6.85% 66.54% 0.35s 6.85% runtime.memclrNoHeapPointers
func memclrNoHeapPointers(ptr unsafe.Pointer, n uintptr)
memclrNoHeapPointers clears n bytes starting at ptr.
Usually you should use typedmemclr. memclrNoHeapPointers should be
used only when the caller knows that *ptr contains no heap pointers
because either:
*ptr is initialized memory and its type is pointer-free.
*ptr is uninitialized memory (e.g., memory that's being reused
for a new allocation) and hence contains only "junk".
in memclr_*.s go:noescape
See https://github.com/golang/go/blob/9e277f7d554455e16ba3762541c53e9bfc1d8188/src/runtime/stubs.go#L78
This is part of the garbage collector. You can see the declaration here.
The specifics of what it does are CPU dependent. See the various memclr_*.s files in the runtime for implmentation
This does seem like a long time in the GC, but it's hard to say something about the memory usage of the library with just the data you've shown I think.
I've been trying to profile my go application (evm-specification-miner) with pprof, but the output is not really useful:
(pprof) top5
108.59mins of 109.29mins total (99.36%)
Dropped 607 nodes (cum <= 0.55mins)
Showing top 5 nodes out of 103 (cum >= 0.98mins)
flat flat% sum% cum cum%
107.83mins 98.66% 98.66% 108.64mins 99.40% [evm-specification-miner]
0.36mins 0.33% 98.99% 6mins 5.49% net.dialIP
0.30mins 0.28% 99.27% 4.18mins 3.83% net.listenIP
0.06mins 0.052% 99.32% 34.66mins 31.71%
github.com/urfave/cli.BoolFlag.ApplyWithError
0.04mins 0.036% 99.36% 0.98mins 0.9% net.probeIPv6Stack
And here is the cumulative output:
(pprof) top5 --cum
1.80hrs of 1.82hrs total (98.66%)
Dropped 607 nodes (cum <= 0.01hrs)
Showing top 5 nodes out of 103 (cum >= 1.53hrs)
flat flat% sum% cum cum%
1.80hrs 98.66% 98.66% 1.81hrs 99.40% [evm-specification-miner]
0 0% 98.66% 1.53hrs 83.93% net.IP.matchAddrFamily
0 0% 98.66% 1.53hrs 83.92% net.(*UDPConn).WriteToUDP
0 0% 98.66% 1.53hrs 83.90% net.sockaddrToUDP
0 0% 98.66% 1.53hrs 83.89% net.(*UDPConn).readMsg
As you can see, most of the time is spent in evm-specification-miner (which is the name of my go application), but I've been unable to obtain more details or even understand what these square brackets meant (there is a question with a similar problem, but it didn't receive any answer).
Here are the build and pprof commands:
go install evm-specification-miner
go tool pprof evm-specification-miner cpuprof
I've even tried the debug flags -gcflags "-N -l" (as noted here: https://golang.org/doc/gdb#Introduction), to no avail.
The profiling is done with calls to pprof.StartCPUProfile() and pprof.StopCPUProfile() as is explained by this blog post: https://blog.golang.org/profiling-go-programs:
func StartProfiling(cpuprof string) error {
f, err := os.Create(cpuprof)
if err != nil {
return err
}
return pprof.StartCPUProfile(f)
}
func StopProfiling() error {
pprof.StopCPUProfile()
return nil
}
StartProfiling is called at the beginning of main(), and StopProfiling when a signal (interrupt or kill) is received (or if the program terminates normally). This profile was obtained after an interruption.
Looks like updating to 1.9rc1 fixed it.
I no longer have [evm-specifiation-miner] in the profile (for the record, the top functions do not even come from my own package, so it is even weirder than they did not appear before).
I've written a small Go library (go-patan) that collects a running min/max/avg/stddev of certain variables. I compared it to an equivalent Java implementation (patan), and to my surprise the Java implementation is much faster. I would like to understand why.
The library basically consists of a simple data store with a lock that serializes reads and writes. This is a snippet of the code:
type Store struct {
durations map[string]*Distribution
counters map[string]int64
samples map[string]*Distribution
lock *sync.Mutex
}
func (store *Store) addSample(key string, value int64) {
store.addToStore(store.samples, key, value)
}
func (store *Store) addDuration(key string, value int64) {
store.addToStore(store.durations, key, value)
}
func (store *Store) addToCounter(key string, value int64) {
store.lock.Lock()
defer store.lock.Unlock()
store.counters[key] = store.counters[key] + value
}
func (store *Store) addToStore(destination map[string]*Distribution, key string, value int64) {
store.lock.Lock()
defer store.lock.Unlock()
distribution, exists := destination[key]
if !exists {
distribution = NewDistribution()
destination[key] = distribution
}
distribution.addSample(value)
}
I've benchmarked the GO and Java implementations (go-benchmark-gist, java-benchmark-gist) and Java wins by far, but I don't understand why:
Go Results:
10 threads with 20000 items took 133 millis
100 threads with 20000 items took 1809 millis
1000 threads with 20000 items took 17576 millis
10 threads with 200000 items took 1228 millis
100 threads with 200000 items took 17900 millis
Java Results:
10 threads with 20000 items takes 89 millis
100 threads with 20000 items takes 265 millis
1000 threads with 20000 items takes 2888 millis
10 threads with 200000 items takes 311 millis
100 threads with 200000 items takes 3067 millis
I've profiled the program with the Go's pprof and generated a call graph call-graph. This shows that it basically spends all the time in sync.(*Mutex).Lock() and sync.(*Mutex).Unlock().
The Top20 calls according to the profiler:
(pprof) top20
59110ms of 73890ms total (80.00%)
Dropped 22 nodes (cum <= 369.45ms)
Showing top 20 nodes out of 65 (cum >= 50220ms)
flat flat% sum% cum cum%
8900ms 12.04% 12.04% 8900ms 12.04% runtime.futex
7270ms 9.84% 21.88% 7270ms 9.84% runtime/internal/atomic.Xchg
7020ms 9.50% 31.38% 7020ms 9.50% runtime.procyield
4560ms 6.17% 37.56% 4560ms 6.17% sync/atomic.CompareAndSwapUint32
4400ms 5.95% 43.51% 4400ms 5.95% runtime/internal/atomic.Xadd
4210ms 5.70% 49.21% 22040ms 29.83% runtime.lock
3650ms 4.94% 54.15% 3650ms 4.94% runtime/internal/atomic.Cas
3260ms 4.41% 58.56% 3260ms 4.41% runtime/internal/atomic.Load
2220ms 3.00% 61.56% 22810ms 30.87% sync.(*Mutex).Lock
1870ms 2.53% 64.10% 1870ms 2.53% runtime.osyield
1540ms 2.08% 66.18% 16740ms 22.66% runtime.findrunnable
1430ms 1.94% 68.11% 1430ms 1.94% runtime.freedefer
1400ms 1.89% 70.01% 1400ms 1.89% sync/atomic.AddUint32
1250ms 1.69% 71.70% 1250ms 1.69% github.com/toefel18/go-patan/statistics/lockbased.(*Distribution).addSample
1240ms 1.68% 73.38% 3140ms 4.25% runtime.deferreturn
1070ms 1.45% 74.83% 6520ms 8.82% runtime.systemstack
1010ms 1.37% 76.19% 1010ms 1.37% runtime.newdefer
1000ms 1.35% 77.55% 1000ms 1.35% runtime.mapaccess1_faststr
950ms 1.29% 78.83% 15660ms 21.19% runtime.semacquire
860ms 1.16% 80.00% 50220ms 67.97% main.Benchmrk.func1
Can someone explain why locking in Go seems to be so much slower than in Java, what am I doing wrong? I've also written a channel based implementation in Go, but that is even slower.
It is best to avoid defer in tiny functions that need high performance since it is expensive. In most other cases, there is no need to avoid it since the cost of defer is outweighed by the code around it.
I'd also recommend using lock sync.Mutex instead of using a pointer. The pointer creates a slight amount of extra work for the programmer (initialisation, nil bugs), and a slight amount of extra work for the garbage collector.
I've also posted this question on the golang-nuts group. The reply from Jesper Louis Andersen explains quite well that Java uses synchronization optimization techniques such as lock escape analysis/lock elision and lock coarsening.
Java JIT might be taking the lock and allowing multiple updates at once within the lock to increase performance. I ran the Java benchmark with -Djava.compiler=NONE which gave dramatic performance, but is not a fair comparison.
I assume that many of these optimization techniques have less impact in a production environment.
I'm stress testing (with loader.io) this type of code in Go to create an array of 100 items along with some other basic variables and parse them all in the template:
package main
import (
"html/template"
"net/http"
)
var templates map[string]*template.Template
// Load templates on program initialisation
func init() {
if templates == nil {
templates = make(map[string]*template.Template)
}
templates["index.html"] = template.Must(template.ParseFiles("index.html"))
}
func handler(w http.ResponseWriter, r *http.Request) {
type Post struct {
Id int
Title, Content string
}
var Posts [100]Post
// Fill posts
for i := 0; i < 100; i++ {
Posts[i] = Post{i, "Sample Title", "Lorem Ipsum Dolor Sit Amet"}
}
type Page struct {
Title, Subtitle string
Posts [100]Post
}
var p Page
p.Title = "Index Page of My Super Blog"
p.Subtitle = "A blog about everything"
p.Posts = Posts
tmpl := templates["index.html"]
tmpl.ExecuteTemplate(w, "index.html", p)
}
func main() {
http.HandleFunc("/", handler)
http.ListenAndServe(":8888", nil)
}
My test with Loader is using 5k concurrent connections/s through 1 minute. The problem is, just after a few seconds after starting the test, I get a high average latency (almost 10s) and as a result 5k successful responses and the test stops because it reaches the Error Rate of 50% (timeouts).
On the same machine, PHP gives 50k+.
I understand that it's not Go performance issue, but probably something related to html/template. Go can easily manage hard enough calculations a lot faster than anything like PHP of course, but when it comes to parsing the data to the template, why is it so awful?
Any workarounds, or probably I'm just doing it wrong (I'm new to Go)?
P.S. Actually even with 1 item it's exactly the same... 5-6k and stopping after huge amount of timeouts. But that's probably because the array with posts is staying of the same length.
My template code (index.html):
{{ .Title }}
{{ .Subtitle }}
{{ range .Posts }}
{{ .Title }}
{{ .Content }}
{{ end }}
Here's the profiling result of github.com/pkg/profile:
root#Test:~# go tool pprof app /tmp/profile311243501/cpu.pprof
Possible precedence issue with control flow operator at /usr/lib/go/pkg/tool/linux_amd64/pprof line 3008.
Welcome to pprof! For help, type 'help'.
(pprof) top10
Total: 2054 samples
97 4.7% 4.7% 726 35.3% reflect.Value.call
89 4.3% 9.1% 278 13.5% runtime.mallocgc
85 4.1% 13.2% 86 4.2% syscall.Syscall
66 3.2% 16.4% 75 3.7% runtime.MSpan_Sweep
58 2.8% 19.2% 1842 89.7% text/template.(*state).walk
54 2.6% 21.9% 928 45.2% text/template.(*state).evalCall
51 2.5% 24.3% 53 2.6% settype
47 2.3% 26.6% 47 2.3% runtime.stringiter2
44 2.1% 28.8% 149 7.3% runtime.makeslice
40 1.9% 30.7% 223 10.9% text/template.(*state).evalField
These are profiling results after refining the code (as suggested in the answer by icza):
root#Test:~# go tool pprof app /tmp/profile501566907/cpu.pprof
Possible precedence issue with control flow operator at /usr/lib/go/pkg/tool/linux_amd64/pprof line 3008.
Welcome to pprof! For help, type 'help'.
(pprof) top10
Total: 2811 samples
137 4.9% 4.9% 442 15.7% runtime.mallocgc
126 4.5% 9.4% 999 35.5% reflect.Value.call
113 4.0% 13.4% 115 4.1% syscall.Syscall
110 3.9% 17.3% 122 4.3% runtime.MSpan_Sweep
102 3.6% 20.9% 2561 91.1% text/template.(*state).walk
74 2.6% 23.6% 337 12.0% text/template.(*state).evalField
68 2.4% 26.0% 72 2.6% settype
66 2.3% 28.3% 1279 45.5% text/template.(*state).evalCall
65 2.3% 30.6% 226 8.0% runtime.makeslice
57 2.0% 32.7% 57 2.0% runtime.stringiter2
(pprof)
There are two main reasons why the equivalent application using html/template is slower than PHP variant.
First of all html/template provides more functionality than the PHP. The main difference is that html/template will automatically escape variables using correct escaping rules (HTML, JS, CSS, etc) depending on their location in the resulting HTML output (which I think is quite cool!).
Secondly html/template rendering code heavily uses reflection and methods with variable number of arguments and they are just not as fast as statically compiled code.
Under the hood the following template
{{ .Title }}
{{ .Subtitle }}
{{ range .Posts }}
{{ .Title }}
{{ .Content }}
{{ end }}
is converted to something like
{{ .Title | html_template_htmlescaper }}
{{ .Subtitle | html_template_htmlescaper }}
{{ range .Posts }}
{{ .Title | html_template_htmlescaper }}
{{ .Content | html_template_htmlescaper }}
{{ end }}
Calling html_template_htmlescaper using reflection in a loop kills performance.
Having said all that this micro-benchmark of html/template shouldn't be used to decide whether to use Go or not. Once you add code to work with the database to the request handler I suspect that template rendering time will hardly be noticeable.
Also I am pretty sure that over time both Go reflection and html/template package will become faster.
If in a real application you will find that html/template is a bottleneck it still possible to switch to text/template and supply it with already escaped data.
You are working with arrays and structs, both which are non-pointer types, nor are they descriptors (like slices or maps or channels). So passing them always creates a copy of the value, assigning an array value to a variable copies all the elements. This is slow and gives a huge amount of work to the GC.
Also you are utilizing only 1 CPU core. To utilize more, add this to your main() function:
func main() {
runtime.GOMAXPROCS(runtime.NumCPU())
http.HandleFunc("/", handler)
log.Fatal(http.ListenAndServe(":8888", nil))
}
Edit: This was only the case prior to Go 1.5. Since Go 1.5 runtime.NumCPU() is the default.
Your code
var Posts [100]Post
An array with space for 100 Posts is allocated.
Posts[i] = Post{i, "Sample Title", "Lorem Ipsum Dolor Sit Amet"}
You create a Post value with a composite literal, then this value is copied into the ith element in the array. (redundant)
var p Page
This creates a variable of type Page. It is a struct, so its memory is allocated which also contains a field Posts [100]Post so another array of 100 elements is allocated.
p.Posts = Posts
This copies 100 elements (a hundred structs)!
tmpl.ExecuteTemplate(w, "index.html", p)
This creates a copy of p (which is of type Page), so another array of 100 posts is created and elements from p are copied, then it is passed to ExecuteTemplate().
And since Page.Posts is an array, most likely when it is processed (iterated over in the template engine), a copy will be made from each element (haven't checked - not verified).
Proposal for a more efficient code
Some things to speed up your code:
func handler(w http.ResponseWriter, r *http.Request) {
type Post struct {
Id int
Title, Content string
}
Posts := make([]*Post, 100) // A slice of pointers
// Fill posts
for i := range Posts {
// Initialize pointers: just copies the address of the created struct value
Posts[i]= &Post{i, "Sample Title", "Lorem Ipsum Dolor Sit Amet"}
}
type Page struct {
Title, Subtitle string
Posts []*Post // "Just" a slice type (it's a descriptor)
}
// Create a page, only the Posts slice descriptor is copied
p := Page{"Index Page of My Super Blog", "A blog about everything", Posts}
tmpl := templates["index.html"]
// Only pass the address of p
// Although since Page.Posts is now just a slice, passing by value would also be OK
tmpl.ExecuteTemplate(w, "index.html", &p)
}
Please test this code and report back your results.
PHP isn't answering 5000 requests concurrently. The requests are being multiplexed to a handful of processes for serial execution. This makes more efficient use of both CPU and memory. 5000 concurrent connections may make sense for a message broker or similar, doing limited processing of small pieces of data, but it makes little sense for any service doing real I/O or processing. If your Go app is not behind a proxy of some type that will limit the number of concurrent requests, you will want to do so yourself, perhaps at the beginning of your handler, using a buffered channel or a wait group, a la https://blakemesdag.com/blog/2014/11/12/limiting-go-concurrency/.
html/template is slow because it uses reflection, which isn't optimized for speed yet.
Try quicktemplate as a workaround of slow html/template. Currently quicktemplate is more than 20x faster than html/template according to the benchmark from its' source code.
There is a template benchmark you can check at goTemplateBenchmark. Personally, I think Hero is the one that best combines efficiency and readability.
Typed strings is your friend if you would like to speed up html/template. It is sometimes useful to pre-render repeating HTML-fragments.
Assuming that most of the time is spent to render those 100 Post objects, it could make sense to pre-render those.
I am starting a discussion, which I hope, will become one place to discuss data loading method using mutators Vs. loading using flat file via 'LOAD DATA INFILE'.
I have been baffled to get enormous performance gain using mutators (using batch size = 1000 or 10000 or 100K et cetera).
My project involved loading close to 400 million rows of social media data into HyperTable to be used for real time analytics. It took me close to 3 days to just load just 1 million row of data (code sample below). Each row is approximately 32 byte. So, in order to avoid taking 2-3 weeks to load this much data, I prepared a flat file with rows and used DATA LOAD INFILE method. Performance gain was amazing. Using this method, loading rate was 368336 cells/sec.
See below for actual snapshot of action:
hypertable> LOAD DATA INFILE "/data/tmp/users.dat" INTO TABLE users;
Loading 7,113,154,337 bytes of input data...
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Load complete.
Elapsed time: 508.07 s
Avg key size: 8.92 bytes
Total cells: 218976067
Throughput: 430998.80 cells/s
Resends: 2210404
hypertable> LOAD DATA INFILE "/data/tmp/graph.dat" INTO TABLE graph;
Loading 12,693,476,187 bytes of input data...
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Load complete.
Elapsed time: 1189.71 s
Avg key size: 17.48 bytes
Total cells: 437952134
Throughput: 368118.13 cells/s
Resends: 1483209
Why is performance difference between 2 method is so vast? What's the best way to enhance mutator performance. Sample mutator code is below:
my $batch_size = 1000000; # or 1000 or 10000 make no substantial difference
my $ignore_unknown_cfs = 2;
my $ht = new Hypertable::ThriftClient($master, $port);
my $ns = $ht->namespace_open($namespace);
my $users_mutator = $ht->mutator_open($ns, 'users', $ignore_unknown_cfs, 10);
my $graph_mutator = $ht->mutator_open($ns, 'graph', $ignore_unknown_cfs, 10);
my $keys = new Hypertable::ThriftGen::Key({ row => $row, column_family => $cf, column_qualifier => $cq });
my $cell = new Hypertable::ThriftGen::Cell({key => $keys, value => $val});
$ht->mutator_set_cell($mutator, $cell);
$ht->mutator_flush($mutator);
I would appreciate any input on this? I don't have tremendous amount of HyperTable experience.
Thanks.
If it's taking three days to load one million rows, then you're probably calling flush() after every row insert, which is not the right thing to do. Before I describe hot to fix that, your mutator_open() arguments aren't quite right. You don't need to specify ignore_unknown_cfs and you should supply 0 for the flush_interval, something like this:
my $users_mutator = $ht->mutator_open($ns, 'users', 0, 0);
my $graph_mutator = $ht->mutator_open($ns, 'graph', 0, 0);
You should only call mutator_flush() if you would like to checkpoint how much of the input data has been consumed. A successful call to mutator_flush() means that all data that has been inserted on that mutator has durably made it into the database. If you're not checkpointing how much of the input data has been consumed, then there is no need to call mutator_flush(), since it will get flushed automatically when you close the mutator.
The next performance problem with your code that I see is that you're using mutator_set_cell(). You should use either mutator_set_cells() or mutator_set_cells_as_arrays() since each method call is a round-trip to the ThriftBroker, which is expensive. By using the mutator_set_cells_* methods, you amortize that round-trip over many cells. The mutator_set_cells_as_arrays() method can be more efficient for languages where object construction overhead is large in comparison to native datatypes (e.g. string). I'm not sure about Perl, but you might want to give that a try to see if it boosts performance.
Also, be sure to call mutator_close() when you're finished with the mutator.