Parallel execution of prime finding algorithm slows runtime - go

So I implemented the following prime finding algorithm in go.
primes = []
Assume all numbers are primes (vacuously true)
check = 2
if check is still assumed to be prime append it to primes
multiply check by each prime less than or equal to its minimum factor and
eliminate results from assumed primes.
increment check by 1 and repeat 4 thru 6 until check > limit.
Here is my serial implementation:
package main
import(
"fmt"
"time"
)
type numWithMinFactor struct {
number int
minfactor int
}
func pow(base int, power int) int{
result := 1
for i:=0;i<power;i++{
result*=base
}
return result
}
func process(check numWithMinFactor,primes []int,top int,minFactors []numWithMinFactor){
var n int
for i:=0;primes[i]<=check.minfactor;i++{
n = check.number*primes[i]
if n>top{
break;
}
minFactors[n] = numWithMinFactor{n,primes[i]}
if i+1 == len(primes){
break;
}
}
}
func findPrimes(top int) []int{
primes := []int{}
minFactors := make([]numWithMinFactor,top+2)
check := 2
for power:=1;check <= top;power++{
if minFactors[check].number == 0{
primes = append(primes,check)
minFactors[check] = numWithMinFactor{check,check}
}
process(minFactors[check],primes,top,minFactors)
check++
}
return primes
}
func main(){
fmt.Println("Welcome to prime finder!")
start := time.Now()
fmt.Println(findPrimes(1000000))
elapsed := time.Since(start)
fmt.Println("Finding primes took %s", elapsed)
}
This runs great producing all the primes <1,000,000 in about 63ms (mostly printing) and primes <10,000,000 in 600ms on my pc. Now I figure none of the numbers check such that 2^n < check <= 2^(n+1) have factors > 2^n so I can do all the multiplications and elimination for each check in that range in parallel once I have primes up to 2^n. And my parallel implementation is as follows:
package main
import(
"fmt"
"time"
"sync"
)
type numWithMinFactor struct {
number int
minfactor int
}
func pow(base int, power int) int{
result := 1
for i:=0;i<power;i++{
result*=base
}
return result
}
func process(check numWithMinFactor,primes []int,top int,minFactors []numWithMinFactor, wg *sync.WaitGroup){
defer wg.Done()
var n int
for i:=0;primes[i]<=check.minfactor;i++{
n = check.number*primes[i]
if n>top{
break;
}
minFactors[n] = numWithMinFactor{n,primes[i]}
if i+1 == len(primes){
break;
}
}
}
func findPrimes(top int) []int{
primes := []int{}
minFactors := make([]numWithMinFactor,top+2)
check := 2
var wg sync.WaitGroup
for power:=1;check <= top;power++{
for check <= pow(2,power){
if minFactors[check].number == 0{
primes = append(primes,check)
minFactors[check] = numWithMinFactor{check,check}
}
wg.Add(1)
go process(minFactors[check],primes,top,minFactors,&wg)
check++
if check>top{
break;
}
}
wg.Wait()
}
return primes
}
func main(){
fmt.Println("Welcome to prime finder!")
start := time.Now()
fmt.Println(findPrimes(1000000))
elapsed := time.Since(start)
fmt.Println("Finding primes took %s", elapsed)
}
Unfortunately not only is this implementation slower running up to 1,000,000 in 600ms and up to 10 million in 6 seconds. My intuition tells me that there is potential for parallelism to improve performance however I clearly haven't been able to achieve that and would greatly appreciate any input on how to improve runtime here, or more specifically any insight as to why the parallel solution is slower.
Additionally the parallel solution consumes more memory relative to the serial solution but that is to be expected; the serial solution can grid up to 1,000,000,000 in about 22 seconds where the parallel solution runs out of memory on my system (32GB ram) going for the same target. But I'm asking about runtime here not memory use, I could for example use the zero value state of the minFactors array rather than a separate isPrime []bool true state but I think it is more readable as is.
I've tried passing a pointer for primes []int but that didn't seem to make a difference, using a channel instead of passing the minFactors array to the process function resulted in big time memory use and a much(10x ish) slower performance. I've re-written this algo a couple times to see if I could iron anything out but no luck. Any insights or suggestions would be much appreciated because I think parallelism could make this faster not 10x slower!
Par #Volker's suggestion I limited the number of processes to somthing less than my pc's available logical processes with the following revision however I am still getting runtimes that are 10x slower than the serial implementation.
package main
import(
"fmt"
"time"
"sync"
)
type numWithMinFactor struct {
number int
minfactor int
}
func pow(base int, power int) int{
result := 1
for i:=0;i<power;i++{
result*=base
}
return result
}
func process(check numWithMinFactor,primes []int,top int,minFactors []numWithMinFactor, wg *sync.WaitGroup){
defer wg.Done()
var n int
for i:=0;primes[i]<=check.minfactor;i++{
n = check.number*primes[i]
if n>top{
break;
}
minFactors[n] = numWithMinFactor{n,primes[i]}
if i+1 == len(primes){
break;
}
}
}
func findPrimes(top int) []int{
primes := []int{}
minFactors := make([]numWithMinFactor,top+2)
check := 2
nlogicalProcessors := 20
var wg sync.WaitGroup
var twoPow int
for power:=1;check <= top;power++{
twoPow = pow(2,power)
for check <= twoPow{
for nLogicalProcessorsInUse := 0 ; nLogicalProcessorsInUse < nlogicalProcessors; nLogicalProcessorsInUse++{
if minFactors[check].number == 0{
primes = append(primes,check)
minFactors[check] = numWithMinFactor{check,check}
}
wg.Add(1)
go process(minFactors[check],primes,top,minFactors,&wg)
check++
if check>top{
break;
}
if check>twoPow{
break;
}
}
wg.Wait()
if check>top{
break;
}
}
}
return primes
}
func main(){
fmt.Println("Welcome to prime finder!")
start := time.Now()
fmt.Println(findPrimes(10000000))
elapsed := time.Since(start)
fmt.Println("Finding primes took %s", elapsed)
}
tldr; Why is my parallel implementation slower than serial implementation how do I make it faster?
Par #mh-cbon's I made larger jobs for parallel processing resulting in the following code.
package main
import(
"fmt"
"time"
"sync"
)
func pow(base int, power int) int{
result := 1
for i:=0;i<power;i++{
result*=base
}
return result
}
func process(check int,primes []int,top int,minFactors []int){
var n int
for i:=0;primes[i]<=minFactors[check];i++{
n = check*primes[i]
if n>top{
break;
}
minFactors[n] = primes[i]
if i+1 == len(primes){
break;
}
}
}
func processRange(start int,end int,primes []int,top int,minFactors []int, wg *sync.WaitGroup){
defer wg.Done()
for start <= end{
process(start,primes,top,minFactors)
start++
}
}
func findPrimes(top int) []int{
primes := []int{}
minFactors := make([]int,top+2)
check := 2
nlogicalProcessors := 10
var wg sync.WaitGroup
var twoPow int
var start int
var end int
var stepSize int
var stepsTaken int
for power:=1;check <= top;power++{
twoPow = pow(2,power)
stepSize = (twoPow-start)/nlogicalProcessors
stepsTaken = 0
stepSize = (twoPow/2)/nlogicalProcessors
for check <= twoPow{
start = check
end = check+stepSize
if stepSize == 0{
end = twoPow
}
if stepsTaken == nlogicalProcessors-1{
end = twoPow
}
if end>top {
end = top
}
for check<=end {
if minFactors[check] == 0{
primes = append(primes,check)
minFactors[check] = check
}
check++
}
wg.Add(1)
go processRange(start,end,primes,top,minFactors,&wg)
if check>top{
break;
}
if check>twoPow{
break;
}
stepsTaken++
}
wg.Wait()
if check>top{
break;
}
}
return primes
}
func main(){
fmt.Println("Welcome to prime finder!")
start := time.Now()
fmt.Println(findPrimes(1000000))
elapsed := time.Since(start)
fmt.Println("Finding primes took %s", elapsed)
}
This runs at a similar speed to the serial implementation.

So I did eventually get a parallel version of the code to run slightly faster than the serial version. following suggestions from #mh-cbon (See above). However this implementation did not result in vast improvements relative to the serial implementation (50ms to 10 million compared to 75ms serially) Considering that allocating and writing an []int 0:10000000 takes 25ms I'm not disappointed by these results. As #Volker stated "such stuff often is not limited by CPU but by memory bandwidth." which I believe is the case here.
I would still love to see any additional improvements however I am somewhat satisfied with what I've gained here.
Serial code running up to 2 billion 19.4 seconds
Parallel code running up to 2 billion 11.1 seconds
Initializing []int{0:2Billion} 4.5 seconds

Related

Need help about concurrency programming in Go

For this task i need to find min sum in list of numbers. Then i must print number that have min sum. This must be done with Mutex and WaitGroups. I can't find where is the mistake or why is output different.
Logic: Scanf n and make vector with len(n). Then create funcion for sum of number and forward that function to second where we in one FOR cycle give goroutines function to.
I run this code a few times, and sometimes give different answer for same input.
Input:
3
13
12
11
Output:
Sometimes 12
Sometimes 11
package main
import (
"fmt"
"math"
"runtime"
"sync"
)
var wg sync.WaitGroup
var mutex sync.Mutex
var vector []int
var i int
var n int
var firstsum int
var p int //Temp sum
var index_result int
func sumanajmanjih(broj int) int {
var br int
var suma int
br = int(math.Abs(float64(broj)))
suma = 0
for {
suma += br % 10
br = br / 10
if br <= 0 {
break
}
}
return suma
}
func glavna(rg int) {
var index int
firstsum = sumanajmanjih(vector[0])
for {
mutex.Lock()
if i == n {
mutex.Unlock()
break
} else {
index = i
i += 1
mutex.Unlock()
}
fmt.Printf("Procesor %d radni indeks %d\n", rg, index)
p = sumanajmanjih(vector[index])
if p < firstsum {
firstsum = p
index_result = index
}
}
wg.Done()
}
func main() {
fmt.Scanf("%d", &n)
vector = make([]int, n)
for i := 0; i < n; i++ {
fmt.Scanf("%d", &vector[i])
}
fmt.Println(vector)
brojGR := runtime.NumCPU()
wg.Add(brojGR)
for rg := 0; rg < brojGR; rg++ {
go glavna(rg)
}
wg.Wait()
fmt.Println(vector[index_result])
}
Not a full answer to your question, but a few suggestions to make code more readable and stable:
Use English language for names - glavna, brojGR are hard to understand
Add comments to code explaining intent
Try to avoid shared/global variables, especially for concurrent code. glavna(rg) is executed concurrently, and you assign global i and p inside that function, that is a race condition. Sends all the data in and out into function explicitly as argument or function result.
Mutex easily can lock the code, and it is complicated to debug. Simplify its usage. Often defer mutex.Unlock() in the next line after Lock() is good enough.

Why this Go program is so slow?

I just read some short tutorials of Go and wrote a simple program sieve. Sieve uses sieve algorithm to print all the prime number that is smaller than 10000, which create a lot of go routines. I got the correct results but the program is very slow (5 seconds on my machine). I also wrote lua script and python script which implemented the same algorithm, and runs a lot faster (both are about 1 second on my machine).
Note that the purpose is to have idea of go routine's performance compared with coroutine in other languages, for example lua. The implementation is very inefficient, some comments pointed out that it's not correct way to implement Sieve of Eratosthenes. Yes, that's intentional. Some other replies pointed out that slowness is caused by print I/O. So I commented out print lines.
My question is why my sieve program implemented in Go is so slow?
Here is the code:
package main
import (
"fmt"
"sync"
)
type Sieve struct {
id int;
msg_queue chan int;
wg *sync.WaitGroup;
}
func NewSieve(id int) *Sieve {
sieve := new(Sieve)
sieve.id = id
sieve.msg_queue = make(chan int)
sieve.wg = new(sync.WaitGroup)
sieve.wg.Add(1)
return sieve
}
func (sieve *Sieve) run() {
defer sieve.wg.Done()
myprime := <-sieve.msg_queue
if myprime == 0 {
return
}
// fmt.Printf("Sieve (%d) is for prime number %d.\n", sieve.id, myprime)
next_sieve := NewSieve(sieve.id + 1)
go next_sieve.run()
for {
number := <-sieve.msg_queue
if number == 0 {
next_sieve.msg_queue <- number;
next_sieve.wg.Wait()
return
} else if number % myprime != 0 {
// fmt.Printf("id: %d, number: %d, myprime: %d, number mod myprime: %d\n", sieve.id, number, myprime, number % myprime)
next_sieve.msg_queue <- number
}
}
}
func driver() {
first := NewSieve(2)
go first.run()
for n := 2; n <= 10000; n++ {
first.msg_queue <- n
}
first.msg_queue <- 0
first.wg.Wait()
}
func main() {
driver()
}
As a comparison, here is the code of sieve.lua
function sieve(id)
local myprime = coroutine.yield()
// print(string.format("Sieve (%d) is for prime number %d", id, myprime))
local next_sieve = coroutine.create(sieve)
coroutine.resume(next_sieve, id + 1)
while true do
local number = coroutine.yield()
if number % myprime ~= 0 then
// print(string.format("id: %d, number: %d, myprime: %d, number mod myprime: %d", id, number, myprime, number % myprime))
coroutine.resume(next_sieve, number)
end
end
end
function driver()
local first = coroutine.create(sieve)
coroutine.resume(first, 2)
local n
for n = 2, 10000 do
coroutine.resume(first, n)
end
end
driver()
Meaningless microbenchmarks produce meaningless results.
You are timing print I/O.
You are incurring go routine and channel overhead for a small amount of work.
Here is a prime number sieve program in Go.
Output:
$ go version
go version devel +46be01f4e0 Sun Oct 13 01:48:30 2019 +0000 linux/amd64
$ go build sumprimes.go && time ./sumprimes
5736396
29.96µs
real 0m0.001s
user 0m0.001s
sys 0m0.000s
sumprimes.go:
package main
import (
"fmt"
"time"
)
const (
prime = 0x00
notprime = 0xFF
)
func oddPrimes(n uint64) (sieve []uint8) {
sieve = make([]uint8, (n+1)/2)
sieve[0] = notprime
p := uint64(3)
for i := p * p; i <= n; i = p * p {
for j := i; j <= n; j += 2 * p {
sieve[j/2] = notprime
}
for p += 2; sieve[p/2] == notprime; p += 2 {
}
}
return sieve
}
func sumPrimes(n uint64) uint64 {
sum := uint64(0)
if n >= 2 {
sum += 2
}
for i, p := range oddPrimes(n) {
if p == prime {
sum += 2*uint64(i) + 1
}
}
return sum
}
func main() {
start := time.Now()
var n uint64 = 10000
sum := sumPrimes(n)
fmt.Println(sum)
fmt.Println(time.Since(start))
}
Most of the time is spent in fmt.Printf.
Taking out the line:
fmt.Printf("id: %d, number: %d, myprime: %d, number mod myprime: %d\n", sieve.id, number, myprime, number%myprime)
reduces runtime from ~5.4 seconds to ~0.64 seconds on one test I ran.
Taking out the unnecessary sync.WaitGroups reduces the time a bit further, to ~0.48 seconds. See the version without sync.WaitGroup here. You're still doing a lot of channel operations, which languages with yield-value-from-coroutine operators do not need (though they have their own issues instead). This is not a good way to implement primality testing.

How to iterate int range concurrently

For purely educational purposes I created a base58 package. It will encode/decode a uint64 using the bitcoin base58 symbol chart, for example:
b58 := Encode(100) // return 2j
num := Decode("2j") // return 100
While creating the first tests I came with this:
func TestEncode(t *testing.T) {
var i uint64
for i = 0; i <= (1<<64 - 1); i++ {
b58 := Encode(i)
num := Decode(b58)
if num != i {
t.Fatalf("Expecting %d for %s", i, b58)
}
}
}
This "naive" implementation, tries to convert all the range from uint64 (From 0 to 18,446,744,073,709,551,615) to base58 and later back to uint64 but takes too much time.
To better understand how go handles concurrency I would like to know how to use channels or goroutines and perform the iteration across the full uint64 range in the most efficient way?
Could data be processed by chunks and in parallel, if yes how to accomplish this?
Thanks in advance.
UPDATE:
Like mention in the answer by #Adrien, one-way is to use t.Parallel() but that applies only when testing the package, In any case, by implementing it I found that is noticeably slower, it runs in parallel but there is no speed gain.
I understand that doing the full uint64 may take years but what I want to find/now is how could a channel or goroutine, may help to speed up the process (testing with small range 1<<16) probably by using something like this https://play.golang.org/p/9U22NfrXeq just as an example.
The question is not about how to test the package is about what algorithm, technic could be used to iterate faster by using concurrency.
This functionality is built into the Go testing package, in the form of T.Parallel:
func TestEncode(t *testing.T) {
var i uint64
for i = 0; i <= (1<<64 - 1); i++ {
t.Run(fmt.Sprintf("%d",i), func(t *testing.T) {
j := i // Copy to local var - important
t.Parallel() // Mark test as parallelizable
b58 := Encode(j)
num := Decode(b58)
if num != j {
t.Fatalf("Expecting %d for %s", j, b58)
}
})
}
}
I came up with this solutions:
package main
import (
"fmt"
"time"
"github.com/nbari/base58"
)
func encode(i uint64) {
x := base58.Encode(i)
fmt.Printf("%d = %s\n", i, x)
time.Sleep(time.Second)
}
func main() {
concurrency := 4
sem := make(chan struct{}, concurrency)
for i, val := uint64(0), uint64(1<<16); i <= val; i++ {
sem <- struct{}{}
go func(i uint64) {
defer func() { <-sem }()
encode(i)
}(i)
}
for i := 0; i < cap(sem); i++ {
sem <- struct{}{}
}
}
Basically, start 4 workers and calls the encode function, to notice/understand more this behavior a sleep is added so that the data can be printed in chunks of 4.
Also, these answers helped me to better understand concurrency understanding: https://stackoverflow.com/a/18405460/1135424
If there is a better way please let me know.

GoLang - Sequential vs Concurrent

I have two versions of factorial. Concurrent vs Sequencial.
Both the program will calculate factorial of 10 "1000000" times.
Factorial Concurrent Processing
package main
import (
"fmt"
//"math/rand"
"sync"
"time"
//"runtime"
)
func main() {
start := time.Now()
printFact(fact(gen(1000000)))
fmt.Println("Current Time:", time.Now(), "Start Time:", start, "Elapsed Time:", time.Since(start))
panic("Error Stack!")
}
func gen(n int) <-chan int {
c := make(chan int)
go func() {
for i := 0; i < n; i++ {
//c <- rand.Intn(10) + 1
c <- 10
}
close(c)
}()
return c
}
func fact(in <-chan int) <-chan int {
out := make(chan int)
var wg sync.WaitGroup
for n := range in {
wg.Add(1)
go func(n int) {
//temp := 1
//for i := n; i > 0; i-- {
// temp *= i
//}
temp := calcFact(n)
out <- temp
wg.Done()
}(n)
}
go func() {
wg.Wait()
close(out)
}()
return out
}
func printFact(in <-chan int) {
//for n := range in {
// fmt.Println("The random Factorial is:", n)
//}
var i int
for range in {
i ++
}
fmt.Println("Count:" , i)
}
func calcFact(c int) int {
if c == 0 {
return 1
} else {
return calcFact(c-1) * c
}
}
//###End of Factorial Concurrent
Factorial Sequencial Processing
package main
import (
"fmt"
//"math/rand"
"time"
"runtime"
)
func main() {
start := time.Now()
//for _, n := range factorial(gen(10000)...) {
// fmt.Println("The random Factorial is:", n)
//}
var i int
for range factorial(gen(1000000)...) {
i++
}
fmt.Println("Count:" , i)
fmt.Println("Current Time:", time.Now(), "Start Time:", start, "Elapsed Time:", time.Since(start))
}
func gen(n int) []int {
var out []int
for i := 0; i < n; i++ {
//out = append(out, rand.Intn(10)+1)
out = append(out, 10)
}
println(len(out))
return out
}
func factorial(val ...int) []int {
var out []int
for _, n := range val {
fa := calcFact(n)
out = append(out, fa)
}
return out
}
func calcFact(c int) int {
if c == 0 {
return 1
} else {
return calcFact(c-1) * c
}
}
//###End of Factorial sequential processing
My assumption was concurrent processing will be faster than sequential but sequential is executing faster than concurrent in my windows machine.
I am using 8 core/ i7 / 32 GB RAM.
I am not sure if there is something wrong in the programs or my basic understanding is correct.
p.s. - I am new to GoLang.
Concurrent version of your program will always be slow compared to the sequential version. The reason however, is related to the nature and behavior of problem you are trying to solve.
Your program is concurrent but it is not parallel. Each callFact is running in it's own goroutine but there is no division of the amount of work required to be done. Each goroutine must perform the same computation and output the same value.
It is like having a task that requires some text to be copied a hundred times. You have just one CPU (ignore the cores for now).
When you start a sequential process, you point the CPU to the original text once, and ask it to write it down a 100 times. The CPU has to manage a single task.
With goroutines, the CPU is told that there are a hundred tasks that must be done concurrently. It just so happens that they are all the same tasks. But CPU is not smart enough to know that.
So it does the same thing as above. Even though each task now is a 100 times smaller, there is still just one CPU. So the amount of work CPU has to do is still the same, except with all the added overhead of managing 100 different things at once. Hence, it looses a part of its efficiency.
To see an improvement in performance you'll need proper parallelism. A simple example would be to split the factorial input number roughly in the middle and compute 2 smaller factorials. Then combine them together:
// not an ideal solution
func main() {
ch := make(chan int)
r := 10
result := 1
go fact(r, ch)
for i := range ch {
result *= i
}
fmt.Println(result)
}
func fact(n int, ch chan int) {
p := n/2
q := p + 1
var wg sync.WaitGroup
wg.Add(2)
go func() {
ch <- factPQ(1, p)
wg.Done()
}()
go func() {
ch <- factPQ(q, n)
wg.Done()
}()
go func() {
wg.Wait()
close(ch)
}()
}
func factPQ(p, q int) int {
r := 1
for i := p; i <= q; i++ {
r *= i
}
return r
}
Working code: https://play.golang.org/p/xLHAaoly8H
Now you have two goroutines working towards the same goal and not just repeating the same calculations.
Note about CPU cores:
In your original code, the sequential version's operations are most definitely being distributed amongst various CPU cores by the runtime environment and the OS. So it still has parallelism to a degree, you just don't controll it.
The same is happening in the concurrent version but again as mentioned above, the overhead of goroutine context switching makes the performance come down.
abhink has given a good answer. I would also like to draw attention to Amdahl's Law, which should always be borne in mind when trying to use parallel processing to increase the overall speed of computation. That's not to say "don't make things parallel", but rather: be realistic about expectations and understand the parallel architecture fully.
Go allows us to write concurrent programs. This is related to trying to write faster parallel programs, but the two issues are separate. See Rob Pike's Concurrency is not Parallelism for more info.

Go algorithm for looping through servers in predefined ratio

I am tying to make a algorithm that can loop true things, backend server in my case by a pre defined ratio.
for example I got 2 backend servers
type server struct {
addr string
ratio float64
counter int64
}
// s2 is a beast and may handle 3 times the requests then s1 *edit
s1 := &server{":3000", 0.25}
s2 := &server{":3001", 0.75}
func nextServer() {
server := next() // simple goroutine that provides the next server between s1 and s2
N := server.counter / i
if float64(N) > server.ratio {
//repeat this function
return nextServer()
}
server.counter += 1
}
for i := 0; i < 1000; i++ {
nextServer()
}
s1 has 250 as counter (requests handled)
s2 is huge so he has 750 as counter (requests handled)
this is a very simple implementation of what I got but when i is like 10000, it keeps looping in
nextServer() cause N is always > server.ratio.
as long as i is around 5000 it works perfect. but I think there are better algorithms for looping in ratios.
How to make this simple and solid?
Something like this?
package main
import (
"fmt"
"math/rand"
)
type server struct {
addr string
ratio float64
}
var servers []server
func nextServer() *server {
rndFloat := rand.Float64() //pick a random number between 0.0-1.0
ratioSum := 0.0
for _, srv := range servers {
ratioSum += srv.ratio //sum ratios of all previous servers in list
if ratioSum >= rndFloat { //if ratiosum rises above the random number
return &srv //return that server
}
}
return nil //should not come here
}
func main() {
servers = []server{server{"0.25", 0.25}, server{"0.50", 0.50},
server{"0.10", 0.10}, server{"0.15", 0.15}}
counts := make(map[string]int, len(servers))
for i := 0; i < 100; i++ {
srv := nextServer()
counts[srv.addr] += 1
}
fmt.Println(counts)
}
Yields for example:
map[0.50:56 0.15:15 0.25:24 0.10:5]

Resources