Why is this program not performing better with goroutines? - performance

I am learning Go programming language. Please consider the following program,
package main
import (
"fmt"
"bytes"
"os"
"os/exec"
"path/filepath"
"sync"
)
func grep(file string) {
defer wg.Done()
cmd := exec.Command("grep", "-H", "--color=always", "add", file)
var out bytes.Buffer
cmd.Stdout = &out
cmd.Run()
fmt.Printf("%s\n", out.String())
}
func walkFn(path string, info os.FileInfo, err error) error {
if !info.IsDir() {
wg.Add(1)
go grep (path)
}
return nil
}
var wg sync.WaitGroup
func main() {
filepath.Walk("/tmp/", walkFn)
wg.Wait()
}
This program walks all the files in the /tmp directory, and does a grep on each file in a goroutine. So this will spawn n goroutines where n is the number of files present in the /tmp directory. Main waits till all goroutines finishes the work.
Interestingly, this program take same time to execute with and without goroutines. Try running go grep (path, c) and grep (path, c) (you need to comment channel stuff when doing this).
I was expecting goroutine version to run faster as multiple grep runs concurrently. But it executes almost in equal time. I am wondering why this happens?

Try using more cores. Also, use a better root directory for comparative purposes, like the Go directory. An SSD makes a big difference too. For example,
func main() {
runtime.GOMAXPROCS(runtime.NumCPU())
goroot := "/home/peter/go/"
filepath.Walk(goroot, walkFn)
wg.Wait()
fmt.Println("GOMAXPROCS:", runtime.GOMAXPROCS(0))
}
GOMAXPROCS: 1
real 0m10.137s
user 0m2.628s
sys 0m6.472s
GOMAXPROCS: 4
real 0m3.284s
user 0m2.492s
sys 0m5.116s

Your program's performance is bound to the speed of the disk (or ram, if /tmp is a ram disk): the computation is I/O bound. No matter how many goroutines run in parallel it can't read faster than that.

Related

Multi-Threading using gorountine

I am trying to automate my recon tool using Go. So far I can run two basic tools in kali (Nikto/whois). Now I want them to execute parallelly rather than waiting for one function to finish. After reading a bit, I came to know this can be achieved by using goroutines. But my code doesn't seem to work:
package main
import (
"log"
"os/exec"
"os"
"fmt"
)
var url string
func nikto(){
cmd := exec.Command("nikto","-h",url)
cmd.Stdout = os.Stdout
err := cmd.Run()
if err != nil {
log.Fatal(err)
}
}
func whois() {
cmd := exec.Command("whois","google.co")
cmd.Stdout = os.Stdout
err := cmd.Run()
if err !=nil {
log.Fatal(err)
}
}
func main(){
fmt.Printf("Please input URL")
fmt.Scanln(&url)
nikto()
go whois()
}
I do understand that here, go whois() would execute till main() would, but I still can't see them both execute parallel.
If I understood your question correctly, you want to execute both nikto() and whois() in parallel and wait until both returned. To wait until a set of goroutines has finished, sync.WaitGroup is a good way to archive this. For you, it could look something like this:
func main(){
fmt.Printf("Please input URL")
fmt.Scanln(&url)
var wg sync.WaitGroup
wg.Add(2)
go func() {
defer wg.Done()
nikto()
}()
go func() {
defer wg.Done()
whois()
}()
wg.Wait()
}
Here wg.Add(2) tells the WaitGroup, that we will wait for 2 goroutines.
Then your two functions are called inside small wrapper functions, which also call wg.Done() once each function is done. This tells the WaitGroup, that the function finished. The defer keyword just tells Go, to execute the function call, once the surrounding function returns. Also notice the go keyword before the two calls to the wrapper functions, which causes the execution to happen in 2 separate goroutines.
Finally, once both goroutines have been started, the call to wg.Wait() inside the main function blocks until wg.Done() was called twice, which will happen once both nikto() and whois() have finished.

Detecting whether something is on STDIN

Program should be able to get input from stdin on terminal, as follows:
echo foobar | program
However, in the source below for Program, the stdin read blocks if the pipe is omitted:
package main
import (
"fmt"
"os"
)
func main() {
b := make([]byte, 1024)
r := os.Stdin
n, e := r.Read(b)
if e != nil {
fmt.Printf("Err: %s\n", e)
}
fmt.Printf("Res: %s (%d)\n", b, n)
}
So how can Program detect whether something is being piped to it in this manner, and continue execution instead of blocking if not?
... and is it a good idea to do so?
os.Stdin is treated like a file and has permissions. When os.Stdin is open, the perms are 0600, when closed it's 0620.
This code works:
package main
import (
"fmt"
"os"
)
func main() {
stat, _ := os.Stdin.Stat()
fmt.Printf("stdin mode: %v\n", stat.Mode().Perm())
if stat.Mode().Perm() == 0600 {
fmt.Printf("stdin open\n")
return
}
fmt.Printf("stdin close\n")
}
It would be possible if you put the read in a goroutine and use select to do a non-blocking read from the channel in main, but you would then have a race condition where you could get different results depending on whether the data comes in quickly enough. What is the larger problem you're trying to solve?

Concurrent writing to a file

In go, how can I control the concurrent writing to a text file?
I ask this because I will have multiple goroutines writing to a text file using the same file handler.
I wrote this bit of code to try and see what happens but I'm not sure if I did it "right":
package main
import (
"os"
"sync"
"fmt"
"time"
"math/rand"
"math"
)
func WriteToFile( i int, f *os.File, w *sync.WaitGroup ){
//sleep for either 200 or 201 milliseconds
randSleep := int( math.Floor( 200 + ( 2 * rand.Float64() ) ) )
fmt.Printf( "Thread %d waiting %d\n", i, randSleep )
time.Sleep( time.Duration(randSleep) * time.Millisecond )
//write to the file
fmt.Fprintf( f, "Printing out: %d\n", i )
//write to stdout
fmt.Printf( "Printing out: %d\n", i )
w.Done()
}
func main() {
rand.Seed( time.Now().UnixNano() )
d, err := os.Getwd()
if err != nil {
fmt.Println( err )
}
filename := d + "/log.txt"
f, err := os.OpenFile( filename, os.O_CREATE | os.O_WRONLY | os.O_TRUNC, 0666 )
if err != nil {
fmt.Println( err )
}
var w *sync.WaitGroup = new(sync.WaitGroup)
w.Add( 10 )
//start 10 writers to the file
for i:=1; i <= 10; i++ {
go WriteToFile( i, f, w )
}
//wait for writers to finish
w.Wait()
}
I half expected that the output would show something like this in the file instead of the coherent output I got:
Printing Printing out: 2
out: 5
Poriuntitng: 6
Essentially, I expected the characters to come out incoherently and interweaved due to a lack of synchronization. Did I not write code that would coax this behavior out? Or is some mechanism during calls to fmt.Fprintf synchronizing the writing?
A simple approach to controlling concurrent access is via a service goroutine, receiving messages from a channel. This goroutine would have sole access to the file. Access would therefore be sequential, without any race problems.
Channels do a good job of interleaving requests. The clients write to the channel instead of directly to the file. Messages on the channel are automatically interleaved for you.
The benefit of this approach over simply using a Mutex is that you start viewing your program as a collection of microservices. This is the CSP way and leads to easy composition of large systems from smaller components.
There are many ways to control concurrent access. The easiest is to use a Mutex:
var mu sync.Mutex
func WriteToFile( i int, f *os.File, w *sync.WaitGroup ){
mu.Lock()
defer mu.Unlock()
// etc...
}
As to why you're not seeing problems, Go uses operating system calls to implement file access, and those system calls are thread safe (emphasis added):
According to POSIX.1-2008/SUSv4 Section XSI 2.9.7 ("Thread Interactions with Regular File Operations"):
All of the following functions shall be atomic with respect to
each other in the effects specified in POSIX.1-2008 when they
operate on regular files or symbolic links: ...
Among the APIs subsequently listed are write() and writev(2). And
among the effects that should be atomic across threads (and
processes) are updates of the file offset. However, on Linux before
version 3.14, this was not the case: if two processes that share an
open file description (see open(2)) perform a write() (or writev(2))
at the same time, then the I/O operations were not atomic with
respect updating the file offset, with the result that the blocks of
data output by the two processes might (incorrectly) overlap. This
problem was fixed in Linux 3.14.
I would still use a lock though, since Go code is not automatically thread safe. (two goroutines modifying the same variable will result in strange behavior)

Communication with other Go process

I have a program that reads a filename from the console and executes go run filename.go.
// main.go
package main
import (
"bufio"
"fmt"
"log"
"os"
"os/exec"
)
func main() {
console := bufio.NewReader(os.Stdin)
fmt.Print("Enter a filename: ")
input, err := console.ReadString('\n')
if err != nil {
log.Fatalln(err)
}
input = input[:len(input)-1]
gorun := exec.Command("go", "run", input)
result, err := gorun.Output()
if err != nil {
log.Println(err)
}
fmt.Println("---", input, "Result ---")
fmt.Println(string(result))
}
In the same directory, I have another file like this.
// hello.go
package main
import "fmt"
func main() {
fmt.Println("Hello, World!")
}
When I input "hello.go" in the console, that file is run, and its output gets returned to the parent Go process. However, I have another program like this.
// count.go
package main
import (
"fmt"
"time"
)
func main() {
i := 0
for {
time.Sleep(time.Second)
i++
fmt.Println(i)
}
}
Except, because this program never returns, my parent process is left hanging forever. Is there a way to communicate with different Go processes? I'm thinking something like channels for goroutines, but for processes. I need to be able to receive live stdout from the child process.
The problem I'm trying to solve is dynamically executing Go programs from a directory. Go files will be added, removed, and modified daily. I'm kind of trying to make something like Go Playgrounds. The main process is a webserver serving webpages, so I can't shut it down all the time to modify code.
Don't use go run, you need to do what go run is doing yourself to have the go program be a direct child of your server process.
Using go build -o path_to/binary source_file.go will give you more control. Then you can can directly execute and communicate with the resulting binary.

what can create huge overhead of goroutines?

for an assignment we are using go and one of the things we are going to do is to parse a uniprotdatabasefile line-by-line to collect uniprot-records.
I prefer not to share too much code, but I have a working code snippet that does parse such a file (2.5 GB) correctly in 48 s (measured using the time go-package). It parses the file iteratively and add lines to a record until a record end signal is reached (a full record), and metadata on the record is created. Then the record string is nulled, and a new record is collected line-by-line. Then I thought that I would try to use go-routines.
I have got some tips before from stackoverflow, and then to the original code I simple added a function to handle everything concerning the metadata-creation.
So, the code is doing
create an empty record,
iterate the file and add lines to the record,
if a record stop signal is found (now we have a full record) - give it to a go routine to create the metadata
null the record string and continue from 2).
I also added a sync.WaitGroup() to make sure that I waited (in the end) for each routine to finish. I thought that this would actually lower the time spent on parsing the databasefile as it continued to parse while the goroutines would act on each record. However, the code seems to run for more than 20 minutes indicating that something is wrong or the overhead went crazy. Any suggestions?
package main
import (
"bufio"
"crypto/sha1"
"fmt"
"io"
"log"
"os"
"strings"
"sync"
"time"
)
type producer struct {
parser uniprot
}
type unit struct {
tag string
}
type uniprot struct {
filenames []string
recordUnits chan unit
recordStrings map[string]string
}
func main() {
p := producer{parser: uniprot{}}
p.parser.recordUnits = make(chan unit, 1000000)
p.parser.recordStrings = make(map[string]string)
p.parser.collectRecords(os.Args[1])
}
func (u *uniprot) collectRecords(name string) {
fmt.Println("file to open ", name)
t0 := time.Now()
wg := new(sync.WaitGroup)
record := []string{}
file, err := os.Open(name)
errorCheck(err)
scanner := bufio.NewScanner(file)
for scanner.Scan() { //Scan the file
retText := scanner.Text()
if strings.HasPrefix(retText, "//") {
wg.Add(1)
go u.handleRecord(record, wg)
record = []string{}
} else {
record = append(record, retText)
}
}
file.Close()
wg.Wait()
t1 := time.Now()
fmt.Println(t1.Sub(t0))
}
func (u *uniprot) handleRecord(record []string, wg *sync.WaitGroup) {
defer wg.Done()
recString := strings.Join(record, "\n")
t := hashfunc(recString)
u.recordUnits <- unit{tag: t}
u.recordStrings[t] = recString
}
func hashfunc(record string) (hashtag string) {
hash := sha1.New()
io.WriteString(hash, record)
hashtag = string(hash.Sum(nil))
return
}
func errorCheck(err error) {
if err != nil {
log.Fatal(err)
}
}
First of all: your code is not thread-safe. Mainly because you're accessing a hashmap
concurrently. These are not safe for concurrency in go and need to be locked. Faulty line in your code:
u.recordStrings[t] = recString
As this will blow up when you're running go with GOMAXPROCS > 1, I'm assuming that you're not doing that. Make sure you're running your application with GOMAXPROCS=2 or higher to achieve parallelism.
The default value is 1, therefore your code runs on one single OS thread which, of course, can't be scheduled on two CPU or CPU cores simultaneously. Example:
$ GOMAXPROCS=2 go run udb.go uniprot_sprot_viruses.dat
At last: pull the values from the channel or otherwise your program will not terminate.
You're creating a deadlock if the number of goroutines exceeds your limit. I tested with a
76MiB file of data, you said your file was about 2.5GB. I have 16347 entries. Assuming linear growth,
your file will exceed 1e6 and therefore there are not enough slots in the channel and your program
will deadlock, giving no result while accumulating goroutines which don't run to fail at the end
(miserably).
So the solution should be to add a go routine which pulls the values from the channel and does
something with them.
As a side note: If you're worried about performance, do not use strings as they're always copied. Use []byte instead.

Resources