Concurrency issues with crawler - go

I try to build concurrent crawler based on Tour and some others SO answers regarding that. What I have currently is below but I think I have here two subtle issues.
Sometimes I get 16 urls in response and sometimes 17 (debug print in main). I know it because when I even change WriteToSlice to Read then in Read sometimes 'Read: end, counter = ' is never reached and it's always when I get 16 urls.
I have troubles with err channel, I get no messages in this channel, even when I run my main Crawl method with address like www.golang.org so without valid schema error should be send via err channel
Concurrency is really difficult topic, help and advice will be appreciated
package main
import (
"fmt"
"net/http"
"sync"
"golang.org/x/net/html"
)
type urlCache struct {
urls map[string]struct{}
sync.Mutex
}
func (v *urlCache) Set(url string) bool {
v.Lock()
defer v.Unlock()
_, exist := v.urls[url]
v.urls[url] = struct{}{}
return !exist
}
func newURLCache() *urlCache {
return &urlCache{
urls: make(map[string]struct{}),
}
}
type results struct {
data chan string
err chan error
}
func newResults() *results {
return &results{
data: make(chan string, 1),
err: make(chan error, 1),
}
}
func (r *results) close() {
close(r.data)
close(r.err)
}
func (r *results) WriteToSlice(s *[]string) {
for {
select {
case data := <-r.data:
*s = append(*s, data)
case err := <-r.err:
fmt.Println("e ", err)
}
}
}
func (r *results) Read() {
fmt.Println("Read: start")
counter := 0
for c := range r.data {
fmt.Println(c)
counter++
}
fmt.Println("Read: end, counter = ", counter)
}
func crawl(url string, depth int, wg *sync.WaitGroup, cache *urlCache, res *results) {
defer wg.Done()
if depth == 0 || !cache.Set(url) {
return
}
response, err := http.Get(url)
if err != nil {
res.err <- err
return
}
defer response.Body.Close()
node, err := html.Parse(response.Body)
if err != nil {
res.err <- err
return
}
urls := grablUrls(response, node)
res.data <- url
for _, url := range urls {
wg.Add(1)
go crawl(url, depth-1, wg, cache, res)
}
}
func grablUrls(resp *http.Response, node *html.Node) []string {
var f func(*html.Node) []string
var results []string
f = func(n *html.Node) []string {
if n.Type == html.ElementNode && n.Data == "a" {
for _, a := range n.Attr {
if a.Key != "href" {
continue
}
link, err := resp.Request.URL.Parse(a.Val)
if err != nil {
continue
}
results = append(results, link.String())
}
}
for c := n.FirstChild; c != nil; c = c.NextSibling {
f(c)
}
return results
}
res := f(node)
return res
}
// Crawl ...
func Crawl(url string, depth int) []string {
wg := &sync.WaitGroup{}
output := &[]string{}
visited := newURLCache()
results := newResults()
defer results.close()
wg.Add(1)
go crawl(url, depth, wg, visited, results)
go results.WriteToSlice(output)
// go results.Read()
wg.Wait()
return *output
}
func main() {
r := Crawl("https://www.golang.org", 2)
// r := Crawl("www.golang.org", 2) // no schema, error should be generated and send via err
fmt.Println(len(r))
}

Both your questions 1 and 2 are a result of the same bug.
In Crawl() you are not waiting for this go routine to finish: go results.WriteToSlice(output). On the last crawl() function, the wait group is released, the output is returned and printed before the WriteToSlice function finishes with the data and err channel. So what has happened is this:
crawl() finishes, placing data in results.data and results.err.
Waitgroup wait() unblocks, causing main() to print the length of the result []string
WriteToSlice adds the last data (or err) item to the channel
You need to return from Crawl() not only when the data is done being written to the channel, but also when the channel is done being read in it's entirety (including the buffer). A good way to do this is close channels when you are sure that you are done with them. By organizing your code this way, you can block on the go routine that is draining the channels, and instead of using the wait group to release to main, you wait until the channels are 100% done.
You can see this gobyexample https://gobyexample.com/closing-channels. Remember that when you close a channel, the channel can still be used until the last item is taken. So you can close a buffered channel, and the reader will still get all the items that were queued in the channel.
There is some code structure that can change to make this cleaner, but here is a quick way to fix your program. Change Crawl to block on WriteToSlice. Close the data channel when the crawl function finishes, and wait for WriteToSlice to finish.
// Crawl ...
func Crawl(url string, depth int) []string {
wg := &sync.WaitGroup{}
output := &[]string{}
visited := newURLCache()
results := newResults()
go func() {
wg.Add(1)
go crawl(url, depth, wg, visited, results)
wg.Wait()
// All data is written, this makes `WriteToSlice()` unblock
close(results.data)
}()
// This will block until results.data is closed
results.WriteToSlice(output)
close(results.err)
return *output
}
Then on write to slice, you have to check for the closed channel to exit the for loop:
func (r *results) WriteToSlice(s *[]string) {
for {
select {
case data, open := <-r.data:
if !open {
return // All data done
}
*s = append(*s, data)
case err := <-r.err:
fmt.Println("e ", err)
}
}
}
Here is the full code: https://play.golang.org/p/GBpGk-lzrhd (it won't work in the playground)

Related

How to prioritize goroutines

I want to call two endpoints at the same time (A and B). But if I got a response 200 from both I need to use the response from A otherwise use B response.
If B returns first I need to wait for A, in other words, I must use A whenever A returns 200.
Can you guys help me with the pattern?
Thank you
Wait for a result from A. If the result is not good, then wait from a result from B. Use a buffered channel for the B result so that the sender does not block when A is good.
In the following snippet, fnA() and fnB() functions that issue requests to the endpoints, consume the response and cleanup. I assume that the result is a []byte, but it could be the result of decoding JSON or something else. Here's an example for fnA:
func fnA() ([]byte, error) {
r, err := http.Get("http://example.com/a")
if err != nil {
return nil, err
}
defer r.Body.Close() // <-- Important: close the response body!
if r.StatusCode != 200 {
return nil, errors.New("bad response")
}
return ioutil.ReadAll(r.Body)
}
Define a type to hold the result and error.
type response struct {
result []byte
err error
}
With those preliminaries done, here's how to prioritize A over B.
a := make(chan response)
go func() {
result, err := fnA()
a <- response{result, err}
}()
b := make(chan response, 1) // Size > 0 is important!
go func() {
result, err := fnB()
b <- response{result, err}
}()
resp := <-a
if resp.err != nil {
resp = <-b
if resp.err != nil {
// handle error. A and B both failed.
}
}
result := resp.result
If the application does not execute code concurrently with A and B, then there's no need to use a goroutine for A:
b := make(chan response, 1) // Size > 0 is important!
go func() {
result, err := fnB()
b <- response{result, err}
}()
result, err := fnA()
if err != nil {
resp = <-b
if resp.err != nil {
// handle error. A and B both failed.
}
result = resp.result
}
I'm suggesting you to use something like this, this is a bulky solution, but there you can start more than two endpoints for you needs.
func endpointPriorityTest() {
const (
sourceA = "a"
sourceB = "b"
sourceC = "c"
)
type endpointResponse struct {
source string
response *http.Response
error
}
epResponseChan := make(chan *endpointResponse)
endpointsMap := map[string]string{
sourceA: "https://jsonplaceholder.typicode.com/posts/1",
sourceB: "https://jsonplaceholder.typicode.com/posts/10",
sourceC: "https://jsonplaceholder.typicode.com/posts/100",
}
for source, endpointURL := range endpointsMap {
source := source
endpointURL := endpointURL
go func(respChan chan<- *endpointResponse) {
// You can add a delay so that the response from A takes longer than from B
// and look to the result map
// if source == sourceA {
// time.Sleep(time.Second)
// }
resp, err := http.Get(endpointURL)
respChan <- &endpointResponse{
source: source,
response: resp,
error: err,
}
}(epResponseChan)
}
respCache := make(map[string]*http.Response)
// Reading endpointURL responses from chan
for epResp := range epResponseChan {
// Skips failed requests
if epResp.error != nil {
continue
}
// Save successful response to cache map
respCache[epResp.source] = epResp.response
// Interrupt reading channel if we've got an response from source A
if epResp.source == sourceA {
break
}
}
fmt.Println("result map: ", respCache)
// Now we can use data from cache map
// resp, ok :=respCache[sourceA]
// if ok{
// ...
// }
}
#Zombo 's answer has the correct logic flow. Piggybacking off this, I would suggest one addition: leveraging the context package.
Basically, any potentially blocking tasks should use context.Context to allow the call-chain to perform more efficient clean-up in the event of early cancelation.
context.Context also can be leveraged, in your case, to abort the B call early if the A call succeeds:
func failoverResult(ctx context.Context) *http.Response {
// wrap the (parent) context
ctx, cancel := context.WithCancel(ctx)
// if we return early i.e. if `fnA()` completes first
// this will "cancel" `fnB()`'s request.
defer cancel()
b := make(chan *http.Response, 1)
go func() {
b <- fnB(ctx)
}()
resp := fnA(ctx)
if resp.StatusCode != 200 {
resp = <-b
}
return resp
}
fnA (and fnB) would look something like this:
func fnA(ctx context.Context) (resp *http.Response) {
req, _ := http.NewRequestWithContext(ctx, "GET", aUrl)
resp, _ = http.DefaultClient.Do(req) // TODO: check errors
return
}
Normally in golang, channel are used for communicating between goroutines.
You can orchestrate your scenario with following sample code.
basically you pass channel into your callB which will hold response. You don't need to run callA in goroutine as you always need result from that endpoint/service
package main
import (
"fmt"
"time"
)
func main() {
resB := make(chan int)
go callB(resB)
res := callA()
if res == 200 {
fmt.Print("No Need for B")
} else {
res = <-resB
fmt.Printf("Response from B : %d", res)
}
}
func callA() int {
time.Sleep(1000)
return 200
}
func callB(res chan int) {
time.Sleep(500)
res <- 200
}
Update: As suggestion given in comment, above code leaks "callB"
package main
import (
"fmt"
"time"
)
func main() {
resB := make(chan int, 1)
go callB(resB)
res := callA()
if res == 200 {
fmt.Print("No Need for B")
} else {
res = <-resB
fmt.Printf("Response from B : %d", res)
}
}
func callA() int {
time.Sleep(1000 * time.Millisecond)
return 200
}
func callB(res chan int) {
time.Sleep(500 * time.Millisecond)
res <- 200
}

Best time to close channel, when iterating over channel

I am playing around with Golang and I created this little app to make several concurrent api calls using goroutines.
While the app works, after the calls complete, the app gets stuck, which makes sense because it cannot exit the range c loop because the channel is not closed.
I am not sure where to better close the channel in this pattern.
package main
import "fmt"
import "net/http"
func main() {
links := []string{
"https://github.com/fabpot",
"https://github.com/andrew",
"https://github.com/taylorotwell",
"https://github.com/egoist",
"https://github.com/HugoGiraudel",
}
checkUrls(links)
}
func checkUrls(urls []string) {
c := make(chan string)
for _, link := range urls {
go checkUrl(link, c)
}
for msg := range c {
fmt.Println(msg)
}
close(c) //this won't get hit
}
func checkUrl(url string, c chan string) {
_, err := http.Get(url)
if err != nil {
c <- "We could not reach:" + url
} else {
c <- "Success reaching the website:" + url
}
}
You close a channel when there are no more values to send, so in this case it's when all checkUrl goroutines have completed.
var wg sync.WaitGroup
func checkUrls(urls []string) {
c := make(chan string)
for _, link := range urls {
wg.Add(1)
go checkUrl(link, c)
}
go func() {
wg.Wait()
close(c)
}()
for msg := range c {
fmt.Println(msg)
}
}
func checkUrl(url string, c chan string) {
defer wg.Done()
_, err := http.Get(url)
if err != nil {
c <- "We could not reach:" + url
} else {
c <- "Success reaching the website:" + url
}
}
(Note that the error from http.Get is only going to reflect connection and protocol errors. It is not going to contain http server errors if you're expecting those too, which you must be seeing how you're checking for paths and not just hosts.)
When writing programs in Go using channels and goroutines always think about who (which function) owns a channel. I prefer the practice of letting the function who owns a channel close it. If i were to write this i would do as shown below.
Note: A better way to handle situations like this is the Fan-out, fan-in concurrency pattern. refer(https://blog.golang.org/pipelines)Go Concurrency Patterns
package main
import "fmt"
import "net/http"
import "sync"
func main() {
links := []string{
"https://github.com/fabpot",
"https://github.com/andrew",
"https://github.com/taylorotwell",
"https://github.com/egoist",
"https://github.com/HugoGiraudel",
}
processURLS(links)
fmt.Println("End of Main")
}
func processURLS(links []string) {
resultsChan := checkUrls(links)
for msg := range resultsChan {
fmt.Println(msg)
}
}
func checkUrls(urls []string) chan string {
outChan := make(chan string)
go func(urls []string) {
defer close(outChan)
var wg sync.WaitGroup
for _, url := range urls {
wg.Add(1)
go checkUrl(&wg, url, outChan)
}
wg.Wait()
}(urls)
return outChan
}
func checkUrl(wg *sync.WaitGroup, url string, c chan string) {
defer wg.Done()
_, err := http.Get(url)
if err != nil {
c <- "We could not reach:" + url
} else {
c <- "Success reaching the website:" + url
}
}

Confusion regarding channel directions and blocking in Go

In a function definition, if a channel is an argument without a direction, does it have to send or receive something?
func makeRequest(url string, ch chan<- string, results chan<- string) {
start := time.Now()
resp, err := http.Get(url)
defer resp.Body.Close()
if err != nil {
fmt.Printf("%v", err)
}
resp, err = http.Post(url, "text/plain", bytes.NewBuffer([]byte("Hey")))
defer resp.Body.Close()
secs := time.Since(start).Seconds()
if err != nil {
fmt.Printf("%v", err)
}
// Cannot move past this.
ch <- fmt.Sprintf("%f", secs)
results <- <- ch
}
func MakeRequestHelper(url string, ch chan string, results chan string, iterations int) {
for i := 0; i < iterations; i++ {
makeRequest(url, ch, results)
}
for i := 0; i < iterations; i++ {
fmt.Println(<-ch)
}
}
func main() {
args := os.Args[1:]
threadString := args[0]
iterationString := args[1]
url := args[2]
threads, err := strconv.Atoi(threadString)
if err != nil {
fmt.Printf("%v", err)
}
iterations, err := strconv.Atoi(iterationString)
if err != nil {
fmt.Printf("%v", err)
}
channels := make([]chan string, 100)
for i := range channels {
channels[i] = make(chan string)
}
// results aggregate all the things received by channels in all goroutines
results := make(chan string, iterations*threads)
for i := 0; i < threads; i++ {
go MakeRequestHelper(url, channels[i], results, iterations)
}
resultSlice := make([]string, threads*iterations)
for i := 0; i < threads*iterations; i++ {
resultSlice[i] = <-results
}
}
In the above code,
ch <- or <-results
seems to be blocking every goroutine that executes makeRequest.
I am new to concurrency model of Go. I understand that sending to and receiving from a channel blocks but find it difficult what is blocking what in this code.
I'm not really sure that you are doing... It seems really convoluted. I suggest you read up on how to use channels.
https://tour.golang.org/concurrency/2
That being said you have so much going on in your code that it was much easier to just gut it to something a bit simpler. (It can be simplified further). I left comments to understand the code.
package main
import (
"fmt"
"io/ioutil"
"log"
"net/http"
"sync"
"time"
)
// using structs is a nice way to organize your code
type Worker struct {
wg sync.WaitGroup
semaphore chan struct{}
result chan Result
client http.Client
}
// group returns so that you don't have to send to many channels
type Result struct {
duration float64
results string
}
// closing your channels will stop the for loop in main
func (w *Worker) Close() {
close(w.semaphore)
close(w.result)
}
func (w *Worker) MakeRequest(url string) {
// a semaphore is a simple way to rate limit the amount of goroutines running at any single point of time
// google them, Go uses them often
w.semaphore <- struct{}{}
defer func() {
w.wg.Done()
<-w.semaphore
}()
start := time.Now()
resp, err := w.client.Get(url)
if err != nil {
log.Println("error", err)
return
}
defer resp.Body.Close()
// don't have any examples where I need to also POST anything but the point should be made
// resp, err = http.Post(url, "text/plain", bytes.NewBuffer([]byte("Hey")))
// if err != nil {
// log.Println("error", err)
// return
// }
// defer resp.Body.Close()
secs := time.Since(start).Seconds()
b, err := ioutil.ReadAll(resp.Body)
if err != nil {
log.Println("error", err)
return
}
w.result <- Result{duration: secs, results: string(b)}
}
func main() {
urls := []string{"https://facebook.com/", "https://twitter.com/", "https://google.com/", "https://youtube.com/", "https://linkedin.com/", "https://wordpress.org/",
"https://instagram.com/", "https://pinterest.com/", "https://wikipedia.org/", "https://wordpress.com/", "https://blogspot.com/", "https://apple.com/",
}
workerNumber := 5
worker := Worker{
semaphore: make(chan struct{}, workerNumber),
result: make(chan Result),
client: http.Client{Timeout: 5 * time.Second},
}
// use sync groups to allow your code to wait for
// all your goroutines to finish
for _, url := range urls {
worker.wg.Add(1)
go worker.MakeRequest(url)
}
// by declaring wait and close as a seperate goroutine
// I can get to the for loop below and iterate on the results
// in a non blocking fashion
go func() {
worker.wg.Wait()
worker.Close()
}()
// do something with the results channel
for res := range worker.result {
fmt.Printf("Request took %2.f seconds.\nResults: %s\n\n", res.duration, res.results)
}
}
The channels in channels are nil (no make is executed; you make the slice but not the channels), so any send or receive will block. I'm not sure exactly what you're trying to do here, but that's the basic problem.
See https://golang.org/doc/effective_go.html#channels for an explanation of how channels work.

goroutine didn't take effect in Crawl example of 'A Tour of Go'

As the hits mentioned in Crawl example of 'A Tour of Go', I modified the Crawl function and just wonder why the 'go Crawl' failed to spawn another thread as only one url was found printed out.
Is there anything wrong with my modification?
List my modification as below,
// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, depth int, fetcher Fetcher) {
// TODO: Fetch URLs in parallel.
// TODO: Don't fetch the same URL twice.
// This implementation doesn't do either:
if depth <= 0 {
fmt.Printf("depth <= 0 return")
return
}
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("found: %s %q\n", url, body)
crawled.mux.Lock()
crawled.c[url]++
crawled.mux.Unlock()
for _, u := range urls {
//crawled.mux.Lock()
if cnt, ok := crawled.c[u]; ok {
cnt++
} else {
fmt.Println("go ...", u)
go Crawl(u, depth-1, fetcher)
}
//crawled.mux.Unlock()
//Crawl(u, depth-1, fetcher)
}
return
}
type crawledUrl struct {
c map[string]int
mux sync.Mutex
}
var crawled = crawledUrl{c: make(map[string]int)}
In your program, you have no any synchronized tool for your go routines.
So the behavior of this code is undefined. Perhaps main go thread will end soon.
Please remember that the main go routine will never block to wait other go routines for termination, only if you explicitly use some kind of util to synchronize the execution of go routines.
Such as channels or useful sync utils.
Let me help to give a version.
type fetchState struct {
mu sync.Mutex
fetched map[string]bool
}
func (f *fetchState) CheckAndMark(url string) bool {
defer f.mu.Unlock()
f.mu.Lock()
if f.fetched[url] {
return true
}
f.fetched[url] = true
return false
}
func mkFetchState() *fetchState {
f := &fetchState{}
f.fetched = make(map[string]bool)
return f
}
func CrawlConcurrentMutex(url string, fetcher Fetcher, f *fetchState) {
if f.CheckAndMark(url) {
return
}
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("found: %s %q\n", url, body)
var done sync.WaitGroup
for _, u := range urls {
done.Add(1)
go func(u string) {
defer done.Done()
CrawlConcurrentMutex(u, fetcher, f)
}(u) // Without the u argument there is a race
}
done.Wait()
return
}
Please pay attention to the usage of sync.WaitGroup, please refer the doc and you can understand the whole story.

How do I handle errors in a worker pool using WaitGroup?

I got a problem using sync.WaitGroup and select together. If you take a look at following http request pool you will notice that if an error occurs it will never be reported as wg.Done() will block and there is no read from the channel anymore.
package pool
import (
"fmt"
"log"
"net/http"
"sync"
)
var (
MaxPoolQueue = 100
MaxPoolWorker = 10
)
type Pool struct {
wg *sync.WaitGroup
queue chan *http.Request
errors chan error
}
func NewPool() *Pool {
return &Pool{
wg: &sync.WaitGroup{},
queue: make(chan *http.Request, MaxPoolQueue),
errors: make(chan error),
}
}
func (p *Pool) Add(r *http.Request) {
p.wg.Add(1)
p.queue <- r
}
func (p *Pool) Run() error {
for i := 0; i < MaxPoolWorker; i++ {
go p.doWork()
}
select {
case err := <-p.errors:
return err
default:
p.wg.Wait()
}
return nil
}
func (p *Pool) doWork() {
for r := range p.queue {
fmt.Printf("Request to %s\n", r.Host)
p.wg.Done()
_, err := http.DefaultClient.Do(r)
if err != nil {
log.Fatal(err)
p.errors <- err
} else {
fmt.Printf("no error\n")
}
}
}
Source can be found here
How can I still use WaitGroup but also get errors from go routines?
Just got the answer my self as I wrote the question and as I think it is an interesting case I would like to share it with you.
The trick to use sync.WaitGroup and chan together is that we wrap:
select {
case err := <-p.errors:
return err
default:
p.wg.Done()
}
Together in a for loop:
for {
select {
case err := <-p.errors:
return err
default:
p.wg.Done()
}
}
In this case select will always check for errors and wait if nothing happens :)
It looks a bit like the fail-fast mechanism enabled by the Tomb library (Tomb V2 GoDoc):
The tomb package handles clean goroutine tracking and termination.
If any of the tracked goroutines returns a non-nil error, or the Kill or Killf method is called by any goroutine in the system (tracked or not), the tomb Err is set, Alive is set to false, and the Dying channel is closed to flag that all tracked goroutines are supposed to willingly terminate as soon as possible.
Once all tracked goroutines terminate, the Dead channel is closed, and Wait unblocks and returns the first non-nil error presented to the tomb via a result or an explicit Kill or Killf method call, or nil if there were no errors.
You can see an example in this playground:
(extract)
// start runs all the given functions concurrently
// until either they all complete or one returns an
// error, in which case it returns that error.
//
// The functions are passed a channel which will be closed
// when the function should stop.
func start(funcs []func(stop <-chan struct{}) error) error {
var tomb tomb.Tomb
var wg sync.WaitGroup
allDone := make(chan struct{})
// Start all the functions.
for _, f := range funcs {
f := f
wg.Add(1)
go func() {
defer wg.Done()
if err := f(tomb.Dying()); err != nil {
tomb.Kill(err)
}
}()
}
// Start a goroutine to wait for them all to finish.
go func() {
wg.Wait()
close(allDone)
}()
// Wait for them all to finish, or one to fail
select {
case <-allDone:
case <-tomb.Dying():
}
tomb.Done()
return tomb.Err()
}
A simpler implementation would be like below. (Check in play.golang: https://play.golang.org/p/TYxxsDRt5Wu)
package main
import "fmt"
import "sync"
import "time"
type Error struct {
message string
}
func (e Error) Error() string {
return e.message
}
func main() {
var wg sync.WaitGroup
waitGroupLength := 8
errChannel := make(chan error, 1)
// Setup waitgroup to match the number of go routines we'll launch off
wg.Add(waitGroupLength)
finished := make(chan bool, 1) // this along with wg.Wait() are why the error handling works and doesn't deadlock
for i := 0; i < waitGroupLength; i++ {
go func(i int) {
fmt.Printf("Go routine %d executed\n", i+1)
time.Sleep(time.Duration(waitGroupLength - i))
time.Sleep(0) // only here so the time import is needed
if i%4 == 1 {
errChannel <- Error{fmt.Sprintf("Errored on routine %d", i+1)}
}
// Mark the wait group as Done so it does not hang
wg.Done()
}(i)
}
go func() {
wg.Wait()
close(finished)
}()
L:
for {
select {
case <-finished:
break L // this will break from loop
case err := <-errChannel:
if err != nil {
fmt.Println("error ", err)
// handle your error
}
}
}
fmt.Println("Executed all go routines")
}

Resources