Exercise: Web Crawler - print not working - go

I'm a golang newbie and currently working on Exercise: Web Crawler.
I simply put the keyword 'go' before every place where func Crawl is invoked and hope it can be parallelized, but fmt.Printf doesn't work and prints nothing. Nothing other is changed on the original code besides this one. Would someone like to give me a hand?
func Crawl(url string, depth int, fetcher Fetcher) {
// TODO: Fetch URLs in parallel.
// TODO: Don't fetch the same URL twice.
// This implementation doesn't do either:
if depth <= 0 {
return
}
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("found: %s %q\n", url, body)
for _, u := range urls {
go Crawl(u, depth-1, fetcher)
}
return
}
func main() {
go Crawl("https://golang.org/", 4, fetcher)
}

According to the spec
Program execution begins by initializing the main package and then invoking the function main. When that function invocation returns, the program exits. It does not wait for other (non-main) goroutines to complete.
Therefore you have to explicitly wait for the other goroutine to end in main() function.
One way is simply add time.Sleep() at the end of main() function until you think that the other goroutine ends (e.g. maybe 1 second in this case).
Cleaner way is using sync.WaitGroup as follows:
func Crawl(wg *sync.WaitGroup, url string, depth int, fetcher Fetcher) {
defer wg.Done()
if depth <= 0 {
return
}
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("found: %s %q\n", url, body)
for _, u := range urls {
wg.Add(1)
go Crawl(wg, u, depth-1, fetcher)
}
return
}
func main() {
wg := &sync.WaitGroup{}
wg.Add(1)
// first call does not need to be goroutine since its subroutine is goroutine.
Crawl(wg, "https://golang.org/", 4, fetcher)
//time.Sleep(1000 * time.Millisecond)
wg.Wait()
}
This code stores counter in WaitGroup, increment it using wg.Add(), decrement using wg.Done() and waits until it goes zero using wg.Wait().
Confirm it in go playground: https://play.golang.org/p/WqQBqe6iFLp

Related

A Tour of Go Web Crawler: how is this channel being used concurrently?

I'm looking at this mit solution to this A Tour of Go exercise
//
// Concurrent crawler with channels
//
func worker(url string, ch chan []string, fetcher Fetcher) {
urls, err := fetcher.Fetch(url)
if err != nil {
ch <- []string{}
} else {
ch <- urls
}
}
func coordinator(ch chan []string, fetcher Fetcher) {
n := 1
fetched := make(map[string]bool)
for urls := range ch {
for _, u := range urls {
if fetched[u] == false {
fetched[u] = true
n += 1
go worker(u, ch, fetcher)
}
}
n -= 1
if n == 0 {
break
}
}
}
func ConcurrentChannel(url string, fetcher Fetcher) {
ch := make(chan []string)
go func() {
ch <- []string{url}
}()
coordinator(ch, fetcher)
}
func main() {
//fmt.Printf("=== Serial===\n")
//Serial("http://golang.org/", fetcher, make(map[string]bool))
//fmt.Printf("=== ConcurrentMutex ===\n")
//ConcurrentMutex("http://golang.org/", fetcher, makeState())
fmt.Printf("=== ConcurrentChannel ===\n")
ConcurrentChannel("http://golang.org/", fetcher)
}
What I don't understand is how this solution is concurrent? How is the use of this channel making the fetching of the urls parallel?
The way I see it is that, as you loop through the urls present on one 'page', you'll start a worker on any url that hasn't been 'fetched' yet.
But this worker just retrieves all the urls on the page it's being called on and puts this in the channel, so nothing seems to be happening in 'parallel'.
Rather it feels like urls are just getting placed into the channel page by page, and marked as 'fetched' iteratively

Concurrency issues with crawler

I try to build concurrent crawler based on Tour and some others SO answers regarding that. What I have currently is below but I think I have here two subtle issues.
Sometimes I get 16 urls in response and sometimes 17 (debug print in main). I know it because when I even change WriteToSlice to Read then in Read sometimes 'Read: end, counter = ' is never reached and it's always when I get 16 urls.
I have troubles with err channel, I get no messages in this channel, even when I run my main Crawl method with address like www.golang.org so without valid schema error should be send via err channel
Concurrency is really difficult topic, help and advice will be appreciated
package main
import (
"fmt"
"net/http"
"sync"
"golang.org/x/net/html"
)
type urlCache struct {
urls map[string]struct{}
sync.Mutex
}
func (v *urlCache) Set(url string) bool {
v.Lock()
defer v.Unlock()
_, exist := v.urls[url]
v.urls[url] = struct{}{}
return !exist
}
func newURLCache() *urlCache {
return &urlCache{
urls: make(map[string]struct{}),
}
}
type results struct {
data chan string
err chan error
}
func newResults() *results {
return &results{
data: make(chan string, 1),
err: make(chan error, 1),
}
}
func (r *results) close() {
close(r.data)
close(r.err)
}
func (r *results) WriteToSlice(s *[]string) {
for {
select {
case data := <-r.data:
*s = append(*s, data)
case err := <-r.err:
fmt.Println("e ", err)
}
}
}
func (r *results) Read() {
fmt.Println("Read: start")
counter := 0
for c := range r.data {
fmt.Println(c)
counter++
}
fmt.Println("Read: end, counter = ", counter)
}
func crawl(url string, depth int, wg *sync.WaitGroup, cache *urlCache, res *results) {
defer wg.Done()
if depth == 0 || !cache.Set(url) {
return
}
response, err := http.Get(url)
if err != nil {
res.err <- err
return
}
defer response.Body.Close()
node, err := html.Parse(response.Body)
if err != nil {
res.err <- err
return
}
urls := grablUrls(response, node)
res.data <- url
for _, url := range urls {
wg.Add(1)
go crawl(url, depth-1, wg, cache, res)
}
}
func grablUrls(resp *http.Response, node *html.Node) []string {
var f func(*html.Node) []string
var results []string
f = func(n *html.Node) []string {
if n.Type == html.ElementNode && n.Data == "a" {
for _, a := range n.Attr {
if a.Key != "href" {
continue
}
link, err := resp.Request.URL.Parse(a.Val)
if err != nil {
continue
}
results = append(results, link.String())
}
}
for c := n.FirstChild; c != nil; c = c.NextSibling {
f(c)
}
return results
}
res := f(node)
return res
}
// Crawl ...
func Crawl(url string, depth int) []string {
wg := &sync.WaitGroup{}
output := &[]string{}
visited := newURLCache()
results := newResults()
defer results.close()
wg.Add(1)
go crawl(url, depth, wg, visited, results)
go results.WriteToSlice(output)
// go results.Read()
wg.Wait()
return *output
}
func main() {
r := Crawl("https://www.golang.org", 2)
// r := Crawl("www.golang.org", 2) // no schema, error should be generated and send via err
fmt.Println(len(r))
}
Both your questions 1 and 2 are a result of the same bug.
In Crawl() you are not waiting for this go routine to finish: go results.WriteToSlice(output). On the last crawl() function, the wait group is released, the output is returned and printed before the WriteToSlice function finishes with the data and err channel. So what has happened is this:
crawl() finishes, placing data in results.data and results.err.
Waitgroup wait() unblocks, causing main() to print the length of the result []string
WriteToSlice adds the last data (or err) item to the channel
You need to return from Crawl() not only when the data is done being written to the channel, but also when the channel is done being read in it's entirety (including the buffer). A good way to do this is close channels when you are sure that you are done with them. By organizing your code this way, you can block on the go routine that is draining the channels, and instead of using the wait group to release to main, you wait until the channels are 100% done.
You can see this gobyexample https://gobyexample.com/closing-channels. Remember that when you close a channel, the channel can still be used until the last item is taken. So you can close a buffered channel, and the reader will still get all the items that were queued in the channel.
There is some code structure that can change to make this cleaner, but here is a quick way to fix your program. Change Crawl to block on WriteToSlice. Close the data channel when the crawl function finishes, and wait for WriteToSlice to finish.
// Crawl ...
func Crawl(url string, depth int) []string {
wg := &sync.WaitGroup{}
output := &[]string{}
visited := newURLCache()
results := newResults()
go func() {
wg.Add(1)
go crawl(url, depth, wg, visited, results)
wg.Wait()
// All data is written, this makes `WriteToSlice()` unblock
close(results.data)
}()
// This will block until results.data is closed
results.WriteToSlice(output)
close(results.err)
return *output
}
Then on write to slice, you have to check for the closed channel to exit the for loop:
func (r *results) WriteToSlice(s *[]string) {
for {
select {
case data, open := <-r.data:
if !open {
return // All data done
}
*s = append(*s, data)
case err := <-r.err:
fmt.Println("e ", err)
}
}
}
Here is the full code: https://play.golang.org/p/GBpGk-lzrhd (it won't work in the playground)

Goroutine didn't run as expected

I'm still learning Go and was doing the exercise of a web crawler as linked here. The main part I implemented is as follows. (Other parts remain the same and can be found in the link.)
// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, depth int, fetcher Fetcher) {
// TODO: Fetch URLs in parallel.
// TODO: Don't fetch the same URL twice.
// This implementation doesn't do either:
if depth <= 0 {
return
}
body, urls, err := fetcher.Fetch(url)
cache.Set(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("found: %s %q\n", url, body)
for _, u := range urls {
if cache.Get(u) == false {
fmt.Println("Next:", u)
Crawl(u, depth-1, fetcher) // I want to parallelize this
}
}
return
}
func main() {
Crawl("https://golang.org/", 4, fetcher)
}
type SafeCache struct {
v map[string]bool
mux sync.Mutex
}
func (c *SafeCache) Set(key string) {
c.mux.Lock()
c.v[key] = true
c.mux.Unlock()
}
func (c *SafeCache) Get(key string) bool {
return c.v[key]
}
var cache SafeCache = SafeCache{v: make(map[string]bool)}
When I ran the code above, the result was expected:
found: https://golang.org/ "The Go Programming Language"
Next: https://golang.org/pkg/
found: https://golang.org/pkg/ "Packages"
Next: https://golang.org/cmd/
not found: https://golang.org/cmd/
Next: https://golang.org/pkg/fmt/
found: https://golang.org/pkg/fmt/ "Package fmt"
Next: https://golang.org/pkg/os/
found: https://golang.org/pkg/os/ "Package os"
However, when I tried to parallelize the crawler (on the line with a comment in the program above) by changing Crawl(u, depth-1, fetcher) to go Crawl(u, depth-1, fetcher), the results were not as I expected:
found: https://golang.org/ "The Go Programming Language"
Next: https://golang.org/pkg/
Next: https://golang.org/cmd/
I thought directly adding a go keyword is as straightforward as it seems, but I'm not not sure what went wrong and confused on how I should best approach this problem. Any advice would be appreciated. Thank you in advance!
Your program is most likely exiting before the crawlers finish doing their work. One approach would be for the Crawl to have a WaitGroup where it waits for all of it's sub crawlers to finish. For example
import "sync"
// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, depth int, fetcher Fetcher, *wg sync.WaitGroup) {
defer func() {
// If the crawler was given a wait group, signal that it's finished
if wg != nil {
wg.Done()
}
}()
if depth <= 0 {
return
}
_, urls, err := fetcher.Fetch(url)
cache.Set(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("found: %s %q\n", url, body)
var crawlers sync.WaitGroup
for _, u := range urls {
if cache.Get(u) == false {
fmt.Println("Next:", u)
crawlers.Add(1)
go Crawl(u, depth-1, fetcher, &crawlers)
}
}
crawlers.Wait() // Waits for its sub-crawlers to finish
return
}
func main() {
// The root does not need a WaitGroup
Crawl("http://example.com/index.html", 4, nil)
}

goroutine didn't take effect in Crawl example of 'A Tour of Go'

As the hits mentioned in Crawl example of 'A Tour of Go', I modified the Crawl function and just wonder why the 'go Crawl' failed to spawn another thread as only one url was found printed out.
Is there anything wrong with my modification?
List my modification as below,
// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, depth int, fetcher Fetcher) {
// TODO: Fetch URLs in parallel.
// TODO: Don't fetch the same URL twice.
// This implementation doesn't do either:
if depth <= 0 {
fmt.Printf("depth <= 0 return")
return
}
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("found: %s %q\n", url, body)
crawled.mux.Lock()
crawled.c[url]++
crawled.mux.Unlock()
for _, u := range urls {
//crawled.mux.Lock()
if cnt, ok := crawled.c[u]; ok {
cnt++
} else {
fmt.Println("go ...", u)
go Crawl(u, depth-1, fetcher)
}
//crawled.mux.Unlock()
//Crawl(u, depth-1, fetcher)
}
return
}
type crawledUrl struct {
c map[string]int
mux sync.Mutex
}
var crawled = crawledUrl{c: make(map[string]int)}
In your program, you have no any synchronized tool for your go routines.
So the behavior of this code is undefined. Perhaps main go thread will end soon.
Please remember that the main go routine will never block to wait other go routines for termination, only if you explicitly use some kind of util to synchronize the execution of go routines.
Such as channels or useful sync utils.
Let me help to give a version.
type fetchState struct {
mu sync.Mutex
fetched map[string]bool
}
func (f *fetchState) CheckAndMark(url string) bool {
defer f.mu.Unlock()
f.mu.Lock()
if f.fetched[url] {
return true
}
f.fetched[url] = true
return false
}
func mkFetchState() *fetchState {
f := &fetchState{}
f.fetched = make(map[string]bool)
return f
}
func CrawlConcurrentMutex(url string, fetcher Fetcher, f *fetchState) {
if f.CheckAndMark(url) {
return
}
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("found: %s %q\n", url, body)
var done sync.WaitGroup
for _, u := range urls {
done.Add(1)
go func(u string) {
defer done.Done()
CrawlConcurrentMutex(u, fetcher, f)
}(u) // Without the u argument there is a race
}
done.Wait()
return
}
Please pay attention to the usage of sync.WaitGroup, please refer the doc and you can understand the whole story.

Goroutines not exiting when data channel is closed

I'm trying to follow along the bounded goroutine example that is posted at http://blog.golang.org/pipelines/bounded.go. The problem that I'm having is that if there are more workers spun up then the amount of work to do, the extra workers never get cancelled. Everything else seems to work, the values get computed and logged, but when I close the groups channel, the workers just hang at the range statement.
I guess what I don't understand (in both my code and the example code) is how do the workers know when there is no more work to do and that they should exit?
Update
A working (i.e. non-working) example is posted at http://play.golang.org/p/T7zBCYLECp. It shows the deadlock on the workers since they are all asleep and there is no work to do. What I'm confused about is that I think the example code would have the same problem.
Here is the code that I'm currently using:
// Creates a pool of workers to do a bunch of computations
func computeAll() error {
done := make(chan struct{})
defer close(done)
groups, errc := findGroups(done)
// start a fixed number of goroutines to schedule with
const numComputers = 20
c := make(chan result)
var wg sync.WaitGroup
wg.Add(numComputers)
for i := 0; i < numComputers; i++ {
go func() {
compute(done, groups, c)
wg.Done()
}()
}
go func() {
wg.Wait()
close(c)
}()
// log the results of the computation
for r := range c { // log the results }
if err := <-errc; err != nil {
return err
}
return nil
}
Here is the code that fills up the channel with data:
// Retrieves the groups of data the must be computed
func findGroups(done <-chan struct{}) (<-chan model, <-chan error) {
groups := make(chan model)
errc := make(chan error, 1)
go func() {
// close the groups channel after find returns
defer close(groups)
group, err := //... code to get the group ...
if err == nil {
// add the group to the channel
select {
case groups <- group:
}
}
}()
return groups, errc
}
And here is the code that reads the channel to do the computations.
// Computes the results for the groups of data
func compute(done <-chan struct{}, groups <-chan model, c chan<- result) {
for group := range groups {
value := compute(group)
select {
case c <- result{value}:
case <-done:
return
}
}
}
Because you're trying to read from errc and it's empty unless there's an error.
//edit
computeAll() will always block on <- errc if there are no errors, another approach is to use something like:
func computeAll() (err error) {
.........
select {
case err = <-errc:
default: //don't block
}
return
}
Try to close the errc as OneOfOne says
go func() {
wg.Wait()
close(c)
close(errc)
}()
// log the results of the computation
for r := range c { // log the results }
if err := range errc {
if err != nil {
return err
}
}

Resources