Tour of Go exercise #10: Crawler

Tour of Go exercise #10: Crawler - go

I'm going through the Go Tour and I feel like I have a pretty good understanding of the language except for concurrency.
slide 10 is an exercise that asks the reader to parallelize a web crawler (and to make it not cover repeats but I haven't gotten there yet.)
Here is what I have so far:
func Crawl(url string, depth int, fetcher Fetcher, ch chan string) {
if depth <= 0 {
return
}
body, urls, err := fetcher.Fetch(url)
if err != nil {
ch <- fmt.Sprintln(err)
return
}
ch <- fmt.Sprintf("found: %s %q\n", url, body)
for _, u := range urls {
go Crawl(u, depth-1, fetcher, ch)
}
}
func main() {
ch := make(chan string, 100)
go Crawl("http://golang.org/", 4, fetcher, ch)
for i := range ch {
fmt.Println(i)
}
}
My question is, where do I put the close(ch) call.
If I put a defer close(ch) somewhere in the Crawl method, then the program ends up writing to a closed channel from one of the spawned goroutines, because the call to Crawl will return before the spawned goroutines do.
If I omit the call to close(ch), as I demonstrate it, the program deadlocks in the main function ranging the channel because the channel is never closed when all goroutines has returned.

A look at the Parallelization section of Effective Go leads to ideas for the solution. Essentually you have to close the channel on each return route of the function. Actually this is a nice use case of the defer statement:
func Crawl(url string, depth int, fetcher Fetcher, ret chan string) {
defer close(ret)
if depth <= 0 {
return
}
body, urls, err := fetcher.Fetch(url)
if err != nil {
ret <- err.Error()
return
}
ret <- fmt.Sprintf("found: %s %q", url, body)
result := make([]chan string, len(urls))
for i, u := range urls {
result[i] = make(chan string)
go Crawl(u, depth-1, fetcher, result[i])
}
for i := range result {
for s := range result[i] {
ret <- s
}
}
return
}
func main() {
result := make(chan string)
go Crawl("http://golang.org/", 4, fetcher, result)
for s := range result {
fmt.Println(s)
}
}
The essential difference to your code is that every instance of Crawl gets its own return channel and the caller function collects the results in its return channel.

I went with a completely different direction with this one. I might have been mislead by the tip about using a map.
// SafeUrlMap is safe to use concurrently.
type SafeUrlMap struct {
v map[string]string
mux sync.Mutex
}
func (c *SafeUrlMap) Set(key string, body string) {
c.mux.Lock()
// Lock so only one goroutine at a time can access the map c.v.
c.v[key] = body
c.mux.Unlock()
}
// Value returns mapped value for the given key.
func (c *SafeUrlMap) Value(key string) (string, bool) {
c.mux.Lock()
// Lock so only one goroutine at a time can access the map c.v.
defer c.mux.Unlock()
val, ok := c.v[key]
return val, ok
}
// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, depth int, fetcher Fetcher, urlMap SafeUrlMap) {
defer wg.Done()
urlMap.Set(url, body)
if depth <= 0 {
return
}
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
for _, u := range urls {
if _, ok := urlMap.Value(u); !ok {
wg.Add(1)
go Crawl(u, depth-1, fetcher, urlMap)
}
}
return
}
var wg sync.WaitGroup
func main() {
urlMap := SafeUrlMap{v: make(map[string]string)}
wg.Add(1)
go Crawl("http://golang.org/", 4, fetcher, urlMap)
wg.Wait()
for url := range urlMap.v {
body, _ := urlMap.Value(url)
fmt.Printf("found: %s %q\n", url, body)
}
}

O(1) time lookup of url on map instead of O(n) lookup on slice of all urls visited should help minimize time spent inside of the critical section, which is a trivial amount of time for this example but would become relevant with scale.
WaitGroup used to prevent top level Crawl() function from returning until all child go routines are complete.
func Crawl(url string, depth int, fetcher Fetcher) {
var str_map = make(map[string]bool)
var mux sync.Mutex
var wg sync.WaitGroup
var crawler func(string,int)
crawler = func(url string, depth int) {
defer wg.Done()
if depth <= 0 {
return
}
mux.Lock()
if _, ok := str_map[url]; ok {
mux.Unlock()
return;
}else{
str_map[url] = true
mux.Unlock()
}
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("found: %s %q %q\n", url, body, urls)
for _, u := range urls {
wg.Add(1)
go crawler(u, depth-1)
}
}
wg.Add(1)
crawler(url,depth)
wg.Wait()
}
func main() {
Crawl("http://golang.org/", 4, fetcher)
}

Similar idea to the accepted answer, but with no duplicate URLs fetched, and printing directly to console. defer() is not used either. We use channels to signal when goroutines complete. The SafeMap idea is lifted off the SafeCounter given previously in the tour.
For the child goroutines, we create an array of channels, and wait until every child returns, by waiting on the channel.
package main
import (
"fmt"
"sync"
)
// SafeMap is safe to use concurrently.
type SafeMap struct {
v map[string] bool
mux sync.Mutex
}
// SetVal sets the value for the given key.
func (m *SafeMap) SetVal(key string, val bool) {
m.mux.Lock()
// Lock so only one goroutine at a time can access the map c.v.
m.v[key] = val
m.mux.Unlock()
}
// Value returns the current value of the counter for the given key.
func (m *SafeMap) GetVal(key string) bool {
m.mux.Lock()
// Lock so only one goroutine at a time can access the map c.v.
defer m.mux.Unlock()
return m.v[key]
}
type Fetcher interface {
// Fetch returns the body of URL and
// a slice of URLs found on that page.
Fetch(url string) (body string, urls []string, err error)
}
// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, depth int, fetcher Fetcher, status chan bool, urlMap SafeMap) {
// Check if we fetched this url previously.
if ok := urlMap.GetVal(url); ok {
//fmt.Println("Already fetched url!")
status <- true
return
}
// Marking this url as fetched already.
urlMap.SetVal(url, true)
if depth <= 0 {
status <- false
return
}
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
status <- false
return
}
fmt.Printf("found: %s %q\n", url, body)
statuses := make ([]chan bool, len(urls))
for index, u := range urls {
statuses[index] = make (chan bool)
go Crawl(u, depth-1, fetcher, statuses[index], urlMap)
}
// Wait for child goroutines.
for _, childstatus := range(statuses) {
<- childstatus
}
// And now this goroutine can finish.
status <- true
return
}
func main() {
urlMap := SafeMap{v: make(map[string] bool)}
status := make(chan bool)
go Crawl("https://golang.org/", 4, fetcher, status, urlMap)
<- status
}

I think using a map (the same way we could use a set in other languages) and a mutex is the easiest approach:
func Crawl(url string, depth int, fetcher Fetcher) {
mux.Lock()
defer mux.Unlock()
if depth <= 0 || IsVisited(url) {
return
}
visit[url] = true
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("found: %s %q\n", url, body)
for _, u := range urls {
//
go Crawl(u, depth-1, fetcher)
}
return
}
func IsVisited(s string) bool {
_, ok := visit[s]
return ok
}
var mux sync.Mutex
var visit = make(map[string]bool)
func main() {
Crawl("https://golang.org/", 4, fetcher)
time.Sleep(time.Second)
}

Here is my solution. I use empty structs as values in the safe cache because they are not assigned any memory. I based it off os whossname's solution.
package main
import (
"fmt"
"sync"
)
type Fetcher interface {
// Fetch returns the body of URL and
// a slice of URLs found on that page.
Fetch(url string) (body string, urls []string, err error)
}
type safeCache struct {
m map[string]struct{}
c sync.Mutex
}
func (s *safeCache) Get(key string) bool {
s.c.Lock()
defer s.c.Unlock()
if _,ok:=s.m[key];!ok{
return false
}
return true
}
func (s *safeCache) Set(key string) {
s.c.Lock()
s.m[key] = struct{}{}
s.c.Unlock()
return
}
// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, depth int, fetcher Fetcher, cach safeCache) {
defer wg.Done()
// TODO: Fetch URLs in parallel.
// TODO: Don't fetch the same URL twice.
// This implementation doesn't do either:
cach.Set(url)
if depth <= 0 {
return
}
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("found: %s %q\n", url, body)
for _, u := range urls {
if found := cach.Get(u); !found{
wg.Add(1)
go Crawl(u, depth-1, fetcher, cach)
}
}
return
}
var wg sync.WaitGroup
func main() {
urlSafe := safeCache{m: make(map[string]struct{})}
wg.Add(1)
go Crawl("https://golang.org/", 4, fetcher, urlSafe)
wg.Wait()
}
// fakeFetcher is Fetcher that returns canned results.
type fakeFetcher map[string]*fakeResult
type fakeResult struct {
body string
urls []string
}
func (f fakeFetcher) Fetch(url string) (string, []string, error) {
if res, ok := f[url]; ok {
return res.body, res.urls, nil
}
return "", nil, fmt.Errorf("not found: %s", url)
}
// fetcher is a populated fakeFetcher.
var fetcher = fakeFetcher{
"https://golang.org/": &fakeResult{
"The Go Programming Language",
[]string{
"https://golang.org/pkg/",
"https://golang.org/cmd/",
},
},
"https://golang.org/pkg/": &fakeResult{
"Packages",
[]string{
"https://golang.org/",
"https://golang.org/cmd/",
"https://golang.org/pkg/fmt/",
"https://golang.org/pkg/os/",
},
},
"https://golang.org/pkg/fmt/": &fakeResult{
"Package fmt",
[]string{
"https://golang.org/",
"https://golang.org/pkg/",
},
},
"https://golang.org/pkg/os/": &fakeResult{
"Package os",
[]string{
"https://golang.org/",
"https://golang.org/pkg/",
},
},
}

Below is my solution. Except the global map, I only had to change the contents of Crawl. Like other solutions, I used sync.Map and sync.WaitGroup. I've blocked out the important parts.
var m sync.Map
// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, depth int, fetcher Fetcher) {
// This implementation doesn't do either:
if depth <= 0 {
return
}
// Don't fetch the same URL twice.
/////////////////////////////////////
_, ok := m.LoadOrStore(url, url) //
if ok { //
return //
} //
/////////////////////////////////////
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("found: %s %q\n", url, body)
// Fetch URLs in parallel.
/////////////////////////////////////
var wg sync.WaitGroup //
defer wg.Wait() //
for _, u := range urls { //
wg.Add(1) //
go func(u string) { //
defer wg.Done() //
Crawl(u, depth-1, fetcher) //
}(u) //
} //
/////////////////////////////////////
return
}

Here's my solution. I have a "master" routine that listens to a channel of urls and starts new crawling routine (which puts crawled urls into the channel) if it finds new urls to crawl.
Instead of explicitly closing the channel, I have a counter for unfinished crawling goroutines, and when the counter is 0, the program exits because it has nothing to wait for.
func doCrawl(url string, fetcher Fetcher, results chan []string) {
body, urls, err := fetcher.Fetch(url)
results <- urls
if err != nil {
fmt.Println(err)
} else {
fmt.Printf("found: %s %q\n", url, body)
}
}
func Crawl(url string, depth int, fetcher Fetcher) {
results := make(chan []string)
crawled := make(map[string]bool)
go doCrawl(url, fetcher, results)
// counter for unfinished crawling goroutines
toWait := 1
for urls := range results {
toWait--
for _, u := range urls {
if !crawled[u] {
crawled[u] = true
go doCrawl(u, fetcher, results)
toWait++
}
}
if toWait == 0 {
break
}
}
}

I have implemented it with a simple channel, where all the goroutines send their messages. To ensure that it is closed when there is no more goroutines I use a safe counter, that close the channel when the counter is 0.
type Msg struct {
url string
body string
}
type SafeCounter struct {
v int
mux sync.Mutex
}
func (c *SafeCounter) inc() {
c.mux.Lock()
defer c.mux.Unlock()
c.v++
}
func (c *SafeCounter) dec(ch chan Msg) {
c.mux.Lock()
defer c.mux.Unlock()
c.v--
if c.v == 0 {
close(ch)
}
}
var goes SafeCounter = SafeCounter{v: 0}
// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, depth int, fetcher Fetcher, ch chan Msg) {
defer goes.dec(ch)
// TODO: Fetch URLs in parallel.
// TODO: Don't fetch the same URL twice.
// This implementation doesn't do either:
if depth <= 0 {
return
}
if !cache.existsAndRegister(url) {
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
ch <- Msg{url, body}
for _, u := range urls {
goes.inc()
go Crawl(u, depth-1, fetcher, ch)
}
}
return
}
func main() {
ch := make(chan Msg, 100)
goes.inc()
go Crawl("http://golang.org/", 4, fetcher, ch)
for m := range ch {
fmt.Printf("found: %s %q\n", m.url, m.body)
}
}
Note that the safe counter must be incremented outside of the goroutine.

I passed the safeCounter and waitGroup to the Crawl function and then use safeCounter to jump over urls that have been visited, waitGroup to prevent early exit out of the current goroutine.
func Crawl(url string, depth int, fetcher Fetcher, c *SafeCounter, wg *sync.WaitGroup) {
defer wg.Done()
if depth <= 0 {
return
}
c.mux.Lock()
c.v[url]++
c.mux.Unlock()
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("found: %s %q\n", url, body)
for _, u := range urls {
c.mux.Lock()
i := c.v[u]
c.mux.Unlock()
if i == 1 {
continue
}
wg.Add(1)
go Crawl(u, depth-1, fetcher, c, wg)
}
return
}
func main() {
c := SafeCounter{v: make(map[string]int)}
var wg sync.WaitGroup
wg.Add(1)
Crawl("https://golang.org/", 4, fetcher, &c, &wg)
wg.Wait()
}

Here is my version (inspired by #fasmats answer) – this one prevents fetching the same URL twice by utilizing custom cache with RWMutex.
type Cache struct {
data map[string]fakeResult
mux sync.RWMutex
}
var cache = Cache{data: make(map[string]fakeResult)}
//cache adds new page to the global cache
func (c *Cache) cache(url string) fakeResult {
c.mux.Lock()
body, urls, err := fetcher.Fetch(url)
if err != nil {
body = err.Error()
}
data := fakeResult{body, urls}
c.data[url] = data
c.mux.Unlock()
return data
}
//Visit visites the page at given url and caches it if needec
func (c *Cache) Visit(url string) (data fakeResult, alreadyCached bool) {
c.mux.RLock()
data, alreadyCached = c.data[url]
c.mux.RUnlock()
if !alreadyCached {
data = c.cache(url)
}
return data, alreadyCached
}
/*
Crawl crawles all pages reachable from url and within the depth (given by args).
Fetches pages using given fetcher and caches them in the global cache.
Continously sends newly discovered pages to the out channel.
*/
func Crawl(url string, depth int, fetcher Fetcher, out chan string) {
defer close(out)
if depth <= 0 {
return
}
data, alreadyCached := cache.Visit(url)
if alreadyCached {
return
}
//send newly discovered page to out channel
out <- fmt.Sprintf("found: %s %q", url, data.body)
//visit linked pages
res := make([]chan string, len(data.urls))
for i, link := range data.urls {
res[i] = make(chan string)
go Crawl(link, depth-1, fetcher, res[i])
}
//send newly discovered pages from links to out channel
for i := range res {
for s := range res[i] {
out <- s
}
}
}
func main() {
res := make(chan string)
go Crawl("https://golang.org/", 4, fetcher, res)
for page := range res {
fmt.Println(page)
}
}
Aside from not fetching URLs twice, this solution doesn't use the fact of knowing the total number of pages in advance (works for any number of pages) and doesn't falsely limit/extend program execution time by timers.

I'm new to go, so grain of salt, but this solution seems to me like it'd be more idiomatic. It uses a single channel for all of the results, a single channel for all of the crawl requests (an attempt to crawl a specific url), and a wait group for keeping track of completion. The main Crawl call acts as the distributor of the crawl requests to worker processes (while handling deduplication) and acts as the tracker for how many crawl requests are pending.
package main
import (
"fmt"
"sync"
)
type Fetcher interface {
// Fetch returns the body of URL and
// a slice of URLs found on that page.
Fetch(url string) (body string, urls []string, err error)
}
type FetchResult struct {
url string
body string
err error
}
type CrawlRequest struct {
url string
depth int
}
type Crawler struct {
depth int
fetcher Fetcher
results chan FetchResult
crawlRequests chan CrawlRequest
urlReservations map[string]bool
waitGroup *sync.WaitGroup
}
func (crawler Crawler) Crawl(url string, depth int) {
defer crawler.waitGroup.Done()
if depth <= 0 {
return
}
body, urls, err := crawler.fetcher.Fetch(url)
crawler.results <- FetchResult{url, body, err}
if len(urls) == 0 {
return
}
crawler.waitGroup.Add(len(urls))
for _, url := range urls {
crawler.crawlRequests <- CrawlRequest{url, depth - 1}
}
}
// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, depth int, fetcher Fetcher) (results chan FetchResult) {
results = make(chan FetchResult)
urlReservations := make(map[string]bool)
crawler := Crawler{
crawlRequests: make(chan CrawlRequest),
depth: depth,
fetcher: fetcher,
results: results,
waitGroup: &sync.WaitGroup{},
}
crawler.waitGroup.Add(1)
// Listen for crawlRequests, pass them through to the caller if they aren't duplicates.
go func() {
for crawlRequest := range crawler.crawlRequests {
if _, isReserved := urlReservations[crawlRequest.url]; isReserved {
crawler.waitGroup.Done()
continue
}
urlReservations[crawlRequest.url] = true
go crawler.Crawl(crawlRequest.url, crawlRequest.depth)
}
}()
// Wait for the wait group to finish, and then close the channel
go func() {
crawler.waitGroup.Wait()
close(results)
}()
// Send the first crawl request to the channel
crawler.crawlRequests <- CrawlRequest{url, depth}
return
}
func main() {
results := Crawl("https://golang.org/", 4, fetcher)
for result := range results {
if result.err != nil {
fmt.Println(result.err)
continue
}
fmt.Printf("found: %s %q\n", result.url, result.body)
}
fmt.Printf("done!")
}
// fakeFetcher is Fetcher that returns canned results.
type fakeFetcher map[string]*fakeResult
type fakeResult struct {
body string
urls []string
}
func (f fakeFetcher) Fetch(url string) (string, []string, error) {
if res, ok := f[url]; ok {
return res.body, res.urls, nil
}
return "", nil, fmt.Errorf("not found: %s", url)
}
// fetcher is a populated fakeFetcher.
var fetcher = fakeFetcher{
"https://golang.org/": &fakeResult{
"The Go Programming Language",
[]string{
"https://golang.org/pkg/",
"https://golang.org/cmd/",
},
},
"https://golang.org/pkg/": &fakeResult{
"Packages",
[]string{
"https://golang.org/",
"https://golang.org/cmd/",
"https://golang.org/pkg/fmt/",
"https://golang.org/pkg/os/",
},
},
"https://golang.org/pkg/fmt/": &fakeResult{
"Package fmt",
[]string{
"https://golang.org/",
"https://golang.org/pkg/",
},
},
"https://golang.org/pkg/os/": &fakeResult{
"Package os",
[]string{
"https://golang.org/",
"https://golang.org/pkg/",
},
},
}

Here is my solution. I had a problem that the main function doesn't wait on the goroutines to print their statuses and finish. I checked that the previous slide used a solution with waiting of 1 second before exiting, and I decided to use that approach. In practice though, I believe some coordinating mechanism is better.
import (
"fmt"
"sync"
"time"
)
type SafeMap struct {
mu sync.Mutex
v map[string]bool
}
// Sets the given key to true.
func (sm *SafeMap) Set(key string) {
sm.mu.Lock()
sm.v[key] = true
sm.mu.Unlock()
}
// Get returns the current value for the given key.
func (sm *SafeMap) Get(key string) bool {
sm.mu.Lock()
defer sm.mu.Unlock()
return sm.v[key]
}
var safeMap = SafeMap{v: make(map[string]bool)}
type Fetcher interface {
// Fetch returns the body of URL and
// a slice of URLs found on that page.
Fetch(url string) (body string, urls []string, err error)
}
// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, depth int, fetcher Fetcher) {
if depth <= 0 {
return
}
// if the value exists, don't fetch it twice
if safeMap.Get(url) {
return
}
// check if there is an error fetching
body, urls, err := fetcher.Fetch(url)
safeMap.Set(url)
if err != nil {
fmt.Println(err)
return
}
// list contents and crawl recursively
fmt.Printf("found: %s %q\n", url, body)
for _, u := range urls {
go Crawl(u, depth-1, fetcher)
}
}
func main() {
go Crawl("https://golang.org/", 4, fetcher)
time.Sleep(time.Second)
}

No need to change any signatures or introduce any new stuff in the global scope. We can use a sync.WaitGroup to wait for the recurse goroutines to finish. A map from a strings to empty structs acts as a set, and is the most efficient way to keep count of the already crawled URLs.
func Crawl(url string, depth int, fetcher Fetcher) {
visited := make(map[string]struct{})
var mu sync.Mutex
var wg sync.WaitGroup
var recurse func(string, int)
recurse = func(url string, depth int) {
defer wg.Done()
if depth <= 0 {
return
}
mu.Lock()
defer mu.Unlock()
if _, ok := visited[url]; ok {
return
}
visited[url] = struct{}{}
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("found: %s %q\n", url, body)
for _, u := range urls {
wg.Add(1)
go recurse(u, depth-1)
}
}
wg.Add(1)
go recurse(url, depth)
wg.Wait()
}
func main() {
Crawl("https://golang.org/", 4, fetcher)
}
Full demo on the Go Playground

I use slice to avoid crawl the url twice，the recursive version without the concurrency is ok， but not sure about this concurrency version.
func Crawl(url string, depth int, fetcher Fetcher) {
var str_arrs []string
var mux sync.Mutex
var crawl func(string, int)
crawl = func(url string, depth int) {
if depth <= 0 {
return
}
mux.Lock()
for _, v := range str_arrs {
if url == v {
mux.Unlock()
return
}
}
str_arrs = append(str_arrs, url)
mux.Unlock()
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("found: %s %q\n", url, body)
for _, u := range urls {
go crawl(u, depth-1) // could delete “go” then it is recursive
}
}
crawl(url, depth)
return
}
func main() {
Crawl("http://golang.org/", 4, fetcher)
}

Here's my solution, using sync.WaitGroup and a SafeCache of fetched urls:
package main
import (
"fmt"
"sync"
)
type Fetcher interface {
// Fetch returns the body of URL and
// a slice of URLs found on that page.
Fetch(url string) (body string, urls []string, err error)
}
// Safe to use concurrently
type SafeCache struct {
fetched map[string]string
mux sync.Mutex
}
func (c *SafeCache) Add(url, body string) {
c.mux.Lock()
defer c.mux.Unlock()
if _, ok := c.fetched[url]; !ok {
c.fetched[url] = body
}
}
func (c *SafeCache) Contains(url string) bool {
c.mux.Lock()
defer c.mux.Unlock()
_, ok := c.fetched[url]
return ok
}
// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, depth int, fetcher Fetcher, cache SafeCache,
wg *sync.WaitGroup) {
defer wg.Done()
if depth <= 0 {
return
}
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("found: %s %q\n", url, body)
cache.Add(url, body)
for _, u := range urls {
if !cache.Contains(u) {
wg.Add(1)
go Crawl(u, depth-1, fetcher, cache, wg)
}
}
return
}
func main() {
cache := SafeCache{fetched: make(map[string]string)}
var wg sync.WaitGroup
wg.Add(1)
Crawl("http://golang.org/", 4, fetcher, cache, &wg)
wg.Wait()
}

Below is a simple solution for parallelization using only sync waitGroup.
var fetchedUrlMap = make(map[string]bool)
var mutex sync.Mutex
func Crawl(url string, depth int, fetcher Fetcher) {
//fmt.Println("In Crawl2 with url" , url)
if _, ok := fetchedUrlMap[url]; ok {
return
}
if depth <= 0 {
return
}
body, urls, err := fetcher.Fetch(url)
mutex.Lock()
fetchedUrlMap[url] = true
mutex.Unlock()
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("found: %s %q\n", url, body)
var wg sync.WaitGroup
for _, u := range urls {
// fmt.Println("Solving for ", u)
wg.Add(1)
go func(uv string) {
Crawl(uv, depth-1, fetcher)
wg.Done()
}(u)
}
wg.Wait()
}

Here is my solution :)
package main
import (
"fmt"
"runtime"
"sync"
)
type Fetcher interface {
// Fetch returns the body of URL and
// a slice of URLs found on that page.
Fetch(url string) (body string, urls []string, err error)
}
// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, depth int, fetcher Fetcher, set map[string]bool) {
// TODO: Fetch URLs in parallel.
// TODO: Don't fetch the same URL twice.
// This implementation doesn't do either:
if depth <= 0 {
return
}
// use a set to identify if the URL should be traversed or not
if set[url] == true {
wg.Done()
return
} else {
fmt.Println(runtime.NumGoroutine())
set[url] = true
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("found: %s %q\n", url, body)
for _, u := range urls {
Crawl(u, depth-1, fetcher, set)
}
}
}
var wg sync.WaitGroup
func main() {
wg.Add(6)
collectedURLs := make(map[string]bool)
go Crawl("https://golang.org/", 4, fetcher, collectedURLs)
wg.Wait()
}
// fakeFetcher is Fetcher that returns canned results.
type fakeFetcher map[string]*fakeResult
type fakeResult struct {
body string
urls []string
}
func (f fakeFetcher) Fetch(url string) (string, []string, error) {
if res, ok := f[url]; ok {
return res.body, res.urls, nil
}
return "", nil, fmt.Errorf("not found: %s", url)
}
// fetcher is a populated fakeFetcher.
var fetcher = fakeFetcher{
"https://golang.org/": &fakeResult{
"The Go Programming Language",
[]string{
"https://golang.org/pkg/",
"https://golang.org/cmd/",
},
},
"https://golang.org/pkg/": &fakeResult{
"Packages",
[]string{
"https://golang.org/",
"https://golang.org/cmd/",
"https://golang.org/pkg/fmt/",
"https://golang.org/pkg/os/",
},
},
"https://golang.org/pkg/fmt/": &fakeResult{
"Package fmt",
[]string{
"https://golang.org/",
"https://golang.org/pkg/",
},
},
"https://golang.org/pkg/os/": &fakeResult{
"Package os",
[]string{
"https://golang.org/",
"https://golang.org/pkg/",
},
},
}

Since most of the Solutions here don't work out for me (including accepted answer), I'll upload my own inspired by Kamil (special thanks :) (no dups/all valid URLs)
package main
import (
"fmt"
"runtime"
"sync"
)
type Fetcher interface {
// Fetch returns the body of URL and
// a slice of URLs found on that page.
Fetch(url string) (body string, urls []string, err error)
}
// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, depth int, fetcher Fetcher, set map[string]bool) {
// TODO: Fetch URLs in parallel.
// TODO: Don't fetch the same URL twice.
if depth <= 0 { return }
// use a set to identify if the URL should be traversed or not
fmt.Println(runtime.NumGoroutine())
set[url] = true
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("found: %s %q\n", url, body)
for _, u := range urls {
if set[u] == false {
wg.Add(1)
go Crawl(u, depth-1, fetcher, set)
}
}
wg.Done()
}
var wg sync.WaitGroup
func main() {
collectedURLs := make(map[string]bool)
Crawl("https://golang.org/", 4, fetcher, collectedURLs)
wg.Wait()
}

/*
Exercise: Web Crawler
In this exercise you'll use Go's concurrency features to parallelize a web crawler.
Modify the Crawl function to fetch URLs in parallel without fetching the same URL twice.
Hint: you can keep a cache of the URLs that have been fetched on a map, but maps alone are not safe for concurrent use!
*/
package main
import (
"fmt"
"sync"
"time"
)
type Fetcher interface {
// Fetch returns the body of URL and
// a slice of URLs found on that page.
Fetch(url string) (body string, urls []string, err error)
}
type Response struct {
url string
urls []string
body string
err error
}
var ch chan Response = make(chan Response)
var fetched map[string]bool = make(map[string]bool)
var wg sync.WaitGroup
var mu sync.Mutex
// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, depth int, fetcher Fetcher) {
// TODO: Fetch URLs in parallel.
// TODO: Don't fetch the same URL twice.
// This implementation doesn't do either:
var fetch func(url string, depth int, fetcher Fetcher)
wg.Add(1)
recv := func() {
for res := range ch {
body, _, err := res.body, res.urls, res.err
if err != nil {
fmt.Println(err)
continue
}
fmt.Printf("found: %s %q\n", url, body)
}
}
fetch = func(url string, depth int, fetcher Fetcher) {
time.Sleep(time.Second / 2)
defer wg.Done()
if depth <= 0 || fetched[url] {
return
}
mu.Lock()
fetched[url] = true
mu.Unlock()
body, urls, err := fetcher.Fetch(url)
for _, u := range urls {
wg.Add(1)
go fetch(u, depth-1, fetcher)
}
ch <- Response{url, urls, body, err}
}
go fetch(url, depth, fetcher)
go recv()
return
}
func main() {
Crawl("https://golang.org/", 4, fetcher)
wg.Wait()
}
// fakeFetcher is Fetcher that returns canned results.
type fakeFetcher map[string]*fakeResult
type fakeResult struct {
body string
urls []string
}
func (f fakeFetcher) Fetch(url string) (string, []string, error) {
if res, ok := f[url]; ok {
return res.body, res.urls, nil
}
return "", nil, fmt.Errorf("not found: %s", url)
}
// fetcher is a populated fakeFetcher.
var fetcher = fakeFetcher{
"https://golang.org/": &fakeResult{
"The Go Programming Language",
[]string{
"https://golang.org/pkg/",
"https://golang.org/cmd1/",
},
},
"https://golang.org/pkg/": &fakeResult{
"Packages",
[]string{
"https://golang.org/",
"https://golang.org/cmd2/",
"https://golang.org/pkg/fmt/",
"https://golang.org/pkg/os/",
},
},
"https://golang.org/pkg/fmt/": &fakeResult{
"Package fmt",
[]string{
"https://golang.org/",
"https://golang.org/pkg/",
},
},
"https://golang.org/pkg/os/": &fakeResult{
"Package os",
[]string{
"https://golang.org/",
"https://golang.org/pkg/",
},
},
}
https://gist.github.com/gaogao1030/5d63ed925534f3610ccb7e25ed46992a

Super-simple solution, using one channel per fetched URL to wait for GoRoutines crawling the URLs in the corresponding body.
Duplicate URLs are avoided using a UrlCache struct with a mutex and a map[string]struct{} (this saves memory wrt a map of booleans).
Side effects, potentially causing deadlocks, are mitigated by using defer for both mutex locking and channels writes.
package main
import (
"fmt"
"sync"
)
type UrlCache struct {
v map[string]struct{}
mux sync.Mutex
}
func NewUrlCache() *UrlCache {
res := UrlCache{}
res.v = make(map[string]struct{})
return &res
}
func (c *UrlCache) check(url string) bool {
c.mux.Lock()
defer c.mux.Unlock()
if _, p := c.v[url]; !p {
c.v[url] = struct{}{}
return false
}
return true
}
type Fetcher interface {
// Fetch returns the body of URL and
// a slice of URLs found on that page.
Fetch(url string) (body string, urls []string, err error)
}
// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, depth int, fetcher Fetcher, uc *UrlCache, c chan struct{}) {
defer func() { c <- struct{}{} }()
if depth <= 0 {
return
}
if uc.check(url) {
return
}
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("found: %s %q\n", url, body)
ci := make(chan struct{})
for _, u := range urls {
go Crawl(u, depth-1, fetcher, uc, ci)
}
// Wait for parallel crowl to finish
for range urls {
<-ci
}
}
func main() {
c := make(chan struct{})
go Crawl("https://golang.org/", 4, fetcher, NewUrlCache(), c)
<-c
}
// fakeFetcher is Fetcher that returns canned results.
type fakeFetcher map[string]*fakeResult
type fakeResult struct {
body string
urls []string
}
func (f fakeFetcher) Fetch(url string) (string, []string, error) {
if res, ok := f[url]; ok {
return res.body, res.urls, nil
}
return "", nil, fmt.Errorf("not found: %s", url)
}
// fetcher is a populated fakeFetcher.
var fetcher = fakeFetcher{
"https://golang.org/": &fakeResult{
"The Go Programming Language",
[]string{
"https://golang.org/pkg/",
"https://golang.org/cmd/",
},
},
"https://golang.org/pkg/": &fakeResult{
"Packages",
[]string{
"https://golang.org/",
"https://golang.org/cmd/",
"https://golang.org/pkg/fmt/",
"https://golang.org/pkg/os/",
},
},
"https://golang.org/pkg/fmt/": &fakeResult{
"Package fmt",
[]string{
"https://golang.org/",
"https://golang.org/pkg/",
},
},
"https://golang.org/pkg/os/": &fakeResult{
"Package os",
[]string{
"https://golang.org/",
"https://golang.org/pkg/",
},
},
}

Below is my solution. Defer is a really powerful semantic in golang.
var urlcache FetchedUrls
func (urlcache *FetchedUrls) CacheIfNotPresent(url string) bool {
urlcache.m.Lock()
defer urlcache.m.Unlock()
_, ok := urlcache.urls[url]
if !ok {
urlcache.urls[url] = true
}
return !ok
}
func BlockOnChan(ch chan int) {
<-ch
}
// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, depth int, fetcher Fetcher, ch chan int) {
defer close(ch)
if depth <= 0 {
return
}
if !urlcache.CacheIfNotPresent(url) {
return
}
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("found: %s %q\n", url, body)
for _, u := range urls {
fch := make(chan int)
defer BlockOnChan(fch)
go Crawl(u, depth-1, fetcher, fch)
}
}
func main() {
urlcache.urls = make(map[string]bool)
Crawl("https://golang.org/", 4, fetcher, make(chan int))
}

Adding my solution for others to reference. Hope it helps
Being able to compare our different approaches is just great!
You can try below code in Go playground
func Crawl(url string, depth int, fetcher Fetcher) {
defer wg.Done()
if depth <= 0 {
return
} else if _, ok := fetched.Load(url); ok {
fmt.Printf("Skipping (already fetched): %s\n", url)
return
}
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("found: %s %q\n", url, body)
fetched.Store(url, nil)
for _, u := range urls {
wg.Add(1)
go Crawl(u, depth-1, fetcher)
}
}
// As there could be many types of events leading to errors when
// fetching a url, only marking it when it is correctly processed
var fetched sync.Map
// For each Crawl, wg is incremented, and it waits for all to finish
// on main method
var wg sync.WaitGroup
func main() {
wg.Add(1)
go Crawl("https://golang.org/", 4, fetcher)
wg.Wait()
}

Using mutex and channels
package main
import (
"fmt"
"sync"
)
type SafeMap struct {
mu sync.Mutex
seen map[string]bool
}
func (s *SafeMap) getVal(url string) bool {
s.mu.Lock()
defer s.mu.Unlock()
return s.seen[url]
}
func (s *SafeMap) setVal(url string) {
s.mu.Lock()
defer s.mu.Unlock()
s.seen[url] = true
}
var s = SafeMap{seen: make(map[string]bool)}
type Fetcher interface {
// Fetch returns the body of URL and
// a slice of URLs found on that page.
Fetch(url string) (body string, urls []string, err error)
}
// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, depth int, fetcher Fetcher, ch chan bool) {
// TODO: Fetch URLs in parallel.
// TODO: Don't fetch the same URL twice.
// This implementation doesn't do either:
if depth <= 0 || s.getVal(url) {
ch <- false
return
}
body, urls, err := fetcher.Fetch(url)
s.setVal(url)
if err != nil {
fmt.Println(err)
ch <- false
return
}
fmt.Printf("found: %s %q\n", url, body)
chs := make(map[string]chan bool, len(urls))
for _, u := range urls {
chs[u] = make(chan bool)
go Crawl(u, depth-1, fetcher, chs[u])
}
for _,v := range urls {
<-chs[v]
}
ch <- true
return
}
func main() {
ch := make(chan bool)
go Crawl("https://golang.org/", 4, fetcher, ch)
<-ch
}
// fakeFetcher is Fetcher that returns canned results.
type fakeFetcher map[string]*fakeResult
type fakeResult struct {
body string
urls []string
}
func (f fakeFetcher) Fetch(url string) (string, []string, error) {
if res, ok := f[url]; ok {
return res.body, res.urls, nil
}
return "", nil, fmt.Errorf("not found: %s", url)
}
// fetcher is a populated fakeFetcher.
var fetcher = fakeFetcher{
"https://golang.org/": &fakeResult{
"The Go Programming Language",
[]string{
"https://golang.org/pkg/",
"https://golang.org/cmd/",
},
},
"https://golang.org/pkg/": &fakeResult{
"Packages",
[]string{
"https://golang.org/",
"https://golang.org/cmd/",
"https://golang.org/pkg/fmt/",
"https://golang.org/pkg/os/",
},
},
"https://golang.org/pkg/fmt/": &fakeResult{
"Package fmt",
[]string{
"https://golang.org/",
"https://golang.org/pkg/",
},
},
"https://golang.org/pkg/os/": &fakeResult{
"Package os",
[]string{
"https://golang.org/",
"https://golang.org/pkg/",
},
},
}

You can solve the problem of closing the channel by using a sync.WaitGroup and spawning a separate goroutine to close the channel.
This solution does not solve for the requirement to avoid repeated visits to urls.
func Crawl(url string, depth int, fetcher Fetcher, ch chan string, wg *sync.WaitGroup) {
defer wg.Done()
if depth <= 0 {
return
}
body, urls, err := fetcher.Fetch(url)
if err != nil {
ch <- fmt.Sprintln(err)
return
}
ch <- fmt.Sprintf("found: %s %q", url, body)
for _, u := range urls {
wg.Add(1)
go Crawl(u, depth-1, fetcher, ch, wg)
}
}
func main() {
ch := make(chan string)
var wg sync.WaitGroup
wg.Add(1)
go Crawl("https://golang.org/", 4, fetcher, ch, &wg)
go func() {
wg.Wait()
close(ch)
}()
for i := range ch {
fmt.Println(i)
}
}

Related

Goroutine didn't run as expected

I'm still learning Go and was doing the exercise of a web crawler as linked here. The main part I implemented is as follows. (Other parts remain the same and can be found in the link.)
// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, depth int, fetcher Fetcher) {
// TODO: Fetch URLs in parallel.
// TODO: Don't fetch the same URL twice.
// This implementation doesn't do either:
if depth <= 0 {
return
}
body, urls, err := fetcher.Fetch(url)
cache.Set(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("found: %s %q\n", url, body)
for _, u := range urls {
if cache.Get(u) == false {
fmt.Println("Next:", u)
Crawl(u, depth-1, fetcher) // I want to parallelize this
}
}
return
}
func main() {
Crawl("https://golang.org/", 4, fetcher)
}
type SafeCache struct {
v map[string]bool
mux sync.Mutex
}
func (c *SafeCache) Set(key string) {
c.mux.Lock()
c.v[key] = true
c.mux.Unlock()
}
func (c *SafeCache) Get(key string) bool {
return c.v[key]
}
var cache SafeCache = SafeCache{v: make(map[string]bool)}
When I ran the code above, the result was expected:
found: https://golang.org/ "The Go Programming Language"
Next: https://golang.org/pkg/
found: https://golang.org/pkg/ "Packages"
Next: https://golang.org/cmd/
not found: https://golang.org/cmd/
Next: https://golang.org/pkg/fmt/
found: https://golang.org/pkg/fmt/ "Package fmt"
Next: https://golang.org/pkg/os/
found: https://golang.org/pkg/os/ "Package os"
However, when I tried to parallelize the crawler (on the line with a comment in the program above) by changing Crawl(u, depth-1, fetcher) to go Crawl(u, depth-1, fetcher), the results were not as I expected:
found: https://golang.org/ "The Go Programming Language"
Next: https://golang.org/pkg/
Next: https://golang.org/cmd/
I thought directly adding a go keyword is as straightforward as it seems, but I'm not not sure what went wrong and confused on how I should best approach this problem. Any advice would be appreciated. Thank you in advance!

Your program is most likely exiting before the crawlers finish doing their work. One approach would be for the Crawl to have a WaitGroup where it waits for all of it's sub crawlers to finish. For example
import "sync"
// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, depth int, fetcher Fetcher, *wg sync.WaitGroup) {
defer func() {
// If the crawler was given a wait group, signal that it's finished
if wg != nil {
wg.Done()
}
}()
if depth <= 0 {
return
}
_, urls, err := fetcher.Fetch(url)
cache.Set(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("found: %s %q\n", url, body)
var crawlers sync.WaitGroup
for _, u := range urls {
if cache.Get(u) == false {
fmt.Println("Next:", u)
crawlers.Add(1)
go Crawl(u, depth-1, fetcher, &crawlers)
}
}
crawlers.Wait() // Waits for its sub-crawlers to finish
return
}
func main() {
// The root does not need a WaitGroup
Crawl("http://example.com/index.html", 4, nil)
}

Go-routine not running when called through recursion

I'm doing the Web Crawler problem from the tour of go. Here's my solution so far:
func GatherUrls(url string, fetcher Fetcher) []string {
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println("error:", err)
} else {
fmt.Printf("found: %s %q\n", url, body)
}
return urls
}
// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, depth int, fetcher Fetcher) {
// get all urls for depth
// check if url has been crawled
// Y: noop
// N: crawl url
// when depth is 0, stop
fmt.Printf("crawling %q...\n", url)
if depth <= 0 {
return
}
urls := GatherUrls(url, fetcher)
fmt.Println("urls:", urls)
for _, u := range urls {
fmt.Println("currentUrl:", u)
if _, exists := cache[u]; !exists {
fmt.Printf("about to crawl %q\n", u)
go Crawl(u, depth - 1, fetcher)
} else {
cache[u] = true
}
}
}
func main() {
cache = make(map[string]bool)
Crawl("https://golang.org/", 4, fetcher)
}
When I run this code, Crawl() is never called when the function recurses (i know this because fmt.Printf("crawling %q...\n", url) is only ever called once)
Here are the logs:
crawling "https://golang.org/"...
found: https://golang.org/ "The Go Programming Language"
urls: [https://golang.org/pkg/ https://golang.org/cmd/]
currentUrl: https://golang.org/pkg/
about to crawl "https://golang.org/pkg/"
currentUrl: https://golang.org/cmd/
about to crawl "https://golang.org/cmd/"
What am I doing wrong? I suspect that spawning a thread to do recursion is the wrong way to do this? Please advise.
Please note that I want to do this with as few libraries as possible. I've seen some answers with the WaitGroup package. I dont want to use this.
NOTE: The full code including the lesson boilerplate is below:
package main
import (
"fmt"
)
var cache map[string]bool
type Fetcher interface {
// Fetch returns the body of URL and
// a slice of URLs found on that page.
Fetch(url string) (body string, urls []string, err error)
}
func GatherUrls(url string, fetcher Fetcher) []string {
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println("error:", err)
} else {
fmt.Printf("found: %s %q\n", url, body)
}
return urls
}
// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, depth int, fetcher Fetcher) {
// get all urls for depth
// check if url has been crawled
// Y: noop
// N: crawl url
// when depth is 0, stop
fmt.Printf("crawling %q...\n", url)
if depth <= 0 {
return
}
urls := GatherUrls(url, fetcher)
fmt.Println("urls:", urls)
for _, u := range urls {
fmt.Println("currentUrl:", u)
if _, exists := cache[u]; !exists {
fmt.Printf("about to crawl %q\n", u)
go Crawl(u, depth - 1, fetcher)
} else {
cache[u] = true
}
}
}
func main() {
cache = make(map[string]bool)
Crawl("https://golang.org/", 4, fetcher)
}
// fakeFetcher is Fetcher that returns canned results.
type fakeFetcher map[string]*fakeResult
type fakeResult struct {
body string
urls []string
}
func (f fakeFetcher) Fetch(url string) (string, []string, error) {
if res, ok := f[url]; ok {
return res.body, res.urls, nil
}
return "", nil, fmt.Errorf("not found: %s", url)
}
// fetcher is a populated fakeFetcher.
var fetcher = fakeFetcher{
"https://golang.org/": &fakeResult{
"The Go Programming Language",
[]string{
"https://golang.org/pkg/",
"https://golang.org/cmd/",
},
},
"https://golang.org/pkg/": &fakeResult{
"Packages",
[]string{
"https://golang.org/",
"https://golang.org/cmd/",
"https://golang.org/pkg/fmt/",
"https://golang.org/pkg/os/",
},
},
"https://golang.org/pkg/fmt/": &fakeResult{
"Package fmt",
[]string{
"https://golang.org/",
"https://golang.org/pkg/",
},
},
"https://golang.org/pkg/os/": &fakeResult{
"Package os",
[]string{
"https://golang.org/",
"https://golang.org/pkg/",
},
},
}

As you see in this sample: https://tour.golang.org/concurrency/10, we should do following tasks:
Fetch URLs in parallel.
Don't fetch the same URL twice.
Cache URLs already fetched on a map, but maps alone are not safe for concurrent use!
So, we can do following steps to resolve above tasks:
Create struct to store the fetch result:
type Result struct {
body string
urls []string
err error
}
Create a struct to store URL has already fetched on the map, we need use sync.Mutex, this is not introduced in 'A Tour of Go':
type Cache struct {
store map[string]bool
mux sync.Mutex
}
Fetch URL and body in parallel: Add URL to the cache when fetching it, but the first we need lock read/write in parallel by a mutex. So, we can modify Crawl function like this:
func Crawl(url string, depth int, fetcher Fetcher) {
if depth <= 0 {
return
}
ch := make(chan Result)
go func(url string, res chan Result) {
body, urls, err := fetcher.Fetch(url)
if err != nil {
ch <- Result{body, urls, err}
return
}
var furls []string
cache.mux.Lock()
for _, u := range urls {
if _, exists := cache.store[u]; !exists {
furls = append(furls, u)
}
cache.store[u] = true
}
cache.mux.Unlock()
ch <- Result{body: body, urls: furls, err: err}
}(url, ch)
res := <-ch
if res.err != nil {
fmt.Println(res.err)
return
}
fmt.Printf("found: %s %q\n", url, res.body)
for _, u := range res.urls {
Crawl(u, depth-1, fetcher)
}
}
You can view the full code and run this in the playground: https://play.golang.org/p/iY9uBXchx3w
Hope this help.

The main() function exits before the goroutines execute. Fix by using a wait group:
There's a data race on cache. Protect it with a mutex. Always set cache[u] = true for URLs to be visited.
var wg sync.WaitGroup
var mu sync.Mutex
var fetched = map[string]bool{}
func Crawl(url string, depth int, fetcher Fetcher) {
if depth <= 0 {
return
}
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("found: %s %q\n", url, body)
for _, u := range urls {
mu.Lock()
f := fetched[u]
fetched[u] = true
mu.Unlock()
if !f {
wg.Add(1)
go func(u string) {
defer wg.Done()
Crawl(u, depth-1, fetcher)
}(u)
}
}
return
}
playground example
Wait groups are the idiomatic way to wait for goroutines to complete. If you cannot use sync.WaitGroup for some reason, then reimplement the type using a counter, mutex and channel:
type WaitGroup struct {
mu sync.Mutex
n int
done chan struct{}
}
func (wg *WaitGroup) Add(i int) {
wg.mu.Lock()
defer wg.mu.Unlock()
if wg.done == nil {
wg.done = make(chan struct{})
}
wg.n += i
if wg.n < 0 {
panic("negative count")
}
if wg.n == 0 {
close(wg.done)
wg.done = nil
}
}
func (wg *WaitGroup) Done() {
wg.Add(-1)
}
func (wg *WaitGroup) Wait() {
wg.mu.Lock()
done := wg.done
wg.mu.Unlock()
if done != nil {
<-done
}
}
playground example

because main function was exit
you need add sync.WaitGroup to keep main function wait unit all coroutine done
package main
import (
"fmt"
"sync"
)
var cache map[string]bool
var wg sync.WaitGroup
type Fetcher interface {
// Fetch returns the body of URL and
// a slice of URLs found on that page.
Fetch(url string) (body string, urls []string, err error)
}
func GatherUrls(url string, fetcher Fetcher, Urls chan []string) {
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println("error:", err)
} else {
fmt.Printf("found: %s %q\n", url, body)
}
Urls <- urls
wg.Done()
}
// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, depth int, fetcher Fetcher) {
// get all urls for depth
// check if url has been crawled
// Y: noop
// N: crawl url
// when depth is 0, stop
fmt.Printf("crawling %q... %d\n", url, depth)
if depth <= 0 {
return
}
uc := make(chan []string)
wg.Add(1)
go GatherUrls(url, fetcher, uc)
urls, _ := <-uc
fmt.Println("urls:", urls)
for _, u := range urls {
fmt.Println("currentUrl:", u)
if _, exists := cache[u]; !exists {
fmt.Printf("about to crawl %q\n", u)
wg.Add(1)
go Crawl(u, depth-1, fetcher)
} else {
cache[u] = true
}
}
wg.Done()
}
func main() {
cache = make(map[string]bool)
wg.Add(1)
go Crawl("https://golang.org/", 4, fetcher)
wg.Wait()
}
// fakeFetcher is Fetcher that returns canned results.
type fakeFetcher map[string]*fakeResult
type fakeResult struct {
body string
urls []string
}
func (f fakeFetcher) Fetch(url string) (string, []string, error) {
if res, ok := f[url]; ok {
return res.body, res.urls, nil
}
return "", nil, fmt.Errorf("not found: %s", url)
}
// fetcher is a populated fakeFetcher.
var fetcher = fakeFetcher{
"https://golang.org/": &fakeResult{
"The Go Programming Language",
[]string{
"https://golang.org/pkg/",
"https://golang.org/cmd/",
},
},
"https://golang.org/pkg/": &fakeResult{
"Packages",
[]string{
"https://golang.org/",
"https://golang.org/cmd/",
"https://golang.org/pkg/fmt/",
"https://golang.org/pkg/os/",
},
},
"https://golang.org/pkg/fmt/": &fakeResult{
"Package fmt",
[]string{
"https://golang.org/",
"https://golang.org/pkg/",
},
},
"https://golang.org/pkg/os/": &fakeResult{
"Package os",
[]string{
"https://golang.org/",
"https://golang.org/pkg/",
},
},
}

Simple solution for golang tour webcrawler exercise

I'm new to Go and I saw some solutions for this exercise, but I think they are complex...
In my solution everything seems simple, but I've got a deadlock error. I can't figure out how to properly close channels and stop loop inside main block. Is there a simple way to do this?
Solution on Golang playground
Thanks for any/all help one may provide!
package main
import (
"fmt"
"sync"
)
type Fetcher interface {
// Fetch returns the body of URL and
// a slice of URLs found on that page.
Fetch(url string) (body string, urls []string, err error)
}
type SafeCache struct {
cache map[string]bool
mux sync.Mutex
}
func (c *SafeCache) Set(s string) {
c.mux.Lock()
c.cache[s] = true
c.mux.Unlock()
}
func (c *SafeCache) Get(s string) bool {
c.mux.Lock()
defer c.mux.Unlock()
return c.cache[s]
}
var (
sc = SafeCache{cache: make(map[string]bool)}
errs, ress = make(chan error), make(chan string)
)
// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, depth int, fetcher Fetcher) {
if depth <= 0 {
return
}
var (
body string
err error
urls []string
)
if ok := sc.Get(url); !ok {
sc.Set(url)
body, urls, err = fetcher.Fetch(url)
} else {
err = fmt.Errorf("Already fetched: %s", url)
}
if err != nil {
errs <- err
return
}
ress <- fmt.Sprintf("found: %s %q\n", url, body)
for _, u := range urls {
go Crawl(u, depth-1, fetcher)
}
return
}
func main() {
go Crawl("http://golang.org/", 4, fetcher)
for {
select {
case res, ok := <-ress:
fmt.Println(res)
if !ok {
break
}
case err, ok := <-errs:
fmt.Println(err)
if !ok {
break
}
}
}
}
// fakeFetcher is Fetcher that returns canned results.
type fakeFetcher map[string]*fakeResult
type fakeResult struct {
body string
urls []string
}
func (f fakeFetcher) Fetch(url string) (string, []string, error) {
if res, ok := f[url]; ok {
return res.body, res.urls, nil
}
return "", nil, fmt.Errorf("not found: %s", url)
}
// fetcher is a populated fakeFetcher.
var fetcher = fakeFetcher{
"http://golang.org/": &fakeResult{
"The Go Programming Language",
[]string{
"http://golang.org/pkg/",
"http://golang.org/cmd/",
},
},
"http://golang.org/pkg/": &fakeResult{
"Packages",
[]string{
"http://golang.org/",
"http://golang.org/cmd/",
"http://golang.org/pkg/fmt/",
"http://golang.org/pkg/os/",
},
},
"http://golang.org/pkg/fmt/": &fakeResult{
"Package fmt",
[]string{
"http://golang.org/",
"http://golang.org/pkg/",
},
},
"http://golang.org/pkg/os/": &fakeResult{
"Package os",
[]string{
"http://golang.org/",
"http://golang.org/pkg/",
},
},
}

you can solve this with sync.WaitGroup
You can start listening your channels in separate goroutines.
WaitGroup will coordinate how many goroutines do you have.
wg.Add(1) says that we're going to start new goroutine.
wg.Done() says that goroutine is finished.
wg.Wait() blocks goroutine, until all started goroutines aren't finished yet.
This 3 methods allows you to coordinate goroutines.
Go playground link
PS. you might be interested in sync.RWMutex for your SafeCache

Go lang Tour Webcrawler Exercise - Solution

i am new in Go and for study i have to hold a presentation about concurrency in Go.
I think the Go lang Tour - Webcrawler exercise is a nice example to talk about that.
Before i will do that, it would be nice if anybody could verify if this solution fits.
I assume it is correct but perhaps i have missed anything or anybody of you has a better alternative.
Here is my Code:
package main
import (
"fmt"
"sync"
"strconv"
"time"
)
/*
* Data and Types
* ===================================================================================
*/
var fetched map[string]bool // Map of fetched URLs -> true: fetched
var lock sync.Mutex // locks write access to fetched-map
var urlChan chan string // Channel to Write fetched URL
type Fetcher interface {
// Fetch returns the body of URL and
// a slice of URLs found on that page.
Fetch(url string) (body string, urls []string, err error)
}
// fakeFetcher is Fetcher that returns canned results.
type fakeFetcher map[string]*fakeResult
type fakeResult struct {
body string
urls []string
}
// fetcher is a populated fakeFetcher.
var fetcher = fakeFetcher{
"http://golang.org/": &fakeResult{
"The Go Programming Language",
[]string{
"http://golang.org/pkg/",
"http://golang.org/cmd/",
},
},
"http://golang.org/pkg/": &fakeResult{
"Packages",
[]string{
"http://golang.org/",
"http://golang.org/cmd/",
"http://golang.org/pkg/fmt/",
"http://golang.org/pkg/os/",
},
},
"http://golang.org/pkg/fmt/": &fakeResult{
"Package fmt",
[]string{
"http://golang.org/",
"http://golang.org/pkg/",
},
},
"http://golang.org/pkg/os/": &fakeResult{
"Package os",
[]string{
"http://golang.org/",
"http://golang.org/pkg/",
},
},
}
/*
* End Data and Types
* ===================================================================================
*/
/*
* Webcrawler implementation
* ===================================================================================
*/
func waitUntilDone(d int) {
fMap := make(map[string]string)
for i := 0; i < d; i++ {
fMap[<-urlChan] = strconv.Itoa(time.Now().Nanosecond())
}
time.Sleep(time.Millisecond * 100)
fmt.Println()
fmt.Println("Fetch stats")
fmt.Println("==================================================================")
for k, v := range fMap {
fmt.Println("Fetched: " + k + " after: " + v + " ns")
}
fmt.Println("==================================================================")
fmt.Println()
}
func (f fakeFetcher) Fetch(url string) (string, []string, error) {
var str string
var strArr [] string
var err error
if fetched[url] {
// already fetched?
str, strArr, err = "", nil, fmt.Errorf("already fetched: %s this will be ignored", url)
}else if res, ok := f[url]; ok {
str, strArr, err = res.body, res.urls, nil
urlChan <- url
}else {
str, strArr, err = "", nil, fmt.Errorf("not found: %s", url)
}
return str, strArr, err
}
// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, depth int, fetcher Fetcher, goRoutNum int) {
if depth <= 0 {
return
}
// Start fetching url concurrently
fmt.Println("Goroutine " + strconv.Itoa(goRoutNum) + " is fetching: " + url)
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
// Lock map
lock.Lock()
fetched[url] = true
// Unlock
lock.Unlock()
fmt.Printf("found: %s %q\n", url, body)
for i, u := range urls {
go func(url string, goRoutNumber int) {
Crawl(url, depth - 1, fetcher, goRoutNumber)
}(u, i + 1)
}
return
}
func StartCrawling(url string, depth int, fetcher Fetcher) {
fmt.Println()
fmt.Println("Start crawling ...")
fmt.Println("==================================================================")
go func(u string, i int, f Fetcher) {
Crawl(u, i, f, 0)
}(url, depth, fetcher)
}
/*
* End Webcrawler implementation
* ===================================================================================
*/
/*
* Main
* ====================================================================
*/
func main() {
depth := len(fetcher)
fetched = make(map[string]bool)
url := "http://golang.org/"
urlChan = make(chan string, len(fetcher))
go StartCrawling(url, depth, fetcher)
waitUntilDone(depth)
}
/*
* End Main
* =====================================================================
*/
Playground: https://play.golang.org/p/GHHt5I162o
Exercise Link: https://tour.golang.org/concurrency/10

When and where to check if channel won't get any more data?

I'm trying to solve Exercise: Web Crawler
In this exercise you'll use Go's concurrency features to parallelize a
web crawler.
Modify the Crawl function to fetch URLs in parallel without fetching
the same URL twice.
When should I check if all urls already been crawled? (or how could I know if there will be no more data queued?)
package main
import (
"fmt"
)
type Result struct {
Url string
Depth int
}
type Stor struct {
Queue chan Result
Visited map[string]int
}
func NewStor() *Stor {
return &Stor{
Queue: make(chan Result,1000),
Visited: map[string]int{},
}
}
type Fetcher interface {
// Fetch returns the body of URL and
// a slice of URLs found on that page.
Fetch(url string) (body string, urls []string, err error)
}
// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(res Result, fetcher Fetcher, stor *Stor) {
defer func() {
/*
if len(stor.Queue) == 0 {
close(stor.Queue)
}
*/ // this is wrong, it makes the channel closes too early
}()
if res.Depth <= 0 {
return
}
// TODO: Don't fetch the same URL twice.
url := res.Url
stor.Visited[url]++
if stor.Visited[url] > 1 {
fmt.Println("skip:",stor.Visited[url],url)
return
}
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("found: %s %q\n", url, body)
for _, u := range urls {
stor.Queue <- Result{u,res.Depth-1}
}
return
}
func main() {
stor := NewStor()
Crawl(Result{"http://golang.org/", 4}, fetcher, stor)
for res := range stor.Queue {
// TODO: Fetch URLs in parallel.
go Crawl(res,fetcher,stor)
}
}
// fakeFetcher is Fetcher that returns canned results.
type fakeFetcher map[string]*fakeResult
type fakeResult struct {
body string
urls []string
}
func (f fakeFetcher) Fetch(url string) (string, []string, error) {
if res, ok := f[url]; ok {
return res.body, res.urls, nil
}
return "", nil, fmt.Errorf("not found: %s", url)
}
// fetcher is a populated fakeFetcher.
var fetcher = fakeFetcher{
"http://golang.org/": &fakeResult{
"The Go Programming Language",
[]string{
"http://golang.org/pkg/",
"http://golang.org/cmd/",
},
},
"http://golang.org/pkg/": &fakeResult{
"Packages",
[]string{
"http://golang.org/",
"http://golang.org/cmd/",
"http://golang.org/pkg/fmt/",
"http://golang.org/pkg/os/",
},
},
"http://golang.org/pkg/fmt/": &fakeResult{
"Package fmt",
[]string{
"http://golang.org/",
"http://golang.org/pkg/",
},
},
"http://golang.org/pkg/os/": &fakeResult{
"Package os",
[]string{
"http://golang.org/",
"http://golang.org/pkg/",
},
},
}
The output was a deadlock since the stor.Queue channel never closed.

Simplest way to wait until all goroutings are done is sync.WaitGroup in sync package
package main
import "sync"
var wg sync.WaitGroup
//then you do
func Crawl(res Result, fetcher Fetcher) { //what for you pass stor *Stor as arg? It just visible for all goroutings
defer wg.Done()
...
//why not to spawn new routing just inside Crowl?
for res := range urls {
wg.Add(1)
go Crawl(res,fetcher)
}
...
}
...
//And in main.main()
func main() {
wg.Add(1)
Crawl(Result{"http://golang.org/", 4}, fetcher)
...
wg.Wait() //Will block until all routings Done
}
Complete solution will be:
package main
import (
"fmt"
"sync"
)
var wg sync.WaitGroup
var visited map[string]int = map[string]int{}
type Result struct {
Url string
Depth int
}
type Fetcher interface {
// Fetch returns the body of URL and
// a slice of URLs found on that page.
Fetch(url string) (body string, urls []string, err error)
}
// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(res Result, fetcher Fetcher) {
defer wg.Done()
if res.Depth <= 0 {
return
}
// TODO: Don't fetch the same URL twice.
url := res.Url
visited[url]++
if visited[url] > 1 {
fmt.Println("skip:",visited[url],url)
return
}
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("found: %s %q\n", url, body)
for _, u := range urls {
wg.Add(1)
go Crawl( Result{u,res.Depth-1},fetcher)
//stor.Queue <- Result{u,res.Depth-1}
}
return
}
func main() {
wg.Add(1)
Crawl(Result{"http://golang.org/", 4}, fetcher)
wg.Wait()
}
// fakeFetcher is Fetcher that returns canned results.
type fakeFetcher map[string]*fakeResult
type fakeResult struct {
body string
urls []string
}
func (f fakeFetcher) Fetch(url string) (string, []string, error) {
if res, ok := f[url]; ok {
return res.body, res.urls, nil
}
return "", nil, fmt.Errorf("not found: %s", url)
}
// fetcher is a populated fakeFetcher.
var fetcher = fakeFetcher{
"http://golang.org/": &fakeResult{
"The Go Programming Language",
[]string{
"http://golang.org/pkg/",
"http://golang.org/cmd/",
},
},
"http://golang.org/pkg/": &fakeResult{
"Packages",
[]string{
"http://golang.org/",
"http://golang.org/cmd/",
"http://golang.org/pkg/fmt/",
"http://golang.org/pkg/os/",
},
},
"http://golang.org/pkg/fmt/": &fakeResult{
"Package fmt",
[]string{
"http://golang.org/",
"http://golang.org/pkg/",
},
},
"http://golang.org/pkg/os/": &fakeResult{
"Package os",
[]string{
"http://golang.org/",
"http://golang.org/pkg/",
},
},
}

Checking the len of a channel is always a race, you can't use that for any sort of synchronization.
The producer is always the side that closes a channel, because it's a fatal error to try and send on a closed channel. Don't use a defer here, just close the channel when you're done sending.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Tour of Go exercise #10: Crawler - go

Related

Goroutine didn't run as expected

Go-routine not running when called through recursion

Simple solution for golang tour webcrawler exercise

Go lang Tour Webcrawler Exercise - Solution

When and where to check if channel won't get any more data?

Categories

Resources