I'm building a crawler that takes a URL, extracts links from it, and visits each one of them to a certain depth; making a tree of paths on a specific site.
The way I implemented parallelism for this crawler is that I visit each new found URL as soon as it's found like this:
func main() {
link := "https://example.com"
wg := new(sync.WaitGroup)
wg.Add(1)
q := make(chan string)
go deduplicate(q, wg)
q <- link
wg.Wait()
}
func deduplicate(ch chan string, wg *sync.WaitGroup) {
for link := range ch {
// seen is a global variable that holds all seen URLs
if seen[link] {
wg.Done()
continue
}
seen[link] = true
go crawl(link, ch, wg)
}
}
func crawl(link string, q chan string, wg *sync.WaitGroup) {
// handle the link and create a variable "links" containing the links found inside the page
wg.Add(len(links))
for _, l := range links {
q <- l}
}
}
This works fine for relatively small sites, but when I run it on a large one with a lot of link everywhere, I start getting one of these two errors on some requests: socket: too many open files and no such host (the host is indeed there).
What's the best way to handle this? Should I check for these errors and pause execution when I get them for some time until the other requests are finished? Or specify a maximum number of possible requests at a certain time? (which makes more sense to me but not sure how to code up exactly)
The files being referred to in the error socket: too many open files includes Threads and sockets (the http requests to load the web pages being scraped).
See this question.
The DNS query also most likely fails due being unable to create a file, however the error that is reported is no such host.
The problem can be fixed in two ways:
1) Increase the maximum number of open file handles
2) Limit the maximum number of concurrent `crawl` calls
1) Is the simplest solution, but might not be ideal as it only postpones the problem until you find a website that has more links that the new limit . For linux use can set this limit with ulimit -n.
2) Is more a problem of design. We need to limit the number of http requests that can be made concurrently. I have modified the code a little. The most important change is maxGoRoutines. With every scraping call that starts a value is inserted into the channel. Once the channel is full the next call will block until a value is removed from the channel. A value is removed from the channel every time a scraping call finishes.
package main
import (
"fmt"
"sync"
"time"
)
func main() {
link := "https://example.com"
wg := new(sync.WaitGroup)
wg.Add(1)
q := make(chan string)
go deduplicate(q, wg)
q <- link
fmt.Println("waiting")
wg.Wait()
}
//This is the maximum number of concurrent scraping calls running
var MaxCount = 100
var maxGoRoutines = make(chan struct{}, MaxCount)
func deduplicate(ch chan string, wg *sync.WaitGroup) {
seen := make(map[string]bool)
for link := range ch {
// seen is a global variable that holds all seen URLs
if seen[link] {
wg.Done()
continue
}
seen[link] = true
wg.Add(1)
go crawl(link, ch, wg)
}
}
func crawl(link string, q chan string, wg *sync.WaitGroup) {
//This allows us to know when all the requests are done, so that we can quit
defer wg.Done()
links := doCrawl(link)
for _, l := range links {
q <- l
}
}
func doCrawl(link string) []string {
//This limits the maximum number of concurrent scraping requests
maxGoRoutines <- struct{}{}
defer func() { <-maxGoRoutines }()
// handle the link and create a variable "links" containing the links found inside the page
time.Sleep(time.Second)
return []string{link + "a", link + "b"}
}
Related
I got this code from someone on github and I am trying to play around with it to understand concurrency.
package main
import (
"bufio"
"fmt"
"os"
"sync"
"time"
)
var wg sync.WaitGroup
func sad(url string) string {
fmt.Printf("gonna sleep a bit\n")
time.Sleep(2 * time.Second)
return url + " added stuff"
}
func main() {
sc := bufio.NewScanner(os.Stdin)
urls := make(chan string)
results := make(chan string)
for i := 0; i < 20; i++ {
wg.Add(1)
go func() {
defer wg.Done()
for url := range urls {
n := sad(url)
results <- n
}
}()
}
for sc.Scan() {
url := sc.Text()
urls <- url
}
for result := range results {
fmt.Printf("%s arrived\n", result)
}
wg.Wait()
close(urls)
close(results)
}
I have a few questions:
Why does this code give me a deadlock?
How does that for loop exist before the operation of taking in input from user does the go routines wait until anything is passes in the urls channel then start doing work? I don't get this because it's not sequential, like why is taking in input from user then putting every input in the urls channel then running the go routines is considered wrong?
Inside the for loop I have another loop which is iterating over the urls channel, does each go routine deal with exactly one line of input? or does one go routine handle multiple lines at once? how does any of this work?
Am i gathering the output correctly here?
Mostly you're doing things correctly, but have things a little out of order. The for sc.Scan() loop will continue until Scanner is done, and the for result := range results loop will never run, thus no go routine ('main' in this case) will be able to receive from results. When running your example, I started the for result := range results loop before for sc.Scan() and also in its own go routine--otherwise for sc.Scan() will never be reached.
go func() {
for result := range results {
fmt.Printf("%s arrived\n", result)
}
}()
for sc.Scan() {
url := sc.Text()
urls <- url
}
Also, because you run wg.Wait() before close(urls), the main goroutine is left blocked waiting for the 20 sad() go routines to finish. But they can't finish until close(urls) is called. So just close that channel before waiting for the waitgroup.
close(urls)
wg.Wait()
close(results)
The for-loop creates 20 goroutines, all waiting input from the urls channel. When someone writes into this channel, one of the goroutines will pick it up and work on in. This is a typical worker-pool implementation.
Then, then scanner reads input line by line, and sends it to the urls channel, where one of the goroutines will pick it up and write the response to the results channel. At this point, there are no other goroutines reading from the results channel, so this will block.
As the scanner reads URLs, all other goroutines will pick them up and block. So if the scanner reads more than 20 URLs, it will deadlock because all goroutines will be waiting for a reader.
If there are fewer than 20 URLs, the scanner for-loop will end, and the results will be read. However that will eventually deadlock as well, because the for-loop will terminate when the channel is closed, and there is no one there to close the channel.
To fix this, first, close the urls channel right after you finish reading. That will release all the for-loops in the goroutines. Then you should put the for-loop reading from the results channel into a goroutine, so you can call wg.Wait while results are being processed. After wg.Wait, you can close the results channel.
This does not guarantee that all items in the results channel will be read. The program may terminate before all messages are processed, so use a third channel which you close at the end of the goroutine that reads from the results channel. That is:
done:=make(chan struct{})
go func() {
defer close(done)
for result := range results {
fmt.Printf("%s arrived\n", result)
}
}()
wg.Wait()
close(results)
<-done
I am not super happy with previous answers, so here is a solution based on the documented behavior in the go tour, the go doc, the specifications.
package main
import (
"bufio"
"fmt"
"strings"
"sync"
"time"
)
var wg sync.WaitGroup
func sad(url string) string {
fmt.Printf("gonna sleep a bit\n")
time.Sleep(2 * time.Millisecond)
return url + " added stuff"
}
func main() {
// sc := bufio.NewScanner(os.Stdin)
sc := bufio.NewScanner(strings.NewReader(strings.Repeat("blah blah\n", 15)))
urls := make(chan string)
results := make(chan string)
for i := 0; i < 20; i++ {
wg.Add(1)
go func() {
defer wg.Done()
for url := range urls {
n := sad(url)
results <- n
}
}()
}
// results is consumed by so many goroutines
// we must wait for them to finish before closing results
// but we dont want to block here, so put that into a routine.
go func() {
wg.Wait()
close(results)
}()
go func() {
for sc.Scan() {
url := sc.Text()
urls <- url
}
close(urls) // done consuming a channel, close it, right away.
}()
for result := range results {
fmt.Printf("%s arrived\n", result)
} // the program will finish when it gets out of this loop.
// It will get out of this loop because you have made sure the results channel is closed.
}
I am trying to learn Golang and took on a simple project to call all the craigslist cities and query them for a specific search. In the code below I removed all the links in the listingmap but there are over 400 links there. So the loop is fairly large. I thought this would be a good test to put what I am learning into application but I am running into a strange issue.
Some of the times most of the Http.Get() get no response from the server while others it gets them all with no problem. So I started adding prints to show how many error out and we recovered and how many successfully made it through. Also while this is running it will randomly hang and never respond. The program doesn't freeze but the site just sits there trying to load and the terminal shows no activity.
I am making sure my Response body is closed by deferring the cleanup after the recover but it still seems to not work. Is there something that jumps out to anyone that maybe I am missing?
Thanks in advance guys!
package main
import (
"fmt"
"net/http"
"html/template"
"io/ioutil"
"encoding/xml"
"sync"
)
var wg sync.WaitGroup
var locationMap = map[string]string {"https://auburn.craigslist.org/": "auburn "...}
var totalRecovers int = 0
var successfulReads int = 0
type Listings struct {
Links []string `xml:"item>link"`
Titles []string `xml:"item>title"`
Descriptions []string `xml:"item>description"`
Dates []string `xml:"item>date"`
}
type Listing struct {
Title string
Description string
Date string
}
type ListAggPage struct {
Title string
Listings map[string]Listing
SearchRequest string
}
func cleanUp(link string) {
defer wg.Done()
if r:= recover(); r!= nil {
totalRecovers++
// recoverMap <- link
}
}
func cityRoutine(c chan Listings, link string) {
defer cleanUp(link)
var i Listings
address := link + "search/sss?format=rss&query=motorhome"
resp, rErr := http.Get(address)
if(rErr != nil) {
fmt.Println("Fatal error has occurs while getting response.")
fmt.Println(rErr);
}
bytes, bErr := ioutil.ReadAll(resp.Body)
if(bErr != nil) {
fmt.Println("Fatal error has occurs while getting bytes.")
fmt.Println(bErr);
}
xml.Unmarshal(bytes, &i)
resp.Body.Close()
c <- i
successfulReads++
}
func listingAggHandler(w http.ResponseWriter, r *http.Request) {
queue := make(chan Listings, 99999)
listing_map := make(map[string]Listing)
for key, _ := range locationMap {
wg.Add(1)
go cityRoutine(queue, key)
}
wg.Wait()
close(queue)
for elem := range queue {
for index, _ := range elem.Links {
listing_map[elem.Links[index]] = Listing{elem.Titles[index * 2], elem.Descriptions[index], elem.Dates[index]}
}
}
p := ListAggPage{Title: "Craigslist Aggregator", Listings: listing_map}
t, _ := template.ParseFiles("basictemplating.html")
fmt.Println(t.Execute(w, p))
fmt.Println("Successfully loaded: ", successfulReads)
fmt.Println("Recovered from: ", totalRecovers)
}
func indexHandler(w http.ResponseWriter, r *http.Request) {
fmt.Fprintf(w, "<h1>Whoa, Go is neat!</h1>")
}
func main() {
http.HandleFunc("/", indexHandler)
http.HandleFunc("/agg/", listingAggHandler)
http.ListenAndServe(":8000", nil)
}
I'm having trouble finding the golang mailing list discussion I was reading in reference to this, but you generally don't want to open up hundreds of requests. There's some information here: How Can I Effectively 'Max Out' Concurrent HTTP Requests?
Craigslist might also just be rate limiting you. Either way, I recommend limiting to around 20 simultaneous requests or so, here's a quick update to your listingAggHandler.
queue := make(chan Listings, 99999)
listing_map := make(map[string]Listing)
request_queue := make(chan string)
for i := 0; i < 20; i++ {
go func() {
for {
key := <- request_queue
cityRoutine(queue, key)
}
}()
}
for key, _ := range locationMap {
wg.Add(1)
request_queue <- key
}
wg.Wait()
close(request_queue)
close(queue)
The application should still be very fast. I agree with the other comments on your question as well. Would also try and avoid putting so much in the global scope.
You could also spruce my changes up a little by just using the wait group in the request pool and have each goroutine clean itself up and decrement the wait group. That would limit some of the global scope.
So I followed everyones suggestions and it seems to resolved my issue so I greatly appreciate it. I ended up removing the global WaitGroup like many suggested and had it passed in as a parameter(pointer) to clean up the code. As for the error issues before, it must have been maxing out the concurrent HTTP request like maxm had mentioned. Once I added a wait in between every 20 searches, I have not seen any errors. The program runs a little slower than I would like but for learning purposes this has been helpful.
Below is the major change the code needed.
counter := 0
for key, _ := range locationMap {
if(counter >= 20) {
wg.Wait()
counter = 0
}
wg.Add(1)
frmtSearch := key + "search/sss?format=rss&query=" + strings.Replace(p.SearchRequest, " ", "%20", -1)
go cityRoutine(queue, frmtSearch, &wg)
counter++
}
I am implementing a web crawler and I have a Parse function that takes an link as an input and should return all links contained in the page.
I would like to make the most of go routines to make it as fast as possible. To do so, I want to create a pool of workers.
I set up a channel of strings representing the links links := make(chan string) and pass it as an argument to the Parse function. I want the workers to communicate through a unique channel. When the function starts, it takes a link from links, parse it and **for each valid link found in the page, add the link to links.
func Parse(links chan string) {
l := <- links
// If link already parsed, return
for url := newUrlFounds {
links <- url
}
}
However, the main issue here is to indicate when no more links have been found. One way I thought of doing it was to wait before all workers have completed. But I don't know how to do so in Go.
As Tim already commented, don't use the same channel for reading and writing in a worker. This will deadlock eventually (even if buffered, because Murphy).
A far simpler design is simply launching one goroutine per URL. A buffered channel can serve as a simple semaphore to limit the number of concurrent parsers (goroutines that don't do anything because they are blocked are usually negligible). Use a sync.WaitGroup to wait until all work is done.
package main
import (
"sync"
)
func main() {
sem := make(chan struct{}, 10) // allow ten concurrent parsers
wg := &sync.WaitGroup{}
wg.Add(1)
Parse("http://example.com", sem, wg)
wg.Wait()
// all done
}
func Parse(u string, sem chan struct{}, wg *sync.WaitGroup) {
defer wg.Done()
sem <- struct{}{} // grab
defer func() { <-sem }() // release
// If URL already parsed, return.
var newURLs []string
// ...
for u := range newURLs {
wg.Add(1)
go Parse(u)
}
}
I need some help on understanding how to use goroutines in this problem. I will post only some snippets of code but if you want to take a deep look you can check it out here
Basically, I have a distributor function which receives a request slice being called many times, and each time the function is called it must distribute this request among other functions to actually resolve the request. And what I'm trying to create a channel and launch this function to resolve the request on a new goroutine, so the program can handle requests concurrently.
How the distribute function is called:
// Run trigger the system to start receiving requests
func Run() {
// Since the programs starts here, let's make a channel to receive requests
requestCh := make(chan []string)
idCh := make(chan string)
// If you want to play with us you need to register your Sender here
go publisher.Sender(requestCh)
go makeID(idCh)
// Our request pool
for request := range requestCh {
// add ID
request = append(request, <-idCh)
// distribute
distributor(request)
}
// PROBLEM
for result := range resultCh {
fmt.Println(result)
}
}
The distribute function itself:
// Distribute requests to respective channels.
// No waiting in line. Everybody gets its own goroutine!
func distributor(request []string) {
switch request[0] {
case "sum":
arithCh := make(chan []string)
go arithmetic.Exec(arithCh, resultCh)
arithCh <- request
case "sub":
arithCh := make(chan []string)
go arithmetic.Exec(arithCh, resultCh)
arithCh <- request
case "mult":
arithCh := make(chan []string)
go arithmetic.Exec(arithCh, resultCh)
arithCh <- request
case "div":
arithCh := make(chan []string)
go arithmetic.Exec(arithCh, resultCh)
arithCh <- request
case "fibonacci":
fibCh := make(chan []string)
go fibonacci.Exec(fibCh, resultCh)
fibCh <- request
case "reverse":
revCh := make(chan []string)
go reverse.Exec(revCh, resultCh)
revCh <- request
case "encode":
encCh := make(chan []string)
go encode.Exec(encCh, resultCh)
encCh <- request
}
}
And the fibonacci.Exec function to illustrate how I'm trying to calculate the Fibonacci given a request received on the fibCh and sending the result value through the resultCh.
func Exec(fibCh chan []string, result chan map[string]string) {
fib := parse(<-fibCh)
nthFibonacci(fib)
result <- fib
}
So far, at the Run function when I range over the resultCh I get the results but also a deadlock. But why? Also, I imagine that I should use the waitGroup function to wait the goroutines to finish but I'm not sure of how implement that since I'm expecting receive a continuous stream of requests. I would appreciate some help on understanding what I'm doing wrong here and a way to solve it.
I'm not digging into the implementation details of your application, but basically as it sounds to me, you can use the workers pattern.
Using the workers pattern multiple goroutines can read from a single channel, distributing an amount of work between CPU cores, hence the workers name. In Go, this pattern is easy to implement - just start a number of goroutines with channel as parameter, and just send values to that channel - distributing and multiplexing will be done by Go runtime, automagically.
Here is a simple implementation of the workers pattern:
package main
import (
"fmt"
"sync"
"time"
)
func worker(tasksCh <-chan int, wg *sync.WaitGroup) {
defer wg.Done()
for {
task, ok := <-tasksCh
if !ok {
return
}
d := time.Duration(task) * time.Millisecond
time.Sleep(d)
fmt.Println("processing task", task)
}
}
func pool(wg *sync.WaitGroup, workers, tasks int) {
tasksCh := make(chan int)
for i := 0; i < workers; i++ {
go worker(tasksCh, wg)
}
for i := 0; i < tasks; i++ {
tasksCh <- i
}
close(tasksCh)
}
func main() {
var wg sync.WaitGroup
wg.Add(36)
go pool(&wg, 36, 50)
wg.Wait()
}
Another useful resource how you can use the WaitGroup to wait for all the goroutines to finish the execution before to continue (hence to not trap into deadlock) is this nice article:
http://nathanleclaire.com/blog/2014/02/15/how-to-wait-for-all-goroutines-to-finish-executing-before-continuing/
And a very basic implementation of it:
Go playground
If you do not want to change the implementation to use the worker pattern maybe would be a good idea to use another channel to signify the end of goroutine execution, because deadlock happens when there is no receiver to accept the sent message through unbuffered channel.
done := make(chan bool)
//.....
done <- true //Tell the main function everything is done.
So when you receive the message you mark the execution as completed by setting the channel value to true.
I'm trying to write my first web-spider in Golang. Its task is to crawl domains (and inspect their html) from the provided database query. The idea is to have no 3rd party dependencies (e.g. msg queue), or as little as possible, yet it has to be performant enough to crawl 5 million domains per day. I have approx 150 million domains I need to check every month.
The very basic version below - it runs in "infinite loop" as theoretically the crawl process would be endless.
func crawl(n time.Duration) {
var wg sync.WaitGroup
runtime.GOMAXPROCS(runtime.NumCPU())
for _ = range time.Tick(n * time.Second) {
wg.Add(1)
go func() {
defer wg.Done()
// do the expensive work here - query db, crawl domain, inspect html
}()
}
wg.Wait()
}
func main() {
go crawl(1)
select{}
}
Running this code on 4 CPU cores at the moment means it can perform max 345600 requests during 24 hours ((60 * 60 * 24) * 4) with the given threshold of 1s. At least that's my understanding :-) If my thinking's correct then I will need to come up with solution being 14x faster to meet daily requirements.
I would appreciate your advices in regards to make the crawler faster, but without resolving to complicated stack setup or buying server with more CPU cores.
Why have the timing component at all?
Just create a channel that you feed URLs to, then spawn N goroutines that loop over that channel and do the work.
then just tweak the value of N until your CPU/memory is capped ~90% utilization (to accommodate fluctuations in site response times)
something like this (on Play):
package main
import "fmt"
import "sync"
var numWorkers = 10
func crawler(urls chan string, wg *sync.WaitGroup) {
defer wg.Done()
for u := range urls {
fmt.Println(u)
}
}
func main() {
ch := make(chan string)
var wg sync.WaitGroup
for i := 0; i < numWorkers; i++ {
wg.Add(1)
go crawler(ch, &wg)
}
ch <- "http://ibm.com"
ch <- "http://google.com"
close(ch)
wg.Wait()
fmt.Println("All Done")
}