Performant web-spider with no external dependencies - performance

I'm trying to write my first web-spider in Golang. Its task is to crawl domains (and inspect their html) from the provided database query. The idea is to have no 3rd party dependencies (e.g. msg queue), or as little as possible, yet it has to be performant enough to crawl 5 million domains per day. I have approx 150 million domains I need to check every month.
The very basic version below - it runs in "infinite loop" as theoretically the crawl process would be endless.
func crawl(n time.Duration) {
var wg sync.WaitGroup
runtime.GOMAXPROCS(runtime.NumCPU())
for _ = range time.Tick(n * time.Second) {
wg.Add(1)
go func() {
defer wg.Done()
// do the expensive work here - query db, crawl domain, inspect html
}()
}
wg.Wait()
}
func main() {
go crawl(1)
select{}
}
Running this code on 4 CPU cores at the moment means it can perform max 345600 requests during 24 hours ((60 * 60 * 24) * 4) with the given threshold of 1s. At least that's my understanding :-) If my thinking's correct then I will need to come up with solution being 14x faster to meet daily requirements.
I would appreciate your advices in regards to make the crawler faster, but without resolving to complicated stack setup or buying server with more CPU cores.

Why have the timing component at all?
Just create a channel that you feed URLs to, then spawn N goroutines that loop over that channel and do the work.
then just tweak the value of N until your CPU/memory is capped ~90% utilization (to accommodate fluctuations in site response times)
something like this (on Play):
package main
import "fmt"
import "sync"
var numWorkers = 10
func crawler(urls chan string, wg *sync.WaitGroup) {
defer wg.Done()
for u := range urls {
fmt.Println(u)
}
}
func main() {
ch := make(chan string)
var wg sync.WaitGroup
for i := 0; i < numWorkers; i++ {
wg.Add(1)
go crawler(ch, &wg)
}
ch <- "http://ibm.com"
ch <- "http://google.com"
close(ch)
wg.Wait()
fmt.Println("All Done")
}

Related

gomaxprocs ignored when more workers are explicityl called

How can i use gomaxprocs? The code below sets gomaxprocs, but then more workers are spawned that set. I expect 2 processes but 5 are still run.
package main
import (
"fmt"
"runtime"
"sync"
"time"
)
func worker(i int, waiter chan struct{}, wg *sync.WaitGroup) {
defer func(waiter chan struct{}, wg *sync.WaitGroup) {
fmt.Printf("worker %d done\n", i)
wg.Done()
<-waiter
}(waiter, wg)
fmt.Printf("worker %d starting\n", i)
time.Sleep(time.Second)
}
func main() {
runtime.GOMAXPROCS(2)
var concurrency = 5
var items = 10
waiter := make(chan struct{}, concurrency)
var wg sync.WaitGroup
for i := 0; i < items; i++ {
wg.Add(1)
waiter <- struct{}{}
go worker(i, waiter, &wg)
}
wg.Wait()
}
Go has three concepts for what C/C++ programmers think of as a thread: G, P, M.
M = actual thread
G = Goroutines (i.e., the code in your program)
P = Processor
There is no Go API for limiting the number of Ms. There is no API for limiting the number of Gs - a new one gets created every time go func(...) is called. The GOMAXPROCS thing is there to limit Ps.
Each P is used to track the runtime state of some running Goroutine.
You should think of GOMAXPROCS as the peak number of Ms devoted to running Goroutines. (There are other Ms that don't run Goroutines, but handle garbage collection tasks and serve as template threads for creating new Ms as needed etc. Some Ms are devoted to holding runtime state while some Go code is blocked inside a system call.)
So, in terms of the code in your program, GOMAXPROCS is a constraint for how parallel its Go code execution can be. When a running Goroutine reaches a point where it becomes blocked, it is parked and its P is used to resume execution of some other Goroutine that is not blocked.

Unsuccessful attempts at implementing concurrency

I'm having difficulty getting go concurrency to work correctly. I'm working with data loaded from an XML Data Source. Once I load the data into memory, i loop through the XML elements and perform an operation. The code prior to the concurrency addition has been tested and functional, and I don't believe it has any influence on the concurrency addition. I have 2 failed attempts at concurrency implementations, both with different outputs. I used locking because i dont want to enter a race condition.
For this implementation, it never enters the goroutine.
var mu sync.Mutex
// length is 197K
for i:=0;i<len(listings.Listings);i++{
go func(){
mu.Lock()
// code execution (tested prior to adding concurrency and locking)
mu.Unlock()
}()
}
For this implementation using waitGroups, a runtime out of memory occurs
var mu sync.Mutex
var wg sync.WaitGroup
// length is 197K
for i:=0;i<len(listings.Listings);i++{
wg.Add(1)
go func(){
mu.Lock()
// code execution (tested prior to adding concurrency and locking and wait group)
wg.Done()
mu.Unlock()
}()
}
wg.Wait()
I'm not really sure what's going on and could use some assistance.
You don't need Mutex here if you want to make it concurrent
197K goroitines are a lot, try lower amount of goroutines. You can accomplish it by creating N goroutines, when each of them is listening to the same channel.
https://play.golang.org/p/s4e0YyHdyPq
package main
import (
"fmt"
"sync"
)
type Listing struct{}
func main() {
var (
wg sync.WaitGroup
concurrency = 100
)
c := make(chan Listing)
wg.Add(concurrency)
for i := 0; i < concurrency; i++ {
go func(ci <-chan Listing) {
for l := range ci {
// code, l is a single Listing
fmt.Printf("%v", l)
}
wg.Done()
}(c)
}
// replace with your var
listings := []Listing{Listing{}}
for _, l := range listings {
c <- l
}
close(c)
wg.Wait()
}

Closing channel when all workers have finished

I am implementing a web crawler and I have a Parse function that takes an link as an input and should return all links contained in the page.
I would like to make the most of go routines to make it as fast as possible. To do so, I want to create a pool of workers.
I set up a channel of strings representing the links links := make(chan string) and pass it as an argument to the Parse function. I want the workers to communicate through a unique channel. When the function starts, it takes a link from links, parse it and **for each valid link found in the page, add the link to links.
func Parse(links chan string) {
l := <- links
// If link already parsed, return
for url := newUrlFounds {
links <- url
}
}
However, the main issue here is to indicate when no more links have been found. One way I thought of doing it was to wait before all workers have completed. But I don't know how to do so in Go.
As Tim already commented, don't use the same channel for reading and writing in a worker. This will deadlock eventually (even if buffered, because Murphy).
A far simpler design is simply launching one goroutine per URL. A buffered channel can serve as a simple semaphore to limit the number of concurrent parsers (goroutines that don't do anything because they are blocked are usually negligible). Use a sync.WaitGroup to wait until all work is done.
package main
import (
"sync"
)
func main() {
sem := make(chan struct{}, 10) // allow ten concurrent parsers
wg := &sync.WaitGroup{}
wg.Add(1)
Parse("http://example.com", sem, wg)
wg.Wait()
// all done
}
func Parse(u string, sem chan struct{}, wg *sync.WaitGroup) {
defer wg.Done()
sem <- struct{}{} // grab
defer func() { <-sem }() // release
// If URL already parsed, return.
var newURLs []string
// ...
for u := range newURLs {
wg.Add(1)
go Parse(u)
}
}

What's the best way to handle "too many open files"?

I'm building a crawler that takes a URL, extracts links from it, and visits each one of them to a certain depth; making a tree of paths on a specific site.
The way I implemented parallelism for this crawler is that I visit each new found URL as soon as it's found like this:
func main() {
link := "https://example.com"
wg := new(sync.WaitGroup)
wg.Add(1)
q := make(chan string)
go deduplicate(q, wg)
q <- link
wg.Wait()
}
func deduplicate(ch chan string, wg *sync.WaitGroup) {
for link := range ch {
// seen is a global variable that holds all seen URLs
if seen[link] {
wg.Done()
continue
}
seen[link] = true
go crawl(link, ch, wg)
}
}
func crawl(link string, q chan string, wg *sync.WaitGroup) {
// handle the link and create a variable "links" containing the links found inside the page
wg.Add(len(links))
for _, l := range links {
q <- l}
}
}
This works fine for relatively small sites, but when I run it on a large one with a lot of link everywhere, I start getting one of these two errors on some requests: socket: too many open files and no such host (the host is indeed there).
What's the best way to handle this? Should I check for these errors and pause execution when I get them for some time until the other requests are finished? Or specify a maximum number of possible requests at a certain time? (which makes more sense to me but not sure how to code up exactly)
The files being referred to in the error socket: too many open files includes Threads and sockets (the http requests to load the web pages being scraped).
See this question.
The DNS query also most likely fails due being unable to create a file, however the error that is reported is no such host.
The problem can be fixed in two ways:
1) Increase the maximum number of open file handles
2) Limit the maximum number of concurrent `crawl` calls
1) Is the simplest solution, but might not be ideal as it only postpones the problem until you find a website that has more links that the new limit . For linux use can set this limit with ulimit -n.
2) Is more a problem of design. We need to limit the number of http requests that can be made concurrently. I have modified the code a little. The most important change is maxGoRoutines. With every scraping call that starts a value is inserted into the channel. Once the channel is full the next call will block until a value is removed from the channel. A value is removed from the channel every time a scraping call finishes.
package main
import (
"fmt"
"sync"
"time"
)
func main() {
link := "https://example.com"
wg := new(sync.WaitGroup)
wg.Add(1)
q := make(chan string)
go deduplicate(q, wg)
q <- link
fmt.Println("waiting")
wg.Wait()
}
//This is the maximum number of concurrent scraping calls running
var MaxCount = 100
var maxGoRoutines = make(chan struct{}, MaxCount)
func deduplicate(ch chan string, wg *sync.WaitGroup) {
seen := make(map[string]bool)
for link := range ch {
// seen is a global variable that holds all seen URLs
if seen[link] {
wg.Done()
continue
}
seen[link] = true
wg.Add(1)
go crawl(link, ch, wg)
}
}
func crawl(link string, q chan string, wg *sync.WaitGroup) {
//This allows us to know when all the requests are done, so that we can quit
defer wg.Done()
links := doCrawl(link)
for _, l := range links {
q <- l
}
}
func doCrawl(link string) []string {
//This limits the maximum number of concurrent scraping requests
maxGoRoutines <- struct{}{}
defer func() { <-maxGoRoutines }()
// handle the link and create a variable "links" containing the links found inside the page
time.Sleep(time.Second)
return []string{link + "a", link + "b"}
}

Undetected "deadlock" while reading from channel

How do I deal with a situation where undetected deadlock occurs when reading results of execution of uncertain number tasks from a channel in a complex program, e.g. web server?
package main
import (
"fmt"
"math/rand"
"time"
)
func main() {
rand.Seed(time.Now().UTC().UnixNano())
results := make(chan int, 100)
// we can't know how many tasks there will be
for i := 0; i < rand.Intn(1<<8)+1<<8; i++ {
go func(i int) {
time.Sleep(time.Second)
results <- i
}(i)
}
// can't close channel here
// because it is still written in
//close(results)
// something else is going on other threads (think web server)
// therefore a deadlock won't be detected
go func() {
for {
time.Sleep(time.Second)
}
}()
for j := range results {
fmt.Println(j)
// we just stuck in here
}
}
In case of simpler programs go detects a deadlock and properly fails. Most examples either fetch a known number of results, or write to the channel sequentially.
The trick is to use sync.WaitGroup and wait for the tasks to finish in a non-blocking way.
var wg sync.WaitGroup
// we can't know how many tasks there will be
for i := 0; i < rand.Intn(1<<8)+1<<8; i++ {
wg.Add(1)
go func(i int) {
time.Sleep(time.Second)
results <- i
wg.Done()
}(i)
}
// wait for all tasks to finish in other thread
go func() {
wg.Wait()
close(results)
}()
// execution continues here so you can print results
See also: Go Concurrency Patterns: Pipelines and cancellation - The Go Blog

Resources