Multiple Http.Get hanging randomly - go

I am trying to learn Golang and took on a simple project to call all the craigslist cities and query them for a specific search. In the code below I removed all the links in the listingmap but there are over 400 links there. So the loop is fairly large. I thought this would be a good test to put what I am learning into application but I am running into a strange issue.
Some of the times most of the Http.Get() get no response from the server while others it gets them all with no problem. So I started adding prints to show how many error out and we recovered and how many successfully made it through. Also while this is running it will randomly hang and never respond. The program doesn't freeze but the site just sits there trying to load and the terminal shows no activity.
I am making sure my Response body is closed by deferring the cleanup after the recover but it still seems to not work. Is there something that jumps out to anyone that maybe I am missing?
Thanks in advance guys!
package main
import (
"fmt"
"net/http"
"html/template"
"io/ioutil"
"encoding/xml"
"sync"
)
var wg sync.WaitGroup
var locationMap = map[string]string {"https://auburn.craigslist.org/": "auburn "...}
var totalRecovers int = 0
var successfulReads int = 0
type Listings struct {
Links []string `xml:"item>link"`
Titles []string `xml:"item>title"`
Descriptions []string `xml:"item>description"`
Dates []string `xml:"item>date"`
}
type Listing struct {
Title string
Description string
Date string
}
type ListAggPage struct {
Title string
Listings map[string]Listing
SearchRequest string
}
func cleanUp(link string) {
defer wg.Done()
if r:= recover(); r!= nil {
totalRecovers++
// recoverMap <- link
}
}
func cityRoutine(c chan Listings, link string) {
defer cleanUp(link)
var i Listings
address := link + "search/sss?format=rss&query=motorhome"
resp, rErr := http.Get(address)
if(rErr != nil) {
fmt.Println("Fatal error has occurs while getting response.")
fmt.Println(rErr);
}
bytes, bErr := ioutil.ReadAll(resp.Body)
if(bErr != nil) {
fmt.Println("Fatal error has occurs while getting bytes.")
fmt.Println(bErr);
}
xml.Unmarshal(bytes, &i)
resp.Body.Close()
c <- i
successfulReads++
}
func listingAggHandler(w http.ResponseWriter, r *http.Request) {
queue := make(chan Listings, 99999)
listing_map := make(map[string]Listing)
for key, _ := range locationMap {
wg.Add(1)
go cityRoutine(queue, key)
}
wg.Wait()
close(queue)
for elem := range queue {
for index, _ := range elem.Links {
listing_map[elem.Links[index]] = Listing{elem.Titles[index * 2], elem.Descriptions[index], elem.Dates[index]}
}
}
p := ListAggPage{Title: "Craigslist Aggregator", Listings: listing_map}
t, _ := template.ParseFiles("basictemplating.html")
fmt.Println(t.Execute(w, p))
fmt.Println("Successfully loaded: ", successfulReads)
fmt.Println("Recovered from: ", totalRecovers)
}
func indexHandler(w http.ResponseWriter, r *http.Request) {
fmt.Fprintf(w, "<h1>Whoa, Go is neat!</h1>")
}
func main() {
http.HandleFunc("/", indexHandler)
http.HandleFunc("/agg/", listingAggHandler)
http.ListenAndServe(":8000", nil)
}

I'm having trouble finding the golang mailing list discussion I was reading in reference to this, but you generally don't want to open up hundreds of requests. There's some information here: How Can I Effectively 'Max Out' Concurrent HTTP Requests?
Craigslist might also just be rate limiting you. Either way, I recommend limiting to around 20 simultaneous requests or so, here's a quick update to your listingAggHandler.
queue := make(chan Listings, 99999)
listing_map := make(map[string]Listing)
request_queue := make(chan string)
for i := 0; i < 20; i++ {
go func() {
for {
key := <- request_queue
cityRoutine(queue, key)
}
}()
}
for key, _ := range locationMap {
wg.Add(1)
request_queue <- key
}
wg.Wait()
close(request_queue)
close(queue)
The application should still be very fast. I agree with the other comments on your question as well. Would also try and avoid putting so much in the global scope.
You could also spruce my changes up a little by just using the wait group in the request pool and have each goroutine clean itself up and decrement the wait group. That would limit some of the global scope.

So I followed everyones suggestions and it seems to resolved my issue so I greatly appreciate it. I ended up removing the global WaitGroup like many suggested and had it passed in as a parameter(pointer) to clean up the code. As for the error issues before, it must have been maxing out the concurrent HTTP request like maxm had mentioned. Once I added a wait in between every 20 searches, I have not seen any errors. The program runs a little slower than I would like but for learning purposes this has been helpful.
Below is the major change the code needed.
counter := 0
for key, _ := range locationMap {
if(counter >= 20) {
wg.Wait()
counter = 0
}
wg.Add(1)
frmtSearch := key + "search/sss?format=rss&query=" + strings.Replace(p.SearchRequest, " ", "%20", -1)
go cityRoutine(queue, frmtSearch, &wg)
counter++
}

Related

Are creating go routines asynchrnous?

I'm trying to fetch the content of an API with numerous goroutines.
I'm using a for loop to iterate over different character, but it seems like the forloop reaches its final value, before the requests are sent off.
package main
import (
"encoding/json"
"fmt"
"net/http"
"sync"
)
type people struct {
Name string `json:"name"`
}
func main(){
names := make(chan string, 25)
var wg sync.WaitGroup
for i := 0; i < 25; i++ {
wg.Add(1)
go func() {
defer wg.Done()
var p people
url := fmt.Sprintf("https://swapi.dev/api/people/%d", i)
getJSON(url, &p)
names <- p.Name
}()
}
name := <-names
fmt.Println(name)
wg.Wait()
}
func getJSON(url string, target interface{}) error {
r, err := http.Get(url)
if err != nil {
return err
}
defer r.Body.Close()
json.NewDecoder(r.Body).Decode(target)
return nil
}
Also, if somebody could improve my code quality, I'd be very grateful, I'm very new to Golang and don't have anybody to learn from!
You go routines are all using the same variable i. So on the first loop, you launch a goroutine that makes a url from i, and on the next loop i is incremented before that routine has a chance to run.
It's a common mistake in GoLang. The solution is to make a variable for each loop, and pass that one forward. You can either do it with a closure like this (playground).
for i := 0; i < 25; i++ {
wg.Add(1)
localI := i
go func() {
defer wg.Done()
var p people
// Use LocalI here
url := fmt.Sprintf("https://swapi.dev/api/people/%d", localI)
getJSON(url, &p)
names <- p.Name
}()
}
Or as an argument to the function (playground)
for i := 0; i < 25; i++ {
wg.Add(1)
localI := i
go func(localI int) {
defer wg.Done()
var p people
// Use LocalI here
url := fmt.Sprintf("https://swapi.dev/api/people/%d", localI)
getJSON(url, &p)
names <- p.Name
// Pass i here. Since I is a primitive, it is passed by value, not reference.
// Meaning a copy is made.
}(i)
}
Here is a good writeup on the mistake you made:
https://github.com/golang/go/wiki/CommonMistakes#using-goroutines-on-loop-iterator-variables
And the one above it is good to read too!

Golang Goroutine synchronization with channel

I have the following program where the HTTP server is created using gorilla mux.
When any request comes, it starts goroutine 1. In processing, I am starting another goroutine 2.
I want to wait for goroutine 2's response in goroutine 1? How I can do that?
How to ensure that only goroutine 2 will give the response to goroutine 1?
There can be GR4 created by GR3 and GR 3 should wait for GR4 only.
GR = Goroutine
SERVER
package main
import (
"encoding/json"
"fmt"
"net/http"
"strconv"
"time"
"github.com/gorilla/mux"
)
type Post struct {
ID string `json:"id"`
Title string `json:"title"`
Body string `json:"body"`
}
var posts []Post
var i = 0
func getPosts(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "application/json")
i++
fmt.Println(i)
ch := make(chan int)
go getTitle(ch, i)
p := Post{
ID: "123",
}
// Wait for getTitle result and update variable P with title
s := <-ch
//
p.Title = strconv.Itoa(s) + strconv.Itoa(i)
json.NewEncoder(w).Encode(p)
}
func main() {
router := mux.NewRouter()
posts = append(posts, Post{ID: "1", Title: "My first post", Body: "This is the content of my first post"})
router.HandleFunc("/posts", getPosts).Methods("GET")
http.ListenAndServe(":9999", router)
}
func getTitle(resultCh chan int, m int) {
time.Sleep(2 * time.Second)
resultCh <- m
}
CLIENT
package main
import (
"fmt"
"net/http"
"io/ioutil"
"time"
)
func main(){
for i :=0;i <100 ;i++ {
go main2()
}
time.Sleep(200 * time.Second)
}
func main2() {
url := "http://localhost:9999/posts"
method := "GET"
client := &http.Client {
}
req, err := http.NewRequest(method, url, nil)
if err != nil {
fmt.Println(err)
}
res, err := client.Do(req)
defer res.Body.Close()
body, err := ioutil.ReadAll(res.Body)
fmt.Println(string(body))
}
RESULT ACTUAL
{"id":"123","title":"25115","body":""}
{"id":"123","title":"23115","body":""}
{"id":"123","title":"31115","body":""}
{"id":"123","title":"44115","body":""}
{"id":"123","title":"105115","body":""}
{"id":"123","title":"109115","body":""}
{"id":"123","title":"103115","body":""}
{"id":"123","title":"115115","body":""}
{"id":"123","title":"115115","body":""}
{"id":"123","title":"115115","body":""}
RESULT EXPECTED
{"id":"123","title":"112112","body":""}
{"id":"123","title":"113113","body":""}
{"id":"123","title":"115115","body":""}
{"id":"123","title":"116116","body":""}
{"id":"123","title":"117117","body":""}
there are few ways to do this , a simple way is to use channels
change the getTitle func to this
func getTitle(resultCh chan string) {
time.Sleep(2 * time.Second)
resultCh <- "Game Of Thrones"
}
and the getPosts will use it like this
func getPosts(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "application/json")
ch := make(chan string)
go getTitle(ch)
s := <-ch // this will wait until getTile inserts data to channel
p := Post{
ID: s,
}
json.NewEncoder(w).Encode(p)
}
i suspect you are new to go, this is a basic channel usage , check more details here Channels
So the problem you're having is that you haven't really leaned how to deal with concurrent code (not a dis, I was there once). Most of this centers not around channels. The channels are working correctly as #kojan's answer explains. Where things go awry is with the i variable. Firstly you have to understand that i is not being mutated atomically so if your client requests arrive in parallel you can mess up the number:
C1 : C2:
i == 6 i == 6
i++ i++
i == 7 i == 7
Two increments in software become one increment in actuality because i++ is really 3 operations: load, increment, store.
The second problem you have is that i is not a pointer, so when you pass i to your go routine you're making a copy. the i in the go routine is sent back on the channel and becomes the first number in your concatenated string which you can watch increment. However the i left behind which is used in the tail of the string has continued to be incremented by successive client invocations.

What's the best way to handle "too many open files"?

I'm building a crawler that takes a URL, extracts links from it, and visits each one of them to a certain depth; making a tree of paths on a specific site.
The way I implemented parallelism for this crawler is that I visit each new found URL as soon as it's found like this:
func main() {
link := "https://example.com"
wg := new(sync.WaitGroup)
wg.Add(1)
q := make(chan string)
go deduplicate(q, wg)
q <- link
wg.Wait()
}
func deduplicate(ch chan string, wg *sync.WaitGroup) {
for link := range ch {
// seen is a global variable that holds all seen URLs
if seen[link] {
wg.Done()
continue
}
seen[link] = true
go crawl(link, ch, wg)
}
}
func crawl(link string, q chan string, wg *sync.WaitGroup) {
// handle the link and create a variable "links" containing the links found inside the page
wg.Add(len(links))
for _, l := range links {
q <- l}
}
}
This works fine for relatively small sites, but when I run it on a large one with a lot of link everywhere, I start getting one of these two errors on some requests: socket: too many open files and no such host (the host is indeed there).
What's the best way to handle this? Should I check for these errors and pause execution when I get them for some time until the other requests are finished? Or specify a maximum number of possible requests at a certain time? (which makes more sense to me but not sure how to code up exactly)
The files being referred to in the error socket: too many open files includes Threads and sockets (the http requests to load the web pages being scraped).
See this question.
The DNS query also most likely fails due being unable to create a file, however the error that is reported is no such host.
The problem can be fixed in two ways:
1) Increase the maximum number of open file handles
2) Limit the maximum number of concurrent `crawl` calls
1) Is the simplest solution, but might not be ideal as it only postpones the problem until you find a website that has more links that the new limit . For linux use can set this limit with ulimit -n.
2) Is more a problem of design. We need to limit the number of http requests that can be made concurrently. I have modified the code a little. The most important change is maxGoRoutines. With every scraping call that starts a value is inserted into the channel. Once the channel is full the next call will block until a value is removed from the channel. A value is removed from the channel every time a scraping call finishes.
package main
import (
"fmt"
"sync"
"time"
)
func main() {
link := "https://example.com"
wg := new(sync.WaitGroup)
wg.Add(1)
q := make(chan string)
go deduplicate(q, wg)
q <- link
fmt.Println("waiting")
wg.Wait()
}
//This is the maximum number of concurrent scraping calls running
var MaxCount = 100
var maxGoRoutines = make(chan struct{}, MaxCount)
func deduplicate(ch chan string, wg *sync.WaitGroup) {
seen := make(map[string]bool)
for link := range ch {
// seen is a global variable that holds all seen URLs
if seen[link] {
wg.Done()
continue
}
seen[link] = true
wg.Add(1)
go crawl(link, ch, wg)
}
}
func crawl(link string, q chan string, wg *sync.WaitGroup) {
//This allows us to know when all the requests are done, so that we can quit
defer wg.Done()
links := doCrawl(link)
for _, l := range links {
q <- l
}
}
func doCrawl(link string) []string {
//This limits the maximum number of concurrent scraping requests
maxGoRoutines <- struct{}{}
defer func() { <-maxGoRoutines }()
// handle the link and create a variable "links" containing the links found inside the page
time.Sleep(time.Second)
return []string{link + "a", link + "b"}
}

Use range channel in go error fatal [duplicate]

I don't understand why the deadlock is occurring in this code. I've tried several different things to get the deadlock to stop (several different versions using WorkGroup). This is my first day in Go, and I am pretty disappointed so far with the complexity of fairly simple and straightforward operations. I feel like I'm missing something big and obvious, but all of the docs I have found on this are seemingly very different from what, to me, is a very basic mode of operation. All of the docs use primitive types for channels (int, string) not more complex types, all with very basic for loops OR on they are on the other end of the spectrum where the functions are fairly complicated orchestrations.
I guess I'm really looking for a middle of the road sample of "this is how it's usually done" with goroutines.
package main
import "fmt"
//import "sync"
import "time"
type Item struct {
name string
}
type Truck struct {
Cargo []Item
name string
}
func UnloadTrucks(c chan Truck) {
for t := range c {
fmt.Printf("%s has %d items in cargo: %s\n", t.name, len(t.Cargo), t.Cargo[0].name)
}
}
func main() {
trucks := make([]Truck, 2)
ch := make(chan Truck)
for i, _ := range trucks {
trucks[i].name = fmt.Sprintf("Truck %d", i+1)
fmt.Printf("Building %s\n", trucks[i].name)
}
for t := range trucks {
go func(tr Truck) {
itm := Item{}
itm.name = "Groceries"
fmt.Printf("Loading %s\n", tr.name)
tr.Cargo = append(tr.Cargo, itm)
ch <- tr
}(trucks[t])
}
time.Sleep(50 * time.Millisecond)
fmt.Println("Unloading Trucks")
UnloadTrucks(ch)
fmt.Println("Done")
}
You never close the "truck" channel ch, so UnloadTrucks never returns.
You can close the channel after all workers are done, by using a WaitGroup:
go func() {
wg.Wait()
close(ch)
}()
UnloadTrucks(ch)
http://play.golang.org/p/1V7UbYpsQr

Always have x number of goroutines running at any time

I see lots of tutorials and examples on how to make Go wait for x number of goroutines to finish, but what I'm trying to do is have ensure there are always x number running, so a new goroutine is launched as soon as one ends.
Specifically I have a few hundred thousand 'things to do' which is processing some stuff that is coming out of MySQL. So it works like this:
db, err := sql.Open("mysql", connection_string)
checkErr(err)
defer db.Close()
rows,err := db.Query(`SELECT id FROM table`)
checkErr(err)
defer rows.Close()
var id uint
for rows.Next() {
err := rows.Scan(&id)
checkErr(err)
go processTheThing(id)
}
checkErr(err)
rows.Close()
Currently that will launch several hundred thousand threads of processTheThing(). What I need is that a maximum of x number (we'll call it 20) goroutines are launched. So it starts by launching 20 for the first 20 rows, and from then on it will launch a new goroutine for the next id the moment that one of the current goroutines has finished. So at any point in time there are always 20 running.
I'm sure this is quite simple/standard, but I can't seem to find a good explanation on any of the tutorials or examples or how this is done.
You may find Go Concurrency Patterns article interesting, especially Bounded parallelism section, it explains the exact pattern you need.
You can use channel of empty structs as a limiting guard to control number of concurrent worker goroutines:
package main
import "fmt"
func main() {
maxGoroutines := 10
guard := make(chan struct{}, maxGoroutines)
for i := 0; i < 30; i++ {
guard <- struct{}{} // would block if guard channel is already filled
go func(n int) {
worker(n)
<-guard
}(i)
}
}
func worker(i int) { fmt.Println("doing work on", i) }
Here I think something simple like this will work :
package main
import "fmt"
const MAX = 20
func main() {
sem := make(chan int, MAX)
for {
sem <- 1 // will block if there is MAX ints in sem
go func() {
fmt.Println("hello again, world")
<-sem // removes an int from sem, allowing another to proceed
}()
}
}
Thanks to everyone for helping me out with this. However, I don't feel that anyone really provided something that both worked and was simple/understandable, although you did all help me understand the technique.
What I have done in the end is I think much more understandable and practical as an answer to my specific question, so I will post it here in case anyone else has the same question.
Somehow this ended up looking a lot like what OneOfOne posted, which is great because now I understand that. But OneOfOne's code I found very difficult to understand at first because of the passing functions to functions made it quite confusing to understand what bit was for what. I think this way makes a lot more sense:
package main
import (
"fmt"
"sync"
)
const xthreads = 5 // Total number of threads to use, excluding the main() thread
func doSomething(a int) {
fmt.Println("My job is",a)
return
}
func main() {
var ch = make(chan int, 50) // This number 50 can be anything as long as it's larger than xthreads
var wg sync.WaitGroup
// This starts xthreads number of goroutines that wait for something to do
wg.Add(xthreads)
for i:=0; i<xthreads; i++ {
go func() {
for {
a, ok := <-ch
if !ok { // if there is nothing to do and the channel has been closed then end the goroutine
wg.Done()
return
}
doSomething(a) // do the thing
}
}()
}
// Now the jobs can be added to the channel, which is used as a queue
for i:=0; i<50; i++ {
ch <- i // add i to the queue
}
close(ch) // This tells the goroutines there's nothing else to do
wg.Wait() // Wait for the threads to finish
}
Create channel for passing data to goroutines.
Start 20 goroutines that processes the data from channel in a loop.
Send the data to the channel instead of starting a new goroutine.
Grzegorz Żur's answer is the most efficient way to do it, but for a newcomer it could be hard to implement without reading code, so here's a very simple implementation:
type idProcessor func(id uint)
func SpawnStuff(limit uint, proc idProcessor) chan<- uint {
ch := make(chan uint)
for i := uint(0); i < limit; i++ {
go func() {
for {
id, ok := <-ch
if !ok {
return
}
proc(id)
}
}()
}
return ch
}
func main() {
runtime.GOMAXPROCS(4)
var wg sync.WaitGroup //this is just for the demo, otherwise main will return
fn := func(id uint) {
fmt.Println(id)
wg.Done()
}
wg.Add(1000)
ch := SpawnStuff(10, fn)
for i := uint(0); i < 1000; i++ {
ch <- i
}
close(ch) //should do this to make all the goroutines exit gracefully
wg.Wait()
}
playground
This is a simple producer-consumer problem, which in Go can be easily solved using channels to buffer the paquets.
To put it simple: create a channel that accept your IDs. Run a number of routines which will read from the channel in a loop then process the ID. Then run your loop that will feed IDs to the channel.
Example:
func producer() {
var buffer = make(chan uint)
for i := 0; i < 20; i++ {
go consumer(buffer)
}
for _, id := range IDs {
buffer <- id
}
}
func consumer(buffer chan uint) {
for {
id := <- buffer
// Do your things here
}
}
Things to know:
Unbuffered channels are blocking: if the item wrote into the channel isn't accepted, the routine feeding the item will block until it is
My example lack a closing mechanism: you must find a way to make the producer to wait for all consumers to end their loop before returning. The simplest way to do this is with another channel. I let you think about it.
I've wrote a simple package to handle concurrency for Golang. This package will help you limit the number of goroutines that are allowed to run concurrently:
https://github.com/zenthangplus/goccm
Example:
package main
import (
"fmt"
"goccm"
"time"
)
func main() {
// Limit 3 goroutines to run concurrently.
c := goccm.New(3)
for i := 1; i <= 10; i++ {
// This function have to call before any goroutine
c.Wait()
go func(i int) {
fmt.Printf("Job %d is running\n", i)
time.Sleep(2 * time.Second)
// This function have to when a goroutine has finished
// Or you can use `defer c.Done()` at the top of goroutine.
c.Done()
}(i)
}
// This function have to call to ensure all goroutines have finished
// after close the main program.
c.WaitAllDone()
}
Also can take a look here: https://github.com/LiangfengChen/goutil/blob/main/concurrent.go
The example can refer the test case.
func TestParallelCall(t *testing.T) {
format := "test:%d"
data := make(map[int]bool)
mutex := sync.Mutex{}
val, err := ParallelCall(1000, 10, func(pos int) (interface{}, error) {
mutex.Lock()
defer mutex.Unlock()
data[pos] = true
return pos, errors.New(fmt.Sprintf(format, pos))
})
for i := 0; i < 1000; i++ {
if _, ok := data[i]; !ok {
t.Errorf("TestParallelCall pos not found: %d", i)
}
if val[i] != i {
t.Errorf("TestParallelCall return value is not right (%d,%v)", i, val[i])
}
if err[i].Error() != fmt.Sprintf(format, i) {
t.Errorf("TestParallelCall error msg is not correct (%d,%v)", i, err[i])
}
}
}

Resources