Golang web spider with pagination processing

Golang web spider with pagination processing - go

I'm working on a golang web crawler that should parse the search results on some specific search engine. The main difficulty - parsing with concurrency, or rather, in processing pagination such as
← Previous 1 2 3 4 5 ... 34 Next →. All things work fine except recursive crawling of paginated results. Look at my code:
package main
import (
"bufio"
"errors"
"fmt"
"net"
"strings"
"github.com/antchfx/htmlquery"
"golang.org/x/net/html"
)
type Spider struct {
HandledUrls []string
}
func NewSpider(url string) *Spider {
// ...
}
func requestProvider(request string) string {
// Everything is good here
}
func connectProvider(url string) net.Conn {
// Also works
}
// getContents makes request to search engine and gets response body
func getContents(request string) *html.Node {
// ...
}
// CheckResult controls empty search results
func checkResult(node *html.Node) bool {
// ...
}
func (s *Spider) checkVisited(url string) bool {
// ...
}
// Here is the problems
func (s *Spider) Crawl(url string, channelDone chan bool, channelBody chan *html.Node) {
body := getContents(url)
defer func() {
channelDone <- true
}()
if checkResult(body) == false {
err := errors.New("Nothing found there")
ErrFatal(err)
}
channelBody <- body
s.HandledUrls = append(s.HandledUrls, url)
fmt.Println("Handled ", url)
newUrls := s.getPagination(body)
for _, u := range newUrls {
fmt.Println(u)
}
for i, newurl := range newUrls {
if s.checkVisited(newurl) == false {
fmt.Println(i)
go s.Crawl(newurl, channelDone, channelBody)
}
}
}
func (s *Spider) getPagination(node *html.Node) []string {
// ...
}
func main() {
request := requestProvider(*requestFlag)
channelBody := make(chan *html.Node, 120)
channelDone := make(chan bool)
var parsedHosts []*Host
s := NewSpider(request)
go s.Crawl(request, channelDone, channelBody)
for {
select {
case recievedNode := <-channelBody:
// ...
for _, h := range newHosts {
parsedHosts = append(parsedHosts, h)
fmt.Println("added", h.HostUrl)
}
case <-channelDone:
fmt.Println("Jobs finished")
}
break
}
}
It always returns the first page only, no pagination. Same GetPagination(...) works good. Please tell me, where is my error(s).
Hope Google Translate was correct.

The problem is probably that main exits before all goroutine finished.
First, there is a break after the select statement and it runs uncodintionally after first time a channel is read. That ensures the main func returns after the first time you send something over channelBody.
Secondly, using channelDone is not the right way here. The most idomatic approach would be using a sync.WaitGroup. Before starting each goroutine, use WG.Add(1) and replace the defer with defer WG.Done(); In main, use WG.Wait(). Please be aware that you should use a pointer to refer to the WaitGroup. You can read more here.

Related

why I push false to channel once, but select got false twice in golang?

I am a newbie in golang, I am studying concurrency in golang, and tried to wrote a simple crawler demo, when I read all given url, I push a false to processChannel, and this push just will execute once;
then in other goroutine, I select on processChannel, when got a false, I closed channel for application, but, in this select case, I got false twice, and got a panic for "panic: close of closed channel"
so, I cannot understand why I pushed false once, but select case false twice ???
All code at below:
package main
import (
"fmt"
"io/ioutil"
"net/http"
"sync"
"time"
)
var applicationStatus bool
var urls []string
var urlsProcessed int
var foundUrls []string
var fullText string
var totalURLCount int
var wg sync.WaitGroup
var v1 int
func main() {
applicationStatus = true
statusChannel := make(chan int)
textChannel := make(chan string)
processChannel := make(chan bool)
totalURLCount = 0
urls = append(urls, "https://www.msn.cn/zh-cn/news/other/nasa%E7%AC%AC%E4%BA%94%E6%AC%A1%E8%A7%82%E5%AF%9F%E5%88%B0%E9%BB%91%E6%B4%9E%E5%90%83%E6%8E%89%E4%B8%80%E9%A2%97%E6%B5%81%E6%B5%AA%E7%9A%84%E6%81%92%E6%98%9F/ar-AA15ybhx?cvid=0eaf927e48604c0588413d393c788a8f&ocid=winp2fptaskbarent")
urls = append(urls, "https://www.msn.cn/zh-cn/news/other/nasa%E7%AC%AC%E4%BA%94%E6%AC%A1%E8%A7%82%E5%AF%9F%E5%88%B0%E9%BB%91%E6%B4%9E%E5%90%83%E6%8E%89%E4%B8%80%E9%A2%97%E6%B5%81%E6%B5%AA%E7%9A%84%E6%81%92%E6%98%9F/ar-AA15ybhx?cvid=0eaf927e48604c0588413d393c788a8f&ocid=winp2fptaskbarent")
fmt.Println("Starting spider")
urlsProcessed = 0
totalURLCount = len(urls)
go evaluateStatus(statusChannel, processChannel)
go readURLs(statusChannel, textChannel)
go appendToFullText(textChannel, processChannel)
for {
if applicationStatus == false {
fmt.Println(fullText)
fmt.Println("Done!")
break
}
//select {
//case sC := <-statusChannel:
// fmt.Println("Message on statusChannel", sC)
//}
}
}
func evaluateStatus(statusChannel chan int, processChannel chan bool) {
for {
select {
case status := <-statusChannel:
urlsProcessed++
if status == 0 {
fmt.Println("got url")
}
if status == 1 {
close(statusChannel)
}
if urlsProcessed == totalURLCount {
fmt.Println("=============>>>>urlsProcessed")
fmt.Println(urlsProcessed)
fmt.Println("read all top-level url")
processChannel <- false
applicationStatus = false
}
}
}
}
func readURLs(statusChannel chan int, textChannel chan string) {
time.Sleep(time.Millisecond * 1)
fmt.Println("grabing ", len(urls), " urls")
for _, url := range urls {
resp, _ := http.Get(url)
text, err := ioutil.ReadAll(resp.Body)
if err != nil {
fmt.Println("No HTML body")
}
textChannel <- string(text)
statusChannel <- 0
}
}
func appendToFullText(textChannel chan string, processChannel chan bool) {
for {
select {
case pC := <-processChannel:
fmt.Println("pc==============>>>")
fmt.Println(pC)
if pC == true {
// hang out
}
if pC == false {
// all url got
close(textChannel)
close(processChannel)
}
case tC := <-textChannel:
fmt.Println("text len: ")
fmt.Println(len(tC))
fullText += tC
}
}
}
Thx for your help.

As per the Go Programming Language Specification
A receive operation on a closed channel can always proceed immediately, yielding the element type's zero value after any previously sent values have been received.
This can be seen in the following (playground) demonstration (the comments show what is output):
func main() {
processChannel := make(chan bool)
go func() {
processChannel <- true
processChannel <- false
close(processChannel)
}()
fmt.Println(<-processChannel) // true
fmt.Println(<-processChannel) // false
fmt.Println(<-processChannel) // false
select {
case x := <-processChannel:
fmt.Println(x) // false
}
}
In your code you are closing processChannel so future receives will return the default value (false). One solution is to use processChannel = nil after closing it because:
A nil channel is never ready for communication.
However in your case appendToFullText is closing both channels when pC == false; as such you should probably just return after doing so (because with both channels closed there is no point in keeping the loop running).
Please note that I have only scanned your code

How to handle multiple goroutines that share the same channel

I've been searching a lot but could not find an answer for my problem yet.
I need to make multiple calls to an external API, but with different parameters concurrently.
And then for each call I need to init a struct for each dataset and process the data I receive from the API call. Bear in mind that I read each line of the incoming request and start immediately send it to the channel.
First problem I encounter was not obvious at the beginning due to the large quantity of data I'm receiving, is that each goroutine does not receive all the data that goes through the channel. (Which I learned by the research I've made). So what I need is a way of requeuing/redirect that data to the correct goroutine.
The function that sends the streamed response from a single dataset.
(I've cut useless parts of code that are out of context)
func (api *API) RequestData(ctx context.Context, c chan DWeatherResponse, dataset string, wg *sync.WaitGroup) error {
for {
line, err := reader.ReadBytes('\n')
s := string(line)
if err != nil {
log.Println("End of %s", dataset)
return err
}
data, err := extractDataFromStreamLine(s, dataset)
if err != nil {
continue
}
c <- *data
}
}
The function that will process the incoming data
func (s *StrikeStruct) Process(ch, requeue chan dweather.DWeatherResponse) {
for {
data, more := <-ch
if !more {
break
}
// data contains {dataset string, value float64, date time.Time}
// The s.Parameter needs to match the dataset
// IMPORTANT PART, checks if the received data is part of this struct dataset
// If not I want to send it to another go routine until it gets to the correct
one. There will be a max of 4 datasets but still this could not be the best approach to have
if !api.GetDataset(s.Parameter, data.Dataset) {
requeue <- data
continue
}
// Do stuff with the data from this point
}
}
Now on my own API endpoint I have the following:
ch := make(chan dweather.DWeatherResponse, 2)
requeue := make(chan dweather.DWeatherResponse)
final := make(chan strike.StrikePerYearResponse)
var wg sync.WaitGroup
for _, s := range args.Parameters.Strikes {
strike := strike.StrikePerYear{
Parameter: strike.Parameter(s.Dataset),
StrikeValue: s.Value,
}
// I receive and process the data in here
go strike.ProcessStrikePerYear(ch, requeue, final, string(s.Dataset))
}
go func() {
for {
data, _ := <-requeue
ch <- data
}
}()
// Creates a goroutine for each dataset
for _, dataset := range api.Params.Dataset {
wg.Add(1)
go api.RequestData(ctx, ch, dataset, &wg)
}
wg.Wait()
close(ch)
//Once the data is all processed it is all appended
var strikes []strike.StrikePerYearResponse
for range args.Fetch.Datasets {
strikes = append(strikes, <-final)
}
return strikes
The issue with this code is that as soon as I start receiving data from more than one endpoint the requeue will block and nothing more happens. If I remove that requeue logic data will be lost if it does not land on the correct goroutine.
My two questions are:
Why is the requeue blocking if it has a goroutine always ready to receive?
Should I take a different approach on how I'm processing the incoming data?

this is not a good way to solving your problem. you should change your solution. I suggest an implementation like the below:
import (
"fmt"
"sync"
)
// answer for https://stackoverflow.com/questions/68454226/how-to-handle-multiple-goroutines-that-share-the-same-channel
var (
finalResult = make(chan string)
)
// IData use for message dispatcher that all struct must implement its method
type IData interface {
IsThisForMe() bool
Process(*sync.WaitGroup)
}
//MainData can be your main struct like StrikePerYear
type MainData struct {
// add any props
Id int
Name string
}
type DataTyp1 struct {
MainData *MainData
}
func (d DataTyp1) IsThisForMe() bool {
// you can check your condition here to checking incoming data
if d.MainData.Id == 2 {
return true
}
return false
}
func (d DataTyp1) Process(wg *sync.WaitGroup) {
d.MainData.Name = "processed by DataTyp1"
// send result to final channel, you can change it as you want
finalResult <- d.MainData.Name
wg.Done()
}
type DataTyp2 struct {
MainData *MainData
}
func (d DataTyp2) IsThisForMe() bool {
// you can check your condition here to checking incoming data
if d.MainData.Id == 3 {
return true
}
return false
}
func (d DataTyp2) Process(wg *sync.WaitGroup) {
d.MainData.Name = "processed by DataTyp2"
// send result to final channel, you can change it as you want
finalResult <- d.MainData.Name
wg.Done()
}
//dispatcher will run new go routine for each request.
//you can implement a worker pool to preventing running too many go routines.
func dispatcher(incomingData *MainData, wg *sync.WaitGroup) {
// based on your requirements you can remove this go routing or not
go func() {
var p IData
p = DataTyp1{incomingData}
if p.IsThisForMe() {
go p.Process(wg)
return
}
p = DataTyp2{incomingData}
if p.IsThisForMe() {
go p.Process(wg)
return
}
}()
}
func main() {
dummyDataArray := []MainData{
MainData{Id: 2, Name: "this data #2"},
MainData{Id: 3, Name: "this data #3"},
}
wg := sync.WaitGroup{}
for i := range dummyDataArray {
wg.Add(1)
dispatcher(&dummyDataArray[i], &wg)
}
result := make([]string, 0)
done := make(chan struct{})
// data collector
go func() {
loop:for {
select {
case <-done:
break loop
case r := <-finalResult:
result = append(result, r)
}
}
}()
wg.Wait()
done<- struct{}{}
for _, s := range result {
fmt.Println(s)
}
}
Note: this is just for opening your mind for finding a better solution, and for sure this is not a production-ready code.

Making a struct thread safe using go channels

Suppose I have the following struct:
package manager
type Manager struct {
strings []string
}
func (m *Manager) AddString(s string) {
m.strings = append(m.strings, s)
}
func (m *Manager) RemoveString(s string) {
for i, str := range m.strings {
if str == s {
m.strings = append(m.strings[:i], m.strings[i+1:]...)
}
}
}
This pattern is not thread safe, so the following test fails due to some race condition (array index out of bounds):
func TestManagerConcurrently(t *testing.T) {
m := &manager.Manager{}
wg := sync.WaitGroup{}
for i:=0; i<100; i++ {
wg.Add(1)
go func () {
m.AddString("a")
m.AddString("b")
m.AddString("c")
m.RemoveString("b")
wg.Done()
} ()
}
wg.Wait()
fmt.Println(m)
}
I'm new to Go, and from googling around I suppose I should use channels (?). So one way to make this concurrent would be like this:
type ManagerA struct {
Manager
addStringChan chan string
removeStringChan chan string
}
func NewManagerA() *ManagerA {
ma := &ManagerA{
addStringChan: make(chan string),
removeStringChan: make(chan string),
}
go func () {
for {
select {
case msg := <-ma.addStringChan:
ma.AddString(msg)
case msg := <-ma.removeStringChan:
ma.RemoveString(msg)
}
}
}()
return ma
}
func (m* ManagerA) AddStringA(s string) {
m.addStringChan <- s
}
func (m* ManagerA) RemoveStringA(s string) {
m.removeStringChan <- s
}
I would like to expose an API similar to the non-concurrent example, hence AddStringA, RemoveStringA.
This seems to work as expected concurrently (although I guess the inner goroutine should also exit at some point). My problem with this is that there is a lot of extra boilerplate:
need to define & initialize channels
define inner goroutine loop with select
map functions to channel calls
It seems a bit much to me. Is there a way to simplify this (refactor / syntax / library)?
I think the best way to implement this would be to use a Mutex instead? But is it still possible to simplify this sort of boilerplate?

Using a mutex would be perfectly idiomatic like this:
type Manager struct {
mu sync.Mutex
strings []string
}
func (m *Manager) AddString(s string) {
m.mu.Lock()
m.strings = append(m.strings, s)
m.mu.Unlock()
}
func (m *Manager) RemoveString(s string) {
m.mu.Lock()
for i, str := range m.strings {
if str == s {
m.strings = append(m.strings[:i], m.strings[i+1:]...)
}
}
m.mu.Unlock()
}
You could do this with channels, but as you noted it is a lot of extra work for not much gain. Just use a mutex is my advice!

If you simply need to make the access to the struct thread-safe, use mutex:
type Manager struct {
sync.Mutex
data []string
}
func (m *Manager) AddString(s string) {
m.Lock()
m.strings = append(m.strings, s)
m.Unlock()
}

HTTP request fails when executed asynchronously

I'm trying to write a tiny application in Go that can send an HTTP request to all IP addresses in hopes to find a specific content. The issue is that the application seems to crash in a very peculiar way when the call is executed asynchronously.
ip/validator.go
package ip
import (
"io/ioutil"
"net/http"
"regexp"
"time"
)
type ipValidator struct {
httpClient http.Client
path string
exp *regexp.Regexp
confirmationChannel *chan string
}
func (this *ipValidator) validateUrl(url string) bool {
response, err := this.httpClient.Get(url)
if err != nil {
return false
}
defer response.Body.Close()
if response.StatusCode != http.StatusOK {
return false
}
bodyBytes, _ := ioutil.ReadAll(response.Body)
result := this.exp.Match(bodyBytes)
if result && this.confirmationChannel != nil {
*this.confirmationChannel <- url
}
return result
}
func (this *ipValidator) ValidateIp(addr ip) bool {
httpResult := this.validateUrl("http://" + addr.ToString() + this.path)
httpsResult := this.validateUrl("https://" + addr.ToString() + this.path)
return httpResult || httpsResult
}
func (this *ipValidator) GetSuccessChannel() *chan string {
return this.confirmationChannel
}
func NewIpValidadtor(path string, exp *regexp.Regexp) ipValidator {
return newValidator(path, exp, nil)
}
func NewAsyncIpValidator(path string, exp *regexp.Regexp) ipValidator {
c := make(chan string)
return newValidator(path, exp, &c)
}
func newValidator(path string, exp *regexp.Regexp, c *chan string) ipValidator {
httpClient := http.Client{
Timeout: time.Second * 2,
}
return ipValidator{httpClient, path, exp, c}
}
main.go
package main
import (
"./ip"
"fmt"
"os"
"regexp"
)
func processOutput(c *chan string) {
for true {
url := <- *c
fmt.Println(url)
}
}
func main() {
args := os.Args[1:]
fmt.Printf("path: %s regex: %s", args[0], args[1])
regexp, regexpError := regexp.Compile(args[1])
if regexpError != nil {
fmt.Println("The provided regexp is not valid")
return
}
currentIp, _ := ip.NewIp("172.217.22.174")
validator := ip.NewAsyncIpValidator(args[0], regexp)
successChannel := validator.GetSuccessChannel()
go processOutput(successChannel)
for currentIp.HasMore() {
go validator.ValidateIp(currentIp)
currentIp = currentIp.Increment()
}
}
Note the line that says go validator.ValidateIp(currentIp) in main.go. Should I remove the word "go" to execute everything within the main routine, the code works as expected -> it sends requests to IP addresses starting 172.217.22.174 and should one of them return a legitimate result that matches the regexp that the ipValidator was initialized with, the URL is passed to the channel and the value is printed out by processOutput function from main.go. The issue is that simply adding go in front of validator.ValidateIp(currentIp) breaks that functionality. In fact, according to the debugger, I never seem to go past the line that says response, err := this.httpClient.Get(url) in validator.go.
The struggle is real. Should I decide to scan the whole internet, there's 256^4 IP addresses to go through. It will take years, unless I find a way to split the process into multiple routines.

Reading from map with locks doesn't return value via channel

I tried to implement a locking version of reading/writing from a map in golang, but it doesn't return the desired result.
package main
import (
"sync"
"fmt"
)
var m = map[int]string{}
var lock = sync.RWMutex{}
func StoreUrl(id int, url string) {
for {
lock.Lock()
defer lock.Unlock()
m[id] = url
}
}
func LoadUrl(id int, ch chan string) {
for {
lock.RLock()
defer lock.RUnlock()
r := m[id]
ch <- r
}
}
func main() {
go StoreUrl(125, "www.google.com")
chb := make(chan string)
go LoadUrl(125, chb);
C := <-chb
fmt.Println("Result:", C)
}
The output is:
Result:
Meaning the value is not returned via the channel, which I don't get. Without the locking/goroutines it seems to work fine. What did I do wrong?
The code can also be found here:
https://play.golang.org/p/-WmRcMty5B

Infinite loops without sleep or some kind of IO are always bad idea.
In your code if you put a print statement at the start of StoreUrl, you will find that it never gets printed i.e the go routine was never started, the go call is setting putting the info about this new go routine in some run queue of the go scheduler but the scheduler hasn't ran yet to schedule that task. How do you run the scheduler? Do sleep/IO/channel reading/writing.
Another problem is that your infinite loop is taking lock and trying to take the lock again, which will cause it to deadlock. Defer only run after function exit and that function will never exit because of infinite loop.
Below is modified code that uses sleep to make sure every execution thread gets time to do its job.
package main
import (
"sync"
"fmt"
"time"
)
var m = map[int]string{}
var lock = sync.RWMutex{}
func StoreUrl(id int, url string) {
for {
lock.Lock()
m[id] = url
lock.Unlock()
time.Sleep(1)
}
}
func LoadUrl(id int, ch chan string) {
for {
lock.RLock()
r := m[id]
lock.RUnlock()
ch <- r
}
}
func main() {
go StoreUrl(125, "www.google.com")
time.Sleep(1)
chb := make(chan string)
go LoadUrl(125, chb);
C := <-chb
fmt.Println("Result:", C)
}
Edit: As #Jaun mentioned in the comment, you can also use runtime.Gosched() instead of sleep.

Usage of defer incorrect, defer execute at end of function, not for statement.
func StoreUrl(id int, url string) {
for {
func() {
lock.Lock()
defer lock.Unlock()
m[id] = url
}()
}
}
or
func StoreUrl(id int, url string) {
for {
lock.Lock()
m[id] = url
lock.Unlock()
}
}
We can't control the order of go routine, so add time.Sleep() to control the order.
code here:
https://play.golang.org/p/Bu8Lo46SA2

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Golang web spider with pagination processing - go

Related

why I push false to channel once, but select got false twice in golang?

How to handle multiple goroutines that share the same channel

Making a struct thread safe using go channels

HTTP request fails when executed asynchronously

Reading from map with locks doesn't return value via channel

Categories

Resources