I am trying to take input from a text file containing domains(unknown amount) to then use each as an argument and get their server type. As expected, this only returns the last domain. How do I iterating multiple return values?
Below is the code.
// Test
package main
import (
"bufio"
"time"
"os"
"fmt"
"net/http"
//"github.com/gocolly/colly"
)
var Domain string
var Target string
func main() {
Domain := DomainGrab()
Target := BannerGrab(Domain)
//CheckDB if not listed then add else skip
//RiskDB
//Email
fmt.Println(Domain)
fmt.Println(Target)
}
func BannerGrab(s string) string {
client := &http.Client{}
req, err := http.NewRequest("GET", s, nil)
if err != nil {
log.Fatalln(err)
}
req.Header.Set("User-Agent", "Authac/0.1")
resp, _ := client.Do(req)
serverEntry := resp.Header.Get("Server")
return serverEntry
}
func DomainGrab() string {
//c := colly.NewCollector()
// Open the file.
f, _ := os.Open("domains.txt")
defer f.Close()
// Create a new Scanner for the file.
scanner := bufio.NewScanner(f)
// Loop over all lines in the file and print them.
for scanner.Scan() {
line := scanner.Text()
time.Sleep(2 * time.Second)
//fmt.Println(line)
return line
}
return Domain
}
If you wanted to do it "concurrently", you would return a channel through which you will send the multiple things you want to return:
https://play.golang.org/p/iYBGPwfYLYR
func DomainGrab() <-chan string {
ch := make(chan string, 1)
f, _ := os.Open("domains.txt")
defer f.Close()
scanner := bufio.NewScanner(f)
go func() {
// Loop over all lines in the file and print them.
for scanner.Scan() {
line := scanner.Text()
time.Sleep(2 * time.Second)
ch <- line
}
close(ch)
}()
return ch
}
If I understand your question, you want to read the file, somehow detect that file was modified and have a method that will emit these modifications to client code.
That is not how files work.
you have two options:
Listen for file changes using some OS specific API - https://www.linuxjournal.com/content/linux-filesystem-events-inotify
Read file using infinite loop. Read the file once. Save the copy into memory. Read the same file again and again in loop till new file is different from copy and calculate the delta.
Check if that is possible to use push instead of pull for getting new domains. Is it possible that system that control domain names in file will push data to you directly?
If loop is the only possible option, set up some pause time between file reads to reduce system load.
Use channels as #dave suggested when you managed to get new domains and need to process them concurrently.
Probably not the BEST solution. But, I decided to get rid of a separate function all together just cover more ground. I'll post below the code of what I expect. Now, I need to parse the domains so that Root URL's and Subdomains are only scanned once.
// Main
package main
import (
"log"
"fmt"
"time"
"net/http"
"github.com/gocolly/colly"
)
//var Domain string
var Target string
func main() {
c := colly.NewCollector()
c.OnError(func(r *colly.Response, err error) {
fmt.Println("Request URL:", r.Request.URL, "\n Failed with response:", r.StatusCode)
})
// Find and visit all links
c.OnHTML("a", func(e *colly.HTMLElement) {
e.Request.Visit(e.Attr("href"))
})
c.OnRequest(func(r *colly.Request) {
Domain := r.URL.String()
Target := BannerGrab(Domain)
fmt.Println(Domain)
fmt.Println(Target)
fmt.Println("Dropping By.. ", r.URL)
time.Sleep(1000 * time.Millisecond)
})
c.Visit("http://www.milliondollarhomepage.com/")
}
//CheckDB if not listed else add
//RiskDB
//Email
func BannerGrab(s string) string {
client := &http.Client{}
req, err := http.NewRequest("GET", s, nil)
if err != nil {
log.Fatalln(err)
}
req.Header.Set("User-Agent", "Authac/0.1")
resp, _ := client.Do(req)
serverEntry := resp.Header.Get("Server")
return serverEntry
}
Related
I am trying to build a web scrapper to scrape jobs from internshala.com. I am using go colly to build the web scrapper. I visit every page and then visit the subsequent links of each job to scrape data from. Doing this in a sequential manner scrapes almost all the links, but if I try doing it by using colly's parallel scrapping the number of links scraped decreases. I write all the data in a csv file.
EDIT
My question is why does this happen while scrapping parallelly and how can I solve it (how can I scrape all the data even when scrapping parallelly ).
Or is there something else I am doing wrong that is cauing the problem. A code review will be really helpful. Thanks :)
package main
import (
"encoding/csv"
"log"
"os"
"strconv"
"sync"
"time"
"github.com/gocolly/colly"
)
func main(){
parallel(10)
seq(10)
}
I comment out one of the two functions before running for obvious reasons.
parallel function :=
func parallel(n int){
start := time.Now()
c := colly.NewCollector(
colly.AllowedDomains("internshala.com", "https://internshala.com/internship/detail",
"https://internshala.com/internship/", "internshala.com/", "www.intershala.com"),
colly.Async(true),
)
d := colly.NewCollector(
colly.AllowedDomains("internshala.com", "https://internshala.com/internship/detail",
"https://internshala.com/internship/", "internshala.com/", "www.intershala.com"),
colly.Async(true),
)
c.Limit(&colly.LimitRule{DomainGlob: "*", Parallelism: 4})
d.Limit(&colly.LimitRule{DomainGlob: "*", Parallelism: 4})
fileName := "data.csv"
file, err := os.Create(fileName)
cnt := 0
if err != nil{
log.Fatalf("Could not create file, err: %q", err)
return
}
defer file.Close() // close the file after the main routine exits
writer := csv.NewWriter(file)
defer writer.Flush()
var wg sync.WaitGroup
c.OnHTML("a[href]", func(e *colly.HTMLElement){
if e.Attr("class") != "view_detail_button"{
return
}
detailsLink := e.Attr("href")
d.Visit(e.Request.AbsoluteURL(detailsLink))
})
d.OnHTML(".detail_view", func(e *colly.HTMLElement) {
wg.Add(1)
go func(wg *sync.WaitGroup) {
writer.Write([]string{
e.ChildText("span.profile_on_detail_page"),
e.ChildText(".company_name a"),
e.ChildText("#location_names a"),
e.ChildText(".internship_other_details_container > div:first-of-type > div:last-of-type .item_body"),
e.ChildText("span.stipend"),
e.ChildText(".applications_message"),
e.ChildText(".internship_details > div:nth-last-of-type(3)"),
e.Request.URL.String(),
})
wg.Done()
}(&wg)
})
c.OnRequest(func(r *colly.Request) {
log.Println("visiting", r.URL.String())
})
d.OnRequest(func(r *colly.Request) {
log.Println("visiting", r.URL.String())
cnt++
})
for i := 1; i < n; i++ {
c.Visit("https://internshala.com/internships/page-"+strconv.Itoa(i))
}
c.Wait()
d.Wait()
wg.Wait()
t := time.Since(start)
log.Printf("time %v \n", t)
log.Printf("amount %v \n", cnt)
log.Printf("Scrapping complete")
log.Println(c)
}
seq function :=
func seq(n int){
start := time.Now()
c := colly.NewCollector(
colly.AllowedDomains("internshala.com", "https://internshala.com/internship/detail",
"https://internshala.com/internship/", "internshala.com/", "www.intershala.com"),
)
d := colly.NewCollector(
colly.AllowedDomains("internshala.com", "https://internshala.com/internship/detail",
"https://internshala.com/internship/", "internshala.com/", "www.intershala.com"),
)
fileName := "data.csv"
file, err := os.Create(fileName)
cnt := 0
if err != nil{
log.Fatalf("Could not create file, err: %q", err)
return
}
defer file.Close() // close the file after the main routine exits
writer := csv.NewWriter(file)
defer writer.Flush()
c.OnHTML("a[href]", func(e *colly.HTMLElement){
if e.Attr("class") != "view_detail_button"{
return
}
detailsLink := e.Attr("href")
d.Visit(e.Request.AbsoluteURL(detailsLink))
})
d.OnHTML(".detail_view", func(e *colly.HTMLElement) {
writer.Write([]string{
e.ChildText("span.profile_on_detail_page"),
e.ChildText(".company_name a"),
e.ChildText("#location_names a"),
e.ChildText(".internship_other_details_container > div:first-of-type > div:last-of-type .item_body"),
e.ChildText("span.stipend"),
e.ChildText(".applications_message"),
e.ChildText(".internship_details > div:nth-last-of-type(3)"),
e.Request.URL.String(),
})
})
c.OnRequest(func(r *colly.Request) {
log.Println("visiting", r.URL.String())
})
d.OnRequest(func(r *colly.Request) {
log.Println("visiting", r.URL.String())
cnt++
})
for i := 1; i < n; i++ {
// Add URLs to the queue
c.Visit("https://internshala.com/internships/page-"+strconv.Itoa(i))
}
t := time.Since(start)
log.Printf("time %v \n", t)
log.Printf("amount %v \n", cnt)
log.Printf("Scrapping complete")
log.Println(c)
}
Any help will be much appreciated. :)
Sorry for being late at the party but I came up with a working solution to your problem. Let me show it:
package main
import (
"encoding/csv"
"fmt"
"log"
"os"
"strconv"
"strings"
"time"
"github.com/gocolly/colly/v2"
"github.com/gocolly/colly/v2/queue"
)
func parallel(n int) {
start := time.Now()
cnt := 0
queue, _ := queue.New(8, &queue.InMemoryQueueStorage{MaxSize: 1000}) // tried up to 8 threads
fileName := "data_par.csv"
file, err := os.Create(fileName)
if err != nil {
log.Fatalf("Could not create file, err: %q", err)
return
}
defer file.Close() // close the file after the main routine exits
writer := csv.NewWriter(file)
defer func() {
writer.Flush()
if err := writer.Error(); err != nil {
panic(err)
}
}()
c := colly.NewCollector(
colly.AllowedDomains("internshala.com", "https://internshala.com/internship/detail",
"https://internshala.com/internship/", "internshala.com/", "www.intershala.com"),
)
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
if e.Attr("class") != "view_detail_button" {
return
}
detailsLink := e.Attr("href")
e.Request.Visit(detailsLink)
})
c.OnRequest(func(r *colly.Request) {
writer.Write([]string{r.URL.String()})
})
for i := 1; i < n; i++ {
queue.AddURL("https://internshala.com/internships/page-" + strconv.Itoa(i))
}
queue.Run(c)
t := time.Since(start)
log.Printf("time: %v\tamount: %d\n", t, cnt)
}
func seq(n int) {
start := time.Now()
c := colly.NewCollector(
colly.AllowedDomains("internshala.com", "https://internshala.com/internship/detail",
"https://internshala.com/internship/", "internshala.com/", "www.intershala.com"),
)
fileName := "data_seq.csv"
file, err := os.Create(fileName)
cnt := 0
if err != nil {
log.Fatalf("Could not create file, err: %q", err)
return
}
defer file.Close() // close the file after the main routine exits
writer := csv.NewWriter(file)
defer func() {
writer.Flush()
if err := writer.Error(); err != nil {
panic(err)
}
}()
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
if e.Attr("class") != "view_detail_button" {
return
}
detailsLink := e.Attr("href")
e.Request.Visit(detailsLink)
})
c.OnRequest(func(r *colly.Request) {
writer.Write([]string{r.URL.String()})
})
for i := 1; i < n; i++ {
c.Visit("https://internshala.com/internships/page-" + strconv.Itoa(i))
}
t := time.Since(start)
log.Printf("time: %v\tamount: %d\n", t, cnt)
}
func main() {
fmt.Println("sequential")
seq(6)
fmt.Println(strings.Repeat("#", 50))
fmt.Println("parallel")
parallel(6)
}
The problem
After looking at your code, I think that everything is implemented correctly. Sure things could be done in a better way but at least about the concurrency everything is set up properly. Some aspects that you could have improved are in the following list:
Check for the Error while flushing to the underlying CSV file
Use only one collector instead of two
Again, as I already said, these are only small refinements.
The actual problem
The actual problem is that when you make concurrent (and potentially parallel) requests, the colly framework cannot keep up with it and starts losing some responses. This trend grows exponentially when you increase the number of executions.
The easiest solution (IMO)
gocolly provides the Queue type that fits very well for these challenges. Thanks to them, you'll be sure that every request will be processed as if they've been done concurrently. The steps can be summarized as follows:
Instantiate a new queue with the New function provided by the queue sub-package. You've to set up the number of threads and also the type of queue (in our case it's fine to use an in-memory implementation).
Instantiate a default collector with all of its needed callbacks.
Invoke the method AddUrl on the above-defined queue variable with the appropriate URL to query.
Invoke the Run method that sends the actual requests to the target URLs and waits for the responses.
Note that I simplified the solution you shared just to focus on the number of requests in the two approaches. I didn't check the logic you wrote in the OnHTML callback but I assumed it worked.
Let me know if this solves your issue or share how you were able to solve this problem, thanks!
I started learning go recently and I've been chipping away at this for a while now, but figured it was time to ask for some specific help. I have my program requesting paginated data from an api and because there are about 160 pages of data. Seems like a good use of goroutines, except I have race conditions and I can't seem to figure out why. It's probably because I'm new to the language, but my impressions was that params for a function are passed as a copy of the data in the function calling it unless it's a pointer.
According to what I think I know this should be making copies of my data which leaves me free to change it in the main function, but I end up request some pages multiple times and other pages just once.
My main.go
package main
import (
"bufio"
"encoding/json"
"log"
"net/http"
"net/url"
"os"
"strconv"
"sync"
"github.com/joho/godotenv"
)
func main() {
err := godotenv.Load()
if err != nil {
log.Fatalln(err)
}
httpClient := &http.Client{}
baseURL := "https://api.data.gov/ed/collegescorecard/v1/schools.json"
filters := make(map[string]string)
page := 0
filters["school.degrees_awarded.predominant"] = "2,3"
filters["fields"] = "id,school.name,school.city,2018.student.size,2017.student.size,2017.earnings.3_yrs_after_completion.overall_count_over_poverty_line,2016.repayment.3_yr_repayment.overall"
filters["api_key"] = os.Getenv("API_KEY")
outFile, err := os.Create("./out.txt")
if err != nil {
log.Fatalln(err)
}
writer := bufio.NewWriter(outFile)
requestURL := getRequestURL(baseURL, filters)
response := requestData(requestURL, httpClient)
wg := sync.WaitGroup{}
for (page+1)*response.Metadata.ResultsPerPage < response.Metadata.TotalResults {
page++
filters["page"] = strconv.Itoa(page)
wg.Add(1)
go func() {
defer wg.Done()
requestURL := getRequestURL(baseURL, filters)
response := requestData(requestURL, httpClient)
_, err = writer.WriteString(response.TextOutput())
if err != nil {
log.Fatalln(err)
}
}()
}
wg.Wait()
}
func getRequestURL(baseURL string, filters map[string]string) *url.URL {
requestURL, err := url.Parse(baseURL)
if err != nil {
log.Fatalln(err)
}
query := requestURL.Query()
for key, value := range filters {
query.Set(key, value)
}
requestURL.RawQuery = query.Encode()
return requestURL
}
func requestData(url *url.URL, httpClient *http.Client) CollegeScoreCardResponseDTO {
request, _ := http.NewRequest(http.MethodGet, url.String(), nil)
resp, err := httpClient.Do(request)
if err != nil {
log.Fatalln(err)
}
defer resp.Body.Close()
var parsedResponse CollegeScoreCardResponseDTO
err = json.NewDecoder(resp.Body).Decode(&parsedResponse)
if err != nil {
log.Fatalln(err)
}
return parsedResponse
}
I know another issue I will be running into is writing to the output file in the correct order, but I believe using channels to tell each routine what request finished writing could solve that. If I'm incorrect on that I would appreciate any advice on how to approach that as well.
Thanks in advance.
goroutines do not receive copies of data. When the compiler detects that a variable "escapes" the current function, it allocates that variable on the heap. In this case, filters is one such variable. When the goroutine starts, the filters it accesses is the same map as the main thread. Since you keep modifying filters in the main thread without locking, there is no guarantee of what the goroutine sees.
I suggest you keep filters read-only, create a new map in the goroutine by copying all items from the filters, and add the "page" in the goroutine. You have to be careful to pass a copy of the page as well:
go func(page int) {
flt:=make(map[string]string)
for k,v:=range filters {
flt[k]=v
}
flt["page"]=strconv.Itoa(page)
...
} (page)
This is the code taken from the go book. The client enters the message and the request is sent to the server. How to send the same request repeatedly without entering values every time? Also, the time interval between successive requests should be 3 seconds. Should I use goroutines?
package main
import (
"bufio"
"fmt"
"net"
"os"
)
func main() {
arguments := os.Args
if len(arguments) == 1 {
fmt.Println("Please provide host:port.")
return
}
CONNECT := arguments[1]
c, err := net.Dial("tcp", CONNECT)
if err != nil {
fmt.Println(err)
return
}
for {
reader := bufio.NewReader(os.Stdin)
fmt.Print(">>")
text, _ := reader.ReadString('\n')
fmt.Fprintf(c, text+"\n")
}
}
Use a time.Ticker to execute code at some specified interval:
t := time.NewTicker(3 * time.Second)
defer t.Stop()
for range t.C {
_, err := c.Write([]byte("Hello!\n"))
if err != nil {
log.Fatal(err)
}
}
Run it on the playground.
I have a golang program which is supposed to call an API with different payloads, the web application is a drop wizard application which is running on localhost, and the go program is below
package main
import (
"bufio"
"encoding/json"
"log"
"net"
"net/http"
"os"
"strings"
"time"
)
type Data struct {
PersonnelId string `json:"personnel_id"`
DepartmentId string `json:"department_id"`
}
type PersonnelEvent struct {
EventType string `json:"event_type"`
Data `json:"data"`
}
const (
maxIdleConnections = 20
maxIdleConnectionsPerHost = 20
timeout = time.Duration(5 * time.Second)
)
var transport = http.Transport{
Dial: dialTimeout,
MaxIdleConns: maxIdleConnections,
MaxIdleConnsPerHost: 20,
}
var client = &http.Client{
Transport: &transport,
}
func dialTimeout(network, addr string) (net.Conn, error) {
return net.DialTimeout(network, addr, timeout)
}
func makeRequest(payload string) {
req, _ := http.NewRequest("POST", "http://localhost:9350/v1/provider-
location-personnel/index", strings.NewReader(payload))
req.Header.Set("X-App-Token", "TESTTOKEN1")
req.Header.Set("Content-Type", "application/json")
resp, err := client.Do(req)
if err != nil {
log.Println("Api invocation returned an error ", err)
} else {
defer resp.Body.Close()
log.Println(resp.Body)
}
}
func indexPersonnels(personnelPayloads []PersonnelEvent) {
for _, personnelEvent := range personnelPayloads {
payload, err := json.Marshal(personnelEvent)
if err != nil {
log.Println("Error while marshalling payload ", err)
}
log.Println(string(payload))
// go makeRequest(string(payload))
}
}
func main() {
ch := make(chan PersonnelEvent)
for i := 0; i < 20; i++ {
go func() {
for personnelEvent := range ch {
payload, err := json.Marshal(personnelEvent)
if err != nil {
log.Println("Error while marshalling payload", err)
}
go makeRequest(string(payload))
//log.Println("Payload ", string(payload))
}
}()
}
file, err := os.Open("/Users/tmp/Desktop/personnels.txt")
defer file.Close()
if err != nil {
log.Fatalf("Error opening personnel id file %v", err)
}
scanner := bufio.NewScanner(file)
for scanner.Scan() {
go func() {
ch <- PersonnelEvent{EventType: "provider_location_department_personnel_linked", Data: Data{DepartmentId: "2a8d9687-aea8-4a2c-bc08-c64d7716d973", PersonnelId: scanner.Text()}}
}()
}
}
Its reading some ids from a file and then creating a payload out of it and invoking a post request on the web server, but when i run the program it gives too many open file errors/no such host errors, i feel that the program is too much concurrent how to make it run gracefully?
inside your 20 goroutines started in main(), "go makeRequest(...)" again created one goroutine for each event. you don't need start extra goroutine there.
Besides, I think you don't need start goroutine in your scan loop, either. buffered channel is enough,because bottleneck should be at doing http requests.
You can use a buffered channel, A.K.A. counting semaphore, to limit the parallelism.
// The capacity of the buffered channel is 10,
// which means you can have 10 goroutines to
// run the makeRequest function in parallel.
var tokens = make(chan struct{}, 10)
func makeRequest(payload string) {
tokens <- struct{}{} // acquire the token or block here
defer func() { <-tokens }() // release the token to awake another goroutine
// other code...
}
I have the following code which is suppose to download file by splitting it into multiple parts. But right now it only works on images, when I try downloading other files like tar files the output is an invalid file.
UPDATED:
Used os.WriteAt instead of os.Write and removed os.O_APPEND file mode.
package main
import (
"errors"
"flag"
"fmt"
"io/ioutil"
"log"
"net/http"
"os"
"strconv"
)
var file_url string
var workers int
var filename string
func init() {
flag.StringVar(&file_url, "url", "", "URL of the file to download")
flag.StringVar(&filename, "filename", "", "Name of downloaded file")
flag.IntVar(&workers, "workers", 2, "Number of download workers")
}
func get_headers(url string) (map[string]string, error) {
headers := make(map[string]string)
resp, err := http.Head(url)
if err != nil {
return headers, err
}
if resp.StatusCode != 200 {
return headers, errors.New(resp.Status)
}
for key, val := range resp.Header {
headers[key] = val[0]
}
return headers, err
}
func download_chunk(url string, out string, start int, stop int) {
client := new(http.Client)
req, _ := http.NewRequest("GET", url, nil)
req.Header.Add("Range", fmt.Sprintf("bytes=%d-%d", start, stop))
resp, _ := client.Do(req)
defer resp.Body.Close()
body, err := ioutil.ReadAll(resp.Body)
if err != nil {
log.Fatalln(err)
return
}
file, err := os.OpenFile(out, os.O_WRONLY, 0600)
if err != nil {
if file, err = os.Create(out); err != nil {
log.Fatalln(err)
return
}
}
defer file.Close()
if _, err := file.WriteAt(body, int64(start)); err != nil {
log.Fatalln(err)
return
}
fmt.Println(fmt.Sprintf("Range %d-%d: %d", start, stop, resp.ContentLength))
}
func main() {
flag.Parse()
headers, err := get_headers(file_url)
if err != nil {
fmt.Println(err)
} else {
length, _ := strconv.Atoi(headers["Content-Length"])
bytes_chunk := length / workers
fmt.Println("file length: ", length)
for i := 0; i < workers; i++ {
start := i * bytes_chunk
stop := start + (bytes_chunk - 1)
go download_chunk(file_url, filename, start, stop)
}
var input string
fmt.Scanln(&input)
}
}
Basically, it just reads the length of the file, divides it with the number of workers then each file downloads using HTTP's Range header, after downloading it seeks to a position in the file where that chunk is written.
If you really ignore many errors like seen above then your code is not supposed to work reliably for any file type.
However, I guess I can see on problem in your code. I think that mixing O_APPEND and seek is probably a mistake (Seek should be ignored with this mode). I suggest to use (*os.File).WriteAt instead.
IIRC, O_APPEND forces any write to happen at the [current] end of file. However, your download_chunk function instances for file parts can be executing in unpredictable order, thus "reordering" the file parts. The result is then a corrupted file.
1.the sequence of the go routine is not sure。
eg. the execute result maybe as follows:
...
file length:20902
Range 10451-20901:10451
Range 0-10450:10451
...
so the chunks can't just append.
2.when write chunk datas must have a sys.Mutex
(my english is poor,please forget it)