Scrolling to bottom of infinite scroll page then getting html document - go

I'm kinda lost as to how to carry out this task as I'm pretty new to golang. I'm trying to scroll to the bottom of a page that uses infinite scroll in order to load all of the elements and then save the response as an HTML file (or even better just getting the tags) as obviously it will only return part of the elements if there is no scroll as it's infinite scroll. There's also no discernible footer or something I could jump to using Rod or chromedp, any help ? :)

It depends on how the infinite scroll page is implemented.
With chromedp, we usually scroll a page like this:
package main
import (
"context"
"time"
"github.com/chromedp/chromedp"
"github.com/chromedp/chromedp/kb"
)
func main() {
opts := append(chromedp.DefaultExecAllocatorOptions[:],
// Disable the headless mode to see what happen.
chromedp.Flag("headless", false),
)
ctx, cancel := chromedp.NewExecAllocator(context.Background(), opts...)
defer cancel()
ctx, cancel = chromedp.NewContext(ctx)
defer cancel()
if err := chromedp.Run(ctx,
chromedp.Navigate("https://intoli.com/blog/scrape-infinite-scroll/demo.html"),
); err != nil {
panic(err)
}
for i := 0; i < 10; i++ {
if err := chromedp.Run(ctx,
// Option 1 to scroll the page: window.scrollTo.
chromedp.Evaluate(`window.scrollTo(0, document.documentElement.scrollHeight)`, nil),
// Slow down the action so we can see what happen.
chromedp.Sleep(2*time.Second),
// Option 2 to scroll the page: send "End" key to the page.
chromedp.KeyEvent(kb.End),
chromedp.Sleep(2*time.Second),
); err != nil {
panic(err)
}
}
}

Related

Stop all running nested loops inside for -> go func -> for loop

The best way I figured out how to "multi-thread" requests with a different proxy each time was to nest a go func and for loop inside of another for loop, but I can't figure out how to stop all loops like a break normally would, I tried a regular break also tried break out and added out: above the loops but that didn't stop it.
package main
import (
"log"
"encoding/json"
"github.com/parnurzeal/gorequest"
)
func main(){
rep := 100
for i := 0; i < rep; i++ {
log.Println("starting loop")
go func() {
for{
request := gorequest.New()
resp, body, errs := request.Get("https://discord.com/api/v9/invites/family").End()
if errs != nil {
return
}
if resp.StatusCode == 200{
var result map[string]interface{}
json.Unmarshal([]byte(body), &result)
serverName := result["guild"].(map[string]interface{})["name"]
log.Println(sererName +" response 200, closing all loops")
//break all loops and goroutine here
}
}
}
}
log.Println("response 200,closed all loops")
Answering this would be complicated by your use of parnurzeal/gorequest because that package does not provide any obvious way to cancel requests (see this issue). Because your focus appears to be on the process rather than the specific function I've just used the standard library (http) instead (if you do need to use gorequest then perhaps ask a question specifically about that).
Anyway the below solution demonstrates a few things:
Uses a Waitgroup so it knows when all go routines are done (not essential here but often you want to know you have shutdown cleanly)
Passes the result out via a channel (updating shared variables from a goroutine leads to data races).
Uses a context for cancellation. The cancel function is called when we have a result and this will stop in progress requests.
package main
import (
"context"
"encoding/json"
"errors"
"fmt"
"log"
"net/http"
"sync"
)
func main() {
// Get the context and a function to cancel it
ctx, cancel := context.WithCancel(context.Background())
defer cancel() // Not really required here but its good practice to ensure context is cancelled eventually.
results := make(chan string)
const goRoutineCount = 100
var wg sync.WaitGroup
wg.Add(goRoutineCount) // we will be waiting on 100 goRoutines
for i := 0; i < goRoutineCount; i++ {
go func() {
defer wg.Done() // Decrement WaitGroup when goRoutine exits
req, err := http.NewRequestWithContext(ctx, http.MethodGet, "https://discord.com/api/v9/invites/family", nil)
if err != nil {
panic(err)
}
resp, err := http.DefaultClient.Do(req)
if err != nil {
if errors.Is(err, context.Canceled) {
return // The error is due to the context being cancelled so just shutdown
}
panic(err)
}
defer resp.Body.Close() // Ensure body is closed
if resp.StatusCode == 200 {
var result map[string]interface{}
if err = json.NewDecoder(resp.Body).Decode(&result); err != nil {
panic(err)
}
serverName := result["guild"].(map[string]interface{})["name"]
results <- serverName.(string) // Should error check this...
cancel() // We have a result so all goroutines can stop now!
}
}()
}
// We need to process results until everything has shutdown; simple approach is to just close the channel when done
go func() {
wg.Wait()
close(results)
}()
var firstResult string
requestsProcessed := 0
for x := range results {
fmt.Println("got result")
if requestsProcessed == 0 {
firstResult = x
}
requestsProcessed++ // Possible that we will get more than one result (remember that requests are running in parallel)
}
// At this point all goroutines have shutdown
if requestsProcessed == 0 {
log.Println("No results received")
} else {
log.Printf("xx%s response 200, closing all loops (requests processed: %d)", firstResult, requestsProcessed)
}
}

How to detect javascript alert using chromedp

I'm trying to identify that an alert popped up after navigating to a URL using chromedp. I tried using a listener as follows but I'm new to Golang so I'm not sure why it didn't work.
package main
import (
"context"
"log"
"fmt"
"github.com/chromedp/chromedp"
"github.com/chromedp/cdproto/page"
)
func main() {
// create context
url := "https://grey-acoustics.surge.sh/?__proto__[onload]=alert('hello')"
ctx, cancel := chromedp.NewContext(context.Background())
defer cancel()
chromedp.ListenTarget(ctx, func(ev interface{}) {
if ev, ok := ev.(*page.EventJavascriptDialogOpening); ok {
fmt.Println("Got an alert: %s", ev.Message)
}
})
// run task list
err := chromedp.Run(ctx,
chromedp.Navigate(url),
)
if err != nil {
log.Fatal(err)
}
}
For your specific URL it helped to wait for the iframe to load to receive the event, otherwise chromedp seems to stop because it is finished with its task list.
// run task list
err := chromedp.Run(ctx,
chromedp.Navigate(url),
chromedp.WaitVisible("iframe"),
)
}

Trouble figuring out data race in goroutine

I started learning go recently and I've been chipping away at this for a while now, but figured it was time to ask for some specific help. I have my program requesting paginated data from an api and because there are about 160 pages of data. Seems like a good use of goroutines, except I have race conditions and I can't seem to figure out why. It's probably because I'm new to the language, but my impressions was that params for a function are passed as a copy of the data in the function calling it unless it's a pointer.
According to what I think I know this should be making copies of my data which leaves me free to change it in the main function, but I end up request some pages multiple times and other pages just once.
My main.go
package main
import (
"bufio"
"encoding/json"
"log"
"net/http"
"net/url"
"os"
"strconv"
"sync"
"github.com/joho/godotenv"
)
func main() {
err := godotenv.Load()
if err != nil {
log.Fatalln(err)
}
httpClient := &http.Client{}
baseURL := "https://api.data.gov/ed/collegescorecard/v1/schools.json"
filters := make(map[string]string)
page := 0
filters["school.degrees_awarded.predominant"] = "2,3"
filters["fields"] = "id,school.name,school.city,2018.student.size,2017.student.size,2017.earnings.3_yrs_after_completion.overall_count_over_poverty_line,2016.repayment.3_yr_repayment.overall"
filters["api_key"] = os.Getenv("API_KEY")
outFile, err := os.Create("./out.txt")
if err != nil {
log.Fatalln(err)
}
writer := bufio.NewWriter(outFile)
requestURL := getRequestURL(baseURL, filters)
response := requestData(requestURL, httpClient)
wg := sync.WaitGroup{}
for (page+1)*response.Metadata.ResultsPerPage < response.Metadata.TotalResults {
page++
filters["page"] = strconv.Itoa(page)
wg.Add(1)
go func() {
defer wg.Done()
requestURL := getRequestURL(baseURL, filters)
response := requestData(requestURL, httpClient)
_, err = writer.WriteString(response.TextOutput())
if err != nil {
log.Fatalln(err)
}
}()
}
wg.Wait()
}
func getRequestURL(baseURL string, filters map[string]string) *url.URL {
requestURL, err := url.Parse(baseURL)
if err != nil {
log.Fatalln(err)
}
query := requestURL.Query()
for key, value := range filters {
query.Set(key, value)
}
requestURL.RawQuery = query.Encode()
return requestURL
}
func requestData(url *url.URL, httpClient *http.Client) CollegeScoreCardResponseDTO {
request, _ := http.NewRequest(http.MethodGet, url.String(), nil)
resp, err := httpClient.Do(request)
if err != nil {
log.Fatalln(err)
}
defer resp.Body.Close()
var parsedResponse CollegeScoreCardResponseDTO
err = json.NewDecoder(resp.Body).Decode(&parsedResponse)
if err != nil {
log.Fatalln(err)
}
return parsedResponse
}
I know another issue I will be running into is writing to the output file in the correct order, but I believe using channels to tell each routine what request finished writing could solve that. If I'm incorrect on that I would appreciate any advice on how to approach that as well.
Thanks in advance.
goroutines do not receive copies of data. When the compiler detects that a variable "escapes" the current function, it allocates that variable on the heap. In this case, filters is one such variable. When the goroutine starts, the filters it accesses is the same map as the main thread. Since you keep modifying filters in the main thread without locking, there is no guarantee of what the goroutine sees.
I suggest you keep filters read-only, create a new map in the goroutine by copying all items from the filters, and add the "page" in the goroutine. You have to be careful to pass a copy of the page as well:
go func(page int) {
flt:=make(map[string]string)
for k,v:=range filters {
flt[k]=v
}
flt["page"]=strconv.Itoa(page)
...
} (page)

How to load local html to headless chrome

I have local html files which I need to render them and get their screenshot.
I could not find any solution to load html codes in chromedp
Is that possible?
Yes, it is.
In chromedp documentation there is a nice example https://github.com/chromedp/examples/blob/master/screenshot/main.go. Only difference is that instead of using "https://..." in urlstring you will use "file:///<absolute_path_to_your_file>" .
Example of code, which I mostly took from the example link and used to make screenshot of html stored on my local system:
package main
import (
"context"
"io/ioutil"
"log"
"math"
"github.com/chromedp/cdproto/emulation"
"github.com/chromedp/cdproto/page"
"github.com/chromedp/chromedp"
)
func main() {
// create context
ctx, cancel := chromedp.NewContext(context.Background())
defer cancel()
//if you want to use html from your local filesystem use file:/// + absolute path to your html file
url := "file:///home/oktogen/Documents/Notebooks/2020/May/AnalysisJobs/FlaskApp/templates/index.html"
// capture screenshot of an element
var buf []byte
// capture entire browser viewport, returning png with quality=90
if err := chromedp.Run(ctx, fullScreenshot(url, 90, &buf)); err != nil {
log.Fatal(err)
}
if err := ioutil.WriteFile("fullScreenshot.png", buf, 0644); err != nil {
log.Fatal(err)
}
}
// fullScreenshot takes a screenshot of the entire browser viewport.
//
// Liberally copied from puppeteer's source.
//
// Note: this will override the viewport emulation settings.
func fullScreenshot(urlstr string, quality int64, res *[]byte) chromedp.Tasks {
return chromedp.Tasks{
chromedp.Navigate(urlstr),
chromedp.ActionFunc(func(ctx context.Context) error {
// get layout metrics
_, _, contentSize, err := page.GetLayoutMetrics().Do(ctx)
if err != nil {
return err
}
width, height := int64(math.Ceil(contentSize.Width)), int64(math.Ceil(contentSize.Height))
// force viewport emulation
err = emulation.SetDeviceMetricsOverride(width, height, 1, false).
WithScreenOrientation(&emulation.ScreenOrientation{
Type: emulation.OrientationTypePortraitPrimary,
Angle: 0,
}).
Do(ctx)
if err != nil {
return err
}
// capture screenshot
*res, err = page.CaptureScreenshot().
WithQuality(quality).
WithClip(&page.Viewport{
X: contentSize.X,
Y: contentSize.Y,
Width: contentSize.Width,
Height: contentSize.Height,
Scale: 1,
}).Do(ctx)
if err != nil {
return err
}
return nil
}),
}
}

How can I check for an element is present in the page using golang knq/chromedp

I am creating an app to using [chromedp][1]
How can I check for an element is present in the page?
I tried to use cdp.WaitVisible() but it didn't give me what I wanted.
I need this so I can make dictions if the application will do one thing or the other.
For this example, suppose I need to know if the search input is present or not
How can I do that?
[1]: https://github.com/knq/chromedp
package main
import (
"context"
"fmt"
"io/ioutil"
"log"
"time"
cdp "github.com/knq/chromedp"
cdptypes "github.com/knq/chromedp/cdp"
)
func main() {
var err error
// create context
ctxt, cancel := context.WithCancel(context.Background())
defer cancel()
// create chrome instance
c, err := cdp.New(ctxt, cdp.WithLog(log.Printf))
if err != nil {
log.Fatal(err)
}
// run task list
var site, res string
err = c.Run(ctxt, googleSearch("site:brank.as", "Easy Money Management", &site, &res))
if err != nil {
log.Fatal(err)
}
// shutdown chrome
err = c.Shutdown(ctxt)
if err != nil {
log.Fatal(err)
}
// wait for chrome to finish
err = c.Wait()
if err != nil {
log.Fatal(err)
}
log.Printf("saved screenshot of #testimonials from search result listing `%s` (%s)", res, site)
}
func googleSearch(q, text string, site, res *string) cdp.Tasks {
var buf []byte
sel := fmt.Sprintf(`//a[text()[contains(., '%s')]]`, text)
return cdp.Tasks{
cdp.Navigate(`https://www.google.com`),
cdp.Sleep(2 * time.Second),
cdp.WaitVisible(`#hplogo`, cdp.ByID),
cdp.SendKeys(`#lst-ib`, q+"\n", cdp.ByID),
cdp.WaitVisible(`#res`, cdp.ByID),
cdp.Text(sel, res),
cdp.Click(sel),
cdp.Sleep(2 * time.Second),
cdp.WaitVisible(`#footer`, cdp.ByQuery),
cdp.WaitNotVisible(`div.v-middle > div.la-ball-clip-rotate`, cdp.ByQuery),
cdp.Location(site),
cdp.Screenshot(`#testimonials`, &buf, cdp.ByID),
cdp.ActionFunc(func(context.Context, cdptypes.Handler) error {
return ioutil.WriteFile("testimonials.png", buf, 0644)
}),
}
}
Here is my answer.
The web page is www.google.co.in. The element used is lst-ib, Text box present on the page.
Navigate the page.
Wait until the element is visible.
Read the value of the element. This is first time page is being loaded so obviously value will be "".
Assume, you have modified the value of the element by typing in the text box. Now, if we try to read the value of the same element lst-ib then we should get the updated value.
My code is below,
package main
import (
"context"
"log"
cdp "github.com/knq/chromedp"
)
func main() {
var err error
// create context
ctxt, cancel := context.WithCancel(context.Background())
defer cancel()
// create chrome instance
c, err := cdp.New(ctxt)
if err != nil {
log.Fatal(err)
}
// run task list
var res, value, newValue string
err = c.Run(ctxt, text(&res, &value, &newValue))
if err != nil {
log.Fatal(err)
}
// shutdown chrome
err = c.Shutdown(ctxt)
if err != nil {
log.Fatal(err)
}
// wait for chrome to finish
err = c.Wait()
if err != nil {
log.Fatal(err)
}
if len(value) > 1 {
log.Println("Search Input is present.")
} else {
log.Println("Search Input is NOT present.")
}
log.Println("New updated value: ", newValue);
}
func text(res, value, newValue *string) cdp.Tasks {
return cdp.Tasks{
cdp.Navigate(`https://www.google.co.in`),
cdp.WaitVisible(`lst-ib`, cdp.ByID),
cdp.EvaluateAsDevTools("document.getElementById('lst-ib').value", value),
cdp.EvaluateAsDevTools("document.getElementById('lst-ib').value='Hello';document.getElementById('lst-ib').value", newValue),
}
}
To run code use go run <FileName>.go
I am getting following output which was expected:
$ go run main.go
2017/09/28 20:05:20 Search Input is NOT present.
2017/09/28 20:05:20 New updated value: Hello
NOTE:
First I checked with Google Chrome Developer Tools to get exact Javascripts for my need. It helps a lot.
I have added the screenshot of the Javascript I tried on Chrome Developer Tools.
I hope it helps you. :)

Resources