Golang Colly Scraping - Website Captcha Catches My Scrape - go

I did make Scraping for Amazon Product Titles but Amazon captcha catches my scraper. I tried 10 times- go run main.go(8 times catches me - 2 times I scraped the product title)
I researched this but I did not find any solution for golang(there is just python) is there any solution for me?
package main
import (
"fmt"
"strings"0
"github.com/gocolly/colly"
)
func main() {
// Create a Collector specifically for Shopify
c := colly.NewCollector(
colly.AllowedDomains("www.amazon.com", "amazon.com"),
)
c.OnHTML("div", func(h *colly.HTMLElement) {
capctha := h.Text
title := h.ChildText("span#productTitle")
fmt.Println(strings.TrimSpace(title))
fmt.Println(strings.TrimSpace(capctha))
})
// Start the collector
c.Visit("https://www.amazon.com/Bluetooth-Over-Ear-Headphones-Foldable-Prolonged/dp/B07K5214NZ")
}
Output:
Enter the characters you see below Sorry, we just need to make sure
you're not a robot. For best results, please make sure your browser is
accepting cookies.

If you don't mind a different package, I wrote a package to search HTML
(essentially thin wrapper around github.com/tdewolff/parse):
package main
import (
"github.com/89z/parse/html"
"net/http"
"os"
)
func main() {
req, err := http.NewRequest(
"GET", "https://www.amazon.com/dp/B07K5214NZ", nil,
)
req.Header = http.Header{
"User-Agent": {"Mozilla"},
}
res, err := new(http.Transport).RoundTrip(req)
if err != nil {
panic(err)
}
defer res.Body.Close()
lex := html.NewLexer(res.Body)
lex.NextAttr("id", "productTitle")
os.Stdout.Write(lex.Bytes())
}
Result:
Bluetooth Headphones Over-Ear, Zihnic Foldable Wireless and Wired Stereo
Headset Micro SD/TF, FM for Cell Phone,PC,Soft Earmuffs &Light Weight for
Prolonged Waring(Rose Gold)
https://github.com/89z/parse

Related

Access EV Stations Detail for a Route or Geolocation Using Golang and HereApi

I am working on an RND Project to find out the Charging stations around a particular route/geolocation, I am getting credentials not valid error message while trying to access the URL. Meanwhile some of the other services I am able to access with the same API Key but not the EV ones.
How can I access the same using the credentials, Looking forward to valuable feedback from the expert community. Here is what I have tried using Golang.
package main
import (
"fmt"
"io/ioutil"
"log"
"net/http"
)
var apikey = "XXXXXXXX"
var latitude = 42.36399
var longitude = -71.05493
var address string
func main() {
url = "https://ev-v2.cc.api.here.com/ev/stations.json?prox=" + fmt.Sprint(latitude) + "," + fmt.Sprint(longitude) + ",5000&connectortype=31&apiKey=" + apikey
res, err := http.Get(url)
if err != nil {
log.Fatalln(err)
}
body, err := ioutil.ReadAll(res.Body)
if err != nil {
log.Fatalln(err)
}
fmt.Println(string(body))
}
EV routing requires whitelisting. I would suggest please get in touch with your AE, or to get in contact with the respective team out of your region we have here a contact form you need to fill - https://www.here.com/contact

Dynamic Discovery Client For GCP Golang

I've recently shifted from python to golang. I had been using python to work with GCP.
I used to pass in the scopes and mention the discovery client I wanted to create like this :
def get_client(scopes, api, version="v1"):
service_account_json = os.environ.get("SERVICE_ACCOUNT_KEY_JSON", None)
if service_account_json is None:
sys.exit("Exiting !!! No SSH_KEY_SERVICE_ACCOUNT env var found.")
credentials = service_account.Credentials.from_service_account_info(
json.loads(b64decode(service_account_json)), scopes=scopes
)
return discovery.build(api, version, credentials=credentials, cache_discovery=False)
And this would create my desired discovery client, whether it be compute engine service or sqladmin
However in go I don't seem to find this.
I found this : https://pkg.go.dev/google.golang.org/api/discovery/v1
For any client that I want to create I would've to import that and then create that, like this :
https://cloud.google.com/resource-manager/reference/rest/v1/projects/list#examples
package main
import (
"fmt"
"log"
"golang.org/x/net/context"
"golang.org/x/oauth2/google"
"google.golang.org/api/cloudresourcemanager/v1"
)
func main() {
ctx := context.Background()
c, err := google.DefaultClient(ctx, cloudresourcemanager.CloudPlatformScope)
if err != nil {
log.Fatal(err)
}
cloudresourcemanagerService, err := cloudresourcemanager.New(c)
if err != nil {
log.Fatal(err)
}
req := cloudresourcemanagerService.Projects.List()
if err := req.Pages(ctx, func(page *cloudresourcemanager.ListProjectsResponse) error {
for _, project := range page.Projects {
// TODO: Change code below to process each `project` resource:
fmt.Printf("%#v\n", project)
}
return nil
}); err != nil {
log.Fatal(err)
}
}
So I've to import each client library to get the client for that.
"google.golang.org/api/cloudresourcemanager/v1"
There's no dynamic creation of it.
Is it even possible, cause go is strict type checking 🤔
Thanks.
No, this is not possible with the Golang Google Cloud library.
You've nailed the point on the strict type checking, as it would definitely defeat the benefits of compile time type checking. It would also be a bad Golang practice to return different objects with different signatures, as we don't do duck typing and instead we rely on interface contracts.
Golang is boring and verbose, and it's like that by design :)

golang check if javascript is enabled

this is my actual code :
package main
import (
"net/http"
"net/http/httputil"
"net/url"
)
const BaseUrl = "http://127.0.01:5000"
const ListeningPort = "80"
func main() {
// intercept call
http.HandleFunc("/test", Test)
// all other traffic pass on
http.HandleFunc("/", ProxyFunc)
http.ListenAndServe(":"+ListeningPort, nil)
}
func ProxyFunc(w http.ResponseWriter, r *http.Request) {
u, err := url.Parse(BaseUrl)
if err != nil {
w.Write([]byte(err.Error()))
return
}
proxy := httputil.NewSingleHostReverseProxy(u)
proxy.ServeHTTP(w, r)
}
func Test(w http.ResponseWriter, r *http.Request) {
w.Write([]byte("TEST"))
}
first to accept client connexion, i want to check if browser have enabled javascript, how i can do this in my actual code ?
i want check with this method :
https://pastebin.com/ZASFQumf
It is not possible to do that in Golang since it is a server side language. I don't even think it is possible with JavaScript.
It is not something you can add/set/get from the headers.
What you are trying to do is check browser specific flags.
You might be able to find third party libraries used to manage Chrome flags or Firefox flags etc. That is your best option.

How to get the URL in Go

I tried looking it up, but couldn't find an already answered question.
How do I get the host part of the URL in Go?
For eg.
if the user enters http://localhost:8080 in the address bar, I wanted to extract "localhost" from the URL.
If you are talking about extracting the host from a *http.Request you can do the following:
func ExampleHander(w http.ResponseWriter, r *http.Request) {
host := r.Host // This is your host (and will include port if specified)
}
If you just want to access the host part without the port part you can do the following:
func ExampleHandler(w http.ResponseWriter, r *http.Request) {
host, port, _ := net.SplitHostPort(r.Host)
}
For the last one to work you also have to import net
Go has wonderful documentation, I would also recommend taking a look at that: net/http
Go has built in library that can do it for you.
package main
import "fmt"
import "net"
import "net/url"
func main() {
s := "http://localhost:8080"
u, err := url.Parse(s)
if err != nil {
panic(err)
}
host, _, _ := net.SplitHostPort(u.Host)
fmt.Println(host)
}
https://golang.org/pkg/net/url/#

How to take screenshot of a website using Golang?

What I'm looking to do, given a URL and take a screenshot of the website using Golang. I searched for results but I didn't get any. Can anyone please help me.
You can use a Go version of Selenium if you want to go that route. https://godoc.org/github.com/tebeka/selenium
There is no pure golang way to do at the moment this since it must involve a browser is some form.
The easiest path to achieve this functionality is probably:
Find a nice NodeJS library to take website screenshots
Create a NodeJS script that is suits your needs for taking screenshots (i/o and settings)
Execute this NodeJS script from Golang and handle the results in your Golang code
Not the cleanest method to get this done though - if you want it cleaner you probably have to build/find a golang package that controls a browser so you can skip the NodeJS middleman.
I solved this issue using https://github.com/mafredri/cdp and a Chrome headless docker container.
You can see my service example here: https://gist.github.com/efimovalex/9f9b815b0d5b1b7889a51d46860faf8a
A few more tools using Go and Chrome/Chromium include:
gowitness CLI app
screenshot library
web2image CLI app based on chromedp
I was writing a program for this specific task. Here is a sample code that browse google.com and takes a screenshot.
package main
import (
"time"
driver "github.com/dreygur/webdriver"
)
func main() {
url := `https://google.com`
driver.RunServer("./geckodriver")
driver.GetSession()
driver.Get(url)
time.Sleep(8 * time.Second)
driver.Screenshot("google")
time.Sleep(8 * time.Second)
defer driver.Kill()
}
To install the module, run go get github.com/dreygur/webdriver
You can use chromedp.
But you need install chrome browser!
Example :
package main
import (
"context"
"fmt"
"os"
"time"
"github.com/chromedp/chromedp"
)
func TackScreenShot(ctx context.Context, url string) ([]byte, error) {
context, cancel := chromedp.NewContext(ctx)
defer cancel()
var filebyte []byte
if err := chromedp.Run(context, chromedp.Tasks{
chromedp.Navigate(url),
chromedp.Sleep(3 * time.Second),
chromedp.CaptureScreenshot(&filebyte),
}); err != nil {
return nil, err
}
return filebyte, nil
}
func main() {
url := "https://google.com"
ctx := context.TODO()
data, err := TackScreenShot(ctx, url)
if err != nil {
panic(err)
}
defer ctx.Done()
pngFile, err := os.Create("./shot.png")
if err != nil {
panic(err)
}
defer pngFile.Close()
pngFile.Write(data)
fmt.Println("screen shot tacked!")
}

Resources