Colly Max Depth and encoding/json - null - go

I have gone through the Go tour and I'm now going through some of the Colly tutorials. I understand the max depth and have been trying to implement it in a go program like so:
package main
import (
"encoding/json"
"log"
"net/http"
"github.com/gocolly/colly"
)
func ping(w http.ResponseWriter, r *http.Request) {
log.Println("Ping")
w.Write([]byte("ping"))
}
func getData(w http.ResponseWriter, r *http.Request) {
//Verify the param "URL" exists
URL := r.URL.Query().Get("url")
if URL == "" {
log.Println("missing URL argument")
return
}
log.Println("visiting", URL)
//Create a new collector which will be in charge of collect the data from HTML
c := colly.NewCollector(
// MaxDepth is 2, so only the links on the scraped page
// and links on those pages are visited
colly.MaxDepth(2),
colly.Async(true),
)
// Limit the maximum parallelism to 2
// This is necessary if the goroutines are dynamically
// created to control the limit of simultaneous requests.
//
// Parallelism can be controlled also by spawning fixed
// number of go routines.
c.Limit(&colly.LimitRule{DomainGlob: "*", Parallelism: 2})
//Slices to store the data
var response []string
//onHTML function allows the collector to use a callback function when the specific HTML tag is reached
//in this case whenever our collector finds an
//anchor tag with href it will call the anonymous function
// specified below which will get the info from the href and append it to our slice
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
link := e.Request.AbsoluteURL(e.Attr("href"))
if link != "" {
response = append(response, link)
}
})
//Command to visit the website
c.Visit(URL)
// parse our response slice into JSON format
b, err := json.Marshal(response)
if err != nil {
log.Println("failed to serialize response:", err)
return
}
// Add some header and write the body for our endpoint
w.Header().Add("Content-Type", "application/json")
w.Write(b)
}
func main() {
addr := ":7171"
http.HandleFunc("/links", getData)
http.HandleFunc("/ping", ping)
log.Println("listening on", addr)
log.Fatal(http.ListenAndServe(addr, nil))
}
When doing so the response is null. Taking out the MaxDepth and Async lines results in the expected response (with only the top level links).
Any help is appreciated!

When running in Async mode c.Visit will return before the requests are actually made (see here); the correct process is demonstrated in the Parallel demo. In your case this means:
c.Visit(URL)
c.Wait()
Using async is not very useful when just making the one request. Check out the reddit example to see how this can be used to visit multiple URLs in one operation.
Note: You really should be checking the error values returned by these functions and adding an error handler is also good practice.

Related

Limit Max Number of Requests Per Hour with `didip/tollbooth`

I'm new to rate limiting and want to use tollbooth to limit HTTP requests.
I also read the Token Bucket Algorithm page on Wikipedia.
For a simple test app, I want to limit the max number of concurrent requests to 10 regardless of request IP, and have a max burst size of 3 based on request IP.
NOTE: The 10 and 3 are just to make rate limiting easier to observe.
Below is my code based on the examples on tollbooth's GitHub page:
package main
import (
"net/http"
"time"
"github.com/didip/tollbooth/v7"
"github.com/didip/tollbooth/v7/limiter"
)
func main() {
lmt := tollbooth.NewLimiter(3, &limiter.ExpirableOptions{DefaultExpirationTTL: time.Hour})
http.Handle("/", tollbooth.LimitFuncHandler(lmt, HelloHandler))
http.ListenAndServe(":8080", nil)
}
func HelloHandler(w http.ResponseWriter, req *http.Request) {
w.Write([]byte("Hello, World!"))
}
I test the code by running curl -i localhost:8080 several times in rapid succession, and I do get HTTP/1.1 429 Too Many Requests errors whenever I exceed the rate limit I set.
Below are my questions:
How do I use tollbooth to limit max number of concurrent requests to something like 10? And does it even make sense to do so? I assume it does because rate limiting based only on IPs sounds like the server could still go out of memory when too many IPs access it at once.
Am I approaching rate limiting correctly, or am I missing something? Perhaps this is something that's better handled by whatever load balancer is working with the app in the cloud?
UPDATE: Here's my working code based on Woody1193's answer:
package main
import (
"net/http"
"sync"
"time"
"github.com/didip/tollbooth/v7"
"github.com/didip/tollbooth/v7/limiter"
)
func main() {
ipLimiter := tollbooth.NewLimiter(3, &limiter.ExpirableOptions{DefaultExpirationTTL: time.Hour})
globalLimiter := NewConcurrentLimiter(10)
http.Handle("/", globalLimiter.LimitConcurrentRequests(ipLimiter, HelloHandler))
http.ListenAndServe(":8080", nil)
}
func HelloHandler(w http.ResponseWriter, req *http.Request) {
w.Write([]byte("Hello, World!"))
}
type ConcurrentLimiter struct {
max int
current int
mut sync.Mutex
}
func NewConcurrentLimiter(limit int) *ConcurrentLimiter {
return &ConcurrentLimiter{
max: limit,
}
}
func (limiter *ConcurrentLimiter) LimitConcurrentRequests(lmt *limiter.Limiter,
handler func(http.ResponseWriter, *http.Request)) http.Handler {
middle := func(w http.ResponseWriter, r *http.Request) {
limiter.mut.Lock()
maxHit := limiter.current == limiter.max
if maxHit {
limiter.mut.Unlock()
http.Error(w, http.StatusText(429), http.StatusTooManyRequests)
return
}
limiter.current += 1
limiter.mut.Unlock()
defer func() {
limiter.mut.Lock()
limiter.current -= 1
limiter.mut.Unlock()
}()
// There's no rate-limit error, serve the next handler.
handler(w, r)
}
return tollbooth.LimitHandler(lmt, http.HandlerFunc(middle))
}
It appears that tollbooth doesn't offer the functionality you're looking for. However, you can roll your own:
type ConcurrentLimiter struct {
max int
current int
mut sync.Mutex
}
func NewConcurrentLimiter(limit int) *ConcurrentLimiter {
return &ConcurrentLimiter {
max: limit,
mut: new(sync.Mutex),
}
}
func (limiter *ConcurrentLimiter) LimitConcurrentRequests(lmt *limiter.Limiter,
next http.Handler) http.Handler {
middle := func(w http.ResponseWriter, r *http.Request) {
limiter.mut.Lock()
maxHit := limiter.current == limiter.max
if maxHit {
limiter.mut.Unlock()
httpError := // Insert your HTTP error here
return
}
limiter.current += 1
limiter.mut.Unlock()
defer func() {
limiter.mut.Lock()
limiter.current -= 1
limiter.mut.Unlock()
}()
// There's no rate-limit error, serve the next handler.
next.ServeHTTP(w, r)
}
return tollbooth.LimitHandler(lmt, http.HandlerFunc(middle))
}
Then, in your setup you can do:
http.Handle("/", NewConcurrentLimiter(10).LimitConcurrentRequests(HelloHandler))
This code works by maintaining a value describing how many requests the API is currently handling and returning an error if the maximum value has been met. The Mutex is used to ensure that the value is updated regardless of concurrent requests.
I had to inject the tollbooth.Limiter into the limiter I wrote because of the way tollbooth handles such functions (i.e. it doesn't operate as a middleware).

Cannot bind POST body to URL in Go

I'm trying to make a simple API call to the pokemon API through reaching a POST request that I'm serving with Echo.
I'm sending a POST request to "localhost:8000/pokemon" with the body { "pokemon": "pikachu" } where the BODY is reattached to the request through ioutil changing the request to be made with the body: "localhost:8000/pokemon/pikachu".
The POST request works by responding with some JSON, but the call being made is only to "localhost:8000/pokemon", and it seems the body isn't added to the URL.
I think there is something wrong with the binding here u := new(pokemon)
Anyone have any ideas?
func main() {
e := echo.New() // Middleware
e.Use(middleware.Logger()) // Logger
e.Use(middleware.Recover())
//CORS
e.Use(middleware.CORSWithConfig(middleware.CORSConfig{
AllowOrigins: []string{"*"},
AllowMethods: []string{echo.GET, echo.HEAD, echo.PUT, echo.PATCH, echo.POST, echo.DELETE},
}))
// Root route => handler
e.GET("/", func(c echo.Context) error {
return c.String(http.StatusOK, "Hello, World!\n")
})
e.POST("/pokemon", controllers.GrabPrice) // Price endpoint
// Server
e.Logger.Fatal(e.Start(":8000"))
}
type pokemon struct { pokemon string `json:"pokemon" form:"pokemon" query:"pokemon"`
}
// GrabPrice - handler method for binding JSON body and scraping for stock price
func GrabPrice(c echo.Context) (err error) {
// Read the Body content
var bodyBytes []byte
if c.Request().Body != nil {
bodyBytes, _ = ioutil.ReadAll(c.Request().Body)
}
// Restore the io.ReadCloser to its original state
c.Request().Body = ioutil.NopCloser(bytes.NewBuffer(bodyBytes))
u := new(pokemon)
er := c.Bind(u) // bind the structure with the context body
// on no panic!
if er != nil {
panic(er)
}
// company ticker
ticker := u.pokemon
print("Here", string(u.pokemon))
// yahoo finance base URL
baseURL := "https://pokeapi.co/api/v2/pokemon"
print(baseURL + ticker)
// price XPath
//pricePath := "//*[#name=\"static\"]"
// load HTML document by binding base url and passed in ticker
doc, err := htmlquery.LoadURL(baseURL + ticker)
// uh oh :( freak out!!
if err != nil {
panic(err)
}
// HTML Node
// from the Node get inner text
price := string(htmlquery.InnerText(doc))
return c.JSON(http.StatusOK, price)
}
Adding to what already answered by #mkopriva and #A.Lorefice
Yes you need to ensure that the variable are exported, for the binding to work properly.
Since underlay process of binding actually using reflection mechanism on the struct. See this documentation, scroll into Structs section to see what it is.
type pokemon struct {
Pokemon string `json:"pokemon" form:"pokemon" query:"pokemon"`
}

Rate limit function 40/second with "golang.org/x/time/rate"

I'm trying to use "golang.org/x/time/rate" to build a function which blocks until a token is free. Is this the correct way to use the library to rate limit blocks of code to 40 requests per second, with a bucket size of 2.
type Client struct {
limiter *rate.Limiter
ctx context.Context
}
func NewClient() *Client {
c :=Client{}
c.limiter = rate.NewLimiter(40, 2)
c.ctx = context.Background()
return &c
}
func (client *Client) RateLimitFunc() {
err := client.limiter.Wait(client.ctx)
if err != nil {
fmt.Printf("rate limit error: %v", err)
}
}
To rate limit a block of code I call
RateLimitFunc()
I don't want to use a ticker as I want the rate limiter to take into account the length of time the calling code runs for.
Reading the documentation here; link
You can see that the first parameter to NewLimiter is of type rate.Limit.
If you want 40 requests / second then that translates into a rate of 1 request every 25 ms.
You can create that by doing:
limiter := rate.NewLimiter(rate.Every(25 * time.Millisecond), 2)
Side note:
In generate, a context, ctx, should not be stored on a struct and should be per request. It would appear that Client will be reused, thus you could pass a context to the RateLimitFunc() or wherever appropriate instead of storing a single context on the client struct.
func RateLimit(ctx context.Context) {
limiter := rate.NewLimiter(40, 10)
err := limiter.Wait(ctx)
if err != nil {
// Log the error and return
}
// Do the actual work here
}
As Zak said, do not store Context inside a struct type according to the Go documentation context.

How to make stateless connections with gorilla mux?

My program are running fine with one connection per time, but not with concurrent connections.
I need all connections being rendered by one function, which will have all data I need in my service, and that is not working fine, so I ilustrated with the simple code below:
package main
import (
"encoding/json"
"fmt"
"github.com/gorilla/mux"
"github.com/rs/cors"
"net/http"
"reflect"
"time"
)
var Out struct {
Code int `json:"status"`
Message []interface{} `json:"message"`
}
func Clear(v interface{}) {
p := reflect.ValueOf(v).Elem()
p.Set(reflect.Zero(p.Type()))
}
func YourHandler(w http.ResponseWriter, r *http.Request) {
Clear(&Out.Message)
Out.Code = 0
// w.Header().Set("Content-Type", "application/json; charset=UTF-8")
w.Header().Set("Access-Control-Allow-Origin", "*")
w.Header().Set("Access-Control-Allow-Headers","Content-Type,access-control-allow-origin, access-control-allow-headers")
w.WriteHeader(http.StatusOK)
for i:=0; i<10; i++ {
Out.Code = Out.Code + 1
Out.Message = append(Out.Message, "Running...")
time.Sleep(1000 * time.Millisecond)
if err := json.NewEncoder(w).Encode(Out)
err != nil {
http.Error(w, err.Error(), http.StatusInternalServerError)
}
}
}
func main() {
r := mux.NewRouter()
r.StrictSlash(true);
r.HandleFunc("/", YourHandler)
handler := cors.New(cors.Options{
AllowedOrigins: []string{"*"},
AllowCredentials: true,
Debug: true,
AllowedHeaders: []string{"Content-Type"},
AllowedMethods: []string{"GET"},
}).Handler(r)
fmt.Println("Working in localhost:5000")
http.ListenAndServe(":5000", handler)
}
If you run this code, you won't see anything wrong in one connection per time, but if you run it in another tab/browser/etc, at same time, because of the delay, the status code will not be from 1 to 10, but it will be shuffled with all calls.
So I guess that means it's not stateless, and I need it to be, so even with 300 connections at same time, it will always return status code from 1 to 10 in each one.
How can I do it? (As I said, it's a simple code, the structure and the render functions are in separeted packages from each other and of all data collection and)
Handlers are called concurrently by the net/http server. The server creates a goroutine for each client connection and calls handlers on those goroutines.
The Gorilla Mux is passive with respect to concurrency. The mux calls through to the registered application handler on whatever goroutine the mux is called on.
Use a sync.Mutex to limit execution to one goroutine at a time:
var mu sync.Mutex
func YourHandler(w http.ResponseWriter, r *http.Request) {
mu.Lock()
defer mu.Unlock()
Clear(&Out.Message)
Out.Code = 0
...
This is not a good solution given the time.Sleep calls in the handler. The server will process at most one request every 10 seconds.
A better solution is to declare Out as a local variable inside the handler function. With this change, here's no need for the mutex or to clear Out:
func YourHandler(w http.ResponseWriter, r *http.Request) {
var Out struct {
Code int `json:"status"`
Message []interface{} `json:"message"`
}
// w.Header().Set("Content-Type", "application/json; charset=UTF-8")
w.Header().Set("Access-Control-Allow-Origin", "*")
...
If it's not possible to move the declaration of Out, then copy the value to a local variable:
func YourHandler(w http.ResponseWriter, r *http.Request) {
Out := Out // local Out is copy of package-level Out
Clear(&Out.Message)
Out.Code = 0
...
Gorilla Mix uses Go's net/http server to process your http requests. Go creates a Go routine to service each of these incoming requests. If I understand your question correctly, you expect that the Go responses will have your custom status codes in order from 1 to 10 since you were expecting each request coming in synchronously in that order. Go routine's parallelism doesn't guarantee order of execution just like Java threads are if you're familiar with Java. So if Go routines were spawned for each of the requests created in the for 1-to-10 loop then, the routines will execute on its own without regard for order who goes and complete first. Each of these Go routines will serve your requests as it finishes. If you want to control the order of these requests processed in parallel but in order then you can use channels. Look at this link to control synchonization between your 10 Go routines for each of those http requests. https://gobyexample.com/channel-synchronization
First I would like to thanks ThunderCat and Ramil for the help, yours answers gave me a north to find the correctly answer.
A short answer is: Go don't have stateless connections, so I can't do what I was looking for.
Once that said, the reason why I think (based on RFC 7230) it doesn't have is because:
In a traditional web server application we have a program that handle the connections (Apache, nginx etc) and open a thread to the routed application, while in Go we have both in same application, so anything global are always shared between connections.
In languages that may work like Go (the application that opens a port and stay listen it), like C++, they are Object Oriented, so even public variables are inside a class, so you won't share it, since you have to create an instance of the class each time.
Create a thread would resolve the problem, but Go don't have it, instead it have Goroutines, more detail about it in:
https://translate.google.com/translate?sl=ko&tl=en&u=https%3A%2F%2Ftech.ssut.me%2F2017%2F08%2F20%2Fgoroutine-vs-threads%2F
After days on that and the help here, I'll fix it changing my struct to type and put it local, like that:
package main
import (
"encoding/json"
"fmt"
"github.com/gorilla/mux"
"github.com/rs/cors"
"net/http"
"reflect"
"time"
)
type Out struct {
Code int `json:"status"`
Message []interface{} `json:"message"`
}
func Clear(v interface{}) {
p := reflect.ValueOf(v).Elem()
p.Set(reflect.Zero(p.Type()))
}
func YourHandler(w http.ResponseWriter, r *http.Request) {
localOut := Out{0,nil}
// w.Header().Set("Content-Type", "application/json; charset=UTF-8")
w.Header().Set("Access-Control-Allow-Origin", "*")
w.Header().Set("Access-Control-Allow-Headers","Content-Type,access-control-allow-origin, access-control-allow-headers")
w.WriteHeader(http.StatusOK)
for i:=0; i<10; i++ {
localOut.Code = localOut.Code + 1
localOut.Message = append(localOut.Message, "Running...")
time.Sleep(1000 * time.Millisecond)
if err := json.NewEncoder(w).Encode(localOut)
err != nil {
http.Error(w, err.Error(), http.StatusInternalServerError)
}
}
}
func main() {
r := mux.NewRouter()
r.StrictSlash(true);
r.HandleFunc("/", YourHandler)
handler := cors.New(cors.Options{
AllowedOrigins: []string{"*"},
AllowCredentials: true,
Debug: true,
AllowedHeaders: []string{"X-Session-Token","Content-Type"},
AllowedMethods: []string{"GET","POST","PUT","DELETE"},
}).Handler(r)
fmt.Println("Working in localhost:5000")
http.ListenAndServe(":5000", handler)
}
Of course that will take some weeks, so for now I put my application behind nginx and now it works as expected.

Gin If `request body` bound in middleware, c.Request.Body become 0

My API server has middle ware which is getting token from request header.
If it access is correct, its go next function.
But request went to middle ware and went to next function, c.Request.Body become 0.
middle ware
func getUserIdFromBody(c *gin.Context) (int) {
var jsonBody User
length, _ := strconv.Atoi(c.Request.Header.Get("Content-Length"))
body := make([]byte, length)
length, _ = c.Request.Body.Read(body)
json.Unmarshal(body[:length], &jsonBody)
return jsonBody.Id
}
func CheckToken() (gin.HandlerFunc) {
return func(c *gin.Context) {
var userId int
config := model.NewConfig()
reqToken := c.Request.Header.Get("token")
_, resBool := c.GetQuery("user_id")
if resBool == false {
userId = getUserIdFromBody(c)
} else {
userIdStr := c.Query("user_id")
userId, _ = strconv.Atoi(userIdStr)
}
...
if ok {
c.Nex()
return
}
}
next func
func bindOneDay(c *gin.Context) (model.Oneday, error) {
var oneday model.Oneday
if err := c.BindJSON(&oneday); err != nil {
return oneday, err
}
return oneday, nil
}
bindOneDay return error with EOF. because maybe c.Request.Body is 0.
I want to get user_id from request body in middle ware.
How to do it without problem that c.Request.Body become 0
You can only read the Body from the client once. The data is streaming from the user, and they're not going to send it again. If you want to read it more than once, you're going to have to buffer the whole thing in memory, like so:
bodyCopy := new(bytes.Buffer)
// Read the whole body
_, err := io.Copy(bodyCopy, req.Body)
if err != nil {
return err
}
bodyData := bodyCopy.Bytes()
// Replace the body with a reader that reads from the buffer
req.Body = ioutil.NopCloser(bytes.NewReader(bodyData))
// Now you can do something with the contents of bodyData,
// like passing it to json.Unmarshal
Note that buffering the entire request into memory means that a user can cause you to allocate unlimited memory -- you should probably either block this at a frontend proxy or use an io.LimitedReader to limit the amount of data you'll buffer.
You also have to read the entire body before Unmarshal can start its work -- this is probably no big deal, but you can do better using io.TeeReader and json.NewDecoder if you're so inclined.
Better, of course, would be to figure out a way to restructure your code so that buffering the body and decoding it twice aren't necessary.
Gin provides a native solution to allow you to get data multiple times from c.Request.Body. The solution is to use c.ShouldBindBodyWith. Per the gin documentation
ShouldBindBodyWith ... stores the
request body into the context, and reuse when it is called again.
For your particular example, this would be implemented in your middleware like so,
func getUserIdFromBody(c *gin.Context) (int) {
var jsonBody User
if err := c.ShouldBindBodyWith(&jsonBody, binding.JSON); err != nil {
//return error
}
return jsonBody.Id
}
After the middleware, if you want to bind to the body again, just use ctx.ShouldBindBodyWith again. For your particular example, this would be implemented like so
func bindOneDay(c *gin.Context) (model.Oneday, error) {
var oneday model.Oneday
if err := c.ShouldBindBodyWith(&oneday); err != nil {
return error
}
return oneday, nil
}
The issue we're fighting against is that gin has setup c.Request.Body as an io.ReadCloser object -- meaning that it is intended to be read from only once. So, if you access c.Request.Body in your code at all, the bytes will be read (consumed) and c.Request.Body will be empty thereafter. By using ShouldBindBodyWith to access the bytes, gin saves the bytes into another storage mechanism within the context, so that it can be reused over and over again.
As a side note, if you've consumed the c.Request.Body and later want to access c.Request.Body, you can do so by tapping into gin's storage mechanism via ctx.Get(gin.BodyBytesKey). Here's an example of how you can obtain the gin-stored Request Body as []byte and then convert it to a string,
var body string
if cb, ok := ctx.Get(gin.BodyBytesKey); ok {
if cbb, ok := cb.([]byte); ok {
body = string(cbb)
}
}

Resources