What is the default mode in which network requests are executed in GoColly? Since we have the Async method in the collector I would assume that the default mode is synchronous.
However, I see no particular difference when I execute these 8 requests in the program other than I need to use Wait for async mode. It seems as if the method only controls how the program is executed (the other code) and the requests are always asynchronous.
package main
import (
"fmt"
"github.com/gocolly/colly/v2"
)
func main() {
urls := []string{
"http://webcode.me",
"https://example.com",
"http://httpbin.org",
"https://www.perl.org",
"https://www.php.net",
"https://www.python.org",
"https://code.visualstudio.com",
"https://clojure.org",
}
c := colly.NewCollector(
colly.Async(true),
)
c.OnHTML("title", func(e *colly.HTMLElement) {
fmt.Println(e.Text)
})
for _, url := range urls {
c.Visit(url)
}
c.Wait()
}
The default collection is synchronous.
The confusing bit is probably the collector option colly.Async() which ignores the actual param. In fact the implementation at the time of writing is:
func Async(a ...bool) CollectorOption {
return func(c *Collector) {
c.Async = true // uh-oh...!
}
}
Based on this issue, it was done this way for backwards compatibility, so that (I believe) you can pass an option with no param at it'll still work, e.g.:
colly.NewCollector(colly.Async()) // no param, async collection
If you remove the async option altogether and instantiate with just colly.NewCollector(), the network requests will be clearly sequential — i.e. you can also remove c.Wait() and the program won't exit right away.
Related
Assuming that I have a function that sends web requests to an API endpoint, I would like to add a timeout to the client so that if the call is taking too long, the operation breaks either by returning an error or panicing the current thread.
Another assumption is that, the client function (the function that sends web requests) comes from a library and it has been implemented in a synchronous way.
Let's have a look at the client function's signature:
func Send(params map[string]string) (*http.Response, error)
I would like to write a wrapper around this function to add a timeout mechanism. To do that, I can do:
func SendWithTimeout(ctx context.Context, params map[string]string) (*http.Response, error) {
completed := make(chan bool)
go func() {
res, err := Send(params)
_ = res
_ = err
completed <- true
}()
for {
select {
case <-ctx.Done():
{
return nil, errors.New("Cancelled")
}
case <-completed:
{
return nil, nil // just to test how this method works
}
}
}
}
Now when I call the new function and pass a cancellable context, I successfully get a cancellation error, but the goroutine that is running the original Send function keeps on running to the end.
Since, the function makes an API call meaning that establishing socket/TCP connections are actually involved in the background, it is not a good practice to leave a long-running API behind the scene.
Is there any standard way to interrupt the original Send function when the context.Done() is hit?
This is a "poor" design choice to add context support to an existing API / implementation that did not support it earlier. Context support should be added to the existing Send() implementation that uses it / monitors it, renaming it to SendWithTimeout(), and provide a new Send() function that takes no context, and calls SendWithTimeout() with context.TODO() or context.Background().
For example if your Send() function makes an outgoing HTTP call, that may be achieved by using http.NewRequest() followed by Client.Do(). In the new, context-aware version use http.NewRequestWithContext().
If you have a Send() function which you cannot change, then you're "out of luck". The function itself has to support the context or cancellation. You can't abort it from the outside.
See related:
Terminating function execution if a context is cancelled
Is it possible to cancel unfinished goroutines?
Stopping running function using context timeout in Golang
cancel a blocking operation in Go
Which ctx should I use in run parameter of hystrix.Do function of hystrix-go package? The ctx from the upper level, or context.Background()?
Thanks.
package main
import(
"context"
"github.com/myteksi/hystrix-go/hystrix"
)
func tb(ctx context.Context)error{
return nil
}
func ta(ctx context.Context){
hystrix.Do("cbName", func()error{
// At this place, the ctx parameter of function tb,
// Should I use ctx from ta function, or context.Background()?
return tb(ctx)
}, nil)
}
func main(){
ta(context.Background())
}
If you're using contexts, it seems to me like you should using hystrix.DoC. There's no reason to use anything than whatever context passed through, since Do is synchronous, and you would like whatever cancellations, deadlines (and whatever else is attached to your context) to be preserved inside this code.
func ta(ctx context.Context) {
err := hystrix.DoC(ctx, "cbName", func(ctx context.Context) error {
... code that uses ctx here.
}, nil)
// handle err, which may be a hystrix error.
}
It's hard to say if this is actually different from calling hystrix.Do, but this potentially allows hystrix to use your context, to add deadlines/cancellations itself.
Always use the context.Context coming from the upper level as a parameter wherever you can. It allows an end to end mechanism to control request, all the caller has to do is cancel, or invoke timeout on the initial ctx, and it will work for the complete request path.
The initial context passed can depend on your requirement. If you're not sure about what context to use initially, context.TODO can be a good option till you're sure.
I'm using the pion/webrtc Go library in my project and found this problem that the callback-based API the library provides (which mirrors the JavaScript API of WebRTC) can be awkward to use in Go.
For example, doing the following
conn.OnTrack(func(...) { ... })
conn.OnICEConnectionStateChange(func(...) { ... })
is typical in JavaScript, but in Go, this has a few problems:
This API makes it easy to introduce data race, if the callbacks are called in parallel.
The callback-based API propagates to other part of the codebase and makes everything takes callbacks.
What's the conventional way to handle this situation in Go? I'm new to Go and I read that synchronous API is preferred in Go because Goroutines are cheap. So perhaps one possible design is to use a channel to synchronize the callbacks:
msgChan := make(chan Msg)
// or use a separate channel for each type of event?
conn.OnTrack(func(...) {
msgChan <- onTrackMsg
})
conn.OnICEConnectionStateChange(func(...) {
msgChan <- onStateChangeMsg
})
for {
msg := <-msgChan
// do something depending on the type of msg
}
I think forcing synchronization with channels basically mimics the single-threaded nature of JavaScript.
Anyway, how do people usually model event-driven workflow in Go?
No need for a channel. Just wrap your async/callback code in a single function that waits for a response, and use a WaitGroup (you could use a channel here instead, but a WaitGroup is much easier):
func DoSomething() (someType, error) {
var result SomeType
var err error
wg := sync.WaitGroup{}
wg.Add(1)
StartAsyncProcess(func() {
// This is the call back that gets called eventually
defer wg.Done()
result = /* Set the result */
err = /* and/or set the error */
})
wg.Wait() // Wait until the callback is called, and exits
return result, err // And finally return our values
}
You may need/wish to add additional locks or synchronization in the callback, if necessary in your case, if your callback relies on or modifies shared state.
I'm writing a basic server for a website. Now I face a (for me) difficult performance question. Is it better to read the template file in the init() function?
// Initialize all pages of website
func init(){
indexPageData, err := ioutil.ReadFile("./tpl/index.tpl")
check(err)
}
Or in the http.HandlerFunc?
func index(w http.ResponseWriter, req *http.Request){
indexPageData, err := ioutil.ReadFile("./tpl/index.tpl")
check(err)
indexPageTpl := template.Must(template.New("index").Parse(string(indexPageData)))
indexPageTpl.Execute(w, "test")
}
I think in the first example, after the server is started you have no need to access the disk and increase the performance of the request.
But during development I want to refresh the browser and see the new content. That can be done with the second example.
Does someone have a state-of-the-art solution? Or what is the right from the performance point of view?
Let's analyze the performance:
We name your first solution (with slight changes, see below) a and your second solution b.
One request:
a: One disk access
b: One disk access
Ten requests:
a: One disk access
b: Ten disk accesses
10 000 000 requests:
a: One disk access
b: 10 000 000 disk accesses (this is slow)
So, performance is better with your first solution. But what about your concern regarding up-to-date data? From the documentation of func (t *Template) Execute(wr io.Writer, data interface{}) error:
Execute applies a parsed template to the specified data object, writing the output to wr. If an error occurs executing the template or writing its output, execution stops, but partial results may already have been written to the output writer. A template may be executed safely in parallel.
So, what happens is this:
You read a template from disk
You parse the file into a template
You choose the data to fill in the blanks with
You Execute the template with that data, the result is written out into an io.Writer
Your data is as up-to-date as you choose it. This has nothing to do with re-reading the template from disk, or even re-parsing it. This is the whole idea behind templates: One disk access, one parse, multiple dynamic end results.
The documentation quoted above tells us another thing:
A template may be executed safely in parallel.
This is very useful, because your http.HandlerFuncs are ran in parallel, if you have multiple requests in parallel.
So, what to do now?
Read the template file once,
Parse the template once,
Execute the template for every request.
I'm not sure if you should read and parse in the init() function, because at least the Must can panic (and don't use some relative, hard coded path in there!) - I would try to do that in a more controlled environment, e.g. provide a function (like New()) to create a new instance of your server and do that stuff in there.
EDIT: I re-read your question and I might have misunderstood you:
If the template itself is still in development then yes, you would have to read it on every request to have an up-to-date result. This is more convenient than to restart the server every time you change the template. For production, the template should be fixed and only the data should change.
Sorry if I got you wrong there.
Never read and parse template files in the request handler in production, that is as bad as it can get (you should like always avoid this). During development it is ok of course.
Read this question for more details:
It takes too much time when using "template" package to generate a dynamic web page to client in golang
You could approach this in multiple ways. Here I list 4 with example implementation.
1. With a "dev mode" setting
You could have a constant or variable telling if you're running in development mode which means templates are not to be cached.
Here's an example to that:
const dev = true
var indexTmpl *template.Template
func init() {
if !dev { // Prod mode, read and cache template
indexTmpl = template.Must(template.New("index").ParseFiles(".tpl/index.tpl"))
}
}
func getIndexTmpl() *template.Template {
if dev { // Dev mode, always read fresh template
return template.Must(template.New("index").ParseFiles(".tpl/index.tpl"))
} else { // Prod mode, return cached template
return indexTmpl
}
}
func indexHandler(w http.ResponseWriter, r *http.Request) {
getIndexTmpl().Execute(w, "test")
}
2. Specify in the request (as a param) if you want a fresh template
When you develop, you may specify an extra URL parameter indicating to read a fresh template and not use the cached one, e.g. http://localhost:8080/index?dev=true
Example implementation:
var indexTmpl *template.Template
func init() {
indexTmpl = getIndexTmpl()
}
func getIndexTmpl() *template.Template {
return template.Must(template.New("index").ParseFiles(".tpl/index.tpl"))
}
func indexHandler(w http.ResponseWriter, r *http.Request) {
t := indexTmpl
if r.FormValue("dev") != nil {
t = getIndexTmpl()
}
t.Execute(w, "test")
}
3. Decide based on host
You can also check the host name of the request URL, and if it is "localhost", you can omit the cache and use a fresh template. This requires the smallest extra code and effort. Note that you may want to accept other hosts as well e.g. "127.0.0.1" (up to you what you want to include).
Example implementation:
var indexTmpl *template.Template
func init() {
indexTmpl = getIndexTmpl()
}
func getIndexTmpl() *template.Template {
return template.Must(template.New("index").ParseFiles(".tpl/index.tpl"))
}
func indexHandler(w http.ResponseWriter, r *http.Request) {
t := indexTmpl
if r.URL.Host == "localhost" || strings.HasPrefix(r.URL.Host, "localhost:") {
t = getIndexTmpl()
}
t.Execute(w, "test")
}
4. Check template file last modified
You could also store the last modified time of the template file when it is loaded. Whenever the template is requested, you can check the last modified time of the source template file. If it has changed, you can reload it before executing it.
Example implementation:
type mytempl struct {
t *template.Template
lastmod time.Time
mutex sync.Mutex
}
var indexTmpl mytempl
func init() {
// You may want to call this in init so first request won't be slow
checkIndexTempl()
}
func checkIndexTempl() {
nm := ".tpl/index.tpl"
fi, err := os.Stat(nm)
if err != nil {
panic(err)
}
if indexTmpl.lastmod != fi.ModTime() {
// Changed, reload. Don't forget the locking!
indexTmpl.mutex.Lock()
defer indexTmpl.mutex.Unlock()
indexTmpl.t = template.Must(template.New("index").ParseFiles(nm))
indexTmpl.lastmod = fi.ModTime()
}
}
func indexHandler(w http.ResponseWriter, r *http.Request) {
checkIndexTempl()
indexTmpl.t.Execute(w, "test")
}
I'm trying to chain HTTP handlers in go to provide some added functionality, like this:
package router
import (
// snip
"github.com/gorilla/mux"
"github.com/gorilla/handlers"
"net/http"
)
// snip
r := mux.NewRouter()
/* routing code */
var h http.Handler
h = r
if useGzip {
h = handlers.CompressHandler(h)
}
if useLogFile {
fn := pathToLog
accessLog, err := os.OpenFile(fn, os.O_WRONLY|os.O_APPEND|os.O_CREATE, 0666)
if err != nil {
panic(err)
}
h = handlers.CombinedLoggingHandler(accessLog, h)
}
// etc...
The problem is, if any HTTP headers are already set by one of the controllers that the gorilla/mux router points to (for example, w.WriteHeader(404) or w.Header().Set("Content-Type", "application/json")) - this silently breaks any "wrapper" handler trying to set or add its own headers, like the compress handler. I can't see any errors, unless I forgot to catch one somewhere, but the browser gets an invalid response.
Is there any graceful way to deal with this, short of just stashing the headers somewhere and then leaving the final handler to write them? It seems like that would mean rewriting the handlers' code, which I'd love to avoid if at all possible.
Once you call w.WriteHeader(404), the header goes on a wire. So you can't add to it anymore.
Best way you can do is to buffer status code and write it at the end of a chain.
For example, you can provide your own wrapper for http.ResponseWriter that would re-implement WriteHeader() to save status value. Then add method Commit() to actually write it.
Call Commit() in the last handler. You have to determine somehow which handler is last, of course.
I experienced the same silently-failing behaviour. But only in handlers where I did WritheHeader to set a status code other than StatusOK. I think things went wrong in this part of CompressHandler:
if h.Get("Content-Type") == "" {
h.Set("Content-Type", http.DetectContentType(b))
}
Which appears to be resolved when explicitly setting the content type in my own handler:
w.Header().Set("Content-Type", "text/html; charset=utf-8")
w.WriteHeader(code)