sql.DB on aws-lambda too many connection - go

As I understand in Golang: the DB handle is meant to be long-lived and shared between many goroutines.
But when I using Golang with AWS lambda, it's a very different story since lambdas are stopping the function when it's finished.
I am using:defer db.Close() in Lambda Invoke function but it isn't affect. On MySQL, it's still keep that connection as Sleep query. As a result, it causes too many connections on MySQL.
Currently, I have to set wait_timeout in MySQL to small number. But it's not the best solution, in my opinion.
Is there any way to close connection when using Go SQL driver with Lambda?
Thanks,

There are two problems that we need to address
Correctly managing state between lambda invocations
Configuring a connection pool
Correctly managing state
Let us understand a bit of how the container is managed by AWS. From the AWS docs:
After a Lambda function is executed, AWS Lambda maintains the
execution context for some time in anticipation of another Lambda
function invocation. In effect, the service freezes the execution
context after a Lambda function completes, and thaws the context for
reuse, if AWS Lambda chooses to reuse the context when the Lambda
function is invoked again.
This execution context reuse approach has the following implications:
Any declarations in your Lambda function code (outside the handler
code, see Programming Model) remains initialized, providing additional
optimization when the function is invoked again. For example, if your
Lambda function establishes a database connection, instead of
reestablishing the connection, the original connection is used in
subsequent invocations. We suggest adding logic in your code to check
if a connection exists before creating one.
Each execution context provides 500MB of additional disk space in the
/tmp directory. The directory content remains when the execution
context is frozen, providing transient cache that can be used for
multiple invocations. You can add extra code to check if the cache has
the data that you stored. For information on deployment limits, see
AWS Lambda Limits.
Background processes or callbacks initiated by your Lambda function
that did not complete when the function ended resume if AWS Lambda
chooses to reuse the execution context. You should make sure any
background processes or callbacks (in case of Node.js) in your code
are complete before the code exits.
This first bullet point says that state is maintained between executions. Let us see this in action:
let counter = 0
module.exports.handler = (event, context, callback) => {
counter++
callback(null, { count: counter })
}
If you deploy this and call multiple times consecutively you will see that the counter will be incremented between calls.
Now that you know that - you should not call defer db.Close(), instead you should be reusing the database instance. You can do that by simply making db a package level variable.
First, create a database package that will export an Open function:
package database
import (
"fmt"
"os"
_ "github.com/go-sql-driver/mysql"
"github.com/jinzhu/gorm"
)
var (
host = os.Getenv("DB_HOST")
port = os.Getenv("DB_PORT")
user = os.Getenv("DB_USER")
name = os.Getenv("DB_NAME")
pass = os.Getenv("DB_PASS")
)
func Open() (db *gorm.DB) {
args := fmt.Sprintf("%s:%s#tcp(%s:%s)/%s?parseTime=true", user, pass, host, port, name)
// Initialize a new db connection.
db, err := gorm.Open("mysql", args)
if err != nil {
panic(err)
}
return
}
Then use it on your handler.go file:
package main
import (
"context"
"github.com/aws/aws-lambda-go/events"
"github.com/aws/aws-lambda-go/lambda"
"github.com/jinzhu/gorm"
"github.com/<username>/<name-of-lib>/database"
)
var db *gorm.DB
func init() {
db = database.Open()
}
func Handler() (events.APIGatewayProxyResponse, error) {
// You can use db here.
return events.APIGatewayProxyResponse{
StatusCode: 201,
}, nil
}
func main() {
lambda.Start(Handler)
}
OBS: don't forget to replace github.com/<username>/<name-of-lib>/database with the right path.
Now, you might still see the too many connections error. If that happens you will need a connection pool.
Configuring a connection pool
From Wikipedia:
In software engineering, a connection pool is a cache of database
connections maintained so that the connections can be reused when
future requests to the database are required. Connection pools are
used to enhance the performance of executing commands on a database.
You will need a connection pool that the number of allowed connections must be equal to the number of parallel lambdas running, you have two choices:
MySQL Proxy
MySQL Proxy is a simple program that sits between your client and
MySQL server(s) and that can monitor, analyze or transform their
communication. Its flexibility allows for a wide variety of uses,
including load balancing, failover, query analysis, query filtering
and modification, and many more.
AWS Aurora:
Amazon Aurora Serverless is an on-demand, auto-scaling configuration
for Amazon Aurora (MySQL-compatible edition), where the database will
automatically start up, shut down, and scale capacity up or down based
on your application's needs. It enables you to run your database in
the cloud without managing any database instances. It's a simple,
cost-effective option for infrequent, intermittent, or unpredictable
workloads.
Regardless of your choice, there are plenty of tutorials on the internet on how to configure both.

Related

Scheduler-worker cluster without port forwarding

Hello Satckoverflow!
TLDR I would like to recreate https://github.com/KorayGocmen/scheduler-worker-grpc without port forwarding on the worker.
I am trying to build a competitive programming judge server for evaluation of submissions as a project for my school where I teach programming to kids.
Because the evaluation is computationally heavy I would like to have multiple worker nodes.
The scheduler would receive submissions and hand them out to the worker nodes. For ease of worker deployment ( as it will be often changing ) I would like the worker to be able to subscribe to the scheduler and thus become a worker and receive jobs.
The workers may not be on the same network as the scheduler + the worker resides in a VM ( maybe later will be ported to docker but currently there are issues with it ).
The scheduler should be able to know resource usage of the worker, send different types of jobs to the worker and receive a stream of results.
I am currently thinking of using grpc to address my requirements of communication between workers and the scheduler.
I could create multiple scheduler service methods like:
register worker, receive a stream of jobs
stream job results, receive nothing
stream worker state periodically, receive nothing
However I would prefer the following but idk whether it is possible:
The scheduler GRPC api:
register a worker ( making the worker GRPC api available to the scheduler )
The worker GRPC api:
start a job ( returns stream of job status )
cancel a job ???
get resource usage
The worker should unregister automatically if the connection is lost.
So my question is... is it possible to create a grpc worker api that can be registered to the scheduler for later use if the worker is behind a NAT without port forwarding?
Additional possibly unnecessary information:
Making matters worse I have multiple radically different types of jobs ( streaming an interactive console, executing code against prepared testcases ). I may just create different workers for different jobs.
Sometimes the jobs involve having large files on the local filesystem ( up to 500 MB ) that are usually kept near the scheduler therefore I would like to send the job to a worker which already has the specific files downloaded from the scheduler. Otherwise download the large files on one of the workers. Having all files at the same time on the worker would take more than 20 GB therefore I would like to avoid it.
A worker can run multiple jobs ( up to 16 ) at the same time.
I am writing the system in go.
As long as only the workers initiate the connections you don't have to worry about NAT. gRPC supports streaming in either direction (or both). This means that all of your requirements can be implemented using just one server on the scheduler; there is no need for the scheduler to connect back to the workers.
Given your description your service could look something like this:
syntax = "proto3";
import "google/protobuf/empty.proto";
service Scheduler {
rpc GetJobs(GetJobsRequest) returns (stream GetJobsResponse) {}
rpc ReportWorkerStatus(stream ReportWorkerStatusRequest) returns (google.protobuf.Empty) {}
rpc ReportJobStatus(stream JobStatus) returns (stream JobAction) {}
}
enum JobType {
JOB_TYPE_UNSPECIFIED = 0;
JOB_TYPE_CONSOLE = 1;
JOB_TYPE_EXEC = 2;
}
message GetJobsRequest {
// List of job types this worker is willing to accept.
repeated JobType types = 1;
}
message GetJobsResponse {
string jobId = 0;
JobType type = 1;
string fileName = 2;
bytes fileContent = 3;
// etc.
}
message ReportWorkerStatusRequest {
float cpuLoad = 0;
uint64 availableDiskSpace = 1;
uint64 availableMemory = 2;
// etc.
// List of filenames or file hashes, or whatever else you need to precisely
// report the presence of files.
repeated string haveFiles = 2;
}
Much of this is a matter of preference (you can use oneof instead of enums, for instance), but hopefully it's clear that a single connection from client to server is sufficient for your requirements.
Maintaining the set of available workers is quite simple:
func (s *Server) GetJobs(req *pb.GetJobRequest, stream pb.Scheduler_GetJobsServer) error {
ctx := stream.Context()
s.scheduler.AddWorker(req)
defer s.scheduler.RemoveWorker(req)
for {
job, err := s.scheduler.GetJob(ctx, req)
switch {
case ctx.Err() != nil: // client disconnected
return nil
case err != nil:
return err
}
if err := stream.Send(job); err != nil {
return err
}
}
}
The Basics tutorial includes examples for all types of streaming, including server and client implementations in Go.
As for registration, that usually just means creating some sort of credential that a worker will use when communicating with the server. This might be a randomly generated token (which the server can use to load associated metadata), or a username/password combination, or a TLS client certificate, or similar. Details will depend on your infrastructure and desired workflow when setting up workers.

Invoke kubernetes operator reconcile loop on external resources changes

I'm working on developing a k8s custom resource that as part of the business logic needs to reconcile its state when an external Job in the cluster have changed its own state.
Those Jobs aren't created by the custom resource itself but are externally created for a third party service, however I need to reconcile the state of the CRO for example when any of those external jobs have finished.
After reading bunch of documentation, I came up with setting a watcher for the controller, to watch Jobs like the following example
func (r *DatasetReconciler) SetupWithManager(mgr ctrl.Manager) error {
return ctrl.NewControllerManagedBy(mgr).
For(&datasetv1beta1.Dataset{}).
Watches(&source.Kind{Type: &batchv1.Job{}}, &handler.EnqueueRequestForObject{} /* filter by predicates, see https://pkg.go.dev/sigs.k8s.io/controller-runtime#v0.9.6/pkg/controller#Controller */).
Complete(r)
}
No I'm having my reconcile loop triggered for Jobs and my CRs with the corresponding name and namespace but I don't know anything about the object kind.
func (r *DatasetReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
l := log.FromContext(ctx)
l.Info("Enter Reconcile loop")
l.Info("Request", "Req", req)
//if this is triggered by my CR
dataset := &datasetv1beta1.Dataset{}
r.Get(ctx, types.NamespacedName{Name: req.Name, Namespace: req.Namespace}, dataset)
//whereas when triggered by a Job
job := &batchv1.Job{}
r.Get(ctx, types.NamespacedName{Name: req.Name, Namespace: req.Namespace}, job)
return ctrl.Result{}, nil
}
How can I check within Reconcile the object kind? so I can retrieve the full object data calling r.Get
By design, the event that triggered reconciliation is not passed to the reconciler so that you are forced to define and act on a state instead. This approach is referred to as level-based, as opposed to edge-based.
In your example you have two resources you are trying to keep track of. I would suggest either:
Using ownerReferences or labels if these resources are related. That way you can get all related Datasets for a given Job (or vice versa) and reconcile things that way.
If the two resources are not related, create a separate controller for each resource.
If you want to prevent reconciliation on certain events you can make use of predicates. From the event in the predicate function you can get the object type by e.Object.(*core.Pod) for example.

Proper logging implementation in Golang package

I have small Golang package which does some work. This work suppose a high amount of errors could be produced and this is OK. Currently all errors are ignored. Yes it may look strange, but visit the link and check the main purpose of package.
I'd like to extend functionality of the package and provide ability to see errors occurred during runtime. But due to lack of software design skills I have some questions with no answers.
At first, I thought to implement logging inside the package using the existing logging (zerolog, zap or whatever else). But, will it be ok for package's users? Because they might want to use other logging packages and would like to modify output format.
Maybe it's possible to provide a way to user to inject it's own logging?
I'd like to achieve the ability to provide easy-configurable way for logging which could be switched on or off on users demands.
Some go lib use logging like this
in your packge definite a logger interface
type Yourlogging interface{
Errorf(...)
Warningf(...)
Infof(...)
Debugf(...)
}
and definite a variable for this interface
var mylogger Yourlogging
func SetLogger(l yourlogging)error{
mylogger = l
}
in your func, you can call them for logging
mylogger.Infof(..)
mylogger.Errorf(...)
you don't need implement the interface, but you can use them who implement this interface
for example:
SetLogger(os.Stdout) //logging output to stdout
SetLogger(logrus.New()) // logging output to logrus (github.com/sirupsen/logrus)
In Go, you will see some libraries implement logging interfaces like other answers have suggested. However, you could completely avoid your packages needing to log if you structured your application differently, for your example.
For example, in your example application you linked, your main application runtime calls idleexacts.Run(), which starts this function.
// startLoop starts workload using passed settings and database connection.
func startLoop(ctx context.Context, log log.Logger, pool db.DB, tables []string, jobs uint16, minTime, maxTime time.Duration) error {
rand.Seed(time.Now().UnixNano())
// Increment maxTime up to 1 due to rand.Int63n() never return max value.
maxTime++
// While running, keep required number of workers using channel.
// Run new workers only until there is any free slot.
guard := make(chan struct{}, jobs)
for {
select {
// Run workers only when it's possible to write into channel (channel is limited by number of jobs).
case guard <- struct{}{}:
go func() {
table := selectRandomTable(tables)
naptime := time.Duration(rand.Int63n(maxTime.Nanoseconds()-minTime.Nanoseconds()) + minTime.Nanoseconds())
err := startSingleIdleXact(ctx, pool, table, naptime)
if err != nil {
log.Warnf("start idle xact failed: %s", err)
}
// When worker finishes, read from the channel to allow starting another worker.
<-guard
}()
case <-ctx.Done():
return nil
}
}
}
The problem here is all of the orchestration of your logic is happening inside of your packages. Instead, this loop should be running in your main application, and this package should provide users with simple actions such as selectRandomTable() or createTempTable().
If the orchestration of code was in your main application and the package only provided simple actions. It would be much easier to return errors to the user as part of the function calls.
It would also make your packages easier for others to reuse because they have simple actions and open users to use them in other ways than you intended.

Go http listener with data update every seconds

I'm trying to build a little website in Go containing a report based on data collected from a web service. It uses an API to query the service for data. The service can only be queried once every few seconds.
However, I have to query it a number of times to get the complete report data. Right now I just hammer the API to update my whole data structure each time the http handler (http.HandleFunc) is called. Of course, this is bad because it triggers lots of queries to the external API that are throttled. So, my report comes up very, very, very, slowly.
What I want to do instead is to have a function to updateReportData in a non-blocking way and store that data in some variable that the http.HandleFunc() can just ingest without calling the external API.
But, I'm very new to Go (and things like closures, semaphores, concurrency, etc) and so I'm not really sure how to build this. Should I be using channels? Should I use timers? How can I get the updateReportData to not block the http.HandleFunc, but still run on a fixed interval?
To sum up, I want to have a background routine update a data structure on a fixed interval and I want to be able to use http.HandleFunc to serve whatever data is in the data structure any time i make an http request to the program. I just have no idea how to start. Any advice would be appreciated.
There are a few things you need to do:
Create a background service that polls for the data. This service can run as a goroutine that periodically checks for new data.
Create a channel that the background service uses to send new data to a centralized place to store the values. The background service should write data to this channel whenever it finds something new. Another option would be to protect the centralized data store with a mutex. Depending on the way the data is written and read, one option will be a better choice.
Create a HTTP handler that returns the current contents of the centralized data store.
Here is a simplified example showing how to use a goroutine and a sync.RWMutext to accomplish what you want:
package main
import (
"fmt"
"net/http"
"sync"
"time"
)
var (
timeSumsMu sync.RWMutex
timeSums int64
)
func main() {
// Start the goroutine that will sum the current time
// once per second.
go runDataLoop()
// Create a handler that will read-lock the mutext and
// write the summed time to the client
http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
timeSumsMu.RLock()
defer timeSumsMu.RUnlock()
fmt.Fprint(w, timeSums)
})
http.ListenAndServe(":8080", nil)
}
func runDataLoop() {
for {
// Within an infinite loop, lock the mutex and
// increment our value, then sleep for 1 second until
// the next time we need to get a value.
timeSumsMu.Lock()
timeSums += time.Now().Unix()
timeSumsMu.Unlock()
time.Sleep(1 * time.Second)
}
}

How frequently should I be calling sql.Open in my program?

As the title says I don't know if having multiple sql.Open statements is a good or bad thing or what or if I should have a file with just an init that is something like:
var db *sql.DB
func init() {
var err error
db, err = sql.Open
}
just wondering what the best practice would be. Thanks!
You should at least check the error.
As mentioned in "Connecting to a database":
Note that Open does not directly open a database connection: this is deferred until a query is made. To verify that a connection can be made before making a query, use the Ping function:
if err := db.Ping(); err != nil {
log.Fatal(err)
}
After use, the database is closed using Close.
If possible, limit the number of opened connection to a database to a minimum.
See "Go/Golang sql.DB reuse in functions":
You shouldn't need to open database connections all over the place.
The database/sql package does connection pooling internally, opening and closing connections as needed, while providing the illusion of a single connection that can be used concurrently.
As elithrar points out in the comment, database.sql/#Open does mention:
The returned DB is safe for concurrent use by multiple goroutines and maintains its own pool of idle connections.
Thus, the Open function should be called just once.
It is rarely necessary to close a DB.
As mentioned here
Declaring *sql.DB globally also have some additional benefits such as SetMaxIdleConns (regulating connection pool size) or preparing SQL statements across your application.
You can use a function init, which will run even if you don't have a main():
var db *sql.DB
func init() {
db, err = sql.Open(DBparms....)
}
init() is always called, regardless if there's main or not, so if you import a package that has an init function, it will be executed.
You can have multiple init() functions per package, they will be executed in the order they show up in the code (after all variables are initialized of course).

Resources