Prometheus Exporter - Direct Instrumentation vs Custom Collector - go

I'm currently writing a Prometheus exporter for a telemetry network application.
I've read the doc here Writing Exporters and while I understand the use case for implementing a custom collector to avoid race condition, I'm not sure whether my use case could fit with direct instrumentation.
Basically, the network metrics are streamed via gRPC by the network devices so my exporter just receives them and doesn't have to effectively scrape them.
I've used direct instrumentation with below code:
I declare my metrics using promauto package to keep code compact:
package metrics
import (
"github.com/lucabrasi83/prom-high-obs/proto/telemetry"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
)
var (
cpu5Sec = promauto.NewGaugeVec(
prometheus.GaugeOpts{
Name: "cisco_iosxe_iosd_cpu_busy_5_sec_percentage",
Help: "The IOSd daemon CPU busy percentage over the last 5 seconds",
},
[]string{"node"},
)
Below is how I simply set the metric value from the gRPC protocol buffer decoded message:
cpu5Sec.WithLabelValues(msg.GetNodeIdStr()).Set(float64(val))
Finally, here is my main loop which basically handles the telemetry gRPC streams for metrics I'm interested in:
for {
req, err := stream.Recv()
if err == io.EOF {
return nil
}
if err != nil {
logging.PeppaMonLog(
"error",
fmt.Sprintf("Error while reading client %v stream: %v", clientIPSocket, err))
return err
}
data := req.GetData()
msg := &telemetry.Telemetry{}
err = proto.Unmarshal(data, msg)
if err != nil {
log.Fatalln(err)
}
if !logFlag {
logging.PeppaMonLog(
"info",
fmt.Sprintf(
"Telemetry Subscription Request Received - Client %v - Node %v - YANG Model Path %v",
clientIPSocket, msg.GetNodeIdStr(), msg.GetEncodingPath(),
),
)
}
logFlag = true
// Flag to determine whether the Telemetry device streams accepted YANG Node path
yangPathSupported := false
for _, m := range metrics.CiscoMetricRegistrar {
if msg.EncodingPath == m.EncodingPath {
yangPathSupported = true
go m.RecordMetricFunc(msg)
}
}
}
For each metric I'm interested in, I register it with a record metric function (m.RecordMetricFunc ) that takes the protocol buffer message as argument as per below.
package metrics
import "github.com/lucabrasi83/prom-high-obs/proto/telemetry"
var CiscoMetricRegistrar []CiscoTelemetryMetric
type CiscoTelemetryMetric struct {
EncodingPath string
RecordMetricFunc func(msg *telemetry.Telemetry)
}
I then use an init function for the actual registration:
func init() {
CiscoMetricRegistrar = append(CiscoMetricRegistrar, CiscoTelemetryMetric{
EncodingPath: CpuYANGEncodingPath,
RecordMetricFunc: ParsePBMsgCpuBusyPercent,
})
}
I'm using Grafana as the frontend and so far haven't seen any particular discrepancy while correlating the Prometheus exposed metrics VS Checking metrics directly on the device.
So I would like to understand whether this is following Prometheus best practices or I should still go through the custom collector route.
Thanks in advance.

You are not following best practices because you are using the global metrics that the article you linked to cautions against. With your current implementation your dashboard will forever show some arbitrary and constant value for the CPU metric after a device disconnects (or, more precisely, until your exporter is restarted).
Instead, the RPC method should maintain a set of local metrics and remove them once the method returns. That way the device's metrics vanish from the scrape output when it disconnects.
Here is one approach to do this. It uses a map that contains currently active metrics. Each map element is the set of metrics for one particular stream (which I understand corresponds to one device). Once the stream ends, that entry is removed.
package main
import (
"sync"
"github.com/prometheus/client_golang/prometheus"
)
// Exporter is a prometheus.Collector implementation.
type Exporter struct {
// We need some way to map gRPC streams to their metrics. Using the stream
// itself as a map key is simple enough, but anything works as long as we
// can remove metrics once the stream ends.
sync.Mutex
Metrics map[StreamServer]*DeviceMetrics
}
type DeviceMetrics struct {
sync.Mutex
CPU prometheus.Metric
}
// Globally defined descriptions are fine.
var cpu5SecDesc = prometheus.NewDesc(
"cisco_iosxe_iosd_cpu_busy_5_sec_percentage",
"The IOSd daemon CPU busy percentage over the last 5 seconds",
[]string{"node"},
nil, // constant labels
)
// Collect implements prometheus.Collector.
func (e *Exporter) Collect(ch chan<- prometheus.Metric) {
// Copy current metrics so we don't lock for very long if ch's consumer is
// slow.
var metrics []prometheus.Metric
e.Lock()
for _, deviceMetrics := range e.Metrics {
deviceMetrics.Lock()
metrics = append(metrics,
deviceMetrics.CPU,
)
deviceMetrics.Unlock()
}
e.Unlock()
for _, m := range metrics {
if m != nil {
ch <- m
}
}
}
// Describe implements prometheus.Collector.
func (e *Exporter) Describe(ch chan<- *prometheus.Desc) {
ch <- cpu5SecDesc
}
// Service is the gRPC service implementation.
type Service struct {
exp *Exporter
}
func (s *Service) RPCMethod(stream StreamServer) (*Response, error) {
deviceMetrics := new(DeviceMetrics)
s.exp.Lock()
s.exp.Metrics[stream] = deviceMetrics
s.exp.Unlock()
defer func() {
// Stop emitting metrics for this stream.
s.exp.Lock()
delete(s.exp.Metrics, stream)
s.exp.Unlock()
}()
for {
req, err := stream.Recv()
// TODO: handle error
var msg *Telemetry = parseRequest(req) // Your existing code that unmarshals the nested message.
var (
metricField *prometheus.Metric
metric prometheus.Metric
)
switch msg.GetEncodingPath() {
case CpuYANGEncodingPath:
metricField = &deviceMetrics.CPU
metric = prometheus.MustNewConstMetric(
cpu5SecDesc,
prometheus.GaugeValue,
ParsePBMsgCpuBusyPercent(msg), // func(*Telemetry) float64
"node", msg.GetNodeIdStr(),
)
default:
continue
}
deviceMetrics.Lock()
*metricField = metric
deviceMetrics.Unlock()
}
return nil, &Response{}
}

Related

Azure DevOps Rate Limit

Goal is to retrieve Azure DevOps users with their license and project entitlements in go.
I'm using Microsoft SDK.
Our Azure DevOps organization has more than 1500 users. So when I request each user entitlements, I have an error message due to Azure DevOps rate limit => 443: read: connection reset by peer
However, limiting top with 100/200 does the job, of course..
For a real solution, I though not using SDK anymore and using direct REST API calls with a custom http handler which would support rate limit. Or maybe using heimdall.
What is your advise for a good design guys ?
Thanks.
Here is code :
package main
import (
"context"
"fmt"
"github.com/microsoft/azure-devops-go-api/azuredevops"
"github.com/microsoft/azure-devops-go-api/azuredevops/memberentitlementmanagement"
"log"
"runtime"
"sync"
"time"
)
var organizationUrl = "https://dev.azure.com/xxx"
var personalAccessToken = "xxx"
type User struct {
DisplayName string
MailAddress string
PrincipalName string
LicenseDisplayName string
Status string
GroupAssignments string
ProjectEntitlements []string
LastAccessedDate azuredevops.Time
DateCreated azuredevops.Time
}
func init() {
runtime.GOMAXPROCS(runtime.NumCPU()) // Try to use all available CPUs.
}
func main() {
// Time measure
defer timeTrack(time.Now(), "Fetching Azure DevOps Users License and Projects")
// Compute context
fmt.Println("Version", runtime.Version())
fmt.Println("NumCPU", runtime.NumCPU())
fmt.Println("GOMAXPROCS", runtime.GOMAXPROCS(0))
fmt.Println("Starting concurrent calls...")
// Create a connection to your organization
connection := azuredevops.NewPatConnection(organizationUrl, personalAccessToken)
// New context
ctx := context.Background()
// Create a member client
memberClient, err := memberentitlementmanagement.NewClient(ctx, connection)
if err != nil {
log.Fatal(err)
}
// Request all users
top := 10000
skip := 0
filter := "Id"
response, err := memberClient.GetUserEntitlements(ctx, memberentitlementmanagement.GetUserEntitlementsArgs{
Top: &top,
Skip: &skip,
Filter: &filter,
SortOption: nil,
})
usersLen := len(*response.Members)
allUsers := make(chan User, usersLen)
var wg sync.WaitGroup
wg.Add(usersLen)
for _, user := range *response.Members {
go func(user memberentitlementmanagement.UserEntitlement) {
defer wg.Done()
var userEntitlement = memberentitlementmanagement.GetUserEntitlementArgs{UserId: user.Id}
account, err := memberClient.GetUserEntitlement(ctx, userEntitlement)
if err != nil {
log.Fatal(err)
}
var GroupAssignments string
var ProjectEntitlements []string
for _, assignment := range *account.GroupAssignments {
GroupAssignments = *assignment.Group.DisplayName
}
for _, userProject := range *account.ProjectEntitlements {
ProjectEntitlements = append(ProjectEntitlements, *userProject.ProjectRef.Name)
}
allUsers <- User{
DisplayName: *account.User.DisplayName,
MailAddress: *account.User.MailAddress,
PrincipalName: *account.User.PrincipalName,
LicenseDisplayName: *account.AccessLevel.LicenseDisplayName,
DateCreated: *account.DateCreated,
LastAccessedDate: *account.LastAccessedDate,
GroupAssignments: GroupAssignments,
ProjectEntitlements: ProjectEntitlements,
}
}(user)
}
wg.Wait()
close(allUsers)
for eachUser := range allUsers {
fmt.Println(eachUser)
}
}
func timeTrack(start time.Time, name string) {
elapsed := time.Since(start)
log.Printf("%s took %s", name, elapsed)
}
You can write custom version of GetUserEntitlement function.
https://github.com/microsoft/azure-devops-go-api/blob/dev/azuredevops/memberentitlementmanagement/client.go#L297-L314
It does not use any private members.
After getting http.Response you can check Retry-After header and delay next loop's iteration if it is present.
https://github.com/microsoft/azure-devops-go-api/blob/dev/azuredevops/memberentitlementmanagement/client.go#L306
P.S. Concurrency in your code is redundant and can be removed.
Update - explaining concurrency issue:
You cannot easily implement rate-limiting in concurrent code. It will be much simpler if you execute all requests sequentially and check Retry-After header in every response before moving to the next one.
With parallel execution: 1) you cannot rely on Retry-After header value because you may have another request executing at the same time returning a different value. 2) You cannot apply delay to other requests because some of them are already in progress.
For a real solution, I though not using SDK anymore and using direct
REST API calls with a custom http handler which would support rate
limit. Or maybe using heimdall.
Do you mean you want to avoid the Rate Limit by using the REST API directly?
If so, then your idea will not work.
Most REST APIs are accessible through client libraries, and if you're using SDK based on a REST API or other thing based on a REST API, it will of course hit a rate limit.
Since the rate limit is based on users, I suggest that you can complete your operations based on multiple users (provided that your request is not too much that the server blocking your IP).

How to handle GRPC Golang High CPU Usage

We have suspicious high cpu usage in our golang function that we using grpc to stream our transaction. The function is simple, when we got request of ORDER ID data changes from frontend, then we consume and stream back.
Here the code
func (consumer OrderChangesConsumer) Serve(message string) {
response := messages.OrderChanges{}
if err := json.Unmarshal([]byte(message), &response); err != nil {
logData := map[string]interface{}{
"message": message,
}
seelog.Error(commonServices.GenerateLog("parse_message_error", err.Error(), &logData))
}
if response.OrderID > 0 {
services.PublishChanges(response.OrderID, &response)
}
}
// PublishChanges sends the order change message to the changes channel.
func PublishChanges(orderID int, orderChanges *messages.OrderChanges) {
orderMutex.RLock()
defer orderMutex.RUnlock()
orderChan, ok := orderChans[orderID]
if !ok {
return
}
orderChan <- orderChanges
}
How we can improve and test the best practice for this case?
Would update your PublishChanges code to the following and see if that helps:
// PublishChanges sends the order change message to the changes channel.
func PublishChanges(orderID int, orderChanges *messages.OrderChanges) {
orderMutex.RLock()
orderChan, ok := orderChans[orderID]
orderMutex.RUnlock()
if !ok {
return
}
orderChan <- orderChanges
}
You might also want to consider using sync.Map for an easier to use concurrent map.

How can I scale sarama consumer group in kubernetes deployment?

I am trying to have some consumers to process messages from kafka, and I would like to implement kubernetes deployment scalability for elastic message processing capability.
I found this code from sarama official guide https://pkg.go.dev/github.com/Shopify/sarama#NewConsumerGroup:
package main
import (
"context"
"fmt"
)
type exampleConsumerGroupHandler struct{}
func (exampleConsumerGroupHandler) Setup(_ ConsumerGroupSession) error { return nil }
func (exampleConsumerGroupHandler) Cleanup(_ ConsumerGroupSession) error { return nil }
func (h exampleConsumerGroupHandler) ConsumeClaim(sess ConsumerGroupSession, claim ConsumerGroupClaim) error {
for msg := range claim.Messages() {
fmt.Printf("Message topic:%q partition:%d offset:%d\n", msg.Topic, msg.Partition, msg.Offset)
sess.MarkMessage(msg, "")
}
return nil
}
func main() {
config := NewTestConfig()
config.Version = V2_0_0_0 // specify appropriate version
config.Consumer.Return.Errors = true
group, err := NewConsumerGroup([]string{"localhost:9092"}, "my-group", config)
if err != nil {
panic(err)
}
defer func() { _ = group.Close() }()
// Track errors
go func() {
for err := range group.Errors() {
fmt.Println("ERROR", err)
}
}()
// Iterate over consumer sessions.
ctx := context.Background()
for {
topics := []string{"my-topic"}
handler := exampleConsumerGroupHandler{}
// `Consume` should be called inside an infinite loop, when a
// server-side rebalance happens, the consumer session will need to be
// recreated to get the new claims
err := group.Consume(ctx, topics, handler)
if err != nil {
panic(err)
}
}
}
I have some questions:
how to set numbers of consumers in a consumer group?
If I deploy this program in a Pod, can I scale it safely? I mean, assume one program is running, and I scale the replicas from 1 to 2, will another NewConsumerGroup call with the same group id works perfectly without conflict?
Thank you in advance.
NOTE: I am using Kafka 2.8 and I heard that sarama_cluster package is DEPRECATED.
Reminder that groups cannot scale beyond the topic partition count
Scaling the pods is the correct way to use consumer groups, and using the same group name is correct, however I'd recommend extracting that and the broker address to environment variables so they can easily be changed at deploy time
As-is the containerized code would be unable to use localhost as a Kafka connection string as that would be the pod itself

Rate limit function 40/second with "golang.org/x/time/rate"

I'm trying to use "golang.org/x/time/rate" to build a function which blocks until a token is free. Is this the correct way to use the library to rate limit blocks of code to 40 requests per second, with a bucket size of 2.
type Client struct {
limiter *rate.Limiter
ctx context.Context
}
func NewClient() *Client {
c :=Client{}
c.limiter = rate.NewLimiter(40, 2)
c.ctx = context.Background()
return &c
}
func (client *Client) RateLimitFunc() {
err := client.limiter.Wait(client.ctx)
if err != nil {
fmt.Printf("rate limit error: %v", err)
}
}
To rate limit a block of code I call
RateLimitFunc()
I don't want to use a ticker as I want the rate limiter to take into account the length of time the calling code runs for.
Reading the documentation here; link
You can see that the first parameter to NewLimiter is of type rate.Limit.
If you want 40 requests / second then that translates into a rate of 1 request every 25 ms.
You can create that by doing:
limiter := rate.NewLimiter(rate.Every(25 * time.Millisecond), 2)
Side note:
In generate, a context, ctx, should not be stored on a struct and should be per request. It would appear that Client will be reused, thus you could pass a context to the RateLimitFunc() or wherever appropriate instead of storing a single context on the client struct.
func RateLimit(ctx context.Context) {
limiter := rate.NewLimiter(40, 10)
err := limiter.Wait(ctx)
if err != nil {
// Log the error and return
}
// Do the actual work here
}
As Zak said, do not store Context inside a struct type according to the Go documentation context.

Syncing websocket loops with channels in Golang

I'm facing a dilemma here trying to keep certain websockets in sync for a given user. Here's the basic setup:
type msg struct {
Key string
Value string
}
type connStruct struct {
//...
ConnRoutineChans []*chan string
LoggedIn bool
Login string
//...
Sockets []*websocket.Conn
}
var (
//...
/* LIST OF CONNECTED USERS AN THEIR IP ADDRESSES */
guestMap sync.Map
)
func main() {
post("Started...")
rand.Seed(time.Now().UTC().UnixNano())
http.HandleFunc("/wss", wsHandler)
panic(http.ListenAndServeTLS("...", "...", "...", nil))
}
func wsHandler(w http.ResponseWriter, r *http.Request) {
if r.Header.Get("Origin")+":8080" != "https://...:8080" {
http.Error(w, "Origin not allowed", 403)
fmt.Println("Client origin not allowed! (https://"+r.Host+")")
fmt.Println("r.Header Origin: "+r.Header.Get("Origin"))
return
}
///
conn, err := websocket.Upgrade(w, r, w.Header(), 1024, 1024)
if err != nil {
http.Error(w, "Could not open websocket connection", http.StatusBadRequest)
fmt.Println("Could not open websocket connection with client!")
}
//ADD CONNECTION TO guestMap IF CONNECTION IS nil
var authString string = /*gets device identity*/;
var authChan chan string = make(chan string);
authValue, authOK := guestMap.Load(authString);
if !authOK {
// NO SESSION, CREATE A NEW ONE
newSession = getSession();
//defer newSession.Close();
guestMap.Store(authString, connStruct{ LoggedIn: false,
ConnRoutineChans: []*chan string{&authChan},
Login: "",
Sockets: []*websocket.Conn{conn}
/* .... */ });
}else{
//SESSION STARTED, ADD NEW SOCKET TO Sockets
var tempConn connStruct = authValue.(connStruct);
tempConn.Sockets = append(tempConn.Sockets, conn);
tempConn.ConnRoutineChans = append(tempConn.ConnRoutineChans, &authChan)
guestMap.Store(authString, tempConn);
}
//
go echo(conn, authString, &authChan);
}
func echo(conn *websocket.Conn, authString string, authChan *chan string) {
var message msg;
//TEST CHANNEL
authValue, _ := guestMap.Load(authString);
go sendToChans(authValue.(connStruct).ConnRoutineChans, "sup dude?")
fmt.Println("got past send...");
for true {
select {
case val := <-*authChan:
// use value of channel
fmt.Println("AuthChan for user #"+strconv.Itoa(myConnNumb)+" spat out: ", val)
default:
// if channels are empty, this is executed
}
readError := conn.ReadJSON(&message)
fmt.Println("got past readJson...");
if readError != nil || message.Key == "" {
//DISCONNECT USER
//.....
return
}
//
_key, _value := chief(message.Key, message.Value, &*conn, browserAndOS, authString)
if writeError := conn.WriteJSON(_key + "|" + _value); writeError != nil {
//...
return
}
fmt.Println("got past writeJson...");
}
}
func sendToChans(chans []*chan string, message string){
for i := 0; i < len(chans); i++ {
*chans[i] <- message
}
}
I know, a big block of code eh? And I commented out most of it...
Anyway, if you've ever used a websocket most of it should be quite familiar:
1) func wsHandler() fires every time a user connects. It makes an entry in guestMap (for each unique device that connects) which holds a connStruct which holds a list of channels: ConnRoutineChans []*chan string. This all gets passed to:
2) echo(), which is a goroutine that constantly runs for each websocket connection. Here I'm just testing out sending a message to other running goroutines, but it seems my for loop isn't actually constantly firing. It only fires when the websocket receives a message from the open tab/window it's connected to. (If anyone can clarify this mechanic, I'd love to know why it's not looping constantly?)
3) For each window or tab that the user has open on a given device there is a websocket and channel stored in an arrays. I want to be able to send a message to all the channels in the array (essentially the other goroutines for open tabs/windows on that device) and receive the message in the other goroutines to change some variables set in the constantly running goroutine.
What I have right now works only for the very first connection on a device, and (of course) it sends "sup dude?" to itself since it's the only channel in the array at the time. Then if you open a new tab (or even many), the message doesn't get sent to anyone at all! Strange?... Then when I close all the tabs (and my commented out logic removes the device item from guestMap) and start up a new device session, still only the first connection gets it's own message.
I already have a method for sending a message to all the other websockets on a device, but sending to a goroutine seems to be a little more tricky than I thought.
To answer my own question:
First, I've switched from a sync.map to a normal map. Secondly, in order for nobody to be reading/writing to it at the same time I've made a channel that you call to do any read/write operation on the map. I've been trying my best to keep my data access and manipulation quick to execute so the channel doesn't get crowded so easily. Here's a small example of that:
package main
import (
"fmt"
)
var (
guestMap map[string]*guestStruct = make(map[string]*guestStruct);
guestMapActionChan = make (chan actionStruct);
)
type actionStruct struct {
Action func([]interface{})[]interface{}
Params []interface{}
ReturnChan chan []interface{}
}
type guestStruct struct {
Name string
Numb int
}
func main(){
//make chan listener
go guestMapActionChanListener(guestMapActionChan)
//some guest logs in...
newGuest := guestStruct{Name: "Larry Josher", Numb: 1337}
//add to the map
addRetChan := make(chan []interface{})
guestMapActionChan <- actionStruct{Action: guestMapAdd,
Params: []interface{}{&newGuest},
ReturnChan: addRetChan}
addReturned := <-addRetChan
fmt.Println(addReturned)
fmt.Println("Also, numb was changed by listener to:", newGuest.Numb)
// Same kind of thing for removing, except (of course) there's
// a lot more logic to a real-life application.
}
func guestMapActionChanListener (c chan actionStruct){
for{
value := <-c;
//
returned := value.Action(value.Params);
value.ReturnChan <- returned;
close(value.ReturnChan)
}
}
func guestMapAdd(params []interface{}) []interface{} {
//.. do some parameter verification checks
theStruct := params[0].(*guestStruct)
name := theStruct.Name
theStruct.Numb = 75
guestMap[name] = &*theStruct
return []interface{}{"Added '"+name+"' to the guestMap"}
}
For communication between connections, I just have each socket loop hold onto their guestStruct, and have more guestMapActionChan functions that take care of distributing data to other guests' guestStructs
Now, I'm not going to mark this as the correct answer unless I get some better suggestions as how to do something like this the right way. But for now this is working and should guarantee no races for reading/writing to the map.
Edit: The correct approach should really have been to just use a sync.Mutex like I do in the (mostly) finished project GopherGameServer

Resources