So I'm building a small utility that listens on a socket and stores incoming messages as a structs in a slice:
var points []Point
type Point struct {
time time.Time
x float64
y float64
}
func main() {
received = make([]Point, 0)
l, err := net.Listen("tcp4", ":8900")
// (...)
}
func processIncomingData(data string) {
// Parse icoming data that comes as: xValue,yValue
inData = strings.Split(data, ",")
x, err := strconv.ParseFloat(inData[0], 64);
if err != nil {
fmt.Println(err)
}
y, err := strconv.ParseFloat(inData[1], 64);
if err != nil {
fmt.Println(err)
}
// Store the new Point
points = append(points, Point{
time: time.Now(),
x: x,
y: y,
})
// Remove points older than 1h ?
}
Now, as you might imagine this will quickly fill my RAM. What's the best way (faster execution) to remove points older than 1h after appening each new one? I'll be getting new points 10-15 times peer second.
Thank you.
An approach I've used several times is to start a goroutine early in the project that looks something like this:
go cleanup()
...
func cleanup() {
for {
time.Sleep(...)
// do cleanup
}
}
Then what you could do is iterate over points using time.Since(point.time) to figure out how old each piece of data is. If it's too old, there's a slice trick to remove an item from a slice given it's position:
points = append(points[:i], points[i+1:]...)
(where i is the index to remove)
Because the points are in the slice in order of the time they were added, you could speed things up by simply finding the first index that isn't an hour old and doing points = points[i:] to chop off the old points from the beginning of the slice.
You may run into problems if you get a request that accesses the array while you're cleaning it up. Adding a sync.Mutex can help with that. Just lock the mutex before the cleanup and also attempt to lock the mutex anywhere else you write to the array. This may be premature optimization though. I'd experiment without the mutex before adding it in as this would effectively make interacting with points a serial operation and slow the service down.
The time.Sleep(...) in the loop is to prevent from cleaning too often. You might be tempted to set it to an hour since you want to delete points older than that but you might end up with a situation where a point is added immediately after a cleanup. On the next cleanup it'll be 59 mins old and you don't delete it, on the NEXT cleanup it's nearly 2 hours old. My rule of thumb is that I attempt to clean up every 1/10 the amount of time I want to allow an object to stay in memory but that's rather arbitrary. This approach means an object could be at most 1h 5m 59s old when it's deleted.
Related
My use case:
append items(small struct) to a slice in the main process
every 100 items I want to process items in a processor go routine (then pop them from slice)
items comme in very fast continuously
I read that if there is at least one "write" in more then two goroutines using a variable (slice in my case), one shall handle the concurrency (mutex or similar).
My questions:
If I do not handle with a mutex the r/w on slice do I risk problems ? (ie. Item 101 arrives while the processor is working on 1-100s)
What is the best concurrency technique for the incoming item flow to remain "fluent" ?
Disclaimer:
I do not want any event queueing, I need to process items in a given "bundle" size
Actually you don't need a lock here. Here is a working code:
package main
import (
"fmt"
"sync"
)
type myStruct struct {
Cpt int
}
func main() {
buf := make([]myStruct, 0, 100)
wg := sync.WaitGroup{}
// Main process
// Appending one million times
for i := 0; i < 10e6; i++ {
// Locking buffer
// Appending
buf = append(buf, myStruct{Cpt: i})
// Did we reach 100 items ?
if len(buf) >= 100 {
// Yes we did. Creating a slice from the buffer
processSlice := make([]myStruct, 100)
copy(processSlice, buf[0:100])
// Emptying buffer
buf = buf[:0]
// Running processor in parallel
// Adding one element to waitgroup
wg.Add(1)
go processor(&wg, processSlice)
}
}
// Waiting for all processors to finish
wg.Wait()
}
func processor(wg *sync.WaitGroup, processSlice []myStruct) {
// Removing one element to waitgroup when done
defer wg.Done()
// Doing some process
fmt.Printf("Procesing items from %d to %d\n", processSlice[0].Cpt, processSlice[99].Cpt)
}
A few notes about your problem and this solution:
If you want a minimal stop time in your feeding process (e.g, to respond as fast as possible to a HTTP call), then the minimal thing to do is just the copy part, and run the processor function in a go routine. By doing so, you have to create a unique process slice dynamically and copying the content of your buffer inside it.
The sync.WaitGroup object is needed to ensure that the last processor function has ended before exiting the program.
Note that this is not a perfect solution: If you run this pattern for a long time, and the input data comes in more than 100 times faster than the processor processes the slices, then there are going to be:
More and more processSlice instances in RAM -> Risks for filling the RAM and hitting the swap
More and more parallel processor goroutines -> Same risks for the RAM, and more to process in the same time, making each of the calls be slower and the problem gets self-feeding.
This will end up in the system crashing at some point.
The solution for this is to have a limited number of workers that ensures there is no crash. However, when this number of workers is fully busy, then there will be wait in the feeding process, which does not answer what you want. However this is a good solution to absorb a charge which intensity is changing in time.
In general, just remember that if you feed more data than you can process in the same time, your program will just reach a limit at some point where it can't handle it so it has to slow down input acquisition or crash. This is mathematical!
Defining the problem:
We have this IOT device which each send us logs about cars locations. We want to compute the distance the car is travelling online! so when ever a log comes(after putting it in a queue etc) we do this:
type Delta struct {
DeviceId string
time int64
Distance float64
}
var LastLogs = make(map[string]FullLog)
var Distances = make(map[string]Delta)
func addLastLog(l FullLog) {
LastLogs[l.DeviceID] = l
}
func AddToLogPerDay(l FullLog) {
//mutex.Lock()
if val, ok := LastLogs[l.DeviceID]; ok {
if distance, exist := Distances[l.DeviceID]; exist {
x := computingDistance(val, l)
Distances[l.DeviceID] = Delta{
DeviceId: l.DeviceID,
time: distance.time + 1,
Distance: distance.Distance + x,
}
} else {
Distances[l.DeviceID] = Delta{
DeviceId: l.DeviceID,
time: 1,
Distance: 0,
}
}
}
addLastLog(l)
}
which basically calculates distance using a utility function! so in Distances each device Id is mapped to some distance traveled! now here is where the problem starts: While this distances are added to Distances map, I want a go routine to put this data in the database but since there are many devices and many logs and so on doing this query for every log is not a good idea. So I need to this for every 5 second which means every 5 seconds try to empty the list of all last distances added to the map. I wrote this function:
func UpdateLogPerDayTable() {
for {
for _, distance := range Distances {
logs := model.HourPerDay{}
result := services.CarDBProvider.DB.Table(model.HourPerDay{}.TableName()).
Where("created_at >? AND device_id = ?", getCurrentData(), distance.DeviceId).
Find(&logs)
if result.Error != nil && !result.RecordNotFound() {
log.Infof("Something went wrong while checking the log: %v", result.Error)
} else {
if !result.RecordNotFound() {
logs.CountDistance = distance.Distance
logs.CountSecond = distance.time
err := services.CarDBProvider.DB.Model(&logs).
Update(map[string]interface{}{
"count_second": logs.CountSecond,
"count_distance": logs.CountDistance,
})
if err.Error != nil {
log.Infof("Something went wrong while updating the log: %v", err.Error)
}
} else if result.RecordNotFound() {
dayLog := model.HourPerDay{
Model: gorm.Model{},
DeviceId: distance.DeviceId,
CountSecond: int64(distance.time),
CountDistance: distance.Distance,
}
err := services.CarDBProvider.DB.Create(&dayLog)
if err.Error != nil {
log.Infof("Something went wrong while adding the log: %v", err.Error)
}
}
}
}
time.Sleep(time.Second * 5)
}
}
it is called go utlis.UpdateLogPerDayTable() on another go routine. However there are many problems here:
I don't know how to secure Distances so when I add it in another routine I read it somewhere else ,every thing is ok!(The problem is that I want to use go channels and don't have any idea how to do it)
How can I schedule tasks in go for this problem?
Probably I will add a redis to store all the devices that or online so I could do the select query faster and just update the actual database. also add an expire time for redis so if a device didn't send and data for some time, it vanishes! where should I put this code?
Sorry If my explanations weren't enough but I really need some help. specifically for code implementation
Go has a really cool pattern using for / select over multiple channels. This allows you to batch distance writes using both a timeout and a max record size. Using this pattern requires using channels.
First thing is to model your distances as a channel:
distances := make(chan Delta)
Then you an keep track of the current batch
var deltas []Delta
Then
ticker := time.NewTicker(time.Second * 5)
var deltas []Delta
for {
select {
case <-ticker.C:
// 5 seconds up flush to db
// reset deltas
case d := <-distances:
deltas = append(deltas, d)
if len(deltas) >= maxDeltasPerFlush {
// flush
// reset deltas
}
}
}
I don't know how to secure Distances so when I add it in another
routine I read it somewhere else ,every thing is ok!(The problem is
that I want to use go channels and don't have any idea how to do it)
If you intend to keep a map and share memory you need to protect it using mutual exclusion (mutex) to synchronize access between go routines. Using a channel allows you to send a copy to a channel, removing the need for synchronizing across the Delta Object. Depending on your architecture you could also create a pipeline of go routines connected by channels, which could make it so only a single go routine (monitor go routine) is accessing the Delta, also removing the need for synchronization.
How can I schedule tasks in go for this problem?
Using a channel as the primitive for how you pass Deltas to different go routines :)
Probably I will add a redis to store all the devices that or online so
I could do the select query faster and just update the actual
database. also add an expire time for redis so if a device didn't send
and data for some time, it vanishes! where should I put this code?
This depends on your finished architecture. You could write a decorator for the select operation, which would check redis first then go to the DB. The client of this function wouldn't have to know about this. Write operations could be done the same way: Write to persistent store and then write back to redis with the cached value and the expiration. Using decorators the client wouldn't need to know about this, they would just perform the Reads and Writes and the cache logic would be implemented inside of the decorators. There are many ways for this, and its largely dependent on where your implementation settles.
I am working on a chat bot for the site Twitch.tv that is written in Go.
One of the features of the bot is a points system that rewards users for watching a particular stream. This data is stored in a SQLite3 database.
To get the viewers, the bot makes an API call to twitch and gathers all of the current viewers of a stream. These viewers are then put in a slice of strings.
Total viewers can range anywhere from a couple to 20,000 or more.
What the bot does
Makes API call
Stores all viewers in a slice of strings
For each viewer, bot iterates and adds points accordingly.
Bot clears this slice before next iteration
Code
type Viewers struct {
Chatters struct {
CurrentModerators []string `json:"moderators"`
CurrentViewers []string `json:"viewers"`
} `json:"chatters"`
}
func RunPoints(timer time.Duration, modifier int, conn net.Conn, channel string) {
database := InitializeDB() // Loads database through SQLite3 driver
var Points int
var allUsers []string
for range time.NewTicker(timer * time.Second).C {
currentUsers := GetViewers(conn, channel)
tx, err := database.Begin()
if err != nil {
fmt.Println("Error starting points transaction: ", err)
}
allUsers = append(allUsers, currentUsers.Chatters.CurrentViewers...)
allUsers = append(allUsers, currentUsers.Chatters.CurrentModerators...)
for _, v := range allUsers {
userCheck := UserInDB(database, v)
if userCheck == false {
statement, _ := tx.Prepare("INSERT INTO points (Username, Points) VALUES (?, ?)")
statement.Exec(v, 1)
} else {
err = tx.QueryRow("Select Points FROM points WHERE Username = ?", v).Scan(&Points)
if err != nil {
} else {
Points = Points + modifier
statement, _ := tx.Prepare("UPDATE points SET Points = ? WHERE username = ?")
statement.Exec(Points, v)
}
}
}
tx.Commit()
allUsers = allUsers[:0]
currentUsers = Viewers{} // Clear Viewer object struct
}
Expected Behavior
When pulling thousands of viewers, naturally, I expect the system resources to get pretty high. This can turn the bot using 3.0 MB of RAM up to 20 MB+. Thousands of elements takes up a lot of space, of course!
However, something else happens.
Actual Behavior
Each time the API is called, the RAM increases as expected. But because I clear the slice, I expect it to fall back down to its 'normal' 3.0 MB of usage.
However, the amount of RAM usage increases per API call, and doesn't go back down even if the total number of viewers of a stream creases.
Thus, given a few hours, the bot will easily consume 100 + MB of ram which doesn't seem right to me.
What am I missing here? I'm fairly new to programming and CS in general, so perhaps I am trying to fix something that isn't a problem. But this almost sounds like a memory leak to me.
I have tried forcing garbage collection and freeing the memory through Golang's run time library, but this does not fix it.
To understand what's happening here, you need to understand the internals of a slice and what's happening with it. You should probably start with https://blog.golang.org/go-slices-usage-and-internals
To give a brief answer: A slice gives a view into a portion of an underlying array, and when you are attempting to truncate your slice, all you're doing is reducing the view you have over the array, but the underlying array remains unaffected and still takes up just as much memory. In fact, by continuing to use the same array, you're never going to decrease the amount of memory you're using.
I'd encourage you to read up on how this works, but as an example for why no actual memory would be freed up, take a look at the output from this simple program that demos how changes to a slice will not truncate the memory allocated under the hood: https://play.golang.org/p/PLEZba8uD-L
When you reslice the slice:
allUsers = allUsers[:0]
All the elements are still in the backing array and cannot be collected. The memory is still allocated, which will save some time in the next run (it doesn't have to resize the array so much, saving slow allocations), which is the point of reslicing it to zero length instead of just dumping it.
If you want the memory released to the GC, you'd need to just dump it altogether and create a new slice every time. This would be slower, but use less memory between runs. However, that doesn't necessarily mean you'll see less memory used by the process. The GC collects unused heap objects, then may eventually free that memory to the OS, which may eventually reclaim it, if other processes are applying memory pressure.
In code where a global map with an expensive to generate value structure may be modified by multiple concurrent threads, which pattern is correct?
// equivalent to map[string]*activity where activity is a
// fairly heavyweight structure
var ipActivity sync.Map
// version 1: not safe with multiple threads, I think
func incrementIP(ip string) {
val, ok := ipActivity.Load(ip)
if !ok {
val = buildComplexActivityObject()
ipActivity.Store(ip, val)
}
updateTheActivityObject(val.(*activity), ip)
}
// version 2: inefficient, I think, because a complex object is built
// every time even through it's only needed the first time
func incrementIP(ip string) {
tmp := buildComplexActivityObject()
val, _ := ipActivity.LoadOrStore(ip, tmp)
updateTheActivity(val.(*activity), ip)
}
// version 3: more complex but technically correct?
func incrementIP(ip string) {
val, found := ipActivity.Load(ip)
if !found {
tmp := buildComplexActivityObject()
// using load or store incase the mapping was already made in
// another store
val, _ = ipActivity.LoadOrStore(ip, tmp)
}
updateTheActivity(val.(*activity), ip)
}
Is version three the correct pattern given Go's concurrency model?
Option 1 obviously can be called by multiple goroutines with a new ip concurrently, and only the last one in the if block would get stored. This possibility is greatly increased the longer buildComplexActivityObject takes, as there is more time in the critical section.
Option 2 works, but calls buildComplexActivityObject every time, which you state is not what you want.
Given that you want to call buildComplexActivityObject as infrequently as possible, the third option is the only one that makes sense.
The sync.Map however cannot protect the actual activity values referenced by the stored pointers. You also need synchronization there when updating the activity value.
I have a large number of allocated slices (a few million) which I have appended to. I'm sure a large number of them are over capacity. I want to try and reduce memory usage.
My first attempt is to iterate over all of them, allocate a new slice of len(oldSlice) and copy the values over. Unfortunately this appears to increase memory usageĀ (up to double) and the garbage collection is slow to reclaim the memory.
Is there a good general way to slim down memory usage for a large number of over-capacity slices?
Choosing the right strategy to allocate your buffers is hard without knowing the exact problem.
In general you can try to reuse your buffers:
type buffer struct{}
var buffers = make(chan *buffer, 1024)
func newBuffer() *buffer {
select {
case b:= <-buffers:
return b
default:
return &buffer{}
}
}
func returnBuffer(b *buffer) {
select {
case buffers <- b:
default:
}
}
The heuristic used in append may not be suitable for all applications. It's designed for use when you don't know the final length of the data you'll be storing. Instead of iterating over them later, I'd try to minimize the amount of extra capacity you're allocating as early as possible. Here's a simple example of one strategy, which is to use a buffer only while the length is not known, and to reuse that buffer:
type buffer struct {
names []string
... // possibly other things
}
// assume this is called frequently and has lots and lots of names
func (b *buffer) readNames(lines bufio.Scanner) ([]string, error) {
// Start from zero, so we can re-use capacity
b.names = b.names[:0]
for lines.Scan() {
b.names = append(b.names, lines.Text())
}
// Figure out the error
err := lines.Err()
if err == io.EOF {
err = nil
}
// Allocate a minimal slice
out := make([]string, len(b.names))
copy(out, b.names)
return out, err
}
Of course, you'll need to modify this if you need something that's safe for concurrent use; for that I'd recommend using a buffered channel as a leaky bucket for storing your buffers.