Golang concurrent map access with range - data-structures

I have a map with objects that needs to be released before clearing the map. I am tempted to iterate over the map and remove/release objects as I walk through it.
Here is a mock up example
https://play.golang.org/p/kAtPoUgMsq
Since the only way to iterate the map is through range, how would I synchronize multiple producers and multiple consumers?
I don't want to read lock the map since that would make delete/modifying keys during the iteration impossible.

There are a bunch of ways you can clean up things from a map without racy map accesses. What works for your application depends a lot on what it's doing.
0) Just lock the map while you work on it. If the map's not too big, or you have some latency tolerance, it gets the job done quickly (in terms of time you spend on it) and you can move on to thinking about other stuff. If it becomes a problem later, you can come back to the problem then.
1) Copy the objects or pointers out and clear the map while holding a lock, then release the objects in the background. If the problem is that the slowness of releasing itself will keep the lock held a long time, this is the simple workaround for that.
2) If efficient reads are basically all that matters, use atomic.Value. That lets you entirely replace one map with a new and different one. If writes are essentially 0% of your workload, the efficient reads balance out the cost of creating a new map on every change. That's rare, but it happens, e.g., encoding/gob has a global map of types managed this way.
3) If none of those do everything you need, tweak how you store the data (e.g. shard the map). Replace your map with 16 maps and hash keys yourself to decide which map a thing belongs in, and then you can lock one shard at a time, for cleanup or any other write.
There's also the issue of a race between release and use: goroutine A gets something from the map, B clears the map and releases the thing, A uses the released thing.
One strategy there is to lock each value while you use or release it; then you need locks but not global ones.
Another is to tolerate the consequences of races if they're known and not disastrous; for example, concurrent access to net.Conns is explicitly allowed by its docs, so closing an in-use connection may cause a request on it to error but won't lead to undefined app behavior. You have to really be sure you know what you're getting into then, though, 'cause many benign-seeming races aren't.
Finally, maybe your application already is ensuring no in-use objects are released, e.g. there's a safely maintained reference count on objects and only unused objects are released. Then, of course, you don't have to worry.
It may be tempting to try to replace these locks with channels somehow but I don't see any gains from it. It's nice when you can design your app thinking mainly in terms of communication between processes rather than shared data, but when you do have shared data, there's no use in pretending otherwise. Excluding unsafe access to shared data is what locks are for.

You do not state all the requirements (e.g. can the release of multiple objects happen simultaneously, etc) but the simplest solution I can think of is to remove elements and launch a release goroutine for each of the removed elements:
for key := range keysToRemove {
if v, ok := m[k]; ok {
delete(m, k)
go release(k, v)
}
}

Update August 2017 (golang 1.9)
You now have a new Map type in the sync package is a concurrent map with amortized-constant-time loads, stores, and deletes.
It is safe for multiple goroutines to call a Map's methods concurrently.
Original answer Nov. 2016
I don't want to read lock the map
That makes sense, since a deleting from a map is considered a write operation, and must be serialized with all other reads and writes. That implies a write lock to complete the delete. (source: this answer)
Assuming the worst case scenario (multiple writers and readers), you can take a look at the implementation of orcaman/concurrent-map, which has a Remove() method using multiple sync.RWMutex because, to avoid lock bottlenecks, this concurrent map is dived to several (SHARD_COUNT) map shards.
This is faster than using only one RWMutex as in this example.

Related

golang hash-table with concurrency support

I need to use a very big hash-table, and access it from many readers and many writers in parallel. is there data structure like map, that support many reads and writes in parallel, without locking the whole structure each access?
Since you asked for a map
without locking the whole structure each access
I direct you to the following implementation:
https://github.com/cornelk/hashmap
This project implements a pure lock free hash map data structure using atomic instructions common in many CPU architectures
The regular go sync.Map still uses an underlying Mutex which locks the corresponding map datastructure.
Package sync provides the concurrent safe map.
Map is like a Go map[interface{}]interface{} but is safe for
concurrent use by multiple goroutines without additional locking or
coordination. Loads, stores, and deletes run in amortized constant
time.
Although the spec itself point out these two specific cases when it should be used(otherwise they suggest using the normal map with locking mechanism):
when the entry for a given key is only ever written once but read many times, as in caches that only grow
when multiple goroutines read, write and overwrite entries for disjoint sets of keys

What is the name of this kind of cache/ data structure?

I need a fixed-size cache of objects that keeps track how many times each object was requested. When it is full and a new object is added, the object with the lowest usage score gets removed.
So this is different from a LRU-cache of size N in that if some object is heavily requested, then even adding N new objects won't push it out of cache.
Some kind of mix of a cache and a priority queue. Is there a name for that?
Thanks!
Without a time element, this kind of cache clogs up with things that were used a lot in the past, but aren't used currently. Replacement becomes impossible, because everything in the cache has been used more than once, so you won't evict anything in favor of a new item.
You could write some code that degrades the value of the count over time (i.e. take into account the time since last used), but doing so is just a really complicated way of simulating an LRU cache. I experimented with it at one point, but found that it didn't perform any better than the simple LRU cache. At least not in my application.

Golang web-server and concurrent approach for an in memory cache

I am not new to programming, but I am relatively new to golang and still not completely used to the golang concurrency approach.
The general set-up:
Web server (should be fast and parallel), so I use net/http
I need to store and retrieve lots of documents. While retrieving happens more often than storing, the factor is rather low. Maybe 20.
When retrieving the, by far, most important are the lastly stored documents. The rest can be retrieved just from the disk/DB if needed.
Solution: In memory cache of last added items.
Note: On retrieval, I don't care about the last 3 seconds. Meaning, if, at time (A), I ask for a complete list of the last added items, the items added in the last 3 seconds can (partially or completely) be missing. But when asking again at time (A+3s) all those items should be in the list.
My question is related to how to implement the in memory cache.
Naive approach #1 (RWLock)
Have a big list of items in memory.
Guard it with a RW lock
Problem with this approach: I successfully serialized the web server :)
OK, please forget about this approach.
Approach #2: Split things up
have X lists in memory (each with RWLock)
on http handler start get a random number and chose one of the X lists, work only on that list
Another collector routine is started every 2.5 seconds collecting and combining the lists
This is better, I theoretically could even split the work between servers.
But, for example based on the golang tour code:
func main() {
http.HandleFunc("/view/", makeHandler(viewHandler))
http.HandleFunc("/edit/", makeHandler(editHandler))
http.HandleFunc("/save/", makeHandler(saveHandler))
http.ListenAndServe(":8080", nil)
}
How do I pass/get a new random number in the http handler without serializing?
It does not need to be cryptographically secure. I just want to use it to pick one of the X lists.
I know there is a global random generator but that uses a mutex internally, so back to square 1.
I could ask the clients (JavaScript) to provide a random number as get parameter. But that sounds dangerous (DOS)? Or is this OK?
I might not know the users IP address in the go server (reverse proxy setup).
And: Generally is this a good approach? Is there a better way? And now I am limiting myself to X, this does not scale. If I want X to change during run-time, how could I tell the handlers about that change (without becoming serial again)?
You don't really serialize your server with RWLock. Use RLock() for parallel read of documents.
Check on thread-safe concurrent map for go library. It utilizes mutex and sharding technics alongside. I would also added CQRS to database level and It could easily handle 100K concurrent requests/sec.

Growing hash-of-queues beyond main memory limits

I have a cluster application, which is divided into a controller and a bunch of workers. The controller runs on a dedicated host, the workers phone in over the network and get handed jobs, so far so normal. (Basically the "divide-and-conquer pipeline" from the zeromq manual, with job-specific wrinkles. That's not important right now.)
The controller's core data structure is unordered_map<string, queue<string>> in pseudo-C++ (the controller is actually implemented in Python, but I am open to the possibility of rewriting it in something else). The strings in the queues define jobs, and the keys of the map are a categorization of the jobs. The controller is seeded with a set of jobs; when a worker starts up, the controller removes one string from one of the queues and hands it out as the worker's first job. The worker may crash during the run, in which case the job gets put back on the appropriate queue (there is an ancillary table of outstanding jobs). If it completes the job successfully, it will send back a list of new job-strings, which the controller will sort into the appropriate queues. Then it will pull another string off some queue and send it to the worker as its next job; usually, but not always, it will pick the same queue as the previous job for that worker.
Now, the question. This data structure currently sits entirely in main memory, which was fine for small-scale test runs, but at full scale is eating all available RAM on the controller, all by itself. And the controller has several other tasks to accomplish, so that's no good.
What approach should I take? So far, I have considered:
a) to convert this to a primarily-on-disk data structure. It could be cached in RAM to some extent for efficiency, but jobs take tens of seconds to complete, so it's okay if it's not that efficient,
b) using a relational database - e.g. SQLite, (but SQL schemas are a very poor fit AFAICT),
c) using a NoSQL database with persistency support, e.g. Redis (data structure maps over trivially, but this still appears very RAM-centric to make me feel confident that the memory-hog problem will actually go away)
Concrete numbers: For a full-scale run, there will be between one and ten million keys in the hash, and less than 100 entries in each queue. String length varies wildly but is unlikely to be more than 250-ish bytes. So, a hypothetical (impossible) zero-overhead data structure would require 234 – 237 bytes of storage.
Ultimately, it all boils down on how you define efficiency needed on part of the controller -- e.g. response times, throughput, memory consumption, disk consumption, scalability... These properties are directly or indirectly related to:
number of requests the controller needs to handle per second (throughput)
acceptable response times
future growth expectations
From your options, here's how I'd evaluate each option:
a) to convert this to a primarily-on-disk data structure. It could be
cached in RAM to some extent for efficiency, but jobs take tens of
seconds to complete, so it's okay if it's not that efficient,
Given the current memory hog requirement, some form of persistent storage seems a reaonsable choice. Caching comes into play if there is a repeatable access pattern, say the same queue is accessed over and over again -- otherwise, caching is likely not to help.
This option makes sense if 1) you cannot find a database that maps trivially to your data structure (unlikely), 2) for some other reason you want to have your own on-disk format, e.g. you find that converting to a database is too much overhead (again, unlikely).
One alternative to databases is to look at persistent queues (e.g. using a RabbitMQ backing store), but I'm not sure what the per-queue or overall size limits are.
b) using a relational database - e.g. SQLite, (but SQL schemas are a
very poor fit AFAICT),
As you mention, SQL is probably not a good fit for your requirements, even though you could surely map your data structure to a relational model somehow.
However, NoSQL databases like MongoDB or CouchDB seem much more appropriate. Either way, a database of some sort seems viable as long as they can meet your throughput requirement. Many if not most NoSQL databases are also a good choice from a scalability perspective, as they include support for sharding data across multiple machines.
c) using a NoSQL database with persistency support, e.g. Redis (data
structure maps over trivially, but this still appears very RAM-centric
to make me feel confident that the memory-hog problem will actually go
away)
An in-memory database like Redis doesn't solve the memory hog problem, unless you set up a cluster of machines that each holds a part of the overall data. This makes sense only if keeping all data in-memory is needed due to low response times requirements. Yet, given the nature of your jobs, taking tens of seconds to complete, response times, respective to workers, hardly matter.
If you find, however, that response times do matter, Redis would be a good choice, as it handles partitioning trivially using either client-side consistent-hashing or at the cluster level, thus also supporting scalability scenarios.
In any case
Before you choose a solution, be sure to clarify your requirements. You mention you want an efficient solution. Since efficiency can only be gauged against some set of requirements, here's the list of questions I would try to answer first:
*Requirements
how many jobs are expected to complete, say per minute or per hour?
how many workers are needed to do so?
concluding from that:
what is the expected load in requestes/per second, and
what response times are expected on part of the controller (handing out jobs, receiving results)?
And looking into the future:
will the workload increase, i.e. does your solution need to scale up (more jobs per time unit, more more data per job?)
will there be a need for persistency of jobs and results, e.g. for auditing purposes?
Again, concluding from that,
how will this influence the number of workers?
what effect will it have on the number of requests/second on part of the controller?
With these answers, you will find yourself in a better position to choose a solution.
I would look into a message queue like RabbitMQ. This way it will first fill up the RAM and then use the disk. I have up to 500,000,000 objects in queues on a single server and it's just plugging away.
RabbitMQ works on Windows and Linux and has simple connectors/SDKs to about any kind of language.
https://www.rabbitmq.com/

MongoDB caching counters

I'm writing a visit counter for products on a website which uses MongoDB as its' DB-Engine.
Here it says that Mongo keeps frequently accessed stuff in memory and has an integrated in-memory caching engine.
So can I just relay on this integrated caching system and dumbly set the counters up on every visit or does one still need another caching layer on a high-traffic environment?
They're two seperate things. MongoDB uses a simple paged memory management system that, by design, keeps the most accessed parts of the memory mapped disk space in memory.
As a result, this will help you most for counters that are requested frequently but do not change often. Unfortunately for website counters these two things are mutually exclusive. Because increasing counters will generally not cause MongoDB to move the document holding the counter on disk the read caching will still be fairly effective.
The main issue is your writes, basically doing an increase per visit is not going to be very cost effective. I suggest a strategy where your counter webapp caches incoming visits and only pushes counter updates every X visits or every Y seconds, whichever comes first. Your main goal here is to reduce writes per second so you definitely do not want a db write per counter visit.
Although I have never worked on the kind of system you describe, I would do the following (assuming that I have read your question correctly and that you do indeed simply want to increment the counter for each visit).
Use the $inc operator to atomically perform the incrementation, or use upserts with modifiers to create the document structure if it is not already there
Use an appropriate Write Concern to speed up updates if that is safe to do so (ie with a Write Concern of NONE your call to update will return immediately and you'll just have to trust Mongo to persist it to disk). Of course whether this is safe or not depends on the use case. If you are counting millions of hits then 1 failed hit may not be a problem.
If the scale of data you are storing is truly enormous, look into using sharding to partition writes

Resources