I need to use a very big hash-table, and access it from many readers and many writers in parallel. is there data structure like map, that support many reads and writes in parallel, without locking the whole structure each access?
Since you asked for a map
without locking the whole structure each access
I direct you to the following implementation:
https://github.com/cornelk/hashmap
This project implements a pure lock free hash map data structure using atomic instructions common in many CPU architectures
The regular go sync.Map still uses an underlying Mutex which locks the corresponding map datastructure.
Package sync provides the concurrent safe map.
Map is like a Go map[interface{}]interface{} but is safe for
concurrent use by multiple goroutines without additional locking or
coordination. Loads, stores, and deletes run in amortized constant
time.
Although the spec itself point out these two specific cases when it should be used(otherwise they suggest using the normal map with locking mechanism):
when the entry for a given key is only ever written once but read many times, as in caches that only grow
when multiple goroutines read, write and overwrite entries for disjoint sets of keys
Related
I'm new to Dask
what i'm trying to find is "shared array between processes and it needed to be writable by any proccess"
could someone can show me that?
Top
a way to implement shared writable array in dask
Dask's internal abstraction is a DAG, a functional graph in which it is assumed that tasks act the same should you rerun them ("functionally pure"), since it's always possible that a task runs in two places, or that a worker which holds a task's output dies.
Dask does not, therefore, support mutable data structures as task inputs/outputs normally. However, you can execute tasks that create mutation as a side-effect, such as any of the functions that write to disk.
If you are prepared to set up your own shared memory and pass around handles to this, there is nothing stopping you from making functions that mutate that memory. The caveats around tasks running multiple times hold, and you would be on your own. There is no mechanism currently to do this kind of thing for you, but it is something I personally intend to investigate within the next few months.
I am not new to programming, but I am relatively new to golang and still not completely used to the golang concurrency approach.
The general set-up:
Web server (should be fast and parallel), so I use net/http
I need to store and retrieve lots of documents. While retrieving happens more often than storing, the factor is rather low. Maybe 20.
When retrieving the, by far, most important are the lastly stored documents. The rest can be retrieved just from the disk/DB if needed.
Solution: In memory cache of last added items.
Note: On retrieval, I don't care about the last 3 seconds. Meaning, if, at time (A), I ask for a complete list of the last added items, the items added in the last 3 seconds can (partially or completely) be missing. But when asking again at time (A+3s) all those items should be in the list.
My question is related to how to implement the in memory cache.
Naive approach #1 (RWLock)
Have a big list of items in memory.
Guard it with a RW lock
Problem with this approach: I successfully serialized the web server :)
OK, please forget about this approach.
Approach #2: Split things up
have X lists in memory (each with RWLock)
on http handler start get a random number and chose one of the X lists, work only on that list
Another collector routine is started every 2.5 seconds collecting and combining the lists
This is better, I theoretically could even split the work between servers.
But, for example based on the golang tour code:
func main() {
http.HandleFunc("/view/", makeHandler(viewHandler))
http.HandleFunc("/edit/", makeHandler(editHandler))
http.HandleFunc("/save/", makeHandler(saveHandler))
http.ListenAndServe(":8080", nil)
}
How do I pass/get a new random number in the http handler without serializing?
It does not need to be cryptographically secure. I just want to use it to pick one of the X lists.
I know there is a global random generator but that uses a mutex internally, so back to square 1.
I could ask the clients (JavaScript) to provide a random number as get parameter. But that sounds dangerous (DOS)? Or is this OK?
I might not know the users IP address in the go server (reverse proxy setup).
And: Generally is this a good approach? Is there a better way? And now I am limiting myself to X, this does not scale. If I want X to change during run-time, how could I tell the handlers about that change (without becoming serial again)?
You don't really serialize your server with RWLock. Use RLock() for parallel read of documents.
Check on thread-safe concurrent map for go library. It utilizes mutex and sharding technics alongside. I would also added CQRS to database level and It could easily handle 100K concurrent requests/sec.
I know that some of Spark Actions like collect() cause performance issues.
It has been quoted in documentation
To print all elements on the driver, one can use the collect() method to first bring the RDD to the driver node thus:rdd.collect().foreach(println). This can cause the driver to run out of memory, though,
because collect() fetches the entire RDD to a single machine; if you only need to print a few elements of the RDD, a safer approach is to use the take(): rdd.take(100).foreach(println).
And from one more related SE question: Spark runs out of memory when grouping by key
I have come to know that groupByKey(), reduceByKey() may cause out of memory if parallelism is not set properly.
I did not get enough evidence on other Transformations and Action commands, which have to be used with caution.
These three are the only commands to be tackled? I have doubts about below commands too
aggregateByKey()
sortByKey()
persist() / cache()
It would be great if you provide information on intensive commands (global across partitions instead of single partition OR low performance commands), which have to be tackled with better guarding.
You have to consider three types of operations:
transformations implemented using only mapPartitions(WithIndex) like filter, map, flatMap etc. Typically it will be the safest group. Probably the biggest possible issue you can encounter is an extensive spilling to disk.
transformations which require shuffle. It includes obvious suspects like different variants of combineByKey (groupByKey, reduceByKey, aggregateByKey) or join and less obvious like sortBy, distinct or repartition. Without a context (data distribution, exact function for reduction, partitioner, resources) it is hard to tell if particular transformation will be problematic. There are two main factors:
network traffic and disk IO - any operation which is not performed in memory will be at least an order of magnitude slower.
skewed data distribution - if distribution is highly skewed shuffle can fail or subsequent operations may suffer from a suboptimal resource allocation
operations which require passing data to and from the driver. Typically it covers actions like collect or take and creating distributed data structure from a local one (parallelize).
Other members of this category are broadcasts (including automatic broadcast joins) and accumulators. Total cost depends of course on a particular operation and the amount of data.
While some of these operations can be expensive none is particularly bad (including demonized groupByKey) by itself. Obviously it is better to avoid network traffic or additional disk IO but in practice you cannot avoid it in any complex application.
Regarding cache you may find Spark: Why do i have to explicitly tell what to cache? useful.
I have a map with objects that needs to be released before clearing the map. I am tempted to iterate over the map and remove/release objects as I walk through it.
Here is a mock up example
https://play.golang.org/p/kAtPoUgMsq
Since the only way to iterate the map is through range, how would I synchronize multiple producers and multiple consumers?
I don't want to read lock the map since that would make delete/modifying keys during the iteration impossible.
There are a bunch of ways you can clean up things from a map without racy map accesses. What works for your application depends a lot on what it's doing.
0) Just lock the map while you work on it. If the map's not too big, or you have some latency tolerance, it gets the job done quickly (in terms of time you spend on it) and you can move on to thinking about other stuff. If it becomes a problem later, you can come back to the problem then.
1) Copy the objects or pointers out and clear the map while holding a lock, then release the objects in the background. If the problem is that the slowness of releasing itself will keep the lock held a long time, this is the simple workaround for that.
2) If efficient reads are basically all that matters, use atomic.Value. That lets you entirely replace one map with a new and different one. If writes are essentially 0% of your workload, the efficient reads balance out the cost of creating a new map on every change. That's rare, but it happens, e.g., encoding/gob has a global map of types managed this way.
3) If none of those do everything you need, tweak how you store the data (e.g. shard the map). Replace your map with 16 maps and hash keys yourself to decide which map a thing belongs in, and then you can lock one shard at a time, for cleanup or any other write.
There's also the issue of a race between release and use: goroutine A gets something from the map, B clears the map and releases the thing, A uses the released thing.
One strategy there is to lock each value while you use or release it; then you need locks but not global ones.
Another is to tolerate the consequences of races if they're known and not disastrous; for example, concurrent access to net.Conns is explicitly allowed by its docs, so closing an in-use connection may cause a request on it to error but won't lead to undefined app behavior. You have to really be sure you know what you're getting into then, though, 'cause many benign-seeming races aren't.
Finally, maybe your application already is ensuring no in-use objects are released, e.g. there's a safely maintained reference count on objects and only unused objects are released. Then, of course, you don't have to worry.
It may be tempting to try to replace these locks with channels somehow but I don't see any gains from it. It's nice when you can design your app thinking mainly in terms of communication between processes rather than shared data, but when you do have shared data, there's no use in pretending otherwise. Excluding unsafe access to shared data is what locks are for.
You do not state all the requirements (e.g. can the release of multiple objects happen simultaneously, etc) but the simplest solution I can think of is to remove elements and launch a release goroutine for each of the removed elements:
for key := range keysToRemove {
if v, ok := m[k]; ok {
delete(m, k)
go release(k, v)
}
}
Update August 2017 (golang 1.9)
You now have a new Map type in the sync package is a concurrent map with amortized-constant-time loads, stores, and deletes.
It is safe for multiple goroutines to call a Map's methods concurrently.
Original answer Nov. 2016
I don't want to read lock the map
That makes sense, since a deleting from a map is considered a write operation, and must be serialized with all other reads and writes. That implies a write lock to complete the delete. (source: this answer)
Assuming the worst case scenario (multiple writers and readers), you can take a look at the implementation of orcaman/concurrent-map, which has a Remove() method using multiple sync.RWMutex because, to avoid lock bottlenecks, this concurrent map is dived to several (SHARD_COUNT) map shards.
This is faster than using only one RWMutex as in this example.
I'm writing a visit counter for products on a website which uses MongoDB as its' DB-Engine.
Here it says that Mongo keeps frequently accessed stuff in memory and has an integrated in-memory caching engine.
So can I just relay on this integrated caching system and dumbly set the counters up on every visit or does one still need another caching layer on a high-traffic environment?
They're two seperate things. MongoDB uses a simple paged memory management system that, by design, keeps the most accessed parts of the memory mapped disk space in memory.
As a result, this will help you most for counters that are requested frequently but do not change often. Unfortunately for website counters these two things are mutually exclusive. Because increasing counters will generally not cause MongoDB to move the document holding the counter on disk the read caching will still be fairly effective.
The main issue is your writes, basically doing an increase per visit is not going to be very cost effective. I suggest a strategy where your counter webapp caches incoming visits and only pushes counter updates every X visits or every Y seconds, whichever comes first. Your main goal here is to reduce writes per second so you definitely do not want a db write per counter visit.
Although I have never worked on the kind of system you describe, I would do the following (assuming that I have read your question correctly and that you do indeed simply want to increment the counter for each visit).
Use the $inc operator to atomically perform the incrementation, or use upserts with modifiers to create the document structure if it is not already there
Use an appropriate Write Concern to speed up updates if that is safe to do so (ie with a Write Concern of NONE your call to update will return immediately and you'll just have to trust Mongo to persist it to disk). Of course whether this is safe or not depends on the use case. If you are counting millions of hits then 1 failed hit may not be a problem.
If the scale of data you are storing is truly enormous, look into using sharding to partition writes