I'm new to Dask
what i'm trying to find is "shared array between processes and it needed to be writable by any proccess"
could someone can show me that?
Top
a way to implement shared writable array in dask
Dask's internal abstraction is a DAG, a functional graph in which it is assumed that tasks act the same should you rerun them ("functionally pure"), since it's always possible that a task runs in two places, or that a worker which holds a task's output dies.
Dask does not, therefore, support mutable data structures as task inputs/outputs normally. However, you can execute tasks that create mutation as a side-effect, such as any of the functions that write to disk.
If you are prepared to set up your own shared memory and pass around handles to this, there is nothing stopping you from making functions that mutate that memory. The caveats around tasks running multiple times hold, and you would be on your own. There is no mechanism currently to do this kind of thing for you, but it is something I personally intend to investigate within the next few months.
Related
What is your question?
I am trying to implement a metric which needs access to whole data. So instead of updating the metric in *_step() methods, I am trying to collect the outputs in the *_epoch_end() methods. However, the outputs contain only the output of the partition of the data each device gets. Basically if there are n devices, then each device is getting 1/n of the total outputs.
What's your environment?
OS: ubuntu
Packaging: conda
Version [1.0.4
Pytorch: 1.6.0
See the pytorch-lightningmanual. I think you are looking for training_step_end/validation_step_end (assuming you are using DP/DDP2).
...So, when Lightning calls any of the training_step, validation_step, test_step you will only be operating on one of those pieces. (...) For most metrics, this doesn’t really matter. However, if you want to add something to your computational graph (like softmax) using all batch parts you can use the training_step_end step.
When using the DDP backend, there's a separate process running for every GPU. They don't have access to each other's data, but there are a few special operations (reduce, all_reduce, gather, all_gather) that make the processes synchronize. When you use such operations on a tensor, the processes will wait for each other to reach the same point and combine their values in some way, for example take the sum from every process.
In theory it's possible to gather all data from all processes and then calculate the metric in one process, but this is slow and prone to problems, so you want to minimize the data that you transfer. The easiest approach is to calculate the metric in pieces and then for example take the average. self.log() calls will do this automatically when you use sync_dist=True.
If you don't want to take the average over the GPU processes, it's also possible to update some state variables at each step, and after the epoch synchronize the state variables and calculate your metric from those values. The recommended way is to create a class that uses the Metrics API, which recently moved from PyTorch Lightning to the TorchMetrics project.
If it's not enough to store a set of state variables, you can try to make your metric gather all data from all the processes. Derive your own metric from the Metric base class, overriding the update() and compute() methods. Use add_state("data", default=[], dist_reduce_fx="cat") to create a list where you collect the data that you need for calculating the metric. dist_reduce_fx="cat" will cause the data from different processes to be combined with torch.cat(). Internally it uses torch.distributed.all_gather. The tricky part here is that it assumes that all processes create identically-sized tensors. If the sizes don't match, syncing will hang indefinitely.
I need to use a very big hash-table, and access it from many readers and many writers in parallel. is there data structure like map, that support many reads and writes in parallel, without locking the whole structure each access?
Since you asked for a map
without locking the whole structure each access
I direct you to the following implementation:
https://github.com/cornelk/hashmap
This project implements a pure lock free hash map data structure using atomic instructions common in many CPU architectures
The regular go sync.Map still uses an underlying Mutex which locks the corresponding map datastructure.
Package sync provides the concurrent safe map.
Map is like a Go map[interface{}]interface{} but is safe for
concurrent use by multiple goroutines without additional locking or
coordination. Loads, stores, and deletes run in amortized constant
time.
Although the spec itself point out these two specific cases when it should be used(otherwise they suggest using the normal map with locking mechanism):
when the entry for a given key is only ever written once but read many times, as in caches that only grow
when multiple goroutines read, write and overwrite entries for disjoint sets of keys
I know that some of Spark Actions like collect() cause performance issues.
It has been quoted in documentation
To print all elements on the driver, one can use the collect() method to first bring the RDD to the driver node thus:rdd.collect().foreach(println). This can cause the driver to run out of memory, though,
because collect() fetches the entire RDD to a single machine; if you only need to print a few elements of the RDD, a safer approach is to use the take(): rdd.take(100).foreach(println).
And from one more related SE question: Spark runs out of memory when grouping by key
I have come to know that groupByKey(), reduceByKey() may cause out of memory if parallelism is not set properly.
I did not get enough evidence on other Transformations and Action commands, which have to be used with caution.
These three are the only commands to be tackled? I have doubts about below commands too
aggregateByKey()
sortByKey()
persist() / cache()
It would be great if you provide information on intensive commands (global across partitions instead of single partition OR low performance commands), which have to be tackled with better guarding.
You have to consider three types of operations:
transformations implemented using only mapPartitions(WithIndex) like filter, map, flatMap etc. Typically it will be the safest group. Probably the biggest possible issue you can encounter is an extensive spilling to disk.
transformations which require shuffle. It includes obvious suspects like different variants of combineByKey (groupByKey, reduceByKey, aggregateByKey) or join and less obvious like sortBy, distinct or repartition. Without a context (data distribution, exact function for reduction, partitioner, resources) it is hard to tell if particular transformation will be problematic. There are two main factors:
network traffic and disk IO - any operation which is not performed in memory will be at least an order of magnitude slower.
skewed data distribution - if distribution is highly skewed shuffle can fail or subsequent operations may suffer from a suboptimal resource allocation
operations which require passing data to and from the driver. Typically it covers actions like collect or take and creating distributed data structure from a local one (parallelize).
Other members of this category are broadcasts (including automatic broadcast joins) and accumulators. Total cost depends of course on a particular operation and the amount of data.
While some of these operations can be expensive none is particularly bad (including demonized groupByKey) by itself. Obviously it is better to avoid network traffic or additional disk IO but in practice you cannot avoid it in any complex application.
Regarding cache you may find Spark: Why do i have to explicitly tell what to cache? useful.
I have a map with objects that needs to be released before clearing the map. I am tempted to iterate over the map and remove/release objects as I walk through it.
Here is a mock up example
https://play.golang.org/p/kAtPoUgMsq
Since the only way to iterate the map is through range, how would I synchronize multiple producers and multiple consumers?
I don't want to read lock the map since that would make delete/modifying keys during the iteration impossible.
There are a bunch of ways you can clean up things from a map without racy map accesses. What works for your application depends a lot on what it's doing.
0) Just lock the map while you work on it. If the map's not too big, or you have some latency tolerance, it gets the job done quickly (in terms of time you spend on it) and you can move on to thinking about other stuff. If it becomes a problem later, you can come back to the problem then.
1) Copy the objects or pointers out and clear the map while holding a lock, then release the objects in the background. If the problem is that the slowness of releasing itself will keep the lock held a long time, this is the simple workaround for that.
2) If efficient reads are basically all that matters, use atomic.Value. That lets you entirely replace one map with a new and different one. If writes are essentially 0% of your workload, the efficient reads balance out the cost of creating a new map on every change. That's rare, but it happens, e.g., encoding/gob has a global map of types managed this way.
3) If none of those do everything you need, tweak how you store the data (e.g. shard the map). Replace your map with 16 maps and hash keys yourself to decide which map a thing belongs in, and then you can lock one shard at a time, for cleanup or any other write.
There's also the issue of a race between release and use: goroutine A gets something from the map, B clears the map and releases the thing, A uses the released thing.
One strategy there is to lock each value while you use or release it; then you need locks but not global ones.
Another is to tolerate the consequences of races if they're known and not disastrous; for example, concurrent access to net.Conns is explicitly allowed by its docs, so closing an in-use connection may cause a request on it to error but won't lead to undefined app behavior. You have to really be sure you know what you're getting into then, though, 'cause many benign-seeming races aren't.
Finally, maybe your application already is ensuring no in-use objects are released, e.g. there's a safely maintained reference count on objects and only unused objects are released. Then, of course, you don't have to worry.
It may be tempting to try to replace these locks with channels somehow but I don't see any gains from it. It's nice when you can design your app thinking mainly in terms of communication between processes rather than shared data, but when you do have shared data, there's no use in pretending otherwise. Excluding unsafe access to shared data is what locks are for.
You do not state all the requirements (e.g. can the release of multiple objects happen simultaneously, etc) but the simplest solution I can think of is to remove elements and launch a release goroutine for each of the removed elements:
for key := range keysToRemove {
if v, ok := m[k]; ok {
delete(m, k)
go release(k, v)
}
}
Update August 2017 (golang 1.9)
You now have a new Map type in the sync package is a concurrent map with amortized-constant-time loads, stores, and deletes.
It is safe for multiple goroutines to call a Map's methods concurrently.
Original answer Nov. 2016
I don't want to read lock the map
That makes sense, since a deleting from a map is considered a write operation, and must be serialized with all other reads and writes. That implies a write lock to complete the delete. (source: this answer)
Assuming the worst case scenario (multiple writers and readers), you can take a look at the implementation of orcaman/concurrent-map, which has a Remove() method using multiple sync.RWMutex because, to avoid lock bottlenecks, this concurrent map is dived to several (SHARD_COUNT) map shards.
This is faster than using only one RWMutex as in this example.
I Would like to make Service call for each row of File . Our Source File is greater than 50 GB. Iterating over 50GB of Row might take more time. Is there any built-in feature or any Map Reduce program need to be written to make call to service for each row. Since Map Reduce offer little bit of parallelization. Is there any custom tool already build this requirement
The basic requirement for map-reduce, is tasks should be run in parallel without any impact on individual results. if your service call is independent of other stuff, you can use map reduce. I think only map will suffice, to take care of reading each row and doing a service call. However, you need to think the other side of the map too. What are you going to do with the service call and eventually with map. That part decides the reducer thing