I had a pytorch task, which worked with DP:
One same network copied to multiple GPUs sharing the same weights, but each copy receives a different data batch, so it speeds up training by increasing equivalent batch size.
But now I hope to introduce multiple different networks into the training flow:
net_A, net_B, net_C, and they are of different architectures and don't share weights.
Is it possible to assign each network to a different node (1 node with 4 GPUs), so that "net_A" can still enjoy the speed up of DP on 4 GPUs of "node_A", and "net_B" occupies "node_B", etc?
Related
On a supercomputer, you have a set of nodes, and for each nodes you have some amount of CPUs. Is it generally better if to use, say, 20 CPUS for 1 node, as opposed to 2 nodes with 10 CPUs each? In both cases, there are 20 CPUs total.
Is the communication time between CPUs on a node a lot faster than CPUs across 2 different nodes?
As a general rule of thumb, it is better to use 20 cpus in 1 node since intra-node communication is faster than inter-node communication.
This generally depends upon the problem definition. If you want to use a shared memory programming model (create threads/tasks etc), then 1 node with 20 cpus will be better. You can take advantage of shared memory, caching, less communication overheads. But if your application requires both shared and distributed memory (processes spread among nodes), then using multiple nodes may be beneficial.
But if your problem (shared/distributed) only requires resources of a single node to solve it, then as a generic rule don't take extra nodes, because you don't get any benefit from it. Even if your application uses distributed memory paradigm, use single node because the intra-node communication is very fast and optimised.
As #Poshi's comment, more concrete answer is problem specific. It requires understanding the problem and profiling the application to come up with a specific solution.
I have multiple 4GB GPU nodes so I want them to run huge model in parallel. I hope just splitting layers into several pieces with appropriate device scopes just enables model parallelism but it turns out that it doesn't reduce memory footprint for master node(task 0). (10 nodes configuration - master: 20g, followers:2g, 1 node configuration - master: 6~7g)
Suspicious one is that gradients are not distributed because I didn't setup right device scope for them.
my model is available on github.(https://github.com/nakosung/tensorflow-wavenet/tree/model_parallel_2)
device placement log is here: https://gist.github.com/nakosung/a38d4610fff09992f7e5569f19eefa57
So the good news is that you using colocate_gradients_with_ops, which means that you are ensuring that the gradients are being computed on the same device that the ops are placed. (https://github.com/nakosung/tensorflow-wavenet/blob/model_parallel_2/train.py#L242)
Reading the device placement log is a little difficult, so I would suggest using TensorBoard to try visualizing the graph. It has options to be able to visualize how nodes are being placed on devices.
Secondly, you can try to see how the sizes of your operations map down to devices -- it is possible that the largest layers (largest activations, or largest weights) may be disproportionately placed more on some nodes than others. You might try to use https://github.com/tensorflow/tensorflow/blob/6b1d4fd8090d44d20fdadabf06f1a9b178c3d80c/tensorflow/python/tools/graph_metrics.py to analyze your graph to get a better picture of where resources are required in your graph.
Longer term we'd like to try to solve some of these placement problems automatically, but so far model parallelism requires a bit of care to place things precisely.
I need to train a neural network with 2-4 hidden layers, not sure yet on the structure of the actual net. I was thinking to train it using Hadoop map reduce (cluster of 12 pcs) or a gpu in order to get faster results. What do you think it would be better ? Also are there any available libraries that have these already implemented?
Thanks
I've been luckily to work in a lab which has dabbled in both of these methods for training networks, and while both are useful in very computationally expensive settings, the location of the computational bottleneck usually determines which method to use.
Training a network using a distributed system (e.g. HADOOP)
This is useful when your network is large enough that the matrix multiplications involved in training become unwieldy on a traditional PC. This problem is particularly prevalent when you have harsh time constraints (e.g. online training), as otherwise the hassle of a HADOOP implementation isn't worth it (just run the network overnight). If you're thinking about HADOOP because you want to fiddle with network parameters and not have to wait a day before fiddling some more (frequently the case in my lab), then simply run multiple instances of the network with different parameters on different machines. That way you can make use of your cluster without dealing with actual distributed computation.
Example:
You're training a network to find the number of people in images. Instead of a predefined set of training examples (image-number of people pairs) you decide to have the program pull random images from Google. While the network is processing the image, you must view the image and provide feedback on how many people are actually in the image. Since this is image processing, your network size is probably on the scale of millions of units. And since you're providing the feedback in real time the speed of the network's computations matters. Thus, you should probably invest in a distributed implementation.
Training a network on a GPU
This is the right choice if the major computational bottleneck isn't the network size, but the size of the training set (though the networks are still generally quite large). Since GPUs are ideal for situations involving applying the same vector/matrix operation across a large number of data sets, they are mainly used when you can use batch training with a very large batch size.
Example:
You're training a network to answer questions posed in natural language. You have a huge database of question-answer pairs and don't mind the network only updating its weights every 10000 questions. With such a large batch size and presumably a rather large network as well, a GPU based implementation would be a good idea.
I have a large set of scalar values distributed over a 3D mesh (one value per vertex.)
My goal is to show:
all points in the mesh where the value is greater than a threshold.
AND group the points that are connected (to simplify the display.)
So my basic solution was:
Find the points that pass the threshold test
For each point that has not been grouped, create a new group and recursively put all connected points into that group.
This works fine, until I started using a multicore solution:
The data set has been divided across multiple cores
Each core knows about boundary points that are shared by other cores.
I'm using MPI to communicate between cores.
I used my original algorithm to find "local" groups a single core.
My challenge is to merge "local" groups into global groups. The problem gets complicated for a number of reasons: Connected groups can cross many core boundaries. Groups that seem separate on one core can be connected by a group on a second core.
Thanks in advance.
Jeff
the threshold test can be carried out locally, so for the sake of simplicity we can eliminate it from the discussion. What you want is to have a distributed algorithm that calculates the connected components in your graph. This paper should be very relevant:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.46.1091
Assuming I have a cluster of n Erlang nodes, some of which may be on my LAN, while others may be connected using a WAN (that is, via the Internet), what are suitable mechanisms to cater for a) different bandwidth availability/behavior (for example, latency induced) and b) nodes with differing computational power (or even memory constraints for that matter)?
In other words, how do I prioritize local nodes that have lots of computational power, over those that have a high latency and may be less powerful, or how would I ideally prioritize high performance remote nodes with high transmission latencies to specifically do those processes with a relatively huge computations/transmission (that is, completed work per message ,per time unit) ratio?
I am mostly thinking in terms of basically benchmarking each node in a cluster by sending them a benchmark process to run during initialization, so that the latencies involved in messasing can be calculated, as well as the overall computation speed (that is, using a node-specific timer to determine how fast a node terminates with any task).
Probably, something like that would have to be done repeatedly, on the one hand in order to get representative data (that is, averaging data) and on the other hand it might possibly even be useful at runtime in order to be able to dynamically adjust to changing runtime conditions.
(In the same sense, one would probably want to prioritize locally running nodes over those running on other machines)
This would be meant to hopefully optimize internal job dispatch so that specific nodes handle specific jobs.
We've done something similar to this, on our internal LAN/WAN only (WAN being for instance San Francisco to London). The problem boiled down to a combination of these factors:
The overhead in simply making a remote call over a local (internal) call
The network latency to the node (as a function of the request/result payload)
The performance of the remote node
The compute power needed to execute the function
Whether batching of calls provides any performance improvement if there was a shared "static" data set.
For 1. we assumed no overhead (it was negligible compared to the others)
For 2. we actively measured it using probe messages to measure round trip time, and we collated information from actual calls made
For 3. we measured it on the node and had them broadcast that information (this changed depending on the load current active on the node)
For 4 and 5. we worked it out empirically for the given batch
Then the caller solved to get the minimum solution for a batch of calls (in our case pricing a whole bunch of derivatives) and fired them off to the nodes in batches.
We got much better utilization of our calculation "grid" using this technique but it was quite a bit of effort. We had the added advantage that the grid was only used by this environment so we had a lot more control. Adding in an internet mix (variable latency) and other users of the grid (variable performance) would only increase the complexity with possible diminishing returns...
The problem you are talking about has been tackled in many different ways in the context of Grid computing (e.g, see Condor). To discuss this more thoroughly, I think some additional information is required (homogeneity of the problems to be solved, degree of control over the nodes [i.e. is there unexpected external load etc.?]).
Implementing an adaptive job dispatcher will usually require to also adjust the frequency with which you probe the available resources (otherwise the overhead due to probing could exceed the performance gains).
Ideally, you might be able to use benchmark tests to come up with an empirical (statistical) model that allows you to predict the computational hardness of a given problem (requires good domain knowledge and problem features that have a high impact on execution speed and are simple to extract), and another one to predict communication overhead. Using both in combination should make it possible to implement a simple dispatcher that bases its decisions on the predictive models and improves them by taking into account actual execution times as feedback/reward (e.g., via reinforcement learning).