What is the right way to do model parallelism in tensorflow? - parallel-processing

I have multiple 4GB GPU nodes so I want them to run huge model in parallel. I hope just splitting layers into several pieces with appropriate device scopes just enables model parallelism but it turns out that it doesn't reduce memory footprint for master node(task 0). (10 nodes configuration - master: 20g, followers:2g, 1 node configuration - master: 6~7g)
Suspicious one is that gradients are not distributed because I didn't setup right device scope for them.
my model is available on github.(https://github.com/nakosung/tensorflow-wavenet/tree/model_parallel_2)
device placement log is here: https://gist.github.com/nakosung/a38d4610fff09992f7e5569f19eefa57

So the good news is that you using colocate_gradients_with_ops, which means that you are ensuring that the gradients are being computed on the same device that the ops are placed. (https://github.com/nakosung/tensorflow-wavenet/blob/model_parallel_2/train.py#L242)
Reading the device placement log is a little difficult, so I would suggest using TensorBoard to try visualizing the graph. It has options to be able to visualize how nodes are being placed on devices.
Secondly, you can try to see how the sizes of your operations map down to devices -- it is possible that the largest layers (largest activations, or largest weights) may be disproportionately placed more on some nodes than others. You might try to use https://github.com/tensorflow/tensorflow/blob/6b1d4fd8090d44d20fdadabf06f1a9b178c3d80c/tensorflow/python/tools/graph_metrics.py to analyze your graph to get a better picture of where resources are required in your graph.
Longer term we'd like to try to solve some of these placement problems automatically, but so far model parallelism requires a bit of care to place things precisely.

Related

How to distribute different models to multiple nodes with Pytorch?

I had a pytorch task, which worked with DP:
One same network copied to multiple GPUs sharing the same weights, but each copy receives a different data batch, so it speeds up training by increasing equivalent batch size.
But now I hope to introduce multiple different networks into the training flow:
net_A, net_B, net_C, and they are of different architectures and don't share weights.
Is it possible to assign each network to a different node (1 node with 4 GPUs), so that "net_A" can still enjoy the speed up of DP on 4 GPUs of "node_A", and "net_B" occupies "node_B", etc?

How to determine the optimal capacity for Quadtree subdivision?

I've created a flocking simulation using Boid's algorithm and have integrated a quadtree for optimization. Boids are inserted into the quadtree if the quadtree has not yet met its boid capacity. If the quadtree has met its capacity, it will subdivide into smaller quadtrees and the remaining boids will try to insert again on that one, recursively.
The performance seems to get better if I increase the capacity from its default 4 to one that is capable of holding more boids like 20, and I was just wondering if there is any sort of rule or methodology that goes into picking the optimal capacity formulaically.
You can view the site live here or the source code here if relevant.
I'd assume it very much depends on your implementation, hardware, and the data characteristics.
Implementation:
An extreme case would be using GPU processing to compare entries. If you support that, having very large nodes, potentially just a single node containing all entries, may be faster than any other solution.
Hardware:
Cache size and Bus speed will play a big role, also depending on how much memory every node and every entry consumes. Accessing a sub-node that is not cached is obviously expensive, so you may want to increase the size of nodes in order to reduce sub-node traversal.
-> Coming back to implementation, storing the whole quadtree on a contiguous segment of memory can be very beneficial.
Data characteristics:
Clustered data: Having strongly clustered data can have adverse effect on performance because it may cause the tree to become very deep. In this case, increasing node size may help.
Large amounts of data will mean that you may get over a threshold very everything fits into a cache. In this case, making nodes larger will save memory because you will have fewer nodes and everything may fit into the cache again.
In my experience I found that 10-50 entries per node gives the best performance across different datasets.
If you update your tree a lot, you may want to define a threshold to avoid 'flickering' and frequent merging/splitting of nodes. I.e. split nodes with >25 entries but merge them only when getting less than 15 entries.
If you are interested in a quadtree-like structure that avoids degenerated 'deep' quadtrees, have a look at my PH-Tree. It is structured like a quadtree but operates on bit-level, so maximum depth is strictly limited to 64 or 32, depending on how many bits your data has. In practice the depth will rarely exceed 10 levels or so, even for very dense data. Note: A plain PH-Tree is a key-value 'map' in the sense that every coordinate (=key) can only have one entry (=value). That means you need to store lists or sets of entries in case you expect more than one entry for any given coordinate.

Neural Network training in parallel, better to use Hadoop or a gpu?

I need to train a neural network with 2-4 hidden layers, not sure yet on the structure of the actual net. I was thinking to train it using Hadoop map reduce (cluster of 12 pcs) or a gpu in order to get faster results. What do you think it would be better ? Also are there any available libraries that have these already implemented?
Thanks
I've been luckily to work in a lab which has dabbled in both of these methods for training networks, and while both are useful in very computationally expensive settings, the location of the computational bottleneck usually determines which method to use.
Training a network using a distributed system (e.g. HADOOP)
This is useful when your network is large enough that the matrix multiplications involved in training become unwieldy on a traditional PC. This problem is particularly prevalent when you have harsh time constraints (e.g. online training), as otherwise the hassle of a HADOOP implementation isn't worth it (just run the network overnight). If you're thinking about HADOOP because you want to fiddle with network parameters and not have to wait a day before fiddling some more (frequently the case in my lab), then simply run multiple instances of the network with different parameters on different machines. That way you can make use of your cluster without dealing with actual distributed computation.
Example:
You're training a network to find the number of people in images. Instead of a predefined set of training examples (image-number of people pairs) you decide to have the program pull random images from Google. While the network is processing the image, you must view the image and provide feedback on how many people are actually in the image. Since this is image processing, your network size is probably on the scale of millions of units. And since you're providing the feedback in real time the speed of the network's computations matters. Thus, you should probably invest in a distributed implementation.
Training a network on a GPU
This is the right choice if the major computational bottleneck isn't the network size, but the size of the training set (though the networks are still generally quite large). Since GPUs are ideal for situations involving applying the same vector/matrix operation across a large number of data sets, they are mainly used when you can use batch training with a very large batch size.
Example:
You're training a network to answer questions posed in natural language. You have a huge database of question-answer pairs and don't mind the network only updating its weights every 10000 questions. With such a large batch size and presumably a rather large network as well, a GPU based implementation would be a good idea.

Multicore - how to merge local groups of data found on each core?

I have a large set of scalar values distributed over a 3D mesh (one value per vertex.)
My goal is to show:
all points in the mesh where the value is greater than a threshold.
AND group the points that are connected (to simplify the display.)
So my basic solution was:
Find the points that pass the threshold test
For each point that has not been grouped, create a new group and recursively put all connected points into that group.
This works fine, until I started using a multicore solution:
The data set has been divided across multiple cores
Each core knows about boundary points that are shared by other cores.
I'm using MPI to communicate between cores.
I used my original algorithm to find "local" groups a single core.
My challenge is to merge "local" groups into global groups. The problem gets complicated for a number of reasons: Connected groups can cross many core boundaries. Groups that seem separate on one core can be connected by a group on a second core.
Thanks in advance.
Jeff
the threshold test can be carried out locally, so for the sake of simplicity we can eliminate it from the discussion. What you want is to have a distributed algorithm that calculates the connected components in your graph. This paper should be very relevant:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.46.1091

Prioritizing Erlang nodes

Assuming I have a cluster of n Erlang nodes, some of which may be on my LAN, while others may be connected using a WAN (that is, via the Internet), what are suitable mechanisms to cater for a) different bandwidth availability/behavior (for example, latency induced) and b) nodes with differing computational power (or even memory constraints for that matter)?
In other words, how do I prioritize local nodes that have lots of computational power, over those that have a high latency and may be less powerful, or how would I ideally prioritize high performance remote nodes with high transmission latencies to specifically do those processes with a relatively huge computations/transmission (that is, completed work per message ,per time unit) ratio?
I am mostly thinking in terms of basically benchmarking each node in a cluster by sending them a benchmark process to run during initialization, so that the latencies involved in messasing can be calculated, as well as the overall computation speed (that is, using a node-specific timer to determine how fast a node terminates with any task).
Probably, something like that would have to be done repeatedly, on the one hand in order to get representative data (that is, averaging data) and on the other hand it might possibly even be useful at runtime in order to be able to dynamically adjust to changing runtime conditions.
(In the same sense, one would probably want to prioritize locally running nodes over those running on other machines)
This would be meant to hopefully optimize internal job dispatch so that specific nodes handle specific jobs.
We've done something similar to this, on our internal LAN/WAN only (WAN being for instance San Francisco to London). The problem boiled down to a combination of these factors:
The overhead in simply making a remote call over a local (internal) call
The network latency to the node (as a function of the request/result payload)
The performance of the remote node
The compute power needed to execute the function
Whether batching of calls provides any performance improvement if there was a shared "static" data set.
For 1. we assumed no overhead (it was negligible compared to the others)
For 2. we actively measured it using probe messages to measure round trip time, and we collated information from actual calls made
For 3. we measured it on the node and had them broadcast that information (this changed depending on the load current active on the node)
For 4 and 5. we worked it out empirically for the given batch
Then the caller solved to get the minimum solution for a batch of calls (in our case pricing a whole bunch of derivatives) and fired them off to the nodes in batches.
We got much better utilization of our calculation "grid" using this technique but it was quite a bit of effort. We had the added advantage that the grid was only used by this environment so we had a lot more control. Adding in an internet mix (variable latency) and other users of the grid (variable performance) would only increase the complexity with possible diminishing returns...
The problem you are talking about has been tackled in many different ways in the context of Grid computing (e.g, see Condor). To discuss this more thoroughly, I think some additional information is required (homogeneity of the problems to be solved, degree of control over the nodes [i.e. is there unexpected external load etc.?]).
Implementing an adaptive job dispatcher will usually require to also adjust the frequency with which you probe the available resources (otherwise the overhead due to probing could exceed the performance gains).
Ideally, you might be able to use benchmark tests to come up with an empirical (statistical) model that allows you to predict the computational hardness of a given problem (requires good domain knowledge and problem features that have a high impact on execution speed and are simple to extract), and another one to predict communication overhead. Using both in combination should make it possible to implement a simple dispatcher that bases its decisions on the predictive models and improves them by taking into account actual execution times as feedback/reward (e.g., via reinforcement learning).

Resources