How to understand cache mechanism in tensorflow - caching

The paper: TensorFlow: A System for Large-Scale
Machine Learning, $3.3 said:
We optimized TensorFlow for executing large sub- graphs repeatedly with low latency. Once the graph for a step has been pruned, placed, and partitioned, its sub- graphs are cached in their respective devices. A client session maintains the mapping from step definitions to cached subgraphs, so that a distributed step on a large graph can be initiated with one small message to each participating task. This model favours static, reusable graphs, but it can support dynamic computations using dynamic control flow, as the next subsection describes.
How to understand 'cached in their respective devices' here? And many API owns 'caching_device' parameter, however the default value is False, how to understand the CACHE feature?
In general, cache mechanism will always follow with 'INVALID cache' policy, so how is the cache policy?
If we use more clones model graph for multiple GPUs with between graph parallatism, that is, more model clones will refer shared variable on ps, how each clone to read remote variables? Does it cache the variable on some local devices by default for reducing network communication?
More details:
A Tour of TensorFlow
https://arxiv.org/pdf/1610.01178.pdf
Finally, an important optimization made by TensorFlow at this step is “canonicalization” of (send,receive) pairs. In the setup displayed in Figure 5b, the existence of each recv node on device B would imply allocation and management of a separate buffer to store ν’s output tensor, so that it may then be fed to nodes α and β, respectively. However, an equivalent and more efficient transformation places only one recv node on device B, streams all output from ν to this single node, and then to the two dependent nodes α and β. This last and final evolution is given in Figure 5c.
The above documentation describes that if Figure 5c automatically do optimization to reduce implict read action. If this occurs in distributed system, network traffic will automatically be reduced as wanted.
In another way, /model/slim/deployment/model_deploy.py try to create cache variable as following:
562 def caching_device(self):
563 """Returns the device to use for caching variables.
564
565 Variables are cached on the worker CPU when using replicas.
566
567 Returns:
568 A device string or None if the variables do not need to be cached.
569 """
570 if self._num_ps_tasks > 0:
571 return lambda op: op.device
572 else:
573 return None
to try to opimitize network traffic, I think.
What's the real or best way to do communication optimization in distributed system ?
We also prefer more clearification about it, and we will try to update this issue if I get more experiments tunning result.

Related

PWM transistor heating - Rapberry

I have a raspberry and an auxiliary PCB with transistors for driving some LED strips.
The strips datasheets says 12V, 13.3W/m, i'll use 3 strips in parallel, 1.8m each, so 13.3*1.8*3 = 71,82W, with 12 V, almost 6A.
I'm using an 8A transistor, E13007-2.
In the project i have 5 channels of different LEDs: RGB and 2 types of white.
R, G, B, W1 and W2 are directly connected in py pins.
LED strips are connected with 12V and in CN3, CN4 for GND (by the transistor).
Transistor schematic.
I know that that's a lot of current passing through the transistors, but, is there a way to reduce the heating? I think it's getting 70-100°C. I already had a problem with one raspberry, and i think it's getting dangerous for the application. I have some large traces in the PCB, that's not the problem.
Some thoughts:
1 - Resistor driving the base of the transistor. Maybe it won't reduce heating, but i think it's advisable for short circuit protection, how can i calculate this?
2 - The PWM has a frequency of 100Hz, is there any difference if i reduce this frequency?
The BJT transistor you're using has current gain hFE of roughly 20. This means that the collector current is roughly 20 times the base current, or the base current needs to be 1/20 of the collector current, i.e. 6A/20=300mA.
Rasperry PI for sure can't supply 300mA current from the IO pins, so you're operating the transistor in linear region, which causes it to dissipate a lot of heat.
Change your transistors to MOSFETs with low enough threshold voltage (like 2.0V to have enough conduction at 3.3V IO voltage) to keep it simple.
Using a N-Channel MOSFET will run much cooler if you get enough gate voltage to force to completely enhance. Since this is not a high volume item why not simply use a MOSFET gate driver chip. Then you can use a low RDS on device. Another device is the siemons BTS660 (S50085B BTS50085B TO-220). it is a high side driver that you will need to drive with an open collector or drain device. It will switch 5A at room temperature with no heat sink.It is rated for much more current and is available in a To220 type package. It is obsolete but available as is the replacement. MOSFETs are voltage controlled while transistors are current controlled.

What is the best general purpose computing practice in OpenCL for iterative problems?

When we have a program that requires lots of operations over a large data sets and the operations on each of the data elements are independent, OpenCL can be one of the good choice to make it faster. I have a program like the following:
while( function(b,c)!=TRUE)
{
[X,Y] = function1(BigData);
M = functionA(X);
b = function2(M);
N = functionB(Y);
c = function3(N);
}
Here the function1 is applied on each of the elements on the BigData and produce another two big data sets (X,Y). function2 and function3 are then applied operation individually on each of the elements on these X,Y data, respectively.
Since the operations of all the functions are applied on each of the elements of the data sets independently, using GPU might make it faster. So I come up with the following:
while( function(b,c)!=TRUE)
{
//[X,Y] = function1(BigData);
1. load kernel1 and BigData on the GPU. each of the thread will work on one of the data
element and save the result on X and Y on GPU.
//M = functionA(X);
2a. load kernel2 on GPU. Each of the threads will work on one of the
data elements of X and save the result on M on GPU.
(workItems=n1, workgroup size=y1)
//b = function2(M);
2b. load kernel2 (Same kernel) on GPU. Each of the threads will work on
one of the data elements of M and save the result on B on GPU
(workItems=n2, workgroup size=y2)
3. read the data B on host variable b
//N = functionB(Y);
4a. load kernel3 on GPU. Each of the threads will work on one of the
data element of Y and save the result on N on GPU.
(workItems=n1, workgroup size=y1)
//c = function2(M);
4b. load kernel3 (Same kernel) on GPU. Each of the threads will work
on one of the data element of M and save the result on C on GPU
(workItems=n2, workgroup size=y2)
5. read the data C on host variable c
}
However, the overhead involved in this code seems significant to me (I have implemented a test program and run on a GPU). And if the kernels have some sort of synchronizations it might be ended up with more slowdown.
I also believe the workflow is kind of common. So what is the best practice to using OpenCL for speedup for a program like this.
I don't think there's a general problem with the way you've split up the problem into kernels, although it's hard to say as you haven't been very specific. How often do you expect your while loop to run?
If your kernels do negligible work but the outer loop is doing a lot of iterations, you may wish to combine the kernels into one, and do some number of iterations within the kernel itself, if that works for your problem.
Otherwise:
If you're getting unexpectedly bad performance, you most likely need to be looking at the efficiency of each of your kernels, and possibly their data access patterns. Unless neighbouring work items are reading/writing neighbouring data (ideally: 16 work items read 4 bytes each from a 64-byte cache line at a time) you're probably wasting memory bandwidth. If your kernels contain lots of conditionals or non-constant loop iterations, that will cost you, etc.
You don't specify what kind of runtimes you're getting, on what kind Of job size, (Tens? Thousands? Millions of arithmetic ops? How big are your data sets?) or what hardware. (Compute card? Laptop IGPU?) "Significant overhead" can mean a lot of different things. 5ms? 1 second?
Intel, nVidia and AMD all publish optimisation guides - have you read these?

NFC APDU READ command performance tuning

I am reading several hundred bytes from a DESFire card using APDU commands.
The data application is authenticated, and the response MAC'ed.
I submit a series of READ_DATA commands (0xBD), each retrieving 54 bytes+MAC while increasing read offset for each command.
Will this operation go much quicker if I use a long READ with ADDITIONAL_FRAME (AF) instead of many sequential reads?
I understand that a simple AF is 1 byte vs 8 bytes for a full READ DATA command, thus reducing the number of bytes transferred by ~10%.
But will use of AF give additional performance benefits, for example because of less processing needed by the card?
I am asking this since I am getting only ~220kbit/s effective transfer rate when the theoretical limit is 424kbit/s. See my question on this here
I modified my reads to use ADDITIONAL FRAME.
This reduced the total bytes sent+received from 1628 to 1549 bytes, a 5% reduction.
The time used by tranciecve() was reduced from 602ms to 593ms, a 1.5% reduction.
The conclusion is that use of AF will not give additional performance other that the reduced time for bytes transfered.
The finding also indicates that since the time was reduced by a much lower factor than the data reduction, there must be operations that introduce significant latency that is not dependent on the data trancieved, but ReadFile is not the one.
Authenticate, SelectApplication or ReadRecord may be significantly more expensive than the time used for data transfer.
(Wanted to write a comment, but it got quite long...)
I would use the ADDITIONAL FRAME (AF) way, some reasoning follows:
in addition to the mentioned 7 bytes saved in the command, you save 4 MAC bytes in all of the responses (but the last one, of course)
every time you read 54 bytes, you wastefully MAC 2 zero bytes of padding, which might have been MACed as data (given by DES block size of 8). In the "AF way" there is only a single MAC run covering all the data, so this does not happen here
you are not enforcing the actual frame size. It is up to the reader and the card to select the right frame size (here I am not 100% sure how DESFire handles the ISO 14443-4 chaining and FSD)
some readers can handle the AF situation on their own and (magically?) give you the complete answer (doing the AF somehow in their firmware -- I have seen at least one reader doing this)
If my thoughts are (at least partially) correct, these points make only 9ms difference in your scenario. But under another scenario they might make much more.
Additional notes:
I would exclude the SELECT APPLICATION and AUTHENTICATE from the benchmark and measure them separately. It is up to you, but I would say that they only interfere with the desired "raw" data transfer measurement
I would recommend to benchmark the pure "plain data transfer" mode which is (presumably) the fastest "raw" data transfer possible
Thank you for sharing your results, not many people do so...Good luck!

Implementing Stack and Queue with O(1/B)

This is an exercise from this text book (page 77):
Exercise 48 (External memory stacks and queues). Design a stack data structure that needs O(1/B) I/Os per operation in the I/O model
from Section 2.2. It suffices to keep two blocks in internal memory.
What can happen in a naive implementation with only one block in
memory? Adapt your data structure to implement FIFOs, again using two
blocks of internal buffer memory. Implement deques using four buffer
blocks.
I don't want the code. Can anyone explain me what the question needs, and how can i do operations in O(1/B)?
As the book goes, quoting Section 2.2 on page 27:
External Memory: <...> There are special I/O operations that transfer B consecutive words between slow and fast memory. For
example, the external memory could be a hard disk, M would then be the
main memory size and B would be a block size that is a good compromise
between low latency and high bandwidth. On current technology, M = 1
GByte and B = 1 MByte are realistic values. One I/O step would then be
around 10ms which is 107 clock cycles of a 1GHz machine. With another
setting of the parameters M and B, we could model the smaller access
time difference between a hardware cache and main memory.
So, doing things in O(1/B) most likely means, in other words, using a constant number of these I/O operations for each B stack/queue operations.

Storing values in separate CPU cache banks

As part of a class project I am looking at ways to improve the performance of a path finding algorithm in a CPU architecture. The algorithm is implemented in C++. The basic operation is to read x,y coordinates and perform some operations on them.
The idea I have right now is to store x and y coordinates separately in two cache banks(set associative). Two coordinates entered for a location should be stored in different banks such that it should be able to read them both in parallel and do separate operations on x and y coordinates and store the combined result. By using vector operations this process can be further sped up to read up to 4 x coordinates and 4 y coordinates at the same time.
For example, for computing euclidean distance from the goal node, at each location the x and y coordinates has to be read and subtracted from goal coordinates to find out the distance.
I wish to know if there is an effective way(cache placement policy) to keep x and y coordinates in different cache lines/blocks for taking advantage of parallelism. Is there any operation/encoding of coordinates I can use to implement it?
P.S: I am not looking for software optimizations but for a modified cache design(theoretical) to speed up the algorithm.
Ref: This blog post mentions that "L1 cache can process two accesses in parallel if they access cache lines from different banks, and serially if they belong to the same bank."
A coordinate held in a structure like this:
struct {
int x;
int y;
} coordinate;
will fit into a single cache line. In fact, on typical MIPS processors with a 32 byte cache line, 8 instances of coordinate will fit into a cache line. The CPU will read the entire line at once from a single bank or way in the data cache, and all 32 bytes, or all 4 coordinates will thenceforth be available in a 32 byte fill/store buffer (FSB) at zero delay to the pipeline. MIPS processors with caches typically have more than one FSB.
MIPS processors with banked or set-associative caches typically use a LRU algorithm to select which bank/way new data is placed into on a cache miss. It is not generally possible to ensure that data from a certain location always gets placed into a particular way in the cache.
All this is to say that your scheme of storing x and y coordinates in separate cache banks will not improve performance over a naive scheme because of the positive effects of the FSBs on the naive scheme and because of the unpredictable effects of the way selection algorithm for the cache.

Resources