What is the performance cost of the eager style in TFJS? - performance

The paper TENSORFLOW.JS: MACHINE LEARNING FOR THE WEB AND BEYOND states:
Since an important part of our design goals is to prioritize
ease-of-use over performance, TensorFlow.js supports the eager style
of differentiation.
In general what kind of performance hit are we talking about? Does it depend on the model? Are there cases where there is no performance difference at all?

The main performance benefits of a declarative (a.k.a., deferred-execution, graph-mode) programming paradigm like the one in the default graph model of TensorFlow v1 (Python) comes from the following aspects:
Pushing the entire model down to the C++ layer, where execution overhead is much
lower compared to interpreted or non-compiled languages such as Python and
JavaScript
Parallel execution of independent paths of the model's computation graph. An
example is a model consisting of a number of separate input towers. These
towers can be executed concurrently on different cores of the CPU or multiple
GPUs of the same host.
Thanks to the fact that the entire model is known before the execution begins,
the C++ execution engine can perform a whole suite of optimizations on the
model's computation graph. Just to give a few examples:
Constant folding: a subtree of the graph consisting of only stateless,
deterministic operations on constant nodes can be folded into a single constant node
Op fusion: in some cases, a few adjacent nodes (ops) of the computation
graph can be replaced by a mathematically equivalent but computationally more
efficient node.
Pruning: some computation graphs contain nodes that don't contribute to
the final output. A graph-model execution engine can see that beforehand
and prevent those nodes from executing.
Just in time (JIT) compilation: the graph execution engine can take the
entire graph and compile it to a lower-level representation that involves a lower
dispatching overhead and are more amenable
to high-performance execution on the available hardware (e.g., CUDA programs
for NVIDIA or compatible GPUs, special instructions for Google TPUs, or
even shader programs for WebGL, etc.)
All the aforementioned optimization are supported by graph-mode TensorFlow. For
more details, google for terms "grappler" and "XLA".
TensorFlow.js adopts an imperative (a.k.a., eager) paradigm, mainly based on
usability considerations. This is analogous to TensorFlow eager execution,
PyTorch and NumPy. As a result, it doesn't come with all the aforementioned
opportunities for optimization.
However, realize that there are ways to get computation graphs out from an
imperative program (see TensorFlow v2's tf.function decorator and JAX). There is no reason why TensorFlow.js
can't adopt a similar paradigm for performance boosts. It's just the need hasn't
been clear enough for the product team to prioritize that feature yet.

Related

Does the guarantee of non-divergence when dispatching single work item exist?

As we know, work items running on GPUs could diverge when there are conditional branches. One of those mentions exist in Apple's OpenCL Programming Guide for Mac.
As such, some portions of an algorithm may run "single-threaded", having only 1 work item running. And when it's especially serial and long-running, some applications take those work back to CPU.
However, this question concerns only GPU and assume those portions are short-lived. Do these "single-threaded" portions also diverge (as in execute both true and false code paths) when they have conditional branches? Or will the compute units (or processing elements, whichever your terminology prefers) skip those false branches?
Update
In reply to comment, I'd remove the OpenCL tag and leave the Vulkan tag there.
I included OpenCL as I wanted to know if there's any difference at all between clEnqueueTask and clEnqueueNDRangeKernel with dim=1:x=1. The document says they're equivalent but I was skeptical.
I believe Vulkan removed the special function to enqueue a single-threaded task for good reasons, and if I'm wrong, please correct me.
Do these "single-threaded" portions also diverge (as in execute both true and false code paths) when they have conditional branches?
From an API point of view it has to appear to the program that only the active branch paths were taken. As to what actually happens, I suspect you'll never know for sure. GPU hardware architectures are nearly all confidential so it's impossible to be certain.
There are really two cases here:
Cases where a branch in the program turns into a real branch instruction.
Cases where a branch in the program turns into a conditional select between two computed values.
In the case of a real branch I would expect most cases to only execute the active path because it's a horrible waste of power to do both, and GPUs are all about energy efficiency. That said, YMMV and this isn't guaranteed at all.
For simple branches the compiler might choose to use a conditional select (compute both results, and then select the right answer). In this case you will compute both results. The compiler heuristics will generally aim to choose this where computing both results is less expensive than actually having a full branch.
I included OpenCL as I wanted to know if there's any difference at all between clEnqueueTask and clEnqueueNDRangeKernel with dim=1:x=1. The document says they're equivalent but I was skeptical.
Why would they be different? They are doing the same thing conceptually ...
I believe Vulkan removed the special function to enqueue a single-threaded task for good reasons, and if I'm wrong, please correct me.
Vulkan compute dispatch is in general a whole load simpler than OpenCL (and also perfectly adequate for most use cases), so many of the host-side functions from OpenCL have no equivalent in Vulkan. The GPU side behavior is pretty much the same. It's also worth noting that most of the holes where Vulkan shaders are missing features compared to OpenCL are being patched up with extensions - e.g. VK_KHR_shader_float16_int8 and VK_KHR_variable_pointers.
Q : Or will the compute units skip those false branches?
The ecosystem of CPU / GPU code-execution is rather complex.
The layer of hardware is where the code-paths (translated into "machine"-code) operate. On this laye, the SIMD-Computing-Units cannot and will not skip anything they are ordered to SIMD-process by the hardware-scheduler (next layer).
The layer of hardware-specific scheduler (GPUs have typically right two-modes: a WARP-mode scheduling for coherent, non-diverging code-paths efficiently scheduled in SIMD-blocks and greedy-mode scheduling). From this layer, the SIMD-Computing-Units are loaded to work on SIMD-operated blocks-of-work, so any first divergence detected on the lower layer (above) breaks the execution, flags the SIMD-hardware scheduler about blocks, deferred to be executed later and all known SIMD-specific block-device-optimised scheduling is well-known to start to grow less-efficient and less-efficient, due to each such run-time divergence.
The layer of { OpenCL | Vulkan API }-mediated device-specific programming decides a lot about the ease or comfort of human-side programming of the wide range of the target-devices, all without knowing about its respective internal constraints, about (compiler decided) preferred "machine"-code computing problem re-formulation and device-specific tricks and scheduling. A bit oversimplified battlefield picture has made for years human-users just stay "in front" of the mediated asynchronous work-units ( kernel's ) HOST-to-DEVICE scheduling queues and wait until we receive back the DEVICE-to-HOST delivered results back, doing some prior-H2D/posterior-D2H memory transfers, if allowed and needed.
The HOST-side DEVICE-kernel-code "scheduling" directives are rather imperative and help the mediated-device-specific programming reflect user-side preferences, yet leave user blind from seeing all internal decisions ( assembly-level reviews are indeed only for hard-core, DEVICE-specific, GPU-engineering Aces and hard to modify, if willing to )
All that said, "adaptive" run-time values' based decisions to move a particular "work-unit" back-to-the-HOST-CPU, rather than finalising it all in DEVICE-GPU, are not, to the best of my knowledge, taking place on the bottom of this complex computing ecosystem hierarchy ( afaik, it would be exhaustively expensive to try to do so ).

Alternative for dynamic parallelism for CUDA

I am very new to the CUDA programming model and programming in general, I suppose. I'm attempting to parallelize an expectation maximization algorithm. I am working on a gtx 480 which has compute capability 2.0. At first, I sort of assumed that there's no reason for the device to launch its own threads, but of course, I was sadly mistaken. I came across this pdf.
http://docs.nvidia.com/cuda/pdf/CUDA_Dynamic_Parallelism_Programming_Guide.pdf
Unfortunately, dynamic parallelism only works on the latest and greatest GPUs, with compute capability 3.5. Without diving into too much specifics, what is the alternative to dynamic parallelism? The loops in the CPU EM algorithm have many dependencies and are highly nested, which seems to make dynamic parallelism an attractive ability. I'm not sure if my question makes sense so please ask if you need clarification.
Thank you!
As indicated by #JackOLantern, dynamic parallelism can be described in a nutshell as the ability to call a kernel (i.e. a __global__ function) from device code (a __global__ or __device__ function).
Since the kernel call is the principal method by which the machine spins up multiple threads in response to a single function call, there is really no direct alternative that provides all the capability of dynamic parallelism in a device that does not support it (ie. pre cc 3.5 devices).
Without dynamic parallelism, your overall code will almost certainly involve more synchronization and communication between CPU code and GPU code.
The principal method would be to realize some unit of your code as parallelizable, convert it to a kernel, and work through your code in essentially a non-nested fashion. Repetetive functions might be done via looping in the kernel, or else looping in the host code that calls the kernel.
For a pictorial example of what I am trying to describe, please refer to slide 14 of this deck which introduces some of the new features of CUDA 5 including dynamic parallelism. The code architecture on the right is an algorithm realized with dynamic parallelism. The architecture on the left is the same function realized without dynamic parallelism.
I have checked your algorithm in Wikipedia and I'm not sure you need dynamic parallelism at all.
You do the expectation step in your kernel, __syncthreads(), do the maximization step, and __syncthreads() again. From this distance, the expectation looks like a reduction primitive, and the maximization is a filter one.
If it doesn't work, and you need real task parallelism, a GPU may not be the best choice. While the Kepler GPUs can do that to some degree, this is not what this architecture is designed for. In that case you might be better off using a multi-CPU system, such as an office grid, a supercomputer, or a Xeon Phi accelerator. You should also check OpenMP and MPI, these are the languages used for task-parallel programming (actually OpenMP is just a handful of pragmas in most cases).

How Concurrent is Prolog?

I can't find any info on this online... I am also new to Prolog...
It seems to me that Prolog could be highly concurrent, perhaps trying many possibilities at once when trying to match a rule. Are modern Prolog compilers/interpreters inherently* concurrent? Which ones? Is concurrency on by default? Do I need to enable it somehow?
* I am not interested in multi-threading, just inherent concurrency.
Are modern Prolog compilers/interpreters inherently* concurrent? Which ones? Is concurrency on by default?
No. Concurrent logic programming was the main aim of the 5th Generation Computer program in Japan in the 1980s; it was expected that Prolog variants would be "easily" parallelized on massively parallel hardware. The effort largely failed, because automatic concurrency just isn't easy. Today, Prolog compilers tend to offer threading libraries instead, where the program must control the amount of concurrency by hand.
To see why parallelizing Prolog is as hard as any other language, consider the two main control structures the language offers: conjunction (AND, serial execution) and disjunction (OR, choice with backtracking). Let's say you have an AND construct such as
p(X) :- q(X), r(X).
and you'd want to run q(X) and r(X) in parallel. Then, what happens if q partially unifies X, say by binding it to f(Y). r must have knowledge of this binding, so either you've got to communicate it, or you have to wait for both conjuncts to complete; then you may have wasted time if one of them fails, unless you, again, have them communicate to synchronize. That gives overhead and is hard to get right. Now for OR:
p(X) :- q(X).
p(X) :- r(X).
There's a finite number of choices here (Prolog, of course, admits infinitely many choices) so you'd want to run both of them in parallel. But then, what if one succeeds? The other branch of the computation must be suspended and its state saved. How many of these states are you going to save at once? As many as there are processors seems reasonable, but then you have to take care to not have computations create states that don't fit in memory. That means you have to guess how large the state of a computation is, something that Prolog hides from you since it abstracts over such implementation details as processors and memory; it's not C.
In other words, automatic parallelization is hard. The 5th Gen. Computer project got around some of the issues by designing committed-choice languages, i.e. Prolog dialects without backtracking. In doing so, they drastically changed the language. It must be noted that the concurrent language Erlang is an offshoot of Prolog, and it too has traded in backtracking for something that is closer to functional programming. It still requires user guidance to know what parts of a program can safely be run concurrently.
In theory that seems attractive, but there are various problems that make such an implementation seem unwise.
for better or worse, people are used to thinking of their programs as executing left-to-right and top-down, even when programming in Prolog. Both the order of clauses for a predicate and of terms within a clause is semantically meaningful in standard Prolog. Parallelizing them would change the behaviour of far too much exising code to become popular.
non-relational language elements such as the cut operator can only be meaningfully be used when you can rely on such execution orders, i.e. they would become unusable in a parallel interpreter unless very complicated dependency tracking were invented.
all existing parallelization solutions incur at least some performance overhead for inter-thread communication.
Prolog is typically used for high-level, deeply recursive problems such as graph traversal, theorem proving etc. Parallelization on a modern machines can (ideally) achieve a speedup of n for some constant n, but it cannot turn an unviable recursive solution method into a viable one, because that would require an exponential speedup. In contrast, the numerical problems that Fortran and C programmers usually solve typically have a high but quite finite cost of computation; it is well worth the effort of parallelization to turn a 10-hour job into a 1-hour job. In contrast, turning a program that can look about 6 moves ahead into one that can (on average) look 6.5 moves ahead just isn't as compelling.
There are two notions of concurrency in Prolog. One is tied to multithreading, the other to suspended goals. I am not sure what you want to know. So I will expand a little bit about multithreading first:
Today widely available Prolog system can be differentiated whether they are multithreaded or not. In a multithreaded Prolog system you can spawn multiple threads that run concurrently over the same knowledge base. This poses some problems for consult and dynamic predicates, which are solved by these Prolog systems.
You can find a list of the Prolog systems that are multithreaded here:
Operating system and Web-related features
Multithreading is a prerequesite for various parallelization paradigmas. Correspondingly the individudal Prolog systems provide constructs that serve certain paradigmas. Typical paradigmas are thread pooling, for example used in web servers, or spawning a thread for long running GUI tasks.
Currently there is no ISO standard for a thread library, although there has been a proposal and each Prolog system has typically rich libraries that provide thread synchronization, thread communication, thread debugging and foreign code threads. A certain progress in garbage collection in Prolog system was necessary to allow threaded applications that have potentially infinitely long running threads.
Some existing layers even allow high level parallelization paradigmas in a Prolog system independent fashion. For example Logtalk has some constructs that map to various target Prolog systems.
Now lets turn to suspended goals. From older Prolog systems (since Prolog II, 1982, in fact) we know the freeze/2 command or blocking directives. These constructs force a goal not to be expanded by existing clauses, but instead put on a sleeping list. The goal can the later be woken up. Since the execution of the goal is not immediate but only when it is woken up, suspended goals are sometimes seen as concurrent goals,
but the better notion for this form of parallelism would be coroutines.
Suspended goals are useful to implement constraint solving systems. In the simplest case the sleeping list is some variable attribute. But a newer approach for constraint solving systems are constraint handling rules. In constraint handling rules the wake up conditions can be suspended goal pair patterns. The availability of constraint solving either via suspended goals or constraint handling rules can be seen here:
Overview of Prolog Systems
Best Regards
From a quick google search it appears that the concurrent logic programming paradigm has only been the basis for a few research languages and is no longer actively developed. I have seen claims that concurrent logic is easy to do in the Mozart/Oz system.
There was great hope in the 80s/90s to bake parallelism into the language (thus making it "inherently" parallel), in particular in the context of the Fifth Generation Project. Even special hardware constructs were studied to implement "Parallel Inference Machine" (PIM) (Similar to the special hardware for LISP machines in the "functional programming" camp). Hardware efforts were abandoned due to continual improvement of off-the-shelf CPUs and software effort were abandoned due to excessive compiler complexity, lack of demand for hard-to-implement high-level features and likely lack of payoff: parallelism that looks transparent and elegantly exploitable at the language level generally means costly inter-process communication and transactional locking "under the hood".
A good read about this is
"The Deevolution of Concurrent Logic Programming Languages"
by Evan Tick, March 1994. Appeared in "Journal of Logic Programming, Tenth Anniversary Special Issue, 1995". The Postscript file linked to is complete, unlike the PDF you get at Elsevier.
The author says:
There are two main views of concurrent logic programming and its
development over the past several years [i.e. 1990-94]. Most logic programming
literature views concurrent logic programming languages as a
derivative or variant of logic programs, i.e., the main difference
being the extensive use of "don't care" nondeterminism rather than
"don't know" (backtracking) nondeterminism. Hence the name committed
choice or CC languages. A second view is that concurrent logic
programs are concurrent, reactive programs, not unlike other
"traditional" concurrent languages such as 'C' with explicit message
passing, in the sense that procedures are processes that communicate
over data streams to incrementally produce answers. A cynic might say
that the former view has more academic richness, whereas the latter
view has more practical public relations value.
This article is a survey of implementation techniques of concurrent
logic programming languages, and thus full disclosure of both of these
views is not particularly relevant. Instead, a quick overview of basic
language semantics, and how they relate to fundamental programming
paradigms in a variety of languages within the family, will suffice.
No attempt will be made to cover the many feasible programming
paradigms; nor semantical nuances, nor the family history. (...).
The main point I wish to make in this article is that concurrent logic
programming languages have been deevolving since their inception,
about ten years ago, because of the following tatonnement:
Systems designers and compiler writers could supply only certain limited features in robust; efficient implementations. This drove the
market to accept these restricted languages as, in some informal
sense, de facto standards.
Programmers became aware that certain, more expressive language features were not critically important to getting applications
written, and did not demand their inclusion.
Thus my stance in this article will be a third view: how the initially
rich languages gradually lost their "teeth," and became weaker, but
more practically implementable, and achieved faster performance.
The deevolutionary history begins with Concurrent Prolog (deep guards,
atomic unification; read-only annotated variables for
synchronization), and after a series of reductions (for example: GHC
(input-matching synchronization), Parlog (safe), FCP (flat), Fleng (no
guards), Janus (restricted communication), Strand (assignment rather
than output unification)), and ends for now with PCN (flat guards,
non-atomic assignments input-matching synchronization, and
explicitly-defined mutable variables). This and other terminology will
be defined as the article proceeds.
This view may displease some
readers because it presupposes that performance is the main driving
force of the language market; and furthermore that the main "added
value" of concurrent logic programs over logic programs is the ability
to naturally exploit parallelism to gain speed. Certainly the reactive
nature of the languages also adds value; e.g., in building complex
object-oriented applications. Thus one can argue that the deevolution
witnessed is a bad thing when reactive capabilities are being traded
for speed.
ECLiPSe-CLP, a language "largely backward-compatible with Prolog", supports OR-parallelism, even though "this functionality is currently not actively maintained because of other priorities".
[1,2] document OR- (and AND-)parallelism in ECLiPSe-CLP.
However, I tried to get it working some time using the code from ECLiPSe-CLP's repository, but I didn't get it though.
[1] http://eclipseclp.org/reports/book.ps.gz
[2] http://eclipseclp.org/doc/bips/kernel/compiler/parallel-1.html

Is Cilk's approach to shared memory parallel programming a panacea?

What challenges in shared memory parallel programming (particularly multicore) cannot be solved or cannot be solved efficiently using a Cilk-style solution (i.e. nested data parallelism with per-core work stealing task deques)?
I think Cilk's model is nested task parallelism (which you can use to implement
data parallelism). It is pretty cute, but...
Cilk doesn't support SIMD data parallelism or streaming parallelism.
Cilk doesn't appear to handle partial orders over tasks well, because it only offers nested parallelism. Try coding the following set of parallel tasks: A,B,C,D with ordering constraints A before B, A before D, C before D. (This is the canonical example of the smallest partial task order that nested task parallelism can't encode directly). You lose some the parallelism implementing this with nested parallelism. Parallelism being precious, you don't want to waste opportunities to be parallel.
It doesn't handle (AFAIK) exception handling across thread boundaries.
This is needed if you want to build a really big, complex symbolic application.
It is also useful if you want to compute speculatively.
I also don't think Cilk can handle large sets of interacting (waiting on synchronization events) between computation grains, because AFAIK in a Cilk program can only have as many live parallel computations as there are OS-offered threads. This is due to the Cilk implementation choice of living on top of standard C/C++ compilers and their stack models, which in most workstations use the one-big-stack-per-thread model. You might get to 10 or 100, but not 10,000; this matters when handling very large graphs. [I don't know for a fact that Cilk even allows computation grains to synchronize, but I don't see any technical reason why it could not if it gave up the large-stack-model].
A second implication is that Cilk applications can't recurse over huge data structures, because whatever size stack you choose is bounded, and there's some example for which you will run out of stack. (This isn't a design flaw of Cilk, just of its implementation). That's too bad, because huge things are one of the places where you want parallelism.
For an alternative, see PARLANSE, which offers
arbitrarily large number of computation grains, with work-stealing, but with heap-allocated grains and activation records. Each grain has its own context (and therefore one can implement large interacting sets of grains, because it is straightforward to save grain state when it needs to wait on an event. PARLANSE synchonization primitives include futures, semaphores, and critical function results (see below)
PARLANSE offers explicit "teams" (set of computational grains) as an abstraction, with exceptions propagating out of functions to the top of a computational grain (Java defines this to be "undefined" which is stupid), to the team parent, back to all the other team children as an asynchronous abort-exception (catchable in a try clause) allowing the other children to clean up.
Because some actions (e.g., asynchronous abort-exceptions) can occur at arbitrary times, PARLANSE offers the notion of critical functions, whose results are guaranteed to be returned to the caller atomically, so a function either returns a result assuredly, or does not, and the function can clean up resources safely in an asynch abort handler.
Special "partial order" teams allows one to encode computations in which the partial order is known; I think this is more efficient than Cilk's nested parallelism if you have large sets of such.
(We use PARLANSE to implement large-scale program analysis/transformation tools.. PARLANSE was invented to support this; we want parallelism because the artifacts we deal with are huge, well, trees and graphs, representing millions of lines of code).
(PARLANSE doesn't do streams or SIMD either, but they aren't out of scope of the language. One could arguably add streams and SIMD to C and C++ but its likely to be pretty hard).

How to implement efficient sorting algorithms for multiple processors with Scala?

How to implement efficient sorting algorithms for multiple processors in Scala? Here's the link for radix algorithm in GPU:
radix algorithm in GPU
Use scala.actors.Futures. It isn't a good solution, because you are talking about parallel computation, not concurrent computation, and Futures is aimed at the latter, not the former.
Things like parallel arrays that are coming with Java 7 and a later (not 2.8) version of Scala are more appropriate for parallel algorithms.
Just explaining, a parallel algorithm is one which does the same computation on multiple processing units. It is easy to see that each of which runs the same code. A concurrent computation is one in which each processing unit is running a potentially different code.
Also related, in parallel algorithms, the code being run doesn't change, only the data. In concurrent computation you have code changing constantly.
By the way, though that is not what you are asking, let me state that there's a library for Scala to run OpenCL code (ie, run computation on GPU). It's called ScalaCL.

Resources