How is functional programming beneficial in parallel computations? - parallel-processing

It’s interesting to note that there exist functional programming languages, like Scala or Erlang that forbid changing variable values.
there are areas like parallel computations where this limitation confers certain benefits. Studying such a language (even if you’re not planning to use it soon) is recommended to broaden the mind.
So I started wondering how it's beneficial to parallel computations?

Q : "how it's beneficial to parallel computations?"
In that neither of code-execution paths ever tries to change any variable, the less it ever changes the value of the variable, so all benefit from a warranty of principally coherent ( as per it's value (never ever changed) ), non-colliding processing.
No such thing as a race on read/write coherent-access to a variable (that is by definition never changed).
If nothing else this is A COOL property for the (by definition safe) [PARALLEL] flow of code-execution paths, isn't it?

Related

Does the guarantee of non-divergence when dispatching single work item exist?

As we know, work items running on GPUs could diverge when there are conditional branches. One of those mentions exist in Apple's OpenCL Programming Guide for Mac.
As such, some portions of an algorithm may run "single-threaded", having only 1 work item running. And when it's especially serial and long-running, some applications take those work back to CPU.
However, this question concerns only GPU and assume those portions are short-lived. Do these "single-threaded" portions also diverge (as in execute both true and false code paths) when they have conditional branches? Or will the compute units (or processing elements, whichever your terminology prefers) skip those false branches?
Update
In reply to comment, I'd remove the OpenCL tag and leave the Vulkan tag there.
I included OpenCL as I wanted to know if there's any difference at all between clEnqueueTask and clEnqueueNDRangeKernel with dim=1:x=1. The document says they're equivalent but I was skeptical.
I believe Vulkan removed the special function to enqueue a single-threaded task for good reasons, and if I'm wrong, please correct me.
Do these "single-threaded" portions also diverge (as in execute both true and false code paths) when they have conditional branches?
From an API point of view it has to appear to the program that only the active branch paths were taken. As to what actually happens, I suspect you'll never know for sure. GPU hardware architectures are nearly all confidential so it's impossible to be certain.
There are really two cases here:
Cases where a branch in the program turns into a real branch instruction.
Cases where a branch in the program turns into a conditional select between two computed values.
In the case of a real branch I would expect most cases to only execute the active path because it's a horrible waste of power to do both, and GPUs are all about energy efficiency. That said, YMMV and this isn't guaranteed at all.
For simple branches the compiler might choose to use a conditional select (compute both results, and then select the right answer). In this case you will compute both results. The compiler heuristics will generally aim to choose this where computing both results is less expensive than actually having a full branch.
I included OpenCL as I wanted to know if there's any difference at all between clEnqueueTask and clEnqueueNDRangeKernel with dim=1:x=1. The document says they're equivalent but I was skeptical.
Why would they be different? They are doing the same thing conceptually ...
I believe Vulkan removed the special function to enqueue a single-threaded task for good reasons, and if I'm wrong, please correct me.
Vulkan compute dispatch is in general a whole load simpler than OpenCL (and also perfectly adequate for most use cases), so many of the host-side functions from OpenCL have no equivalent in Vulkan. The GPU side behavior is pretty much the same. It's also worth noting that most of the holes where Vulkan shaders are missing features compared to OpenCL are being patched up with extensions - e.g. VK_KHR_shader_float16_int8 and VK_KHR_variable_pointers.
Q : Or will the compute units skip those false branches?
The ecosystem of CPU / GPU code-execution is rather complex.
The layer of hardware is where the code-paths (translated into "machine"-code) operate. On this laye, the SIMD-Computing-Units cannot and will not skip anything they are ordered to SIMD-process by the hardware-scheduler (next layer).
The layer of hardware-specific scheduler (GPUs have typically right two-modes: a WARP-mode scheduling for coherent, non-diverging code-paths efficiently scheduled in SIMD-blocks and greedy-mode scheduling). From this layer, the SIMD-Computing-Units are loaded to work on SIMD-operated blocks-of-work, so any first divergence detected on the lower layer (above) breaks the execution, flags the SIMD-hardware scheduler about blocks, deferred to be executed later and all known SIMD-specific block-device-optimised scheduling is well-known to start to grow less-efficient and less-efficient, due to each such run-time divergence.
The layer of { OpenCL | Vulkan API }-mediated device-specific programming decides a lot about the ease or comfort of human-side programming of the wide range of the target-devices, all without knowing about its respective internal constraints, about (compiler decided) preferred "machine"-code computing problem re-formulation and device-specific tricks and scheduling. A bit oversimplified battlefield picture has made for years human-users just stay "in front" of the mediated asynchronous work-units ( kernel's ) HOST-to-DEVICE scheduling queues and wait until we receive back the DEVICE-to-HOST delivered results back, doing some prior-H2D/posterior-D2H memory transfers, if allowed and needed.
The HOST-side DEVICE-kernel-code "scheduling" directives are rather imperative and help the mediated-device-specific programming reflect user-side preferences, yet leave user blind from seeing all internal decisions ( assembly-level reviews are indeed only for hard-core, DEVICE-specific, GPU-engineering Aces and hard to modify, if willing to )
All that said, "adaptive" run-time values' based decisions to move a particular "work-unit" back-to-the-HOST-CPU, rather than finalising it all in DEVICE-GPU, are not, to the best of my knowledge, taking place on the bottom of this complex computing ecosystem hierarchy ( afaik, it would be exhaustively expensive to try to do so ).

Programming Chess in Functional programming

A year ago I programmed a chess AI using the Alphabeta prunning algorithm. This was relatively straight forward to do in c++. One of the main issues I considered while doing this was making my code efficient. I did this by using having a data type I called a "game" that I passed around through the search tree made by the algorithm. To increase efficiency I didn't ever copy the "game" data type but rather mutated it while keeping the nessisary information needed to return it to its previous states.
Recently I have been reading about functional programming and the concept of purely using functions that do not change the state of the parameters they are passed appeals to me. I am wondering how I would using the paradigm of functional programming while still taking efficiency of the program into account.
In OOP the solution seems quite straight forward (which is what I implemented) while in functional programming it seems that copying data types is nessisary which decreases efficiency. Is it possible to use functional programming without this loss of efficiency?
In functional programming, data structures are not always copied completely. In many cases, only the part that changes needs to be copied, while the old parts can be referenced (since no mutation is allowed, this is safe).
The article on persistant data structures describes this in more detail.
Jephron's answer points out the important fact that only small parts of a persistent data structure need to get updated, thus the bigger part is shared between the old state and the new state.
To be honest, this would still be slower than a mutation in most cases.
But immutable, persistent data structures have other advantages. Let's assume you have already completed the playing engine. And now, you want to implement a history (for example to allow the player to undo earlier moves). This is dead simple: Just remember all states in a list. You'll find that you need to touch only a few functions to take a list of states instead of just the last state, and you're done. You don't need to worry about compromising your game engine --- there is no global variable or something you could destroy.
Another thing is taking advantage of the many CPU cores you probably have by employing parallelism. Needless to say that you can't let many tasks, threads, fibers or whatever operate on a single mutable data structure. This would just become a synchronization nightmare, and your code would probably go slower even. However, there simply are no synchronization problems on immutable data, as they are read only for all threads.
This could very well speed up your code in such a way that it dwarfs the C++ solution, even if "doing a move" on a functional data structure is much slower than on mutable data.
Here is an example for changing a board game (TTT) from single threaded to parallel: https://dierk.gitbooks.io/fregegoodness/content/src/docs/asciidoc/incremental_episode4.html

Preventing performance regressions in R

What is a good workflow for detecting performance regressions in R packages? Ideally, I'm looking for something that integrates with R CMD check that alerts me when I have introduced a significant performance regression in my code.
What is a good workflow in general? What other languages provide good tools? Is it something that can be built on top unit testing, or that is usually done separately?
This is a very challenging question, and one that I'm frequently dealing with, as I swap out different code in a package to speed things up. Sometimes a performance regression comes along with a change in algorithms or implementation, but it may also arise due to changes in the data structures used.
What is a good workflow for detecting performance regressions in R packages?
In my case, I tend to have very specific use cases that I'm trying to speed up, with different fixed data sets. As Spacedman wrote, it's important to have a fixed computing system, but that's almost infeasible: sometimes a shared computer may have other processes that slow things down 10-20%, even when it looks quite idle.
My steps:
Standardize the platform (e.g. one or a few machines, a particular virtual machine, or a virtual machine + specific infrastructure, a la Amazon's EC2 instance types).
Standardize the data set that will be used for speed testing.
Create scripts and fixed intermediate data output (i.e. saved to .rdat files) that involve very minimal data transformations. My focus is on some kind of modeling, rather than data manipulation or transformation. This means that I want to give exactly the same block of data to the modeling functions. If, however, data transformation is the goal, then be sure that the pre-transformed/manipulated data is as close as possible to standard across tests of different versions of the package. (See this question for examples of memoization, cacheing, etc., that can be used to standardize or speed up non-focal computations. It references several packages by the OP.)
Repeat tests multiple times.
Scale the results relative to fixed benchmarks, e.g. the time to perform a linear regression, to sort a matrix, etc. This can allow for "local" or transient variations in infrastructure, such as may be due to I/O, the memory system, dependent packages, etc.
Examine the profiling output as vigorously as possible (see this question for some insights, also referencing tools from the OP).
Ideally, I'm looking for something that integrates with R CMD check that alerts me when I have introduced a significant performance regression in my code.
Unfortunately, I don't have an answer for this.
What is a good workflow in general?
For me, it's quite similar to general dynamic code testing: is the output (execution time in this case) reproducible, optimal, and transparent? Transparency comes from understanding what affects the overall time. This is where Mike Dunlavey's suggestions are important, but I prefer to go further, with a line profiler.
Regarding a line profiler, see my previous question, which refers to options in Python and Matlab for other examples. It's most important to examine clock time, but also very important to track memory allocation, number of times the line is executed, and call stack depth.
What other languages provide good tools?
Almost all other languages have better tools. :) Interpreted languages like Python and Matlab have the good & possibly familiar examples of tools that can be adapted for this purpose. Although dynamic analysis is very important, static analysis can help identify where there may be some serious problems. Matlab has a great static analyzer that can report when objects (e.g. vectors, matrices) are growing inside of loops, for instance. It is terrible to find this only via dynamic analysis - you've already wasted execution time to discover something like this, and it's not always discernible if your execution context is pretty simple (e.g. just a few iterations, or small objects).
As far as language-agnostic methods, you can look at:
Valgrind & cachegrind
Monitoring of disk I/O, dirty buffers, etc.
Monitoring of RAM (Cachegrind is helpful, but you could just monitor RAM allocation, and lots of details about RAM usage)
Usage of multiple cores
Is it something that can be built on top unit testing, or that is usually done separately?
This is hard to answer. For static analysis, it can occur before unit testing. For dynamic analysis, one may want to add more tests. Think of it as sequential design (i.e. from an experimental design framework): if the execution costs appear to be, within some statistical allowances for variation, the same, then no further tests are needed. If, however, method B seems to have an average execution cost greater than method A, then one should perform more intensive tests.
Update 1: If I may be so bold, there's another question that I'd recommend including, which is: "What are some gotchas in comparing the execution time of two versions of a package?" This is analogous to assuming that two programs that implement the same algorithm should have the same intermediate objects. That's not exactly true (see this question - not that I'm promoting my own questions, here - it's just hard work to make things better and faster...leading to multiple SO questions on this topic :)). In a similar way, two executions of the same code can differ in time consumed due to factors other than the implementation.
So, some gotchas that can occur, either within the same language or across languages, within the same execution instance or across "identical" instances, which can affect runtime:
Garbage collection - different implementations or languages can hit garbage collection under different circumstances. This can make two executions appear different, though it can be very dependent on context, parameters, data sets, etc. The GC-obsessive execution will look slower.
Cacheing at the level of the disk, motherboard (e.g. L1, L2, L3 caches), or other levels (e.g. memoization). Often, the first execution will pay a penalty.
Dynamic voltage scaling - This one sucks. When there is a problem, this may be one of the hardest beasties to find, since it can go away quickly. It looks like cacheing, but it isn't.
Any job priority manager that you don't know about.
One method uses multiple cores or does some clever stuff about how work is parceled among cores or CPUs. For instance, getting a process locked to a core can be useful in some scenarios. One execution of an R package may be luckier in this regard, another package may be very clever...
Unused variables, excessive data transfer, dirty caches, unflushed buffers, ... the list goes on.
The key result is: Ideally, how should we test for differences in expected values, subject to the randomness created due to order effects? Well, pretty simple: go back to experimental design. :)
When the empirical differences in execution times are different from the "expected" differences, it's great to have enabled additional system and execution monitoring so that we don't have to re-run the experiments until we're blue in the face.
The only way to do anything here is to make some assumptions. So let us assume an unchanged machine, or else require a 'recalibration'.
Then use a unit-test alike framework, and treat 'has to be done in X units of time' as just yet another testing criterion to be fulfilled. In other words, do something like
stopifnot( timingOf( someExpression ) < savedValue plus fudge)
so we would have to associate prior timings with given expressions. Equality-testing comparisons from any one of the three existing unit testing packages could be used as well.
Nothing that Hadley couldn't handle so I think we can almost expect a new package timr after the next long academic break :). Of course, this has to be either be optional because on a "unknown" machine (think: CRAN testing the package) we have no reference point, or else the fudge factor has to "go to 11" to automatically accept on a new machine.
A recent change announced on the R-devel feed could give a crude measure for this.
CHANGES IN R-devel UTILITIES
‘R CMD check’ can optionally report timings on various parts of the check: this is controlled by environment variables documented in ‘Writing R Extensions’.
See http://developer.r-project.org/blosxom.cgi/R-devel/2011/12/13#n2011-12-13
The overall time spent running the tests could be checked and compared to previous values. Of course, adding new tests will increase the time, but dramatic performance regressions could still be seen, albeit manually.
This is not as fine grained as timing support within individual test suites, but it also does not depend on any one specific test suite.

How Concurrent is Prolog?

I can't find any info on this online... I am also new to Prolog...
It seems to me that Prolog could be highly concurrent, perhaps trying many possibilities at once when trying to match a rule. Are modern Prolog compilers/interpreters inherently* concurrent? Which ones? Is concurrency on by default? Do I need to enable it somehow?
* I am not interested in multi-threading, just inherent concurrency.
Are modern Prolog compilers/interpreters inherently* concurrent? Which ones? Is concurrency on by default?
No. Concurrent logic programming was the main aim of the 5th Generation Computer program in Japan in the 1980s; it was expected that Prolog variants would be "easily" parallelized on massively parallel hardware. The effort largely failed, because automatic concurrency just isn't easy. Today, Prolog compilers tend to offer threading libraries instead, where the program must control the amount of concurrency by hand.
To see why parallelizing Prolog is as hard as any other language, consider the two main control structures the language offers: conjunction (AND, serial execution) and disjunction (OR, choice with backtracking). Let's say you have an AND construct such as
p(X) :- q(X), r(X).
and you'd want to run q(X) and r(X) in parallel. Then, what happens if q partially unifies X, say by binding it to f(Y). r must have knowledge of this binding, so either you've got to communicate it, or you have to wait for both conjuncts to complete; then you may have wasted time if one of them fails, unless you, again, have them communicate to synchronize. That gives overhead and is hard to get right. Now for OR:
p(X) :- q(X).
p(X) :- r(X).
There's a finite number of choices here (Prolog, of course, admits infinitely many choices) so you'd want to run both of them in parallel. But then, what if one succeeds? The other branch of the computation must be suspended and its state saved. How many of these states are you going to save at once? As many as there are processors seems reasonable, but then you have to take care to not have computations create states that don't fit in memory. That means you have to guess how large the state of a computation is, something that Prolog hides from you since it abstracts over such implementation details as processors and memory; it's not C.
In other words, automatic parallelization is hard. The 5th Gen. Computer project got around some of the issues by designing committed-choice languages, i.e. Prolog dialects without backtracking. In doing so, they drastically changed the language. It must be noted that the concurrent language Erlang is an offshoot of Prolog, and it too has traded in backtracking for something that is closer to functional programming. It still requires user guidance to know what parts of a program can safely be run concurrently.
In theory that seems attractive, but there are various problems that make such an implementation seem unwise.
for better or worse, people are used to thinking of their programs as executing left-to-right and top-down, even when programming in Prolog. Both the order of clauses for a predicate and of terms within a clause is semantically meaningful in standard Prolog. Parallelizing them would change the behaviour of far too much exising code to become popular.
non-relational language elements such as the cut operator can only be meaningfully be used when you can rely on such execution orders, i.e. they would become unusable in a parallel interpreter unless very complicated dependency tracking were invented.
all existing parallelization solutions incur at least some performance overhead for inter-thread communication.
Prolog is typically used for high-level, deeply recursive problems such as graph traversal, theorem proving etc. Parallelization on a modern machines can (ideally) achieve a speedup of n for some constant n, but it cannot turn an unviable recursive solution method into a viable one, because that would require an exponential speedup. In contrast, the numerical problems that Fortran and C programmers usually solve typically have a high but quite finite cost of computation; it is well worth the effort of parallelization to turn a 10-hour job into a 1-hour job. In contrast, turning a program that can look about 6 moves ahead into one that can (on average) look 6.5 moves ahead just isn't as compelling.
There are two notions of concurrency in Prolog. One is tied to multithreading, the other to suspended goals. I am not sure what you want to know. So I will expand a little bit about multithreading first:
Today widely available Prolog system can be differentiated whether they are multithreaded or not. In a multithreaded Prolog system you can spawn multiple threads that run concurrently over the same knowledge base. This poses some problems for consult and dynamic predicates, which are solved by these Prolog systems.
You can find a list of the Prolog systems that are multithreaded here:
Operating system and Web-related features
Multithreading is a prerequesite for various parallelization paradigmas. Correspondingly the individudal Prolog systems provide constructs that serve certain paradigmas. Typical paradigmas are thread pooling, for example used in web servers, or spawning a thread for long running GUI tasks.
Currently there is no ISO standard for a thread library, although there has been a proposal and each Prolog system has typically rich libraries that provide thread synchronization, thread communication, thread debugging and foreign code threads. A certain progress in garbage collection in Prolog system was necessary to allow threaded applications that have potentially infinitely long running threads.
Some existing layers even allow high level parallelization paradigmas in a Prolog system independent fashion. For example Logtalk has some constructs that map to various target Prolog systems.
Now lets turn to suspended goals. From older Prolog systems (since Prolog II, 1982, in fact) we know the freeze/2 command or blocking directives. These constructs force a goal not to be expanded by existing clauses, but instead put on a sleeping list. The goal can the later be woken up. Since the execution of the goal is not immediate but only when it is woken up, suspended goals are sometimes seen as concurrent goals,
but the better notion for this form of parallelism would be coroutines.
Suspended goals are useful to implement constraint solving systems. In the simplest case the sleeping list is some variable attribute. But a newer approach for constraint solving systems are constraint handling rules. In constraint handling rules the wake up conditions can be suspended goal pair patterns. The availability of constraint solving either via suspended goals or constraint handling rules can be seen here:
Overview of Prolog Systems
Best Regards
From a quick google search it appears that the concurrent logic programming paradigm has only been the basis for a few research languages and is no longer actively developed. I have seen claims that concurrent logic is easy to do in the Mozart/Oz system.
There was great hope in the 80s/90s to bake parallelism into the language (thus making it "inherently" parallel), in particular in the context of the Fifth Generation Project. Even special hardware constructs were studied to implement "Parallel Inference Machine" (PIM) (Similar to the special hardware for LISP machines in the "functional programming" camp). Hardware efforts were abandoned due to continual improvement of off-the-shelf CPUs and software effort were abandoned due to excessive compiler complexity, lack of demand for hard-to-implement high-level features and likely lack of payoff: parallelism that looks transparent and elegantly exploitable at the language level generally means costly inter-process communication and transactional locking "under the hood".
A good read about this is
"The Deevolution of Concurrent Logic Programming Languages"
by Evan Tick, March 1994. Appeared in "Journal of Logic Programming, Tenth Anniversary Special Issue, 1995". The Postscript file linked to is complete, unlike the PDF you get at Elsevier.
The author says:
There are two main views of concurrent logic programming and its
development over the past several years [i.e. 1990-94]. Most logic programming
literature views concurrent logic programming languages as a
derivative or variant of logic programs, i.e., the main difference
being the extensive use of "don't care" nondeterminism rather than
"don't know" (backtracking) nondeterminism. Hence the name committed
choice or CC languages. A second view is that concurrent logic
programs are concurrent, reactive programs, not unlike other
"traditional" concurrent languages such as 'C' with explicit message
passing, in the sense that procedures are processes that communicate
over data streams to incrementally produce answers. A cynic might say
that the former view has more academic richness, whereas the latter
view has more practical public relations value.
This article is a survey of implementation techniques of concurrent
logic programming languages, and thus full disclosure of both of these
views is not particularly relevant. Instead, a quick overview of basic
language semantics, and how they relate to fundamental programming
paradigms in a variety of languages within the family, will suffice.
No attempt will be made to cover the many feasible programming
paradigms; nor semantical nuances, nor the family history. (...).
The main point I wish to make in this article is that concurrent logic
programming languages have been deevolving since their inception,
about ten years ago, because of the following tatonnement:
Systems designers and compiler writers could supply only certain limited features in robust; efficient implementations. This drove the
market to accept these restricted languages as, in some informal
sense, de facto standards.
Programmers became aware that certain, more expressive language features were not critically important to getting applications
written, and did not demand their inclusion.
Thus my stance in this article will be a third view: how the initially
rich languages gradually lost their "teeth," and became weaker, but
more practically implementable, and achieved faster performance.
The deevolutionary history begins with Concurrent Prolog (deep guards,
atomic unification; read-only annotated variables for
synchronization), and after a series of reductions (for example: GHC
(input-matching synchronization), Parlog (safe), FCP (flat), Fleng (no
guards), Janus (restricted communication), Strand (assignment rather
than output unification)), and ends for now with PCN (flat guards,
non-atomic assignments input-matching synchronization, and
explicitly-defined mutable variables). This and other terminology will
be defined as the article proceeds.
This view may displease some
readers because it presupposes that performance is the main driving
force of the language market; and furthermore that the main "added
value" of concurrent logic programs over logic programs is the ability
to naturally exploit parallelism to gain speed. Certainly the reactive
nature of the languages also adds value; e.g., in building complex
object-oriented applications. Thus one can argue that the deevolution
witnessed is a bad thing when reactive capabilities are being traded
for speed.
ECLiPSe-CLP, a language "largely backward-compatible with Prolog", supports OR-parallelism, even though "this functionality is currently not actively maintained because of other priorities".
[1,2] document OR- (and AND-)parallelism in ECLiPSe-CLP.
However, I tried to get it working some time using the code from ECLiPSe-CLP's repository, but I didn't get it though.
[1] http://eclipseclp.org/reports/book.ps.gz
[2] http://eclipseclp.org/doc/bips/kernel/compiler/parallel-1.html

Understanding parallel usage of Fortran 90

y(1:n-1) = a*y(2:n) + x(1:n-1)
y(n) = c
In the above Fortran 90 code I want to know how it is executed in term of synchronization, communication and arithmetic.
What I understand is:
Communication is the need for different task to communication with each other. E.g. if there's some variable that have dependencies with some other variable. But the above code doesn't show that there is some communication. As it seems to be no dependencies, am I right?
Synchronization is somewhat related to communication, but it also involves if there has been some use of barriers. But in the above code there is no barrier. Therefore the only synchronization that is involved is if there are any data dependencies.
Arithmetic I have no clue regarding this point, and would be gladly if someone could explain it to me.
The rule in Fortran is fairly simple: the right hand side is completely evaluated before the result is assigned to the left.
Thus you could claim there is a communication upon assigning (sending the result to y), which is at the same time a synchronization point.
The actual evaluation of the right side could be vectorized/parallelized by the compiler, resulting in arbitrary orders of the evaluations for all entries in the array, except for the last one, which is only set after the first assignment.
However, except for pipelining, there is no real parallelism introduced here by common compilers.
Without stopping too much at the given snippet, it looks you could perhaps be interested (tell me if I'm wrong) at for example, Using OpenMP book (presentation here). It is a nice gentle introduction to the world of parallel computing (memory shared). For larger systems you would do well to google "MPI" and its related subjects. There is really a plethora of material on the matter (a lot of them deal with fortran+mpi / fortran+openmp) so I'll skip giving examples here.
Is this what you were aiming for?

Resources