From explicit threading to TBB - performance

Assuming I have a parallel algorithm that uses explicit threading with one or two locks for synchronization and is optimized to take advantage of cache lines (including shared L3 cache between multiple cores), what are good ways of incorporating that into a TBB program? The algorithm in question does not break down as nicely into tasks as it does into threads.

Without any further information (code sample, or any kind of generic presentation of the algorithm, such as a flowchart) I would say that the best way to make such transition is by refactoring the algorithm, isolating repeatable actions and trying to combine them in tasks (a range of one or more activities that have a common purpose)...
unfortunately there's no magical formula to make this transition because these techniques are fundamentally 2 answers to solve the same problem, therefore they share some common ground, but they also tackle the problem from 2 different angles...

Related

What is an approach for designing complex FSMs?

At work, we use FSMs. Recently, I had to design an FSM for a problem that I deem "a little too complex for a simple FSM". Why? Because the problem has about 6 different data dimensions, and many permutations of this data impact the behaviour of the solution significantly. My brain thinks "6 data attributes means 2^6 +1 permutations of this data" if it were all boolean data. Furthermore, there are about 8 inputs that can happen at any given time.
This problem made me aware that my FSM creating skills stop at simple problems used in my hobby projects. At work, we are constrained to use FSMs. That means, I cannot just say "this problem is outside of the scope of FSMs. I'll use something else." Indeed, the FSM platform we have in place does provide a lot of power for our solutions.
Question: What is an approach for designing an FSM when the problem is sufficiently complex? I've researched a bit on this and found a few papers which, honestly, didn't help me much. I hope there are some best practices for this, and all I'm asking for is one. Please and thanks.
I suppose that you might be experiencing the usual "state-transition explosion", which is the known problem of traditional "flat" FSMs. The traditional FSMs "explode", because they inflict repetitions of the same reactions in many states. FSMs lack any mechanisms to capture commonalities of behavior among states. The long know solution is to use Hierarchical State Machines (a.k.a. Harel statecharts or UML state machines). HSM support the concept of state nesting, in which sub-states inherit behavior from the surrounding superstate(s). When used correctly, state nesting eliminates the repetitions and counteracts the "explosion" problem. Most non-trivial problems are not really tractable with FSMs, but are quite manageable with HSMs.

Different approaches to an allocation algorithm for processes?

I am looking to implement an automated way of allocating processes to a variety of servers available. There are many types of servers (characterized by things like location, cpu, network card, etc..) and there are various types of processes (more than there are servers) with different priorities and location/hardware requirements. I can think of pretty much greedy algorithms that are simplistic in nature but was wondering what other references and approaches exist for this type of problem (which I feel is pretty standard). I am also interested in solving a related problem - in which say we remove one of the servers after things have been allocated and we need to reshuffle with minimal interference. This latter one I also feel is standard but I'm not sure what some good references to look at are. Any suggestions on where to start?
Your question is pretty vague. Normally problems like this are handled either by modeling them as a set of linear equations and optimizing an objective function given the linear constraints, or the problem is modeled as a knapsack problem.

Most effective method to use parallel computing on different architectures

I am planning to write something to take advantages of the many devices that I have at home.
Basically my aim is to use the laptop to execute calculations, and also to use my main desktop computer to add more power (and finish the task quicker). I work with cellular simulation and chemical interactions, so to me would be great to take advantage of all that I have available at home.
I am using mainly OSX, so I need something that may work with that OS. I can code in objective-C, C and C++.
I am aware of GCD, OpenCL and MPI, but I am not sure which way to go.
I was planning to not use the full power of my desktop but only some of the available cores (in this way I can continue to work on the desktop doing other tasks that are not so resource intensive). In particular I would love to use the graphic card power (it is an ATI card, so no CUDA), since all that I do mainly is spreadsheet, word and coding with Xcode, and the graphic card resources are basically unused in that scenario.
Is there a specific set of libraries or API, among the aforementioned 3, that would allow me to selectively route tasks, and use resources on another machine without leaving the control totally to the compiler? I've heard that GCD is great but it has very limited control on where the blocks are executed, while MPI is on the other side of the spectrum....OpenCL seems to be in the middle.
Before diving in one of these technologies I would like to know which one would most likely suit my needs; I am sure that some other researcher has already used successfully parallel computing to achieve what I am trying to achieve.
Thanks in advance.
MPI is more for scientific computing large scale many processors many nodes exc not for a weekend project, for what you describe I would suggest using OpenCl or any one the more distributed framework of AMQP protocol families, such as zeromq or rabbitMQ, or a combination of OpenCl and AMQP , or even simpler consider multithreading , i would suggest OpenMP for that. I'm not sure if you are looking for direct solvers or parallel functions but there are many that exist as well for gpu's and cpu's which you can find on the web
Sorry, but this question simply cannot be meaningfully answered as posed. To be sure, I could toss out a collection of buzzwords describing various technologies to look at like GCD, OpenMPI, OpenCL, CUDA and any number of other technologies which allow one to run a single program on multiple cores, multiple programs on different cooperating computers, or a single program distributed across CPU and GPU, and it sounds like you know about a number of those already so I wouldn't even be adding much value in listing the buzzwords.
To simply toss out such terms without knowing the full specifics of the problem you're trying to solve, however, is a bit like saying that you know English, French and a little German so sure, by all means - mix them all together in a single paragraph without knowing anything about the target audience! Similarly, you can parallelize a given computation in any number of ways, across any number of different processing elements, but whether that parallelization is actually a win or not is going to be entirely dependent on the nature of the algorithm, its data dependencies, how much computation is expected for each reasonable "work chunk", and whether it can be executed on a GPU with sufficient numeric precision, among many other factors. The more complex the technology you choose, the more those factors matter and the greater the possibility that the resulting code will actually be slower than its single-threaded, single machine counterpart. IPC overhead and data copying can, and frequently do, swamp all of the gains one might realize from trying to naively parallelize something and then add additional overhead on top of that, resulting in a net loss. This is why engineers who can do this kind of work meaningfully and well are in such high demand. :)
Without knowing anything about your calculations, I would move in baby steps. First try a simple multi-processor framework like GCD (which is already built in to OS X and requires no additional dependencies to use) and figure out how to factor your code such that it can effectively use all of the available cores on a single machine. Once you've learned where the wins are (and if there even are any - if multi-threading isn't helping, multi-machine parallelization almost certainly won't either), try setting up several instances of the calculation on several machines with a simple IPC model that allows for distributing the work. Having already factored your algorithm(s) for multiple threads, it should be comparatively straight-forward to further generalize the approach across multiple machines (though it bears noting that the two are NOT the same problem and either way you still want to use all the cores available on any of the given target machines, so the two challenges are both complimentary and orthogonal).

Implementions of algorithms for evaluating circuits

Consider the problem of circuit evaluation, where the input is a boolean circuit C and an input string x and you want to compute C(x). (Assume fan-in 2 if you like.)
This is a 'trivial' problem algorithmically, however it appears non-trivial to implement when C can be huge (think several million gates) and memory management becomes an issue.
There are several ways this problem can be approached, trading off memory, time, and disc access. But before going through all this work myself, does anyone know of any existing implementations of algorithms for this problem? It would be surprising to me if none exist...
For C/C++, the standard digital circuit design & simulation system for more than 10 years now is SystemC.
It is a library that allows you to design digital logic in C++. There are supporting software that allows you to do timing analysis and even generate schematic netlist for C code.
I've only played with it a little before deciding that I was more comfortable with Verilog. But it is a mature piece of software with lots of industry support. Googling around will yield a lot of information including several tutorial pages.
It sounds like Binary Decision Diagrams could be used for your task? There are well-known algorithms (and implementations) of these which are very compact in terms of memory usage, given that they are designed to be used on huge state spaces.

What are some hints that an algorithm should parallelized?

My experience thus far has shown me that even with multi-core processors, parallelizing an algorithm won't always speed it up noticably. In fact, sometimes it can slow things down. What are some good hints that an algorithm can be sped up significantly by being parallelized?
(Of course given the caveats with premature optimization and their correlation to evil)
To gain the most benefit from parallelisation, a task should be able to be broken into similiar-sized course-grain chunks that are independent (or mostly so), and require little communication of data or synchronisation between the chunks.
Fine-grain parallelisation, almost always suffers from increased overheads, and will have a finite speed-up regardless of the number of physical cores available.
[The caveat to this, is those architectures that have a very large no. of 'cores' (such as the connection machines 64,000 cores). These are well suited to calculations that can be broken into relatively simple actions assigned to a particular topology (like a rectangular mesh).]
If you can divide the work into independent parts then it may be parallelized well.
Remember also Amdahl's Law which is a sobering reminder of how little we can expect in terms of performances gains by adding more cores to most programs.
First, check out this paper by the late Jim Gray:
Distributed Computing Economics
Actually, this will clear up some misunderstanding based on what you wrote in the question. Obviously, if the less amenable your problem set is to being discretized, the more difficult it will be.
Any time you have computations that depend on previous computations, it is not a parallel problem. Things like linear image processing, brute force methods, and genetic algorithms are all easily parallelized.
A good analogy is what could you work on that you could get a bunch of friends to do different parts at once? For example, putting ikea furniture together might parallelize well if different people can work on different sections, but rolling wallpaper might not because you need to do walls in sequence.
If you're doing large matrix computations, like simulations involving finite element models, these can often be broken down into smaller pieces in straight-forward ways. Matrix-vector multiplies can benefit well from parallelization, assuming you are dealing with very large matrices. Unless there is a real performance bottleneck that is causing code to run slowly, it's probably not necessary to hassle with parallel processing.
Well, if you need lots of locks for it to work, then its probably one of those difficult algorithms that doesn't parallelise well. Is there any part of the algorithm that can be broken up into separate parts that don't need to touch each other?

Resources