Recommend method to parallelize a multipass-algorithm - parallel-processing

I am writing a de- and encoder for a custom video format (QTC). The decoding process consists of multiple stages, where the output of each stage is passed to the next stage:
deserialize the input stream
generate a sequence of symbols with a range coder
generate a stream of images from the symbol stream
serialize the image stream into an output format
Steps three and four take up almost all of the processing time, step three takes roughly 35% and step four takes about 60%, the first and the last step are rather trivial.
What is the recommend and ideomatic way to runs the four steps in parallel? I am mostly interested in how to handle the communication between the parts. I plan to use one Goroutine for step two and one for step three, the routines are connected with a buffered channel. Is this the right way?

For some tasks having a "shared" data structure "protected" by careful use of mutexes is easier, but a buffered channel would be the "standard" way to do this in Go and for your task it sounds like the right solution, too.
Are you running into any problems with it since you are asking here?

Do not over think. Just write some code. Go code is easy to change - compiler holds your hand. Write some tests / benchmarks to keep yourself honest.
Alex

Related

Implementing a MapReduce skeleton in Erlang

I am fairly new to both parallel programming and the Erlang language and I'm struggling a bit.
I'm having a hard time implementing a mapreduce skeleton. I spawn M mappers (their task is to map the power function into a list of floats) and R reducers (they sum the elements of the input list sent by the mapper).
What I then want to do is to send the intermediate results of each mapper to a random reducer, how do I go about linking one mapper to a reducer?
I have looked around the internet for examples. The closest thing to what I want to do that I could find is this word counter example, the author seems to have found a clever way to link a mapper to a reducer and the logic makes sense, however I have not been able to tweak it in order to fit my particular needs. Maybe the key-value implementation is not suitable for finding the sum of a list of powers?
Any help, please?
Just to give an update, apparently there were problems with the algorithm linked in the OP. It looks like there is something wrong with the sychronization protocol, which is hinted at by the presence of the call to the sleep() function (ie. it's not supposed to be there).
For a good working implementation of the map/reduce framework, please refer to Joe Armstrong's version in the Programming Erlang book (2nd ed).
Armstrong's version only uses one reducer, but it can be easily modified for more reducers in order to eliminate the bottleck.
I have also added a function to split the input list into chunks. Each mapper will get a chunk of data.

Is there guaranteed order of subscribers in chronicle queue?

I'm looking at Chronicle and I don't understand one thing.
An example - I have a queue with one writer - market data provider writes tick data as they appear.
Let's say queue has 10 readers - each reader is a different trading strategy that reads the new tick, and might send buy or sell order, let's name them Strategy1 .. Strategy10.
Let's say there is a rule that I can have only one trade open at any given time.
Now the problem - as far as I understand, there is no guarantee on the order how these subscribed readers process the tick event. Each strategy is subscribed to the queue, so each of them will get the new tick asynchronously.
So when I run it for the first time, it might be that Strategy1 receives the tick first and places the order, all the other strategies will then be not able to place their orders.
If I'll replay the same sequence of events, it might be that a different strategy processes the tick first, and places its order.
This will result in totally different results while using the same sequence of initial events.
Am I understanding something wrong or is this how it really works?
What are the possible solutions to this problem?
What I want to achieve is that the same sequence of source events (ticks) produces always the same sequence of trades.
If you want determinism in your system, then you need to remove sources of non-determinism. In your case, since you can only have a single trade open at one time, it sounds like it would be sensible to run all 10 strategies on a single thread (reader). This would also remove the need for any synchronisation on the reader side to ensure that there is only one open trade.
You can then use some fixed ordering to your strategies (e.g. round-robin) that will always produce the same output for a given set of inputs. Alternatively, if the decision logic is too expensive to run in serial, it could be executed in parallel, with each reader feeding into some form of gate (a Phaser-like structure), on which a decision about what strategy to use can be made deterministically. The drawback to this design is that the slowest trading strategy would hold back all the others.
Think you need to make a choice about how much you want concurrently and independently, and how much you want in order and serial. I suggest you allow the strategies to place orders independently, however the reader of those order to process them in the original order by checking a sequence number such as the queue index in the first queue.
This way the reader of the orders will process them in the same order regardless of the order they are processed and written which appears to be your goal.

What is the Random Control Logic in the 6502?

I am currently developing a subset of the 6502 in LogiSim and at the current stage I am determining which parts to implement and what can be cut out. One of my main resources is Hanson's Block Diagram.
I am currently trying to determine how exactly the instructions are decoded into the control lines. In the diagram below, there are two parts, the Decode ROM and the Random Control Logic.
How exactly does the 6502 decode the program instructions into control lines? As a follow up, Is it possible to simplify this area to eliminate the Random Control Logic and create the decoding with only one ROM?
I'm on the outskirts of my knowledge here, but my understanding is that the PLA decode ROM outputs its 130 control signals as a function of opcode and cycle, and the random logic is a purely functional unit that takes the PLA output as input in order to control the rest of the chip. I think you could combine the two into a single ROM; from looking at the die shot the random logic is about twice as large as the PLA so my guess would be that time/cost considerations, possibly including intelligent task subdivision and almost certainly including a calculation of debugging time as the 6502 was laid out literally by hand, using pen and ruler, led to the combined approach.

What are the main advantages in using promises in scheme?

Pragmatically, what are the main advantages of using promises? Can you show me some examples of real-life useful usage of promises?
In Scheme a promise is just a value that has a task that is not necessarily done yet and if you never use the value it will never be calculated. In short it is a way to do lazy evaluation in the otherwise eager Scheme. A typical way is to do computations on streams instead of lists.
With lists you can use higher order functions so that you can have a list, then filter it for values you are interested in, then transform these values and perhaps at some point you have enough to produce the value you needed. This is nice since you can abstract each step so that you can make logic that only does one step and compose steps to make the whole program, but in this scenario the first step needs to finish in full before the next step can handle the result of the first while it might be that if you are searching for the first prime number between 0 and 1000 having iterated over all the numbers in each step might not be so effective. Here is where streams comes in.
With streams the code looks the same, but the intermediate result is made by need. A stream is a pair where the parts are promises so that the code that would otherwise make a pair is delayed until the values are used. Every step just produces enough data for the next step and thus should it be enough for the first step to iterate just 20% of the elements for the last step to have computed the final result the 80% rest will never ever be processed in any of the steps. With such a structure the initial stream can also be infinite, like all the numbers from 0 increased by 1.
There are penalties involved using streams. Imagine you make an algorithm that would visit all the elements anyway. Then a stream version of an algorithm would be slower since the promises that are created and the forcing gives th eprogram overhead compared with doing the computation without laziness.
You might be interested in seeing Hal Abelson explaining streams and their pros and cons.
There are other alternatives to streams an lazy evaluation. One is generators. Here you can also make composable procedures that takes a generator and produces a generator. The iteration will be by need like with streams.
Another alternative would be transducers. This is also composable and iterates like streams and generators, but unlike generators initial data cannot be an infinite sequence like with streams and generators unless the underlying structure supports it.
The advantages of using promises or any other technique in this answer is not scheme specific. They are true for all eager programming languages!

Design tips for synchronising signals through a VHDL pipeline

I am designing a video pixel data processing pipeline in VHDL which involves several steps including multiply and divide.
I want to keep signals synchronised so that I can e.g. maintain a sync signal and output it correctly at the end of the pipeline along with manipulated pixel data which has been through several processing stages.
I assume I want to use shift registers or something to delay signals by the right number of cycles so that the output is correct, but I'm looking for advice about good ways to design this, particularly as the number of pipeline stages for different signals may vary as I evolve the design.
Good question.
I'm not aware of a complete solution but here are two partial strategies...
Interconnecting components... It would be really nice if a component could export a generic whose value was its pipeline depth. Unfortunately you can't, and dedicating a port to this seems silly (though it's probably workable; as it would be an integer constant, it would disappear in synthesis)
Failing that, pass IN a generic indicating the budget for this module. Inside the module, assert (severity FAILURE) if the budget can't be met... (this assert is checkable at synth time and at least Xilinx XST handles similar asserts)
Make the budget a hard number, and either assert if not equal to actual pipeline depth, or add pipe stages inside the module if the budget is too large, and only assert if the budget is too small.
That way you are connecting predictable modules, and the top level can perform pipeline arithmetic to balance things (e.g. passing a computed constant value to a programmable delay line)
Within a component... I use a single process, with registers represented as internal signals whose names reflect their pipe stage, exponent_1, exponent_2, exponent_3 and so on. Within the process, the first section describes all the actions for the first cycle, the second section describes the second cycle, and so on. Typically the "easier" paths may be copied verbatim to the next pipe stage, just to sync them with the critical path. The process is fairly organised and easy to maintain.
I might break a 32-bit multiply down into 16*16 chunks and pipeline the partial product additions. The control this gives, USED to give better results than XST gave alone...
I know some people prefer variables within a process, and I use them for intermediate results in a pipe stage, but using signals I can describe the pipeline in its natural order (thanks to postponed assignment) whereas using variables, I would have to describe it backwards!
I create a package for each of my major processing blocks, one of the constants in there is the processing delay of that block. I can then connect that up to my general-purpose "delay-line" block which has a generic for the number of cycles.
Keeping that constant in "sync" with the actual implementation is best done by a self-checking testbench.
Something to consider is delay lines (i.e. back to back registers) vs FIFOs.
Consider a module X with a pipeline delay N. FIFOs work well when there is a N is variable. The trick is remembering that you can only request new work when both the module and the FIFO can accept it. Ideally you size the FIFO so that it can contain the maximum number of items that X can work on concurrently, but sometimes that's not practical. For example, if your calculation includes accesses to a distant memory.
Another option is integrating the side channel (i.e. the path that your sync flag is taking) into the module X rather than it going outside. If you do this then if any part of the calculation has to stall, you can also stall the side channel and the two stay in sync. You can do this because you're in a scope that has all the necessary signals in it. Then all signals, whether used in the calculation or not, appear at the output at the same time.

Resources