Make a previously unknown number of parallel operations. In VHDL - parallel-processing

Im working on a project for which I need to make calculations with vectors (orthogonalizing a matrix using gram schmidt method). The length of this vectors is unknown now, the program must be able to adapt to different lengths. One of such calculations is calculating a new vector (C) which is the result of adding A and B. Each element of the vectors is a number in fixed-point.
I want C(i)=A(i)+B(i). For all the elements of the vector (for i=0 to N, where N is the vector length).
I can find 2 solutions for this but both present some problems:
1- I can declare in the entity, vectors whose length changes according to a generic and then just create a for loop which goes through all the vector.
for I in 0 to N loop
end loop;
The problem with this solution is that the execution would be sequential, and therefore slow. Im not completly sure about this and I dont know how to check it but I guess that the compiler is not smart enough to notice that it can be processed in parallel. In this application speed is a key factor.
2- I can declare vectors which are as long as the maximum possible length for the actual data and fill them with zeroes. Then I could just assign:
This is not an elegant solution and in this application N can be between 3 and 300 therefore it could be a complete waste and tedious to program.
3- I want to find a third solution which could be able to create a number (asigned by the generic) of combinational calculations following a template such as C(i)=A(i)+B(i). Is there any solution like this? It is actually creating a loop which would not be executed sequentially but instead all at the same time.
I know that similar stuff can be done using CUDA but this project is actually a comparison between GPUs and FPGAs, so changing the platform is not a suitable solution either.
Thank you in advance
Edit: I have tought of another unsatisfactory solution but I want to share it in case it is helpful for somebody else checking this in the future. Given that A and B have the same length, you can write them in a 1-D format, that is: A(normal)=[1001,1100,0011], A(1-D)=100111000011. The same would be done with B.
If you know before hand that the sum of any two possible numbers can be expressed with the same amount of bits, there will be no problems. So with 4 unsigned bits you should make sure that in any possible case the numbers in A or B are !>0111 (not higher than 0111). You could just write C(1-D)=A(1-D)+B(1-D) and then just asign C(0)=C(1-D)(3 downto 0), C(1)=C(1-D)(7 downto 4) etc.
If you cannot make sure that the numbers are not higher than 0111 (in the 4 bit case) it wont work.

You might be able to use the length attribute to create a loop depending on the size of your vector.
As mentioned in the comment to the question the loop should be unrolled as long as it is not synchronized to the clock.


Never ending 'for' loop prevents my RStudio notebook from being rendered into a .md file

I'm trying to calculate the Kolmogorov-Smirnov statistic in R. I have the following sample, which clearly comes from a random variable that follows a long-tailed distribution.
Download link
As you may know, the Kolmogorov-Smirnov statistic requires the calculation of the empirical cumulative distribution function and the presumed cumulative distribution function. For both calculations I take the following approach: first, I create a vector with the same length as the length of the sample, and then I modify each of the components of the vector so as for it to contain the empirical cdf (or presumed cdf) of the corresponding observation of the sample.
For the sake of illustration, I'll show you the code I wrote in order to calculate the empirical cdf.
I'm assuming that the data has been read and stored in a dataframe called data.
ecdf = vector("numeric", length(data$logueos))for (i in 1:length(data$logueos)) {ecdf[i] = sum (data$logueos <= data$logueos[i])/length(data$logueos)}
The code I wrote for the calculation of the presumed cdf is analogous to the preceding one; the only difference is that I set each component of the pcdf vector equal to the formula $P(X<=t)$ —where t is the corresponding observation of the sample— according to the distribution that I'm assuming.
The problem is that this 'for' loop never ends. If I force it to end by clicking RStudio's stop button it works: it makes the vector store what I want it to store. But, if I press Ctrl+Shift+k in order to render my notebook and preview it, the load gets stuck when trying to execute the first chunk encountered that contains one of those loops.
First of all, your loop is not endless. It will finish, eventually.
You start initializing a vector with as much elements as the number of observations (1.245.888, which is a lot of iterations). This vector is FULL OF ZEROS.
What your loop does is iterate while changing each zero with the calculus sum (data$logueos <= data$logueos[i])/length(data$logueos). Check that when you stop the execution, the first values of your vector will be values between 0 and 1 while the last values is going to be 0s (because the loop hasn't arrived there yet).
So, you will have to wait more time.
In order to make the execution faster, you could consider loop parallelization (because standard loops go sequentially, one by one, and if it's too much wait, parallelization makes it faster. For example, executing 4 by 4, depending of your computer capacities). Here you'll find some information about it:
Then, my proposal to you:
if(!require(foreach)){install.packages("foreach")}; require(foreach)
registerDoParallel(detectCores() - 1)
ecdf = vector("numeric", length(data$logueos))
foreach (i=1:length(data$logueos)) %do% {
ecdf[i] = sum (data$logueos <= data$logueos[i])/length(data$logueos)
The first line will download and load foreach library, that you
need for parallelization.
detectCores() - 1 is going to use all the
processors that your computer has except one (to avoid freezing your
machine) for computing this loop. You'll see that is going to be
registerDoParallel function is what tells to foreach how many cores use.

How does a finite state machine perform division?

I am taking a course on models of computation and currently we are doing finite state machines. One my tasks is to draw out a FSM that performs division of 3; to simplify the model the machine only accepts numbers multiple of 3. I am not sure how this exactly works, especially since I imagine FSM putting out only single binary values. Could you guys give examples (division by 2 or 4) or hints on how to approach this?
This is what you need, I think (sorry about the bad picture). The 'E' represents epsilon/lambda/no-output. The label of the edges denotes 'input/output'. For each symbol read there is also a corresponding output which may be lambda (no output).

Algorithms to represent a set of integers with only one integer

This may not be a programming question but it's a problem that arised recently at work. Some background: big C development with special interest in performance.
I've a set of integers and want to test the membership of another given integer. I would love to implement an algorithm that can check it with a minimal set of algebraic functions, using only a integer to represent the whole space of integers contained in the first set.
I've tried a composite Cantor pairing function for instance, but with a 30 element set it seems too complicated, and focusing in performance it makes no sense. I played with some operations, like XORing and negating, but it gives me low estimations on membership. Then I tried with successions of additions and finally got lost.
Any ideas?
For sets of unsigned long of size 30, the following is one fairly obvious way to do it:
store each set as a sorted array, 30 * sizeof(unsigned long) bytes per set.
to look up an integer, do a few steps of a binary search, followed by a linear search (profile in order to figure out how many steps of binary search is best - my wild guess is 2 steps, but you might find out different, and of course if you test bsearch and it's fast enough, you can just use it).
So the next question is why you want a big-maths solution, which will tell me what's wrong with this solution other than "it is insufficiently pleasing".
I suspect that any big-math solution will be slower than this. A single arithmetic operation on an N-digit number takes at least linear time in N. A single number to represent a set can't be very much smaller than the elements of the set laid end to end with a separator in between. So even a linear search in the set is about as fast as a single arithmetic operation on a big number. With the possible exception of a Goedel representation, which could do it in one division once you've found the nth prime number, any clever mathematical representation of sets is going to take multiple arithmetic operations to establish membership.
Note also that there are two different reasons you might care about the performance of "look up an integer in a set":
You are looking up lots of different integers in a single set, in which case you might be able to go faster by constructing a custom lookup function for that data. Of course in C that means you need either (a) a simple virtual machine to execute that "function", or (b) runtime code generation, or (c) to know the set at compile time. None of which is necessarily easy.
You are looking up the same integer in lots of different sets (to get a sequence of all the sets it belongs to), in which case you might benefit from a combined representation of all the sets you care about, rather than considering each set separately.
I suppose that very occasionally, you might be looking up lots of different integers, each in a different set, and so neither of the reasons applies. If this is one of them, you can ignore that stuff.
One good start is to try Bloom Filters.
Basically, it's a probabilistic data structure that gives you no false negative, but some false positive. So when an integer matches a bloom filter, you then have to check if it really matches the set, but it's a big speedup by reducing a lot the number of sets to check.
if i'd understood your correctly, python example:
>>> a=[1,2,3,4,5,6,7,8,9,0]
>>> len_a = len(a)
>>> b = [1]
>>> if len(set(a) - set(b)) < len_a:
... print 'this integer exists in set'
this integer exists in set
math base:

Fastest/easiest way to average ARGB color ints?

I have five colors stored in the format #AARRGGBB as unsigned ints, and I need to take the average of all five. Obviously I can't simply divide each int by five and just add them, and the only way I thought of so far is to bitmask them, do each channel separately, and then OR them together again. Is there a clever or concise way of averaging all five of them?
Half way between your (OP) proposed solution and Patrick's solution looks quite neat:
Color colors[5]={ 0xAARRGGBB,...};
unsigned long sum1=0,sum2=0;
for (int i=0;i<5;i++)
sum1+= colors[i] &0x00FF00FF; // 0x00RR00BB
sum2+=(colors[i]>>8)&0x00FF00FF; // 0x00AA00GG
unsigned long output=0;
sum1>>=16;sum2>>=16; // and now the top halves
I don't think you could really divide sum1/sum2 by 5, because the bits from the top half would spill down...
If an approximation would be valid, you could try a multiplication by something like, 0.1875 (0.125+0.0625), (this means: multiply by 3 and shift down by 4 places. This you could do with bitmasking and care.)
The problem is, 0.2 has a crappy binary representation, so multiplying by it is an ass.
As ever, accuracy or speed. Your choice.
When using x86 machines with at least SSE, and if you need to approximate only, you could use the assembly instruction PAVGB (Packed Average Byte), which averages bytes. See for explanation.
Since you've got 5 values, you would need to be creative in calling PAVGB, since PAVGB will only do two values at a time.
I found smart solution of your problem, sadly it is only applicable if number of colors is power of 2. I'll show it in case of two colors:
mask = 01010101
pom = ~(a^b & mask) # ^ means xor here, ~ negation
a = a & pom
b = b & pom
avg = (a+b) >> 1
The trick of this method is — when you count average, LSB of sum (in case of two numbers) has no meaning, as it will be dropped in division (we're talking integers here, of course). In your problem, LSB of partial sums is at the same moment carry bit of sum of adjacent color. Provided, that LSB of every color sum will be 0 you can safely add those two integers — additions won't interfere with each other. Bit shift divides every color by two.
This method can be used with 4 colors as well, but you have to implement finding out the carry flag of sum of numbers made of two last bits of every color. It is also possible to omit this part and just zero last two bits of every color — biggest mistake made with this omission is 1 for every component.
EDIT I'll leave this attempt for posterity, but please note that it is incorrect and will not work.
One "clever" way you could do it would be to insert zeros between the components, parse into an unsigned long, average the numbers, convert back to a hex string, remove the zeros and finally parse into an unsigned int.
i.e. convert #AARRGGBB to #AA00RR00GG00BB
This method involves parsing and string manipulations, so will undoubtedly be slower than the method you proposed.
If you were to factor your own solution carefully, it might actually look quite clever itself.

Algorithm to find a common multiplier to convert decimal numbers to whole numbers

I have an array of numbers that potentially have up to 8 decimal places and I need to find the smallest common number I can multiply them by so that they are all whole numbers. I need this so all the original numbers can all be multiplied out to the same scale and be processed by a sealed system that will only deal with whole numbers, then I can retrieve the results and divide them by the common multiplier to get my relative results.
Currently we do a few checks on the numbers and multiply by 100 or 1,000,000, but the processing done by the *sealed system can get quite expensive when dealing with large numbers so multiplying everything by a million just for the sake of it isn’t really a great option. As an approximation lets say that the sealed algorithm gets 10 times more expensive every time you multiply by a factor of 10.
What is the most efficient algorithm, that will also give the best possible result, to accomplish what I need and is there a mathematical name and/or formula for what I’m need?
*The sealed system isn’t really sealed. I own/maintain the source code for it but its 100,000 odd lines of proprietary magic and it has been thoroughly bug and performance tested, altering it to deal with floats is not an option for many reasons. It is a system that creates a grid of X by Y cells, then rects that are X by Y are dropped into the grid, “proprietary magic” occurs and results are spat out – obviously this is an extremely simplified version of reality, but it’s a good enough approximation.
So far there are quiet a few good answers and I wondered how I should go about choosing the ‘correct’ one. To begin with I figured the only fair way was to create each solution and performance test it, but I later realised that pure speed wasn’t the only relevant factor – an more accurate solution is also very relevant. I wrote the performance tests anyway, but currently the I’m choosing the correct answer based on speed as well accuracy using a ‘gut feel’ formula.
My performance tests process 1000 different sets of 100 randomly generated numbers.
Each algorithm is tested using the same set of random numbers.
Algorithms are written in .Net 3.5 (although thus far would be 2.0 compatible)
I tried pretty hard to make the tests as fair as possible.
Greg – Multiply by large number
and then divide by GCD – 63
Andy – String Parsing
– 199 milliseconds
Eric – Decimal.GetBits – 160 milliseconds
Eric – Binary search – 32
Ima – sorry I couldn’t
figure out a how to implement your
solution easily in .Net (I didn’t
want to spend too long on it)
Bill – I figure your answer was pretty
close to Greg’s so didn’t implement
it. I’m sure it’d be a smidge faster
but potentially less accurate.
So Greg’s Multiply by large number and then divide by GCD” solution was the second fastest algorithm and it gave the most accurate results so for now I’m calling it correct.
I really wanted the Decimal.GetBits solution to be the fastest, but it was very slow, I’m unsure if this is due to the conversion of a Double to a Decimal or the Bit masking and shifting. There should be a
similar usable solution for a straight Double using the BitConverter.GetBytes and some knowledge contained here: but my eyes just kept glazing over every time I read that article and I eventually ran out of time to try to implement a solution.
I’m always open to other solutions if anyone can think of something better.
I'd multiply by something sufficiently large (100,000,000 for 8 decimal places), then divide by the GCD of the resulting numbers. You'll end up with a pile of smallest integers that you can feed to the other algorithm. After getting the result, reverse the process to recover your original range.
Multiple all the numbers by 10
until you have integers.
by 2,3,5,7 while you still have all
I think that covers all cases.
2.1 * 10/7 -> 3
0.008 * 10^3/2^3 -> 1
That's assuming your multiplier can be a rational fraction.
If you want to find some integer N so that N*x is also an exact integer for a set of floats x in a given set are all integers, then you have a basically unsolvable problem. Suppose x = the smallest positive float your type can represent, say it's 10^-30. If you multiply all your numbers by 10^30, and then try to represent them in binary (otherwise, why are you even trying so hard to make them ints?), then you'll lose basically all the information of the other numbers due to overflow.
So here are two suggestions:
If you have control over all the related code, find another
approach. For example, if you have some function that takes only
int's, but you have floats, and you want to stuff your floats into
the function, just re-write or overload this function to accept
floats as well.
If you don't have control over the part of your system that requires
int's, then choose a precision to which you care about, accept that
you will simply have to lose some information sometimes (but it will
always be "small" in some sense), and then just multiply all your
float's by that constant, and round to the nearest integer.
By the way, if you're dealing with fractions, rather than float's, then it's a different game. If you have a bunch of fractions a/b, c/d, e/f; and you want a least common multiplier N such that N*(each fraction) = an integer, then N = abc / gcd(a,b,c); and gcd(a,b,c) = gcd(a, gcd(b, c)). You can use Euclid's algorithm to find the gcd of any two numbers.
Greg: Nice solution but won't calculating a GCD that's common in an array of 100+ numbers get a bit expensive? And how would you go about that? Its easy to do GCD for two numbers but for 100 it becomes more complex (I think).
Evil Andy: I'm programing in .Net and the solution you pose is pretty much a match for what we do now. I didn't want to include it in my original question cause I was hoping for some outside the box (or my box anyway) thinking and I didn't want to taint peoples answers with a potential solution. While I don't have any solid performance statistics (because I haven't had any other method to compare it against) I know the string parsing would be relatively expensive and I figured a purely mathematical solution could potentially be more efficient.
To be fair the current string parsing solution is in production and there have been no complaints about its performance yet (its even in production in a separate system in a VB6 format and no complaints there either). It's just that it doesn't feel right, I guess it offends my programing sensibilities - but it may well be the best solution.
That said I'm still open to any other solutions, purely mathematical or otherwise.
What language are you programming in? Something like
would give you the number of decimal places for a double in C#. You could run each number through that and find the largest number of decimal places(x), then multiply each number by 10 to the power of x.
Edit: Out of curiosity, what is this sealed system which you can pass only integers to?
In a loop get mantissa and exponent of each number as integers. You can use frexp for exponent, but I think bit mask will be required for mantissa. Find minimal exponent. Find most significant digits in mantissa (loop through bits looking for last "1") - or simply use predefined number of significant digits.
Your multiple is then something like 2^(numberOfDigits-minMantissa). "Something like" because I don't remember biases/offsets/ranges, but I think idea is clear enough.
So basically you want to determine the number of digits after the decimal point for each number.
This would be rather easier if you had the binary representation of the number. Are the numbers being converted from rationals or scientific notation earlier in your program? If so, you could skip the earlier conversion and have a much easier time. Otherwise you might want to pass each number to a function in an external DLL written in C, where you could work with the floating point representation directly. Or you could cast the numbers to decimal and do some work with Decimal.GetBits.
The fastest approach I can think of in-place and following your conditions would be to find the smallest necessary power-of-ten (or 2, or whatever) as suggested before. But instead of doing it in a loop, save some computation by doing binary search on the possible powers. Assuming a maximum of 8, something like:
int NumDecimals( double d )
// make d positive for clarity; it won't change the result
if( d<0 ) d=-d;
// now do binary search on the possible numbers of post-decimal digits to
// determine the actual number as quickly as possible:
if( NeedsMore( d, 10e4 ) )
// more than 4 decimals
if( NeedsMore( d, 10e6 ) )
// > 6 decimal places
if( NeedsMore( d, 10e7 ) ) return 10e8;
return 10e7;
// <= 6 decimal places
if( NeedsMore( d, 10e5 ) ) return 10e6;
return 10e5;
// <= 4 decimal places
// etc...
bool NeedsMore( double d, double e )
// check whether the representation of D has more decimal points than the
// power of 10 represented in e.
return (d*e - Math.Floor( d*e )) > 0;
PS: you wouldn't be passing security prices to an option pricing engine would you? It has exactly the flavor...
