Set bits before index without shift or LUT

Set bits before index without shift or LUT - algorithm

Let's say I need to set all bits before a specific bit index. Here are examples with 4 bits:
index(0) = (0x0, 0000)
index(1) = (0x1, 1000)
index(2) = (0x3, 1100)
index(3) = (0x7, 1110)
How can I do this without using shifts or a LUT, but instead using minimal bitwise operations or arithmetic or something similarly efficient?

The constraints are very weird because you want efficient solution and cut of the only two means allowing to do it properly.
So you basically want to compute x=(2^bit)-1 which is with bit-shift pretty easy:
x=(1<<bit)-1; // O(1)
With LUT also ... So how to attack this without the two:
x=pow(2,bit)-1; //O(?) can be O(1),O(log(n)),O(n)
Well this is far from efficient and pow also uses bit shift and some implementation also LUT. The only solutions left are:
approximation
can use polynomial,PCA,or any other method ... but you need to consider the target range ... This is also not very optimal and robust. This can be O(1),O(log(n)),O(n) but usually with very slow constant time.
emulate bit-shif left
you can do that with loop and addition:
int x; for (x=1;bit;bit--) x+=x; x--;
But this runs in O(n). Anyway this is still faster then pow unless you got some pow2 implemented on HW.
[Notes]
in the complexity formulas n=bit and all the code is in C++ except the first formula where ^ means power.

Related

Iterating over bits in FPGA

Now I'm trying to figure out best method for iterating over bits in FPGA. I'm using some variation of fast powering algorithm, a.k.a exponentiation by squaring (more precisely it's doubling and add algorithm for elliptic curve mathematics). To implement it on hardware, I know I must use FSM which does iteration. My problem is how to properly "handle" moving from bit to bit. My first thought was to switch order of bytes, but when my k = 17 is 32bit, I must discard first 27 bits, so it's rather stupid idea. Another concept was with "moving" 0001000 pattern and bitwise & it with number, but it also requires to find first nonzero bit.
TL&DR
Got for example k = 17 (32bits, so: 17x0 10001) and want to iterate 5 times (that means I start iteration on first "real" bit of number) knowing each bit I iterate over.
Language doesn't matter - I need only the algorithm, not solution in specific language. However, if it is easily done in Verilog, I wouldn't mind. :P

A dedicated combinatorial circuit to find the first nonzero bit, shift it to the first position and tell you the shift amount should be fairly light on resources.
In principle, the compiler should be able to find this solution on its own and improve on it:
if none of the top 16 bits are set, set bit 4 of the shift amount, and shift by 16.
if none of the top 8 bits are set, set bit 3 of the shift amount, and shift by 8.
...
The compiler should be able to find further optimizations on this.

Don not code for FPGA but still:
rewrite algorithm to iterate number x from LSB to MSB
then in each iteration bit shift x right by 1 bit
stop if x==0.
this way you have bit-scan inside your main loop and do not need additional cycles for it.
x!=0 is done easily by ORing all its bits together
C++ code example:
DWORD x = ...;
for (; x != 0; x >>= 1)
{
//here is your iteration loop stuff like:
if (DWORD(x & 1) !=0 ) ...;
}

Something like:
always # *
casex(num)
8XXX_XXXX: k = 32;
4XXX_XXXX: k = 31;
2XXX_XXXX: k = 30;
...
Should give you the value of k.
You can have a shift register which can be parallel loaded so you can write a 1 to the kth bit, so you know when your iterations have ended.

If you loop from 0 to 31 and discard the 27 leading zeros...you aren't necessarily wasting cycles. Depends on whether you've surrounded this with a synchronous process, or a asynchronous one.
One gives you a rather small clocked circuit with a 32 clock latency.
The other gives you a giant rats nest of ANDs and ORs which won't run at a very high frequency.
Depends on what you want. Remember though, that even if you do decide to loop over 32 clocks, you can PIPELINE it such that you start a new calculation every clock. It might take you 32 clocks to get an answer, but you CAN do them at high speed.

Finding seeds for a 5 byte PRNG

An old idea, but ever since then I couldn't get around finding some reasonably good way to solve the problem it raised. So I "invented" (see below) a very compact, and in my opinion, reasonably well performing PRNG, but I can't get to figure out algorithms to build suitable seed values for it at large bit depths. My current solution is simply brute-forcing, it's running time is O(n^3).
The generator
My idea came from XOR taps (essentially LFSRs) some old 8bit machines used for sound generation. I fiddled with XOR as a base on a C64, tried to put together opcodes, and experienced with the result. The final working solution looked like this:
asl
adc #num1
eor #num2
This is 5 bytes on the 6502. With a well chosen num1 and num2, in the accumulator it iterates over all 256 values in a seemingly random order, that is, it looks reasonably random when used to fill the screen (I wrote a little 256b demo back then on this). There are 40 suitable num1 & num2 pairs for this, all giving decent looking sequences.
The concept can be well generalized, if expressed in pure C, it may look like this (BITS being the bit depth of the sequence):
r = (((r >> (BITS-1)) & 1U) + (r << 1) + num1) ^ num2;
r = r & ((1U<<BITS)-1U);
This C code is longer since it is generalized, and even if one would use the full depth of an unsigned integer, C wouldn't have the necessary carry logic to transfer the high bit of the shift to the add operation.
For some performance analysis and comparisons, see below, after the question(s).
The problem / question(s)
The core problem with the generator is finding suitable num1 and num2 which would make it iterate over the whole possible sequence of a given bit depth. At the end of this section I attach my code which just brute-forces it. It will finish in reasonable time for up to 12 bits, you may wait for all 16 bits (there are 5736 possible pairs for that by the way, acquired with an overnight full search a while ago), and you may get a few 20 bits if you are patient. But O(n^3) is really nasty...
(Who will get to find the first full 32bit sequence?)
Other interesting questions which arise:
For both num1 and num2 only odd values are able to produce full sequences. Why? This may not be hard (simple logic, I guess), but I never reasonably proved it.
There is a mirroring property along num1 (the add value), that is, if 'a' with a given 'b' num2 gives a full sequence, then the 2 complement of 'a' (in the given bit depth) with the same num2 is also a full sequence. I only observed this happening reliably with all the full generations I calculated.
A third interesting property is that for all the num1 & num2 pairs the resulting sequences seem to form proper circles, that is, at least the number zero seems to be always part of a circle. Without this property my brute force search would die in an infinite loop.
Bonus: Was this PRNG already known before? (and I just re-invented it)?
And here is the brute force search's code (C):
#define BITS 16
#include "stdio.h"
#include "stdlib.h"
int main(void)
{
unsigned int r;
unsigned int c;
unsigned int num1;
unsigned int num2;
unsigned int mc=0U;
num1=1U; /* Only odd add values produce useful results */
do{
num2=1U; /* Only odd eor values produce useful results */
do{
r= 0U;
c=~0U;
do{
r=(((r>>(BITS-1)) & 1U)+r+r+num1)^num2;
r&=(1U<<(BITS-1)) | ((1U<<(BITS-1))-1U); /* 32bit safe */
c++;
}while (r);
if (c>=mc){
mc=c;
printf("Count-1: %08X, Num1(adc): %08X, Num2(eor): %08X\n", c, num1, num2);
}
num2+=2U;
num2&=(1U<<(BITS-1)) | ((1U<<(BITS-1))-1U);
}while(num2!=1U);
num1+=2U;
num1&=((1U<<(BITS-1))-1U); /* Do not check complements */
}while(num1!=1U);
return 0;
}
This, to show it is working, after each iteration will output the pair found if it's sequence length is equal or longer than the previous. Modify the BITS constant for sequences of other depths.
Seed hunting
I did some graphing relating to the seeds. Here is a nice image showing all the 9bit sequence lengths:
The white dots are the full length sequences, X axis is for num1 (add), Y axis is for num2 (xor), the brighter the dot, the longer the sequence. Other bit depth look very similar in pattern: they all seem to be broken up to sixteen major tiles with two patterns repeating with mirroring. The similarity of the tiles is not complete, for example above a diagonal from the up-left corner to the bottom-right is clearly visible while it's opposite is absent, but for the full-length sequences this property seems to be reliable.
Relying on this it is possible to reduce the work even more than by the previous assumptions, but that's still O(n^3)...
Performance analysis
As of current the longest sequences possible to be generated are 24bits: on my computer it takes at about 5 hours to brute-force a full 24bit sequence for this. This is still just so-so for real PRNG tests such as Diehard, so as of now I rather gone by an own approach.
First it's important to understand the role of the generator. This by no means would be a very good generator for it's simplicity, it's goal is rather to produce decent numbers blazing fast. On this region not needing multiply / divide operations, a Galois LFSR can produce similar performance. So my generator is of any use if it is capable to outperform this one.
The test I performed were all of 16bit generators. I chose this depth since it gives an useful sequence length while the numbers may still be broken up in two 8bit parts making it possible to present various bit-exact graphs for visual analysis.
The core of the tests were looking for correlations along previous and currently generated numbers. For this I used X:Y plots where the previous generation was the Y, the current the X, both broken up to low / high parts as above mentioned for two graphs. I created a program capable of plotting these stepped in real time so to also make it possible to roughly examine how the numbers follow each other, how the graphs fill up. Here obviously only the end results are shown as the generators ran through their full 2^16 or 2^16-1 (Galois) cycle.
The explanation of the fields:
The images consist 8x2 256x256 graphs making the total image size 2048x512 (check them at original size).
The top left graph just confirms that indeed a full sequence was plotted, it is simply an X = r % 256; Y = r / 256; plot.
The bottom left graph shows every second number only plotted the same way as the top, just confirming that the numbers occur reasonably randomly.
From the second graph the top row are the high byte correlation graphs. The first of them uses the previous generation, the next skips one number (so uses 2nd previous generation), and so on until the 7th previous generation.
From the second the bottom row are the low byte correlation graphs, organized the same way as above.
Galois generator, 0xB400 tap set
This is the generator found in the Wikipedia Galois example. It's performance is not the worst, but it is still definitely not really good.
Galois generator, 0xA55A tap set
One of the decent Galois "seeds" I found. Note that the low part of the 16bit numbers seem to be a lot better than the above, however I couldn't find any Galois "seed" which would fuzz up the high byte.
My generator, 0x7F25 (adc), 0x00DB (eor) seed
This is the best of my generators where the high byte of the EOR value is zero. Limiting the high byte is useful on 8bit machines since then this calculation can be omitted for smaller code and faster execution if the loss of randomness performance is affordable.
My generator, 0x778B (adc), 0x4A8B (eor) seed
This is one of the very good quality seeds by my measurements.
To find seeds with good correlation, I built a small program which would analyse them to some degree, the same way for Galois and mine. The "good quality" examples were pinpointed by that program, and then I tested several of them and selected one from those.
Some conclusions:
The Galois generator seems to be more rigid than mine. On all the correlation graphs definite geometrical patterns are observable (some seeds produce "checkerboard" patterns, not shown here) even if it is not composed of lines. My generator also shows patterns, but with more generations they grow less defined.
A portion of the Galois generator's result which include the bits in the high byte seems to be inherently rigid which property seems to be absent from my generator. This is a weak assumption yet probably needing some more research (to see if this is always so with the Galois generator and not with mine on other bit combinations).
The Galois generator lacks zero (maximal period being 2^16-1).
As of now it is impossible to generate a good set of seeds for my generator above 20 bits.
Later I might get in this subject deeper seeking to test the generator with Diehard, but as of now the lack of the ability of generating large enough seeds for it makes it impossible.

This is some form of a non-linear shift feedback register. I don't know if it has been used as such, but it resembles linear shift feedback registers somewhat. Read this Wikipedia page as an introduction to LSFRs. They are used frequently in pseudo random number generation.
However, your pseudo random number generator is inherently bad in that there is a linear correlation between the highest order bit of a previously generated number and the lowest order bit of a number generated next. You shift the highest bit B out, and then the lowest order bit of the new number will be the XOR or B, the lowest order bit of the additive constant num1 and the lowest order bit of the XORed constant num2, because binary addition is equivalent to exclusive or at the lowest order bit. Most likely your PRNG has other similar deficiencies. Creating good PRNGs is hard.
However, I must admit that the C64 code is pleasingly compact!

A clever homebrew modulus implementation

I'm programming a PLC with some legacy software (RSLogix 500, don't ask) and it does not natively support a modulus operation, but I need one. I do not have access to: modulus, integer division, local variables, a truncate operation (though I can hack it with rounding). Furthermore, all variables available to me are laid out in tables sorted by data type. Finally, it should work for floating point decimals, for example 12345.678 MOD 10000 = 2345.678.
If we make our equation:
dividend / divisor = integer quotient, remainder
There are two obvious implementations.
Implementation 1:
Perform floating point division: dividend / divisor = decimal quotient. Then hack together a truncation operation so you find the integer quotient. Multiply it by the divisor and find the difference between the dividend and that, which results in the remainder.
I don't like this because it involves a bunch of variables of different types. I can't 'pass' variables to a subroutine, so I just have to allocate some of the global variables located in multiple different variable tables, and it's difficult to follow. Unfortunately, 'difficult to follow' counts, because it needs to be simple enough for a maintenance worker to mess with.
Implementation 2:
Create a loop such that while dividend > divisor divisor = dividend - divisor. This is very clean, but it violates one of the big rules of PLC programming, which is to never use loops, since if someone inadvertently modifies an index counter you could get stuck in an infinite loop and machinery would go crazy or irrecoverably fault. Plus loops are hard for maintenance to troubleshoot. Plus, I don't even have looping instructions, I have to use labels and jumps. Eww.
So I'm wondering if anyone has any clever math hacks or smarter implementations of modulus than either of these. I have access to + - * /, exponents, sqrt, trig functions, log, abs value, and AND/OR/NOT/XOR.

How many bits are you dealing with? You could do something like:
if dividend > 32 * divisor dividend -= 32 * divisor
if dividend > 16 * divisor dividend -= 16 * divisor
if dividend > 8 * divisor dividend -= 8 * divisor
if dividend > 4 * divisor dividend -= 4 * divisor
if dividend > 2 * divisor dividend -= 2 * divisor
if dividend > 1 * divisor dividend -= 1 * divisor
quotient = dividend
Just unroll as many times as there are bits in dividend. Make sure to be careful about those multiplies overflowing. This is just like your #2 except it takes log(n) instead of n iterations, so it is feasible to unroll completely.

If you don't mind overly complicating things and wasting computer time you can calculate modulus with periodic trig functions:
atan(tan(( 12345.678 -5000)*pi/10000))*10000/pi+5000 = 2345.678
Seriously though, subtracting 10000 once or twice (your "implementation 2") is better. The usual algorithms for general floating point modulus require a number of bit-level manipulations that are probably unfeasible for you. See for example http://www.netlib.org/fdlibm/e_fmod.c (The algorithm is simple but the code is complex because of special cases and because it is written for IEEE 754 double precision numbers assuming there is no 64-bit integer type)

This all seems completely overcomplicated. You have an encoder index that rolls over at 10000 and objects rolling along the line whose positions you are tracking at any given point. If you need to forward project stop points or action points along the line, just add however many inches you need and immediately subtract 10000 if your target result is greater than 10000.
Alternatively, or in addition, you always get a new encoder value every PLC scan. In the case where the difference between the current value and last value is negative you can energize a working contact to flag the wrap event and make appropriate corrections for any calculations on that scan. (**or increment a secondary counter as below)
Without knowing more about the actual problem it is hard to suggest a more specific solution but there are certainly better solutions. I don't see a need for MOD here at all. Furthermore, the guys on the floor will thank you for not filling up the machine with obfuscated wizard stuff.
I quote :
Finally, it has to work for floating point decimals, for example
12345.678 MOD 10000 = 2345.678
There is a brilliant function that exists to do this - it's a subtraction. Why does it need to be more complicated than that? If your conveyor line is actually longer than 833 feet then roll a second counter that increments on a primary index roll-over until you've got enough distance to cover the ground you need.
For example, if you need 100000 inches of conveyor memory you can have a secondary counter that rolls over at 10. Primary encoder rollovers can be easily detected as above and you increment the secondary counter each time. Your working encoder position, then, is 10000 times the counter value plus the current encoder value. Work in the extended units only and make the secondary counter roll over at whatever value you require to not lose any parts. The problem, again, then reduces to a simple subtraction (as above).
I use this technique with a planetary geared rotational part holder, for example. I have an encoder that rolls over once per primary rotation while the planetary geared satellite parts (which themselves rotate around a stator gear) require 43 primary rotations to return to an identical starting orientation. With a simple counter that increments (or decrements, depending on direction) at the primary encoder rollover point it gives you a fully absolute measure of where the parts are at. In this case, the secondary counter rolls over at 43.
This would work identically for a linear conveyor with the only difference being that a linear conveyor can go on for an infinite distance. The problem then only needs to be limited by the longest linear path taken by the worst-case part on the line.
With the caveat that I've never used RSLogix, here is the general idea (I've used generic symbols here and my syntax is probably a bit wrong but you should get the idea)
With the above, you end up with a value ENC_EXT which has essentially transformed your encoder from a 10k inch one to a 100k inch one. I don't know if your conveyor can run in reverse, if it can you would need to handle the down count also. If the entire rest of your program only works with the ENC_EXT value then you don't even have to worry about the fact that your encoder only goes to 10k. It now goes to 100k (or whatever you want) and the wraparound can be handled with a subtraction instead of a modulus.
Afterword :
PLCs are first and foremost state machines. The best solutions for PLC programs are usually those that are in harmony with this idea. If your hardware is not sufficient to fully represent the state of the machine then the PLC program should do its best to fill in the gaps for that missing state information with the information it has. The above solution does this - it takes the insufficient 10000 inches of state information and extends it to suit the requirements of the process.
The benefit of this approach is that you now have preserved absolute state information, not just for the conveyor, but also for any parts on the line. You can track them forward and backward for troubleshooting and debugging and you have a much simpler and clearer coordinate system to work with for future extensions. With a modulus calculation you are throwing away state information and trying to solve individual problems in a functional way - this is often not the best way to work with PLCs. You kind of have to forget what you know from other programming languages and work in a different way. PLCs are a different beast and they work best when treated as such.

You can use a subroutine to do exactly what you are talking about. You can tuck the tricky code away so the maintenance techs will never encounter it. It's almost certainly the easiest for you and your maintenance crew to understand.
It's been a while since I used RSLogix500, so I might get a couple of terms wrong, but you'll get the point.
Define a Data File each for your floating points and integers, and give them symbols something along the lines of MOD_F and MOD_N. If you make these intimidating enough, maintenance techs leave them alone, and all you need them for is passing parameters and workspace during your math.
If you really worried about them messing up the data tables, there are ways to protect them, but I have forgotten what they are on a SLC/500.
Next, defined a subroutine, far away numerically from the ones in use now, if possible. Name it something like MODULUS. Again, maintenance guys almost always stay out of SBRs if they sound like programming names.
In the rungs immediately before your JSR instruction, load the variables you want to process into the MOD_N and MOD_F Data Files. Comment these rungs with instructions that they load data for MODULUS SBR. Make the comments clear to anyone with a programming background.
Call your JSR conditionally, only when you need to. Maintenance techs do not bother troubleshooting non-executing logic, so if your JSR is not active, they will rarely look at it.
Now you have your own little walled garden where you can write your loop without maintenance getting involved with it. Only use those Data Files, and don't assume the state of anything but those files is what you expect. In other words, you cannot trust indirect addressing. Indexed addressing is OK, as long as you define the index within your MODULUS JSR. Do not trust any incoming index. It's pretty easy to write a FOR loop with one word from your MOD_N file, a jump and a label. Your whole Implementation #2 should be less than ten rungs or so. I would consider using an expression instruction or something...the one that lets you just type in an expression. Might need a 504 or 505 for that instruction. Works well for combined float/integer math. Check the results though to make sure the rounding doesn't kill you.
After you are done, validate your code, perfectly if possible. If this code ever causes a math overflow and faults the processor, you will never hear the end of it. Run it on a simulator if you have one, with weird values (in case they somehow mess up the loading of the function inputs), and make sure the PLC does not fault.
If you do all that, no one will ever even realize you used regular programming techniques in the PLC, and you will be fine. AS LONG AS IT WORKS.

This is a loop based on the answer by #Keith Randall, but it also maintains the result of the division by substraction. I kept the printf's for clarity.
#include <stdio.h>
#include <limits.h>
#define NBIT (CHAR_BIT * sizeof (unsigned int))
unsigned modulo(unsigned dividend, unsigned divisor)
{
unsigned quotient, bit;
printf("%u / %u:", dividend, divisor);
for (bit = NBIT, quotient=0; bit-- && dividend >= divisor; ) {
if (dividend < (1ul << bit) * divisor) continue;
dividend -= (1ul << bit) * divisor;
quotient += (1ul << bit);
}
printf("%u, %u\n", quotient, dividend);
return dividend; // the remainder *is* the modulo
}
int main(void)
{
modulo( 13,5);
modulo( 33,11);
return 0;
}

Fastest/easiest way to average ARGB color ints?

I have five colors stored in the format #AARRGGBB as unsigned ints, and I need to take the average of all five. Obviously I can't simply divide each int by five and just add them, and the only way I thought of so far is to bitmask them, do each channel separately, and then OR them together again. Is there a clever or concise way of averaging all five of them?

Half way between your (OP) proposed solution and Patrick's solution looks quite neat:
Color colors[5]={ 0xAARRGGBB,...};
unsigned long sum1=0,sum2=0;
for (int i=0;i<5;i++)
{
sum1+= colors[i] &0x00FF00FF; // 0x00RR00BB
sum2+=(colors[i]>>8)&0x00FF00FF; // 0x00AA00GG
}
unsigned long output=0;
output|=(((sum1&0xFFFF)/5)&0xFF);
output|=(((sum2&0xFFFF)/5)&0xFF)<<8;
sum1>>=16;sum2>>=16; // and now the top halves
output|=(((sum1&0xFFFF)/5)&0xFF)<<16;
output|=(((sum2&0xFFFF)/5)&0xFF)<<24;
I don't think you could really divide sum1/sum2 by 5, because the bits from the top half would spill down...
If an approximation would be valid, you could try a multiplication by something like, 0.1875 (0.125+0.0625), (this means: multiply by 3 and shift down by 4 places. This you could do with bitmasking and care.)
The problem is, 0.2 has a crappy binary representation, so multiplying by it is an ass.
As ever, accuracy or speed. Your choice.

When using x86 machines with at least SSE, and if you need to approximate only, you could use the assembly instruction PAVGB (Packed Average Byte), which averages bytes. See http://www.tommesani.com/SSEPrimer.html for explanation.
Since you've got 5 values, you would need to be creative in calling PAVGB, since PAVGB will only do two values at a time.

I found smart solution of your problem, sadly it is only applicable if number of colors is power of 2. I'll show it in case of two colors:
mask = 01010101
pom = ~(a^b & mask) # ^ means xor here, ~ negation
a = a & pom
b = b & pom
avg = (a+b) >> 1
The trick of this method is — when you count average, LSB of sum (in case of two numbers) has no meaning, as it will be dropped in division (we're talking integers here, of course). In your problem, LSB of partial sums is at the same moment carry bit of sum of adjacent color. Provided, that LSB of every color sum will be 0 you can safely add those two integers — additions won't interfere with each other. Bit shift divides every color by two.
This method can be used with 4 colors as well, but you have to implement finding out the carry flag of sum of numbers made of two last bits of every color. It is also possible to omit this part and just zero last two bits of every color — biggest mistake made with this omission is 1 for every component.

EDIT I'll leave this attempt for posterity, but please note that it is incorrect and will not work.
One "clever" way you could do it would be to insert zeros between the components, parse into an unsigned long, average the numbers, convert back to a hex string, remove the zeros and finally parse into an unsigned int.
i.e. convert #AARRGGBB to #AA00RR00GG00BB
This method involves parsing and string manipulations, so will undoubtedly be slower than the method you proposed.
If you were to factor your own solution carefully, it might actually look quite clever itself.

What's better multiplication by 2 or adding the number to itself ? BIGnums

I need some help deciding what is better performance wise.
I'm working with bigints (more then 5 million digits) and most of the computation (if not all) is in the part of doubling the current bigint. So i wanted to know is it better to multiply every cell (part of the bigint) by 2 then mod it and you know the rest. Or is it better just add the bigint to itself.
I'm thinking a bit about the ease of implementation too (addition of 2 bigints is more complicated then multiplication by 2) , but I'm more concerned about the performance rather then the size of code or ease of implementation.
Other info:
I'll code it in C++ , I'm fairly familiar with bigints (just never came across this problem).
I'm not in the need of any source code or similar i just need a nice opinion and explanation/proof of it , since i need to make a good decision form the start as the project will be fairly large and mostly built around this part it depends heavily on what i chose now.
Thanks.

Try bitshifting each bit. That is probably the fastest method. When you bitshift an integer to the left, then you double it (multiply by 2). If you have several long integers in a chain, then you need to store the most significant bit, because after shifting it, it will be gone, and you need to use it as the least significant bit on the next long integer.
This doesn't actually matter a whole lot. Modern 64bit computers can add two integers in the same time it takes to bitshift them (1 clockcycle), so it will take just as long. I suggest you try different methods, and then report back if there is any major time differences. All three methods should be easy to implement, and generating a 5mb number should also be easy, using a random number generator.

To store a 5 million digit integer, you'll need quite a few bits -- 5 million if you were referring to binary digits, or ~17 million bits if those were decimal digits. Let's assume the numbers are stored in a binary representation, and your arithmetic happens in chunks of some size, e.g. 32 bits or 64 bits.
If adding the number to itself, each chunk is added to itself and to the carry from the addition of the previous chunk. Any carry forward is kept for the next chunk. That's a couple of addition operation, and some book keeping for tracking the carry.
If multiplying by two by left-shifting, that's one left-shift operation for the multiplication, and one right-shift operation + and with 1 to obtain the carry. Carry book keeping is a little simpler.
Superficially, the shift version appears slightly faster. The overall cost of doubling the number, however, is highly influenced by the size of the number. A 17 million bits number exceeds the cpu's L1 cache, and processing time is likely overwhelmed by memory fetch operations. On modern PC hardware, memory fetch is orders of magnitude slower than addition and shifting.
With that, you might want to pick the one that's simpler for you to implement. I'm leaning towards the left-shift version.

did you try shifting the bits?
<< multiplies by 2
>> divides by 2

Left bit shifting by one is the same as a multiplication by two !
This link explains the mecanism and give examples.
int A = 10; //...01010 = 10
int B = A<<1; //..010100 = 20

If it really matters, you need to write all three methods (including bit-shift!), and profile them, on various input. (Use small numbers, large numbers, and random numbers, to avoid biasing the results.)
Sorry for the "Do it yourself" answer, but that's really the best way. No one cares about this result more than you, which just makes you the best person to figure it out.

Well implemented multiplication of BigNums is O(N log(N) log(log(N)). Addition is O(n). Therefore, adding to itself should be faster than multiplying by two. However that's only true if you're multiplying two arbitrary bignums; if your library knows you're multiplying a bignum by a small integer it may be able to optimize to O(n).
As others have noted, bit-shifting is also an option. It should be O(n) as well but faster constant time. But that will only work if your bignum library supports bit shifting.

most of the computation (if not all) is in the part of doubling the current bigint
If all your computation is in doubling the number, why don't you just keep a distinct (base-2) scale field? Then just add one to scale, which can just be a plain-old int. This will surely be faster than any manipulation of some-odd million bits.
IOW, use a bigfloat.
random benchmark
use Math::GMP;
use Time::HiRes qw(clock_gettime CLOCK_REALTIME CLOCK_PROCESS_CPUTIME_ID);
my $n = Math::GMP->new(2);
$n = $n ** 1_000_000;
my $m = Math::GMP->new(2);
$m = $m ** 10_000;
my $str;
for ($bits = 1_000_000; $bits <= 2_000_000; $bits += 10_000) {
my $start = clock_gettime(CLOCK_PROCESS_CPUTIME_ID);
$str = "$n" for (1..3);
my $stop = clock_gettime(CLOCK_PROCESS_CPUTIME_ID);
print "$bits,#{[($stop-$start)/3]}\n";
$n = $n * $m;
}
Seems to show that somehow GMP is doing its conversion in O(n) time (where n the number of bits in the binary number). This may be due to the special case of having a 1 followed by a million (or two) zeros; the GNU MP docs say it should be slower (but still better than O(N^2).
http://img197.imageshack.us/img197/6527/chartp.png

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio