Factorial of integer mod m fast calculation [duplicate]

Factorial of integer mod m fast calculation [duplicate] - algorithm

This question already has answers here:
Fast way to calculate n! mod m where m is prime?
(8 answers)
Closed 8 years ago.
Is it possible to calculate Factorial(x) mod m without looping through the whole expression chain
((1 % m) * (2 %m) * (3 % m) * ... (x % m)) % m?
To be more precise m can be 1 <= m <= 10^7 and x : 1<= x < m

There are few fast algorithms for Factorial out there
so the answer is: Yes you can compute factorial without looping through all values
all I saw uses primes decompositions (including mine algorithm)
so from that it is just matter of usein mod multiplication instead of normal multiplication
look here: Fast exact bigint factorial is mine fast algorithm
and the other answer also contains link to swinging primes algorithm ...
[Notes]
for N! you will need a list of primes up to N
but the rest of code can work on arithmetics capable of holding N,m
so no need for huge numbers ...
[edit1] mine 32bit C++ implementations
//---------------------------------------------------------------------------
DWORD modmul(DWORD a,DWORD b,DWORD n)
{
DWORD _a,_b,_n;
_a=a;
_b=b;
_n=n;
asm {
mov eax,_a
mov ebx,_b
mul ebx // H(edx),L(eax) = eax * ebx
mov ebx,_n
div ebx // eax = H(edx),L(eax) / ebx
mov _a,edx // edx = H(edx),L(eax) % ebx
}
return _a;
}
//---------------------------------------------------------------------------
DWORD modfact0(DWORD n,DWORD m) // (n!) mod m (naive approach)
{
DWORD i,f;
for (f=1,i=2;i<=n;i++) f=modmul(f,i,m);
return f;
}
//---------------------------------------------------------------------------
DWORD modfact1(DWORD n,DWORD m) // (n!) mod m (mine fast approach)
{
if (n<=4)
{
if (n==4) return 24;
if (n==3) return 6;
if (n==2) return 2;
if (n==1) return 1;
if (n==0) return 1;
}
int N4,N2,p,i,j,e; DWORD c,pp;
N4=(n>>2)<<2;
N2=N4>>1;
c=modfact1(N2,m); c=modmul(c,c,m); // c=((2N)!)^2;
for (i=0;;i++) // c*= T2
{
p=primes_i32.dat[i];
if (!p) break;
if (p>N4) break;
for (e=0,j=N4;j;e+=j&1,j/=p);
if (e) // c*=p^e
{
if (p==2) c<<=e;
else for (pp=p;;)
{
if (int(e&1)) c=modmul(c,pp,m);
e>>=1; if (!e) break;
pp=modmul(pp,pp,m);
}
}
}
for (i=N4+1;i<=n;i++) c=modmul(c,i,m);
return c;
}
//---------------------------------------------------------------------------
primes:
DWORD primes_i32.dat[] is precomputed sorted (ascending) list of all primes up to n
Here the result:
[ 18.529 ms] slow modfact0(1000000,1299721) = 195641
[ 2.995 ms] fast modfact1(1000000,1299721) = 195641
[ 96.242 ms] slow modfact0(5000000,9999991) = 2812527
[ 13.305 ms] fast modfact1(5000000,9999991) = 2812527
1299721 is first prime close to 1000000 I found
if m is not prime and subresult hits zero then you can ignore the rest of multiplication to massive speed up...
Hope the result is OK have nothing to compare with ...

Related

Compute the exact inverse of this "simple" floating-point function

I have the following function:
float int_to_qty(unsigned x) {
const float MAX = 8.5f;
const float MIN = .001f;
return ((MAX-MIN) / (float)(1<<24)) * x + MIN;
}
This compiles (with reasonable options, on x86) to the following:
.LCPI0_0:
.long 0x3507fbe7 # float 5.06579852E-7
.LCPI0_1:
.long 0x3a83126f # float 0.00100000005
int_to_qty: # #int_to_qty
mov eax, edi
cvtsi2ss xmm0, rax
mulss xmm0, dword ptr [rip + .LCPI0_0]
addss xmm0, dword ptr [rip + .LCPI0_1]
ret
I consider the assembly to be the "canonical" version of the function: Convert the int to a float, multiply by a constant at 32-bit precision, add another constant at 32-bit precision, that's the result.
I want to find the exact inverse of this function. Specifically, a function
unsigned qty_to_int(float qty) that will pass the following test:
int test() {
for (unsigned i = 0; i < (1 << 24); ++i) {
float qty = int_to_qty(i);
if (int_to_qty(qty_to_int(qty)) != qty) {
return 0;
}
}
return 1;
}
Notes:
In the range 4 ≤ int_to_qty(x) < 8, the returned values primarily differ by 1 ulp, which is what makes this challenging.
In the range 8 ≤ int_to_qty(x) < 8.5, the function stops being one-to-one. In this case either answer is fine for the inverse, it doesn't have to be consistently the lowest or the highest.

After wrestling for a long time, I finally came up with a solution that passes the tests. (In Rust, but the translation to C is straightforward.)
pub fn qty_to_int(qty: f64) -> u32 {
const MAX: f32 = 8.5;
const MIN: f32 = 0.001;
let size_inv = f64::from(1<<24) / f64::from(MAX - MIN);
// We explicitly shrink the precision to f32 and then pop back to f64, because we *need* to
// perform that rounding step to properly reverse the addition at the end of int_to_qty.
// We could do the whole thing at f32 precision, except that our input is f64 so the
// subtraction needs to be done at f64 precision.
let fsqueezed: f32 = (qty - f64::from(MIN)) as f32;
// The squeezed subtraction is a one-to-one operation across most of our range. *However*,
// in the border areas where our input is just above an exponent breakpoint, but
// subtraction will bring it below, we hit an issue: The addition in int_to_qty() has
// irreversibly lost a bit in the lowest place! This causes issues when we invert the
// multiply, since we are counting on the error to be centered when we round at the end.
//
// To solve this, we need to re-bias the error by subtracting 0.25 ulp for these cases.
// Technically this should be applied for ranges like [2.0,2.001) as well, but they
// don't need it since they already round correctly (due to having more headroom).
let adj = if qty > 4.0 && qty < 4.001 {
0.5 - 2.0 / 8.499
} else if qty > 8.0 && qty < 8.001 {
0.5 - 4.0 / 8.499
} else {
0.5
};
// Multiply and round, taking into account the possible adjustments.
let fresult = f64::from(fsqueezed) * size_inv + adj;
unsafe { fresult.to_int_unchecked() }
}

Is there an efficient way to calculate ceiling of log_b(a)?

I need to accurately calculate where a and b
are both integers. If I simply use typical change of base formula with floating point math functions I wind up with errors due to rounding error.

You can use this identity:
b^logb(a) = a
So binary search x = logb(a) so the result of b^x is biggest integer which is still less than a and afterwards just increment the final result.
Here small C++ example for 32 bits:
//---------------------------------------------------------------------------
DWORD u32_pow(DWORD a,DWORD b) // = a^b
{
int i,bits=32;
DWORD d=1;
for (i=0;i<bits;i++)
{
d*=d;
if (DWORD(b&0x80000000)) d*=a;
b<<=1;
}
return d;
}
//---------------------------------------------------------------------------
DWORD u32_log2(DWORD a) // = ceil(log2(a))
{
DWORD x;
for (x=32;((a&0x80000000)==0)&&(x>1);x--,a<<=1);
return x;
}
//---------------------------------------------------------------------------
DWORD u32_log(DWORD b,DWORD a) // = ceil(logb(a))
{
DWORD x,m,bx;
// edge cases
if (b< 2) return 0;
if (a< 2) return 0;
if (a<=b) return 1;
m=1<<(u32_log2(a)-1); // max limit for b=2, all other bases lead to smaller exponents anyway
for (x=0;m;m>>=1)
{
x|=m;
bx=u32_pow(b,x);
if (bx>=a) x^=m;
}
return x+1;
}
//---------------------------------------------------------------------------
Where DWORD is any unsigned 32bit int type... for more info about pow,log,exp and bin search see:
Power by squaring for negative exponents
Note that u32_log2 is not really needed (unless you want bigints) you can use constant bitwidth instead, also some CPUs like x86 has single asm instruction returning the same much faster than for loop...
Now the next step is exploit the fact that the u32_pow bin search is the same as the u32_log bin search so we can merge the two functions and get rid of one nested for loop completely improving complexity considerably like this:
//---------------------------------------------------------------------------
DWORD u32_pow(DWORD a,DWORD b) // = a^b
{
int i,bits=32;
DWORD d=1;
for (i=0;i<bits;i++)
{
d*=d;
if (DWORD(b&0x80000000)) d*=a;
b<<=1;
}
return d;
}
//---------------------------------------------------------------------------
DWORD u32_log2(DWORD a) // = ceil(log2(a))
{
DWORD x;
for (x=32;((a&0x80000000)==0)&&(x>1);x--,a<<=1);
return x;
}
//---------------------------------------------------------------------------
DWORD u32_log(DWORD b,DWORD a) // = ceil(logb(a))
{
const int _bits=32; // DWORD bitwidth
DWORD bb[_bits]; // squares of b LUT for speed up b^x
DWORD x,m,bx,bx0,bit,bits;
// edge cases
if (b< 2) return 0;
if (a< 2) return 0;
if (a<=b) return 1;
// max limit for x where b=2, all other bases lead to smaller x
bits=u32_log2(a);
// compute bb LUT
bb[0]=b;
for (bit=1;bit< bits;bit++) bb[bit]=bb[bit-1]*bb[bit-1];
for ( ;bit<_bits;bit++) bb[bit]=1;
// bin search x and b^x at the same time
for (bx=1,x=0,bit=bits-1,m=1<<bit;m;m>>=1,bit--)
{
x|=m; bx0=bx; bx*=bb[bit];
if (bx>=a){ x^=m; bx=bx0; }
}
return x+1;
}
//---------------------------------------------------------------------------
The only drawback is that we need LUT for squares of b so: b,b^2,b^4,b^8... up to bits number of squares
Beware squaring will double the number of bits so you should also handle overflow if b or a are too big ...
[Edit2] more optimization
As benchmark on normal ints (on bigints the bin search is much much faster) revealed bin search version is the same speed as naive version (because of many subsequent operations except multiplications):
DWORD u32_log_naive(DWORD b,DWORD a) // = ceil(logb(a))
{
int x,bx;
if (b< 2) return 0;
if (a< 2) return 0;
if (a<=b) return 1;
for (x=2,bx=b;bx*=b;x++)
if (bx>=a) break;
return x;
}
We can optimize more:
we can comment out computation of unused squares:
//for ( ;bit<_bits;bit++) bb[bit]=1;
with this bin search become faster also on ints but not by much
we can use faster log2 instead of naive one
see: Fastest implementation of log2(int) and log2(float)
putting all together (x86 CPUs):
DWORD u32_log(DWORD b,DWORD a) // = ceil(logb(a))
{
const int _bits=32; // DWORD bitwidth
DWORD bb[_bits]; // squares of b LUT for speed up b^x
DWORD x,m,bx,bx0,bit,bits;
// edge cases
if (b< 2) return 0;
if (a< 2) return 0;
if (a<=b) return 1;
// max limit for x where b=2, all other bases lead to smaller x
asm {
bsr eax,a; // bits=u32_log2(a);
mov bits,eax;
}
// compute bb LUT
bb[0]=b;
for (bit=1;bit< bits;bit++) bb[bit]=bb[bit-1]*bb[bit-1];
// for ( ;bit<_bits;bit++) bb[bit]=1;
// bin search x and b^x at the same time
for (bx=1,x=0,bit=bits-1,m=1<<bit;m;m>>=1,bit--)
{
x|=m; bx0=bx; bx*=bb[bit];
if (bx>=a){ x^=m; bx=bx0; }
}
return x+1;
}
however the speed up is just slight for example naive 137 ms bin search 133 ms ... note that faster log2 did almost no change but that is because how my compiler is handling inline asm (not sure why BDS2006 and BCC32 is very slow on switching between asm and C++ but its true that is why in older C++ builders inline asm functions where not a good choice for speed optimizations unless a major speedup was expected) ...

How does a compiler optimise this factorial function so well?

So I have been having a look at some of the magic that is O3 in GCC (well actually I'm compiling using Clang but it's the same with GCC and I'm guessing a large part of the optimiser was pulled over from GCC to Clang).
Consider this C program:
int foo(int n) {
if (n == 0) return 1;
return n * foo(n-1);
}
int main() {
return foo(10);
}
The first thing I was pretty WOW-ed at (which was also WOW-ed at in this question - https://stackoverflow.com/a/414774/1068248) was how int foo(int) (a basic factorial function) compiles into a tight loop. This is the ARM assembly for it:
.globl _foo
.align 2
.code 16
.thumb_func _foo
_foo:
mov r1, r0
movs r0, #1
cbz r1, LBB0_2
LBB0_1:
muls r0, r1, r0
subs r1, #1
bne LBB0_1
LBB0_2:
bx lr
Blimey I thought. That's pretty interesting! Completely tight looping to do the factorial. WOW. It's not a tail call optimisation since, well, it's not a tail call. But it appears to have done a much similar optimisation.
Now look at main:
.globl _main
.align 2
.code 16
.thumb_func _main
_main:
movw r0, #24320
movt r0, #55
bx lr
That just blew my mind to be honest. It's just totally bypassing foo and returning 3628800 which is 10!.
This makes me really realise how your compiler can often do a much better job than you can at optimising your code. But it raises the question, how does it manage to do such a good job? So, can anyone explain (possibly by linking to relevant code) how the following optimisations work:
The initial foo optimisation to be a tight loop.
The optimisation where main just goes and returns the result directly rather than actually executing foo.
Also another interesting side effect of this question would be to show some more interesting optimisations which GCC/Clang can do.

If you compile with gcc -O3 -fdump-tree-all, you can see that the first dump in which the recursion has been turned into a loop is foo.c.035t.tailr1. This means the same optimisation that handles other tail calls also handles this slightly extended case. Recursion in the form of n * foo(...) or n + foo(...) is not that hard to handle manually (see below), and since it's possible to describe exactly how, the compiler can perform that optimisation automatically.
The optimisation of main is much simpler: inlining can turn this into 10 * 9 * 8 * 7 * 6 * 5 * 4 * 3 * 2 * 1 * 1, and if all the operands of a multiplication are constants, then the multiplication can be performed at compile time.
Update: Here's how you can manually remove the recursion from foo, which can be done automatically. I'm not saying this is the method used by GCC, but it's one realistic possibility.
First, create a helper function. It behaves exactly as foo(n), except that its results are multiplied by an extra parameter f.
int foo(int n)
{
return foo_helper(n, 1);
}
int foo_helper(int n, int f)
{
if (n == 0) return f * 1;
return f * n * foo(n-1);
}
Then, turn recursive calls of foo into recursive calls of foo_helper, and rely on the factor parameter to get rid of the multiplication.
int foo(int n)
{
return foo_helper(n, 1);
}
int foo_helper(int n, int f)
{
if (n == 0) return f;
return foo_helper(n-1, f * n);
}
Turn this into a loop:
int foo(int n)
{
return foo_helper(n, 1);
}
int foo_helper(int n, int f)
{
restart:
if (n == 0) return f;
{
int newn = n-1;
int newf = f * n;
n = newn;
f = newf;
goto restart;
}
}
Finally, inline foo_helper:
int foo(int n)
{
int f = 1;
restart:
if (n == 0) return f;
{
int newn = n-1;
int newf = f * n;
n = newn;
f = newf;
goto restart;
}
}
(Naturally, this is not the most sensible way to manually write the function.)

Fastest sort of fixed length 6 int array

Answering to another Stack Overflow question (this one) I stumbled upon an interesting sub-problem. What is the fastest way to sort an array of 6 integers?
As the question is very low level:
we can't assume libraries are available (and the call itself has its cost), only plain C
to avoid emptying instruction pipeline (that has a very high cost) we should probably minimize branches, jumps, and every other kind of control flow breaking (like those hidden behind sequence points in && or ||).
room is constrained and minimizing registers and memory use is an issue, ideally in place sort is probably best.
Really this question is a kind of Golf where the goal is not to minimize source length but execution time. I call it 'Zening' code as used in the title of the book Zen of Code optimization by Michael Abrash and its sequels.
As for why it is interesting, there is several layers:
the example is simple and easy to understand and measure, not much C skill involved
it shows effects of choice of a good algorithm for the problem, but also effects of the compiler and underlying hardware.
Here is my reference (naive, not optimized) implementation and my test set.
#include <stdio.h>
static __inline__ int sort6(int * d){
char j, i, imin;
int tmp;
for (j = 0 ; j < 5 ; j++){
imin = j;
for (i = j + 1; i < 6 ; i++){
if (d[i] < d[imin]){
imin = i;
}
}
tmp = d[j];
d[j] = d[imin];
d[imin] = tmp;
}
}
static __inline__ unsigned long long rdtsc(void)
{
unsigned long long int x;
__asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));
return x;
}
int main(int argc, char ** argv){
int i;
int d[6][5] = {
{1, 2, 3, 4, 5, 6},
{6, 5, 4, 3, 2, 1},
{100, 2, 300, 4, 500, 6},
{100, 2, 3, 4, 500, 6},
{1, 200, 3, 4, 5, 600},
{1, 1, 2, 1, 2, 1}
};
    unsigned long long cycles = rdtsc();
    for (i = 0; i < 6 ; i++){
    sort6(d[i]);
    /*
         * printf("d%d : %d %d %d %d %d %d\n", i,
     *  d[i][0], d[i][6], d[i][7],
      *  d[i][8], d[i][9], d[i][10]);
        */
    }
    cycles = rdtsc() - cycles;
    printf("Time is %d\n", (unsigned)cycles);
}
Raw results
As number of variants is becoming large, I gathered them all in a test suite that can be found here. The actual tests used are a bit less naive than those showed above, thanks to Kevin Stock. You can compile and execute it in your own environment. I'm quite interested by behavior on different target architecture/compilers. (OK guys, put it in answers, I will +1 every contributor of a new resultset).
I gave the answer to Daniel Stutzbach (for golfing) one year ago as he was at the source of the fastest solution at that time (sorting networks).
Linux 64 bits, gcc 4.6.1 64 bits, Intel Core 2 Duo E8400, -O2
Direct call to qsort library function : 689.38
Naive implementation (insertion sort) : 285.70
Insertion Sort (Daniel Stutzbach) : 142.12
Insertion Sort Unrolled : 125.47
Rank Order : 102.26
Rank Order with registers : 58.03
Sorting Networks (Daniel Stutzbach) : 111.68
Sorting Networks (Paul R) : 66.36
Sorting Networks 12 with Fast Swap : 58.86
Sorting Networks 12 reordered Swap : 53.74
Sorting Networks 12 reordered Simple Swap : 31.54
Reordered Sorting Network w/ fast swap : 31.54
Reordered Sorting Network w/ fast swap V2 : 33.63
Inlined Bubble Sort (Paolo Bonzini) : 48.85
Unrolled Insertion Sort (Paolo Bonzini) : 75.30
Linux 64 bits, gcc 4.6.1 64 bits, Intel Core 2 Duo E8400, -O1
Direct call to qsort library function : 705.93
Naive implementation (insertion sort) : 135.60
Insertion Sort (Daniel Stutzbach) : 142.11
Insertion Sort Unrolled : 126.75
Rank Order : 46.42
Rank Order with registers : 43.58
Sorting Networks (Daniel Stutzbach) : 115.57
Sorting Networks (Paul R) : 64.44
Sorting Networks 12 with Fast Swap : 61.98
Sorting Networks 12 reordered Swap : 54.67
Sorting Networks 12 reordered Simple Swap : 31.54
Reordered Sorting Network w/ fast swap : 31.24
Reordered Sorting Network w/ fast swap V2 : 33.07
Inlined Bubble Sort (Paolo Bonzini) : 45.79
Unrolled Insertion Sort (Paolo Bonzini) : 80.15
I included both -O1 and -O2 results because surprisingly for several programs O2 is less efficient than O1. I wonder what specific optimization has this effect ?
Comments on proposed solutions
Insertion Sort (Daniel Stutzbach)
As expected minimizing branches is indeed a good idea.
Sorting Networks (Daniel Stutzbach)
Better than insertion sort. I wondered if the main effect was not get from avoiding the external loop. I gave it a try by unrolled insertion sort to check and indeed we get roughly the same figures (code is here).
Sorting Networks (Paul R)
The best so far. The actual code I used to test is here. Don't know yet why it is nearly two times as fast as the other sorting network implementation. Parameter passing ? Fast max ?
Sorting Networks 12 SWAP with Fast Swap
As suggested by Daniel Stutzbach, I combined his 12 swap sorting network with branchless fast swap (code is here). It is indeed faster, the best so far with a small margin (roughly 5%) as could be expected using 1 less swap.
It is also interesting to notice that the branchless swap seems to be much (4 times) less efficient than the simple one using if on PPC architecture.
Calling Library qsort
To give another reference point I also tried as suggested to just call library qsort (code is here). As expected it is much slower : 10 to 30 times slower... as it became obvious with the new test suite, the main problem seems to be the initial load of the library after the first call, and it compares not so poorly with other version. It is just between 3 and 20 times slower on my Linux. On some architecture used for tests by others it seems even to be faster (I'm really surprised by that one, as library qsort use a more complex API).
Rank order
Rex Kerr proposed another completely different method : for each item of the array compute directly its final position. This is efficient because computing rank order do not need branch. The drawback of this method is that it takes three times the amount of memory of the array (one copy of array and variables to store rank orders). The performance results are very surprising (and interesting). On my reference architecture with 32 bits OS and Intel Core2 Quad E8300, cycle count was slightly below 1000 (like sorting networks with branching swap). But when compiled and executed on my 64 bits box (Intel Core2 Duo) it performed much better : it became the fastest so far. I finally found out the true reason. My 32bits box use gcc 4.4.1 and my 64bits box gcc 4.4.3 and the last one seems much better at optimizing this particular code (there was very little difference for other proposals).
update:
As published figures above shows this effect was still enhanced by later versions of gcc and Rank Order became consistently twice as fast as any other alternative.
Sorting Networks 12 with reordered Swap
The amazing efficiency of the Rex Kerr proposal with gcc 4.4.3 made me wonder : how could a program with 3 times as much memory usage be faster than branchless sorting networks? My hypothesis was that it had less dependencies of the kind read after write, allowing for better use of the superscalar instruction scheduler of the x86. That gave me an idea: reorder swaps to minimize read after write dependencies. More simply put: when you do SWAP(1, 2); SWAP(0, 2); you have to wait for the first swap to be finished before performing the second one because both access to a common memory cell. When you do SWAP(1, 2); SWAP(4, 5);the processor can execute both in parallel. I tried it and it works as expected, the sorting networks is running about 10% faster.
Sorting Networks 12 with Simple Swap
One year after the original post Steinar H. Gunderson suggested, that we should not try to outsmart the compiler and keep the swap code simple. It's indeed a good idea as the resulting code is about 40% faster! He also proposed a swap optimized by hand using x86 inline assembly code that can still spare some more cycles. The most surprising (it says volumes on programmer's psychology) is that one year ago none of used tried that version of swap. Code I used to test is here. Others suggested other ways to write a C fast swap, but it yields the same performances as the simple one with a decent compiler.
The "best" code is now as follow:
static inline void sort6_sorting_network_simple_swap(int * d){
#define min(x, y) (x<y?x:y)
#define max(x, y) (x<y?y:x)
#define SWAP(x,y) { const int a = min(d[x], d[y]); \
const int b = max(d[x], d[y]); \
d[x] = a; d[y] = b; }
SWAP(1, 2);
SWAP(4, 5);
SWAP(0, 2);
SWAP(3, 5);
SWAP(0, 1);
SWAP(3, 4);
SWAP(1, 4);
SWAP(0, 3);
SWAP(2, 5);
SWAP(1, 3);
SWAP(2, 4);
SWAP(2, 3);
#undef SWAP
#undef min
#undef max
}
If we believe our test set (and, yes it is quite poor, it's mere benefit is being short, simple and easy to understand what we are measuring), the average number of cycles of the resulting code for one sort is below 40 cycles (6 tests are executed). That put each swap at an average of 4 cycles. I call that amazingly fast. Any other improvements possible ?

For any optimization, it's always best to test, test, test. I would try at least sorting networks and insertion sort. If I were betting, I'd put my money on insertion sort based on past experience.
Do you know anything about the input data? Some algorithms will perform better with certain kinds of data. For example, insertion sort performs better on sorted or almost-sorted dat, so it will be the better choice if there's an above-average chance of almost-sorted data.
The algorithm you posted is similar to an insertion sort, but it looks like you've minimized the number of swaps at the cost of more comparisons. Comparisons are far more expensive than swaps, though, because branches can cause the instruction pipeline to stall.
Here's an insertion sort implementation:
static __inline__ int sort6(int *d){
int i, j;
for (i = 1; i < 6; i++) {
int tmp = d[i];
for (j = i; j >= 1 && tmp < d[j-1]; j--)
d[j] = d[j-1];
d[j] = tmp;
}
}
Here's how I'd build a sorting network. First, use this site to generate a minimal set of SWAP macros for a network of the appropriate length. Wrapping that up in a function gives me:
static __inline__ int sort6(int * d){
#define SWAP(x,y) if (d[y] < d[x]) { int tmp = d[x]; d[x] = d[y]; d[y] = tmp; }
SWAP(1, 2);
SWAP(0, 2);
SWAP(0, 1);
SWAP(4, 5);
SWAP(3, 5);
SWAP(3, 4);
SWAP(0, 3);
SWAP(1, 4);
SWAP(2, 5);
SWAP(2, 4);
SWAP(1, 3);
SWAP(2, 3);
#undef SWAP
}

Here's an implementation using sorting networks:
inline void Sort2(int *p0, int *p1)
{
const int temp = min(*p0, *p1);
*p1 = max(*p0, *p1);
*p0 = temp;
}
inline void Sort3(int *p0, int *p1, int *p2)
{
Sort2(p0, p1);
Sort2(p1, p2);
Sort2(p0, p1);
}
inline void Sort4(int *p0, int *p1, int *p2, int *p3)
{
Sort2(p0, p1);
Sort2(p2, p3);
Sort2(p0, p2);
Sort2(p1, p3);
Sort2(p1, p2);
}
inline void Sort6(int *p0, int *p1, int *p2, int *p3, int *p4, int *p5)
{
Sort3(p0, p1, p2);
Sort3(p3, p4, p5);
Sort2(p0, p3);
Sort2(p2, p5);
Sort4(p1, p2, p3, p4);
}
You really need very efficient branchless min and max implementations for this, since that is effectively what this code boils down to - a sequence of min and max operations (13 of each, in total). I leave this as an exercise for the reader.
Note that this implementation lends itself easily to vectorization (e.g. SIMD - most SIMD ISAs have vector min/max instructions) and also to GPU implementations (e.g. CUDA - being branchless there are no problems with warp divergence etc).
See also: Fast algorithm implementation to sort very small list

Since these are integers and compares are fast, why not compute the rank order of each directly:
inline void sort6(int *d) {
int e[6];
memcpy(e,d,6*sizeof(int));
int o0 = (d[0]>d[1])+(d[0]>d[2])+(d[0]>d[3])+(d[0]>d[4])+(d[0]>d[5]);
int o1 = (d[1]>=d[0])+(d[1]>d[2])+(d[1]>d[3])+(d[1]>d[4])+(d[1]>d[5]);
int o2 = (d[2]>=d[0])+(d[2]>=d[1])+(d[2]>d[3])+(d[2]>d[4])+(d[2]>d[5]);
int o3 = (d[3]>=d[0])+(d[3]>=d[1])+(d[3]>=d[2])+(d[3]>d[4])+(d[3]>d[5]);
int o4 = (d[4]>=d[0])+(d[4]>=d[1])+(d[4]>=d[2])+(d[4]>=d[3])+(d[4]>d[5]);
int o5 = 15-(o0+o1+o2+o3+o4);
d[o0]=e[0]; d[o1]=e[1]; d[o2]=e[2]; d[o3]=e[3]; d[o4]=e[4]; d[o5]=e[5];
}

Looks like I got to the party a year late, but here we go...
Looking at the assembly generated by gcc 4.5.2 I observed that loads and stores are being done for every swap, which really isn't needed. It would be better to load the 6 values into registers, sort those, and store them back into memory. I ordered the loads at stores to be as close as possible to there the registers are first needed and last used. I also used Steinar H. Gunderson's SWAP macro. Update: I switched to Paolo Bonzini's SWAP macro which gcc converts into something similar to Gunderson's, but gcc is able to better order the instructions since they aren't given as explicit assembly.
I used the same swap order as the reordered swap network given as the best performing, although there may be a better ordering. If I find some more time I'll generate and test a bunch of permutations.
I changed the testing code to consider over 4000 arrays and show the average number of cycles needed to sort each one. On an i5-650 I'm getting ~34.1 cycles/sort (using -O3), compared to the original reordered sorting network getting ~65.3 cycles/sort (using -O1, beats -O2 and -O3).
#include <stdio.h>
static inline void sort6_fast(int * d) {
#define SWAP(x,y) { int dx = x, dy = y, tmp; tmp = x = dx < dy ? dx : dy; y ^= dx ^ tmp; }
register int x0,x1,x2,x3,x4,x5;
x1 = d[1];
x2 = d[2];
SWAP(x1, x2);
x4 = d[4];
x5 = d[5];
SWAP(x4, x5);
x0 = d[0];
SWAP(x0, x2);
x3 = d[3];
SWAP(x3, x5);
SWAP(x0, x1);
SWAP(x3, x4);
SWAP(x1, x4);
SWAP(x0, x3);
d[0] = x0;
SWAP(x2, x5);
d[5] = x5;
SWAP(x1, x3);
d[1] = x1;
SWAP(x2, x4);
d[4] = x4;
SWAP(x2, x3);
d[2] = x2;
d[3] = x3;
#undef SWAP
#undef min
#undef max
}
static __inline__ unsigned long long rdtsc(void)
{
unsigned long long int x;
__asm__ volatile ("rdtsc; shlq $32, %%rdx; orq %%rdx, %0" : "=a" (x) : : "rdx");
return x;
}
void ran_fill(int n, int *a) {
static int seed = 76521;
while (n--) *a++ = (seed = seed *1812433253 + 12345);
}
#define NTESTS 4096
int main() {
int i;
int d[6*NTESTS];
ran_fill(6*NTESTS, d);
unsigned long long cycles = rdtsc();
for (i = 0; i < 6*NTESTS ; i+=6) {
sort6_fast(d+i);
}
cycles = rdtsc() - cycles;
printf("Time is %.2lf\n", (double)cycles/(double)NTESTS);
for (i = 0; i < 6*NTESTS ; i+=6) {
if (d[i+0] > d[i+1] || d[i+1] > d[i+2] || d[i+2] > d[i+3] || d[i+3] > d[i+4] || d[i+4] > d[i+5])
printf("d%d : %d %d %d %d %d %d\n", i,
d[i+0], d[i+1], d[i+2],
d[i+3], d[i+4], d[i+5]);
}
return 0;
}
I changed modified the test suite to also report clocks per sort and run more tests (the cmp function was updated to handle integer overflow as well), here are the results on some different architectures. I attempted testing on an AMD cpu but rdtsc isn't reliable on the X6 1100T I have available.
Clarkdale (i5-650)
==================
Direct call to qsort library function 635.14 575.65 581.61 577.76 521.12
Naive implementation (insertion sort) 538.30 135.36 134.89 240.62 101.23
Insertion Sort (Daniel Stutzbach) 424.48 159.85 160.76 152.01 151.92
Insertion Sort Unrolled 339.16 125.16 125.81 129.93 123.16
Rank Order 184.34 106.58 54.74 93.24 94.09
Rank Order with registers 127.45 104.65 53.79 98.05 97.95
Sorting Networks (Daniel Stutzbach) 269.77 130.56 128.15 126.70 127.30
Sorting Networks (Paul R) 551.64 103.20 64.57 73.68 73.51
Sorting Networks 12 with Fast Swap 321.74 61.61 63.90 67.92 67.76
Sorting Networks 12 reordered Swap 318.75 60.69 65.90 70.25 70.06
Reordered Sorting Network w/ fast swap 145.91 34.17 32.66 32.22 32.18
Kentsfield (Core 2 Quad)
========================
Direct call to qsort library function 870.01 736.39 723.39 725.48 721.85
Naive implementation (insertion sort) 503.67 174.09 182.13 284.41 191.10
Insertion Sort (Daniel Stutzbach) 345.32 152.84 157.67 151.23 150.96
Insertion Sort Unrolled 316.20 133.03 129.86 118.96 105.06
Rank Order 164.37 138.32 46.29 99.87 99.81
Rank Order with registers 115.44 116.02 44.04 116.04 116.03
Sorting Networks (Daniel Stutzbach) 230.35 114.31 119.15 110.51 111.45
Sorting Networks (Paul R) 498.94 77.24 63.98 62.17 65.67
Sorting Networks 12 with Fast Swap 315.98 59.41 58.36 60.29 55.15
Sorting Networks 12 reordered Swap 307.67 55.78 51.48 51.67 50.74
Reordered Sorting Network w/ fast swap 149.68 31.46 30.91 31.54 31.58
Sandy Bridge (i7-2600k)
=======================
Direct call to qsort library function 559.97 451.88 464.84 491.35 458.11
Naive implementation (insertion sort) 341.15 160.26 160.45 154.40 106.54
Insertion Sort (Daniel Stutzbach) 284.17 136.74 132.69 123.85 121.77
Insertion Sort Unrolled 239.40 110.49 114.81 110.79 117.30
Rank Order 114.24 76.42 45.31 36.96 36.73
Rank Order with registers 105.09 32.31 48.54 32.51 33.29
Sorting Networks (Daniel Stutzbach) 210.56 115.68 116.69 107.05 124.08
Sorting Networks (Paul R) 364.03 66.02 61.64 45.70 44.19
Sorting Networks 12 with Fast Swap 246.97 41.36 59.03 41.66 38.98
Sorting Networks 12 reordered Swap 235.39 38.84 47.36 38.61 37.29
Reordered Sorting Network w/ fast swap 115.58 27.23 27.75 27.25 26.54
Nehalem (Xeon E5640)
====================
Direct call to qsort library function 911.62 890.88 681.80 876.03 872.89
Naive implementation (insertion sort) 457.69 236.87 127.68 388.74 175.28
Insertion Sort (Daniel Stutzbach) 317.89 279.74 147.78 247.97 245.09
Insertion Sort Unrolled 259.63 220.60 116.55 221.66 212.93
Rank Order 140.62 197.04 52.10 163.66 153.63
Rank Order with registers 84.83 96.78 50.93 109.96 54.73
Sorting Networks (Daniel Stutzbach) 214.59 220.94 118.68 120.60 116.09
Sorting Networks (Paul R) 459.17 163.76 56.40 61.83 58.69
Sorting Networks 12 with Fast Swap 284.58 95.01 50.66 53.19 55.47
Sorting Networks 12 reordered Swap 281.20 96.72 44.15 56.38 54.57
Reordered Sorting Network w/ fast swap 128.34 50.87 26.87 27.91 28.02

The test code is pretty bad; it overflows the initial array (don't people here read compiler warnings?), the printf is printing out the wrong elements, it uses .byte for rdtsc for no good reason, there's only one run (!), there's nothing checking that the end results are actually correct (so it's very easy to “optimize” into something subtly wrong), the included tests are very rudimentary (no negative numbers?) and there's nothing to stop the compiler from just discarding the entire function as dead code.
That being said, it's also pretty easy to improve on the bitonic network solution; simply change the min/max/SWAP stuff to
#define SWAP(x,y) { int tmp; asm("mov %0, %2 ; cmp %1, %0 ; cmovg %1, %0 ; cmovg %2, %1" : "=r" (d[x]), "=r" (d[y]), "=r" (tmp) : "0" (d[x]), "1" (d[y]) : "cc"); }
and it comes out about 65% faster for me (Debian gcc 4.4.5 with -O2, amd64, Core i7).

I stumbled onto this question from Google a few days ago because I also had a need to quickly sort a fixed length array of 6 integers. In my case however, my integers are only 8 bits (instead of 32) and I do not have a strict requirement of only using C. I thought I would share my findings anyways, in case they might be helpful to someone...
I implemented a variant of a network sort in assembly that uses SSE to vectorize the compare and swap operations, to the extent possible. It takes six "passes" to completely sort the array. I used a novel mechanism to directly convert the results of PCMPGTB (vectorized compare) to shuffle parameters for PSHUFB (vectorized swap), using only a PADDB (vectorized add) and in some cases also a PAND (bitwise AND) instruction.
This approach also had the side effect of yielding a truly branchless function. There are no jump instructions whatsoever.
It appears that this implementation is about 38% faster than the implementation which is currently marked as the fastest option in the question ("Sorting Networks 12 with Simple Swap"). I modified that implementation to use char array elements during my testing, to make the comparison fair.
I should note that this approach can be applied to any array size up to 16 elements. I expect the relative speed advantage over the alternatives to grow larger for the bigger arrays.
The code is written in MASM for x86_64 processors with SSSE3. The function uses the "new" Windows x64 calling convention. Here it is...
PUBLIC simd_sort_6
.DATA
ALIGN 16
pass1_shuffle OWORD 0F0E0D0C0B0A09080706040503010200h
pass1_add OWORD 0F0E0D0C0B0A09080706050503020200h
pass2_shuffle OWORD 0F0E0D0C0B0A09080706030405000102h
pass2_and OWORD 00000000000000000000FE00FEFE00FEh
pass2_add OWORD 0F0E0D0C0B0A09080706050405020102h
pass3_shuffle OWORD 0F0E0D0C0B0A09080706020304050001h
pass3_and OWORD 00000000000000000000FDFFFFFDFFFFh
pass3_add OWORD 0F0E0D0C0B0A09080706050404050101h
pass4_shuffle OWORD 0F0E0D0C0B0A09080706050100020403h
pass4_and OWORD 0000000000000000000000FDFD00FDFDh
pass4_add OWORD 0F0E0D0C0B0A09080706050403020403h
pass5_shuffle OWORD 0F0E0D0C0B0A09080706050201040300h
pass5_and OWORD 0000000000000000000000FEFEFEFE00h
pass5_add OWORD 0F0E0D0C0B0A09080706050403040300h
pass6_shuffle OWORD 0F0E0D0C0B0A09080706050402030100h
pass6_add OWORD 0F0E0D0C0B0A09080706050403030100h
.CODE
simd_sort_6 PROC FRAME
.endprolog
; pxor xmm4, xmm4
; pinsrd xmm4, dword ptr [rcx], 0
; pinsrb xmm4, byte ptr [rcx + 4], 4
; pinsrb xmm4, byte ptr [rcx + 5], 5
; The benchmarked 38% faster mentioned in the text was with the above slower sequence that tied up the shuffle port longer. Same on extract
; avoiding pins/extrb also means we don't need SSE 4.1, but SSSE3 CPUs without SSE4.1 (e.g. Conroe/Merom) have slow pshufb.
movd xmm4, dword ptr [rcx]
pinsrw xmm4, word ptr [rcx + 4], 2 ; word 2 = bytes 4 and 5
movdqa xmm5, xmm4
pshufb xmm5, oword ptr [pass1_shuffle]
pcmpgtb xmm5, xmm4
paddb xmm5, oword ptr [pass1_add]
pshufb xmm4, xmm5
movdqa xmm5, xmm4
pshufb xmm5, oword ptr [pass2_shuffle]
pcmpgtb xmm5, xmm4
pand xmm5, oword ptr [pass2_and]
paddb xmm5, oword ptr [pass2_add]
pshufb xmm4, xmm5
movdqa xmm5, xmm4
pshufb xmm5, oword ptr [pass3_shuffle]
pcmpgtb xmm5, xmm4
pand xmm5, oword ptr [pass3_and]
paddb xmm5, oword ptr [pass3_add]
pshufb xmm4, xmm5
movdqa xmm5, xmm4
pshufb xmm5, oword ptr [pass4_shuffle]
pcmpgtb xmm5, xmm4
pand xmm5, oword ptr [pass4_and]
paddb xmm5, oword ptr [pass4_add]
pshufb xmm4, xmm5
movdqa xmm5, xmm4
pshufb xmm5, oword ptr [pass5_shuffle]
pcmpgtb xmm5, xmm4
pand xmm5, oword ptr [pass5_and]
paddb xmm5, oword ptr [pass5_add]
pshufb xmm4, xmm5
movdqa xmm5, xmm4
pshufb xmm5, oword ptr [pass6_shuffle]
pcmpgtb xmm5, xmm4
paddb xmm5, oword ptr [pass6_add]
pshufb xmm4, xmm5
;pextrd dword ptr [rcx], xmm4, 0 ; benchmarked with this
;pextrb byte ptr [rcx + 4], xmm4, 4 ; slower version
;pextrb byte ptr [rcx + 5], xmm4, 5
movd dword ptr [rcx], xmm4
pextrw word ptr [rcx + 4], xmm4, 2 ; x86 is little-endian, so this is the right order
ret
simd_sort_6 ENDP
END
You can compile this to an executable object and link it into your C project. For instructions on how to do this in Visual Studio, you can read this article. You can use the following C prototype to call the function from your C code:
void simd_sort_6(char *values);

While I really like the swap macro provided:
#define min(x, y) (y ^ ((x ^ y) & -(x < y)))
#define max(x, y) (x ^ ((x ^ y) & -(x < y)))
#define SWAP(x,y) { int tmp = min(d[x], d[y]); d[y] = max(d[x], d[y]); d[x] = tmp; }
I see an improvement (which a good compiler might make):
#define SWAP(x,y) { int tmp = ((x ^ y) & -(y < x)); y ^= tmp; x ^= tmp; }
We take note of how min and max work and pull the common sub-expression explicitly. This eliminates the min and max macros completely.

Never optimize min/max without benchmarking and looking at actual compiler generated assembly. If I let GCC optimize min with conditional move instructions I get a 33% speedup:
#define SWAP(x,y) { int dx = d[x], dy = d[y], tmp; tmp = d[x] = dx < dy ? dx : dy; d[y] ^= dx ^ tmp; }
(280 vs. 420 cycles in the test code). Doing max with ?: is more or less the same, almost lost in the noise, but the above is a little bit faster. This SWAP is faster with both GCC and Clang.
Compilers are also doing an exceptional job at register allocation and alias analysis, effectively moving d[x] into local variables upfront, and only copying back to memory at the end. In fact, they do so even better than if you worked entirely with local variables (like d0 = d[0], d1 = d[1], d2 = d[2], d3 = d[3], d4 = d[4], d5 = d[5]). I'm writing this because you are assuming strong optimization and yet trying to outsmart the compiler on min/max. :)
By the way, I tried Clang and GCC. They do the same optimization, but due to scheduling differences the two have some variation in the results, can't say really which is faster or slower. GCC is faster on the sorting networks, Clang on the quadratic sorts.
Just for completeness, unrolled bubble sort and insertion sorts are possible too. Here is the bubble sort:
SWAP(0,1); SWAP(1,2); SWAP(2,3); SWAP(3,4); SWAP(4,5);
SWAP(0,1); SWAP(1,2); SWAP(2,3); SWAP(3,4);
SWAP(0,1); SWAP(1,2); SWAP(2,3);
SWAP(0,1); SWAP(1,2);
SWAP(0,1);
and here is the insertion sort:
//#define ITER(x) { if (t < d[x]) { d[x+1] = d[x]; d[x] = t; } }
//Faster on x86, probably slower on ARM or similar:
#define ITER(x) { d[x+1] ^= t < d[x] ? d[x] ^ d[x+1] : 0; d[x] = t < d[x] ? t : d[x]; }
static inline void sort6_insertion_sort_unrolled_v2(int * d){
int t;
t = d[1]; ITER(0);
t = d[2]; ITER(1); ITER(0);
t = d[3]; ITER(2); ITER(1); ITER(0);
t = d[4]; ITER(3); ITER(2); ITER(1); ITER(0);
t = d[5]; ITER(4); ITER(3); ITER(2); ITER(1); ITER(0);
This insertion sort is faster than Daniel Stutzbach's, and is especially good on a GPU or a computer with predication because ITER can be done with only 3 instructions (vs. 4 for SWAP). For example, here is the t = d[2]; ITER(1); ITER(0); line in ARM assembly:
MOV r6, r2
CMP r6, r1
MOVLT r2, r1
MOVLT r1, r6
CMP r6, r0
MOVLT r1, r0
MOVLT r0, r6
For six elements the insertion sort is competitive with the sorting network (12 swaps vs. 15 iterations balances 4 instructions/swap vs. 3 instructions/iteration); bubble sort of course is slower. But it's not going to be true when the size grows, since insertion sort is O(n^2) while sorting networks are O(n log n).

I ported the test suite to a PPC architecture machine I can not identify (didn't have to touch code, just increase the iterations of the test, use 8 test cases to avoid polluting results with mods and replace the x86 specific rdtsc):
Direct call to qsort library function : 101
Naive implementation (insertion sort) : 299
Insertion Sort (Daniel Stutzbach) : 108
Insertion Sort Unrolled : 51
Sorting Networks (Daniel Stutzbach) : 26
Sorting Networks (Paul R) : 85
Sorting Networks 12 with Fast Swap : 117
Sorting Networks 12 reordered Swap : 116
Rank Order : 56

An XOR swap may be useful in your swapping functions.
void xorSwap (int *x, int *y) {
if (*x != *y) {
*x ^= *y;
*y ^= *x;
*x ^= *y;
}
}
The if may cause too much divergence in your code, but if you have a guarantee that all your ints are unique this could be handy.

Looking forward to trying my hand at this and learning from these examples, but first some timings from my 1.5 GHz PPC Powerbook G4 w/ 1 GB DDR RAM. (I borrowed a similar rdtsc-like timer for PPC from http://www.mcs.anl.gov/~kazutomo/rdtsc.html for the timings.) I ran the program a few times and the absolute results varied but the consistently fastest test was "Insertion Sort (Daniel Stutzbach)", with "Insertion Sort Unrolled" a close second.
Here's the last set of times:
**Direct call to qsort library function** : 164
**Naive implementation (insertion sort)** : 138
**Insertion Sort (Daniel Stutzbach)** : 85
**Insertion Sort Unrolled** : 97
**Sorting Networks (Daniel Stutzbach)** : 457
**Sorting Networks (Paul R)** : 179
**Sorting Networks 12 with Fast Swap** : 238
**Sorting Networks 12 reordered Swap** : 236
**Rank Order** : 116

Here is my contribution to this thread: an optimized 1, 4 gap shellsort for a 6-member int vector (valp) containing unique values.
void shellsort (int *valp)
{
int c,a,*cp,*ip=valp,*ep=valp+5;
c=*valp; a=*(valp+4);if (c>a) {*valp= a;*(valp+4)=c;}
c=*(valp+1);a=*(valp+5);if (c>a) {*(valp+1)=a;*(valp+5)=c;}
cp=ip;
do
{
c=*cp;
a=*(cp+1);
do
{
if (c<a) break;
*cp=a;
*(cp+1)=c;
cp-=1;
c=*cp;
} while (cp>=valp);
ip+=1;
cp=ip;
} while (ip<ep);
}
On my HP dv7-3010so laptop with a dual-core Athlon M300 # 2 Ghz (DDR2 memory) it executes in 165 clock cycles. This is an average calculated from timing every unique sequence (6!/720 in all). Compiled to Win32 using OpenWatcom 1.8. The loop is essentially an insertion sort and is 16 instructions/37 bytes long.
I do not have a 64-bit environment to compile on.

If insertion sort is reasonably competitive here, I would recommend trying a shellsort. I'm afraid 6 elements is probably just too little for it to be among the best, but it might be worth a try.
Example code, untested, undebugged, etc. You want to tune the inc = 4 and inc -= 3 sequence to find the optimum (try inc = 2, inc -= 1 for example).
static __inline__ int sort6(int * d) {
char j, i;
int tmp;
for (inc = 4; inc > 0; inc -= 3) {
for (i = inc; i < 5; i++) {
tmp = a[i];
j = i;
while (j >= inc && a[j - inc] > tmp) {
a[j] = a[j - inc];
j -= inc;
}
a[j] = tmp;
}
}
}
I don't think this will win, but if someone posts a question about sorting 10 elements, who knows...
According to Wikipedia this can even be combined with sorting networks:
Pratt, V (1979). Shellsort and sorting networks (Outstanding dissertations in the computer sciences). Garland. ISBN 0-824-04406-1

I know I'm super-late, but I was interested in experimenting with some different solutions. First, I cleaned up that paste, made it compile, and put it into a repository. I kept some undesirable solutions as dead-ends so that others wouldn't try it. Among this was my first solution, which attempted to ensure that x1>x2 was calculated once. After optimization, it is no faster than the other, simple versions.
I added a looping version of rank order sort, since my own application of this study is for sorting 2-8 items, so since there are a variable number of arguments, a loop is necessary. This is also why I ignored the sorting network solutions.
The test code didn't test that duplicates were handled correctly, so while the existing solutions were all correct, I added a special case to the test code to ensure that duplicates were handled correctly.
Then, I wrote an insertion sort that is entirely in AVX registers. On my machine it is 25% faster than the other insertion sorts, but 100% slower than rank order. I did this purely for experiment and had no expectation of this being better due to the branching in insertion sort.
static inline void sort6_insertion_sort_avx(int* d) {
__m256i src = _mm256_setr_epi32(d[0], d[1], d[2], d[3], d[4], d[5], 0, 0);
__m256i index = _mm256_setr_epi32(0, 1, 2, 3, 4, 5, 6, 7);
__m256i shlpermute = _mm256_setr_epi32(7, 0, 1, 2, 3, 4, 5, 6);
__m256i sorted = _mm256_setr_epi32(d[0], INT_MAX, INT_MAX, INT_MAX,
INT_MAX, INT_MAX, INT_MAX, INT_MAX);
__m256i val, gt, permute;
unsigned j;
// 8 / 32 = 2^-2
#define ITER(I) \
val = _mm256_permutevar8x32_epi32(src, _mm256_set1_epi32(I));\
gt = _mm256_cmpgt_epi32(sorted, val);\
permute = _mm256_blendv_epi8(index, shlpermute, gt);\
j = ffs( _mm256_movemask_epi8(gt)) >> 2;\
sorted = _mm256_blendv_epi8(_mm256_permutevar8x32_epi32(sorted, permute),\
val, _mm256_cmpeq_epi32(index, _mm256_set1_epi32(j)))
ITER(1);
ITER(2);
ITER(3);
ITER(4);
ITER(5);
int x[8];
_mm256_storeu_si256((__m256i*)x, sorted);
d[0] = x[0]; d[1] = x[1]; d[2] = x[2]; d[3] = x[3]; d[4] = x[4]; d[5] = x[5];
#undef ITER
}
Then, I wrote a rank order sort using AVX. This matches the speed of the other rank-order solutions, but is no faster. The issue here is that I can only calculate the indices with AVX, and then I have to make a table of indices. This is because the calculation is destination-based rather than source-based. See Converting from Source-based Indices to Destination-based Indices
static inline void sort6_rank_order_avx(int* d) {
__m256i ror = _mm256_setr_epi32(5, 0, 1, 2, 3, 4, 6, 7);
__m256i one = _mm256_set1_epi32(1);
__m256i src = _mm256_setr_epi32(d[0], d[1], d[2], d[3], d[4], d[5], INT_MAX, INT_MAX);
__m256i rot = src;
__m256i index = _mm256_setzero_si256();
__m256i gt, permute;
__m256i shl = _mm256_setr_epi32(1, 2, 3, 4, 5, 6, 6, 6);
__m256i dstIx = _mm256_setr_epi32(0,1,2,3,4,5,6,7);
__m256i srcIx = dstIx;
__m256i eq = one;
__m256i rotIx = _mm256_setzero_si256();
#define INC(I)\
rot = _mm256_permutevar8x32_epi32(rot, ror);\
gt = _mm256_cmpgt_epi32(src, rot);\
index = _mm256_add_epi32(index, _mm256_and_si256(gt, one));\
index = _mm256_add_epi32(index, _mm256_and_si256(eq,\
_mm256_cmpeq_epi32(src, rot)));\
eq = _mm256_insert_epi32(eq, 0, I)
INC(0);
INC(1);
INC(2);
INC(3);
INC(4);
int e[6];
e[0] = d[0]; e[1] = d[1]; e[2] = d[2]; e[3] = d[3]; e[4] = d[4]; e[5] = d[5];
int i[8];
_mm256_storeu_si256((__m256i*)i, index);
d[i[0]] = e[0]; d[i[1]] = e[1]; d[i[2]] = e[2]; d[i[3]] = e[3]; d[i[4]] = e[4]; d[i[5]] = e[5];
}
The repo can be found here: https://github.com/eyepatchParrot/sort6/

This question is becoming quite old, but I actually had to solve the same problem these days: fast agorithms to sort small arrays. I thought it would be a good idea to share my knowledge. While I first started by using sorting networks, I finally managed to find other algorithms for which the total number of comparisons performed to sort every permutation of 6 values was smaller than with sorting networks, and smaller than with insertion sort. I didn't count the number of swaps; I would expect it to be roughly equivalent (maybe a bit higher sometimes).
The algorithm sort6 uses the algorithm sort4 which uses the algorithm sort3. Here is the implementation in some light C++ form (the original is template-heavy so that it can work with any random-access iterator and any suitable comparison function).
Sorting 3 values
The following algorithm is an unrolled insertion sort. When two swaps (6 assignments) have to be performed, it uses 4 assignments instead:
void sort3(int* array)
{
if (array[1] < array[0]) {
if (array[2] < array[0]) {
if (array[2] < array[1]) {
std::swap(array[0], array[2]);
} else {
int tmp = array[0];
array[0] = array[1];
array[1] = array[2];
array[2] = tmp;
}
} else {
std::swap(array[0], array[1]);
}
} else {
if (array[2] < array[1]) {
if (array[2] < array[0]) {
int tmp = array[2];
array[2] = array[1];
array[1] = array[0];
array[0] = tmp;
} else {
std::swap(array[1], array[2]);
}
}
}
}
It looks a bit complex because the sort has more or less one branch for every possible permutation of the array, using 2~3 comparisons and at most 4 assignments to sort the three values.
Sorting 4 values
This one calls sort3 then performs an unrolled insertion sort with the last element of the array:
void sort4(int* array)
{
// Sort the first 3 elements
sort3(array);
// Insert the 4th element with insertion sort
if (array[3] < array[2]) {
std::swap(array[2], array[3]);
if (array[2] < array[1]) {
std::swap(array[1], array[2]);
if (array[1] < array[0]) {
std::swap(array[0], array[1]);
}
}
}
}
This algorithm performs 3 to 6 comparisons and at most 5 swaps. It is easy to unroll an insertion sort, but we will be using another algorithm for the last sort...
Sorting 6 values
This one uses an unrolled version of what I called a double insertion sort. The name isn't that great, but it's quite descriptive, here is how it works:
Sort everything but the first and the last elements of the array.
Swap the first and the elements of the array if the first is greater than the last.
Insert the first element into the sorted sequence from the front then the last element from the back.
After the swap, the first element is always smaller than the last, which means that, when inserting them into the sorted sequence, there won't be more than N comparisons to insert the two elements in the worst case: for example, if the first element has been insert in the 3rd position, then the last one can't be inserted lower than the 4th position.
void sort6(int* array)
{
// Sort everything but first and last elements
sort4(array+1);
// Switch first and last elements if needed
if (array[5] < array[0]) {
std::swap(array[0], array[5]);
}
// Insert first element from the front
if (array[1] < array[0]) {
std::swap(array[0], array[1]);
if (array[2] < array[1]) {
std::swap(array[1], array[2]);
if (array[3] < array[2]) {
std::swap(array[2], array[3]);
if (array[4] < array[3]) {
std::swap(array[3], array[4]);
}
}
}
}
// Insert last element from the back
if (array[5] < array[4]) {
std::swap(array[4], array[5]);
if (array[4] < array[3]) {
std::swap(array[3], array[4]);
if (array[3] < array[2]) {
std::swap(array[2], array[3]);
if (array[2] < array[1]) {
std::swap(array[1], array[2]);
}
}
}
}
}
My tests on every permutation of 6 values ever show that this algorithms always performs between 6 and 13 comparisons. I didn't compute the number of swaps performed, but I don't expect it to be higher than 11 in the worst case.
I hope that this helps, even if this question may not represent an actual problem anymore :)
EDIT: after putting it in the provided benchmark, it is cleary slower than most of the interesting alternatives. It tends to perform a bit better than the unrolled insertion sort, but that's pretty much it. Basically, it isn't the best sort for integers but could be interesting for types with an expensive comparison operation.

I found that at least on my system, the functions sort6_iterator() and sort6_iterator_local() defined below both ran at least as fast, and frequently noticeably faster, than the above current record holder:
#define MIN(x, y) (x<y?x:y)
#define MAX(x, y) (x<y?y:x)
template<class IterType>
inline void sort6_iterator(IterType it)
{
#define SWAP(x,y) { const auto a = MIN(*(it + x), *(it + y)); \
const auto b = MAX(*(it + x), *(it + y)); \
*(it + x) = a; *(it + y) = b; }
SWAP(1, 2) SWAP(4, 5)
SWAP(0, 2) SWAP(3, 5)
SWAP(0, 1) SWAP(3, 4)
SWAP(1, 4) SWAP(0, 3)
SWAP(2, 5) SWAP(1, 3)
SWAP(2, 4)
SWAP(2, 3)
#undef SWAP
}
I passed this function a std::vector's iterator in my timing code.
I suspect (from comments like this and elsewhere) that using iterators gives g++ certain assurances about what can and can't happen to the memory that the iterator refers to, which it otherwise wouldn't have and it is these assurances that allow g++ to better optimize the sorting code (e.g. with pointers, the compiler can't be sure that all pointers are pointing to different memory locations). If I remember correctly, this is also part of the reason why so many STL algorithms, such as std::sort(), generally have such obscenely good performance.
Moreover, sort6_iterator() is sometimes (again, depending on the context in which the function is called) consistently outperformed by the following sorting function, which copies the data into local variables before sorting them.1 Note that since there are only 6 local variables defined, if these local variables are primitives then they are likely never actually stored in RAM and are instead only ever stored in the CPU's registers until the end of the function call, which helps make this sorting function fast. (It also helps that the compiler knows that distinct local variables have distinct locations in memory).
template<class IterType>
inline void sort6_iterator_local(IterType it)
{
#define SWAP(x,y) { const auto a = MIN(data##x, data##y); \
const auto b = MAX(data##x, data##y); \
data##x = a; data##y = b; }
//DD = Define Data
#define DD1(a) auto data##a = *(it + a);
#define DD2(a,b) auto data##a = *(it + a), data##b = *(it + b);
//CB = Copy Back
#define CB(a) *(it + a) = data##a;
DD2(1,2) SWAP(1, 2)
DD2(4,5) SWAP(4, 5)
DD1(0) SWAP(0, 2)
DD1(3) SWAP(3, 5)
SWAP(0, 1) SWAP(3, 4)
SWAP(1, 4) SWAP(0, 3) CB(0)
SWAP(2, 5) CB(5)
SWAP(1, 3) CB(1)
SWAP(2, 4) CB(4)
SWAP(2, 3) CB(2) CB(3)
#undef CB
#undef DD2
#undef DD1
#undef SWAP
}
Note that defining SWAP() as follows sometimes results in slightly better performance although most of the time it results in slightly worse performance or a negligible difference in performance.
#define SWAP(x,y) { const auto a = MIN(data##x, data##y); \
data##y = MAX(data##x, data##y); \
data##x = a; }
If you just want a sorting algorithm that on primitive data types, gcc -O3 is consistently good at optimizing no matter what context the call to the sorting function appears in1 then, depending on how you pass the input, try one of the following two algorithms:
template<class T> inline void sort6(T it) {
#define SORT2(x,y) {if(data##x>data##y){auto a=std::move(data##y);data##y=std::move(data##x);data##x=std::move(a);}}
#define DD1(a) register auto data##a=*(it+a);
#define DD2(a,b) register auto data##a=*(it+a);register auto data##b=*(it+b);
#define CB1(a) *(it+a)=data##a;
#define CB2(a,b) *(it+a)=data##a;*(it+b)=data##b;
DD2(1,2) SORT2(1,2)
DD2(4,5) SORT2(4,5)
DD1(0) SORT2(0,2)
DD1(3) SORT2(3,5)
SORT2(0,1) SORT2(3,4) SORT2(2,5) CB1(5)
SORT2(1,4) SORT2(0,3) CB1(0)
SORT2(2,4) CB1(4)
SORT2(1,3) CB1(1)
SORT2(2,3) CB2(2,3)
#undef CB1
#undef CB2
#undef DD1
#undef DD2
#undef SORT2
}
Or if you want to pass the variables by reference then use this (the below function differs from the above in its first 5 lines):
template<class T> inline void sort6(T& e0, T& e1, T& e2, T& e3, T& e4, T& e5) {
#define SORT2(x,y) {if(data##x>data##y)std::swap(data##x,data##y);}
#define DD1(a) register auto data##a=e##a;
#define DD2(a,b) register auto data##a=e##a;register auto data##b=e##b;
#define CB1(a) e##a=data##a;
#define CB2(a,b) e##a=data##a;e##b=data##b;
DD2(1,2) SORT2(1,2)
DD2(4,5) SORT2(4,5)
DD1(0) SORT2(0,2)
DD1(3) SORT2(3,5)
SORT2(0,1) SORT2(3,4) SORT2(2,5) CB1(5)
SORT2(1,4) SORT2(0,3) CB1(0)
SORT2(2,4) CB1(4)
SORT2(1,3) CB1(1)
SORT2(2,3) CB2(2,3)
#undef CB1
#undef CB2
#undef DD1
#undef DD2
#undef SORT2
}
The reason for using the register keyword is because this is one of the few times that you know that you want these values in registers. Without register, the compiler will figure this out most of the time but sometimes it doesn't. Using the register keyword helps solve this issue. Normally, however, don't use the register keyword since it's more likely to slow your code than speed it up.
Also, note the use of templates. This is done on purpose since, even with the inline keyword, template functions are generally much more aggressively optimized by gcc than vanilla C functions (this has to do with gcc needing to deal with function pointers for vanilla C functions but not with template functions).
While timing various sorting functions I noticed that the context (i.e. surrounding code) in which the call to the sorting function was made had a significant impact on performance, which is likely due to the function being inlined and then optimized. For instance, if the program was sufficiently simple then there usually wasn't much of a difference in performance between passing the sorting function a pointer versus passing it an iterator; otherwise using iterators usually resulted in noticeably better performance and never (in my experience so far at least) any noticeably worse performance. I suspect that this may be because g++ can globally optimize sufficiently simple code.

I believe there are two parts to your question.
The first is to determine the optimal algorithm. This is done - at least in this case - by looping through every possible ordering (there aren't that many) which allows you to compute exact min, max, average and standard deviation of compares and swaps. Have a runner-up or two handy as well.
The second is to optimize the algorithm. A lot can be done to convert textbook code examples to mean and lean real-life algorithms. If you realize that an algorithm can't be optimized to the extent required, try a runner-up.
I wouldn't worry too much about emptying pipelines (assuming current x86): branch prediction has come a long way. What I would worry about is making sure that the code and data fit in one cache line each (maybe two for the code). Once there fetch latencies are refreshingly low which will compensate for any stall. It also means that your inner loop will be maybe ten instructions or so which is right where it should be (there are two different inner loops in my sorting algorithm, they are 10 instructions/22 bytes and 9/22 long respectively). Assuming the code doesn't contain any divs you can be sure it will be blindingly fast.

I know this is an old question.
But I just wrote a different kind of solution I want to share.
Using nothing but nested MIN MAX,
It's not fast as it uses 114 of each,
could reduce it to 75 pretty simply like so -> pastebin
But then it's not purely min max anymore.
What might work is doing min/max on multiple integers at once with AVX
PMINSW reference
#include <stdio.h>
static __inline__ int MIN(int a, int b){
int result =a;
__asm__ ("pminsw %1, %0" : "+x" (result) : "x" (b));
return result;
}
static __inline__ int MAX(int a, int b){
int result = a;
__asm__ ("pmaxsw %1, %0" : "+x" (result) : "x" (b));
return result;
}
static __inline__ unsigned long long rdtsc(void){
unsigned long long int x;
__asm__ volatile (".byte 0x0f, 0x31" :
"=A" (x));
return x;
}
#define MIN3(a, b, c) (MIN(MIN(a,b),c))
#define MIN4(a, b, c, d) (MIN(MIN(a,b),MIN(c,d)))
static __inline__ void sort6(int * in) {
const int A=in[0], B=in[1], C=in[2], D=in[3], E=in[4], F=in[5];
in[0] = MIN( MIN4(A,B,C,D),MIN(E,F) );
const int
AB = MAX(A, B),
AC = MAX(A, C),
AD = MAX(A, D),
AE = MAX(A, E),
AF = MAX(A, F),
BC = MAX(B, C),
BD = MAX(B, D),
BE = MAX(B, E),
BF = MAX(B, F),
CD = MAX(C, D),
CE = MAX(C, E),
CF = MAX(C, F),
DE = MAX(D, E),
DF = MAX(D, F),
EF = MAX(E, F);
in[1] = MIN4 (
MIN4( AB, AC, AD, AE ),
MIN4( AF, BC, BD, BE ),
MIN4( BF, CD, CE, CF ),
MIN3( DE, DF, EF)
);
const int
ABC = MAX(AB,C),
ABD = MAX(AB,D),
ABE = MAX(AB,E),
ABF = MAX(AB,F),
ACD = MAX(AC,D),
ACE = MAX(AC,E),
ACF = MAX(AC,F),
ADE = MAX(AD,E),
ADF = MAX(AD,F),
AEF = MAX(AE,F),
BCD = MAX(BC,D),
BCE = MAX(BC,E),
BCF = MAX(BC,F),
BDE = MAX(BD,E),
BDF = MAX(BD,F),
BEF = MAX(BE,F),
CDE = MAX(CD,E),
CDF = MAX(CD,F),
CEF = MAX(CE,F),
DEF = MAX(DE,F);
in[2] = MIN( MIN4 (
MIN4( ABC, ABD, ABE, ABF ),
MIN4( ACD, ACE, ACF, ADE ),
MIN4( ADF, AEF, BCD, BCE ),
MIN4( BCF, BDE, BDF, BEF )),
MIN4( CDE, CDF, CEF, DEF )
);
const int
ABCD = MAX(ABC,D),
ABCE = MAX(ABC,E),
ABCF = MAX(ABC,F),
ABDE = MAX(ABD,E),
ABDF = MAX(ABD,F),
ABEF = MAX(ABE,F),
ACDE = MAX(ACD,E),
ACDF = MAX(ACD,F),
ACEF = MAX(ACE,F),
ADEF = MAX(ADE,F),
BCDE = MAX(BCD,E),
BCDF = MAX(BCD,F),
BCEF = MAX(BCE,F),
BDEF = MAX(BDE,F),
CDEF = MAX(CDE,F);
in[3] = MIN4 (
MIN4( ABCD, ABCE, ABCF, ABDE ),
MIN4( ABDF, ABEF, ACDE, ACDF ),
MIN4( ACEF, ADEF, BCDE, BCDF ),
MIN3( BCEF, BDEF, CDEF )
);
const int
ABCDE= MAX(ABCD,E),
ABCDF= MAX(ABCD,F),
ABCEF= MAX(ABCE,F),
ABDEF= MAX(ABDE,F),
ACDEF= MAX(ACDE,F),
BCDEF= MAX(BCDE,F);
in[4]= MIN (
MIN4( ABCDE, ABCDF, ABCEF, ABDEF ),
MIN ( ACDEF, BCDEF )
);
in[5] = MAX(ABCDE,F);
}
int main(int argc, char ** argv) {
int d[6][6] = {
{1, 2, 3, 4, 5, 6},
{6, 5, 4, 3, 2, 1},
{100, 2, 300, 4, 500, 6},
{100, 2, 3, 4, 500, 6},
{1, 200, 3, 4, 5, 600},
{1, 1, 2, 1, 2, 1}
};
unsigned long long cycles = rdtsc();
for (int i = 0; i < 6; i++) {
sort6(d[i]);
}
cycles = rdtsc() - cycles;
printf("Time is %d\n", (unsigned)cycles);
for (int i = 0; i < 6; i++) {
printf("d%d : %d %d %d %d %d %d\n", i,
d[i][0], d[i][1], d[i][2],
d[i][3], d[i][4], d[i][5]);
}
}
EDIT:
Rank order solution inspired by Rex Kerr's,
Much faster than the mess above
static void sort6(int *o) {
const int
A=o[0],B=o[1],C=o[2],D=o[3],E=o[4],F=o[5];
const unsigned char
AB = A>B, AC = A>C, AD = A>D, AE = A>E,
BC = B>C, BD = B>D, BE = B>E,
CD = C>D, CE = C>E,
DE = D>E,
a = AB + AC + AD + AE + (A>F),
b = 1 - AB + BC + BD + BE + (B>F),
c = 2 - AC - BC + CD + CE + (C>F),
d = 3 - AD - BD - CD + DE + (D>F),
e = 4 - AE - BE - CE - DE + (E>F);
o[a]=A; o[b]=B; o[c]=C; o[d]=D; o[e]=E;
o[15-a-b-c-d-e]=F;
}

I thought I'd try an unrolled Ford-Johnson merge-insertion sort, which achieves the minimum possible number of comparisons (ceil(log2(6!)) = 10) and no swaps.
It doesn't compete, though (I got a slightly better timing than the worst sorting networks solution sort6_sorting_network_v1).
It loads the values into six registers, then performs 8 to 10 comparisons
to decide which of the 720=6!
cases it's in, then writes the registers back in the appropriate one
of those 720 orders (separate code for each case).
There are no swaps or reordering of anything until the final write-back. I haven't looked at the generated assembly code.
static inline void sort6_ford_johnson_unrolled(int *D) {
register int a = D[0], b = D[1], c = D[2], d = D[3], e = D[4], f = D[5];
#define abcdef(a,b,c,d,e,f) (D[0]=a, D[1]=b, D[2]=c, D[3]=d, D[4]=e, D[5]=f)
#define abdef_cd(a,b,c,d,e,f) (c<a ? abcdef(c,a,b,d,e,f) \
: c<b ? abcdef(a,c,b,d,e,f) \
: abcdef(a,b,c,d,e,f))
#define abedf_cd(a,b,c,d,e,f) (c<b ? c<a ? abcdef(c,a,b,e,d,f) \
: abcdef(a,c,b,e,d,f) \
: c<e ? abcdef(a,b,c,e,d,f) \
: abcdef(a,b,e,c,d,f))
#define abdf_cd_ef(a,b,c,d,e,f) (e<b ? e<a ? abedf_cd(e,a,c,d,b,f) \
: abedf_cd(a,e,c,d,b,f) \
: e<d ? abedf_cd(a,b,c,d,e,f) \
: abdef_cd(a,b,c,d,e,f))
#define abd_cd_ef(a,b,c,d,e,f) (d<f ? abdf_cd_ef(a,b,c,d,e,f) \
: b<f ? abdf_cd_ef(a,b,e,f,c,d) \
: abdf_cd_ef(e,f,a,b,c,d))
#define ab_cd_ef(a,b,c,d,e,f) (b<d ? abd_cd_ef(a,b,c,d,e,f) \
: abd_cd_ef(c,d,a,b,e,f))
#define ab_cd(a,b,c,d,e,f) (e<f ? ab_cd_ef(a,b,c,d,e,f) \
: ab_cd_ef(a,b,c,d,f,e))
#define ab(a,b,c,d,e,f) (c<d ? ab_cd(a,b,c,d,e,f) \
: ab_cd(a,b,d,c,e,f))
a<b ? ab(a,b,c,d,e,f)
: ab(b,a,c,d,e,f);
#undef ab
#undef ab_cd
#undef ab_cd_ef
#undef abd_cd_ef
#undef abdf_cd_ef
#undef abedf_cd
#undef abdef_cd
#undef abcdef
}
TEST(ford_johnson_unrolled, "Unrolled Ford-Johnson Merge-Insertion sort");

Try 'merging sorted list' sort. :) Use two array. Fastest for small and big array.
If you concating, you only check where insert. Other bigger values you not need compare (cmp = a-b>0).
For 4 numbers, you can use system 4-5 cmp (~4.6) or 3-6 cmp (~4.9). Bubble sort use 6 cmp (6). Lots of cmp for big numbers slower code.
This code use 5 cmp (not MSL sort):
if (cmp(arr[n][i+0],arr[n][i+1])>0) {swap(n,i+0,i+1);}
if (cmp(arr[n][i+2],arr[n][i+3])>0) {swap(n,i+2,i+3);}
if (cmp(arr[n][i+0],arr[n][i+2])>0) {swap(n,i+0,i+2);}
if (cmp(arr[n][i+1],arr[n][i+3])>0) {swap(n,i+1,i+3);}
if (cmp(arr[n][i+1],arr[n][i+2])>0) {swap(n,i+1,i+2);}
Principial MSL
9 8 7 6 5 4 3 2 1 0
89 67 45 23 01 ... concat two sorted lists, list length = 1
6789 2345 01 ... concat two sorted lists, list length = 2
23456789 01 ... concat two sorted lists, list length = 4
0123456789 ... concat two sorted lists, list length = 8
js code
function sortListMerge_2a(cmp)
{
var step, stepmax, tmp, a,b,c, i,j,k, m,n, cycles;
var start = 0;
var end = arr_count;
//var str = '';
cycles = 0;
if (end>3)
{
stepmax = ((end - start + 1) >> 1) << 1;
m = 1;
n = 2;
for (step=1;step<stepmax;step<<=1) //bounds 1-1, 2-2, 4-4, 8-8...
{
a = start;
while (a<end)
{
b = a + step;
c = a + step + step;
b = b<end ? b : end;
c = c<end ? c : end;
i = a;
j = b;
k = i;
while (i<b && j<c)
{
if (cmp(arr[m][i],arr[m][j])>0)
{arr[n][k] = arr[m][j]; j++; k++;}
else {arr[n][k] = arr[m][i]; i++; k++;}
}
while (i<b)
{arr[n][k] = arr[m][i]; i++; k++;
}
while (j<c)
{arr[n][k] = arr[m][j]; j++; k++;
}
a = c;
}
tmp = m; m = n; n = tmp;
}
return m;
}
else
{
// sort 3 items
sort10(cmp);
return m;
}
}

Maybe I am late to the party, but at least my contribution is a new approach.
The code really should be inlined
even if inlined, there are too many branches
the analysing part is basically O(N(N-1)) which seems OK for N=6
the code could be more effective if the cost of swap would be higher (irt the cost of compare)
I trust on static functions being inlined.
The method is related to rank-sort
instead of ranks, the relative ranks (offsets) are used.
the sum of the ranks is zero for every cycle in any permutation group.
instead of SWAP()ing two elements, the cycles are chased, needing only one temp, and one (register->register) swap (new <- old).
Update: changed the code a bit, some people use C++ compilers to compile C code ...
#include <stdio.h>
#if WANT_CHAR
typedef signed char Dif;
#else
typedef signed int Dif;
#endif
static int walksort (int *arr, int cnt);
static void countdifs (int *arr, Dif *dif, int cnt);
static void calcranks(int *arr, Dif *dif);
int wsort6(int *arr);
void do_print_a(char *msg, int *arr, unsigned cnt)
{
fprintf(stderr,"%s:", msg);
for (; cnt--; arr++) {
fprintf(stderr, " %3d", *arr);
}
fprintf(stderr,"\n");
}
void do_print_d(char *msg, Dif *arr, unsigned cnt)
{
fprintf(stderr,"%s:", msg);
for (; cnt--; arr++) {
fprintf(stderr, " %3d", (int) *arr);
}
fprintf(stderr,"\n");
}
static void inline countdifs (int *arr, Dif *dif, int cnt)
{
int top, bot;
for (top = 0; top < cnt; top++ ) {
for (bot = 0; bot < top; bot++ ) {
if (arr[top] < arr[bot]) { dif[top]--; dif[bot]++; }
}
}
return ;
}
/* Copied from RexKerr ... */
static void inline calcranks(int *arr, Dif *dif){
dif[0] = (arr[0]>arr[1])+(arr[0]>arr[2])+(arr[0]>arr[3])+(arr[0]>arr[4])+(arr[0]>arr[5]);
dif[1] = -1+ (arr[1]>=arr[0])+(arr[1]>arr[2])+(arr[1]>arr[3])+(arr[1]>arr[4])+(arr[1]>arr[5]);
dif[2] = -2+ (arr[2]>=arr[0])+(arr[2]>=arr[1])+(arr[2]>arr[3])+(arr[2]>arr[4])+(arr[2]>arr[5]);
dif[3] = -3+ (arr[3]>=arr[0])+(arr[3]>=arr[1])+(arr[3]>=arr[2])+(arr[3]>arr[4])+(arr[3]>arr[5]);
dif[4] = -4+ (arr[4]>=arr[0])+(arr[4]>=arr[1])+(arr[4]>=arr[2])+(arr[4]>=arr[3])+(arr[4]>arr[5]);
dif[5] = -(dif[0]+dif[1]+dif[2]+dif[3]+dif[4]);
}
static int walksort (int *arr, int cnt)
{
int idx, src,dst, nswap;
Dif difs[cnt];
#if WANT_REXK
calcranks(arr, difs);
#else
for (idx=0; idx < cnt; idx++) difs[idx] =0;
countdifs(arr, difs, cnt);
#endif
calcranks(arr, difs);
#define DUMP_IT 0
#if DUMP_IT
do_print_d("ISteps ", difs, cnt);
#endif
nswap = 0;
for (idx=0; idx < cnt; idx++) {
int newval;
int step,cyc;
if ( !difs[idx] ) continue;
newval = arr[idx];
cyc = 0;
src = idx;
do {
int oldval;
step = difs[src];
difs[src] =0;
dst = src + step;
cyc += step ;
if(dst == idx+1)idx=dst;
oldval = arr[dst];
#if (DUMP_IT&1)
fprintf(stderr, "[Nswap=%d] Cyc=%d Step=%2d Idx=%d Old=%2d New=%2d #### Src=%d Dst=%d[%2d]->%2d <-- %d\n##\n"
, nswap, cyc, step, idx, oldval, newval
, src, dst, difs[dst], arr[dst]
, newval );
do_print_a("Array ", arr, cnt);
do_print_d("Steps ", difs, cnt);
#endif
arr[dst] = newval;
newval = oldval;
nswap++;
src = dst;
} while( cyc);
}
return nswap;
}
/*************/
int wsort6(int *arr)
{
return walksort(arr, 6);
}

//Bruteforce compute unrolled count dumbsort(min to 0-index)
void bcudc_sort6(int* a)
{
int t[6] = {0};
int r1,r2;
r1=0;
r1 += (a[0] > a[1]);
r1 += (a[0] > a[2]);
r1 += (a[0] > a[3]);
r1 += (a[0] > a[4]);
r1 += (a[0] > a[5]);
while(t[r1]){r1++;}
t[r1] = a[0];
r2=0;
r2 += (a[1] > a[0]);
r2 += (a[1] > a[2]);
r2 += (a[1] > a[3]);
r2 += (a[1] > a[4]);
r2 += (a[1] > a[5]);
while(t[r2]){r2++;}
t[r2] = a[1];
r1=0;
r1 += (a[2] > a[0]);
r1 += (a[2] > a[1]);
r1 += (a[2] > a[3]);
r1 += (a[2] > a[4]);
r1 += (a[2] > a[5]);
while(t[r1]){r1++;}
t[r1] = a[2];
r2=0;
r2 += (a[3] > a[0]);
r2 += (a[3] > a[1]);
r2 += (a[3] > a[2]);
r2 += (a[3] > a[4]);
r2 += (a[3] > a[5]);
while(t[r2]){r2++;}
t[r2] = a[3];
r1=0;
r1 += (a[4] > a[0]);
r1 += (a[4] > a[1]);
r1 += (a[4] > a[2]);
r1 += (a[4] > a[3]);
r1 += (a[4] > a[5]);
while(t[r1]){r1++;}
t[r1] = a[4];
r2=0;
r2 += (a[5] > a[0]);
r2 += (a[5] > a[1]);
r2 += (a[5] > a[2]);
r2 += (a[5] > a[3]);
r2 += (a[5] > a[4]);
while(t[r2]){r2++;}
t[r2] = a[5];
a[0]=t[0];
a[1]=t[1];
a[2]=t[2];
a[3]=t[3];
a[4]=t[4];
a[5]=t[5];
}
static __inline__ void sort6(int* a)
{
#define wire(x,y); t = a[x] ^ a[y] ^ ( (a[x] ^ a[y]) & -(a[x] < a[y]) ); a[x] = a[x] ^ t; a[y] = a[y] ^ t;
register int t;
wire( 0, 1); wire( 2, 3); wire( 4, 5);
wire( 3, 5); wire( 0, 2); wire( 1, 4);
wire( 4, 5); wire( 2, 3); wire( 0, 1);
wire( 3, 4); wire( 1, 2);
wire( 2, 3);
#undef wire
}

Well, if it's only 6 elements and you can leverage parallelism, want to minimize conditional branching, etc. Why you don't generate all the combinations and test for order? I would venture that in some architectures, it can be pretty fast (as long as you have the memory preallocated)

Sort 4 items with usage cmp==0.
Numbers of cmp is ~4.34 (FF native have ~4.52), but take 3x time than merging lists. But better less cmp operations, if you have big numbers or big text.
Edit: repaired bug
Online test http://mlich.zam.slu.cz/js-sort/x-sort-x2.htm
function sort4DG(cmp,start,end,n) // sort 4
{
var n = typeof(n) !=='undefined' ? n : 1;
var cmp = typeof(cmp) !=='undefined' ? cmp : sortCompare2;
var start = typeof(start)!=='undefined' ? start : 0;
var end = typeof(end) !=='undefined' ? end : arr[n].length;
var count = end - start;
var pos = -1;
var i = start;
var cc = [];
// stabilni?
cc[01] = cmp(arr[n][i+0],arr[n][i+1]);
cc[23] = cmp(arr[n][i+2],arr[n][i+3]);
if (cc[01]>0) {swap(n,i+0,i+1);}
if (cc[23]>0) {swap(n,i+2,i+3);}
cc[12] = cmp(arr[n][i+1],arr[n][i+2]);
if (!(cc[12]>0)) {return n;}
cc[02] = cc[01]==0 ? cc[12] : cmp(arr[n][i+0],arr[n][i+2]);
if (cc[02]>0)
{
swap(n,i+1,i+2); swap(n,i+0,i+1); // bubble last to top
cc[13] = cc[23]==0 ? cc[12] : cmp(arr[n][i+1],arr[n][i+3]);
if (cc[13]>0)
{
swap(n,i+2,i+3); swap(n,i+1,i+2); // bubble
return n;
}
else {
cc[23] = cc[23]==0 ? cc[12] : (cc[01]==0 ? cc[30] : cmp(arr[n][i+2],arr[n][i+3])); // new cc23 | c03 //repaired
if (cc[23]>0)
{
swap(n,i+2,i+3);
return n;
}
return n;
}
}
else {
if (cc[12]>0)
{
swap(n,i+1,i+2);
cc[23] = cc[23]==0 ? cc[12] : cmp(arr[n][i+2],arr[n][i+3]); // new cc23
if (cc[23]>0)
{
swap(n,i+2,i+3);
return n;
}
return n;
}
else {
return n;
}
}
return n;
}

Most efficient code for the first 10000 prime numbers?

I want to print the first 10000 prime numbers.
Can anyone give me the most efficient code for this?
Clarifications:
It does not matter if your code is inefficient for n >10000.
The size of the code does not matter.
You cannot just hard code the values in any manner.

The Sieve of Atkin is probably what you're looking for, its upper bound running time is O(N/log log N).
If you only run the numbers 1 more and 1 less than the multiples of 6, it could be even faster, as all prime numbers above 3 are 1 away from some multiple of six.
Resource for my statement

I recommend a sieve, either the Sieve of Eratosthenes or the Sieve of Atkin.
The sieve or Eratosthenes is probably the most intuitive method of finding a list of primes. Basically you:
Write down a list of numbers from 2 to whatever limit you want, let's say 1000.
Take the first number that isn't crossed off (for the first iteration this is 2) and cross off all multiples of that number from the list.
Repeat step 2 until you reach the end of the list. All the numbers that aren't crossed off are prime.
Obviously there are quite a few optimizations that can be done to make this algorithm work faster, but this is the basic idea.
The sieve of Atkin uses a similar approach, but unfortunately I don't know enough about it to explain it to you. But I do know that the algorithm I linked takes 8 seconds to figure out all the primes up to 1000000000 on an ancient Pentium II-350
Sieve of Eratosthenes Source Code: http://web.archive.org/web/20140705111241/http://primes.utm.edu/links/programs/sieves/Eratosthenes/C_source_code/
Sieve of Atkin Source Code: http://cr.yp.to/primegen.html

This isn't strictly against the hardcoding restriction, but comes terribly close. Why not programatically download this list and print it out, instead?
http://primes.utm.edu/lists/small/10000.txt

In Haskell, we can write down almost word for word the mathematical definition of the sieve of Eratosthenes, "primes are natural numbers above 1 without any composite numbers, where composites are found by enumeration of each prime's multiples":
import Data.List.Ordered (minus, union)
primes = 2 : minus [3..] (foldr (\p r -> p*p : union [p*p+p, p*p+2*p..] r)
[] primes)
primes !! 10000 is near-instantaneous.
References:
Sieve of Eratosthenes
Richard Bird's sieve (see pp. 10,11)
minus, union
The above code is easily tweaked into working on odds only, primes = 2 : 3 : minus [5,7..] (foldr (\p r -> p*p : union [p*p+2*p, p*p+4*p..] r) [] (tail primes)). Time complexity is much improved (to just about a log factor above optimal) by folding in a tree-like structure, and space complexity is drastically improved by multistage primes production, in
primes = 2 : _Y ( (3:) . sieve 5 . _U . map (\p -> [p*p, p*p+2*p..]) )
where
_Y g = g (_Y g) -- non-sharing fixpoint combinator
_U ((x:xs):t) = x : (union xs . _U . pairs) t -- ~= nub.sort.concat
pairs (xs:ys:t) = union xs ys : pairs t
sieve k s#(x:xs) | k < x = k : sieve (k+2) s -- ~= [k,k+2..]\\s,
| otherwise = sieve (k+2) xs -- when s⊂[k,k+2..]
(In Haskell the parentheses are used for grouping, a function call is signified just by juxtaposition, (:) is a cons operator for lists, and (.) is a functional composition operator: (f . g) x = (\y -> f (g y)) x = f (g x)).

GateKiller, how about adding a break to that if in the foreach loop? That would speed up things a lot because if like 6 is divisible by 2 you don't need to check with 3 and 5. (I'd vote your solution up anyway if I had enough reputation :-) ...)
ArrayList primeNumbers = new ArrayList();
for(int i = 2; primeNumbers.Count < 10000; i++) {
bool divisible = false;
foreach(int number in primeNumbers) {
if(i % number == 0) {
divisible = true;
break;
}
}
if(divisible == false) {
primeNumbers.Add(i);
Console.Write(i + " ");
}
}

#Matt: log(log(10000)) is ~2
From the wikipedia article (which you cited) Sieve of Atkin:
This sieve computes primes up to N
using O(N/log log N) operations with
only N1/2+o(1) bits of memory. That is
a little better than the sieve of
Eratosthenes which uses O(N)
operations and O(N1/2(log log N)/log
N) bits of memory (A.O.L. Atkin, D.J. Bernstein, 2004). These asymptotic
computational complexities include
simple optimizations, such as wheel
factorization, and splitting the
computation to smaller blocks.
Given asymptotic computational complexities along O(N) (for Eratosthenes) and O(N/log(log(N))) (for Atkin) we can't say (for small N=10_000) which algorithm if implemented will be faster.
Achim Flammenkamp wrote in The Sieve of Eratosthenes:
cited by:
#num1
For intervals larger about 10^9,
surely for those > 10^10, the Sieve of
Eratosthenes is outperformed by the
Sieve of Atkins and Bernstein which
uses irreducible binary quadratic
forms. See their paper for background
informations as well as paragraph 5 of
W. Galway's Ph.D. thesis.
Therefore for 10_000 Sieve of Eratosthenes can be faster then Sieve of Atkin.
To answer OP the code is prime_sieve.c (cited by num1)

Using GMP, one could write the following:
#include <stdio.h>
#include <gmp.h>
int main() {
mpz_t prime;
mpz_init(prime);
mpz_set_ui(prime, 1);
int i;
char* num = malloc(4000);
for(i=0; i<10000; i++) {
mpz_nextprime(prime, prime);
printf("%s, ", mpz_get_str(NULL,10,prime));
}
}
On my 2.33GHz Macbook Pro, it executes as follows:
time ./a.out > /dev/null
real 0m0.033s
user 0m0.029s
sys 0m0.003s
Calculating 1,000,000 primes on the same laptop:
time ./a.out > /dev/null
real 0m14.824s
user 0m14.606s
sys 0m0.086s
GMP is highly optimized for this sort of thing. Unless you really want to understand the algorithms by writing your own, you'd be advised to use libGMP under C.

Not efficient at all, but you can use a regular expression to test for prime numbers.
/^1?$|^(11+?)\1+$/
This tests if, for a string consisting of k “1”s, k is not prime (i.e. whether the string consists of one “1” or any number of “1”s that can be expressed as an n-ary product).

I have adapted code found on the CodeProject to create the following:
ArrayList primeNumbers = new ArrayList();
for(int i = 2; primeNumbers.Count < 10000; i++) {
bool divisible = false;
foreach(int number in primeNumbers) {
if(i % number == 0) {
divisible = true;
}
}
if(divisible == false) {
primeNumbers.Add(i);
Console.Write(i + " ");
}
}
Testing this on my ASP.NET Server took the rountine about 1 minute to run.

Sieve of Eratosthenes is the way to go, because of it's simplicity and speed. My implementation in C
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <math.h>
int main(void)
{
unsigned int lim, i, j;
printf("Find primes upto: ");
scanf("%d", &lim);
lim += 1;
bool *primes = calloc(lim, sizeof(bool));
unsigned int sqrtlim = sqrt(lim);
for (i = 2; i <= sqrtlim; i++)
if (!primes[i])
for (j = i * i; j < lim; j += i)
primes[j] = true;
printf("\nListing prime numbers between 2 and %d:\n\n", lim - 1);
for (i = 2; i < lim; i++)
if (!primes[i])
printf("%d\n", i);
return 0;
}
CPU Time to find primes (on Pentium Dual Core E2140 1.6 GHz, using single core)
~ 4s for lim = 100,000,000

Here is a Sieve of Eratosthenes that I wrote in PowerShell a few days ago. It has a parameter for identifying the number of prime numbers that should be returned.
#
# generate a list of primes up to a specific target using a sieve of eratosthenes
#
function getPrimes { #sieve of eratosthenes, http://en.wikipedia.org/wiki/Sieve_of_Eratosthenes
param ($target,$count = 0)
$sieveBound = [math]::ceiling(( $target - 1 ) / 2) #not storing evens so count is lower than $target
$sieve = #($false) * $sieveBound
$crossLimit = [math]::ceiling(( [math]::sqrt($target) - 1 ) / 2)
for ($i = 1; $i -le $crossLimit; $i ++) {
if ($sieve[$i] -eq $false) {
$prime = 2 * $i + 1
write-debug "Found: $prime"
for ($x = 2 * $i * ( $i + 1 ); $x -lt $sieveBound; $x += 2 * $i + 1) {
$sieve[$x] = $true
}
}
}
$primes = #(2)
for ($i = 1; $i -le $sieveBound; $i ++) {
if($count -gt 0 -and $primes.length -ge $count) {
break;
}
if($sieve[$i] -eq $false) {
$prime = 2 * $i + 1
write-debug "Output: $prime"
$primes += $prime
}
}
return $primes
}

Adapting and following on from GateKiller, here's the final version that I've used.
public IEnumerable<long> PrimeNumbers(long number)
{
List<long> primes = new List<long>();
for (int i = 2; primes.Count < number; i++)
{
bool divisible = false;
foreach (int num in primes)
{
if (i % num == 0)
divisible = true;
if (num > Math.Sqrt(i))
break;
}
if (divisible == false)
primes.Add(i);
}
return primes;
}
It's basically the same, but I've added the "break on Sqrt" suggestion and changed some of the variables around to make it fit better for me. (I was working on Euler and needed the 10001th prime)

The Sieve seems to be the wrong answer. The sieve gives you the primes up to a number N, not the first N primes. Run #Imran or #Andrew Szeto, and you get the primes up to N.
The sieve might still be usable if you keep trying sieves for increasingly larger numbers until you hit a certain size of your result set, and use some caching of numbers already obtained, but I believe it would still be no faster than a solution like #Pat's.

The deque sieve algorithm mentioned by BenGoldberg deserves a closer look, not only because it is very elegant but also because it can occasionally be useful in practice (unlike the Sieve of Atkin, which is a purely academical exercise).
The basic idea behind the deque sieve algorithm is to use a small, sliding sieve that is only large enough to contain at least one separate multiple for each of the currently 'active' prime factors - i.e. those primes whose square does not exceed the lowest number currently represented by the moving sieve. Another difference to the SoE is that the deque sieve stores the actual factors into the slots of composites, not booleans.
The algorithm extends the size of the sieve window as needed, resulting in fairly even performance over a wide range until the sieve starts exceeding the capacity of the CPU's L1 cache appreciably. The last prime that fits fully is 25,237,523 (the 1,579,791st prime), which gives a rough ballpark figure for the reasonable operating range of the algorithm.
The algorithm is fairly simple and robust, and it has even performance over a much wider range than an unsegmented Sieve of Eratosthenes. The latter is a lot faster as long its sieve fits fully into the cache, i.e. up to 2^16 for an odds-only sieve with byte-sized bools. Then its performance drops more and more, although it always remains significantly faster than the deque despite the handicap (at least in compiled languages like C/C++, Pascal or Java/C#).
Here is a rendering of the deque sieve algorithm in C#, because I find that language - despite its many flaws - much more practical for prototyping algorithms and experimentation than the supremely cumbersome and pedantic C++. (Sidenote: I'm using the free LINQPad which makes it possible to dive right in, without all the messiness with setting up projects, makefiles, directories or whatnot, and it gives me the same degree of interactivity as a python prompt).
C# doesn't have an explicit deque type but the plain List<int> works well enough for demonstrating the algorithm.
Note: this version does not use a deque for the primes, because it simply doesn't make sense to pop off sqrt(n) out of n primes. What good would it be to remove 100 primes and to leave 9900? At least this way all the primes are collected in a neat vector, ready for further processing.
static List<int> deque_sieve (int n = 10000)
{
Trace.Assert(n >= 3);
var primes = new List<int>() { 2, 3 };
var sieve = new List<int>() { 0, 0, 0 };
for (int sieve_base = 5, current_prime_index = 1, current_prime_squared = 9; ; )
{
int base_factor = sieve[0];
if (base_factor != 0)
{
// the sieve base has a non-trivial factor - put that factor back into circulation
mark_next_unmarked_multiple(sieve, base_factor);
}
else if (sieve_base < current_prime_squared) // no non-trivial factor -> found a non-composite
{
primes.Add(sieve_base);
if (primes.Count == n)
return primes;
}
else // sieve_base == current_prime_squared
{
// bring the current prime into circulation by injecting it into the sieve ...
mark_next_unmarked_multiple(sieve, primes[current_prime_index]);
// ... and elect a new current prime
current_prime_squared = square(primes[++current_prime_index]);
}
// slide the sieve one step forward
sieve.RemoveAt(0); sieve_base += 2;
}
}
Here are the two helper functions:
static void mark_next_unmarked_multiple (List<int> sieve, int prime)
{
int i = prime, e = sieve.Count;
while (i < e && sieve[i] != 0)
i += prime;
for ( ; e <= i; ++e) // no List<>.Resize()...
sieve.Add(0);
sieve[i] = prime;
}
static int square (int n)
{
return n * n;
}
Probably the easiest way of understanding the algorithm is to imagine it as a special segmented Sieve of Eratosthenes with a segment size of 1, accompanied by an overflow area where the primes come to rest when they shoot over the end of the segment. Except that the single cell of the segment (a.k.a. sieve[0]) has already been sieved when we get to it, because it got run over while it was part of the overflow area.
The number that is represented by sieve[0] is held in sieve_base, although sieve_front or window_base would also be a good names that allow to draw parallels to Ben's code or implementations of segmented/windowed sieves.
If sieve[0] contains a non-zero value then that value is a factor of sieve_base, which can thus be recognised as composite. Since cell 0 is a multiple of that factor it is easy to compute its next hop, which is simply 0 plus that factor. Should that cell be occupied already by another factor then we simply add the factor again, and so on until we find a multiple of the factor where no other factor is currently parked (extending the sieve if needed). This also means that there is no need for storing the current working offsets of the various primes from one segment to the next, as in a normal segmented sieve. Whenever we find a factor in sieve[0], its current working offset is 0.
The current prime comes into play in the following way. A prime can only become current after its own occurrence in the stream (i.e. when it has been detected as a prime, because not marked with a factor), and it will remain current until the exact moment that sieve[0] reaches its square. All lower multiples of this prime must have been struck off due to the activities of smaller primes, just like in a normal SoE. But none of the smaller primes can strike off the square, since the only factor of the square is the prime itself and it is not yet in circulation at this point. That explains the actions taken by the algorithm in the case sieve_base == current_prime_squared (which implies sieve[0] == 0, by the way).
Now the case sieve[0] == 0 && sieve_base < current_prime_squared is easily explained: it means that sieve_base cannot be a multiple of any of the primes smaller than the current prime, or else it would have been marked as composite. I cannot be a higher multiple of the current prime either, since its value is less than the current prime's square. Hence it must be a new prime.
The algorithm is obviously inspired by the Sieve of Eratosthenes, but equally obviously it is very different. The Sieve of Eratosthenes derives its superior speed from the simplicity of its elementary operations: one single index addition and one store for each step of the operation is all that it does for long stretches of time.
Here is a simple, unsegmented Sieve of Eratosthenes that I normally use for sieving factor primes in the ushort range, i.e. up to 2^16. For this post I've modified it to work beyond 2^16 by substituting int for ushort
static List<int> small_odd_primes_up_to (int n)
{
var result = new List<int>();
if (n < 3)
return result;
int sqrt_n_halved = (int)(Math.Sqrt(n) - 1) >> 1, max_bit = (n - 1) >> 1;
var odd_composite = new bool[max_bit + 1];
for (int i = 3 >> 1; i <= sqrt_n_halved; ++i)
if (!odd_composite[i])
for (int p = (i << 1) + 1, j = p * p >> 1; j <= max_bit; j += p)
odd_composite[j] = true;
result.Add(3); // needs to be handled separately because of the mod 3 wheel
// read out the sieved primes
for (int i = 5 >> 1, d = 1; i <= max_bit; i += d, d ^= 3)
if (!odd_composite[i])
result.Add((i << 1) + 1);
return result;
}
When sieving the first 10000 primes a typical L1 cache of 32 KiByte will be exceeded but the function is still very fast (fraction of a millisecond even in C#).
If you compare this code to the deque sieve then it is easy to see that the operations of the deque sieve are a lot more complicated, and it cannot effectively amortise its overhead because it always does the shortest possible stretch of crossings-off in a row (exactly one single crossing-off, after skipping all multiples that have been crossed off already).
Note: the C# code uses int instead of uint because newer compilers have a habit of generating substandard code for uint, probably in order to push people towards signed integers... In the C++ version of the code above I used unsigned throughout, naturally; the benchmark had to be in C++ because I wanted it be based on a supposedly adequate deque type (std::deque<unsigned>; there was no performance gain from using unsigned short). Here are the numbers for my Haswell laptop (VC++ 2015/x64):
deque vs simple: 1.802 ms vs 0.182 ms
deque vs simple: 1.836 ms vs 0.170 ms
deque vs simple: 1.729 ms vs 0.173 ms
Note: the C# times are pretty much exactly double the C++ timings, which is pretty good for C# and ìt shows that List<int> is no slouch even if abused as a deque.
The simple sieve code still blows the deque out of the water, even though it is already operating beyond its normal working range (L1 cache size exceeded by 50%, with attendant cache thrashing). The dominating part here is the reading out of the sieved primes, and this is not affected much by the cache problem. In any case the function was designed for sieving the factors of factors, i.e. level 0 in a 3-level sieve hierarchy, and typically it has to return only a few hundred factors or a low number of thousands. Hence its simplicity.
Performance could be improved by more than an order of magnitude by using a segmented sieve and optimising the code for extracting the sieved primes (stepped mod 3 and unrolled twice, or mod 15 and unrolled once) , and yet more performance could be squeezed out of the code by using a mod 16 or mod 30 wheel with all the trimmings (i.e. full unrolling for all residues). Something like that is explained in my answer to Find prime positioned prime number over on Code Review, where a similar problem was discussed. But it's hard to see the point in improving sub-millisecond times for a one-off task...
To put things a bit into perspective, here are the C++ timings for sieving up to 100,000,000:
deque vs simple: 1895.521 ms vs 432.763 ms
deque vs simple: 1847.594 ms vs 429.766 ms
deque vs simple: 1859.462 ms vs 430.625 ms
By contrast, a segmented sieve in C# with a few bells and whistles does the same job in 95 ms (no C++ timings available, since I do code challenges only in C# at the moment).
Things may look decidedly different in an interpreted language like Python where every operation has a heavy cost and the interpreter overhead dwarfs all differences due to predicted vs. mispredicted branches or sub-cycle ops (shift, addition) vs. multi-cycle ops (multiplication, and perhaps even division). That is bound to erode the simplicity advantage of the Sieve of Eratosthenes, and this could make the deque solution a bit more attractive.
Also, many of the timings reported by other respondents in this topic are probably dominated by output time. That's an entirely different war, where my main weapon is a simple class like this:
class CCWriter
{
const int SPACE_RESERVE = 11; // UInt31 + '\n'
public static System.IO.Stream BaseStream;
static byte[] m_buffer = new byte[1 << 16]; // need 55k..60k for a maximum-size range
static int m_write_pos = 0;
public static long BytesWritten = 0; // for statistics
internal static ushort[] m_double_digit_lookup = create_double_digit_lookup();
internal static ushort[] create_double_digit_lookup ()
{
var lookup = new ushort[100];
for (int lo = 0; lo < 10; ++lo)
for (int hi = 0; hi < 10; ++hi)
lookup[hi * 10 + lo] = (ushort)(0x3030 + (hi << 8) + lo);
return lookup;
}
public static void Flush ()
{
if (BaseStream != null && m_write_pos > 0)
BaseStream.Write(m_buffer, 0, m_write_pos);
BytesWritten += m_write_pos;
m_write_pos = 0;
}
public static void WriteLine ()
{
if (m_buffer.Length - m_write_pos < 1)
Flush();
m_buffer[m_write_pos++] = (byte)'\n';
}
public static void WriteLinesSorted (int[] values, int count)
{
int digits = 1, max_value = 9;
for (int i = 0; i < count; ++i)
{
int x = values[i];
if (m_buffer.Length - m_write_pos < SPACE_RESERVE)
Flush();
while (x > max_value)
if (++digits < 10)
max_value = max_value * 10 + 9;
else
max_value = int.MaxValue;
int n = x, p = m_write_pos + digits, e = p + 1;
m_buffer[p] = (byte)'\n';
while (n >= 10)
{
int q = n / 100, w = m_double_digit_lookup[n - q * 100];
n = q;
m_buffer[--p] = (byte)w;
m_buffer[--p] = (byte)(w >> 8);
}
if (n != 0 || x == 0)
m_buffer[--p] = (byte)((byte)'0' + n);
m_write_pos = e;
}
}
}
That takes less than 1 ms for writing 10000 (sorted) numbers. It's a static class because it is intended for textual inclusion in coding challenge submissions, with a minimum of fuss and zero overhead.
In general I found it to be much faster if focussed work is done on entire batches, meaning sieve a certain range, then extract all primes into a vector/array, then blast out the whole array, then sieve the next range and so on, instead of mingling everything together. Having separate functions focussed on specific tasks also makes it easier to mix and match, it enables reuse, and it eases development/testing.

In Python
import gmpy
p=1
for i in range(10000):
p=gmpy.next_prime(p)
print p

Here is my VB 2008 code, which finds all primes <10,000,000 in 1 min 27 secs on my work laptop. It skips even numbers and only looks for primes that are < the sqrt of the test number. It is only designed to find primes from 0 to a sentinal value.
Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles
Button1.Click
Dim TestNum As Integer
Dim X As Integer
Dim Z As Integer
Dim TM As Single
Dim TS As Single
Dim TMS As Single
Dim UnPrime As Boolean
Dim Sentinal As Integer
Button1.Text = "Thinking"
Button1.Refresh()
Sentinal = Val(SentinalTxt.Text)
UnPrime = True
Primes(0) = 2
Primes(1) = 3
Z = 1
TM = TimeOfDay.Minute
TS = TimeOfDay.Second
TMS = TimeOfDay.Millisecond
For TestNum = 5 To Sentinal Step 2
Do While Primes(X) <> 0 And UnPrime And Primes(X) ^ 2 <= TestNum
If Int(TestNum / Primes(X)) - (TestNum / Primes(X)) = 0 Then
UnPrime = False
End If
X = X + 1
Loop
If UnPrime = True Then
X = X + 1
Z = Z + 1
Primes(Z) = TestNum
End If
UnPrime = True
X = 0
Next
Button1.Text = "Finished with " & Z
TM = TimeOfDay.Minute - TM
TS = TimeOfDay.Second - TS
TMS = TimeOfDay.Millisecond - TMS
ShowTime.Text = TM & ":" & TS & ":" & TMS
End Sub

The following Mathcad code calculated the first million primes in under 3 minutes.
Bear in mind that this would be using floating point doubles for all of the numbers and is basically interpreted. I hope the syntax is clear.

Here is a C++ solution, using a form of SoE:
#include <iostream>
#include <deque>
typedef std::deque<int> mydeque;
void my_insert( mydeque & factors, int factor ) {
int where = factor, count = factors.size();
while( where < count && factors[where] ) where += factor;
if( where >= count ) factors.resize( where + 1 );
factors[ where ] = factor;
}
int main() {
mydeque primes;
mydeque factors;
int a_prime = 3, a_square_prime = 9, maybe_prime = 3;
int cnt = 2;
factors.resize(3);
std::cout << "2 3 ";
while( cnt < 10000 ) {
int factor = factors.front();
maybe_prime += 2;
if( factor ) {
my_insert( factors, factor );
} else if( maybe_prime < a_square_prime ) {
std::cout << maybe_prime << " ";
primes.push_back( maybe_prime );
++cnt;
} else {
my_insert( factors, a_prime );
a_prime = primes.front();
primes.pop_front();
a_square_prime = a_prime * a_prime;
}
factors.pop_front();
}
std::cout << std::endl;
return 0;
}
Note that this version of the Sieve can compute primes indefinitely.
Also note, the STL deque takes O(1) time to perform push_back, pop_front, and random access though subscripting.
The resize operation takes O(n) time, where n is the number of elements being added. Due to how we are using this function, we can treat this is a small constant.
The body of the while loop in my_insert is executed O(log log n) times, where n equals the variable maybe_prime. This is because the condition expression of the while will evaluate to true once for each prime factor of maybe_prime. See "Divisor function" on Wikipedia.
Multiplying by the number of times my_insert is called, shows that it should take O(n log log n) time to list n primes... which is, unsurprisingly, the time complexity which the Sieve of Eratosthenes is supposed to have.
However, while this code is efficient, it's not the most efficient... I would strongly suggest using a specialized library for primes generation, such as primesieve. Any truly efficient, well optimized solution, will take more code than anyone wants to type into Stackoverflow.

Using Sieve of Eratosthenes, computation is quite faster compare to "known-wide" prime numbers algorithm.
By using pseudocode from it's wiki (https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes), I be able to have the solution on C#.
/// Get non-negative prime numbers until n using Sieve of Eratosthenes.
public int[] GetPrimes(int n) {
if (n <= 1) {
return new int[] { };
}
var mark = new bool[n];
for(var i = 2; i < n; i++) {
mark[i] = true;
}
for (var i = 2; i < Math.Sqrt(n); i++) {
if (mark[i]) {
for (var j = (i * i); j < n; j += i) {
mark[j] = false;
}
}
}
var primes = new List<int>();
for(var i = 3; i < n; i++) {
if (mark[i]) {
primes.Add(i);
}
}
return primes.ToArray();
}
GetPrimes(100000000) takes 2s and 330ms.
NOTE: Value might vary depend on Hardware Specifications.

Here is my code which finds
first 10,000 primes in 0.049655 sec on my laptop, first 1,000,000 primes in under 6 seconds and first 2,000,000 in 15 seconds
A little explanation. This method uses 2 techniques to find prime number
first of all any non-prime number is a composite of multiples of prime numbers so this code test by dividing the test number by smaller prime numbers instead of any number, this decreases calculation by atleast 10 times for a 4 digit number and even more for a bigger number
secondly besides dividing by prime, it only divides by prime numbers that are smaller or equal to the root of the number being tested further reducing the calculations greatly, this works because any number that is greater than root of the number will have a counterpart number that has to be smaller than root of the number but since we have tested all numbers smaller than the root already, Therefore we don't need to bother with number greater than the root of the number being tested.
Sample output for first 10,000 prime number
https://drive.google.com/open?id=0B2QYXBiLI-lZMUpCNFhZeUphck0 https://drive.google.com/open?id=0B2QYXBiLI-lZbmRtTkZETnp6Ykk
Here is the code in C language,
Enter 1 and then 10,000 to print out the first 10,000 primes.
Edit: I forgot this contains math library ,if you are on windows or visual studio than that should be fine but on linux you must compile the code using -lm argument or the code may not work
Example: gcc -Wall -o "%e" "%f" -lm
#include <stdio.h>
#include <math.h>
#include <time.h>
#include <limits.h>
/* Finding prime numbers */
int main()
{
//pre-phase
char d,w;
int l,o;
printf(" 1. Find first n number of prime numbers or Find all prime numbers smaller than n ?\n"); // this question helps in setting the limits on m or n value i.e l or o
printf(" Enter 1 or 2 to get anwser of first or second question\n");
// decision making
do
{
printf(" -->");
scanf("%c",&d);
while ((w=getchar()) != '\n' && w != EOF);
if ( d == '1')
{
printf("\n 2. Enter the target no. of primes you will like to find from 3 to 2,000,000 range\n -->");
scanf("%10d",&l);
o=INT_MAX;
printf(" Here we go!\n\n");
break;
}
else if ( d == '2' )
{
printf("\n 2.Enter the limit under which to find prime numbers from 5 to 2,000,000 range\n -->");
scanf("%10d",&o);
l=o/log(o)*1.25;
printf(" Here we go!\n\n");
break;
}
else printf("\n Try again\n");
}while ( d != '1' || d != '2' );
clock_t start, end;
double cpu_time_used;
start = clock(); /* starting the clock for time keeping */
// main program starts here
int i,j,c,m,n; /* i ,j , c and m are all prime array 'p' variables and n is the number that is being tested */
int s,x;
int p[ l ]; /* p is the array for storing prime numbers and l sets the array size, l was initialized in pre-phase */
p[1]=2;
p[2]=3;
p[3]=5;
printf("%10dst:%10d\n%10dnd:%10d\n%10drd:%10d\n",1,p[1],2,p[2],3,p[3]); // first three prime are set
for ( i=4;i<=l;++i ) /* this loop sets all the prime numbers greater than 5 in the p array to 0 */
p[i]=0;
n=6; /* prime number testing begins with number 6 but this can lowered if you wish but you must remember to update other variables too */
s=sqrt(n); /* 's' does two things it stores the root value so that program does not have to calaculate it again and again and also it stores it in integer form instead of float*/
x=2; /* 'x' is the biggest prime number that is smaller or equal to root of the number 'n' being tested */
/* j ,x and c are related in this way, p[j] <= prime number x <= p[c] */
// the main loop begins here
for ( m=4,j=1,c=2; m<=l && n <= o;)
/* this condition checks if all the first 'l' numbers of primes are found or n does not exceed the set limit o */
{
// this will divide n by prime number in p[j] and tries to rule out non-primes
if ( n%p[j]==0 )
{
/* these steps execute if the number n is found to be non-prime */
++n; /* this increases n by 1 and therefore sets the next number 'n' to be tested */
s=sqrt(n); /* this calaulates and stores in 's' the new root of number 'n' */
if ( p[c] <= s && p[c] != x ) /* 'The Magic Setting' tests the next prime number candidate p[c] and if passed it updates the prime number x */
{
x=p[c];
++c;
}
j=1;
/* these steps sets the next number n to be tested and finds the next prime number x if possible for the new number 'n' and also resets j to 1 for the new cycle */
continue; /* and this restarts the loop for the new cycle */
}
// confirmation test for the prime number candidate n
else if ( n%p[j]!=0 && p[j]==x )
{
/* these steps execute if the number is found to be prime */
p[m]=n;
printf("%10dth:%10d\n",m,p[m]);
++n;
s = sqrt(n);
++m;
j=1;
/* these steps stores and prints the new prime number and moves the 'm' counter up and also sets the next number n to be tested and also resets j to 1 for the new cycle */
continue; /* and this restarts the loop */
/* the next number which will be a even and non-prime will trigger the magic setting in the next cycle and therfore we do not have to add another magic setting here*/
}
++j; /* increases p[j] to next prime number in the array for the next cycle testing of the number 'n' */
// if the cycle reaches this point that means the number 'n' was neither divisible by p[j] nor was it a prime number
// and therfore it will test the same number 'n' again in the next cycle with a bigger prime number
}
// the loops ends
printf(" All done !!\n");
end = clock();
cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;
printf(" Time taken : %lf sec\n",cpu_time_used);
}

I spend some time writing a program calculating a lot of primes and this is the code I'm used to calculate a text file containing the first 1.000.000.000 primes. It's in German, but the interesting part is the method calcPrimes(). The primes are stored in an array called Primzahlen. I recommend a 64bit CPU because the calculations are with 64bit integers.
import java.io.*;
class Primzahlengenerator {
long[] Primzahlen;
int LastUnknown = 2;
public static void main(String[] args) {
Primzahlengenerator Generator = new Primzahlengenerator();
switch(args.length) {
case 0: //Wenn keine Argumente übergeben worden:
Generator.printHelp(); //Hilfe ausgeben
return; //Durchfallen verhindern
case 1:
try {
Generator.Primzahlen = new long[Integer.decode(args[0]).intValue()];
}
catch (NumberFormatException e) {
System.out.println("Das erste Argument muss eine Zahl sein, und nicht als Wort z.B. \"Tausend\", sondern in Ziffern z.B. \"1000\" ausgedrückt werden.");//Hinweis, wie man die Argumente angeben muss ausgeben
Generator.printHelp(); //Generelle Hilfe ausgeben
return;
}
break;//dutchfallen verhindern
case 2:
switch (args[1]) {
case "-l":
System.out.println("Sie müsen auch eine Datei angeben!"); //Hilfemitteilung ausgeben
Generator.printHelp(); //Generelle Hilfe ausgeben
return;
}
break;//durchfallen verhindern
case 3:
try {
Generator.Primzahlen = new long[Integer.decode(args[0]).intValue()];
}
catch (NumberFormatException e) {
System.out.println("Das erste Argument muss eine Zahl sein, und nicht als Wort z.B. \"Tausend\", sondern in Ziffern z.B. \"1000\" ausgedrückt werden.");//Hinweis, wie man die Argumente angeben muss ausgeben
Generator.printHelp(); //Generelle Hilfe ausgeben
return;
}
switch(args[1]) {
case "-l":
Generator.loadFromFile(args[2]);//Datei Namens des Inhalts von Argument 3 lesen, falls Argument 2 = "-l" ist
break;
default:
Generator.printHelp();
break;
}
break;
default:
Generator.printHelp();
return;
}
Generator.calcPrims();
}
void printHelp() {
System.out.println("Sie müssen als erstes Argument angeben, die wieviel ersten Primzahlen sie berechnen wollen."); //Anleitung wie man das Programm mit Argumenten füttern muss
System.out.println("Als zweites Argument können sie \"-l\" wählen, worauf die Datei, aus der die Primzahlen geladen werden sollen,");
System.out.println("folgen muss. Sie muss genauso aufgebaut sein, wie eine Datei Primzahlen.txt, die durch den Aufruf \"java Primzahlengenerator 1000 > Primzahlen.txt\" entsteht.");
}
void loadFromFile(String File) {
// System.out.println("Lese Datei namens: \"" + File + "\"");
try{
int x = 0;
BufferedReader in = new BufferedReader(new FileReader(File));
String line;
while((line = in.readLine()) != null) {
Primzahlen[x] = new Long(line).longValue();
x++;
}
LastUnknown = x;
} catch(FileNotFoundException ex) {
System.out.println("Die angegebene Datei existiert nicht. Bitte geben sie eine existierende Datei an.");
} catch(IOException ex) {
System.err.println(ex);
} catch(ArrayIndexOutOfBoundsException ex) {
System.out.println("Die Datei enthält mehr Primzahlen als der reservierte Speicherbereich aufnehmen kann. Bitte geben sie als erstes Argument eine größere Zahl an,");
System.out.println("damit alle in der Datei enthaltenen Primzahlen aufgenommen werden können.");
}
/* for(long prim : Primzahlen) {
System.out.println("" + prim);
} */
//Hier soll code stehen, der von der Datei mit angegebenem Namen ( Wie diese aussieht einfach durch angeben von folgendem in cmd rausfinden:
//java Primzahlengenerator 1000 > 1000Primzahlen.txt
//da kommt ne textdatei, die die primzahlen enthält. mit Long.decode(String ziffern).longValue();
//erhält man das was an der entsprechenden stelle in das array soll. die erste zeile soll in [0] , die zweite zeile in [1] und so weiter.
//falls im arry der platz aus geht(die exception kenn ich grad nich, aber mach mal:
//int[] foo = { 1, 2, 3};
//int bar = foo[4];
//dann kriegst ne exception, das ist die gleiche die man kriegt, wenn im arry der platzt aus geht.
}
void calcPrims() {
int PrimzahlNummer = LastUnknown;
// System.out.println("LAstUnknown ist: " + LastUnknown);
Primzahlen[0] = 2;
Primzahlen[1] = 3;
long AktuelleZahl = Primzahlen[PrimzahlNummer - 1];
boolean IstPrimzahl;
// System.out.println("2");
// System.out.println("3");
int Limit = Primzahlen.length;
while(PrimzahlNummer < Limit) {
IstPrimzahl = true;
double WurzelDerAktuellenZahl = java.lang.Math.sqrt(AktuelleZahl);
for(int i = 1;i < PrimzahlNummer;i++) {
if(AktuelleZahl % Primzahlen[i] == 0) {
IstPrimzahl = false;
break;
}
if(Primzahlen[i] > WurzelDerAktuellenZahl) break;
}
if(IstPrimzahl) {
Primzahlen[PrimzahlNummer] = AktuelleZahl;
PrimzahlNummer++;
// System.out.println("" + AktuelleZahl);
}
AktuelleZahl = AktuelleZahl + 2;
}
for(long prim : Primzahlen) {
System.out.println("" + prim);
}
}
}

I have written this using python, as I just started learning it, and it works perfectly fine. The 10,000th prime generate by this code as same as mentioned in http://primes.utm.edu/lists/small/10000.txt. To check if n is prime or not, divide n by the numbers from 2 to sqrt(n). If any of this range of number perfectly divides n then it's not prime.
import math
print ("You want prime till which number??")
a = input()
a = int(a)
x = 0
x = int(x)
count = 1
print("2 is prime number")
for c in range(3,a+1):
b = math.sqrt(c)
b = int(b)
x = 0
for b in range(2,b+1):
e = c % b
e = int(e)
if (e == 0):
x = x+1
if (x == 0):
print("%d is prime number" % c)
count = count + 1
print("Total number of prime till %d is %d" % (a,count))

I have been working on find primes for about a year. This is what I found to be the fastest:
import static java.lang.Math.sqrt;
import java.io.PrintWriter;
import java.io.File;
public class finder {
public static void main(String[] args) {
primelist primes = new primelist();
primes.insert(3);
primes.insert(5);
File file = new File("C:/Users/Richard/Desktop/directory/file0024.txt");
file.getParentFile().mkdirs();
long time = System.nanoTime();
try{
PrintWriter printWriter = new PrintWriter ("file0024.txt");
int linenum = 0;
printWriter.print("2");
printWriter.print (" , ");
printWriter.print("3");
printWriter.print (" , ");
int up;
int down;
for(int i =1; i<357913941;i++){//
if(linenum%10000==0){
printWriter.println ("");
linenum++;
}
down = i*6-1;
if(primes.check(down)){
primes.insert(down);
//System.out.println(i*6-1);
printWriter.print ( down );
printWriter.print (" , ");
linenum++;
}
up = i*6+1;
if(primes.check(up)){
primes.insert(up);
//System.out.println(i*6+1);
printWriter.print ( up );
printWriter.print (" , ");
linenum++;
}
}
printWriter.println ("Time to execute");
printWriter.println (System.nanoTime()-time);
//System.out.println(primes.length);
printWriter.close ();
}catch(Exception e){}
}
}
class node{
node next;
int x;
public node (){
node next;
x = 3;
}
public node(int z) {
node next;
x = z;
}
}
class primelist{
node first;
int length =0;
node current;
public void insert(int x){
node y = new node(x);
if(current == null){
current = y;
first = y;
}else{
current.next = y;
current = y;
}
length++;
}
public boolean check(int x){
int p = (int)sqrt(x);
node y = first;
for(int i = 0;i<length;i++){
if(y.x>p){
return true;
}else if(x%y.x ==0){
return false;
}
y = y.next;
}
return true;
}
}
1902465190909 nano seconds to get to 2147483629 starting at 2.

Here the code that I made :
enter code here
#include <cmath>
#include <cstdio>
#include <vector>
#include <iostream>
#include <algorithm>
using namespace std;
int main() {
/* Enter your code here. Read input from STDIN. Print output to STDOUT*/
unsigned long int n;
int prime(unsigned long int);
scanf("%ld",&n);
unsigned long int val;
for(unsigned long int i=0;i<n;i++)
{
int flag=0;
scanf("%ld",&val);
flag=prime(val);
if(flag==1)
printf("yes\n");
else
printf("no\n");
}
return 0;
}
int prime(unsigned long int n)
{
if(n==2) return 1;
else if (n == 1||n%2==0) return 0;
for (unsigned long int i=3; i<=sqrt(n); i+=2)
if (n%i == 0)
return 0;
return 1;
}

Using the Array.prototype.find() method in Javascript. 2214.486 ms
function isPrime (number) {
function prime(element) {
let start = 2;
while (start <= Math.sqrt(element)) {
if (element % start++ < 1) {
return false;
}
}
return element > 1;
}
return [number].find(prime)
}
function logPrimes (n) {
let count = 0
let nth = n
let i = 0
while (count < nth) {
if (isPrime(i)) {
count++
console.log('i', i) //NOTE: If this line is ommited time to find 10,000th prime is 121.157ms
if (count === nth) {
console.log('while i', i)
console.log('count', count)
}
}
i++
}
}
console.time(logPrimes)
logPrimes(10000)
console.timeEnd(logPrimes) // 2214.486ms

I can give you some tips, you have to implement it.
For each number, get the half of that number. E.g. for checking 21, only obtain the remainder by dividing it from range 2-10.
If its an odd number, only divide by odd number, and vice versa. Such as for 21, divide with 3, 5, 7, 9 only.
Most efficient method I got up to so far.

Since you want first 10000 primes only, rather than coding complex algorithm I'll suggest
the following
boolean isPrime(int n){
//even but is prime
if(n==2)
return true;
//even numbers filtered already
if(n==0 || n==1 || n%2==0)
return false;
// loop for checking only odd factors
// i*i <= n (same as i<=sqrt(n), avoiding floating point calculations)
for(int i=3 ; i*i <=n ; i+=2){
// if any odd factor divides n then its not a prime!
if(n%i==0)
return false;
}
// its prime now
return true;
}
now call is prime as you need it
for(int i=1 ; i<=1000 ; i++){
if(isPrime(i)){
//do something
}
}

This is an old question, but there's something here everyone's missing...
For primes this small, trial division isn't that slow... there are only 25 primes under 100. With so few primes to test, and such small primes, we can pull out a neat trick!
If a is coprime to b, then gcd a b = 1. Coprime. Fun word. Means it doesn't share any prime factors. We can thus test for divisibility by several primes with one GCD call. How many? Well, the product of the first 15 primes is less than 2^64. And the product of the next 10 is also less than 2^64. That's all 25 that we need. But is it worth it?
Let's see:
check x = null $ filter ((==0) . (x `mod`)) $ [<primes up to 101>]
Prelude> length $ filter check [101,103..85600]
>>> 9975
(0.30 secs, 125,865,152 bytes
a = 16294579238595022365 :: Word64
b = 14290787196698157718
pre = [2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,97]
primes = (pre ++) $ filter ((==1) . gcd a) $ filter ((==1) . gcd b) [99,101..85600]
main = print $ length primes
Prelude> main
>>> 10000
(0.05 secs, 36,387,520 bytes)
A 6 fold improvement there.
(length is to force the list to be computed. By default Haskell prints things 1 Unicode character at a time and so actually printing the list will either dominate the time or dominate the amount of actual code used.)
Of course, this is running in GHCi - a repl running interpreted code - on an old laptop and it is not interpreting any of these numbers as int64s or even BigInts, nor will it even if you ask it to (well, you can force it, but it's ugly and doesn't really help). It is interpreting every single number there as generalized Integer-like things that can be specialized to some particular type via dictionary lookup, and it is traversing a linked list (which is not fused away here as it's not compiled) 3 times. Interestingly, hand fusing the two filters actually slows it down in the REPL.
Let's compile it:
...\Haskell\8.6\Testbed>Primes.exe +RTS -s
10000
606,280 bytes allocated in the heap
Total time 0.000s ( 0.004s elapsed)
Using the RTS report because Windows. Some lines trimmed because they aren't relevant - they were other GC data, or measurements of only part of the execution, and together add up to 0.004s (or less). It's also not constant folding, because Haskell doesn't actually do much of that. If we constant fold ourselves (main = print 10000), we get dramatically lower allocation:
...Haskell\8.6\Testbed>Primes.exe +RTS -s
10000
47,688 bytes allocated in the heap
Total time 0.000s ( 0.001s elapsed)
Literally just enough to load the runtime, then discover there's nothing to do but print a number and exit. Let's add wheel factorization:
wheel = scanl (+) 7 $ cycle [4, 2, 4, 2, 4, 6, 2, 6]
primes = (pre ++) $ filter ((==1) . gcd a) $ filter ((==1) . gcd b) $ takeWhile (<85600) wheel
Total time 0.000s ( 0.003s elapsed)
Cut down approximately 1/3rd relative to our reference of main = print 10000, but there's definitely room for more optimization. It actually stopped to perform a GC in there for example, while with tweaking there shouldn't be any heap use. For some reason, compiling for profiling here actually cuts the runtime down to 2 milliseconds:
Tue Nov 12 21:13 2019 Time and Allocation Profiling Report (Final)
Primes.exe +RTS -p -RTS
total time = 0.00 secs (2 ticks # 1000 us, 1 processor)
total alloc = 967,120 bytes (excludes profiling overheads)
I'm going to leave this as is for now, I'm pretty sure random jitter is starting to dominate.

def compute_primes(bound):
"""
Return a list of the prime numbers in range(2, bound)
Implement the Sieve of Eratosthenes
https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes
"""
primeNumber = [True for i in range(bound + 1)]
start_prime_number = 2
primes = []
while start_prime_number * start_prime_number <=bound:
# If primeNumber[start_prime_number] is not changed, then it is a prime
if primeNumber[start_prime_number]:
# Update all multiples of start_prime_number
for i in range(start_prime_number * start_prime_number, bound + 1, start_prime_number):
primeNumber[i] = False
start_prime_number += 1
# Print all prime numbers
for start_prime_number in range(2, bound + 1):
if primeNumber[start_prime_number]:
primes.append(start_prime_number)
return primes
print(len(compute_primes(200)))
print(len(compute_primes(2000)))

This is a Python code that prints prime numbers between 1 to 1000000.
import math
k=0
factor=0
pl=[]
for i in range(1,1000000):
k=int(math.sqrt(i))
if i==2 or i==3:
pl.append(i)
for j in range(2,k+1):
if i%j==0:
factor=factor+1
elif factor==0 and j==k:
pl.append(i)
factor=0
print(pl)
print(len(pl))

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Factorial of integer mod m fast calculation [duplicate] - algorithm

Related

Compute the exact inverse of this "simple" floating-point function

Is there an efficient way to calculate ceiling of log_b(a)?

How does a compiler optimise this factorial function so well?

Fastest sort of fixed length 6 int array

Most efficient code for the first 10000 prime numbers?

Categories

Resources