Can anyone sort faster than this? [closed] - algorithm

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 6 months ago.
This post was edited and submitted for review 6 months ago and failed to reopen the post:
Original close reason(s) were not resolved
Improve this question
I was able to write an even faster sort for integers!
It sorts faster than the array can be generated. It works by declaring an array to be of length equal to the max value of the integer array to be sorted and initialized to zero. Then, the array to be sorted is looped through using its values as an index to the counting array - which increments each time the value is encountered. Subsequently, the counting array is looped through and assigns its index the counted number of times to the input/output array in order.
Code below:
SUBROUTINE icountSORT(arrA, nA)
! This is a count sort. It counts the frequency of
! each element in the integer array to be sorted using
! an array with a length of MAXVAL(arrA)+1 such that
! 0's are counted at index 1, 1's are counted at index 2,
! etc.
!
! ~ Derrel Walters
IMPLICIT NONE
INTEGER(KIND=8),INTENT(IN) :: nA
INTEGER(KIND=8),DIMENSION(nA),INTENT(INOUT) :: arrA
INTEGER(KIND=8),ALLOCATABLE,DIMENSION(:) :: arrB
INTEGER(KIND=8) :: i, j, k, maxA
INTEGER :: iStat
maxA = MAXVAL(arrA)
ALLOCATE(arrB(maxA+1),STAT=iStat)
arrB = 0
DO i = 1, nA
arrB(arrA(i)+1) = arrB(arrA(i)+1) + 1
END DO
k = 1
DO i = 1, SIZE(arrB)
DO j = 1, arrB(i)
arrA(k) = i - 1
k = k + 1
END DO
END DO
END SUBROUTINE icountSORT
Posting more evidence. nlogn predicts too high execution times at large array sizes. Further, the Fortran program posted near the end of this question writes the array (unsorted and sorted) to files and posts the write and sort times. File writing is a known O(n) process. The sort runs faster than the file writing all the way to the largest arrays. If the sort was running at O(nlogn), at some point, the sorting time would cross the write time and become longer at large array sizes. Therefore, it has been shown that this sort routine executes with O(n) time complexity.
I've added a complete Fortran program for compilation at the bottom of this post such that the output can be reproduced. The execution times are linear.
More timing data in a clearer format using the code below from a Debian environment in Win 10:
dwalters#Lapper3:~/PROGRAMMING/DATA-WATER$ for (( i=100000; i<=50000000; i=2*i )); do ./derrelSORT-example.py $i; done | awk 'BEGIN {print "N Time(s)"}; {if ($1=="Creating") {printf $4" "} else if ($1=="Sorting" && $NF=="seconds") {print $3}}'
N Time(s)
100000 0.01
200000 0.02
400000 0.04
800000 0.08
1600000 0.17
3200000 0.35
6400000 0.76
12800000 1.59
25600000 3.02
This code executes linearly with respect to the number of elements (integer example given here). It achieves this by exponentially increasing the size of the sorted chunks as the (merge)sort proceeds. To facilitate the exponentially growing chunks:
The number of iterations needs to be calculated before the sort begins
Indices transformations need to be derived for the chunks (language specific depending upon the indexing protocol) for passage to merge()
Gracefully handle the remainder at the tail of the list when the chunk size is not evenly divisible by a power of 2
With these things in mind and starting, traditionally, by merging pairs of single value arrays, the merged chunks can be grown from 2 to 4 to 8 to 16 to --- to 2^n. This single case is the exception that breaks the speed limit of O(nlogn) time complexity for comparative sorts. This routine sorts linearly with respect to the number of elements to be sorted.
Can anyone sort faster? ;)
Fortran Code (derrelSort.f90):
! Derrel Walters © 2019
! These sort routines were written by Derrel Walters ~ 2019-01-23
SUBROUTINE iSORT(arrA, nA)
! This implementation of derrelSORT is for integers,
! but the same principles apply for other datatypes.
!
! ~ Derrel Walters
IMPLICIT NONE
INTEGER(KIND=8),INTENT(IN) :: nA
INTEGER,DIMENSION(nA),INTENT(INOUT) :: arrA
INTEGER,DIMENSION(nA) :: arrB
INTEGER(KIND=8) :: lowIDX, highIDX, midIDX
INTEGER :: iStat
INTEGER(KIND=8) :: i, j, A, B, C, thisHigh, mergeSize, nLoops
INTEGER,DIMENSION(:),ALLOCATABLE :: iterMark
LOGICAL,DIMENSION(:),ALLOCATABLE :: moreToGo
arrB = arrA
mergeSize = 2
lowIDX = 1 - mergeSize
highIDX = 0
nLoops = INT(LOG(REAL(nA))/LOG(2.0))
ALLOCATE(iterMark(nLoops), moreToGo(nLoops), STAT=iStat)
moreToGo = .FALSE.
iterMark = 0
DO i = 1, nLoops
iterMark(i) = FLOOR(REAL(nA)/2**i)
IF (MOD(nA, 2**i) > 0) THEN
moreToGo(i) = .TRUE.
iterMark(i) = iterMark(i) + 1
END IF
END DO
DO i = 1, nLoops
DO j = 1, iterMark(i)
A = 0
B = 1
C = 0
lowIDX = lowIDX + mergeSize
highIDX = highIDX + mergeSize
midIDX = (lowIDX + highIDX + 1) / 2
thisHigh = highIDX
IF (j == iterMark(i).AND.moreToGo(i)) THEN
lowIDX = lowIDX - mergeSize
highIDX = highIDX - mergeSize
midIDX = (lowIDX + highIDX + 1) / 2
A = midIDX - lowIDX
B = 2
C = nA - 2*highIDX + midIDX - 1
thisHigh = nA
END IF
CALL imerge(arrA(lowIDX:midIDX-1+A), B*(midIDX-lowIDX), &
arrA(midIDX+A:thisHigh), highIDX-midIDX+1+C, &
arrB(lowIDX:thisHigh), thisHigh-lowIDX+1)
arrA(lowIDX:thisHigh) = arrB(lowIDX:thisHigh)
END DO
mergeSize = 2*mergeSize
lowIDX = 1 - mergeSize
highIDX = 0
END DO
END SUBROUTINE iSORT
SUBROUTINE imerge(arrA, nA, arrB, nB, arrC, nC)
! This merge is a faster merge. Array A arrives
! just to the left of Array B, and Array C is
! filled from both ends simultaneously - while
! still preserving the stability of the sort.
! The derrelSORT routine is so fast, that
! the merge does not affect the O(n) time
! complexity of the sort in practice
!
! ~ Derrel Walters
IMPLICIT NONE
INTEGER(KIND=8),INTENT(IN) :: nA, nB , nC
INTEGER,DIMENSION(nA),INTENT(IN) :: arrA
INTEGER,DIMENSION(nB),INTENT(IN) :: arrB
INTEGER,DIMENSION(nC),INTENT(INOUT) :: arrC
INTEGER(KIND=8) :: i, j, k, x, y, z
arrC = 0
i = 1
j = 1
k = 1
x = nA
y = nB
z = nC
DO
IF (i > x .OR. j > y) EXIT
IF (arrB(j) < arrA(i)) THEN
arrC(k) = arrB(j)
j = j + 1
ELSE
arrC(k) = arrA(i)
i = i + 1
END IF
IF (arrA(x) > arrB(y)) THEN
arrC(z) = arrA(x)
x = x - 1
ELSE
arrC(z) = arrB(y)
y = y - 1
END IF
k = k + 1
z = z - 1
END DO
IF (i <= x) THEN
DO
IF (i > x) EXIT
arrC(k) = arrA(i)
i = i + 1
k = k + 1
END DO
ELSEIF (j <= y) THEN
DO
IF (j > y) EXIT
arrC(k) = arrB(j)
j = j + 1
k = k + 1
END DO
END IF
END SUBROUTINE imerge
Times using f2py3 to convert the above fortran file (derrelSORT.f90) into something callable in python. Here is the python code and times it produced (derrelSORT-example.py):
#!/bin/python3
import numpy as np
import derrelSORT as dS
import time as t
import random as rdm
import sys
try:
array_len = int(sys.argv[1])
except IndexError:
array_len = 100000000
# Create an array with array_len elements
print(50*'-')
print("Creating array of", array_len, "random integers.")
t0 = t.time()
x = np.asfortranarray(np.array([round(100000*rdm.random(),0)
for i in range(array_len)]).astype(np.int32))
t1 = t.time()
print('Creation time:', round(t1-t0, 2), 'seconds')
# Sort the array using derrelSORT
print("Sorting the array with derrelSORT.")
t0 = t.time()
dS.isort(x, len(x))
t1 = t.time()
print('Sorting time:', round(t1-t0, 2), 'seconds')
print(50*'-')
Output from the command line. Please note the times.
dwalters#Lapper3:~/PROGRAMMING/DATA-WATER$ ./derrelSORT-example.py 1000000
--------------------------------------------------
Creating array of 1000000 random integers.
Creation time: 0.78 seconds
Sorting the array with derrelSORT.
Sorting time: 0.1 seconds
--------------------------------------------------
dwalters#Lapper3:~/PROGRAMMING/DATA-WATER$ ./derrelSORT-example.py 10000000
--------------------------------------------------
Creating array of 10000000 random integers.
Creation time: 8.1 seconds
Sorting the array with derrelSORT.
Sorting time: 1.07 seconds
--------------------------------------------------
dwalters#Lapper3:~/PROGRAMMING/DATA-WATER$ ./derrelSORT-example.py 20000000
--------------------------------------------------
Creating array of 20000000 random integers.
Creation time: 15.73 seconds
Sorting the array with derrelSORT.
Sorting time: 2.21 seconds
--------------------------------------------------
dwalters#Lapper3:~/PROGRAMMING/DATA-WATER$ ./derrelSORT-example.py 40000000
--------------------------------------------------
Creating array of 40000000 random integers.
Creation time: 31.64 seconds
Sorting the array with derrelSORT.
Sorting time: 4.39 seconds
--------------------------------------------------
dwalters#Lapper3:~/PROGRAMMING/DATA-WATER$ ./derrelSORT-example.py 80000000
--------------------------------------------------
Creating array of 80000000 random integers.
Creation time: 64.03 seconds
Sorting the array with derrelSORT.
Sorting time: 8.92 seconds
--------------------------------------------------
dwalters#Lapper3:~/PROGRAMMING/DATA-WATER$ ./derrelSORT-example.py 160000000
--------------------------------------------------
Creating array of 160000000 random integers.
Creation time: 129.56 seconds
Sorting the array with derrelSORT.
Sorting time: 18.04 seconds
--------------------------------------------------
More output:
dwalters#Lapper3:~/PROGRAMMING/DATA-WATER$ for (( i=100000; i<=500000000; i=2*i )); do
> ./derrelSORT-example.py $i
> done
--------------------------------------------------
Creating array of 100000 random integers.
Creation time: 0.08 seconds
Sorting the array with derrelSORT.
Sorting time: 0.01 seconds
--------------------------------------------------
--------------------------------------------------
Creating array of 200000 random integers.
Creation time: 0.16 seconds
Sorting the array with derrelSORT.
Sorting time: 0.02 seconds
--------------------------------------------------
--------------------------------------------------
Creating array of 400000 random integers.
Creation time: 0.32 seconds
Sorting the array with derrelSORT.
Sorting time: 0.04 seconds
--------------------------------------------------
--------------------------------------------------
Creating array of 800000 random integers.
Creation time: 0.68 seconds
Sorting the array with derrelSORT.
Sorting time: 0.08 seconds
--------------------------------------------------
--------------------------------------------------
Creating array of 1600000 random integers.
Creation time: 1.25 seconds
Sorting the array with derrelSORT.
Sorting time: 0.15 seconds
--------------------------------------------------
--------------------------------------------------
Creating array of 3200000 random integers.
Creation time: 2.57 seconds
Sorting the array with derrelSORT.
Sorting time: 0.32 seconds
--------------------------------------------------
--------------------------------------------------
Creating array of 6400000 random integers.
Creation time: 5.23 seconds
Sorting the array with derrelSORT.
Sorting time: 0.66 seconds
--------------------------------------------------
--------------------------------------------------
Creating array of 12800000 random integers.
Creation time: 10.09 seconds
Sorting the array with derrelSORT.
Sorting time: 1.35 seconds
--------------------------------------------------
--------------------------------------------------
Creating array of 25600000 random integers.
Creation time: 20.25 seconds
Sorting the array with derrelSORT.
Sorting time: 2.74 seconds
--------------------------------------------------
--------------------------------------------------
Creating array of 51200000 random integers.
Creation time: 41.84 seconds
Sorting the array with derrelSORT.
Sorting time: 5.62 seconds
--------------------------------------------------
--------------------------------------------------
Creating array of 102400000 random integers.
Creation time: 93.19 seconds
Sorting the array with derrelSORT.
Sorting time: 11.49 seconds
--------------------------------------------------
--------------------------------------------------
Creating array of 204800000 random integers.
Creation time: 167.55 seconds
Sorting the array with derrelSORT.
Sorting time: 24.13 seconds
--------------------------------------------------
--------------------------------------------------
Creating array of 409600000 random integers.
Creation time: 340.84 seconds
Sorting the array with derrelSORT.
Sorting time: 47.21 seconds
--------------------------------------------------
When the array size doubles, the time doubles - as demonstrated. Thus, Mr. Mischel's initial assessment was incorrect. The reason why is because that, while the outer loop determines the number of cycles at each chunk size (which is log2(n)), the inner loop counter decreases exponentially as the sort proceeds. The proverbial proof is the pudding, however. The times demonstrate the linearity clearly.
If anyone needs any assistance reproducing the results, please let me know. I'm happy to help.
The Fortran program found at the end of this is an as-is copy of that I wrote in 2019. It is meant to be used on the command-line. To compile it:
Copy the fortran code to a file with an .f90 extension
Compile the code using a command, such as:
gfortran -o derrelSORT-ex.x derrelSORT.f90
Give yourself permission to run the executable:
chmod u+x derrelSORT-ex.x
Execute the program from the command-line with or without an integer argument:
./derrelSORT-ex.x
or
./derrelSORT-ex.x 10000000
The output should look something like this (here, I've used a bash c-style loop to call the command repeatedly). Notice that as the array sizes double with each iteration, the execution time also doubles.
SORT-RESEARCH$ for (( i=100000; i<500000000; i=2*i )); do
> ./derrelSORT-2022.x $i
> done
Derrel Walters © 2019
Demonstrating derrelSORT©
WARNING: This program can produce LARGE files!
Generating random array of length: 100000
Time = 0.0000 seconds
Writing Array to rand-in.txt:
Time = 0.0312 seconds
Sorting the Array
Time = 0.0156 seconds
Writing Array to rand-sorted-out.txt:
Time = 0.0469 seconds
Derrel Walters © 2019
Demonstrating derrelSORT©
WARNING: This program can produce LARGE files!
Generating random array of length: 200000
Time = 0.0000 seconds
Writing Array to rand-in.txt:
Time = 0.0625 seconds
Sorting the Array
Time = 0.0312 seconds
Writing Array to rand-sorted-out.txt:
Time = 0.0312 seconds
Derrel Walters © 2019
Demonstrating derrelSORT©
WARNING: This program can produce LARGE files!
Generating random array of length: 400000
Time = 0.0156 seconds
Writing Array to rand-in.txt:
Time = 0.1250 seconds
Sorting the Array
Time = 0.0625 seconds
Writing Array to rand-sorted-out.txt:
Time = 0.0938 seconds
Derrel Walters © 2019
Demonstrating derrelSORT©
WARNING: This program can produce LARGE files!
Generating random array of length: 800000
Time = 0.0156 seconds
Writing Array to rand-in.txt:
Time = 0.2344 seconds
Sorting the Array
Time = 0.1406 seconds
Writing Array to rand-sorted-out.txt:
Time = 0.2031 seconds
Derrel Walters © 2019
Demonstrating derrelSORT©
WARNING: This program can produce LARGE files!
Generating random array of length: 1600000
Time = 0.0312 seconds
Writing Array to rand-in.txt:
Time = 0.4219 seconds
Sorting the Array
Time = 0.2969 seconds
Writing Array to rand-sorted-out.txt:
Time = 0.3906 seconds
Derrel Walters © 2019
Demonstrating derrelSORT©
WARNING: This program can produce LARGE files!
Generating random array of length: 3200000
Time = 0.0625 seconds
Writing Array to rand-in.txt:
Time = 0.8281 seconds
Sorting the Array
Time = 0.6562 seconds
Writing Array to rand-sorted-out.txt:
Time = 0.7969 seconds
Derrel Walters © 2019
Demonstrating derrelSORT©
WARNING: This program can produce LARGE files!
Generating random array of length: 6400000
Time = 0.0938 seconds
Writing Array to rand-in.txt:
Time = 1.5938 seconds
Sorting the Array
Time = 1.3281 seconds
Writing Array to rand-sorted-out.txt:
Time = 1.6406 seconds
Derrel Walters © 2019
Demonstrating derrelSORT©
WARNING: This program can produce LARGE files!
Generating random array of length: 12800000
Time = 0.2500 seconds
Writing Array to rand-in.txt:
Time = 3.3906 seconds
Sorting the Array
Time = 2.7031 seconds
Writing Array to rand-sorted-out.txt:
Time = 3.2656 seconds
Derrel Walters © 2019
Demonstrating derrelSORT©
WARNING: This program can produce LARGE files!
Generating random array of length: 25600000
Time = 0.4062 seconds
Writing Array to rand-in.txt:
Time = 6.6250 seconds
Sorting the Array
Time = 5.6094 seconds
Writing Array to rand-sorted-out.txt:
Time = 6.5312 seconds
Derrel Walters © 2019
Demonstrating derrelSORT©
WARNING: This program can produce LARGE files!
Generating random array of length: 51200000
Time = 0.8281 seconds
Writing Array to rand-in.txt:
Time = 13.2656 seconds
Sorting the Array
Time = 11.5000 seconds
Writing Array to rand-sorted-out.txt:
Time = 13.1719 seconds
Derrel Walters © 2019
Demonstrating derrelSORT©
WARNING: This program can produce LARGE files!
Generating random array of length: 102400000
Time = 1.6406 seconds
Writing Array to rand-in.txt:
Time = 26.3750 seconds
Sorting the Array
Time = 23.3438 seconds
Writing Array to rand-sorted-out.txt:
Time = 27.0625 seconds
Derrel Walters © 2019
Demonstrating derrelSORT©
WARNING: This program can produce LARGE files!
Generating random array of length: 204800000
Time = 3.3438 seconds
Writing Array to rand-in.txt:
Time = 53.1094 seconds
Sorting the Array
Time = 47.3750 seconds
Writing Array to rand-sorted-out.txt:
Time = 52.8906 seconds
Derrel Walters © 2019
Demonstrating derrelSORT©
WARNING: This program can produce LARGE files!
Generating random array of length: 409600000
Time = 6.6562 seconds
Writing Array to rand-in.txt:
Time = 105.1875 seconds
Sorting the Array
Time = 99.5938 seconds
Writing Array to rand-sorted-out.txt:
Time = 109.9062 seconds
This is the program as-is from 2019 without modification:
SORT-RESEARCH$ cat derrelSORT.f90
! Derrel Walters © 2019
! These sort routines were written by Derrel Walters ~ 2019-01-23
PROGRAM sort_test
! This program demonstrates a linear sort routine
! by generating a random array (here integer), writing it
! to a file 'rand-in.txt', sorting it with an
! implementation of derrelSORT (here for integers -
! where the same principles apply for other applicable
! datatypes), and finally, printing the sorted array
! to a file 'rand-sorted-out.txt'.
!
! To the best understanding of the author, the expert
! consensus is that a comparative sort can, at best,
! be done with O(nlogn) time complexity. Here a sort
! is demonstrated which experimentally runs O(n).
!
! Such time complexity is currently considered impossible
! for a sort. Using this sort, extremely large amounts of data can be
! sorted on any modern computer using a single processor core -
! provided the computer has enough memory to hold the array! For example,
! the sorting time for a given array will be on par (perhaps less than)
! what it takes the same computer to write the array to a file.
!
! ~ Derrel Walters
IMPLICIT NONE
INTEGER,PARAMETER :: in_unit = 21
INTEGER,PARAMETER :: out_unit = 23
INTEGER,DIMENSION(:),ALLOCATABLE :: iArrA
REAL,DIMENSION(:),ALLOCATABLE :: rArrA
CHARACTER(LEN=15) :: cDims
CHARACTER(LEN=80) :: ioMsgStr
INTEGER(KIND=8) :: nDims, i
INTEGER :: iStat
REAL :: start, finish
WRITE(*,*) ''
WRITE(*,'(A)') 'Derrel Walters © 2019'
WRITE(*,*) ''
WRITE(*,'(A)') 'Demonstrating derrelSORT©'
WRITE(*,'(A)') 'WARNING: This program can produce LARGE files!'
WRITE(*,*) ''
CALL GET_COMMAND_ARGUMENT(1, cDims)
IF (cDims == '') THEN
nDims = 1000000
ELSE
READ(cDims,'(1I15)') nDims
END IF
ALLOCATE(iArrA(nDims),rArrA(nDims),STAT=iStat)
WRITE(*,'(A,1X,1I16)') 'Generating random array of length:', nDims
CALL CPU_TIME(start)
CALL RANDOM_NUMBER(rArrA)
iArrA = INT(rArrA*1000000)
CALL CPU_TIME(finish)
WRITE(*,'(A,1X,f9.4,1X,A)') 'Time =',finish-start,'seconds'
DEALLOCATE(rArrA,STAT=iStat)
WRITE(*,'(A)') 'Writing Array to rand-in.txt: '
OPEN(UNIT=in_unit,FILE='rand-in.txt',STATUS='REPLACE',ACTION='WRITE',IOSTAT=iStat,IOMSG=ioMsgStr)
IF (iStat /= 0) THEN
WRITE(*,'(A)') ioMsgStr
ELSE
CALL CPU_TIME(start)
DO i=1, nDims
WRITE(in_unit,*) iArrA(i)
END DO
CLOSE(in_unit)
CALL CPU_TIME(finish)
WRITE(*,'(A,1X,f9.4,1X,A)') 'Time =',finish-start,'seconds'
END IF
WRITE(*,'(A)') 'Sorting the Array'
CALL CPU_TIME(start)
CALL iderrelSORT(iArrA, nDims) !! SIZE(iArrA))
CALL CPU_TIME(finish)
WRITE(*,'(A,1X,f9.4,1X,A)') 'Time =',finish-start,'seconds'
WRITE(*,'(A)') 'Writing Array to rand-sorted-out.txt: '
OPEN(UNIT=out_unit,FILE='rand-sorted-out.txt',STATUS='REPLACE',ACTION='WRITE',IOSTAT=iStat,IOMSG=ioMsgStr)
IF (iStat /= 0) THEN
WRITE(*,'(A)') ioMsgStr
ELSE
CALL CPU_TIME(start)
DO i=1, nDims
WRITE(out_unit,*) iArrA(i)
END DO
CLOSE(out_unit)
CALL CPU_TIME(finish)
WRITE(*,'(A,1X,f9.4,1X,A)') 'Time =',finish-start,'seconds'
END IF
WRITE(*,*) ''
END PROGRAM sort_test
SUBROUTINE iderrelSORT(arrA, nA)
! This implementation of derrelSORT is for integers,
! but the same principles apply for other datatypes.
!
! ~ Derrel Walters
IMPLICIT NONE
INTEGER(KIND=8),INTENT(IN) :: nA
INTEGER,DIMENSION(nA),INTENT(INOUT) :: arrA
INTEGER,DIMENSION(nA) :: arrB
INTEGER(KIND=8) :: lowIDX, highIDX, midIDX
INTEGER :: iStat
INTEGER(KIND=8) :: i, j, A, B, C, thisHigh, mergeSize, nLoops
INTEGER,DIMENSION(:),ALLOCATABLE :: iterMark
LOGICAL,DIMENSION(:),ALLOCATABLE :: moreToGo
arrB = arrA
mergeSize = 2
lowIDX = 1 - mergeSize
highIDX = 0
nLoops = INT(LOG(REAL(nA))/LOG(2.0))
ALLOCATE(iterMark(nLoops), moreToGo(nLoops), STAT=iStat)
moreToGo = .FALSE.
iterMark = 0
DO i = 1, nLoops
iterMark(i) = FLOOR(REAL(nA)/2**i)
IF (MOD(nA, 2**i) > 0) THEN
moreToGo(i) = .TRUE.
iterMark(i) = iterMark(i) + 1
END IF
END DO
DO i = 1, nLoops
DO j = 1, iterMark(i)
A = 0
B = 1
C = 0
lowIDX = lowIDX + mergeSize
highIDX = highIDX + mergeSize
midIDX = (lowIDX + highIDX + 1) / 2
thisHigh = highIDX
IF (j == iterMark(i).AND.moreToGo(i)) THEN
lowIDX = lowIDX - mergeSize
highIDX = highIDX - mergeSize
midIDX = (lowIDX + highIDX + 1) / 2
A = midIDX - lowIDX
B = 2
C = nA - 2*highIDX + midIDX - 1
thisHigh = nA
END IF
!! The traditional merge can also be used (see subroutine for comment). !!
! !
! CALL imerge(arrA(lowIDX:midIDX-1+A), B*(midIDX-lowIDX), & !
! arrA(midIDX+A:thisHigh), highIDX-midIDX+1+C, & !
! arrB(lowIDX:thisHigh), thisHigh-lowIDX+1) !
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
CALL imerge2(arrA(lowIDX:midIDX-1+A), B*(midIDX-lowIDX), &
arrA(midIDX+A:thisHigh), highIDX-midIDX+1+C, &
arrB(lowIDX:thisHigh), thisHigh-lowIDX+1)
arrA(lowIDX:thisHigh) = arrB(lowIDX:thisHigh)
END DO
mergeSize = 2*mergeSize
lowIDX = 1 - mergeSize
highIDX = 0
END DO
END SUBROUTINE iderrelSORT
SUBROUTINE imerge(arrA, nA, arrB, nB, arrC, nC)
! This merge is a traditional merge that places
! the lowest element first. The form that the
! time complexity takes, O(n), is not affected
! by the merge routine - yet this routine
! does not run as fast as the merge used in
! imerge2.
!
! ~Derrel Walters
IMPLICIT NONE
INTEGER(KIND=8),INTENT(IN) :: nA, nB , nC
INTEGER,DIMENSION(nA),INTENT(IN) :: arrA
INTEGER,DIMENSION(nB),INTENT(IN) :: arrB
INTEGER,DIMENSION(nC),INTENT(INOUT) :: arrC
INTEGER(KIND=8) :: i, j, k
arrC = 0
i = 1
j = 1
k = 1
DO
IF (i > nA .OR. j > NB) EXIT
IF (arrB(j) < arrA(i)) THEN
arrC(k) = arrB(j)
j = j + 1
ELSE
arrC(k) = arrA(i)
i = i + 1
END IF
k = k + 1
END DO
IF (i <= nA) THEN
DO
IF (i > nA) EXIT
arrC(k) = arrA(i)
i = i + 1
k = k + 1
END DO
ELSEIF (j <= nB) THEN
DO
IF (j > nB) EXIT
arrC(k) = arrB(j)
j = j + 1
k = k + 1
END DO
END IF
END SUBROUTINE imerge
SUBROUTINE imerge2(arrA, nA, arrB, nB, arrC, nC)
! This merge is a faster merge. Array A arrives
! just to the left of Array B, and Array C is
! filled from both ends simultaneously - while
! still preserving the stability of the sort.
! The derrelSORT routine is so fast, that
! the merge does not affect the O(n) time
! complexity of the sort in practice
! (perhaps, making its execution more linear
! at small numbers of elements).
!
! ~ Derrel Walters
IMPLICIT NONE
INTEGER(KIND=8),INTENT(IN) :: nA, nB , nC
INTEGER,DIMENSION(nA),INTENT(IN) :: arrA
INTEGER,DIMENSION(nB),INTENT(IN) :: arrB
INTEGER,DIMENSION(nC),INTENT(INOUT) :: arrC
INTEGER(KIND=8) :: i, j, k, x, y, z
arrC = 0
i = 1
j = 1
k = 1
x = nA
y = nB
z = nC
DO
IF (i > x .OR. j > y) EXIT
IF (arrB(j) < arrA(i)) THEN
arrC(k) = arrB(j)
j = j + 1
ELSE
arrC(k) = arrA(i)
i = i + 1
END IF
IF (arrA(x) > arrB(y)) THEN
arrC(z) = arrA(x)
x = x - 1
ELSE
arrC(z) = arrB(y)
y = y - 1
END IF
k = k + 1
z = z - 1
END DO
IF (i <= x) THEN
DO
IF (i > x) EXIT
arrC(k) = arrA(i)
i = i + 1
k = k + 1
END DO
ELSEIF (j <= y) THEN
DO
IF (j > y) EXIT
arrC(k) = arrB(j)
j = j + 1
k = k + 1
END DO
END IF
END SUBROUTINE imerge2
MOAR data using the Fortran version. Anyone into straight lines?
SORT-RESEARCH$ for (( i=100000; i<500000000; i=2*i )); do ./derrelSORT-2022.x $i; done | awk 'BEGIN {old_1="Derrel"; print "N Time(s)"};{if ($1 == "Generating") {printf $NF" "; old_1=$1} else if (old_1 == "Sorting") {print $3; old_1=$1} else {old_1=$1}}'
N Time(s)
100000 0.0000
200000 0.0312
400000 0.0625
800000 0.1562
1600000 0.2969
3200000 0.6250
6400000 1.3594
12800000 2.7500
25600000 5.5625
51200000 11.8906
102400000 23.3750
204800000 47.3750
409600000 96.4531
Appears linear, doesn't it? ;)
Fortran sorting times from above plotted.

Your algorithm is not O(n). Your calculated number of loops (nLoops) is log2(n). The number of inner loops (the values in iterMark) will be essentially n/2, n/4, n/8, etc. But the segment sizes really don't matter because every time through the outer loop you look at every item in the list.
No matter how you obfuscate it, you're doing log2(n) passes over n items: O(n log n).
Your code is a fairly standard merge sort, which is proven to be O(n log n). It's well proven that the general case for comparison sorts is O(n log n). Sure, some algorithms can sort some specific cases more quickly. Conversely, the same algorithms have pathological cases that will take O(n^2). Other comparison sorts (heap sort, merge sort, for example) are not highly subject to the order of items. But in the general case comparison sorts make on the order of n log n comparisons. See https://www.cs.cmu.edu/~avrim/451f11/lectures/lect0913.pdf for a detailed explanation.
But don't take my word for it. You can easily test yourself by doing some simple timings. Time how long it takes to sort, say, 100K items. If your algorithm is indeed O(n), then it should take approximately twice as long to sort 200K items, and ten times as long to sort 1 million items. But if it's O(n log n), as I suspect, then the timings will be somewhat longer.
Consider: log(2) of 100K is 16.61. log(2) of 200K is 17.61. So sorting 100K items (if the algorithm is O(log n)) will take time proportional to 100K * 16.61. Sorting 200K items will take time proportional to 200K * 17.71. Doing the arithmetic:
100K * 16.61 = 1,661,000
200K * 17.61 = 3,522,000
So 200K items will take approximately 2.12 times (3,522,000/1,661,000) as long. Or, about 10% longer than if the algorithm is linear.
If you're still unsure, pump it up to a million items. If the algorithm is linear, then a million items will take 10x the time that 100K items took. If it's O(n log n), then it will take 12 times as long.
1M * 19.93 = 19,930,000
(19,930,000 / 1,661,000) = 11.9987 (call it 12)

My f2py skills are not strong, so I wrote a pure fortran wrapper for your code (posted below, if you want to check it), and the timings I got were:
n time (s) 0.1*n/1e6 0.1*n*log(n)/1e6*log(1e6)
1000000 0.109375000 0.100000001 0.100000001
2000000 0.203125000 0.200000003 0.210034326
4000000 0.453125000 0.400000006 0.440137327
8000000 0.937500000 0.800000012 0.920411944
16000000 1.92187500 1.60000002 1.92109859
32000000 4.01562500 3.20000005 4.00274658
64000000 8.26562500 6.40000010 8.32659149
128000000 17.0468750 12.8000002 17.2953815
256000000 35.1406250 25.6000004 35.8751564
This is... not looking good for your O(n) theory, I'm afraid.
My wrapper:
module m
contains
! Your code goes here
end module
program p
use m
implicit none
integer(8) :: i,n
real, allocatable :: real_array(:)
integer, allocatable :: int_array(:)
real :: start
real :: stop
real_array = [0]
int_array = [0]
write(*,*) "n time (s) 0.1*n/1e6 0.1*n*log(n)/1e6*log(1e6)"
do i=0,30
n = 2**i*1e6
deallocate(real_array, int_array)
allocate(real_array(n), int_array(n))
call random_number(real_array)
int_array = -huge(0)*real_array + 2.0*huge(0)
call cpu_time(start)
call isort(int_array, n)
call cpu_time(stop)
write(*,*) n, stop-start, 0.1*n/1.0e6, 0.1*n*log(1.0*n)/(1.0e6*log(1.0e6))
enddo
end program

The other answers have explained why you do not have a linear comparison sort.
I'll try to explain why execution times will never prove a time complexity.
Many times you could come up with some specific cases and an algorithm that uses various CPU-specific optimizations that does its job (whether that job is sorting or something else) better than O(n) according to a plot: if the time for x items is y, then according to the graph the time for 2x items is less than 2y. And this can happen for as large of an x as you can fit into memory.
Still, this proves nothing about time complexity. This could be an algorithm with time complexity O(n), or O(n log n) or maybe even O(log n) or O(n*n).
Big-Oh notation hides the various constants that describe the number of operations performed by the algorithm, so such an algorithm could just be O(n log n) with a very small constant (as in, a constant < 1) or O(log n) with a huge constant.
Big-Oh also does not care about real-life aspects such as system memory or disk space or how fast some CPU executes one instruction as opposed to another. Maybe the operations you use just execute really fast on that CPU. Regardless, if you have an O(n log n) algorithm, for large enough n, you would eventually see the graph look like an n log n graph.
A real example of this could be the Disjoint set data structure, which uses something called an iterated logarithm and its complexity is O(m log* n). In practice, log* n will be something <= 5 for all practical values, so if you plot it for practical values, you might think it's O(m) with a big constant, but that's not the case.
You could change your algorithm to read each number from a different file and write it back to that file, at each step, and remove your input array completely. It wouldn't affect its time complexity, but it would definitely affect the nice execution time measurements you're seeing, because storage is obviously slower than memory. Well, they're all the same to Big-Oh.

I don't doubt that your sort is fast, and I can believe that it compares favorably with the sort command-line utility. But it is an O(N log(N)) iterative merge sort, not an O(N) sort (nor a novel algorithm).
Observe,
Your outer loop iterates O(log(N)) times.
On each of those iterations, the inner loop iterates O(N / 2k) times for some k.
And the main work of each inner-loop iteration is to split O(2k) items into two halves and merge those together, which involves examining and moving every single item. (And then moving them all again back to the original array.) That costs O(2k) operations per inner-loop iteration.
Those factors all multiply together:
O(log(N)) * O(N / 2k) * O(2k)
The factors of 2k cancel each other, and you're left with O(N log(N)). (The ks are functions of N, so they cannot simply be ignored as constants.)
The logarithm grows very slowly, so if you don't look too closely then it is easy to be fooled into thinking you see linear growth when it's really N log(N). You need to look at wide ranges of values to see the superlinearity, and some is indeed visible in your data.
As for your plot, there's a problem with the result of your curve fitting: the y intercept is significantly negative (for the scale of the data, and especially for the concentration of points with small y). Your data may fit a linear model well (if not entirely sensically), but they sure appear to fit an N log(N) model better.

Related

Fastest way to generate a kmer count vector from a nucleotide sequence (Julia)

Given a nucleotide sequence, I'm writing some Julia code to generate a sparse vector of (masked) kmer counts, and I would like it to run as fast as possible.
Here is my current implementation,
using Distributions
using SparseArrays
function kmer_profile(seq, k, mask)
basis = [4^i for i in (k - 1):-1:0]
d = Dict('A'=>0, 'C'=>1, 'G'=>2, 'T'=>3)
kmer_dict = Dict{Int, Int32}(4^k=>0)
for n in 1:(length(seq) - length(mask) + 1)
kmer_hash = 1
j = 1
for i in 1:length(mask)
if mask[i]
kmer_hash += d[seq[n+i-1]] * basis[j]
j += 1
end
end
haskey(kmer_dict, kmer_hash) ? kmer_dict[kmer_hash] += 1 : kmer_dict[kmer_hash] = 1
end
return sparsevec(kmer_dict)
end
seq = join(sample(['A','C','G','T'], 1000000))
mask_str = "111111011111001111111111111110"
mask = BitArray([parse(Bool, string(m)) for m in split(mask_str, "")])
k = sum(mask)
#time kmer_profile(seq, k, mask)
This code runs in about 0.3 seconds on my M1 MacBook Pro, is there any way to make it run significantly faster?
The function kmer_profile uses a sliding window of size length(mask) to count the number of times each masked kmer appears in the nucleotide sequence. A mask is a binary sequence, and a masked kmer is a kmer with nucleotides dropped at positions at which the mask is zero. E.g. the kmer ACGT and mask 1001 will produce the masked kmer AT.
To produce the kmer hash, the function treats each kmer as a base 4 number and then converts it to a (base 10) 64-bit integer, for indexing into the kmer vector.
The size of k is equal to the number of ones in the mask string, and is implicitly limited to 31 so that kmer hashes can fit into a 64-bit integer type.
There are several possible optimizations to make this code faster.
First of all, one can convert the Dict to an array since array-based indexing is faster than dictionary-based indexing one and this is possible here since the key is an ASCII character.
Moreover, the extraction of the sequence codes can be done once instead of length(mask) times by pre-computing code and putting the result in a temporary array.
Additionally, the mask-based conditional and the loop carried dependency make things slow. Indeed, the condition cannot be (easily) predicted by the processor causing it to stall for several cycles. The loop carried dependency make things even worse since the processor can hardly execute other instructions during this stall. This problem can be solved by pre-computing the factors based on both mask and basis. The result is a faster branch-less loop.
Once the above optimizations are done, the biggest bottleneck is sparsevec. In fact, it was also taking nearly half the time of the initial implementation! Optimizing this step is difficult but not impossible. It is slow because of random accesses in the Julia implementation. One can speed this up by sorting the keys-values pairs in the first place. It is faster due to a more cache-friendly execution and it can also help the prediction unit of the processor. This is a complex topic. For more details about how this works, please read Why is processing a sorted array faster than processing an unsorted array?.
Here is the final optimized code:
function kmer_profile_opt(seq, k, mask)
basis = [4^i for i in (k - 1):-1:0]
d = zeros(Int8, 128)
d[Int64('A')] = 0
d[Int64('C')] = 1
d[Int64('G')] = 2
d[Int64('T')] = 3
seq_codes = [d[Int8(e)] for e in seq]
j = 1
premult = zeros(Int64, length(mask))
for i in 1:length(mask)
if mask[i]
premult[i] = basis[j]
j += 1
end
end
kmer_dict = Dict{Int, Int32}(4^k=>0)
for n in 1:(length(seq) - length(mask) + 1)
kmer_hash = 1
j = 1
for i in 1:length(mask)
kmer_hash += seq_codes[n+i-1] * premult[i]
end
haskey(kmer_dict, kmer_hash) ? kmer_dict[kmer_hash] += 1 : kmer_dict[kmer_hash] = 1
end
sorted_kmer_pairs = sort(collect(kmer_dict))
sorted_kmer_keys = [e[1] for e in sorted_kmer_pairs]
sorted_kmer_values = [e[2] for e in sorted_kmer_pairs]
return sparsevec(sorted_kmer_keys, sorted_kmer_values)
end
This code is a bit more than twice faster than the initial implementation on my machine. A significant fraction of the time is still spent in the sorting algorithm.
The code can still be optimized further. One way is to use a parallel sort algorithm. Another way is to replace the premult[i] multiplication by a shift which is faster assuming premult[i] is modified so to contain exponents. I expect the code to be about 4 times faster than the original code. The main bottleneck should be the big dictionary creation. Improving further the performance of this is very hard (though it is still possible).
Inspired by Jérôme's answer, and squeezing some more by avoiding Dicts altogether:
function kmer_profile_opt3a(seq, k, mask)
d = zeros(Int8, 128)
d[Int64('A')] = 0
d[Int64('C')] = 1
d[Int64('G')] = 2
d[Int64('T')] = 3
seq_codes = [d[Int8(e)] for e in seq]
basis = [4^i for i in (k-1):-1:0]
j = 1
premult = zeros(Int64, length(mask))
for i in 1:length(mask)
if mask[i]
premult[i] = basis[j]
j += 1
end
end
kmer_vec = Vector{Int}(undef, length(seq)-length(mask)+1)
#inbounds for n in 1:(length(seq) - length(mask) + 1)
kmer_hash = 1
for i in 1:length(mask)
kmer_hash += seq_codes[n+i-1] * premult[i]
end
kmer_vec[n] = kmer_hash
end
sort!(kmer_vec)
return sparsevec(kmer_vec, ones(length(kmer_vec)), 4^k, +)
end
This achieved another 2x over Jérôme's answer on my machine.
The auto-combining feature of sparsevec makes the code a bit more compact.
Trying to slim the code further, and avoid unnecessary allocations in sparse vector creation, the following can be used:
using SparseArrays, LinearAlgebra
function specialsparsevec(nzs, n)
vals = Vector{Int}(undef, length(nzs))
j, k, count, last = (1, 1, 0, nzs[1])
while k <= length(nzs)
if nzs[k] == last
count += 1
else
vals[j], nzs[j] = (count, last)
count, last = (1, nzs[k])
j += 1
end
k += 1
end
vals[j], nzs[j] = (count, last)
resize!(nzs, j)
resize!(vals, j)
return SparseVector(n, nzs, vals)
end
function kmer_profile_opt3(seq, k, mask)
d = zeros(Int8, 128)
foreach(((i,c),) -> d[Int(c)]=i-1, enumerate(collect("ACGT")))
seq_codes = getindex.(Ref(d), Int8.(collect(seq)))
premult = foldr(
(i,(p,j))->(mask[i] && (p[i]=j ; j<<=2) ; (p,j)),
1:length(mask); init=(zeros(Int64,length(mask)),1)) |> first
kmer_vec = sort(
[ dot(#view(seq_codes[n:n+length(mask)-1]),premult) + 1 for
n in 1:(length(seq)-length(mask)+1)
])
return specialsparsevec(kmer_vec, 4^k)
end
This last version gets another 10% speedup (but is a little cryptic):
julia> #btime kmer_profile_opt($seq, $k, $mask);
367.584 ms (81 allocations: 134.71 MiB) # other answer
julia> #btime kmer_profile_opt3a($seq, $k, $mask);
140.882 ms (22 allocations: 54.36 MiB) # 1st this answer
julia> #btime kmer_profile_opt3($seq, $k, $mask);
127.016 ms (14 allocations: 27.66 MiB) # 2nd this answer

Ruby's digits method performance

I'm solving some Project Euler problems using Ruby, and specifically here I'm talking about problem 25 (What is the index of the first term in the Fibonacci sequence to contain 1000 digits?).
At first, I was using Ruby 2.2.3 and I coded the problem as such:
number = 3
a = 1
b = 2
while b.to_s.length < 1000
a, b = b, a + b
number += 1
end
puts number
But then I found out that version 2.4.2 has a method called digits which is exactly what I needed. I transformed to code to:
while b.digits.length < 1000
And when I compared the two methods, digits was much slower.
Time
./025/problem025.rb 0.13s user 0.02s system 80% cpu 0.190 total
./025/problem025.rb 2.19s user 0.03s system 97% cpu 2.275 total
Does anyone have an idea why?
Ruby's digits
... is implemented in rb_int_digits.
Which for non-tiny numbers (i.e., most of your numbers) uses rb_int_digits_bigbase.
Which extracts digit after digit naively with division/modulo by base.
So it should take quadratic time (at least with a small base such as 10).
Ruby's to_s
... is implemented in int_to_s.
Which uses rb_int2str.
Which for non-tiny numbers uses rb_big2str.
Which uses rb_big2str1.
Which might use big2str_gmp if available (which sounds/looks like it uses the fast GMP library) or ...
... uses big2str_generic.
Which uses big2str_karatsuba (sweet, I recognize that name!).
Which looks like it has something to do with ...
... Karatsuba's algorithm, which is a fast multiplication algorithm. If you multiply two n-digit numbers the naive way you learned in school, you take n2 single-digit products. Karatsuba on the other hand only needs about n1.585, which is quite a lot better. And I didn't read into this further, but I suspect what Ruby does here is also this efficient. Eric Lippert's answer with a base conversion algorithm uses Karatsuba multiplication and says "this [base conversion] algorithm is utterly dominated by the cost of the multiplication".
Comparing quadratic to n1.585 over the number lengths from 1 digit to 1000 digits gives factor 15:
(1..1000).sum { |i| i**2 } / (1..1000).sum { |i| i**1.585 }
=> 15.150583254950678
Which is roughly the factor you observed as well. Of course that's a rather naive comparison, but, well, why not.
GMP by the way apparently uses/used a "near O(n * log(n)) FFT-based multiplication algorithm".
Thanks to #Drenmi's answer for motivating me to dig into the source after all. I hope I did this right, no guarantees, I'm a Ruby beginner. But that's why I left all the links there for you to check for yourself :-P
Integer#digits doesn't just "split" the number. From the documentation:
Returns the array including the digits extracted by place-value
notation with radix base of int.
This extraction is done even if a base argument is omitted. The relevant source:
# ruby/numeric.c:4809
while (!FIXNUM_P(num) || FIX2LONG(num) > 0) {
VALUE qr = rb_int_divmod(num, base);
rb_ary_push(digits, RARRAY_AREF(qr, 1));
num = RARRAY_AREF(qr, 0);
}
As you can see, this process includes repeated modulo arithmetics, which likely accounts for the additional runtime.
Many ruby methods create objects (strins, arrays, etc.)
In ruby, object creation in ruby is "expensive".
For instance to_s creates a string and digits creates an array every time the while condition is evaluated.
If you want to optimize your example, you can do the following:
# create the smallest possible 1000 digits number
max = 10**999
number = 3
a = 1
b = 2
# do not create objects in while condition
while b < max
a, b = b, a + b
number += 1
end
puts number
I have not answered your question, but wish to suggest an improved algorithm for the problem you have addressed. For a given number of decimal digits, n, I have implemented the following algorithm.
estimate the number f of Fibonacci numbers ("FNs") that have n or fewer decimal digits.
compute the fth and (f-1)st FNs, and the number of digits m in the fth FN.
if m >= n back down from down from the (f-1)st FN until the (f-1)st FN has fewer than n decimal digits, at which time the fth FN is the smallest FN to have n decimal digits.
if m < n increase the fth FN until the it has n decimal digits, at which time it is the smallest FN to have n decimal digits.
The key is to compute a close estimate f in the first step.
Code
AVG_FNs_PER_DIGIT = 4.784971966781667
def first_fibonacci_with_n_digits(n)
return [1, 1] if n == 1
idx = (n * AVG_FNs_PER_DIGIT).round
fn, prev_fn = fib(idx)
fn.to_s.size >= n ? fib_down(n, fn, prev_fn, idx) : fib_up(n, fn, prev_fn, idx)
end
def fib(idx)
a = 1
b = 2
(idx - 2).times {a, b = b, a + b }
[b, a]
end
def fib_up(n, b, a, idx)
loop do
a, b = b, a + b
idx += 1
break [idx, b] if b.to_s.size == n
end
end
def fib_down(n, b, a, idx)
loop do
a, b = b - a, a
break [idx, b] if a.to_s.size == n - 1
idx -= 1
end
end
Benchmarks
In computing each Fibonacci number two operations are typically performed:
compute the number of digits in the last-computed Fibonacci number and if that number is equal to the target number of digits, terminate (for reasons made clear in the Explanation section below, it cannot be larger than the target number); else
compute the next number in the Fibonacci sequence.
By contrast, the method I have proposed performs the first step a relatively small number of times.
How important is the first step relative to the second and how does the use of n.digits.size compare with that of n.to_s.size in the first step? Let's run some benchmarks to find out.
def use_to_s(ndigits)
case ndigits
when 1
[1, 1]
else
a = 1
b = 2
idx = 3
loop do
break [idx, b] if b.to_s.length == ndigits
a, b = b, a + b
idx += 1
end
end
end
def use_digits(ndigits)
case ndigits
when 1
[1, 1]
else
a = 1
b = 2
idx = 3
loop do
break [idx, b] if b.digits.size == ndigits
a, b = b, a + b
idx += 1
end
end
end
require 'fruity'
def test(ndigits)
nfibs, last_fib = use_to_s(ndigits)
puts "\nndigits = #{ndigits}, nfibs=#{nfibs}, last_fib=#{last_fib}"
compare do
try_use_to_s { use_to_s(ndigits) }
try_use_digits { use_digits(ndigits) }
try_estimate { first_fibonacci_with_n_digits(ndigits) }
end
end
test 20
ndigits = 20, nfibs=93, last_fib=12200160415121876738
Running each test 128 times. Test will take about 1 second.
try_estimate is faster than try_use_to_s by 2x ± 0.1
try_use_to_s is faster than try_use_digits by 80.0% ± 10.0%
test 100
ndigits = 100, nfibs=476, last_fib=13447...37757 (90 digits omitted)
Running each test 16 times. Test will take about 4 seconds.
try_estimate is faster than try_use_to_s by 5x ± 0.1
try_use_to_s is faster than try_use_digits by 10x ± 1.0
test 500
ndigits = 500, nfibs=2390, last_fib=13519...63145 (490 digits omitted)
Running each test 2 times. Test will take about 27 seconds.
try_estimate is faster than try_use_to_s by 9x ± 0.1
try_use_to_s is faster than try_use_digits by 60x ± 1.0
test 1000
ndigits = 1000, nfibs=4782, last_fib=10700...27816 (990 digits omitted)
Running each test once. Test will take about 1 minute.
try_estimate is faster than try_use_to_s by 12x ± 10.0
try_use_to_s is faster than try_use_digits by 120x ± 100.0
There are two main take-aways from these results:
"try_estimate" is the fastest because it performs the first step relatively few times; and
the use of to_s is much faster than that of digits.
Further to the first of these observations note that the initial estimates of the index of the first FN having a given number of digits, compared to the actual index, are as follows:
for 20 digits: 96 est. vs 93 actual
for 100 digits: 479 est. vs 476 actual
for 500 digits: 2392 est. vs 2390 actual
for 1000 digits: 4785 est. vs 4782 actual
The deviation was at most 3, meaning numbers of digits had to be calculated for at most 3 FNs to obtain the desired result.
Explanation
The only explanation of the methods given in the section Code above is the derivation of the constant AVG_FNs_PER_DIGIT, which is used to calculate an estimate of the index of the first FN having the specified number of digits.
The derivation of this constant derives from the question and selected answer given here. (The Wiki for Fibonacci numbers provides a good overview of the mathematical properties of FNs.)
It is known that the first 7 FNs (including zero) have one digit; thereafter the FNs gain an additional digit every 4 or 5 FNs (i.e., sometimes 4, else 5). Therefore, as a very crude calculation, we see that to calculate the first FN with n digits, n >= 2, it will not be less than the 4*nth FN. For n = 1000, that would be 4,000. (In fact, the 4,782nd is the smallest to have 1,000 digits.) In other words, we don't need to calculate the number of digits in the first 4,000 FNs. We can improve on this estimate, however.
As n approaches infinity, the ratio of ranges 10**n...10**(n+1) (n-digit intervals) that contain 5 FNs to those that contain 4 FNs can be computed as follows.
LOG_10 = Math.log(10)
#=> 2.302585092994046
GR = (1 + Math.sqrt(5))/2
#=> 1.618033988749895
LOG_GR = Math.log(GR)
#=> 0.48121182505960347
RATIO_5to4 = (LOG_10 - 4*LOG_GR)/(5*LOG_GR - LOG_10)
#=> 3.6505564183095474
where GR is the Golden Ratio.
Over a large number of n-digit intervals let n4 be the number of those intervals containing 4 FNs and n5 be the number containing 5 FNs. The average number of FNs per interval is therefore (n4*4 + n5*5)/(n4 + n5). Since n5/n4 converges to RATIO_5to4, n5 approaches RATIO_5to4 * n4 in the limit (discarding roundoff error). If we substitute out n5, and let
b = 1/(1 + RATIO_5to4)
#=> 0.21502803321833364
we find the average number of FNs per n-digit interval converges to
avg = b * 4 + (1-b) *5
#=> 4.784971966781667
If fn is the first FN to have n decimal digits, the number of FNs in the sequence up to an including fn can therefore be approximated to be
n * avg
If, for example, the estimate of the index of the first FN to have 1000 decimal digits would be 1000 * 4.784971966781667).round #=> 4785.

Efficiency in Haskell when counting primes

I have the following set of functions to count the number of primes less than or equal to a number n in Haskell.
The algorithm takes a number, checks if it is divisible by two and then checks if it divisible by odd numbers up to the square root of the number being checked.
-- is a numner, n, prime?
isPrime :: Int -> Bool
isPrime n = n > 1 &&
foldr (\d r -> d * d > n || (n `rem` d /= 0 && r))
True divisors
-- list of divisors for which to test primality
divisors :: [Int]
divisors = 2:[3,5..]
-- pi(n) - the prime counting function, the number of prime numbers <= n
primesNo :: Int -> Int
primesNo 2 = 1
primesNo n
| isPrime n = 1 + primesNo (n-1)
| otherwise = 0 + primesNo (n-1)
main = print $ primesNo (2^22)
Using GHC with the -O2 optimisation flag, counting the number of primes for n = 2^22 takes ~3.8sec on my system. The following C code take ~ 0.8 sec:
#include <stdio.h>
#include <math.h>
/*
compile with: gcc -std=c11 -lm -O2 c_primes.c -o c_orig
*/
int isPrime(int n) {
if (n < 2)
return 0;
else if (n == 2)
return 1;
else if (n % 2 == 0)
return 0;
int uL = sqrt(n);
int i = 3;
while (i <= uL) {
if (n % i == 0)
return 0;
i+=2;
}
return 1;
}
int main() {
int noPrimes = 0, limit = 4194304;
for (int n = 0; n <= limit; n++) {
if (isPrime(n))
noPrimes++;
}
printf("Number of primes in the interval [0,%d]: %d\n", limit, noPrimes);
return 0;
}
This algorithm take about 0.9 sec in Java and 1.8 sec in JavaScript (on Node) so it just feels that the Haskell version is slower than I would expect be. Is there anyway I can more efficiently code this in Haskell without changing the algorithm?
EDIT
The following version of isPrime offered by #dfeuer shaves one second off the running time taking it down to 2.8sec (down from 3.8). Though this is still slower than JavaScript (Node) which takes approx 1.8 sec as shown here, Yet Another Language Speed Test.
isPrime :: Int -> Bool
isPrime n
| n <= 2 = n == 2
| otherwise = odd n && go 3
where
go factor
| factor * factor > n = True
| otherwise = n `rem` factor /= 0 && go (factor+2)
EDIT
In the above isPrime function, the function go calls factor * factor for each divisor for a single n. I would imagine that it would be more efficient to compare factor to the square root of n as this would only have to be calculated once per n. However, using the following code, computation time is increased by approximately 10%, is the square root of n being re-calculated every time the inequality is evaluated (for each factor)?
isPrime :: Int -> Bool
isPrime n
| n <= 2 = n == 2
| otherwise = odd n && go 3
where
go factor
| factor > upperLim = True
| otherwise = n `rem` factor /= 0 && go (factor+2)
where
upperLim = (floor.sqrt.fromIntegral) n
I urge you to use a different algorithm, such as the Sieve of Eratosthenes discussed in the paper by Melissa O'Neill, or the version used in Math.NumberTheory.Primes from the arithmoi package, which also offers an optimized prime counting function. However, this might get you better constant factors:
-- is a number, n, prime?
isPrime :: Int -> Bool
isPrime n
| n <= 2 = n == 2
| otherwise = odd n && -- Put the 2 here instead
foldr (\d r -> d * d > n || (n `rem` d /= 0 && r))
True divisors
-- list of divisors for which to test primality
divisors :: [Int]
{-# INLINE divisors #-} -- No guarantee, but it might possibly inline and stay inlined,
-- so the numbers will be generated on each call instead of
-- being pulled in (expensively) from RAM.
divisors = [3,5..] -- No more 2:
The reason to get rid of the 2: is that an optimization called "foldr/build fusion", "short cut deforestation", or just "list fusion" can, potentially, make your divisors list go away, but, at least with GHC < 7.10.1, that 2: will block the optimization.
Edit: it seems that's not working for you, so here's something else to try:
isPrime n
| n <= 2 = n == 2
| otherwise = odd n && go 3
where
go factor
| factor * factor > n = True
| otherwise = n `rem` factor /= 0 && go (factor+2)
In general I've found that looping in Haskell is about 3-4 times slower than what can be accomplished with C.
To help understand the performance difference I slightly modified the
programs so that a fixed number of divisor tests are made per iteration
and added a parameter e to control how many iterations are made -
the number of (outer) iterations performed is 2^e. For each outer iteration
approx. 2^21 divisor tests are made.
The source code for each program and scripts to run and analyze the
results made be found here: https://github.com/erantapaa/loopbench
Pull-requests to improve the benchmarking are welcome.
Here are the results I get on a 2.4 GHz Intel Core 2 Duo using ghc 7.8.3 (under OSX). The gcc used was "Apple LLVM version 6.0 (clang-600.0.56) (based on LLVM 3.5svn)".
e ctime htime allocated gc-bytes alloc/iter h/c dns
10 0.0101 0.0200 87424 3408 1.980 4.61
11 0.0151 0.0345 112000 3408 2.285 4.51
12 0.0263 0.0700 161152 3408 2.661 5.09
13 0.0472 0.1345 259456 3408 2.850 5.08
14 0.0819 0.2709 456200 3408 3.308 5.50
15 0.1575 0.5382 849416 9616 3.417 5.54
16 0.3112 1.0900 1635848 15960 3.503 5.66
17 0.6105 2.1682 3208848 15984 3.552 5.66
18 1.2167 4.3536 6354576 16032 24.24 3.578 5.70
19 2.4092 8.7336 12646032 16128 24.12 3.625 5.75
20 4.8332 17.4109 25229080 16320 24.06 3.602 5.72
e = exponent parameter
ctime = running time of the C program
htime = running time of the Haskell program
allocated = bytes allocated in the heap (Haskell program)
gc-bytes = bytes copied during GC (Haskell program)
alloc/iter = bytes allocated in the heap / 2^e
h / c = htime divided by ctime
dns = (htime - ctime) divided by the number of divisor tests made
in nanoseconds
# divisor tests made = 2^e * 2^11
Some observations:
The Haskell program performs heap allocation at a rate of about 24 bytes per (outer) loop iteration. The C program clearly does not perform any alloction and runs completely in L1 cache.
The gc-bytes count remains constant for e between 10 and 14 because no garbage collections were performed for those runs.
The time ratio h/c gets progressively worse as more allocations are made.
dps is a measure of the extra time the Haskell program takes per divisor test; it increases with the total amount of allocation made. Also there are some plateaus which suggest this is due to cache effects.
It is well known that GHC does not produce the same tight loop code that
a C compiler produces. The penalty you pay is approx. 4.6 ns per iteration.
Moreover, it looks like Haskell is also affected by cache effects due to
heap allocation.
24 bytes per allocation and 5 ns per loop iteration is not a lot for
most programs, but when you have 2^20 allocations and 2^40 loop iterations
it becomes a factor.
The C code uses 32-bit integers, while the Haskell code uses 64-bit integers.
The original C code runs in 0.63 secs on my computer. However, if I replace the int-s with long-s, it runs in 2.07 seconds with gcc and 2.17 secs with clang.
In comparison, the updated isPrime function (see it in the thread question) runs in 2.09 seconds (with -O2 and -fllvm). Note that is slightly better than the clang-compiled C code, even though they use the same LLVM code generator.
The original Haskell code runs in 3.2 secs, which I think is an acceptable overhead for the convenience of using lists for iteration.
Inline everything, loose the superfluous tests, add strictness annotations just to be sure:
{-# LANGUAGE BangPatterns #-}
-- pi(n) - the prime counting function, the number of prime numbers <= n
primesNo :: Int -> Int
primesNo n
| n < 2 = 0
| otherwise = g 3 1
where
g k !cnt | k > n = cnt
| go 3 = g (k+2) (cnt+1)
| otherwise = g (k+2) cnt
where go f
| f*f > k = True
| otherwise = k `rem` f /= 0 && go (f+2)
main = print $ primesNo (2^22)
The go testing function is as in dfeuer's answer. Compile with -O2 as usual, and always test by running a standalone executable (with something like > test +RTS -s).
Calls to g can be made direct (that's really micro-optimizing it):
primesNo n
| n < 2 = 0
| otherwise = g 3 1
where
g k !cnt | k > n = cnt
| otherwise = go 3
where go f
| f*f > k = g (k+2) (cnt+1)
| k `rem` f == 0 = g (k+2) cnt
| otherwise = go (f+2)
More substantial change (still keeping the algorithm arguably the same) which might or mightn't speed it up is to turn it inside out, to spare the squares computations: test by [3] all odds from 9 to 23, by [3,5] all odds from 25 to 47, etc., along the lines of this segmented code:
import Data.List (inits)
primesNo n = length (takeWhile (<= n) $ 2 : oddprimes)
where
oddprimes = sieve 3 9 [3,5..] (inits [3,5..])
sieve x q ~(_:t) (fs:ft) =
filter ((`all` fs) . ((/=0).) . rem) [x,x+2..q-2]
++ sieve (q+2) (head t^2) t ft
Sometimes tweaking your code into using and instead of all changes the speed too. Further speedup might be attempted by inlining and simplifying everything (replace length with counting etc.).

Project Euler number 35 efficiency

https://projecteuler.net/problem=35
All problems on Project Euler are supposed to be solvable by a program in under 1 minute. My solution, however, has a runtime of almost 3 minutes. Other solutions I've seen online are similar to mine conceptually, but have runtimes that are exponentially faster. Can anyone help make my code more efficient/run faster?
Thanks!
#genPrimes takes an argument n and returns a list of all prime numbers less than n
def genPrimes(n):
primeList = [2]
number = 3
while(number < n):
isPrime = True
for element in primeList:
if element > number**0.5:
break
if number%element == 0 and element <= number**0.5:
isPrime = False
break
if isPrime == True:
primeList.append(number)
number += 2
return primeList
#isCircular takes a number as input and returns True if all rotations of that number are prime
def isCircular(prime):
original = prime
isCircular = True
prime = int(str(prime)[-1] + str(prime)[:len(str(prime)) - 1])
while(prime != original):
if prime not in primeList:
isCircular = False
break
prime = int(str(prime)[-1] + str(prime)[:len(str(prime)) - 1])
return isCircular
primeList = genPrimes(1000000)
circCount = 0
for prime in primeList:
if isCircular(prime):
circCount += 1
print circCount
Two modifications of your code yield a pretty fast solution (roughly 2 seconds on my machine):
Generating primes is a common problem with many solutions on the web. I replaced yours with rwh_primes1 from this article:
def genPrimes(n):
sieve = [True] * (n/2)
for i in xrange(3,int(n**0.5)+1,2):
if sieve[i/2]:
sieve[i*i/2::i] = [False] * ((n-i*i-1)/(2*i)+1)
return [2] + [2*i+1 for i in xrange(1,n/2) if sieve[i]]
It is about 65 times faster (0.04 seconds).
The most important step I'd suggest, however, is to filter the list of generated primes. Since each circularly shifted version of an integer has to be prime, the circular prime must not contain certain digits. The prime 23, e.g., can be easily spotted as an invalid candidate, because it contains a 2, which indicates divisibility by two when this is the last digit. Thus you might remove all such bad candidates by the following simple method:
def filterPrimes(primeList):
for i in primeList[3:]:
if '0' in str(i) or '2' in str(i) or '4' in str(i) \
or '5' in str(i) or '6' in str(i) or '8' in str(i):
primeList.remove(i)
return primeList
Note that the loop starts at the fourth prime number to avoid removing the number 2 or 5.
The filtering step takes most of the computing time (about 1.9 seconds), but reduces the number of circular prime candidates dramatically from 78498 to 1113 (= 98.5 % reduction)!
The last step, the circulation of each remaining candidate, can be done as you suggested. If you wish, you can simplify the code as follows:
circCount = sum(map(isCircular, primeList))
Due to the reduced candidate set this step is completed in only 0.03 seconds.

Find the smallest regular number that is not less than N

Regular numbers are numbers that evenly divide powers of 60. As an example, 602 = 3600 = 48 × 75, so both 48 and 75 are divisors of a power of 60. Thus, they are also regular numbers.
This is an extension of rounding up to the next power of two.
I have an integer value N which may contain large prime factors and I want to round it up to a number composed of only small prime factors (2, 3 and 5)
Examples:
f(18) == 18 == 21 * 32
f(19) == 20 == 22 * 51
f(257) == 270 == 21 * 33 * 51
What would be an efficient way to find the smallest number satisfying this requirement?
The values involved may be large, so I would like to avoid enumerating all regular numbers starting from 1 or maintaining an array of all possible values.
One can produce arbitrarily thin a slice of the Hamming sequence around the n-th member in time ~ n^(2/3) by direct enumeration of triples (i,j,k) such that N = 2^i * 3^j * 5^k.
The algorithm works from log2(N) = i+j*log2(3)+k*log2(5); enumerates all possible ks and for each, all possible js, finds the top i and thus the triple (k,j,i) and keeps it in a "band" if inside the given "width" below the given high logarithmic top value (when width < 1 there can be at most one such i) then sorts them by their logarithms.
WP says that n ~ (log N)^3, i.e. run time ~ (log N)^2. Here we don't care for the exact position of the found triple in the sequence, so all the count calculations from the original code can be thrown away:
slice hi w = sortBy (compare `on` fst) b where -- hi>log2(N) is a top value
lb5=logBase 2 5 ; lb3=logBase 2 3 -- w<1 (NB!) is log2(width)
b = concat -- the slice
[ [ (r,(i,j,k)) | frac < w ] -- store it, if inside width
| k <- [ 0 .. floor ( hi /lb5) ], let p = fromIntegral k*lb5,
j <- [ 0 .. floor ((hi-p)/lb3) ], let q = fromIntegral j*lb3 + p,
let (i,frac)=properFraction(hi-q) ; r = hi - frac ] -- r = i + q
-- properFraction 12.7 == (12, 0.7)
-- update: in pseudocode:
def slice(hi, w):
lb5, lb3 = logBase(2, 5), logBase(2, 3) -- logs base 2 of 5 and 3
for k from 0 step 1 to floor(hi/lb5) inclusive:
p = k*lb5
for j from 0 step 1 to floor((hi-p)/lb3) inclusive:
q = j*lb3 + p
i = floor(hi-q)
frac = hi-q-i -- frac < 1 , always
r = hi - frac -- r == i + q
if frac < w:
place (r,(i,j,k)) into the output array
sort the output array's entries by their "r" component
in ascending order, and return thus sorted array
Having enumerated the triples in the slice, it is a simple matter of sorting and searching, taking practically O(1) time (for arbitrarily thin a slice) to find the first triple above N. Well, actually, for constant width (logarithmic), the amount of numbers in the slice (members of the "upper crust" in the (i,j,k)-space below the log(N) plane) is again m ~ n^2/3 ~ (log N)^2 and sorting takes m log m time (so that searching, even linear, takes ~ m run time then). But the width can be made smaller for bigger Ns, following some empirical observations; and constant factors for the enumeration of triples are much higher than for the subsequent sorting anyway.
Even with constant width (logarthmic) it runs very fast, calculating the 1,000,000-th value in the Hamming sequence instantly and the billionth in 0.05s.
The original idea of "top band of triples" is due to Louis Klauder, as cited in my post on a DDJ blogs discussion back in 2008.
update: as noted by GordonBGood in the comments, there's no need for the whole band but rather just about one or two values above and below the target. The algorithm is easily amended to that effect. The input should also be tested for being a Hamming number itself before proceeding with the algorithm, to avoid round-off issues with double precision. There are no round-off issues comparing the logarithms of the Hamming numbers known in advance to be different (though going up to a trillionth entry in the sequence uses about 14 significant digits in logarithm values, leaving only 1-2 digits to spare, so the situation may in fact be turning iffy there; but for 1-billionth we only need 11 significant digits).
update2: turns out the Double precision for logarithms limits this to numbers below about 20,000 to 40,000 decimal digits (i.e. 10 trillionth to 100 trillionth Hamming number). If there's a real need for this for such big numbers, the algorithm can be switched back to working with the Integer values themselves instead of their logarithms, which will be slower.
Okay, hopefully third time's a charm here. A recursive, branching algorithm for an initial input of p, where N is the number being 'built' within each thread. NB 3a-c here are launched as separate threads or otherwise done (quasi-)asynchronously.
Calculate the next-largest power of 2 after p, call this R. N = p.
Is N > R? Quit this thread. Is p composed of only small prime factors? You're done. Otherwise, go to step 3.
After any of 3a-c, go to step 4.
a) Round p up to the nearest multiple of 2. This number can be expressed as m * 2.
b) Round p up to the nearest multiple of 3. This number can be expressed as m * 3.
c) Round p up to the nearest multiple of 5. This number can be expressed as m * 5.
Go to step 2, with p = m.
I've omitted the bookkeeping to do regarding keeping track of N but that's fairly straightforward I take it.
Edit: Forgot 6, thanks ypercube.
Edit 2: Had this up to 30, (5, 6, 10, 15, 30) realized that was unnecessary, took that out.
Edit 3: (The last one I promise!) Added the power-of-30 check, which helps prevent this algorithm from eating up all your RAM.
Edit 4: Changed power-of-30 to power-of-2, per finnw's observation.
Here's a solution in Python, based on Will Ness answer but taking some shortcuts and using pure integer math to avoid running into log space numerical accuracy errors:
import math
def next_regular(target):
"""
Find the next regular number greater than or equal to target.
"""
# Check if it's already a power of 2 (or a non-integer)
try:
if not (target & (target-1)):
return target
except TypeError:
# Convert floats/decimals for further processing
target = int(math.ceil(target))
if target <= 6:
return target
match = float('inf') # Anything found will be smaller
p5 = 1
while p5 < target:
p35 = p5
while p35 < target:
# Ceiling integer division, avoiding conversion to float
# (quotient = ceil(target / p35))
# From https://stackoverflow.com/a/17511341/125507
quotient = -(-target // p35)
# Quickly find next power of 2 >= quotient
# See https://stackoverflow.com/a/19164783/125507
try:
p2 = 2**((quotient - 1).bit_length())
except AttributeError:
# Fallback for Python <2.7
p2 = 2**(len(bin(quotient - 1)) - 2)
N = p2 * p35
if N == target:
return N
elif N < match:
match = N
p35 *= 3
if p35 == target:
return p35
if p35 < match:
match = p35
p5 *= 5
if p5 == target:
return p5
if p5 < match:
match = p5
return match
In English: iterate through every combination of 5s and 3s, quickly finding the next power of 2 >= target for each pair and keeping the smallest result. (It's a waste of time to iterate through every possible multiple of 2 if only one of them can be correct). It also returns early if it ever finds that the target is already a regular number, though this is not strictly necessary.
I've tested it pretty thoroughly, testing every integer from 0 to 51200000 and comparing to the list on OEIS http://oeis.org/A051037, as well as many large numbers that are ±1 from regular numbers, etc. It's now available in SciPy as fftpack.helper.next_fast_len, to find optimal sizes for FFTs (source code).
I'm not sure if the log method is faster because I couldn't get it to work reliably enough to test it. I think it has a similar number of operations, though? I'm not sure, but this is reasonably fast. Takes <3 seconds (or 0.7 second with gmpy) to calculate that 2142 × 380 × 5444 is the next regular number above 22 × 3454 × 5249+1 (the 100,000,000th regular number, which has 392 digits)
You want to find the smallest number m that is m >= N and m = 2^i * 3^j * 5^k where all i,j,k >= 0.
Taking logarithms the equations can be rewritten as:
log m >= log N
log m = i*log2 + j*log3 + k*log5
You can calculate log2, log3, log5 and logN to (enough high, depending on the size of N) accuracy. Then this problem looks like a Integer Linear programming problem and you could try to solve it using one of the known algorithms for this NP-hard problem.
EDITED/CORRECTED: Corrected the codes to pass the scipy tests:
Here's an answer based on endolith's answer, but almost eliminating long multi-precision integer calculations by using float64 logarithm representations to do a base comparison to find triple values that pass the criteria, only resorting to full precision comparisons when there is a chance that the logarithm value may not be accurate enough, which only occurs when the target is very close to either the previous or the next regular number:
import math
def next_regulary(target):
"""
Find the next regular number greater than or equal to target.
"""
if target < 2: return ( 0, 0, 0 )
log2hi = 0
mant = 0
# Check if it's already a power of 2 (or a non-integer)
try:
mant = target & (target - 1)
target = int(target) # take care of case where not int/float/decimal
except TypeError:
# Convert floats/decimals for further processing
target = int(math.ceil(target))
mant = target & (target - 1)
# Quickly find next power of 2 >= target
# See https://stackoverflow.com/a/19164783/125507
try:
log2hi = target.bit_length()
except AttributeError:
# Fallback for Python <2.7
log2hi = len(bin(target)) - 2
# exit if this is a power of two already...
if not mant: return ( log2hi - 1, 0, 0 )
# take care of trivial cases...
if target < 9:
if target < 4: return ( 0, 1, 0 )
elif target < 6: return ( 0, 0, 1 )
elif target < 7: return ( 1, 1, 0 )
else: return ( 3, 0, 0 )
# find log of target, which may exceed the float64 limit...
if log2hi < 53: mant = target << (53 - log2hi)
else: mant = target >> (log2hi - 53)
log2target = log2hi + math.log2(float(mant) / (1 << 53))
# log2 constants
log2of2 = 1.0; log2of3 = math.log2(3); log2of5 = math.log2(5)
# calculate range of log2 values close to target;
# desired number has a logarithm of log2target <= x <= top...
fctr = 6 * log2of3 * log2of5
top = (log2target**3 + 2 * fctr)**(1/3) # for up to 2 numbers higher
btm = 2 * log2target - top # or up to 2 numbers lower
match = log2hi # Anything found will be smaller
result = ( log2hi, 0, 0 ) # placeholder for eventual matches
count = 0 # only used for debugging counting band
fives = 0; fiveslmt = int(math.ceil(top / log2of5))
while fives < fiveslmt:
log2p = top - fives * log2of5
threes = 0; threeslmt = int(math.ceil(log2p / log2of3))
while threes < threeslmt:
log2q = log2p - threes * log2of3
twos = int(math.floor(log2q)); log2this = top - log2q + twos
if log2this >= btm: count += 1 # only used for counting band
if log2this >= btm and log2this < match:
# logarithm precision may not be enough to differential between
# the next lower regular number and the target, so do
# a full resolution comparison to eliminate this case...
if (2**twos * 3**threes * 5**fives) >= target:
match = log2this; result = ( twos, threes, fives )
threes += 1
fives += 1
return result
print(next_regular(2**2 * 3**454 * 5**249 + 1)) # prints (142, 80, 444)
Since most long multi-precision calculations have been eliminated, gmpy isn't needed, and on IDEOne the above code takes 0.11 seconds instead of 0.48 seconds for endolith's solution to find the next regular number greater than the 100 millionth one as shown; it takes 0.49 seconds instead of 5.48 seconds to find the next regular number past the billionth (next one is (761,572,489) past (1334,335,404) + 1), and the difference will get even larger as the range goes up as the multi-precision calculations get increasingly longer for the endolith version compared to almost none here. Thus, this version could calculate the next regular number from the trillionth in the sequence in about 50 seconds on IDEOne, where it would likely take over an hour with the endolith version.
The English description of the algorithm is almost the same as for the endolith version, differing as follows:
1) calculates the float log estimation of the argument target value (we can't use the built-in log function directly as the range may be much too large for representation as a 64-bit float),
2) compares the log representation values in determining qualifying values inside an estimated range above and below the target value of only about two or three numbers (depending on round-off),
3) compare multi-precision values only if within the above defined narrow band,
4) outputs the triple indices rather than the full long multi-precision integer (would be about 840 decimal digits for the one past the billionth, ten times that for the trillionth), which can then easily be converted to the long multi-precision value if required.
This algorithm uses almost no memory other than for the potentially very large multi-precision integer target value, the intermediate evaluation comparison values of about the same size, and the output expansion of the triples if required. This algorithm is an improvement over the endolith version in that it successfully uses the logarithm values for most comparisons in spite of their lack of precision, and that it narrows the band of compared numbers to just a few.
This algorithm will work for argument ranges somewhat above ten trillion (a few minute's calculation time at IDEOne rates) when it will no longer be correct due to lack of precision in the log representation values as per #WillNess's discussion; in order to fix this, we can change the log representation to a "roll-your-own" logarithm representation consisting of a fixed-length integer (124 bits for about double the exponent range, good for targets of over a hundred thousand digits if one is willing to wait); this will be a little slower due to the smallish multi-precision integer operations being slower than float64 operations, but not that much slower since the size is limited (maybe a factor of three or so slower).
Now none of these Python implementations (without using C or Cython or PyPy or something) are particularly fast, as they are about a hundred times slower than as implemented in a compiled language. For reference sake, here is a Haskell version:
{-# OPTIONS_GHC -O3 #-}
import Data.Word
import Data.Bits
nextRegular :: Integer -> ( Word32, Word32, Word32 )
nextRegular target
| target < 2 = ( 0, 0, 0 )
| target .&. (target - 1) == 0 = ( fromIntegral lg2hi - 1, 0, 0 )
| target < 9 = case target of
3 -> ( 0, 1, 0 )
5 -> ( 0, 0, 1 )
6 -> ( 1, 1, 0 )
_ -> ( 3, 0, 0 )
| otherwise = match
where
lg3 = logBase 2 3 :: Double; lg5 = logBase 2 5 :: Double
lg2hi = let cntplcs v cnt =
let nv = v `shiftR` 31 in
if nv <= 0 then
let cntbts x c =
if x <= 0 then c else
case c + 1 of
nc -> nc `seq` cntbts (x `shiftR` 1) nc in
cntbts (fromIntegral v :: Word32) cnt
else case cnt + 31 of ncnt -> ncnt `seq` cntplcs nv ncnt
in cntplcs target 0
lg2tgt = let mant = if lg2hi <= 53 then target `shiftL` (53 - lg2hi)
else target `shiftR` (lg2hi - 53)
in fromIntegral lg2hi +
logBase 2 (fromIntegral mant / 2^53 :: Double)
lg2top = (lg2tgt^3 + 2 * 6 * lg3 * lg5)**(1/3) -- for 2 numbers or so higher
lg2btm = 2* lg2tgt - lg2top -- or two numbers or so lower
match =
let klmt = floor (lg2top / lg5)
loopk k mtchlgk mtchtplk =
if k > klmt then mtchtplk else
let p = lg2top - fromIntegral k * lg5
jlmt = fromIntegral $ floor (p / lg3)
loopj j mtchlgj mtchtplj =
if j > jlmt then loopk (k + 1) mtchlgj mtchtplj else
let q = p - fromIntegral j * lg3
( i, frac ) = properFraction q; r = lg2top - frac
( nmtchlg, nmtchtpl ) =
if r < lg2btm || r >= mtchlgj then
( mtchlgj, mtchtplj ) else
if 2^i * 3^j * 5^k >= target then
( r, ( i, j, k ) ) else ( mtchlgj, mtchtplj )
in nmtchlg `seq` nmtchtpl `seq` loopj (j + 1) nmtchlg nmtchtpl
in loopj 0 mtchlgk mtchtplk
in loopk 0 (fromIntegral lg2hi) ( fromIntegral lg2hi, 0, 0 )
trival :: ( Word32, Word32, Word32 ) -> Integer
trival (i,j,k) = 2^i * 3^j * 5^k
main = putStrLn $ show $ nextRegular $ (trival (1334,335,404)) + 1 -- (1126,16930,40)
This code calculates the next regular number following the billionth in too small a time to be measured and following the trillionth in 0.69 seconds on IDEOne (and potentially could run even faster except that IDEOne doesn't support LLVM). Even Julia will run at something like this Haskell speed after the "warm-up" for JIT compilation.
EDIT_ADD: The Julia code is as per the following:
function nextregular(target :: BigInt) :: Tuple{ UInt32, UInt32, UInt32 }
# trivial case of first value or anything less...
target < 2 && return ( 0, 0, 0 )
# Check if it's already a power of 2 (or a non-integer)
mant = target & (target - 1)
# Quickly find next power of 2 >= target
log2hi :: UInt32 = 0
test = target
while true
next = test & 0x7FFFFFFF
test >>>= 31; log2hi += 31
test <= 0 && (log2hi -= leading_zeros(UInt32(next)) - 1; break)
end
# exit if this is a power of two already...
mant == 0 && return ( log2hi - 1, 0, 0 )
# take care of trivial cases...
if target < 9
target < 4 && return ( 0, 1, 0 )
target < 6 && return ( 0, 0, 1 )
target < 7 && return ( 1, 1, 0 )
return ( 3, 0, 0 )
end
# find log of target, which may exceed the Float64 limit...
if log2hi < 53 mant = target << (53 - log2hi)
else mant = target >>> (log2hi - 53) end
log2target = log2hi + log(2, Float64(mant) / (1 << 53))
# log2 constants
log2of2 = 1.0; log2of3 = log(2, 3); log2of5 = log(2, 5)
# calculate range of log2 values close to target;
# desired number has a logarithm of log2target <= x <= top...
fctr = 6 * log2of3 * log2of5
top = (log2target^3 + 2 * fctr)^(1/3) # for 2 numbers or so higher
btm = 2 * log2target - top # or 2 numbers or so lower
# scan for values in the given narrow range that satisfy the criteria...
match = log2hi # Anything found will be smaller
result :: Tuple{UInt32,UInt32,UInt32} = ( log2hi, 0, 0 ) # placeholder for eventual matches
fives :: UInt32 = 0; fiveslmt = UInt32(ceil(top / log2of5))
while fives < fiveslmt
log2p = top - fives * log2of5
threes :: UInt32 = 0; threeslmt = UInt32(ceil(log2p / log2of3))
while threes < threeslmt
log2q = log2p - threes * log2of3
twos = UInt32(floor(log2q)); log2this = top - log2q + twos
if log2this >= btm && log2this < match
# logarithm precision may not be enough to differential between
# the next lower regular number and the target, so do
# a full resolution comparison to eliminate this case...
if (big(2)^twos * big(3)^threes * big(5)^fives) >= target
match = log2this; result = ( twos, threes, fives )
end
end
threes += 1
end
fives += 1
end
result
end
Here's another possibility I just thought of:
If N is X bits long, then the smallest regular number R ≥ N will be in the range
[2X-1, 2X]
e.g. if N = 257 (binary 100000001) then we know R is 1xxxxxxxx unless R is exactly equal to the next power of 2 (512)
To generate all the regular numbers in this range, we can generate the odd regular numbers (i.e. multiples of powers of 3 and 5) first, then take each value and multiply by 2 (by bit-shifting) as many times as necessary to bring it into this range.
In Python:
from itertools import ifilter, takewhile
from Queue import PriorityQueue
def nextPowerOf2(n):
p = max(1, n)
while p != (p & -p):
p += p & -p
return p
# Generate multiples of powers of 3, 5
def oddRegulars():
q = PriorityQueue()
q.put(1)
prev = None
while not q.empty():
n = q.get()
if n != prev:
prev = n
yield n
if n % 3 == 0:
q.put(n // 3 * 5)
q.put(n * 3)
# Generate regular numbers with the same number of bits as n
def regularsCloseTo(n):
p = nextPowerOf2(n)
numBits = len(bin(n))
for i in takewhile(lambda x: x <= p, oddRegulars()):
yield i << max(0, numBits - len(bin(i)))
def nextRegular(n):
bigEnough = ifilter(lambda x: x >= n, regularsCloseTo(n))
return min(bigEnough)
You know what? I'll put money on the proposition that actually, the 'dumb' algorithm is fastest. This is based on the observation that the next regular number does not, in general, seem to be much larger than the given input. So simply start counting up, and after each increment, refactor and see if you've found a regular number. But create one processing thread for each available core you have, and for N cores have each thread examine every Nth number. When each thread has found a number or crossed the power-of-2 threshold, compare the results (keep a running best number) and there you are.
I wrote a small c# program to solve this problem. It's not very optimised but it's a start.
This solution is pretty fast for numbers as big as 11 digits.
private long GetRegularNumber(long n)
{
long result = n - 1;
long quotient = result;
while (quotient > 1)
{
result++;
quotient = result;
quotient = RemoveFactor(quotient, 2);
quotient = RemoveFactor(quotient, 3);
quotient = RemoveFactor(quotient, 5);
}
return result;
}
private static long RemoveFactor(long dividend, long divisor)
{
long remainder = 0;
long quotient = dividend;
while (remainder == 0)
{
dividend = quotient;
quotient = Math.DivRem(dividend, divisor, out remainder);
}
return dividend;
}

Resources