Neo4j: why the performance of allShortestPaths function is so slow? - performance

I am using Neo4j 'neo4j-community-2.3.0-RC1' version.
In my database there are just 1054 nodes.
when i do path query with 'allShotestPaths' function, why it is so slow.
it take about more than 1 second,the following is unit test result :
√ search optimalPath Path (192ms)
√ search optimal Path by Lat Lng (1131ms)
should i optimize the query? the following are querys for 'optimalPath' and 'optimal Path by Lat Lng'
optimalPath query:
MATCH path=allShortestPaths((start:潍坊_STATION )-[rels*..50]->(end:潍坊_STATION {name:"火车站"}))
RETURN NODES(path) AS stations,relationships(path) AS path,length(path) AS stop_count,
length(FILTER(index IN RANGE(1, length(rels)-1) WHERE (rels[index]).bus <> (rels[index - 1]).bus)) AS transfer_count,
length(FILTER( rel IN rels WHERE type(rel)="WALK" )) AS walk_count
order by transfer_count,walk_count,stop_count
optimal Path by Lat Lng query:
MATCH path=allShortestPaths((start:潍坊_STATION {name:"公交总公司"})-[rels*..50]->(end:潍坊_STATION {name:"火车站"}))
WHERE
round(
6378.137 *1000*2*
asin(sqrt(
sin((radians(start.lat)-radians(36.714))/2)^2+cos(radians(start.lat))*cos(radians(36.714))*
sin((radians(start.lng)-radians(119.1268))/2)^2
))
)/1000 < 0.5 // this formula is used to calculate the distance between two GEO coordinate (latitude\longitude)
RETURN NODES(path) AS stations,relationships(path) AS path,length(path) AS stop_count,
length(FILTER(index IN RANGE(1, length(rels)-1) WHERE (rels[index]).bus <> (rels[index - 1]).bus)) AS transfer_count,
length(FILTER( rel IN rels WHERE type(rel)="WALK" )) AS walk_count
order by transfer_count,walk_count,stop_count
you can download the database here:https://www.dropbox.com/s/zamkyh2aaw3voe6/data.rar?dl=0
i will be very grateful ,if anybody can help me. thanks

In general, without knowing more, I would pull the predicates and expressions that can be computed before the paths are all expanded, before the match.
And as your geo-filter is independent of anything else except your parameters and the start-node you can do:
MATCH (start:潍坊_STATION {name:"公交总公司"})
WHERE
// this formula is used to calculate the distance between two GEO coordinate (latitude\longitude)
round(6378.137 *1000*2*
asin(sqrt(sin((radians(start.lat)-radians({lat}))/2)^2
+cos(radians(start.lat))*cos(radians({lat}))*
sin((radians(start.lng)-radians({lng}))/2)^2)))/1000
< 0.5
MATCH (end:潍坊_STATION {name:"火车站"})
MATCH path=allShortestPaths((start)-[rels*..50]->(end))
RETURN NODES(path) AS stations,
relationships(path) AS path,
length(path) AS stop_count,
length(FILTER(index IN RANGE(1, length(rels)-1)
WHERE (rels[index]).bus <> (rels[index - 1]).bus)) AS transfer_count,
length(FILTER( rel IN rels WHERE type(rel)="WALK" )) AS walk_count
ORDER BY transfer_count,walk_count,stop_count;
see this test (but the other query is equally fast):
neo4j-sh (?)$ MATCH (start:潍坊_STATION {name:"公交总公司"})
>
> WHERE
> // this formula is used to calculate the distance between two GEO coordinate (latitude\longitude)
> round(6378.137 *1000*2*
> asin(sqrt(sin((radians(start.lat)-radians({lat}))/2)^2
> +cos(radians(start.lat))*cos(radians({lat}))*
> sin((radians(start.lng)-radians({lng}))/2)^2)))/1000
> < 0.5
>
> MATCH (end:潍坊_STATION {name:"火车站"})
> MATCH path=allShortestPaths((start)-[rels*..50]->(end))
> WITH NODES(path) AS stations,
> relationships(path) AS path,
> length(path) AS stop_count,
> length(FILTER(index IN RANGE(1, length(rels)-1)
> WHERE (rels[index]).bus <> (rels[index - 1]).bus)) AS transfer_count,
> length(FILTER( rel IN rels WHERE type(rel)="WALK" )) AS walk_count
>
> ORDER BY transfer_count,walk_count,stop_count
> RETURN count(*);
+----------+
| count(*) |
+----------+
| 320 |
+----------+
1 row
10 ms

Related

Decision table to pseudocode

Create a decision table to help the Municipal Bank decide whether or not to loan money to a customer. Include the criteria used by the bank to identify qualified applicants.
Conditions
Income >=40000? T T T F F F
Credit Score >=600? T F F T T F
Months at job > 12? - T F T F -
Outcomes
Approve loan? Y Y X Y X X
Use pseudo code to write an algorithm for the decision table created in question one.
If Income >= 4000 And credScore >= 600 And monthJob > 12 Then
loanApp = Yes
I am having trouble converting the table to pseudo code, I wanted to know if the partial answer to the second question is on the right track.
Generally the approach works, yes. Although if you look carefully, you will see that monthJob > 12 is not needed for the first column of your condiditions. (there is a '-' there)
A faster approach you can get if you investigate your conditions. We always get Y if 2 of the conditions are met, otherwise we get X.
So here a optimised version (pseudocode):
score = 0
If Income >= 40000 Then
score = score + 1
Endif
If credScore >= 600 Then
score = score + 1
Endif
If monthJob > 12 Then
score = score + 1
Endif
If score >= 2 Then
loanApp = Yes
Else
loanApp = No
EndIf

Google codejam APAC Test practice round: Parentheses Order

I spent one day solving this problem and couldn't find a solution to pass the large dataset.
Problem
An n parentheses sequence consists of n "("s and n ")"s.
Now, we have all valid n parentheses sequences. Find the k-th smallest sequence in lexicographical order.
For example, here are all valid 3 parentheses sequences in lexicographical order:
((()))
(()())
(())()
()(())
()()()
Given n and k, write an algorithm to give the k-th smallest sequence in lexicographical order.
For large data set: 1 ≤ n ≤ 100 and 1 ≤ k ≤ 10^18
This problem can be solved by using dynamic programming
Let dp[n][m] = number of valid parentheses that can be created if we have n open brackets and m close brackets.
Base case:
dp[0][a] = 1 (a >=0)
Fill in the matrix using the base case:
dp[n][m] = dp[n - 1][m] + (n < m ? dp[n][m - 1]:0 );
Then, we can slowly build the kth parentheses.
Start with a = n open brackets and b = n close brackets and the current result is empty
while(k is not 0):
If number dp[a][b] >= k:
If (dp[a - 1][b] >= k) is true:
* Append an open bracket '(' to the current result
* Decrease a
Else:
//k is the number of previous smaller lexicographical parentheses
* Adjust value of k: `k -= dp[a -1][b]`,
* Append a close bracket ')'
* Decrease b
Else k is invalid
Notice that open bracket is less than close bracket in lexicographical order, so we always try to add open bracket first.
Let S= any valid sequence of parentheses from n( and n) .
Now any valid sequence S can be written as S=X+Y where
X=valid prefix i.e. if traversing X from left to right , at any point of time, numberof'(' >= numberof')'
Y=valid suffix i.e. if traversing Y from right to left, at any point of time, numberof'(' <= numberof')'
For any S many pairs of X and Y are possible.
For our example: ()(())
`()(())` =`empty_string + ()(())`
= `( + )(())`
= `() + (())`
= `()( + ())`
= `()(( + ))`
= `()(() + )`
= `()(()) + empty_string`
Note that when X=empty_string, then number of valid S from n( and n)= number of valid suffix Y from n( and n)
Now, Algorithm goes like this:
We will start with X= empty_string and recursively grow X until X=S. At any point of time we have two options to grow X, either append '(' or append ')'
Let dp[a][b]= number of valid suffixes using a '(' and b ')' given X
nop=num_open_parenthesis_left ncp=num_closed_parenthesis_left
`calculate(nop,ncp)
{
if dp[nop][ncp] is not known
{
i1=calculate(nop-1,ncp); // Case 1: X= X + "("
i2=((nop<ncp)?calculate(nop,ncp-1):0);
/*Case 2: X=X+ ")" if nop>=ncp, then after exhausting 1 ')' nop>ncp, therefore there can be no valid suffix*/
dp[nop][ncp]=i1+i2;
}
return dp[nop][ncp];
}`
Lets take example,n=3 i.e. 3 ( and 3 )
Now at the very start X=empty_string, therefore
dp[3][3]= number of valid sequence S using 3( and 3 )
= number of valid suffixes Y from 3 ( and 3 )

find best match numeric sequence in sequence array

Assume I have these two arrays:
float arr[] = {40.4357,40.6135,40.2477,40.2864,39.3449,39.8901,40.103,39.9959,39.7863,39.9102,39.2652,39.2688,39.5147,38.2246,38.5376,38.4512,38.9951,39.0999,39.3057,38.53,38.2761,38.1722,37.8816,37.6521,37.8306,38.0853,37.9644,38.0626,38.0567,38.3518,38.4044,38.3553,38.4978,38.3768,38.2058,38.3175,38.3123,38.262,38.0093,38.3685,38.0111,38.4539,38.8122,39.1413,38.9409,39.2043,39.3538,39.4123,39.3628,39.2825,39.1898,39.0431,39.0634,38.5993,38.252,37.3793,36.6334,36.4009,35.2822,34.4262,34.2119,34.1552,34.3325,33.9626,33.2661,32.3819,35.1959,36.7602,37.9039,37.8103,37.5832,37.9718,38.3111,38.9323,38.6763,39.1163,38.8469,39.805,40.2627,40.3689,40.4064,40.0558,40.815,41.0234,41.0128,41.0296,41.0927,40.7046,40.6775,40.2711,40.1283,39.7518,40.0145,40.0394,39.8461,39.6317,39.5548,39.1996,38.9861,38.8507,38.8603,38.483,38.4711,38.4214,38.4286,38.5766,38.7532,38.7905,38.6029,38.4635,38.1403,36.6844,36.616,36.4053,34.7934,34.0226,33.0505,33.4978,34.6106,35.284,35.7535,35.3541,35.5481,35.4086,35.7096,36.0526,36.1222,35.9408,36.1007,36.7952,36.99,37.1024,37.0993,37.3144,36.6951,37.1213,38.0026,38.1266,39.2538,38.8963,39.0158,38.6235,38.7908,38.6041,38.4489,38.3207,37.7398,38.5304,38.925,38.7249,38.9221,39.1704,39.5113,40.0613,39.3602,39.8689,39.973,40.0524,40.0025,40.7584,40.9714,40.9106,40.9685,40.6554,39.7314,39.0044,38.7183,38.5163,38.6101,38.2004,38.7606,38.7532,37.8903,37.8403,38.5368,39.0462,38.8279,39.0748,39.2907,38.5447,38.423,38.5624,38.476,38.5784,39.0905,39.379,39.4739,39.5774,40.7036,40.3044,39.6162,39.9967,40.0562,39.3426,38.666,38.7561,39.2823,38.8548,37.6214,37.8188,38.1086,38.3619,38.5472,38.1357,38.1422,37.95,37.1837,37.4636,36.8852,37.1617,37.5051,37.7724,38.0879,37.7197,38.0422,37.8551,38.5688,38.8388};
float pattern[] = {38.6434,38.1409,37.3391,37.5457,37.7487,37.7499,37.6121,37.4789,37.5821,37.6541,38.0365,37.7907,37.9932,37.9945,37.7032,37.3556,37.6359,37.5412,37.5296,37.8829,38.3797,38.4452,39.0929,39.1233,39.3014,39.0317,38.903,38.8221,39.045,38.6944,39.0699,39.0978,38.9877,38.8123,38.7491,38.5888,38.7875,38.2086,37.7484,37.3961,36.8663,36.2607,35.8838,35.3297,35.5574,35.7239};
Ives uploaded this example graph:
As you can see in the graph pattern almost fits in the array at index 17
Whats the best and fastest way to find this index? And is there a way to have a confidence for there match fuse the values are not equal as you can see?
If the starting index is your only degree of freedom, you can just try each index and calculate the sum of squared errors for each of the data points. In Python this could look like this:
data = [40.4357,40.6135,40.2477,...]
pattern = [38.6434,38.1409,37.3391,37.5457,37.7487,...]
best_ind, best_err = 0, 1e9999
for i in range(len(data) - len(pattern)):
subdata = data[i : i + len(pattern)]
err = sum((d-p)**2 for (d, p) in zip(subdata, pattern))
if err < best_err:
best_ind, best_err = i, err
Result:
>>> print best_ind, best_err
17 21.27929269
The straightforward algorithm is to chose a measure of convergence ( how you describe similarity, this might be average of errors, or their squared values or any other function suitable for your purposes) and apply steps
Let i = 0 be an integer index, M be a container of size = length(data) - length(pattern) + 1 to store the measurings
If i < size then shift your pattern by i, otherwise go to step 5
Calculate the measure of similarity and store into M
i = i + 1, go to 2 and repeat
Chose the index of minimum value in M
It's a one-liner in Python, using the fact that tuples are sorted lexicographically:
In [1]:
import numpy as np
arr = np.array( [ 40.4357,40.6135,40.2477,40.2864,39.3449,39.8901,40.103,39.9959,39.7863,39.9102,39.2652,39.2688,39.5147,38.2246,38.5376,38.4512,38.9951,39.0999,39.3057,38.53,38.2761,38.1722,37.8816,37.6521,37.8306,38.0853,37.9644,38.0626,38.0567,38.3518,38.4044,38.3553,38.4978,38.3768,38.2058,38.3175,38.3123,38.262,38.0093,38.3685,38.0111,38.4539,38.8122,39.1413,38.9409,39.2043,39.3538,39.4123,39.3628,39.2825,39.1898,39.0431,39.0634,38.5993,38.252,37.3793,36.6334,36.4009,35.2822,34.4262,34.2119,34.1552,34.3325,33.9626,33.2661,32.3819,35.1959,36.7602,37.9039,37.8103,37.5832,37.9718,38.3111,38.9323,38.6763,39.1163,38.8469,39.805,40.2627,40.3689,40.4064,40.0558,40.815,41.0234,41.0128,41.0296,41.0927,40.7046,40.6775,40.2711,40.1283,39.7518,40.0145,40.0394,39.8461,39.6317,39.5548,39.1996,38.9861,38.8507,38.8603,38.483,38.4711,38.4214,38.4286,38.5766,38.7532,38.7905,38.6029,38.4635,38.1403,36.6844,36.616,36.4053,34.7934,34.0226,33.0505,33.4978,34.6106,35.284,35.7535,35.3541,35.5481,35.4086,35.7096,36.0526,36.1222,35.9408,36.1007,36.7952,36.99,37.1024,37.0993,37.3144,36.6951,37.1213,38.0026,38.1266,39.2538,38.8963,39.0158,38.6235,38.7908,38.6041,38.4489,38.3207,37.7398,38.5304,38.925,38.7249,38.9221,39.1704,39.5113,40.0613,39.3602,39.8689,39.973,40.0524,40.0025,40.7584,40.9714,40.9106,40.9685,40.6554,39.7314,39.0044,38.7183,38.5163,38.6101,38.2004,38.7606,38.7532,37.8903,37.8403,38.5368,39.0462,38.8279,39.0748,39.2907,38.5447,38.423,38.5624,38.476,38.5784,39.0905,39.379,39.4739,39.5774,40.7036,40.3044,39.6162,39.9967,40.0562,39.3426,38.666,38.7561,39.2823,38.8548,37.6214,37.8188,38.1086,38.3619,38.5472,38.1357,38.1422,37.95,37.1837,37.4636,36.8852,37.1617,37.5051,37.7724,38.0879,37.7197,38.0422,37.8551,38.5688,38.8388] )
pattern = np.array( [ 38.6434,38.1409,37.3391,37.5457,37.7487,37.7499,37.6121,37.4789,37.5821,37.6541,38.0365,37.7907,37.9932,37.9945,37.7032,37.3556,37.6359,37.5412,37.5296,37.8829,38.3797,38.4452,39.0929,39.1233,39.3014,39.0317,38.903,38.8221,39.045,38.6944,39.0699,39.0978,38.9877,38.8123,38.7491,38.5888,38.7875,38.2086,37.7484,37.3961,36.8663,36.2607,35.8838,35.3297,35.5574,35.7239 ] )
min( ( ( ( arr[i:i+len(pattern)] - pattern ) ** 2 ).mean(), i ) for i in xrange(len(arr)-len(pattern)) )
Out[5]:
(0.46259331934782588, 17)
where 0.46 is the minimal mean squared error, and 17 is the position of the minimum in arr.

Find the smallest regular number that is not less than N

Regular numbers are numbers that evenly divide powers of 60. As an example, 602 = 3600 = 48 × 75, so both 48 and 75 are divisors of a power of 60. Thus, they are also regular numbers.
This is an extension of rounding up to the next power of two.
I have an integer value N which may contain large prime factors and I want to round it up to a number composed of only small prime factors (2, 3 and 5)
Examples:
f(18) == 18 == 21 * 32
f(19) == 20 == 22 * 51
f(257) == 270 == 21 * 33 * 51
What would be an efficient way to find the smallest number satisfying this requirement?
The values involved may be large, so I would like to avoid enumerating all regular numbers starting from 1 or maintaining an array of all possible values.
One can produce arbitrarily thin a slice of the Hamming sequence around the n-th member in time ~ n^(2/3) by direct enumeration of triples (i,j,k) such that N = 2^i * 3^j * 5^k.
The algorithm works from log2(N) = i+j*log2(3)+k*log2(5); enumerates all possible ks and for each, all possible js, finds the top i and thus the triple (k,j,i) and keeps it in a "band" if inside the given "width" below the given high logarithmic top value (when width < 1 there can be at most one such i) then sorts them by their logarithms.
WP says that n ~ (log N)^3, i.e. run time ~ (log N)^2. Here we don't care for the exact position of the found triple in the sequence, so all the count calculations from the original code can be thrown away:
slice hi w = sortBy (compare `on` fst) b where -- hi>log2(N) is a top value
lb5=logBase 2 5 ; lb3=logBase 2 3 -- w<1 (NB!) is log2(width)
b = concat -- the slice
[ [ (r,(i,j,k)) | frac < w ] -- store it, if inside width
| k <- [ 0 .. floor ( hi /lb5) ], let p = fromIntegral k*lb5,
j <- [ 0 .. floor ((hi-p)/lb3) ], let q = fromIntegral j*lb3 + p,
let (i,frac)=properFraction(hi-q) ; r = hi - frac ] -- r = i + q
-- properFraction 12.7 == (12, 0.7)
-- update: in pseudocode:
def slice(hi, w):
lb5, lb3 = logBase(2, 5), logBase(2, 3) -- logs base 2 of 5 and 3
for k from 0 step 1 to floor(hi/lb5) inclusive:
p = k*lb5
for j from 0 step 1 to floor((hi-p)/lb3) inclusive:
q = j*lb3 + p
i = floor(hi-q)
frac = hi-q-i -- frac < 1 , always
r = hi - frac -- r == i + q
if frac < w:
place (r,(i,j,k)) into the output array
sort the output array's entries by their "r" component
in ascending order, and return thus sorted array
Having enumerated the triples in the slice, it is a simple matter of sorting and searching, taking practically O(1) time (for arbitrarily thin a slice) to find the first triple above N. Well, actually, for constant width (logarithmic), the amount of numbers in the slice (members of the "upper crust" in the (i,j,k)-space below the log(N) plane) is again m ~ n^2/3 ~ (log N)^2 and sorting takes m log m time (so that searching, even linear, takes ~ m run time then). But the width can be made smaller for bigger Ns, following some empirical observations; and constant factors for the enumeration of triples are much higher than for the subsequent sorting anyway.
Even with constant width (logarthmic) it runs very fast, calculating the 1,000,000-th value in the Hamming sequence instantly and the billionth in 0.05s.
The original idea of "top band of triples" is due to Louis Klauder, as cited in my post on a DDJ blogs discussion back in 2008.
update: as noted by GordonBGood in the comments, there's no need for the whole band but rather just about one or two values above and below the target. The algorithm is easily amended to that effect. The input should also be tested for being a Hamming number itself before proceeding with the algorithm, to avoid round-off issues with double precision. There are no round-off issues comparing the logarithms of the Hamming numbers known in advance to be different (though going up to a trillionth entry in the sequence uses about 14 significant digits in logarithm values, leaving only 1-2 digits to spare, so the situation may in fact be turning iffy there; but for 1-billionth we only need 11 significant digits).
update2: turns out the Double precision for logarithms limits this to numbers below about 20,000 to 40,000 decimal digits (i.e. 10 trillionth to 100 trillionth Hamming number). If there's a real need for this for such big numbers, the algorithm can be switched back to working with the Integer values themselves instead of their logarithms, which will be slower.
Okay, hopefully third time's a charm here. A recursive, branching algorithm for an initial input of p, where N is the number being 'built' within each thread. NB 3a-c here are launched as separate threads or otherwise done (quasi-)asynchronously.
Calculate the next-largest power of 2 after p, call this R. N = p.
Is N > R? Quit this thread. Is p composed of only small prime factors? You're done. Otherwise, go to step 3.
After any of 3a-c, go to step 4.
a) Round p up to the nearest multiple of 2. This number can be expressed as m * 2.
b) Round p up to the nearest multiple of 3. This number can be expressed as m * 3.
c) Round p up to the nearest multiple of 5. This number can be expressed as m * 5.
Go to step 2, with p = m.
I've omitted the bookkeeping to do regarding keeping track of N but that's fairly straightforward I take it.
Edit: Forgot 6, thanks ypercube.
Edit 2: Had this up to 30, (5, 6, 10, 15, 30) realized that was unnecessary, took that out.
Edit 3: (The last one I promise!) Added the power-of-30 check, which helps prevent this algorithm from eating up all your RAM.
Edit 4: Changed power-of-30 to power-of-2, per finnw's observation.
Here's a solution in Python, based on Will Ness answer but taking some shortcuts and using pure integer math to avoid running into log space numerical accuracy errors:
import math
def next_regular(target):
"""
Find the next regular number greater than or equal to target.
"""
# Check if it's already a power of 2 (or a non-integer)
try:
if not (target & (target-1)):
return target
except TypeError:
# Convert floats/decimals for further processing
target = int(math.ceil(target))
if target <= 6:
return target
match = float('inf') # Anything found will be smaller
p5 = 1
while p5 < target:
p35 = p5
while p35 < target:
# Ceiling integer division, avoiding conversion to float
# (quotient = ceil(target / p35))
# From https://stackoverflow.com/a/17511341/125507
quotient = -(-target // p35)
# Quickly find next power of 2 >= quotient
# See https://stackoverflow.com/a/19164783/125507
try:
p2 = 2**((quotient - 1).bit_length())
except AttributeError:
# Fallback for Python <2.7
p2 = 2**(len(bin(quotient - 1)) - 2)
N = p2 * p35
if N == target:
return N
elif N < match:
match = N
p35 *= 3
if p35 == target:
return p35
if p35 < match:
match = p35
p5 *= 5
if p5 == target:
return p5
if p5 < match:
match = p5
return match
In English: iterate through every combination of 5s and 3s, quickly finding the next power of 2 >= target for each pair and keeping the smallest result. (It's a waste of time to iterate through every possible multiple of 2 if only one of them can be correct). It also returns early if it ever finds that the target is already a regular number, though this is not strictly necessary.
I've tested it pretty thoroughly, testing every integer from 0 to 51200000 and comparing to the list on OEIS http://oeis.org/A051037, as well as many large numbers that are ±1 from regular numbers, etc. It's now available in SciPy as fftpack.helper.next_fast_len, to find optimal sizes for FFTs (source code).
I'm not sure if the log method is faster because I couldn't get it to work reliably enough to test it. I think it has a similar number of operations, though? I'm not sure, but this is reasonably fast. Takes <3 seconds (or 0.7 second with gmpy) to calculate that 2142 × 380 × 5444 is the next regular number above 22 × 3454 × 5249+1 (the 100,000,000th regular number, which has 392 digits)
You want to find the smallest number m that is m >= N and m = 2^i * 3^j * 5^k where all i,j,k >= 0.
Taking logarithms the equations can be rewritten as:
log m >= log N
log m = i*log2 + j*log3 + k*log5
You can calculate log2, log3, log5 and logN to (enough high, depending on the size of N) accuracy. Then this problem looks like a Integer Linear programming problem and you could try to solve it using one of the known algorithms for this NP-hard problem.
EDITED/CORRECTED: Corrected the codes to pass the scipy tests:
Here's an answer based on endolith's answer, but almost eliminating long multi-precision integer calculations by using float64 logarithm representations to do a base comparison to find triple values that pass the criteria, only resorting to full precision comparisons when there is a chance that the logarithm value may not be accurate enough, which only occurs when the target is very close to either the previous or the next regular number:
import math
def next_regulary(target):
"""
Find the next regular number greater than or equal to target.
"""
if target < 2: return ( 0, 0, 0 )
log2hi = 0
mant = 0
# Check if it's already a power of 2 (or a non-integer)
try:
mant = target & (target - 1)
target = int(target) # take care of case where not int/float/decimal
except TypeError:
# Convert floats/decimals for further processing
target = int(math.ceil(target))
mant = target & (target - 1)
# Quickly find next power of 2 >= target
# See https://stackoverflow.com/a/19164783/125507
try:
log2hi = target.bit_length()
except AttributeError:
# Fallback for Python <2.7
log2hi = len(bin(target)) - 2
# exit if this is a power of two already...
if not mant: return ( log2hi - 1, 0, 0 )
# take care of trivial cases...
if target < 9:
if target < 4: return ( 0, 1, 0 )
elif target < 6: return ( 0, 0, 1 )
elif target < 7: return ( 1, 1, 0 )
else: return ( 3, 0, 0 )
# find log of target, which may exceed the float64 limit...
if log2hi < 53: mant = target << (53 - log2hi)
else: mant = target >> (log2hi - 53)
log2target = log2hi + math.log2(float(mant) / (1 << 53))
# log2 constants
log2of2 = 1.0; log2of3 = math.log2(3); log2of5 = math.log2(5)
# calculate range of log2 values close to target;
# desired number has a logarithm of log2target <= x <= top...
fctr = 6 * log2of3 * log2of5
top = (log2target**3 + 2 * fctr)**(1/3) # for up to 2 numbers higher
btm = 2 * log2target - top # or up to 2 numbers lower
match = log2hi # Anything found will be smaller
result = ( log2hi, 0, 0 ) # placeholder for eventual matches
count = 0 # only used for debugging counting band
fives = 0; fiveslmt = int(math.ceil(top / log2of5))
while fives < fiveslmt:
log2p = top - fives * log2of5
threes = 0; threeslmt = int(math.ceil(log2p / log2of3))
while threes < threeslmt:
log2q = log2p - threes * log2of3
twos = int(math.floor(log2q)); log2this = top - log2q + twos
if log2this >= btm: count += 1 # only used for counting band
if log2this >= btm and log2this < match:
# logarithm precision may not be enough to differential between
# the next lower regular number and the target, so do
# a full resolution comparison to eliminate this case...
if (2**twos * 3**threes * 5**fives) >= target:
match = log2this; result = ( twos, threes, fives )
threes += 1
fives += 1
return result
print(next_regular(2**2 * 3**454 * 5**249 + 1)) # prints (142, 80, 444)
Since most long multi-precision calculations have been eliminated, gmpy isn't needed, and on IDEOne the above code takes 0.11 seconds instead of 0.48 seconds for endolith's solution to find the next regular number greater than the 100 millionth one as shown; it takes 0.49 seconds instead of 5.48 seconds to find the next regular number past the billionth (next one is (761,572,489) past (1334,335,404) + 1), and the difference will get even larger as the range goes up as the multi-precision calculations get increasingly longer for the endolith version compared to almost none here. Thus, this version could calculate the next regular number from the trillionth in the sequence in about 50 seconds on IDEOne, where it would likely take over an hour with the endolith version.
The English description of the algorithm is almost the same as for the endolith version, differing as follows:
1) calculates the float log estimation of the argument target value (we can't use the built-in log function directly as the range may be much too large for representation as a 64-bit float),
2) compares the log representation values in determining qualifying values inside an estimated range above and below the target value of only about two or three numbers (depending on round-off),
3) compare multi-precision values only if within the above defined narrow band,
4) outputs the triple indices rather than the full long multi-precision integer (would be about 840 decimal digits for the one past the billionth, ten times that for the trillionth), which can then easily be converted to the long multi-precision value if required.
This algorithm uses almost no memory other than for the potentially very large multi-precision integer target value, the intermediate evaluation comparison values of about the same size, and the output expansion of the triples if required. This algorithm is an improvement over the endolith version in that it successfully uses the logarithm values for most comparisons in spite of their lack of precision, and that it narrows the band of compared numbers to just a few.
This algorithm will work for argument ranges somewhat above ten trillion (a few minute's calculation time at IDEOne rates) when it will no longer be correct due to lack of precision in the log representation values as per #WillNess's discussion; in order to fix this, we can change the log representation to a "roll-your-own" logarithm representation consisting of a fixed-length integer (124 bits for about double the exponent range, good for targets of over a hundred thousand digits if one is willing to wait); this will be a little slower due to the smallish multi-precision integer operations being slower than float64 operations, but not that much slower since the size is limited (maybe a factor of three or so slower).
Now none of these Python implementations (without using C or Cython or PyPy or something) are particularly fast, as they are about a hundred times slower than as implemented in a compiled language. For reference sake, here is a Haskell version:
{-# OPTIONS_GHC -O3 #-}
import Data.Word
import Data.Bits
nextRegular :: Integer -> ( Word32, Word32, Word32 )
nextRegular target
| target < 2 = ( 0, 0, 0 )
| target .&. (target - 1) == 0 = ( fromIntegral lg2hi - 1, 0, 0 )
| target < 9 = case target of
3 -> ( 0, 1, 0 )
5 -> ( 0, 0, 1 )
6 -> ( 1, 1, 0 )
_ -> ( 3, 0, 0 )
| otherwise = match
where
lg3 = logBase 2 3 :: Double; lg5 = logBase 2 5 :: Double
lg2hi = let cntplcs v cnt =
let nv = v `shiftR` 31 in
if nv <= 0 then
let cntbts x c =
if x <= 0 then c else
case c + 1 of
nc -> nc `seq` cntbts (x `shiftR` 1) nc in
cntbts (fromIntegral v :: Word32) cnt
else case cnt + 31 of ncnt -> ncnt `seq` cntplcs nv ncnt
in cntplcs target 0
lg2tgt = let mant = if lg2hi <= 53 then target `shiftL` (53 - lg2hi)
else target `shiftR` (lg2hi - 53)
in fromIntegral lg2hi +
logBase 2 (fromIntegral mant / 2^53 :: Double)
lg2top = (lg2tgt^3 + 2 * 6 * lg3 * lg5)**(1/3) -- for 2 numbers or so higher
lg2btm = 2* lg2tgt - lg2top -- or two numbers or so lower
match =
let klmt = floor (lg2top / lg5)
loopk k mtchlgk mtchtplk =
if k > klmt then mtchtplk else
let p = lg2top - fromIntegral k * lg5
jlmt = fromIntegral $ floor (p / lg3)
loopj j mtchlgj mtchtplj =
if j > jlmt then loopk (k + 1) mtchlgj mtchtplj else
let q = p - fromIntegral j * lg3
( i, frac ) = properFraction q; r = lg2top - frac
( nmtchlg, nmtchtpl ) =
if r < lg2btm || r >= mtchlgj then
( mtchlgj, mtchtplj ) else
if 2^i * 3^j * 5^k >= target then
( r, ( i, j, k ) ) else ( mtchlgj, mtchtplj )
in nmtchlg `seq` nmtchtpl `seq` loopj (j + 1) nmtchlg nmtchtpl
in loopj 0 mtchlgk mtchtplk
in loopk 0 (fromIntegral lg2hi) ( fromIntegral lg2hi, 0, 0 )
trival :: ( Word32, Word32, Word32 ) -> Integer
trival (i,j,k) = 2^i * 3^j * 5^k
main = putStrLn $ show $ nextRegular $ (trival (1334,335,404)) + 1 -- (1126,16930,40)
This code calculates the next regular number following the billionth in too small a time to be measured and following the trillionth in 0.69 seconds on IDEOne (and potentially could run even faster except that IDEOne doesn't support LLVM). Even Julia will run at something like this Haskell speed after the "warm-up" for JIT compilation.
EDIT_ADD: The Julia code is as per the following:
function nextregular(target :: BigInt) :: Tuple{ UInt32, UInt32, UInt32 }
# trivial case of first value or anything less...
target < 2 && return ( 0, 0, 0 )
# Check if it's already a power of 2 (or a non-integer)
mant = target & (target - 1)
# Quickly find next power of 2 >= target
log2hi :: UInt32 = 0
test = target
while true
next = test & 0x7FFFFFFF
test >>>= 31; log2hi += 31
test <= 0 && (log2hi -= leading_zeros(UInt32(next)) - 1; break)
end
# exit if this is a power of two already...
mant == 0 && return ( log2hi - 1, 0, 0 )
# take care of trivial cases...
if target < 9
target < 4 && return ( 0, 1, 0 )
target < 6 && return ( 0, 0, 1 )
target < 7 && return ( 1, 1, 0 )
return ( 3, 0, 0 )
end
# find log of target, which may exceed the Float64 limit...
if log2hi < 53 mant = target << (53 - log2hi)
else mant = target >>> (log2hi - 53) end
log2target = log2hi + log(2, Float64(mant) / (1 << 53))
# log2 constants
log2of2 = 1.0; log2of3 = log(2, 3); log2of5 = log(2, 5)
# calculate range of log2 values close to target;
# desired number has a logarithm of log2target <= x <= top...
fctr = 6 * log2of3 * log2of5
top = (log2target^3 + 2 * fctr)^(1/3) # for 2 numbers or so higher
btm = 2 * log2target - top # or 2 numbers or so lower
# scan for values in the given narrow range that satisfy the criteria...
match = log2hi # Anything found will be smaller
result :: Tuple{UInt32,UInt32,UInt32} = ( log2hi, 0, 0 ) # placeholder for eventual matches
fives :: UInt32 = 0; fiveslmt = UInt32(ceil(top / log2of5))
while fives < fiveslmt
log2p = top - fives * log2of5
threes :: UInt32 = 0; threeslmt = UInt32(ceil(log2p / log2of3))
while threes < threeslmt
log2q = log2p - threes * log2of3
twos = UInt32(floor(log2q)); log2this = top - log2q + twos
if log2this >= btm && log2this < match
# logarithm precision may not be enough to differential between
# the next lower regular number and the target, so do
# a full resolution comparison to eliminate this case...
if (big(2)^twos * big(3)^threes * big(5)^fives) >= target
match = log2this; result = ( twos, threes, fives )
end
end
threes += 1
end
fives += 1
end
result
end
Here's another possibility I just thought of:
If N is X bits long, then the smallest regular number R ≥ N will be in the range
[2X-1, 2X]
e.g. if N = 257 (binary 100000001) then we know R is 1xxxxxxxx unless R is exactly equal to the next power of 2 (512)
To generate all the regular numbers in this range, we can generate the odd regular numbers (i.e. multiples of powers of 3 and 5) first, then take each value and multiply by 2 (by bit-shifting) as many times as necessary to bring it into this range.
In Python:
from itertools import ifilter, takewhile
from Queue import PriorityQueue
def nextPowerOf2(n):
p = max(1, n)
while p != (p & -p):
p += p & -p
return p
# Generate multiples of powers of 3, 5
def oddRegulars():
q = PriorityQueue()
q.put(1)
prev = None
while not q.empty():
n = q.get()
if n != prev:
prev = n
yield n
if n % 3 == 0:
q.put(n // 3 * 5)
q.put(n * 3)
# Generate regular numbers with the same number of bits as n
def regularsCloseTo(n):
p = nextPowerOf2(n)
numBits = len(bin(n))
for i in takewhile(lambda x: x <= p, oddRegulars()):
yield i << max(0, numBits - len(bin(i)))
def nextRegular(n):
bigEnough = ifilter(lambda x: x >= n, regularsCloseTo(n))
return min(bigEnough)
You know what? I'll put money on the proposition that actually, the 'dumb' algorithm is fastest. This is based on the observation that the next regular number does not, in general, seem to be much larger than the given input. So simply start counting up, and after each increment, refactor and see if you've found a regular number. But create one processing thread for each available core you have, and for N cores have each thread examine every Nth number. When each thread has found a number or crossed the power-of-2 threshold, compare the results (keep a running best number) and there you are.
I wrote a small c# program to solve this problem. It's not very optimised but it's a start.
This solution is pretty fast for numbers as big as 11 digits.
private long GetRegularNumber(long n)
{
long result = n - 1;
long quotient = result;
while (quotient > 1)
{
result++;
quotient = result;
quotient = RemoveFactor(quotient, 2);
quotient = RemoveFactor(quotient, 3);
quotient = RemoveFactor(quotient, 5);
}
return result;
}
private static long RemoveFactor(long dividend, long divisor)
{
long remainder = 0;
long quotient = dividend;
while (remainder == 0)
{
dividend = quotient;
quotient = Math.DivRem(dividend, divisor, out remainder);
}
return dividend;
}

Algorithm that sorts elements within range of values into clusters?

Essentially I want to take a list of things like...
a 345
b 762
c 983
d 425
e 45
...
and given a maximum distance, create clusters for each element containing other elements within that range. For example, if I specified the maximum distance above to be 300 the clusters would be...
a 345
d 425
e 45
b 762
c 983
c 983
b 762
d 425
a 345
e 45
a 345
Constraints wise, I'm reading entries in a file, which is common with the work I'm doing. As such, I generally focus my algorithms on doing work as it reads entries, rather than reading everything in the file, storing it in some convenient structure, and then doing work on it. Anyway, storing the entries from the file and then performing a sort based on those values, then just going through the sorted list and doing appropriate output is something I'm trying to avoid.
I've done some layman brainstorming, but before I spend lots of time doing a thorough analysis I feel like I've seen this somewhere or that there is an algorithm very similar to this. I'm not asking for someone to come up with an algorithm unless you feel so inclined, just wondering if there are any existing ones that solve this problem or one very similar to it.
Thanks.
Another possible flood fill algorithm using some modified binary search to reduce the amount of required comparisons:
Pseudo-code:
def clustersWithinRange(values, distance): //Should run in O(n log n)
sortedValues = values.sorted()
valueCount = values.len()
clusters = Array()
cachedLower = 0
cachedUpper = values.len
cachedValue = null
for i in range(0, valueCount):
value = sortedValues.get(i)
if (cachedValue == value): //duplicate value, no need to calculate twice
clusters.add(clusters.get(i).copy()) //simply clone last cluster or nop if no duplicate clusters wanted
else:
lower = sortedValues.binaryBoundSearch(values, value, distance, cachedLower, i, LowerBoundMatchFlag)
upper = sortedValues.binaryBoundSearch(values, value, distance, cachedUpper, valueCount, UpperBoundMatchFlag)
clusters.add(sortedValues[lower:upper+1])//add sublist within (and including) lower...upper to clusters
cachedLower = lower
cachedUpper = upper
return clusters
def withinDistance(valueA, valueB, distance):
return abs(valueA - valueB) <= distance
def binaryBoundSearch(values, value, distance, low, high, searchFlag):
// Possible values of searchFlag:
// "LowerBoundMatchFlag" to get the leftmost found match.
// "UpperBoundMatchFlag" to get the rightmost found match.
if (high < low):
return -1 // not found
mid = low + ((high - low) / 2)
matchPosition = -1
aValue = values[mid]
if (withinDistance(aValue, value, distance)):
matchPosition = mid
displacement = (searchFlag == LowerBoundMatchFlag) ? -1 : 1
if (mid > low && mid < high && withinDistance(values.get(mid + displacement), value, distance)):
newLow = (searchFlag == LowerBoundMatchFlag) ? low : mid + displacement
newHigh = (searchFlag == LowerBoundMatchFlag) ? mid + displacement : high
matchPosition = binaryBoundSearch(values, value, newLow, newHigh, searchFlag)
else:
flag = compare(aValue, value)
if (flag < 0): // current position too far left
matchPosition = binaryBoundSearch(values, value, mid + 1, high, searchFlag)
else if (flag > 0): // current position too far right
matchPosition = binaryBoundSearch(values, value, low, mid - 1, searchFlag)
return matchPosition

Resources