Why is large number power timing out in Elixir? - performance

I am trying to recreate some basics of Bitcoin in Elixir. I know Elixir is not the ideal language for the task but I am doing this for fun and learning purposes. I have run into the following problem when trying to implement public key derivation from a secret, which includes a very large power on a very large number over a finite field. I implemented this myself given that :math.pow/2 is struggling with comparably small numbers already. But my implementation is taking extremely long and the function eventually times out.
Values and calling the function:
prime = 115792089237316195423570985008687907853269984665640564039457584007908834671663
val = 65341020041517633956166170261014086368942546761318486551877808671514674964848
Util.my_fpow(val, prime - 2, prime)
my_fpow/3 method in the Util module:
defmodule Util do
def my_fpow(n, k, prime), do: my_fpow(n, k, 1, prime)
defp my_fpow(_, 0, acc, _), do: acc
defp my_fpow(n, k, acc, prime) do
new_acc = n * acc
|> rem(prime)
my_fpow(n, k - 1, new_acc, prime)
end
end
First of all, I would like to understand on a deeper level why this method takes so long for these large numbers. Also I would be interested if there are other more efficient implementations that might still make it viable to do such calculations in Elixir (not at scale, just so the calculation could actually finish before timing out).

Your function will recurse prime - 2 times since you're subtracting it by 1 at a time. Even if Elixir executes 1 billion calls per second (which it won't on current hardware), it'll take more than 10^60 years to execute.
A much faster solution involves squaring the number on each iteration which reduces the number of calls to log2(x), which is pretty fast. Here's a translation of that algorithm:
defmodule A do
# Calculates (n ^ k) % m.
def powmod(n, k, m), do: powmod(n, k, m, 1)
def powmod(_, 0, _, r), do: r
def powmod(n, k, m, r) do
r = if rem(k, 2) == 1, do: rem(r * n, m), else: r
n = rem(n * n, m)
k = div(k, 2)
powmod(n, k, m, r)
end
end
prime = 115792089237316195423570985008687907853269984665640564039457584007908834671663
val = 65341020041517633956166170261014086368942546761318486551877808671514674964848
IO.inspect A.powmod(val, prime - 2, prime)
Output:
83174505189910067536517124096019359197644205712500122884473429251812128958118

Related

How Do You Analyze This Sorting Algorithm's Time Complexity?

TL;DR - A non-comparison sorting algorithm whose execution time scales based on the number of bits in the datatype; how do you properly evaluate the Big O time complexity?
I have a sorting algorithm "Bitsort" whose time complexity is dependent on the number of bits in the datatype being sorted, and not on the length of the list being sorted; for lists ranging in length from 2 elements to maximum addressable size, the number of passes over the entire list needed to fully sort it is dictated by the size of the datatype used in the list. The simple version of bitsort ( non-optimized but working code example below ) sorts the list 1 bit at a time. In real world terms, this would often mean that a 32-bit system has 32 bits in both the datatype and the maximum length of a list to be sorted; and Log-base2 of 2^32 is 32. Ergo, it could be argued that simple bitsort's time complexity goes from O(NB), where B is the number of bits, to O(NLogN) as N goes to the maximum length array the system can handle. In absolute terms, though, as N goes to infinity, B stays constant, so it could be argued that the time complexity is O(N).
How should the time complexity be evaluated for simple bitsort?
import sys
from random import randrange
def checkOrder(arr):
for i in range(1, len(arr)):
if arr[i - 1] > arr[i]:
return False
return True
def randomArray(length, height):
result = []
for i in range(0, length):
result.append(randrange(0, height, 1))
return result
def swap(arr, left, right):
temp = arr[left]
arr[left] = arr[right]
arr[right] = temp
# pat. pending, but not a troll
def bitsort(arr, begin, end, shiftBit):
if end-begin <= 0 or shiftBit < 0:
return
LP, RP = begin - 1, end + 1
mask = 0b1 << shiftBit
while True:
LP += 1
if LP > end:
bitsort(arr, begin, end, shiftBit - 1)
return
elif LP == RP:
if LP <= begin: # left has 0 or 1 el
pass
else:
bitsort(arr, begin, LP - 1, shiftBit - 1)
if RP >= end: # right has 0 or 1 elements
pass
else:
bitsort(arr, RP, end, shiftBit - 1)
return
if arr[LP] & mask == 0:
continue
else:
while True:
RP -= 1
if RP < begin:
bitsort(arr, begin, end, shiftBit - 1)
return
elif LP == RP:
if LP <= begin:
if RP <= end:
bitsort(arr, begin, end, shiftBit - 1)
return
else:
pass
else:
bitsort(arr, begin, LP - 1, shiftBit - 1)
if RP >= end:
pass
else:
bitsort(arr, RP, end, shiftBit - 1)
return
if (arr[RP] & mask) >> shiftBit == 1:
continue
else:
swap(arr, LP, RP)
break
def main(args):
arr = randomArray(1000000, 2147483648)
bitsort(arr, 0, len(arr) - 1, 31)
print(f'Proper order: {checkOrder(arr)}')
if __name__ == '__main__':
main(sys.argv[1:])
I’m going to trust that your analysis is correct and that the work done, as a function of N and B, is O(NB), and then focus on your question about how to compare this against, say, O(N log N).
The typical convention when analyzing sorting algorithms like this is - at least, from a CS theory perspective - is to classify this as O(NB) rather than O(N log N) or O(N). Here’s a few reasons why:
Imagine that you’re sorting a list of numbers that you know don’t have many bits in them (say, maybe US phone area codes or US zip codes, which are at most ten bits long). In that case, you could tune the algorithm to only work on the first ten bits of each number, reducing the number of passes required to sort everything. A runtime of O(NB) more clearly telegraphs this, since a B decreases the runtime drops accordingly.
While we’re currently using machines with word sizes that are typically 32 or 64, in principle we could crank that word size way up if we wanted to. SIMD instructions, for example, allow for parallel operations on many more bits than this. Or perhaps you’re sorting variable-length integers, like what you might do with RSA keys. In that case, we couldn’t necessarily claim that the runtime is O(N log N), since we aren’t guaranteed that B = O(log N). It also wouldn’t be correct to say that the runtime is O(N) in that latter case, since we can’t pretend that B is a constant.
Explicitly writing out O(NB) allows us to compare this sorting algorithm against other integer sorting algorithms. For example, we have sorting algorithms whose runtimes are O(N log B) and O(N log N / log B). Leaving the B term in makes it easier to see how these other algorithms compare against yours.
A runtime of O(N log N) would be incorrect for large B and much smaller N. In that case, you might do way more than O(log N) rounds of the bitsort routines.
The argument that it's better than O(n Log n) presumes that keys are duplicated in the array. If all the keys are unique, it will take at least log n bits to represent them. This means if at a minimum the sort time is O(n log n), worse if the keys are more sparsely spread out. For example, if we're talking text encoded as ASCII, it's going to much worse.

Find the value of f(T) for big value T

I am trying to solve a problem which is described below,
Given value of f(0) and k , which are integers.
I need to find value of f( T ). where T<=1010
Recursive function is,
f(n) = 2*f(n-1) , if 4*f(n-1) <=k
k - ( 2*f(n-1) ) , if 4*f(n-1) > k
My efforts,
#include<iostream>
using namespace std;
int main(){
long k,f0,i;
cin>>k>>f0;
long operation ;
cin>>operation;
long answer=f0;
for(i=1;i<=operation;i++){
answer=(4*answer <= k )?(2*answer):(k-(2*answer));
}
cout<<answer;
return 0;
}
My code gives me right answer. But, The code will run 1010 time in worst case that gives me Time Limit Exceed. I need more efficient solution for this problem. Please help me. I don't know the correct algorithm.
If 2f(0) < k then you can compute this function in O(log n) time (using exponentiation by squaring modulo k).
r = f(0) * 2^n mod k
return 2 * r >= k ? k - r : r
You can prove this by induction. The induction hypothesis is that 0 <= f(n) < k/2, and that the above code fragment computes f(n).
Here's a Python program which checks random test cases, comparing a naive implementation (f) with an optimized one (g).
def f(n, k, z):
r = z
for _ in xrange(n):
if 4*r <= k:
r = 2 * r
else:
r = k - 2 * r
return r
def g(n, k, z):
r = (z * pow(2, n, k)) % k
if 2 * r >= k:
r = k - r
return r
import random
errs = 0
while errs < 20:
k = random.randrange(100, 10000000)
n = random.randrange(100000)
z = random.randrange(k//2)
a1 = f(n, k, z)
a2 = g(n, k, z)
if a1 != a2:
print n, k, z, a1, a2
errs += 1
print '.',
Can you use methmetical solution before progamming and compulating?
Actually,
f(n) = f0*2^(n-1) , if f(n-1)*4 <= k
k - f0*2^(n-1) , if f(n-1)*4 > k
thus, your code will write like this:
condition = f0*pow(2, operation-2)
answer = condition*4 =< k? condition*2: k - condition*2
For a simple loop, your answer looks pretty tight; one could optimise a little bit using answer<<2 instead of 4*answer, and answer<<1 for 2*answer, but quite possibly your compiler is already doing that. If you're blowing the time with this, it might be necessary to reduce the loop itself somehow.
I can't figure out a mathematical pattern that #Shannon was going for, but I'm thinking we could exploit the fact that this function will sooner or later cycle. If the cycle is short enough, then we could short the loop by just getting the answer at the same point in the cycle.
So let's get some cycle detection equipment in the form of Brent's algorithm, and see if we can cut the loop to reasonable levels.
def brent(f, x0):
# main phase: search successive powers of two
power = lam = 1
tortoise = x0
hare = f(x0) # f(x0) is the element/node next to x0.
while tortoise != hare:
if power == lam: # time to start a new power of two?
tortoise = hare
power *= 2
lam = 0
hare = f(hare)
lam += 1
# Find the position of the first repetition of length λ
mu = 0
tortoise = hare = x0
for i in range(lam):
# range(lam) produces a list with the values 0, 1, ... , lam-1
hare = f(hare)
# The distance between the hare and tortoise is now λ.
# Next, the hare and tortoise move at same speed until they agree
while tortoise != hare:
tortoise = f(tortoise)
hare = f(hare)
mu += 1
return lam, mu
f0 = 2
k = 198779
t = 10000000000
def f(x):
if 4 * x <= k:
return 2 * x
else:
return k - 2 * x
lam, mu = brent(f, f0)
t2 = t
if t >= mu + lam: # if T is past the cycle's first loop,
t2 = (t - mu) % lam + mu # find the equivalent place in the first loop
x = f0
for i in range(t2):
x = f(x)
print("Cycle start: %d; length: %d" % (mu, lam))
print("Equivalent result at index: %d" % t2)
print("Loop iterations skipped: %d" % (t - t2))
print("Result: %d" % x)
As opposed to the other proposed answers, this approach actually could use a memo array to speed up the process, since the start of the function is actually calculated multiple times (in particular, inside brent), or it may be irrelevant, depending on how big the cycle happens to be.
The algorithm you proposed already has O(n).
To come up with more efficient algorithms, there is not that much direction we can go about. Some typical options we have
1.Decease the coefficients of the linear term( but I doubt it would make a difference in this case
2.Change to O(Logn)(typically use some sort of divide and conquer technique)
3.Change to O(1)
In this case, we can do the last one.
The recursion function is a piece-wise function
f(n) = 2*f(n-1) , if 4*f(n-1) <=k
k - ( 2*f(n-1) ) , if 4*f(n-1) > k
Let's tackle it by case:
case 1: if 4*f(n-1) <= k (1)(assuming the starting index is zero)
this is a obvious a geometry series
a_n = 2*a_n-1
Therefore, have the formula
Sn = 2^(n-1)f(0) ----()
Case 2: if 4*f(n-1) > k (2), we have
a_n = -2a_n-1 + k
Assuming, a_j is the element in the sequence which just satisfy condition (2)
Nestedly sub in an_1 to the formula, you will obtain the equation
an = k -2k +4k -8k... +(-2)^(n-j)* a_j
k -2k 4k -8... is another gemo series
Sn = k*(1-2^(n-j))/(1-2) ---gemo series sum formula with starting value k and ratio = -2
Therefore, we have a formula for an in the case 2
an = k * (1-2^(n-j))/(1-2) + (-2)^(n-j) * a_j ----(**)
All we left to do it to find aj which just dissatisfy condition (1) and satisfy (2)
This can be obtained in constant time again using the formula we have for case 1:
find n such that, 4*an = 4*Sn = 4*2^(n-1)*f(0)
solve for n: 4*2^(n-1)*f(0) = k, if n is not integer, take ceiling of n
In my first attempt to solve this question, I had wrong assumption that the value of the sequence is monotonically increasing but in fact the sequence might jump between case 1 and case 2. Therefore, there might not be constant algorithm to solve the problem.
However, we can use utilize the result above to skip iterative update complexity.
The overall algorithm will look something like:
start with T, K, and f(0)
compute n that make the condition switch using either (*) or (**)
update f(0) with f(n), update T - n
repeat
terminate when T-n = 0(the last iteration might over compute causing T-n<0, therefore, you need to go back a little bit if that happen)
Create a map that can store your results. Before finding f(n) check in that map, if solution is already existed or not.
If exists, use that solution.
Otherwise find it, store it for future use.
For C++:
Definition:
map<long,long>result;
Insertion:
result[key]=value
Accessing:
value=result[key];
Checking:
map<long,long>::iterator it=result.find(key);
if(it==result.end())
{
//key was not found, find the solution and insert into result
}
else
{
return result[key];
}
Use above technique for better solution.

How can I evenly distribute distinct keys in a hashtable?

I have this formula:
index = (a * k) % M
which maps a number 'k', from an input set K of distinct numbers, into it's position in a hashtable. I was wondering how to write a non-brute force program that finds such 'M' and 'a' so that 'M' is minimal, and there are no collisions for the given set K.
If, instead of a numeric multiplication you could perform a logic computation (and / or /not), I think that the optimal solution (minimum value of M) would be as small as card(K) if you could get a function that related each value of K (once ordered) with its position in the set.
Theoretically, it must be possible to write a truth table for such a relation (bit a bit), and then simplify the minterms through a Karnaugh Table with a proper program. Depending on the desired number of bits, the computational complexity would be affordable... or not.
If a is co-prime to M then a * k = a * k' mod M if, and only if, k = k' mod M, so you might as well use a = 1, which is always co-prime to M. This also covers all the cases in which M is prime, because all the numbers except 0 are then co-prime to M.
If a and M are not co-prime, then they share a common factor, say b, so a = x * b and M = y * b. In this case anything multiplied by a will also be divisible by b mod M, and you might as well by working mod y, not mod M, so there is nothing to be gained by using an a not co-prime to M.
So for the problem you state, you could save some time by leaving a=1 and trying all possible values of M.
If you are e.g. using 32-bit integers and really calculating not (a * k) mod M but ((a * k) mod 2^32) mod M you might be able to find cases where values of a other than 1 do better than a=1 because of what happens in (a * k) mod 2^32.

Finding the closest integer fraction to a given random real between 0..1, given ranges of numerator and denominator

Given two ranges of positive integers x: [1 ... n] and y: [1 ... m] and random real R from 0 to 1, I need to find the pair of elements (i,j) from x and y such that x_i / y_j is closest to R.
What is the most efficient way to find this pair?
Using Farey sequence
This is a simple and mathematically beautiful algorithm to solve this: run a binary search, where on each iteration the next number is given by the mediant formula (below). By the properties of the Farey sequence that number is the one with the smallest denominator within that interval. Consequently this sequence will always converge and never 'miss' a valid solution.
In pseudocode:
input: m, n, R
a_num = 0, a_denom = 1
b_num = 1, b_denom = 1
repeat:
-- interestingly c_num/c_denom is already in reduced form
c_num = a_num + b_num
c_denom = a_denom + b_denom
-- if the numbers are too big, return the closest of a and b
if c_num > n or c_denom > m then
if R - a_num/a_denom < b_num/b_denom - R then
return a_num, a_denom
else
return b_num, b_denom
-- adjust the interval:
if c_num/c_denom < R then
a_num = c_num, a_denom = c_denom
else
b_num = c_num, b_denom = c_denom
goto repeat
Even though it's fast on average (my educated guess that it's O(log max(m,n))), it can still be slow if R is close to a fraction with a small denominator. For example finding an approximation to 1/1000000 with m = n = 1000000 will take a million iterations.
The standard approach to approximating reals with rationals is computing the continued fraction series (see [1]). Put a limit on the nominator and denominator while computing parts of the series, and the last value before you break the limits is a fraction very close to your real number.
This will find a very good approximation very fast, but I'm not sure this will always find a closest approximation. It is known that
any convergent [partial value of the continued fraction expansion] is nearer to the continued fraction than any other fraction whose denominator is less than that of the convergent
but there may be approximations with larger denominator (still below your limit) that are better approximations, but are not convergents.
[1] http://en.wikipedia.org/wiki/Continued_fraction
Given that R is a real number such that 0 <= R <= 1, integers x: [1 ... n] and integers y: [1 ... m]. It is assumed that n <= m, since if n > m then x[n]/y[m] will be greater than 1, which cannot be the closest approximation to R.
Therefore, the best approximation of R with the denominator d will be either floor(R*d) / d or ceil(R*d) / d.
The problem can be solved in O(m) time and O(1) space (in Python):
from __future__ import division
from random import random
from math import floor
def fractionize(R, n, d):
error = abs(n/d - R)
return (n, d, error) # (numerator, denominator, absolute difference to R)
def better(a, b):
return a if a[2] < b[2] else b
def approximate(R, n, m):
best = (0, 1, R)
for d in xrange(1, m+1):
n1 = min(n, int(floor(R * d)))
n2 = min(n, n1 + 1) # ceil(R*d)
best = better(best, fractionize(R, n1, d))
best = better(best, fractionize(R, n2, d))
return best
if __name__ == '__main__':
def main():
R = random()
n = 30
m = 100
print R, approximate(R, n, m)
main()
Prolly get flamed, but a lookup might be best where we compute all of the fractional values for each of the possible values.. So a simply indexing a 2d array indexed via the fractional parts with the array element containing the real equivalent. I guess we have discrete X and Y parts so this is finite, it wouldnt be the other way around.... Ahh yeah, the actual searching part....erm reet....
Rather than a completely brute force search, do a linear search over the shortest of your lists, using round to find the best match for each element. Maybe something like this:
best_x,best_y=(1,1)
for x in 1...n:
y=max(1,min(m,round(x/R)))
#optional optimization (if you have a fast gcd)
if gcd(x,y)>1:
continue
if abs(R-x/y)<abs(R-bestx/besty):
best_x,best_y=(x,y)
return (best_x,best_y)
Not at all sure whether the gcd "optimization" will ever be faster...
The Solution:
You can do this O(1) space and O(m log(n)) time:
there is no need to create any list to search,
The pseudo code may be is buggy but the idea is this:
r: input number to search.
n,m: the ranges.
for (int i=1;i<=m;i++)
{
minVal = min(Search(i,1,n,r), minVal);
}
//x and y are start and end of array:
decimal Search(i,x,y,r)
{
if (i/x > r)
return i/x - r;
decimal middle1 = i/Cill((x+y)/2);
decimal middle2 = i/Roof((x+y)/2);
decimal dist = min(middle1,middle2)
decimal searchResult = 100000;
if( middle > r)
searchResult = Search (i, x, cill((x+y)/2),r)
else
searchResult = Search(i, roof((x+y)/2), y,r)
if (searchResult < dist)
dist = searchResult;
return dist;
}
finding the index as home work to reader.
Description: I think you can understand what's the idea by code, but let trace one of a for loop:
when i=1:
you should search within bellow numbers:
1,1/2,1/3,1/4,....,1/n
you check the number with (1,1/cill(n/2)) and (1/floor(n/2), 1/n) and doing similar binary search on it to find the smallest one.
Should do this for loop for all items, so it will be done m time. and in each time it takes O(log(n)). this function can improve by some mathematical rules, but It will be complicated, I skip it.
If the denominator of R is larger than m then use the Farey method (which the Fraction.limit_denominator method implements) with a limit of m to get a fraction a/b where b is smaller than m else let a/b = R. With b <= m, either a <= n and you are done or else let M = math.ceil(n/R) and re-run the Farey method.
def approx2(a, b, n, m):
from math import ceil
from fractions import Fraction
R = Fraction(a, b)
if R < Fraction(1, m):
return 1, m
r = R.limit_denominator(m)
if r.numerator > n:
M = ceil(n/R)
r = R.limit_denominator(M)
return r.numerator, r.denominator
>>> approx2(113, 205, 50, 200)
(43, 78)
It might be possible to just run the Farey method once using a limiting denominator of min(ceil(n/R), m) but I am not sure about that:
def approx(a, b, n, m):
from math import ceil
from fractions import Fraction
R = Fraction(a, b)
if R < Fraction(1, m):
return 1, m
r = R.limit_denominator(min(ceil(n/R), m))
return r.numerator, r.denominator

Generating shuffled range using a PRNG rather than shuffling

Is there any known algorithm that can generate a shuffled range [0..n) in linear time and constant space (when output produced iteratively), given an arbitrary seed value?
Assume n may be large, e.g. in the many millions, so a requirement to potentially produce every possible permutation is not required, not least because it's infeasible (the seed value space would need to be huge). This is also the reason for a requirement of constant space. (So, I'm specifically not looking for an array-shuffling algorithm, as that requires that the range is stored in an array of length n, and so would use linear space.)
I'm aware of question 162606, but it doesn't present an answer to this particular question - the mappings from permutation indexes to permutations given in that question would require a huge seed value space.
Ideally, it would act like a LCG with a period and range of n, but the art of selecting a and c for an LCG is subtle. Simply satisfying the constraints for a and c in a full period LCG may satisfy my requirements, but I am wondering if there are any better ideas out there.
Based on Jason's answer, I've made a simple straightforward implementation in C#. Find the next largest power of two greater than N. This makes it trivial to generate a and c, since c needs to be relatively prime (meaning it can't be divisible by 2, aka odd), and (a-1) needs to be divisible by 2, and (a-1) needs to be divisible by 4. Statistically, it should take 1-2 congruences to generate the next number (since 2N >= M >= N).
class Program
{
IEnumerable<int> GenerateSequence(int N)
{
Random r = new Random();
int M = NextLargestPowerOfTwo(N);
int c = r.Next(M / 2) * 2 + 1; // make c any odd number between 0 and M
int a = r.Next(M / 4) * 4 + 1; // M = 2^m, so make (a-1) divisible by all prime factors, and 4
int start = r.Next(M);
int x = start;
do
{
x = (a * x + c) % M;
if (x < N)
yield return x;
} while (x != start);
}
int NextLargestPowerOfTwo(int n)
{
n |= (n >> 1);
n |= (n >> 2);
n |= (n >> 4);
n |= (n >> 8);
n |= (n >> 16);
return (n + 1);
}
static void Main(string[] args)
{
Program p = new Program();
foreach (int n in p.GenerateSequence(1000))
{
Console.WriteLine(n);
}
Console.ReadKey();
}
}
Here is a Python implementation of the Linear Congruential Generator from FryGuy's answer. Because I needed to write it anyway and thought it might be useful for others.
import random
import math
def lcg(start, stop):
N = stop - start
# M is the next largest power of 2
M = int(math.pow(2, math.ceil(math.log(N+1, 2))))
# c is any odd number between 0 and M
c = random.randint(0, M/2 - 1) * 2 + 1
# M=2^m, so make (a-1) divisible by all prime factors and 4
a = random.randint(0, M/4 - 1) * 4 + 1
first = random.randint(0, M - 1)
x = first
while True:
x = (a * x + c) % M
if x < N:
yield start + x
if x == first:
break
if __name__ == "__main__":
for x in lcg(100, 200):
print x,
Sounds like you want an algorithm which is guaranteed to produce a cycle from 0 to n-1 without any repeats. There are almost certainly a whole bunch of these depending on your requirements; group theory would be the most helpful branch of mathematics if you want to delve into the theory behind it.
If you want fast and don't care about predictability/security/statistical patterns, an LCG is probably the simplest approach. The wikipedia page you linked to contains this (fairly simple) set of requirements:
The period of a general LCG is at most
m, and for some choices of a much less
than that. The LCG will have a full
period if and only if:
c and m are relatively prime,
a - 1 is divisible by all prime factors of m
a - 1 is a multiple of 4 if m is a multiple of 4
Alternatively, you could choose a period N >= n, where N is the smallest value that has convenient numerical properties, and just discard any values produced between n and N-1. For example, the lowest N = 2k - 1 >= n would let you use linear feedback shift registers (LFSR). Or find your favorite cryptographic algorithm (RSA, AES, DES, whatever) and given a particular key, figure out the space N of numbers it permutes, and for each step apply encryption once.
If n is small but you want the security to be high, that's probably the trickiest case, as any sequence S is likely to have a period N much higher than n, but is also nontrivial to derive a nonrepeating sequence of numbers with a shorter period than N. (e.g. if you could take the output of S mod n and guarantee nonrepeating sequence of numbers, that would give information about S that an attacker might use)
See my article on secure permutations with block ciphers for one way to do it.
Look into Linear Feedback Shift Registers, they can be used for exactly this.
The short way of explaining them is that you start with a seed and then iterate using the formula
x = (x << 1) | f(x)
where f(x) can only return 0 or 1.
If you choose a good function f, x will cycle through all values between 1 and 2^n-1 (where n is some number), in a good, pseudo-random way.
Example functions can be found here, e.g. for 63 values you can use
f(x) = ((x >> 6) & 1) ^ ((x >> 5) & 1)

Resources