Which base should I use in radix sort? And how do I convert between bases? - sorting

If I have to sort a list of integers of base 10, firstly I convert this integers to, for example, base 2, then perform radix sort and finally convert integers back to base 10?
Generally, how do you perform radix sort with radix different from base of integers in list?

Generally speaking, this depends on how the inputs are represented.
If your inputs are represented as fixed-width integer values, then it's easy to convert between bases by using division and mod. For example, you can get the last base-b digit of a number n by computing n % b and can drop that digit off by computing n / b (rounding down).
If your inputs are represented as strings, then it's harder to reinterpret the characters in other bases. Without using fancy algorithmic techniques, you can usually switch to bases that are powers of the base in which the number is represented by treating blocks of digits as individual digits. You can also use smaller bases by, for example, taking each digit, individually rewriting those digits in smaller bases, then using a radix sort that goes less than one digit at a time.
If you're interested in using a really heavyweight theoretical technique, this paper demonstrates a way to encode numbers in any base in binary in a way that allows for constant-time random access of the digits with no loss in space. This is certainly way more advanced than other approaches and the constant factor probably would make this inefficient in practice, but it shows that in theory it's possible to use any base you'd like.

Related

Runtime of double sorting algorithm

In this picture, I'm having trouble understanding why sorting sorting the array requires understanding that "Each string comparison takes O(s) time", therefore multiplying the a*log(a) section. Why are string comparisons in non-constant time? I'm having trouble conceptualizing this.
As a follow up, does our approach of multiplying by s similar to why sorting the string requires slog(s) as opposed to log(s)? Is the additional "s" in "sorting each string is O(slog(s))" a result of having to compare characters within the string (as opposed to comparing entire strings to each other in the array)? Sorry if this isn't making much sense, but it's somewhat of a confusing topic for me
I think you mixed two things. Sorting a single string of length s takes O(s log s).
For comparing two strings s1,s2, they can be arbitrary large and we cannot fit them in any register to compare them in O(1), therefore we need O(|s1|+|s2|) for comparison time to compare them character by character. In the picture, author supposed the longest string has length at most s.
Given two strings (say bytestrings), each of length s, they can be compared in however long it takes to get through the prefix string common to both strings to find the first difference. For two random strings, the average length of the common prefix p is approximately 1/256, regardless of string length, so a comparison would be O(1).
However, these are sorted strings. If the strings were random to start, after being sorted, they each start with, on average, s/256 zero bytes that need to be scanned (for this purpose, the chance that they have the same number of zero bytes is negligible, so we don't have to worry about the rest), so that's why comparison of sorted strings ends up being O(s). (Edited with a correction: actually, if the string representation can provide the length in constant time, I guess you can use an O(log s) binary search to compare sorted strings.)
On the other hand, we're talking about comparing strings in the context of a sort, and it's not totally clear to me that we can assume that the time complexity of comparison operations taking place in the context of a sort will necessarily be the product of O(s) for each sort multiplied by the expected number of comparisons, since the average similarity of compared strings (as measured by their common prefix) is likely to increase over the course of the sort.
As a result, I suspect it's possible to calculate the time complexity in the size of the strings for a fixed number of strings and vice versa using the argument above, but I'm not so convinced the joint time complexity in s and a holds for all possible ways that s and a can go to inifinity.

Choosing radix and modulus prime in rabin-karp rolling hash

The hash function is explained on Wikipedia
It says, "The choice of a and n is critical to get good hashing;" and refers to a Linear congruential generator article that doesn't feel relevant. I cant figure out how the values are chosen. Any suggestions?
The basis of this algorithm is that a nonzero polynomial of degree at most d has at most d zeros. Each length-k string has its own associated polynomial of degree k - 1, and we screen for possible matches by subtracting the polynomials of the strings in question and evaluating at a. If the strings are equal, then the result is always zero. If the strings are not equal, then the result is zero if and only if a is one of the zeros of the polynomial difference (this is the fact that puts the primality requirement on n, as the integers mod n otherwise would not be a field).
In theory, at least, we want a to be random so that an oblivious adversary cannot create false positives with any frequency. If we don't expect trouble, then it might be better to choose a so that multiplication by a is cheap (e.g., the binary expansion of a has a small number of one bits). Nevertheless, some choices are bad on typical string sets (e.g., a = 1). We want n to be large enough to avoid false positives (probability (k - 1)/n) by random chance but small enough and preferably of a special form so that the modulo computations are efficient.

Fastest way to use big numbers

Is there any way to reasonably operate on very big integer numbers (millions or billions of digits)? The operations I would need to do are simple +, -, * and maybe /.
What I am looking for are algorithms to do the above operations in a reasonable time (say up to 1 hour on a modern PC). I don't mind using any type of representation for the numbers, but if I need a different representation for each operation, then conversion between the different representations should also be done in reasonable time.
All the big number libraries I have looked at completely break down when used for this size numbers. Is this a sign that no such algorithms exist, or just that these libraries representations/implementations are not optimized for such sizes?
EDIT The 1-hour limit is probably impossible. I gave that number since a simple loop over a billion iterations should take less than that, and I was hoping for an algorithm that would use O(n) time. Does a limit of 24-hours seem more reasonable?
You may wish to take a look at the DecInt Python class.
This is optimised for very long decimal integers. (The numbers are stored in a representation that makes it easy to convert to decimal digits in O(n) time).
It can do the operations you wish including:
Multiplication in O(n ln(n))
Division in O(n ln(n)^2)

linear time sortings for all categories

I had this maybe stupid thought
since we have linear time sorting algorithms for constrained categroies like integers using counting sort, radix sort.
as in computer word, all categories of number types are finally encoded in byte sequences (which are to some extent similar with integers etc... ). is it able to state that we can do linear time sorting for all these numbers using those linear time sorting algorithms ?
Sure, although details vary from type to type. One simple example is IEEE-754 floating point values (both 32-bit and 64-bit), which can almost be sorted as though they were integers. (More specifically, they can be sorted as though they were sign-magnitude integers.) So a radix-sort would work fine.
For character strings, a not-uncommon technique when you have too many of them to fit in memory is to "bin" them by prefix, which is a variety of radix-sort.
For short bit-field values (like integers or, as above, floating point numbers), a left-to-right bit-at-a-time radix sort is really just a variant of quicksort, since it is basically just a way to find a plausible pivot. Unlike quicksort, it guarantees a finite recursion depth (32 in the case of 32-bit values). On the other hand, quicksort usually has a much smaller recursion depth, since log2 of the dataset size is usually a lot less than 32.
The main advantage of quicksort is that you can write the algorithm (STL style) without knowing anything about the datatype being sorted at all, other than how to call a function to compare two values. The same cannot be said of radix-sort; it's a lot harder to make a generic version.
Edited to add one important point:
It's very common to overemphasize the difference between O(n) and O(n log n). For very large n, they are different. But for most real-world non-Google-sized problems, log n is a small integer. It wouldn't make sense to use an O(n) algorithm which takes 100n seconds when there is an O(n log n) algorithm which takes 2n log2 n seconds, unless log n were greater than 50, which is to say that n were greater than 1,125,899,906,842,624.
No you cannot. if you have a piece of data represented by the bytes below:
11001100 00110011
(204) (51)
If you were to sort these using something like radix sort you would get:
00110011 11001100
(51) (204)
The only problem with this is that this is no longer the piece of data you wrote to the disk, it is a completely different piece of data that may not even mean anything at all(garbage).

Generate N quasi random numbers in less than O(N)

This was inspired by a question at a job interview: how do you efficiently generate N unique random numbers? Their security and distribution/bias don't matter.
I proposed a naive way of calling rand() N times and eliminating dupes by trial and error, thus getting inefficient and flawed solution. Then I've read this SO question, these algorithms are great for getting quality unique numbers and they are O(N).
But I suspect there are ways to get low-quality unique random numbers for dummy tasks in less than O(N) time complexity. I got some possible ideas:
Store many precomputed lists each containing N numbers and retrieve one list randomly. Complexity is O(1) for fixed N. Storage space used is O(NR) where R is number of lists.
Generate N/2 unique random numbers and then divide them by 2 inequal parts (floor/ceil for odd numbers, n+1/n-1 for even). I know this is flawed (duplicates can pop up) and O(N/2) is still O(N). This is more of a food for thought.
Generate one big random number and then squeeze more variants from it by some fixed manipulations like bitwise operations, factorization, recursion, MapReduce or something else.
Use a quasi-random sequence somehow (not a math guy, just googled this term).
Your ideas?
Presumably this routine has some kind of output (i.e. the results are written to an array of some kind). Populating an array (or some other data-structure) of size N is at least an O(N) operation, so you can't do better than O(N).
You can consequently generate a random number, and if the result array contains it, just add to it the maximum number of already generated numbers.
Detecting if a number already generated is O(1) (using a hash set). So it's O(n) and with only N random() calls.
Of course, this is an assumption that we do not overflow the upper limit (i.e. BigInteger).

Resources