Convert string to integer (not atoi!) - algorithm

I want to be able to take, as input, a character pointer to a number in base 2 through 16 and as a second parameter, what base the number is in and then convert that to it's representation in base 2. The integer can be of arbitrary length. My solution now does what the atoi() function does, but I was curious purely out of academic interest if a lookup table solution is possible.
I have found that this is simple for binary, octal, and hexadecimal. I can simply use a lookup table for each digit to get a series of bits. For instance:
0xF1E ---> (F = 1111) (1 = 0001) (E = 1110) ---> 111100011110
0766 ---> (7 = 111) (6 = 110) (6 = 110) ---> 111110110
1000 ---> ??? ---> 1111101000
However, my problem is that I want to do this look up table method for odd bases, like base 10. I know that I could write the algorithm like atoi does and do a bunch of multiplies and adds, but for this specific problem I'm trying to see if I can do it with a look up table. It's definitely not so obvious with base 10, though. I was curious if anyone had any clever way to figure out how to generate a generic look up table for Base X -> Base 2. I know that for base 10, you can't just give it one digit at a time, so the solution would likely have to lookup a group of digits at a time.
I am aware of the multiply and add solution but since these are arbitrary length numbers, the multiply and add operations are not free so I'd like to avoid them, if at all possible.

You will have to use a look up table with an input width of m base b symbols returning n bits so that
n = log2(b) * m
for positive integers b, n and m. So if b is not a power of two, there will be no (simple) look up table solution.
I do not think that there is a solution. The following example with base 10 illustrates why.
65536 = 1 0000 0000 0000 0000
Changing the last digit from 6 to 5 will flip all bits.
65535 = 0 1111 1111 1111 1111
And almost the same will hold if you process the input starting from the end. Changing the first digit from 6 to 5 flips a significant number of bits.
55535 = 0 1101 1000 1111 0000

This is not possible in bases that aren't powers of two to convert to base-2. The reason that it is possible for base 8 (and 16) is that the way the conversion works is following:
octal ABC = 8^2*A + 8^1*B + 8^0*C (decimal)
= 0b10000000*A + 0b1000*B + C (binary)
so if you have the lookup table of A = (0b000 to 0b111), then the multiplication is always by 1 and some trailing zeros, so the multiplication is simple (just shifting left).
However, consider the 'odd' base of 10. When you look at the powers of 10:
10^1 = 0b1010
10^2 = 0b1100100
10^3 = 0b1111101000
10^4 = 0b10011100010000
..etc
You'll notice that the multiplication never gets simple, so you can't have any lookup tables and do bitshifts and ors, no matter how big you group them. It will always overlap. The best you can do is have a lookup table of the form: (a,b) where a is the digit position, and b is the digit (0..9). Then, you are only reduced to adding n numbers, rather than multiplying and adding n numbers (plus the cost of the memory of the lookup table)

How big are the strings? You can potentially convert the multiply-and-add to a lookup-and-add by doing something like this:
Store the numbers 0-9, 10, 20, 30, 40, ... 90, 100, 200, ... 900, 1000, 2000, ... , 9000, 10000, ... in the target base in a table.
For each character starting with the rightmost, index appropriately into the table and add it to a running result.
Of course I'm not sure how well this will actually perform, but it's a thought.

The algorithm is quite simple. Language agnostic would be:
total = 0
base <- input_base
for each character in input:
total <- total*base + number(char)
In C++:
// Helper to convert a digit to a number
unsigned int number( char ch )
{
if ( ch >= '0' && ch <= '9' ) return ch-'0';
ch = toupper(ch);
if ( ch >= 'A' && ch <= 'F' ) return 10 + (ch-'A');
}
unsigned int parse( std::string const & input, unsigned int base )
{
unsigned int total = 0;
for ( int i = 0; i < input.size(); ++i )
{
total = total*base + number(input[i]);
}
return total;
}
Of course, you should take care of possible errors (incoherent input: base 2 and input string 'af12') or any other exceptional condition.

Start with a running count of 0.
For each character in the string (reading left to right)
Multiply count by base.
Convert character to int value (0 through base)
Add character value to running count.

How accurate do you need to be?
If you're looking for perfection, then multiply-and-add is really your only recourse. And I'd be very surprised if it's the slowest part of your application.
If order-of-magnitude is good enough, use a lookup table to find the closest power of 2.
Example 1: 1234, closest power of 2 is 1024.
Example 2: 98765, closest is 65536
You could also drive this by counting the number of digits, and multiplying the appropriate power of 2 by the leftmost digit. This can be implemented as a left-shift:
Example 3: 98765 has 5 digits, closest power of 2 to 10000 is 8192 (2^13), so result is 9 << 13

I wrote this before your clarifying comment so it probably isn't quite is applicable. I'm not sure if a lookup table approach is possible or not. If you really don't need arbitrary precision, then take advantage of the runtime.
If a C/C++ solution is acceptable, I believe that the following is what you are looking for is something like the following. It probably contains bugs in edge cases, but it does compile and work as expected at least for positive numbers. Making it really work is an exercise for the reader.
/*
* NAME
* convert_num - convert a numerical string (str) of base (b) to
* a printable binary representation
* SYNOPSIS
* int convert_num(char const* s, int b, char** o)
* DESCRIPTION
* Generates a printable binary representation of an input number
* from an arbitrary base. The input number is passed as the ASCII
* character string `s'. The input string consists of characters
* from the ASCII character set {'0'..'9','A'..('A'+b-10)} where
* letter characters may be in either upper or lower case.
* RETURNS
* The number of characters from the input string `s' which were
* consumed by this operation. The output string is placed into
* newly allocated storage which is pointed to by `*o' upon successful
* completion. An error is signalled by returning `-1'.
*/
int
convert_num(char const *str, int b, char **out)
{
int rc = -1;
char *endp = NULL;
char *outp = NULL;
unsigned long num = strtoul(str, &endp, b);
if (endp != str) { /* then we have some numbers */
int numdig = -1;
rc = (endp - str); /* we have this many base `b' digits! */
frexp((double)num, &numdig); /* we need this many base 2 digits */
if ((outp=malloc(numdig+1)) == NULL) {
return -1;
}
*out = outp; /* return the buffer */
outp += numdig; /* make sure it is NUL terminated */
*outp-- = '\0';
while (numdig-- != 0) { /* fill it in from LSb to MSb */
*outp-- = ((num & 1) ? '1' : '0');
num >>= 1;
}
}
return rc;
}

Related

how to calculate 2^32 without multiplying numbers directly?

the simplest way to calculate 2^32 is 2*2*2*2*2......= 4294967296
, I want to know that is there any other way to get 4294967296? (2^16 * 2^16 is treated as the same method as 2*2*2.... )
and How many ways to calculate it?
Is there any function to calculate it?
I can't come up with any methods to calculate it without 2*2*2...
2 << 31
is a bit shift. It effectively raises 2 to the 32nd power.
Options:
1 << 32
2^32 = (2^32 - 1) + 1 = (((2^32 - 1) + 1) - 1) + 1 = ...
Arrange 32 items on a table. Count the ways you can choose subsets of them.
If you are not much of a fan of binary magic, then I would suggest quickpower.This function computes xn in O(logn) time.
int qpower(int x,int n)
{
if(n==0)return 1;
if(n==1)return x;
int mid=qpower(x,n/2);
if(n%2==0)return mid*mid;
return x*mid*mid;
}
If you are on a common computer you can left bitshift 2 by 31 (i.e. 2<<31) to obtain 2^32.
In standard C:
unsigned long long x = 2ULL << 31;
unsigned long long is needed since a simple unsigned long is not guaranteed to be large enough to store the value of 2<<31.
In section 5.2.4.2.1 paragraph 1 of the C99 standard:
... the
following shall be replaced by expressions that have the same type as would an
expression that is an object of the corresponding type converted according to the integer
promotions. Their implementation-defined values shall be equal or greater in magnitude (absolute value) to those shown, with the same sign.
— maximum value for an object of type unsigned long int
ULONG_MAX 4294967295 // 2^32 - 1
— maximum value for an object of type unsigned long long int
ULLONG_MAX 18446744073709551615 // 2^64 - 1
Why not using Math.Pow() (in .NET). I think most language (or environment) would support the similar function for you:
Math.Pow(2,32);
In Groovy/Java you can do something like following with long number (signed integer can be max 2^31 in Java)
long twoPOW32 = 1L << 32;

How to translate Text to Binary with Cocoa?

I'm making a simple Cocoa program that can encode text to binary and decode it back to text. I tried to make this script and I was not even close to accomplishing this. Can anyone help me? This has to include two textboxes and two buttons or whatever is best, Thanks!
There are two parts to this.
The first is to encode the characters of the string into bytes. You do this by sending the string a dataUsingEncoding: message. Which encoding you choose will determine which bytes it gives you for each character. Start with NSUTF8StringEncoding, and then experiment with other encodings, such as NSUnicodeStringEncoding, once you get it working.
The second part is to convert every bit of every byte into either a '0' character or a '1' character, so that, for example, the letter A, encoded in UTF-8 to a single byte, will be represented as 01000001.
So, converting characters to bytes, and converting bytes to characters representing bits. These two are completely separate tasks; the second part should work correctly for any stream of bytes, including any valid stream of encoded characters, any invalid stream of encoded characters, and indeed anything that isn't text at all.
The first part is easy enough:
- (NSString *) stringOfBitsFromEncoding:(NSStringEncoding)encoding
ofString:(NSString *)inputString
{
//Encode the characters to bytes using the UTF-8 encoding. The bytes are contained in an NSData object, which we receive.
NSData *data = [string dataUsingEncoding:NSUTF8StringEncoding];
//I did say these were two separate jobs.
return [self stringOfBitsFromData:data];
}
For the second part, you'll need to loop through the bytes of the data. A C for loop will do the job there, and that will look like this:
//This is the method we're using above. I'll leave out the method signature and let you fill that in.
- …
{
//Find out how many bytes the data object contains.
NSUInteger length = [data length];
//Get the pointer to those bytes. “const” here means that we promise not to change the values of any of the bytes. (The compiler may give a warning if we don't include this, since we're not allowed to change these bytes anyway.)
const char *bytes = [data bytes];
//We'll store the output here. There are 8 bits per byte, and we'll be putting in one character per bit, so we'll tell NSMutableString that it should make room for (the number of bytes times 8) characters.
NSMutableString *outputString = [NSMutableString stringWithCapacity:length * 8];
//The loop. We start by initializing i to 0, then increment it (add 1 to it) after each pass. We keep looping as long as i < length; when i >= length, the loop ends.
for (NSUInteger i = 0; i < length; ++i) {
char thisByte = bytes[i];
for (NSUInteger bitNum = 0; bitNum < 8; ++bitNum) {
//Call a function, which I'll show the definition of in a moment, that will get the value of a bit at a given index within a given character.
bool bit = getBitAtIndex(thisByte, bitNum);
//If this bit is a 1, append a '1' character; if it is a 0, append a '0' character.
[outputString appendFormat: #"%c", bit ? '1' : '0'];
}
}
return outputString;
}
Bits 101 (or, 1100101)
Bits are literally just digits in base 2. Humans in the Western world usually write out numbers in base 10, but a number is a number no matter what base it's written in, and every character, and every byte, and indeed every bit, is just a number.
Digits—including bits—are counted up from the lowest place, according to the exponent to which the base is raised to find the magnitude of that place. We want bits, so that base is 2, so our place values are:
2^0 = 1: The ones place (the lowest bit)
2^1 = 2: The twos place (the next higher bit)
2^2 = 4: The fours place
2^3 = 8: The eights place
And so on, up to 2^7. (Note that the highest exponent is exactly one lower than the number of digits we're after; in this case, 7 vs. 8.)
If that all reminds you of reading about “the ones place”, “the tens place”, “the hundreds place”, etc. when you were a kid, it should: it's the exact same principle.
So a byte such as 65, which (in UTF-8) completely represents the character 'A', is the sum of:
2^7 × 0 = 0
+ 2^6 × 0 = 64
+ 2^5 × 1 = 0
+ 2^4 × 0 = 0
+ 2^3 × 0 = 0
+ 2^2 × 0 = 0
+ 2^1 × 0 = 0
+ 2^0 × 1 = 1
= 0 + 64 +0+0+0+0+0 + 1
= 64 + 1
= 65
Back when you learned base 10 numbers as a kid, you probably noticed that ten is “10”, one hundred is “100”, etc. This is true in base 2 as well: as 10^x is “1” followed by x “0”s in base 10, so is 2^x “1” followed by “x” 0s in base 2. So, for example, sixty-four in base 2 is “1000000” (count the zeroes and compare to the table above).
We are going to use these exact-power-of-two numbers to test each bit in each input byte.
Finding the bit
C has a pair of “shift” operators that will insert zeroes or remove digits at the low end of a number. The former is called “shift left”, and is written as <<, and you can guess the opposite.
We want shift left. We want to shift 1 left by the number of the bit we're after. That is exactly equivalent to raising 2 (our base) to the power of that number; for example, 1 << 6 = 2^6 = “1000000”.
Testing the bit
C has an operator for bit testing, too; it's &, the bitwise AND operator. (Do not confuse this with &&, which is the logical AND operator. && is for using whole true/false values in making decisions; & is one of your tools for working with bits within values.)
Strictly speaking, & does not test single bits; it goes through the bits of both input values, and returns a new value whose bits are the bitwise AND of each input pair. So, for example,
01100101
& 00101011
----------
00100001
Each bit in the output is 1 if and only if both of the corresponding input bits were also 1.
Putting these two things together
We're going to use the shift left operator to give us a number where one bit, the nth bit, is set—i.e., 2^n—and then use the bitwise AND operator to test whether the same bit is also set in our input byte.
//This is a C function that takes a char and an int, promising not to change either one, and returns a bool.
bool getBitAtIndex(const char byte, const int bitNum)
//It could also be a method, which would look like this:
//- (bool) bitAtIndex:(const int)bitNum inByte:(const char)byte
//but you would have to change the code above. (Feel free to try it both ways.)
{
//Find 2^bitNum, which will be a number with exactly 1 bit set. For example, when bitNum is 6, this number is “1000000”—a single 1 followed by six 0s—in binary.
const int powerOfTwo = 1 << bitNum;
//Test whether the same bit is also set in the input byte.
bool bitIsSet = byte & powerOfTwo;
return bitIsSet;
}
A bit of magic I should acknowledge
The bitwise AND operator does not evaluate to a single bit—it does not evaluate to only 1 or 0. Remember the above example, in which the & operator returned 33.
The bool type is a bit magic: Any time you convert any value to bool, it automatically becomes either 1 or 0. Anything that is not 0 becomes 1; anything that is 0 becomes 0.
The Objective-C BOOL type does not do this, which is why I used bool in the code above. You are free to use whichever you prefer, except that you generally should use BOOL whenever you deal with anything that expects a BOOL, particularly when overriding methods in subclasses or implementing protocols. You can convert back and forth freely, though not losslessly (since bool will change non-zero values as described above).
Oh yeah, you said something about text boxes too
When the user clicks on your button, get the stringValue of your input field, call stringOfBitsFromEncoding:ofString: using a reasonable encoding (such as UTF-8) and that string, and set the resulting string as the new stringValue of your output field.
Extra credit: Add a pop-up button with which the user can choose an encoding.
Extra extra credit: Populate the pop-up button with all of the available encodings, without hard-coding or hard-nibbing a list.

Sort N numbers in digit order

Given a N number range E.g. [1 to 100], sort the numbers in digit order (i.e) For the numbers 1 to 100, the sorted output wound be
1 10 100 11 12 13 . . . 19 2 20 21..... 99
This is just like Radix Sort but just that the digits are sorted in reversed order to what would be done in a normal Radix Sort.
I tried to store all the digits in each number as a linked list for faster operation but it results in a large Space Complexity.
I need a working algorithm for the question.
From all the answers, "Converting to Strings" is an option, but is there no other way this can be done?
Also an algorithm for Sorting Strings as mentioned above can also be given.
Use any sorting algorithm you like, but compare the numbers as strings, not as numbers. This is basically lexiographic sorting of regular numbers. Here's an example gnome sort in C:
#include <stdlib.h>
#include <string.h>
void sort(int* array, int length) {
int* iter = array;
char buf1[12], buf2[12];
while(iter++ < array+length) {
if(iter == array || (strcmp(itoa(*iter, &buf1, 10), itoa(*(iter-1), &buf2, 10) >= 0) {
iter++;
} else {
*iter ^= *(iter+1);
*(iter+1) ^= *iter;
*iter ^= *(iter+1);
iter--;
}
}
}
Of course, this requires the non-standard itoa function to be present in stdlib.h. A more standard alternative would be to use sprintf, but that makes the code a little more cluttered. You'd possibly be better off converting the whole array to strings first, then sort, then convert it back.
Edit: For reference, the relevant bit here is strcmp(itoa(*iter, &buf1, 10), itoa(*(iter-1), &buf2, 10) >= 0, which replaces *iter >= *(iter-1).
I have a solution but not exactly an algorithm.. All you need to do is converts all the numbers to strings & sort them as strings..
Here is how you can do it with a recursive function (the code is in Java):
void doOperation(List<Integer> list, int prefix, int minimum, int maximum) {
for (int i = 0; i <= 9; i++) {
int newNumber = prefix * 10 + i;
if (newNumber >= minimum && newNumber <= maximum) {
list.add(newNumber);
}
if (newNumber > 0 && newNumber <= maximum) {
doOperation(list, newNumber, minimum, maximum);
}
}
}
You call it like this:
List<Integer> numberList = new ArrayList<Integer>();
int min=1, max =100;
doOperation(numberList, 0, min, max);
System.out.println(numberList.toString());
EDIT:
I translated my code in C++ here:
#include <stdio.h>
void doOperation(int list[], int &index, int prefix, int minimum, int maximum) {
for (int i = 0; i <= 9; i++) {
int newNumber = prefix * 10 + i;
if (newNumber >= minimum && newNumber <= maximum) {
list[index++] = newNumber;
}
if (newNumber > 0 && newNumber <= maximum) {
doOperation(list, index, newNumber, minimum, maximum);
}
}
}
int main(void) {
int min=1, max =100;
int* numberList = new int[max-min+1];
int index = 0;
doOperation(numberList, index, 0, min, max);
printf("[");
for(int i=0; i<max-min+1; i++) {
printf("%d ", numberList[i]);
}
printf("]");
return 0;
}
Basically, the idea is: for each digit (0-9), I add it to the array if it is between minimum and maximum. Then, I call the same function with this digit as prefix. It does the same: for each digit, it adds it to the prefix (prefix * 10 + i) and if it is between the limits, it adds it to the array. It stops when newNumber is greater than maximum.
i think if you convert numbers to string, you can use string comparison to sort them.
you can use anny sorting alghorighm for it.
"1" < "10" < "100" < "11" ...
Optimize the way you are storing the numbers: use a binary-coded decimal (BCD) type that gives simple access to a specific digit. Then you can use your current algorithm, which Steve Jessop correctly identified as most significant digit radix sort.
I tried to store all the digits in
each number as a linked list for
faster operation but it results in a
large Space Complexity.
Storing each digit in a linked list wastes space in two different ways:
A digit (0-9) only requires 4 bits of memory to store, but you are probably using anywhere from 8 to 64 bits. A char or short type takes 8 bits, and an int can take up to 64 bits. That's using 2X to 16X more memory than the optimal solution!
Linked lists add additional unneeded memory overhead. For each digit, you need an additional 32 to 64 bits to store the memory address of the next link. Again, this increases the memory required per digit by 8X to 16X.
A more memory-efficient solution stores BCD digits contiguously in memory:
BCD only uses 4 bits per digit.
Store the digits in a contiguous memory block, like an array. This eliminates the need to store memory addresses. You don't need linked lists' ability to easily insert/delete from the middle. If you need the ability to grow the numbers to an unknown length, there are other abstract data types that allow that with much less overhead. For example, a vector.
One option, if other operations like addition/multiplication are not important, is to allocate enough memory to store each BCD digit plus one BCD terminator. The BCD terminator can be any combination of 4 bits that is not used to represent a BCD digit (like binary 1111). Storing this way will make other operations like addition and multiplication trickier, though.
Note this is very similar to the idea of converting to strings and lexicographically sorting those strings. Integers are internally stored as binary (base 2) in the computer. Storing in BCD is more like base 10 (base 16, actually, but 6 combinations are ignored), and strings are like base 256. Strings will use about twice as much memory, but there are already efficient functions written to sort strings. BCD's will probably require developing a custom BCD type for your needs.
Edit: I missed that it's a contiguous range. That being the case, all the answers which talk about sorting an array are wrong (including your idea stated in the question that it's like a radix sort), and True Soft's answer is right.
just like Radix Sort but just that the digits are sorted in reversed order
Well spotted :-) If you actually do it that way, funnily enough, it's called an MSD radix sort.
http://en.wikipedia.org/wiki/Radix_sort#Most_significant_digit_radix_sorts
You can implement one very simply, or with a lot of high technology and fanfare. In most programming languages, your particular example faces a slight difficulty. Extracting decimal digits from the natural storage format of an integer, isn't an especially fast operation. You can ignore this and see how long it ends up taking (recommended), or you can add yet more fanfare by converting all the numbers to decimal strings before sorting.
Of course you don't have to implement it as a radix sort: you could use a comparison sort algorithm with an appropriate comparator. For example in C, the following is suitable for use with qsort (unless I've messed it up):
int lex_compare(void *a, void *b) {
char a_str[12]; // assuming 32bit int
char b_str[12];
sprintf(a_str, "%d", *(int*)a);
sprintf(b_str, "%d", *(int*)b);
return strcmp(a_str,b_str);
}
Not terribly efficient, since it does a lot of repeated work, but straightforward.
If you do not want to convert them to strings, but have enough space to store an extra copy of the list I would store the largest power of ten less than the element in the copy. This is probably easiest to do with a loop. Now call your original array x and the powers of ten y.
int findPower(int x) {
int y = 1;
while (y * 10 < x) {
y = y * 10;
}
return y;
}
You could also compute them directly
y = exp10(floor(log10(x)));
but I suspect that the iteration may be faster than the conversions to and from floating point.
In order to compare the ith and jth elements
bool compare(int i, int j) {
if (y[i] < y[j]) {
int ti = x[i] * (y[j] / y[i]);
if (ti == x[j]) {
return (y[i] < y[j]); // the compiler will optimize this
} else {
return (ti < x[j]);
}
} else if (y[i] > y[j]) {
int tj = x[j] * (y[i] / y[j]);
if (x[i] == tj) {
return (y[i] < y[j]); // the compiler will optimize this
} else {
return (x[i] < tj);
}
} else {
return (x[i] < x[j];
}
}
What is being done here is we are multiplying the smaller number by the appropriate power of ten to make the two numbers have an equal number of digits, then comparing them. if the two modified numbers are equal, then compare the digit lengths.
If you do not have the space to store the y arrays you can compute them on each comparison.
In general, you are likely better off using the preoptimized digit conversion routines.

What's the best way to strip leading / trailing digits from a number?

If I have a number like 12345, and I want an output of 2345, is there a mathematical algorithm that does this? The hack in me wants to convert the number to a string, and substring it. I know this will work, but I'm sure there has to be a better way, and Google is failing me.
Likewise, for 12345, if I want 1234, is there another algorithm that will do that? The best I can come up with is Floor(x / 10^(n)), where x is the input and n is the number of digits to strip, but I feel like there has to be a better way, and I just can't see it.
In the first case, don't you just want
n % 10000
i.e. the modulus wrt. 10000 ?
For your second case, if you're using integer arithmetic, just divide by 10. You might want to do this in a more 'explicit' fashion by modding with 10 to get the last digit, subtract and then divide (think of a shift in base 10).
Yes, the modulus operator (%) which is present in most languages, can return the n last digits:
12345 % 10^4 = 12345 % 10000 = 2345
Integral division (/ in C/C++/Java) can return the first n digits:
12345 / 10^4 = 12345 / 10000 = 1
Python 3.0:
>>> import math
>>> def remove_most_significant_digit(n, base=10):
... return n % (base ** int(math.log(n, base)))
...
>>> def remove_least_significant_digit(n, base=10):
... return int(n // base)
...
>>> remove_most_significant_digit(12345)
2345
>>> remove_least_significant_digit(12345)
1234
Converting to a string, and then using a substring method will ultimately be the fastest and best way, since you can just strip characters instead of doing math.
If you really don't want to do that though, you should use modulus (%), which gives the remainder of a division. 11 % 3 = 2, because 3 can only go into 11 three times (9). The remainder is then 2. 41 % 10 = 1, because 10 can go into 41 four times (40). The remainder is then 1.
For stripping the first digits, all you would have to do is mod the tens value that you want to get rid of. For stripping two digits from 12345, you should modulus by 1000. 1000 goes into 12345 twelve times, then the remainder will be 345, which is your answer. You would just need to find a way to find the tens value of the last digit you were trying to strip. Use x % (10^(n)), where x is input, and n is the lowest digit you want to strip.
For stripping the last digits, your way works just fine. What's easier than a quick formula like that?
I don't think there's any other approach than division for removing trailing numbers. It might be more efficient to do repeated integral division than to cast to a float, perform an exponent, then floor and cast back to an integer, but the basic idea remains the same.
Keep in mind that the operation is nearly identical for any base. To remove one trailing decimal digit, you do / 10. If you had 0b0111 and you wanted to remove one digit, it would have to be /2. Or you could have 0xff / 16 to get 0x0f.
You have to realize that numbers don't have digits, only strings do, and how many (and which) digits they have depends entirely on the base (which numbers don't have either). Internally, computers use what amounts to binary strings. So in general, manipulating base 10 digits requires you to convert the number to a string first - or do calculations that are the same you would do when converting it to a string. However, for your specific task of removing leading and trailing digits, these calculations (modulo and integer division) are very simple and much faster than converting the entire number.
i think that converting to string and then remove the first char wont do the trick.
i believe that the alg for converting to string is doing the div-mod routine, for optimization you might as well do the div-mod alg by yourself and manipulate it to your needs
Here is C++ code ... It's not tested.
int myPow ( int n , int k ){
int ret = 1;
for (int i=0;i<k;++i) ret*=n;
return ret;
}
int countDigits (int n ){
int count = 0;
while ( n )++count, n/=10;
return count;
}
int getLastDigits (int number , int numDigits ){
int tmp = myPow ( 10 , numDigits );
return number % tmp;
}
int getFirstDigits (int number, numDigits ){
int tmp = myPow ( 10, countDigits ( number) - numDigits );
return nuber / tmp;
}

How can I sort numbers lexicographically?

Here is the scenario.
I am given an array 'A' of integers. The size of the array is not fixed. The function that I am supposed to write may be called once with an array of just a few integers while another time, it might even contain thousands of integers. Additionally, each integer need not contain the same number of digits.
I am supposed to 'sort' the numbers in the array such that the resulting array has the integers ordered in a lexicographic manner (i.e they are sorted based on their string representations. Here "123" is the string representation of 123). Please note that the output should contain integers only, not their string equivalents.
For example: if the input is:
[ 12 | 2434 | 23 | 1 | 654 | 222 | 56 | 100000 ]
Then the output should be:
[ 1 | 100000 | 12 | 222 | 23 | 2434 | 56 | 654 ]
My initial approach: I converted each integer to its string format, then added zeros to its right to make all the integers contain the same number of digits (this was the messy step as it involved tracking etc making the solution very inefficient) and then did radix sort.
Finally, I removed the padded zeros, converted the strings back to their integers and put them in the resulting array. This was a very inefficient solution.
I've been led to believe that the solution doesn't need padding etc and there is a simple solution where you just have to process the numbers in some way (some bit processing?) to get the result.
What is the space-wise most efficient solution you can think of? Time-wise?
If you are giving code, I'd prefer Java or pseudo-code. But if that doesn't suit you, any such language should be fine.
Executable pseudo-code (aka Python): thenumbers.sort(key=str). Yeah, I know that using Python is kind of like cheating -- it's just too powerful;-). But seriously, this also means: if you can sort an array of strings lexicographically, as Python's sort intrinsically can, then just make the "key string" out of each number and sort that auxiliary array (you can then reconstruct the desired numbers array by a str->int transformation, or by doing the sort on the indices via indirection, etc etc); this is known as DSU (Decorate, Sort, Undecorate) and it's what the key= argument to Python's sort implements.
In more detail (pseudocode):
allocate an array of char** aux as long as the numbers array
for i from 0 to length of numbers-1, aux[i]=stringify(numbers[i])
allocate an array of int indices of the same length
for i from 0 to length of numbers-1, indices[i]=i
sort indices, using as cmp(i,j) strcmp(aux[i],aux[j])
allocate an array of int results of the same length
for i from 0 to length of numbers-1, results[i]=numbers[indices[i]]
memcpy results over numbers
free every aux[i], and also aux, indices, results
Since you mentioned Java is the actual language in question:
You don't need to convert to and from strings. Instead, define your own comparator and use that in the sort.
Specifically:
Comparator<Integer> lexCompare = new Comparator<Integer>(){
int compareTo( Integer x, Integer y ) {
return x.toString().compareTo( y.toString() );
}
};
Then you can sort the array like this:
int[] array = /* whatever */;
Arrays.sort( array, lexCompare );
(Note: The int/Integer mismatch works automatically through auto-boxing)
I'd just turn them into strings, and then sort then sort using strcmp, which does lex comparisons.
Alternatively you can write a "lexcmp" function that compares two numbers using % 10 and /10 but that's basically the same thing as calling atoi many times, so not a good idea.
The actual sorting can be done by any algorithm you like. The key to this problem is finding the comparison function that will properly identify which numbers should be "less than" others, according to this scheme:
bool isLessThan(int a, int b)
{
string aString = ToString(a);
string bString = ToString(b);
int charCount = min(aString.length(), bString.length())
for (charIndex = 0; charIndex < charCount; charIndex++)
{
if (aString[charIndex] < bString[charIndex]) { return TRUE; }
}
// if the numbers are of different lengths, but identical
// for the common digits (e.g. 123 and 12345)
// the shorter string is considered "less"
return (aString.length() < bString.length());
}
My temptation would be to say that the int to string conversion would happen in the comparitor code rather than in bulk. Although this may be more elegant from a code-perspective I'd have to say that the execution effort would be greater as each number may be compared several times.
I'd be inclined to create a new array containing both the int and string representation (not sure that you need to pad the string versions for the string comparison to produce the order you've given), sort that on the string and then copy the int values back to the original array.
I can't think of a smart mathematically way of sorting this as by your own statement you want to sort lexicographically so you need to transform the numbers to strings to do that.
You definitely don't need to pad the result. It will not change the order of the lexicographical compare, it will be more error prone, and it will just waste CPU cycles. The most "space-wise" efficient method would be to convert the numbers to strings when they are compared. That way, you would not need to allocate an additional array, the numbers would be compared in place.
You can get a reasonably good implementation quickly by just converting them to strings as needed. Stringifying a number isn't particularly expensive and, since you are only dealing with two strings at a time, it is quite likely that they will remain in the CPU cache at all times. So the comparisons will be much faster than the case where you convert the entire array to strings since they will not need to be loaded from main memory into the cache. People tend to forget that a CPU has a cache and that algorithms which do a lot of their work in a small local area of memory will benefit greatly from the much faster cache access. On some architectures, the cache is so much faster than the memory that you can do hundreds of operations on your data in the time it would have taken you to load it from main memory. So doing more work in the comparison function could actually be significantly faster than pre-processing the array. Especially if you have a large array.
Try doing the string serialization and comparison in a comparator function and benchmark that. I think it will be a pretty good solution. Example java-ish pseudo-code:
public static int compare(Number numA, Number numB) {
return numA.toString().compare(numB.toString());
}
I think that any fancy bit wise comparisons you could do would have to be approximately equivalent to the work involved in converting the numbers to strings. So you probably wouldn't get significant benefit. You can't just do a direct bit for bit comparison, that would give you a different order than lexicographical sort. You'll need to be able to figure out each digit for the number anyway, so it is most straightforward to just make them strings. There may be some slick trick, but every avenue I can think of off the top of my head is tricky, error-prone, and much more work than it is worth.
Pseudocode:
sub sort_numbers_lexicographically (array) {
for 0 <= i < array.length:
array[i] = munge(array[i]);
sort(array); // using usual numeric comparisons
for 0 <= i < array.length:
array[i] = unmunge(array[i]);
}
So, what are munge and unmunge?
munge is different depending on the integer size. For example:
sub munge (4-bit-unsigned-integer n) {
switch (n):
case 0: return 0
case 1: return 1
case 2: return 8
case 3: return 9
case 4: return 10
case 5: return 11
case 6: return 12
case 7: return 13
case 8: return 14
case 9: return 15
case 10: return 2
case 11: return 3
case 12: return 4
case 13: return 5
case 14: return 6
case 15: return 7
}
Esentially what munge is doing is saying what order 4 bit integers come in when sorted lexigraphically. I'm sure you can see that there is a pattern here --- I didn't have to use a switch --- and that you can write a version of munge that handles 32 bit integers reasonably easily. Think about how you would write versions of munge for 5, 6, and 7 bit integers if you can't immediately see the pattern.
unmunge is the inverse of munge.
So you can avoid converting anything to a string --- you don't need any extra memory.
If you want to try a better preprocess-sort-postprocess, then note that an int is at most 10 decimal digits (ignoring signed-ness for the time being).
So the binary-coded-decimal data for it fits in 64 bits. Map digit 0->1, 1->2 etc, and use 0 as a NUL terminator (to ensure that "1" comes out less than "10"). Shift each digit in turn, starting with the smallest, into the top of a long. Sort the longs, which will come out in lexicographical order for the original ints. Then convert back by shifting digits one at a time back out of the top of each long:
uint64_t munge(uint32_t i) {
uint64_t acc = 0;
while (i > 0) {
acc = acc >> 4;
uint64_t digit = (i % 10) + 1;
acc += (digit << 60);
i /= 10;
}
return acc;
}
uint32_t demunge(uint64_t l) {
uint32_t acc = 0;
while (l > 0) {
acc *= 10;
uint32_t digit = (l >> 60) - 1;
acc += digit;
l << 4;
}
}
Or something like that. Since Java doesn't have unsigned ints, you'd have to modify it a little. It uses a lot of working memory (twice the size of the input), but that's still less than your initial approach. It might be faster than converting to strings on the fly in the comparator, but it uses more peak memory. Depending on the GC, it might churn its way through less memory total, though, and require less collection.
If all the numbers are less than 1E+18, you could cast each number to UINT64, multiply by ten and add one, and then multiply by ten until they are at least 1E+19. Then sort those. To get back the original numbers, divide each number by ten until the last digit is non-zero (it should be one) and then divide by ten once more.
The question doesn't indicate how to treat negative integers in the lexicographic collating order. The string-based methods presented earlier typically will sort negative values to the front; eg, { -123, -345, 0, 234, 78 } would be left in that order. But if the minus signs were supposed to be ignored, the output order should be { 0, -123, 234, -345, 78 }. One could adapt a string-based method to produce that order by somewhat-cumbersome additional tests.
It may be simpler, in both theory and code, to use a comparator that compares fractional parts of common logarithms of two integers. That is, it will compare the mantissas of base 10 logarithms of two numbers. A logarithm-based comparator will run faster or slower than a string-based comparator, depending on a CPU's floating-point performance specs and on quality of implementations.
The java code shown at the end of this answer includes two logarithm-based comparators: alogCompare and slogCompare. The former ignores signs, so would produce { 0, -123, 234, -345, 78 } from { -123, -345, 0, 234, 78 }.
The number-groups shown next are the output produced by the java program.
The “dar rand” section shows a random-data array dar as generated. It reads across and then down, 5 elements per line. Note, arrays sar, lara, and lars initially are unsorted copies of dar.
The “dar sort” section is dar after sorting via Arrays.sort(dar);.
The “sar lex” section shows array sar after sorting with Arrays.sort(sar,lexCompare);, where lexCompare is similar to the Comparator shown in Jason Cohen's answer.
The “lar s log” section shows array lars after sorting by Arrays.sort(lars,slogCompare);, illustrating a logarithm-based method that gives the same order as do lexCompare and other string-based methods.
The “lar a log” section shows array lara after sorting by Arrays.sort(lara,alogCompare);, illustrating a logarithm-based method that ignores minus signs.
dar rand -335768 115776 -9576 185484 81528
dar rand 79300 0 3128 4095 -69377
dar rand -67584 9900 -50568 -162792 70992
dar sort -335768 -162792 -69377 -67584 -50568
dar sort -9576 0 3128 4095 9900
dar sort 70992 79300 81528 115776 185484
sar lex -162792 -335768 -50568 -67584 -69377
sar lex -9576 0 115776 185484 3128
sar lex 4095 70992 79300 81528 9900
lar s log -162792 -335768 -50568 -67584 -69377
lar s log -9576 0 115776 185484 3128
lar s log 4095 70992 79300 81528 9900
lar a log 0 115776 -162792 185484 3128
lar a log -335768 4095 -50568 -67584 -69377
lar a log 70992 79300 81528 -9576 9900
Java code is shown below.
// Code for "How can I sort numbers lexicographically?" - jw - 2 Jul 2014
import java.util.Random;
import java.util.Comparator;
import java.lang.Math;
import java.util.Arrays;
public class lex882954 {
// Comparator from Jason Cohen's answer
public static Comparator<Integer> lexCompare = new Comparator<Integer>(){
public int compare( Integer x, Integer y ) {
return x.toString().compareTo( y.toString() );
}
};
// Comparator that uses "abs." logarithms of numbers instead of strings
public static Comparator<Integer> alogCompare = new Comparator<Integer>(){
public int compare( Integer x, Integer y ) {
Double xl = (x==0)? 0 : Math.log10(Math.abs(x));
Double yl = (y==0)? 0 : Math.log10(Math.abs(y));
Double xf=xl-xl.intValue();
return xf.compareTo(yl-yl.intValue());
}
};
// Comparator that uses "signed" logarithms of numbers instead of strings
public static Comparator<Integer> slogCompare = new Comparator<Integer>(){
public int compare( Integer x, Integer y ) {
Double xl = (x==0)? 0 : Math.log10(Math.abs(x));
Double yl = (y==0)? 0 : Math.log10(Math.abs(y));
Double xf=xl-xl.intValue()+Integer.signum(x);
return xf.compareTo(yl-yl.intValue()+Integer.signum(y));
}
};
// Print array before or after sorting
public static void printArr(Integer[] ar, int asize, String aname) {
int j;
for(j=0; j < asize; ++j) {
if (j%5==0)
System.out.printf("%n%8s ", aname);
System.out.printf(" %9d", ar[j]);
}
System.out.println();
}
// Main Program -- to test comparators
public static void main(String[] args) {
int j, dasize=15, hir=99;
Random rnd = new Random(12345);
Integer[] dar = new Integer[dasize];
Integer[] sar = new Integer[dasize];
Integer[] lara = new Integer[dasize];
Integer[] lars = new Integer[dasize];
for(j=0; j < dasize; ++j) {
lara[j] = lars[j] = sar[j] = dar[j] = rnd.nextInt(hir) *
rnd.nextInt(hir) * (rnd.nextInt(hir)-44);
}
printArr(dar, dasize, "dar rand");
Arrays.sort(dar);
printArr(dar, dasize, "dar sort");
Arrays.sort(sar, lexCompare);
printArr(sar, dasize, "sar lex");
Arrays.sort(lars, slogCompare);
printArr(lars, dasize, "lar s log");
Arrays.sort(lara, alogCompare);
printArr(lara, dasize, "lar a log");
}
}
If you're going for space-wise efficiency, I'd try just doing the work in the comparison function of the sort
int compare(int a, int b) {
// convert a to string
// convert b to string
// return -1 if a < b, 0 if they are equal, 1 if a > b
}
If it's too slow (it's slower than preprocessing, for sure), keep track of the conversions somewhere so that the comparison function doesn't keep having to do them.
Possible optimization: Instead of this:
I converted each integer to its string format, then added zeros to its right to make all the integers contain the same number of digits
you can multiply each number by (10^N - log10(number)), N being a number larger than log10 of any of your numbers.
#!/usr/bin/perl
use strict;
use warnings;
my #x = ( 12, 2434, 23, 1, 654, 222, 56, 100000 );
print $_, "\n" for sort #x;
__END__
Some timings ... First, with empty #x:
C:\Temp> timethis s-empty
TimeThis : Elapsed Time : 00:00:00.188
Now, with 10,000 randomly generated elements:
TimeThis : Elapsed Time : 00:00:00.219
This includes the time taken to generate the 10,000 elements but not the time to output them to the console. The output adds about a second.
So, save some programmer time ;-)
One really hacky method (using C) would be:
generate a new array of all the values converted to floats
do a sort using the mantissa (significand) bits for the comparison
In Java (from here):
long bits = Double.doubleToLongBits(5894.349580349);
boolean negative = (bits & 0x8000000000000000L) != 0;
long exponent = bits & 0x7ff0000000000000L >> 52;
long mantissa = bits & 0x000fffffffffffffL;
so you would sort on the long mantissa here.

Resources