I want to convert a fixed-length, say 50 character long randomized string into a 64 bit integer and be able to convert it back to original text given the 64 bit integer.
Does an algorithm exist for this? I want to go with encoding/decoding rather than hashing/reverse lookup.
just sumarization of the comments...
1:1 mapping between string and number requires enough characters and bits to store your data. Assuming 26 char alphabet only:
64bit -> 2^64 // possible numbers in 64 bits
1char -> 26 // possible characters per 1 char
so in order to get the number of chars fitting into 64 bit integer
chars = floor( 64 / (log(26)/log(2)) )
= floor( 64 / 4.7004397181410921603968126542567)
= floor( 13.6 )
= 13
if you want to know how many bits you need for 50 chars:
bits = ceil( 50 / (log(2)/log(26)) )
= ceil( 50 / 0.21274605355336315360618778415321
= ceil( 235.02198590705460801984063271284 )
= 236
Now if you want to encode 13 char (a..z) from text into 64 bit unsigned integer x:
char text[13] = "bla bla bla b";
unsigned int x,m,i;
for (i=0,x=0,m=1;i<13;i++,m*=26)
x += ((unsigned int)(text[i]-'a'))*m;
And decoding back:
for (i=0;i<13;i++)
{
text[i] = (x%26)+'a';
x /= 26;
}
As you can see its the same as converting between numbers in different bases...
In case you want to have faster dec/enc at the cost of text size you can ceil the number of bits per single character to 5 meaning floor(64/5) = 12 chars and use bits operations instead (each character would be 5 bits in the number)...
char text[12] = "bla bla bla ";
unsigned int x,i;
for (i=0,x=0,i<12;i++)
{
x <<= 5;
x |= text[i]-'a';
}
for (i=0;i<12;i++)
{
text[11-i] = (x&31)+'a';
x >>= 5;
}
However if you have any additional knowledge about the characters its possible to implement compression but only in cases where entropy allows it... for more info google RLE,Huffman encoding...
Have ready C# code to split integer into 2 bytes as you can see below, Needs to re-write same in Ruby-
int seat2 = 65000;
// Split into two bytes
byte seats = (byte)(seat2 & 0xFF); // lower byte
byte options = (byte)((seat2 >> 8) & 0xFF); // upper byte
Below is the output above
Output Seats => 232
options => 253
// Merge back into integer
seat2 = (options << 8) | seats;
Please suggest anyone has any solution to rewrite the above in Ruby
The code you wrote would work well in Ruby with very few modifications.
You could simply try:
seat2 = 65000
seat2 & 0xFF
# => 232
(seat2 >> 8) & 0xFF
# => 253
An alternative would be to use pack and unpack:
[65000].pack('S').unpack('CC')
# => [232, 253]
[232, 253].pack('CC').unpack('S')
# => [65000]
I believe the most idiomatic way for binary transformations in Ruby is Array#pack and String#unpack (like in Eric's answer).
Also you have an option to use Numeric#divmod with 256(2^8, byte size):
> upper, lower = 65000.divmod(256)
# => [253, 232]
> upper
# => 253
> lower
# => 232
In this case, to have correct bytes, your Integer should not exceed 65535 (2^16-1).
Another one:
lower, upper = 65000.digits(256)
I am working with a micro controller which calculates the CRC32 checksum of data I upload to it's flash memory on the fly. This can in turn be used to verify that the upload was correct, by verifying the resulting checksum after all data is uploaded.
The only problem is that the Micro Controller reverses the bit order of the input bytes when it's run through the otherwise standard crc32 calculation. This in turn means I need to reverse every byte in the data on the programming host in order to calculate the CRC32 sum to verify. As the programming host is somewhat constrained, this is quite slow.
I figure that if it's possible to modify the CRC32 lookuptable so I can do the lookup without having to reverse the bit order, the verification algorithm would run many times faster. But I seem unable to figure out a way to do this.
To clarify the byte reversal, I need to change the input bytes following way:
01 02 03 04 -> 80 40 C0 20
It's a lot easier to see the reversal in binary representation of course:
00000001 00000010 00000011 00000100 ->
10000000 01000000 11000000 00100000
Edit
Here is the PoC Python code I use to verify the correctness of the CRC32 calculation, however this reverses each byte (a.e the slow way).
EDIT2
I've also included my failed attempt to generate a permutated lookup table, and using a standard LUT CRC32 algorithm.
The code spits out the correct reference CRC value first, and then the wrong LUT calculated CRC afterwards.
import binascii
CRC32_POLY = 0xEDB88320
def reverse_byte_bits(x):
'''
Reverses the bit order of the giveb byte 'x' and returns the result
'''
x = ((x<<4) & 0xF0)|((x>>4) & 0x0F)
x = ((x<<2) & 0xCC)|((x>>2) & 0x33)
x = ((x<<1) & 0xAA)|((x>>1) & 0x55)
return x
def reverse_bits(ba, blen):
'''
Reverses all bytes in the given array of bytes
'''
bar = bytearray()
for i in range(0, blen):
bar.append(reverse_byte_bits(ba[i]))
return bar
def crc32_reverse(ba):
# Reverse all bits in the
bar = reverse_bits(ba, len(ba))
# Calculate the CRC value
return binascii.crc32(bar)
def gen_crc_table_msb():
crctable = [0] * 256
for i in range(0, 256):
remainder = i
for bit in range(0, 8):
if remainder & 0x1:
remainder = (remainder >> 1) ^ CRC32_POLY
else:
remainder = (remainder >> 1)
# The correct index for the calculated value is the reverse of the index
ix = reverse_byte_bits(i)
crctable[ix] = remainder
return crctable
def crc32_revlut(ba, lut):
crc = 0xFFFFFFFF
for x in ba:
crc = lut[x ^ (crc & 0xFF)] ^ (crc >> 8)
return ~crc
# Reference test which gives the correct CRC
test = bytearray([1, 2, 3, 4, 5, 6, 7, 8])
crcrev = crc32_reverse(test)
print("0x%08X" % (crcrev & 0xFFFFFFFF))
# Test using permutated lookup table, but standard CRC32 LUT algorithm
lut = gen_crc_table_msb()
crctst = crc32_revlut(test, lut)
print("0x%08X" % (crctst & 0xFFFFFFFF))
Does anyone have any hints to how this could be done?
By reversing the logic of which way the crc "streams", the reverse in the main calculation can be avoided. So instead of crc >> 8 there would be crc << 8 and instead of XORing with the bottom byte of the crc for the LUT index we take the top. Like this:
def reverse_dword_bits(x):
'''
Reverses the bit order of the given dword 'x' and returns the result
'''
x = ((x<<16) & 0xFFFF0000)|((x>>16) & 0x0000FFFF)
x = ((x<<8) & 0xFF00FF00)|((x>>8) & 0x00FF00FF)
x = ((x<<4) & 0xF0F0F0F0)|((x>>4) & 0x0F0F0F0F)
x = ((x<<2) & 0xCCCCCCCC)|((x>>2) & 0x33333333)
x = ((x<<1) & 0xAAAAAAAA)|((x>>1) & 0x55555555)
return x
def gen_crc_table_msb():
crctable = [0] * 256
for i in range(0, 256):
remainder = i
for bit in range(0, 8):
if remainder & 0x1:
remainder = (remainder >> 1) ^ CRC32_POLY
else:
remainder = (remainder >> 1)
# The correct index for the calculated value is the reverse of the index
ix = reverse_byte_bits(i)
crctable[ix] = reverse_dword_bits(remainder)
return crctable
def crc32_revlut(ba, lut):
crc = 0xFFFFFFFF
for x in ba:
crc = lut[x ^ (crc >> 24)] ^ ((crc << 8) & 0xFFFFFFFF)
return reverse_dword_bits(~crc)
I have just started learning c++ and have come across various data types in c++. I also learnt how the computer stores values when the data type is specified . One doubt that occurred to me while learning char data types was how did the computer differentiate between integers and characters.
I learnt that the character data type uses 8 bits to store a character and the computer can store a character in its memory location by following ASCII encoding rules. However, I didn't realise how the computer knows whether the byte 00100001 represents the latter 'a' or the integer 65. Is there any special bit assigned for this purpose?
when we do
int a = 65
or
char ch = 'a'
If we check the memory address we will see the value 00100001 as expected.
In application layer we choose to cast as character or integer
prinf("%d", ch)
will print 65
Characters are represented as integers inside the computer. Hence the data type "char" is simply a subset of the data type "int".
Refer to following page: will clear all the ambiguities in your mind.
Data Types Detail
The computer itself does not remember or set any bits to distinguish chars from ints. Instead it's the compiler which maintains that information and generates proper machine code which operates on data appropriately.
You can even override and 'mislead' the compiler if you want. For example you can cast a char pointer to a void pointer and then to an int pointer and then try to read the location referred to as an int. I think 'dynamic casts' are also possible. If there was an actual bit used then such operations would not be possible.
Adding more details in response to comment:
Hi, really what you should ask is that who will retrieve the values? Imagine that you write the contents of memory to file and send them over the Internet. If the receiver "knows" that its receiving chars then there is no need to encode the identity of chars. But if the receiver could receive either chars or ints then it would need identifying bits. In the same way, when you compile a program and the compiler knows what's stored where, there is no need to 'figure out' anything since you already know it. Now how a char is encoded as bits vs a float vs an int is decided by a standard like IEEE standard
You have asked a simple yet profound question. :-)
Answers and an example or two are below.
(see edit2, at bottom, for a longer example that tries to illustrate what happens when you interpret a single memory location's bit patterns in different ways).
The "profound" aspect of it lies in the astounding variety of character encodings that exist. There are many - I wager more than you believe there could possibly be. :-)
This is a worthwhile read: http://www.joelonsoftware.com/articles/Unicode.html
full title: "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)"
As for your first question: "how did the computer differentiate between integers and characters":
The computer doesn't (for better or worse).
The meaning of bit patterns is interpreted by whatever reads them.
Consider this example bit pattern (8 bits, or one byte):
01000001b = 41x = 65d (binary, hex & decimal respectively).
If that bit pattern is based on ASCII it will represent an uppercase A.
If that bit pattern is EBCDIC it will represent an "non-breaking space" character (at least according to the EBCDIC chart at wikipedia, most of the others I looked at don't say what 65d means in EBCDIC).
(Just for trivia's sake, in EBCDIC, 'A' would be represented with a different bit pattern entirely: C1x or 193d.)
If you read that bit pattern is an integer (perhaps a short), it may indicate you have 65 dollars in a bank account (or euros, or something else - just like the character set your bit pattern won't have anything in it to tell you what currency it is.
If that bit pattern is part of a 24-bit pixel encoding for your display (3 bytes for RBG), perhps 'blue' in RBG encoding, it may indicate your pixel is roughly 25% blue (e.g. 65/255 is about 25.4%); 0% would be black, 100% would be as blue as possible.
So, yeah, there are lots of variations on how bits can be interpreted. It is up to your program to keep track of that.
edit: it is common to add metadata to track that, so if you are dealing with currencies you may have one byte for currency type and other bytes for the quantity of a given currency. Currency type would have to be encoded as well; there are different ways to do that... something that "C++ enum" attempts to solve in a space-efficient way: http://www.cprogramming.com/tutorial/enum.html ).
As for 8 bits (one byte) per character, that is an Fair Assumption when you're starting out. But it isn't always true. Lots of languages will use 2+ bytes for each character when you get into Unicode.
However... ASCII is very common and it fits into a single byte (8 bits).
If you are handling simple english text (A-Z, 0-9 and so on), that my be enough for you.
Spend some time browsing here and look at acsii, ebcdic and others:
http://www.lookuptables.com/
If you're running on linux or smth, hexdump can be your friend.
Try the following
$ hexdump -C myfile.dat
Whatever operating system you're using, you will want to find a hexdump utility you can use to see what is really in your data files.
You mentioned C++, I think it would would be an interesting exercise to write a "thing" byte-dumper utility, just a short program that takes a void* pointer and the number of bytes it has and then prints out that many bytes worth of values.
Good luck with your studies! :-)
Edit 2: I added a small research program... I don't know how to illustrate the idea more concisely (seems easer in C than C++).
Anyway...
In this example program, I have two character pointers that are referencing memory used by an integer.
The actual code (see 'example program', way below) is messier with casting, but this illustrates the basic idea:
unsigned short a; // reserve 2 bytes of memory to store our 'unsigned short' integer.
char *c1 = &a; // point to first byte at a's memory location.
char *c2 = c1 + 1; // point to next byte at a's memory location.
Note how 'c1' and 'c2' both share the memory that is also used by 'a'.
Walking through the output...
The sizeof's basically tells you how many bytes something uses.
The ===== Message Here ===== lines are like a comment printed out by the dump() function.
The important thing about the dump() function is that it is using the bit patterns in the memory location for 'a'.
dump() doesn't change those bit patterns, it just retrieves them and displays them via cout.
In the first run, before calling dump I assign the following bit pattern to a:
a = (0x41<<8) + 0x42;
This left-shifts 0x41 8 bits and adds 0x42 to it.
The resulting bit pattern is = 0x4142 (which is 16706 decimal, or 100001 100010 binary).
One of the bytes will be 0x41, the other will hold 0x42.
Next it calls the dump() method:
dump( "In ASCII, 0x41 is 'A' and 0x42 is 'B'" );
Note the output for this run on my virtual box Ubuntu found the address of a was 0x6021b8.
Which nicely matches the expected addresses pointed to by both c1 & c2.
Then I modify the bit pattern in 'a'...
a += 1; dump(); // why did this find a 'C' instead of 'B'?
a += 5; dump(); // why did this find an 'H' instead of 'C' ?
As you dig deeper into C++ (and maybe C ) you will want to be able to draw memory maps like this (more or less):
=== begin memory map ===
+-------+-------+
unsigned short a : byte0 : byte1 : holds 2 bytes worth of bit patterns.
+-------+-------+-------+-------+
char * c1 : byte0 : byte1 : byte3 : byte4 : holds address of a
+-------+-------+-------+-------+
char * c2 : byte0 : byte1 : byte3 : byte4 : holds address of a + 1
+-------+-------+-------+-------+
=== end memory map ===
Here is what it looks like when it runs; I encourage you to walk through the C++ code
in one window and tie each piece of output back to the C++ expression that generated it.
Note how sometimes we do simple math to add a number to a (e.g. "a +=1" followed by "a += 5").
Note the impact that has on the characters that dump() extracts from memory location 'a'.
=== begin run ===
$ clear; g++ memfun.cpp
$ ./a.out
sizeof char =1, unsigned char =1
sizeof short=2, unsigned short=2
sizeof int =4, unsigned int =4
sizeof long =8, unsigned long =8
===== In ASCII, 0x41 is 'A' and 0x42 is 'B' =====
a=16706(dec), 0x4142 (address of a: 0x6021b8)
c1=0x6021b8 (should be the same as 'address of a')
c2=0x6021b9 (should be just 1 more than 'address of a')
c1=B
c2=A
in hex, c1=42
in hex, c2=41
===== after a+= 1 =====
a=16707(dec), 0x4143 (address of a: 0x6021b8)
c1=0x6021b8 (should be the same as 'address of a')
c2=0x6021b9 (should be just 1 more than 'address of a')
c1=C
c2=A
in hex, c1=43
in hex, c2=41
===== after a+= 5 =====
a=16712(dec), 0x4148 (address of a: 0x6021b8)
c1=0x6021b8 (should be the same as 'address of a')
c2=0x6021b9 (should be just 1 more than 'address of a')
c1=H
c2=A
in hex, c1=48
in hex, c2=41
===== In ASCII, 0x58 is 'X' and 0x42 is 'Y' =====
a=22617(dec), 0x5859 (address of a: 0x6021b8)
c1=0x6021b8 (should be the same as 'address of a')
c2=0x6021b9 (should be just 1 more than 'address of a')
c1=Y
c2=X
in hex, c1=59
in hex, c2=58
===== In ASCII, 0x59 is 'Y' and 0x5A is 'Z' =====
a=22874(dec), 0x595a (address of a: 0x6021b8)
c1=0x6021b8 (should be the same as 'address of a')
c2=0x6021b9 (should be just 1 more than 'address of a')
c1=Z
c2=Y
in hex, c1=5a
in hex, c2=59
Done.
$
=== end run ===
=== begin example program ===
#include <iostream>
#include <string>
using namespace std;
// define some global variables
unsigned short a; // declare 2 bytes in memory, as per sizeof()s below.
char *c1 = (char *)&a; // point c1 to start of memory belonging to a (1st byte).
char * c2 = c1 + 1; // point c2 to next piece of memory belonging to a (2nd byte).
void dump(const char *msg) {
// so the important thing about dump() is that
// we are working with bit patterns in memory we
// do not own, and it is memory we did not set (at least
// not here in dump(), the caller is manipulating the bit
// patterns for the 2 bytes in location 'a').
cout << "===== " << msg << " =====\n";
cout << "a=" << dec << a << "(dec), 0x" << hex << a << dec << " (address of a: " << &a << ")\n";
cout << "c1=" << (void *)c1 << " (should be the same as 'address of a')\n";
cout << "c2=" << (void *)c2 << " (should be just 1 more than 'address of a')\n";
cout << "c1=" << (char)(*c1) << "\n";
cout << "c2=" << (char)(*c2) << "\n";
cout << "in hex, c1=" << hex << ((int)(*c1)) << dec << "\n";
cout << "in hex, c2=" << hex << (int)(*c2) << dec << "\n";
}
int main() {
cout << "sizeof char =" << sizeof( char ) << ", unsigned char =" << sizeof( unsigned char ) << "\n";
cout << "sizeof short=" << sizeof( short ) << ", unsigned short=" << sizeof( unsigned short ) << "\n";
cout << "sizeof int =" << sizeof( int ) << ", unsigned int =" << sizeof( unsigned int ) << "\n";
cout << "sizeof long =" << sizeof( long ) << ", unsigned long =" << sizeof( unsigned long ) << "\n";
// this logic changes the bit pattern in a then calls dump() to interpret that bit pattern.
a = (0x41<<8) + 0x42; dump( "In ASCII, 0x41 is 'A' and 0x42 is 'B'" );
a+= 1; dump( "after a+= 1" );
a+= 5; dump( "after a+= 5" );
a = (0x58<<8) + 0x59; dump( "In ASCII, 0x58 is 'X' and 0x42 is 'Y'" );
a = (0x59<<8) + 0x5A; dump( "In ASCII, 0x59 is 'Y' and 0x5A is 'Z'" );
cout << "Done.\n";
}
=== end example program ===
int is an integer, a number that has no digits after the decimal point. It can be positive or negative. Internally, integers are stored as binary numbers. On most computers, integers are 32-bit binary numbers, but this size can vary from one computer to another. When calculations are done with integers, anything after the decimal point is lost. So if you divided 2 by 3, the result is 0, not 0.6666.
char is a data type that is intended for holding characters, as in alphanumeric strings. This data type can be positive or negative, even though most character data for which it is used is unsigned. The typical size of char is one byte (eight bits), but this varies from one machine to another. The plot thickens considerably on machines that support wide characters (e.g., Unicode) or multiple-byte encoding schemes for strings. But in general char is one byte.
I am basically a beginner in Computer Science. Please forgive me if I ask elementary questions. I am trying to understand radix sort. I read that a 32 bit unsigned integer can be broken down into 4 8-bit chunks. After that, all it takes is "4 passes" to complete the radix sort. Can somebody please show me an example for how this breakdown (32 bit into 4 8-bit chunks) works? Maybe, a 32-bit integer like 2147507648.
Thanks!
You would divide the 32 bit integer up in 4 pieces of 8 bits. Extracting those pieces is a matter of using using some of the operators available in C.:
uint32_t x = 2147507648;
uint8_t chunk1 = x & 0x000000ff; //lower 8 bits
uint8_t chunk2 = (x & 0x0000ff00) >> 8;
uint8_t chunk3 = (x & 0x00ff0000) >> 16;
uint8_t chunk4 = (x & 0xff000000) >> 24; //highest 8 bits
2147507648 decimal is 0x80005DC0 hex. You an pretty much eyeball those 8 bits out of the hex representation, since each hex digit represents 4 bits, two and two of them represents 8 bits.
So that now means chunk 1 is 0xC0, chunk 2 is 0x5D, chunk3 is 0x00 and chunk 4 is 0x80
It's done as follows:
2147507648
=> 0x80005DC0 (hex value of 2147507648)
=> 0x80 0x00 0x5D 0xC0
=> 128 0 93 192
To do this, you'd need bitwise operations as nos suggested.