How to use urandom to generate all possible base64 characters? - random

In a base64 digits you could save up to 6 bits (2**6 == 64).
Which means that you could fit 3 bytes in 4 base64 digits.
64**4 == 2**24
That's why:
0x000000 == 'AAAA'
0xFFFFFF == '////'
This means that a random string of 3 bytes is equivalent to a base64 string of 4 characters.
However if I am converting a number of bytes which is not a multiple of 3 in a base64 string, I will not be able to generate all the combination of the base64 string.
Let's take an example:
If I want a random 7 characters base64 string, I would need to generate 42 random bits (64**7 == 2**42).
If I am using urandom to get 5 random bytes I will get only 40 bits (5*8) and if ask for 6 I will get 48 bits (6*8).
Can I ask for 6 bytes and use a mask to short it down to 5 or will it break my random repartition?
One solution:
hex(0x123456789012 & 0xFFFFFFFFFF)
'0x3456789012'
Another one:
hex(0x123456789012 >> 8)
'0x1234567890'
What do you think?

base64 strings with 7 characters of length
is an encode of a file with 5 bytes ( 40 bits: no less, no more )
40%6 = 4
base64 needs to add 2 more bits, and then, with 42 bits, 42%6=0,
the encode is possible; but, beware:
" If I want a random 7 characters base64 string, I would need to
generate 42 random bits (64**7 == 2**42). "
the 2 additional bits are not random, and are constants; are zero, indeed.
The cardinal number of your key-space doesn't change: is 2**40 = 1099511627776, not (64**7 == 2**42).
(64**7 == 2**42) is the cardinal number of the key-space that contains all possible combinations of 64 chars with length of 7; but, with the last two bits fixed (to zero, in this case, but doesn't matter the value) you don't have all possible combinations.
6 random bytes (48 bits), or 42 random bits, increase your original key-space; you should use 5 random bytes (40 bits), and send it to base64

Related

What is the offset size in FlatBuffers?

The major part of our data is strings with possible substring duplication (eg. domains - "some.thing.com" and "thing.com"). We'd like to reuse the substrings to reduce file size and memory consumption with FlatBuffers, so i'm planning to use [string] as i can just reference to some existing substrings, eg. thing.com will be just a string created with let substr_offset = builder.create_string("thing.com") and "some.thing.com" will be stored as [builder.create_string("some."), substr_offset].
However it seems referencing has the costs, so probably there is no benefit of referencing is the string is too short (less than offset variable size). Is it correct? Is offset type just usize? What are better alternatives for prefix/postfix strings representations with FlatBuffers?
PS. BTW what is string array instead of just string cost? Is it just one more offset cost?
Both strings and vectors are addressed over a 32-bit offset to them, and also have a 32-bit size field prefixed. So:
"some.thing.com" 14 chars + 1 terminator + 4 size bytes == 19.
Or:
"thing.com" 9 chars + 1 terminator + 4 size bytes == 14.
"some." 5 chars + 1 terminator + 4 size bytes == 10.
vector of 2 strings: 2x4 bytes of offsets + 4 size bytes = 12.
total: 36
of those 36, 14 are shared, leaving 22 bytes of unique data which is larger than the original. So the shared string needs to be 13 bytes or larger for this technique to be worth it, assuming it is shared often.
For details: https://google.github.io/flatbuffers/flatbuffers_internals.html
The offset seems to be uint32 (4 bytes, not usize) according to the doc.

Direct mapped cache example

i am really confused on the topic Direct Mapped Cache i've been looking around for an example with a good explanation and it's making me more confused then ever.
For example: I have
2048 byte memory
64 byte big cache
8 byte cache lines
with direct mapped cache how do i determine the 'LINE' 'TAG' and "Byte offset'?
i believe that the total number of addressing bits is 11 bits because 2048 = 2^11
2048/64 = 2^5 = 32 blocks (0 to 31) (5bits needed) (tag)
64/8 = 8 = 2^3 = 3 bits for the index
8 byte cache lines = 2^3 which means i need 3 bits for the byte offset
so the addres would be like this: 5 for the tag, 3 for the index and 3 for the byte offset
Do i have this figured out correctly?
Do i figured out correctly? YES
Explanation
1) Main memmory size is 2048 bytes = 211. So you need 11 bits to address a byte (If your word size is 1 byte) [word = smallest individual unit that will be accessed with the address]
2) You can calculating tag bits in direct mapping by doing (main memmory size / cash size). But i will explain a little more about tag bits.
Here the size of a cashe line( which is always same as size of a main memmory block) is 8 bytes. which is 23 bytes. So you need 3 bits to represent a byte within a cashe line. Now you have 8 bits (11 - 3) are remaining in the address.
Now the total number of lines present in the cache is (cashe size / line size) = 26 / 23 = 23
So, you have 3 bits to represent the line in which the your required byte is present.
The number of remaining bits now are 5 (8 - 3).
These 5 bits can be used to represent a tag. :)
3) 3 bit for index. If you were trying to label the number of bits needed to represent a line as index. Yes you are right.
4) 3 bits will be used to access a byte withing a cache line. (8 = 23)
So,
11 bits total address length = 5 tag bits + 3 bits to represent a line + 3 bits to represent a byte(word) withing a line
Hope there is no confusion now.

Compression of Integers

how can I compress a row of integers into something shorter?
E.g. Input: '1 2 4 9 8 5 2 7 6 2 3 4' -> Algorithm -> Output: 'X Y Z'
and can get it back the other way around? ('X Y Z' -> '1 2 4 9 8 5 2 7 6 2 3 4')
Input contains 12 digits max, numbers only. Output can be alphanumeric and should be 3-4 digits max.
Thanks in advance.
Edit: Each input digit 0-9; Output 0-9a-Z
Unless your input comes from a specific domain, where many inputs are unlikely/unacceptable - you cannot do it.
You can encode 62^4~=1.4*10^7 different serieses with 4 alphanumeric characters.
On the other hand, input of 12 digits can have 10^12 possible different inputs.
From pigeonhole principle - there must be 2 "compressions" that are mapped to the same input.
But, since you should need to recreate the original sequence, you cannot differentiate two identical compressions.
So such a compression does not exist.
In fact, to compress a 12 digits number into 4 characters, you are going to need the alphabet for the characters to be of size 1000:
x^4 = 10^12, x>0
x = 1000
First, you could just use any existing compression algorithm, via some library. However knowing that your input is very specialized, you can also write a special algorithm adapted yo your case.
But let's first analyze how much you can compress this input. To simplify, I'll first consider compressing exactly 12 digits from 0 to 9 (you didn't explicitly write what the input range is however). There are 10^12 possible combinations, which is a little less than 2^40. So what you basically want to do is compress 40 bits.
Let's now analyze how you can compress these 40 bits. If you understand alphanumeric as [0-9A-Z], you have 36 characters available. Each character can encode log_2(36)=5.1 bits. Encoding your 40 bits therefore needs 8 alphanumeric characters.
A better alternative would be to use base64. Here, you have 64 characters, which means each character can encode 6 bits, so you could encode your input with only 40/6=6.666 => 7 characters.
If you consider compressing your input to binary, you will obviously need 40 bits. This can be written with 5 8-bit ASCII characters, with 2 32-bit integers or with 1 64-bit integer. However this probably isn't what you want to achieve.
Conclusion: you can't compress data arbitrarily much and the data that you want to compress can't be compressed as much as you like to compress it.
As an example, to encode 12 digits from 0 to 9 into ASCII characters, you could simply print convert them into one big number, convert it to binary, then take this binary number by portions of 8 bit and convert them to ASCII characters.
Example:
Input: 1 2 4 9 8 5 2 7 6 2 3 4
One number: 124985276234
Binary: 1110100011001101100111111011101001010
Grouped: 11101 00011001 10110011 11110111 01001010
ASCII: <GS><EM>��J
Note that some ASCII-symbols are not printable. If that is important to you, you'll have to use an alternative encoding, as for example base 64, which only has 64 different characters, but they are all printable.
Similar discussion
Compressing a set of large integers
PHP Compress array of bits into shortest string possible
$val = pack('H*', "124985276234");
echo '#'. $val . '#';
print_r(unpack('H*', $val));
die;
#Issue
00011001 => 25
11001 => 25
I was try to implement #Misch algorithm in PHP but some bits when use decbin was wrong and give me bad results when unpacking. Then found pack function and its work similarly. But numbers from 0 to 9 are wrong when unpacking and on 9000000 test 8090899 was decompress with wrong value no collision was found.
set_time_limit(0);
ini_set('memory_limit', '5000M');
ini_set("max_execution_time",0);
$collision = [];
$err = [];
for ($i=0; $i < 9000000; $i++) {
$packed = pack('H*', $i);
$unpacked = unpack('H*', $packed)[1];
if ( array_key_exists($i, $collision) ) {
die("Collision:". $i .' !!!!'. $packed .'!!!!'. $unpacked);
}
if ( $i != $unpacked ) {
$e = "Collision2:". $i .' !!!!'. $packed .'!!!!'. $unpacked . "\n";
#echo $e;
$err[] = $e;
}
$collision[] = $packed;
#echo '#'. $i .'#' . $unpacked . '#' . $unpacked . "#\n";
}

What is the maximum size of LENGTH field in SNMP frames?

Implementing a SNMP v1 decoder and working with some Wireshark captures, I can see that sometimes length field of a BER if coded with one byte and other times with two bytes.
Reading BER rules, if more significative bit is setted to 1, then the length value must be extended with next byte to represent values bigger than 255.
So, if firt byte is 0x81, and next byte is 0x9F, then the extended Length field should take the 0x9F value... OK
My question is:
If second byte is 0x9F, the more significative bit is 1 again.
Wireshark only takes two bytes for this length.
Why in this case size of Length is only two bytes?
Length fields are restricted to 2 bytes?
Thanks.
According to the BER rule, the length field can be multiple bytes (much more than 2),
http://en.wikipedia.org/wiki/KLV
Length Field
Following the bytes for the Key are bytes for the
Length field which will tell you how many bytes follow the length
field and make up the Value portion. There are four kinds of encoding
for the Length field: 1-byte, 2-byte, 4-byte and Basic Encoding Rules
(BER). The 1-, 2-, and 4-byte variants are pretty straightforward:
make an unsigned integer out of the bytes, and that integer is the
number of bytes that follow.
BER length encoding is a bit more
complicated but the most flexible. If the first byte in the length
field does not have the high bit set (0x80), then that single byte
represents an integer between 0 and 127 and indicates the number of
Value bytes that immediately follows. If the high bit is set, then the
lower seven bits indicate how many bytes follow that themselves make
up a length field.
For example if the first byte of a BER length field
is binary 10000010, that would indicate that the next two bytes make
up an integer that then indicates how many Value bytes follow.
Therefore a total of three bytes were taken up to specify a length.
"If second byte is 0x9F, the more significative bit is 1 again." Is that a question? Only the first byte in the bytes determines how many following bytes are used to determine the length. So you never need to care about the most significant bit of the second byte. Never.
How Wireshark represents the bytes is not very critical. Unless Wireshark shows you a wrong value for length, you should not pay much attention to it.
In ASN.1 BER Length Encoding Rules:
a) If the number of content octets <= 127 then length octet encodes number of content octets.
b) Else the most signicant bit of the rst length octet is set to 1 and other 7 bits describe number of length octets following.
c) The following length octets encode the length of content octets in Big Endian byte order.
Example:
Length 126: 01111110
Length 127: 01111111
Length 128: 10000001 10000000
Length 1031: 10000010 00000100 00000111
Number | MSB of 1st Byte | Bytes to represent the Number | BE Binary
128 | 1 | 0000001 (=1) | 10000000 (=128+0+0+0+0+0+0+0)
1031 | 1 | 0000010 (=2) | 00000100 00000111 (=1024+0+0+0+0+0+0+4+2+1)
I'd add that SNMP (usually) uses UDP datagram for transport, which are limited to 65535 bytes, that is 0xffff. Exactly 2 bytes are needed at most to encode the length.

Ruby - How to represent message length as 2 binary bytes

I'm using Ruby and I'm communicating with a network endpoint that requires the formatting of a 'header' prior to sending the message itself.
The first field in the header must be the message length which is defined as a 2 binary byte message length in network byte order.
For example, my message is 1024 in length. How do I represent 1024 as binary two-bytes?
The standard tools for byte wrangling in Ruby (and Perl and Python and ...) are pack and unpack. Ruby's pack is in Array. You have a length that should be two bytes long and in network byte order, that sounds like a job for the n format specifier:
n | Integer | 16-bit unsigned, network (big-endian) byte order
So if the length is in length, you'd get your two bytes thusly:
two_bytes = [ length ].pack('n')
If you need to do the opposite, have a look at String#unpack:
length = two_bytes.unpack('n').first
See Array#pack.
[1024].pack("n")
This packs the number as the network-order byte sequence \x04\x00.
The way this works is that each byte is 8 binary bits. 1024 in binary is 10000000000. If we break this up into octets of 8 (8 bits per byte), we get: 00000100 00000000.
A byte can represent (2 states) ^ (8 positions) = 256 unique values. However, since we don't have 256 ascii-printable characters, we visually represent bytes as hexadecimal pairs, since a hexadecimal digit can represent 16 different values and 16 * 16 = 256. Thus, we can take the first byte, 00000100 and break it into two hexadecimal quads as 0000 0100. Translating binary to hex gives us 0x04. The second byte is trivial, as 0000 0000 is 0x00. This gives us our hexadecimal representation of the two-byte string.
It's worth noting that because you are constrained to a 2-byte (16-bit) header, you are limited to a maximum value of 11111111 11111111, or 2^16 - 1 = 65535 bytes. Any message larger than that cannot accurately represent its length in two bytes.

Resources