Devise a simple algorithm which creates a file which contains nothing but its own checksum.
Let's say it is CRC-32, so this file must be 4 bytes long.
There might be some smart mathematical way of finding it out (or proving that none exists), if you know how the algorithm works.
But since I'm lazy and CRC32 has only 2^32 values, I would brute force it. While waiting for the algorithm to go through all 2^32 values, I would use Google and Stack Overflow to find whether somebody has a solution to it.
In case of SHA-1, MD5 and other more-or-less cryptographically secure algorithms, I would get intimidated by the mathematicians who designed those algorithms and just give up.
EDIT 1: Brute forcing... This far I've found one; CC4FBB6A in big-endian encoding. There might still be more. I'm checking 4 different encodings: ASCII uppercase and lowercase, and binary big-endian and little-endian.
EDIT 2: Brute force done. Here are the results:
CC4FBB6A (big-endian)
FFFFFFFF (big-endian & little-endian)
32F3B737 (uppercase ASCII)
The code is here. On my overclocked C2Q6600 that takes about 1.5 hours to run. Now that program is single-threaded, but it would be easy to make it multi-threaded, which would give a nice linear scalability.
Aside from Jerry Coffin and Esko Luontola's good answers to an unusual problem, I'd like to add:
Mathematically, we're looking for X such that F(X) = X, where F is the checksum function, and X is the data itself.
Since the checksum's output is of fixed size, and the input we are looking for is of the same size, there is no guarantee that such an X even exists! It could very well be that every input value of the fixed size is correlated with a different value of that size.
EDIT: Your question didn't specify the exact way the checksum is supposed to be formatted within the file, so I assumed you mean the byte-representation of the checksum. When strings and encodings and formatted-strings come to play, things become more complex.
Lacking any specific guidance to the contrary, I'd define the checksum of nonexistent data as a nonexistent checksum, so creating an empty file would fulfill the requirement.
Another typical method is a negative checksum -- i.e. after the data you write a value that makes the checksum of the whole file (including the checksum) come out to zero. In this case, you write a checksum of 0, and it all works out.
Brute force. This is Adler32, which I haven't implemented before, and didn't bother testing, so it's quite likely I've messed it up. I wouldn't expect a corrected version to run significantly slower, though, unless I've done something colossally wrong.
This assumes that the 32bit checksum value is written to the file little-endian (I didn't find a fixed point with it big-endian):
#include <iostream>
#include <stdint.h>
#include <iomanip>
const int modulus = 65521;
void checkAllAdlers(uint32_t sofar, int depth, uint32_t a, uint32_t b) {
if (depth == 4) {
if ((b << 16) + a == sofar) {
std::cout << "Got a fixed point: 0x" <<
std::hex << std::setw(8) << std::setfill('0') <<
sofar << "\n";
}
return;
}
for (uint32_t i = 0; i < 256; ++i) {
uint32_t newa = a + i;
if (newa >= modulus) newa -= modulus;
uint32_t newb = b + a;
if (newb >= modulus) newb -= modulus;
checkAllAdlers(sofar + (i << (depth*8)), depth + 1, newa, newb);
}
return;
}
int main() {
checkAllAdlers(0, 0, 1, 0);
}
Output:
$ g++ adler32fp.cpp -o adler32fp -O3 && time ./adler32fp
Got a fixed point: 0x03fb01fe
real 0m31.215s
user 0m30.326s
sys 0m0.015s
[Edit: several bugs fixed already, I have no confidence whatever in the correctness of this code ;-) Anyway, you get the idea: a 32 bit checksum which uses each byte of input only once is very cheap to brute force. Checksums are usually designed to be fast to compute, whereas hashes are usually much slower, even though they have superficially similar effects. If your checksum was "2 rounds of Adler32" (meaning that the target checksum was the result of computing the checksum and then computing the checksum of that checksum) then my recursive approach wouldn't help so much, there'd be proportionally less in common between inputs with a common prefix. MD5 has 4 rounds, SHA-512 has 80.]
Brute force it. CRC-32 gives you a string of length 8 containing digits and letters of A-F (in other words, it's a hexadecimal number). Try every combination, giving you 168 = many possibilities. Then hash each possibility and see if it gives you the original string.
You can try optimizing it by assuming the solution will use each character no more than two or three times, this might make it finish faster.
If you have access to a CRC32 implementation, you can also try to break the algorithm and find a solution much faster, but I have no idea how you'd do this.
Related
This is a question about an SO question; I don't think it belongs in meta despite being sp by definition, but if someone feels it should go to math, cross-validated, etc., please let me know.
Background:
#ForceBru asked this question about how to generate a 64 bit random number using rand(). #nwellnhof provided an answer that was accepted that basically takes the low 15 bits of 5 random numbers (because MAXRAND is apparently only guaranteed to be 15bits on at least some compilers) and glues them together and then drops the first 11 bits (15*5-64=11). #NikBougalis made a comment that while this seems reasonable, it won't pass many statistical tests of randomnes. #Foon (me) asked for a citation or an example of a test that it would fail. #NikBougalis replied with an answer that didn't elucidate me; #DavidSwartz suggested running it against dieharder.
So, I ran dieharder. I ran it against the algorithm in question
unsigned long long llrand() {
unsigned long long r = 0;
for (int i = 0; i < 5; ++i) {
r = (r << 15) | (rand() & 0x7FFF);
}
return r & 0xFFFFFFFFFFFFFFFFULL;
}
For comparison, I also ran it against just rand() and just 8bits of rand() at at time.
void rand_test()
{
int x;
srand(1);
while(1)
{
x = rand();
fwrite(&x,sizeof(x),1,stdout);
}
void rand_byte_test()
{
srand(1);
while(1)
{
x = rand();
c = x % 256;
fwrite(&c,sizeof(c),1,stdout);
}
}
The algorithm under question came back with two tests showing weakenesses for rgb_lagged_sum for ntuple=28 and one of the sts_serials for ntuple=8.
The just using rand() failed horribly on many tests, presumably because I'm taking a number that has 15 bits of randomness and passing it off as 32 bits of randomness.
The using the low 8 bits of rand() at a time came back as weak for rgb_lagged_sum with ntuple 2, and (edit) failed dab_monobit, with tuple 12
My question(s) is:
Am I interpretting the results for 8 bits of randomly correctly, namely that given that one of the tests (which was marked as "good"; for the record, it also came back as weak for one of the dieharder tests marked "suspect"), came as weak and one as failed, rand()'s randomness should be suspected.
Am I interpretting the results for the algorithm under test correctly (namely that this should also be marginally suspected)
Given the description of what the tests that came back as weak do (e.g for sts_serial looks at whether the distribution of bit patterns of a certain size is valid), should I be able to determine what the bias likely is
If 3, since I'm not, can someone point out what I should be seeing?
Edit: understood that rand() isn't guaranteed to be great. Also, I tried to think what values would be less likely, and surmised zero, maxvalue, or repeated numbers might be... but doing a test of 1000000000 tries, the ratio is very near the expected value of 1 out of every 2^15 times (e.g., in 1000000000 runs, we saw 30512 zeros, 30444 max, and 30301 repeats, and bc says that 30512 * 2^15 is 999817216; other runs had similar ratios including cases where max and/or repeat was larger than zeros.
When you run dieharder the column you really need to watch is the p-value column.
The p-value column essentially says: "This is the probability that real random numbers could have produced this result." You want it to be uniformly distributed between 0 and 1.
You'll also want to run it multiple times on suspect cases. For instance, if you have a column with a p-value of (for instance) .03 then if you re-run it, you still have .03 (rather than some higher value) then you can have a high confidence that your random number generator performs poorly on that test and it's not just a 3% fluke. However, if you get a high value, then you're probably looking at a statistical fluke. But it cuts both ways.
Ultimately, knowing facts about random or pseudorandom processes is difficult. But armed with dieharder you have approximate knowledge of many things.
I know enough Haskell to translate the code below, but I don't know much about making it perform well:
typedef unsigned long precision;
typedef unsigned char uc;
const int kSpaceForByte = sizeof(precision) * 8 - 8;
const int kHalfPrec = sizeof(precision) * 8 / 2;
const precision kTop = ((precision)1) << kSpaceForByte;
const precision kBot = ((precision)1) << kHalfPrec;
//This must be called before encoding starts
void RangeCoder::StartEncode(){
_low = 0;
_range = (precision) -1;
}
/*
RangeCoder does not concern itself with models of the data.
To encode each symbol, you pass the parameters *cumFreq*, which gives
the cumulative frequency of the possible symbols ordered before this symbol,
*freq*, which gives the frequency of this symbol. And *totFreq*, which gives
the total frequency of all symbols.
This means that you can have different frequency distributions / models for
each encoded symbol, as long as you can restore the same distribution at
this point, when restoring.
*/
void RangeCoder::Encode(precision cumFreq, precision freq, precision totFreq){
assert(cumFreq + freq <= totFreq && freq && totFreq <= kBot);
_low += cumFreq * (_range /= totFreq);
_range *= freq;
while ((_low ^ _low + _range) < kTop or
_range < kBot and ((_range= -_low & kBot - 1), 1)){
//the "a or b and (r=..,1)" idiom is a way to assign r only if a is false.
OutByte(_low >> kSpaceForByte); //output one byte.
_range <<= sizeof(uc) * 8;
_low <<= sizeof(uc) * 8;
}
}
I know, I know "Write several versions and use criterion to see what works". I don't know enough to know what my options are though, or to avoid silly mistakes.
Here are my thoughts so far. One way would be to use the State monad and/or lenses. Another would be to translate the loop and state to explicit recursion. I read somewhere that explicit recursion tends to performs badly on ghc though. I think using ByteString Builder would be a good way to output each byte. Assuming I run on a 64 bit platform, should I use unboxed Word64 arguments? The compression quality will not decrease significantly if I decrease the precision to 32 bits. Will GHC optimize better for this?
Since this is not a 1-1 mapping, pipes with StateP would lead to very neat code, where I would request arguments one at a time and then let the while-loop respond byte for byte. Unfortunately, when i benchmarked it, it seems the pipe overhead (unsurprisingly) is quite large. Since each symbol can lead to many byte outputs, it feels a bit like a concatMap with State. Perhaps this would be the idiomatic solution? Concatenating lists of bytes does not sound very fast to me, though. ByteString has a concatMap. Perhaps this is the correct way? EDIT: no it is not. It takes a ByteString as input.
I intend to release the package on Hackage when I'm done, so any advice (or actual code!) you can give will benefit the community :). I plan to use this compression as a base for writing a very memory efficient compressed map.
I read somewhere that explicit recursion tends to performs badly on ghc though.
No. GHC produce slow machine code for recursion, which couldn't be reduced (or GHC "don't want" to reduce). If recursion could be unrolled (I don't see any fundamential problems with it in your snippet), it is translated to almost the same machine code as while-loop in C or C++.
Assuming I run on a 64 bit platform, should I use unboxed Word64 arguments? The compression quality will not decrease significantly if I decrease the precision to 32 bits. Will GHC optimize better for this?
Do you mean Word#? Let GHC to deal with it, use boxed types. I've never met a situation when some profit could be achived only by using unboxed types. Using 32bit types wouldn't help on 64bit platform.
One general rule of optimizing performance for GHC is avoiding data structures where possible. If you can pass pieces of data through function arguments or closures, use the chance.
Which version is faster:
x * 0.5
or
x / 2 ?
I've had a course at the university called computer systems some time ago. From back then I remember that multiplying two values can be achieved with comparably "simple" logical gates but division is not a "native" operation and requires a sum register that is in a loop increased by the divisor and compared to the dividend.
Now I have to optimise an algorithm with a lot of divisions. Unfortunately it's not just dividing by two, so binary shifting is not an option. Will it make a difference to change all divisions to multiplications ?
Update:
I have changed my code and didn't notice any difference. You're probably right about compiler optimisations. Since all the answers were great ive upvoted them all. I chose rahul's answer because of the great link.
Usually division is a lot more expensive than multiplication, but a smart compiler will often convert division by a compile-time constant to a multiplication anyway. If your compiler is not smart enough though, or if there are floating point accuracy issues, then you can always do the optimisation explicitly, e.g. change:
float x = y / 2.5f;
to:
const float k = 1.0f / 2.5f;
...
float x = y * k;
Note that this is most likely a case of premature optimisation - you should only do this kind of thing if you have profiled your code and positively identified division as being a performance bottlneck.
Division by a compile-time constant that's a power of 2 is quite fast (comparable to multiplication by a compile-time constant) for both integers and floats (it's basically convertible into a bit shift).
For floats even dynamic division by powers of two is much faster than regular (dynamic or static division) as it basically turns into a subtraction on its exponent.
In all other cases, division appears to be several times slower than multiplication.
For dynamic divisor the slowndown factor at my Intel(R) Core(TM) i5 CPU M 430 # 2.27GHz appears to be about 8, for static ones about 2.
The results are from a little benchmark of mine, which I made because I was somewhat curious about this (notice the aberrations at powers of two) :
ulong -- 64 bit unsigned
1 in the label means dynamic argument
0 in the lable means statically known argument
The results were generated from the following bash template:
#include <stdio.h>
#include <stdlib.h>
typedef unsigned long ulong;
int main(int argc, char** argv){
$TYPE arg = atoi(argv[1]);
$TYPE i = 0, res = 0;
for (i=0;i< $IT;i++)
res+=i $OP $ARG;
printf($FMT, res);
return 0;
}
with the $-variables assigned and the resulting program compiled with -O3 and run (dynamic values came from the command line as it's obvious from the C code).
Well if it is a single calculation you wil hardly notice any difference but if you talk about millions of transaction then definitely Division is costlier than Multiplication. You can always use whatever is the clearest and readable.
Please refer this link:- Should I use multiplication or division?
That will likely depend on your specific CPU and the types of your arguments. For instance, in your example you're doing a floating-point multiplication but an integer division. (Probably, at least, in most languages I know of that use C syntax.)
If you are doing work in assembler, you can look up the specific instructions you are using and see how long they take.
If you are not doing work in assembler, you probably don't need to care. All modern compilers with optimization will change your operations in this way to the most appropriate instructions.
Your big wins on optimization will not be from twiddling the arithmetic like this. Instead, focus on how well you are using your cache. Consider whether there are algorithm changes that might speed things up.
One note to make, if you are looking for numerical stability:
Don't recycle the divisions for solutions that require multiple components/coordinates, e.g. like implementing an n-D vector normalize() function, i.e. the following will NOT give you a unit-length vector:
V3d v3d(x,y,z);
float l = v3d.length();
float oneOverL = 1.f / l;
v3d.x *= oneOverL;
v3d.y *= oneOverL;
v3d.z *= oneOverL;
assert(1. == v3d.length()); // fails!
.. but this code will..
V3d v3d(x,y,z);
float l = v3d.length();
v3d.x /= l;
v3d.y /= l;
v3d.z /= l;
assert(1. == v3d.length()); // ok!
Guess the problem in the first code excerpt is the additional float normalization (the pre-division will impose a different scale normalization to the floating point number, which is then forced upon the actual result and introducing additional error).
Didn't look into this for too long, so please share your explanation why this happens. Tested it with x,y and z being .1f (and with doubles instead of floats)
Consider a binary sequence:
11000111
I have to find sum of this series (actually in parallel)
Sum =1+1+0+0+0+1+1+1= 5
This is a waste of resource as why invest time in adding 0s?
Is there any clever way to sum this sequence so I can avoid unnecessary additions?
Operate at the byte level rather than the bit level. Use a small LUT to convert a byte to a population count. That way you're only doing one lookup and one add per 8 bits. Unless your data is likely to be very sparse this should be quite efficient.
Well it depends on how you store your bitset.
If it's an array, then you can't do more than a plain for. If you want to do this in parallel, just split the array in chunks and process them concurrently.
If we are talking about a bitset (storing the bits in a native (32/64-bit) integer type), then the simplest way to count bits would be this one:
int bitset;
int s = 0;
for (; bitset; s++)
bitset &= bitset-1;
This removes the last bit of 1 at every step, so you have O(s).
Of course, you can combine these two methods if you need more than 32/64 bits
I dunno why people are answering, not even looking into link from the 1st comment to the question. You can easily make it under O(size_of_bitset). At lewast when it comes to constant factor.
You could use this method (found in link by J.F. Sebastian):
inline int count_bits(int num){
int sum = 0;
for (; bitset; sum++) bitset &= bitset-1;
return sum;
}
int main (void){
int array[N];
int total_sum = 0;
#pragma omp parallel for reduction(+:total_sum)
for (size_t i = 0; i < N, i++){
total_sum += count_bits(array[i]);
}
}
This will count number of bits in memory range of array in parallel. The inline is important to avoid unnecessary copying, also the compiler should optimize it much better.
You can swap the count_bits with anything better that counts bits in an integer to get faster if you find anything. This version has complexity of O(bits_set) (not size of the bit set!).
Invoking the parallel construct will introduce quite a lot of overhead compared to a single summation that it does need to be quite large to compensate.
The parallelism is done via OpenMP. The partial sum of each thread is summed at the end of the parallel loop and stored in total_sum. Note the total_sum will be private inside the loop for each thread reduction due to reduction clause.
You could alter the code to make it count bits set in arbitrary memory region but it is quite important for it to be memory aligned when you perform operations on such low level.
As far as I can see, it would be wasteful to try to handle the zeros specially. As #bdares said, addition is really cheap. At a minimum, you'll need to execute N instructions to sum up the an N-bit sequence, that would be if you unconditionally sum ever bit. If you add a test to see whether the bit is a 0 or 1, that's another instruction that needs to be executed for each bit. Even if there's no branch penalty, you're executing minimum 1 instruction for every bit (the conditional test), and then you're also executing the original instruction (the add) for any bits that are equal to 1. So even without branch penalty, this takes more time to execute.
#bdares mentions that the compiler will optimize out the branches, but that's only if the value of each bit is known at compile time, and if you know the values of the bits at compile time, you should just add them up yourself in advance.
There might be some cute things you can do with bit twiddling. For instance, if you take the bits two at a time you're adding up values of 0, 1, 2, or 3, and only have half as many additions to do. There may by something you can then do with the result to convert it into the value you want, but I haven't actually thought about how to do that.
This is a question about Linux's kernel implementation of /dev/urandom. If user asks to read a very big amount of data (gigabytes) and the entropy is not added to pool, if it possible to predict next data generated from urandom, based on current data?
The usual case is when entropy is often added to pool, but in my case we can consider, there was no additional entropy (e.g. adding of it was disabled by kernel patching). So in my case the question is about urandom algorithm itself.
Source is /drivers/char/random.c or http://www.google.com/codesearch#KMCRKdMbI4g/drivers/char/random.c&q=urandom%20linux&type=cs&l=116
or http://lxr.linux.no/linux+v3.3.3/drivers/char/random.c
// data copying loop
while (nbytes) {
extract_buf(r, tmp);
memcpy(buf, tmp, i);
nbytes -= i;
buf += i;
ret += i;
}
static void extract_buf(struct entropy_store *r, __u8 *out)
{
int i;
__u32 hash[5], workspace[SHA_WORKSPACE_WORDS];
__u8 extract[64];
/* Generate a hash across the pool, 16 words (512 bits) at a time */
sha_init(hash);
for (i = 0; i < r->poolinfo->poolwords; i += 16)
sha_transform(hash, (__u8 *)(r->pool + i), workspace);
/*
* We mix the hash back into the pool to prevent backtracking
* attacks (where the attacker knows the state of the pool
* plus the current outputs, and attempts to find previous
* ouputs), unless the hash function can be inverted. By
* mixing at least a SHA1 worth of hash data back, we make
* brute-forcing the feedback as hard as brute-forcing the
* hash.
*/
mix_pool_bytes_extract(r, hash, sizeof(hash), extract);
/*
* To avoid duplicates, we atomically extract a portion of the
* pool while mixing, and hash one final time.
*/
sha_transform(hash, extract, workspace);
memset(extract, 0, sizeof(extract));
memset(workspace, 0, sizeof(workspace));
/*
* In case the hash function has some recognizable output
* pattern, we fold it in half. Thus, we always feed back
* twice as much data as we output.
*/
hash[0] ^= hash[3];
hash[1] ^= hash[4];
hash[2] ^= rol32(hash[2], 16);
memcpy(out, hash, EXTRACT_SIZE);
memset(hash, 0, sizeof(hash));
}
There is a backtrack prevention mechanism, but what about "forward-track"?
E.g.: I did a single read syscall for 500 MB from urandom, and having all data up to 200-th MB known and no additional entropy in the pool, can I predict what 201-th megabyte will be?
In principle, yes you can predict. When there is no entropy available dev/urandom becomes a PRNG and its output can in principle be predicted once its internal state is known. In practice it is not that simple, because the internal state is reasonably large and the hash function prevents us working backwards from the output. It can be determined by trial and error, but that is likely to take a long time.
The definition of "cryptographically strong pseudo-random number generator" is that it is computationally infeasible to distinguish its output from that of a true random number generator. If you could predict future output from past output, you could so distinguish; ergo, you cannot do so unless the Linux urandom algorithm is weak.
That code does not look like any standard pseudo-random generator to me -- the Linux folks have an unfortunate habit of "rolling their own" -- but breaking it would probably be a publishable result anyway. So if it is breakable, I suspect it is not easy.
Certainly the intent of the design is for "no" to be the answer to your question.
[edit]
Of course, in an information-theoretic sense, the answer is "yes" because you cannot get infinite entropy out of finite entropy. But in an information-theoretic sense, there is no secure cipher other than a one-time pad. I am assuming you are asking about the practical/cryptographic sense.
[edit 2]
A little searching turns up this paper, which claims to demonstrate an attack against the "forward security" in Linux's /dev/urandom. (That is, given the state of the generator, try to reconstruct earlier states.)
This is why programmers should never try to invent their own cryptography. No matter how clever you think you are, some Israeli academics who do this stuff for a living can make you look stupid.
That said, I do not see any attacks against the output of the generator, which is what you are asking about.