Hashing for sparse bit vectors - algorithm

Does anyone have any good intuition for a good hash function for a sparse bit vector?
To give a concrete example, say I want to hash a 4096 bit integer where the probability of each bit being 1 is 10%.
I want to get some compression in the hash. For example 4096 bits in and 32 bits out. This is just an example to illustrate what I am looking for. Of course, all answers are very much appreciated.

Would a Bloom filter help?
If the bit vector is 2^32 bits, then why not just use a 32 bit integer?

I would just hash the bits as usual by calling
hash<vector<bool>>(...)
if you're using C++0x, or else see boost::hash.

Related

Bitboard algorithms for board sizes greater than 64?

I know the Magic BitBoard technique is useful for modern games that are on a n 8x8 grid because you it aligns perfectly with a single 64-bit integer, but is the idea extensible to board sizes greater than 64 squares?
Some games like Shogi have larger board sizes such as 81 squares, which doesn't cleanly fit into a 64-bit integer.
I assume you'd have to use multiple integers but would it would it be better to use 2 64-bit integers or something like 3 32-bit ones?
I know there probably isn't a trivial answer to this, but what kind of knowledge would I need in order to research something like this? I only have some basic/intermediate algorithms and data structures knowledge.
Yes, you could do this with a structure that contains multiple integers of varying lengths. For example, you could use 11 unsigned bytes. Or a 64-bit integer and a 32-bit integer, etc. Anything that will add up to 81 or more bits.
I rather like the idea of three 32-bit integers because you can store three rows per integer. It makes your indexing code simpler than if you used a 64-bit integer and a 32-bit integer. 9 16-bit words would work well, too, but you're wasting almost half your bits.
You could use 11 unsigned bytes, but the indexing is kind of ugly.
All things considered, I'd probably go with the 3 32-bit integers, using the low 27 bits of each.

Why SHA2 has a 384 bit version?

I understand that there is a 256 and 512 versions because they are all powers of 2. But where 384 came from?
I know that binary representation of 384 is 110000000 but I can't understand the logic.
It is not in the middle between 256 and 512. It is not even a logarithmic middle.
Why 384?
A quick look on Wikipedia finds this:
SHA-256 and SHA-512 are novel hash functions computed with 32-bit and
64-bit words, respectively. They use different shift amounts and
additive constants, but their structures are otherwise virtually
identical, differing only in the number of rounds. SHA-224 and SHA-384
are simply truncated versions of the first two, computed with
different initial values.
Looking at the comparison between all the variants, it seems that SHA-384 is more resistant to length extension attacks than SHA-512 (its longer version).
You can find a more detailed answer on Cryptography Stack Exchange: here.
256 + 128 = 384
It is nothing but addition of two values which are above mentioned powers of 2 !!

How is data stored in a bit vector?

I'm a bit confused how a fixed size bit vector stores its data.
Let's assume that we have a bit vector bv that I want to store hello in as ASCII.
So we do bv[0]=104, bv[1]=101, bv[2]=108, bv[3]=108, bv[4]=111.
How is the ASCII of hello represented in the bit vector?
Is it as binary like this: [01101000][01100101][01101100][01101100][01101111]
or as ASCII like this: [104][101][108][108][111]
The following paper HAMPI at section 3.5 step 2, the author is assigning ascii code to a bit vector, but Im confused how the char is represented in the bit vector.
Firstly, you should probably read up on what a bit vector is, just to make sure we're on the same page.
Bit vectors don't represent ASCII characters, they represent bits. Trying to do bv[0]=104 on a bit vector will probably not compile / run, or, if it does, it's very unlikely to do what you expect.
The operations that you would expect to be supported is along the lines of set the 5th bit to 1, set the 10th bit to 0, set all these bit to this, OR the bits of these two vectors and probably some others.
How these are actually stored in memory is completely up to the programming language, and, on top of that, it may even be completely up to a given implementation of that language.
The general consensus (not a rule) is that each bit should take up roughly 1 bit in memory (maybe, on average, slightly more, since there could be overhead related to storing these).
As one example (how Java does it), you could have an array of 64-bit numbers and store 64 bits in each position. The translation to ASCII won't make sense in this case.
Another thing you should know - even ASCII gets stored as bits in memory, so those 2 arrays are essentially the same, unless you meant something else.

Arbitrary precision arithmetic with Ruby

How the heck does Ruby do this? Does Jörg or anyone else know what's happening behind the scenes?
Unfortunately I don't know C very well so bignum.c is of little help to me. I was just kind of curious it someone could explain (in plain English) the theory behind whatever miracle algorithm its using.
irb(main):001:0> 999**999
368063488259223267894700840060521865838338232037353204655959621437025609300472231530103873614505175218691345257589896391130393189447969771645832382192366076536631132001776175977932178658703660778465765811830827876982014124022948671975678131724958064427949902810498973271030787716781467419524180040734398996952930832508934116945966120176735120823151959779536852290090377452502236990839453416790640456116471139751546750048602189291028640970574762600185950226138244530187489211615864021135312077912018844630780307462205252807737757672094320692373101032517459518497524015120165166724189816766397247824175394802028228160027100623998873667435799073054618906855460488351426611310634023489044291860510352301912426608488807462312126590206830413782664554260411266378866626653755763627796569082931785645600816236891168141774993267488171702172191072731069216881668294625679492696148976999868715671440874206427212056717373099639711168901197440416590226524192782842896415414611688187391232048327738965820265934093108172054875188246591760877131657895633586576611857277011782497943522945011248430439201297015119468730712364007639373910811953430309476832453230123996750235710787086641070310288725389595138936784715274150426495416196669832679980253436807864187160054589045664027158817958549374490512399055448819148487049363674611664609890030088549591992466360050042566270348330911795487647045949301286614658650071299695652245266080672989921799342509291635330827874264789587306974472327718704306352445925996155619153783913237212716010410294999877569745287353422903443387562746452522860420416689019732913798073773281533570910205207767157128174184873357050830752777900041943256738499067821488421053870869022738698816059810579221002560882999884763252161747566893835178558961142349304466506402373556318707175710866983035313122068321102457824112014969387225476259342872866363550383840720010832906695360553556647545295849966279980830561242960013654529514995113584909050813015198928283202189194615501403435553060147713139766323195743324848047347575473228198492343231496580885057330510949058490527738662697480293583612233134502078182014347192522391449087738579081585795613547198599661273567662441490401862839817822686573112998663038868314974259766039340894024308383451039874674061160538242392803580758232755749310843694194787991556647907091849600704712003371103926967137408125713631396699343733288014254084819379380555174777020843568689927348949484201042595271932630685747613835385434424807024615161848223715989797178155169951121052285149157137697718850449708843330475301440373094611119631361702936342263219382793996895988331701890693689862459020775599439506870005130750427949747071390095256759203426671803377068109744629909769176319526837824364926844730545524646494321826241925107158040561607706364484910978348669388142016838792902926158979355432483611517588605967745393958061959024834251565197963477521095821435651996730128376734574843289089682710350244222290017891280419782767803785277960834729869249991658417000499998999
Simple: it does it the same way you do, ever since first grade. Except it doesn't compute in base 10, it computes in base 4 billion (and change).
Think about it: with our number system, we can only represent numbers from 0 to 9. So, how can we compute 6+7 without overflowing? Easy: we do actually overflow! We cannot represent the result of 6+7 as a number between 0 and 9, but we can overflow to the next place and represent it as two numbers between 0 and 9: 3×100 + 1×101. If you want to add two numbers, you add them digit-wise from the right and overflow ("carry") to the left. If you want to multiply two numbers, you have to multiply every digit of one number individually with the other number, then add up the intermediate results.
BigNum arithmetic (this is what this kind of arithmetic where the numbers are bigger than the native machine numbers is usually called) works basically the same way. Except that the base is not 10, and its not 2, either – it's the size of a native machine integer. So, on a 32 bit machine, it would be base 232 or 4 294 967 296.
Specifically, in Ruby Integer is actually an abstract class that is never instianted. Instead, it has two subclasses, Fixnum and Bignum, and numbers automagically migrate between them, depending on their size. In MRI and YARV, Fixnum can hold a 31 or 63 bit signed integer (one bit is used for tagging) depending on the native word size of the machine. In JRuby, a Fixnum can hold a full 64 bit signed integer, even on an 32 bit machine.
The simplest operation is adding two numbers. And if you look at the implementation of + or rather bigadd_core in YARV's bignum.c, it's not too bad to follow. I can't read C either, but you can cleary see how it loops over the individual digits.
You could read the source for bignum.c...
At a very high level, without going into any implementation details, bignums are calculated "by hand" like you used to do in grade school. Now, there are certainly many optimizations that can be applied, but that's the gist of it.
I don't know of the implementation details so I'll cover how a basic Big Number implementation would work.
Basically instead of relying on CPU "integers" it will create it's own using multiple CPU integers. To store arbritrary precision, well lets say you have 2 bits. So the current integer is 11. You want to add one. In normal CPU integers, this would roll over to 00
But, for big number, instead of rolling over and keeping a "fixed" integer width, it would allocate another bit and simulate an addition so that the number becomes the correct 100.
Try looking up how binary math can be done on paper. It's very simple and is trivial to convert to an algorithm.
Beaconaut APICalc 2 just released on Jan.18, 2011, which is an arbitrary-precision integer calculator for bignum arithmetic, cryptography analysis and number theory research......
http://www.beaconaut.com/forums/default.aspx?g=posts&t=13
It uses the Bignum class
irb(main):001:0> (999**999).class
=> Bignum
Rdoc is available of course

Simple integer encryption

Is there a simple algorithm to encrypt integers? That is, a function E(i,k) that accepts an n-bit integer and a key (of any type) and produces another, unrelated n-bit integer that, when fed into a second function D(E(i),k) (along with the key) produces the original integer?
Obviously there are some simple reversible operations you can perform, but they all seem to produce clearly related outputs (e.g. consecutive inputs lead to consecutive outputs). Also, of course, there are cryptographically strong standard algorithms, but they don't produce small enough outputs (e.g. 32-bit). I know any 32-bit cryptography can be brute-forced, but I'm not looking for something cryptographically strong, just something that looks random. Theoretically speaking it should be possible; after all, I could just create a dictionary by randomly pairing every integer. But I was hoping for something a little less memory-intensive.
Edit: Thanks for the answers. Simple XOR solutions will not work because similar inputs will produce similar outputs.
Would not this amount to a Block Cipher of block size = 32 bits ?
Not very popular, because it's easy to break. But theorically feasible.
Here is one implementation in Perl :
http://metacpan.org/pod/Crypt::Skip32
UPDATE: See also Format preserving encryption
UPDATE 2: RC5 supports 32-64-128 bits for its block size
I wrote an article some time ago about how to generate a 'cryptographically secure permutation' from a block cipher, which sounds like what you want. It covers using folding to reduce the size of a block cipher, and a trick for dealing with non-power-of-2 ranges.
A simple one:
rand = new Random(k);
return (i xor rand.Next())
(the point xor-ing with rand.Next() rather than k is that otherwise, given i and E(i,k), you can get k by k = i xor E(i,k))
Ayden is an algorithm that I developed. It is compact, fast and looks very secure. It is currently available for 32 and 64 bit integers. It is on public domain and you can get it from http://github.com/msotoodeh/integer-encoder.
You could take an n-bit hash of your key (assuming it's private) and XOR that hash with the original integer to encrypt, and with the encrypted integer to decrypt.
Probably not cryptographically solid, but depending on your requirements, may be sufficient.
If you just want to look random and don't care about security, how about just swapping bits around. You could simply reverse the bit string, so the high bit becomes the low bit, second highest, second lowest, etc, or you could do some other random permutation (eg 1 to 4, 2 to 7 3 to 1, etc.
How about XORing it with a prime or two? Swapping bits around seems very random when trying to analyze it.
Try something along the lines of XORing it with a prime and itself after bit shifting.
How many integers do you want to encrypt? How much key data do you want to have to deal with?
If you have few items to encrypt, and you're willing to deal with key data that's just as long as the data you want to encrypt, then the one-time-pad is super simple (just an XOR operation) and mathematically unbreakable.
The drawback is that the problem of keeping the key secret is about as large as the problem of keeping your data secret.
It also has the flaw (that is run into time and again whenever someone decides to try to use it) that if you take any shortcuts - like using a non-random key or the common one of using a limited length key and recycling it - that it becomes about the weakest cipher in existence. Well, maybe ROT13 is weaker.
But in all seriousness, if you're encrypting an integer, what are you going to do with the key no matter which cipher you decide on? Keeping the key secret will be a problem about as big (or bigger) than keeping the integer secret. And if you're encrypting a bunch of integers, just use a standard, peer reviewed cipher like you'll find in many crypto libraries.
RC4 will produce as little output as you want, since it's a stream cipher.
XOR it with /dev/random

Resources