Is there difference between Cache index address calculation vs Division hash function? - caching

Upon studying hash data structure and cache memory from computer architecture, I noticed that they're very similar.
Division hash function calculates index by hash(k) = k Mod (table size M) but my DS book says M should be a prime number or at least an odd number, because if M is an even number, the result is always even when k is even, odd when k is odd, so even M should be avoided since you often use memory addresses which are always even.
And yet, my CA book says for direct-mapped cache you use (Block address) Mod (Number of blocks in the cache) and the result indices look uniform. Why is this? It's all very confusing because MIPS uses 32 bit address every 4 bytes which is even number. But I think it's because they threw out the last 2 bits since they're byte offsets?
And, since it uses (Block address) Mod (Number of blocks in the cache), it makes the cache size power of 2 so that you can just use the lower x bits of the block address.
But this method looks exactly the same as division hash function, except you make the hash table power of 2, which is even (data structure book said use prime or odd) and use the lower bits of the block address.
Are these 2 different methods? If so, what's the cache one called? I would really appreciate a reply please. Thank you.

The reason for not using an even number for hash table is described here.
And how caches use addresses to calculate line numbers are described here. And its ok for caches to map more than one entry to the same line. Just because an address is mapped to a cacheline which has data, we don't blindly use the data in that cacheline. We also do a tag comparison to make sure that the content is the cacheline is what exactly we are looking for.

The reason for using a prime to take the modulo by is to get "mixing" of the bits, which is helpful if the integers that you're hashing have a poor structure. That isn't the only way to deal with it though, and for example the Java standard library doesn't use that, it uses a separate "mixing" function (that XORs the input with right-shifted versions of itself) and then uses a power-of-two sizes table. Either way it's protection against badly distributed input, which isn't necessary in and of itself - if the input was always nicely distributed you wouldn't need it.
Memory addresses are usually fairly nicely distributed, because it's typically used in sequential pieces. The obvious exception is that you will see highly aligned big objects, which would conflict with each other in the cache if nothing was done about it. Of course you will probably use a set-associative cache rather than direct mapped, since it is far more robust against degradation, and that would take care of a lot of that. But nothing is ever immune to bad patterns (that also goes for hash-mod-prime, which you can easily defeat if you know the prime), but a fairly simple improvement (which is also used in practice, or at least was, more advanced techniques exist now - combined with adaptive replacement strategies that mitigate bad access patterns) is to XOR some of the higher address bits into the index. This is hash-strengthening, the same technique used in the Java standard library, but a much simpler version of it.
Computing a remainder by a prime number (or really anything that isn't a power of two) is not something you'd want to do in this case, it's a slow computation by itself, and it leaves you with an awkwardly sized cache that doesn't fully use the power of its decoders, which adds to the slowness (or reduces cache size for a given latency, depending on how you look at it). The difference between that and XORing some of the high bits into the low bits is much bigger in hardware than it is in software, since XOR is really a trivial operation in hardware, much faster as a circuit operation than as an instruction.

Related

Use big.Rat with Go to get Abs() value

I am a beginner with Go and a java developer.
I am currently working with big.Rat.
I need to get the Abs of a Rat n for which I have to write something like
n.Abs(n) or something like big.Rat{}.Abs(n)
Why didn't go provide something like just n.Abs()?
Or am I going wrong somewhere?
Go's big package is concerned with memory allocation when it comes to its function signatures. A big.Rat consists of two big.Ints which each contain an array of uints. Unlike an int (native 32 or 64 bit integer), a big.Int must thus be allocated dynamically, depending on its value. For large values this means more elements in the array.
Your proposed function signature n.Abs() would mean that a new array of the same size as n's would have to be allocated for this operation. In reality we often have the case that the original n is no longer needed, thus we can reuse its existing memory. To allow this, the Abs function takes a pointer to an existing big.Rat which might be n itself. The implementation can now reuse the memory. The caller is now in full control of what memory to use for these operations.
This might not make the nicest API for all use cases, in fact if you just want to do a quick calculation for a few large numbers, on a computer with Gigabytes of RAM, you might have preferred the n.Abs() version, but if you do numerically expensive computations with a lot of large numbers, you must be able to control your memory. Imagine doing some image manipulation on a Raspberry for example, where you are more constraint by the available memory. In this case the existing API allows you to be more efficient.

Why use 1 instead of -1?

At 29min mark of http://channel9.msdn.com/Events/GoingNative/2013/Writing-Quick-Code-in-Cpp-Quickly Andrei Alexandrescu says when using constants to prefer 0 and mentions hardware knows how to handle it. I did some assembly and I know what he is talking about and about the zero flag on CPUs
Then he says prefer the constant 1 rather then -1. -1 IIRC is not actually special but because it is negative the sign flag on CPUs would be set. 1 from my current understanding is simply a positive number there is no bit on the processor flag for it and no way to distinguish from 0 or other positive numbers.
But Andrei says to prefer 1 over -1. Why? What does hardware do with 1 that is better then -1?
First, it should be noted that Andrea Alexandrescu emphasized the difference between zero and the other two good constants, that the difference between using one and negative one is less significant. He also bundles compiler issues with hardware issues, i.e., the hardware might be able to perform the operation efficiently but the compiler will not generate the appropriate machine code given a reasonably clear expression in the chosen higher level language.
While I cannot read his mind, there are at least two aspects that may make one better than negative one.
Many ISAs provide comparison operations (or flag to GPR transfers) that return zero or one (e.g., MIPS has Set on Less Than) not zero or negative one. (SIMD instructions are an exception; SIMD comparisons typically generate zero or negative one [all bits set].)
At least one implementation of SPARC made loading smaller signed values more expensive, and I seem to recall that at least one ISA did not provide an instruction for loading signed bytes. Naive implementation of sign extension adds latency because whether to set or clear the more significant bits is not known until the value has been loaded.
Negative one does have some benefits. As you mentioned, testing for negativity is often relatively easy, so if negative one is the only negative value used it may be handled less expensively. Also, conditionally clearing a value based on zero or negative one is simply an and operation. (For conditionally setting or clearing a single bit, one rather than negative one would be preferred since such would involve only a shift and an and.)

Is it possible to count the number of Set bits in Number in O(1)? [duplicate]

This question already has answers here:
Count the number of set bits in a 32-bit integer
(65 answers)
Count bits in the number [duplicate]
(3 answers)
Closed 8 years ago.
I was asked the above question in an interview and interviewer is very certain of the answer. But i am not sure. Can anyone help me here?
Sure. The obvious brute force method is just a big lookup table with one entry for every possible value of the input number. That's not very practical if the number is very big, but is still enough to prove it's possible.
Edit: the notion has been raised that this is complete nonsense, and the same could be said of essentially any algorithm.
To a limited degree, that's a fair statement -- but the limitations are so severe that for most algorithms it remains utterly meaningless.
My original point (at least as well as I remember it) was that population counting is about equivalent to many other operations like addition and subtraction that we normally assume are O(1).
At the hardware level, circuitry for a single-cycle POPCNT instruction is probably easier than for a single-cycle ADD instruction. Just for one example, for any practical size of data word, we can use table lookups on 4-bit chunks in parallel, then add the results from those pieces together. Even using fairly unlikely worst-case assumptions (e.g., separate storage for each of those tables) this would still be easy to implement in a modern CPU -- in fact, it's probably at least somewhat simpler than the single-cycle addition or subtraction mentioned above1.
This is a decided contrast to many other algorithms. For one obvious example, let's consider sorting. For even the most trivial sort most people can imagine -- 2 items, 8 bits apiece, we're already at a 64 kilobyte lookup table to get constant complexity. Long before we can do even a fairly trivial sort (e.g., 100 items) we need a lookup table that contains far more data items than there are atoms in the universe.
Looking at it from the opposite direction, it's certainly true that at some point, essentially nothing is O(1) any more. Let's consider the most trivial operations possible. For an N-bit CPU, bitwise OR is normally implemented as a set of N OR gates in parallel. Unlike addition, there's no interaction between one bit and another, so for any practical size of CPU, this easy to execute in a single instruction.
Nonetheless, if I specify a bit-wise OR in which each operand is 100 petabits, there's nothing even approaching a practical way to do the job with constant complexity. Using the usual method of parallel OR gates, we end up with (among other things) 300 petabits worth of input and output lines -- a number that completely dwarfs even the number of pins on the largest CPUs.
On reasonable hardware, doing a bitwise OR on 100 petabit operands is going to take a while (not to mention quite a bit of hard drive space). If we increase that to 200 petabit operands, the time is likely to (about) double -- so from that viewpoint, it's an O(N) operation. Obviously enough, the same is going to be true with the other "trivial" operations like addition, subtraction, bit-wise AND, bit-wise XOR, and so on.
Nonetheless, unless you have very specific instructions to say you're going to be dealing with utterly immense operands, you're typically going to treat every one of these as a constant complexity operation. Looked at in these terms, a POPCNT instruction falls about halfway between bit-wise AND/OR/XOR on one hand, and addition/subtraction on the other, in terms of the difficulty to execute in fixed time.
1. You might wonder how it could possibly be simpler than an add when it actually includes an add after doing some other operations. If so, kudos -- it's an excellent question.
The answer is that it's because it only needs a much smaller adder. For example, a 64-bit CPU needs one half-adder and 63 full-adders. In the simple implementation, you carry out the addition bit-wise -- i.e., you add bit-0 of one operand to bit-0 of the other. That generates an output bit, and a carry bit. That carry bit becomes an input to the addition for the next pair of bits. There are some tricks to parallelize that to some degree, but the nature of the beast (so to speak) is bit-serial.
With a POPCNT instruction, we have an addition after doing the individual table lookups, but our result is limited to the size of the input words. Given the same size of inputs (64 bits) our final result can't be any larger than 64. That means we only need a 6-bit adder instead of a 64-bit adder.
Since, as outlined above, addition is basically bit-serial, this means that the addition at the end of the POPCNT instruction is fundamentally a lot faster than a normal add. To be specific, it's logarithmic on the operand size, whereas simple addition is roughly linear on the operand size.
If the bit size is fixed (e.g. natural word size of a 32- or 64-bit machine), you can just iterate over the bits and count them directly in O(1) time (though there are certainly faster ways to do it). For arbitrary precision numbers (BigInt, etc.), the answer must be no.
Some processors can do it in one instruction, obviously for integers of limited size. Look up the POPCNT mnemonic for further details.
For integers of unlimited size obviously you need to read the whole input, so the lower bound is O(n).
The interviewer probably meant the bit counting trick (the first Google result follows): http://www.gamedev.net/topic/547102-bit-counting-trick---new-to-me/

Is it fastest to access a byte than a bit? Why?

The question is very straight: is it fastest to access a byte than a bit? If I store 8 booleans in a byte will it be slower when I have to compare them than if I used 8 bytes? Why?
Chances are no. The smallest addressable unit of memory in most machines today is a byte. In most cases, you can't address or access by bit.
In fact, accessing a specific bit might be even more expensive because you have to build a mask and use some logic.
EDIT:
Your question mentions "compare", I'm not sure exactly what you mean by that. But in some cases, you perform logic very efficiently on multiple booleans using bitwise operators if your booleans are densely packed into larger integer types.
As for which to use: array of bytes (with one boolean per byte), or a densely packed structure with one boolean per bit is a space-effiicency trade-off. For some applications that need to store a massive amount of bools, dense packing is better since it saves memory.
The underlying hardware that your code runs on is built to access bytes (or longer words) from memory. To read a bit, you have to read the entire byte, and then mask off the bits you don't care about, and possibly also shift to get the bit into the ones position. So the instructions to access a bit are a superset of the instructions to access a byte.
It may be faster to store the data as bits for a different reason - if you need to traverse and access many 8-bit sets of flags in a row. You will perform more ops per boolean flag, but you will traverse less memory by having it packed in fewer bytes. You will also be able to test multiple flags in a single operation, although you may be able to do this with bools to some extent as well, as long as they lie within a single machine word.
The memory latency penalty is far higher than register bit twiddling. In the end, only profiling the code on the hardware on which it will actually run will tell you which way is best.
From a hardware point of view, I would say that in general all the bit masking and other operations in the best case might occur within a single clock (resulting in no different), but that entirely depends on hardware layer that you likely won't ever know the specifics of, and as such you cannot bank on it.
It's worth pointing out that things like the .NET system.collections.bitarray uses a 32bit integer array underneath to store it's bit data. There is likely a performance reason behind this implementation (even if only in a general case that 32bit words perform above average), I would suggest reading up about the inner workings of that might be revealing.
From a coding point of view, it really depends what you're going to do with the bits afterwards. That is to say if you're going to store your data in booleans such as:
bool a0, a1, a2, a3, a4, a5, a6, a7;
And then in your code you compare them one by one (and most of them together):
if ( a0 && a1 && !a2 && a3 && !a4 && (!a5 || a6) || a7) {
...
}
Then you will find that it will be faster (and likely neater in code) to use a bit mask. But really the only time this would matter is if you're going to be running this code millions of times in a high performance or time critical environment.
I guess what I'm getting at here is that you should do whatever your coding standards say (and if you don't have any or they don't consider such details then just do what looks neatest for your application and need).
But I highly suggest trying to look around and read a blog or two explaining the inner workings of the .NET system.collections.bitarray.
This depends on the kind of processor and motherboard data bus, i.e. 32 bit data bus will compare your data faster if you collect them into "word"s rather than "bool"s or "byte"s....
This is only valid when you are writing in assembly language when you can compare each instruction how many cycles it takes .... but since you are using compiler then it is almost the same.
However, collecting booleans into words or integers will be useful in saving memory required for variables.
Computers tend to access things in words. Accessing a bit is slower because it requires more effort:
Imagine I said something to you, then said "oh change my second word to instead".
Now imagine my edit instead was "oh, change the third letter in the second word to 's'".
Which requires more thinking on your part?

Best algorithm for hashing number values?

When dealing with a series of numbers, and wanting to use hash results for security reasons, what would be the best way to generate a hash value from a given series of digits? Examples of input would be credit card numbers, or bank account numbers. Preferred output would be a single unsigned integer to assist in matching purposes.
My feeling is that most of the string implementations appear to have low entropy when run against such a short range of characters and because of that, the collision rate might be higher than when run against a larger sample.
The target language is Delphi, however answers from other languages are welcome if they can provide a mathmatical basis which can lead to an optimal solution.
The purpose of this routine will be to determine if a previously received card/account was previously processed or not. The input file could have multiple records against a database of multiple records so performance is a factor.
With security questions all the answers lay on a continuum from most secure to most convenient. I'll give you two answers, one that is very secure, and one that is very convenient. Given that and the explanation of each you can choose the best solution for your system.
You stated that your objective was to store this value in lieu of the actual credit card so you could later know if the same credit card number is used again. This means that it must contain only the credit card number and maybe a uniform salt. Inclusion of the CCV, expiration date, name, etc. would render it useless since it the value could be different with the same credit card number. So we will assume you pad all of your credit card numbers with the same salt value that will remain uniform for all entries.
The convenient solution is to use a FNV (As Zebrabox and Nick suggested). This will produce a 32 bit number that will index quickly for searches. The downside of course is that it only allows for at max 4 billion different numbers, and in practice will produce collisions much quicker then that. Because it has such a high collision rate a brute force attack will probably generate enough invalid results as to make it of little use.
The secure solution is to rely on SHA hash function (the larger the better), but with multiple iterations. I would suggest somewhere on the order of 10,000. Yes I know, 10,000 iterations is a lot and it will take a while, but when it comes to strength against a brute force attack speed is the enemy. If you want to be secure then you want it to be SLOW. SHA is designed to not have collisions for any size of input. If a collision is found then the hash is considered no longer viable. AFAIK the SHA-2 family is still viable.
Now if you want a solution that is secure and quick to search in the DB, then I would suggest using the secure solution (SHA-2 x 10K) and then storing the full hash in one column, and then take the first 32 bits and storing it in a different column, with the index on the second column. Perform your look-up on the 32 bit value first. If that produces no matches then you have no matches. If it does produce a match then you can compare the full SHA value and see if it is the same. That means you are performing the full binary comparison (hashes are actually binary, but only represented as strings for easy human reading and for transfer in text based protocols) on a much smaller set.
If you are really concerned about speed then you can reduce the number of iterations. Frankly it will still be fast even with 1000 iterations. You will want to make some realistic judgment calls on how big you expect the database to get and other factors (communication speed, hardware response, load, etc.) that may effect the duration. You may find that your optimizing the fastest point in the process, which will have little to no actual impact.
Also, I would recommend that you benchmark the look-up on the full hash vs. the 32 bit subset. Most modern database system are fairly fast and contain a number of optimizations and frequently optimize for us doing things the easy way. When we try to get smart we sometimes just slow it down. What is that quote about premature optimization . . . ?
This seems to be a case for key derivation functions. Have a look at PBKDF2.
Just using cryptographic hash functions (like the SHA family) will give you the desired distribution, but for very limited input spaces (like credit card numbers) they can be easily attacked using brute force because this hash algorithms are usually designed to be as fast as possible.
UPDATE
Okay, security is no concern for your task. Because you have already a numerical input, you could just use this (account) number modulo your hash table size. If you process it as string, you might indeed encounter a bad distribution, because the ten digits form only a small subset of all possible characters.
Another problem is probably that the numbers form big clusters of assigned (account) numbers with large regions of unassigned numbers between them. In this case I would suggest to try highly non-linear hash function to spread this clusters. And this brings us back to cryptographic hash functions. Maybe good old MD5. Just split the 128 bit hash in four groups of 32 bits, combine them using XOR, and interpret the result as a 32 bit integer.
While not directly related, you may also have a look at Benford's law - it provides some insight why numbers are usually not evenly distributed.
If you need security, use a cryptographically secure hash, such as SHA-256.
I needed to look deeply into hash functions a few months ago. Here are some things I found.
You want the hash to spread out hits evenly and randomly throughout your entire target space (usually 32 bits, but could be 16 or 64-bits.) You want every character of the input to have and equally large effect on the output.
ALL the simple hashes (like ELF or PJW) that simply loop through the string and xor in each byte with a shift or a mod will fail that criteria for a simple reason: The last characters added have the most effect.
But there are some really good algorithms available in Delphi and asm. Here are some references:
See 1997 Dr. Dobbs article at burtleburtle.net/bob/hash/doobs.html
code at burtleburtle.net/bob/c/lookup3.c
SuperFastHash Function c2004-2008 by Paul Hsieh (AKA HsiehHash)
www.azillionmonkeys.com/qed/hash.html
You will find Delphi (with optional asm) source code at this reference:
http://landman-code.blogspot.com/2008/06/superfasthash-from-paul-hsieh.html
13 July 2008
"More than a year ago Juhani Suhonen asked for a fast hash to use for his
hashtable. I suggested the old but nicely performing elf-hash, but also noted
a much better hash function I recently found. It was called SuperFastHash (SFH)
and was created by Paul Hsieh to overcome his 'problems' with the hash functions
from Bob Jenkins. Juhani asked if somebody could write the SFH function in basm.
A few people worked on a basm implementation and posted it."
The Hashing Saga Continues:
2007-03-13 Andrew: When Bad Hashing Means Good Caching
www.team5150.com/~andrew/blog/2007/03/hash_algorithm_attacks.html
2007-03-29 Andrew: Breaking SuperFastHash
floodyberry.wordpress.com/2007/03/29/breaking-superfasthash/
2008-03-03 Austin Appleby: MurmurHash 2.0
murmurhash.googlepages.com/
SuperFastHash - 985.335173 mb/sec
lookup3 - 988.080652 mb/sec
MurmurHash 2.0 - 2056.885653 mb/sec
Supplies c++ code MurmurrHash2.cpp and aligned-read-only implementation -
MurmurHashAligned2.cpp
//========================================================================
// Here is Landman's MurmurHash2 in C#
//2009-02-25 Davy Landman does C# implimentations of SuperFashHash and MurmurHash2
//landman-code.blogspot.com/search?updated-min=2009-01-01T00%3A00%3A00%2B01%3A00&updated-max=2010-01-01T00%3A00%3A00%2B01%3A00&max-results=2
//
//Landman impliments both SuperFastHash and MurmurHash2 4 ways in C#:
//1: Managed Code 2: Inline Bit Converter 3: Int Hack 4: Unsafe Pointers
//SuperFastHash 1: 281 2: 780 3: 1204 4: 1308 MB/s
//MurmurHash2 1: 486 2: 759 3: 1430 4: 2196
Sorry if the above turns out to look like a mess. I had to just cut&paste it.
At least one of the references above gives you the option of getting out a 64-bit hash, which would certainly have no collisions in the space of credit card numbers, and could be easily stored in a bigint field in MySQL.
You do not need a cryptographic hash. They are much more CPU intensive. And the purpose of "cryptographic" is to stop hacking, not to avoid collisions.
If performance is a factor I suggest to take a look at a CodeCentral entry of Peter Below. It performs very well for large number of items.
By default it uses P.J. Weinberger ELF hashing function. But others are also provided.
By definition, a cryptographic hash will work perfectly for your use case. Even if the characters are close, the hash should be nicely distributed.
So I advise you to use any cryptographic hash (SHA-256 for example), with a salt.
For a non cryptographic approach you could take a look at the FNV hash it's fast with a low collision rate.
As a very fast alternative, I've also used this algorithm for a few years and had few collision issues however I can't give you a mathematical analysis of it's inherent soundness but for what it's worth here it is
=Edit - My code sample was incorrect - now fixed =
In c/c++
unsigned int Hash(const char *s)
{
int hash = 0;
while (*s != 0)
{
hash *= 37;
hash += *s;
s++;
}
return hash;
}
Note that '37' is a magic number, so chosen because it's prime
Best hash function for the natural numbers let
f(n)=n
No conflicts ;)

Resources