Choosing an integer type in core data - cocoa

When I create models in core data, I'm always a little perplexed by which integer type I should choose—16, 32, 64. I'm almost always needing something for a simple, basic number: a count of people in the household, for my present case. Probably going to be a number between 1-20. Or, I have an incrementing case id number in another instance...can't imagine that going further than few hundred people.
And here's the deal...It's clear that true computer science folk think of numbers differently, taking into account factors like the architecture that's going to be processing the numbers, the space required to process and store the data, backwards compatibility, future proofing, etc. When I think of numbers, I basically think of how large a value is being represented. So when I get to that point of my process when I have to choose between three types of integers, I basically say to myself, "Well, this is going to be a small number, let's just use the Int 16 option...", or "Shoot, I could end up with a really big number here so let's use the Int 64 choice." Basically, I pick these data types with the same sort of logic I use when ordering fries...if I'm really hungry I go for the large, if I'm feeling a big guilty I'll just get the small.
I'm learning enough to know that I'm not thinking about this in the right terms, but I don't really know why, and I don't know the appropriate way to choose the best option. What factors should I really be considering...what's the most important criteria for selecting between Int 16, Int 32, and Int 64?

It doesn't matter much.
Assuming you're using a SQLite persistent store, the three integer types are all represented as SQLite INTEGER fields (same for Core Data's "Boolean" type). And in SQLite field type is purely advisory anyway, so even that doesn't mean much. Therefore: it makes literally no difference in terms of storage space. SQLite will optimize itself based on how big the integer values are, and larger int types at the Core Data level will have no effect.
For memory usage, it might have a small impact. If you use a 64 bit int instead of a 16 bit, you're requesting more bits than you need. But unless you have extremely large data sets, it's unlikely that you'll ever have a reason to care.
My usual rule then, is to use Integer 64 for any integral value.

You do it correctly: The size of the maximum expected number. But do not care to much about it. Even if you choose a integer "too big", is doesn't fail, but simply uses more memory than needed. On disk.
I think that different sized integers are an anachronism.

Related

Use big.Rat with Go to get Abs() value

I am a beginner with Go and a java developer.
I am currently working with big.Rat.
I need to get the Abs of a Rat n for which I have to write something like
n.Abs(n) or something like big.Rat{}.Abs(n)
Why didn't go provide something like just n.Abs()?
Or am I going wrong somewhere?
Go's big package is concerned with memory allocation when it comes to its function signatures. A big.Rat consists of two big.Ints which each contain an array of uints. Unlike an int (native 32 or 64 bit integer), a big.Int must thus be allocated dynamically, depending on its value. For large values this means more elements in the array.
Your proposed function signature n.Abs() would mean that a new array of the same size as n's would have to be allocated for this operation. In reality we often have the case that the original n is no longer needed, thus we can reuse its existing memory. To allow this, the Abs function takes a pointer to an existing big.Rat which might be n itself. The implementation can now reuse the memory. The caller is now in full control of what memory to use for these operations.
This might not make the nicest API for all use cases, in fact if you just want to do a quick calculation for a few large numbers, on a computer with Gigabytes of RAM, you might have preferred the n.Abs() version, but if you do numerically expensive computations with a lot of large numbers, you must be able to control your memory. Imagine doing some image manipulation on a Raspberry for example, where you are more constraint by the available memory. In this case the existing API allows you to be more efficient.

Is there difference between Cache index address calculation vs Division hash function?

Upon studying hash data structure and cache memory from computer architecture, I noticed that they're very similar.
Division hash function calculates index by hash(k) = k Mod (table size M) but my DS book says M should be a prime number or at least an odd number, because if M is an even number, the result is always even when k is even, odd when k is odd, so even M should be avoided since you often use memory addresses which are always even.
And yet, my CA book says for direct-mapped cache you use (Block address) Mod (Number of blocks in the cache) and the result indices look uniform. Why is this? It's all very confusing because MIPS uses 32 bit address every 4 bytes which is even number. But I think it's because they threw out the last 2 bits since they're byte offsets?
And, since it uses (Block address) Mod (Number of blocks in the cache), it makes the cache size power of 2 so that you can just use the lower x bits of the block address.
But this method looks exactly the same as division hash function, except you make the hash table power of 2, which is even (data structure book said use prime or odd) and use the lower bits of the block address.
Are these 2 different methods? If so, what's the cache one called? I would really appreciate a reply please. Thank you.
The reason for not using an even number for hash table is described here.
And how caches use addresses to calculate line numbers are described here. And its ok for caches to map more than one entry to the same line. Just because an address is mapped to a cacheline which has data, we don't blindly use the data in that cacheline. We also do a tag comparison to make sure that the content is the cacheline is what exactly we are looking for.
The reason for using a prime to take the modulo by is to get "mixing" of the bits, which is helpful if the integers that you're hashing have a poor structure. That isn't the only way to deal with it though, and for example the Java standard library doesn't use that, it uses a separate "mixing" function (that XORs the input with right-shifted versions of itself) and then uses a power-of-two sizes table. Either way it's protection against badly distributed input, which isn't necessary in and of itself - if the input was always nicely distributed you wouldn't need it.
Memory addresses are usually fairly nicely distributed, because it's typically used in sequential pieces. The obvious exception is that you will see highly aligned big objects, which would conflict with each other in the cache if nothing was done about it. Of course you will probably use a set-associative cache rather than direct mapped, since it is far more robust against degradation, and that would take care of a lot of that. But nothing is ever immune to bad patterns (that also goes for hash-mod-prime, which you can easily defeat if you know the prime), but a fairly simple improvement (which is also used in practice, or at least was, more advanced techniques exist now - combined with adaptive replacement strategies that mitigate bad access patterns) is to XOR some of the higher address bits into the index. This is hash-strengthening, the same technique used in the Java standard library, but a much simpler version of it.
Computing a remainder by a prime number (or really anything that isn't a power of two) is not something you'd want to do in this case, it's a slow computation by itself, and it leaves you with an awkwardly sized cache that doesn't fully use the power of its decoders, which adds to the slowness (or reduces cache size for a given latency, depending on how you look at it). The difference between that and XORing some of the high bits into the low bits is much bigger in hardware than it is in software, since XOR is really a trivial operation in hardware, much faster as a circuit operation than as an instruction.

Is it fastest to access a byte than a bit? Why?

The question is very straight: is it fastest to access a byte than a bit? If I store 8 booleans in a byte will it be slower when I have to compare them than if I used 8 bytes? Why?
Chances are no. The smallest addressable unit of memory in most machines today is a byte. In most cases, you can't address or access by bit.
In fact, accessing a specific bit might be even more expensive because you have to build a mask and use some logic.
EDIT:
Your question mentions "compare", I'm not sure exactly what you mean by that. But in some cases, you perform logic very efficiently on multiple booleans using bitwise operators if your booleans are densely packed into larger integer types.
As for which to use: array of bytes (with one boolean per byte), or a densely packed structure with one boolean per bit is a space-effiicency trade-off. For some applications that need to store a massive amount of bools, dense packing is better since it saves memory.
The underlying hardware that your code runs on is built to access bytes (or longer words) from memory. To read a bit, you have to read the entire byte, and then mask off the bits you don't care about, and possibly also shift to get the bit into the ones position. So the instructions to access a bit are a superset of the instructions to access a byte.
It may be faster to store the data as bits for a different reason - if you need to traverse and access many 8-bit sets of flags in a row. You will perform more ops per boolean flag, but you will traverse less memory by having it packed in fewer bytes. You will also be able to test multiple flags in a single operation, although you may be able to do this with bools to some extent as well, as long as they lie within a single machine word.
The memory latency penalty is far higher than register bit twiddling. In the end, only profiling the code on the hardware on which it will actually run will tell you which way is best.
From a hardware point of view, I would say that in general all the bit masking and other operations in the best case might occur within a single clock (resulting in no different), but that entirely depends on hardware layer that you likely won't ever know the specifics of, and as such you cannot bank on it.
It's worth pointing out that things like the .NET system.collections.bitarray uses a 32bit integer array underneath to store it's bit data. There is likely a performance reason behind this implementation (even if only in a general case that 32bit words perform above average), I would suggest reading up about the inner workings of that might be revealing.
From a coding point of view, it really depends what you're going to do with the bits afterwards. That is to say if you're going to store your data in booleans such as:
bool a0, a1, a2, a3, a4, a5, a6, a7;
And then in your code you compare them one by one (and most of them together):
if ( a0 && a1 && !a2 && a3 && !a4 && (!a5 || a6) || a7) {
...
}
Then you will find that it will be faster (and likely neater in code) to use a bit mask. But really the only time this would matter is if you're going to be running this code millions of times in a high performance or time critical environment.
I guess what I'm getting at here is that you should do whatever your coding standards say (and if you don't have any or they don't consider such details then just do what looks neatest for your application and need).
But I highly suggest trying to look around and read a blog or two explaining the inner workings of the .NET system.collections.bitarray.
This depends on the kind of processor and motherboard data bus, i.e. 32 bit data bus will compare your data faster if you collect them into "word"s rather than "bool"s or "byte"s....
This is only valid when you are writing in assembly language when you can compare each instruction how many cycles it takes .... but since you are using compiler then it is almost the same.
However, collecting booleans into words or integers will be useful in saving memory required for variables.
Computers tend to access things in words. Accessing a bit is slower because it requires more effort:
Imagine I said something to you, then said "oh change my second word to instead".
Now imagine my edit instead was "oh, change the third letter in the second word to 's'".
Which requires more thinking on your part?

Does Kernel::srand have a maximum input value?

I'm trying to seed a random number generator with the output of a hash. Currently I'm computing a SHA-1 hash, converting it to a giant integer, and feeding it to srand to initialize the RNG. This is so that I can get a predictable set of random numbers for an set of infinite cartesian coordinates (I'm hashing the coordinates).
I'm wondering whether Kernel::srand actually has a maximum value that it'll take, after which it truncates it in some way. The docs don't really make this obvious - they just say "a number".
I'll try to figure it out myself, but I'm assuming somebody out there has run into this already.
Knowing what programmers are like, it probably just calls libc's srand(). Either way, it's probably limited to 2^32-1, 2^31-1, 2^16-1, or 2^15-1.
There's also a danger that the value is clipped when cast from a biginteger to a C int/long, instead of only taking the low-order bits.
An easy test is to seed with 1 and take the first output. Then, seed with 2i+1 for i in [1..64] or so, take the first output of each, and compare. If you get a match for some i=n and all greater is, then it's probably doing arithmetic modulo 2n.
Note that the random number generator is almost certainly limited to 32 or 48 bits of entropy anyway, so there's little point seeding it with a huge value, and an attacker can reasonably easily predict future outputs given past outputs (and an "attacker" could simply be a player on a public nethack server).
EDIT: So I was wrong.
According to the docs for Kernel::rand(),
Ruby currently uses a modified Mersenne Twister with a period of 2**19937-1.
This means it's not just a call to libc's rand(). The Mersenne Twister is statistically superior (but not cryptographically secure). But anyway.
Testing using Kernel::srand(0); Kernel::sprintf("%x",Kernel::rand(2**32)) for various output sizes (2*16, 2*32, 2*36, 2*60, 2*64, 2*32+1, 2*35, 2*34+1), a few things are evident:
It figures out how many bits it needs (number of bits in max-1).
It generates output in groups of 32 bits, most-significant-bits-first, and drops the top bits (i.e. 0x[r0][r1][r2][r3][r4] with the top bits masked off).
If it's not less than max, it does some sort of retry. It's not obvious what this is from the output.
If it is less than max, it outputs the result.
I'm not sure why 2*32+1 and 2*64+1 are special (they produce the same output from Kernel::rand(2**1024) so probably have the exact same state) — I haven't found another collision.
The good news is that it doesn't simply clip to some arbitrary maximum (i.e. passing in huge numbers isn't equivalent to passing in 2**31-1), which is the most obvious thing that can go wrong. Kernel::srand() also returns the previous seed, which appears to be 128-bit, so it seems likely to be safe to pass in something large.
EDIT 2: Of course, there's no guarantee that the output will be reproducible between different Ruby versions (the docs merely say what it "currently uses"; apparently this was initially committed in 2002). Java has several portable deterministic PRNGs (SecureRandom.getInstance("SHA1PRNG","SUN"), albeit slow); I'm not aware of something similar for Ruby.

Best algorithm for hashing number values?

When dealing with a series of numbers, and wanting to use hash results for security reasons, what would be the best way to generate a hash value from a given series of digits? Examples of input would be credit card numbers, or bank account numbers. Preferred output would be a single unsigned integer to assist in matching purposes.
My feeling is that most of the string implementations appear to have low entropy when run against such a short range of characters and because of that, the collision rate might be higher than when run against a larger sample.
The target language is Delphi, however answers from other languages are welcome if they can provide a mathmatical basis which can lead to an optimal solution.
The purpose of this routine will be to determine if a previously received card/account was previously processed or not. The input file could have multiple records against a database of multiple records so performance is a factor.
With security questions all the answers lay on a continuum from most secure to most convenient. I'll give you two answers, one that is very secure, and one that is very convenient. Given that and the explanation of each you can choose the best solution for your system.
You stated that your objective was to store this value in lieu of the actual credit card so you could later know if the same credit card number is used again. This means that it must contain only the credit card number and maybe a uniform salt. Inclusion of the CCV, expiration date, name, etc. would render it useless since it the value could be different with the same credit card number. So we will assume you pad all of your credit card numbers with the same salt value that will remain uniform for all entries.
The convenient solution is to use a FNV (As Zebrabox and Nick suggested). This will produce a 32 bit number that will index quickly for searches. The downside of course is that it only allows for at max 4 billion different numbers, and in practice will produce collisions much quicker then that. Because it has such a high collision rate a brute force attack will probably generate enough invalid results as to make it of little use.
The secure solution is to rely on SHA hash function (the larger the better), but with multiple iterations. I would suggest somewhere on the order of 10,000. Yes I know, 10,000 iterations is a lot and it will take a while, but when it comes to strength against a brute force attack speed is the enemy. If you want to be secure then you want it to be SLOW. SHA is designed to not have collisions for any size of input. If a collision is found then the hash is considered no longer viable. AFAIK the SHA-2 family is still viable.
Now if you want a solution that is secure and quick to search in the DB, then I would suggest using the secure solution (SHA-2 x 10K) and then storing the full hash in one column, and then take the first 32 bits and storing it in a different column, with the index on the second column. Perform your look-up on the 32 bit value first. If that produces no matches then you have no matches. If it does produce a match then you can compare the full SHA value and see if it is the same. That means you are performing the full binary comparison (hashes are actually binary, but only represented as strings for easy human reading and for transfer in text based protocols) on a much smaller set.
If you are really concerned about speed then you can reduce the number of iterations. Frankly it will still be fast even with 1000 iterations. You will want to make some realistic judgment calls on how big you expect the database to get and other factors (communication speed, hardware response, load, etc.) that may effect the duration. You may find that your optimizing the fastest point in the process, which will have little to no actual impact.
Also, I would recommend that you benchmark the look-up on the full hash vs. the 32 bit subset. Most modern database system are fairly fast and contain a number of optimizations and frequently optimize for us doing things the easy way. When we try to get smart we sometimes just slow it down. What is that quote about premature optimization . . . ?
This seems to be a case for key derivation functions. Have a look at PBKDF2.
Just using cryptographic hash functions (like the SHA family) will give you the desired distribution, but for very limited input spaces (like credit card numbers) they can be easily attacked using brute force because this hash algorithms are usually designed to be as fast as possible.
UPDATE
Okay, security is no concern for your task. Because you have already a numerical input, you could just use this (account) number modulo your hash table size. If you process it as string, you might indeed encounter a bad distribution, because the ten digits form only a small subset of all possible characters.
Another problem is probably that the numbers form big clusters of assigned (account) numbers with large regions of unassigned numbers between them. In this case I would suggest to try highly non-linear hash function to spread this clusters. And this brings us back to cryptographic hash functions. Maybe good old MD5. Just split the 128 bit hash in four groups of 32 bits, combine them using XOR, and interpret the result as a 32 bit integer.
While not directly related, you may also have a look at Benford's law - it provides some insight why numbers are usually not evenly distributed.
If you need security, use a cryptographically secure hash, such as SHA-256.
I needed to look deeply into hash functions a few months ago. Here are some things I found.
You want the hash to spread out hits evenly and randomly throughout your entire target space (usually 32 bits, but could be 16 or 64-bits.) You want every character of the input to have and equally large effect on the output.
ALL the simple hashes (like ELF or PJW) that simply loop through the string and xor in each byte with a shift or a mod will fail that criteria for a simple reason: The last characters added have the most effect.
But there are some really good algorithms available in Delphi and asm. Here are some references:
See 1997 Dr. Dobbs article at burtleburtle.net/bob/hash/doobs.html
code at burtleburtle.net/bob/c/lookup3.c
SuperFastHash Function c2004-2008 by Paul Hsieh (AKA HsiehHash)
www.azillionmonkeys.com/qed/hash.html
You will find Delphi (with optional asm) source code at this reference:
http://landman-code.blogspot.com/2008/06/superfasthash-from-paul-hsieh.html
13 July 2008
"More than a year ago Juhani Suhonen asked for a fast hash to use for his
hashtable. I suggested the old but nicely performing elf-hash, but also noted
a much better hash function I recently found. It was called SuperFastHash (SFH)
and was created by Paul Hsieh to overcome his 'problems' with the hash functions
from Bob Jenkins. Juhani asked if somebody could write the SFH function in basm.
A few people worked on a basm implementation and posted it."
The Hashing Saga Continues:
2007-03-13 Andrew: When Bad Hashing Means Good Caching
www.team5150.com/~andrew/blog/2007/03/hash_algorithm_attacks.html
2007-03-29 Andrew: Breaking SuperFastHash
floodyberry.wordpress.com/2007/03/29/breaking-superfasthash/
2008-03-03 Austin Appleby: MurmurHash 2.0
murmurhash.googlepages.com/
SuperFastHash - 985.335173 mb/sec
lookup3 - 988.080652 mb/sec
MurmurHash 2.0 - 2056.885653 mb/sec
Supplies c++ code MurmurrHash2.cpp and aligned-read-only implementation -
MurmurHashAligned2.cpp
//========================================================================
// Here is Landman's MurmurHash2 in C#
//2009-02-25 Davy Landman does C# implimentations of SuperFashHash and MurmurHash2
//landman-code.blogspot.com/search?updated-min=2009-01-01T00%3A00%3A00%2B01%3A00&updated-max=2010-01-01T00%3A00%3A00%2B01%3A00&max-results=2
//
//Landman impliments both SuperFastHash and MurmurHash2 4 ways in C#:
//1: Managed Code 2: Inline Bit Converter 3: Int Hack 4: Unsafe Pointers
//SuperFastHash 1: 281 2: 780 3: 1204 4: 1308 MB/s
//MurmurHash2 1: 486 2: 759 3: 1430 4: 2196
Sorry if the above turns out to look like a mess. I had to just cut&paste it.
At least one of the references above gives you the option of getting out a 64-bit hash, which would certainly have no collisions in the space of credit card numbers, and could be easily stored in a bigint field in MySQL.
You do not need a cryptographic hash. They are much more CPU intensive. And the purpose of "cryptographic" is to stop hacking, not to avoid collisions.
If performance is a factor I suggest to take a look at a CodeCentral entry of Peter Below. It performs very well for large number of items.
By default it uses P.J. Weinberger ELF hashing function. But others are also provided.
By definition, a cryptographic hash will work perfectly for your use case. Even if the characters are close, the hash should be nicely distributed.
So I advise you to use any cryptographic hash (SHA-256 for example), with a salt.
For a non cryptographic approach you could take a look at the FNV hash it's fast with a low collision rate.
As a very fast alternative, I've also used this algorithm for a few years and had few collision issues however I can't give you a mathematical analysis of it's inherent soundness but for what it's worth here it is
=Edit - My code sample was incorrect - now fixed =
In c/c++
unsigned int Hash(const char *s)
{
int hash = 0;
while (*s != 0)
{
hash *= 37;
hash += *s;
s++;
}
return hash;
}
Note that '37' is a magic number, so chosen because it's prime
Best hash function for the natural numbers let
f(n)=n
No conflicts ;)

Resources