Bloom Filter char based - algorithm

I am new to Bloom Filter. I understand how to implement a Bloom Filter with bit array, which we hash value x with k hash functions and set each bit array index to 1.
But I am wondering how we are going to implement a Bloom Filter with a char array? Especially if the input is a string. One way I can think of is adding the ASCII value of each char of string and hash that value then set index of char array to some value (I am also not sure what value to set in char array if I use this method because it can't be just 0 or 1 since we are not using bit array), but the probability of false positive is going to be very high. May someone give me some ideas to get the start? (I do not need actual code, but I really appreciate if you can give me some insight on what hash function to use and how to map them into char array)

You can use some hashing algorithm which will convert that to an integer hash and then consider each bit of it as part of the bit array or char array.
hash(S)=sum(S[i]*(p^i))_i=0 to n-1.
You can use this hash 2 times to reduce the chance of false positives. That will give you a reasonable behavior.
Also choice of p must be limited to prime and it should be greater than the number of characters in the alphabet set.
This will give you a better result than simple ascii value addition.
Also a strange thing is the hash functions used should independent and uniformly distributed.
Also being fast is another criteria that's why standard cryptographic hashes are not good choice. (like sha1)
One standard hashing method that I heard is murmurhash which you can try to use and compare with the result you expect.
To be clear on how you will go about implementing it:-
You can consider multiple hash functions like murmur, fnv1a or
even the simple one I presented and then you get 3 values from each
hash. Put them in appropriate positions. And then that will work as
your bloom filter.
Here as you are implementing different hash functions the probability of false positive will depend on multiple hash functions resulting in a better result.
For example:
You want to hash stackoverflow. Now you use 3 hash functions which give you numbers 11, 45 and 17. You would keep an map where you will put this value.
{
11: 1,
45: 1,
17: 1
}
Again you hash this way and get the value 11, 15 and 97.
Then you will change it to
{
11: 1,
15: 1,
17: 1,
45: 1,
97: 1
}
Note: I have mentioned map here...it can be something like a bit array also where you set the bits. For example..in case of
stackoverflow 11,17,and 45 th bits will be set to 1.
Note this map will help you answer the query whether an element is there or not.
Now in case of query , you will do the same, get the hash values and will check if these values exist. If yes there is a high chance it is there(not exactly as it may be a false positive) , if not then it is not for sure.
Suppose now you will check if string "abcd" is there. You apply the 3 hash functions used earlier. Results are 11,99,55. You will check if all 3 of them exists. You can see 55 is not there. So string "abcd" is not there.

Related

Ruby: Help improving hashing algorithm

I am still relatively new to ruby as a language, but I know there are a lot of convenience methods built into the language. I am trying to generate a "hash" to check against in a low level block-chain verifier and I am wondering if there are any "convenience methods" that I could you to try to make this hashing algorithm more efficient. I think I can make this more efficient by utilizing ruby's max integer size, but I'm not sure.
Below is the current code which takes in a string to hash, unpacks it into an array of UTF-8 values, does computationally intensive math to each one of those values, adds up all of those values after the math is done to them, takes that value modulo 65,536, and then returns the hex representation of that value.
def generate_hash(string)
unpacked_string = string.unpack('U*')
sum = 0
unpacked_string.each do |x|
sum += (x**2000) * ((x + 2)**21) - ((x + 5)**3)
end
new_val = sum % 65_536 # Gives a number from 0 to 65,535
new_val.to_s(16)
end
On very large block-chains there is a very large performance hit which I am trying to get around. Any help would be great!
First and foremost, it is extremely unlikely that you are going to create anything that is more efficient than simply using String#hash. This is a case of you trying to build a better mousetrap.
Honestly, your hashing algorithm is very inefficient. The entire point of a hash is to be a fast, low-overhead way of quickly getting a "unique" (as unique as possible) integer to represent any object to avoid comparing by values.
Using that as a premise, if you start doing any type of intense computation in a hash algorithm, it is already counter-productive. Once you start implementing modulo and pow functions, it is inefficient.
Usually best practice involves taking a value(s) of the object that can be represented as integers, and performing bit operations on them, typically with prime numbers to help reduce hash collisions.
def hash
h = value1 ^ 393
h += value2 ^ 17
h
end
In your example, you are for some reason forcing the hash to the max value of a 16-bit unsigned integer, when typically 32-bits is used, although if you are comparing on the Ruby-side, this would be 31-bits due to how Ruby masks Fixnum values. Fixnum was deprecated on the Ruby side as it should have been, but internally the same threshold exists between what how a Bignum and Fixnum are handled. The Integer class simply provides one interface on the Ruby side, as those two really should never have been exposed outside of the C code.
In your specific example using strings, I would simply symbolize them. This guarantees a quick and efficient way that determines if two strings are equal without hardly any overhead, and comparing 2 symbols is the exact same as comparing 2 integers. There is a caveat to this method if you are comparing a vast number of strings. Once a symbol is created, it is alive for the life of the program. Any additional strings that equal to it will return the same symbol, but you cannot remove the memory of the symbol (just a few bytes) for as long as the program runs. Not good if using this method to compare thousands and thousands of unique strings.

how does this Ruby code work? (hash) (Learnrubythehardway)

I know i will look like a total noob, but there's something I can't wrap my head around. Let me emphasize that i DID google this thing, but i didn't find what I was looking for.
I'm going through the learnrubythehardway course, and for ex39 this is one of the functions we have defined:
def Dict.hash_key(aDict, key)
return key.hash % aDict.length
end
The author gives this explanation:
hash_key
This deceptively simple function is the core of how a hash works. What it does is uses the built-in Ruby hash function to convert a
string to a number. Ruby uses this function for its own hash data
structure, and I'm just reusing it. You should fire up a Ruby console
to see how it works. Once I have a number for the key, I then use the
% (modulus) operator and the aDict.length to get a bucket where this
key can go. As you should know, the % (modulus) operator will divide
any number and give me the remainder. I can also use this as a way of
limiting giant numbers to a fixed smaller set of other numbers. If you
don't get this then use Ruby to explore it
I like this course, but the above paragraph was no help.
Ok, you call the function passing it two arguments (aDict is an array) and it returns something.
(My questions are not totally independent of one another.)
What and how does it do that? (ok, it returns a bucket index, but how do we "get there"?)
What does the key.hash do/what is it?
How does using the % help me get what I need? (What is the use of "modding" the key.hash by the aDict.length?)
"Use Ruby to explore it." - ok, but my question No.2. kinda already suggests that I wouldn't know how to go about doing that.
Thanks in advance.
key.hash is calling Object#hash, which is not to be confused with Hash.
Object#hash converts a string into a number consistently (the same string will always result in the same number, in the same running instance of Ruby).
pry(main)> "abc".hash
=> -1672853150
So now we have a number, but it's way too large for the number of buckets in our Dict structure, which defaults to 256 buckets. So we modulus it to get a number within our bucket range.
pry(main)> "abc".hash % 256
=> 98
This essentially allows us to translate Dict["abc"] into aDict[98].
RE: This example in particular
I'm going to change the order of things in a way that I hope makes more sense:
#2. You can think of a hash as a sort of 'fingerprint' of something. The .hash method will create a (generally) unique output for any given input.
#3. In this case, we know that the hash is a number, so we take the modulo of the generated number by the backing array's length in order to find a (hopefully empty) index that is within our storage's bounds.
#1. That's how. A hashing algorithm will return the same output for any given input. The modulo takes this output and turns it into something we can actually use in an array to find something reliably.
#4. Call hash on something. Call it on a string and then modulo it by the length of an array. Try again on another string. Do that again, and use your result to assign something to that array. Do it again to see that the hash and modulo thing will find that value again.
Further Notes:
By itself, the modulo function is not a good way to pick unique indexes for keys. This example is the first step, but especially in a small array, there is still a relatively large chance for the hashes of different keys to modulo into the same number. That's called a collision, and handling those seems to be outside the scope of this question.

Algorithms: random unique string

I need to generate string that meets the following requirements:
it should be a unique string;
string length should be 8 characters;
it should contain 2 digits;
all symbols (non-digital characters) should be upper case.
I will store them in a data base after generation (they will be assigned to other entities).
My intention is to do something like this:
Generate 2 random values from 0 to 9—they will be used for digits in the string;
generate 6 random values from 0 to 25 and add them to 64—they will be used as 6 symbols;
concatenate everything into one string;
check if the string already exists in the data base; if not—repeat.
My concern with regard to that algorithm is that it doesn't guarantee a result in finite time (if there are already A LOT of values in the data base).
Question: could you please give advice on how to improve this algorithm to be more deterministic?
Thanks.
it should be unique string;
string length should be 8 characters;
it should contains 2 digits;
all symbols (non-digital characters) - should be upper case.
Assuming:
requirements #2 and #3 are exact (exactly 8 chars, exactly 2 digits) and not a minimum
the "symbols" in requirement #4 are the 26 capital letters A through Z
you would like an evenly-distributed random string
Then your proposed method has two issues. One is that the letters A - Z are ASCII 65 - 90, not 64 - 89. The other is that it doesn't distribute the numbers evenly within the possible string space. That can be remedied by doing the following:
Generate two different integers between 0 and 7, and sort them.
Generate 2 random numbers from 0 to 9.
Generate 6 random letters from A to Z.
Use the two different integers in step #1 as positions, and put the 2 numbers in those positions.
Put the 6 random letters in the remaining positions.
There are 28 possibilities for the two different integers ((8*8 - 8 duplicates) / 2 orderings), 266 possibilities for the letters, and 100 possibilities for the numbers, the total # of valid combinations being Ncomb = 864964172800 = 8.64 x 1011.
edit: If you want to avoid the database for storage, but still guarantee both uniqueness of strings and have them be cryptographically secure, your best bet is a cryptographically random bijection from a counter between 0 and Nmax <= Ncomb to a subset of the space of possible output strings. (Bijection meaning there is a one-to-one correspondence between the output string and the input counter.)
This is possible with Feistel networks, which are commonly used in hash functions and symmetric cryptography (including AES). You'd probably want to choose Nmax = 239 which is the largest power of 2 <= Ncomb, and use a 39-bit Feistel network, using a constant key you keep secret. You then plug in your counter to the Feistel network, and out comes another 39-bit number X, which you then transform into the corresponding string as follows:
Repeat the following step 6 times:
Take X mod 26, generate a capital letter, and set X = X / 26.
Take X mod 100 to generate your two digits, and set X = X / 100.
X will now be between 0 and 17 inclusive (239 / 266 / 100 = 17.796...). Map this number to two unique digit positions (probably easiest using a lookup table, since we're only talking 28 possibilities. If you had more, use Floyd's algorithm for generating a unique permutation, and use the variable-base technique of mod + integer divide instead of generating a random number).
Follow the random approach above, but use the numbers generated by this algorithm instead.
Alternatively, use 40-bit numbers, and if the output of your Feistel network is > Ncomb, then increment the counter and try again. This covers the entire string space at the cost of rejecting invalid numbers and having to re-execute the algorithm. (But you don't need a database to do this.)
But this isn't something to get into unless you know what you're doing.
Are these user passwords? If so, there are a couple of things you need to take into account:
You must avoid 0/O and I/1, which can easily be mistaken for each other.
You must avoid too many consecutive letters, which might spell out a rude word.
As far as 2 is concerned, you can avoid the problem by using LLNLLNLL as your pattern (L = letter, N = number).
If you need 1 million passwords out of a pool of 2.5 billion, you will certainly get clashes in your database, so you have to deal with them gracefully. But a simple retry is enough, if your random number generator is robust.
I don't see anything in your requirements that states that the string needs to be random. You could just do something like the following pseudocode:
for letters in ( 'AAAAAA' .. 'ZZZZZZ' ) {
for numbers in ( 00 .. 99 ) {
string = letters + numbers
}
}
This will create unique strings eight characters long, with two digits and six upper-case letters.
If you need randomly-generated strings, then you need to keep some kind of record of which strings have been previously generated, so you're going to have to hit a DB (or keep them all in memory, or write them to a textfile) and check against that list.
I think you're safe well into your tens of thousands of such ID's, and even after that you're most likely alright.
Now if you want some determinism, you can always force a password after a certain number of failures. Say after 50 failures, you select a password at random and increment a part of it by 1 until you get a free one.
I'm willing to bet money though that you'll never see the extra functionality kick in during your life time :)
Do it the other way around: generate one big random number that you will split up to obtain the individual characters:
long bigrandom = ...;
int firstDigit = bigRandom % 10;
int secondDigit = ( bigrandom / 10 ) % 10;
and so on.
Then you only store the random number in your database and not the string. Since there's a one-to-one relationship between the string and the number, this doesn't really make a difference.
However, when you try to insert a new value, and it's already in the databse, you can easily find the smallest unallocated number graeter than the originally generated number, and use that instead of the one you generated.
What you gain from this method is that you're guaranteed to find an available code relatively quickly, even when most codes are already allocated.
For one thing, your list of requirements doesn't state that string has to be necessary random, so you might consider something like database index.
If 'random' is a requirement, you can do a few improvements.
Store string as a number in database. Not sure how much this improves perfromance.
Do not store used strings at all. You can employ 'index' approach above, but convert integer number to a string in a seemingly random fashion (e.g., employing bit shift). Without much research, nobody will notice pattern.
E.g., if we have sequence 1, 2, 3, 4, ... and use cyclic binary shift right by 1 bit, it'll be turned into 4, 1, 5, 2, ... (assuming we have 3 bits only)
It doesn't have to be a shift too, it can be a permutation or any other 'randomization'.
The problem with your approach is clearly that while you have few records, you are very unlikely to get collisions but as your number of records grows the chance will increase until it becomes more likely than not that you'll get a collision. Eventually you will be hitting multiple collisions before you get a 'valid' result. Every time will require a table scan to determine if the code is valid, and the whole thing turns into a mess.
The simplest solution is to precalculate your codes.
Start with the first code 00AAAA, and increment to generate 00AAAB, 00AAAC ... 99ZZZZ. Insert them into a table in random order. When you need a new code, retrieve to top record unused record from the table (then mark it as used). It's not a huge table, as pointed out above - only a few million records.
You don't need to calculate any random numbers and generate strings for each user (already done)
You don't need to check whether anything has already been used, just get the next available
No chance of getting multiple collisions before finding something usable.
If you ever need more 'codes', just generate some more 'random' strings and append them to the table.

Determining Perfect Hash Lookup Table for Pearson Hash

I'm developing a programming language, and in my programming language, I'm storing objects as hash tables. The hash function I'm using is Pearson Hashing, which depends on a 256-bit lookup table. Here's the function:
char* pearson(char* name, char* lookup)
{
char index = '\0';
while(*name)
{
index = lookup[index ^ *name];
name++;
}
return index;
}
My question is, given a fixed group of fewer than 256 member names, how can one determine a lookup table such that pearson() will return unique characters within a contiguous range starting from '\0'. In other words, I need an algorithm to create a lookup table for a perfect hash. This will allow me to have objects that take up no more space than the number of their members. This will be done at compile time, so speed isn't a huge concern, but faster would be better. It would be easy to brute force this, but I think (hope) there's a better way.
Here's an example: given member variables 'foo', 'bar', and 'baz' in a class, I want to determine a lookup such that:
pearson('foo',lookup) == (char) 0
pearson('bar',lookup) == (char) 1
pearson('baz',lookup) == (char) 2
Note that the order doesn't matter, so the following result would also be acceptable:
pearson('foo',lookup) == (char) 2
pearson('bar',lookup) == (char) 0
pearson('baz',lookup) == (char) 1
In an ideal world, all names that aren't in the table would return a value greater than 2 because this would allow me to avoid a check and possibly even avoid storing the member names, but I don't think this is possible, so I'll have to add an extra check to see if it's in the table. Given this, it probably would save time to not initialize values in the lookup table which aren't used (collisions don't matter, because if it collides and fails the check, it isn't in the object at all, so the collision doesn't need to be resolved; only the error needs to be handled).
I strongly doubt that you will be able to find a solution with brute force if the number of member names is too high. Thanks to the birthday paradox the probability that no collisions exist (i.e., two hashes are the same) is approximately 1:5000 for 64 and 1:850,000,000 for 96 member names. From the structure of your hash function (it's derived from a cryptographic construction that is designed to "mix" things well) I don't expect that an algorithms exists that solves your problem (but I would definitely be interested in such a beast).
Your ideal world is an illusion (as you expected): there are 256 characters you can append to 'foo', no two of them giving a new word with a same hash. As there are only 256 possibilities for the hash values, you can therefore append a character to 'foo' so that its hash is the same as any of the hashes of 'foo', 'bar' or 'baz'.
Why don't you use an existing library like CMPH?
If I understand you correctly, what you need is an sorted and no-duplicated-element array that you can do binary search on. If the key is in the array, the index is the "hash". Otherwise, you get the size of the array. It is O(nlogn) compares to lookup table O(1), but it is good enough for small number of elements - 256 in your case.

Can I identify a hash algorithm based on the initial key and output hash?

If I have both the initial key and the hash that was created, is there any way to determine what hashing algorithm was used?
For example:
Key: higher
Hash: df072c8afcf2385b8d34aab3362020d0
Algorithm: ?
By looking at the length, you can decide which algorithms to try. MD5 and MD2 produce 16-byte digests. SHA-1 produces 20 bytes of output. Etc. Then perform each hash on the input and see if it matches the output. If so, that's your algorithm.
Of course, if more than the "key" was hashed, you'll need to know that too. And depending on the application, hashes are often applied iteratively. That is, the output of the hash is hashed again, and that output is hashed… often thousands of times. So if you know in advance how many iterations were performed, that can help too.
There's nothing besides the length in the output of a cryptographic hash that would help narrow down the algorithm that produced it.
Well, given that there are a finite number of popular hash algorithms, maybe what you propose is not so ridiculous.
But suppose I asked you this:
If I have an input and an output, can
I determine the function?
Generally speaking, no, you cannot determine the inner-workings of any function simply from knowing one input and one output, without any additional information.
// very, very basic illustration
if (unknownFunction(2) == 4) {
// what does unknownFunction do?
// return x + 2?
// or return x * 2?
// or return Math.Pow(x, 2)?
// or return Math.Pow(x, 3) - 4?
// etc.
}
The hash seems to contain only hexadecimal characters (each character represents 4bits)
Total count is 32 characters -> this is a 128-bits length hash.
Standard hashing algorithms that comply with these specs are: haval, md2, md4, md5 and ripemd128.
Highest probability is that MD5 was used.
md5("higher") != df072c8afcf2385b8d34aab3362020d0
Highest probability is that some salt was used.
Highest probability still remains MD5.
Didn't match any of the common hashing algorithms:
http://www.fileformat.info/tool/hash.htm?text=higher
Perhaps a salt was added prior to hashing...
Not other than trying out a bunch that you know and seeing if any match.

Resources