Pseudorandom function in Hive? - random

I'm trying to get a deterministic, pseudorandom function in Hive. I tried checksum, but apparently that's just SQL, not Hive. I did
select hash(1) gave me 1
select rand(1), rand(2), rand(3) got me 0.730878191 0.731146936 0.731057369
Is there a cryptographically secure hash in Hive? Why is rand not random?

You can call Java library from Hive using reflect() or java_method() functions. For example sha256 from Apache DigestUtils:
SELECT reflect('org.apache.commons.codec.digest.DigestUtils', 'sha256Hex', 'message');
And as of Hive 1.3.0 SHA-2 family of hash functions are already implemented as built-in functions.
About rand() documentation says:
Returns a random number (that changes from row to row) that is distributed uniformly from 0 to 1. Specifying the seed will make sure the generated random number sequence is deterministic. So, you will get the same random sequence. The same as in java.util.Random. And recommendation is: Use current time in seconds as a seed.

Related

How to select nth random integer from a range of integers without repetition or storage? [duplicate]

This question already has answers here:
Unique (non-repeating) random numbers in O(1)?
(22 answers)
Closed 1 year ago.
Let's say my system needs to provide a unique integer id regularly, between 1 and 10^20, from a function like --
function getNextRandomUniqueId(index:BigInt, min:BigInt, max:BigInt, seed:BigInt): BigInt { ? }
id = getNextRandomUniqueId(index=42, min=1, max=10^20, seed=0)
These ids need to be provided in random order as the index increases, not sequentially. Once an id has been provided, it cannot be provided again, as long as the index increases. My system cannot store a random list of all the numbers to be issued, or all the numbers issued, there's too many. I also don't want to rely on something like a random UUID, which is exceedingly unlikely to have a collision, but not guaranteed to.
How can this be done? To have a deterministic mathematical way to iterate randomly through a set of sequential integers without repetition and without storage?
EDIT: Fixed 1^20 to 10^20
This can be done, assuming you are allowed to store an encryption key and counter. Encryption is a one-to-one mapping so by encrypting all the numbers in a given range you will get back all those same numbers in a randomized order. Different keys will give a different order. Encrypt the numbers 0, 1, 2, 3, ... in order, using the key and keeping track of how far you have got.
Depending on the range of numbers, you may need to use some form of Format Preserving encryption to keep the outputs within the required range.
You cannot guarantee that your same id is not in another seed sequence.
Most languages use the time to generate the sequence when you are not providing a seed yourself. You have set your seed to zero so each time you restart your program, you will get your same ids. This is most likely not your intent :-)
But even when you would do this, the chance that you hit the same id is there.
1 in the 100,000,000,000,000,000,000.
The reason you can get the same id is because it is RANDOM
I would go with a GUID.
1 in the 340.280.000.000.000.000.000.000.000.000.000.000.000

Randomization in Tableau

I am trying to generate random numbers in Tableau between 1 and 15. Currently, I am using the Random() function. However, this returns random numbers in the [0,1] interval. Does anyone know how to get whole integer values instead?
I am using this feature to try to both randomize and anonymize the names of 15 people.
Thanks!
Actually random is the function you should look for, even thought it returns numbe from 0 to 1.
According to the integer number you are trying to get, why don't you multiply the result by 10/100/1000/etc and then use the round function in order to get rid of the rest?

Create 2 way map for sha256.

Assuming we are looking at the data set of all sha256.
Each sha256 value whan aplying the sha256 function on it will result in a different sha256 value.
Since our data set is in the same size as the result set we can assume we have a 1 to 1 function.
Is there a way to map all the value and create a backwared function. (Assuming we are looking only on the above data set.
In a reseanable computational time (not 110 years)
Since our data set is in the same size as the result set we can assume we have a 1 to 1 function.
This is a faulty assumption. There is no reason to believe that SHA256 is a unique mapping across 256-bit inputs; there is, in all probability, at least one (and probably many) pairs of 256-inputs which have the same SHA256 hash.
Is there a way to map all the value and create a backwared function.
No. There are 2256 ≈ 1.16×1077 possible SHA256 hashes. As a point of comparison, there are roughly 2.4×1067 atoms in our galaxy. Even if you could turn the entire Milky Way into a computer and write one hash onto each atom, you would run out of atoms long before you finished.

Hashing and encryption technique for a huge data set containing phone numbers

Description of problem:
I'm in the process of working with a highly sensitive data-set that contains the people's phone number information as one of the columns. I need to apply (encryption/hash function on them) to convert them as some encoded values and do my analysis. It can be an one-way hash - i.e, after processing with the encrypted data we wont be converting them back to original phone numbers. Essentially, am looking for an anonymizer that takes phone numbers and converts them to some random value on which I can do my processing. Suggest the best way to do about this process. Recommendations on the best algorithms to use are welcome.
Update: size of the dataset
My dataset is really huge in the size of hundreds of GB.
Update: Sensitive
By sensitive, I meant that phone number should not be a part of our analysis.So, basically I would need a one-way hashing function but without redundancy - Each phone number should map to unique value --Two phones numbers should not map to a same value.
Update: Implementation ?
Thanks for your answers.I am looking for elaborate implementation.I was going through python's hashlib library for hashing, Does it necessarily do the same set of steps that you suggested ? Here is the link
Can you give me some example code to achieve the process , preferably in Python ?
Generate a key for your data set (16 or 32 bytes) and keep it secret. Use Hmac-sha1 on your data with this key, and base 64 encode that and you have a random unique string per phonenumber that isn't reversable (without the key).
Example (Hmac-Sha1 with 256bit key) using Keyczar:
Create random secret key:
$> python keyczart.py create --location=path_to_key_set --purpose=sign
$> python keyczart.py addkey --location=path_to_key_set --status=primary
Anonymize phone number:
from keyczar import keyczar
def anonymize(phone_num):
signer = keyczar.Signer.Read("path_to_key_set");
return signer.Sign(phone_num)
If you're going to use cryptography, you want to apply a pseudorandom function to each phone number and throw away the key. Collision-resistant hashes such as SHA-256 do not provide the right security guarantees. Really, though, are there that many different phone numbers that you can't just construct incrementally a map representing an actually random function?
sort your data by the respective column and start counting distinct values ... replace the actual values with their respective counter value ... collision free ... one way ...
"So, basically I would need a one-way hashing function but without redundancy - Each phone number should map to unique value --Two phones numbers should not map to a same value."
This screams for a solution based on a cryptographic hash function. MD5 and SHA-1 are the best known examples, and work wonderfully for this. You will read that "MD5 has been cracked", but for your purpose that doesn't matter.

How to set the seed for Math.random() in Apex

Is there a way to set the seed for the random number generator in Apex? And if so; which function do I use for it?
It likely isn't possible to seed the RNG in Apex. If you need a repeatable sequence of random numbers, you'll have to implement a seeded pseudo random number generator yourself.
On the Apex platform, I'm sure they have a huge source of entropy available to generate random numbers, and there's no need for you to seed the generator.
There is no way to seed the built-in random number generator in Salesforce. I was in the same boat as you. I wanted to be able to use a seed, so that I could create repeatable random numbers.
So, I thought I'd attempt to write my own RNG. I spent a number of days scouring the Internet for algorithms. I was able to piece together a pretty comprehensive library of functions borrowing from various sources. The classes are: "Random.cls", which is the main RNG class, and "Random_Test.cls", which is the test code.
It has the following methods:
nextInteger(upperLimit)
nextLong(upperLimit)
nextDouble(upperLimit)
nextUniform() - Same function as Math.Random() to return a Double between 0.0 and 1.0.
nextIntegerInRange(lowerLimit, upperLimit)
nextLongInRange(lowerLimit, upperLimit)
nextDoubleInRange(lowerLimit, upperLimit)
shuffle(List<Object>) - destroys the order of the original list
shuffleWithCopy(List<Object>) - return a shuffled copy of the list, in case you wish to preserve the list's original order (less efficient than "shuffle(List<Object>)")
The "Random.cls" documents the sources that I borrowed from in case you want to read more about random number generators.
I put the code out on GitHub for anyone who wants it: https://github.com/DeviousBard/Salesforce/tree/master

Resources