Techniques for data anonymization - algorithm

I'm looking for a good way to anonymize data in my database while retaining the capability of aggregating / summarizing statistical information.
As an example, let's say I want to track clicks by IP address per hour but I don't actually want to store the IP address.
My first thought is to store only a hash (e.g. SHA-256) of the IP. However, I'm not sure this provides sufficient security. If an attacker got ahold of our database and was determined to reverse our anonymization they could generate a rainbow table of IP's and get back the real IP info fairly easily.
My next thought was to add a static prefix to the IP before hashing (e.g. 192.168.1.10 becomes MY_SECRET_STRING-192.168.1.10). Of course, if the attacker finds the static prefix then it is essentially useless.
I've been searching for sound solutions to this problem and I haven't found anything I really like so far. Are there any well known methods for anonymizing data like this?

If someone have access to your salt and database I would say it's almost impossible (if not impossible) to keep them from creating some sort of collision table and "cracking" your hashes. The only option you have is to make their job hard/expensive.
Using a static salt is a bad idea though, this since the whole point of a salt is to prevent an attacker from generating a rainbow table for all your records. The uniqueness is what makes a salt a good salt , this since the purpose of the salt is to make each hash unique regardless if the original content was the same as another record (thus obligating an attacker to brute-force each row to figure out its content).
Also something that is worth noticing is that salts don't need to be secret, so you can just store your salt in an additional column.
There is this nice article about salting and hashing if you have any doubt about the topic.
The problem with the described approach is that in the end, just like an attacker, you won't be able to tell which of the rows are the same IPs.
One potential solution I can see if you really really need to implement this is having a table where you store the IPs + click count, and then every 1 hour have a process to anonymize the data by simply replacing all the IPs/hash from the last hour with a good RANDOM value. This in the end means that you will only be able to group the clicks per hour without knowing the actual IP, but, please notice two things:
Although an attacker will never be able to figure out the past data, you will have 1 hour worth of data that is not anonymized at any given time. Meaning that an attacker could "spy" on you and store this information over time which could become a much bigger problem than "we just leaked 1 hour worth of data".
You won't be able to tell the same IP apart between each hour. For example: if IP 127.0.0.1 did 3 click from 17:00 to 18:00 and the same IP did 6 clicks from 18:00 to 19:00 you wouldn't able to tell that 127.0.0.1 did 9 clicks from 17:00 to 19:00.
Also to make the hourly non-anonymized IP a bit more hard to crack you could have a function that takes an IP and generates unique salt and then caches that unique salt for that IP till the next hour, meaning that each IP would have its own unique salt every hour. This way the attacker would have to calculate a new rainbow table for each row every hour and you could still figure out what IP row to increment|create.

Why, yes, there are. The most well-known is called "salting". Basically, instead of adding a static string to all of the plain texts, you add a unique string to each one. This string is randomly or algorithmically generated and stored separately. It doesn't make a single hash any harder to crack, but it prevents use of tables to crack multiple hashes. See the wikipedia article on Salt(crytography).
That being said, I think that a one-way hash of the IP is sufficient. An attacker would have to crack each IP address. No matter what method you use, once an IP is cracked then all of the records for that IP will be exposed. But cracking one IP doesn't help with any of the others.

Related

Can Provably Fair truly be fair?

I have been seeing allot of gambling (BTC) websites use the "Provably Fair" system. I am wondering if some of these could possibly be faked.
As an example:
Place a bet on a website for 1 BTC
The website gives you a hash to "verify" the outcome of the result
Displays the result, awards or takes
Now I understand that these are completely random, but with pretty much any programming langauge thousands of these hashes can be generated at once in miliseconds. Is it possible for gambling websites to pretty much try and "scam" a user by generating numbers before a specific rule to decide which one they want to give them based on them winning/losing.
I just started researching if they are trustworthy and this came across my mind.
I apologize if this is one the wrong stack website, if you don't mind directing me to the correct one.
Here are some examples:
http://provablyfair.org/
https://fortunejack.com/help/provably_fair
I understand what you mean and I also think this could be able to do, in fact its pretty simple:
Server send numbers to client, modify the results
Hash are displayed next day
Create 10000 hash, choose the outcomes you need and publicate in that order
Done
And now will come the genius one saying: "You can't modify the seeds". No, but as far as I know u can create as many diferent secrets as you want to archive diferente numbers results.
(Im new at codign but I think it could work by this way)
A result is often calculated using 3 things:
A server seed: Generated by the server. This is hashed so that the player can verify the result are legit and the server didn't change it mid-way but doesn't allow the player to calculate the result themself (cheating)
A client-seed: Generated by the browser. This is used so that the server don't know the result and can't change it.
A nonce, known by both parties. This is often used as a counter for how many bets you have made.
To get the result:
Your browser send the client-seed and the bet info (amount, odds) to the server. Now the server know the result, but can't change it because the client will check the hash later on.
The server send the result and the server seed to your brower.
To verify:
Step 1: Take the server seed and hash it, then compare it with the hash you recive before. If it match, the server play it nice and didn't cheat on you. Continute to Step 2. If it doesn't, you are getting scammed :(
Step 2: Calulate the result yourself.

Best way to encrypt and decrypt data using php and mysql

To start, I am trying to encrypt very sensitive information on a public website. Users will be able to update their information, Administrators will need access to this information. I am worried that if the encrypted data is some how compromised, then everyone's information would be as well due to them all using the same salt and key.
So I know using a salt, and key is always preferred. But as mentioned above if they reverse engineer the encrypted data, what use it is.
My solution, is to have the key and salt stored in a DB, with many rows and columns, any of which can be used for the salt or key. I would have an algorithm that will use "something" fixed in the users account that will be used to figure out which salt and key to use. This way statistically speaking no 2 years will have same combo of salt and key.
Is this over kill, or good?
I question the value of this second database that holds keys and salts. Consider:
The "something" in the user's data that identifies the salt and key will necessarily have to be encrypted differently from the rest of the user's data. Otherwise, you wouldn't be able to get it without first already having it.
Statistical analysis of the encrypted user data would almost certainly discover that the "something" is encrypted differently. That will be like waving a red flag at a bull, and an attacker will concentrate on figuring out why that's different.
You can assume that if an attacker can get the database of encrypted user information, he can also get the database of salts and keys.
Given that, there are two possible scenarios:
The encryption of the "something" that identifies the key and salt is unbreakable. That is, it's so good that the attacker's best efforts fail to reveal the connection between that "something" and the key/salt database.
The attacker discovers the encryption of the "something," and therefore is able to decrypt your sensitive data.
If #1 is the case, then you probably should use that encryption algorithm for all of your user data. Why do something in two steps when you can do it just as effectively in one?
If #2 is the case, then all the work you've done just put up a little bump in the road for the attacker.
So, short answer: what you propose looks like either unnecessary work or ineffective road blocking. In either case, it looks to me like a whole lot of work and added complexity for no appreciable gain.
That said, I could have misinterpreted your brief description. If so, please correct me.

Is there a way to generate a short random id, avoiding collisions, without hitting persistent storage?

If you've used GoToMeeting, that's the type of ID I want. I'd like it to be random so that it obfuscates the number of items being tracked and short, so that it's easy to reference manually; UUIDs are way too long. I'd like to avoid hitting persistent storage merely for performance reasons, but I can't think of any other way to avoid collisions. Is 9 digits enough to do something time-based?
In response to questions:
I'm building a ticket-tracking application. This ID would be used as the primary key for a table, but it would be needed before the record is persisted which would result in an extra database call that I'd like to avoid if possible.
I'd like to keep it at a 9 digit int. I consider a UUID to be too long because people are going to have to reference the ID manually (via email, phone, etc.).
I'm thinking of using the time of generation somehow. Since time is always ticking on forward, it would continually limit the set of potential IDs, excluding those that had already been generated.
One way is to take a unique number or string (like a random UUID) then calculate a fixed-length digest (such as MD5 or SHA-1) and/or encode it in a higher base (like base64) to shorten it further.
Git does something similar where it generates a sha numbers for commits (and other events) and then the user can references the numbers manually in order to lookup those commits. The trick they used is that the user doesn't have to enter the whole string in order to find the correct event, they simply have to enter a long enough string that it doesn't collide with any other commit currently in the repository. In general this only require 5 or so hex digits for relatively large repositories.

What is a reliable method to record votes from anonymous users, without allowing duplicates

First of all, I searched as best I could and read all SO questions that seem relevant, but nothing specifically answered this. This is not a duplicate, afaik.
Obviously if anonymous voting on a website is allowed, there is no fool proof way to prevent someone voting more than once.
However, I am wondering if someone with experience can aide me in coming up with a reasonably reliable way of tracking absolutely unique visitors and recording votes against those credentials.
Currently I am ensuring that only one vote per item/session combo is allowed, however this is easily circumvented by restarting browser, changing browsers/computers, or clearing your session data.
Recording against IP seems the next reasonable solution but I wonder if this will get false positives too often (multiple people on same LAN behind a NAT will have same external IP, etc).
Is there a middle ground to be had here or some other method/combination I am overlooking?
I'd collect as much data about the session as possible without asking any questions directly (browser, OS, installed plugins, all with versions numbers, IP address etc) and hash it.
Record the hash and increment a counter if you want multiple votes to be allowed. Include a timestamp (daily, hourly etc) in the salt to make votes time sensitive, say 5 votes per day.
The simplest answer is to use a cookie. Obviously it's vulnerable to people clearing their cookies, but anonymous voting is inherently approximate anyway.
In practice, unless the topic being voted on is in some way controversial or inflammatory, people aren't going to have a motive behind rigging the vote anyway.
IP is more 'reliable' but will produce an unacceptably high level of collisions due to NATs.
How about a more unique identifier composed of IP + user-agent (maybe a hash)? That effectively means for each IP, each exact OS/browser version pair gets 1 vote, which is a lot closer to 1 vote per person. Most browsers provide detailed version information in the user-agent -- I'm not sure, but my gut feel is that this would prevent the majority of collisions caused by NATs.
The only place that would still produce lots of collisions is a corporate environment with a standardised network, where everyone is using an identical machine.
The Chinese have to share one IPv4 address with hundreds of others; Hp/Compaq/DEC has almost 50 million addresses. IPv6 doesn't help as everyone get addresses by the billion. A person just is not the same as an IP address, and that notion is becoming ever more false.
There are just no proper ways to do this on the Internet. Persons are simply a concept unknown on the Internet, and any idea to introduce the concept is unlikely to succeed. (Too many governments would not want this to happen, for instance.)
Of course, you can relate the amount of votes per IP to the amounf of repeat page visits from that IP, especially in combination with cookie tracking. This works best if you estimate that number before you start the voting period. If the top 5% popular articles are typically read 10 times from a single IP, it's likely 10 people share that IP and they should get 10 votes. Cookies can be used to prevent them from stealing each others vote, but on the whole they can't skew your poll. (Note: this fails in small communities where a large group of voters come from a small number of IPs, in particular this happens around universities).
If you're not looking at authenticating voters, then you're going to be getting some duplicate votes no matter what you use. I'd use a cookie, and have done with it for the anonymous users.
UserVoice allows both anonymous voting and voting when logged in, but then allows the admin to filter out anonymous votes - a nice solution to this problem.
Anything based on IP addresses isn't an option - the case of NAT has been mentioned, but this seems to only be in the case of home users. There are many larger installations that use NAT - some corporations can have thousands of users pooled behind a single IP address. There are also ISP's that use proxy servers for their users - another case where you can have many thousands of users appear to your application as a single address. Adding unique UA combinations to this won't help, as there isn't enough variation.
A persistent cookie is going to be your best bet - and you'll have to live with the fact that it is easy to game. At least when the cookie is persistent (as opposed to session based) you'll catch the majority of users who run a single browser.
If you really want to rely on the results, you are going to have to add some form of identification in the process (like e-mail validation, which is still gameable).
At the end of the day any internet survey is going to have flaws (like: http://www.time.com/time/arts/article/0,8599,1894028,00.html), and you'll have to live with this.
Use a persistent cookie to allow only one vote per item
and record the IP, if there are more than 100 (1,000? 10,000?) requests in less than X mins then "soft block" the IP
The "soft block": dont show a page saying "your IP has been blocked" but show your "thank you for your vote" page and DONT record the vote in your DB. You even can increase the counter for that IP only. You want to prevent them to know that you are blocking their IP.
Two ideas not mentioned yet are:
Asking for the user's email address and emailing them a verification link
Using a captcha
Obviously the former can be circumvented with disposable email addresses and so on, but gives you an audit trail, and provides a significant hurdle to casual/bot vote-stuffing. A good captcha likewise severely limits vote-stuffing, but with all the usual caveats surrounding their use.
I have the same problem, and here's what I am planning on doing...
Set a persistent cookie. Check the cookie to decide whether a particular vote could be cast.
Additionally store some data about the vote request in the form of a combination of IP address + User Agent. And then use this value to limit the no. of votes to, say, 10 per day.
What is the best way of going about creating this hash (IP + UA String)?

Algorithm for unique CD-KEY generation with validation

I am trying to create a unique CD-KEY to put in our product's box, just like a normal CD-KEY found in standard software boxes that users use to register the product.
However we are not selling software, we are selling DNA collection kit for criminal and medical purposes. Users will receive a saliva collection kit by mail with the CD-KEY on it and they will use that CD-KEY to create an account on our website and get their results. The results from the test will be linked to the CD-KEY. This is the only way that we will have to link the results to the patients. It is therefore important that it does not fail :)
One of the requirements would be that the list of CD-KEYs must be sufficiently "spread" apart so that there is no possibility of someone entering an incorrect CD-KEY and still having it approved for someone else kit, thereby mixing up two kits. That could cost us thousands of dollars in liability.
For example, it cannot be a incremental sequence of numbers such as
00001
00002
00003
...
The reason is that if someone receives the kit 00002, but registers it as 000003 by accident, then his results will be matched to someone else. So it must be like credit card numbers... Unless a valid sequence is entered, your chances of randomly hitting a valid number is 1 in a million...
Also, we are selling over 50,000 kits annually to various providers (who will generate their own CD-KEYS using our algorithm) so we cannot maintain a list of all previously issued CD-KEYS to check for duplicate. The algorithm must generate unique CD-KEYs.
We also require the ability to verify that the CD-KEY is valid using a quick check algorithm, so that we can inform the user if the code he enters is invalid. This leaves out many hashing or MD5 algorithms I believe. And it cannot be a 128 bit because, who would take that time to type it out on the computer screen?
So far this is what I was thinking the final CD-KEY structure would look like
(4 char product code) - (4 char reseller code) - (12 char unique, verifiable CD-KEY)
Ex. 384A - GTLD - {4565 - FR54 - EDF3}
To insure the uniqueness of the KEYS, I could include the current date (20090521) as part of the source. We wont generate unique keys more than once a week, so this value changes often enough for the purpose of unique initial value.
What possible algorithm can I use to generate the unique keys?
Create the strings <providername>000001, <providername>000002, etc. or whatever and encrypt them with a public key, and that's your "CD-KEY" that the user enters. Decrypt the CD-KEY with the private key and validate that when decrypted you get a valid string with a valid provider name.
Credit Card numbers use the Luhn algorithm you might want to look at something similar to that.
I use SeriousBit Ellipter link for software protection but I don't see any reason you could generate a group of unique keys each week and us the library to verify the key validity when entered into your web site. You can also encode optional services into the key allow you to control how the sample is processed from the key (that's if you have different service levels).
As it uses an encrypted method of key generation in the first place and it's relatively cheap, it's certainly worth a look I would say.
I finally settled for a cd-key of this form
<TIMESTAMP>-<incremented number>-<8 char MD5 hash>-<checksumdigit>
I used the mod 11 ISBN checksum digit algorithm.
Generate GUID and catenate a random number to it. GUID is guaranteed to be unique and random number will make it improbable to hit a code accidentally. Just don't modify the GUID in any way or you might compromise the uniqueness.
http://msdn.microsoft.com/en-us/library/aa475087.aspx

Resources