Comparing secret data without giving away source - algorithm

Issue:
Company A has secret data they don't want to give away to company B.
Company B has secret data they don't want to give away to company A.
The secret data is IP addresses on both sides.
But the two companies want to know the number of overlapping IPs they have (IP addresses that both companies have in the database).
Without using a third party I can't think of a way to solve this issue without one party compromising their secret data set. Is there any type of hashing algo written to solve this problem?

First I'll describe a simple but not very secure idea. Then I'll describe a way that I think it can be easily made much more secure. The basic idea is to have each company send an encoding of a one-way function to the other company.
Sending Programs
As a warm-up, let's first suppose that one company (let's say A) develops an ordinary computer program in some language and sends it to B; B will then run it, supplying its own list of email addresses as input, and the program will report how many of them are also used by A. At this point, B knows how many email addresses it shares with A. Then the process can be repeated, but with the roles of A and B reversed.
Sending SAT Instances
Implementing this program straightforwardly in a normal programming language would yield a program that is almost trivially easy to reverse-engineer. To mitigate this, first, instead of having the program report the count directly, let's reformulate the problem as a decision problem: Does the other company have at least k of the emails in the input? (This involves choosing some value k to test for; of course, if both parties agree then the whole procedure can be performed for many different values of k. (But see the last section for possible ramifications.)) Now the program can be represented instead as a SAT instance that takes as input (some bitstring encoding of) a list of email addresses, and outputs a single bit that indicates whether k or more of them also belong to the company that created the instance.
It's computationally easy to supply inputs to a SAT instance and read off the output bit, but when the instance is large, it's (in principle) very difficult to go in "the other direction" -- that is, to find a satisfying assignment of inputs, i.e., a list of email addresses that will drive the output bit to 1: SAT being an NP-hard problem, all known exact techniques take time exponential in the problem size.
Making it Harder with Hashing
[EDIT: Actually there are many more than (n choose k) possible hashes to be ORed together, since any valid subsequence (with gaps allowed) in the list of email addresses that contains at least k shared ones needs to turn the output bit on. If each email address takes at most b bits, then there are much more than 2^((n-k)b)*(n choose k) possibilities. It's probably only feasible to sample a small fraction of them, and I don't know if unsampled ones can be somehow turned into "don't-cares"...]
The SAT instance I propose here would certainly be very large, as it would have to be a disjunction (OR) of all (n choose k) possible allowed bitstrings. (Let's assume that email addresses are required to be listed in some particular order, to wipe off an n-factorial factor.) However it has a very regular structure that might make it amenable to analysis that could dramatically reduce the time required to solve it. To get around this, all we need to do is to require the receiver to hash the original input and supply this hash value as input instead. The resulting SAT instance will still look like the disjunction (OR) of (n choose k) possible valid bitstrings (which now represent hashes of lists of strings, rather than raw lists of strings) -- but, by choosing a hash size large enough and applying some logic minimisation to the resulting instance, I'm confident that any remaining telltale patterns can be removed. (If anyone with more knowledge in the area can confirm or deny this, please edit or comment.)
Possible Attacks
One weakness of this approach is that nothing stops the receiver from "running" (supplying inputs to) the SAT instance many times. So, choosing k too low allows the receiver to easily isolate the email addresses shared with the sender by rerunning the SAT instance many times using different k-combinations of their own addresses, and dummy values (e.g. invalid email addresses) for the remaining input bits. E.g. if k=2, then the receiver can simply try running all n^2 pairs of its own email addresses and invalid email addresses for the rest until a pair is found that turns the output bit on; either of these email addresses can then be paired with all remaining email addresses to detect them in linear time.

You should be able to use homomorphic encryption to carry out the computation. I imagine creating something like bitmasks on both sites, performing encryption, then performing a XOR of the result. I think this source points to some information on what encryption you can perform that supports XOR.

Related

How can I generate a unique identifier that is apparently not progressive [duplicate]

A few months back I was tasked with implementing a unique and random code for our web application. The code would have to be user friendly and as small as possible, but still be essentially random (so users couldn't easily predict the next code in the sequence).
It ended up generating values that looked something like this:
Af3nT5Xf2
Unfortunately, I was never satisfied with the implementation. Guid's were out of the question, they were simply too big and difficult for users to type in. I was hoping for something more along the lines of 4 or 5 characters/digits, but our particular implementation would generate noticeably patterned sequences if we encoded to less than 9 characters.
Here's what we ended up doing:
We pulled a unique sequential 32bit id from the database. We then inserted it into the center bits of a 64bit RANDOM integer. We created a lookup table of easily typed and recognized characters (A-Z, a-z, 2-9 skipping easily confused characters such as L,l,1,O,0, etc.). Finally, we used that lookup table to base-54 encode the 64-bit integer. The high bits were random, the low bits were random, but the center bits were sequential.
The final result was a code that was much smaller than a guid and looked random, even though it absolutely wasn't.
I was never satisfied with this particular implementation. What would you guys have done?
Here's how I would do it.
I'd obtain a list of common English words with usage frequency and some grammatical information (like is it a noun or a verb?). I think you can look around the intertubes for some copy. Firefox is open-source and it has a spellchecker... so it must be obtainable somehow.
Then I'd run a filter on it so obscure words are removed and that words which are too long are excluded.
Then my generation algorithm would pick 2 words from the list and concatenate them and add a random 3 digits number.
I can also randomize word selection pattern between verb/nouns like
eatCake778
pickBasket524
rideFlyer113
etc..
the case needn't be camel casing, you can randomize that as well. You can also randomize the placement of the number and the verb/noun.
And since that's a lot of randomizing, Jeff's The Danger of Naïveté is a must-read. Also make sure to study dictionary attacks well in advance.
And after I'd implemented it, I'd run a test to make sure that my algorithms should never collide. If the collision rate was high, then I'd play with the parameters (amount of nouns used, amount of verbs used, length of random number, total number of words, different kinds of casings etc.)
In .NET you can use the RNGCryptoServiceProvider method GetBytes() which will "fill an array of bytes with a cryptographically strong sequence of random values" (from ms documentation).
byte[] randomBytes = new byte[4];
RNGCryptoServiceProvider rng = new RNGCryptoServiceProvider();
rng.GetBytes(randomBytes);
You can increase the lengh of the byte array and pluck out the character values you want to allow.
In C#, I have used the 'System.IO.Path.GetRandomFileName() : String' method... but I was generating salt for debug file names. This method returns stuff that looks like your first example, except with a random '.xyz' file extension too.
If you're in .NET and just want a simpler (but not 'nicer' looking) solution, I would say this is it... you could remove the random file extension if you like.
At the time of this writing, this question's title is:
How can I generate a unique, small, random, and user-friendly key?
To that, I should note that it's not possible in general to create a random value that's also unique, at least if each random value is generated independently of any other. In addition, there are many things you should ask yourself if you want to generate unique identifiers (which come from my section on unique random identifiers):
Can the application easily check identifiers for uniqueness within the desired scope and range (e.g., check whether a file or database record with that identifier already exists)?
Can the application tolerate the risk of generating the same identifier for different resources?
Do identifiers have to be hard to guess, be simply "random-looking", or be neither?
Do identifiers have to be typed in or otherwise relayed by end users?
Is the resource an identifier identifies available to anyone who knows that identifier (even without being logged in or authorized in some way)?
Do identifiers have to be memorable?
In your case, you have several conflicting goals: You want identifiers that are—
unique,
easy to type by end users (including small), and
hard to guess (including random).
Important points you don't mention in the question include:
How will the key be used?
Are other users allowed to access the resource identified by the key, whenever they know the key? If not, then additional access control or a longer key length will be necessary.
Can your application tolerate the risk of duplicate keys? If so, then the keys can be completely randomly generated (such as by a cryptographic RNG). If not, then your goal will be harder to achieve, especially for keys intended for security purposes.
Note that I don't go into the issue of formatting a unique value into a "user-friendly key". There are many ways to do so, and they all come down to mapping unique values one-to-one with "user-friendly keys" — if the input value was unique, the "user-friendly key" will likewise be unique.
If by user friendly, you mean that a user could type the answer in then I think you would want to look in a different direction. I've seen and done implementations for initial random passwords that pick random words and numbers as an easier and less error prone string.
If though you're looking for a way to encode a random code in the URL string which is an issue I've dealt with for awhile then I what I have done is use 64-bit encoded GUIDs.
You could load your list of words as chakrit suggested into a data table or xml file with a unique sequential key. When getting your random word, use a random number generator to determine what words to fetch by their key. If you concatenate 2 of them, I don't think you need to include the numbers in the string unless "true randomness" is part of the goal.

Solidity - Generate unpredictable random number that does not depend on input

I know that the "how to generate random number" in solidity is a very common question. However, after reading the great majority of answers I did not find one to fit my case.
A short description of what I want to do is: I have a list of objects that each have a unique id, a number. I need to produce a list that contains 25% of those objects, randomly selected each time the function is called. The person calling the function cannot be depended on to provide input that will somehow influence predictably the resulting list.
The only answer I found that gives a secure random number was Here. However, it depends on input coming from the participants and it is meant to address a gambling scenario. I cannot use it in my implementation.
All other cases mention that the number generated is going to be predictable, and even some of those depend on a singular input to produce a single random number. Once again, does not help me.
Summarising, I need a function that will give me multiple, non-predictable, random numbers.
Thanks for any help.
Here is an option:
function rand()
public
view
returns(uint256)
{
uint256 seed = uint256(keccak256(abi.encodePacked(
block.timestamp + block.difficulty +
((uint256(keccak256(abi.encodePacked(block.coinbase)))) / (now)) +
block.gaslimit +
((uint256(keccak256(abi.encodePacked(msg.sender)))) / (now)) +
block.number
)));
return (seed - ((seed / 1000) * 1000));
}
It generates a random number between 0-999, and basically it's impossible to predict it (It has been used by some famous Dapps like Fomo3D).
Smart Contracts are deterministic, so, basically every functions are predictable - if we know input, we will be and we should be know output. And you cannot get random number without any input - almost every language generates "pseudo random number" using clock. This means, you will not get random number in blockchain using simple method.
There are many interesting methods to generate random number using Smart Contract - using DAO, Oracle, etc. - but they all have some trade-offs.
So in conclusion, There is no method you are looking for. You need to sacrifice something.
:(
100% randomness is definitely impossible on Ethereum. The reason for that is that when distributed nodes are building from the scratch the blockchain they will build the state by running every single transaction ever created on the blockchain, and all of them have to achieve the exact same final status. In order to do that randomness is totally forbidden from the Ethereum Virtual Machine, since otherwise each execution of the exact same code would potentially yield a different result, which would make impossible to reach a common final status among all participants of the network.
That being said, there are projects like RanDAO that pretend to create trustable pseudorandomness on the blockchain.
In any case, there are approaches to achieve pseudandomness, being two of the most important ones commit-reveal techniques and using an oracle (or a combination of both).
As an example that just occurred to me: you could use Oraclize to call from time to time to a trusted external JSON API that returns pseudorandom numbers and verify on the contract that the call has truly been performed.
Of course the downside of these methods is that you and/or your users will have to spend more gas executing the smart contracts, but it's in my opinion a fair price for the huge benefits in security.

Logoot CRDT: interleaving of data on concurrent edits to the same spot?

I want to implement Logoot for eventually-convergent P2P text editing and I've run into a bit of a problem.
My understanding of Logoot is that the intervals between objects (lines of text in the original paper, but could be characters or words) can be divided infinitely on account of an unbounded identifier. This means that the position of an object is determined not by its neighbors as in WOOT (which would require tombstones) but by a fixed numerical point along the length of the string. Combined with a unique site identifier, this also gives us a total order and enables eventual convergence.
However... doesn't this cause a problem when concurrent edits are made to the same spot? If two disconnected clients start writing new sentences at the same cursor position and then merge, their sentences have a good chance of interleaving.
Below is a whiteboard example of what I'm talking about:
As you can see, both site B and site C divide the interval between "I" and "conquered" according to the rules of Logoot, giving us random points between the positions of (20,A) and (25,A). But nothing orders these points relative to each other, causing them to mix when merged. Meanwhile, neighbor-based algorithms can account for this issue since the causality chain of each object is preserved.
The above is a baby example, but in the more general case, imagine if two users wanted to insert a different sentence between two existing sentences. If one of the users happened to be offline, they shouldn't come back to a garbled mess! Clearly, to preserve intent, one sentence should follow the other.
Am I missing something in my reading of the paper, or is this an inherent downside to Logoot?
(Also, why is there a recorded clock value that's seemingly unused in the algorithm? The paper even points out that each object's identifier is necessarily unique without the clock.)
You're correct, this a real anomaly in Logoot and LSEQ. Whether or not it constitutes a intention violation depends on what your definition of intention is. An extension to the definition requiring that contiguous sequences remain contiguous unless they are split by a casually subsequent operation would make intuitive sense.
The clock is unnecessary. Most likely the authors used the (site, clock) pair or Lamport timestamp as their UUIDs out of convention. One site can never create two identical positions, so clocks will never need to be compared. (Assuming messages are received from a site in order, which is required for other aspects of Logoot/LSEQ too.)

Iterative Hash Algorithm for Fast File Check

I want to create a representation of the state of all files in a folder (ignoring order), so that I can send this state to another computer to check if we are in sync. This "state representation" is 3 numbers concatenated by . which are:
sum . product . number of items
The "sum" is the numerical addition all of the file's md5 numerical representations.
The product is the multiplication of all of the file's md5 numerical representations.
The number of items is just the number of files.
The main reason for doing this is that this allows me to create unique states iteratively/quickly when I add or delete a file (a modification being a combination of delete then add). Also, one should end up with the same "state" even if the same set of operations are performed in any random order.
Adding A File
Generate the file's md5
Calculate the md5's numerical value (x).
Add x to the sum
Multiply the product by x
Increment the number of items.
Removing A File
Generate the file's md5
Calculate the md5's numerical value (x).
Subtract x from the sum
Divide the product by x
Decrement the number of items.
Problems
Since the numerical representations of hashes can be quite large, I may have to use a library to generate results using strings rather than integers which may be quite slow.
With the limited testing I have done, I have not been able to create "collisions" where a collision is where two different sets of file hashes could produce the same state (remember that we are ignoring the order of the file hashes).
Question
I'm sure that I can't be the first person to want to achieve such a thing. Is there an algorithm or iterative hash function that aims to do the same thing already, preferably in PHP, Java, or Python? Is there a term for this type of thing, all I could think of was "iterative hash"? Is there a flaw with this algorithm that I haven't spotted, such as with "collisions" making generated state representations non-unique?
How many states can your file system take ? infinity for all practical purposes.
How long is your hash length ? short enough to be efficient, finite in any case.
Will I get collisions ? Yes.
So, your hash approach seems fine, particularly if it spreads correctly points that are close, i.e. the state of the fs differing by content of just one file hashes to very different values.
However, you should depend on your hash to produce collisions in the long run, it's a mathematical certainty that probability goes to one that someday you get a collision, given that collision chance is not 0.
So to be really safe, you probably need a full MD5 exchange, if speed and fast updates are the goal your scheme sounds good, but I would back it up with more infrequent exchanges of longer keys, just to be on the safe side if sync is mission critical.

Mapping a value to an other value and back

Imagine a value, say '1234'. I want to map that value to an other value, say 'abcd'. The constrains:
The length of the target value is equal to the start value
The mapping should be unique. E.g. 1234 should only map to abcd and viseversa
The mapping process should be (very) difficult to guess. E.g. multiplying by 2 does count
The mapping should be reversible
The start value is an integer
The target value can be of any type
This should be a basic algorithm, eventually I'll write it in Ruby but that is of no concern here.
I was thinking along the following lines:
SECRET = 1234
def to(int)
SECRET + int * 2
end
def fro(int)
(int - SECRET) / 2
end
Obviously this violates constrains 1 and 3.
The eventual goal is to anonymize records in my database. I might be over thinking this.
First off, I rather think your objectives are too ambitious: why constraint 6?
Second, what you need is technically a bijection from the domain of integers.
Third, your constraint 3 goes against Kerkhoff's principle. You'd be better off with a well-known algorithm governed by a secret key, where the secret key is hard to derive even if you know the results for a large set of integers.
Fourth, what are you anonymizing against? If you are dealing with personal information, how will you protect against statistical analysis revealing that Xyzzy is actually John Doe, based on the relations to other data? There's some research on countering such attack vectors (google for e.g. 'k-anonymization').
Fifth, use existing cryptographic primitives rather than trying to invent your own. Encryption algorithms exist (e.g. AES in cipher-block-chaining mode) that are well-tested -- AES is well supported by all modern platforms, presumably Ruby as well. However, encryption still doesn't give records anonymity in any strong sense.
It might be worth you giving a little more detail on what you're trying to acheive. Presumably you're worried about some evil person getting hold of your data, but isn't it equally possible that this evil person will also have access to the code that accessed your database? What's to stop them learning the algorithm by inspecting your code?
If you truely want to anonymize the data then that's generally a one way thing (names are removed, credit card values are removed etc). If you're trying to encrypt the contents of the database then many database engines provide well tested mechanisms to do this. For example:
Best practices for dealing with encrypted data in MSSQL
database encryption
It's always better to use a product's encryption mechanism than roll your own.

Resources