Distributed algorithm for storing synchronized data - algorithm

In a distributed systems, all agents are expected to use the same key to encrypt information. The system is completely distributed, no agent is privileged, agents can come and leave.
Each agent is holding a key, and I want any agent to be able replacing the key at once. That is, from some time t, all agents will use a new key K' instead of previous key K.
Connection is reliable (tcp), but more versatile solution should also be considered.
Is there a standard algorithm to do that?
I'm missing a textbook of similar consensus algorithm.
My idea:
All agents will connect in a circle.
Each agent will send the new key to his neighbours, once he got one, with the time of the key's creation.
When an agent is receiving a new key, it'll compare the creation date of the two keys and will discard the older.
When an agent throws his previous key - he'll send his new key to his neighbours.

If I get it right, you want to have an agreement protocol. Depending on you assumptions (e.g., synchronization, possible faults, etc.) there are lots of such protocols.
E.g., im many cases are simpe group communication protocols (e.g., ISIS) sufficient, for more sophisticated fault modes you need Byzantine Agreement.

You can see this for few generic approaches, and even this for few famous ones.

Related

consistent hashing where you want a key mapped to multiple servers

I'm wondering if I'm missing a concept here somewhere, and wondering if someone can explain how this might work.
My understanding of consistent hashing makes perfect sense where I want to map a particular key to one server. I can just map the key to a single server or virtual node clockwise or counterclockwise from the key.
How does consistent hashing work if I want to specify that each key should be mapped to some quorum of servers I define? For example, what if I have 5 servers and want each key mapped on at least two servers? Would I just choose the two unique servers clockwise on the ring, or is there some other strategy you need? Could you equivalently choose one server clockwise and one counterclockwise? What if you want a key mapped to an arbitrary number of servers?
My issue may be also that I just don't know the right terminology to search for. My particular use case is that I want to have a cluster of Prometheus collectors, say 7, and say I have a pool of 150 exporters. I want each exporter to be collected by at least 3 of the collectors. What's a good way to think about this problem? Appreciate thoughts, thanks!
It turns out that consistent hashing is actually a special case of rendezvous hashing, which is what I was looking for.

How to remove duplicated values in distributed system?

Assume we have a distributed system and there are K machines in the cluster. Each machine stores several integers. I would like to remove all the duplicate values from the system. So if integer 123 appears in machine1 and machine2, we should only keep one 123 in the system. How should I handle this?
My idea is to first let each machine do a removeDuplicate operation using something like bucket-sorting (all nubmers are integer), and then let one machine be the master-node to do a reduce. Is there any better idea?
The easy answer would be to not end up with unmanaged duplicate values on different machines in the first place, by using a distributed hash ring or similar technology to make sure a certain value ends up on a certain node.
If that's not good enough, I'd look into heuristic optimizations. Since you already have multiple copies on different machines, I'm assuming that you want to deduplicate these values for a little bit of extra performance, rather than application correctness.
If this is the case, let each node slowly pass through its keyspace (foreach integer on node) and ask all other nodes if they have a copy of that same value. If they do, we deduplicate it. If someone doesn't respond (fast enough); ignore them and continue. This allows for a decentralized deduplication algorithm that handles node failures and that can be run in any speed, allowing more important traffic to be prioritized when needed.
I'm guessing that the keys are accessed according to a power law distribution, so sweaping through the most commonly updated keys more often could be more efficient, but there's no guarantee for it.
Not sure what type of system you are interested in but if a shared memory is an option you can keep a counter array. Since all your numbers are integers you can flag each integer that appears in this shared array. Also, if this integer is already flagged then drop it. This will result in o(k) operations for each integer received and no duplicates.

High cost encryption but less cost decryption

I want that user/attacker to encrypt the data and send to server. Now I want an algorithm that is completely opposite to standard algorithms (fast to use, hard to decrypt), i.e it is very hard to encrypt the data like passwords using a key send by the server, to protect against random attacks, but very easy to decrypt so server consumes very less time in validating the user, but it becomes very hard for attacker, to every time encrypt new trial password with the key send by server.
Once again I am not talking about SSL.
It sounds as if you're looking for a proof-of-work scheme -- one of the applications for such a scheme is just what you describe: To force a client to do a certain amount of work and so prevent it from flooding a server with requests.
One silly idea that might work really well is to attach a "puzzle" to the encryption scheme. Make it so that in order to post encrypted data to the server, you have to solve some known NP-hard problem (for example, finding a satisfying assignment to a large Boolean formula) and send the answer along with. The server can then easily verify the solution, but under the assumption that P ≠ NP, clients trying to post data have to do a superpolynomial amount of extra work, preventing them from flooding your server.
Hope this helps!
You can use RSA signatures (e.g. PKCS#1).
Your server may accept answers only if they are signed with a certain RSA key whose private part you distributed earlier. The server uses the public part.
RSA has the property that verification is much quicker than signing when you select a small public exponent (typically called e, for instance e=3) by a factor or x10 or x100, depending on the key length and on whether your clients are smart enough to use CRT.
RSA key generation is extremely slow, but it is probably sufficient to use one key for all clients, as long as you include a challenge in your protocol to foil replay attacks.
Finally, RSA signing does not give you confidentiality.

Distributed algorithm design

I've been reading Introduction to Algorithms and started to get a few ideas and questions popping up in my head. The one that's baffled me most is how you would approach designing an algorithm to schedule items/messages in a queue that is distributed.
My thoughts have lead me to browsing Wikipedia on topics such as Sorting,Message queues,Sheduling, Distributed hashtables, to name a few.
The scenario:
Say you wanted to have a system that queued messages (strings or some serialized object for example). A key feature of this system is to avoid any single point of failure. The system had to be distributed across multiple nodes within some cluster and had to consistently (or as best as possible) even the work load of each node within the cluster to avoid hotspots.
You want to avoid the use of a master/slave design for replication and scaling (no single point of failure). The system totally avoids writing to disc and maintains in memory data structures.
Since this is meant to be a queue of some sort the system should be able to use varying scheduling algorithms (FIFO,Earliest deadline,round robin etc...) to determine which message should be returned on the next request regardless of which node in the cluster the request is made to.
My initial thoughts
I can imagine how this would work on a single machine but when I start thinking about how you'd distribute something like this questions like:
How would I hash each message?
How would I know which node a message was sent to?
How would I schedule each item so that I can determine which message and from which node should be returned next?
I started reading about distributed hash tables and how projects like Apache Cassandra use some sort of consistent hashing to distribute data but then I thought, since the query won't supply a key I need to know where the next item is and just supply it...
This lead into reading about peer to peer protocols and how they approach the synchronization problem across nodes.
So my question is, how would you approach a problem like the one described above, or is this too far fetched and is simply a stupid idea...?
Just an overview, pointers,different approaches, pitfalls and benefits of each. The technologies/concepts/design/theory that may be appropriate. Basically anything that could be of use in understanding how something like this may work.
And if you're wondering, no I'm not intending to implement anything like this, its just popped into my head while reading (It happens, I get distracted by wild ideas when I read a good book).
UPDATE
Another interesting point that would become an issue is distributed deletes.I know systems like Cassandra have tackled this by implementing HintedHandoff,Read Repair and AntiEntropy and it seems to work work well but are there any other (viable and efficient) means of tackling this?
Overview, as you wanted
There are some popular techniques for distributed algorithms, e.g. using clocks, waves or general purpose routing algorithms.
You can find these in the great distributed algorithm books Introduction to distributed algorithms by Tel and Distributed Algorithms by Lynch.
Reductions
are particularly useful since general distributed algorithms can become quite complex. You might be able to use a reduction to a simpler, more specific case.
If, for instance, you want to avoid having a single point of failure, but a symmetric distributed algorithm is too complex, you can use the standard distributed algorithm of (leader) election and afterwards use a simpler asymmetric algorithm, i.e. one which can make use of a master.
Similarly, you can use synchronizers to transform a synchronous network model to an asynchronous one.
You can use snapshots to be able to analyze offline instead of having to deal with varying online process states.

Why do you need lots of randomness for effective encryption?

I've seen it mentioned in many places that randomness is important for generating keys for symmetric and asymmetric cryptography and when using the keys to encrypt messages.
Can someone provide an explanation of how security could be compromised if there isn't enough randomness?
Randomness means unguessable input. If the input is guessable, then the output can be easily calculated. That is bad.
For example, Debian had a long standing bug in its SSL implementation that failed to gather enough randomness when creating a key. This resulted in the software generating one of only 32k possible keys. It is thus easily possible to decrypt anything encrypted with such a key by trying all 32k possibilities by trying them out, which is very fast given today's processor speeds.
The important feature of most cryptographic operations is that they are easy to perform if you have the right information (e.g. a key) and infeasible to perform if you don't have that information.
For example, symmetric cryptography: if you have the key, encrypting and decrypting is easy. If you don't have the key (and don't know anything about its construction) then you must embark on something expensive like an exhaustive search of the key space, or a more-efficient cryptanalysis of the cipher which will nonetheless require some extremely large number of samples.
On the other hand, if you have any information on likely values of the key, your exhaustive search of the keyspace is much easier (or the number of samples you need for your cryptanalysis is much lower). For example, it is (currently) infeasible to perform 2^128 trial decryptions to discover what a 128-bit key actually is. If you know the key material came out of a time value that you know within a billion ticks, then your search just became 340282366920938463463374607431 times easier.
To decrypt a message, you need to know the right key.
The more possibly keys you have to try, the harder it is to decrypt the message.
Taking an extreme example, let's say there's no randomness at all. When I generate a key to use in encrypting my messages, I'll always end up with the exact same key. No matter where or when I run the keygen program, it'll always give me the same key.
That means anyone who have access to the program I used to generate the key, can trivially decrypt my messages. After all, they just have to ask it to generate a key too, and they get one identical to the one I used.
So we need some randomness to make it unpredictable which key you end up using. As David Schmitt mentions, Debian had a bug which made it generate only a small number of unique keys, which means that to decrypt a message encrypted by the default OpenSSL implementation on Debian, I just have to try this smaller number of possible keys. I can ignore the vast number of other valid keys, because Debian's SSL implementation will never generate those.
On the other hand, if there was enough randomness in the key generation, it's impossible to guess anything about the key. You have to try every possible bit pattern. (and for a 128-bit key, that's a lot of combinations.)
It has to do with some of the basic reasons for cryptography:
Make sure a message isn't altered in transit (Immutable)
Make sure a message isn't read in transit (Secure)
Make sure the message is from who it says it's from (Authentic)
Make sure the message isn't the same as one previously sent (No Replay)
etc
There's a few things you need to include, then, to make sure that the above is true. One of the important things is a random value.
For instance, if I encrypt "Too many secrets" with a key, it might come out with "dWua3hTOeVzO2d9w"
There are two problems with this - an attacker might be able to break the encryption more easily since I'm using a very limited set of characters. Further, if I send the same message again, it's going to come out exactly the same. Lastly, and attacker could record it, and send the message again and the recipient wouldn't know that I didn't send it, even if the attacker didn't break it.
If I add some random garbage to the string each time I encrypt it, then not only does it make it harder to crack, but the encrypted message is different each time.
The other features of cryptography in the bullets above are fixed using means other than randomness (seed values, two way authentication, etc) but the randomness takes care of a few problems, and helps out on other problems.
A bad source of randomness limits the character set again, so it's easier to break, and if it's easy to guess, or otherwise limited, then the attacker has fewer paths to try when doing a brute force attack.
-Adam
A common pattern in cryptography is the following (sending text from alice to bob):
Take plaintext p
Generate random k
Encrypt p with k using symmetric encryption, producing crypttext c
Encrypt k with bob's private key, using asymmetric encryption, producing x
Send c+x to bob
Bob reverses the processes, decrypting x using his private key to obtain k
The reason for this pattern is that symmetric encryption is much faster than asymmetric encryption. Of course, it depends on a good random number generator to produce k, otherwise the bad guys can just guess it.
Here's a "card game" analogy: Suppose we play several rounds of a game with the same deck of cards. The shuffling of the deck between rounds is the primary source of randomness. If we didn't shuffle properly, you could beat the game by predicting cards.
When you use a poor source of randomness to generate an encryption key, you significantly reduce the entropy (or uncertainty) of the key value. This could compromise the encryption because it makes a brute-force search over the key space much easier.
Work out this problem from Project Euler, and it will really drive home what "lots of randomness" will do for you. When I saw this question, that was the first thing that popped into my mind.
Using the method he talks about there, you can easily see what "more randomness" would gain you.
A pretty good paper that outlines why not being careful with randomness can lead to insecurity:
http://www.cs.berkeley.edu/~daw/papers/ddj-netscape.html
This describes how back in 1995 the Netscape browser's key SSL implementation was vulnerable to guessing the SSL keys because of a problem seeding the PRNG.

Resources