Algorithm that costs time to run but easy to verify? - algorithm

I am designing an website for experiment, there will be a button which user must click and hold for a while, then release, then client submits AJAX event to server.
However, to prevent from autoclick bots and fast spam, I want the hold time to be very real and not skip-able, e.g. doing some calculation. The point is to waste actual CPU time, so that you can't simply guess the AJAX callback value or turning faster system clock to bypass it.
Are there any algorithm that
fast & easy to generate a challenge on a server
costs some time to execute on the client side, no spoof or shortcut the time.
easy & fast to verify the response result on a server?

You're looking for a Proof-of-work system.
The most popular algorithm seems to be Hashcash (also on Wikipedia), which is used for bitcoins, among other things. The basic idea is to ask the client program to find a hash with a certain number of leading zeroes, which is a problem they have to solve with brute force.
Basically, it works like this: the client has some sort of token. For email, this is usually the recipient's email address and today's date. So it could look like this:
bob#example.com:04102011
The client now has to find a random string to put in front of this:
asdlkfjasdlbob#example.com:04202011
such that the hash of this has a bunch of leading 0s. (My example won't work because I just made up a number.)
Then, on your side, you just have to take this random input and run a single hash on it, to check if it starts with a bunch of 0s. This is a very fast operation.
The reason that the client has to spend a fair amount of CPU time on finding the right hash is that it is a brute-force problem. The only know want to do it is to choose a random string, test it, and if it doesn't work, choose another one.
Of course, since you're not doing emails, you will probably want to use a different token of some sort rather than an email address and date. However, in your case, this is easy: you can just make up a random string server-side and pass it to the client.
One advantage of this particular algorithm is that it's very easy to adjust the difficulty: just change how many leading zeroes you want. The more zeroes you require, the longer it will take the client; however, verification still takes the same amount of time on your end.

Came back to answer my own question. This is called Verifiable Delay Function
The concept was first proposed in 2018 by Boneh et al., who proposed several candidate structures for constructing verifiable delay functions and it is an important tool to add time delay in decentralized applications. To be exact, the verifiable delay function is a function f:X→Y that takes a prescribed wall-clock time to compute, even on a parallel processor, ond outputs a unique result that can effectively output the verification. In short, even if it is evaluated on a large number of parallel processors and still requires evaluation of f in a specified number of sequential steps
https://www.mdpi.com/1424-8220/22/19/7524
The idea of VDF is a step forward than #TikhonJelvis's PoW answer because apparently it "takes a prescribed wall-clock time to compute, even on a parallel processor"

Related

socket library poll vs custom poll

So I was having some (arguably) fun with sockets (in c) and then I came across the problem of asynchronously receiving.
As stated here, select and poll does a linear search across the sockets, which does not scale very well. Then I thought, can I do better, knowing application specific behaviour of the sockets?
For instance, if
Xn: the time of arrival of the nth datagram for socket X (for simplicity lets assume time is discrete)
Pr(Xn = xn | Xn-1 = xn-1, Xn-2 = xn-2 ...): the probability of Xn = xn given the previous arrival times
is known by statistics or assumption or whatever. I could then implement an algorithm that polls sockets in the order of largest probability.
The question is, is this an insane attempt? Does the library poll/select have some advantage that I can't beat from user space?
EDIT: to clarify, I don't mean to duplicate the semantics of poll and select, I just want a working way of finding at least a socket that is ready to receive.
Also, stuff like epoll exists and all that, which I think is most likely superior but I want to seek out any possible alternatives first.
Does the library poll/select have some advantage that I can't beat from user space?
The C library runs in userspace, too, but its select() and poll() functions almost certainly are wrappers for system calls (but details vary from system to system). That they wrap single system calls (where in fact they do so) does give them a distinct advantage over any scheme involving multiple system calls, such as I imagine would be required for the kind of approach you have in mind. System calls have high overhead.
All of that is probably moot, however, if you have in mind to duplicate the semantics of select() and poll(): specifically, that when they return, they provide information on all the files that are ready. In order to do that, they must test or somehow watch every specified file, and so, therefore, must your hypothetical replacement. Since you need to scan every file anyway, it doesn't much matter what order you scan them in; a linear scan is probably an ideal choice because it has very low overhead.

Suggest proof of work algorithm that can be used to control the growth of the blockchain

I'm working on a blockchain based identity system. And, since each item will be in the chain forever, consuming space, I'm thinking on adding a proof of work requirement to add items to the chain.
At first I was thinking of bitcoin, since it's a tried and tested way to prove that the work was done, but doing it this way would prevent users from joining in, since bitcoin is not widely adapted yet. Also, in a distributed system, it is not clear who should get the money.
So, I'm looking for a proof of work algorithm, complexity of which can be easily adjusted based on blockchain growth speed, as well as something that would be hard to be re-used. Also, if complexity would had grown since the work has been started, the work should be able to be completed with adjusted complexity without having to be re-done.
Can someone suggest to me something that would work for my purpose, as well as would be resistant to GPU acceleration?
Simple... burn bitcoins. Anyone can do it - so there's no barrier to entry, and really what you need is "proof of destroyed value". Because the value is destroyed, you know the miner's incentives are to strengthen your chain.
Invent a bitcoin address that cannot be real, but checksums correctly. Then have your miners send to that burn address, with a public key in OP-return. Doing so earns them the right to mine during some narrow window of time.
"Difficulty" is adjusted by increasing the amount of bitcoins burned. Multiple burns in the same window can get reward shared, but only one block is elected correct (that with a checksum closest to the checksum of all of the valid burns for the window).

What is the difference between an on-line and off-line algorithm?

These terms were used in my data structures textbook, but the explanation was very terse and unclear. I think it has something to do with how much knowledge the algorithm has at each stage of computation.
(Please, don't link to the Wikipedia page: I've already read it and I am still looking for a clarification. An explanation as if I'm twelve years old and/or an example would be much more helpful.)
Wikipedia
The Wikipedia page is quite clear:
In computer science, an online algorithm is one that can process its
input piece-by-piece in a serial fashion, i.e., in the order that the
input is fed to the algorithm, without having the entire input
available from the start. In contrast, an offline algorithm is given
the whole problem data from the beginning and is required to output an
answer which solves the problem at hand. (For example, selection sort
requires that the entire list be given before it can sort it, while
insertion sort doesn't.)
Let me expand on the above:
An offline algorithm requires all information BEFORE the algorithm starts. In the Wikipedia example, selection sort is offline because step 1 is Find the minimum value in the list. To do this, you need to have the entire list available - otherwise, how could you possibly know what the minimum value is? You cannot.
Insertion sort, by contrast, is online because it does not need to know anything about what values it will sort and the information is requested WHILE the algorithm is running. Simply put, it can grab new values at every iteration.
Still not clear?
Think of the following examples (for four year olds!). David is asking you to solve two problems.
In the first problem, he says:
"I'm, going to give you two balls of different masses and you need to
drop them at the same time from a tower.. just to make sure Galileo
was right. You can't use a watch, we'll just eyeball it."
If I gave you only one ball, you'd probably look at me and wonder what you're supposed to be doing. After all, the instructions were pretty clear. You need both balls at the beginning of the problem. This is an offline algorithm.
For the second problem, David says
"Okay, that went pretty well, but now I need you to go ahead and kick
a couple of balls across a field."
I go ahead and give you the first ball. You kick it. Then I give you the second ball and you kick that. I could also give you a third and fourth ball (without you even knowing that I was going to give them to you). This is an example of an online algorithm. As a matter of fact, you could be kicking balls all day.
I hope this was clear :D
An online algorithm processes the input only piece by piece and doesn't know about the actual input size at the beginning of the algorithm.
An often used example is scheduling: you have a set of machines, and an unknown workload. Each machine has a specific speed. You want to clear the workload as fast as possible. But since you don't know all inputs from the beginning (you can often see only the next in the queue) you can only estimate which machine is the best for the current input. This can result in non-optimal distribution of your workload since you cannot make any assumption on your input data.
An offline algorithm on the other hand works only with complete input data. All workload must be known before the algorithm starts processing the data.
Example:
Workload:
1. Unit (Weight: 1)
2. Unit (Weight: 1)
3. Unit (Weight: 3)
Machines:
1. Machine (1 weight/hour)
2. Machine (2 weights/hour)
Possible result (Online):
1. Unit -> 2. Machine // 2. Machine has now a workload of 30 minutes
2. Unit -> 2. Machine // 2. Machine has now a workload of one hour
either
3. Unit -> 1. Machine // 1. Machine has now a workload of three hours
or
3. Unit -> 2. Machine // 1. Machine has now a workload of 2.5 hours
==> the work is done after 2.5 hours
Possible result (Offline):
1. Unit -> 1. Machine // 1. Machine has now a workload of one hour
2. Unit -> 1. Machine // 1. Machine has now a workload of two hours
3. Unit -> 2. Machine // 2. Machine has now a workload of 1.5 hours
==> the work is done after 2 hours
Note that the better result in the offline algorithm is only possible since we don't use the better machine from the start. We know already that there will be a heavy unit (unit 3), so this unit should be processed by the fastest machine.
An offline algorithm knows all about its input data the moment it is invoked. An online algorithm, on the other hand, can get parts or all of its input data while it is running.
An
algorithm is said to be online if it does not
know the data it will be executing on
beforehand. An offline algorithm may see
all of the data in advance.
An on-line algorithm is one that receives a sequence of requests and performs an immediate action in response to each request.
In contrast,an off-line algorithm performs action after all the requests are taken.
This paper by Richard Karp gives more insight on on-line,off-line algorithms.
We can differentiate offline and online algorithms based on the availability of the inputs prior to the processing of the algorithm.
Offline Algorithm: All input information are available to the algorithm and processed simultaneously by the algorithm. With the complete set of input information the algorithm finds a way to efficiently process the inputs and obtain an optimal solution.
Online Algorithm: Inputs come on the fly i.e. all input information are not available to the algorithm simultaneously rather part by part as a sequence or over the time. Upon the availability of an input, the algorithm has to take immediate decision without any knowledge of future input information. In this process, the algorithm produces a sequence of decisions that will have an impact on the final quality of its overall performance.
Eg: Routing in Communication network:
Data Packets from different sources come to the nearest router. More than one communication links are connected to the router. When a new data packet arrive to the router, then the router has to decide immediately to which link the data packet is to be sent. (Assume all links are routed to the destination, all links bandwidth are same, all links are the part of the shortest path to the destination). Here, the objective is to assign each incoming data packet to one of the link without knowing the future data packets in such a way that the load of each link will be balanced. No links should be overloaded. This is a load balancing problem.
Here, the scheduler implemented in the router has no idea about the future data packets, but it has to take scheduling decision for each incoming data packets.
In the contrast a offline scheduler has full knowledge about all incoming data packets, then it efficiently assign the data packets to different links and can optimally balance the load among different links.
Cache Miss problem: In a computer system, cache is a memory unit used to avoid the speed mismatch between the faster processor and the slowest primary memory. The objective of using cache is to minimize the average access time by keeping some frequently accessed pages in the cache. The assumption is that these pages may be requested by the processor in near future. Generally, when a page request is made by the processor then the page is fetched from the primary or secondary memory and a copy of the page is stored in the cache memory. Suppose, the cache is full, then the algorithm implemented in the cache has to take immediate decision of replacing a cache block without knowledge of future page requests. The question arises: which cache block has to be replaced? (In worst case, it may happen that you replace a cache block and very next moment, the processor request for the replaced cache block).
So, the algorithm must be designed in such a way that it take immediate decision upon the arrival of an incoming request with out advance knowledge of entire request sequence. This type of algorithms are known as ONLINE ALGORITHM

Locking data for x days

Is there an (easy) way to encrypt data so that it takes a certain amount of cpu hours to decrypt it? Maybe a series of encryptions with short key lengths, a variable one-way function or anything?
It's probably not of great use, but how would this encryption scheme be called and are there tools for it?
edit:
To get no varying results for the brute force break time, shouldn't I use many rounds with an xor-feedback?
I just came up with this algo (for a symmetric block cipher with equal value and key length)... maybe it's non-sense
round 1
create a zero-block
create a random-block-1
encipher value:zero-block with key:random-block1 => gives lock-output-1
round 2
create a zero-block
create a random-block-2
encipher value:zero-block with key:random-block2 => gives temp
xor temp with random-block-1 => gives lock-output-2
and so on
The xor operation with random-block-1 would be there so that the unlock routine will have to find random-block-1 before it can start brute forcing on lock-output-2.
lock-output-1 + lock-output-2 .. lock-output-N would be the complete lock-output. When the unlock routine has found N key-blocks that each give zero on all lock-output blocks, it can use the N key-blocks as a whole to decipher the actual data.
Then I'd also need a formula to calculate how many rounds would give a maximum variation of e.g. 10% for the wanted amount of CPU hours.
I guess there must exist a simmilar algorithm out there.
The concept is called timed commitment, as defined by Boneh and Naor. The data you want to encrypt is said to be committed by one party (which I call the sender), such that another party (the receiver) may, at some tunable cost, recover the data.
The method described by Boneh and Naor is quite more advanced than what you suggest. Their timed commitment scheme has the three following properties:
Verifiable recovery: the sender is able to convince the receiver that he really did commit a proper value which the receiver will be able to recover by applying a substantial but feasible amount of CPU muscle to it.
Recovery with proof: once the recovery has been done, it is verifiable efficiently: a third party wishing to verify that the recovered value is indeed the one which was committed, can do so efficiently (without applying hours of CPU to it).
Immunity against parallel attacks: the recovery process cannot benefit from having access to a thousand PC: one cannot go much faster than what can be done with a single CPU.
With these properties, a timed commitment becomes a worthwhile tool in some situations; Boneh and Naor mainly discuss contract signing, but also honesty preserving auctions and a few other applications.
I am not aware of any actual implementation or even a defined protocol for timed commitments, beyond the mathematical description by Boneh and Naor.
You can encrypt it normally, and release just enough information about the key such that a brute force attack will take X CPU hours.
No, you can't do that reliably, because
the attacker could rent a powerful computer (a computing cloud for example) and use it for highly parsllel much faster attack
so far computers become faster and faster as time passes - what took a day yesterday might take one minute in two years
Well, to know the amount of CPU hours for any kind of decryption, it does not really matter how the encryption takes place. Instead you would have to make sure
what decryption algorithm the decrypter will use (perhaps a non-so-far invented one?)
which implementation of that algorithm he will use
which CPU/hardware he will use.
Each of these 3 parameters can make a difference in speed of at least a factor 1000 or more.
A cryption algorithm is considered as cracked when someone found a way to get the password faster than a brute force attack (in average).
It's the case for some algorithms like MD5 so make sure you pick one algorithm that isn't cracked (yet)
For other algorithms, even if they are not cracked, they are still vulnerable to brute force attacks... it might take a while but everything that is crypted might be decrypted... it's only a question of time and resources.
If a guy have a huge zombie computer farm working for him around the world, it might take few hours to crack something that would take years for a guy with a single laptop.
If you want a maximum of security, you can mix a couple of existing cryption algoritm with custom algorithm of your own. Someone can still try to crack your data, but most likely, unless you are dealing with national top secret data, it will probably never append.
It is relative, a computer is going to decrypt fast depending on its computing power, and the selected algorithm to encrypt depends on the data you want to protect, so with a normal computer a good encryption algorithm an average computer takes its time to decrypt cause there always is a price for good things, but i recommend you Elliptic curve cryptography cause it has power to encrypt and its time to be decrypted is very good, you can take a look on it.
that is what i can say about it.-

Is user delay between random takes is good improvement for PRNG?

I thought that for making random choices for example for next track in a player or next page in the browser it could be possible to use time as 'natural phenomenon', for example decent RPNG just can continuously get next random number without program request (for example in a thread every several milliseconds or event more often) and when the time comes (based on the user decision), the choice will be naturally affected by this user delay.
Is this approach is good enough and how can it be tested? The problem for testing manually is that I can not wait that long in real world to save enough random numbers to feed them to some test program. Any artificial attempt to speed this up will make the method itself invalid.
Thanks
A good random number generator really doesn't need improvement, and even if it did, it isn't clear that user input timing would help.
Could a user ever detect a pattern in tracks selected by an LCG? Whatever your platform, its likely that its built-in random() function would be good enough (that is, it would appear completely random to a user).
If you are still worried, however, use a cryptographic quality RNG, seeded with data from the dedicated source of randomness on your system. Nowadays, many of these system RNGs use truly random bits generated through quantum events in hardware. However, they can be slow to produce bits, so its best to use them as a seed for a fast, algorithmic PRNG.
Now, if you aren't convinced these approaches are good enough, you should be very skeptical that the timing of user typing is a good source. The keys that are pressed by users are highly predictable, given the limited vocabulary in use and the patterns that tend to appear within that limited set of words. This predictability in letter sequences leads to a high degree of predictability in timing between key presses.
I know that a lot of security programs use this technique during key generation. I don't think that it is pure snake oil, but it could be a placebo to placate users. A good product will depend on the system RNG.
Acquiring the time information that you describe can indeed add entropy to a PRNG. However, from your description of your intended applications, I don't think you need it. For "random choices for example for next track in a player or next page in the browser", a trivial, unmodified PRNG is fine. For security applications such as nonces, etc. it is much more important.
Anyway, you should read about PRNG entropy sources.
I wouldn't improve PRNGs with user delays, mostly because they're quite regular: you type at around the same speed, and it takes too long to measure the delay between a click and another (assuming normal usage). I'd rather use other user-triggered events: pressed keys, distance between each click, position of the mouse at given moments.

Resources