What can i parallelize in event driven python web application? - parallel-processing

I'm trying to make a simple application on Tornado, and Tornado is an event driven Webserver, and because it's on Python, then i'll try to use Multiprocessing, but in what?
Password hashing is a linear operation no? if i hash a password 1000 times, then every n operation needs the n-1 operation?
What about image processing, if they are used in forms, then it must wait until the client validate his form no?
The only example i can get from multiprocessing is 3D Rendering, more you get processes, more you gain time.

Why would you need multiprocessing and add a hell of complexity to your purpose while there is no actual need? If you want to take advantage of multiple cores, just raise some Tornado instances behind Nginx. For trivial tasks like hash computations, template renderings etc the overhead is more than acceptable. If you have more complex scenarios, delegate your work to a queue, such as celery.
Hashing is an O(n) operation, but that doesn't mean you need previous calculations to compute a hash every time. Also, 3D rendering doesn't take place on the server :)

Related

socket library poll vs custom poll

So I was having some (arguably) fun with sockets (in c) and then I came across the problem of asynchronously receiving.
As stated here, select and poll does a linear search across the sockets, which does not scale very well. Then I thought, can I do better, knowing application specific behaviour of the sockets?
For instance, if
Xn: the time of arrival of the nth datagram for socket X (for simplicity lets assume time is discrete)
Pr(Xn = xn | Xn-1 = xn-1, Xn-2 = xn-2 ...): the probability of Xn = xn given the previous arrival times
is known by statistics or assumption or whatever. I could then implement an algorithm that polls sockets in the order of largest probability.
The question is, is this an insane attempt? Does the library poll/select have some advantage that I can't beat from user space?
EDIT: to clarify, I don't mean to duplicate the semantics of poll and select, I just want a working way of finding at least a socket that is ready to receive.
Also, stuff like epoll exists and all that, which I think is most likely superior but I want to seek out any possible alternatives first.
Does the library poll/select have some advantage that I can't beat from user space?
The C library runs in userspace, too, but its select() and poll() functions almost certainly are wrappers for system calls (but details vary from system to system). That they wrap single system calls (where in fact they do so) does give them a distinct advantage over any scheme involving multiple system calls, such as I imagine would be required for the kind of approach you have in mind. System calls have high overhead.
All of that is probably moot, however, if you have in mind to duplicate the semantics of select() and poll(): specifically, that when they return, they provide information on all the files that are ready. In order to do that, they must test or somehow watch every specified file, and so, therefore, must your hypothetical replacement. Since you need to scan every file anyway, it doesn't much matter what order you scan them in; a linear scan is probably an ideal choice because it has very low overhead.

Algorithm that costs time to run but easy to verify?

I am designing an website for experiment, there will be a button which user must click and hold for a while, then release, then client submits AJAX event to server.
However, to prevent from autoclick bots and fast spam, I want the hold time to be very real and not skip-able, e.g. doing some calculation. The point is to waste actual CPU time, so that you can't simply guess the AJAX callback value or turning faster system clock to bypass it.
Are there any algorithm that
fast & easy to generate a challenge on a server
costs some time to execute on the client side, no spoof or shortcut the time.
easy & fast to verify the response result on a server?
You're looking for a Proof-of-work system.
The most popular algorithm seems to be Hashcash (also on Wikipedia), which is used for bitcoins, among other things. The basic idea is to ask the client program to find a hash with a certain number of leading zeroes, which is a problem they have to solve with brute force.
Basically, it works like this: the client has some sort of token. For email, this is usually the recipient's email address and today's date. So it could look like this:
bob#example.com:04102011
The client now has to find a random string to put in front of this:
asdlkfjasdlbob#example.com:04202011
such that the hash of this has a bunch of leading 0s. (My example won't work because I just made up a number.)
Then, on your side, you just have to take this random input and run a single hash on it, to check if it starts with a bunch of 0s. This is a very fast operation.
The reason that the client has to spend a fair amount of CPU time on finding the right hash is that it is a brute-force problem. The only know want to do it is to choose a random string, test it, and if it doesn't work, choose another one.
Of course, since you're not doing emails, you will probably want to use a different token of some sort rather than an email address and date. However, in your case, this is easy: you can just make up a random string server-side and pass it to the client.
One advantage of this particular algorithm is that it's very easy to adjust the difficulty: just change how many leading zeroes you want. The more zeroes you require, the longer it will take the client; however, verification still takes the same amount of time on your end.
Came back to answer my own question. This is called Verifiable Delay Function
The concept was first proposed in 2018 by Boneh et al., who proposed several candidate structures for constructing verifiable delay functions and it is an important tool to add time delay in decentralized applications. To be exact, the verifiable delay function is a function f:X→Y that takes a prescribed wall-clock time to compute, even on a parallel processor, ond outputs a unique result that can effectively output the verification. In short, even if it is evaluated on a large number of parallel processors and still requires evaluation of f in a specified number of sequential steps
https://www.mdpi.com/1424-8220/22/19/7524
The idea of VDF is a step forward than #TikhonJelvis's PoW answer because apparently it "takes a prescribed wall-clock time to compute, even on a parallel processor"

redis batsd (statsd) counter

I use batsd ( which use statsd) library with jeremy/statsd-ruby client for my ruby web application (rails). And I have to keep simple visits statistic. Great! I use statsd.increment('users.visits') method from the above gem.
Then I noticed, this operation once create new sorted set(zset) and add one element(it looks like "1338932870<X>1) every time.
Why statsd use this approach? Would not be easer and faster use HINCRBY method with simlpe hash(not zadd to zset)?
I know, statsd is good and well-known instrument, but I wonder, is it the counters standart pattern in redis? I'm new to redis and nosql at all, thank you!
I'm not familiar with the package, but if you just use HINCRBY, you will just calculate the last value of the metric and keep it in Redis. I guess a statistical package may need to store the evolution of the metric (in order to plot graph over time or something similar).
Using a zset is a way to store the events ordered by timestamp (i.e. a time serie), and therefore to keep an history of the evolution of this metric. It is slower and consume much more memory than just keeping the last value, but you have the history. See Noah's comment below for the full story.
Using HINCRBY or INCRBY to aggregate counters in real time, and using zset to store time series are two common Redis patterns.

Locking data for x days

Is there an (easy) way to encrypt data so that it takes a certain amount of cpu hours to decrypt it? Maybe a series of encryptions with short key lengths, a variable one-way function or anything?
It's probably not of great use, but how would this encryption scheme be called and are there tools for it?
edit:
To get no varying results for the brute force break time, shouldn't I use many rounds with an xor-feedback?
I just came up with this algo (for a symmetric block cipher with equal value and key length)... maybe it's non-sense
round 1
create a zero-block
create a random-block-1
encipher value:zero-block with key:random-block1 => gives lock-output-1
round 2
create a zero-block
create a random-block-2
encipher value:zero-block with key:random-block2 => gives temp
xor temp with random-block-1 => gives lock-output-2
and so on
The xor operation with random-block-1 would be there so that the unlock routine will have to find random-block-1 before it can start brute forcing on lock-output-2.
lock-output-1 + lock-output-2 .. lock-output-N would be the complete lock-output. When the unlock routine has found N key-blocks that each give zero on all lock-output blocks, it can use the N key-blocks as a whole to decipher the actual data.
Then I'd also need a formula to calculate how many rounds would give a maximum variation of e.g. 10% for the wanted amount of CPU hours.
I guess there must exist a simmilar algorithm out there.
The concept is called timed commitment, as defined by Boneh and Naor. The data you want to encrypt is said to be committed by one party (which I call the sender), such that another party (the receiver) may, at some tunable cost, recover the data.
The method described by Boneh and Naor is quite more advanced than what you suggest. Their timed commitment scheme has the three following properties:
Verifiable recovery: the sender is able to convince the receiver that he really did commit a proper value which the receiver will be able to recover by applying a substantial but feasible amount of CPU muscle to it.
Recovery with proof: once the recovery has been done, it is verifiable efficiently: a third party wishing to verify that the recovered value is indeed the one which was committed, can do so efficiently (without applying hours of CPU to it).
Immunity against parallel attacks: the recovery process cannot benefit from having access to a thousand PC: one cannot go much faster than what can be done with a single CPU.
With these properties, a timed commitment becomes a worthwhile tool in some situations; Boneh and Naor mainly discuss contract signing, but also honesty preserving auctions and a few other applications.
I am not aware of any actual implementation or even a defined protocol for timed commitments, beyond the mathematical description by Boneh and Naor.
You can encrypt it normally, and release just enough information about the key such that a brute force attack will take X CPU hours.
No, you can't do that reliably, because
the attacker could rent a powerful computer (a computing cloud for example) and use it for highly parsllel much faster attack
so far computers become faster and faster as time passes - what took a day yesterday might take one minute in two years
Well, to know the amount of CPU hours for any kind of decryption, it does not really matter how the encryption takes place. Instead you would have to make sure
what decryption algorithm the decrypter will use (perhaps a non-so-far invented one?)
which implementation of that algorithm he will use
which CPU/hardware he will use.
Each of these 3 parameters can make a difference in speed of at least a factor 1000 or more.
A cryption algorithm is considered as cracked when someone found a way to get the password faster than a brute force attack (in average).
It's the case for some algorithms like MD5 so make sure you pick one algorithm that isn't cracked (yet)
For other algorithms, even if they are not cracked, they are still vulnerable to brute force attacks... it might take a while but everything that is crypted might be decrypted... it's only a question of time and resources.
If a guy have a huge zombie computer farm working for him around the world, it might take few hours to crack something that would take years for a guy with a single laptop.
If you want a maximum of security, you can mix a couple of existing cryption algoritm with custom algorithm of your own. Someone can still try to crack your data, but most likely, unless you are dealing with national top secret data, it will probably never append.
It is relative, a computer is going to decrypt fast depending on its computing power, and the selected algorithm to encrypt depends on the data you want to protect, so with a normal computer a good encryption algorithm an average computer takes its time to decrypt cause there always is a price for good things, but i recommend you Elliptic curve cryptography cause it has power to encrypt and its time to be decrypted is very good, you can take a look on it.
that is what i can say about it.-

How can I make my applications scale well?

In general, what kinds of design decisions help an application scale well?
(Note: Having just learned about Big O Notation, I'm looking to gather more principles of programming here. I've attempted to explain Big O Notation by answering my own question below, but I want the community to improve both this question and the answers.)
Responses so far
1) Define scaling. Do you need to scale for lots of users, traffic, objects in a virtual environment?
2) Look at your algorithms. Will the amount of work they do scale linearly with the actual amount of work - i.e. number of items to loop through, number of users, etc?
3) Look at your hardware. Is your application designed such that you can run it on multiple machines if one can't keep up?
Secondary thoughts
1) Don't optimize too much too soon - test first. Maybe bottlenecks will happen in unforseen places.
2) Maybe the need to scale will not outpace Moore's Law, and maybe upgrading hardware will be cheaper than refactoring.
The only thing I would say is write your application so that it can be deployed on a cluster from the very start. Anything above that is a premature optimisation. Your first job should be getting enough users to have a scaling problem.
Build the code as simple as you can first, then profile the system second and optimise only when there is an obvious performance problem.
Often the figures from profiling your code are counter-intuitive; the bottle-necks tend to reside in modules you didn't think would be slow. Data is king when it comes to optimisation. If you optimise the parts you think will be slow, you will often optimise the wrong things.
Ok, so you've hit on a key point in using the "big O notation". That's one dimension that can certainly bite you in the rear if you're not paying attention. There are also other dimensions at play that some folks don't see through the "big O" glasses (but if you look closer they really are).
A simple example of that dimension is a database join. There are "best practices" in constructing, say, a left inner join which will help to make the sql execute more efficiently. If you break down the relational calculus or even look at an explain plan (Oracle) you can easily see which indexes are being used in which order and if any table scans or nested operations are occurring.
The concept of profiling is also key. You have to be instrumented thoroughly and at the right granularity across all the moving parts of the architecture in order to identify and fix any inefficiencies. Say for example you're building a 3-tier, multi-threaded, MVC2 web-based application with liberal use of AJAX and client side processing along with an OR Mapper between your app and the DB. A simplistic linear single request/response flow looks like:
browser -> web server -> app server -> DB -> app server -> XSLT -> web server -> browser JS engine execution & rendering
You should have some method for measuring performance (response times, throughput measured in "stuff per unit time", etc.) in each of those distinct areas, not only at the box and OS level (CPU, memory, disk i/o, etc.), but specific to each tier's service. So on the web server you'll need to know all the counters for the web server your're using. In the app tier, you'll need that plus visibility into whatever virtual machine you're using (jvm, clr, whatever). Most OR mappers manifest inside the virtual machine, so make sure you're paying attention to all the specifics if they're visible to you at that layer. Inside the DB, you'll need to know everything that's being executed and all the specific tuning parameters for your flavor of DB. If you have big bucks, BMC Patrol is a pretty good bet for most of it (with appropriate knowledge modules (KMs)). At the cheap end, you can certainly roll your own but your mileage will vary based on your depth of expertise.
Presuming everything is synchronous (no queue-based things going on that you need to wait for), there are tons of opportunities for performance and/or scalability issues. But since your post is about scalability, let's ignore the browser except for any remote XHR calls that will invoke another request/response from the web server.
So given this problem domain, what decisions could you make to help with scalability?
Connection handling. This is also bound to session management and authentication. That has to be as clean and lightweight as possible without compromising security. The metric is maximum connections per unit time.
Session failover at each tier. Necessary or not? We assume that each tier will be a cluster of boxes horizontally under some load balancing mechanism. Load balancing is typically very lightweight, but some implementations of session failover can be heavier than desired. Also whether you're running with sticky sessions can impact your options deeper in the architecture. You also have to decide whether to tie a web server to a specific app server or not. In the .NET remoting world, it's probably easier to tether them together. If you use the Microsoft stack, it may be more scalable to do 2-tier (skip the remoting), but you have to make a substantial security tradeoff. On the java side, I've always seen it at least 3-tier. No reason to do it otherwise.
Object hierarchy. Inside the app, you need the cleanest possible, lightest weight object structure possible. Only bring the data you need when you need it. Viciously excise any unnecessary or superfluous getting of data.
OR mapper inefficiencies. There is an impedance mismatch between object design and relational design. The many-to-many construct in an RDBMS is in direct conflict with object hierarchies (person.address vs. location.resident). The more complex your data structures, the less efficient your OR mapper will be. At some point you may have to cut bait in a one-off situation and do a more...uh...primitive data access approach (Stored Procedure + Data Access Layer) in order to squeeze more performance or scalability out of a particularly ugly module. Understand the cost involved and make it a conscious decision.
XSL transforms. XML is a wonderful, normalized mechanism for data transport, but man can it be a huge performance dog! Depending on how much data you're carrying around with you and which parser you choose and how complex your structure is, you could easily paint yourself into a very dark corner with XSLT. Yes, academically it's a brilliantly clean way of doing a presentation layer, but in the real world there can be catastrophic performance issues if you don't pay particular attention to this. I've seen a system consume over 30% of transaction time just in XSLT. Not pretty if you're trying to ramp up 4x the user base without buying additional boxes.
Can you buy your way out of a scalability jam? Absolutely. I've watched it happen more times than I'd like to admit. Moore's Law (as you already mentioned) is still valid today. Have some extra cash handy just in case.
Caching is a great tool to reduce the strain on the engine (increasing speed and throughput is a handy side-effect). It comes at a cost though in terms of memory footprint and complexity in invalidating the cache when it's stale. My decision would be to start completely clean and slowly add caching only where you decide it's useful to you. Too many times the complexities are underestimated and what started out as a way to fix performance problems turns out to cause functional problems. Also, back to the data usage comment. If you're creating gigabytes worth of objects every minute, it doesn't matter if you cache or not. You'll quickly max out your memory footprint and garbage collection will ruin your day. So I guess the takeaway is to make sure you understand exactly what's going on inside your virtual machine (object creation, destruction, GCs, etc.) so that you can make the best possible decisions.
Sorry for the verbosity. Just got rolling and forgot to look up. Hope some of this touches on the spirit of your inquiry and isn't too rudimentary a conversation.
Well there's this blog called High Scalibility that contains a lot of information on this topic. Some useful stuff.
Often the most effective way to do this is by a well thought through design where scaling is a part of it.
Decide what scaling actually means for your project. Is infinite amount of users, is it being able to handle a slashdotting on a website is it development-cycles?
Use this to focus your development efforts
Jeff and Joel discuss scaling in the Stack Overflow Podcast #19.
FWIW, most systems will scale most effectively by ignoring this until it's a problem- Moore's law is still holding, and unless your traffic is growing faster than Moore's law does, it's usually cheaper to just buy a bigger box (at $2 or $3K a pop) than to pay developers.
That said, the most important place to focus is your data tier; that is the hardest part of your application to scale out, as it usually needs to be authoritative, and clustered commercial databases are very expensive- the open source variations are usually very tricky to get right.
If you think there is a high likelihood that your application will need to scale, it may be intelligent to look into systems like memcached or map reduce relatively early in your development.
One good idea is to determine how much work each additional task creates. This can depend on how the algorithm is structured.
For example, imagine you have some virtual cars in a city. At any moment, you want each car to have a map showing where all the cars are.
One way to approach this would be:
for each car {
determine my position;
for each car {
add my position to this car's map;
}
}
This seems straightforward: look at the first car's position, add it to the map of every other car. Then look at the second car's position, add it to the map of every other car. Etc.
But there is a scalability problem. When there are 2 cars, this strategy takes 4 "add my position" steps; when there are 3 cars, it takes 9 steps. For each "position update," you have to cycle through the whole list of cars - and every car needs its position updated.
Ignoring how many other things must be done to each car (for example, it may take a fixed number of steps to calculate the position of an individual car), for N cars, it takes N2 "visits to cars" to run this algorithm. This is no problem when you've got 5 cars and 25 steps. But as you add cars, you will see the system bog down. 100 cars will take 10,000 steps, and 101 cars will take 10,201 steps!
A better approach would be to undo the nesting of the for loops.
for each car {
add my position to a list;
}
for each car {
give me an updated copy of the master list;
}
With this strategy, the number of steps is a multiple of N, not of N2. So 100 cars will take 100 times the work of 1 car - NOT 10,000 times the work.
This concept is sometimes expressed in "big O notation" - the number of steps needed are "big O of N" or "big O of N2."
Note that this concept is only concerned with scalability - not optimizing the number of steps for each car. Here we don't care if it takes 5 steps or 50 steps per car - the main thing is that N cars take (X * N) steps, not (X * N2).

Resources