How much faster is EBS than S3? - performance

I have been looking but haven't found the answer to this simple question: In general, how must faster is AWS EBS than S3?
I'm looking for a general ballpark answer in terms of "X times faster", "Y orders of magnitude faster", or a range of "somewhere # to # times faster depending on the specific use case". I'll even take answers that give different use cases as long as there's actual relative performance NUMBERS associated with them.
And PLEASE, I do NOT want this thread to devolve into an architectural discussion of various other solutions I might try (e.g., DynamoDB, ElastiCache, RDS, etc) or the limitations imposed by whatever compute solution I choose (e.g. EC2, Lambda, ECS, etc). Nor do I find "why would you want to do that?" counter-questions helpful no matter how often they appear on StackOverflow.
I'm just looking for "How much faster is EBS than S3?" because I haven't the slightest clue about it right now and haven't found a resource that gives me the type of answer I'm looking for.
Yes, yes, I know the real answer HAS to be "it depends" because determining the answer is infinitely more nuanced than my question.
I know that generally EBS is faster and that how much faster will depend on all sorts of things, like drive type, PIOPs, network speed, etc. all related to a specific use case. But surely there's a general rule of thumb to help choose between the two when evaluating a system's design cost/benefit tradeoffs. ("No there isn't, and stop calling me Shirley.")
If you need to know why I'm asking, let's say I'm just curious what the general speed difference is in case I ever want to decide between them when standing up a cheap-as-heck web site with a dirt-simple data store that is either on EBS or S3. (Again, I'm NOT looking for design options).
Thanks

For the edge case where you are storing a file as an immutable data file and need to choose between S3 object store and EBS-based file system, both are valid options if you stretch your definition of validity.
To answer the question: EBS is faster (20x?) than S3, but more expensive and bound to a single EC2 instance. S3 is slower but cheaper & more widely accessible to multiple resources. S3 can be made to be much faster with some additional components (See #Mark B's comments above for more details)
You (and I) will want to use S3.
Additional Info
For immutable data files, S3 makes more sense in all normal cases: it's fast enough, cheaper, durable, and its contents are more available to multiple parallel processes, making it better for HA solutions.
EBS could make sense if you are hyper sensitive to speed (and can't achieve the speeds with S3+other components) and not cost or high availability. EBS (or it's LAN-like equivalent EFS) is the only choice (out of the 2 being explored) if your file is going to be frequently updated (e.g., random access files for an RDBMS) or used by an application that requires a file system be in place.
(Thanks #Mark B for your patience, and #Jarmod for the objective data)

Related

Does a machine learning algorithm copy the data it learns from?

I am not a programmer, rather a law student, but I am currently researching for a project involving artificial intelligence and copyright law. I am currently looking at whether the learning process of a machine learning algorithm may be copyright infringement if a protected work is used by the algorithm. However, this relies on whether or not the algorithm copies the work or does something else.
Can anyone tell me whether machine learning algorithms typically copy the data (picture/text/video/etc.) they are analysing (even if only briefly) or if they are able to obtain the required information from the data through other methods that do not require copying (akin to a human looking at a stop sign and recognising it as a stop sign without necessarily copying the image).
Apologies for my lack of knowledge and I'm sorry if any of my explanation flies in the face of any established machine learning knowledge. As I said, I am merely a lowly law student.
Thanks in advance!
A few machine learning algorithms actually retain a copy of the training set, for example k-nearest neighbours. See https://en.wikipedia.org/wiki/Instance-based_learning. Not all do this; in fact it is usually regarded as a disadvantage, because the training set can be large.
Also, computers are also built round a number of different stores of data of different sizes and speeds. They usually copy data they are working on to small fast stores while they are working on it, because the larger stores take much longer to read and write. One of many possible examples of this has been the subject of legal wrangling of which I know little - see e.g. https://law.stackexchange.com/questions/2223/why-does-browser-cache-not-count-as-copyright-infringement and others for browser cache copyright. If a computer has added two numbers, it will certainly have stored them in its internal memory. It is very likely that it will have stored at least one of them in what are called internal registers - very small very fast memory intended for storing numbers to be worked on.
If a computer (or any other piece of electronic equipment) has been used to process classified data, it is usual to treat it as classified from then on, making the worst case assumption that it might have retained some copy of any of the data it has been used to process, even if retrieving that data from it would in practice require a great deal of specialised expertise with specialised equipment.
Typically, no. The first thing that typical ML algorithms do with their inputs is not to copy or store it, but to compute something based on it and then forget the original. And this is a fair description of what neural networks, regression algorithms and statistical methods do. There is no 'eidetic memory' in mainstream ML. I imagine anything doing that would be marketed as a database or a full text indexing engine or somesuch.
But how will you present your data to an algorithm running on a machine without first copying the data to that machine?
Does a machine learning algorithm copy the data it learns from?
There are many different machine learning algorithms. If you are talking about k nearest neighbor (k-NN) then the answer is simply yes.
However, k-NN is rarely used. Most (all?) other models are not that simple. Usually, a machine learning developer wants the training data to be compressed (a lot, lossy) by the model for several reasons: (1) The amount of training data is large (many GB), (2) Generalization might be better if the training data is compressed (3) inference of new examples might take really long if the data is not compressed. (By "compress", I mean that the relevant information for the task is extracted and irrelevant data is removed. Not compression in the usual sense.)
For other models than k-NN, the answer is more complicated. It depends on what you consider a "copy". For example, from artificial neural networks (especially the sub-type of convolutional neural networks, short: CNNs) the training data can partially be restored. Those models ware state of the art for many (all?) computer vision tasks.
I could not find papers which show that you can (partially) restore / extract training data from CNNs with the focus on possible privacy / copyright problems, but I'm ~70% certain I have read an abstract about this problem. I think I've also heard a talk where a researcher said this was a problem when building a detector for child pornography. However, I don't think that was recorded or anything published about this.
Here are two papers which indicate that restoring training data from CNNs might be possible:
Understanding deep learning requires rethinking generalization
Visualizing Deep Convolutional Neural Networks Using Natural Pre-Images and the Zeiler & Fergus paper
It depends on what you mean by the word "copy". If you run any program, it will copy the data from the hard disk to RAM for processing. I am assuming this is not what you meant.
So let's say you have the copyrighted data in a particular machine and you run your machine learning algorithms on the data, then there is no reason for the algorithm to copy the data out of the machine.
On the other hand, if you use a cloud ML service(AWS/IBM Bluemix/Azure), then you need to upload the data to the cloud before you can run ML algorithms. This would mean you are copying the data.
Hopefully this sheds more light !
Lowly ML student
Some of the machines do copy the data set such as KNN. Unfortunately, such algorithms are not commonly used in practice because they can't be scaled for large data set.
Most ML algorithms use the data set to identify a pattern, that's why pattern recognition is another name for machine learning. The pattern is almost always much smaller (in terms of memory and variables etc) than the original data set.

Image storing/tagging solution

We are creating a site which will have users uploading images that's classifiable and searchable.
My question is surrounding the image storing thereof, what would make a solid maintainable solution?
I've looked at S3 - it looks promising.
If S3 is a good option, where would I store the references to the objects (along with the metadata/tags)?
Thanks :)
If I were architecting such a system, I would certainly look no further than S3 for scalability and durability for actually storing the images -- and thumbnails -- and metadata, to some extent.
S3 metadata storage is limited to 2KB (total number of bytes of all keys and all values combined), is limited to US-ASCII, and is not indexed -- you have to fetch the metadata for the specific object. For many applications, this is entirely sufficient but that's very doubtful in your case.
http://docs.aws.amazon.com/AmazonS3/latest/dev/UsingMetadata.html#object-metadata
So, the question "is S3 a good option" is easily answered: if you mean among AWS services, the answer is yes, it's difficult to argue that it is the best fit.
You may also consider CloudFront -- not instead of, but in addition to S3. It can improve load times by caching your "popular" content closer to where users are located, among other things.
Where to store the references to the objects goes off into the land of "opinion based," which we don't do on Stack Overflow. The answer is, of course, "in a database," but AWS has options here.
I'm a relational database DBA, so of course, my inclination is that everything should have a relational database (such as RDS) as its authoritative data store, while others would probably say the DynamoDB NoSQL database offering would be a useful data store.
From there (wherever "there") is, CloudSearch could be populated with the metadata, keywords, etc., for processing the actual search operations, using indexes it builds which are more potentially better-suited to search-intensive operations than proper databases. I would not, however, try to use CloudSearch as the authoritative store of all your valuable metadata. Search indexes should be treated as disposable, rebuildable assets... although I fear even that statement might strike some as being opinion-based.
One thing that isn't a matter of opinion is that all of these various cloud services allow you to spin up a substantial proof-of-concept infrastructure at costs that are so low as to have been unimaginable just a few years ago... so you can try them, play with them, and throw them away if they don't do what you expect. You don't have to buy before you try.

Distributed algorithm design

I've been reading Introduction to Algorithms and started to get a few ideas and questions popping up in my head. The one that's baffled me most is how you would approach designing an algorithm to schedule items/messages in a queue that is distributed.
My thoughts have lead me to browsing Wikipedia on topics such as Sorting,Message queues,Sheduling, Distributed hashtables, to name a few.
The scenario:
Say you wanted to have a system that queued messages (strings or some serialized object for example). A key feature of this system is to avoid any single point of failure. The system had to be distributed across multiple nodes within some cluster and had to consistently (or as best as possible) even the work load of each node within the cluster to avoid hotspots.
You want to avoid the use of a master/slave design for replication and scaling (no single point of failure). The system totally avoids writing to disc and maintains in memory data structures.
Since this is meant to be a queue of some sort the system should be able to use varying scheduling algorithms (FIFO,Earliest deadline,round robin etc...) to determine which message should be returned on the next request regardless of which node in the cluster the request is made to.
My initial thoughts
I can imagine how this would work on a single machine but when I start thinking about how you'd distribute something like this questions like:
How would I hash each message?
How would I know which node a message was sent to?
How would I schedule each item so that I can determine which message and from which node should be returned next?
I started reading about distributed hash tables and how projects like Apache Cassandra use some sort of consistent hashing to distribute data but then I thought, since the query won't supply a key I need to know where the next item is and just supply it...
This lead into reading about peer to peer protocols and how they approach the synchronization problem across nodes.
So my question is, how would you approach a problem like the one described above, or is this too far fetched and is simply a stupid idea...?
Just an overview, pointers,different approaches, pitfalls and benefits of each. The technologies/concepts/design/theory that may be appropriate. Basically anything that could be of use in understanding how something like this may work.
And if you're wondering, no I'm not intending to implement anything like this, its just popped into my head while reading (It happens, I get distracted by wild ideas when I read a good book).
UPDATE
Another interesting point that would become an issue is distributed deletes.I know systems like Cassandra have tackled this by implementing HintedHandoff,Read Repair and AntiEntropy and it seems to work work well but are there any other (viable and efficient) means of tackling this?
Overview, as you wanted
There are some popular techniques for distributed algorithms, e.g. using clocks, waves or general purpose routing algorithms.
You can find these in the great distributed algorithm books Introduction to distributed algorithms by Tel and Distributed Algorithms by Lynch.
Reductions
are particularly useful since general distributed algorithms can become quite complex. You might be able to use a reduction to a simpler, more specific case.
If, for instance, you want to avoid having a single point of failure, but a symmetric distributed algorithm is too complex, you can use the standard distributed algorithm of (leader) election and afterwards use a simpler asymmetric algorithm, i.e. one which can make use of a master.
Similarly, you can use synchronizers to transform a synchronous network model to an asynchronous one.
You can use snapshots to be able to analyze offline instead of having to deal with varying online process states.

Ok to use memcache in this way? or need a system re-architecture?

I have a "score" i need to calculate for multiple items for multiple users. Each user has many many scores unique to them, and calculating can be time/processor intensive. (the slowness isn't on the database end). To deal with this, I'm making extensive use of memcached. Without memcache some pages would take 10 seconds to load! Memcache seems to work well because the scores are very small pieces of information, but take awhile to compute. I'm actually setting the key to never expire, and then I delete them on the occasional circumstances the score changes.
I'm entering a new phase on this product, and am considering re-architecting the whole thing. There seems to be a way I can calculate the values iteratively, and then store them in a local field. It'll be a bit similar to whats happening now, just the value updates will happen faster, and the cache will be in the real database, and managing it will be a bit more work (I think I'd still use memcache on top of that though).
if it matters, its all in python/django.
Is intending on the cache like this bad practice? is it ok? why? should I try and re-architect things?
If it ain't broke...don't fix it ;^) It seems your method is working, so I'd say stick with it. You might look at memcachedb (or tokyo cabinet) , which is a persistent version of memcache. This way, when the memcache machine crashes and reboots, it doesn't have to recalc all values.
You're applying several architectural patterns here, and each of them certainly has a place. There's not enough information here for me to evaluate whether your current solution needs rearchitecting or whether your ideas will work. It does seem likley to me that as your understanding of the user's requirements grows you may want to improve things.
As always, prototype, measure performance, consider the trade off between complexity and performance - you don't need to be as fast as possible, just fast enough.
Caching in various forms is often the key to good performance. The question here is whether there's merit in persisting the caclulated, cahced values. If they're stable over time then this is often an effective strategy. Whether to persist the cache or make space for them in your database schema will probably depend upon the access patterns. I there are various query paths then a carefully designed database scheme may be appropriate.
Rather than using memcached, try storing the computed score in the same place as your other data; this may be simpler and require fewer boxes.
Memcached is not necessarily the answer to everything; it's intended for systems which need to read-scale very highly. It sounds like in your case, it doesn't need to, it simply needs to be a bit more efficient.

How can I make my applications scale well?

In general, what kinds of design decisions help an application scale well?
(Note: Having just learned about Big O Notation, I'm looking to gather more principles of programming here. I've attempted to explain Big O Notation by answering my own question below, but I want the community to improve both this question and the answers.)
Responses so far
1) Define scaling. Do you need to scale for lots of users, traffic, objects in a virtual environment?
2) Look at your algorithms. Will the amount of work they do scale linearly with the actual amount of work - i.e. number of items to loop through, number of users, etc?
3) Look at your hardware. Is your application designed such that you can run it on multiple machines if one can't keep up?
Secondary thoughts
1) Don't optimize too much too soon - test first. Maybe bottlenecks will happen in unforseen places.
2) Maybe the need to scale will not outpace Moore's Law, and maybe upgrading hardware will be cheaper than refactoring.
The only thing I would say is write your application so that it can be deployed on a cluster from the very start. Anything above that is a premature optimisation. Your first job should be getting enough users to have a scaling problem.
Build the code as simple as you can first, then profile the system second and optimise only when there is an obvious performance problem.
Often the figures from profiling your code are counter-intuitive; the bottle-necks tend to reside in modules you didn't think would be slow. Data is king when it comes to optimisation. If you optimise the parts you think will be slow, you will often optimise the wrong things.
Ok, so you've hit on a key point in using the "big O notation". That's one dimension that can certainly bite you in the rear if you're not paying attention. There are also other dimensions at play that some folks don't see through the "big O" glasses (but if you look closer they really are).
A simple example of that dimension is a database join. There are "best practices" in constructing, say, a left inner join which will help to make the sql execute more efficiently. If you break down the relational calculus or even look at an explain plan (Oracle) you can easily see which indexes are being used in which order and if any table scans or nested operations are occurring.
The concept of profiling is also key. You have to be instrumented thoroughly and at the right granularity across all the moving parts of the architecture in order to identify and fix any inefficiencies. Say for example you're building a 3-tier, multi-threaded, MVC2 web-based application with liberal use of AJAX and client side processing along with an OR Mapper between your app and the DB. A simplistic linear single request/response flow looks like:
browser -> web server -> app server -> DB -> app server -> XSLT -> web server -> browser JS engine execution & rendering
You should have some method for measuring performance (response times, throughput measured in "stuff per unit time", etc.) in each of those distinct areas, not only at the box and OS level (CPU, memory, disk i/o, etc.), but specific to each tier's service. So on the web server you'll need to know all the counters for the web server your're using. In the app tier, you'll need that plus visibility into whatever virtual machine you're using (jvm, clr, whatever). Most OR mappers manifest inside the virtual machine, so make sure you're paying attention to all the specifics if they're visible to you at that layer. Inside the DB, you'll need to know everything that's being executed and all the specific tuning parameters for your flavor of DB. If you have big bucks, BMC Patrol is a pretty good bet for most of it (with appropriate knowledge modules (KMs)). At the cheap end, you can certainly roll your own but your mileage will vary based on your depth of expertise.
Presuming everything is synchronous (no queue-based things going on that you need to wait for), there are tons of opportunities for performance and/or scalability issues. But since your post is about scalability, let's ignore the browser except for any remote XHR calls that will invoke another request/response from the web server.
So given this problem domain, what decisions could you make to help with scalability?
Connection handling. This is also bound to session management and authentication. That has to be as clean and lightweight as possible without compromising security. The metric is maximum connections per unit time.
Session failover at each tier. Necessary or not? We assume that each tier will be a cluster of boxes horizontally under some load balancing mechanism. Load balancing is typically very lightweight, but some implementations of session failover can be heavier than desired. Also whether you're running with sticky sessions can impact your options deeper in the architecture. You also have to decide whether to tie a web server to a specific app server or not. In the .NET remoting world, it's probably easier to tether them together. If you use the Microsoft stack, it may be more scalable to do 2-tier (skip the remoting), but you have to make a substantial security tradeoff. On the java side, I've always seen it at least 3-tier. No reason to do it otherwise.
Object hierarchy. Inside the app, you need the cleanest possible, lightest weight object structure possible. Only bring the data you need when you need it. Viciously excise any unnecessary or superfluous getting of data.
OR mapper inefficiencies. There is an impedance mismatch between object design and relational design. The many-to-many construct in an RDBMS is in direct conflict with object hierarchies (person.address vs. location.resident). The more complex your data structures, the less efficient your OR mapper will be. At some point you may have to cut bait in a one-off situation and do a more...uh...primitive data access approach (Stored Procedure + Data Access Layer) in order to squeeze more performance or scalability out of a particularly ugly module. Understand the cost involved and make it a conscious decision.
XSL transforms. XML is a wonderful, normalized mechanism for data transport, but man can it be a huge performance dog! Depending on how much data you're carrying around with you and which parser you choose and how complex your structure is, you could easily paint yourself into a very dark corner with XSLT. Yes, academically it's a brilliantly clean way of doing a presentation layer, but in the real world there can be catastrophic performance issues if you don't pay particular attention to this. I've seen a system consume over 30% of transaction time just in XSLT. Not pretty if you're trying to ramp up 4x the user base without buying additional boxes.
Can you buy your way out of a scalability jam? Absolutely. I've watched it happen more times than I'd like to admit. Moore's Law (as you already mentioned) is still valid today. Have some extra cash handy just in case.
Caching is a great tool to reduce the strain on the engine (increasing speed and throughput is a handy side-effect). It comes at a cost though in terms of memory footprint and complexity in invalidating the cache when it's stale. My decision would be to start completely clean and slowly add caching only where you decide it's useful to you. Too many times the complexities are underestimated and what started out as a way to fix performance problems turns out to cause functional problems. Also, back to the data usage comment. If you're creating gigabytes worth of objects every minute, it doesn't matter if you cache or not. You'll quickly max out your memory footprint and garbage collection will ruin your day. So I guess the takeaway is to make sure you understand exactly what's going on inside your virtual machine (object creation, destruction, GCs, etc.) so that you can make the best possible decisions.
Sorry for the verbosity. Just got rolling and forgot to look up. Hope some of this touches on the spirit of your inquiry and isn't too rudimentary a conversation.
Well there's this blog called High Scalibility that contains a lot of information on this topic. Some useful stuff.
Often the most effective way to do this is by a well thought through design where scaling is a part of it.
Decide what scaling actually means for your project. Is infinite amount of users, is it being able to handle a slashdotting on a website is it development-cycles?
Use this to focus your development efforts
Jeff and Joel discuss scaling in the Stack Overflow Podcast #19.
FWIW, most systems will scale most effectively by ignoring this until it's a problem- Moore's law is still holding, and unless your traffic is growing faster than Moore's law does, it's usually cheaper to just buy a bigger box (at $2 or $3K a pop) than to pay developers.
That said, the most important place to focus is your data tier; that is the hardest part of your application to scale out, as it usually needs to be authoritative, and clustered commercial databases are very expensive- the open source variations are usually very tricky to get right.
If you think there is a high likelihood that your application will need to scale, it may be intelligent to look into systems like memcached or map reduce relatively early in your development.
One good idea is to determine how much work each additional task creates. This can depend on how the algorithm is structured.
For example, imagine you have some virtual cars in a city. At any moment, you want each car to have a map showing where all the cars are.
One way to approach this would be:
for each car {
determine my position;
for each car {
add my position to this car's map;
}
}
This seems straightforward: look at the first car's position, add it to the map of every other car. Then look at the second car's position, add it to the map of every other car. Etc.
But there is a scalability problem. When there are 2 cars, this strategy takes 4 "add my position" steps; when there are 3 cars, it takes 9 steps. For each "position update," you have to cycle through the whole list of cars - and every car needs its position updated.
Ignoring how many other things must be done to each car (for example, it may take a fixed number of steps to calculate the position of an individual car), for N cars, it takes N2 "visits to cars" to run this algorithm. This is no problem when you've got 5 cars and 25 steps. But as you add cars, you will see the system bog down. 100 cars will take 10,000 steps, and 101 cars will take 10,201 steps!
A better approach would be to undo the nesting of the for loops.
for each car {
add my position to a list;
}
for each car {
give me an updated copy of the master list;
}
With this strategy, the number of steps is a multiple of N, not of N2. So 100 cars will take 100 times the work of 1 car - NOT 10,000 times the work.
This concept is sometimes expressed in "big O notation" - the number of steps needed are "big O of N" or "big O of N2."
Note that this concept is only concerned with scalability - not optimizing the number of steps for each car. Here we don't care if it takes 5 steps or 50 steps per car - the main thing is that N cars take (X * N) steps, not (X * N2).

Resources