I am trying to build an web image server. It serves images to lots of clients(10 thousands+) simultaneously. (It will be a easier problem if there is fewer clients.) What is a good way to do so, with time delay as small as possible.
I am new to this field. Any suggestion will be welcomed.
Definitely look around for a good delivery service. Akamai is the best known.
if you really want to do it on your own, forget about Apache/IIS. much more appropriate are 'light' webservers. Two very good are lighthttp and NginX (wiki). NginX in particular, has a really solid performance.
Edit: Content Distribution Networks (CDNs) have flourished in the last few years, and it's much easier to find easier and cheaper ones. In particular, it's quite simple to put your static content in Amazon's S3 and use CloudFront.
If you want to design the fastest static file webserver with lowest latency, here's how I would do it.
Use an event loop to detect which sockets are ready
Put those sockets in a queue
Create a stack of threads to deal with sockets (1 for each core). When they finish, put them back on the stack.
Assign work to threads.
Cache all image files in memory.
This is essentially what IO completion ports are minus the caching of files. This model is available in Windows and Solaris.
http://technet.microsoft.com/en-us/sysinternals/bb963891.aspx
(source: microsoft.com)
With that many clients, you may want to look into using a content delivery network (such as Akamai). On the surface it might seem expensive, but if you really look at the cost of maintaining the hardware and particularly the cost of bandwidth, it starts to make economic sense.
How are the images to be served? Are the images generated on the fly? or are they static and stored as .jpg or other format on the file system?
Either way, I'd use ASP.NET .ashx (generic handlers) and use the System.Drawing classes.
You'll also want to setup TCP/IP Network Load Balancing per http://support.microsoft.com/kb/323431
Related
I know the title of my question is rather vague, so I'll try to clarify as much as I can. Please feel free to moderate this question to make it more useful for the community.
Given a standard LAMP stack with more or less default settings (a bit of tuning is allowed, client-side and server-side caching turned on), running on modern hardware (16Gb RAM, 8-core CPU, unlimited disk space, etc), deploying a reasonably complicated CMS service (a Drupal or Wordpress project for arguments sake) - what amounts of traffic, SQL queries, user requests can I resonably expect to accommodate before I have to start thinking about performance?
NOTE: I know that specifics will greatly depend on the details of the project, i.e. optimizing MySQL queries, indexing stuff, minimizing filesystem hits - assuming web developers did a professional job - I'm really looking for a very rough figure in terms of visits per day, traffic during peak visiting times, how many records before (transactional) MySQL fumbles, so on.
I know the only way to really answer my question is to run load testing on a real project, and I'm concerned that my question may be treated as partly off-top.
I would like to get a set of figures from people with first-hand experience, e.g. "we ran such and such set-up and it handled at least this much load [problems started surfacing after such and such]". I'm also greatly interested in any condenced (I'm short on time atm) reading I can do to get a better understanding of the matter.
P.S. I'm meeting a client tomorrow to talk about his project, and I want to be prepared to reason about performance if his project turns out to be akin FourSquare.
Very tricky to answer without specifics as you have noted. If I was tasked with what you have to do, I would take each component in turn ( network interface, CPU/memory, physical IO load, SMP locking etc) and get the maximum capacity available, divide by rough estimate of use per request.
For example, network io. You might have 1x 1Gb card, which might achieve maybe 100Mbytes/sec. ( I tend to use 80% of theoretical max). How big will a typical 'hit' be? Perhaps 3kbytes average, for HTML, images etc. that means you can achieve 33k requests per second before you bottleneck at the physical level. These numbers are absolute maximums, depending on tools and skills you might not get anywhere near them, but nobody can exceed these maximums.
Repeat the above for every component, perhaps varying your numbers a little, and you will build a quick picture of what is likely to be a concern. Then, consider how you can quickly get more capacity in each component, can you just chuck $$ and gain more performance (eg use SSD drives instead of HD)? Or will you hit a limit that cannot be moved without rearchitecting? Also take into account what resources you have available, do you have lots of skilled programmer time, DBAs, or wads of cash? If you have lots of a resource, you can tend to reduce those constraints easier and quicker as you move along the experience curve.
Do not forget external components too, firewalls may have limits that are lower than expected for sustained traffic.
Sorry I cannot give you real numbers, our workloads are using custom servers, high memory caching and other tricks, and not using all the products you list. However, I would concentrate most on IO/SQL queries and possibly network IO, as these tend to be more hard limits, than CPU/memory, although I'm sure others will have a different opinion.
Obviously, the question is such that does not have a "proper" answer, but I'd like to close it and give some feedback. The client meeting has taken place, performance was indeed a biggie, their hosting platform turned out to be on the Amazon cloud :)
From research I've done independently:
Memcache is a must;
MySQL (or whatever persistent storage instance you're running) is usually the first to go. Solutions include running multiple virtual instances and replicate data between them, distributing the load;
http://highscalability.com/ is a good read :)
I would like to try testing NLP tools against dumps of the web and other corpora, sometimes larger than 4 TB.
If I run this on a mac it's very slow. What is the best way to speed up this process?
deploying to EC2/Heroku and scaling up servers
buying hardware and creating a local setup
Just want to know how this is usually done (processing terabytes in a matter of minutes/seconds), if it's cheaper/better to experiment with this in the cloud, or do I need my own hardware setup?
Regardless of the brand of your cloud, the whole idea of cloud computing is to be able to scale-up and scale down in a flexible way.
In a corporate environment you might have a scenario in which you will consistently need the same amount of computing resources, so if you already have them, it is rather a difficult case to use the cloud because you just don't need the flexibility provided.
On the other hand if your processing tasks are not quite predictable, your best solution is the cloud because you will be able to pay more when you use more computing power, and then pay less when you don't need as much power.
Take into account though, that not all cloud-solutions are the same, for instance, a Web role is a highly web-dedicated node whose main purpose is to serve web requests, the more requests are served, the more you pay.
Whereas in a virtual role, is almost like you are given the exclusivity of a computer system that you can use for anything you want, either a linux or a windows OS, the system keeps running even though you are not using it at its best.
Overall, the costs depend on your own scenario and how well it fits to your needs.
I suppose it depends quite a bit on what kind of experimenting you are wanting to do, for what purpose and for how long.
If you're looking into buying the hardware and running your own cluster then you probably want something like Hadoop or Storm to manage the compute nodes. I don't know how feasible it is to go through 4TB of data in a matter of seconds but again that really depends on the kind of processing you want to do. Counting the frequency of words in the 4TB corpus should be pretty easy (even or your mac), but building SVMs or doing something like LDA on the lot won't be. One issue you'll run into is that you won't have enough memory to fit all of that, so you'll want a library that can run the methods off disk.
If you don't know exactly what your requirements are then I would use EC2 to setup a test rig to gain a better understanding what it is that you want to do and how much grunt/memory that needs to get done in the amount of time you require.
We recently bought two compute nodes 128 cores each with 256Gb of memory and a few terabytes of disk space for I think it was around £20k or so. These are AMD interlagos machines. That said the compute cluster already had infiniband storage so we just had to hook up to that and just buy to two compute nodes, not the whole infrastructure.
The obvious thing to do here is to start off with a smaller data set, say a few gigabytes. That'll get you started on your mac, you can experiment with the data and different methods to get an idea of what works and what doesn't, and then move your pipeline to the cloud, and run it with more data. If you don't want to start the experimentation with a single sample, you can always take multiple samples from different parts of the full corpus, just keep the sample sizes down to something you can manage on your own workstation to start off with.
As an aside, I highly recommend the scikit-learn project on GitHub for machine learning. It's written in Python, but most of the matrix operations are done in Fortran or C libraries so it's pretty fast. The developer community is also extremely active on the project. Another good library that perhaps a bit more approachable (depending on your level of expertise) is NLTK. It's nowhere near as fast but makes a bit more sense if you're not familiar with thinking about everything as a matrix.
UPDATE
One thing I forgot to mention is the time your project will be running. Or to put it another way, how long will you get some use out of your specialty hardware. If it's a project that is supposed to serve the EU parliament for the next 10 years, then you should definitely buy the hardware. If it's a project for you to get familiar with NLP, then forking out the money might be a bit redundant, unless you're also planning on starting you own cloud computing rental service :).
That said, I don't know what the real world costs of using EC2 are for something like this. I've never had to use them.
As the title says, what I would like to accomplish is given a package(usually the size may vary between 500Mb and 1Gb), I would like to copy over something around 40 servers at the same time(concurrently), I've been using a script that run a copy at the time, therefore I'm considering these possibilities:
1- Multiprocess library and create a single process for each copy function so that, they can run concurrently;
-although I think I might end up having an I/O bottleneck, and process cannot share the same data.
2-I m not using a single internet connection, but a huge corporate WAN.
Can anyone tell me whether is there any other more effective way(faster) to achieve the same thing? Or some other way to solve it?(I can run this task from a 2 core workstation).
1) I have no experience with this, but it looks like a fit for your use case:
http://code.google.com/p/pysendfile/
sendfile(2) is a system call which provides a "zero-copy" way of copying data from one file descriptor to another (a socket). The phrase "zero-copy" refers to the fact that all of the copying of data between the two descriptors is done entirely by the kernel, with no copying of data into userspace buffers. This is particularly useful when sending a file over a socket (e.g. FTP).
and
When do you want to use it?
Basically any application sending files over the network can take advantage of sendfile(2).
2) Another option would be to use some torrent library. I recently learned (skip to 31:00 for the torrent stuff) that facebook distribute their daily software updates via torrent (and update 1000s of servers with 1.5GB binaries within 15min or so).
Assume your machines have 1Gbit connections. You'll get 800Mbit/s if you're lucky/work at it, and it'll take ~10s to copy each 1GByte and 6-7 minutes to update those machines. If that's good enough, the only thing you need to do is work on using the 1Gbit efficiently to hit that target (what are you seeing from your current scripts ? OK 1Gbit may be ambitous on WAN, but you can do a similar analysis). Multiprocessing might or might not help here... but it's not going to magically get you more bandwidth.
If it's not good enough, I'd either consider:
go P2P (see miku;s answer), so as soon as one machine has a bit of
the data it can share it with other machines using it's own
bandwidth. How much this helps depends to some extent on your
network topology (existence of other bottleneck points).
Look into multicast, if the network is enough under your control that you can get the stuff routed appropriately (this seems pretty
unlikely for WAN, but maybe one day in an IPv6 wonderland...).
Instead of copying the same data 40 times (assuming it is the same
each time), you just broadcast it once and all the receivers pick it
up simultaneously. Multicast UDP isn't reliable (intended more for
IPTV I think) but there have been attempts to build reliable file
transfer tools using multicast tech e.g OpenPGM and MS's
own implementation.
Has anybody got any experience of using HTTP byte ranges across multiple parallel requests to speed up downloads?
I have an app that needs to download fairly large images from a web service (1MB +) and then send out the modified files (resized and cropped) to the browser. There are many of these images so it is likely that caching will be ineffective - i.e. the cache may well be empty. In this case we are hit by some fairly large latency times whilst waiting for the image to download, 500 m/s +, which is over 60% our app's total response time.
I am wondering if I could speed up the download of these images by using a group of parallel HTTP Range requests, e.g. each thread downloads 100kb of data and the responses are concatenated back into a full file.
Does anybody out there have any experience of this sort of thing? Would the overhead of the extra downloads negate a speed increase or might this actually technique work? The app is written in ruby but experiences / examples from any language would help.
A few specifics about the setup:
There are no bandwidth or connection restrictions on the service (it's owned by my company)
It is difficult to pre-generate all the cropped and resized images, there are millions with lots of potential permutations
It is difficult to host the app on the same hardware as the image disk boxes (political!)
Thanks
I found your post by Googling to see if someone had already written a parallel analogue of wget that does this. It's definitely possible and would be helpful for very large files over a relatively high-latency link: I've gotten >10x improvements in speed with multiple parallel TCP connections.
That said, since your organization runs both the app and the web service, I'm guessing your link is high-bandwidth and low-latency, so I suspect this approach will not help you.
Since you're transferring large numbers of small files (by modern standards), I suspect you are actually getting burned by the connection setup more than by the transfer speeds. You can test this by loading a similar page full of tiny images. In your situation you may want to go serial rather than parallel: see if your HTTP client library has an option to use persistent HTTP connections, so that the three-way handshake is done only once per page or less instead of once per image.
If you end up getting really fanatical about TCP latency, it's also possible to cheat, as certain major web services like to.
(My own problem involves the other end of the TCP performance spectrum, where a long round-trip time is really starting to drag on my bandwidth for multi-TB file transfers, so if you do turn up a parallel HTTP library, I'd love to hear about it. The only tool I found, called "puf", parallelizes by files rather than byteranges. If the above doesn't help you and you really need a parallel transfer tool, likewise get in touch: I may have given up and written it by then.)
I've written the backend and services for the sort of place you're pulling images from. Every site is different so details based on what I did might not apply to what you're trying to do.
Here's my thoughts:
If you have a service agreement with the company you're pulling images from (which you should because you have a fairly high bandwidth need), then preprocess their image catalog and store the thumbnails locally, either as database blobs or as files on disk with a database containing the paths to the files.
Doesn't that service already have the images available as thumbnails? They're not going to send a full-sized image to someone's browser either... unless they're crazy or sadistic and their users are crazy and masochistic. We preprocessed our images into three or four different thumbnail sizes so it would have been trivial to supply what you're trying to do.
If your request is something they expect then they should have an API or at least some resources (programmers) who can help you access the images in the fastest way possible. They should actually have a dedicated host for that purpose.
As a photographer I also need to mention that there could be copyright and/or terms-of-service issues with what you're doing, so make sure you're above board by consulting a lawyer AND the site you're accessing. Don't assume everything is ok, KNOW it is. Copyright laws don't fit the general public's conception of what copyrights are, so involving a lawyer up front can be really educational, plus give you a good feeling you're on solid ground. If you've already talked with one then you know what I'm saying.
I would guess that using any p2p network would be useless as there is more permutations then often used files.
Downloading parallel few parts of file can give improvement only in slow networks (slower then 4-10Mbps).
To get any improvement of using parallel download you need to ensure there will be enough server power. From you current problem (waiting over 500ms for connection) I assume you already have problem with servers:
you should add/improve load-balancing,
you should think of changing server software for something with more performance
And again if 500ms is 60% of total response time then you servers are overloaded, if you think they are not you should search for bottle neck in connections/server performance.
Lately I've asked this question. But the answer doesn't suit my demands, and I know that file hosting providers do manage to limit the speed. So I'm wondering what's the general algorithm/method to do that (I do mean downloading technique) - in particular limiting single connection/user download speed.
#back2dos I want to give a particular user a particular download speed (corresponding to hardware capabilities of course) or in other words give user ability to download some particular file with lets say 20kb/s. Surely I want to have an ability to change that to some other value.
You could use a token bucket ( http://en.wikipedia.org/wiki/Token_bucket)
Without mention of platform/language, it's difficult to answer, but a "leaky bucket" algorithm would probably be the best fit:
http://en.wikipedia.org/wiki/Leaky_bucket
Well, since this answer is really general, here's a very simple approach for plain TCP:
You put the resource handlers of all download connection into a list, paired up with information about what data is requested, and loop through it. Then you write a chunk of the required data onto the socket, maybe about 1.5K, which is the most commonly used maximum segment size, as far as I know. When you're at the and of the list, you start over. Before starting over, simply wait to get the desired average bandwidth.
Please note, if too many clients have lower bandwidth than you allow, then your TCP buffer is likely explode. some TCP bindings permit finding the size of currently buffered data for one socket. if it exceeds a threshold, you can simply skip the socket.
Also, if too many clients are connected, you will actually not have enough time to write to all the sockets, thus after one loop, you "have to wait for a negative time". Increasing the chunk size might speed up things in such scenarios, but at some point your server will stop getting faster.
A more simple approach is to do this on the client side, but this may generate a lot of overhead. The dead simple idea is to have the client request 1K every 50ms (assuming you want 20KB/s). You can even do that over HTTP, although I strongly suggest bigger chunk size, since HTTP has enourmous overheads.
My guess is, the best is to try to find a webserver capable of doing such things out of the box. I think Apache has a number of modules for al kinds of quota.
greetz
back2dos