Scaling Tigase XMPP server on Amazon EC2 - amazon-ec2

Does anyone have an experience running clustered Tigase XMPP servers on Amazon's EC2, primarily I wish to know about anything that might trip me up that is non-obvious. (For example apparently running Ejabberd on EC2 can cause issues due to Mnesia.)
Or if you have any general advice to installing and running Tigase on Ubuntu.
Extra information:
The system I’m developing uses XMPP just to communicate (in near real-time) between a mobile app and the server(s).
The number of users will initially be small, but hopefully will grow. This is why the system needs to be scalable. Presumably for a just a few thousand users you wouldn’t need a cc1.4xlarge EC2 instance? (Otherwise this is going to be very expensive to run!)
I plan on using a MySQL database hosted in Amazon RDS for the XMPP server database.
I also plan on creating an external XMPP component written in Python, using SleekXMPP. It will be this external component that does all the ‘work’ of the server, as the application I’m making is quite different from instant messaging. For this part I have not worked out how to connect an external XMPP component written in Python to a Tigase server. The documentation seems to suggest that components are written specifically for Tigase - and not for a general XMPP server, using XEP-0114: Jabber Component Protocol, as I expected.
With this extra information, if you can think of anything else I should know about I’d be glad to know.
Thank you :)

I have lots of experience. I think there is a load of non-obvious problems. Like the only reliable instance to run application like Tigase is cc1.4xlarge. Others cause problems with CPU availability and this is just a lottery whether you are lucky enough to run your service on a server which is not busy with others people work.
Also you need an instance with the highest possible I/O to make sure it can cope with network traffic. The high I/O applies especially to database instance.
Not sure if this is obvious or not, but there is this problem with hostnames on EC2, every time you start instance the hostname changes and IP address changes. Tigase cluster is quite sensitive to hostnames. There is a way to force/change the hostname for the instance, so this might be a way around the problem.
Of course I am talking about a cluster for millions of online users and really high traffic 100k XMPP packets per second or more. Generally for large installation it is way cheaper and more efficient to have a dedicated servers.
Generally Tigase runs very well on Amazon EC2 but you really need the latest SVN code as it has lots of optimizations added especially after tests on the cloud. If you provide some more details about your service I may have some more suggestions.
More comments:
If it comes to costs, a dedicated server is always cheaper option for constantly running service. Unless you plan to switch servers on/off on hourly basis I would recommend going for some dedicated service. Costs are lower and performance is way more predictable.
However, if you really want/need to stick to Amazon EC2 let me give you some concrete numbers, below is a list of instances and how many online users the cluster was able to reliably handle:
5*cc1.4xlarge - 1mln 700k online users
1*c1.xlarge - 118k online users
2*c1.xlarge - 127k online users
2*m2.4xlarge (with 5GB RAM for Tigase) - 236k online users
2*m2.4xlarge (with 20GB RAM for Tigase) - 315k online users
5*m2.4xlarge (with 60GB RAM for Tigase) - 400k online users
5*m2.4xlarge (with 60GB RAM for Tigase) - 312k online users
5*m2.4xlarge (with 60GB RAM for Tigase) - 327k online users
5*m2.4xlarge (with 60GB RAM for Tigase) - 280k online users
A few more comments:
Why amount of memory matters that much? This is because CPU power is very unreliable and inconsistent on all but cc1.4xlarge instances. You have 8 virtual CPUs but if you look at the top command you often see one CPU is working and the rest is not. This insufficient CPU power leads to internal queues grow in the Tigase. When the CPU power is back Tigase can process waiting packets. The more memory Tigase has the more packets can be queued and it better handles CPU deficiencies.
Why there is 5*m2.4xlarge 4 times? This is because I repeated tests many times at different days and time of the day. As you can see depending on the time and date the system could handle different load. I guess this is because Tigase instance shared CPU power with some other services. If they were busy Tigase suffered from CPU under power.
That said I think with installation of up to 10k online users you should be fine. However, other factors like roster size greatly matter as they affect traffic, and load. Also if you have other elements which generate a significant traffic this will put load on your system.
In any case, without some tests it is impossible to tell how really your system behaves or whether it can handle the load.
And the last question regarding component:
Of course Tigase does support XEP-0114 and XEP-0225 for connecting external components. So this should not be a problem with components written in different languages. On the other hand I recommend using Tigase's API for writing component. They can be deployed either as internal Tigase components or as external components and this is transparent for the developer, you do not have to worry about this at development time. This is part of the API and framework.
Also, you can use all the goods from Tigase framework, scripting capabilities, monitoring, statistics, much easier development as you can easily deploy your code as internal component for tests.
You really do not have to worry about any XMPP specific stuff, you just fill body of processPacket(...) method and that's it.
There should be enough online documentation for all of this on the Tigase website.
Also, I would suggest reading about Python support for multi-threading and how it behaves under a very high load. It used to be not so great.

Related

Basic AWS questions

I'm newbie on AWS, and it has so many products (EC2, Load Balancer, EBS, S3, SimpleDB etc.), and so many docs, that I can't figure out where I must start from.
My goal is to be ready for scalability.
Suppose I want to set up a simple webserver, which access a database in mongolab. I suppose I need one EC2 instance to run it. At this point, do I need something more (EBS, S3, etc.)?
At some point of time, my app has reached enough traffic and I must scale it. I was thinking of starting a new copy (instance) of my EC2 machine. But then it will have another IP. So, how traffic is distributed between both EC2 instances? Is that did automatically? Must I hire a Load Balancer service to distribute the traffic? And then will I have to pay for 2 EC2 instances and 1 LB? At this point, do I need something more (e.g.: Elastic IP)?
Welcome to the club Sony Santos,
AWS is a very powerfull architecture, but with this power comes responsibility. I and presumably many others have learned the hard way building applications using AWS's services.
You ask, where do I start? This is actually a very good question, but you probably won't like my answer. You need to read and do research about all the technologies offered by amazon and even other providers such as Rackspace, GoGrid, Google's Cloud and Azure. Amazon is not easy to get going but its not meant to be really, its focus is more about being very customizable and have a very extensive api. But lets get back to your question.
To run a simple webserver you would need to start an EC2 instance this instance by default runs on a diskdrive called EBS. Essentially an EBS drive is a normal harddrive except that you can do lots of other cool stuff with it like take it off one server and move it to another. S3 is really more of a file storage system its more useful if you have a bunch of images or if you want to store a lot of backups of your databases etc, but its not a requirement for a simple webserver. Just running an EC2 instance is all you need, everything else will happen behind the scenes.
If you app reaches a lot of traffic you have two options. You can scale your machine up by shutting it off and starting it with a larger instance. Generally speaking this is the easiest thing to do, but you'll get to a point where you either cannot handle all the traffic with 1 instance even at the larger size and you'll decide you need two OR you'll want a more fault tolerant application that will still be online in the event of a failure or update.
If you create a second instance you will need to do some form of loadbalancing. I recommend using amazons Elastic Load Balancer as its easy to configure and its integration with the cloud is better than using Round Robin DNS or a application like haproxy. Elastic Load Balancers are not expensive, I believe they cost around $18 / month + data that's passed between the loadbalancer.
But no, you don't need anything else to do scale up your site. 2 EC2 instances and a ELB will do the trick.
Additional questions you didn't ask but probably should have.
How often does an EC2 instance experience hardware failure and crash my server. What can I do if this happens?
It happens frequently, usually in batches. Sometimes I go months without any problems then I will get a few servers crash at a time. But its defiantly something you should plan for I didn't in the beginning and I paid for it. Make sure you create scripts and have backups and a backup plan ready incase your server fails. Be ok with it being down or have a load balanced solution from day 1.
Whats the hardest part about scalabilty?
Testing testing testing testing... Don't ever assume anything. Also be prepared for sudden spikes in your traffic. You have to be prepared for anything if you page goes from 1 to 1000 people over night are you prepared to handle it? Have you tested what you "think" will happen?
Best of luck and have fun... I know I have :)

Amazon EC2 Capacity & Workflow Questions

I’m hoping some of you with experience using amazon EC2 could offer some advice… of course it’ll be subjective which is fine, I’m pretty sure your guestimate would be better than mine.
I am planning on moving all my client’s websites from shared hosting environments to Amazon EC2. They’re all pretty low traffic sites (the busiest site receives around 50 unique visitors a day). There’s about 8 sites, but I may expand this as I take on more projects and host more sites… current capacity planning is for say 12 sites.
Each site runs on ASP.Net (Umbraco CMS), and requires a SQL Server database.
My thoughts are one of the following:
Setup a Small Instance (1.7gb RAM, 1 EC2 Compute Unit), and run IIS and SQL Server Express on that server.
Setup 2 Micro Instances (613MB Ram each, Up to 2 EC2 Compute Units) – one for IIS, the other for SQL Server.
Which arrangement do you think would work the best for my requirements. I’ve started setting up a Micro instance with Server 2008, SQL Server Express, etc… and finding it not coping with the memory requirements, hence considering expanding. I could always configure on a Small instance, then export the AMI and fire it up in a Micro instance after, and do the same every time any serious changes to the server are required. I guess I could even do all updates etc on a spare Small Spot instance, then switch load that AMI up in a Micro and transfer the IP Address across, so I don’t need to do too much work on the production servers. I figure if I store all my website data files on EBS Volumes, then it should be fairly easy to move hosting between servers with minimal downtime, while never working on a production server.
I’m interested to know what you all think, and what strategies you employ for such activities as upgrades, windows updates, software installations, etc.
And what capacity do you think I’d need for my requirements.
Cheers
Greg
Well, first-up, Server 2008 doesn't play well in the 613MB RAM the Micro instance gives you. It runs, but it's a dog, and it barks louder the more services (IIS, SSE, etc) you layer on top. We using nothing smaller than a Small for Server 2008, and in fact typically do the environment config in a Medium and scale down to Small once the heavy lifting is complete and the OS is ready to use. Server 2003, however, seems to breathe easier on a Micro - but we still do the config on a larger instance and scale down.
We're running low-traffic websites on Server 2003/IIS6 in a Micro, with a Server 2008/SS install on a shared, separate, Small instance. We do also have one Server 2008/IIS7 Micro build running, but only to remind ourselves why we don't use it more widely. ;)
Larger websites run Server 2008/IIS7 in either Small or Medium instances, but almost always still using that shared separate SS instance for database services. We try not to deploy multiple SS installations, since it makes maintenance and backups more complex.
Stashing content and config on EBS Volumes is of course good practice, unless you like rebuilding the entire system whenever an Instance disappears. Snapshotting your Instances periodically is also good practice, since you can spin-up a new Instance from a baseline AMI and swap the snapshot in as a boot Volume for fast recovery in the event of disaster.

What is the "http"/IIS capacity of an EC2 micro instance?

I want to separate the current high cpu medium instance that I have into high cpu + micro instance in order to take in more http traffic. Does anyone know how many simultaneous connections can a micro instance take in? The initial idea is to separate the db and host it from sql azure but due to some old stored procedures, I'm opting to stay on a pure ec2 setup. The reason is because the current instance already takes in 100% cpu at times.
This question is quite old so presumably you've made the decision on your own, but just for posterity:
The capacity of any server will vary WIDELY depending on what you're using it for. It would be silly to even speculate on a value. If each HTTP request requires lots of server processing, combing through database results, repackaging media from other servers, you could quickly overload a Micro. However, if all you're doing is basic web serving, you could probably do alright for a while.
The only way to answer a question like this is to set up the micro instance and test it yourself. Set up your development workstation to flood it with requests, and then keep an eye on the response time and CPU usage in AWS CloudWatch to see if the behavior you get is acceptable to you.
(As a side note, there are also differences in network I/O between the various EC2 sizes, so if performance under heavy load seems bad but the CPU isn't working hard, it may be I/O bound.)

Is it possible to rent CPU cycles?

I have an application that takes days to process data. Is there a service that would let me run my application on powerful computers?
I'm not running a website or a web service. This is taking lots and lots of data files, running them through a big custom application, and outputting a result.
It takes days on my PC and it's something that needs to be done every once in a while, but not continuously.
Cost isn't really an issue, in the sense that my company will pay for it, but of course it should be cheaper than buying a big-ass machine ourselves.
Have you considered Amazon EC2? You pay by the hour for what you use. No more, no less. You could event rent many servers at once to split the work load.
I'm not sure if that meets your requirement of "powerful computers", because they're just average servers, but at least it will give you a pay-as-you-go solution for running the program off of your own computer.
Amazon's EC2 Service is an excellent solution for your needs. You only pay for the time you use, and you can scale up to as many machines as you need.
From their information:
Elastic – Amazon EC2 enables you to increase or decrease capacity within minutes, not hours or days. You can commission one, hundreds or even thousands of server instances simultaneously. Of course, because this is all controlled with web service APIs, your application can automatically scale itself up and down depending on its needs.
Flexible – You have the choice of multiple instance types, operating systems, and software packages. Amazon EC2 allows you to select a configuration of memory, CPU, and instance storage that is optimal for your choice of operating system and application. For example, your choice of operating systems includes numerous Linux distributions, Microsoft Windows Server and OpenSolaris.
If your application is not parallel, you won't get many advantages by running it in a "big machine", unless the bottleneck is in the virtual memory swapping. Even the Top500 supercomputers are not essentially faster than any PC for sequential workloads.
If your application can exploit parallelism maybe you could use your company's existent resources more efficiently than just deploying it in one and only pc. If you have a few dozens of computers, you could set up a loosely coupled heterogeneous cluster (or local grid, terminology changes with fashion).
I recommend CPUsage.
It is a "startup" in grid computing.
It's speciality is that any individual can join to the grid with spare cpu cycles. That makes the grid management cheap, thus the grid usage prices are also very cheap.
They have an API which if you integrate into your program, it will be able to run on the system.

Load Balancing in Amazon EC2?

We've been fighting with HAProxy for a few days now in Amazon EC2; the experience has so far been great, but we're stuck on squeezing more performance out of the software load balancer. We're not exactly Linux networking whizzes (we're a .NET shop, normally), but we've so far held our own, attempting to set proper ulimits, inspecting kernel messages and tcpdumps for any irregularities.
So far though, we've reached a plateau of about 1,700 requests/sec, at which point client timeouts abound (we've been using and tweaking httperf for this purpose). A coworker and I were listening to the most recent Stack Overflow podcast, in which the Reddit founders note that their entire site runs off one HAProxy node, and that it so far hasn't become a bottleneck. Ack! Either there's somehow not seeing that many concurrent requests, we're doing something horribly wrong, or the shared nature of EC2 is limiting the network stack of the Ec2 instance (we're using a large instance type). Considering the fact that both Joel and the Reddit founders agree that network will likely be the limiting factor, is it possible that's the limitation we're seeing?
Any thoughts are greatly appreciated!
Edit It looks like the actual issue was not, in fact, with the load balancer node! The culprit was actually the nodes running httperf, in this instance. As httperf builds and tears down a socket for each request, it spends a good amount of CPU time in the kernel. As we bumped the request rate higher, the TCP FIN TTL (being 60s by default) was keeping sockets around too long, and the ip_local_port_range's default was too low for this usage scenario. Basically, after a few minutes of the client (httperf) node constantly creating and destroying new sockets, the number of unused ports ran out, and subsequent 'requests' errored-out at this stage, yielding low request/sec numbers and a large amount of errors.
We also had looked at nginx, but We've been working with RighScale, and they've got drop-in scripts for HAProxy. Oh, and we've got too tight a deadline [of course] to switch out components unless it proves absolutely necessary. Mercifully, being on AWS allows us to test out another setup using nginx in parallel (if warranted), and make the switch overnight later on.
This page describes each of the sysctl variables fairly well (ip_local_port_range and tcp_fin_timeout were tuned, in this case).
Not answering the question directly, but EC2 now supports load balancing through Elastic Load Balancing rather than running your own load balancer in an EC2 instance.
EDIT: Amazon's Route 53 DNS service now offers a way to point a top-level domain at an ELB with an "alias" record. Since Amazon knows the current IP address of the ELB, it can return an A record for that current IP rather than having to use a CNAME record, while still being free to change the IP from time to time.
Not really an answer to your question, but nginx and pound both have good reputations as load-balancers. Wordpress just switched to nginx with good results.
But more specifically, to debug your problem. If you aren't seeing 100% cpu usage (including I/O wait), then you are network bound, yes.
EC2 internally uses a gigabit network, try using an XL instance, so you have the underlying hardware to yourself, and don't have to share that gigabit network port.
Yes, You could use an off-site load balancer.. and on bare metal LVS is a great choice, but your latency will be awful! Rumour has it that Amazon is going to fix the CNAME issue. However they are unlikely to add https, indepth or custom health checks, feedback agents, url matching, cookie insertion (and some people with good architecture would say quite right too.) However thats why Scalr, RightScale and others are using HAProxy usually two of them behind a round robin DNS entry. Here at Loadbalancer.org we are just about to launch our own EC2 load balancing appaliance:
http://blog.loadbalancer.org/ec2-load-balancer-appliance-rocks-and-its-free-for-now-anyway/
We are planning on using SSH scripts to intergrate with autoscaling in the same way rightscale does, any comments appreciated on the blog.
Thanks
I would look at switching to a off-site load balancer, not in the cloud and run something like IPVS on top of it. [The reason why it would be off of amazon's cloud is because of kernel stuff] If Amazon doesn't limit the source IP of packets coming out of the you could go with a unidirectional load balancing mechanism. We do something like this, and it gets us about 800,000 simultaneous requests [though we don't deal with latency]. I also would say use "ab2" (apache bench), as it is a little more user friendly, and easier to use in my humble opinion.
Even though your issue solved. KEMP Technologies now have a fully blown load balancer for AWS. Might save you some hassle.

Resources