Distributed calculation on Cloud Foundry with help of auto-scaling - spring-boot

I have some computation intensive and long-running task. It can easily be split into sub-tasks and also it would be kind of easy to aggregate the results later on. For example Map/Reduce would work well.
I have to solve this on Cloud Foundry and there I want to get advantage from autos-caling, that is creation of additional instances due to high CPU loads. Normally I use Spring boot for developing my cf apps.
Any ideas are welcome of how to divide&conquer in an elastic way on cf. It would be great to have as many instances created as cf would do, without needing to configure the amount of available application instances in the application. Also I need to trigger the creation of instances by loading the CPUs to provoke auto-scaling.

I have to solve this on Cloud Foundry
It sounds like you're on the right track here. The main thing is that you need to write your app so that it can coexist with multiple instances of itself (or perhaps break it into a primary node that coordinates work and multiple worker apps). However you architect the app, being able to scale up instances is critical. You can then simply cf scale to add or remove nodes and increase capacity.
If you wanted to get clever, you could set up a pipeline to run your jobs. Step one would be to scale up the worker nodes of your app, step two would be to schedule the work to run, step three would be to clean up and scale down your nodes.
I'm suggesting this because manual scaling is going to be the simplest path forward (please read on for why).
and there I want to get advantage from autos-caling, that is creation of additional instances due to high CPU loads.
As to autoscaling, I think it's possible but I also think it's making the problem more complicated than it needs to be. Auto scaling by CPU on Cloud Foundry is not as simple as it seems. The way Linux reports CPU usage, you can exceed 100%, it's 100% per CPU core. Pair this with the fact that you may not know how many CPU cores are on your Cells (like if you're using a public CF provider), the fact that the number of cores could change over time (if your provider changes hardware), and that makes it's difficult to know at what point you should scale your application.
If you must autoscale, I would suggest trying to autoscale on some other metric. What metrics are available, will depend on the autoscaler tool you are using. The best would be if you could have some custom metric, then you could use work queue length or something that's relevant to your application. If custom metrics are not supported, you could always hack together your own autoscaler that does work with metrics relevant to your application (you can scale up and down by adjusting the instance cound of your app using the CF API).
You might also be able to hack together a solution based on the metrics that your autoscaler does provide. For example, you could artificially inflate a metric that your autoscaler does support in proportion to the workload you need to process.
You could also just scale up when your work day starts and scale down at the end of the day. It's not dynamic, but it simple and it will get you some efficiency improvements.
Hope that helps!

Related

On memory intensive on-demand compute

We are considering using azure functions to run some compute on. However our computes could take up lots of memory, lets say more than 5GB.
As I understand there is no easy way to scale azure functions based on memory usage. Ie If you reach 15GB start a new instance (since you don't want it to run over the maximum memory of your instance)?
Is there a way around this limitation?
OR
Is there another technical alternative to azure functions that provide pay per use and allows rapid scaling on demand?
Without any further details about what you are actually trying to compute, it is very hard to give you any meaningful advice.
But there are a few things you could consider.
For example, if you are processing CSV files and you know you need 1GB of memory to process a file, then "spin up" a AWS Lambda per file. You could use serverless orchestration like AWS Step Function to coordinate your processing. This way you are effectively splitting up your compute intensive tasks.
Another option would be to automate "pay per use" yourself. You could use automation tools like Terraform to start a EC2 spot instance, run your computing task and after it finishes just shut down the EC2 instance. That is more or less pay as you go with a bit of operations overhead on your part.
There are also other services like AWS Fargate, which are marketed as "Serverless compute for containers", allowing you to run Docker containers in a pay per use manner. Fargate allows provisioning of up to 30GB of memory.
Another option would be to use services like ElastiCache or Memcached to "externalize" your memory.
Which of these options would be the best for you is hard to tell, because it depends on your constraints: do you need everything to be stored in the "same memory" or can it be split up etc, what are your data structures, can it be processed in chunks (to minimise memory usage), is latency important, how long does it take, how often do you need to run your tasks, etc...
Note: There are probably equivalent Azure services to the AWS services I talked about in this answer.

Microservices interdependency

One of the benefits of Microservice architecture is one can scale heavily used parts of the application without scaling the other parts. This supposedly provides benefits around cost.
However, my question is, if a heavily used microservice is dependent on other microservice to do it's work wouldn't you have to scale the other services as well seemingly defeating the purpose. If a microservice is calling other micro service at real time to do it's job, does it mean that Micro service boundaries are not established correctly.
There's no rule of thumb for that.
Scaling usually depends on some metrics and when some thresholds are reached then new instances are created. Same goes for the case when they are not needed anymore.
Some services are doing simple, fast tasks, like taking an input and writing it to the database and others may be longer running task which can take any amount of time.
If a service that needs scale is calling a service that can easily handle heavy loads in a reliable way then there is no need to scale that service.
That idea behind scaling is to scale up when needed in order to support the loads and then scale down whenever loads get in the regular metrics ranges in order to reduce the costs.
There are two topics to discuss here.
First is that usually, it is not a good practice to communicate synchronously two microservices because you are coupling them in time, I mean, one service has to wait for the other to finish its task. So normally it is a better approach to use some message queue to decouple the producer and consumer, this way the load of one service doesn't affect the other.
However, there are situations in which it is necessary to do synchronous communication between two services, but it doesn't mean necessarily that both have to scale the same way, for example: if a service has to make several calls to other services, queries to database, or other kind of heavy computational tasks, and one of the service called only do an array sorting, probably the first service has to scale much more than the second in order to process the same number of request because the threads in the first service will be occupied longer time than the second

Azure web and worker role - 2x small instances or 1x medium?

Which is better in terms of performance, 2 medium role instances or 4 small role instances?
What are the pro's and cons of each configuration?
The only real way to know if you gain advantage of using larger instances is trying and measuring, not asking here. This article has a table that says that a medium instance has everything twice as large as a small one. However in real life your mileage may vary and how this affects your application only you can measure.
Smaller roles have one important advantage - if instances fail separately you get smaller performance degradation. Supposing you know about "guaranteed uptime" requirement of having at least two instances, you have to choose between two medium and four small instances. If one small instance fails you lose 1/4 of your performance, but if one medium instance fails you lose half of performance.
Instances will fail if for example you have an unhandled exception inside Run() of your role entry point descendant and sometimes something just goes wrong big time and your code can't handle this and it'd better just restart. Not that you should deliberately target for such failures but you should expect them and take measures to minimize impact to your application.
So the bottom line is - it's impossible to say which gets better performance, but uptime implications are just as important and they are clearly in favor of smaller instances.
Good points by #sharptooth. One more thing to consider: When scaling in, the fewest number of instances is one, not zero. So, say you have a worker role that does some nightly task for an hour, and it requires either 2 Medium or 4 Small instances to get the job done in that timeframe. When the work is done, you may want to save costs by scaling to one instance and let it run as one instance for 23 hours until the next nightly job. With a single Small instance, you'll burn 23 core-hours, and with a single Medium instance, you'll burn 46 core-hours. This thinking also applies to your Web role, but probably more-so since you will probably have minimum two instances to make sure you have uptime SLA (it may not be as important for you to have SLA on your worker if, say, your end user never interacts with it and it's just for utility purposes).
My general rule of thumb when sizing: Pick the smallest VM size that can properly do the work, and then scale out/in as needed. Your choice will primarily be driven by CPU, RAM, and network bandwidth needs (and don't forget you need network when moving data between Compute and Storage).
For a start, you won't get the guaranteed uptime of 99% unless you have at least 2 roles role instances, this allows one to die and be restarted while the other one takes the burden. Otherwise, it is a case of how much you want to pay and what specs you get on each. It has not caused me any hassle having more than one role role instance, Azure hides the difficult stuff.
One other point maybe worth a mention if you use four small roles you would be able to run two in one datacenter and two in another datacenter and use traffic manager to route people at least which is closer. This might give you some performance gains.
Two mediums will give you more options to store stuff in cache at compute level and thus more in cache rather than coming off SQL Azure it is going to be faster.
Ideally you have to follow #sharptooth and measure and test. This is all very subjective and I second David also you want to start as small as possible and scale outwards. We run this way, you really want to think about designing your app around a more sharding aspect as this fits azure model better than working in traditional sense of just getting a bigger box to run everything on, at some point you run out into limits thinking in the bigger box process, ie.Like SQL Azure Connection limits.
Using technologies like Jmeter is your friend here and should give you some tools to test your app.
http://jmeter.apache.org/

What's the correct Cloudwatch/Autoscale settings for extremely short traffic spikes on Amazon Web Services?

I have a site running on amazon elastic beanstalk with the following traffic pattern:
~50 concurrent users normally.
~2000 concurrent users for 1/2 minutes when post is made to Facebook page.
Amazon web services claim to be able to rapidly scale to challenges like this but the "Greater than x for more than 1 minute" setup of cloudwatch doesn't appear to be fast enough for this traffic pattern?
Usually within seconds all the ec2 instances crash, killing all cloudwatch metrics and the whole site is down for 4/6 minutes. So far I've yet to find a configuration that works for this senario.
Here is the graph of a smaller event that also killed the site:
Are these links posted predictably? If so, you can use Scaling by Schedule or as alternative you might change DESIRED-CAPACITY value of Auto Scaling Group or even trigger as-execute-policy to scale out straight before your link is posted.
Do you know you can have multiple scaling policies in one group? So you might have special Auto Scaling policy for your case, something like SCALE_OUT_HIGH which adds say 10 more instances at once. Take a look at as-put-scaling-policy command.
Also, you need to check your code and find bottle necks.
What HTTPD do you use? Consider of switching to Nginx as it's much more faster and less resource consuming software than Apache. Try to use Memcache... NoSQL like Redis for hight read and writes is fine option as well.
The suggestion from AWS was as follows:
We are always working to make our systems more responsive, but it is
challenging to provision virtual servers automatically with a response
time of a few seconds as your use case appears to require. Perhaps
there is a workaround that responds more quickly or that is more
resilient when requests begin to increase.
Have you observed whether the site performs better if you use a larger
instance type or a larger number of instances in the steady state?
That may be one method to be resilient to rapid increases in inbound
requests. Although I recognize it may not be the most cost-effective,
you may find this to be a quick fix.
Another approach may be to adjust your alarm to use a threshold or a
metric that would reflect (or predict) your demand increase sooner.
For example, you might see better performance if you set your alarm to
add instances after you exceed 75 or 100 users. You may already be
doing this. Aside from that, your use case may have another indicator
that predicts a demand increase, for example a posting on your
Facebook page may precede a significant request increase by several
seconds or even a minute. Using CloudWatch custom metrics to monitor
that value and then setting an alarm to Auto Scale on it may also be a
potential solution.
So I think the best answer is to run more instances at lower traffic and use custom metrics to predict traffic from an external source. I am going to try, for example, monitoring Facebook and Twitter for posts with links to the site and scaling up straight away.

Prototyping for amazon Ec2

How do people (and start up companies) actually go about prototyping/deploying things on amazon and keep costs reasonable? Last month we were experimenting with some specific applications and running own hadoop cluster and managed to spend almost 1.5k just for tests ? Sure - they have micro instances, but what if you application is so intensive it actually requires a larger instance to even test? So I'd like some input as to how people go about doing this?
Several key issues:
Consider a local testbed for some purposes & consider if a given test really needs EC2. If it's really so hard to wrangle 2-4 machines to use as a testbed for Hadoop, there's a different problem. Get your head around whatever you're going to run, how Hadoop will play a role, and kick the tires on that. In time, you will also want to change your grid, upgrade software, tinker with other ideas, etc. When you go to EC2, you'll have smoothed some rough edges already.
Don't use a larger capacity machine than you need while getting the hang of things. If you're not pushing lots of data or compute cycles through at this stage, don't bother with cluster compute nodes, massive RAM instances, etc. Just focus on getting things set up correctly.
When you are ready to retarget to more powerful machines, try a few different machine setups. Maybe the cluster compute instances will pay off, maybe you don't need that kind of throughput: until you know your bottlenecks, don't overspend.
Be sure to use spot instances frequently during the testing phase. You will typically pay about 50% of the on-demand price.
If you get to a point where you want to pay for on-demand instances, have a separate instance start and stop Hadoop instances as needed - unless you need a big cluster all on cluster compute instances.
Prepare your AMIs to get launched as quickly as possible (under 1 minute) and never leave anything running overnight or over a weekend if it isn't necessary.
Until you get the system set up and running, you're basically paying tuition to learn how to get everything tailored to your needs. Just pay the "tuition" to learn each lesson (configurations, bottlenecks, scaling up, etc.), rather than try to take on everything at once. When you approach it as a series of lessons to be learned, it is less painful to spend the money, but as long as you know what you're about to test and learn, you will also spend money more judiciously.
Finally, compare the $1500 to the labor costs of this learning experience - it probably isn't a big deal in the long run. Once you know that something is going to be a reasonable block of computational effort, it's well engineered, and will finish quickly (albeit on many machines), it isn't so painful to spend money on it. Right now, it's hard to appreciate what you're learning because it doesn't yet benefit your org's goals.
To address cost issue while doing proof-of-concept of using Amazon Cloud.
I created a light-weight Java Application using Amazon AWS API, which creates the amazon cloud instances when I want to run a test on them. Once the test is finished or failed-to-start the application terminates the instances immediately by sending out diagnostic mail.
So, no amazon instance kept running or sitting ideal. Which can happen if you create/terminate manually or through a separate program.
Consider using spot instances. If you overbid, you can be almost sure it won't be terminated. In longer run they have price on a level of reserved instances, but you don't need to pay upfront. I believe you could also schedule the tests for non-peak hours, reaching even better prices, or switch to on-demand if spot instance price exceeds on-demand one - Hadoop should handle it nicely. Check this article about spot instances. It has also references to two other articles that analyze the potential of spot instances.

Resources