Crawling With Multiple EC2 Instances

Crawling With Multiple EC2 Instances - amazon-ec2

I've got a crawling process I've written in python that I am running on an ec2 instance on amazon. I've written the crawler so that it reports back to a separate "hub" instance with it's results. The hub processes the results of the crawler and the crawler is free to keep crawling. What I had in mind with this crawling instance is that it would be easy to clone several instances of the crawler, having each of them report back to the hub for processing.
So, at this point, I have one hub and 8 separate crawlers (all on their own instances) continually crawling and reporting back in etc.
I was thinking with the small, separate crawlers:
There is redundancy so if one crawler gets hung up, the rest of the crawlers can keep working.
(This is an assumption) I have better network utilization if each crawler has it's own independent ip.
I can spin up several crawlers or scale down depending on my current needs.
My question is; Is this efficient? Would I be better off spinning up a larger instance and somehow splitting up the network utilization?
Thank you in advance for your input. Please forgive me if this is off topic for stackoverflow.

my view on your question.
(1) There is redundancy so if one crawler gets hung up, the rest of the crawlers can keep working.
set with auto-scaling group to manage these crawler instances.
(2) (This is an assumption) I have better network utilization if each crawler has its own independent ip.
Yes, ec2 instance can have its own public and private ip if created in public subnets. Within one region, you can set to launch the instances on different Available Zone (for example, us-west-2 region has three AZs). With that, you can spread the network usage.
(3) I can spin up several crawlers or scale down depending on my current needs.
with auto-scaling group, you should be easy to control this ***
My question is; Is this efficient?
* If you can, set ec2 instances in different regions (US, EU, Asia, etc) to reduce latency for some websites. *
Would I be better off spinning up a larger instance and somehow splitting up the network utilization?
* in your case, separate smaller instances should be a better solution, it also saves much money for you. Maybe you can also think to use spot instance for these crawlers. to save more *

Related

Behavior on startup times of ec2

I have a use case where we have a very large computation job, which can be broken up into many small units of work fairly efficiently. There could be effectively lets say 1,000 hours of computational work for an m4.large instance. Lets say I wanted the result back within the next 10 minutes, that would mean I would need 6,000 instances to get the job done in time.
So far I have setup AWS batch, I haven't used any more than the 20 m4.large instances your account comes with. I know I can up the amount of instances requested by AWS but I still don't really know much about what the behaviour is if you suddenly try and provision thousands of on-demand instances or if AWS limits how many instances you can use.
So my question is am I able to launch thousands of m4.large instances on-demand? And if so what are sort of times would I be looking at for all instances to get to the Running state.

I have done this many times with ~100 instances but never in the thousands of instances.
STEP 1: Open a support ticket with AWS. You will need to get your account approved, credit checked, etc. My customers are very big companies, so for them the credit and approval process is easy. If you are a little guy, I don't know.
STEP 2: Think thru your VPC design and how you will address that many instances. If is one thing to have 5 instances going thru a NAT Gateway, but a hundred systems will bring Internet connectivity to its knees.
STEP 3: Think thru the networking bandwidth required. Do you need placement groups or very high speed Intranet or Internet connectivity?
STEP 4: Be prepared that you cannot launch all instances with a specific instance type (capacity not available error). Have a selection of instances that you can fall back on.
STEP 5: Create your own software, I use Python, to launch the instances, perform updates, install software, etc. You can then poll the instances using the Boto3 EC2 API to determine when all the instances are running. The length of time for 1,000 instances won't be much different than 1 instance.
Now for the real world. If your job takes 1,000 hours, launching 1,000 instances will not reduce it to 1 hour unless you have a really scalable software design with minimum inter-machine communications required. Once you go beyond 10 systems, networking bandwidth and communications overhead becomes an issue. Even though AWS's resources are huge, launching 1,000 EC2 instances at one time by one customer is not a common launch case.
I would also NOT launch 1,000 instances to get processing down to 10 minutes. It can take 10 minutes for your instances to come online, get updated, synchronize, etc. This means that you will be spending 50% of your budget on waiting time. For really large jobs today we prefer to use Hadoop / Spark where scaling to hundreds of machines is realistic.

You can contact AWS Customer Service to increase your EC2 limits (use the link shown in the Limits section of the EC2 management console). They will verify your use-case.
You might also consider using Spot Pricing to lower your costs. Spot instances take longer to provision.
Sample use-case: Gigaom | Cycle Computing once again showcases Amazon’s high-performance computing potential
There are also services like Spotinst that can help you provision servers at the lowest possible cost.

What's the correct Cloudwatch/Autoscale settings for extremely short traffic spikes on Amazon Web Services?

I have a site running on amazon elastic beanstalk with the following traffic pattern:
~50 concurrent users normally.
~2000 concurrent users for 1/2 minutes when post is made to Facebook page.
Amazon web services claim to be able to rapidly scale to challenges like this but the "Greater than x for more than 1 minute" setup of cloudwatch doesn't appear to be fast enough for this traffic pattern?
Usually within seconds all the ec2 instances crash, killing all cloudwatch metrics and the whole site is down for 4/6 minutes. So far I've yet to find a configuration that works for this senario.
Here is the graph of a smaller event that also killed the site:

Are these links posted predictably? If so, you can use Scaling by Schedule or as alternative you might change DESIRED-CAPACITY value of Auto Scaling Group or even trigger as-execute-policy to scale out straight before your link is posted.
Do you know you can have multiple scaling policies in one group? So you might have special Auto Scaling policy for your case, something like SCALE_OUT_HIGH which adds say 10 more instances at once. Take a look at as-put-scaling-policy command.
Also, you need to check your code and find bottle necks.
What HTTPD do you use? Consider of switching to Nginx as it's much more faster and less resource consuming software than Apache. Try to use Memcache... NoSQL like Redis for hight read and writes is fine option as well.

The suggestion from AWS was as follows:
We are always working to make our systems more responsive, but it is
challenging to provision virtual servers automatically with a response
time of a few seconds as your use case appears to require. Perhaps
there is a workaround that responds more quickly or that is more
resilient when requests begin to increase.
Have you observed whether the site performs better if you use a larger
instance type or a larger number of instances in the steady state?
That may be one method to be resilient to rapid increases in inbound
requests. Although I recognize it may not be the most cost-effective,
you may find this to be a quick fix.
Another approach may be to adjust your alarm to use a threshold or a
metric that would reflect (or predict) your demand increase sooner.
For example, you might see better performance if you set your alarm to
add instances after you exceed 75 or 100 users. You may already be
doing this. Aside from that, your use case may have another indicator
that predicts a demand increase, for example a posting on your
Facebook page may precede a significant request increase by several
seconds or even a minute. Using CloudWatch custom metrics to monitor
that value and then setting an alarm to Auto Scale on it may also be a
potential solution.
So I think the best answer is to run more instances at lower traffic and use custom metrics to predict traffic from an external source. I am going to try, for example, monitoring Facebook and Twitter for posts with links to the site and scaling up straight away.

AWS Autoscaling and Elastic load balancing

For my application I am using auto scaling, without using elastic load balancing, is there any performance issue for directly using Auto scaling without ELB?

Adi,
David is right.
Autoscaling allows you to scale instances (based on cloudwatch metrics, a single event, or on a recurring schedule).
Suppose you have three instances running (scaled with Autoscaling): how is traffic going to reach them? You need to implement a Load Balancing somewhere, that's why Elastic Load Balancing is so useful.
Without that, your traffic can only be directed in a poorly-engineered manner.
See Slide #5 of this presentation on slideshare, to get a sense of the architecture: http://www.slideshare.net/harishganesan/scale-new-business-peaks-with-auto-scaling
Best,

Autoscaling determines, based on some measurement (CPU load is a common measurement), whether or not to increase/decrease the number of instances running.
Load balancing relates to how you distribute traffic to your instances based on domain name lookup, etc. Somewhere you must have knowledge of which IP addresses are those currently assigned to the instances that the autoscaling creates.
You can have multiple IP address entries for A records in the DNS settings and machines will be allocated in a roughly round-robin fashion from that pool. But, keeping the pool up to date in real-time is hard.
The load balancer gives you an easy mechanism to provide a single interface/IP address to the outside world and it has knowledge of which instances it is load balancing in real time.
If you are using autoscaling, unless you are going to create a fairly complex monitoring and DNS updating system, you can reasonably assume that you must use a load balancer as well.

Prototyping for amazon Ec2

How do people (and start up companies) actually go about prototyping/deploying things on amazon and keep costs reasonable? Last month we were experimenting with some specific applications and running own hadoop cluster and managed to spend almost 1.5k just for tests ? Sure - they have micro instances, but what if you application is so intensive it actually requires a larger instance to even test? So I'd like some input as to how people go about doing this?

Several key issues:
Consider a local testbed for some purposes & consider if a given test really needs EC2. If it's really so hard to wrangle 2-4 machines to use as a testbed for Hadoop, there's a different problem. Get your head around whatever you're going to run, how Hadoop will play a role, and kick the tires on that. In time, you will also want to change your grid, upgrade software, tinker with other ideas, etc. When you go to EC2, you'll have smoothed some rough edges already.
Don't use a larger capacity machine than you need while getting the hang of things. If you're not pushing lots of data or compute cycles through at this stage, don't bother with cluster compute nodes, massive RAM instances, etc. Just focus on getting things set up correctly.
When you are ready to retarget to more powerful machines, try a few different machine setups. Maybe the cluster compute instances will pay off, maybe you don't need that kind of throughput: until you know your bottlenecks, don't overspend.
Be sure to use spot instances frequently during the testing phase. You will typically pay about 50% of the on-demand price.
If you get to a point where you want to pay for on-demand instances, have a separate instance start and stop Hadoop instances as needed - unless you need a big cluster all on cluster compute instances.
Prepare your AMIs to get launched as quickly as possible (under 1 minute) and never leave anything running overnight or over a weekend if it isn't necessary.
Until you get the system set up and running, you're basically paying tuition to learn how to get everything tailored to your needs. Just pay the "tuition" to learn each lesson (configurations, bottlenecks, scaling up, etc.), rather than try to take on everything at once. When you approach it as a series of lessons to be learned, it is less painful to spend the money, but as long as you know what you're about to test and learn, you will also spend money more judiciously.
Finally, compare the $1500 to the labor costs of this learning experience - it probably isn't a big deal in the long run. Once you know that something is going to be a reasonable block of computational effort, it's well engineered, and will finish quickly (albeit on many machines), it isn't so painful to spend money on it. Right now, it's hard to appreciate what you're learning because it doesn't yet benefit your org's goals.

To address cost issue while doing proof-of-concept of using Amazon Cloud.
I created a light-weight Java Application using Amazon AWS API, which creates the amazon cloud instances when I want to run a test on them. Once the test is finished or failed-to-start the application terminates the instances immediately by sending out diagnostic mail.
So, no amazon instance kept running or sitting ideal. Which can happen if you create/terminate manually or through a separate program.

Consider using spot instances. If you overbid, you can be almost sure it won't be terminated. In longer run they have price on a level of reserved instances, but you don't need to pay upfront. I believe you could also schedule the tests for non-peak hours, reaching even better prices, or switch to on-demand if spot instance price exceeds on-demand one - Hadoop should handle it nicely. Check this article about spot instances. It has also references to two other articles that analyze the potential of spot instances.

Load balancing: DNS round robin in front of hardware load balancers. How to share stickiness?

DNS Round Robin (DRR) permits to do cheap load balancing (distribution is a better term). It has the pro of permitting infinite horizontal scaling. The con is that if one of the web servers goes down, some clients continue to use the broken IP for minutes (min TTL 300s) or more, even if the DNS implements fail-over.
An Hardware Load Balancer (HLB) handles such web server failures transparently but it cannot scale its bandwidth indefinitely. An hot spare is also needed.
A good solution seems to use DRR in front to a group of HLB pairs. Each HLB pair never goes down and therefore DRR never keeps clients down. Plus, when bandwidth isn't enough you can add a new HLB pair to the group.
Problem: DRR moves clients randomly between the HLB pairs and therefore (AFAIK) session stickiness cannot work.
I could just avoid to use session stickiness but it makes better use of caches therefore is something that I want to preserve.
Question: is it possible/exist an HLB implementation where an instance can share its (sessionid,webserver) mapping with other instances?
If this is possible then a client would be routed to the same web server independently by the HLB that routed the request.
Thanks in advance.

Modern load balancers have very high throughput capabilities (gigabit). So unless you're running a huuuuuuuuuuge site (e.g. google), adding bandwidth is not why you'll need a new pair of load balancers, especially since most large sites offload much of their bandwidth to CDNs (Content Delivery Networks) like Akamai. If you're pumping a gigabit of un-CDN-able data through your site and don't already have a global load-balancing strategy, you've got bigger problems than cache affinity. :-)
Instead of bandwidth limits, sites tend to add additional LB pairs for geo-distribution of servers at separate data centers to ensure users spread across the world can talk to a server closest to them.
For that latter scenario, load balancer companies offer geo-location solutions, which (at least until a few years ago which was when I was following this stuff) were based on custom DNS implementations which looked at client IPs and resolved to the load balancer pairs Virtual IP address which is "closest" (in network topology or performance) to the client. These days, CDNs like Akamai also offer global load balancing services (e.g. http://www.akamai.com/html/technology/products/gtm.html). Amazon's EC2 hosting also supports this kind of feature for sites hosted there (see http://aws.amazon.com/elasticloadbalancing/).
Since users tend not to move across continents in the course of a single session, you automatically get affinity (aka "stickiness") with geographic load balancing, assuming your pairs are located in separate data centers.
Keep in mind that geo-location is really hard since you also have to geo-locate your data to ensure your back-end cross-data-center network doesn't get swamped.
I suspect that F5 and other vendors also offer single-datacenter solutions which achieve the same ends, if you're really concerned about the single point of failure of network infrastructure (routers, etc.) inside your datacenter. But router and switch vendors have high-availability solutions which may be more appropriate to address that issue.
Net-net, if I were you I wouldn't worry about multiple pairs of load balancers. Get one pair and, unless you have a lot of money and engineering time to burn, partner with a hoster who's good at keeping their data center network up and running.
That said, if cache affinity is such a big deal for your app that you're thinking about shelling out big $$$ for multiple pairs of load balancers, it may be worth considering some app architecture changes (like using an external caching cluster). Solutions like memcached (for linux) are designed for this scenario. Microsoft also has one coming called "Velocity".
Anyway, hope this is useful info-- it's admittedly been a while since I've been deeply involved in this space (I was part of the team which designed an application load balancing product for a large software vendor) so you might want to double-check my assumptions above with facts you can pull off the web from F5 and other LB vendors.

Ok, this is an ancient question, which I just found through a Google search. But for any future visitors, here is some additional clarifications:
Problem: [DNS Round Robin] moves clients randomly between the HLB pairs and therefore (AFAIK) session stickiness cannot work.
This premise is as best I can tell not accurate. It seems nobody really knows what old browsers might do, but presumably each browser window will stay on the same IP address as long as it's open. Newer operation systems probably obey the "match longest prefix" rule. Thus there shouldn't be much 'flapping', randomly switching from one load balancer IP to another.
However, if you're still worried about users getting randomly reassigned to a new load balancer pair, then a small modification of the classic L3/4 & L7 load balancing setup can help:
Publish DNS Round Robin records that go to Virtual high-availability IPs that are handled by L4 load balancers.
Have the L4 load balancers forward to pairs of L7 load balancers based on the origin IP address, i.e. use consistent hashing based on the end users IP to always route end users to the same L7 load balancer.
Have your L7 load balancers use "sticky sessions" as you want them to.
Essentially this is just a small modification to what Willy Tarreau (the creator of HAProxy) wrote years ago.

thanks for having put things in the right perspective.
I agree with you.
I did some reading and found:
Flickr: http://highscalability.com/flickr-architecture
4 billion queries per day --> about 50000 queries/s
Youtube: http://highscalability.com/youtube-architecture
100 million video views/day --> about 1200 video views/second
PlentyOfFish: http://highscalability.com/plentyoffish-architecture
600 pages/second
200 Mbps used
CDN used
Twitter: http://highscalability.com/scaling-twitter-making-twitter-10000-percent-faster
300 tweets/second
600 req/s
A very top end LB like this can scale up :
200,000 SSL handshakes per second
1 million TCP connections per second
3.2 million HTTP requests per second
36 Gbps of TCP or HTTP throughput
Therefore, you are right a LB could hardly become a bottleneck.
Anyway I found this (old) article http://www.tenereillo.com/GSLBPageOfShame.htm
where it is explained that geo-aware DNS could create availability issues.
Could someone comment on that article?
Thanks,
Valentino

So why not keep it simple and have the DNS server give out a certain IP address (or addresses) based on the origin IP address (i.e. use consistent hashing based on the end users IP to always give end users the same IP address(es)) ?
I'm aware that this only provides a simple and cheap load distribution mechanism.
I have been looking for this, but haven't found a DNS server which implements this (although Bind has some possibilities with views).

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio