Scrapy spiders drastically slows down while running on AWS EC2 - amazon-ec2

I am using scrapy to scrape multiple sites and Scrapyd to run spiders.
I had written 7 spiders and each spider processes at least 50 start URLs. I have around 7000 URL's. 1000 URL's for each spider.
As I start placing jobs in ScrapyD with 50 start URL's per job. Initially all spiders responds fine but suddenly they start working really slow. While running those on localhost it gives high performance.
While I run Scrapyd on localhost it gives me very high performance. As I publish jobs on Scrapyd server. Request response time drastically decreases.
Response time for each start URL is really slow after some time on server
Settings looks like this:
BOT_NAME = 'service_scraper'
SPIDER_MODULES = ['service_scraper.spiders']
NEWSPIDER_MODULE = 'service_scraper.spiders'
CONCURRENT_REQUESTS = 30
# DOWNLOAD_DELAY = 0
CONCURRENT_REQUESTS_PER_DOMAIN = 1000
ITEM_PIPELINES = {
'service_scraper.pipelines.MongoInsert': 300,
}
MONGO_URL="mongodb://xxxxx:yyyy"
EXTENSIONS = {'scrapy.contrib.feedexport.FeedExporter': None}
HTTPCACHE_ENABLED = True
We tried changing CONCURRENT_REQUESTS and CONCURRENT_REQUESTS_PER_DOMAIN, but nothing is working. We had hosted scrapyd in AWS EC2.

As with all performance testing, the goal is to find the performance bottleneck. This typically falls to one (or more) of:
Memory: Use top to measure memory consumption. If too much memory is consumed, it might swap to disk, which is slower than RAM. Try adding memory.
CPU: Use Amazon CloudWatch to track CPU. Be very careful with t2 instances (see below).
Disk speed: If the job is disk-intensive, or if memory is swapping to disk, this can impact performance -- especially for databases. Amazon EBS is network-attached disk, so network speed can actually throttle disk speed.
Network speed: Due to the multi-tenant design of Amazon EC2, network bandwidth is intentionally throttled. The amount of network bandwidth available depends upon the instance type used.
You are using a t2.small instance. It has:
Memory: 2GB (This is less than the 4GB on your own laptop)
CPU: The t2 family is extremely powerful, but the t2.small only receives an average 20% of CPU (see below).
Network: The t2.small is rated as Low to Moderate network bandwidth.
The fact that your CPU is recording 60%, while the t2.small is limited to an average 20% of CPU indicates that the instance is consuming CPU credits faster than they are being earned. This leads to an eventual exhaustion of CPU credits, thereby limiting the machine to 20% of CPU. This is highly likely to be impacting your performance. You can view CPU Credit balances in Amazon CloudWatch.
See: T2 Instances documentation for an understanding of CPU Credits.
Network bandwidth is relatively low for the t2.small. This impacts Internet access and communication with the Amazon EBS storage volume. Given that your application is downloading lots of web pages in parallel, and then writing them to disk, this is also a potential bottleneck for your system.
Bottom line: When comparing to the performance on your laptop, the instance in use has less memory, potentially less CPU due to exhaustion of CPU credits and potentially slower disk access due to high network traffic.
I recommend you use a larger instance type to confirm that performance is improved, then experiment with different instance types (both in the t2 family and outside of it) to determine what size machine gives you the best price/performance trade-off.
Continue to monitor the CPU, Memory and Network performance to identify the leading bottleneck, then aim to fix that bottleneck.

Related

Disk latency causing CPU spikes on EC2 instance

We are having an interesting issue where we are seeing a CPU spike on our EC2 instance and at the same time we are seeing a spike in disk latency. Here is the pattern for CPU spike
CPU spike from 50% to 100% within 30 seconds
It stays at 100% utilization for two minutes
CPU utilization is dropped from 100 to almost 0 in 10 seconds. At the same time almost disk latency is also back to normal
This issue has happened on different AWS ec2 instances a couple of times over a week and still happening. In all cases we are seeing CPU spike along with disk latency with CPU spike having a similar pattern as above.
We had put process monitoring tools to check if any particular process was occupying the CPU. That tool revealed that each of process on the ec2 instance starts taking approx twice the CPU. For eg our app server CPU utilization increases from .75% to 1.5 . Similar observation for Nginx and other processes. There was no single process occupying more than 8% CPU. We studied our traffic pattern and there is nothing unusual which can cause this. So the question is
Can increase in disk latency cause the CPU spike pattern as above or in general can disk latency result in CPU spike
Here is my bet: you are running t2 / t3 machines which are burstable instances. You can access 30% of the CPU all the time, and a credit system create a fair usage predictable mode for the 70% remaining. You earn credit by running the instance, you lose credit by going over 30% CPU usage.
You are running out of credits and then AWS reduce your access to CPU. The system goes smooth again when credits are added to your balance.
t2 and t3 doesn't have the system credit system, you can find details here: CPU Credits and baseline
You have two solutions:
Take a bigger instance, so you will have more credits per hour and better baseline or another family like c5, m5, r5 etc...
Take an unlimited mode option for your t3 instances
I would suggest faster storage. cpu aims to add up to 100%. limiting is working in this strange way that it simulates usage for "unknown" reason. Reasons can be one of those:
idle time (notice here this is what you consider FREE cpu, thats why I say it adds up to 100%)
user time (normal usage)
system time (system usage)
iowait (your case, cpu waiting for HDD/SSD to answer)
nice time (low priority processes that were not included in user time)
interupt time (external device "talk" time - could be your case if you have many usb devices etc - rather unlikely)
softirq (queued work from a processed interrupt - see above)
steal time (case that Clement is describing)
I would suggest ensuring which one is your case
you can try below to get the info:
$ sudo apt-get install sysstat
$ mpstat -P ALL 1
From here there is 2 options for you :)
EBS allows you to run IO optimized volume called "IO1" (mid price - mid speed)
Change the machine and use one in "Nitro System" (provides bare metal capabilities - that is: as if you had actual NVMe connected directly - max possible speed)
m5.2xlarge 8 37 32 GiB EBS Only $0.384 per Hour
m5d.2xlarge 8 37 32 GiB 1 x 300 NVMe SSD $0.452 per Hour
Source: Instances built on the Nitro System

Tool to load balance tasks in distributed system

I'm a bit stumped on searching for a tool to help me with a pesky load balancing problem.
Say you have a large variety of repeated hourly, daily and weekly units of code that can vary greatly in the RAM, CPU, Disk and Network usage. The usage is known and tracked fairly well.
Now say you have N servers to execute these tasks. The RAM, CPU, Disk and Network usage limits on the these servers are known as well.
What tools are available that can handle auto delegating tasks to these server in a manner that will ensure resources wont be strained?
For example (simplified):
Task A - Consumes 20% CPU, 15GB of RAM -> Push to Server P that has 100% CPU and 32GB RAM Available
Task B - Consumes 10% CPU, 20GB of RAM -> Server P would be Memory strained if I allocate this. Push to Server L that has 100% CPU and 32GB RAM Available
Task C - Consumes 10% CPU, 2GB of RAM -> Push to Server P that has 80% CPU and 17GB RAM Available
Task A - Reports finish, Server P now has 90% CPU and 30GB RAM
I feel like this is a common problem and I'm not sure why, but I'm having a heck of time finding anything in my google adventures.
It would be pretty straight forward to code something myself, but if there's already a tried and tested tool for this, why re-invent the wheel?
The closest I could find was a tool called hang fire I/O https://www.hangfire.io/overview.html
Which is somewhat close, but ignores the resource load balancing aspect of scheduling it looks like; which is key. I already have a solid scheduling system in place as it is, so this wont really help me.

Performance of instance store vs EBS-optimized EC2 or RAID volumes

As far as I can tell from my own experience and from what I have read, there are very few situations in which one wouldn't want to use EBS over instance store. However, instance store is generally faster for disk read/writes due to it's being physically attached to the EC2. How much faster, and whether it is faster in all cases, I don't know.
So, what I am curious about, if anyone out there has had some experience with any of these, is the relative speed/performance of:
An EC2 using instance store vs a non-storage-optimized EC2 using EBS (of any storage type)
An EC2 using instance store vs a storage-optimized (I3) EC2 using EBS
An EC2 using instance store vs a non-storage-optimized EC2 using some kind of EBS RAIDing
A non-storage-optimized EBS-backed EC2 vs a storage-optimized EC2 vs an EC2 with an EBS RAID configuration
All of the above vs EBS-optimized instances of any type.
The more specific and quantifiable the answers the better -- thanks!
Now Available – I3 Instances for Demanding, I/O Intensive Applications claims that Instance Store on i3 instances:
can deliver up to 3.3 million IOPS at a 4 KB block and up to 16 GB/second of sequential disk throughput.
Coming Soon – The I2 Instance Type – High I/O Performance Via SSD claims that Instance Store on i2 instances:
deliver 350,000 random read IOPS and 320,000 random write IOPS.
Amazon EBS Volume Types lists:
General Purpose SSD: Maximum 10,000 IOPS/Volume
Provisioned IOPS SSD: Maximum 20,000 IOPS/Volume
Throughput Optimized HDD: Maximum throughput 500 MiB/s (Optimized for throughput rather than IOPS, good for large, contiguous reads)
We did a thorough set of benchmarks for some of those situations comparing EBS vs. instance store, namely,
EBS SSD general iops (about 3k for a 1TB volume)
EBS SSD provisioned iops (about 50k for a 1TB volume)
instance store local disk (one disk raid 0)
instance store local disk (two disks striped raid 1)
We had the following takeaways,
the local disk options are vastly better at random access than EBS, due to low latency and iops being a limiting factor. Although most applications try to avoid random reading, it’s hard to avoid completely and so good performance in this area is a big plus.
Sequential reads are also vastly better than EBS, mainly due to rate limiting of EBS, specifically the throughput. Generally you are going to get full, unrestricted access to a local disk with much lower latency than network storage (EBS).
Raid1 is (not surprisingly) up to 2x better for reads than the single disk. Writes are the same due to needing to write to both disks. However on larger system, you can have 4+ disks and do raid10 (mirrored striping) which would give improvements to writes as well.
Unfortunately as mentioned at the start, local disk options are ephemeral and will lose your data on a terminate/stop of the instance. Even so, it might be worth considering a high availability architecture to allow using it.
EBS 50K is certainly more performant than 3K, although you generally need to get past 4+ threads to see a real difference (e.g. a database). Single threaded processes are not going to be much faster (e.g. a file copy, zip, etc..). EBS 50k was limited by the instance max iops (30k), so generally be aware the instance size also can be a limiting factor on EBS performance.
It’s possible to raid EBS as well, but keep in mind it’s networked storage and so that will likely be a real bottleneck on any performance gains. Worth a separate test to compare.
full details on the benchmarks can be found at,
https://www.scalebench.com/blog/index.php/2020/06/03/aws-ebs-vs-aws-instance-storelocal-disk-storage/

Play2 performance surprises

Some introduction first.
Our Play2 (ver. 2.5.10) web service provides responses in JSON, and the response size can be relatively large: up to 350KB.
We've been running our web service in a standalone mode for a while. We had gzip compression enabled as Play2, which reduces response body ~10 times, i.e. we had up to ~35KB response bodies.
In that mode the service can handle to up to 200 queries per second while running on AWS EC2 m4.xlarge (4 vCPU, 16GB RAM, 750 MBit network) inside Docker container. The performance is completely CPU-bound, most of the time (75%) is spent on JSON serialization, and rest of the time (25%) was spent on gzip compression. Business logic is very fast and is not even visible on the perf graphs.
Now we have introduced a separate front-end node (running Nginx) to handle some specific functions: authentication, authorization and, crucially to this question, traffic compression. We hoped to offload compression from the Play2 back-end to the front-end and spend those 25% of CPU cycles on main tasks.
However, instead of improving, performance got much worse! Now our web service can only handle up to 80 QPS. At that point, most of CPU is already consumed by something inside the JVM. Our metrics show it's not garbage collection, and it's also not in our code, but rather something inside Play.
It's important to notice that as this load (80 QPS at ~350KB per response) we generate ~30 MB/s of traffic. This number, while significant, doesn't saturate the EC2 networking, so that shouldn't be the bottleneck.
So my question, I guess, is as follows: does anyone have an explanation and a mitigation plan for this problem? Some hints about how to get to the root cause of this would also be helpful.

Is there any limitation on EC2 machine or network?

I have 2 instances on Amazon EC2. The one is a t2.micro machine as web cache server, the other is a performance test tool.
When I started a test, TPS (transactions per second) was about 3000. But a few minutes later TPS has been decreased to 300.
At first I thought that the CPU credit balance was exhausted, but it was enough to process requests. During a test, the max outgoing traffic of web cache was 500Mbit/s, usage of CPU was 60% and free memory was more than enough.
I couldn't find any cause of TPS decrease. Is there any limitation on EC2 machine or network?
There are several factors that could be constraining your processes.
CPU credits on T2 instances
As you referenced, T2 instances use credits for bursting CPU. They are very powerful machines, but each instance is limited to a certain amount of CPU. t2.micro instances are given 10% of CPU, meaning they actually get 100% of the CPU only 10% of the time (at low millisecond resolution).
Instances start with CPU credits for a fast start, and these credits are consumed when the CPU is used faster than the credits are earned. However, you say that the credit balance was sufficient, so this appears not to be the cause.
Network Bandwidth
Each Amazon EC2 instance can use a certain throughput of network bandwidth. Smaller instances have 'low' bandwidth, bigger instances have more. There is no official statement of bandwidth size, but this is an interesting reference from Serverfault: Bandwidth limits for Amazon EC2
Disk IOPS
If your application uses disk access for each transaction, and your instance is using a General Purpose (SSD) instance type, then your disk may have consumed all available burst credits. If your disk is small, this could mean it will run slow (speed is 3 IOPS per GB, so a 20GB disk would run at 60 IOPS). Check the Amazon CloudWatch VolumeQueueLength metric to see if IO is queuing excessively.
Something else
The slowdown could also be due to your application or cache system (eg running out of free memory for storing data).

Resources