We are having an interesting issue where we are seeing a CPU spike on our EC2 instance and at the same time we are seeing a spike in disk latency. Here is the pattern for CPU spike
CPU spike from 50% to 100% within 30 seconds
It stays at 100% utilization for two minutes
CPU utilization is dropped from 100 to almost 0 in 10 seconds. At the same time almost disk latency is also back to normal
This issue has happened on different AWS ec2 instances a couple of times over a week and still happening. In all cases we are seeing CPU spike along with disk latency with CPU spike having a similar pattern as above.
We had put process monitoring tools to check if any particular process was occupying the CPU. That tool revealed that each of process on the ec2 instance starts taking approx twice the CPU. For eg our app server CPU utilization increases from .75% to 1.5 . Similar observation for Nginx and other processes. There was no single process occupying more than 8% CPU. We studied our traffic pattern and there is nothing unusual which can cause this. So the question is
Can increase in disk latency cause the CPU spike pattern as above or in general can disk latency result in CPU spike
Here is my bet: you are running t2 / t3 machines which are burstable instances. You can access 30% of the CPU all the time, and a credit system create a fair usage predictable mode for the 70% remaining. You earn credit by running the instance, you lose credit by going over 30% CPU usage.
You are running out of credits and then AWS reduce your access to CPU. The system goes smooth again when credits are added to your balance.
t2 and t3 doesn't have the system credit system, you can find details here: CPU Credits and baseline
You have two solutions:
Take a bigger instance, so you will have more credits per hour and better baseline or another family like c5, m5, r5 etc...
Take an unlimited mode option for your t3 instances
I would suggest faster storage. cpu aims to add up to 100%. limiting is working in this strange way that it simulates usage for "unknown" reason. Reasons can be one of those:
idle time (notice here this is what you consider FREE cpu, thats why I say it adds up to 100%)
user time (normal usage)
system time (system usage)
iowait (your case, cpu waiting for HDD/SSD to answer)
nice time (low priority processes that were not included in user time)
interupt time (external device "talk" time - could be your case if you have many usb devices etc - rather unlikely)
softirq (queued work from a processed interrupt - see above)
steal time (case that Clement is describing)
I would suggest ensuring which one is your case
you can try below to get the info:
$ sudo apt-get install sysstat
$ mpstat -P ALL 1
From here there is 2 options for you :)
EBS allows you to run IO optimized volume called "IO1" (mid price - mid speed)
Change the machine and use one in "Nitro System" (provides bare metal capabilities - that is: as if you had actual NVMe connected directly - max possible speed)
m5.2xlarge 8 37 32 GiB EBS Only $0.384 per Hour
m5d.2xlarge 8 37 32 GiB 1 x 300 NVMe SSD $0.452 per Hour
Source: Instances built on the Nitro System
Related
I develop a multithreaded cpu-intensive application. Until now this application has been tested on multicore (but single-cpu) systems like an i7-6800K and worked well under Linux and Windows. A newly observed phenomenon is that it does not run well on certain sever hardware: 2 x Xeon E5 2660 v3:
When 40 threads are active then cpu utilization drops to 5-10 %. This server has two physical CPUs and supports NUMA. The application has not been written with the NUMA-model in mind and thus we have certainly lots of memory accesses to non-local memory and that should be improved. But the question is: "Can low displayed cpu-utilization be caused by slow memory access?"
I believe this is the case but a colleque said that the cpu utilization would nevertheless stay at 100 %. This is important because if he is right then the trouble does not come from memory-misplacement. I don't know how Windows10 counts cpu utilization so I hope that somebody knows from practical experience with server hardware if the displayed cpu utilization drops in case of congested memory controllers.
I am using scrapy to scrape multiple sites and Scrapyd to run spiders.
I had written 7 spiders and each spider processes at least 50 start URLs. I have around 7000 URL's. 1000 URL's for each spider.
As I start placing jobs in ScrapyD with 50 start URL's per job. Initially all spiders responds fine but suddenly they start working really slow. While running those on localhost it gives high performance.
While I run Scrapyd on localhost it gives me very high performance. As I publish jobs on Scrapyd server. Request response time drastically decreases.
Response time for each start URL is really slow after some time on server
Settings looks like this:
BOT_NAME = 'service_scraper'
SPIDER_MODULES = ['service_scraper.spiders']
NEWSPIDER_MODULE = 'service_scraper.spiders'
CONCURRENT_REQUESTS = 30
# DOWNLOAD_DELAY = 0
CONCURRENT_REQUESTS_PER_DOMAIN = 1000
ITEM_PIPELINES = {
'service_scraper.pipelines.MongoInsert': 300,
}
MONGO_URL="mongodb://xxxxx:yyyy"
EXTENSIONS = {'scrapy.contrib.feedexport.FeedExporter': None}
HTTPCACHE_ENABLED = True
We tried changing CONCURRENT_REQUESTS and CONCURRENT_REQUESTS_PER_DOMAIN, but nothing is working. We had hosted scrapyd in AWS EC2.
As with all performance testing, the goal is to find the performance bottleneck. This typically falls to one (or more) of:
Memory: Use top to measure memory consumption. If too much memory is consumed, it might swap to disk, which is slower than RAM. Try adding memory.
CPU: Use Amazon CloudWatch to track CPU. Be very careful with t2 instances (see below).
Disk speed: If the job is disk-intensive, or if memory is swapping to disk, this can impact performance -- especially for databases. Amazon EBS is network-attached disk, so network speed can actually throttle disk speed.
Network speed: Due to the multi-tenant design of Amazon EC2, network bandwidth is intentionally throttled. The amount of network bandwidth available depends upon the instance type used.
You are using a t2.small instance. It has:
Memory: 2GB (This is less than the 4GB on your own laptop)
CPU: The t2 family is extremely powerful, but the t2.small only receives an average 20% of CPU (see below).
Network: The t2.small is rated as Low to Moderate network bandwidth.
The fact that your CPU is recording 60%, while the t2.small is limited to an average 20% of CPU indicates that the instance is consuming CPU credits faster than they are being earned. This leads to an eventual exhaustion of CPU credits, thereby limiting the machine to 20% of CPU. This is highly likely to be impacting your performance. You can view CPU Credit balances in Amazon CloudWatch.
See: T2 Instances documentation for an understanding of CPU Credits.
Network bandwidth is relatively low for the t2.small. This impacts Internet access and communication with the Amazon EBS storage volume. Given that your application is downloading lots of web pages in parallel, and then writing them to disk, this is also a potential bottleneck for your system.
Bottom line: When comparing to the performance on your laptop, the instance in use has less memory, potentially less CPU due to exhaustion of CPU credits and potentially slower disk access due to high network traffic.
I recommend you use a larger instance type to confirm that performance is improved, then experiment with different instance types (both in the t2 family and outside of it) to determine what size machine gives you the best price/performance trade-off.
Continue to monitor the CPU, Memory and Network performance to identify the leading bottleneck, then aim to fix that bottleneck.
I am developing a program that happens to use a lot of CPU cycles to do its job. I have noticed that it, and other CPU intensive tasks, like iMovie import/export or Grapher Examples, will trigger a spin dump report, logged in Console:
1/21/16 12:37:30.000 PM kernel[0]: process iMovie[697] thread 22740 caught burning CPU! It used more than 50% CPU (Actual recent usage: 77%) over 180 seconds. thread lifetime cpu usage 91.400140 seconds, (87.318264 user, 4.081876 system) ledger info: balance: 90006145252 credit: 90006145252 debit: 0 limit: 90000000000 (50%) period: 180000000000 time since last refill (ns): 116147448571
1/21/16 12:37:30.881 PM com.apple.xpc.launchd[1]: (com.apple.ReportCrash[705]) Endpoint has been activated through legacy launch(3) APIs. Please switch to XPC or bootstrap_check_in(): com.apple.ReportCrash
1/21/16 12:37:30.883 PM ReportCrash[705]: Invoking spindump for pid=697 thread=22740 percent_cpu=77 duration=117 because of excessive cpu utilization
1/21/16 12:37:35.199 PM spindump[423]: Saved cpu_resource.diag report for iMovie version 9.0.4 (1634) to /Library/Logs/DiagnosticReports/iMovie_2016-01-21-123735_cudrnaks-MacBook-Pro.cpu_resource.diag
I understand that high CPU usage may be associated with software errors, but some operations simply require high CPU usage. It seems a waste of resources to watch-dog and report processes/threads that are expected to use a lot of CPU.
In my program, I use four serial GCD dispatch queues, one for each core of the i7 processor. I have tried using QOS_CLASS_BACKGROUND, and spin dump recognizes this:
Primary state: 31 samples Non-Frontmost App, Non-Suppressed, Kernel mode, Thread QoS Background
The fan spins much more slowly when using QOS_CLASS_BACKGROUND instead of QOS_CLASS_USER_INITIATED, and the program takes about 2x longer to complete. As a side issue, Activity Monitor still reports the same % CPU usage and even longer total CPU Time for the same task.
Based on Apple's Energy Efficiency documentation, QOS_CLASS_BACKGROUND seems to be the proper choice for something that takes a long time to complete:
Work takes significant time, such as minutes or hours.
So why then does it still complain about using a lot of CPU time? I've read about methods to disable spindump, but these methods disable it for all processes. Is there a programmatic way to tell the system that this process/thread is expected to use a lot of CPU, so don't bother watch-dogging it?
I have 2 instances on Amazon EC2. The one is a t2.micro machine as web cache server, the other is a performance test tool.
When I started a test, TPS (transactions per second) was about 3000. But a few minutes later TPS has been decreased to 300.
At first I thought that the CPU credit balance was exhausted, but it was enough to process requests. During a test, the max outgoing traffic of web cache was 500Mbit/s, usage of CPU was 60% and free memory was more than enough.
I couldn't find any cause of TPS decrease. Is there any limitation on EC2 machine or network?
There are several factors that could be constraining your processes.
CPU credits on T2 instances
As you referenced, T2 instances use credits for bursting CPU. They are very powerful machines, but each instance is limited to a certain amount of CPU. t2.micro instances are given 10% of CPU, meaning they actually get 100% of the CPU only 10% of the time (at low millisecond resolution).
Instances start with CPU credits for a fast start, and these credits are consumed when the CPU is used faster than the credits are earned. However, you say that the credit balance was sufficient, so this appears not to be the cause.
Network Bandwidth
Each Amazon EC2 instance can use a certain throughput of network bandwidth. Smaller instances have 'low' bandwidth, bigger instances have more. There is no official statement of bandwidth size, but this is an interesting reference from Serverfault: Bandwidth limits for Amazon EC2
Disk IOPS
If your application uses disk access for each transaction, and your instance is using a General Purpose (SSD) instance type, then your disk may have consumed all available burst credits. If your disk is small, this could mean it will run slow (speed is 3 IOPS per GB, so a 20GB disk would run at 60 IOPS). Check the Amazon CloudWatch VolumeQueueLength metric to see if IO is queuing excessively.
Something else
The slowdown could also be due to your application or cache system (eg running out of free memory for storing data).
Is increased CPU time (as reported by time CLI command) indicative of inefficiency when hyperthreading is used (e.g. time spent in spinlocks or cache misses) or is it possible that the CPU time is inflated by the odd nature of HT? (e.g. real cores being busy and HT can't kick in)
I have quad-core i7, and I'm testing trivially-parallelizable part (image to palette remapping) of an OpenMP program — with no locks, no critical sections. All threads access a bit of read-only shared memory (look-up table), but write only to their own memory.
cores real CPU
1: 5.8 5.8
2: 3.7 5.9
3: 3.1 6.1
4: 2.9 6.8
5: 2.8 7.6
6: 2.7 8.2
7: 2.6 9.0
8: 2.5 9.7
I'm concerned that amount of CPU time used increases rapidly as number of cores exceeds 1 or 2.
I imagine that in an ideal scenario CPU time wouldn't increase much (same amount of work just gets distributed over multiple cores).
Does this mean there's 40% of overhead spent on parallelizing the program?
It's quite possibly an artefact of how CPU time is measured. A trivial example, if you run a 100 MHz CPU and a 3 GHz CPU for one second each, each will report that it ran for one second. The second CPU might do 30 times more work, but it takes one second.
With hyperthreading, a reasonable (not quite accurate) model would be that one core can run either one task at lets say 2000 MHz, or two tasks at lets say 1200 MHz. Running two tasks it does only 60% of the work per thread, but 120% of the work for both threads together, a 20% improvement. But if the OS asks how many seconds of CPU time was used, the first will report "1 second" after each second on real time, while the second will report "2 seconds".
So the reported CPU time goes up. If it less than doubles, overall performance is improved.
Quick question - are you running the genuine time program /usr/bin/time, or the built in bash command of the same name? I'm not sure that matters, they look very similar.
Looking at your table of numbers I sense that the processed data set (ie input plus all the out data) is reasonably large overall (bigger than L2 cache), and that the processing per data item is not that lengthy.
The numbers show a nearly linear improvement from 1 to 2 cores, but that is tailing off significantly by the time you're using 4 cores. The hyoerthreaded cores are adding virtually nothing. This means that something shared is being contended for. Your program has free running threads, so that thing can only be memory (L3 cache and main memory on the i7).
This sounds like a typical example of being I/O bound rather than compute bound, the I/O in this case being to/from L3 cache and main memory. L2 cache is 256k, so I'm guessing that the size of your input data plus one set of results and all intermediate arrays is bigger than 256k.
Am I near the mark?
Generally speaking when considering how many threads to use you have to take shared cache and memory speeds and data set sizes into account. That can be a right be a right bugger because you have to work it out at run time, which is a lot of programming effort (unless your hardware config is fixed).