Tool to load balance tasks in distributed system - performance

I'm a bit stumped on searching for a tool to help me with a pesky load balancing problem.
Say you have a large variety of repeated hourly, daily and weekly units of code that can vary greatly in the RAM, CPU, Disk and Network usage. The usage is known and tracked fairly well.
Now say you have N servers to execute these tasks. The RAM, CPU, Disk and Network usage limits on the these servers are known as well.
What tools are available that can handle auto delegating tasks to these server in a manner that will ensure resources wont be strained?
For example (simplified):
Task A - Consumes 20% CPU, 15GB of RAM -> Push to Server P that has 100% CPU and 32GB RAM Available
Task B - Consumes 10% CPU, 20GB of RAM -> Server P would be Memory strained if I allocate this. Push to Server L that has 100% CPU and 32GB RAM Available
Task C - Consumes 10% CPU, 2GB of RAM -> Push to Server P that has 80% CPU and 17GB RAM Available
Task A - Reports finish, Server P now has 90% CPU and 30GB RAM
I feel like this is a common problem and I'm not sure why, but I'm having a heck of time finding anything in my google adventures.
It would be pretty straight forward to code something myself, but if there's already a tried and tested tool for this, why re-invent the wheel?
The closest I could find was a tool called hang fire I/O https://www.hangfire.io/overview.html
Which is somewhat close, but ignores the resource load balancing aspect of scheduling it looks like; which is key. I already have a solid scheduling system in place as it is, so this wont really help me.

Related

High CPU/Memory use by the vmcompute process

We are running a Windows 2019 cluster in AWS ECS.
From time to time the instances get problems with higher cpu and memory usage, not related to container usage
When checking the instances we can see that the vmcompute process have spiked its memory usage (commit) to up to 90% of the system memory, and having a average CPU usage of at least 30-40%
But I fail to understand why that is happening, and if it is a real issue?
Or if the memory and CPU usage will decrease when more load is put onto the containers?

NUMA: Win10 CPU utilization

I develop a multithreaded cpu-intensive application. Until now this application has been tested on multicore (but single-cpu) systems like an i7-6800K and worked well under Linux and Windows. A newly observed phenomenon is that it does not run well on certain sever hardware: 2 x Xeon E5 2660 v3:
When 40 threads are active then cpu utilization drops to 5-10 %. This server has two physical CPUs and supports NUMA. The application has not been written with the NUMA-model in mind and thus we have certainly lots of memory accesses to non-local memory and that should be improved. But the question is: "Can low displayed cpu-utilization be caused by slow memory access?"
I believe this is the case but a colleque said that the cpu utilization would nevertheless stay at 100 %. This is important because if he is right then the trouble does not come from memory-misplacement. I don't know how Windows10 counts cpu utilization so I hope that somebody knows from practical experience with server hardware if the displayed cpu utilization drops in case of congested memory controllers.

Scrapy spiders drastically slows down while running on AWS EC2

I am using scrapy to scrape multiple sites and Scrapyd to run spiders.
I had written 7 spiders and each spider processes at least 50 start URLs. I have around 7000 URL's. 1000 URL's for each spider.
As I start placing jobs in ScrapyD with 50 start URL's per job. Initially all spiders responds fine but suddenly they start working really slow. While running those on localhost it gives high performance.
While I run Scrapyd on localhost it gives me very high performance. As I publish jobs on Scrapyd server. Request response time drastically decreases.
Response time for each start URL is really slow after some time on server
Settings looks like this:
BOT_NAME = 'service_scraper'
SPIDER_MODULES = ['service_scraper.spiders']
NEWSPIDER_MODULE = 'service_scraper.spiders'
CONCURRENT_REQUESTS = 30
# DOWNLOAD_DELAY = 0
CONCURRENT_REQUESTS_PER_DOMAIN = 1000
ITEM_PIPELINES = {
'service_scraper.pipelines.MongoInsert': 300,
}
MONGO_URL="mongodb://xxxxx:yyyy"
EXTENSIONS = {'scrapy.contrib.feedexport.FeedExporter': None}
HTTPCACHE_ENABLED = True
We tried changing CONCURRENT_REQUESTS and CONCURRENT_REQUESTS_PER_DOMAIN, but nothing is working. We had hosted scrapyd in AWS EC2.
As with all performance testing, the goal is to find the performance bottleneck. This typically falls to one (or more) of:
Memory: Use top to measure memory consumption. If too much memory is consumed, it might swap to disk, which is slower than RAM. Try adding memory.
CPU: Use Amazon CloudWatch to track CPU. Be very careful with t2 instances (see below).
Disk speed: If the job is disk-intensive, or if memory is swapping to disk, this can impact performance -- especially for databases. Amazon EBS is network-attached disk, so network speed can actually throttle disk speed.
Network speed: Due to the multi-tenant design of Amazon EC2, network bandwidth is intentionally throttled. The amount of network bandwidth available depends upon the instance type used.
You are using a t2.small instance. It has:
Memory: 2GB (This is less than the 4GB on your own laptop)
CPU: The t2 family is extremely powerful, but the t2.small only receives an average 20% of CPU (see below).
Network: The t2.small is rated as Low to Moderate network bandwidth.
The fact that your CPU is recording 60%, while the t2.small is limited to an average 20% of CPU indicates that the instance is consuming CPU credits faster than they are being earned. This leads to an eventual exhaustion of CPU credits, thereby limiting the machine to 20% of CPU. This is highly likely to be impacting your performance. You can view CPU Credit balances in Amazon CloudWatch.
See: T2 Instances documentation for an understanding of CPU Credits.
Network bandwidth is relatively low for the t2.small. This impacts Internet access and communication with the Amazon EBS storage volume. Given that your application is downloading lots of web pages in parallel, and then writing them to disk, this is also a potential bottleneck for your system.
Bottom line: When comparing to the performance on your laptop, the instance in use has less memory, potentially less CPU due to exhaustion of CPU credits and potentially slower disk access due to high network traffic.
I recommend you use a larger instance type to confirm that performance is improved, then experiment with different instance types (both in the t2 family and outside of it) to determine what size machine gives you the best price/performance trade-off.
Continue to monitor the CPU, Memory and Network performance to identify the leading bottleneck, then aim to fix that bottleneck.

Fully Utilise CPU and Memory

I have a system - processor 2.8 ghz, 20 physical cores, 40 logical cores, 128 gb ram and 4tb hard drive.
Scenario:
I am running 3 (independent) python base processes/scripts (running independently) that read data from file and write it to database. They are taking time while not using CPU and Memory 100% not even 40%.
Why is it so? (I think it depends upon OS)
How can I configure it to utilise CPU and Memory more?
I am using Windows 8.1.
take a look at processoraffinity and processpriority
https://msdn.microsoft.com/en-us/library/system.diagnostics.processthread.processoraffinity(v=vs.110).aspx
https://msdn.microsoft.com/en-us/library/system.diagnostics.process.priorityclass(v=vs.110).aspx
A process (including a python script) isn't going to use any more cores than it has running threads. So if your python script is single-threaded, it's only going to use a single core.
Further, disk and database operations will stall the process while blocked on I/O and network. (Effective CPU usage == 0).
In other words, your program may not be "cpu bound" if it's doing a lot of I/O.
I'm not sure what your programs do, but if the problem at hand can be parallelized (split up into multiple independent tasks), then it might lend itself to having more threads or processes to take advantage of the extra hardware you have. But it's tricky and very hard to get this right and get the performance gain.

CPU Utilization and threads

We have a transaction intensive process at one customer site running on a quad core server with four processors. The process is designed to take advantage of every core available. So in this installation, we take an input queue, divide it by 16th's and allocate each fraction of the queue to a core. It works well and keeps up with the transaction volume on the box.
Looking at the CPU utilization on the box, it never seems to go above 33%. Now we have a new customer with at least double the volume of the existing customer. Some of us are arguing that since CPU usage is way below maximum utilization, that we should go with the same configuration.
Others claim that there is no direct correlation between cpu utilization and transaction processing speed and since the logic of the underlying software module is based on the number of available cores, that it makes sense to obtain a box with proportionately more cores available for the new client to accommodate the increased traffic volume.
Does anyone have a sense as to who is right in this instance?
Thank you,
To determine the optimum configuration for your new customer, understanding the reason for low CPU usage is paramount.
Very likely, the reason is one of the following:
Your process is limited by memory bandwidth. In this case, faster RAM will help if supported by the motherboard. If possible, a redesign to limit the amount of data accessed during processing will improve performance. Adding more CPU cores will, on its own, do nothing to improve performance.
Your process is limited by disk I/O. Using faster disk connections (SATA etc.) and/or upgrading to a SSD might help, but more CPU power will not.
Your process is limited by synchronization contention. In this case, adding more threads for more cores might even be counter productive. Redesigning your algorithm might help in this case.
Having said this, I have also seen situations where processes that are definitely CPU bound fail to achieve 100% CPU usage on modern processors (Core i7 etc.) because in certain turbo boost relevant cases, task manager will show less than 100%.
As 9000 said, you need to find out what your bottlenecks are when under load. Perfmon might provide enough data to find out.
Another afterthought: You could limit your process on the existing machine to part of the cores (but still at least 30% so that theoretically, CPU doesn't become a bottleneck due to this limitation) and check if overall throughput degrades. If it does not, adding more cores will not improve performance.

Resources