Service Fabric Resource balancer uses stale Reported load - metrics

While looking into the resource balancer and dynamic load metrics on Service Fabric, we ran into some questions (Running devbox SDK GA 2.0.135).
In the Service Fabric Explorer (the portal and the standalone application) we can see that the balancing is ran very often, most of the time it is done almost instantly and this happens every second. While looking at the Load Metric Information on the nodes or partitions it is not updating the values as we report load.
We send a dynamic load report based on our interaction (a HTTP request to a service), increasing the reported load data of a single partition by a large amount. This spike becomes visible somewhere in 5 minutes at which point the balancer actually starts balancing. This seems to be an interval in which the load data gets refreshed. The last reported time gets updated all the time but without the new value.
We added the metrics to applicationmanifest and the clustermanifest to make sure it gets used in the balancing.
This means the resource balancer uses the same data for 5 minutes. Is this a configurable setting? Is it constraint because it is running on a devbox?
We tried a lot of variables in the clustermanifest but none seem to be affecting this refreshtime.
If this is not adaptable, can someone explain why would you run the balancer with stale data? and why this 5 minute interval was chosen?

This is indeed a configurable setting, and the default is 5 minutes. The idea behind it is that in prod you have tons of replicas all reporting load all the time, and so you want to batch them up so you don't spam the Cluster Resource Manager with all those as independent messages.
You're probably right in that this value is way too long for local development. We'll look into changing that for the local clusters, but in the meantime you can add the following to your local cluster manifest to change the amount of time we wait by default. If there are other settings already in there, just add the SendLoadReportInterval line. The value is in seconds and you can adjust it accordingly. The below would change the default load reporting interval from 5 minutes (300 seconds) to 1 minute (60 seconds).
<Section Name="ReconfigurationAgent">
<Parameter Name="SendLoadReportInterval" Value="60" />
</Section>
Please note that doing so does increase load on some of the system services (TANSTAAFL), and as always if you're operating on a generated or complete cluster manifest be sure to Test-ServiceFabricClusterManifest before deploying it. If you're working with a local development cluster the easiest way to get it deployed is probably just to modify the cluster manifest template (by default here: "C:\Program Files\Microsoft SDKs\Service Fabric\ClusterSetup\NonSecure\ClusterManifestTemplate.xml") and just add the line, then right click on the Service Fabric Local Cluster Manager in your system tray and select "Reset Local Cluster". This will regenerate the local cluster with your changes to the template.

Related

Sitecore page load slowness

I'm using Sitecore instance 9.1, Solr 7.2.1, and SXA 1.8.
I have deployed the environment on Azure and while monitoring incoming requests (to CD instance), I've noticed slowness in loading some pages at specific times.
I've explored App Insights and found an unexplainable behavior the request is taking 28.7 seconds while the breakdown of it shows executions of milli-seconds .. How is that possible? and How to explain what's happening during extra 28 seconds on the app service ??
I've checked the profiler and it shows that the thread is taking only 1042.48 ms .. How is that possible ?
This is an intermittent issue happens during the day .. regular requests are being served within 3 to 4 seconds.
I noticed that Azure often shows a profile trace for a "similar", but completely different request when clicking from the End-to-end transaction view. You can check this by comparing the timestamp and URL of the profile trace and the transaction you clicked from.
For example, I see a transaction logged at 8:58:39 PM, 2021-09-25 with 9.1 s response time:
However, when I click the profile trace icon, Azure takes me to a trace that was captured 10 minutes earlier, at 08:49:20 PM, 2021-09-25 and took only 121.64 ms:
So, if the issue you experience is intermittent and you cannot replicate it easily, try looking at the profile traces with the Slowest wall clock time by going to Application Insights → Performance → Drill into profile traces:
This will show you the worst-performing requests captured by the profiler at the top of the list:
In order to figure out why it is slow, you’ll need to understand what happens internally, f.e:
How the wall clock time is spent while processing your request?
Are there any locks internally?
The source of that data is dynamic profiling, Azure can do that on demand.
The IIS stats report would show you slowest requests, so you could look into Thread Time distribution to see where those 28 seconds are spent:
In Sitecore the when the application start the Initial prefetch configuration allows to pre-populate prefetch caches. Pre-heated prefetch caches help to reduce the processing time of incoming requests. The initial prefetch configuration of caches are taking time to load on initial stage.
Sitecore XP instance takes too long to load. This is caused by a performance issue in the CatalogRepository.GetCatalogItems method. It will be fixed in upcoming updates
see Site core knowledge base
In Sitecore XP 9.0 the initial prefetch configuration was revised. The prefetch cache for the core database was configured to include items that are used to render the Sitecore Client interface.
The Sitecore Client interface is not used on Content Delivery instances. Disabling initial prefetch configuration for the Core database helps in avoiding excessive resource consumption on the SQL Server hosting the Core database.
Change the configuration of the Core database in the \App_Config\Sitecore.config file:
Refer site core knowledge base

AWS Application load balancer stops to send traffic when new instance is added

I have a problem with autoscaling with AWS Application Load Balancer.
I'm running my Jmeter tests and discovered that whenever new instance is added to autoscaling group (that is when it becomes healthy and ALB starts to route traffic to it), then for some short period of time Load Balancer forwards less requests to targets and a lot of requests are apparently stuck at Load Balancer itself.
I'm attaching 3 images that show this issue. CPU of JVM of one of instances drops and than goes back to normal, some requests are hanging for more than 30 sec. Number of requests per target drops and then goes back to trend. (see attached pictures)
I'm using sticky sessions with 3 minutes validity period.
Does any one knows what may cause this temporary "choking" when new instance is added?
It is quite crucial to our user experience. Can't actually understand why adding new instance can have such adverse effect on traffic routing.
Issue is fully reproducible.

Using Azure load balancer to reboot/update server with zero downtime

I have a really simple setup: An azure load balancer for http(s) traffic, two application servers running windows and one database, which also contains session data.
The goal is being able to reboot or update the software on the servers, without a single request being dropped. The problem is that the health probe will do a test every 5 seconds and needs to fail 2 times in a row. This means when I kill the application server, a lot of requests during those 10 seconds will time out. How can I avoid this?
I have already tried running the health probe on a different port, then denying all traffic to the different port, using windows firewall. Load balancer will think the application is down on that node, and therefore no longer send new traffic to that specific node. However... Azure LB does hash-based load balancing. So the traffic which was already going to the now killed node, will keep going there for a few seconds!
First of all, could you give us additional details: is your database load balanced as well ? Are you performing read and write on this database or only read ?
For your information, you have the possibility to change Azure Load Balancer distribution mode, please refer to this article for details: https://learn.microsoft.com/en-us/azure/load-balancer/load-balancer-distribution-mode
I would suggest you to disable the server you are updating at load balancer level. Wait a couple of minutes (depending of your application) before starting your updates. This should "purge" your endpoint. When update is done, update your load balancer again and put back the server in it.
Cloud concept is infrastructure as code: this could be easily scripted and included in you deployment / update procedure.
Another solution would be to use Traffic Manager. It could give you additional option to manage your endpoints (It might be a bit oversized for 2 VM / endpoints).
Last solution is to migrate to a PaaS solution where all this kind of features are already available (Deployment Slot).
Hoping this will help.
Best regards

502 server error in Google App Engine Flexible when load testing with JMeter

I have deployed a simple Spring boot app in Google App Engine Flexible. The app. has two APIs, one to add the user data into the DB (xxx.appspot.com/add) the other to get all the user data from the DB (xxx.appspot.com/all).
I wanted to see how GAE scales for the load, hence used JMeter to create a load with 100 user concurrency ramped up in 10 seconds and calls these two APIs in half a second delay, forever. While it runs fine for sometime (with just one instance), it starts to fail after 30 seconds or so with a "java.net.SocketException" or "The server responded with a status of 502".
After this error, when I try to access the same API from the browser, it displays,
Error: Server Error
The server encountered a temporary error and could not complete your
request. Please try again in 30 seconds.
The service is back to normal after 30 mins or so, and whenever the load test happens it repeats the same behavior as mentioned above. I expect GAE to auto-scale based on the load coming in to handle it without any down time (using multiple instances), instead it just crashes or blocks the service (without any information in the log). My app.yaml configuration is,
runtime: java
env: flex
service: hello-service
automatic_scaling:
min_num_instances: 1
max_num_instances: 10
I am a bit stuck with this one, Any help would be greatly appreciated. Thanks in advance.
The solution was to increase the resource configuration, details below.
Given that I did not set a resource parameter, it defaulted to the pre-defined values for both CPU and Memory. In this case, the default
memory was set at 0.6GB. App Engine Flex instances uses about 0.4GB
for overhead processes. Given Java is known to consume higher memory, there is a
great likelihood that the overhead processes consumed more than the
approximate 0.4GB value. Now instances in App Engine are restarted due
to a variety of reasons including optimization due to memory use. This
explains why your instances went off and it shows Tomcat is starting
up (they got restarted) and ends up in 502 error due to the nginx is
not able to complete the request. Fixing the above may lessen if not completely eliminate the 502s.
After I have specified the resources attribute and increased the configuration in app.yaml 502 error seems to be gone.

How to prevent WebSphere from starting before files from an application update have been unpacked

Using the WebSphere Integrated Solutions Console, a large (18,400 file) web application is updated by specifying a war file name and going through the update screens and finally saving the configuration. The Solutions Console web UI spins a while, then it returns, at which point the user is able to start the web application.
If the application is started after the "successful update", it fails because the files that are the web application have not been exploded out to the deployment directory yet.
Experimentation indicates that it takes on the order of 12 minutes for the files to appear!
One more bit of background that may be significant: There are 19 application servers on this one WebSphere instance. WebSphere insists that there be a lot of chatter between them, even though they don't need anything from each other. I wondered if this might be slowing things down when it comes to deployment. Or if there's some timer in the bowels of WebSphere that is just set wrong (usual disclaimers apply...I'm just showing up and finding this situation...I didn't configure this installation).
Additional Information:
This is a Network Deployment configuration, and it's all on one physical host.
* ND 6.1.0.23
Is this a standalone or a ND set up? I am guessing it is ND set up considering you have stated that they are 19 app servers. The nodes should be synchronized with the deployment manager so that the updated files are available to the respective nodes.
After you update and save the changes, try and synchronize the nodes with the dmgr (or alternatively as part of the update process, click on review and the check the box which says synchronize nodes) and this would distribute the changes to the various nodes.
The default interval, i believe is 1 minute.
12 minutes certainly sounds a lot. Is there any possibility of network being an issue here?
HTH
Manglu

Resources