Sitecore page load slowness - performance

I'm using Sitecore instance 9.1, Solr 7.2.1, and SXA 1.8.
I have deployed the environment on Azure and while monitoring incoming requests (to CD instance), I've noticed slowness in loading some pages at specific times.
I've explored App Insights and found an unexplainable behavior the request is taking 28.7 seconds while the breakdown of it shows executions of milli-seconds .. How is that possible? and How to explain what's happening during extra 28 seconds on the app service ??
I've checked the profiler and it shows that the thread is taking only 1042.48 ms .. How is that possible ?
This is an intermittent issue happens during the day .. regular requests are being served within 3 to 4 seconds.

I noticed that Azure often shows a profile trace for a "similar", but completely different request when clicking from the End-to-end transaction view. You can check this by comparing the timestamp and URL of the profile trace and the transaction you clicked from.
For example, I see a transaction logged at 8:58:39 PM, 2021-09-25 with 9.1 s response time:
However, when I click the profile trace icon, Azure takes me to a trace that was captured 10 minutes earlier, at 08:49:20 PM, 2021-09-25 and took only 121.64 ms:
So, if the issue you experience is intermittent and you cannot replicate it easily, try looking at the profile traces with the Slowest wall clock time by going to Application Insights → Performance → Drill into profile traces:
This will show you the worst-performing requests captured by the profiler at the top of the list:

In order to figure out why it is slow, you’ll need to understand what happens internally, f.e:
How the wall clock time is spent while processing your request?
Are there any locks internally?
The source of that data is dynamic profiling, Azure can do that on demand.
The IIS stats report would show you slowest requests, so you could look into Thread Time distribution to see where those 28 seconds are spent:

In Sitecore the when the application start the Initial prefetch configuration allows to pre-populate prefetch caches. Pre-heated prefetch caches help to reduce the processing time of incoming requests. The initial prefetch configuration of caches are taking time to load on initial stage.
Sitecore XP instance takes too long to load. This is caused by a performance issue in the CatalogRepository.GetCatalogItems method. It will be fixed in upcoming updates
see Site core knowledge base
In Sitecore XP 9.0 the initial prefetch configuration was revised. The prefetch cache for the core database was configured to include items that are used to render the Sitecore Client interface.
The Sitecore Client interface is not used on Content Delivery instances. Disabling initial prefetch configuration for the Core database helps in avoiding excessive resource consumption on the SQL Server hosting the Core database.
Change the configuration of the Core database in the \App_Config\Sitecore.config file:
Refer site core knowledge base

Related

ASP.NET 5 Web API application intermittently unresponsive

We are working on an ASP.NET 5 Web API project that is in production now but we are experiencing an issue where it becomes unresponsive intermittently throughout the day.
A few notes about the application architecture. It is an ASP.NET Web API project using a MariaDB database on a separate EC2 instance within the same private network. The connection string uses the private IP of the database server to avoid any name resolution issues. The site is hosted via IIS 10.
The application itself has been developed carefully following the best practices provided by Microsoft. Heavy focus on async operations, minimizing query response times and offloading more expensive operations into background services.
The app is extremely responsive. It performs with sub 100ms responses on almost all requests, even the more complicated requests, and all the way up until it becomes unresponsive this high level of performance remains the same. We tend to see between 10-30 requests per second and 300-500 select queries per second at peak usage so not too extreme. However, randomly (2-3 times over a 24 hour period) it will begin hanging on requests and simply not respond to the request. During this time, the database is still extremely responsive and we are never over 300 connections out of our 512 connection limit.
The resources on the application server itself are never really taxed much at all. The CPU never gets above ~20% and the memory usage sits around 20-30%.
If I were to stop the site in IIS and start it again while this is happening, it will quickly come back online. If I don't it will be down for a few minutes until IIS finally kills it due to a failed health check. There are no real errors generated as a response to the issue other than typical errors caused by the hanging of the process such as connection terminated errors. The only thing I have seen before that gave me pause was the fact that there a few connection timeouts when getting the connection from the pool, but like I said, the connections to the server are never close to the limit.
Also, this app and version has been in production for months and it wasn't until the traffic volume started to grow that we started seeing these issues. At this point, I am at a loss for next steps of troubleshooting and I'm seeking suggestions.
In IIS App Pool advanced settings set Start Mode to AlwaysRunning
I never found a root cause for this issue, however, after updating to newer versions of .NET MVC this issue went away. My best guess is that changes with the Kestrel possibly resolved this issue, although, I have no idea what specific change that might have been. I have gone through the change logs a few times and didn't see anything that specifically jumped out at me.

Web Server Performance Degradation

The web application is running on Springboot and deployed on WebLogic.
We have assigned 400 as max threads and JDBC to be 100 connections.
When we perform load testing on the web application, the performance is optimal when the load is low (the response time is less than 200ms for most of the http request that we called).
When we increase the load, we can see that the thread count increases and jdbc count also increases gradually but no where near to max. However, the response time is getting much longer and it could take more than 5 seconds to response.
CPU usage, thread count, memory, JDBC connection seems to be normal during these period.
Another observation is that during testing and we saw that the performance is degrading, we used another machine to make a http call to the server that is only retrieving text without any DB or logic, and even this simple http call will take 10s to respond. (And the server resources is still not MAX!)
So, we are wondering what keep them waiting ?
Any other possible bottleneck?
If the server doesn't lack resources like CPU/RAM/etc. only a profiler can tell you where your application spends the most time which might be in:
Waiting in a queue for next thread/db connection from the pool to be available
Slow database query
Inefficient functions/algorithms which a subject to optimization
WebLogic configuration not suitable for high loads
JVM configuration not suitable for high loads (i.e. system is doing garbage collection to often/too long)
So I would recommend re-running your test with profiler tool telemetry enabled and at the same time monitoring essential JVM metrics using i.e. JMXMon Sample Collector which can be used for monitoring your application-specific metrics as well. It's a plugin which can be installed using JMeter Plugins Manager
For a detailed approach on how ago about identifying poor thread performance I suggest you take look at the TSA Method by Brendan Gregg.

Ajax-calls: big differences between server runtime and client waiting time

I have two REST endpoints driving some navigation in a web site. Both create nearly the same response, but one gets its data straight from the db whereas the other has to ask a search engine (solr) first to get some data and then do the db calls.
If i profile both endpoints via JProfiler i get a higher runtime (approx. 60%) for the second one (about 31ms vs. 53ms). That's as expected.
Profile result:
If i view the same ajax calls from the client side i get a very different picture.
The faster of the both calls takes about 146 ms waiting and network time
The slower of the both calls takes about 1.4 seconds waiting and network
Frontend timing is measured via chrome developer tools. The server is a tomcat 7.0.30 running in STS 3.2. Client and server live on the same system, db and solr are external so there should be no network latency between tomcat and the browser. As a side note: The faster response has the bigger payload (2.6 vs 4.5 kb).
I have no idea why the slower of the both calls takes about 60% more server time but in sum nearly 1000% more "frontend time".
The question is: Is there any way i can figure out where this timing differences originate?
By default, the CPU views in JProfiler show times in the "Runnable" thread state. If a thread reads data from a socket connection or waits for some condition, that time is not included in the "Runnable" thread state.
In the upper right corner of the CPU views there is a thread state selector. If you change that to "All states", you will get times that you can compare with the wall clock times from the browser.

Troubleshooting MVC4 Web API Performance Issues

I have an asp.net mvc4 web api interface that gets about 54k requests a day.
http://myserv.x.com/api/123/getstuff?whatstuff=thisstuff
I have 3 web servers behind a load balancer that are setup to handle the http requests.
On average response times are ~300ms. However, lately something has gone awry (or maybe it has always been there) as there is sporadic behavior of response times coming back in 10-20sec. This would be for the same request hitting the same server directly instead of through the load balancer.
GIVEN:
- System has been passed down to me so there may be gaps with IIS confiuration, etc,.
- Database: SQL Server 2008R2
- Web Servers: Windows Server 2008R2 Enterprise SP1
- IIS 7.5
- Using MemoryCache aggressively with Model and Business Objects with eviction set to 2hrs
- Looked at the logs but really don't see anything significantly relevant
- One application pool...no other LOB applications running on this server
Assumptions & Ask:
Somehow I'm thinking that something is recycling the application pool or IIS worker threads are shutting down and restarting thus causing each new request to warmup and recache itself. It's so sporadic that it's tough to trouble shoot right now. The same request to the same server comes back fast as expected (back to back N requests) since it was cached in about 300ms....but wait about 5-10-20min and that same request to the same server takes 16seconds.
I have limited tracing to go by as these are prod systems so I can only expose so much logging details. Any help and information attacking this or similar behavior somebody else has run into is appreciated. Thx
UPDATE:
The w3wpe.exe process grows to ~3G. Somehow it gets wiped out and the PID changes so itself or something is killing it every 3-4min I see tons of warnings in my webserver (IIS) log:
A process serving application pool 'MyApplication' suffered a fatal
communication error with the Windows Process Activation Service. The
process id was '1732'. The data field contains the error number.
After 4-5 days of assessing IIS and configuration vs internal code issues I finally found the issue with little to no help with windbg or debugdiag IIS tools. Those tools contain so much information even with mini dumps or log trace stacks that they can be red herrings. Best bet was to reproduce it by setting up a "copy intelligently" instance of a production system, which we did not have at the time and took a bit for ops to set something up.
Needless to say the problem had to do with over cacheing business objects. There was one race condition where updates on a certain table were updating an attribute to that corresponding business object (updates were coming from multiple servers) which was causing an OOC stackoverflow that pretty much caused the cacheing to recursively cache itself to death thus causing the w3wp.exe process to die and psuedo-recycle itself. It was one of those edge cases that was incredibly hard to test and repro in a non-production environment.

ASP.NET MVC lost in finding botleneck

I have ASP.NET MVC app which accept file uploads and has result pooling using SignalR. The app hosted on Prod server with IIS7, 4 Gb Ram and two cores CPU.
The app on Dev server works perfectly but when I host it on Prod server with about 50 000 users per day the app become unrresponsible after five minutes of running. The web page request time increase dramatically and it takes about 30 seconds to load one page. I have tried to record all MvcApplication.Application_BeginRequest event call and got 9000 hits in 5 minutes. Not sure is this acceptable number of hits or not for app like this.
I have used ANTS Performance Profiler(not useful in Prod app profiling, slow and eats all memory) to profile code but profiler do not show any time delay issues in my code/MSSQL queries.
Also I have tried to monitored CPU and RAM spike problems but I didn't find any. CPU percentage sometimes goes to 15% but never up and memory usage is normal.
I suspect that there is something wrong with request or threads limits in ASP.NET/IIS7 but don't know how to profile it.
Could someone suggest any profiling solutions which could help in this situation? Tried to hunt the problem for two week already without any result :(
You may try using the MiniProfiler and more specifically the MiniProfiler.MVC3 NuGet package which is specifically created for ASP.NET MVC applications. It will show you all kind of useful information such as the time spend for different methods in the execution of the request.

Resources