Azure Functions - Java CosmosClientBuilder slow on initial connection - performance

we're using Azure Cloud Functions with the Java SDK and connect to the Cosmos DB using the following Java API
CosmosClient client = new CosmosClientBuilder()
.endpoint("https://my-cosmos-project-xyz.documents.azure.com:443/")
.key(key)
.consistencyLevel(ConsistencyLevel.SESSION)
.buildClient();
This buildClient() starts a connection to CosmosDB, which takes 2 to 3 seconds.
The subsequent database queries using that client are fast.
Only this first setup of the connection is pretty slow.
We keep the CosmosClient as a static variable, so we can reuse it between multiple http requests that go to our function.
But once the function is getting cold (when Azure shuts it down after a few minutes unused), the static variable gets lost and will be reconnected, when the function is started up again.
Is there a way to make this initial connection to cosmos DB faster?
Or do you think we need to increase the time a function stays online, if we need faster response times?

This is a expected behavior, see https://youtu.be/McZIQhZpvew?t=850.
The first request a client does needs to go through a warm-up step. This warm-up consists of fetching the account information, container information, routing and partitioning information in order to know where to route the requests (as you experienced, further requests do not get this extra latency). Hence the importance of maintaining a singleton instance.
In some Functions plan (Consumption) instances get de-provisioned if there is no activity, in which case, any existing instance of the client is destroyed, so when a new instance is provisioned, your first request will pay this warm-up cost.
There are currently no workaround I'm aware of in the Java SDK but this should not affect your P99 latency since it's just the first request on a cold client.
Hope this and the video help with the reason.

Related

Limit concurrent queries in Spring JPA

I have a simple rest endpoint that executes Postgres procedure.
This procedure returns the current state of device.
For example:
20 devices.
Client app connect to API and make 20 responses to that endpoint every second.
For x clients there are x*20 requests.
For 2 clients 40 requests.
It causes a big cpu load on Postgres server only if there are many clients and/or many devices.
I didn’t create it but I need to redesign it.
How to limit concurrent queries to db only for it? It would be a hot fix.
My second idea is to create background worker that executes queries only one in the same time. Then the endpoint fetches data from memory.
I would try the simple way first. Try to reduce
the amount of database connections in the pool OR
the amount of working threads in the build-in Tomcat.
More flexible option would be to put the logic behind a thread pool limiting the amount of working threads. It is not trivial, if the Spring context and database is used inside a worker. Take a look on a Spring annotation #Async.
Offtopic: The solution we are discussing here looks like a workaround. The discussed solution alone will most probably increase the throughput only by factor 2 maybe 3. It is not JEE conform and it will be most probably not very stable. It is better to refactor the application avoiding such a problem. Another option would be to buy a new database server.
Update: JEE compliant solution would be to implement some sort of bulkhead pattern. It will limit the amount of concurrent running requests and reject it, if the some critical number is reached. The server application answers with "503 Service Unavailable". The client application catches this status and retries a second later (see "exponential backoff").

Initial call on ServiceFabric proxy is VERY slow

Whenever a I'm calling one service fabric service from another, the first call on the proxy is VERY slow i.e. 100x slower than all subsequent calls. I've put timings in that record the time immediately before the call and then the time immediately in the service method being called and this can easily be over 60 seconds! The service fabric cluster is a standalone cluster running on 12 nodes/VM's.
Interestingly the length of time the first call takes seems to be related to the number of nodes i.e. if I deactivate half the nodes the time is reduced (though not by half). Also when running the exact same code on a dev cluster running on my local PC the length of time the first call take is typically around 8 second with subsequent calls taking < 10ms on either system. In addition, creating another proxy to the same service in the same client process still result in fast call times, it seem as if the proxy factory (which I believe SF caches per client process) is created on first use of the proxy and take a very long time.
Interestingly no exceptions are thrown and the services actually work!
So my question is, why does it take so long the first time a call is made from one service to another on a proxy created with ServiceProxy.Create()?
According to The SF remoting docs (See below, emphasis mine), ServiceProxy.Create is a wrapper around ServiceProxyFactory, and the first call also involves setting up the factory for the subsequent calls.
ServiceProxyFactory is a factory that creates proxy for different remoting interfaces. If you use API ServiceProxy.Create for creating proxy, then framework creates the singleton ServiceProxyFactory. It is useful to create one manually when you need to override IServiceRemotingClientFactory properties. Factory is an expensive operation. ServiceProxyFactory maintains cache of communication client. Best practice is to cache ServiceProxyFactory for as long as possible.
I have not experienced slow resolution anywhere near what you have, however I create my proxies when my API service starts up using dependency injection.
The way I have my system set up is that the stateless API service (asp.net core) communicates with the backend SF services.
It's possible that I am actually experiencing a longer delay, but by the time I go to use the application the resolution process has already started and finished, rather than the resolution starting when I make the first request to the app.
private void InitializeContainer(IApplicationBuilder app)
{
// Add application presentation components:
Container.RegisterMvcControllers(app);
Container.RegisterMvcViewComponents(app);
// Add application services.
Container.Register(() => ServiceProxy.Create<IContestService>(FabricUrl.ContestService), Lifestyle.Transient);
Container.Register(() => ServiceProxy.Create<IFriendService>(FabricUrl.FriendService), Lifestyle.Transient);
Container.Register(() => ServiceProxy.Create<IUserService>(FabricUrl.UserService), Lifestyle.Transient);
Container.Register(() => ServiceProxy.Create<IBillingService>(FabricUrl.BillingService), Lifestyle.Transient);
Container.RegisterSingleton(AutoMapperApi.Configure());
// Cross-wire ASP.NET services (if any). For instance:
Container.RegisterSingleton(app.ApplicationServices.GetService<ILoggerFactory>());
// NOTE: Prevent cross-wired instances as much as possible.
// See: https://simpleinjector.org/blog/2016/07/
}

Falcor: avoid outdated client cache

I'm considering to use Falcor in an app project I'm currently working on, I've started reading the docs but there's still one issue that is not entirely clear to me.
Let's make this example.
At time zero client A performs a request to a Falcor model, which in turns retrieves the needed data from a server DataSource, and stores it in the client's cache.
At time one the same server data is changed by operations performed by client B.
At time two client A performs the same request to the Falcor model, which finds a cached value and serves the now outdated data.
Is there any way to notifiy client A after time one that its Falcor cache for that data is outdated, and should instead perform a new request to the server DataSource?
You can use web sockets to send messages to the client. On the client you can call invalidate to manually invalidate the cache. You can also set an expires time on values to cause them to expire after a certain amount of time.

Cache the client connection with ElasticSearch Nest so the first call to the client is fast

I'm about to set up suggestion search for ElasticSearch with the NEST client. Ideally I'd start matching as of the 2nd character entered. However, it takes 600ms the first time I call the client. Every subsequent call is more like 20ms. Is there a way to cache or prepare the NEST client?
I've read this post: Elasticsearch and .NET
I've also read that I can either create a new client or use the same instance of the client with no repercussions.
I just want to get the client ready for use before I call it so the user isn't waiting for the client to validate itself.
For the moment I'm making a connection to the client as soon as the user hits the website, then saving the client reference in the session. However, the first search is still slow even though I've already established the connection. Is there a way to preload/cache the connection so the delay occurs during page load?
The first hit cache built up is per AppDomain. So you do not need to cache the client itself. Every client you are going to instantiate after the first hit is going to be warm.
I've opened up a working ticket so you are able to initiate the warmup process in your application startup so you are no longer penalizing the first user of your system with the warmup costs.
https://github.com/elasticsearch/elasticsearch-net/issues/742

Select on one row table takes seconds

I am experiencing very low performance in my web application in which trivial HTTP requests take dozens of seconds to be processed. Tracing down the application code I discovered the majority of time is spent executing the first DB query, even if it is as simple as a SELECT on a single row-single column table. This happens for every HTTP request, independently from the query performed. After this first pathological DB interaction the remaining queries go smoothly.
I am using Hibernate on top of an Oracle DB (using jdbc).
It is not a problem of connection pool since I am successfully using Hibernate-c3p0, neither it seems to be related to Oracle itself, because all query returns immediately if performed directly on DB.
Furthermore, Hibernate SessionFactory is correctly created only once, at application start up time and concurrency is not a problem at all since tests have been done with single user.
Finally, my DB IP address is correctly resolved in my application server /etc/hosts so that even DNS related issues can be discarded (I am using two distinct virtual machines, DB and APP server).
I do not know what to look for, any help?
This sounds like your session factory object is being spun up on the first query. Generally I try to initialize the session factory on application startup to avoid this when issuing the first query because generally the user can see this slowdown. When doing it up front in application startup you will avoid this.

Resources