Hello all who read this,
We have written a router function on azure in an app plan that receives messages from iothub
and depending the message type we route our message to another eventhub.
Previously we had 6 out bindings to eventhubs in this function
Recently we added 3 more message type so 3 more out binding to 3 more eventhubs
No processing of the messages happen in this function but what we see now is that we spend 16 times more time in the routing function.
Is there a performance issue about having multiple output bindings.
We don't see an increase in load of the incoming messages.
We are running on azure functions 1.0 (Runtime version: 1.0.12205.0 (~1))
Regards Ben
Simplified Sample code of the routing function
public static class IotHubRouterFunction
{
[FunctionName("IotHubRouterFunction")]
public static void Run([EventHubTrigger("%iothub%", Connection = "IothubRouterListen")]EventData myEventHubData,
[EventHub("%msg1-eventhub%", Connection = "msg1event")] ICollector<EventData> eventHub4Dmsg1Event,
[EventHub("%msg2-eventhub%", Connection = "msg2event")] ICollector<EventData> eventHub4Dmsg2Event,
[EventHub("%msg3-eventhub%", Connection = "msg3event")] ICollector<EventData> eventHub4Dmsg3Event,
//... like 6 more bindings like this
ILogger logger
)
{
try
{
var messageType = GetValue(myEventHubData.Properties, "type");
// routing
switch (messageType)
{
case "msg1event":
{
eventHub4DevicesStatusChanged.Add(eventHub4Dmsg1Event);
break;
}
case "msg2event":
{
eventHub4MeasurementLog.Add(eventHub4Dmsg2Event);
break;
}
case "msg3event":
{
eventHub4DeviceDiscovered.Add(eventHub4Dmsg3Event);
break;
}
//6 more cases like this
default:
{
logger.LogError("Unrouteable message of type: {messageType}", messageType);
break;
}
}
}
catch (Exception ex)
{
//removed
}
}
}
With 6 bindings the message fly through the router function at 50ms
With 9 bindings the message crawl through the router function at 800ms
CPU raised with 30% as well on the applan (we scaled extra so we have it under control but why so much what is causing this)
A little late with the follow up of what happened
In the end we found out what was going on
We have several instances of our app plan
but the old monitoring solution showed the average of the cpu and memory overall the instances of the applan.
Basically with switching to the newer metrics and azure monitoring we were able to drill down in the separate instances of the app plan and the instances of the functions.
We found out that one instance of a function which was running three times two of them norammly but the third function had crashed it's internal apppool and consumed all cpu power it got hold off and did absolutely nothing.
We restarted the function and all issues were gone.
Still wondering if it was something in our code that made it go through the roof
or that something happened in azure that made it go crazy.
:-s
When you are using Azure Function under App service plan then you have to watch out for performance parameters like scaling. Have you investigated your function is not getting overloaded ?
On the other hand , As part of your design this approach is wrong to me. With this many bindings there could be potential performance issues , and what if you are supposed to add more bindings in future ? If you are not performing any operation then you shouldn't be taking overhead of redirecting messages.
Event Grid
We can use event grids for that. Based on topic the IoT hub publishes the event to a topic and events are consumed by subscribers in your case other event hubs. You also get advantage of micro billing (serverless) and auto scaling as well. https://learn.microsoft.com/en-us/azure/event-grid/overview
Related
I am working on a project to read from our existing ElasticSearch instance and produce messages in Pulsar. If I do this in a highly multithreaded way without any explicit synchronization, I get many occurances of the following log line:
Message with sequence id X might be a duplicate but cannot be determined at this time.
That is produced from this line of code in the Pulsar Java client:
https://github.com/apache/pulsar/blob/a4c3034f52f857ae0f4daf5d366ea9e578133bc2/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ProducerImpl.java#L653
When I add a synchronized block to my method, synchronizing on the pulsar template, the error disappears, but my publish rate drops substantially.
Here is the current working implementation of my method that sends Protobuf messages to Pulsar:
public <T extends GeneratedMessageV3> CompletableFuture<MessageId> persist(T o) {
var descriptor = o.getDescriptorForType();
PulsarPersistTopicSettings settings = pulsarPersistConfig.getSettings(descriptor);
MessageBuilder<T> messageBuilder = Optional.ofNullable(pulsarPersistConfig.getMessageBuilder(descriptor))
.orElse(DefaultMessageBuilder.DEFAULT_MESSAGE_BUILDER);
Optional<ProducerBuilderCustomizer<T>> producerBuilderCustomizerOpt =
Optional.ofNullable(pulsarPersistConfig.getProducerBuilder(descriptor));
PulsarOperations.SendMessageBuilder<T> sendMessageBuilder;
sendMessageBuilder = pulsarTemplate.newMessage(o)
.withSchema(Schema.PROTOBUF_NATIVE(o.getClass()))
.withTopic(settings.getTopic());
producerBuilderCustomizerOpt.ifPresent(sendMessageBuilder::withProducerCustomizer);
sendMessageBuilder.withMessageCustomizer(mb -> messageBuilder.applyMessageBuilderKeys(o, mb));
synchronized (pulsarTemplate) {
try {
return sendMessageBuilder.sendAsync();
} catch (PulsarClientException re) {
throw new PulsarPersistException(re);
}
}
}
The original version of the above method did not have the synchronized(pulsarTemplate) { ... } block. It performed faster, but generated a lot of logs about duplicate messages, which I knew to be incorrect. Adding the synchronized block got rid of the log messages, but slowed down publishing.
What are the best practices for multithreaded access to the PulsarTemplate? Is there a better way to achieve very high throughput message publishing?
Should I look at using the reactive client instead?
EDIT: I've updated the code block to show the minimum synchronization necessary to avoid the log lines, which is just synchronizing during the .sendAsync(...) call.
Your usage w/o the synchronized should work. I will look into that though to see if I see anything else going on. In the meantime, it would be great to give the Reactive client a try.
This issue was initially tracked here, and the final resolution was that it was an issue that has been resolved in Pulsar 2.11.
Please try updating the Pulsar 2.11.
Under load in production we receive "RabbitMQ.Client.Exceptions.ConnectFailureException" connection failed and "MassTransit.RequestTimeoutException" timeout waiting for response. The consumer does receive the message and send it back. It's like the web app isn't listening, or unable to accept the connection.
We're running an ASP.NET web application ( not MVC ) on .NET Framework 4.6.2 on Windows Server 2019 on IIS. We're using MassTransit 7.0.4. In production, under load, we can get some exceptions dealing with sockets on RabbitMQ or timeouts from masstransit. It's difficult to reproduce them in Dev. RabbitMQ is in a mirror, it seems to happen once we turn on a high-load service that bumps from 140 message/sec to 250 message/sec.
I have a few questions about the code architecture, and then if anyone else is running into these kinds of timeout issues.
Questions:
Should I have static scope for the IBusControl? IE, should it be static inside Global asax? And does it matter at all if it's a singleton underneath?
Should I create a new IBusControl and start it per request ( maybe stick it in Application BeginRequest ). Would that make a difference?
Would adding another worker process affect the total number of open connections I'm able to make -- If this is a resource issue ( exhausting threads, connections or some resource ).
Exceptions:
MassTransit.RequestTimeoutException
Timeout Waiting for response
Stacktrace:
System.Runtime.ExceptionServices.ExceptionDispathInfo.Throw
System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification
MassTransit.Clients.ResponseHandlerConnectionHandle`1+<GetTask>d_11.MoveNext
System.Threading.ExecutionContext.RunInternal
RabbitMQ.Client.Exceptions.ConnectFailureException
Connection failed
Statcktrace:
RabbitMQ.Client.Impl.SocketFrameHandler.ConnectOrFail
RabbitMQ.Client.Impl.SocketFrameHandler.ConnectUsingAddressFamily
RabbitMQ.Client.Impl.SocketFrameHandler..ctor
RabbitMQ.Client.ConnectionFactory.CreateFrameHandler
RabbitMQ.Client.EndPointResolverExtensions.SelectOne
RabbitMQ.Client.ConnectionFactory.CreateConnection
How Our Code Works ( overview )
Static IBusControl that is instantiated the first time someone tries to produce a message. The whole connection and send code is a little large to put in here ( connection factory and other metric classes, but below are the interesting parts ).
Static IBusControl B;
B = Bus.Factory.CreateUsingRabbitMq(x =>
{
hostAddress = host.HostAddress;
x.Host(new Uri(host.HostAddress), h =>
{
h.Username(host.UserName);
h.Password(host.Password);
});
x.Durable = false;
x.SetQueueArgument("x-message-ttl", 600000);
});
B.Start(new TimeSpan(0, 0, 10));
// Then send the Actual Messages
// Generic with TRequest and TResponse : class BaseMessage
// Pulling the code out of a few different classes
string serviceAddressString = string.Format("{0}/{1}?durable={2}", HostAddress, ChkMassTransit.QueueName(typeof(TRequest), typeof(TResponse)), false ? "true" : "false");
Uri serviceAddress = new Uri(serviceAddressString);
RequestTimeout rt = RequestTimeout.After(0, 0, 0, 0, timeout.Value);
IRequestClient<TRequest> reqClient = B.CreateRequestClient<TRequest>(serviceAddress, rt);
var v = reqClient.GetResponse<TResponse>(request, sendInfo.CT, sendInfo.RT);
if ( v.Wait(timeoutMS) ) { /*do some stuff*/ }
First, I find your lack of async disturbing. Using Wait or anything like it on TPL-based code is a recipe for death and destruction, pain and suffering, dogs and cats living together, etc.
Yes, you should have a single bus instance that is started when the application starts. Since you're doing request/response, set AutoStart = true on the bus configurator to make sure it's all warmed up and ready.
Never, no, one bus only!
Each bus instance only has a single connection, so you shouldn't see any resource issues related to capacity on RabbitMQ.
MassTransit 7.0.4 is really old, you might consider the easy upgrade 7.3.1 and see if that improves things for you. It's the last version of the v7 codebase available.
I'm working on a POC for using MassTransit sagas to handle state changes in a system for grant applications. I'm using MassTransit 8.0.0-develop.394, .Net 6, EF Core 6.0.2 and ActiveMQ Artemis 1.19.0.
In the final solution the applicants can register their application and prepare the data for several weeks. A few days before the deadline another external system will be populated with data that will be used to validate the application data. Application data entered before the validation data is populated should just be scheduled for later validation, but data entered after should be validated immediately. I think MassTransit sagas with scheduled events looks like a good fit for this.
In the POC I just schedule the validation start time for some 10 seconds into the future from the program starts, and uses a shorter and shorter delay in the schedule until I just schedule it with a delay of TimeSpan.Zero.
From looking in the database I noticed that some of the schedule events somehow get lost when I run the POC with an empty saga repository, but everything works fine when I rerun the the program with existing sagas in the database. I use the same scheduling code in Initially and in DuringAny, which make me think that there might be some limitations on how short delay its safe to use when scheduling saga events?
Note 1: I've switched to not schedule the event in the saga when its less than 1 second to the valdation can be started, then I just publish the validation message directly, so this issue is not blocking me at the moment.
Note 2: I noticed this when running the POC from the command line and checking the database manually. I've tried to reproduce it in a test using the TestHarness, and also using ActiveMQ Artemis and InMemoryRepository, but with no luck. I've been able to reproduce it (more or less consistently) with a test using Artemis and EF Core Repository. I must admit that the test got quite complex with a lot of Task.Delay and other stuff, so it might be hard to follow the logic, but I can post it here if anyone think it's of any help.
Update 2 using Chris Pattersons recommendation about cfg.UseMessageRetry and cfg.UseInMemoryOutbox in the SagaDefinition and not on the bus.
Here is the updated code where MassTransit is configured
private static ServiceProvider BuildServiceProvider()
{
return new ServiceCollection()
.AddDbContext<MySagaDbContext>(builder =>
{
MySagaDbContextFactory.Apply(builder);
})
.AddMassTransit(cfg =>
{
cfg.AddDelayedMessageScheduler();
cfg.UsingActiveMq((context, config) =>
{
config.Host("artemis", 61616, configureHost =>
{
configureHost.Username("admin");
configureHost.Password("admin");
});
config.EnableArtemisCompatibility();
config.UseDelayedMessageScheduler();
config.ConfigureEndpoints(context);
});
cfg.AddSagaStateMachine<MyStateMachine, MySaga, MySagaDefinition<MySaga>>()
.EntityFrameworkRepository(x =>
{
x.ConcurrencyMode = ConcurrencyMode.Optimistic;
x.ExistingDbContext<MySagaDbContext>();
});
})
.AddLogging(configure =>
{
configure.AddFilter("MassTransit", LogLevel.Error); // Filter out all retry warnings
configure.AddFilter("Microsoft", LogLevel.None);
configure.AddSimpleConsole(options =>
{
options.UseUtcTimestamp = true;
options.TimestampFormat = "HH:mm:ss.fff ";
});
})
.BuildServiceProvider(true);
}
Here is the updated saga definition code
public class MySagaDefinition<TSaga> : SagaDefinition<TSaga> where TSaga : class, ISaga
{
protected override void ConfigureSaga(IReceiveEndpointConfigurator endpointConfigurator, ISagaConfigurator<TSaga> consumerConfigurator)
{
endpointConfigurator.UseMessageRetry(r => r.Intervals(10, 50, 100, 500, 1000));
endpointConfigurator.UseInMemoryOutbox();
}
}
If you are scheduling messages from a saga, or really producing any messages from a saga, you should always have the following middleware components configured:
cfg.UseMessageRetry(r => r.Intervals(50,100,1000));
cfg.UseInMemoryOutbox();
That will ensure that messages produced by the saga are:
Only produced if the saga is successfully saved to the repository
Produced after the saga has been saved to the repository
More details are available in the documentation.
The reason being, a short delay is likely delivering the message before it has been saved, and the scheduled event isn't correlating to an existing saga instance because it hasn't saved yet.
could someone help me to read NewRelic Summary and Trace details. Following screenshots have trace for a single transaction, which do not create any query to the database. It is just a simple query with few lines of Scala template code, which renders HTML page and returns it to the client. This is just a single transaction that is currently running in production. Production has plenty of more complex transaction running which do lots of external calls to Mongo, Maria, Queue, etc.
Does the trace reveal anything about where bottleneck could be? Are we for example running out of Threads or Workers. As I told most of the transactions do lots of web external calls, which might reserve single Thread for quite long time. How one can actually study if Threads or Workers are running out in Play application? We are using 2.1.4.
What actually happens in following calls?
Promise.apply 21.406ms
Async Wait 21.406ms
Actor.tell 48.366ms
PlayDefaultUpstreamHandler 6.292ms
Edit:
What is the purpose of following calls? Those have super high average call times.
scala.concurrent.impl.CallbackRunnable.run()
scala.concurrent.impl.Future$PromiseCompletingRunnable.run()
org.jboss.netty.handler.codec.http.HttpRequestDecoder.unfoldAndFireMessageReceived()
Edit:
play {
akka {
event-handlers = ["akka.event.slf4j.Slf4jEventHandler"]
loglevel = WARNING
actor {
default-dispatcher = {
fork-join-executor {
parallelism-min = 350
parallelism-max = 350
}
}
exports = {
fork-join-executor {
parallelism-min = 10
parallelism-max = 10
}
}
}
}
}
I'm not sure if this will help you 1 year later but I think the performance problems you were hitting are not related to Play, Akka or Netty.
The problem will be in your code business logic or in the database access. The big times that you see for PromiseCompletingRunnable and unfoldAndFireMessageReceived are misleading. This times are reported by newrelic in a wrong and misleading way. Please read this post:
Extremely slow play framework 2.3 request handling code
I faced a similar problem, and mine was in the database but newrelic reported big times in netty.
I hope this helps you even now.
I have a straightforward SignalR setup: OWIN-hosted .NET server and JavaScript client (both # v2.1.1). The client uses SignalR to synchronize its copy of an ordered event stream maintained in an Rx ReplaySubject on the server. When a client connects, it provides a startAfter query parameter that is used to initialize an IObserver against the ReplaySubject, and this observer then sends each event in the observed sequence to the client. Each event has a sequence number, and the client can tell, based on the event sequence number, if any event is missing in the sequence. (Which would be a serious problem in this application.)
The problem is that the client regularly receives only portions of the event sequence. In fact, there is a regular pattern to this. For every 250 events there is a large gap. So for example, each test shows that the first gap was from somewhere between 70 and 80 to 250. Why always 250? And from there on, the "skip-to" point is always in intervals of 250; e.g., a gap from 263 to 500, then one from 511 to 750, etc.. I have to assume that this is some kind of default buffer size.
Also, the first time a client connects to the server it always receives the entire sequence just fine. It's the subsequent connections that exhibit the regular skipping problem. So it seems like it's a server-side problem, and not a client problem at all.
I then added some checks to the server to ensure that the IObserver for each client is seeing all of the events in the correct order. It is. So it seems almost certain that the problem is on the SignalR server side and has nothing to do with Rx.
And finally, I checked to see if the dropped messages were perhaps just being delivered out of order (which I could live with, although I assumed SignalR provides an ordered-delivery guarantee). They are not - the messages just disappear into a void.
If it helps, I'm currently running locally, with IIS Express on Win 8.1 x64 and testing on IE Developer Channel as well as Chrome 36. The connection is using WebSockets. I couldn't find any reference to 250 as a special quantity in either the SignalR source (client or server) or the Rx.Net source.
Any suggestions on troubleshooting? I'd love to find a stable solution before I start building a complicated workaround.
Here's the relevant server-side code:
public class AllEventsReplaySource
{
private readonly IHubConnectionContext<dynamic> clients;
private readonly ReplaySubject<dynamic> allEvents;
private AllEventsReplaySource(IHubConnectionContext<dynamic> clients)
{
this.clients = clients;
this.allEvents = new ReplaySubject<dynamic>();
// (Not shown: code that generates the input to the ReplaySubject.)
}
public void SubscribeClient(string connectionId, int startAfter)
{
this.allEvents.Skip(startAfter).Subscribe(e =>
{
// (Not shown: code that verifies no skips are occurring at this point for a client.)
clients.Client(connectionId).notifyEvent(e);
});
}
private readonly static Lazy<AllEventsReplaySource> instance =
new Lazy<AllEventsReplaySource>(() => new AllEventsReplaySource(
GlobalHost.ConnectionManager.GetHubContext<AllEventsReplayHub>().Clients));
public static AllEventsReplaySource Instance
{
get { return instance.Value; }
}
}
[HubName("allEventsReplayHub")]
public class AllEventsReplayHub : Hub
{
private readonly AllEventsReplaySource source;
public AllEventsReplayHub()
: this(AllEventsReplaySource.Instance)
{ }
public AllEventsReplayHub(AllEventsReplaySource source)
{
this.source = source;
}
public override Task OnConnected()
{
var previousSequenceNumber = Int32.Parse(Context.QueryString["startAfter"]);
var connectionId = this.Context.ConnectionId;
AllEventsReplaySource.Instance.SubscribeClient(connectionId, previousSequenceNumber);
return base.OnConnected();
}
}
The issue you are experiencing seems consistent with a message buffer overflow. When SignalR releases messages from its buffer, it does so in 250 message fragments by default.
SignalR will buffer at least the last 1000 messages sent to a given connectionId. This means that when you send the 1251st message, the first 250 get dereferenced by the buffer. This explains why when a client first connects to the server, it receives the entire sequence of messages. You have to send at least 1251 messages to a given client before the buffer will drop fragments. Again, this is all assuming default settings.
While you could increase the DefaultMessageBufferSize, that probably will not fix your root problem. It seems that you are trying to send messages faster than the server can send them to the client. If you do that continuously, you will run out of buffer space no matter the size.
It's more common to reduce the DefaultMessageBufferSize rather than increase it, since the buffers can consume a lot of memory, especially if you are sending a lot of large unique messages to many different clients.
Your best bet to avoid overrunning the buffer is to have the client send an ACK at least every 1000 messages. Given this, it might be possible to avoid sending over 1000 unACKed messages thereby avoiding this problem altogether.
By the way, you can take a look at SignalR's message buffer implementation yourself if you feel so inclined. Note that the capacity constructor argument is the DefaultMessageBufferSize.