Blocked threads due to ServiceRegistry - osgi

I have a relatively high volume/requests system in which we are using OSGi. We get close to 800M requests per day.
We are seeing some issues currently where threads are getting blocked. For every request that comes in, we forward the event/data to an osgi bundle using registerService to pass the payload/data to the OSGi bundle which is listening to this service.
Like this: bundleContext.registerService(Map.class.getName(), dataHolderMap, null);
Where dataHolderMap is nothing but a regular java hashmap
Here is the threaddump using JStack:
===========================================================================
"RequestThread" prio=10 tid=0x00000000421ab800 nid=0x1042 runnable [0x00007fbdd3867000]
java.lang.Thread.State: RUNNABLE
at org.apache.felix.framework.ServiceRegistry.getService(ServiceRegistry.java:295)
- locked <0x0000000700e2c590> (a org.apache.felix.framework.ServiceRegistry)
at org.apache.felix.framework.Felix.getService(Felix.java:3568)
at org.apache.felix.framework.BundleContextImpl.getService(BundleContextImpl.java:468)
at org.osgi.util.tracker.ServiceTracker.addingService(ServiceTracker.java:411)
at org.osgi.util.tracker.ServiceTracker$Tracked.customizerAdding(ServiceTracker.java:932)
at org.osgi.util.tracker.ServiceTracker$Tracked.customizerAdding(ServiceTracker.java:864)
at org.osgi.util.tracker.AbstractTracked.trackAdding(AbstractTracked.java:256)
at org.osgi.util.tracker.AbstractTracked.track(AbstractTracked.java:229)
at org.osgi.util.tracker.ServiceTracker$Tracked.serviceChanged(ServiceTracker.java:894)
at org.apache.felix.framework.util.EventDispatcher.invokeServiceListenerCallback(EventDispatcher.java:932)
at org.apache.felix.framework.util.EventDispatcher.fireEventImmediately(EventDispatcher.java:793)
at org.apache.felix.framework.util.EventDispatcher.fireServiceEvent(EventDispatcher.java:543)
at org.apache.felix.framework.Felix.fireServiceEvent(Felix.java:4419)
at org.apache.felix.framework.Felix.registerService(Felix.java:3423)
at org.apache.felix.framework.BundleContextImpl.registerService(BundleContextImpl.java:346)
at org.apache.felix.framework.BundleContextImpl.registerService(BundleContextImpl.java:320)
at com.mypackage.person.bs.processor.Processor.sendEvent(Processor.java:56)
at com.mypackage.jetMyStream.event.support.AbstractEventSource.fireSendEvent(AbstractEventSource.java:97)
at com.mypackage.jetMyStream.event.channel.messaging.InboundMessagingChannel.fireEvent(InboundMessagingChannel.java:113)
at com.mypackage.jetMyStream.event.channel.messaging.InboundMessagingChannel.onMessage(InboundMessagingChannel.java:204)
at com.mypackage.jetMyStream.messaging.MessageService.dispatchMessageForContext(MessageService.java:349)
at com.mypackage.jetMyStream.messaging.MessageService.dispatch(MessageService.java:259)
at com.mypackage.jetMyStream.messaging.MessageServiceRequest.execute(MessageServiceRequest.java:40)
at com.mypackage.jetMyStream.util.RequestThread.run(RequestThread.java:69)
"RequestThread" prio=10 tid=0x00000000425f8800 nid=0x1041 waiting for monitor entry [0x00007fbdd3968000]
java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.felix.framework.ServiceRegistry.registerService(ServiceRegistry.java:109)
- waiting to lock <0x0000000700e2c590> (a org.apache.felix.framework.ServiceRegistry)
at org.apache.felix.framework.Felix.registerService(Felix.java:3393)
at org.apache.felix.framework.BundleContextImpl.registerService(BundleContextImpl.java:346)
at org.apache.felix.framework.BundleContextImpl.registerService(BundleContextImpl.java:320)
at com.mypackage.person.bs.processor.Processor.sendEvent(BullseyeModelProcessor.java:56)
at com.mypackage.jetMyStream.event.support.AbstractEventSource.fireSendEvent(AbstractEventSource.java:97)
at com.mypackage.jetMyStream.event.channel.messaging.InboundMessagingChannel.fireEvent(InboundMessagingChannel.java:113)
at com.mypackage.jetMyStream.event.channel.messaging.InboundMessagingChannel.onMessage(InboundMessagingChannel.java:204)
at com.mypackage.jetMyStream.messaging.MessageService.dispatchMessageForContext(MessageService.java:349)
at com.mypackage.jetMyStream.messaging.MessageService.dispatch(MessageService.java:259)
at com.mypackage.jetMyStream.messaging.MessageServiceRequest.execute(MessageServiceRequest.java:40)
at com.mypackage.jetMyStream.util.RequestThread.run(RequestThread.java:69)
A couple of question on this:
Am I doing something wrong in sending a map through registerService. If so what are my alternatives?
Any ideas on how I can get this to work? We have 26 nodes and need to only process about 400 requests per second using this mechanism
Anyone had similar issues? Any pointers are highly appreciated
Thanks
Masti

"Am I doing something wrong in sending a map through registerService. If so what are my alternatives?"
YES! The service registry is NOT intended to be used this way; it is not a messaging bus.
As for alternatives... why not use a messaging bus? You could look at OSGi Event Admin but there are many other implementations of this idea.
"Any ideas on how I can get this to work? We have 26 nodes and need to only process about 400 requests per second using this mechanism"
You have to be a bit more specific about what you're trying to achieve. The question is too abstract. Please describe where the messages are coming from, where they need to be delivered, what processing (if any) you need to do in the middle, etc.

Related

Logging Microprofile fault tollerance events

I am working on a Quarkus app that uses the smallrye microprofile fault tolerance implementation.
We have configured fault tolerance on the client definitions via the annotations API (#Retry, #Bulkhead, etc) and it seems to work but we don't get any sort of feedback about what is happening. Ideally we would like to get some sort of callback but even just having logs would help out in the first step.
The rest clients look something like this:
#RegisterRestClient(configKey = "foo-backend")
#Path("/backend")
interface FooClient {
#POST
#Retry(maxRetries = 4, delay = 900)
#ExponentialBackoff
#Timeout(value = 3000)
fun getUser(payload: GetFooUserRequest): GetFooUserResponse
}
Looking at the logs, even though we trace all communication, I cannot see any event even if I manually stop foo-backend and start it again before the retires run out.
Our logging config looks like this right now but still nothing
quarkus.rest-client.logging.scope=request-response
quarkus.rest-client.logging.body-limit=2048
quarkus.log.category."org.jboss.resteasy.reactive.client.logging".level=DEBUG
Is there a way to get callbacks when a fault tolerance event happens? Or a setting which logs them out? I also would be interested in knowing when out Circuit Breakers are triggered or when a Bulkhead fills up. Logging them would be good enough for now but Ideally I would like to somehow listen for them.
You can enable DEBUG logging for the io.smallrye.faulttolerance category, and you should get all the information you need.
Specifically for circuit breakers, you can register state change listeners for circuit breakers that have been given a name using #CircuitBreakerName -- just inject CircuitBreakerMaintenance and use onStateChange. See https://smallrye.io/docs/smallrye-fault-tolerance/5.6.0/usage/extra.html#_circuit_breaker_maintenance
There's unfortunately nothing similar for bulkheads yet.

zeromq: ZMQ_CONFLATE==1 does not stop queues from saving old messages

With ZeroMQ and CPPZMQ 4.3.2, I want to drop old messages for all my sockets including
PAIR
Pub/Sub
REQ/REP
So I use m_socks[channel].setsockopt(ZMQ_CONFLATE, 1) on all my sockets before binding/connecting.
Test
However, when I made the following test, it seems that the old messages are still flushed out on each reconnection. In this test,
I use a thread to keep sending generated sinewave to a receiver thread
Every 10 seconds I double the sinewave's frequency
Then after 10 seconds I stop the process
Below is the pseudocode of the sender
// on sender end
auto thenSec = high_resolution_clock::now();
while(m_isRunning) {
// generate sinewave, double the frequency every 10s or so
auto nowSec = high_resolution_clock::now();
if (duration_cast<seconds>(nowSec - thenSec).count() > 10) {
m_sine.SetFreq(m_sine.GetFreq()*2);
thenSec = nowSec;
}
m_sine.Generate(audio);
// send to rendering thread
m_messenger.send("inproc://sound-ear.pair",
(const void*)(audio),
audio_size,
zmq::send_flags::dontwait
);
}
Note that I already use DONTWAIT to mitigate blocking.
On the receiver side I have a zmq::poller_event handler that simply receives the last message on event polling.
In the stop sequence I reset the sinewave frequency to its lowest value, say, 440Hz.
Expected
The expected behaviour would be:
If I stop both the sender and the receiver after 10s when the frequency is doubled,
and I restart both,
then I should see the sinewave reset to 440Hz.
Observed
But the observed behaviour is that the received sinewave is still of the doubled frequency after restarting the communication, i.e., 880Hz.
Question
Am I doing it wrong or should I use some kind of killswitch to force drop all messages in this case?
OK, I think I solved it myself. Kind of.
Actual solution
I finally realized that the behaviour I want is to flush all messages when I stop the rendering. According to the official doc(How can I flush all messages that are in the ZeroMQ socket queue?), this can only be achieved by
set the sockets of both sender's and receiver's ZMQ_LINGER option to 0, meaning to keep nothing on closing those sockets;
closing the sockets on both sender and receiver ends, which also involves bootstrapping pollers and all references to the sockets.
This seems a lot of unnecessary work if I'm to restart rendering my data again, right after the stop sequence. But I found no other way to solve this cleanly.
Initial effort
It seems to me that ZMQ_CONFLATE does not make a difference on PAIR. I really have to tweak high water marks on sender and receiver ends using ZMQ_SNDHWM and ZMQ_RCVHWM.
However, I said "kind of solved" because tweaking HWM in the end is not the optimal solution for a realtime application,
having ZMQ_SNDHWM / ZMQ_RCVHWM set to the minimum "1", we still have a sizable latency in terms of realtime.
Also, the consumer thread could fall into underrun situatioin, i.e., perceivable jitters with the lowest HWM.
If I'm not doing anything wrong, I guess the optimal solution would still be shared memory for my targeted scenario. This is sad because I really enjoyed the simplicity of ZMQ's multicast messaging patternsand hate to deal with thread locking littered everywhere.

CAF Receiver, shutdown handling to http requests

Some background reading at first :) what is shutdown handling
I'm doing a custom receiver with CAF SDK.
With the similar shutdown handling, I try to dispatch some http requests within the callback like:
receiver.addEventListener(
cast.framework.system.EventType.SHUTDOWN,
e => {
// some http requests
HttpHandler.post(url, somePayload);
HttpHandler.post(anotherUrl, someOtherPayload);
....... (more requests to go)
});
However, I can't guarantee those requests are reaching the destination since the receiver application is about to terminate anytime(Likely less than 1 sec).Those requests were also proved not reaching the destination in fact.
As far as I know, there is no way to postpone the shutdown of the receiver application with CAF SDK itself.
Is there a workaround about it? Is there a way we can postpone shutdown with the help of CAF SDK?
I did some more research, and it turns out you can also use
window.addEventListener("beforeunload", e => {
...
});
instead of
receiver.addEventListener(
cast.framework.system.EventType.SHUTDOWN,
e => {
...
});
Alas, this does not help to assure everything in the callback is executed: the beforeunload callback is terminated in the same way as the shutdown handler.
The answer seems quite simple: turn all http requests to synchronous.
Drawbacks are quite obvious as well, synchronous requests will block the thread. When one request is hung in middle of somewhere due to unknown reasons, the script will be blocked forever unless force shutdown.
Still looking for better way to improve it.

Project reactor processors v3.X

We are trying to migrate from 2.X to 3.X.
https://github.com/reactor/reactor-core/issues/375
We have used the EventBus as event manager in our application(Low latency FX system) and it works very well for us.
After the change we decided to take every module and create his own processor to handle event.
1. Does this use seems to be correct from your point of view? Because lack of document at the current stage and after reviewing everything we could we don't really know what to do here
2. We have tried to use Flux in order to perform action every X interval
For example: Market is arriving 1000 for 1 second but we want to process an update only 4 time in a second. After upgrading we are using:
Processor with buffer and sending to another method.
In this method we have Flux that get list and try to work in parallel in order to complete his task.
We had 2 major problems:
1. Sometimes we received Null event which we cannot find that our system is sending to i suppose maybe we are miss using the processor
//Definition of processor
ReplayProcessor<Event> classAEventProcessor = ReplayProcessor.create();
//Event handler subscribing
public void onMyEventX(Consumer<Event> consumer) {
Flux<Event> handler = classAEventProcessor .filter(event -> event.getType().equals(EVENT_X));
handler.subscribe(consumer);
}
in the example above the event in the handler sometimes get null.. Once he does the stream stop working until we are restating server(Because only on restart we are doing creating processor)
2.We have tried to us parallel but sometimes some of the message were disappeared so maybe we are misusing the framework
//On constructor
tickProcessor.buffer(1024, Duration.of(250, ChronoUnit.MILLIS)).subscribe(markets ->
handleMarkets(markets));
//Handler
Flux.fromIterable(getListToProcess())
.parallel()
.runOn(Schedulers.parallel())
.doOnNext(entryMap -> {
DoBlockingWork(entryMap);
})
.sequential()
.subscribe();
The intention of this is that the processor will wakeup every 250ms and invoke the handler. The handler will work work with Flux parallel in order to make better and faster processing.
*In case that DoBlockingWork takes more than 250ms i couldn't understand what will be the behavior
UPDATE:
The EventBus was wrapped by us and every event subscribed throw the wrapped event manager.
Now we have tried to create event processor for every module but it works very slow. We have used TopicProcessor with ThreadExecutor and still very slow.. EventBus did the same work in high speed
Anyone has any idea? BTW when i tried to use DirectProcessor it seems to work much better that the TopicProcessor
Reactor 3 is built around the concept that you should avoid blocking as much as you can, so in your second snippet DoBlockingWork doesn't look good.
How are the events generated? Do you maybe have an listener-based asynchronous API to get them? If so, you could try using Flux.create.
For your use case of "we have 1000 events in 1 second, but only want to process 4", I'd chain a sample operator. For instance, sample(Duration.ofMillis(250)) will divide each second into 4 windows, from which it will only emit the last element.
The reference guide is being written, as well as a page where you can find links to external articles and learning material.There's a preview of the WIP reference guide here and the learning resources page here.

How can I tell if my Ruby server script is being overloaded?

I have a daemonized ruby script running on my server that looks like this:
#server = TCPServer.open(61101)
loop do
#thr = Thread.new(#server.accept) do |sock|
Thread.current[:myArrayOfHashes] = [] # hashes containing attributes of myObject
SystemTimer.timeout_after(5) do
Thread.current[:string] = sock.gets
sock.close
# parse the string and load the data into myArrayOfHashes
Myobject.transaction do # Update the myObjects Table
Thread.current[:myArrayOfHashes].each do |h|
Thread.current[:newMyObject] = Myobject.new
# load up the new object with data
Thread.current[:newMyObject].save
end
end
end
end
#thr.join
end
This server receives and manages data for my rails application which is all running on Mac OS 10.6. The clients call the server every 15 minutes on the 15 and while I currently only have 16 or so clients calling every 15 min on the 15, I'm wondering about the following:
If two clients call at close enough to the same time, will one client's connection attempt fail?
How I can figure out how many client connections my server can accommodate at the same time?
How can I monitor how much memory my server is using?
Also, is there an article you can point me toward that discusses the best way to implement this kind of a server? I mean can I have multiple instances of the server listening on the same port? Would that even help?
I am using Bluepill to monitor my server daemons.
1 and 2
The answer is no, two clients connecting close to each other will not make the connection fail (however multiple clients connecting may fail, see below).
The reason is the operating system has a default so called listening queue built into all server sockets. So even if you are not calling accept fast enough in your program, the OS will still keep buffering incoming connections for you. It will buffer these connections for as long as the listening queue does not get filled.
Now what is the size of this queue then?
In most cases the default size typically used is 5. The size is set after you create the socket and you call listen on this socket (see man page for listen here).
For Ruby TCPSocket automatically calls listen for you, and if you look at the C-source code for TCPSocket you will find that it indeed sets the size to 5:
https://github.com/ruby/ruby/blob/trunk/ext/socket/ipsocket.c#L108
SOMAXCONN is defined as 5 here:
https://github.com/ruby/ruby/blob/trunk/ext/socket/mkconstants.rb#L693
Now what happens if you don't call accept fast enough and the queue gets filled?
The answer is found in the man page of listen:
The backlog argument defines the maximum length to which the queue of pending connections for sockfd may grow. If a connection request arrives when the queue is full, the client may receive an error with an indication of ECONNREFUSED or, if the underlying protocol supports retransmission, the request may be ignored so that a later reattempt at connection succeeds.
In your code however there is one problem which can make the queue fill up if more than 5 clients try to connect at the same time: you're calling #thr.join at the end of the loop.
What effectively happens when you do this is that your server will not accept any new incoming connections until all your stuff inside your accept-thread has finished executing.
So if the database stuff and the other things you are doing inside the accept-thread takes a long time, the listening queue may fill up in the meantime. It depends on how long your processing takes, and how many clients could potentially be connecting at the exact same time.
3
You didn't say which platform you are running on, but on linux/osx the easiest way is to just run top in your console. For more advanced memory monitoring options you might want to check these out:
ruby/ruby on rails memory leak detection
track application memory usage on heroku

Resources