Reactive Quarkus app behaving differently when run as Java or native - quarkus

I have a reactive quarkus app with hibernate-panache-reactive. The problem is it behaves differently when I run it as a Java app or a native app.
The app
loads a lot of data from a MySQL DB via hibernate-panache-reactive
builds a graph based on the data loaded
runs some time consuming algorithm on the graph
loads some more data from the DB based on the results returned from 3)
So initially the code looked something like this:
GraphProcessor graphProcessor = createInitialProcessor();
return Uni.createFrom().item(graphProcessor)
// 1) loading of initial data
.onItem().transformToUni(this::loadDataViaPanaceReactive1)
.onItem().transformToUni(this::loadDataViaPanaceReactive2)
.onItem().transformToUni(this::loadDataViaPanaceReactive3)
// 2) building of graph
.onItem().transform(graphProcessor::processLoadedData)
.onItem().invoke(graphProcessor::loadingComplete) //sync
// 3) running time consuming algorithm on graph
.onItem().transformToMulti(this::runTimeConsumingTask)
.onItem().invoke(this::prepareDBQueries)
// 4) load more data from DB
.onItem().transformToUniAndConcatenate(this::loadMoreData1)
.onItem().transformToUniAndConcatenate(this::loadMoreData2)
.onItem().transformToUniAndConcatenate(this::transformToPublicForm)
.onFailure().invoke(log::error);
That worked fine when run as a Java app but when I tried to run it as a native app it first complained that the computation in 2 and 3 were taking too long and this was blocking the calling thread.
I fixed that by using
.emitOn(Infrastructure.getDefaultWorkerPool())
Between 1 and 2
This time I got another error
java.lang.IllegalStateException: HR000069: Detected use of the
reactive Session from a different Thread than the one which was used
to open the reactive Session - this suggests an invalid integration;
original thread: 'vert.x-eventloop-thread-0' current Thread:
'vert.x-eventloop-thread-1'
I've fixed that by inserting
.emitOn(Infrastructure.getDefaultExecutor())
between 3 and 4.
GraphProcessor graphProcessor = createInitialProcessor();
return Uni.createFrom().item(graphProcessor)
// 1) loading of initial data
.onItem().transformToUni(this::loadDataViaPanaceReactive1)
.onItem().transformToUni(this::loadDataViaPanaceReactive2)
.onItem().transformToUni(this::loadDataViaPanaceReactive3)
// 2) building of graph
.emitOn(Infrastructure.getDefaultWorkerPool()) // Required for native mode
.onItem().transform(graphProcessor::processLoadedData)
.onItem().invoke(graphProcessor::loadingComplete)
// 3) running time consuming algorithm on graph
.onItem().transformToMulti(this::runTimeConsumingTask)
.onItem().invoke(this::prepareDBQueries)
.emitOn(Infrastructure.getDefaultExecutor()) // Required for native mode
// 4) load more data from DB
.onItem().transformToUniAndConcatenate(this::loadMoreData1)
.onItem().transformToUniAndConcatenate(this::loadMoreData2)
.onItem().transformToUniAndConcatenate(this::transformToPublicForm)
.onFailure().invoke(log::error);
That worked when run in native mode but now when I run it in Java I get the same exception (Detected use of the
reactive Session from a different Thread than the one which was used
to open the reactive Session)
The emitOn(Infrastructure.getDefaultExcecutor()) should have switched back to the original thread.
The odd thing is also that this exception is not thrown every time I hit the app.
So what am I doing wrong here? What is the best way to handle time consuming tasks and then having to do some more DB queries after?
You could use .runSubscriptionOn(Executor) but I would need to switch back to the original thread for part 4 again.
Thanks for you help.

Related

Dataflow job has high data freshness and events are dropped due to lateness

I deployed an apache beam pipeline to GCP dataflow in a DEV environment and everything worked well. Then I deployed it to production in Europe environment (to be specific - job region:europe-west1, worker location:europe-west1-d) where we get high data velocity and things started to get complicated.
I am using a session window to group events into sessions. The session key is the tenantId/visitorId and its gap is 30 minutes. I am also using a trigger to emit events every 30 seconds to release events sooner than the end of session (writing them to BigQuery).
The problem appears to happen in the EventToSession/GroupPairsByKey. In this step there are thousands of events under the droppedDueToLateness counter and the dataFreshness keeps increasing (increasing since when I deployed it). All steps before this one operates good and all steps after are affected by it, but doesn't seem to have any other problems.
I looked into some metrics and see that the EventToSession/GroupPairsByKey step is processing between 100K keys to 200K keys per second (depends on time of day), which seems quite a lot to me. The cpu utilization doesn't go over the 70% and I am using streaming engine. Number of workers most of the time is 2. Max worker memory capacity is 32GB while the max worker memory usage currently stands on 23GB. I am using e2-standard-8 machine type.
I don't have any hot keys since each session contains at most a few dozen events.
My biggest suspicious is the huge amount of keys being processed in the EventToSession/GroupPairsByKey step. But on the other, session is usually related to a single customer so google should expect handle this amount of keys to handle per second, no?
Would like to get suggestions how to solve the dataFreshness and events droppedDueToLateness issues.
Adding the piece of code that generates the sessions:
input = input.apply("SetEventTimestamp", WithTimestamps.of(event -> Instant.parse(getEventTimestamp(event))
.withAllowedTimestampSkew(new Duration(Long.MAX_VALUE)))
.apply("SetKeyForRow", WithKeys.of(event -> getSessionKey(event))).setCoder(KvCoder.of(StringUtf8Coder.of(), input.getCoder()))
.apply("CreatingWindow", Window.<KV<String, TableRow>>into(Sessions.withGapDuration(Duration.standardMinutes(30)))
.triggering(Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(30))))
.discardingFiredPanes()
.withAllowedLateness(Duration.standardDays(30)))
.apply("GroupPairsByKey", GroupByKey.create())
.apply("CreateCollectionOfValuesOnly", Values.create())
.apply("FlattenTheValues", Flatten.iterables());
After doing some research I found the following:
regarding constantly increasing data freshness: as long as allowing late data to arrive a session window, that specific window will persist in memory. This means that allowing 30 days late data will keep every session for at least 30 days in memory, which obviously can over load the system. Moreover, I found we had some ever-lasting sessions by bots visiting and taking actions in websites we are monitoring. These bots can hold sessions forever which also can over load the system. The solution was decreasing allowed lateness to 2 days and use bounded sessions (look for "bounded sessions").
regarding events dropped due to lateness: these are events that on time of arrival they belong to an expired window, such window that the watermark has passed it's end (See documentation for the droppedDueToLateness here). These events are being dropped in the first GroupByKey after the session window function and can't be processed later. We didn't want to drop any late data so the solution was to check each event's timestamp before it is going to the sessions part and stream to the session part only events that won't be dropped - events that meet this condition: event_timestamp >= event_arrival_time - (gap_duration + allowed_lateness). The rest will be written to BigQuery without the session data (Apparently apache beam drops an event if the event's timestamp is before event_arrival_time - (gap_duration + allowed_lateness) even if there is a live session this event belongs to...)
p.s - in the bounded sessions part where he demonstrates how to implement a time bounded session I believe he has a bug allowing a session to grow beyond the provided max size. Once a session exceeded the max size, one can send late data that intersects this session and is prior to the session, to make the start time of the session earlier and by that expanding the session. Furthermore, once a session exceeded max size it can't be added events that belong to it but don't extend it.
In order to fix that I switched the order of the current window span and if-statement and edited the if-statement (the one checking for session max size) in the mergeWindows function in the window spanning part, so a session can't pass the max size and can only be added data that doesn't extend it beyond the max size. This is my implementation:
public void mergeWindows(MergeContext c) throws Exception {
List<IntervalWindow> sortedWindows = new ArrayList<>();
for (IntervalWindow window : c.windows()) {
sortedWindows.add(window);
}
Collections.sort(sortedWindows);
List<MergeCandidate> merges = new ArrayList<>();
MergeCandidate current = new MergeCandidate();
for (IntervalWindow window : sortedWindows) {
MergeCandidate next = new MergeCandidate(window);
if (current.intersects(window)) {
if ((current.union == null || new Duration(current.union.start(), window.end()).getMillis() <= maxSize.plus(gapDuration).getMillis())) {
current.add(window);
continue;
}
}
merges.add(current);
current = next;
}
merges.add(current);
for (MergeCandidate merge : merges) {
merge.apply(c);
}
}

Running out of SQL connections with Quarkus and hibernate-reactive-panache

I've got a Quarkus app which uses hibernate-reactive-panache to run some queries and than process the result and return JSON via a Rest Call.
For each Rest call 5 DB queries are done, the last one will load about 20k rows:
public Uni<GraphProcessor> loadData(GraphProcessor graphProcessor){
return myEntityRepository.findByDateLeaving(graphProcessor.getSearchDate())
.select().where(graphProcessor::filter)
.onItem().invoke(graphProcessor::onNextRow).collect().asList()
.onItem().invoke(g -> log.info("loadData - end"))
.replaceWith(graphProcessor);
}
//In myEntityRepository
public Multi<MyEntity> findByDateLeaving(LocalDate searchDate){
LocalDateTime startDate = searchDate.atStartOfDay();
return MyEntity.find("#MyEntity.findByDate",
Parameters.with("startDate", startDate)
.map()).stream();
}
This all works fine for the first 4 times but on the 5th call I get
11:12:48:070 ERROR [org.hibernate.reactive.util.impl.CompletionStages:121] (147) HR000057: Failed to execute statement [$1select <ONE OF THE QUERIES HERE>]: $2could not load an entity: [com.mycode.SomeEntity#1]: java.util.concurrent.CompletionException: io.vertx.core.impl.NoStackTraceThrowable: Timeout
at <16 internal lines>
io.vertx.sqlclient.impl.pool.SqlConnectionPool$1PoolRequest.lambda$null$0(SqlConnectionPool.java:202) <4 internal lines>
at io.vertx.sqlclient.impl.pool.SqlConnectionPool$1PoolRequest.lambda$onEnqueue$1(SqlConnectionPool.java:199) <15 internal lines>
Caused by: io.vertx.core.impl.NoStackTraceThrowable: Timeout
I've checked https://quarkus.io/guides/reactive-sql-clients#pooled-connection-idle-timeout and configured
quarkus.datasource.reactive.idle-timeout=1000
That itself did not make a difference.
I than added
quarkus.datasource.reactive.max-size=10
I was able to run 10 Rest calls before getting the timeout again. On a pool setting of max-size=20 I was able to run it 20 times. So it does look like each Rest call will use up a SQL connection and not release it again.
Is there something that needs to be done to manually release the connection or is this simply a bug?
The problem was with using #Blocking on a reactive Rest method.
See https://github.com/quarkusio/quarkus/issues/25138 and https://quarkus.io/blog/resteasy-reactive-smart-dispatch/ for more information.
So if you have a rest method that returns e.g. Uni or Multi, DO NOT use #Blocking on the call. I had to initially add it as I received an Exception telling me that the thread cannot block. This was due to some CPU intensive calculations. Adding #Blocking made that exception go away (in dev-mode but another problem popped up in native mode) but caused this SQL pool issue.
The real solution was to use emitOn to change the thread for the cpu intensive method:
.emitOn(Infrastructure.getDefaultWorkerPool())
.onItem().transform(processor::cpuIntensiveMethod)

Draw a graph using D3 (v3) in a WebWorker

The goal is to draw a graph using D3 (v3) in a WebWorker (Rickshaw would be even better).
Requirement #1:
The storage space for the entire project should not exceed 1 MB.
Requirement #2:
Internet Explorer 10 should be supported
I already tried to pass the DOM element to Webworker.
This brought the following error message:
DOMException: Failed to execute 'postMessage' on 'Worker': HTMLDivElement object could not be cloned.
var worker = new Worker( 'worker.js' );
worker.postMessage( {
'chart' : document.querySelector('#chart').cloneNode(true)
} );
The GitHub user chrisahardie has made...
a small proof on concept showing how to generate a d3 SVG chart in a
web worker and pass it back to the main UI thread to be injected into
a webpage.
https://github.com/chrisahardie/d3-svg-chart-in-web-worker
He integrated jsdom into the browser with Browserify.
The problem:
The script has almost 5 MB, which is too much memory requirements for the application.
So my question:
Does anyone have experience in solving the problem or has any idea how the problem can be solved and the requirements can be met?
The Web Workers don't have access to the following JavaScript objects: The window object, The document object and The parent object. So, all we could do on that side would be to build something that could be used for quickly creating the DOM. The worker(s) could e.g process the datasets and do all the heavy computations, then pass the result back as a set of arrays. More details, you could check this article and this sample

Client application hangs when inserting into table on Oracle using ArrayBinding

Here is our environment:
.Net version: 4.5
Database: Oracle 12.1.0.2 (odp.net)
We are using LLBL "Adapter" but I don't think that has anything to do with the issue
LLBLGen Pro version: 4.1
Llbl Gen Pro Runtime: 4.1.13.1213
When we do an Insert(always into different tables which we are using for the short period and then removing) we use the following code:
int numRecords = strings.Count();
var insertCmd = "insert into " + tableName + " (StringField) values (:StringField)";
var oracleCommand = new OracleCommand();
oracleCommand.CommandText = insertCmd;
oracleCommand.CommandType = CommandType.Text;
oracleCommand.BindByName = true;
oracleCommand.ArrayBindCount = numRecords;
oracleCommand.Parameters.Add(":StringField", OracleDbType.NVarchar2, strings.ToArray(), ParameterDirection.Input);
// this is an LLBL adapter. Like I said, I think the issue is below the LLBL layer.
this.adapter.ExecuteActionQuery(new ActionQuery(oracleCommand));
When the database is getting hit hard with multiple of these inserts in parallel, we get the following error and the insert call never returns from the database.
WG_6.Index_586.TVD: An exception was caught during the execution of an action query: ORA-24381: error(s) in array DML
ORA-12592: TNS:bad packet
ORA-12592: TNS:bad packet
ORA-12592: TNS:bad packet
ORA-12592: TNS:bad packet
ORA-03111: break received on communication channel
ORA-03111: break received on communication channel
ORA-03111: break received on communication channel
On the database, using Toad's session browser, I can see that the "Current Statement" is correct.
insert into schemaX.tableY(StringField) values(:Stringfield)
Under the Waits tab in Toad, there is the following message:
“Waiting for SQL*Net more data from client - waited X hundred seconds, so far” and the X keeps incrementing until we hit our database timeout.
We tried with batches of 1 million and this gave us the best performance for our scenario. However, this hanging issue arose. I then decrease the ArrayBindCount to 500K, 100K, 50K, 10K and then 5K. Only when I used 5K did it stop happening.
A couple of notes:
This happens more frequently when the database is on a different physical machine than the client. When using a local VM, it rarely happens. The network that we are using is generally very reliable with no other noted issues.
From the error message(ORA-12592: TNS:bad packet), it seems that the issue might be on the client and perhaps related to code in the "Oracle.DataAccess.Client"(ODAC) dll.
My next steps for troubleshooting are to use Reflector to debug the call from the ODAC code and also to get more reliable client side tracing while forcing this error to occur.
I had the same situation when trying to insert into an Oracle table using the ArrayBinding.
Using a small number for oracleCommand.ArrayBindCount seemed to improve the frequency of the errors (same like yours) but not completely.
The solution was to use the Managed data access. I suggest you get the latest ODP.NET, add a reference to ManagedDataAccess and change to:
using Oracle.ManagedDataAccess.Client;
using Oracle.ManagedDataAccess.Types;
This fixed problem in my case and with no need to change anything in the code.

FastRWeb performance on Ubuntu with built-in web server

I have installed FastRWeb 1.1-0 on an installation of R 2.15.2 (Trick or Treat) running on an Ubuntu 10.04 box. I hope to use the resulting system to run a web service.
I've configured the system by setting http.port to 8181 in rserve.conf and unsetting the socket destination. I've assigned .http.request to FastRWeb::.http.request. I exchange JSON blobs between the client and the server using HTTP POST (the second blob can exceed 150KB in size, and will not fit in an HTTP GET query string.)
Everything works end to end -- I have a little client-side R script which generates JSON RPC calls across the channel. I see the run function invoked, and see it returned.
I've run into a significant performance problem, however: the return path takes in excess of 12 seconds from the time run() returns (including the call to done()) and the time that the R client gets the return value. RCurl doesn't seem to be the culprit; it appears that something is taking twelve seconds to do a return.
Does anybody have any suggestions of where to look? I can easily shift over to using Apache 2.0 and CGI, but, honestly, I'd rather keep everything R centric.
Answering my own question.
I wrapped .http.request with an Rprof()/Rprof(NULL) pair and looked at the time spent in each routine. It turns out that the system spends ~11 seconds inside URLDecode in the standard implementation of .run. This looks like a scaling problem in URLDecode in the core.

Resources