H2O Exception: water.AutoBuffer.AutoBufferException - h2o

We are producitonalizing a model that was created using h2o. However, during one of our load tests, we are getting a
ERRR: water.AutoBuffer$AutoBufferException
What exactly is the AutoBufferException and what causes it to be thrown?
Previously, when we deployed the application, we did not provide any Java heap memory parameters and during load testing, that always caused the instance to terminate due to an OOM error. However, since adding the Java heap memory parameter and conducting load tests, we have been getting the AutoBufferException under heavy load.
When we look at the Debug logs, we see that there is plenty of memory still available when this Exception is thrown. CPU usage seems to be pretty normal as well. Below is a snippet of our code and the stacktrace that the error produces.
According to the logs, h2o makes a series of web requests in order to complete the prediction. The exception only gets thrown during one of those calls: (Request Type: POST, Request Path: /99/Models.bin). Once the exception gets thrown, the request fails and the service picks up the next request.
h2o Version: 3.18.0.11
Python Version: 3.5.4
OS: RHEL 7.x
h2o.connect(ip = HOST, port = PORT)
loaded_model = h2o.load_model(MODEL_PATH)
def predict(to_be_scored):
to_be_scored_hex = h2o.H2OFrame(to_be_scored)
prediction = loaded_model.predict(to_be_scored_hex)
return prediction
We would expect the code to return a prediction, however the following stacktrace is the output.
Stacktrace: [water.AutoBuffer$AutoBufferException,
water.AutoBuffer.getImpl(AutoBuffer.java:634),
water.AutoBuffer.getSp(AutoBuffer.java:610),
water.AutoBuffer.get1(AutoBuffer.java:749),
water.AutoBuffer.get1U(AutoBuffer.java:750),
water.AutoBuffer.getInt(AutoBuffer.java:829),
water.AutoBuffer.get(AutoBuffer.java:793),
water.AutoBuffer.getKey(AutoBuffer.java:811),
water.AutoBuffer.getKey(AutoBuffer.java:808),
hex.Model.readAll_impl(Model.java:1635),
hex.tree.SharedTreeModel.readAll_impl(SharedTreeModel.java:411),
water.AutoBuffer.getKey(AutoBuffer.java:814),
water.Keyed.readAll(Keyed.java:50),
hex.Model.importBinaryModel(Model.java:2256),
water.api.ModelsHandler.importModel(ModelsHandler.java:209),
sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source),
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43),
java.lang.reflect.Method.invoke(Method.java:498),
water.api.Handler.handle(Handler.java:63),
water.api.RequestServer.serve(RequestServer.java:451),
water.api.RequestServer.doGeneric(RequestServer.java:296),
water.api.RequestServer.doPost(RequestServer.java:222),
javax.servlet.http.HttpServlet.service(HttpServlet.java:755),
javax.servlet.http.HttpServlet.service(HttpServlet.java:848),
org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684),
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503),
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086),
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:429),
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020),
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135),
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154),
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116),
water.JettyHTTPD$LoginHandler.handle(JettyHTTPD.java:197),
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154),
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116),
org.eclipse.jetty.server.Server.handle(Server.java:370),
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494),
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53),
org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982),
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043),
org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865),
org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240),
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72),
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264),
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608),
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543),
java.lang.Thread.run(Thread.java:748)];parms={dir=/opt/h2o/data/model, model_id=}

Related

Spring Data queryForStream: how can it run out of Heap Space?

I have a Spring Boot application that reads from a database table with potentially millions of rows and thus uses the queryForStream method from Spring Data. This is the code:
Stream<MyResultDto> result = jdbcTemplate.queryForStream("select * from table", myRowMapper));
This runs well for smaller tables, but from about 500 MB of table size the application dies with a stacktrace like this:
Exception in thread "http-nio-8080-Acceptor" java.lang.OutOfMemoryError: Java heap space
at java.base/java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:64)
at java.base/java.nio.ByteBuffer.allocate(ByteBuffer.java:363)
at org.apache.tomcat.util.net.SocketBufferHandler.<init>(SocketBufferHandler.java:58)
at org.apache.tomcat.util.net.NioEndpoint.setSocketOptions(NioEndpoint.java:486)
at org.apache.tomcat.util.net.NioEndpoint.setSocketOptions(NioEndpoint.java:79)
at org.apache.tomcat.util.net.Acceptor.run(Acceptor.java:149)
at java.base/java.lang.Thread.run(Thread.java:833)
2023-01-28 00:37:23.862 ERROR 1 --- [nio-8080-exec-3] o.a.c.h.Http11NioProtocol : Failed to complete processing of a request
java.lang.OutOfMemoryError: Java heap space
2023-01-28 00:37:30.548 ERROR 1 --- [nio-8080-exec-6] o.a.c.c.C.[.[.[.[dispatcherServlet] : Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Handler dispatch failed; nested exception is java.lang.OutOfMemoryError: Java heap space] with root cause
java.lang.OutOfMemoryError: Java heap space
Exception in thread "http-nio-8080-Poller" java.lang.OutOfMemoryError: Java heap space
As you can probably guess from the stack trace, I am streaming the database results out via a HTTP REST interface. The stack is PostgreSQL 15, the standard PostgreSQL JDBC driver 42.3.8 and the spring-boot-starter-data-jpa is 2.6.14, which results in spring-jdbc 5.3.24 being pulled.
It's worth noting that the table has no primary key, which I suppose should be no problem for the above query. I have not posted the RowMapper, because it never goes to work, the memory literally runs out after sending the query to the database. It just never comes back with a result set that the rowmapper could work on.
I have tried to use a jdbcTemplate.setFetchSize(1000) and also without specifying any fetch size, which I believe would result in the default being used (100 I think). In both cases the same thing happens - large result sets will not be streamed, but somehow exhaust the Java heap space before streaming starts. What could be the reason for this? Isn't the queryForStream method meant to exactly avoid such situations?
I was on the right track setting the fetch size, that is exactly what prevents the JDBC driver from loading the entire result set into memory. In my case the setting was silently ignored and that is a function of the PostgreSQL JDBC driver. It ignores the fetch size if autocommit is set to true, which is the default in Spring JDBC.
Therefore the solution was to define a datasource in Spring JDBC that sets autocommit to false and use that datasource for the streaming query. The fetch size was then applied and I ended up setting it to 10000, which in my case yielded the best performance / memory usage ratio.

Running out of SQL connections with Quarkus and hibernate-reactive-panache

I've got a Quarkus app which uses hibernate-reactive-panache to run some queries and than process the result and return JSON via a Rest Call.
For each Rest call 5 DB queries are done, the last one will load about 20k rows:
public Uni<GraphProcessor> loadData(GraphProcessor graphProcessor){
return myEntityRepository.findByDateLeaving(graphProcessor.getSearchDate())
.select().where(graphProcessor::filter)
.onItem().invoke(graphProcessor::onNextRow).collect().asList()
.onItem().invoke(g -> log.info("loadData - end"))
.replaceWith(graphProcessor);
}
//In myEntityRepository
public Multi<MyEntity> findByDateLeaving(LocalDate searchDate){
LocalDateTime startDate = searchDate.atStartOfDay();
return MyEntity.find("#MyEntity.findByDate",
Parameters.with("startDate", startDate)
.map()).stream();
}
This all works fine for the first 4 times but on the 5th call I get
11:12:48:070 ERROR [org.hibernate.reactive.util.impl.CompletionStages:121] (147) HR000057: Failed to execute statement [$1select <ONE OF THE QUERIES HERE>]: $2could not load an entity: [com.mycode.SomeEntity#1]: java.util.concurrent.CompletionException: io.vertx.core.impl.NoStackTraceThrowable: Timeout
at <16 internal lines>
io.vertx.sqlclient.impl.pool.SqlConnectionPool$1PoolRequest.lambda$null$0(SqlConnectionPool.java:202) <4 internal lines>
at io.vertx.sqlclient.impl.pool.SqlConnectionPool$1PoolRequest.lambda$onEnqueue$1(SqlConnectionPool.java:199) <15 internal lines>
Caused by: io.vertx.core.impl.NoStackTraceThrowable: Timeout
I've checked https://quarkus.io/guides/reactive-sql-clients#pooled-connection-idle-timeout and configured
quarkus.datasource.reactive.idle-timeout=1000
That itself did not make a difference.
I than added
quarkus.datasource.reactive.max-size=10
I was able to run 10 Rest calls before getting the timeout again. On a pool setting of max-size=20 I was able to run it 20 times. So it does look like each Rest call will use up a SQL connection and not release it again.
Is there something that needs to be done to manually release the connection or is this simply a bug?
The problem was with using #Blocking on a reactive Rest method.
See https://github.com/quarkusio/quarkus/issues/25138 and https://quarkus.io/blog/resteasy-reactive-smart-dispatch/ for more information.
So if you have a rest method that returns e.g. Uni or Multi, DO NOT use #Blocking on the call. I had to initially add it as I received an Exception telling me that the thread cannot block. This was due to some CPU intensive calculations. Adding #Blocking made that exception go away (in dev-mode but another problem popped up in native mode) but caused this SQL pool issue.
The real solution was to use emitOn to change the thread for the cpu intensive method:
.emitOn(Infrastructure.getDefaultWorkerPool())
.onItem().transform(processor::cpuIntensiveMethod)

Getting Slow operation detected: com.hazelcast.map.impl.operation.DeleteOperation while calling ITopic.publish()

[SpringBoot Application] I am trying to delete a key from Hazelcast Map from an async Method. In the Delete Method of the MapStore class, I am trying to put a key into a topic and calling Publish().However, sometimes I am getting this message
Slow operation detected: com.hazelcast.map.impl.operation.DeleteOperation. I am adding the stack trace below.
2019-03-27 11:52:08.041 [31m WARN[0;39m [] 24586 --- [trace=,span=] [35m[hz._hzInstance_1_gaian.SlowOperationDetectorThread][0;39m [33mc.h.s.i.o.s.SlowOperationDetector [0;39m: [localhost]:5701 [gaian] [3.10] Slow operation detected: com.hazelcast.map.impl.operation.DeleteOperation
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:304)
com.hazelcast.spi.impl.AbstractInvocationFuture.get(AbstractInvocationFuture.java:160)
com.hazelcast.client.spi.ClientProxy.invokeOnPartition(ClientProxy.java:225)
com.hazelcast.client.proxy.PartitionSpecificClientProxy.invokeOnPartition(PartitionSpecificClientProxy.java:49)
com.hazelcast.client.proxy.ClientTopicProxy.publish(ClientTopicProxy.java:52)
com.gaian.adwize.cache.mapstore.CampaignMapStore.delete(CampaignMapStore.java:95)
com.gaian.adwize.cache.mapstore.CampaignMapStore.delete(CampaignMapStore.java:36)
com.hazelcast.map.impl.MapStoreWrapper.delete(MapStoreWrapper.java:115)
com.hazelcast.map.impl.mapstore.writethrough.WriteThroughStore.remove(WriteThroughStore.java:56)
com.hazelcast.map.impl.mapstore.writethrough.WriteThroughStore.remove(WriteThroughStore.java:28)
com.hazelcast.map.impl.recordstore.DefaultRecordStore.delete(DefaultRecordStore.java:565)
com.hazelcast.map.impl.operation.DeleteOperation.run(DeleteOperation.java:38)
com.hazelcast.spi.Operation.call(Operation.java:148)
com.hazelcast.spi.impl.operationservice.impl.OperationRunnerImpl.call(OperationRunnerImpl.java:202)
com.hazelcast.spi.impl.operationservice.impl.OperationRunnerImpl.run(OperationRunnerImpl.java:191)
com.hazelcast.spi.impl.operationexecutor.impl.OperationExecutorImpl.run(OperationExecutorImpl.java:406)
com.hazelcast.spi.impl.operationexecutor.impl.OperationExecutorImpl.runOrExecute(OperationExecutorImpl.java:433)
com.hazelcast.spi.impl.operationservice.impl.Invocation.doInvokeLocal(Invocation.java:581)
com.hazelcast.spi.impl.operationservice.impl.Invocation.doInvoke(Invocation.java:566)
com.hazelcast.spi.impl.operationservice.impl.Invocation.invoke0(Invocation.java:525)
com.hazelcast.spi.impl.operationservice.impl.Invocation.invoke(Invocation.java:215)
com.hazelcast.spi.impl.operationservice.impl.InvocationBuilderImpl.invoke(InvocationBuilderImpl.java:60)
com.hazelcast.client.impl.protocol.task.AbstractPartitionMessageTask.processMessage(AbstractPartitionMessageTask.java:67)
com.hazelcast.client.impl.protocol.task.AbstractMessageTask.initializeAndProcessMessage(AbstractMessageTask.java:130)
com.hazelcast.client.impl.protocol.task.AbstractMessageTask.run(AbstractMessageTask.java:110)
com.hazelcast.spi.impl.operationservice.impl.OperationRunnerImpl.run(OperationRunnerImpl.java:155)
com.hazelcast.spi.impl.operationexecutor.impl.OperationThread.process(OperationThread.java:125)
com.hazelcast.spi.impl.operationexecutor.impl.OperationThread.run(OperationThread.java:100)
Any help from the community will be really helpful. Thanks.
Slow operation is not necessarily an error. The newer versions of Hazelcast can calculate what is a normal response time and detect when something is out of ordinary. It’s usually more of a network latency issue or garbage collection introducing latency.
This is not an error. This is visible in Hazelcast Version 4.0.x as well.

ElasticSearch randomly fails when running tests

I have a test ElasticSearch box (2.3.0) and my tests that are using ES are failing in random order which is really frustrating (failed with All shards failed exception).
Looking at the elastic_search.log file it only showed me this
[2017-05-04 04:19:15,990][DEBUG][action.search.type ] [es-testing-1] All shards failed for phase: [query]
RemoteTransportException[[es-testing-1][127.0.0.1:9300][indices:data/read/search[phase/query]]]; nested: IllegalIndexShardStateException[CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED, RELOCATED]];
Caused by: [derp_test][[derp_test][3]] IllegalIndexShardStateException[CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED, RELOCATED]]
at org.elasticsearch.index.shard.IndexShard.readAllowed(IndexShard.java:993)
at org.elasticsearch.index.shard.IndexShard.acquireSearcher(IndexShard.java:814)
at org.elasticsearch.search.SearchService.createContext(SearchService.java:641)
at org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:618)
at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:369)
at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:368)
at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:365)
at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:350)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
Any idea what's going on? So far my research only told me this is most likely due to corrupt translog -- but I don't think deleting translog will help because the test drops the test index for every namespace
ES test box has 3.5GB RAM and it's using 2.5GB heap size, CPU usage is quite normal during the test (peaked at 15%)
To clarify: when I said failing test, I meant error with the weird exception as mentioned above (not failing test due to incorrect value). I did manual refresh after every insert/update operation so value is correct.
After investigating ElasticSearch log file (at DEBUG level) and the source code, turns out what actually happened was that after index is created, the shards are entering RECOVERING state and sometimes my test tried to perform a query on ElasticSearch while the shards are not yet active -- thus the exception.
Fix is simple - after creating an index, just wait until shards are active using setWaitForActiveShards function and to be more paranoid I also added setWaitForYellowStatus
It's a recommendation use ESIntegTestCase to do the integration test.
ESIntegTestCase has some helper method, like: ensureGreen and refresh ... to ensure the Elasticsearch is ready to continue testing. and you can configure node settings for test.
if use Elasticsearch directly as a test box, it maybe cause various problems:
like your Exception, this seems it's recovering shards for index
derp_test.
even you have indexed your data into index, but when you immediately search will fail, since cluster need flush or refresh
...
Those most problems can just use Thread.sleep to wait some time to fix :), but it's a bad way to do this.
Try manually refreshing your indices after inserting the data and before performing a query to ensure the data is searchable.
Either:
As part of the index request - https://www.elastic.co/guide/en/elasticsearch/reference/2.3/docs-index_.html#index-refresh
Or separately - https://www.elastic.co/guide/en/elasticsearch/reference/2.3/indices-refresh.html
There could be another reason. I had the same problem with my elasticsearch unit tests, at first I thought the problem root cause is somewhere in .Net Core or Nest or elsewhere outside of my code because the test would run successfully in Debug mode (When debugging tests) but randomly failed in Release mode (when running tests).
After lots of investigations and many try and errors, I found out the problem root cause (in my case) was Concurrency !! or on the other hand Race Condition used to happen
Since the tests run concurrently and I used to recreate and seed my index (initializing and preparing) on test class constructor which means executing on the beginning of every test and since the tests would run concurrently, race condition were likely to happen and make my tests fail
Here is my initialization code that caused tests fail randomly when running them (on release mode)
public BaseElasticDataTest(RootFixture fixture)
: base(fixture)
{
ElasticHelper = fixture.Builder.Build<ElasticDataProvider<FakePersonIndex>();
deleteFakeIndex();
createFakeIndex();
fillFakeIndexData();
}
the code above used to run on every test concurrently. I fixed my problem by executing initialization code only once per test class (once for all the test cases inside the test class) and the problem went away.
Here is my fixed test class constructor code :
static bool initialized = false;
public BaseElasticDataTest(RootFixture fixture)
: base(fixture)
{
ElasticHelper = fixture.Builder.Build<ElasticDataProvider<FakePersonIndex>>();
if (!initialized)
{
deleteFakeIndex();
createFakeIndex();
fillFakeIndexData();
//for concurrency
System.Threading.Thread.Sleep(100);
initialized = true;
}
}
Hope it helps

HazelCast IMap.values() giving OutofMemory on Tomcat

I'm still trying to get to know hazelcast and have to make a decision on whether to use it or not.
I wrote a simple application where in I startup the cache on (single node) server startup and load the Map at the same time with about 400 entries.The Object itself has two String fields. I have a service class that looksup the cache and tries to get all the values from the map.
However, I'm getting a OutofMemoryError on Java Heap Space while trying to get values out of the hazelcast map. Eventually we plan to move to a 5 node cluster to start with.
Following is the cache spring config:
<hz:hazelcast id="instance">
<hz:config>
<hz:group name="dev" password=""/>
<hz:properties>
<hz:property name="hazelcast.merge.first.run.delay.seconds">5</hz:property>
<hz:property name="hazelcast.merge.next.run.delay.seconds">5</hz:property>
</hz:properties>
<hz:network port="5701" port-auto-increment="false">
<hz:join>
<hz:multicast enabled="true" />
</hz:join>
</hz:network>
</hz:config>
</hz:hazelcast>
<hz:map instance-ref="instance" id="statusMap" name="statuses" />
Following is where the error occurs:
map = instance.getMap("statuses");
Set<Status> statuses = (Set<Status>) map.values();
return statuses;
Any other method of IMap works fine. I tried getting the keySet and size and both worked fine. It is only when I try to get the values that the OutofMemory error shows up.
java.lang.OutOfMemoryError: Java heap space
I've tried the above with a standalone java application and it works fine. I've also monitored with visual VM and don't see any spike in used Heap Memory when the error occurs which is all the more confusing. Available Heap is 1G and the used Heap was about 70MB when the error occurred.
However, when I take out cache implementation from the application, it works fine going to the Database and getting the data.
I've also tried playing around with the tomcat vm args to no success. It is the same OutofMemoryError when I access IMap.values() with or without SQLPredicate. Any help or direction in this matter will be greatly appreciated.
Thanks.
As the exception mentions you're running out of heap space since the values method tries to return all deserialized values at once. If they don't fit into memory you'll likely to get an OOME.
You can use paging to prevent this from happening: http://hazelcast.org/docs/latest/manual/html-single/hazelcast-documentation.html#paging-predicate-order-limit-
How big are your 400 entries?
And like Chris said, the whole data is being pulled in memory.
In the future we'll replace this by an iteration based approach where we'll only pull small chunks in memory instead of the whole thing.
I figured out the issue. The Status object was implementing "com.hazelcast.nio.serialization.Portable" for Serialization. I did not configure the corresponding serialization factory. After I configured the factory as follows, it worked fine:
<hz:serialization>
<hz:portable-factories>
<hz:portable-factory factory-id="1" class-name="ApplicationPortableFactory" />
</hz:portable-factories>
</hz:serialization>
Apologize for not giving the complete background initially as I myself noticed it later on. Thanks for replying though. I wasn't aware of the Paging Predicate and now I'm using it for sorting and paging results. Thanks again.

Resources