Delay / Lag between Commit and select with Distributed Transactions when two connections are enlisted to the transaction in Oracle with ODAC - oracle

We have our application calling to two Oracle databases using two connections (which are kept open through out the application). For certain functionality, we use distributed transactions. We have Enlist=false in the connection string and manually enlist the connection to the transaction.
The problem comes with a scenario where, we update the same record very frequently within a distributed transaction, on which we see a delay to see the commited data in the previous run.
ex.
using (OracleConnection connection1 = new OracleConnection())
{
using(OracleConnection connection2 = new OracleConnection())
{
connection1.ConnectionString = connection1String;
connection1.Open();
connection2.ConnectionString = connection2String;
connection2.Open();
//for 100 times, do an update
{
.. check the previously updated value
connection1.EnlistTransaction(currentTransaction);
connection2.EnlistTransaction(currentTransaction);
.. do an update using connection1
.. do some updates with connection2
}
}
}
as in the above code fragment, we do update and check the previously updated value in the next iteration. The issues comes up when we run this for a single record frequently, on which we don't see the committed update in the last iteration in the next iteration even though it was committed in the previous iteration. But when this happens this update is visible in other applications in a very very small delay, and even within our code it's visible if we were to debug and run the line again.
It's almost like delay in the commit even though previous commit returned from the code.
Any one has any ideas ?

It turned out that I there's no way to control this behavior through ODAC. So the only viable solution was to implement a retry behavior in our code, since this occurs very rarely and when it happens, delay 10 seconds and retry the same.
Additional details on things I that I found on this can be found here.

Related

Dataflow job has high data freshness and events are dropped due to lateness

I deployed an apache beam pipeline to GCP dataflow in a DEV environment and everything worked well. Then I deployed it to production in Europe environment (to be specific - job region:europe-west1, worker location:europe-west1-d) where we get high data velocity and things started to get complicated.
I am using a session window to group events into sessions. The session key is the tenantId/visitorId and its gap is 30 minutes. I am also using a trigger to emit events every 30 seconds to release events sooner than the end of session (writing them to BigQuery).
The problem appears to happen in the EventToSession/GroupPairsByKey. In this step there are thousands of events under the droppedDueToLateness counter and the dataFreshness keeps increasing (increasing since when I deployed it). All steps before this one operates good and all steps after are affected by it, but doesn't seem to have any other problems.
I looked into some metrics and see that the EventToSession/GroupPairsByKey step is processing between 100K keys to 200K keys per second (depends on time of day), which seems quite a lot to me. The cpu utilization doesn't go over the 70% and I am using streaming engine. Number of workers most of the time is 2. Max worker memory capacity is 32GB while the max worker memory usage currently stands on 23GB. I am using e2-standard-8 machine type.
I don't have any hot keys since each session contains at most a few dozen events.
My biggest suspicious is the huge amount of keys being processed in the EventToSession/GroupPairsByKey step. But on the other, session is usually related to a single customer so google should expect handle this amount of keys to handle per second, no?
Would like to get suggestions how to solve the dataFreshness and events droppedDueToLateness issues.
Adding the piece of code that generates the sessions:
input = input.apply("SetEventTimestamp", WithTimestamps.of(event -> Instant.parse(getEventTimestamp(event))
.withAllowedTimestampSkew(new Duration(Long.MAX_VALUE)))
.apply("SetKeyForRow", WithKeys.of(event -> getSessionKey(event))).setCoder(KvCoder.of(StringUtf8Coder.of(), input.getCoder()))
.apply("CreatingWindow", Window.<KV<String, TableRow>>into(Sessions.withGapDuration(Duration.standardMinutes(30)))
.triggering(Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(30))))
.discardingFiredPanes()
.withAllowedLateness(Duration.standardDays(30)))
.apply("GroupPairsByKey", GroupByKey.create())
.apply("CreateCollectionOfValuesOnly", Values.create())
.apply("FlattenTheValues", Flatten.iterables());
After doing some research I found the following:
regarding constantly increasing data freshness: as long as allowing late data to arrive a session window, that specific window will persist in memory. This means that allowing 30 days late data will keep every session for at least 30 days in memory, which obviously can over load the system. Moreover, I found we had some ever-lasting sessions by bots visiting and taking actions in websites we are monitoring. These bots can hold sessions forever which also can over load the system. The solution was decreasing allowed lateness to 2 days and use bounded sessions (look for "bounded sessions").
regarding events dropped due to lateness: these are events that on time of arrival they belong to an expired window, such window that the watermark has passed it's end (See documentation for the droppedDueToLateness here). These events are being dropped in the first GroupByKey after the session window function and can't be processed later. We didn't want to drop any late data so the solution was to check each event's timestamp before it is going to the sessions part and stream to the session part only events that won't be dropped - events that meet this condition: event_timestamp >= event_arrival_time - (gap_duration + allowed_lateness). The rest will be written to BigQuery without the session data (Apparently apache beam drops an event if the event's timestamp is before event_arrival_time - (gap_duration + allowed_lateness) even if there is a live session this event belongs to...)
p.s - in the bounded sessions part where he demonstrates how to implement a time bounded session I believe he has a bug allowing a session to grow beyond the provided max size. Once a session exceeded the max size, one can send late data that intersects this session and is prior to the session, to make the start time of the session earlier and by that expanding the session. Furthermore, once a session exceeded max size it can't be added events that belong to it but don't extend it.
In order to fix that I switched the order of the current window span and if-statement and edited the if-statement (the one checking for session max size) in the mergeWindows function in the window spanning part, so a session can't pass the max size and can only be added data that doesn't extend it beyond the max size. This is my implementation:
public void mergeWindows(MergeContext c) throws Exception {
List<IntervalWindow> sortedWindows = new ArrayList<>();
for (IntervalWindow window : c.windows()) {
sortedWindows.add(window);
}
Collections.sort(sortedWindows);
List<MergeCandidate> merges = new ArrayList<>();
MergeCandidate current = new MergeCandidate();
for (IntervalWindow window : sortedWindows) {
MergeCandidate next = new MergeCandidate(window);
if (current.intersects(window)) {
if ((current.union == null || new Duration(current.union.start(), window.end()).getMillis() <= maxSize.plus(gapDuration).getMillis())) {
current.add(window);
continue;
}
}
merges.add(current);
current = next;
}
merges.add(current);
for (MergeCandidate merge : merges) {
merge.apply(c);
}
}

Client application hangs when inserting into table on Oracle using ArrayBinding

Here is our environment:
.Net version: 4.5
Database: Oracle 12.1.0.2 (odp.net)
We are using LLBL "Adapter" but I don't think that has anything to do with the issue
LLBLGen Pro version: 4.1
Llbl Gen Pro Runtime: 4.1.13.1213
When we do an Insert(always into different tables which we are using for the short period and then removing) we use the following code:
int numRecords = strings.Count();
var insertCmd = "insert into " + tableName + " (StringField) values (:StringField)";
var oracleCommand = new OracleCommand();
oracleCommand.CommandText = insertCmd;
oracleCommand.CommandType = CommandType.Text;
oracleCommand.BindByName = true;
oracleCommand.ArrayBindCount = numRecords;
oracleCommand.Parameters.Add(":StringField", OracleDbType.NVarchar2, strings.ToArray(), ParameterDirection.Input);
// this is an LLBL adapter. Like I said, I think the issue is below the LLBL layer.
this.adapter.ExecuteActionQuery(new ActionQuery(oracleCommand));
When the database is getting hit hard with multiple of these inserts in parallel, we get the following error and the insert call never returns from the database.
WG_6.Index_586.TVD: An exception was caught during the execution of an action query: ORA-24381: error(s) in array DML
ORA-12592: TNS:bad packet
ORA-12592: TNS:bad packet
ORA-12592: TNS:bad packet
ORA-12592: TNS:bad packet
ORA-03111: break received on communication channel
ORA-03111: break received on communication channel
ORA-03111: break received on communication channel
On the database, using Toad's session browser, I can see that the "Current Statement" is correct.
insert into schemaX.tableY(StringField) values(:Stringfield)
Under the Waits tab in Toad, there is the following message:
“Waiting for SQL*Net more data from client - waited X hundred seconds, so far” and the X keeps incrementing until we hit our database timeout.
We tried with batches of 1 million and this gave us the best performance for our scenario. However, this hanging issue arose. I then decrease the ArrayBindCount to 500K, 100K, 50K, 10K and then 5K. Only when I used 5K did it stop happening.
A couple of notes:
This happens more frequently when the database is on a different physical machine than the client. When using a local VM, it rarely happens. The network that we are using is generally very reliable with no other noted issues.
From the error message(ORA-12592: TNS:bad packet), it seems that the issue might be on the client and perhaps related to code in the "Oracle.DataAccess.Client"(ODAC) dll.
My next steps for troubleshooting are to use Reflector to debug the call from the ODAC code and also to get more reliable client side tracing while forcing this error to occur.
I had the same situation when trying to insert into an Oracle table using the ArrayBinding.
Using a small number for oracleCommand.ArrayBindCount seemed to improve the frequency of the errors (same like yours) but not completely.
The solution was to use the Managed data access. I suggest you get the latest ODP.NET, add a reference to ManagedDataAccess and change to:
using Oracle.ManagedDataAccess.Client;
using Oracle.ManagedDataAccess.Types;
This fixed problem in my case and with no need to change anything in the code.

StackExchange.Redis timeout and "No connection is available to service this operation"

I have the following issues in our production environment (Web-Farm - 4 nodes, on top of it Load balancer):
1) Timeout performing HGET key, inst: 3, queue: 29, qu=0, qs=29, qc=0, wr=0/0
at StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](Message message, ResultProcessor``1 processor, ServerEndPoint server) in ConnectionMultiplexer.cs:line 1699 This happens 3-10 times in a minute
2) No connection is available to service this operation: HGET key at StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](Message message, ResultProcessor``1 processor, ServerEndPoint server) in ConnectionMultiplexer.cs:line 1666
I tried to implement as Marc suggested (Maybe I interpreted it incorrectly) - better to have fewer connections to Redis than multiple.
I made the following implementation:
public class SeRedisConnection
{
private static ConnectionMultiplexer _redis;
private static readonly object SyncLock = new object();
public static IDatabase GetDatabase()
{
if (_redis == null || !_redis.IsConnected || !_redis.GetDatabase().IsConnected(default(RedisKey)))
{
lock (SyncLock)
{
try
{
var configurationOptions = new ConfigurationOptions
{
AbortOnConnectFail = false
};
configurationOptions.EndPoints.Add(new DnsEndPoint(ConfigurationHelper.CacheServerHost,
ConfigurationHelper.CacheServerHostPort));
_redis = ConnectionMultiplexer.Connect(configurationOptions);
}
catch (Exception ex)
{
IoC.Container.Resolve<IErrorLog>().Error(ex);
return null;
}
}
}
return _redis.GetDatabase();
}
public static void Dispose()
{
_redis.Dispose();
}
}
Actually dispose is not being used right now. Also I have some specifics of the implementation which could cause such behavior (I'm only using hashes):
1. Add, Remove hashes - async
2. Get -sync
Could somebody help me how to avoid this behavior?
Thanks a lot in advance!
SOLVED - Increasing Client connection timeout after evaluating network capabilities.
UPDATE 2: Actually it didn't solve the problem. When cache volume starting to get increased e.g. from 2GB.
Then I saw the same pattern actually these timeouts were happend about every 5 minutes.
And our sites were frozen for some period of time every 5 minutes until fork operation was finished.
Then I found out that there is an option to make a fork (save to disk) every x seconds:
save 900 1
save 300 10
save 60 10000
In my case it was "save 300 10" - save in every 5 minutes if at least 10 updates were happened. Also I found out that "fork" could be very expensive. Commented "save" section resolved the problem at all. We can commented "save" section as we are using only Redis as "cache in memory" - we don't need any persistance.
Here is configuration of our cache servers "Redis 2.4.6" windows port: https://github.com/rgl/redis/downloads
Maybe it has been solved in recent versions of Redis windows port in MSOpentech: http://msopentech.com/blog/2013/04/22/redis-on-windows-stable-and-reliable/
but I haven't tested yet.
Anyway StackExchange.Redis has nothing to do with this issue and it works pretty stable in our production environment, thanks to Marc Gravell.
FINAL UPDATE:
Redis is single-threaded solution - it is ultimately fast but when it comes to the point of releasing the memory (Removing items that are stale or expired) the problems are emerged due to one thread should reclaim the memory (that is not fast operation - whatever algorithm is used) and the same thread should handle GET, SET operations. Of course it happens when we are talking about medium-loaded production environment. Even if you use a cluster with slaves when the memory barrier is reached it will have the same behavior.
It looks like in most cases this exception is a client issue. Previous versions of StackExchange.Redis used Win32 socket directly which sometimes has a negative impact. Probably Asp.net internal routing somehow related to it.
The good news is that StackExchange.Redis's network infra was completely rewritten recently. The last version is 2.0.513. Try it and there is a good chance that your problem will go.

Performance transaction scope against DB2 table

We are trying to track down the cause of a performance problem.
We have a table with a single row that contains a primary key and a counter. Within a transaction we read the value of the counter, increment the value by one and save the new value.
The read and update is done using Entity Framework, we use a serializable transaction scope, we need to ensure that a counter value is read once only.
Most of the time this takes 0.1 seconds, then sometimes it takes over 1 second. We have not been able to find any pattern as to why this happen.
Has anyone else experienced variable performance when using transaction scope? Would it help to drop using transaction scope and set the transaction directly on the connection?
I remember commenting on this question a long time ago, but recently some developers in my shop have started using TransactionScope, and have also run into performance issues. While trying to search for some information, this came up pretty high in the Google Search results.
The issue we ran into was that, apparently chaining commands (INSERTs, etc.) with BeginChain does not work when using a TransactionScope (at least on the version we are running, Client v9.7.4.4 connecting to DB2 z/OS v 10).
I thought that I would leave a workaround for the issue we ran into (slow performance when running lots [1k+] of INSERTs under TransactionScope, but ran fine when the scope was removed and chaining was allowed). I'm not really sure if it would help the original question directly, but there are some options if you look in the IBM.Data.DB2.dll classes that allow you to update rows using a DB2DataAdapter and an underlying DataTable.
Here's some example code in VB.NET:
Private Sub InsertByBulk(tableName As String, insertCollection As List(Of Object))
Dim curTimestamp = Date.Now
Using scope = New TransactionScope
'Something that opens a connection to DB2, may vary
Using conn = GetDB2Connection()
'Dumb query to get DataTable from the ResultSet
Dim sql = String.Format("SELECT * FROM {0} WHERE 1=0", tableName)
Using adapter = New DB2DataAdapter(sql, conn)
Using table As New DataTable
adapter.FillSchema(table, SchemaType.Source)
For Each item In insertCollection
Dim row = table.NewRow()
row("ID") = item.Id
row("CHAR_FIELD") = item.CharField
row("QUANTITY") = item.Quantity
row("UPDATE_TIMESTAMP") = curTimestamp
table.Rows.Add(row)
Next
Using bc = New DB2BulkCopy(conn)
bc.DestinationTableName = tableName
bc.WriteToServer(table)
End Using 'BulkCopy
End Using 'DataTable
End Using 'DataAdapter
End Using 'Connection
scope.Complete()
End Using
End Sub
We have now solved this problem.
The root of the problem was that the DB2 provider does not support transaction promotion. This results in Transaction Scope using MSDTC distributed transactions for everything.
We replaced the use of Transaction Scope with transactions set on the database connection.
Composite services that included the code in the question were then reduced from 3 seconds to 0.3 seconds.

Timeout error trying to lock table in h2

I get the following error under a certain scenario
When a different thread is populating a lot of users via the bulk upload operation and I was trying to view the list of all users on a different web page. The list query, throws the following timeout error. Is there a way to set this timeout so that I can avoid this timeout error.
Env: h2 (latest), Hibernate 3.3.x
Caused by: org.h2.jdbc.JdbcSQLException: Timeout trying to lock table "USER"; SQL statement:
[50200-144]
at org.h2.message.DbException.getJdbcSQLException(DbException.java:327)
at org.h2.message.DbException.get(DbException.java:167)
at org.h2.message.DbException.get(DbException.java:144)
at org.h2.table.RegularTable.doLock(RegularTable.java:482)
at org.h2.table.RegularTable.lock(RegularTable.java:416)
at org.h2.table.TableFilter.lock(TableFilter.java:139)
at org.h2.command.dml.Select.queryWithoutCache(Select.java:571)
at org.h2.command.dml.Query.query(Query.java:257)
at org.h2.command.dml.Query.query(Query.java:227)
at org.h2.command.CommandContainer.query(CommandContainer.java:78)
at org.h2.command.Command.executeQuery(Command.java:132)
at org.h2.server.TcpServerThread.process(TcpServerThread.java:278)
at org.h2.server.TcpServerThread.run(TcpServerThread.java:137)
at java.lang.Thread.run(Thread.java:619)
at org.h2.engine.SessionRemote.done(SessionRemote.java:543)
at org.h2.command.CommandRemote.executeQuery(CommandRemote.java:152)
at org.h2.jdbc.JdbcPreparedStatement.executeQuery(JdbcPreparedStatement.java:96)
at org.jboss.resource.adapter.jdbc.WrappedPreparedStatement.executeQuery(WrappedPreparedStatement.java:342)
at org.hibernate.jdbc.AbstractBatcher.getResultSet(AbstractBatcher.java:208)
at org.hibernate.loader.Loader.getResultSet(Loader.java:1808)
at org.hibernate.loader.Loader.doQuery(Loader.java:697)
at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:259)
at org.hibernate.loader.Loader.doList(Loader.java:2228)
... 125 more
Yes, you can change the lock timeout. The default is relatively low: 1 second (1000 ms).
In many cases the problem is that another connection has locked the table, and using multi-version concurrency also solves the problem (append ;MVCC=true to the database URL).
EDIT: MVCC=true param is no longer supported, because since h2 1.4.200 it's always true for a MVStore engine, which is a default engine.
I faced quite the same problem and using the parameter "MVCC=true", it solved it. You can find more explanations about this parameter in the H2 documentation here : http://www.h2database.com/html/advanced.html#mvcc
I'd like to suggest that if you are getting this error, then perhaps you should not be using a transaction on your bulk database operation. Consider instead doing a transaction on each individual update: does it make sense to think of an entire bulk import as a transaction? Probably not. If it does, then yes, MVCC=true or a bigger lock timeout is a reasonable solution.
However, I think for most cases, you are seeing this error because you are trying to perform a very long transaction - in other words you are not aware that you are performing a really long transaction. This was certainly the case for myself and I simply took more care on how I was writing records (either using no transactions or using smaller transactions) and the lock timeout issue was resolved.
For those having this issue with integration tests (i.e. server is accessing the h2 db and an integration test is accessing the db before calling the server, to prepare the test), adding a 'commit' to the script executed before the test makes sure that the data are in the database before calling the server (without MVCC=true - which I find is a bit 'weird' if it is not enabled by default).
I had MVCC=true in my connection string but still was getting error above. I had added ;DEFAULT_LOCK_TIMEOUT=10000;LOCK_MODE=0 and problem was solved
I got this issue with the PlayFramework
JPAQueryException occured : Error while executing query from
models.Page where name = ?: Timeout trying to lock table "PAGE"
It ended being an infinite loop of sorts because I had a
#Before
without an unless which caused the function to repeatedly call itself
#Before(unless="getUser")
Working with DBUnit, H2 and Hibernate - same error, MVCC=true helped, but I would still get the error for any tests following deletion of data. What fixed these cases was wrapping the actual deletion code inside a transaction:
Transaction tx = session.beginTransaction();
...delete stuff
tx.commit();
From a 2020 user, see reference
Basically, the reference says:
Sets the lock timeout (in milliseconds) for the current session. The default value for this setting is 1000 (one second).
This command does not commit a transaction, and rollback does not affect it. This setting can be appended to the database URL: jdbc:h2:./test;LOCK_TIMEOUT=10000

Resources