Reduce code is randomly getting stuck when inserting data in postgres - hadoop

We have a map reduce code written in Java which reads multiple small files (say 10k+) converts to a single avro file in driver, reducer inserts a bunch of reduced records to postgres database. This process happens every hour. But there are multiple map reduce jobs running simultaneously, processing different avro files and opening a different database connection per job. So sometimes (very random) it happens that all the tasks are stuck in reducer phase with following exception -
"C2 CompilerThread0" daemon prio=10 tid=0x00007f78701ae000 nid=0x6db5 waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"Signal Dispatcher" daemon prio=10 tid=0x00007f78701ab800 nid=0x6db4 waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"Surrogate Locker Thread (Concurrent GC)" daemon prio=10 tid=0x00007f78701a1800 nid=0x6db3 waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"Finalizer" daemon prio=10 tid=0x00007f787018a800 nid=0x6db2 in Object.wait() [0x00007f7847941000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000006e5d34418> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135)
- locked <0x00000006e5d34418> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:151)
at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:189)
"Reference Handler" daemon prio=10 tid=0x00007f7870181000 nid=0x6db1 in Object.wait() [0x00007f7847a42000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000006e5d32b50> (a java.lang.ref.Reference$Lock)
at java.lang.Object.wait(Object.java:503)
at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:133)
- locked <0x00000006e5d32b50> (a java.lang.ref.Reference$Lock)
"main" prio=10 tid=0x00007f7870013800 nid=0x6da1 runnable [0x00007f7877a7b000]
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:152)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at org.postgresql.core.VisibleBufferedInputStream.readMore(VisibleBufferedInputStream.java:143)
at org.postgresql.core.VisibleBufferedInputStream.ensureBytes(VisibleBufferedInputStream.java:112)
at org.postgresql.core.VisibleBufferedInputStream.read(VisibleBufferedInputStream.java:71)
at org.postgresql.core.PGStream.ReceiveChar(PGStream.java:269)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1700)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:255)
- locked <0x00000006e5d34520> (a org.postgresql.core.v3.QueryExecutorImpl)
at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:555)
at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:417)
at org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:302)
at ComputeReducer.setup(ComputeReducer.java:299)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:162)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:610)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:444)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
"VM Thread" prio=10 tid=0x00007f787017e800 nid=0x6db0 runnable
"Gang worker#0 (Parallel GC Threads)" prio=10 tid=0x00007f7870024800 nid=0x6da2 runnable
"Gang worker#1 (Parallel GC Threads)" prio=10 tid=0x00007f7870026800 nid=0x6da3 runnable
After this exception occurs we have to restart the database else all the reduce jobs seat idle stuck around 70% and even next hour jobs cannot run. Initially it used to exhaust number of open connections but after increasing the connections to considerably high number such is not the case. I should point that I am no database expert so please suggest any configuration changes that might help. Just to confirm does this seem to be database configuration issue? If yes then would configuring connection pooling over postgres help resolve this?
Any help/ suggestions are highly appreciated! Thanks in advance.

My initial thought would be that if it is random, it is probably a lock. There are two areas to look for locks:
Locks between threads on shared resources and locks on database objects.
I don't see anything in your stack trace to suggest that it is a database lock issue but this could be caused by not closing transactions so you don't get a deadlock but you are waiting on inserts.
More likely you have a deadlock in your Java code, perhaps the two waiting threads are waiting on eachother?

I want to add my findings,
After refactoring the code it worked fine for couple months then this problem recurred, we thought it was a hadoop cluster problem so a small fresh hadoop cluster was created but that didn't solve the problem either. So finally, we looked at our largest database table it had more than 1.5 billion rows and select query was taking a lot of time so after getting rid of old data from this table, full vacuum and reindex helped.

Related

JMeter Threads Getting blocked after sometime in test

Using JMeter 5.4.1 and JDK 11 in Non-GUI mode, running at 40 Thread at 5 requests/sec.
All Listeners Off
Disabled all assertions, relying on application logs
Groovy used as scripting language in JSR223, using to emulate Pacing between requests
Heap Size increased to 12 GB
After around 40 mins or sometime 70 mins, JMeter stops generating load as seen on Web Server logs. I ran Visual VM on JMeter machine, hooked into JMeter process and took a Thread Dump.
Majority of my threads are in Monitor State
Log as below:
'''
"Script11 - GetRequest 11-1" #61 prio=5 os_prio=0 cpu=13390.63ms elapsed=4266.73s tid=0x000001d1599d9800 nid=0xee0 waiting for monitor entry [0x000000ca924fe000]
java.lang.Thread.State: BLOCKED (on object monitor)
at java.io.PrintStream.println(java.base#11.0.11/PrintStream.java:881)
- waiting to lock <0x0000000500857638> (a java.io.PrintStream)
at org.apache.jmeter.reporters.Summariser.formatAndWriteToLog(Summariser.java:329)
at org.apache.jmeter.reporters.Summariser.sampleOccurred(Summariser.java:208)
at org.apache.jmeter.threads.ListenerNotifier.notifyListeners(ListenerNotifier.java:58)
at org.apache.jmeter.threads.JMeterThread.notifyListeners(JMeterThread.java:1024)
at org.apache.jmeter.threads.JMeterThread.executeSamplePackage(JMeterThread.java:579)
at org.apache.jmeter.threads.JMeterThread.processSampler(JMeterThread.java:489)
at org.apache.jmeter.threads.JMeterThread.run(JMeterThread.java:256)
at java.lang.Thread.run(java.base#11.0.11/Thread.java:834)
Locked ownable synchronizers:
- None
"0x0000000500857638" is Locked on
"Script09 - GetRequest9 9-2" #55 prio=5 os_prio=0 cpu=13562.50ms elapsed=4266.74s tid=0x000001d1599d5000 nid=0x13e0 runnable [0x000000ca91efe000]
java.lang.Thread.State: RUNNABLE
at java.io.FileOutputStream.writeBytes(java.base#11.0.11/Native Method)
at java.io.FileOutputStream.write(java.base#11.0.11/FileOutputStream.java:354)
at java.io.BufferedOutputStream.flushBuffer(java.base#11.0.11/BufferedOutputStream.java:81)
at java.io.BufferedOutputStream.flush(java.base#11.0.11/BufferedOutputStream.java:142)
- locked <0x0000000500857660> (a java.io.BufferedOutputStream)
at java.io.PrintStream.write(java.base#11.0.11/PrintStream.java:561)
- locked <0x0000000500857638> (a java.io.PrintStream)
at sun.nio.cs.StreamEncoder.writeBytes(java.base#11.0.11/StreamEncoder.java:233)
at sun.nio.cs.StreamEncoder.implFlushBuffer(java.base#11.0.11/StreamEncoder.java:312)
at sun.nio.cs.StreamEncoder.flushBuffer(java.base#11.0.11/StreamEncoder.java:104)
- locked <0x00000005008577b8> (a java.io.OutputStreamWriter)
at java.io.OutputStreamWriter.flushBuffer(java.base#11.0.11/OutputStreamWriter.java:181)
at java.io.PrintStream.write(java.base#11.0.11/PrintStream.java:606)
- locked <0x0000000500857638> (a java.io.PrintStream)
at java.io.PrintStream.print(java.base#11.0.11/PrintStream.java:745)
at java.io.PrintStream.println(java.base#11.0.11/PrintStream.java:882)
- locked <0x0000000500857638> (a java.io.PrintStream)
at org.apache.jmeter.reporters.Summariser.formatAndWriteToLog(Summariser.java:329)
at org.apache.jmeter.reporters.Summariser.sampleOccurred(Summariser.java:208)
at org.apache.jmeter.threads.ListenerNotifier.notifyListeners(ListenerNotifier.java:58)
at org.apache.jmeter.threads.JMeterThread.notifyListeners(JMeterThread.java:1024)
at org.apache.jmeter.threads.JMeterThread.executeSamplePackage(JMeterThread.java:579)
at org.apache.jmeter.threads.JMeterThread.processSampler(JMeterThread.java:489)
at org.apache.jmeter.threads.JMeterThread.run(JMeterThread.java:256)
at java.lang.Thread.run(java.base#11.0.11/Thread.java:834)
Locked ownable synchronizers:
- None
Heap Utilization is around 1.5GB, allocation is 12 GB
CPU Utilization on host machine is around 15%
Any suggestions on how to avoid the JMeter threads getting blocked ?
Thanks.

Spring 4.3 Framework: Deadlock on ConcurrentHashMap AbstractBeanFactory.doGetBean

Scenario - Looks like this is timing issue. We have application lock (Lock#1) before calling getBean(), then there comes Spring framework ConcurrentHashMap (Lock#2). Threads are getting blocked on Lock#1 and Lock#2. Please find below snippets of thread dumps to prove the use case.
T1 - (Acquired Lock#1 and waiting for Lock#2)
"Catalina-utility-1" #85 prio=1 os_prio=0 tid=0x00007f9918034000 nid=0x33f0 waiting for monitor entry [0x00007f97f1ccd000]
java.lang.Thread.State: BLOCKED (on object monitor)
at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.getSingleton(DefaultSingletonBeanRegistry.java:213)
- waiting to lock <0x00000004df51aef8> (a java.util.concurrent.ConcurrentHashMap) (-> At this place, waiting for Spring lock, Lock#2. Here it has already acquired Lock#1, and waiting for Lock#2 )
at org.springframework.beans.factory.support.AbstractBeanFactory.doGetBean(AbstractBeanFactory.java:308)
at org.springframework.beans.factory.support.AbstractBeanFactory.getBean(AbstractBeanFactory.java:197)
at org.springframework.context.support.AbstractApplicationContext.getBean(AbstractApplicationContext.java:1082)
at com.bmc.unifiedadmin.service.ABCServiceBeanContext.getService(ABCServiceBeanContext.java:53)
at com.bmc.unifiedadmin.service.ABCServicesFactory.getService(ABCServicesFactory.java:52) (**-> At this place, application semaphore have been acquired, Lock#1**)
T2 - (Acquired Lock#2 and waiting for Lock#1)
at java.util.concurrent.Semaphore.acquire(Semaphore.java:312) -> (**At this place, waiting for TSPS lock, Lock#1. Here it has already acquired Lock#2, and waiting for Lock#2**)
—
–
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.springframework.beans.BeanUtils.instantiateClass(BeanUtils.java:142)
at org.springframework.beans.factory.support.SimpleInstantiationStrategy.instantiate(SimpleInstantiationStrategy.java:89)
at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.instantiateBean(AbstractAutowireCapableBeanFactory.java:1151)
at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.createBeanInstance(AbstractAutowireCapableBeanFactory.java:1103)
at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.doCreateBean(AbstractAutowireCapableBeanFactory.java:511)
at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.createBean(AbstractAutowireCapableBeanFactory.java:481)
at org.springframework.beans.factory.support.AbstractBeanFactory$1.getObject(AbstractBeanFactory.java:312)
at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.getSingleton(DefaultSingletonBeanRegistry.java:230)
- locked <0x00000004df51aef8> (a java.util.concurrent.ConcurrentHashMap) (**-> At this place, Lock#2 has been acquired**)

hystrix many threads in waiting state

We have used hystrix - circuit breaker pattern [library] in our one of the module.
usecase is:- we are polling 16 number of messages from kafka and processing them using pararllel stream,so, for each message in workflow it takes 3 rest calls which are protected by hystric command. Now, issue is when I try to run my single instance then CPU shows spikes and thread dump shows many threads in waiting state for all the 3 commands. Like below:-
Omitted thread name but assume all all thread pools it shows same thing:-
Thread Pool-7" #82
Thread State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x000000004cee2312> (a java.util.concurrent.SynchronousQueue$TransferStack)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:458)
at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
at java.util.concurrent.SynchronousQueue.take(SynchronousQueue.java:924)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Could you please help me in fine tuning application and thread pool parameters?
what I am missing here?
The default isolation strategy of Hystrix is threadpool and its default size is just 10. It means that only 10 REST calls can be running at the same time in your case.
First, try to increase the below default property to big one.
hystrix.threadpool.default.coreSize=1000 # default is 10
If it works, adjust the value to the proper one.
default can be replaced with the proper HystrixThreadPoolKey for each thread pool.
If you are using Semaphore isolation strategy, try to increase the below one.
hystrix.command.default.execution.isolation.semaphore.maxConcurrentRequests=1000
Above one's default is also just 10. default can be replaced with HystrixCommandKey name for each semaphore.
Updated
To choose the isolation strategy, you can use the below property.
hystrix.command.default.execution.isolation.strategy=THREAD or SEMAPHORE
default can be replaced with HystrixCommandKey. It means that you can assign different strategy for each hystrix command.

Oracle JDBC connection cache, connection kept open for long time and eventually fails to close it

In our application, we are facing an issue where in for certain hibernate queries, the queries are taking longer (sometimes not completing) than usual and when profiled using a profiler, we are observing that the connection objects related to these queries are open but not closed.
Because of this behavior, eventually the application runs out of connections and goes into high CPU and heap utilization.
java.lang.Thread.State: TIMED_WAITING
at java.lang.Object.wait(Native Method)
- waiting on <3a685292> (a oracle.jdbc.pool.OracleImplicitConnectionCache)
at oracle.jdbc.pool.OracleImplicitConnectionCache.processConnectionWaitTimeout(OracleImplicitConnectionCache.java:2955)
at oracle.jdbc.pool.OracleImplicitConnectionCache.getConnection(OracleImplicitConnectionCache.java:374)
at oracle.jdbc.pool.OracleDataSource.getConnection(OracleDataSource.java:374)
at oracle.jdbc.pool.OracleDataSource.getConnection(OracleDataSource.java:178)
at oracle.jdbc.pool.OracleDataSource.getConnection(OracleDataSource.java:156)
at org.springframework.jdbc.datasource.LazyConnectionDataSourceProxy$LazyConnectionInvocationHandler.getTargetConnection(LazyConnectionDataSourceProxy.java:403)
at org.springframework.jdbc.datasource.LazyConnectionDataSourceProxy$LazyConnectionInvocationHandler.invoke(LazyConnectionDataSourceProxy.java:376)
at com.sun.proxy.$Proxy75.prepareStatement(Unknown Source)
at org.hibernate.engine.jdbc.internal.StatementPreparerImpl$5.doPrepare(StatementPreparerImpl.java:161)
at org.hibernate.engine.jdbc.internal.StatementPreparerImpl$StatementPreparationTemplate.prepareStatement(StatementPreparerImpl.java:182)
at org.hibernate.engine.jdbc.internal.StatementPreparerImpl.prepareQueryStatement(StatementPreparerImpl.java:159)
at org.hibernate.loader.Loader.prepareQueryStatement(Loader.java:1854)
at org.hibernate.loader.Loader.executeQueryStatement(Loader.java:1831)
at org.hibernate.loader.Loader.executeQueryStatement(Loader.java:1811)
at org.hibernate.loader.Loader.doQuery(Loader.java:899)
at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:341)
at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:311)
at org.hibernate.loader.Loader.loadEntity(Loader.java:2111)
at org.hibernate.loader.entity.AbstractEntityLoader.load(AbstractEntityLoader.java:82)
at org.hibernate.loader.entity.AbstractEntityLoader.load(AbstractEntityLoader.java:72)
at org.hibernate.persister.entity.AbstractEntityPersister.load(AbstractEntityPersister.java:3917)
at org.hibernate.event.internal.DefaultLoadEventListener.loadFromDatasource(DefaultLoadEventListener.java:460)
at org.hibernate.event.internal.DefaultLoadEventListener.doLoad(DefaultLoadEventListener.java:429)
at org.hibernate.event.internal.DefaultLoadEventListener.load(DefaultLoadEventListener.java:206)
at org.hibernate.event.internal.DefaultLoadEventListener.proxyOrLoad(DefaultLoadEventListener.java:262)
at org.hibernate.event.internal.DefaultLoadEventListener.onLoad(DefaultLoadEventListener.java:150)
at org.hibernate.internal.SessionImpl.fireLoad(SessionImpl.java:1091)
at org.hibernate.internal.SessionImpl.access$2000(SessionImpl.java:174)
at org.hibernate.internal.SessionImpl$IdentifierLoadAccessImpl.load(SessionImpl.java:2473)
at org.hibernate.internal.SessionImpl.get(SessionImpl.java:991)
at org.hibernate.event.internal.DefaultMergeEventListener.entityIsDetached(DefaultMergeEventListener.java:271)
at org.hibernate.event.internal.DefaultMergeEventListener.onMerge(DefaultMergeEventListener.java:151)
at org.hibernate.event.internal.DefaultMergeEventListener.onMerge(DefaultMergeEventListener.java:76)
at org.hibernate.internal.SessionImpl.fireMerge(SessionImpl.java:913)
at org.hibernate.internal.SessionImpl.merge(SessionImpl.java:897)
at org.hibernate.internal.SessionImpl.merge(SessionImpl.java:901)
In such a scenario could you please suggest what kind of timeout property is preferable for the connection cache.
As per JDBC documentation, we came across the following properties, please help:
InactivityTimeout
TimeToLiveTimeout
AbandonedConnectionTimeout
Ref: http://docs.oracle.com/cd/B14117_01/java.101/b10979/conncache.htm#CDEBCBJC
Please use Oracle Universal Connection Pool for Java (UCP) a replacement for the Implicit Connection Cache (ICC) which has been de-supported in Oracle Database 12c. The documentation is found on OTN.

Weblogic Stuck Thread on JDBC call

We frequently get a series of Stuck threads on our Weblogic servers. I've analyzed this over a period of time.
What I'd like to understand is whether this stuck thread block indicates it is still reading data from the open socket to the database since the queries are simple SELECT stuff?
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at oracle.net.ns.Packet.receive(Packet.java:239)
at oracle.net.ns.DataPacket.receive(DataPacket.java:92)
We've run netstat and other commands, the sockets from the Weblogic app server to the Database match the number of connections in the pool.
Any ideas what else we should be investigating here?
Stack trace of thread dump:
"[STUCK] ExecuteThread: '2' for queue: 'weblogic.kernel.Default (self-tuning)'" daemon prio=10 tid=0x61a5b000 nid=0x25f runnable [0x6147b000..0x6147eeb0]
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at oracle.net.ns.Packet.receive(Packet.java:239)
at oracle.net.ns.DataPacket.receive(DataPacket.java:92)
at oracle.net.ns.NetInputStream.getNextPacket(NetInputStream.java:172)
at oracle.net.ns.NetInputStream.read(NetInputStream.java:117)
at oracle.net.ns.NetInputStream.read(NetInputStream.java:92)
at oracle.net.ns.NetInputStream.read(NetInputStream.java:77)
at oracle.jdbc.driver.T4CMAREngine.unmarshalUB1(T4CMAREngine.java:1023)
at oracle.jdbc.driver.T4CMAREngine.unmarshalSB1(T4CMAREngine.java:999)
at oracle.jdbc.driver.T4C8Oall.receive(T4C8Oall.java:584)
at oracle.jdbc.driver.T4CStatement.doOall8(T4CStatement.java:183)
at oracle.jdbc.driver.T4CStatement.fetch(T4CStatement.java:1000)
at oracle.jdbc.driver.OracleResultSetImpl.close_or_fetch_from_next(OracleResultSetImpl.java:314)
- locked <0x774546e0> (a oracle.jdbc.driver.T4CConnection)
at oracle.jdbc.driver.OracleResultSetImpl.next(OracleResultSetImpl.java:228)
- locked <0x774546e0> (a oracle.jdbc.driver.T4CConnection)
at weblogic.jdbc.wrapper.ResultSet_oracle_jdbc_driver_OracleResultSetImpl.next(Unknown Source)
The bit starting from weblogic.work.ExecuteThread.run to here has been omitted. We have 8 sets of thread dumps - and each show the thread waiting on the same line, and the same object locked
at oracle.jdbc.driver.OracleResultSetImpl.close_or_fetch_from_next(OracleResultSetImpl.java:314)
- locked <0x774546e0> (a oracle.jdbc.driver.T4CConnection)
At the time the stack was printed, it seems blocked waiting for more data from the server
at oracle.jdbc.driver.OracleResultSetImpl.next(OracleResultSetImpl.java:228)
Maybe it is just the query which is taking more than StuckThreadMaxTimeand WL issues a Warning.
If possible I would try:
Find which query or queries are getting the threads stuck and check execution time
Use Wireshark to analyze communication with database
Have a look at the driver source code (JD comes to mind) to understand stack trace
if you use weblogic debug flag -Dweblogic.debug.DebugJDBCSQL you will be able to trace the SQL which is actually being executed

Resources