Pyspark: Saving dataframe to hadoop or hdfs without overflowing memory? - hadoop

I'm working on a pipeline that reads a number of hive tables and parses them into some DenseVectors for eventual use in SparkML. I want to do a lot of iteration to find optimal training parameters, both inputs to the model and with computing resources. The dataframe I'm working with is somewhere between 50-100gb all said, spread across a dynamic number of executors on a YARN cluster.
Whenever I try to save, either to parquet or saveAsTable, I get a series of failed tasks before finally it fails completely and suggests raising spark.yarn.executor.memoryOverhead. Each id is a a single row, no more than a few kb.
feature_df.write.parquet('hdfs:///user/myuser/featuredf.parquet',mode='overwrite',partitionBy='id')
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure:
Task 98 in stage 33.0 failed 4 times, most recent failure: Lost task 98.3 in
stage 33.0 (TID 2141, rs172.hadoop.pvt, executor 441): ExecutorLostFailure
(executor 441 exited caused by one of the running tasks) Reason: Container
killed by YARN for exceeding memory limits. 12.0 GB of 12 GB physical memory used.
Consider boosting spark.yarn.executor.memoryOverhead.
I currently have this at 2g.
Spark workers are currently getting 10gb, and the driver (which is not on the cluster) is getting 16gb with a maxResultSize of 5gb.
I'm caching the dataframe before I write, what else can I do to troubleshoot?
Edit: It seems like it's trying to do all of my transformations at once. When I look at the details for the saveAsTable() method:
== Physical Plan ==
InMemoryTableScan [id#0L, label#90, features#119]
+- InMemoryRelation [id#0L, label#90, features#119], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
+- *Filter (isnotnull(id#0L) && (id#0L < 21326835))
+- InMemoryTableScan [id#0L, label#90, features#119], [isnotnull(id#0L), (id#0L < 21326835)]
+- InMemoryRelation [id#0L, label#90, features#119], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
+- *Project [id#0L, label#90, pythonUDF0#135 AS features#119]
+- BatchEvalPython [<lambda>(collect_list_is#108, 56845.0)], [id#0L, label#90, collect_list_is#108, pythonUDF0#135]
+- SortAggregate(key=[id#0L, label#90], functions=[collect_list(indexedSegs#39, 0, 0)], output=[id#0L, label#90, collect_list_is#108])
+- *Sort [id#0L ASC NULLS FIRST, label#90 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(id#0L, label#90, 200)
+- *Project [id#0L, UDF(segment#2) AS indexedSegs#39, cast(label#1 as double) AS label#90]
+- *BroadcastHashJoin [segment#2], [entry#12], LeftOuter, BuildRight
:- HiveTableScan [id#0L, label#1, segment#2], MetastoreRelation pmccarthy, reka_data_long_all_files
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
+- *Project [cast(entry#7 as string) AS entry#12]
+- HiveTableScan [entry#7], MetastoreRelation reka_trop50, public_crafted_audiences_sized

My suggestion would be to disable dynamic allocation. Try running it with the below configuration :
--master yarn-client --driver-memory 15g --executor-memory 15g --executor-cores 10 --num-executors 15 -Dspark.yarn.executor.memoryOverhead=20000 -Dspark.yarn.driver.memoryOverhead=20000 -Dspark.default.parallelism=500

Ultimately the clue I got from the Spark user mailing list was to look at the partitions, both balance and sizes. As the planner had it, too much was being given to a single executor instance. Adding .repartition(1000) to the expression creating the dataframe to be written made all the difference, and more gains could probably be achieved by creating and partitioning on a clever key column.

Related

Hive error: java.lang.Throwable: Child Error

I am using CDH 5.9, while executing following hive query it is throwing error. Any idea about the issue?
For normal select query its working but for complex query it results failure.
hive> select * from table where dt='22-01-2017' and field like '%xyz%' limit 10;
Query ID = hdfs_20170123200303_44a9c423-4bb3-4f80-ade4-b1312971eb63
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201701131637_0067, Tracking URL = http://cdhum03.temp-dsc-updates.bms.bz:50030/jobdetails.jsp?jobid=job_201701131637_0067
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201701131637_0067
Hadoop job information for Stage-1: number of mappers: 6; number of reducers: 0
2017-01-23 20:05:46,563 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201701131637_0067 with errors
Error during job, obtaining debugging information...
Job Tracking URL: http://cdhum03.temp-dsc-updates.bms.bz:50030/jobdetails.jsp?jobid=job_201701131637_0067
Examining task ID: task_201701131637_0067_m_000007 (and more) from job job_201701131637_0067
Examining task ID: task_201701131637_0067_r_000000 (and more) from job job_201701131637_0067
Task with the most failures(4):
-----
Task ID:
task_201701131637_0067_m_000006
URL:
http://cdhum03.temp-dsc-updates.bms.bz:50030/taskdetails.jsp?jobid=job_201701131637_0067&tipid=task_201701131637_0067_m_000006
-----
Diagnostic Messages for this Task:
java.lang.Throwable: Child Error
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:250)
Caused by: java.io.IOException: Task process exit with nonzero status of 126.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:237)
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-1: Map: 6 HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
Thanks.
Please check your data size as your job needs more space for logs but the jvm are less please scale your cluster or use specific query as you are using -
select * from table where dt='22-01-2017' and field like '%xyz%' limit 10
, as '%xyz%' will check whole data better to use specific requirement.
Else drop your table and create a new partitioned table with date as a partition column.

Hive query on small dataset never finishes (or OOM)

Performing a simple query on a small sample dataset (195 rows, 22 columns) either throws an out of memory exception, or, following many suggestions to increase memory sizes, never ends.
Options tried
set hive.optimize.sort.dynamic.partition = true
increase tez memory
increase memory & decrease shuffle size
increase memory
more like that
Sometimes the OOM error is gone, but then it runs for hours without any result...
Query
select * lag(status, 1, null) over (partition by type_id order by time) as status_prev from sample_table
Example query that never stops
hive -hiveconf hive.tez.container.size=2048 -hiveconf hive.tez.java.opts=-Xmx1640m -hiveconf tez.runtime.io.sort.mb=820 -hiveconf tez.runtime.unordered.output.buffer.size-mb=205 -e "select * lag(status, 1, null) over (partition by type_id order by time) as status_prev from sample_table"
Out of memory
Status: Running (Executing on YARN cluster with App id application_1473144435077_0015)
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 FAILED 1 0 0 1 4 0
Reducer 2 KILLED 1 0 0 1 0 0
--------------------------------------------------------------------------------
VERTICES: 00/02 [>>--------------------------] 0% ELAPSED TIME: 18.30 s
--------------------------------------------------------------------------------
Status: Failed
Vertex failed, vertexName=Map 1, vertexId=vertex_1473144435077_0015_1_00, diagnostics=[Task failed, taskId=task_1473144435077_0015_1_00_000000, diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:159)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:139)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:344)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:181)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:168)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
at java.nio.ByteBuffer.allocate(ByteBuffer.java:331)
at org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.<init>(PipelinedSorter.java:172)
at org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.<init>(PipelinedSorter.java:116)
at org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput.start(OrderedPartitionedKVOutput.java:142)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:142)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:149)
... 14 more
Never stops
(33 secs is example, doesnt stop in hours)
Status: Running (Executing on YARN cluster with App id application_1473144435077_0025)
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 INITED 1 0 0 1 0 0
Reducer 2 INITED 1 0 0 1 0 0
--------------------------------------------------------------------------------
VERTICES: 00/02 [>>--------------------------] 0% ELAPSED TIME: 33.32 s
--------------------------------------------------------------------------------
Took me too long to find the answer, hopefully this will help someone else...
So this breaks down to 2 problems:
Heap size too small, solved by increasing heap size
Hive job stuck in pending status
The following command solved my issues
hive -hiveconf hive.tez.container.size=512 -hiveconf hive.tez.java.opts="-server -Xmx512m -Djava.net.preferIPv4Stack=true" -e "select * lag(status, 1, null) over (partition by type_id order by time) as status_prev from sample_table"
Source

ACID transactions on data added from Spark not working

I'm trying to use ACID transactions in Hive but I have a problem when the data are added with Spark.
First, I created a table with the following statement :
CREATE TABLE testdb.test(id string, col1 string)
CLUSTERED BY (id) INTO 4 BUCKETS
STORED AS ORC TBLPROPERTIES('transactional'='true');
Then I added data with those queries :
INSERT INTO testdb.test VALUES("1", "A");
INSERT INTO testdb.test VALUES("2", "B");
INSERT INTO testdb.test VALUES("3", "C");
And I've been able to delete rows with this query :
DELETE FROM testdb.test WHERE id="1";
All that worked perfectly, but a problem occurs when I try to delete rows that were added with Spark.
What I do in Spark (iPython) :
hc = HiveContext(sc)
data = sc.parallelize([["1", "A"], ["2", "B"], ["3", "C"]])
data_df = hc.createDataFrame(data)
data_df.registerTempTable(data_df)
hc.sql("INSERT INTO testdb.test SELECT * FROM data_df");
Then, when I come back to Hive, I'm able to run a SELECT query on this the "test" table.
However, when I try to run the exact same DELETE query as before, I have the following error (it happens after the reduce phase) :
Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{"reducesinkkey0":{"transactionid":0,"bucketid":-1,"rowid":0}},"value":null}
at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:265)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{"reducesinkkey0":{"transactionid":0,"bucketid":-1,"rowid":0}},"value":null}
at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:253)
... 7 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:723)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84)
at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:244)
... 7 more
I have no idea where this is coming from, that is why I'm looking for ideas.
I'm using the Cloudera Quickstart VM (5.4.2).
Hive version : 1.1.0
Spark Version : 1.3.0
And here is the complete output of the Hive DELETE command :
hive> delete from testdb.test where id="1";
Query ID = cloudera_20160914090303_795e40b7-ab6a-45b0-8391-6d41d1cfe7bd
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 4
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1473858545651_0036, Tracking URL =http://quickstart.cloudera:8088/proxy/application_1473858545651_0036/
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1473858545651_0036
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 4
2016-09-14 09:03:55,571 Stage-1 map = 0%, reduce = 0%
2016-09-14 09:04:14,898 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 1.66 sec
2016-09-14 09:04:15,944 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.33 sec
2016-09-14 09:04:44,101 Stage-1 map = 100%, reduce = 17%, Cumulative CPU 4.21 sec
2016-09-14 09:04:46,523 Stage-1 map = 100%, reduce = 25%, Cumulative CPU 4.79 sec
2016-09-14 09:04:47,673 Stage-1 map = 100%, reduce = 42%, Cumulative CPU 5.8 sec
2016-09-14 09:04:50,041 Stage-1 map = 100%, reduce = 75%, Cumulative CPU 7.45 sec
2016-09-14 09:05:18,486 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 7.69 sec
MapReduce Total cumulative CPU time: 7 seconds 690 msec
Ended Job = job_1473858545651_0036 with errors
Error during job, obtaining debugging information...
Job Tracking URL: http://quickstart.cloudera:8088/proxy/application_1473858545651_0036/
Examining task ID: task_1473858545651_0036_m_000000 (and more) from job job_1473858545651_0036
Task with the most failures(4):
-----
Task ID:
task_1473858545651_0036_r_000001
URL:
http://0.0.0.0:8088/taskdetails.jsp?jobid=job_1473858545651_0036&tipid=task_1473858545651_0036_r_000001
-----
Diagnostic Messages for this Task:
Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{"reducesinkkey0":{"transactionid":0,"bucketid":-1,"rowid":0}},"value":null}
at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:265)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{"reducesinkkey0":{"transactionid":0,"bucketid":-1,"rowid":0}},"value":null}
at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:253)
... 7 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:723)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84)
at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:244)
... 7 more
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2 Reduce: 4 Cumulative CPU: 7.69 sec HDFS Read: 21558 HDFS Write: 114 FAIL
Total MapReduce CPU Time Spent: 7 seconds 690 msec
Thanks !
Use the Spark HiveAcid Datasource - http://github.com/qubole/spark-acid
val df = spark.read.format("HiveAcid").option("table", "testdb.test").load()
df.collect()
Spark needs to run with HMS 3.1.1 so that the underlying datasource can take necessary locks etc.
The directory structure, file formats are different for a Hive ACID table compared with a normal table. CRUD needs to happen from Hive.
With respect to Spark, normal table reads are not compatible with Hive ACID table reads. We could not use the native Spark apis to read the table.
Also, currently there is no support for updates, deletes, inserts in Spark
As for reading the data, one could use the connector - http://github.com/qubole/spark-acid
I had the same issue running from hue, but after I set these parameters from hive cli, it started working:
set hive.support.concurrency=true
set hive.enforce.bucketing=true
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DBTxnManager;
set hive.compactor.initiator.on=true;

sonarqube 5.2 MySQLTransactionRollbackException: Deadlock found when trying to get lock

Using SonarQube 5.2 I’m seeing the following Deadlock issue:
05:48:22 ERROR: Error during Sonar runner execution
05:48:22 java.lang.IllegalStateException: Fail to execute request
[code=500, url=http://192.168.109.6/api/ce/submit?projectKey=CoprHD&projectName=CoprHD-controller&projectBranch=bugfix-COP-19001-hotfix]:
{"errors":[{"msg":"\n### Error updating database.
Cause: com.mysql.jdbc.exceptions.jdbc4.MySQLTransactionRollbackException:
Deadlock found when trying to get lock; try restarting transaction\n
### The error may involve org.sonar.db.user.RoleMapper.insertGroupRole-Inline\n### The error occurred while setting parameters\n
### SQL: INSERT INTO group_roles (group_id, resource_id, role) VALUES (?, ?, ?)\n
### Cause: com.mysql.jdbc.exceptions.jdbc4.MySQLTransactionRollbackException: Deadlock found when trying to get lock; try restarting transaction"}]}
05:48:22 at org.sonar.batch.report.ReportPublisher.uploadMultiPartReport(ReportPublisher.java:182)
05:48:22 at org.sonar.batch.report.ReportPublisher.sendOrDumpReport(ReportPublisher.java:151)
05:48:22 at org.sonar.batch.report.ReportPublisher.execute(ReportPublisher.java:115)
05:48:22 at org.sonar.batch.phases.PhaseExecutor.publishReportJob(PhaseExecutor.java:116)
05:48:22 at org.sonar.batch.phases.PhaseExecutor.execute(PhaseExecutor.java:106)
05:48:22 at org.sonar.batch.scan.ModuleScanContainer.doAfterStart(ModuleScanContainer.java:192)
05:48:22 at org.sonar.core.platform.ComponentContainer.startComponents(ComponentContainer.java:100)
05:48:22 at org.sonar.core.platform.ComponentContainer.execute(ComponentContainer.java:85)
05:48:22 at org.sonar.batch.scan.ProjectScanContainer.scan(ProjectScanContainer.java:258)
05:48:22 at org.sonar.batch.scan.ProjectScanContainer.scanRecursively(ProjectScanContainer.java:253)
05:48:22 at org.sonar.batch.scan.ProjectScanContainer.doAfterStart(ProjectScanContainer.java:243)
05:48:22 at org.sonar.core.platform.ComponentContainer.startComponents(ComponentContainer.java:100)
05:48:22 at org.sonar.core.platform.ComponentContainer.execute(ComponentContainer.java:85)
05:48:22 at org.sonar.batch.bootstrap.GlobalContainer.executeAnalysis(GlobalContainer.java:153)
05:48:22 at org.sonar.batch.bootstrapper.Batch.executeTask(Batch.java:110)
05:48:22 at org.sonar.runner.batch.BatchIsolatedLauncher.execute(BatchIsolatedLauncher.java:55)
05:48:22 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
05:48:22 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
05:48:22 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
05:48:22 at java.lang.reflect.Method.invoke(Method.java:606)
05:48:22 at org.sonar.runner.impl.IsolatedLauncherProxy.invoke(IsolatedLauncherProxy.java:61)
05:48:22 at com.sun.proxy.$Proxy0.execute(Unknown Source)
05:48:22 at org.sonar.runner.api.EmbeddedRunner.doExecute(EmbeddedRunner.java:275)
05:48:22 at org.sonar.runner.api.EmbeddedRunner.runAnalysis(EmbeddedRunner.java:166)
05:48:22 at org.sonar.runner.api.EmbeddedRunner.runAnalysis(EmbeddedRunner.java:153)
05:48:22 at org.sonar.runner.cli.Main.runAnalysis(Main.java:118)
05:48:22 at org.sonar.runner.cli.Main.execute(Main.java:80)
05:48:22 at org.sonar.runner.cli.Main.main(Main.java:66)
Searching for similar reports I found this reference which says the issue was resolved: https://jira.sonarsource.com/browse/SONAR-1945
I also found a reference that transaction-isolation should be changed from REPEATABLE-READ to READ-COMMITTED. Is this a reasonable thing to do with mysql for Sonar?
mysql> show variables like '%wait_timeout%';
+--------------------------+----------+
| Variable_name | Value |
+--------------------------+----------+
| innodb_lock_wait_timeout | 500 |
| lock_wait_timeout | 31536000 |
| wait_timeout | 28800 |
+--------------------------+----------+
3 rows in set (0.25 sec)
mysql> show variables like '%tx_isolation%';
+---------------+-----------------+
| Variable_name | Value |
+---------------+-----------------+
| tx_isolation | REPEATABLE-READ |
+---------------+-----------------+
1 row in set (0.00 sec)
mysql> SELECT ##GLOBAL.tx_isolation, ##tx_isolation;
+-----------------------+-----------------+
| ##GLOBAL.tx_isolation | ##tx_isolation |
+-----------------------+-----------------+
| REPEATABLE-READ | REPEATABLE-READ |
+-----------------------+-----------------+
For further info about the Deadlock issue, here is some data.
Does anyone know if this issue is something that should be tweaked in mysql or is this an issue that needs to be fixed in the SonarQube app?
mysql> show engine innodb status
=====================================
2015-12-18 07:42:25 7f61f03cd700 INNODB MONITOR OUTPUT
=====================================
Per second averages calculated from the last 31 seconds
-----------------
BACKGROUND THREAD
-----------------
srv_master_thread loops: 44635 srv_active, 0 srv_shutdown, 1284536 srv_idle
srv_master_thread log flush and writes: 1329157
----------
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 224853
OS WAIT ARRAY INFO: signal count 1727534
Mutex spin waits 1578113, rounds 7231747, OS waits 74673
RW-shared spins 483413, rounds 5257332, OS waits 110301
RW-excl spins 197945, rounds 3737144, OS waits 35005
Spin rounds per wait: 4.58 mutex, 10.88 RW-shared, 18.88 RW-excl
------------------------
LATEST DETECTED DEADLOCK
------------------------
2015-12-17 05:46:47 7f61f0594700
*** (1) TRANSACTION:
TRANSACTION 17641507, ACTIVE 0 sec inserting
mysql tables in use 1, locked 1
LOCK WAIT 8 lock struct(s), heap size 1184, 7 row lock(s), undo log entries 9
MySQL thread id 5021, OS thread handle 0x7f61f071a700, query id 33269201 localhost 127.0.0.1 sonar update
INSERT INTO group_roles (group_id, resource_id, role)
VALUES (null, 1515106, 'codeviewer')
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 310 page no 6 n bits 472 index `group_roles_resource` of table `sonar`.`group_roles` trx id 17641507 lock_mode X insert intention waiting
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
0: len 8; hex 73757072656d756d; asc supremum;;
*** (2) TRANSACTION:
TRANSACTION 17641509, ACTIVE 0 sec inserting
mysql tables in use 1, locked 1
7 lock struct(s), heap size 1184, 4 row lock(s), undo log entries 3
MySQL thread id 5005, OS thread handle 0x7f61f0594700, query id 33269204 localhost 127.0.0.1 sonar update
INSERT INTO group_roles (group_id, resource_id, role)
VALUES (1, 1515107, 'admin')
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 310 page no 6 n bits 472 index `group_roles_resource` of table `sonar`.`group_roles` trx id 17641509 lock_mode X
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
0: len 8; hex 73757072656d756d; asc supremum;;
*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 310 page no 6 n bits 472 index `group_roles_resource` of table `sonar`.`group_roles` trx id 17641509 lock_mode X insert intention waiting
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
0: len 8; hex 73757072656d756d; asc supremum;;
*** WE ROLL BACK TRANSACTION (2)
------------
TRANSACTIONS
------------
Trx id counter 18864174
Purge done for trx's n:o
<pre>
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 0, not started
MySQL thread id 7482, OS thread handle 0x7f61f03cd700, query id 38116433 localhost sonar init
show engine innodb status
---TRANSACTION 18864038, not started
MySQL thread id 7478, OS thread handle 0x7f61f3349700, query id 38115903 localhost 127.0.0.1 sonar cleaning up
---TRANSACTION 18864173, not started
MySQL thread id 7475, OS thread handle 0x7f61f040e700, query id 38116432 localhost 127.0.0.1 sonar cleaning up
--------
FILE I/O
--------
I/O thread 0 state: waiting for completed aio requests (insert buffer thread)
I/O thread 1 state: waiting for completed aio requests (log thread)
I/O thread 2 state: waiting for completed aio requests (read thread)
I/O thread 3 state: waiting for completed aio requests (read thread)
I/O thread 4 state: waiting for completed aio requests (read thread)
I/O thread 5 state: waiting for completed aio requests (read thread)
I/O thread 6 state: waiting for completed aio requests (write thread)
I/O thread 7 state: waiting for completed aio requests (write thread)
I/O thread 8 state: waiting for completed aio requests (write thread)
I/O thread 9 state: waiting for completed aio requests (write thread)
Pending normal aio reads: 0 [0, 0, 0, 0] , aio writes: 0 [0, 0, 0, 0] ,
ibuf aio reads: 0, log i/o's: 0, sync i/o's: 0
Pending flushes (fsync) log: 0; buffer pool: 0
7146308 OS file reads, 6478063 OS file writes, 1783568 OS fsyncs
0.00 reads/s, 0 avg bytes/read, 0.00 writes/s, 0.00 fsyncs/s
-------------------------------------
INSERT BUFFER AND ADAPTIVE HASH INDEX
-------------------------------------
Ibuf: size 1, free list len 3077, seg size 3079, 22965 merges
merged operations:
insert 45672, delete mark 7198683, delete 214896
discarded operations:
insert 0, delete mark 0, delete 0
Hash table size 6374777, node heap has 11107 buffer(s)
0.00 hash searches/s, 0.00 non-hash searches/s
---
LOG
---
Log sequence number 219765124434
Log flushed up to 219765124434
Pages flushed up to 219765124434
Last checkpoint at 219765124434
0 pending log writes, 0 pending chkp writes
1189792 log i/o's done, 0.00 log i/o's/second
----------------------
BUFFER POOL AND MEMORY
----------------------
Total memory allocated 3296722944; in additional pool allocated 0
Dictionary memory allocated 359878
Buffer pool size 196600
Free buffers 8192
Database pages 177301
Old database pages 65285
Modified db pages 0
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 1567756, not young 296705943
0.00 youngs/s, 0.00 non-youngs/s
Pages read 7146255, created 1592527, written 5004155
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 177301, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]

spray-client throwing "Too many open files" exception when giving more concurrent requests

I have a spray http client which is running in a server X, which will make connections to server Y. Server Y is kind of slow(will take 3+ sec for a request)
This is my http client code invocation:
def get() {
val result = for {
response <- IO(Http).ask(HttpRequest(GET,Uri(getUri(msg)),headers)).mapTo[HttpResponse]
} yield response
result onComplete {
case Success(res) => sendSuccess(res)
case Failure(error) => sendError(res)
}
}
These are the configurations I have in application.conf:
spray.can {
client {
request-timeout = 30s
response-chunk-aggregation-limit = 0
max-connections = 50
warn-on-illegal-headers = off
}
host-connector {
max-connections = 128
idle-timeout = 3s
}
}
Now I tried to abuse the server X with large number of concurrent requests(using ab with n=1000 and c=100).
Till 900 requests it went fine. After that the server threw lot of exceptions and I couldn't hit the server after that.
These are the exceptions:
[info] [ERROR] [03/28/2015 17:33:13.276] [squbs-akka.actor.default-dispatcher-6] [akka://squbs/system/IO-TCP/selectors/$a/0] Accept error: could not accept new connection
[info] java.io.IOException: Too many open files
[info] at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
[info] at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:241)
[info] at akka.io.TcpListener.acceptAllPending(TcpListener.scala:103)
and on further hitting the same server, it threw the below exception:
[info] [ERROR] [03/28/2015 17:53:16.735] [hcp-client-akka.actor.default-dispatcher-6] [akka://hcp-client/system/IO-TCP/selectors] null
[info] akka.actor.ActorInitializationException: exception during creation
[info] at akka.actor.ActorInitializationException$.apply(Actor.scala:164)
[info] at akka.actor.ActorCell.create(ActorCell.scala:596)
[info] Caused by: java.lang.reflect.InvocationTargetException
[info] at sun.reflect.GeneratedConstructorAccessor59.newInstance(Unknown Source)
[info] Caused by: java.io.IOException: Too many open files
[info] at sun.nio.ch.IOUtil.makePipe(Native Method)
I was previously using apache http client(which was synchronous) which was able to handle 10000+ requests with concurrency of 100.
I'm not sure I'm missing something. Any help would be appreciated.
The problem is that every time you call get() method it creates a new actor that creates at least one connection to the remote server. Furthermore you never shut down that actor, so each such connection leaves until it times out.
You only need a single such actor to manage all your HTTP requests, thus to fix it take IO(Http) out of the get() method and call it only once. Reuse that returned ActorRef for all your requests to that server. Shut it down on application shutdown.
For example:
val system: ActorSystem = ...
val io = IO(Http)(system)
io ! Http.Bind( ...
def get(): Unit = {
...
io.ask ...
// or
io.tell ...
}

Resources