Delayed execution for 'Explain' queries on Prestosql clusters - trino

I have two types of prestosql clusters, on aws instances and on Kubernetes. Prestosql on K8s has a weird issue with 'EXPLAIN' queries as it takes a long time ~2-3 mins compared to 2-3 seconds on the instance one.
The query stays on "WAITING_FOR_RESOURCES" for about 2 minutes and then executes very quickly.
There is also an exception on the server logs
2020-12-23T05:25:01.930Z ERROR Query-20201223_052431_00004_pxqak-276 io.prestosql.cost.CachingStatsProvider Error occurred when computing stats for query 20201223_052431_00004_pxqak
io.prestosql.spi.PrestoException: HIVE_METASTORE_ERROR
at io.prestosql.plugin.hive.metastore.thrift.ThriftHiveMetastore.getMetastorePartitionColumnStatistics(ThriftHiveMetastore.java:461)
at io.prestosql.plugin.hive.metastore.thrift.ThriftHiveMetastore.getPartitionColumnStatistics(ThriftHiveMetastore.java:438)
at io.prestosql.plugin.hive.metastore.thrift.ThriftHiveMetastore.getPartitionStatistics(ThriftHiveMetastore.java:389)
at io.prestosql.plugin.hive.metastore.thrift.BridgingHiveMetastore.getPartitionStatistics(BridgingHiveMetastore.java:110)
at io.prestosql.plugin.hive.metastore.cache.CachingHiveMetastore.lambda$loadPartitionColumnStatistics$6(CachingHiveMetastore.java:360)
at java.base/java.lang.Iterable.forEach(Iterable.java:75)
at io.prestosql.plugin.hive.metastore.cache.CachingHiveMetastore.loadPartitionColumnStatistics(CachingHiveMetastore.java:353)
at io.prestosql.plugin.hive.metastore.cache.CachingHiveMetastore.access$100(CachingHiveMetastore.java:89)
at io.prestosql.plugin.hive.metastore.cache.CachingHiveMetastore$1.loadAll(CachingHiveMetastore.java:179)
at com.google.common.cache.CacheLoader$1.loadAll(CacheLoader.java:207)
at io.prestosql.cost.JoinStatsRule.doCalculate(JoinStatsRule.java:81)
at io.prestosql.cost.JoinStatsRule.doCalculate(JoinStatsRule.java:48)
at io.prestosql.cost.SimpleStatsRule.calculate(SimpleStatsRule.java:39)
at io.prestosql.cost.ComposableStatsCalculator.calculateStats(ComposableStatsCalculator.java:82)
at io.prestosql.cost.ComposableStatsCalculator.calculateStats(ComposableStatsCalculator.java:70)
at io.prestosql.cost.CachingStatsProvider.getGroupStats(CachingStatsProvider.java:103)
at io.prestosql.cost.CachingStatsProvider.getStats(CachingStatsProvider.java:72)
at io.prestosql.cost.JoinStatsRule.doCalculate(JoinStatsRule.java:81)
at io.prestosql.cost.JoinStatsRule.doCalculate(JoinStatsRule.java:48)
at io.prestosql.cost.SimpleStatsRule.calculate(SimpleStatsRule.java:39)
at io.prestosql.cost.ComposableStatsCalculator.calculateStats(ComposableStatsCalculator.java:82)
at io.prestosql.cost.ComposableStatsCalculator.calculateStats(ComposableStatsCalculator.java:70)
at io.prestosql.cost.CachingStatsProvider.getGroupStats(CachingStatsProvider.java:103)
at io.prestosql.cost.CachingStatsProvider.getStats(CachingStatsProvider.java:72)
at io.prestosql.cost.CostCalculatorWithEstimatedExchanges.calculateJoinExchangeCost(CostCalculatorWithEstimatedExchanges.java:233)
at io.prestosql.cost.CostCalculatorWithEstimatedExchanges.calculateJoinCostWithoutOutput(CostCalculatorWithEstimatedExchanges.java:208)
at io.prestosql.sql.planner.iterative.rule.DetermineJoinDistributionType.getJoinNodeWithCost(DetermineJoinDistributionType.java:180)
at io.prestosql.sql.planner.iterative.rule.DetermineJoinDistributionType.addJoinsWithDifferentDistributions(DetermineJoinDistributionType.java:116)
at io.prestosql.sql.planner.iterative.rule.DetermineJoinDistributionType.getCostBasedJoin(DetermineJoinDistributionType.java:98)
at io.prestosql.sql.planner.iterative.rule.DetermineJoinDistributionType.apply(DetermineJoinDistributionType.java:74)
at io.prestosql.sql.planner.iterative.rule.DetermineJoinDistributionType.apply(DetermineJoinDistributionType.java:49)
at io.prestosql.sql.planner.iterative.IterativeOptimizer.transform(IterativeOptimizer.java:165)
at io.prestosql.sql.planner.iterative.IterativeOptimizer.exploreNode(IterativeOptimizer.java:140)
at io.prestosql.sql.planner.iterative.IterativeOptimizer.exploreGroup(IterativeOptimizer.java:105)
at io.prestosql.sql.planner.iterative.IterativeOptimizer.exploreChildren(IterativeOptimizer.java:190)
at com.google.common.cache.LocalCache.loadAll(LocalCache.java:4058)
at com.google.common.cache.LocalCache.getAll(LocalCache.java:4021)
at com.google.common.cache.LocalCache$LocalLoadingCache.getAll(LocalCache.java:4972)
at io.prestosql.plugin.hive.metastore.cache.CachingHiveMetastore.getAll(CachingHiveMetastore.java:255)
at io.prestosql.plugin.hive.metastore.cache.CachingHiveMetastore.getPartitionStatistics(CachingHiveMetastore.java:330)
at io.prestosql.plugin.hive.metastore.cache.CachingHiveMetastore.lambda$loadPartitionColumnStatistics$6(CachingHiveMetastore.java:360)
at java.base/java.lang.Iterable.forEach(Iterable.java:75)
at io.prestosql.plugin.hive.metastore.cache.CachingHiveMetastore.loadPartitionColumnStatistics(CachingHiveMetastore.java:353)
at io.prestosql.plugin.hive.metastore.cache.CachingHiveMetastore.access$100(CachingHiveMetastore.java:89)
at io.prestosql.plugin.hive.metastore.cache.CachingHiveMetastore$1.loadAll(CachingHiveMetastore.java:179)
at com.google.common.cache.CacheLoader$1.loadAll(CacheLoader.java:207)
at com.google.common.cache.LocalCache.loadAll(LocalCache.java:4058)
at com.google.common.cache.LocalCache.getAll(LocalCache.java:4021)
at com.google.common.cache.LocalCache$LocalLoadingCache.getAll(LocalCache.java:4972)
at io.prestosql.plugin.hive.metastore.cache.CachingHiveMetastore.getAll(CachingHiveMetastore.java:255)
at io.prestosql.plugin.hive.metastore.cache.CachingHiveMetastore.getPartitionStatistics(CachingHiveMetastore.java:330)
at io.prestosql.plugin.hive.HiveMetastoreClosure.getPartitionStatistics(HiveMetastoreClosure.java:88)
at io.prestosql.plugin.hive.metastore.SemiTransactionalHiveMetastore.getPartitionStatistics(SemiTransactionalHiveMetastore.java:256)
at io.prestosql.plugin.hive.statistics.MetastoreHiveStatisticsProvider.getPartitionsStatistics(MetastoreHiveStatisticsProvider.java:126)
at io.prestosql.plugin.hive.statistics.MetastoreHiveStatisticsProvider.lambda$new$0(MetastoreHiveStatisticsProvider.java:104)
at io.prestosql.plugin.hive.statistics.MetastoreHiveStatisticsProvider.getTableStatistics(MetastoreHiveStatisticsProvider.java:146)
at io.prestosql.plugin.hive.HiveMetadata.getTableStatistics(HiveMetadata.java:695)
at io.prestosql.sql.planner.iterative.IterativeOptimizer.exploreGroup(IterativeOptimizer.java:107)
at io.prestosql.sql.planner.iterative.IterativeOptimizer.exploreChildren(IterativeOptimizer.java:190)
at io.prestosql.sql.planner.iterative.IterativeOptimizer.exploreGroup(IterativeOptimizer.java:107)
at io.prestosql.sql.planner.iterative.IterativeOptimizer.optimize(IterativeOptimizer.java:96)
at io.prestosql.sql.planner.LogicalPlanner.plan(LogicalPlanner.java:196)
at io.prestosql.sql.analyzer.QueryExplainer.getLogicalPlan(QueryExplainer.java:182)
at io.prestosql.sql.analyzer.QueryExplainer.getPlan(QueryExplainer.java:121)
at io.prestosql.sql.rewrite.ExplainRewrite$Visitor.getQueryPlan(ExplainRewrite.java:137)
at io.prestosql.sql.rewrite.ExplainRewrite$Visitor.visitExplain(ExplainRewrite.java:115)
at io.prestosql.sql.rewrite.ExplainRewrite$Visitor.visitExplain(ExplainRewrite.java:65)
at io.prestosql.sql.tree.Explain.accept(Explain.java:80)
at io.prestosql.sql.tree.AstVisitor.process(AstVisitor.java:27)
at io.prestosql.sql.rewrite.ExplainRewrite.rewrite(ExplainRewrite.java:62)
at io.prestosql.sql.rewrite.StatementRewrite.rewrite(StatementRewrite.java:57)
at io.prestosql.sql.analyzer.Analyzer.analyze(Analyzer.java:80)
at io.prestosql.sql.analyzer.Analyzer.analyze(Analyzer.java:75)
at io.prestosql.execution.SqlQueryExecution.analyze(SqlQueryExecution.java:221)
at io.prestosql.execution.SqlQueryExecution.<init>(SqlQueryExecution.java:180)
at io.prestosql.execution.SqlQueryExecution.<init>(SqlQueryExecution.java:97)
at io.prestosql.execution.SqlQueryExecution$SqlQueryExecutionFactory.createQueryExecution(SqlQueryExecution.java:732)
at io.prestosql.dispatcher.LocalDispatchQueryFactory.lambda$createDispatchQuery$0(LocalDispatchQueryFactory.java:119)
at io.prestosql.$gen.Presto_330____20201223_050837_2.call(Unknown Source)
at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:57)
at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: MetaException(message:null)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partitions_statistics_req_result$get_partitions_statistics_req_resultStandardScheme.read(ThriftHiveMetastore.java)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partitions_statistics_req_result$get_partitions_statistics_req_resultStandardScheme.read(ThriftHiveMetastore.java)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partitions_statistics_req_result.read(ThriftHiveMetastore.java)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:86)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_partitions_statistics_req(ThriftHiveMetastore.java:4013)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_partitions_statistics_req(ThriftHiveMetastore.java:4000)
at io.prestosql.plugin.hive.metastore.thrift.ThriftHiveMetastoreClient.getPartitionColumnStatistics(ThriftHiveMetastoreClient.java:227)
at io.prestosql.plugin.hive.metastore.thrift.FailureAwareThriftMetastoreClient.lambda$getPartitionColumnStatistics$16(FailureAwareThriftMetastoreClient.java:191)
at io.prestosql.plugin.hive.metastore.thrift.FailureAwareThriftMetastoreClient.runWithHandle(FailureAwareThriftMetastoreClient.java:394)
at io.prestosql.plugin.hive.metastore.thrift.FailureAwareThriftMetastoreClient.getPartitionColumnStatistics(FailureAwareThriftMetastoreClient.java:191)
at io.prestosql.plugin.hive.metastore.thrift.ThriftHiveMetastore.lambda$getMetastorePartitionColumnStatistics$15(ThriftHiveMetastore.java:453)
at io.prestosql.plugin.hive.metastore.thrift.ThriftMetastoreApiStats.lambda$wrap$0(ThriftMetastoreApiStats.java:42)
at io.prestosql.plugin.hive.util.RetryDriver.run(RetryDriver.java:130)
at io.prestosql.plugin.hive.metastore.thrift.ThriftHiveMetastore.getMetastorePartitionColumnStatistics(ThriftHiveMetastore.java:451)
... 156 more
Suppressed: MetaException(message:null)
... 170 more
Suppressed: MetaException(message:null)
... 170 more
Suppressed: MetaException(message:null)
... 170 more
Suppressed: MetaException(message:null)
... 170 more
Suppressed: MetaException(message:null)
... 170 more
Suppressed: MetaException(message:null)
... 170 more
Suppressed: MetaException(message:null)
... 170 more
Suppressed: MetaException(message:null)
... 170 more
Suppressed: MetaException(message:null)
... 170 more
I tried changing the values of hive.metastore.partition-batch-size.max and hive.metastore-cache-ttl

It seems than in your "slow" deployment the metastore call get_partitions_statistics_req fails for some reason and is getting retried. The retries likely consume all the "waiting" time. Since Presto by default ignores stats calculation failures like this, the query eventually works.
The failure is on the Hive side, so you need to check metastore logs to understand the cause of the failure, since it's not getting propagated on the Presto side.
On the Presto side you can still apply some configuration changes, as a workaround:
disable stats for the Hive connector with hive.table-statistics-enabled configuration property
reduce the time spent retrying metastore calls with hive.metastore.thrift.client.max-retry-time configuration property
make your queries fail loud with global config property optimizer.ignore-stats-calculator-failures=false (unlikely what you want)

Related

Spring ldaptemplate update group with large membership issue

I have issues updating groups in Active Directory with > 1500 members. It's only trying to modify the member attribute.
I have no issues updating groups with fewer members. I can also add a new group with many members.
However if its too large, update fails. I can try to update the large group to just one member and it still fails with the same error.
Code fails on the modifyAttributes line:
ModificationItem[] modList =
nameContext.getDirContextAdapter().getModificationItems();
writeADTemplate.modifyAttributes(nameContext.getName(),modList);
StackTrace Below:
org.springframework.ldap.NameAlreadyBoundException: [LDAP: error code 68 -
00000562: UpdErr: DSID-031A122A, problem 6005 (ENTRY_EXISTS), data 0
nested exception is javax.naming.NameAlreadyBoundException: [LDAP: error
code 68 - 00000562: UpdErr: DSID-031A122A, problem 6005 (ENTRY_EXISTS), data 0
remaining name 'cn=Atlassian Users,ou=Groups'
at org.springframework.ldap.support.LdapUtils.convertLdapException
(LdapUtils.java:169)
at org.springframework.ldap.core.LdapTemplate.executeWithContext
(LdapTemplate.java:810)
at
org.springframework.ldap.core.LdapTemplate.executeReadWrite
(LdapTemplate.java:802)
at org.springframework.ldap.core.LdapTemplate.modifyAttributes
(LdapTemplate.java:967)
more ...
Caused by: javax.naming.NameAlreadyBoundException: [LDAP: error code 68 -
00000562: UpdErr: DSID-031A122A, problem 6005 (ENTRY_EXISTS), data 0
remaining name 'cn=Atlassian Users,ou=Groups'
at com.sun.jndi.ldap.LdapCtx.mapErrorCode(Unknown Source)
at com.sun.jndi.ldap.LdapCtx.processReturnCode(Unknown Source)
at com.sun.jndi.ldap.LdapCtx.processReturnCode(Unknown Source)
at com.sun.jndi.ldap.LdapCtx.c_modifyAttributes(Unknown Source)
at com.sun.jndi.toolkit.ctx.ComponentDirContext.p_modifyAttributes(Unknown
Source)
at
com.sun.jndi.toolkit.ctx.PartialCompositeDirContext.modifyAttributes(Unknown
Source)
at javax.naming.directory.InitialDirContext.modifyAttributes(Unknown Source)
at
org.springframework.ldap.core.LdapTemplate$19.executeWithContext
(LdapTemplate.java:969)
at
org.springframework.ldap.core.LdapTemplate.executeWithContext
(LdapTemplate.java:807)
... 88 more
Ok my real issue is that Active Directory will not return a multi value attribute like member if the values > 1500.
When I was getting the current group members it was return 0 values so my code was trying to add all the members back to the group.
Looks like I'll have to figure out how to use
DefaultIncrementalAttributesMapper to get all the members

AWS connection timeout when running Spark job on EMR

I'm trying to submit a simple spark job in an Amazon EMR cluster. My cluster has 5 M4.2xlarge instances (1 master, 4 slaves), each with 16 vCPU, and 32 gigs of memory.
This is my code:
def main(args : Array[String]): Unit = {
val sparkConfig = new SparkConf()
.set("hive.exec.dynamic.partition", "true")
.set("hive.exec.dynamic.partition.mode", "nonstrict")
.set("hive.s3.max-client-retries", "50")
.set("hive.s3.max-error-retries", "50")
.set("hive.s3.max-connections", "100")
.set("hive.s3.connect-timeout", "5m")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.kryo.registrationRequired", "true")
.set("spark.kryo.classesToRegister", "org.apache.spark.graphx.impl.VertexAttributeBlock")
.set("spark.broadcast.compress", "true")
val spark = SparkSession.builder()
.appName("Spark Hive Example")
.enableHiveSupport()
.config(sparkConfig)
.getOrCreate()
// Set Kryo for serializing
GraphXUtils.registerKryoClasses(sparkConfig)
val res = spark.sql("SELECT col1, col2, col3 FROM table1 limit 10000")
val edgesRDD = res.rdd.map(row => Edge(row.getString(0).hashCode, row.getString(1).hashCode, row(2).asInstanceOf[String]))
val res_two = spark.sql("SELECT col1 FROM table2 where col1 is not NULL and col1 != '' limit 100000")
val vertexRDD: RDD[(VertexId, String)] = res_two.rdd.map(row => (row.getString(0).hashCode, row(0).asInstanceOf[String]))
val graph = Graph(vertexRDD, edgesRDD)
val connectedComponents = graph.connectedComponents().vertices
Both table1, and table2 are S3 backed external tables on hive. When I run this program, my job fails with the following error:
Job aborted due to stage failure: Task 827 in stage 0.0 failed 4 times, most recent failure: Lost task 827.3 in stage 0.0 (TID 921, xxx.internal, executor 3): com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1069)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1035)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:742)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:716)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4169)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4116)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1237)
at com.amazon.ws.emr.hadoop.fs.s3.lite.call.GetObjectMetadataCall.perform(GetObjectMetadataCall.java:24)
at com.amazon.ws.emr.hadoop.fs.s3.lite.call.GetObjectMetadataCall.perform(GetObjectMetadataCall.java:10)
at com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor.execute(GlobalS3Executor.java:82)
at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.invoke(AmazonS3LiteClient.java:176)
at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.getObjectMetadata(AmazonS3LiteClient.java:94)
at com.amazon.ws.emr.hadoop.fs.s3.lite.AbstractAmazonS3Lite.getObjectMetadata(AbstractAmazonS3Lite.java:39)
at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:211)
at sun.reflect.GeneratedMethodAccessor26.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy35.retrieveMetadata(Unknown Source)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:768)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.open(S3NativeFileSystem.java:1194)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:773)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.open(EmrFileSystem.java:166)
at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.extractMetaInfoFromFooter(ReaderImpl.java:355)
at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.<init>(ReaderImpl.java:316)
at org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:237)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1204)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1113)
at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:246)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:245)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:203)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool
at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:286)
at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:263)
at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.conn.ClientConnectionRequestFactory$Handler.invoke(ClientConnectionRequestFactory.java:70)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.conn.$Proxy37.get(Unknown Source)
at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:190)
at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1190)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1030)
... 59 more
Not sure if it is coming from hadoop or when reading from hive, but I saw a similar issue here, so I added the following params in my spark-submit command:
--conf "spark.driver.extraJavaOptions=-Djavax.net.ssl.sessionCacheSize=1000 -Djavax.net.ssl.sessionCacheTimeout=60" --conf "spark.executor.extraJavaOptions=-Djavax.net.ssl.sessionCacheSize=1000 -Djavax.net.ssl.sessionCacheTimeout=60"
Still doesn't work. Does anyone know what's going on?
TLDR: The property you need to set is fs.s3.maxConnections in the emrfs-site.xml configuration file. It defaults to 50. We were getting exactly the same error/stack trace as you, so I set it to 5000, which fixed the problem and had no ill effects.
From what I can tell, the root cause is InputFormat implementations that do not properly use try...finally to ensure that connections get closed when an exceptions are thrown. Notably, older versions of Hive, including v1.2.1 that Spark is compiled against, exhibit this bug. Hive 2.x massively refactors OrcInputFormat, though I haven't verified that the bug is fixed, nor do I know if/when/how you can compile Spark against Hive 2.x.
The workaround increases the size of the connection pool, as suggested in another answer, but both the property and its location are quite different than in the "classic" S3 filesystems (s3/s3a/s3n). Of course, this isn't documented anywhere and required decompilation of the emrfs jar to tease out...
I don't use EMRFS, but I do know the other spark/hadoop S3 clients all use a pool of http connections for their requests to S3, and "timeout waiting for pool" messages invariably means "pool isn't big enough". See if you can find out what the emrfs options are for increasing that pool size. You will need at least one for every worker thread running in your process, and I'd double it in the hope that emrfs parallelises block uploads the way the s3a client does.

slave lost and very slow join in spark

I did a join of two dataframes on one common column and then ran a show method:
df= df1.join(df2, df1.col1== df2.col2, 'inner')
df.show()
Then join ran very slow and finally raise an error: slave lost.
Py4JJavaError: An error occurred while calling o109.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 4 times, most recent failure: Lost task 0.3 in stage 8.0 : ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Slave lost
Driver stacktrace:
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:236) at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832) at
org.apache.spark.SparkContext.runJob(SparkContext.scala:1845) at
org.apache.spark.SparkContext.runJob(SparkContext.scala:1858) at
org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:212)
at
org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165)
at
org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
at
org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
at
org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
at
org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)
at
org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498)
at
org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505)
at
org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375)
at
org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374)
at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099)
at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374) at
org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456) at
org.apache.spark.sql.DataFrame.showString(DataFrame.scala:170) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) at
py4j.Gateway.invoke(Gateway.java:259) at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79) at
py4j.GatewayConnection.run(GatewayConnection.java:209) at
java.lang.Thread.run(Thread.java:745)
After some search, it seems this is a memory related issue. Then I increased repartition to 3000, increased executor memory,increased memoryOverhead, but still no luck, I got the same slave lost error. During df.show(), I found one of the execuctor shuffle write size is very high, the others were not so high.
Any clue?
Thanks
If using scala try
val df = df1.join(df2,Seq("column name"))
if pyspark
df = df1.join(df2,["columnname"])
or
df = df1.join(df2,df1.columnname == df2.columnname)
display(df)
If trying to do same in pyspark - sql
df1.createOrReplaceTempView("left_test_table")
df2..createOrReplaceTempView("right_test_table")
left <- sql(sqlContext, "SELECT * FROM left_test_table")
right <- sql(sqlContext, "SELECT * FROM right_test_table")
head(drop(join(left, right), left$name))

org.apache.http.nio.reactor.IOReactorException: I/O dispatch worker terminated abnormally

I have a service which uses apache HttpAsyncClient. (versions: httpasyncclient-4.0.2.jar, httpcore-4.4.3.jar, httpcore-nio-4.3.3.jar)
All requests start failing some time after starting the async client with following being initial exception -
[#|2016-03-16T22:31:59.376-0700|SEVERE|glassfish3.1.2|org.apache.http.impl.nio.client.InternalHttpAsyncClient|_ThreadID=564;_ThreadName=Thread-6;|I/O reactor terminated abnormally
org.apache.http.nio.reactor.IOReactorException: I/O dispatch worker terminated abnormally
at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor.execute(AbstractMultiworkerIOReactor.java:357)
at org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager.execute(PoolingNHttpClientConnectionManager.java:189)
at org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase.doExecute(CloseableHttpAsyncClientBase.java:67)
at org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase.access$000(CloseableHttpAsyncClientBase.java:38)
at org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase$1.run(CloseableHttpAsyncClientBase.java:57)
at java.lang.Thread.run(Unknown Source)
Caused by: RestException(statusCode=500, code=null, message=I/O operation failed, developerMessage=RestException(statusCode=500, code=null, message=I/O operation failed, developerMessage=null)
at com.notificationservice.analytics.client.AsyncResponse$2.failed(AsyncResponse.java:178)
at org.apache.http.concurrent.BasicFuture.failed(BasicFuture.java:134)
at org.apache.http.impl.nio.client.DefaultClientExchangeHandlerImpl.failed(DefaultClientExchangeHandlerImpl.java:258)
at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.exception(HttpAsyncRequestExecutor.java:127)
at org.apache.http.impl.nio.client.InternalIODispatch.onException(InternalIODispatch.java:68)
at org.apache.http.impl.nio.client.InternalIODispatch.onException(InternalIODispatch.java:37)
at org.apache.http.impl.nio.reactor.AbstractIODispatch.outputReady(AbstractIODispatch.java:154)
at org.apache.http.impl.nio.reactor.BaseIOReactor.writable(BaseIOReactor.java:180)
at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:342)
at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:316)
at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:277)
at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:105)
at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:586)
at java.lang.Thread.run(Unknown Source)
)
at com.notificationservice.client.AsyncResponse$2.failed(AsyncResponse.java:178)
at org.apache.http.concurrent.BasicFuture.failed(BasicFuture.java:134)
at org.apache.http.impl.nio.client.DefaultClientExchangeHandlerImpl.failed(DefaultClientExchangeHandlerImpl.java:258)
at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.exception(HttpAsyncRequestExecutor.java:127)
at org.apache.http.impl.nio.client.InternalIODispatch.onException(InternalIODispatch.java:68)
at org.apache.http.impl.nio.client.InternalIODispatch.onException(InternalIODispatch.java:37)
at org.apache.http.impl.nio.reactor.AbstractIODispatch.outputReady(AbstractIODispatch.java:154)
at org.apache.http.impl.nio.reactor.BaseIOReactor.writable(BaseIOReactor.java:180)
at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:342)
at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:316)
at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:277)
at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:105)
at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:586)
... 1 more
Same problem happens with new versions - httpasyncclient-4.1.1.jar, httpcore-4.4.4.jar, httpcore-nio-4.4.4.jar
Any insight would be highly appreciated. Is there some IOReactorConfig parameter which needs to be changed?
I would say something is wrong with your rest parameters. StatusCode 500 comes from the server so your request are going to it.
Caused by: RestException(statusCode=500, code=null, message=I/O operation failed, developerMessage=RestException(statusCode=500, code=null, message=I/O operation failed, developerMessage=null

Cosmos Hive error entering and using map reduce

I've a couple of problems executing Hive on cosmos fiware lab instance.
First, after log into the machine, I enter in Hive command line and I get the following error (I saw other questions related to this, but I couldn't find a solution):
$ hive
log4j:ERROR Could not instantiate class [org.apache.hadoop.hive.shims.HiveEventCounter].
java.lang.RuntimeException: Could not load shims in class org.apache.hadoop.log.metrics.EventCounter
at org.apache.hadoop.hive.shims.ShimLoader.createShim(ShimLoader.java:123)
at org.apache.hadoop.hive.shims.ShimLoader.loadShims(ShimLoader.java:115)
at org.apache.hadoop.hive.shims.ShimLoader.getEventCounter(ShimLoader.java:98)
at org.apache.hadoop.hive.shims.HiveEventCounter.<init>(HiveEventCounter.java:34)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at java.lang.Class.newInstance0(Class.java:357)
at java.lang.Class.newInstance(Class.java:310)
at org.apache.log4j.helpers.OptionConverter.instantiateByClassName(OptionConverter.java:330)
at org.apache.log4j.helpers.OptionConverter.instantiateByKey(OptionConverter.java:121)
at org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:664)
at org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:647)
at org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigurator.java:544)
at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:440)
at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:476)
at org.apache.log4j.PropertyConfigurator.configure(PropertyConfigurator.java:354)
at org.apache.hadoop.hive.common.LogUtils.initHiveLog4jDefault(LogUtils.java:127)
at org.apache.hadoop.hive.common.LogUtils.initHiveLog4jCommon(LogUtils.java:77)
at org.apache.hadoop.hive.common.LogUtils.initHiveLog4j(LogUtils.java:58)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:641)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:625)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:197)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.log.metrics.EventCounter
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:171)
at org.apache.hadoop.hive.shims.ShimLoader.createShim(ShimLoader.java:120)
... 27 more
log4j:ERROR Could not instantiate appender named "EventCounter".
Logging initialized using configuration in jar:file:/usr/local/apache-hive-0.13.0-bin/lib/hive-common-0.13.0.jar!/hive-log4j.properties
However, I'm able to run a query like SELECT * FROM table;
On the other hand, if I try to run other query more specific like display only a column field, a map reduce job starts to run and it results in the following error:
hive> SELECT table.column FROM table;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201507101501_40071, Tracking URL = http://cosmosmaster-gi:50030/jobdetails.jsp?jobid=job_201507101501_40071
Kill Command = /usr/lib/hadoop-0.20/bin/hadoop job -kill job_201507101501_40071
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2016-01-29 12:49:45,518 Stage-1 map = 0%, reduce = 0%
2016-01-29 12:50:08,642 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201507101501_40071 with errors
Error during job, obtaining debugging information...
Job Tracking URL: http://cosmosmaster-gi:50030/jobdetails.jsp?jobid=job_201507101501_40071
Examining task ID: task_201507101501_40071_m_000002 (and more) from job job_201507101501_40071
Task with the most failures(4):
-----
Task ID:
task_201507101501_40071_m_000000
URL:
http://cosmosmaster-gi:50030/taskdetails.jsp?jobid=job_201507101501_40071&tipid=task_201507101501_40071_m_000000
-----
Diagnostic Messages for this Task:
java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:386)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:324)
at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
at org.apache.hadoop.mapred.Child.main(Child.java:260)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.jav
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Job 0: Map: 1 HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
Any help or suggestion is welcome.
Thanks.
The first error is not relevant and does not affect Hive querying, as you have seen.
Regarding the second error, most probably it is because the stored data in HDFS is in Json format (most probably stored by the Cygnus tool) and a Json SerializerDeserializer (serde) must be set. You can do this by executoing the following sentence before doing the select column from table:
$ add jar /usr/local/apache-hive-0.13.0-bin/lib/json-serde-1.3.1-SNAPSHOT-jar-with-dependencies.jar;
$ select column from table;

Resources