Spark + Parquet + S3n : Seems to read parquet file many times - performance

I have the parquet files in Hive-like partitioned way on S3n bucket. The metadata files are not created, the parquet footers are in the file itself.
When I tried a sample spark job in local mode (v-1.6.0) trying to read a file of size 5.2 MB:
val filePath = "s3n://bucket/trackingPackage/dpYear=2016/dpMonth=5/dpDay=10/part-r-00004-1c86d6b0-4f6f-4770-a930-c42d77e3c729-1462833064172.gz.parquet"
val path: Path = new Path(filePath)
val conf = new SparkConf().setMaster("local[2]").set("spark.app.name", "parquet-reader-s3n").set("spark.eventLog.enabled", "true")
val sc = new SparkContext(conf)
val sqlc = new org.apache.spark.sql.SQLContext(sc)
val df = sqlc.read.parquet(filePath).select("referenceCode")
Thread.sleep(1000*10) // Intentionally given
println(df.schema)
val output = df.collect
The log generated is:
..
[22:21:56.505][main][INFO][BlockManagerMaster:58] Registered BlockManager
[22:21:56.909][main][INFO][EventLoggingListener:58] Logging events to file:/tmp/spark-events/local-1463676716372
[22:21:57.307][main][INFO][ParquetRelation:58] Listing s3n://bucket//trackingPackage/dpYear=2016/dpMonth=5/dpDay=10/part-r-00004-1c86d6b0-4f6f-4770-a930-c42d77e3c729-1462833064172.gz.parquet on driver
[22:21:59.927][main][INFO][SparkContext:58] Starting job: parquet at InspectInputSplits.scala:30
[22:21:59.942][dag-scheduler-event-loop][INFO][DAGScheduler:58] Got job 0 (parquet at InspectInputSplits.scala:30) with 2 output partitions
[22:21:59.942][dag-scheduler-event-loop][INFO][DAGScheduler:58] Final stage: ResultStage 0 (parquet at InspectInputSplits.scala:30)
[22:21:59.943][dag-scheduler-event-loop][INFO][DAGScheduler:58] Parents of final stage: List()
[22:21:59.944][dag-scheduler-event-loop][INFO][DAGScheduler:58] Missing parents: List()
[22:21:59.954][dag-scheduler-event-loop][INFO][DAGScheduler:58] Submitting ResultStage 0 (MapPartitionsRDD[1] at parquet at InspectInputSplits.scala:30), which has no missing parents
[22:22:00.218][dag-scheduler-event-loop][INFO][MemoryStore:58] Block broadcast_0 stored as values in memory (estimated size 64.5 KB, free 64.5 KB)
[22:22:00.226][dag-scheduler-event-loop][INFO][MemoryStore:58] Block broadcast_0_piece0 stored as bytes in memory (estimated size 21.7 KB, free 86.2 KB)
[22:22:00.229][dispatcher-event-loop-0][INFO][BlockManagerInfo:58] Added broadcast_0_piece0 in memory on localhost:54419 (size: 21.7 KB, free: 1088.2 MB)
[22:22:00.231][dag-scheduler-event-loop][INFO][SparkContext:58] Created broadcast 0 from broadcast at DAGScheduler.scala:1006
[22:22:00.234][dag-scheduler-event-loop][INFO][DAGScheduler:58] Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at parquet at InspectInputSplits.scala:30)
[22:22:00.235][dag-scheduler-event-loop][INFO][TaskSchedulerImpl:58] Adding task set 0.0 with 2 tasks
[22:22:00.278][dispatcher-event-loop-1][INFO][TaskSetManager:58] Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2076 bytes)
[22:22:00.281][dispatcher-event-loop-1][INFO][TaskSetManager:58] Starting task 1.0 in stage 0.0 (TID 1, localhost, partition 1,PROCESS_LOCAL, 2395 bytes)
[22:22:00.290][Executor task launch worker-0][INFO][Executor:58] Running task 0.0 in stage 0.0 (TID 0)
[22:22:00.291][Executor task launch worker-1][INFO][Executor:58] Running task 1.0 in stage 0.0 (TID 1)
[22:22:00.425][Executor task launch worker-1][INFO][ParquetFileReader:151] Initiating action with parallelism: 5
[22:22:00.447][Executor task launch worker-0][INFO][ParquetFileReader:151] Initiating action with parallelism: 5
[22:22:00.463][Executor task launch worker-0][INFO][Executor:58] Finished task 0.0 in stage 0.0 (TID 0). 936 bytes result sent to driver
[22:22:00.471][task-result-getter-0][INFO][TaskSetManager:58] Finished task 0.0 in stage 0.0 (TID 0) in 213 ms on localhost (1/2)
[22:22:00.586][pool-20-thread-1][INFO][NativeS3FileSystem:619] Opening 's3n://bucket//trackingPackage/dpYear=2016/dpMonth=5/dpDay=10/part-r-00004-1c86d6b0-4f6f-4770-a930-c42d77e3c729-1462833064172.gz.parquet' for reading
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
[22:22:25.890][Executor task launch worker-1][INFO][Executor:58] Finished task 1.0 in stage 0.0 (TID 1). 4067 bytes result sent to driver
[22:22:25.898][task-result-getter-1][INFO][TaskSetManager:58] Finished task 1.0 in stage 0.0 (TID 1) in 25617 ms on localhost (2/2)
[22:22:25.898][dag-scheduler-event-loop][INFO][DAGScheduler:58] ResultStage 0 (parquet at InspectInputSplits.scala:30) finished in 25.656 s
[22:22:25.899][task-result-getter-1][INFO][TaskSchedulerImpl:58] Removed TaskSet 0.0, whose tasks have all completed, from pool
[22:22:25.905][main][INFO][DAGScheduler:58] Job 0 finished: parquet at InspectInputSplits.scala:30, took 25.977801 s
StructType(StructField(referenceCode,StringType,true))
[22:22:36.271][main][INFO][DataSourceStrategy:58] Selected 1 partitions out of 1, pruned 0.0% partitions.
[22:22:36.325][main][INFO][MemoryStore:58] Block broadcast_1 stored as values in memory (estimated size 89.3 KB, free 175.5 KB)
[22:22:36.389][main][INFO][MemoryStore:58] Block broadcast_1_piece0 stored as bytes in memory (estimated size 20.2 KB, free 195.7 KB)
[22:22:36.389][dispatcher-event-loop-0][INFO][BlockManagerInfo:58] Added broadcast_1_piece0 in memory on localhost:54419 (size: 20.2 KB, free: 1088.2 MB)
[22:22:36.391][main][INFO][SparkContext:58] Created broadcast 1 from collect at InspectInputSplits.scala:34
[22:22:36.520][main][INFO][deprecation:1174] mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
[22:22:36.522][main][INFO][ParquetRelation:58] Reading Parquet file(s) from s3n://bucket//trackingPackage/dpYear=2016/dpMonth=5/dpDay=10/part-r-00004-1c86d6b0-4f6f-4770-a930-c42d77e3c729-1462833064172.gz.parquet
[22:22:36.554][main][INFO][SparkContext:58] Starting job: collect at InspectInputSplits.scala:34
[22:22:36.556][dag-scheduler-event-loop][INFO][DAGScheduler:58] Got job 1 (collect at InspectInputSplits.scala:34) with 1 output partitions
[22:22:36.556][dag-scheduler-event-loop][INFO][DAGScheduler:58] Final stage: ResultStage 1 (collect at InspectInputSplits.scala:34)
[22:22:36.556][dag-scheduler-event-loop][INFO][DAGScheduler:58] Parents of final stage: List()
[22:22:36.557][dag-scheduler-event-loop][INFO][DAGScheduler:58] Missing parents: List()
[22:22:36.557][dag-scheduler-event-loop][INFO][DAGScheduler:58] Submitting ResultStage 1 (MapPartitionsRDD[4] at collect at InspectInputSplits.scala:34), which has no missing parents
[22:22:36.571][dag-scheduler-event-loop][INFO][MemoryStore:58] Block broadcast_2 stored as values in memory (estimated size 7.6 KB, free 203.3 KB)
[22:22:36.575][dag-scheduler-event-loop][INFO][MemoryStore:58] Block broadcast_2_piece0 stored as bytes in memory (estimated size 4.0 KB, free 207.3 KB)
[22:22:36.576][dispatcher-event-loop-1][INFO][BlockManagerInfo:58] Added broadcast_2_piece0 in memory on localhost:54419 (size: 4.0 KB, free: 1088.2 MB)
[22:22:36.577][dag-scheduler-event-loop][INFO][SparkContext:58] Created broadcast 2 from broadcast at DAGScheduler.scala:1006
[22:22:36.577][dag-scheduler-event-loop][INFO][DAGScheduler:58] Submitting 1 missing tasks from ResultStage 1 (MapPartitionsRDD[4] at collect at InspectInputSplits.scala:34)
[22:22:36.577][dag-scheduler-event-loop][INFO][TaskSchedulerImpl:58] Adding task set 1.0 with 1 tasks
[22:22:36.585][dispatcher-event-loop-3][INFO][TaskSetManager:58] Starting task 0.0 in stage 1.0 (TID 2, localhost, partition 0,PROCESS_LOCAL, 2481 bytes)
[22:22:36.586][Executor task launch worker-1][INFO][Executor:58] Running task 0.0 in stage 1.0 (TID 2)
[22:22:36.605][Executor task launch worker-1][INFO][ParquetRelation$$anonfun$buildInternalScan$1$$anon$1:58] Input split: ParquetInputSplit{part: s3n://bucket//trackingPackage/dpYear=2016/dpMonth=5/dpDay=10/part-r-00004-1c86d6b0-4f6f-4770-a930-c42d77e3c729-1462833064172.gz.parquet start: 0 end: 5364897 length: 5364897 hosts: []}
[22:22:38.253][Executor task launch worker-1][INFO][NativeS3FileSystem:619] Opening 's3n://bucket//trackingPackage/dpYear=2016/dpMonth=5/dpDay=10/part-r-00004-1c86d6b0-4f6f-4770-a930-c42d77e3c729-1462833064172.gz.parquet' for reading
[22:23:04.249][Executor task launch worker-1][INFO][NativeS3FileSystem:619] Opening 's3n://bucket//trackingPackage/dpYear=2016/dpMonth=5/dpDay=10/part-r-00004-1c86d6b0-4f6f-4770-a930-c42d77e3c729-1462833064172.gz.parquet' for reading
[22:23:28.337][Executor task launch worker-1][INFO][CodecPool:181] Got brand-new decompressor [.gz]
[22:23:28.400][dispatcher-event-loop-1][INFO][BlockManagerInfo:58] Removed broadcast_0_piece0 on localhost:54419 in memory (size: 21.7 KB, free: 1088.2 MB)
[22:23:28.408][Spark Context Cleaner][INFO][ContextCleaner:58] Cleaned accumulator 1
[22:23:49.993][Executor task launch worker-1][INFO][Executor:58] Finished task 0.0 in stage 1.0 (TID 2). 9376344 bytes result sent to driver
[22:23:50.191][task-result-getter-2][INFO][TaskSetManager:58] Finished task 0.0 in stage 1.0 (TID 2) in 73612 ms on localhost (1/1)
[22:23:50.191][task-result-getter-2][INFO][TaskSchedulerImpl:58] Removed TaskSet 1.0, whose tasks have all completed, from pool
[22:23:50.191][dag-scheduler-event-loop][INFO][DAGScheduler:58] ResultStage 1 (collect at InspectInputSplits.scala:34) finished in 73.612 s
[22:23:50.195][main][INFO][DAGScheduler:58] Job 1 finished: collect at InspectInputSplits.scala:34, took 73.640193 s
The SparkUI snapshot is:
Questions:
In logs, I can see that the parquet file is seen to be read in total of 3 times. One time by [pool-21-thread-1] thread (on driver) and another two times by [Executor task launch worker-1] thread, which I assume to be worker thread. On debug, I can see that before first read, two s3n requests were made specifically for the footer (it had the http header of content-range), first to get the size of the footer and then to get the footer itself. My question is: When we had the footer information, why [pool-21-thread-1] thread still had to read the entire file? And why the executor thread made 2 requests to read the s3 file?
In the spark UI, It shows that only 670 KB is being taken as input. Since I was not assured this to be true, I looked into network activity and it seems 20+ MB has been received. Snapshot attached shows nearly 5+ MB received data in first read and later on 15+ MB for the 2 reads after Thread.sleep(1000*10). I could not reach the debug point for last 2 reads by [pool-21-thread-1] thread due to IDE issues, so not sure whether the particular column ("referenceCode") is being read or the entire file. I understand that there are overhead network packets at the tcp/udp layers, but 20+ MB seems quite a lot for just one column.

After debugging into the application, it turned out that S3N still uses jets3t library but the S3A has a new implementation based on AWS SDK (
Hadoop-10400 )
The hadoop's implementation of NativeS3FileSystem does not support seek (partial content reads) on S3 files. It downloads the whole file first.
EDIT: The scenario was not seen in EMR. On EMR amazon provides a highly optimized S3 connector - emrfs for all schemes which overrides the connector provided by hadoop.

Related

SonarQube upgrade to 6.7.1 LTS: Unrecoverable Indexation Failures

I have successfully upgraded SonarQube to ver. 6.5, including the database upgrade, and I am currently trying to upgrade SonarQube to ver. 6.7.1 LTS. The New SonarQube version is being installed on a Linux 64 bit system and is connected to a 2014 Microsoft SQL database. Every time I try to launch the 6.7.1 version of SonarQube it fails with the error "Background initialization failed". If I run the new SonarQube using an empty Microsoft SQL database, then it will start up fine with no issues. The "Background initialization failed" issue only occurs when I connect the new SonarQube to the upgraded database. I have tried adding memory to the heap for ElasticSearch and reducing the number of issues being processed. Any help to resolve this issue would be greatly appreciated.
Web log:
web[][o.s.p.ProcessEntryPoint] Starting web
web[][o.a.t.u.n.NioSelectorPool] Using a shared selector for servlet write/read
web[][o.e.p.PluginsService] no modules loaded
web[][o.e.p.PluginsService] loaded plugin [org.elasticsearch.index.reindex.ReindexPlugin]
web[][o.e.p.PluginsService] loaded plugin [org.elasticsearch.join.ParentJoinPlugin]
web[][o.e.p.PluginsService] loaded plugin [org.elasticsearch.percolator.PercolatorPlugin]
web[][o.e.p.PluginsService] loaded plugin [org.elasticsearch.transport.Netty4Plugin]
web[][i.n.c.MultithreadEventLoopGroup] -Dio.netty.eventLoopThreads: 64
web[][i.n.u.i.PlatformDependent0] -Dio.netty.noUnsafe: false
web[][i.n.u.i.PlatformDependent0] Java version: 8
web[][i.n.u.i.PlatformDependent0] sun.misc.Unsafe.theUnsafe: available
web[][i.n.u.i.PlatformDependent0] sun.misc.Unsafe.copyMemory: available
web[][i.n.u.i.PlatformDependent0] java.nio.Buffer.address: available
web[][i.n.u.i.PlatformDependent0] direct buffer constructor: available
web[][i.n.u.i.PlatformDependent0] java.nio.Bits.unaligned: available, true
web[][i.n.u.i.PlatformDependent0] jdk.internal.misc.Unsafe.allocateUninitializedArray(int): unavailable prior to Java9
web[][i.n.u.i.PlatformDependent0] java.nio.DirectByteBuffer.<init>(long, int): available
web[][i.n.u.i.PlatformDependent] sun.misc.Unsafe: available
web[][i.n.u.i.PlatformDependent] -Dio.netty.tmpdir: /../../sonarqube-6.7.1/temp (java.io.tmpdir)
web[][i.n.u.i.PlatformDependent] -Dio.netty.bitMode: 64 (sun.arch.data.model)
web[][i.n.u.i.PlatformDependent] -Dio.netty.noPreferDirect: false
web[][i.n.u.i.PlatformDependent] -Dio.netty.maxDirectMemory: 4772593664 bytes
web[][i.n.u.i.PlatformDependent] -Dio.netty.uninitializedArrayAllocationThreshold: -1
web[][i.n.u.i.CleanerJava6] java.nio.ByteBuffer.cleaner(): available
web[][i.n.c.n.NioEventLoop] -Dio.netty.noKeySetOptimization: false
web[][i.n.c.n.NioEventLoop] -Dio.netty.selectorAutoRebuildThreshold: 512
web[][i.n.u.i.PlatformDependent] org.jctools-core.MpscChunkedArrayQueue: available
web[][i.n.c.DefaultChannelId] -Dio.netty.processId: ***** (auto-detected)
web[][i.netty.util.NetUtil] -Djava.net.preferIPv4Stack: true
web[][i.netty.util.NetUtil] -Djava.net.preferIPv6Addresses: false
web[][i.netty.util.NetUtil] Loopback interface: lo (lo, 127.0.0.1)
web[][i.netty.util.NetUtil] /proc/sys/net/core/somaxconn: 128
web[][i.n.c.DefaultChannelId] -Dio.netty.machineId: ***** (auto-detected)
web[][i.n.u.ResourceLeakDetector] -Dio.netty.leakDetection.level: simple
web[][i.n.u.ResourceLeakDetector] -Dio.netty.leakDetection.maxRecords: 4
web[][i.n.b.PooledByteBufAllocator] -Dio.netty.allocator.numHeapArenas: 47
web[][i.n.b.PooledByteBufAllocator] -Dio.netty.allocator.numDirectArenas: 47
web[][i.n.b.PooledByteBufAllocator] -Dio.netty.allocator.pageSize: 8192
web[][i.n.b.PooledByteBufAllocator] -Dio.netty.allocator.maxOrder: 11
web[][i.n.b.PooledByteBufAllocator] -Dio.netty.allocator.chunkSize: 16777216
web[][i.n.b.PooledByteBufAllocator] -Dio.netty.allocator.tinyCacheSize: 512
web[][i.n.b.PooledByteBufAllocator] -Dio.netty.allocator.smallCacheSize: 256
web[][i.n.b.PooledByteBufAllocator] -Dio.netty.allocator.normalCacheSize: 64
web[][i.n.b.PooledByteBufAllocator] -Dio.netty.allocator.maxCachedBufferCapacity: 32768
web[][i.n.b.PooledByteBufAllocator] -Dio.netty.allocator.cacheTrimInterval: 8192
web[][i.n.b.PooledByteBufAllocator] -Dio.netty.allocator.useCacheForAllThreads: true
web[][i.n.b.ByteBufUtil] -Dio.netty.allocator.type: pooled
web[][i.n.b.ByteBufUtil] -Dio.netty.threadLocalDirectBufferSize: 65536
web[][i.n.b.ByteBufUtil] -Dio.netty.maxThreadLocalCharBufferSize: 16384
web[][i.n.b.AbstractByteBuf] -Dio.netty.buffer.bytebuf.checkAccessible: true
web[][i.n.u.ResourceLeakDetectorFactory] Loaded default ResourceLeakDetector: io.netty.util.ResourceLeakDetector#6c6be5c2
web[][i.n.util.Recycler] -Dio.netty.recycler.maxCapacityPerThread: 32768
web[][i.n.util.Recycler] -Dio.netty.recycler.maxSharedCapacityFactor: 2
web[][i.n.util.Recycler] -Dio.netty.recycler.linkCapacity: 16
web[][i.n.util.Recycler] -Dio.netty.recycler.ratio: 8
web[][o.s.s.e.EsClientProvider] Connected to local Elasticsearch: [127.0.0.1:*****]
web[][o.s.s.p.LogServerVersion] SonarQube Server / 6.7.1.35068 / 426519346f51f7b980a76f9050f983110550509d
web[][o.sonar.db.Database] Create JDBC data source for jdbc:sqlserver:*****
web[][o.s.s.p.ServerFileSystemImpl] SonarQube home: /../../sonarqube-6.7.1
web[][o.s.s.u.SystemPasscodeImpl] System authentication by passcode is disabled
web[][o.s.c.i.DefaultI18n] Loaded 2094 properties from l10n bundles
web[][o.s.s.p.d.m.c.MssqlCharsetHandler] Verify that database collation is case-sensitive and accent-sensitive
web[][o.s.s.p.w.MasterServletFilter] Initializing servlet filter org.sonar.server.ws.WebServiceFilter#7e977d45 [pattern=UrlPattern{inclusions=[/api/system/migrate_db/*, ...], exclusions=[/api/properties*, ...]}]
web[][o.s.s.a.TomcatAccessLog] Tomcat is started
web[][o.s.s.a.EmbeddedTomcat] HTTP connector enabled on port ****
web[][o.s.s.p.UpdateCenterClient] Update center:https://update.sonarsource.org/update-center.properties (no proxy)
web[][o.s.a.r.Languages] No language available
web[][o.s.s.e.RecoveryIndexer] Elasticsearch recovery - sonar.search.recovery.minAgeInMs=300000
web[][o.s.s.e.RecoveryIndexer] Elasticsearch recovery - sonar.search.recovery.loopLimit=10000
web[][o.s.s.s.LogServerId] Server ID: *****
web[][o.s.s.e.RecoveryIndexer] Elasticsearch recovery - sonar.search.recovery.delayInMs=300000
web[][o.s.s.e.RecoveryIndexer] Elasticsearch recovery - sonar.search.recovery.initialDelayInMs=26327
web[][o.s.s.t.TelemetryDaemon] Sharing of SonarQube statistics is enabled.
web[][o.s.s.n.NotificationDaemon] Notification service started (delay 60 sec.)
web[][o.s.s.s.GeneratePluginIndex] Generate scanner plugin index
web[][o.s.s.s.GeneratePluginIndex] Generate scanner plugin index (done) | time=1ms
web[][o.s.s.s.RegisterPlugins] Register plugins
web[][o.s.s.s.RegisterPlugins] Register plugins (done) | time=167ms
web[][o.s.s.s.RegisterMetrics] Register metrics
web[][o.s.s.s.RegisterMetrics] Register metrics (done) | time=2734ms
web[][o.s.s.r.RegisterRules] Register rules
web[][o.s.s.r.RegisterRules] Register rules (done) | time=685ms
web[][o.s.s.q.BuiltInQProfileRepositoryImpl] Load quality profiles
web[][o.s.s.q.BuiltInQProfileRepositoryImpl] Load quality profiles (done) | time=2ms
web[][o.s.s.s.RegisterPermissionTemplates] Register permission templates
web[][o.s.s.s.RegisterPermissionTemplates] Register permission templates (done) | time=153ms
web[][o.s.s.s.RenameDeprecatedPropertyKeys] Rename deprecated property keys
web[][o.s.s.p.w.MasterServletFilter] Initializing servlet filter org.sonar.server.ws.WebServiceFilter#3a6e54b [pattern=UrlPattern{inclusions=[/api/measures/component/*, ...], exclusions=[/api/properties*, ...]}]
web[][o.s.s.p.w.MasterServletFilter] Initializing servlet filter org.sonar.server.ws.DeprecatedPropertiesWsFilter#3b2c45f3 [pattern=UrlPattern{inclusions=[/api/properties/*], exclusions=[]}]
web[][o.s.s.p.w.MasterServletFilter] Initializing servlet filter org.sonar.server.ws.WebServiceReroutingFilter#42ffe60e [pattern=UrlPattern{inclusions=[/api/components/bulk_update_key, ...], exclusions=[]}]
web[][o.s.s.p.w.MasterServletFilter] Initializing servlet filter org.sonar.server.authentication.InitFilter#3bc1cd0f [pattern=UrlPattern{inclusions=[/sessions/init/*], exclusions=[]}]
web[][o.s.s.p.w.MasterServletFilter] Initializing servlet filter org.sonar.server.authentication.OAuth2CallbackFilter#533fe992 [pattern=UrlPattern{inclusions=[/oauth2/callback/*], exclusions=[]}]
web[][o.s.s.p.w.MasterServletFilter] Initializing servlet filter org.sonar.server.authentication.ws.LoginAction#54370dcd [pattern=UrlPattern{inclusions=[/api/authentication/login], exclusions=[]}]
web[][o.s.s.p.w.MasterServletFilter] Initializing servlet filter org.sonar.server.authentication.ws.LogoutAction#7bc801b4 [pattern=UrlPattern{inclusions=[/api/authentication/logout], exclusions=[]}]
web[][o.s.s.p.w.MasterServletFilter] Initializing servlet filter org.sonar.server.authentication.ws.ValidateAction#2e0576fc [pattern=UrlPattern{inclusions=[/api/authentication/validate], exclusions=[]}]
web[][o.s.s.e.IndexerStartupTask] Indexing of type [issues/issue] ...
web[][o.s.s.es.BulkIndexer] 1387134 requests processed (23118 items/sec)
web[][o.s.s.es.BulkIndexer] 2715226 requests processed (22134 items/sec)
web[][o.s.s.es.BulkIndexer] 3944404 requests processed (20486 items/sec)
web[][o.s.s.es.BulkIndexer] 5319447 requests processed (22917 items/sec)
web[][o.s.s.es.BulkIndexer] 6871423 requests processed (25866 items/sec)
web[][o.s.s.es.BulkIndexer] 7814247 requests processed (15713 items/sec)
web[][o.s.s.es.BulkIndexer] 7814247 requests processed (0 items/sec)
web[][o.s.s.es.BulkIndexer] 7814247 requests processed (0 items/sec)
web[][o.s.s.p.Platform] Background initialization failed. Stopping SonarQube
java.lang.IllegalStateException: Unrecoverable indexation failures
at org.sonar.server.es.IndexingListener$1.onFinish(IndexingListener.java:39)
at org.sonar.server.es.BulkIndexer.stop(BulkIndexer.java:117)
at org.sonar.server.issue.index.IssueIndexer.doIndex(IssueIndexer.java:247)
at org.sonar.server.issue.index.IssueIndexer.indexOnStartup(IssueIndexer.java:95)
at org.sonar.server.es.IndexerStartupTask.indexUninitializedTypes(IndexerStartupTask.java:68)
at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580)
at org.sonar.server.es.IndexerStartupTask.execute(IndexerStartupTask.java:55)
at java.util.Optional.ifPresent(Optional.java:159)
at org.sonar.server.platform.platformlevel.PlatformLevelStartup$1.doPrivileged(PlatformLevelStartup.java:84)
at org.sonar.server.user.DoPrivileged.execute(DoPrivileged.java:45)
at org.sonar.server.platform.platformlevel.PlatformLevelStartup.start(PlatformLevelStartup.java:80)
at org.sonar.server.platform.Platform.executeStartupTasks(Platform.java:196)
at org.sonar.server.platform.Platform.access$400(Platform.java:46)
at org.sonar.server.platform.Platform$1.lambda$doRun$1(Platform.java:121)
at org.sonar.server.platform.Platform$AutoStarterRunnable.runIfNotAborted(Platform.java:371)
at org.sonar.server.platform.Platform$1.doRun(Platform.java:121)
at org.sonar.server.platform.Platform$AutoStarterRunnable.run(Platform.java:355)
at java.lang.Thread.run(Thread.java:748)
web[][o.s.s.p.Platform] Background initialization of SonarQube done
web[][o.s.p.StopWatcher] Stopping process
===========================================================================
Edit: I have referenced the link provided prior to my initial post. The post referenced "free space" which I assumed to mean disk space, here is my disk space values where SonarQube 6.7.1 is installed:
1K-blocks Used Available Use%
251531268 16204576 235326692 7% /prod/appl
Also here is a portion of my elasticsearch log where the error in the web.log occurs. SonarQube 6.7.1 uses Elasticsearch-5.
Elasticsearch log:
es[][o.e.i.IndexingMemoryController] write indexing buffer to disk for shard [[issues][0]] to free up its [29.8mb] indexing buffer
es[][o.e.i.s.IndexShard] add [29.8mb] writing bytes for shard [[issues][0]]
es[][o.e.i.e.Engine] use refresh to write indexing buffer (heap size=[23.5mb]), to also clear version map (heap size=[6.3mb])
es[][o.e.i.f.p.SortedSetDVOrdinalsIndexFieldData] global-ordinals [_parent#authorization][49] took [462.3micros]
es[][o.e.i.s.IndexShard] remove [29.8mb] writing bytes for shard [[issues][0]]
es[][o.e.i.IndexingMemoryController] now write some indexing buffers: total indexing heap bytes used [104.3mb] vs indices.memory.index_buffer_size [98.9mb], currently writing bytes [0b], [5] shards with non-zero indexing buffer
es[][o.e.i.IndexingMemoryController] write indexing buffer to disk for shard [[issues][1]] to free up its [54.8mb] indexing buffer
es[][o.e.i.s.IndexShard] add [54.8mb] writing bytes for shard [[issues][1]]
es[][o.e.i.e.Engine] use IndexWriter.flush to write indexing buffer (heap size=[51.1mb]) since version map is small (heap size=[3.6mb])
es[][o.e.i.s.IndexShard] remove [54.8mb] writing bytes for shard [[issues][1]]
es[][o.e.i.IndexingMemoryController] now write some indexing buffers: total indexing heap bytes used [104.2mb] vs indices.memory.index_buffer_size [98.9mb], currently writing bytes [0b], [5] shards with non-zero indexing buffer
es[][o.e.i.IndexingMemoryController] write indexing buffer to disk for shard [[issues][1]] to free up its [50.7mb] indexing buffer
es[][o.e.i.s.IndexShard] add [50.7mb] writing bytes for shard [[issues][1]]
es[][o.e.i.e.Engine] use IndexWriter.flush to write indexing buffer (heap size=[43.9mb]) since version map is small (heap size=[6.7mb])
es[][o.e.i.s.IndexShard] remove [50.7mb] writing bytes for shard [[issues][1]]
es[][o.e.i.IndexingMemoryController] now write some indexing buffers: total indexing heap bytes used [100.1mb] vs indices.memory.index_buffer_size [98.9mb], currently writing bytes [0b], [5] shards with non-zero indexing buffer
es[][o.e.i.IndexingMemoryController] write indexing buffer to disk for shard [[issues][1]] to free up its [31.5mb] indexing buffer
es[][o.e.i.s.IndexShard] add [31.5mb] writing bytes for shard [[issues][1]]
es[][o.e.i.e.Engine] use refresh to write indexing buffer (heap size=[23.3mb]), to also clear version map (heap size=[8.2mb])
es[][o.e.i.f.p.SortedSetDVOrdinalsIndexFieldData] global-ordinals [_parent#authorization][46] took [988.8micros]
es[][o.e.i.s.IndexShard] remove [31.5mb] writing bytes for shard [[issues][1]]
es[][o.e.i.f.p.SortedSetDVOrdinalsIndexFieldData] global-ordinals [_parent#authorization][46] took [880.6micros]
es[][o.e.i.f.p.SortedSetDVOrdinalsIndexFieldData] global-ordinals [_parent#authorization][57] took [510.7micros]
es[][o.e.i.f.p.SortedSetDVOrdinalsIndexFieldData] global-ordinals [_parent#authorization][49] took [829.3micros]
es[][o.e.i.f.p.SortedSetDVOrdinalsIndexFieldData] global-ordinals [_parent#authorization][47] took [412.9micros]
es[][o.e.i.f.p.SortedSetDVOrdinalsIndexFieldData] global-ordinals [_parent#authorization][43] took [277.4micros]
es[][o.e.i.e.InternalEngine$EngineMergeScheduler] merge segment [_kh] done: took [30.9s], [343.7 MB], [3,159,200 docs], [0s stopped], [1.5s throttled], [169.4 MB written], [Infinity MB/sec throttle]
es[][o.e.i.e.InternalEngine$EngineMergeScheduler] merge segment [_oc] done: took [28.9s], [290.9 MB], [2,593,116 docs], [0s stopped], [0s throttled], [232.1 MB written], [Infinity MB/sec throttle]
es[][o.e.i.e.InternalEngine$EngineMergeScheduler] merge segment [_pz] done: took [30.6s], [341.3 MB], [2,573,716 docs], [0s stopped], [0s throttled], [266.1 MB written], [Infinity MB/sec throttle]
es[][o.e.i.e.InternalEngine$EngineMergeScheduler] merge segment [_th] done: took [35.2s], [346.3 MB], [3,102,397 docs], [0s stopped], [0s throttled], [262.0 MB written], [Infinity MB/sec throttle]
es[][o.e.c.s.ClusterService] processing [update-settings]: execute
es[][o.e.i.IndicesQueryCache] using [node] query cache with size [98.9mb] max filter count [10000]
es[][o.e.i.IndicesService] creating Index [[issues/WmTjz_-ITtyPeqpDlqPeFg]], shards [5]/[0] - reason [metadata verification]
es[][o.e.i.s.IndexStore] using index.store.throttle.type [NONE], with index.store.throttle.max_bytes_per_sec [null]
es[][o.e.i.m.MapperService] using dynamic[false]
es[][o.e.i.c.b.BitsetFilterCache] clearing all bitsets because [close]
es[][o.e.i.c.q.IndexQueryCache] full cache clear, reason [close]
es[][o.e.i.c.b.BitsetFilterCache] clearing all bitsets because [close]
es[][o.e.c.s.ClusterService] cluster state updated, version [17], source [update-settings]
es[][o.e.c.s.ClusterService] publishing cluster state version [17]
es[][o.e.c.s.ClusterService] applying cluster state version 17
es[][o.e.c.s.ClusterService] set local cluster state to version 17
es[][o.e.c.s.ClusterService] processing [update-settings]: took [19ms] done applying updated cluster_state (version: 17, uuid: dkhQacKBQGS5YsyMqp1kmQ)
es[][o.e.n.Node] stopping ...

RMAppMaster is running beyond physical memory limits

I am trying to troubleshoot this puzzling issue: RMAppMaster oversteps its allocated container memory and is then killed by the node manager even if heap size is much smaller than container size.
NM logs:
2017-12-01 11:18:49,863 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 14191 for container-id container_1506599288376_62101_01_000001: 1.0 GB of 1 GB physical memory used; 3.1 GB of 2.1 GB virtual memory used
2017-12-01 11:18:49,863 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Process tree for container: container_1506599288376_62101_01_000001 has processes older than 1 iteration running over the configured limit. Limit=1073741824, current usage = 1076969472
2017-12-01 11:18:49,863 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Container [pid=14191,containerID=container_1506599288376_62101_01_000001] is running beyond physical memory limits. Current usage: 1.0 GB of 1 GB physical memory used; 3.1 GB of 2.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_1506599288376_62101_01_000001 :
|- 14279 14191 14191 14191 (java) 4915 235 3167825920 262632 /usr/java/default//bin/java -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1506599288376_62101/container_1506599288376_62101_01_000001 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Djava.net.preferIPv4Stack=true -Xmx512m org.apache.hadoop.mapreduce.v2.app.MRAppMaster
|- 14191 14189 14191 14191 (bash) 0 1 108650496 300 /bin/bash -c /usr/java/default//bin/java -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1506599288376_62101/container_1506599288376_62101_01_000001 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Djava.net.preferIPv4Stack=true -Xmx512m org.apache.hadoop.mapreduce.v2.app.MRAppMaster 1>/var/log/hadoop-yarn/container/application_1506599288376_62101/container_1506599288376_62101_01_000001/stdout 2>/var/log/hadoop-yarn/container/application_1506599288376_62101/container_1506599288376_62101_01_000001/stderr
You can observe that while the heap size is set to 512MB, physical memory observed by the NM grows up to 1GB.
Application is an Oozie launcher (Hive task), thus it has only one mapper which does mostly nothing and no reducer.
What baffles me is that only this specific instance of MRAppMaster is killed and I cannot explain the 500MB overhead between max heap size and physical memory as defined by the NM:
Other MRAppMaster instances run fine even with the default config (yarn.app.mapreduce.am.resource.mb = 1024 and yarn.app.mapreduce.am.command-opts = -Xmx825955249).
MRAppMaster does not run any application specific code, why only this one is having trouble? I expect MRAppMaster memory consumption to be somewhat linear to the number of tasks / attempts and this app has only one mapper.
-Xmx has been reduced to 512MB to see if the issue still happens with ~500MB of headroom. I expect MRAppMaster to consume very little native memory, what could those extra 500MB be?
I will try to workaround the issue by increasing yarn.app.mapreduce.am.resource.mb, but had really like to understand what is going on. Any idea?
config: cdh-5.4

How to tune the Hadoop MapReduce parameters on Amazon EMR?

My MR job ended at map 100% reduce 35% with lots of error messages similar to running beyond physical memory limits. Current usage: 3.0 GB of 3 GB physical memory used; 3.7 GB of 15 GB virtual memory used. Killing container.
My input *.bz2 file is about 4GB, if I uncompress it, the size of it will be about 38GB, it took about one hour to run this job with one Master and two slavers on the Amazon EMR.
My questions are
- Why this job used so much memory?
- Why this job took about one hour? Usually running a 40GB wordcount job on a small 4-node cluster takes about 10 mins.
- How to tune the MR parameters to solve this problem?
- Which Amazon EC2 Instance types are the good fit to solve this problem?
Please refer to the following log:
- Physical memory (bytes) snapshot=43327889408 => 43.3GB
- Virtual memory (bytes) snapshot=108950675456 => 108.95GB
- Total committed heap usage (bytes)=34940649472 => 34.94GB
My proposed solutions are as follows, but I'm not sure if they are correct solutions or not
- use larger Amazon EC2 Instance which is at least 8GB in memory
- tune the MR parameters using the following codes
Version 1:
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "jobtest1");
//don't kill the container, if the physical memory exceeds "mapreduce.reduce.memory.mb" or "mapreduce.map.memory.mb"
conf.setBoolean("yarn.nodemanager.pmem-check-enabled", false);
conf.setBoolean("yarn.nodemanager.vmem-check-enabled", false);
Version 2:
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "jobtest2");
//conf.set("mapreduce.input.fileinputformat.split.minsize","3073741824");
conf.set("mapreduce.map.memory.mb", "8192");
conf.set("mapreduce.map.java.opts", "-Xmx6144m");
conf.set("mapreduce.reduce.memory.mb", "8192");
conf.set("mapreduce.reduce.java.opts", "-Xmx6144m");
Log:
15/11/08 11:37:27 INFO mapreduce.Job: map 100% reduce 35%
15/11/08 11:37:27 INFO mapreduce.Job: Task Id : attempt_1446749367313_0006_r_000006_2, Status : FAILED
Container [pid=24745,containerID=container_1446749367313_0006_01_003145] is running beyond physical memory limits. Current usage: 3.0 GB of 3 GB physical memory used; 3.7 GB of 15 GB virtual memory used. Killing container.
Dump of the process-tree for container_1446749367313_0006_01_003145 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 24745 24743 24745 24745 (bash) 0 0 9658368 291 /bin/bash -c /usr/lib/jvm/java-openjdk/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx2304m -Djava.io.tmpdir=/mnt1/yarn/usercache/ec2-user/appcache/application_1446749367313_0006/container_1446749367313_0006_01_003145/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1446749367313_0006/container_1446749367313_0006_01_003145 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild **.***.***.*** 32846 attempt_1446749367313_0006_r_000006_2 3145 1>/var/log/hadoop-yarn/containers/application_1446749367313_0006/container_1446749367313_0006_01_003145/stdout 2>/var/log/hadoop-yarn/containers/application_1446749367313_0006/container_1446749367313_0006_01_003145/stderr
|- 24749 24745 24745 24745 (java) 14124 1281 3910426624 789477 /usr/lib/jvm/java-openjdk/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx2304m -Djava.io.tmpdir=/mnt1/yarn/usercache/ec2-user/appcache/application_1446749367313_0006/container_1446749367313_0006_01_003145/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1446749367313_0006/container_1446749367313_0006_01_003145 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild **.***.***.*** 32846 attempt_1446749367313_0006_r_000006_2 3145
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
15/11/08 11:37:28 INFO mapreduce.Job: map 100% reduce 25%
15/11/08 11:37:30 INFO mapreduce.Job: map 100% reduce 26%
15/11/08 11:37:37 INFO mapreduce.Job: map 100% reduce 27%
15/11/08 11:37:42 INFO mapreduce.Job: map 100% reduce 28%
15/11/08 11:37:53 INFO mapreduce.Job: map 100% reduce 29%
15/11/08 11:37:57 INFO mapreduce.Job: map 100% reduce 34%
15/11/08 11:38:02 INFO mapreduce.Job: map 100% reduce 35%
15/11/08 11:38:13 INFO mapreduce.Job: map 100% reduce 36%
15/11/08 11:38:22 INFO mapreduce.Job: map 100% reduce 37%
15/11/08 11:38:35 INFO mapreduce.Job: map 100% reduce 42%
15/11/08 11:38:36 INFO mapreduce.Job: map 100% reduce 100%
15/11/08 11:38:36 INFO mapreduce.Job: Job job_1446749367313_0006 failed with state FAILED due to: Task failed task_1446749367313_0006_r_000001
Job failed as tasks failed. failedMaps:0 failedReduces:1
15/11/08 11:38:36 INFO mapreduce.Job: Counters: 43
File System Counters
FILE: Number of bytes read=11806418671
FILE: Number of bytes written=22240791936
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=16874
HDFS: Number of bytes written=0
HDFS: Number of read operations=59
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
S3: Number of bytes read=3942336319
S3: Number of bytes written=0
S3: Number of read operations=0
S3: Number of large read operations=0
S3: Number of write operations=0
Job Counters
Failed reduce tasks=22
Killed reduce tasks=5
Launched map tasks=59
Launched reduce tasks=27
Data-local map tasks=59
Total time spent by all maps in occupied slots (ms)=114327828
Total time spent by all reduces in occupied slots (ms)=131855700
Total time spent by all map tasks (ms)=19054638
Total time spent by all reduce tasks (ms)=10987975
Total vcore-seconds taken by all map tasks=19054638
Total vcore-seconds taken by all reduce tasks=10987975
Total megabyte-seconds taken by all map tasks=27438678720
Total megabyte-seconds taken by all reduce tasks=31645368000
Map-Reduce Framework
Map input records=728795619
Map output records=728795618
Map output bytes=50859151614
Map output materialized bytes=10506705085
Input split bytes=16874
Combine input records=0
Spilled Records=1457591236
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=150143
CPU time spent (ms)=14360870
Physical memory (bytes) snapshot=43327889408
Virtual memory (bytes) snapshot=108950675456
Total committed heap usage (bytes)=34940649472
File Input Format Counters
Bytes Read=0
I am not sure of Amazon EMR. So few points to consider regarding map reduce:
bzip2 is slower, although it compresses better than gzip. bzip2’s decompression speed is faster than its compression speed, but it is still slower than the other formats. So at a high level, you already have this compared to 40gb word count program which ran in ten minutes.(assuming that 40gb program don't have compression). Next question is, BUT HOW MUCH SLOWER
However, your job is still failing after one hour. Please confirm this. So only when the job runs successfully, can we thing of performance. For this reason, lets think of why is it failing.
You were getting memory error. Also based on error, a container is failed during the reducer phase(as mapper phase is completed 100%). Mostly not even one reducer might have succeeded. Even though 32% might trick you to think that some reducers ran, that % could be due to preparing clean up work before first reducer runs. One way to confirm is, see if you have got any reducer output file generated.
Once confirming that, none of the reducer ran, you can increase the memory for containers as per your version 2.
Your version 1 will help you to see if only a specific container is causing issue and allowing the job to complete.
Your input file size should conclude the number of reducers. Standard is 1 Reducer per 1 GB unless you are compressing the Mapper output data. So in this case ideal number should have been at least 38. Try passing the command line option as -D mapred.reduce.tasks=40 and see if there is any change.

Problems while working with apache-spark that is on EC2 (Master), from a local machine

I am using the Apache Spark 1.3.0 and Hadoop 1.0.4
I have managed to install everything on EC2, and I am running everything from EC2 without any issues. Master and Slaves are running as expected.
What I want to do now is to run this from a local machine, and have the Master (which is on ec2) accessed by issuing:
./spark-shell --master spark://ec2-blahblah.compute.amazonaws.com:7077 --conf key=/blah/blah.pem --driver-cores 4 --executor-memory 512m
What I am getting (with and without changing the cores and executor memory) is an inability to connect to spark://ec2-blahblah.compute.amazonaws.com
Also, I am getting the famous: WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
What am I doing wrong?
What configurations do I need to set?
How do I secure the connection to "./spark-shell --master spark://ec2-blahblah.compute.amazon...." without using YARN?
EDIT, The errors I get are:
...
After setting the fs.s3n.awsAccessKeyId and fs.s3n.awsSecretAccessKey
...
scala> val csv = sc.textFile("s3n://LOCATION OF A FILE”)
15/03/27 15:25:05 INFO MemoryStore: ensureFreeSpace(35538) called with curMem=0, maxMem=278019440
15/03/27 15:25:05 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 34.7 KB, free 265.1 MB)
15/03/27 15:25:05 INFO MemoryStore: ensureFreeSpace(5406) called with curMem=35538, maxMem=278019440
15/03/27 15:25:05 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 5.3 KB, free 265.1 MB)
15/03/27 15:25:05 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.0.188:54529 (size: 5.3 KB, free: 265.1 MB)
15/03/27 15:25:05 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0
15/03/27 15:25:05 INFO SparkContext: Created broadcast 0 from textFile at <console>:21
csv: org.apache.spark.rdd.RDD[String] = s3n://rtlm-dev/Iris_rtlm.csv MapPartitionsRDD[1] at textFile at <console>:21
scala> val cnt = csv.count
15/03/27 15:25:17 INFO Client: Retrying connect to server: ec2-52-11-115-141.us-west-2.compute.amazonaws.com/52.11.115.141:9000. Already tried 0 time(s).
15/03/27 15:25:26 INFO Client: Retrying connect to server: ec2-52-11-115-141.us-west-2.compute.amazonaws.com/52.11.115.141:9000. Already tried 1 time(s).
—————————————
Second error is (when trying to do the PI example)
15/03/27 15:29:04 INFO SparkContext: Starting job: reduce at <console>:33
15/03/27 15:29:04 INFO DAGScheduler: Got job 0 (reduce at <console>:33) with 2 output partitions (allowLocal=false)
15/03/27 15:29:04 INFO DAGScheduler: Final stage: Stage 0(reduce at <console>:33)
15/03/27 15:29:04 INFO DAGScheduler: Parents of final stage: List()
15/03/27 15:29:04 INFO DAGScheduler: Missing parents: List()
15/03/27 15:29:04 INFO DAGScheduler: Submitting Stage 0 (MapPartitionsRDD[1] at map at <console>:29), which has no missing parents
15/03/27 15:29:04 INFO MemoryStore: ensureFreeSpace(1912) called with curMem=0, maxMem=278019440
15/03/27 15:29:04 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 1912.0 B, free 265.1 MB)
15/03/27 15:29:04 INFO MemoryStore: ensureFreeSpace(1307) called with curMem=1912, maxMem=278019440
15/03/27 15:29:04 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1307.0 B, free 265.1 MB)
15/03/27 15:29:04 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.0.188:54583 (size: 1307.0 B, free: 265.1 MB)
15/03/27 15:29:04 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0
15/03/27 15:29:04 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:839
15/03/27 15:29:04 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (MapPartitionsRDD[1] at map at <console>:29)
15/03/27 15:29:04 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
15/03/27 15:29:19 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
Did you add a security rule that allows access to the port 7077 from your machine?
Have you tried explicitly binding to the public EC2 name?
In spark-env.sh:
export SPARK_PUBLIC_DNS=ec2-blahblah.compute.amazonaws.com
export STANDALONE_SPARK_MASTER_HOST=ec2-blahblah.compute.amazonaws.com
Additionally, try setting SPARK_MASTER_IP as well.
I got this error as well, and besides posts above, me myself found that, the security rule for ec2 instances should open all ports ideally, since i checked the documentation and it seems the master/worker communication will randomly choose a port number and binding on that port...
my solution is to use a sub-net and private IP, then open all ports within the subnet. maybe you can have a try.
Also, if you have a cluster with multiple ec2 instance, i.e. with one ec2 instance specialized as master and couple other ec2 instances as workers, then the cluster mode "standalone" may not work, change it to have a try.

Hadoop YARN reducer/shuffle stuck

I was migrating from Hadoop 1 to Hadoop 2 YARN. Source code were recompiled using MRV2 jars and didn't have any compatibility issue. When I was trying to run the job under YARN, map worked fine and went to 100%, but reduce was stuck at ~6,7%. There's no performance issue. Actually, I checked CPU usage, it turned out when reduce was stuck, there seems like no computation going on because CPU is mostly 100% idle. The job can run successfully on Hadoop 1.2.1.
I checked the log messages from resourcemanager and found out that since map finished, no more container was allocated so there's no reduce is running on any container. What caused this situation?
I'm wondering if it is related to the yarn.nodemanager.aux-services property setting. By following the official tutorial(http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html), this property has to be set to mapreduce_shuffle which indicates that MR will still use default shuffle method instead of other shuffle plugins(http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/PluggableShuffleAndPluggableSort.html). I tried not to set this property but Hadoop wouldn't let me.
Here's the log of userlogs/applicationforlder/containerfolder/syslog when it's about to reach 7% of reduce. After that log didn't update anymore and reduce stopped as well.
2014-11-26 09:01:04,104 INFO [fetcher#1] org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#1 about to shuffle output of map attempt_1416988910568_0001_m_002988_0 decomp: 129587 len: 129591 to MEMORY
2014-11-26 09:01:04,104 INFO [fetcher#1] org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput: Read 129587 bytes from map-output for attempt_1416988910568_0001_m_002988_0
2014-11-26 09:01:04,104 INFO [fetcher#1] org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 129587, inMemoryMapOutputs.size() -> 2993, commitMemory -> 342319024, usedMemory ->342448611
2014-11-26 09:01:04,105 INFO [fetcher#1] org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#1 about to shuffle output of map attempt_1416988910568_0001_m_002989_0 decomp: 128525 len: 128529 to MEMORY
2014-11-26 09:01:04,105 INFO [fetcher#1] org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput: Read 128525 bytes from map-output for attempt_1416988910568_0001_m_002989_0
2014-11-26 09:01:04,105 INFO [fetcher#1] org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 128525, inMemoryMapOutputs.size() -> 2994, commitMemory -> 342448611, usedMemory ->342577136
2014-11-26 09:01:04,105 INFO [fetcher#1] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: datanode03:13562 freed by fetcher#1 in 13ms
Was this a common issue when migrating from Hadoop 1 to 2? Was the strategy of running map-shuffle-sort-reduce changed in Hadoop 2? What caused this problem? Thanks so much. Any comments will help!
Major environment setup:
Hadoop version: 2.5.2
6-node cluster with 8-core CPU, 15 GB memory on each node
Related properties settings:
yarn.scheduler.maximum-allocation-mb: 14336
yarn.scheduler.minimum-allocation-mb: 2500
yarn.nodemanager.resource.memory-mb: 14336
yarn.nodemanager.aux-services: mapreduce_shuffle
mapreduce.task.io.sort.factor: 100
mapreduce.task.io.sort.mb: 1024
Finally solved the problem after googling around and found out I posted this question three month ago already.
It's because of the data skew.

Resources