SonarQube upgrade to 6.7.1 LTS: Unrecoverable Indexation Failures - elasticsearch

I have successfully upgraded SonarQube to ver. 6.5, including the database upgrade, and I am currently trying to upgrade SonarQube to ver. 6.7.1 LTS. The New SonarQube version is being installed on a Linux 64 bit system and is connected to a 2014 Microsoft SQL database. Every time I try to launch the 6.7.1 version of SonarQube it fails with the error "Background initialization failed". If I run the new SonarQube using an empty Microsoft SQL database, then it will start up fine with no issues. The "Background initialization failed" issue only occurs when I connect the new SonarQube to the upgraded database. I have tried adding memory to the heap for ElasticSearch and reducing the number of issues being processed. Any help to resolve this issue would be greatly appreciated.
Web log:
web[][o.s.p.ProcessEntryPoint] Starting web
web[][o.a.t.u.n.NioSelectorPool] Using a shared selector for servlet write/read
web[][o.e.p.PluginsService] no modules loaded
web[][o.e.p.PluginsService] loaded plugin [org.elasticsearch.index.reindex.ReindexPlugin]
web[][o.e.p.PluginsService] loaded plugin [org.elasticsearch.join.ParentJoinPlugin]
web[][o.e.p.PluginsService] loaded plugin [org.elasticsearch.percolator.PercolatorPlugin]
web[][o.e.p.PluginsService] loaded plugin [org.elasticsearch.transport.Netty4Plugin]
web[][i.n.c.MultithreadEventLoopGroup] -Dio.netty.eventLoopThreads: 64
web[][i.n.u.i.PlatformDependent0] -Dio.netty.noUnsafe: false
web[][i.n.u.i.PlatformDependent0] Java version: 8
web[][i.n.u.i.PlatformDependent0] sun.misc.Unsafe.theUnsafe: available
web[][i.n.u.i.PlatformDependent0] sun.misc.Unsafe.copyMemory: available
web[][i.n.u.i.PlatformDependent0] java.nio.Buffer.address: available
web[][i.n.u.i.PlatformDependent0] direct buffer constructor: available
web[][i.n.u.i.PlatformDependent0] java.nio.Bits.unaligned: available, true
web[][i.n.u.i.PlatformDependent0] jdk.internal.misc.Unsafe.allocateUninitializedArray(int): unavailable prior to Java9
web[][i.n.u.i.PlatformDependent0] java.nio.DirectByteBuffer.<init>(long, int): available
web[][i.n.u.i.PlatformDependent] sun.misc.Unsafe: available
web[][i.n.u.i.PlatformDependent] -Dio.netty.tmpdir: /../../sonarqube-6.7.1/temp (java.io.tmpdir)
web[][i.n.u.i.PlatformDependent] -Dio.netty.bitMode: 64 (sun.arch.data.model)
web[][i.n.u.i.PlatformDependent] -Dio.netty.noPreferDirect: false
web[][i.n.u.i.PlatformDependent] -Dio.netty.maxDirectMemory: 4772593664 bytes
web[][i.n.u.i.PlatformDependent] -Dio.netty.uninitializedArrayAllocationThreshold: -1
web[][i.n.u.i.CleanerJava6] java.nio.ByteBuffer.cleaner(): available
web[][i.n.c.n.NioEventLoop] -Dio.netty.noKeySetOptimization: false
web[][i.n.c.n.NioEventLoop] -Dio.netty.selectorAutoRebuildThreshold: 512
web[][i.n.u.i.PlatformDependent] org.jctools-core.MpscChunkedArrayQueue: available
web[][i.n.c.DefaultChannelId] -Dio.netty.processId: ***** (auto-detected)
web[][i.netty.util.NetUtil] -Djava.net.preferIPv4Stack: true
web[][i.netty.util.NetUtil] -Djava.net.preferIPv6Addresses: false
web[][i.netty.util.NetUtil] Loopback interface: lo (lo, 127.0.0.1)
web[][i.netty.util.NetUtil] /proc/sys/net/core/somaxconn: 128
web[][i.n.c.DefaultChannelId] -Dio.netty.machineId: ***** (auto-detected)
web[][i.n.u.ResourceLeakDetector] -Dio.netty.leakDetection.level: simple
web[][i.n.u.ResourceLeakDetector] -Dio.netty.leakDetection.maxRecords: 4
web[][i.n.b.PooledByteBufAllocator] -Dio.netty.allocator.numHeapArenas: 47
web[][i.n.b.PooledByteBufAllocator] -Dio.netty.allocator.numDirectArenas: 47
web[][i.n.b.PooledByteBufAllocator] -Dio.netty.allocator.pageSize: 8192
web[][i.n.b.PooledByteBufAllocator] -Dio.netty.allocator.maxOrder: 11
web[][i.n.b.PooledByteBufAllocator] -Dio.netty.allocator.chunkSize: 16777216
web[][i.n.b.PooledByteBufAllocator] -Dio.netty.allocator.tinyCacheSize: 512
web[][i.n.b.PooledByteBufAllocator] -Dio.netty.allocator.smallCacheSize: 256
web[][i.n.b.PooledByteBufAllocator] -Dio.netty.allocator.normalCacheSize: 64
web[][i.n.b.PooledByteBufAllocator] -Dio.netty.allocator.maxCachedBufferCapacity: 32768
web[][i.n.b.PooledByteBufAllocator] -Dio.netty.allocator.cacheTrimInterval: 8192
web[][i.n.b.PooledByteBufAllocator] -Dio.netty.allocator.useCacheForAllThreads: true
web[][i.n.b.ByteBufUtil] -Dio.netty.allocator.type: pooled
web[][i.n.b.ByteBufUtil] -Dio.netty.threadLocalDirectBufferSize: 65536
web[][i.n.b.ByteBufUtil] -Dio.netty.maxThreadLocalCharBufferSize: 16384
web[][i.n.b.AbstractByteBuf] -Dio.netty.buffer.bytebuf.checkAccessible: true
web[][i.n.u.ResourceLeakDetectorFactory] Loaded default ResourceLeakDetector: io.netty.util.ResourceLeakDetector#6c6be5c2
web[][i.n.util.Recycler] -Dio.netty.recycler.maxCapacityPerThread: 32768
web[][i.n.util.Recycler] -Dio.netty.recycler.maxSharedCapacityFactor: 2
web[][i.n.util.Recycler] -Dio.netty.recycler.linkCapacity: 16
web[][i.n.util.Recycler] -Dio.netty.recycler.ratio: 8
web[][o.s.s.e.EsClientProvider] Connected to local Elasticsearch: [127.0.0.1:*****]
web[][o.s.s.p.LogServerVersion] SonarQube Server / 6.7.1.35068 / 426519346f51f7b980a76f9050f983110550509d
web[][o.sonar.db.Database] Create JDBC data source for jdbc:sqlserver:*****
web[][o.s.s.p.ServerFileSystemImpl] SonarQube home: /../../sonarqube-6.7.1
web[][o.s.s.u.SystemPasscodeImpl] System authentication by passcode is disabled
web[][o.s.c.i.DefaultI18n] Loaded 2094 properties from l10n bundles
web[][o.s.s.p.d.m.c.MssqlCharsetHandler] Verify that database collation is case-sensitive and accent-sensitive
web[][o.s.s.p.w.MasterServletFilter] Initializing servlet filter org.sonar.server.ws.WebServiceFilter#7e977d45 [pattern=UrlPattern{inclusions=[/api/system/migrate_db/*, ...], exclusions=[/api/properties*, ...]}]
web[][o.s.s.a.TomcatAccessLog] Tomcat is started
web[][o.s.s.a.EmbeddedTomcat] HTTP connector enabled on port ****
web[][o.s.s.p.UpdateCenterClient] Update center:https://update.sonarsource.org/update-center.properties (no proxy)
web[][o.s.a.r.Languages] No language available
web[][o.s.s.e.RecoveryIndexer] Elasticsearch recovery - sonar.search.recovery.minAgeInMs=300000
web[][o.s.s.e.RecoveryIndexer] Elasticsearch recovery - sonar.search.recovery.loopLimit=10000
web[][o.s.s.s.LogServerId] Server ID: *****
web[][o.s.s.e.RecoveryIndexer] Elasticsearch recovery - sonar.search.recovery.delayInMs=300000
web[][o.s.s.e.RecoveryIndexer] Elasticsearch recovery - sonar.search.recovery.initialDelayInMs=26327
web[][o.s.s.t.TelemetryDaemon] Sharing of SonarQube statistics is enabled.
web[][o.s.s.n.NotificationDaemon] Notification service started (delay 60 sec.)
web[][o.s.s.s.GeneratePluginIndex] Generate scanner plugin index
web[][o.s.s.s.GeneratePluginIndex] Generate scanner plugin index (done) | time=1ms
web[][o.s.s.s.RegisterPlugins] Register plugins
web[][o.s.s.s.RegisterPlugins] Register plugins (done) | time=167ms
web[][o.s.s.s.RegisterMetrics] Register metrics
web[][o.s.s.s.RegisterMetrics] Register metrics (done) | time=2734ms
web[][o.s.s.r.RegisterRules] Register rules
web[][o.s.s.r.RegisterRules] Register rules (done) | time=685ms
web[][o.s.s.q.BuiltInQProfileRepositoryImpl] Load quality profiles
web[][o.s.s.q.BuiltInQProfileRepositoryImpl] Load quality profiles (done) | time=2ms
web[][o.s.s.s.RegisterPermissionTemplates] Register permission templates
web[][o.s.s.s.RegisterPermissionTemplates] Register permission templates (done) | time=153ms
web[][o.s.s.s.RenameDeprecatedPropertyKeys] Rename deprecated property keys
web[][o.s.s.p.w.MasterServletFilter] Initializing servlet filter org.sonar.server.ws.WebServiceFilter#3a6e54b [pattern=UrlPattern{inclusions=[/api/measures/component/*, ...], exclusions=[/api/properties*, ...]}]
web[][o.s.s.p.w.MasterServletFilter] Initializing servlet filter org.sonar.server.ws.DeprecatedPropertiesWsFilter#3b2c45f3 [pattern=UrlPattern{inclusions=[/api/properties/*], exclusions=[]}]
web[][o.s.s.p.w.MasterServletFilter] Initializing servlet filter org.sonar.server.ws.WebServiceReroutingFilter#42ffe60e [pattern=UrlPattern{inclusions=[/api/components/bulk_update_key, ...], exclusions=[]}]
web[][o.s.s.p.w.MasterServletFilter] Initializing servlet filter org.sonar.server.authentication.InitFilter#3bc1cd0f [pattern=UrlPattern{inclusions=[/sessions/init/*], exclusions=[]}]
web[][o.s.s.p.w.MasterServletFilter] Initializing servlet filter org.sonar.server.authentication.OAuth2CallbackFilter#533fe992 [pattern=UrlPattern{inclusions=[/oauth2/callback/*], exclusions=[]}]
web[][o.s.s.p.w.MasterServletFilter] Initializing servlet filter org.sonar.server.authentication.ws.LoginAction#54370dcd [pattern=UrlPattern{inclusions=[/api/authentication/login], exclusions=[]}]
web[][o.s.s.p.w.MasterServletFilter] Initializing servlet filter org.sonar.server.authentication.ws.LogoutAction#7bc801b4 [pattern=UrlPattern{inclusions=[/api/authentication/logout], exclusions=[]}]
web[][o.s.s.p.w.MasterServletFilter] Initializing servlet filter org.sonar.server.authentication.ws.ValidateAction#2e0576fc [pattern=UrlPattern{inclusions=[/api/authentication/validate], exclusions=[]}]
web[][o.s.s.e.IndexerStartupTask] Indexing of type [issues/issue] ...
web[][o.s.s.es.BulkIndexer] 1387134 requests processed (23118 items/sec)
web[][o.s.s.es.BulkIndexer] 2715226 requests processed (22134 items/sec)
web[][o.s.s.es.BulkIndexer] 3944404 requests processed (20486 items/sec)
web[][o.s.s.es.BulkIndexer] 5319447 requests processed (22917 items/sec)
web[][o.s.s.es.BulkIndexer] 6871423 requests processed (25866 items/sec)
web[][o.s.s.es.BulkIndexer] 7814247 requests processed (15713 items/sec)
web[][o.s.s.es.BulkIndexer] 7814247 requests processed (0 items/sec)
web[][o.s.s.es.BulkIndexer] 7814247 requests processed (0 items/sec)
web[][o.s.s.p.Platform] Background initialization failed. Stopping SonarQube
java.lang.IllegalStateException: Unrecoverable indexation failures
at org.sonar.server.es.IndexingListener$1.onFinish(IndexingListener.java:39)
at org.sonar.server.es.BulkIndexer.stop(BulkIndexer.java:117)
at org.sonar.server.issue.index.IssueIndexer.doIndex(IssueIndexer.java:247)
at org.sonar.server.issue.index.IssueIndexer.indexOnStartup(IssueIndexer.java:95)
at org.sonar.server.es.IndexerStartupTask.indexUninitializedTypes(IndexerStartupTask.java:68)
at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580)
at org.sonar.server.es.IndexerStartupTask.execute(IndexerStartupTask.java:55)
at java.util.Optional.ifPresent(Optional.java:159)
at org.sonar.server.platform.platformlevel.PlatformLevelStartup$1.doPrivileged(PlatformLevelStartup.java:84)
at org.sonar.server.user.DoPrivileged.execute(DoPrivileged.java:45)
at org.sonar.server.platform.platformlevel.PlatformLevelStartup.start(PlatformLevelStartup.java:80)
at org.sonar.server.platform.Platform.executeStartupTasks(Platform.java:196)
at org.sonar.server.platform.Platform.access$400(Platform.java:46)
at org.sonar.server.platform.Platform$1.lambda$doRun$1(Platform.java:121)
at org.sonar.server.platform.Platform$AutoStarterRunnable.runIfNotAborted(Platform.java:371)
at org.sonar.server.platform.Platform$1.doRun(Platform.java:121)
at org.sonar.server.platform.Platform$AutoStarterRunnable.run(Platform.java:355)
at java.lang.Thread.run(Thread.java:748)
web[][o.s.s.p.Platform] Background initialization of SonarQube done
web[][o.s.p.StopWatcher] Stopping process
===========================================================================
Edit: I have referenced the link provided prior to my initial post. The post referenced "free space" which I assumed to mean disk space, here is my disk space values where SonarQube 6.7.1 is installed:
1K-blocks Used Available Use%
251531268 16204576 235326692 7% /prod/appl
Also here is a portion of my elasticsearch log where the error in the web.log occurs. SonarQube 6.7.1 uses Elasticsearch-5.
Elasticsearch log:
es[][o.e.i.IndexingMemoryController] write indexing buffer to disk for shard [[issues][0]] to free up its [29.8mb] indexing buffer
es[][o.e.i.s.IndexShard] add [29.8mb] writing bytes for shard [[issues][0]]
es[][o.e.i.e.Engine] use refresh to write indexing buffer (heap size=[23.5mb]), to also clear version map (heap size=[6.3mb])
es[][o.e.i.f.p.SortedSetDVOrdinalsIndexFieldData] global-ordinals [_parent#authorization][49] took [462.3micros]
es[][o.e.i.s.IndexShard] remove [29.8mb] writing bytes for shard [[issues][0]]
es[][o.e.i.IndexingMemoryController] now write some indexing buffers: total indexing heap bytes used [104.3mb] vs indices.memory.index_buffer_size [98.9mb], currently writing bytes [0b], [5] shards with non-zero indexing buffer
es[][o.e.i.IndexingMemoryController] write indexing buffer to disk for shard [[issues][1]] to free up its [54.8mb] indexing buffer
es[][o.e.i.s.IndexShard] add [54.8mb] writing bytes for shard [[issues][1]]
es[][o.e.i.e.Engine] use IndexWriter.flush to write indexing buffer (heap size=[51.1mb]) since version map is small (heap size=[3.6mb])
es[][o.e.i.s.IndexShard] remove [54.8mb] writing bytes for shard [[issues][1]]
es[][o.e.i.IndexingMemoryController] now write some indexing buffers: total indexing heap bytes used [104.2mb] vs indices.memory.index_buffer_size [98.9mb], currently writing bytes [0b], [5] shards with non-zero indexing buffer
es[][o.e.i.IndexingMemoryController] write indexing buffer to disk for shard [[issues][1]] to free up its [50.7mb] indexing buffer
es[][o.e.i.s.IndexShard] add [50.7mb] writing bytes for shard [[issues][1]]
es[][o.e.i.e.Engine] use IndexWriter.flush to write indexing buffer (heap size=[43.9mb]) since version map is small (heap size=[6.7mb])
es[][o.e.i.s.IndexShard] remove [50.7mb] writing bytes for shard [[issues][1]]
es[][o.e.i.IndexingMemoryController] now write some indexing buffers: total indexing heap bytes used [100.1mb] vs indices.memory.index_buffer_size [98.9mb], currently writing bytes [0b], [5] shards with non-zero indexing buffer
es[][o.e.i.IndexingMemoryController] write indexing buffer to disk for shard [[issues][1]] to free up its [31.5mb] indexing buffer
es[][o.e.i.s.IndexShard] add [31.5mb] writing bytes for shard [[issues][1]]
es[][o.e.i.e.Engine] use refresh to write indexing buffer (heap size=[23.3mb]), to also clear version map (heap size=[8.2mb])
es[][o.e.i.f.p.SortedSetDVOrdinalsIndexFieldData] global-ordinals [_parent#authorization][46] took [988.8micros]
es[][o.e.i.s.IndexShard] remove [31.5mb] writing bytes for shard [[issues][1]]
es[][o.e.i.f.p.SortedSetDVOrdinalsIndexFieldData] global-ordinals [_parent#authorization][46] took [880.6micros]
es[][o.e.i.f.p.SortedSetDVOrdinalsIndexFieldData] global-ordinals [_parent#authorization][57] took [510.7micros]
es[][o.e.i.f.p.SortedSetDVOrdinalsIndexFieldData] global-ordinals [_parent#authorization][49] took [829.3micros]
es[][o.e.i.f.p.SortedSetDVOrdinalsIndexFieldData] global-ordinals [_parent#authorization][47] took [412.9micros]
es[][o.e.i.f.p.SortedSetDVOrdinalsIndexFieldData] global-ordinals [_parent#authorization][43] took [277.4micros]
es[][o.e.i.e.InternalEngine$EngineMergeScheduler] merge segment [_kh] done: took [30.9s], [343.7 MB], [3,159,200 docs], [0s stopped], [1.5s throttled], [169.4 MB written], [Infinity MB/sec throttle]
es[][o.e.i.e.InternalEngine$EngineMergeScheduler] merge segment [_oc] done: took [28.9s], [290.9 MB], [2,593,116 docs], [0s stopped], [0s throttled], [232.1 MB written], [Infinity MB/sec throttle]
es[][o.e.i.e.InternalEngine$EngineMergeScheduler] merge segment [_pz] done: took [30.6s], [341.3 MB], [2,573,716 docs], [0s stopped], [0s throttled], [266.1 MB written], [Infinity MB/sec throttle]
es[][o.e.i.e.InternalEngine$EngineMergeScheduler] merge segment [_th] done: took [35.2s], [346.3 MB], [3,102,397 docs], [0s stopped], [0s throttled], [262.0 MB written], [Infinity MB/sec throttle]
es[][o.e.c.s.ClusterService] processing [update-settings]: execute
es[][o.e.i.IndicesQueryCache] using [node] query cache with size [98.9mb] max filter count [10000]
es[][o.e.i.IndicesService] creating Index [[issues/WmTjz_-ITtyPeqpDlqPeFg]], shards [5]/[0] - reason [metadata verification]
es[][o.e.i.s.IndexStore] using index.store.throttle.type [NONE], with index.store.throttle.max_bytes_per_sec [null]
es[][o.e.i.m.MapperService] using dynamic[false]
es[][o.e.i.c.b.BitsetFilterCache] clearing all bitsets because [close]
es[][o.e.i.c.q.IndexQueryCache] full cache clear, reason [close]
es[][o.e.i.c.b.BitsetFilterCache] clearing all bitsets because [close]
es[][o.e.c.s.ClusterService] cluster state updated, version [17], source [update-settings]
es[][o.e.c.s.ClusterService] publishing cluster state version [17]
es[][o.e.c.s.ClusterService] applying cluster state version 17
es[][o.e.c.s.ClusterService] set local cluster state to version 17
es[][o.e.c.s.ClusterService] processing [update-settings]: took [19ms] done applying updated cluster_state (version: 17, uuid: dkhQacKBQGS5YsyMqp1kmQ)
es[][o.e.n.Node] stopping ...

Related

Container is running beyond physical memory limits

I have a MapReduce Job that process 1.4 Tb of data.
While doing it, I am getting the error as below.
The number of splits is 6444.
Before starting the job I set the following settings:
conf.set("mapreduce.map.memory.mb", "8192");
conf.set("mapreduce.reduce.memory.mb", "8192");
conf.set("mapreduce.map.java.opts.max.heap", "8192");
conf.set("mapreduce.map.java.opts", "-Xmx8192m");
conf.set("mapreduce.reduce.java.opts", "-Xmx8192m");
conf.set("mapreduce.job.heap.memory-mb.ratio", "0.8");
conf.set("mapreduce.task.timeout", "21600000");
The error:
2018-05-18 00:50:36,595 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1524473936587_2969_m_004719_3: Container [pid=11510,containerID=container_1524473936587_2969_01_004894] is running beyond physical memory limits. Current usage: 8.1 GB of 8 GB physical memory used; 8.8 GB of 16.8 GB virtual memory used. Killing container.
Dump of the process-tree for container_1524473936587_2969_01_004894 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 11560 11510 11510 11510 (java) 14960 2833 9460879360 2133706 /usr/lib/jvm/java-7-oracle-cloudera/bin/java
-Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx8192m -Djava.io.tmpdir=/sdk/7/yarn/nm/usercache/administrator/appcache/application_1524473936587_2969/container_1524473936587_2969_01_004894/tmp
-Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1524473936587_2969/container_1524473936587_2969_01_004894
-Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog org.apache.hadoop.mapred.YarnChild 10.106.79.75 41869 attempt_1524473936587_2969_m_004719_3 4894
|- 11510 11508 11510 11510 (bash) 0 0 11497472 679 /bin/bash -c /usr/lib/jvm/java-7-oracle-cloudera/bin/java
-Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx8192m -Djava.io.tmpdir=/sdk/7/yarn/nm/usercache/administrator/appcache/application_1524473936587_2969/container_1524473936587_2969_01_004894/tmp
-Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1524473936587_2969/container_1524473936587_2969_01_004894
-Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog org.apache.hadoop.mapred.YarnChild 10.106.79.75 41869 attempt_1524473936587_2969_m_004719_3 4894 1>/var/log/hadoop-yarn/container/application_1524473936587_2969/container_1524473936587_2969_01_004894/stdout 2>/var/log/hadoop-yarn/container/application_1524473936587_2969/container_1524473936587_2969_01_004894/stderr
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Any help would be really appreciated!
The setting mapreduce.map.memory.mb will set the physical memory size of the container running the mapper (mapreduce.reduce.memory.mb will do the same for the reducer container).
Besure that you adjust the heap value as well. In newer version of YARN/MRv2 the setting mapreduce.job.heap.memory-mb.ratio can be used to have it auto-adjust. The default is .8, so 80% of whatever the container size is will be allocated as the heap. Otherwise, adjust manually using mapreduce.map.java.opts.max.heap and mapreduce.reduce.java.opts.max.heap settings.
BTW, I believe that 1 GB is the default and it is quite low. I recommend reading the below link. It provides a good understanding of YARN and MR memory setting, how they relate, and how to set some baseline settings based on the cluster node size (disk, memory, and cores).
Reference: http://community.cloudera.com/t5/Cloudera-Manager-Installation/ERROR-is-running-beyond-physical-memory-limits/td-p/55173
Try to set yarn memory allocation limits:
SET yarn.scheduler.maximum-allocation-mb=16G;
SET yarn.scheduler.minimum-allocation-mb=8G;
You may lookup other Yarn settings here:
https://www.ibm.com/support/knowledgecenter/STXKQY_BDA_SHR/bl1bda_tuneyarn.htm
Try with : set yarn.app.mapreduce.am.resource.mb=1000;
Explanation is here :
In spark, spark.driver.memoryOverhead is considered in calculating the total memory required for the driver. By default it is 0.10 of the driver-memory or minimum 384MB. In your case it will be 8GB * 0.1 = 9011MB ~= 9G
YARN allocates memory only in increments/multiples of yarn.scheduler.minimum-allocation-mb .
When yarn.scheduler.minimum-allocation-mb=4G, it can only allocate container sizes of 4G,8G,12G etc. So if something like 9G is requested it will round up to the next multiple and will allocate 12G of container size for the driver.
When yarn.scheduler.minimum-allocation-mb=1G, then container sizes of 8G, 9G, 10G are possible. The nearest rounded up size of 9G will be used in this case.
https://community.cloudera.com/t5/Support-Questions/Yarn-Container-is-running-beyond-physical-memory-limits-but/m-p/199353#M161393

Mac brew arangodb delaying start log file path

I have installed arangodb through brew. I am new to both mac and arangodb. Right after installation of arangodb I could start stop it through brew services. But since yesterday that didn't work. However arangod start worked. Today its taking really long time for the service to start up
$ arangod start
2018-04-30T07:40:32Z [3593] INFO ArangoDB 3.3.7 [darwin] 64bit, using jemalloc, build , VPack 0.1.30, RocksDB 5.6.0, ICU 58.1, V8 5.7.492.77, OpenSSL 1.0.2o 27 Mar 2018
2018-04-30T07:40:32Z [3593] INFO {authentication} Jwt secret not specified, generating...
2018-04-30T07:40:32Z [3593] INFO using storage engine mmfiles
2018-04-30T07:40:32Z [3593] INFO {cluster} Starting up with role SINGLE
2018-04-30T07:40:32Z [3593] INFO {syscall} file-descriptors (nofiles) hard limit is unlimited, soft limit is 8192
2018-04-30T07:40:32Z [3593] INFO {authentication} Authentication is turned on (system only), authentication for unix sockets is turned on
2018-04-30T07:40:32Z [3593] INFO running WAL recovery (1 logfiles)
2018-04-30T07:40:32Z [3593] INFO replaying WAL logfile '/Users/neel/start/journals/logfile-17009.db' (1 of 1)
2018-04-30T07:40:32Z [3593] INFO WAL recovery finished successfully
2018-04-30T07:40:33Z [3593] INFO using endpoint 'http+tcp://127.0.0.1:8529' for non-encrypted requests
2018-04-30T07:41:33Z [3593] WARNING {v8} giving up waiting for unused V8 context after 60.000000 s
2018-04-30T07:41:43Z [3593] WARNING {v8} giving up waiting for unused V8 context after 60.000000 s
2018-04-30T07:42:34Z [3593] WARNING {v8} giving up waiting for unused V8 context after 60.000000 s
2018-04-30T07:43:05Z [3593] INFO ArangoDB (version 3.3.7 [darwin]) is ready for business. Have fun!
I don't know where are the log files. So when I try to start with brew services start arangodb I can't check whether it has been started or not as it responds Successfully startedarangodb(label: homebrew.mxcl.arangodb) immediately. So my questions are why its delaying ? and where are the log files ?
The log files are located here: /usr/local/var/log/arangodb3
The delay above is caused by lack of available V8 contexts. You can adjust them by setting them in /usr/local/etc/arangodb3/arangod.conf. But the default value there is set to 0, which means that arangodb is to choose how many are running.

RMAppMaster is running beyond physical memory limits

I am trying to troubleshoot this puzzling issue: RMAppMaster oversteps its allocated container memory and is then killed by the node manager even if heap size is much smaller than container size.
NM logs:
2017-12-01 11:18:49,863 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 14191 for container-id container_1506599288376_62101_01_000001: 1.0 GB of 1 GB physical memory used; 3.1 GB of 2.1 GB virtual memory used
2017-12-01 11:18:49,863 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Process tree for container: container_1506599288376_62101_01_000001 has processes older than 1 iteration running over the configured limit. Limit=1073741824, current usage = 1076969472
2017-12-01 11:18:49,863 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Container [pid=14191,containerID=container_1506599288376_62101_01_000001] is running beyond physical memory limits. Current usage: 1.0 GB of 1 GB physical memory used; 3.1 GB of 2.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_1506599288376_62101_01_000001 :
|- 14279 14191 14191 14191 (java) 4915 235 3167825920 262632 /usr/java/default//bin/java -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1506599288376_62101/container_1506599288376_62101_01_000001 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Djava.net.preferIPv4Stack=true -Xmx512m org.apache.hadoop.mapreduce.v2.app.MRAppMaster
|- 14191 14189 14191 14191 (bash) 0 1 108650496 300 /bin/bash -c /usr/java/default//bin/java -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1506599288376_62101/container_1506599288376_62101_01_000001 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Djava.net.preferIPv4Stack=true -Xmx512m org.apache.hadoop.mapreduce.v2.app.MRAppMaster 1>/var/log/hadoop-yarn/container/application_1506599288376_62101/container_1506599288376_62101_01_000001/stdout 2>/var/log/hadoop-yarn/container/application_1506599288376_62101/container_1506599288376_62101_01_000001/stderr
You can observe that while the heap size is set to 512MB, physical memory observed by the NM grows up to 1GB.
Application is an Oozie launcher (Hive task), thus it has only one mapper which does mostly nothing and no reducer.
What baffles me is that only this specific instance of MRAppMaster is killed and I cannot explain the 500MB overhead between max heap size and physical memory as defined by the NM:
Other MRAppMaster instances run fine even with the default config (yarn.app.mapreduce.am.resource.mb = 1024 and yarn.app.mapreduce.am.command-opts = -Xmx825955249).
MRAppMaster does not run any application specific code, why only this one is having trouble? I expect MRAppMaster memory consumption to be somewhat linear to the number of tasks / attempts and this app has only one mapper.
-Xmx has been reduced to 512MB to see if the issue still happens with ~500MB of headroom. I expect MRAppMaster to consume very little native memory, what could those extra 500MB be?
I will try to workaround the issue by increasing yarn.app.mapreduce.am.resource.mb, but had really like to understand what is going on. Any idea?
config: cdh-5.4

Spark + Parquet + S3n : Seems to read parquet file many times

I have the parquet files in Hive-like partitioned way on S3n bucket. The metadata files are not created, the parquet footers are in the file itself.
When I tried a sample spark job in local mode (v-1.6.0) trying to read a file of size 5.2 MB:
val filePath = "s3n://bucket/trackingPackage/dpYear=2016/dpMonth=5/dpDay=10/part-r-00004-1c86d6b0-4f6f-4770-a930-c42d77e3c729-1462833064172.gz.parquet"
val path: Path = new Path(filePath)
val conf = new SparkConf().setMaster("local[2]").set("spark.app.name", "parquet-reader-s3n").set("spark.eventLog.enabled", "true")
val sc = new SparkContext(conf)
val sqlc = new org.apache.spark.sql.SQLContext(sc)
val df = sqlc.read.parquet(filePath).select("referenceCode")
Thread.sleep(1000*10) // Intentionally given
println(df.schema)
val output = df.collect
The log generated is:
..
[22:21:56.505][main][INFO][BlockManagerMaster:58] Registered BlockManager
[22:21:56.909][main][INFO][EventLoggingListener:58] Logging events to file:/tmp/spark-events/local-1463676716372
[22:21:57.307][main][INFO][ParquetRelation:58] Listing s3n://bucket//trackingPackage/dpYear=2016/dpMonth=5/dpDay=10/part-r-00004-1c86d6b0-4f6f-4770-a930-c42d77e3c729-1462833064172.gz.parquet on driver
[22:21:59.927][main][INFO][SparkContext:58] Starting job: parquet at InspectInputSplits.scala:30
[22:21:59.942][dag-scheduler-event-loop][INFO][DAGScheduler:58] Got job 0 (parquet at InspectInputSplits.scala:30) with 2 output partitions
[22:21:59.942][dag-scheduler-event-loop][INFO][DAGScheduler:58] Final stage: ResultStage 0 (parquet at InspectInputSplits.scala:30)
[22:21:59.943][dag-scheduler-event-loop][INFO][DAGScheduler:58] Parents of final stage: List()
[22:21:59.944][dag-scheduler-event-loop][INFO][DAGScheduler:58] Missing parents: List()
[22:21:59.954][dag-scheduler-event-loop][INFO][DAGScheduler:58] Submitting ResultStage 0 (MapPartitionsRDD[1] at parquet at InspectInputSplits.scala:30), which has no missing parents
[22:22:00.218][dag-scheduler-event-loop][INFO][MemoryStore:58] Block broadcast_0 stored as values in memory (estimated size 64.5 KB, free 64.5 KB)
[22:22:00.226][dag-scheduler-event-loop][INFO][MemoryStore:58] Block broadcast_0_piece0 stored as bytes in memory (estimated size 21.7 KB, free 86.2 KB)
[22:22:00.229][dispatcher-event-loop-0][INFO][BlockManagerInfo:58] Added broadcast_0_piece0 in memory on localhost:54419 (size: 21.7 KB, free: 1088.2 MB)
[22:22:00.231][dag-scheduler-event-loop][INFO][SparkContext:58] Created broadcast 0 from broadcast at DAGScheduler.scala:1006
[22:22:00.234][dag-scheduler-event-loop][INFO][DAGScheduler:58] Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at parquet at InspectInputSplits.scala:30)
[22:22:00.235][dag-scheduler-event-loop][INFO][TaskSchedulerImpl:58] Adding task set 0.0 with 2 tasks
[22:22:00.278][dispatcher-event-loop-1][INFO][TaskSetManager:58] Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2076 bytes)
[22:22:00.281][dispatcher-event-loop-1][INFO][TaskSetManager:58] Starting task 1.0 in stage 0.0 (TID 1, localhost, partition 1,PROCESS_LOCAL, 2395 bytes)
[22:22:00.290][Executor task launch worker-0][INFO][Executor:58] Running task 0.0 in stage 0.0 (TID 0)
[22:22:00.291][Executor task launch worker-1][INFO][Executor:58] Running task 1.0 in stage 0.0 (TID 1)
[22:22:00.425][Executor task launch worker-1][INFO][ParquetFileReader:151] Initiating action with parallelism: 5
[22:22:00.447][Executor task launch worker-0][INFO][ParquetFileReader:151] Initiating action with parallelism: 5
[22:22:00.463][Executor task launch worker-0][INFO][Executor:58] Finished task 0.0 in stage 0.0 (TID 0). 936 bytes result sent to driver
[22:22:00.471][task-result-getter-0][INFO][TaskSetManager:58] Finished task 0.0 in stage 0.0 (TID 0) in 213 ms on localhost (1/2)
[22:22:00.586][pool-20-thread-1][INFO][NativeS3FileSystem:619] Opening 's3n://bucket//trackingPackage/dpYear=2016/dpMonth=5/dpDay=10/part-r-00004-1c86d6b0-4f6f-4770-a930-c42d77e3c729-1462833064172.gz.parquet' for reading
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
[22:22:25.890][Executor task launch worker-1][INFO][Executor:58] Finished task 1.0 in stage 0.0 (TID 1). 4067 bytes result sent to driver
[22:22:25.898][task-result-getter-1][INFO][TaskSetManager:58] Finished task 1.0 in stage 0.0 (TID 1) in 25617 ms on localhost (2/2)
[22:22:25.898][dag-scheduler-event-loop][INFO][DAGScheduler:58] ResultStage 0 (parquet at InspectInputSplits.scala:30) finished in 25.656 s
[22:22:25.899][task-result-getter-1][INFO][TaskSchedulerImpl:58] Removed TaskSet 0.0, whose tasks have all completed, from pool
[22:22:25.905][main][INFO][DAGScheduler:58] Job 0 finished: parquet at InspectInputSplits.scala:30, took 25.977801 s
StructType(StructField(referenceCode,StringType,true))
[22:22:36.271][main][INFO][DataSourceStrategy:58] Selected 1 partitions out of 1, pruned 0.0% partitions.
[22:22:36.325][main][INFO][MemoryStore:58] Block broadcast_1 stored as values in memory (estimated size 89.3 KB, free 175.5 KB)
[22:22:36.389][main][INFO][MemoryStore:58] Block broadcast_1_piece0 stored as bytes in memory (estimated size 20.2 KB, free 195.7 KB)
[22:22:36.389][dispatcher-event-loop-0][INFO][BlockManagerInfo:58] Added broadcast_1_piece0 in memory on localhost:54419 (size: 20.2 KB, free: 1088.2 MB)
[22:22:36.391][main][INFO][SparkContext:58] Created broadcast 1 from collect at InspectInputSplits.scala:34
[22:22:36.520][main][INFO][deprecation:1174] mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
[22:22:36.522][main][INFO][ParquetRelation:58] Reading Parquet file(s) from s3n://bucket//trackingPackage/dpYear=2016/dpMonth=5/dpDay=10/part-r-00004-1c86d6b0-4f6f-4770-a930-c42d77e3c729-1462833064172.gz.parquet
[22:22:36.554][main][INFO][SparkContext:58] Starting job: collect at InspectInputSplits.scala:34
[22:22:36.556][dag-scheduler-event-loop][INFO][DAGScheduler:58] Got job 1 (collect at InspectInputSplits.scala:34) with 1 output partitions
[22:22:36.556][dag-scheduler-event-loop][INFO][DAGScheduler:58] Final stage: ResultStage 1 (collect at InspectInputSplits.scala:34)
[22:22:36.556][dag-scheduler-event-loop][INFO][DAGScheduler:58] Parents of final stage: List()
[22:22:36.557][dag-scheduler-event-loop][INFO][DAGScheduler:58] Missing parents: List()
[22:22:36.557][dag-scheduler-event-loop][INFO][DAGScheduler:58] Submitting ResultStage 1 (MapPartitionsRDD[4] at collect at InspectInputSplits.scala:34), which has no missing parents
[22:22:36.571][dag-scheduler-event-loop][INFO][MemoryStore:58] Block broadcast_2 stored as values in memory (estimated size 7.6 KB, free 203.3 KB)
[22:22:36.575][dag-scheduler-event-loop][INFO][MemoryStore:58] Block broadcast_2_piece0 stored as bytes in memory (estimated size 4.0 KB, free 207.3 KB)
[22:22:36.576][dispatcher-event-loop-1][INFO][BlockManagerInfo:58] Added broadcast_2_piece0 in memory on localhost:54419 (size: 4.0 KB, free: 1088.2 MB)
[22:22:36.577][dag-scheduler-event-loop][INFO][SparkContext:58] Created broadcast 2 from broadcast at DAGScheduler.scala:1006
[22:22:36.577][dag-scheduler-event-loop][INFO][DAGScheduler:58] Submitting 1 missing tasks from ResultStage 1 (MapPartitionsRDD[4] at collect at InspectInputSplits.scala:34)
[22:22:36.577][dag-scheduler-event-loop][INFO][TaskSchedulerImpl:58] Adding task set 1.0 with 1 tasks
[22:22:36.585][dispatcher-event-loop-3][INFO][TaskSetManager:58] Starting task 0.0 in stage 1.0 (TID 2, localhost, partition 0,PROCESS_LOCAL, 2481 bytes)
[22:22:36.586][Executor task launch worker-1][INFO][Executor:58] Running task 0.0 in stage 1.0 (TID 2)
[22:22:36.605][Executor task launch worker-1][INFO][ParquetRelation$$anonfun$buildInternalScan$1$$anon$1:58] Input split: ParquetInputSplit{part: s3n://bucket//trackingPackage/dpYear=2016/dpMonth=5/dpDay=10/part-r-00004-1c86d6b0-4f6f-4770-a930-c42d77e3c729-1462833064172.gz.parquet start: 0 end: 5364897 length: 5364897 hosts: []}
[22:22:38.253][Executor task launch worker-1][INFO][NativeS3FileSystem:619] Opening 's3n://bucket//trackingPackage/dpYear=2016/dpMonth=5/dpDay=10/part-r-00004-1c86d6b0-4f6f-4770-a930-c42d77e3c729-1462833064172.gz.parquet' for reading
[22:23:04.249][Executor task launch worker-1][INFO][NativeS3FileSystem:619] Opening 's3n://bucket//trackingPackage/dpYear=2016/dpMonth=5/dpDay=10/part-r-00004-1c86d6b0-4f6f-4770-a930-c42d77e3c729-1462833064172.gz.parquet' for reading
[22:23:28.337][Executor task launch worker-1][INFO][CodecPool:181] Got brand-new decompressor [.gz]
[22:23:28.400][dispatcher-event-loop-1][INFO][BlockManagerInfo:58] Removed broadcast_0_piece0 on localhost:54419 in memory (size: 21.7 KB, free: 1088.2 MB)
[22:23:28.408][Spark Context Cleaner][INFO][ContextCleaner:58] Cleaned accumulator 1
[22:23:49.993][Executor task launch worker-1][INFO][Executor:58] Finished task 0.0 in stage 1.0 (TID 2). 9376344 bytes result sent to driver
[22:23:50.191][task-result-getter-2][INFO][TaskSetManager:58] Finished task 0.0 in stage 1.0 (TID 2) in 73612 ms on localhost (1/1)
[22:23:50.191][task-result-getter-2][INFO][TaskSchedulerImpl:58] Removed TaskSet 1.0, whose tasks have all completed, from pool
[22:23:50.191][dag-scheduler-event-loop][INFO][DAGScheduler:58] ResultStage 1 (collect at InspectInputSplits.scala:34) finished in 73.612 s
[22:23:50.195][main][INFO][DAGScheduler:58] Job 1 finished: collect at InspectInputSplits.scala:34, took 73.640193 s
The SparkUI snapshot is:
Questions:
In logs, I can see that the parquet file is seen to be read in total of 3 times. One time by [pool-21-thread-1] thread (on driver) and another two times by [Executor task launch worker-1] thread, which I assume to be worker thread. On debug, I can see that before first read, two s3n requests were made specifically for the footer (it had the http header of content-range), first to get the size of the footer and then to get the footer itself. My question is: When we had the footer information, why [pool-21-thread-1] thread still had to read the entire file? And why the executor thread made 2 requests to read the s3 file?
In the spark UI, It shows that only 670 KB is being taken as input. Since I was not assured this to be true, I looked into network activity and it seems 20+ MB has been received. Snapshot attached shows nearly 5+ MB received data in first read and later on 15+ MB for the 2 reads after Thread.sleep(1000*10). I could not reach the debug point for last 2 reads by [pool-21-thread-1] thread due to IDE issues, so not sure whether the particular column ("referenceCode") is being read or the entire file. I understand that there are overhead network packets at the tcp/udp layers, but 20+ MB seems quite a lot for just one column.
After debugging into the application, it turned out that S3N still uses jets3t library but the S3A has a new implementation based on AWS SDK (
Hadoop-10400 )
The hadoop's implementation of NativeS3FileSystem does not support seek (partial content reads) on S3 files. It downloads the whole file first.
EDIT: The scenario was not seen in EMR. On EMR amazon provides a highly optimized S3 connector - emrfs for all schemes which overrides the connector provided by hadoop.

Hadoop YARN reducer/shuffle stuck

I was migrating from Hadoop 1 to Hadoop 2 YARN. Source code were recompiled using MRV2 jars and didn't have any compatibility issue. When I was trying to run the job under YARN, map worked fine and went to 100%, but reduce was stuck at ~6,7%. There's no performance issue. Actually, I checked CPU usage, it turned out when reduce was stuck, there seems like no computation going on because CPU is mostly 100% idle. The job can run successfully on Hadoop 1.2.1.
I checked the log messages from resourcemanager and found out that since map finished, no more container was allocated so there's no reduce is running on any container. What caused this situation?
I'm wondering if it is related to the yarn.nodemanager.aux-services property setting. By following the official tutorial(http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html), this property has to be set to mapreduce_shuffle which indicates that MR will still use default shuffle method instead of other shuffle plugins(http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/PluggableShuffleAndPluggableSort.html). I tried not to set this property but Hadoop wouldn't let me.
Here's the log of userlogs/applicationforlder/containerfolder/syslog when it's about to reach 7% of reduce. After that log didn't update anymore and reduce stopped as well.
2014-11-26 09:01:04,104 INFO [fetcher#1] org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#1 about to shuffle output of map attempt_1416988910568_0001_m_002988_0 decomp: 129587 len: 129591 to MEMORY
2014-11-26 09:01:04,104 INFO [fetcher#1] org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput: Read 129587 bytes from map-output for attempt_1416988910568_0001_m_002988_0
2014-11-26 09:01:04,104 INFO [fetcher#1] org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 129587, inMemoryMapOutputs.size() -> 2993, commitMemory -> 342319024, usedMemory ->342448611
2014-11-26 09:01:04,105 INFO [fetcher#1] org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#1 about to shuffle output of map attempt_1416988910568_0001_m_002989_0 decomp: 128525 len: 128529 to MEMORY
2014-11-26 09:01:04,105 INFO [fetcher#1] org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput: Read 128525 bytes from map-output for attempt_1416988910568_0001_m_002989_0
2014-11-26 09:01:04,105 INFO [fetcher#1] org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 128525, inMemoryMapOutputs.size() -> 2994, commitMemory -> 342448611, usedMemory ->342577136
2014-11-26 09:01:04,105 INFO [fetcher#1] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: datanode03:13562 freed by fetcher#1 in 13ms
Was this a common issue when migrating from Hadoop 1 to 2? Was the strategy of running map-shuffle-sort-reduce changed in Hadoop 2? What caused this problem? Thanks so much. Any comments will help!
Major environment setup:
Hadoop version: 2.5.2
6-node cluster with 8-core CPU, 15 GB memory on each node
Related properties settings:
yarn.scheduler.maximum-allocation-mb: 14336
yarn.scheduler.minimum-allocation-mb: 2500
yarn.nodemanager.resource.memory-mb: 14336
yarn.nodemanager.aux-services: mapreduce_shuffle
mapreduce.task.io.sort.factor: 100
mapreduce.task.io.sort.mb: 1024
Finally solved the problem after googling around and found out I posted this question three month ago already.
It's because of the data skew.

Resources